whestbench.
API

merge_datasets

Concatenate partial bakes into a single canonical dataset directory.

function · source

merge_datasets(input_dirs: "'list[Path | str]'", *, output_dir: "'Path | str'", cache_dir: "'Path | str | None'" = None) -> 'Path'
Concatenate partial bakes into a single canonical dataset directory.

Validates that all partials share compatible bake parameters and that
their mlp_range values cover [0, total_n_mlps) exactly once. Output
metadata strips per-partial fields and adds hardware_fingerprints and
merged_at_utc.

Bit-equivalent to a single-host bake with the same (seed, n_mlps, ...).

Args:
    input_dirs: Paths to partial dataset directories.
    output_dir: Destination directory; must not exist.
    cache_dir: HF datasets cache to use for the per-partial loads. When
        ``None`` (default), a temporary directory is created for the
        duration of the call and removed before returning, so the global
        ``~/.cache/huggingface/datasets/`` is not polluted by per-partial
        entries keyed on the input dir basenames (fleet partials named
        ``mlp-NNNN`` would otherwise leak N entries, ~2 MB each).

Raises:
    MergeIncompatibleError: partials disagree on schema_version, seed,
        n_samples, width, depth, backend, or total_n_mlps; or any input
        is not a partial.
    MergeIncompleteError: ranges don't cover [0, total_n_mlps) — gaps.
    MergeOverlapError: ranges overlap.
    MergeCorruptError: a partial's row mlp_ids don't match its declared
        mlp_range.