API
merge_datasets
Concatenate partial bakes into a single canonical dataset directory.
function · source
merge_datasets(input_dirs: "'list[Path | str]'", *, output_dir: "'Path | str'", cache_dir: "'Path | str | None'" = None) -> 'Path'Concatenate partial bakes into a single canonical dataset directory.
Validates that all partials share compatible bake parameters and that
their mlp_range values cover [0, total_n_mlps) exactly once. Output
metadata strips per-partial fields and adds hardware_fingerprints and
merged_at_utc.
Bit-equivalent to a single-host bake with the same (seed, n_mlps, ...).
Args:
input_dirs: Paths to partial dataset directories.
output_dir: Destination directory; must not exist.
cache_dir: HF datasets cache to use for the per-partial loads. When
``None`` (default), a temporary directory is created for the
duration of the call and removed before returning, so the global
``~/.cache/huggingface/datasets/`` is not polluted by per-partial
entries keyed on the input dir basenames (fleet partials named
``mlp-NNNN`` would otherwise leak N entries, ~2 MB each).
Raises:
MergeIncompatibleError: partials disagree on schema_version, seed,
n_samples, width, depth, backend, or total_n_mlps; or any input
is not a partial.
MergeIncompleteError: ranges don't cover [0, total_n_mlps) — gaps.
MergeOverlapError: ranges overlap.
MergeCorruptError: a partial's row mlp_ids don't match its declared
mlp_range.