create_dataset

function · source

create_dataset(*, n_mlps: 'int', n_samples: 'int', width: 'int', depth: 'int', mlp_seeds: 'Optional[List[int]]' = None, output_path: "'Path | str'", split: 'str' = 'public', config: 'str' = 'default', mlp_range: 'Optional[Tuple[int, int]]' = None, progress: 'Optional[Callable[[Dict[str, Any]], None]]' = None, **deprecated_kwargs: 'Any') -> 'Path'

Generate MLPs, compute ground-truth, and write a schema-3.0 dataset directory.

Output is a directory with data/<split>-00000-of-00001.parquet, metadata.json,
README.md. Raises FileExistsError if output_path already exists.

Each MLP is seeded by an element of ``mlp_seeds``. When ``mlp_seeds`` is
omitted, one distinct ``secrets.randbits(63)`` value is generated per MLP.

If `mlp_range=(start, end)` is set, only MLPs in [start, end) are generated.
Output metadata is marked is_partial=true. Run merge_datasets to combine.

Bit-equivalent property: a worker baking slice [a, b) of a logical dataset of
size N with the same ``mlp_seeds`` list produces the same rows as the
corresponding slice of a single-host bake of size N.

Args:
    n_mlps: Total number of MLPs in the logical dataset.
    n_samples: Ground-truth samples per MLP.
    width: Neurons per layer.
    depth: Number of weight matrices.
    mlp_seeds: Per-MLP input seeds (list of ``n_mlps`` distinct int63s).
        Auto-generated when omitted.
    output_path: Destination directory (must not exist).
    split: HF split name for the parquet file.
    config: HF dataset config name for this split. Defaults to "default".
    mlp_range: ``(start, end)`` to bake a slice of [0, n_mlps).
    progress: Optional callback for progress events.

Raises:
    TypeError: if the legacy ``seed=`` kwarg is passed.
    ValueError: if ``mlp_seeds`` length or values are invalid.
    FileExistsError: if ``output_path`` already exists.