API
create_dataset
Generate MLPs, compute ground-truth, and write a schema-3.0 dataset directory.
function · source
create_dataset(*, n_mlps: 'int', n_samples: 'int', width: 'int', depth: 'int', mlp_seeds: 'Optional[List[int]]' = None, output_path: "'Path | str'", split: 'str' = 'public', config: 'str' = 'default', mlp_range: 'Optional[Tuple[int, int]]' = None, progress: 'Optional[Callable[[Dict[str, Any]], None]]' = None, **deprecated_kwargs: 'Any') -> 'Path'Generate MLPs, compute ground-truth, and write a schema-3.0 dataset directory.
Output is a directory with data/<split>-00000-of-00001.parquet, metadata.json,
README.md. Raises FileExistsError if output_path already exists.
Each MLP is seeded by an element of ``mlp_seeds``. When ``mlp_seeds`` is
omitted, one distinct ``secrets.randbits(63)`` value is generated per MLP.
If `mlp_range=(start, end)` is set, only MLPs in [start, end) are generated.
Output metadata is marked is_partial=true. Run merge_datasets to combine.
Bit-equivalent property: a worker baking slice [a, b) of a logical dataset of
size N with the same ``mlp_seeds`` list produces the same rows as the
corresponding slice of a single-host bake of size N.
Args:
n_mlps: Total number of MLPs in the logical dataset.
n_samples: Ground-truth samples per MLP.
width: Neurons per layer.
depth: Number of weight matrices.
mlp_seeds: Per-MLP input seeds (list of ``n_mlps`` distinct int63s).
Auto-generated when omitted.
output_path: Destination directory (must not exist).
split: HF split name for the parquet file.
config: HF dataset config name for this split. Defaults to "default".
mlp_range: ``(start, end)`` to bake a slice of [0, n_mlps).
progress: Optional callback for progress events.
Raises:
TypeError: if the legacy ``seed=`` kwarg is passed.
ValueError: if ``mlp_seeds`` length or values are invalid.
FileExistsError: if ``output_path`` already exists.