`whest dataset`

Dataset bake/publish/load/merge/inspect commands.

whest dataset [options]

`whest dataset bake`

Bake a new dataset to a directory.

Option	Default	Description
`--n-mlps`		Total number of MLPs in the logical dataset.
`--n-samples`
`--width`
`--depth`
`--mlp-seeds`		Path to a JSON file containing an array of N explicit per-MLP seeds (each a non-negative int < 2**63). If omitted, auto-generate via secrets.randbits(63). See docs/reference/dataset-format.md.
`--split`	`'public'`	Split name. Must match [a-z][a-z0-9-]* (HF Hub split-name convention).
`--config`	`'default'`	HF dataset config name for this split. Defaults to 'default'. Use this when authoring config-per-split datasets.
`--output`		Output directory (must not exist).
`--torch`		Use GPU/torch backend.
`--device`	`'auto'`
`--mlps-per-batch`
`--chunk-size`
`--slice`		K/N — this slice K of N (0-indexed).
`--mlp-range`		START-END (inclusive on both ends), e.g. 0-249.

`whest dataset upload`

Upload a baked dataset to HF Hub.

Option	Default	Description
`local_dir`
`--repo`		HF repo id (org/name).
`--tag`		Optional git tag (e.g. v1).
`--private`
`--token`
`--message`

`whest dataset push`

Option	Default	Description
`local_dir`
`--repo`		HF repo id (org/name).
`--tag`		Optional git tag (e.g. v1).
`--private`
`--token`
`--message`

`whest dataset download`

Download a dataset from HF Hub.

Option	Default	Description
`repo_id`
`--revision`
`--output`
`--token`
`--split`		Optional: download only the specified split's parquet (and metadata/README).

`whest dataset pull`

Option	Default	Description
`repo_id`
`--revision`
`--output`
`--token`
`--split`		Optional: download only the specified split's parquet (and metadata/README).

`whest dataset merge`

Merge partial bakes into one dataset.

Option	Default	Description
`inputs`		Partial dataset directories.
`--output`

`whest dataset info`

Print dataset metadata.

Option	Default	Description
`source`		Local dir or HF repo id.
`--revision`

`whest dataset inspect`

Option	Default	Description
`source`		Local dir or HF repo id.
`--revision`

`whest dataset combine-splits`

Combine N single-split datasets into a multi-split dataset directory.

Option	Default	Description
`input_dirs`		One or more complete single-split dataset directories.
`--output`		Output directory (must not exist).
`--default-split`		Optional name of the split that downstream consumers should fall back to when --split is omitted on a multi-split dataset. Must match one of the input splits. Recorded as 'default_split' in the combined metadata.json and used by `whest run`.
`--skip-prepared-arrow`		Skip generation of prepared/<split>/ Arrow artifacts. By default combine-splits emits `Dataset.save_to_disk()` directories for each split so `whestbench.load_dataset` can memory-map them directly on the consumer side (no parquet→arrow conversion). Skip if the prepare cost outweighs the runtime win for your use.

`whest dataset prepare-arrow`

Patch an existing multi-split dataset directory with prepared/<split>/ Arrow artifacts so consumers can skip the parquet→arrow conversion on cold cache.

Option	Default	Description
`dataset_dir`		Path to an existing multi-split dataset directory (with data/, metadata.json).

whest dataset

On this page