CLI
whest dataset
Dataset bake/publish/load/merge/inspect commands.
whest dataset
Dataset bake/publish/load/merge/inspect commands.
whest dataset [options]whest dataset bake
Bake a new dataset to a directory.
| Option | Default | Description |
|---|---|---|
--n-mlps | Total number of MLPs in the logical dataset. | |
--n-samples | ||
--width | ||
--depth | ||
--mlp-seeds | Path to a JSON file containing an array of N explicit per-MLP seeds (each a non-negative int < 2**63). If omitted, auto-generate via secrets.randbits(63). See docs/reference/dataset-format.md. | |
--split | 'public' | Split name. Must match [a-z][a-z0-9-]* (HF Hub split-name convention). |
--config | 'default' | HF dataset config name for this split. Defaults to 'default'. Use this when authoring config-per-split datasets. |
--output | Output directory (must not exist). | |
--torch | Use GPU/torch backend. | |
--device | 'auto' | |
--mlps-per-batch | ||
--chunk-size | ||
--slice | K/N — this slice K of N (0-indexed). | |
--mlp-range | START-END (inclusive on both ends), e.g. 0-249. |
whest dataset upload
Upload a baked dataset to HF Hub.
| Option | Default | Description |
|---|---|---|
local_dir | ||
--repo | HF repo id (org/name). | |
--tag | Optional git tag (e.g. v1). | |
--private | ||
--token | ||
--message |
whest dataset push
| Option | Default | Description |
|---|---|---|
local_dir | ||
--repo | HF repo id (org/name). | |
--tag | Optional git tag (e.g. v1). | |
--private | ||
--token | ||
--message |
whest dataset download
Download a dataset from HF Hub.
| Option | Default | Description |
|---|---|---|
repo_id | ||
--revision | ||
--output | ||
--token | ||
--split | Optional: download only the specified split's parquet (and metadata/README). |
whest dataset pull
| Option | Default | Description |
|---|---|---|
repo_id | ||
--revision | ||
--output | ||
--token | ||
--split | Optional: download only the specified split's parquet (and metadata/README). |
whest dataset merge
Merge partial bakes into one dataset.
| Option | Default | Description |
|---|---|---|
inputs | Partial dataset directories. | |
--output |
whest dataset info
Print dataset metadata.
| Option | Default | Description |
|---|---|---|
source | Local dir or HF repo id. | |
--revision |
whest dataset inspect
| Option | Default | Description |
|---|---|---|
source | Local dir or HF repo id. | |
--revision |
whest dataset combine-splits
Combine N single-split datasets into a multi-split dataset directory.
| Option | Default | Description |
|---|---|---|
input_dirs | One or more complete single-split dataset directories. | |
--output | Output directory (must not exist). | |
--default-split | Optional name of the split that downstream consumers should fall back to when --split is omitted on a multi-split dataset. Must match one of the input splits. Recorded as 'default_split' in the combined metadata.json and used by whest run. | |
--skip-prepared-arrow | Skip generation of prepared/<split>/ Arrow artifacts. By default combine-splits emits Dataset.save_to_disk() directories for each split so whestbench.load_dataset can memory-map them directly on the consumer side (no parquet→arrow conversion). Skip if the prepare cost outweighs the runtime win for your use. |
whest dataset prepare-arrow
Patch an existing multi-split dataset directory with prepared/<split>/ Arrow artifacts so consumers can skip the parquet→arrow conversion on cold cache.
| Option | Default | Description |
|---|---|---|
dataset_dir | Path to an existing multi-split dataset directory (with data/, metadata.json). |