CLI Reference
Exact command syntax and key flags for all whest commands.
For the full per-command reference, see CLI.
When to use this page
Use this page for exact command syntax and key flags.
Environment variables
WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1— skip OS-native fallback probes when collectingrun_meta.hostor datasetmetadata.hardware. Cheap fields andpsutil-backed fields are still collected; fallback-backed fields may remainnull.HF_TOKEN— HuggingFace Hub authentication token. Used bywhest dataset push,whest dataset pull, andwhest run --dataset hf://...as a fallback when--tokenis not provided.
Commands
Participant workflow commands:
whest smoke-testwhest doctorwhest initwhest validatewhest runwhest dataset(bake / push / pull / merge / inspect)whest packagewhest profile-simulationwhest version
All JSON outputs include a top-level whestbench_version string for traceability.
whest version
Print installed whestbench version.
whest version [--format rich|plain|json] [--json]JSON output is:
{
"ok": true,
"command": "version",
"name": "whestbench",
"version": "0.2.0",
"whestbench_version": "0.2.0"
}Examples:
whest version
whest version --jsonMigration note:
whest create-datasetis replaced bywhest dataset bake. Runningwhest create-datasetprints a redirect and exits.
whest smoke-test
Run a built-in CombinedEstimator dashboard check and print next-step participant commands.
whest smoke-test [--detail raw|full] [--profile] [--show-diagnostic-plots] [--format rich|plain|json] [--debug]--format rich|plain|json— choose styled terminal output, plain log-friendly output, or JSON. Defaults torichon TTYs andplainotherwise. Under a debugger,smoke-testautomatically forcesplainifrichwas requested.
whest doctor
Run install and environment health checks. Prints a pass/fail list for Python version, uv/Node.js availability, BLAS thread pool, disk space, and working-directory writability. Useful for first-hour setup troubleshooting and for CI gates.
whest doctor [--format rich|plain|json] [--json] [--strict] [--debug]Key options:
--format rich|plain|json— choose styled terminal output, plain log-friendly output ([OK]/[WARN]/[FAIL]tokens, no box-drawing), or JSON (schema_version,checks,counts,overall). Defaults torichon TTYs andplainotherwise.--json— alias for--format json.--strict— treat warnings as failures for exit-code purposes. Rendering is unchanged.--debug— re-raise exceptions from crashing checks instead of capturing them asfail.
Severity model
ok— the check passed.warn— the check found something worth knowing but not blocking. Examples:uvmissing (safe to ignore if you installed via pip), less than 1 GiB free disk in the current directory.fail— the check found a genuine blocker. Examples: Python version belowrequires-python,threadpoolctlfailed to import, cannot write to the working directory.
Exit codes
- Default:
0if all checks areokorwarn;1if anyfail. --strict:0only if all checks areok;1otherwise.
Example
# Interactive first-hour check
whest doctor
# CI pre-flight (treat anything that isn't OK as a failure)
whest doctor --strict --jsonwhest init
Create starter files in a target directory.
whest init [path] [--format rich|plain|json] [--json] [--debug]whest validate
Validate estimator loading and output contract.
whest validate --estimator <path> [--class <name>] [--format rich|plain|json] [--json] [--debug]whest run
Run local scoring with a participant estimator.
whest run --estimator <path> [options]Default behavior: whest run --estimator <path> is equivalent to --runner local.
Key options:
--class <name>— estimator class name (if the module exports more than one).--runner local|subprocess|server|inprocess--n-mlps <int>— number of MLPs to evaluate. Default: 10 without--dataset; full dataset size with--dataset. Clamped to dataset size when--datasetis set.--flop-budget <int>— cap on effective compute C_m = F_m + λ·R_m per MLP. Default:68_000_000_000(6.8e10). Always honored; anyflop_budgetstored in--dataset's metadata is ignored.--wall-time-limit <seconds>(default:60.0) — wall-clock limit perpredict()call; forwarded to the estimatorBudgetContext. Operational backstop matching the Phase 1 grader cap; the primary compute constraint is--flop-budget.--residual-wall-time-limit <seconds>— limit for non-flopscope time perpredict()call, enforced by WhestBench after timing is reported.--detail raw|full--seed <int>— random seed for the run.- Without
--dataset: seeds both MLP generation and estimator setup (ctx.seed). - With
--dataset: MLP seeds come from the dataset; this flag seeds estimator setup (ctx.seed) only. Default: omitted (ctx.seeddefaults to 0;run_config.seedisnullin the JSON output). See estimator-contract for thectx.seedreproducibility contract.
- Without
--profile--show-diagnostic-plots--format rich|plain|json— choose styled terminal output, plain log-friendly output, or JSON. Defaults torichon TTYs andplainotherwise.--json— alias for--format json.--dataset <path>— dataset source. Accepts:- Local directory:
./my-evalor/abs/path/my-eval - HF Hub with inline revision:
hf://owner/repo@v1orhf://aicrowd/arc-whestbench-2026@v1 - HF Hub with
--revisionflag:aicrowd/arc-whestbench-2026 --revision v1Bareowner/repowithout--revisionis rejected (revision must be explicit).
- Local directory:
--revision <tag>— HF Hub git tag or commit SHA for--dataset. Ignored for local paths.--n-samples <int>— ground truth samples per MLP when generating on-the-fly (without--dataset). Default:width*width*256.--debug— include estimator tracebacks in the report's "Estimator Errors" panel.--fail-fast— stop on the first estimator error and let the raw Python traceback propagate. Combine with--debugto show it.--max-threads <N>— limit BLAS to at most N CPU threads.
Recommended debug sequence:
whest run --estimator ./path/to/estimator.py
whest run --estimator ./path/to/estimator.py --debug
whest run --estimator ./path/to/estimator.py --debug --fail-fast
whest run --estimator ./path/to/estimator.py --runner local --format plain # for pdb.set_trace() / breakpoint()Using a pre-baked dataset
# Local directory (schema 3.0)
whest run --estimator ./estimator.py --dataset ./my-eval
# HF Hub with inline revision (preferred)
whest run --estimator ./estimator.py --dataset hf://aicrowd/arc-whestbench-2026@v1
# HF Hub with separate --revision flag
whest run --estimator ./estimator.py \
--dataset aicrowd/arc-whestbench-2026 \
--revision v1Exit codes
0— scoring completed; no estimator errors (budget or time exhaustion still exits0).1— at least one MLP raised duringpredict, or setup/runtime failure.
Runner mode tradeoff:
local(default): in-process execution with better traceback fidelity while debugging. Required for interactive debuggers (pdb,breakpoint()).subprocess: isolated execution in a separate process via the subprocess runner.server: legacy alias forsubprocess.inprocess: alias forlocal.
whest dataset
Dataset management commands. All subcommands share the whest dataset <sub> prefix.
whest dataset {bake,push,pull,merge,inspect} ...whest dataset bake
Bake a new evaluation dataset to a local directory.
whest dataset bake \
--n-mlps N --n-samples N --width W --depth D \
[--split SPLIT] [--config CONFIG] \
--output DIR \
[--torch] [--device auto|cuda|mps|cpu] \
[--mlps-per-batch N] [--chunk-size N] \
[--slice K/N | --mlp-range START-END]Required options:
--n-mlps <int>— total number of MLPs in the logical dataset.--n-samples <int>— ground-truth samples per MLP. Larger values give lower-noise ground truth. Default for on-the-fly runs iswidth*width*256(~16.7M for 256-wide).--width <int>— neuron count per layer.--depth <int>— number of weight matrices per MLP.--output <dir>— output directory (must not exist).
Key optional options:
--split <name>— dataset split name. Default:public.--config <name>— HF dataset config name for this split. Default:default. Use this for authoring config-per-split datasets such asdefault/mini + full/fullordefault/public + holdout/holdout.--torch— use the GPU/torch backend (requirespip install whestbench[gpu]). See GPU Dataset Generation.--device auto|cuda|mps|cpu— device when--torchis active.autoresolvescuda > mps > cpu.--mlps-per-batch <int>— torch backend: MLPs processed in parallel on device.--chunk-size <int>— torch backend: samples per chunk per step.--slice K/N— bake only the K-th slice of N total slices (0-indexed). Produces a partial dataset. Combine withwhest dataset mergeto assemble the full dataset. Example:--slice 0/4for the first of four workers.--mlp-range START-END— bake only MLP indices [START, END] inclusive (both ends). Alternative to--slicefor irregular splits.
Bit-equivalence guarantee: a worker baking --slice K/N produces rows that are bitwise identical to the corresponding rows of a single-host bake with the same --seed and --n-mlps.
Output is a directory with:
<output>/
├── data/<split>-00000-of-00001.parquet
├── metadata.json
└── README.mdExample
# Full bake (10 MLPs, 10M samples each)
whest dataset bake \
--n-mlps 10 --n-samples 10_000_000 \
--width 256 --depth 8 \
--output ./my-eval
# Partial bake (slice 0 of 4)
whest dataset bake \
--n-mlps 100 --n-samples 1_000_000_000 \
--width 256 --depth 8 \
--slice 0/4 \
--output ./partial-0
# GPU bake
whest dataset bake \
--n-mlps 100 --n-samples 1_000_000_000 \
--width 256 --depth 8 \
--torch --device auto \
--output ./gpu-evalwhest dataset inspect
Print metadata from a local directory or a HF Hub repo.
whest dataset inspect <DIR_OR_REPO_ID> [--revision REV]Arguments:
DIR_OR_REPO_ID— local dataset directory, or HF Hub repo id (e.g.aicrowd/arc-whestbench-2026).--revision <tag>— HF Hub git tag or commit SHA (for remote repos).
Example
# Local
whest dataset inspect ./my-eval
# Remote
whest dataset inspect aicrowd/arc-whestbench-2026 --revision v1Output prints key metadata fields: schema_version, format, backend, split, config, n_mlps, n_samples, width, depth, created_at_utc, and device provenance for torch bakes. Multi-split datasets print each split's config when present.
whest dataset push
Upload a baked dataset directory to HuggingFace Hub. Requires HF_TOKEN set in the environment or --token.
whest dataset push <LOCAL_DIR> \
--repo REPO_ID \
[--tag TAG] \
[--private] \
[--token TOKEN] \
[--message MSG]Arguments:
LOCAL_DIR— local directory produced bywhest dataset bakeorwhest dataset merge.--repo <repo_id>— HF Hub repo id, e.g.aicrowd/arc-whestbench-2026.--tag <tag>— optional git tag to create on the uploaded commit (e.g.v1). Recommended for versioning.--private— create the repo as private if it doesn't exist yet.--token <token>— HF Hub write token. Falls back toHF_TOKENenv var, then thehuggingface-cli logincache.--message <msg>— commit message for the HF Hub upload.
Example
# Publish with a version tag
whest dataset push ./my-eval \
--repo aicrowd/arc-whestbench-2026 \
--tag v1 \
--message "Bake: 10 MLPs, seed=42"
# Private repo
whest dataset push ./my-eval \
--repo aicrowd/arc-whestbench-2026-holdout \
--tag v1 \
--privatewhest dataset pull
Download a dataset from HuggingFace Hub to a local directory.
whest dataset pull <REPO_ID> \
[--revision REV] \
--output DIR \
[--token TOKEN]Arguments:
REPO_ID— HF Hub repo id (e.g.aicrowd/arc-whestbench-2026).--revision <tag>— HF Hub git tag or commit SHA. Default:main.--output <dir>— local destination directory.--token <token>— HF Hub token for private repos. Falls back toHF_TOKENenv var.
Example
whest dataset pull aicrowd/arc-whestbench-2026 \
--revision v1 \
--output ./eval-v1whest dataset merge
Merge partial bakes (produced with --slice or --mlp-range) into a single canonical dataset.
whest dataset merge <DIR> [<DIR>...] --output <DIR>Arguments:
<DIR>...— two or more partial dataset directories.--output <dir>— destination for the merged dataset (must not exist).
All partial datasets must share the same --seed, --n-mlps, --n-samples, --width, --depth, and --backend. Their mlp_range values must together cover [0, total_n_mlps) exactly once (no gaps, no overlaps).
The merged result is bit-equivalent to a single-host bake with the same parameters.
Example
# After baking 4 slices on separate workers:
whest dataset merge \
./partial-0 ./partial-1 ./partial-2 ./partial-3 \
--output ./final-evalEnd-to-end example (bake → inspect → push → pull → run)
# 1. Bake
whest dataset bake \
--n-mlps 10 --n-samples 10_000_000 \
--width 256 --depth 8 \
--output ./my-eval
# 2. Inspect locally
whest dataset inspect ./my-eval
# 3. Publish
export HF_TOKEN=hf_...
whest dataset push ./my-eval \
--repo aicrowd/arc-whestbench-2026 \
--tag v1
# 4. Pull on another machine
whest dataset pull aicrowd/arc-whestbench-2026 \
--revision v1 --output ./local-copy
# 5. Run evaluation
whest run --estimator ./estimator.py \
--dataset hf://aicrowd/arc-whestbench-2026@v1whest package
Build a submission artifact.
whest package --estimator <path> [options]Key options:
--class <name>--requirements <path>--submission-metadata <path>--approach <path>--output <path>--format rich|plain|json--json— alias for--format json--debug
whest profile-simulation
Profile flopscope FLOP accounting and analytical correctness across a grid of network sizes and FLOP budgets.
whest profile-simulation [--preset super-quick|quick|standard|exhaustive]
[--output <path>]
[--format rich|plain|json]
[--json]
[--verbose]
[--debug]Key options:
--preset <name>(default:standard) — parameter sweep size:super-quick— 1 width (256), 1 depth (4), 10 000 samples. Sub-second, for testing the debug loop.quick— 1 width (256), 2 depths (4, 128), 2 sample counts (10 000, 100 000). Finishes in seconds.standard— 2 widths (64, 256), 3 depths (4, 32, 128), 2 sample counts (10 000, 100 000). Under a minute.exhaustive— 2 widths (64, 256), 3 depths (4, 32, 128), 3 sample counts (10 000, 100 000, 1 000 000). Thorough but slow.
--output <path>— save a JSON report with correctness results and FLOP accounting data.--format rich|plain|json— choose styled terminal output, plain log-friendly output, or JSON. Defaults torichon TTYs andplainotherwise.--json— alias for--format json.--debug— show full tracebacks on errors.--verbose— show full tables with all columns and raw data.
Example workflows:
# Quick correctness check
whest profile-simulation --preset quick
# Full profile with JSON export
whest profile-simulation --preset exhaustive --output profile_results.jsonNext step
- Dataset Format — schema 3.0 specification
- Score Report Fields
- GPU Dataset Generation
- Inspect and Traverse MLP Structure (in the starter kit)
- Validate, Run, and Package (in the starter kit)