CLI Reference

For the full per-command reference, see CLI.

When to use this page

Use this page for exact command syntax and key flags.

Environment variables

WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1 — skip OS-native fallback probes when collecting run_meta.host or dataset metadata.hardware. Cheap fields and psutil-backed fields are still collected; fallback-backed fields may remain null.
HF_TOKEN — HuggingFace Hub authentication token. Used by whest dataset push, whest dataset pull, and whest run --dataset hf://... as a fallback when --token is not provided.

Commands

Participant workflow commands:

whest smoke-test
whest doctor
whest init
whest validate
whest run
whest dataset (bake / push / pull / merge / inspect)
whest package
whest profile-simulation
whest version

All JSON outputs include a top-level whestbench_version string for traceability.

`whest version`

Print installed whestbench version.

whest version [--format rich|plain|json] [--json]

JSON output is:

{
  "ok": true,
  "command": "version",
  "name": "whestbench",
  "version": "0.2.0",
  "whestbench_version": "0.2.0"
}

Examples:

whest version
whest version --json

Migration note: whest create-dataset is replaced by whest dataset bake. Running whest create-dataset prints a redirect and exits.

`whest smoke-test`

Run a built-in CombinedEstimator dashboard check and print next-step participant commands.

whest smoke-test [--detail raw|full] [--profile] [--show-diagnostic-plots] [--format rich|plain|json] [--debug]

--format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise. Under a debugger, smoke-test automatically forces plain if rich was requested.

Run install and environment health checks. Prints a pass/fail list for Python version, uv/Node.js availability, BLAS thread pool, disk space, and working-directory writability. Useful for first-hour setup troubleshooting and for CI gates.

whest doctor [--format rich|plain|json] [--json] [--strict] [--debug]

Key options:

--format rich|plain|json — choose styled terminal output, plain log-friendly output ([OK]/[WARN]/[FAIL] tokens, no box-drawing), or JSON (schema_version, checks, counts, overall). Defaults to rich on TTYs and plain otherwise.
--json — alias for --format json.
--strict — treat warnings as failures for exit-code purposes. Rendering is unchanged.
--debug — re-raise exceptions from crashing checks instead of capturing them as fail.

Severity model

ok — the check passed.
warn — the check found something worth knowing but not blocking. Examples: uv missing (safe to ignore if you installed via pip), less than 1 GiB free disk in the current directory.
fail — the check found a genuine blocker. Examples: Python version below requires-python, threadpoolctl failed to import, cannot write to the working directory.

Exit codes

Default: 0 if all checks are ok or warn; 1 if any fail.
--strict: 0 only if all checks are ok; 1 otherwise.

Example

# Interactive first-hour check
whest doctor

# CI pre-flight (treat anything that isn't OK as a failure)
whest doctor --strict --json

`whest init`

Create starter files in a target directory.

whest init [path] [--format rich|plain|json] [--json] [--debug]

`whest validate`

Validate estimator loading and output contract.

whest validate --estimator <path> [--class <name>] [--format rich|plain|json] [--json] [--debug]

`whest run`

Run local scoring with a participant estimator.

whest run --estimator <path> [options]

Default behavior: whest run --estimator <path> is equivalent to --runner local.

Key options:

--class <name> — estimator class name (if the module exports more than one).
--runner local|subprocess|server|inprocess
--n-mlps <int> — number of MLPs to evaluate. Default: 10 without --dataset; full dataset size with --dataset. Clamped to dataset size when --dataset is set.
--flop-budget <int> — cap on effective compute C_m = F_m + λ·R_m per MLP. Default: 68_000_000_000 (6.8e10). Always honored; any flop_budget stored in --dataset's metadata is ignored.
--wall-time-limit <seconds> (default: 60.0) — wall-clock limit per predict() call; forwarded to the estimator BudgetContext. Operational backstop matching the Phase 1 grader cap; the primary compute constraint is --flop-budget.
--residual-wall-time-limit <seconds> — limit for non-flopscope time per predict() call, enforced by WhestBench after timing is reported.
--detail raw|full
--seed <int> — random seed for the run.
- Without --dataset: seeds both MLP generation and estimator setup (ctx.seed).
- With --dataset: MLP seeds come from the dataset; this flag seeds estimator setup (ctx.seed) only. Default: omitted (ctx.seed defaults to 0; run_config.seed is null in the JSON output). See estimator-contract for the ctx.seed reproducibility contract.
--profile
--show-diagnostic-plots
--format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise.
--json — alias for --format json.
--dataset <path> — dataset source. Accepts:
- Local directory: ./my-eval or /abs/path/my-eval
- HF Hub with inline revision: hf://owner/repo@v1 or hf://aicrowd/arc-whestbench-2026@v1
- HF Hub with --revision flag: aicrowd/arc-whestbench-2026 --revision v1 Bare owner/repo without --revision is rejected (revision must be explicit).
--revision <tag> — HF Hub git tag or commit SHA for --dataset. Ignored for local paths.
--n-samples <int> — ground truth samples per MLP when generating on-the-fly (without --dataset). Default: width*width*256.
--debug — include estimator tracebacks in the report's "Estimator Errors" panel.
--fail-fast — stop on the first estimator error and let the raw Python traceback propagate. Combine with --debug to show it.
--max-threads <N> — limit BLAS to at most N CPU threads.

Recommended debug sequence:

whest run --estimator ./path/to/estimator.py
whest run --estimator ./path/to/estimator.py --debug
whest run --estimator ./path/to/estimator.py --debug --fail-fast
whest run --estimator ./path/to/estimator.py --runner local --format plain   # for pdb.set_trace() / breakpoint()

Using a pre-baked dataset

# Local directory (schema 3.0)
whest run --estimator ./estimator.py --dataset ./my-eval

# HF Hub with inline revision (preferred)
whest run --estimator ./estimator.py --dataset hf://aicrowd/arc-whestbench-2026@v1

# HF Hub with separate --revision flag
whest run --estimator ./estimator.py \
    --dataset aicrowd/arc-whestbench-2026 \
    --revision v1

Exit codes

0 — scoring completed; no estimator errors (budget or time exhaustion still exits 0).
1 — at least one MLP raised during predict, or setup/runtime failure.

Runner mode tradeoff:

local (default): in-process execution with better traceback fidelity while debugging. Required for interactive debuggers (pdb, breakpoint()).
subprocess: isolated execution in a separate process via the subprocess runner.
server: legacy alias for subprocess.
inprocess: alias for local.

`whest dataset`

Dataset management commands. All subcommands share the whest dataset <sub> prefix.

whest dataset {bake,push,pull,merge,inspect} ...

`whest dataset bake`

Bake a new evaluation dataset to a local directory.

whest dataset bake \
    --n-mlps N --n-samples N --width W --depth D \
    [--split SPLIT] [--config CONFIG] \
    --output DIR \
    [--torch] [--device auto|cuda|mps|cpu] \
    [--mlps-per-batch N] [--chunk-size N] \
    [--slice K/N | --mlp-range START-END]

Required options:

--n-mlps <int> — total number of MLPs in the logical dataset.
--n-samples <int> — ground-truth samples per MLP. Larger values give lower-noise ground truth. Default for on-the-fly runs is width*width*256 (~16.7M for 256-wide).
--width <int> — neuron count per layer.
--depth <int> — number of weight matrices per MLP.
--output <dir> — output directory (must not exist).

Key optional options:

--split <name> — dataset split name. Default: public.
--config <name> — HF dataset config name for this split. Default: default. Use this for authoring config-per-split datasets such as default/mini + full/full or default/public + holdout/holdout.
--torch — use the GPU/torch backend (requires pip install whestbench[gpu]). See GPU Dataset Generation.
--device auto|cuda|mps|cpu — device when --torch is active. auto resolves cuda > mps > cpu.
--mlps-per-batch <int> — torch backend: MLPs processed in parallel on device.
--chunk-size <int> — torch backend: samples per chunk per step.
--slice K/N — bake only the K-th slice of N total slices (0-indexed). Produces a partial dataset. Combine with whest dataset merge to assemble the full dataset. Example: --slice 0/4 for the first of four workers.
--mlp-range START-END — bake only MLP indices [START, END] inclusive (both ends). Alternative to --slice for irregular splits.

Bit-equivalence guarantee: a worker baking --slice K/N produces rows that are bitwise identical to the corresponding rows of a single-host bake with the same --seed and --n-mlps.

Output is a directory with:

<output>/
├── data/<split>-00000-of-00001.parquet
├── metadata.json
└── README.md

Example

# Full bake (10 MLPs, 10M samples each)
whest dataset bake \
    --n-mlps 10 --n-samples 10_000_000 \
    --width 256 --depth 8 \
    --output ./my-eval

# Partial bake (slice 0 of 4)
whest dataset bake \
    --n-mlps 100 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --slice 0/4 \
    --output ./partial-0

# GPU bake
whest dataset bake \
    --n-mlps 100 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --torch --device auto \
    --output ./gpu-eval

`whest dataset inspect`

Print metadata from a local directory or a HF Hub repo.

whest dataset inspect <DIR_OR_REPO_ID> [--revision REV]

Arguments:

DIR_OR_REPO_ID — local dataset directory, or HF Hub repo id (e.g. aicrowd/arc-whestbench-2026).
--revision <tag> — HF Hub git tag or commit SHA (for remote repos).

Example

# Local
whest dataset inspect ./my-eval

# Remote
whest dataset inspect aicrowd/arc-whestbench-2026 --revision v1

Output prints key metadata fields: schema_version, format, backend, split, config, n_mlps, n_samples, width, depth, created_at_utc, and device provenance for torch bakes. Multi-split datasets print each split's config when present.

`whest dataset push`

Upload a baked dataset directory to HuggingFace Hub. Requires HF_TOKEN set in the environment or --token.

whest dataset push <LOCAL_DIR> \
    --repo REPO_ID \
    [--tag TAG] \
    [--private] \
    [--token TOKEN] \
    [--message MSG]

Arguments:

LOCAL_DIR — local directory produced by whest dataset bake or whest dataset merge.
--repo <repo_id> — HF Hub repo id, e.g. aicrowd/arc-whestbench-2026.
--tag <tag> — optional git tag to create on the uploaded commit (e.g. v1). Recommended for versioning.
--private — create the repo as private if it doesn't exist yet.
--token <token> — HF Hub write token. Falls back to HF_TOKEN env var, then the huggingface-cli login cache.
--message <msg> — commit message for the HF Hub upload.

Example

# Publish with a version tag
whest dataset push ./my-eval \
    --repo aicrowd/arc-whestbench-2026 \
    --tag v1 \
    --message "Bake: 10 MLPs, seed=42"

# Private repo
whest dataset push ./my-eval \
    --repo aicrowd/arc-whestbench-2026-holdout \
    --tag v1 \
    --private

`whest dataset pull`

Download a dataset from HuggingFace Hub to a local directory.

whest dataset pull <REPO_ID> \
    [--revision REV] \
    --output DIR \
    [--token TOKEN]

Arguments:

REPO_ID — HF Hub repo id (e.g. aicrowd/arc-whestbench-2026).
--revision <tag> — HF Hub git tag or commit SHA. Default: main.
--output <dir> — local destination directory.
--token <token> — HF Hub token for private repos. Falls back to HF_TOKEN env var.

Example

whest dataset pull aicrowd/arc-whestbench-2026 \
    --revision v1 \
    --output ./eval-v1

`whest dataset merge`

Merge partial bakes (produced with --slice or --mlp-range) into a single canonical dataset.

whest dataset merge <DIR> [<DIR>...] --output <DIR>

Arguments:

<DIR>... — two or more partial dataset directories.
--output <dir> — destination for the merged dataset (must not exist).

All partial datasets must share the same --seed, --n-mlps, --n-samples, --width, --depth, and --backend. Their mlp_range values must together cover [0, total_n_mlps) exactly once (no gaps, no overlaps).

The merged result is bit-equivalent to a single-host bake with the same parameters.

Example

# After baking 4 slices on separate workers:
whest dataset merge \
    ./partial-0 ./partial-1 ./partial-2 ./partial-3 \
    --output ./final-eval

End-to-end example (bake → inspect → push → pull → run)

# 1. Bake
whest dataset bake \
    --n-mlps 10 --n-samples 10_000_000 \
    --width 256 --depth 8 \
    --output ./my-eval

# 2. Inspect locally
whest dataset inspect ./my-eval

# 3. Publish
export HF_TOKEN=hf_...
whest dataset push ./my-eval \
    --repo aicrowd/arc-whestbench-2026 \
    --tag v1

# 4. Pull on another machine
whest dataset pull aicrowd/arc-whestbench-2026 \
    --revision v1 --output ./local-copy

# 5. Run evaluation
whest run --estimator ./estimator.py \
    --dataset hf://aicrowd/arc-whestbench-2026@v1

`whest package`

Build a submission artifact.

whest package --estimator <path> [options]

Key options:

--class <name>
--requirements <path>
--submission-metadata <path>
--approach <path>
--output <path>
--format rich|plain|json
--json — alias for --format json
--debug

`whest profile-simulation`

Profile flopscope FLOP accounting and analytical correctness across a grid of network sizes and FLOP budgets.

whest profile-simulation [--preset super-quick|quick|standard|exhaustive]
                          [--output <path>]
                          [--format rich|plain|json]
                          [--json]
                          [--verbose]
                          [--debug]

Key options:

--preset <name> (default: standard) — parameter sweep size:
- super-quick — 1 width (256), 1 depth (4), 10 000 samples. Sub-second, for testing the debug loop.
- quick — 1 width (256), 2 depths (4, 128), 2 sample counts (10 000, 100 000). Finishes in seconds.
- standard — 2 widths (64, 256), 3 depths (4, 32, 128), 2 sample counts (10 000, 100 000). Under a minute.
- exhaustive — 2 widths (64, 256), 3 depths (4, 32, 128), 3 sample counts (10 000, 100 000, 1 000 000). Thorough but slow.
--output <path> — save a JSON report with correctness results and FLOP accounting data.
--format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise.
--json — alias for --format json.
--debug — show full tracebacks on errors.
--verbose — show full tables with all columns and raw data.

Example workflows:

# Quick correctness check
whest profile-simulation --preset quick

# Full profile with JSON export
whest profile-simulation --preset exhaustive --output profile_results.json

Next step

Dataset Format — schema 3.0 specification
Score Report Fields
GPU Dataset Generation
Inspect and Traverse MLP Structure (in the starter kit)
Validate, Run, and Package (in the starter kit)

CLI Reference

When to use this page

Environment variables

Commands

`whest version`

`whest smoke-test`

`whest doctor`

Severity model

Exit codes

Example

`whest init`

`whest validate`

`whest run`

Using a pre-baked dataset

Exit codes

`whest dataset`

`whest dataset bake`

Example

`whest dataset inspect`

Example

`whest dataset push`

Example

`whest dataset pull`

Example

`whest dataset merge`

Example

End-to-end example (bake → inspect → push → pull → run)

`whest package`

`whest profile-simulation`

Next step

On this page