============================================================
Participant Guide
============================================================

--- participant-guide ---
URL: https://aicrowd.github.io/whestbench/docs/participant-guide

Participant GuideTutorial and how-to for the challenge, federated from whest-starterkit.Participant Guide
Federated from AIcrowd/whest-starterkit @ aaa3882.whestbenchWhite-box estimation of MLP output statistics under a FLOP budget.Tutorial — The 5-stage ladderNext Page

============================================================
Guides
============================================================

--- guides/datasets ---
URL: https://aicrowd.github.io/whestbench/docs/guides/datasets

GuidesDatasets — a complete guideWhestBench uses HuggingFace Datasets as its dataset format and HF Hub as the distribution channel. This guide walks you through every dataset-related verb in whest.WhestBench uses HuggingFace Datasets as
its dataset format and HF Hub as the distribution channel. This guide walks you
through every dataset-related verb in whest, in the order you'd typically
encounter them.

If you only have 5 minutes, read the Quick start below. The rest of the
guide builds on it.

Quick start
You have a working estimator at ./estimator.py. Bake a tiny evaluation
dataset locally, then score against it:
# 1. Generate 10 MLPs with ground-truth statistics → ./my-eval/
whest dataset bake --n-mlps 10 --n-samples 1000 --width 64 --depth 4 \
                   --output ./my-eval

# 2. Inspect what got written
whest dataset info ./my-eval

# 3. Score your estimator against the same MLPs every run
whest run --estimator estimator.py --dataset ./my-eval
Why this matters: without --dataset, whest run regenerates MLPs and ground
truth on every invocation. Baking a dataset once and reusing it makes your runs
deterministic and ~10× faster.
Continue to: lifecycle ↓
The dataset lifecycle
+--------+        +-----------+        +----------+        +--------+
| local  | upload |    HF     | down-  |  local   |  run   | scores |
|  bake  | -----> | Hub repo  | load → |  cache   | -----> | report |
| (./out)|        | (org/...) |        |  (~/.hf) |        |        |
+--------+        +-----------+        +----------+        +--------+
   ^                                                            |
   |____________________________________________________________|
                   iterate on estimator code

Local-only workflow: bake → run. Best when you're iterating fast and
don't care about sharing the dataset. See
Working locally.
Team workflow: bake → upload → … later … → download → run. The HF
repo's tag pins which exact dataset everyone scores against. See
Publishing to HuggingFace Hub and
Downloading from HF Hub and the local cache.
CI / leaderboard workflow: bake → upload --tag v1-warmup. Participants
pull by tag. Streaming (whest run --streaming) is the natural fit for
per-PR CI gates — see Streaming mode.

Each verb is detailed below.
Working locally
You want to iterate fast: bake a small dataset to disk, inspect it, and reuse
it across whest run invocations. No network, no HF account needed.
whest dataset bake — create a dataset
You're starting a new evaluation. Bake 100 MLPs of moderate size with their
ground-truth statistics to ./my-eval/:
whest dataset bake \
    --n-mlps 100 \
    --n-samples 10000 \
    --width 256 --depth 8 \
    --output ./my-eval
Representative output:
→ Baking 100 MLPs (width=256, depth=8, n_samples=10000) to ./my-eval
  ✓ Generated weights         100/100
  ✓ Computed ground truth     100/100   31.7s
✓ Wrote ./my-eval (2.0 GB)
The result on disk:
my-eval/
├── data/public-00000-of-00001.parquet   # weights + ground-truth stats
├── metadata.json                         # schema_version, seed_protocol, …
└── README.md                             # dataset card
Key flags:

--mlp-seeds <file.json> — pin per-MLP seeds explicitly. JSON array of N
distinct int63 values. Required for bit-exact reproducibility with another
bake.
--mlp-range START-END or --slice K/N — bake a slice of a larger logical
dataset. The slice is bit-equivalent to the corresponding portion of a
single-host bake at the same seeds.
--torch — use the GPU backend (requires whestbench[gpu]).
--split <name> — assign a split name (default public). See
Multi-split datasets.
--config <name> — assign the HF config for this split (default default).
Dataset authors use this for config-per-split repos; participants normally
leave it unset.

If it broke, see the Troubleshooting section — bake errors
usually trace back to seed shape, an existing output directory, or running
out of RAM on large --n-samples.

whest dataset info — what's in a dataset
You've baked or downloaded a dataset and want a one-screen summary before
running against it:
whest dataset info ./my-eval
Reports schema_version, seed_protocol, n_mlps, n_samples,
hardware fingerprint, and per-split row counts. info also works against HF
Hub directly:
whest dataset info aicrowd/arc-whestbench-public-2026 --revision v1-warmup
No download required — info only fetches metadata.json.
whest dataset merge — assemble parallel bakes
You have a multi-host cluster and want to bake a 1,000-MLP dataset in two
slices, then concatenate. Both workers must share the same --mlp-seeds file
so the result is bit-equivalent to a single-host bake:
# Two workers each bake a slice…
whest dataset bake --n-mlps 1000 --slice 0/2 --output ./partial-a \
                   --mlp-seeds seeds.json
whest dataset bake --n-mlps 1000 --slice 1/2 --output ./partial-b \
                   --mlp-seeds seeds.json

# … then merge.
whest dataset merge ./partial-a ./partial-b --output ./full
The merged dataset is bit-equivalent to a single-host bake of the same size at
the same seeds. See also: parallel-bake how-to.
whest run --dataset <local-dir> — score against a baked dataset
You're iterating on estimator.py. Score it against the first 50 MLPs of
your baked dataset (fast feedback loop):
whest run --estimator estimator.py --dataset ./my-eval --n-mlps 50
--n-mlps K clamps the run to the first K MLPs of the dataset (useful for
quick iteration). Pass --split <name> if the dataset is
multi-split.
Once you're happy with local results, publish the dataset
so teammates can score against the same MLPs.
Publishing to HuggingFace Hub
You've baked a dataset locally and want to share it with the team — or pin a
specific revision so a CI gate scores everyone against the same MLPs. Upload
to a HuggingFace Hub dataset repo.
Authenticate once
hf auth login   # opens a browser; or pass --token <hf_xxx>
Tokens with write scope are required to push. You can also set the token
without the interactive flow:
export HF_TOKEN=hf_xxx
whest dataset upload reads HF_TOKEN as a fallback when --token isn't
passed. See also: the
publish-to-hf-hub how-to for an
end-to-end walkthrough.
whest dataset upload
You have ./my-eval from the previous section. Push it as a private repo and
pin the resulting commit with a tag:
whest dataset upload ./my-eval \
    --repo aicrowd/my-eval \
    --tag v1 \
    --private   # omit for public datasets
Representative output:
→ Uploading ./my-eval to aicrowd/my-eval (private)
  ✓ Repo exists / created
  ✓ Uploaded 2.0 GB                 ████████████████████ 100%   34.1s
  ✓ Tag v1 created at d2f9a1c
✓ Done: https://huggingface.co/datasets/aicrowd/my-eval/tree/v1
The repo is created if it doesn't exist. The tag is created at the resulting
commit so
whest run --dataset hf://aicrowd/my-eval@v1
pins to this exact revision.
Repo naming. Use <org>/<dataset-name>. Keep names short and
hyphen-separated (e.g. aicrowd/arc-whestbench-public-2026).
Tag conventions. HF doesn't enforce semver; the de-facto pattern is
v<MAJOR>.<MINOR> (e.g. v1.0, v1.1) or descriptive (v1-warmup,
v1-holdout). See
HF's revision docs.
What gets published
The dataset card (README.md) is auto-generated from metadata.json at bake
time. It includes splits, hardware fingerprint, seed protocol, and a runnable
quick-start snippet. Edit README.md after bake and before upload to
add custom content.
The card's YAML front-matter is what HuggingFace Hub renders on the dataset
page (tags, license, language, etc.). Don't strip it.

whest dataset push continues to work as a deprecated alias for upload
through v0.6. v0.7 will remove it. Same applies to pull → download and
inspect → info.

If it broke (401, 403, repo already exists, network errors), jump to
Troubleshooting.

Downloading from HF Hub and the local cache
You want to score against a dataset published by your team or the contest
organisers. There are two paths.
whest dataset download — explicit fetch
Use when you want a real on-disk copy you can inspect, ship to another
machine, or commit to a separate artifact store:
whest dataset download aicrowd/arc-whestbench-public-2026 \
    --revision v1-warmup \
    --output ./eval
Representative output:
→ Downloading aicrowd/arc-whestbench-public-2026@v1-warmup → ./eval
  Preflight: 1 parquet shard, 2.0 GB, 1,000 MLPs
  ✓ Downloaded 2.0 GB              ████████████████████ 100%   28.9s
✓ Wrote ./eval (cache: ~/.cache/huggingface/hub/datasets--aicrowd--arc-whestbench-public-2026)
With --output set, files are materialised under the named directory; the HF
cache also picks them up.
Auto-fetch via whest run
You can skip the explicit download — whest run does it lazily on first use:
whest run --estimator estimator.py \
          --dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup
This downloads on first invocation (showing a progress bar) and caches.
Subsequent runs are ~10× faster (the cache hit prints Loaded from cache).
HF cache layout
After a fetch, the HF cache lives at three places:
PathWhat's there~/.cache/huggingface/hub/datasets--<org>--<name>/Raw blobs (Git LFS / Xet objects) + the revision snapshot symlinks~/.cache/huggingface/datasets/<org>___<name>/The datasets library's regenerated Arrow tables (memory-mapped)~/.cache/huggingface/xet/{chunk_cache,shard_cache,staging}/Xet chunk-level dedup cache (since hf_xet ≥ 1.0)
Total disk usage is roughly 2× download size (the parquet blob + Arrow
rebuild). The hub cache uses content-addressed dedup, so the same blob is
shared across revisions and even repos.
Cleaning up
Defer to HF's own cache CLI — it understands the layout above and will not
accidentally orphan blobs that are still referenced from another revision:
hf cache ls                  # show what's there
hf cache prune               # drop unreferenced revisions
hf cache rm <selector>       # remove a specific repo or revision
hf cache verify              # check integrity
Full reference: HF cache management.
Cache location overrides
Env varWhat it setsDefaultHF_HOMERoot of all HF state~/.cache/huggingfaceHF_HUB_CACHEHub-only cache (blobs/snapshots)$HF_HOME/hubHF_DATASETS_CACHEdatasets-library Arrow cache$HF_HOME/datasetsHF_XET_CACHEXet chunk staging$HF_HOME/xet
When running on NFS, point HF_XET_CACHE=/local/ssd to avoid roundtrips.
See Performance tuning for more knobs.

whest dataset pull continues to work as a deprecated alias for download
through v0.6.

If it broke (long pause, disk full, gated dataset, cas-bridge.xethub.hf.co
URLs you don't recognise), jump to Troubleshooting.

Streaming mode
You want to score against a small slice of a remote dataset without paying
the cost of a full download. whest run --streaming consumes the dataset
row-group-by-row-group over HTTP instead of downloading it first.
When to use

You're iterating on estimator code with --n-mlps 5 (or some small K).
Streaming fetches only the first ⌈K/47⌉ row groups (~95 MB each for the
warmup dataset) instead of the full 2 GB.
You're on a constrained-disk environment (CI runner, container).
You want a fast first-row response time more than total throughput.

When NOT to use

Repeated full evaluations of the same dataset. Streaming does NOT populate
the cache — every run re-fetches. Use the
default materialise path
instead.
Anything that needs random access. IterableDataset is iteration-only;
len(ds), ds[i], and ds.shuffle(seed=…) don't work as expected.

Trade-off table
PropertyMaterialise (default)--streamingFirst-row latency, cold cache~30 s (full download)~5 sFirst-row latency, warm cache~2 s~5 s (re-fetch)Disk usage~4 GB (blob + Arrow)0Subsequent runs~2 s (cache hit)~5 s (re-fetch every time)Random accessYesNo
Authentication and streaming
Unauthenticated requests to HF are rate-limited and noticeably slower. Run
hf auth login once to set a token; streaming throughput typically improves
30–50% authenticated.
Example
whest run --estimator estimator.py \
          --dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup \
          --streaming \
          --n-mlps 5
You'll see a ⚠ Streaming from HF warning at startup, then a progress
indicator while the first row group is fetched, then scoring begins.

Streaming is incompatible with --json output (it would corrupt JSON
ordering) and len(ds) raises on a streaming dataset. Both are documented
under Troubleshooting.

Multi-split datasets
A dataset can contain multiple disjoint groups of MLPs — typically public
(open to participants for tuning) and holdout (used only by the leaderboard
grader). One repo, two splits.
When and why

Leaderboard datasets: participants score against public locally,
the leaderboard grader scores against holdout. Same parquet schema, same
hardware fingerprint, different seeds.
Train/validation flow: split a dataset into train/val/test for
meta-learning experiments on top of WhestBench.

Baking a split
Each split is baked separately. Make sure to use distinct --mlp-seeds files
so the splits don't overlap:
whest dataset bake --n-mlps 500 --split public  --config default  --output ./eval-public
whest dataset bake --n-mlps 500 --split holdout --config holdout --output ./eval-holdout
Combining splits into one multi-split directory
whest dataset combine-splits ./eval-public ./eval-holdout --output ./eval-full
The result is a single dataset directory with both splits in data/,
suitable for whest dataset upload to a single HF repo. combine-splits
preserves each bake's config metadata, so the published card can expose the
same config-per-split layout as the official HF datasets.
Selecting a split when running
whest run --estimator estimator.py \
          --dataset hf://aicrowd/eval-full@v1 \
          --split public
Without --split, multi-split datasets are rejected by whest run (the
scoring path scores against exactly one split at a time, by design).
Inspecting splits
whest dataset info ./eval-full
# Reports each split's n_mlps and seed.

If combine-splits complains about overlapping mlp_seeds or mismatched
hardware fingerprints, see Troubleshooting.

Performance tuning
These are power-user knobs. The defaults are fine for almost everyone.
Xet high-performance mode
If you have ≥64 GB RAM and a fat uplink:
export HF_XET_HIGH_PERFORMANCE=1
Saturates both bandwidth and CPU cores. Helpful when downloading
many-GB datasets to a workstation. Reference:
HF Xet storage docs.
Local SSD for the Xet cache
If your HF cache is on NFS or a slow disk:
export HF_XET_CACHE=/local/ssd/hf-xet
Keeps the chunk staging cache on fast local storage. The main hub cache
(HF_HUB_CACHE) can stay on NFS — only the per-chunk Xet metadata is
roundtrip-sensitive.
Disabling Xet entirely
export HF_HUB_DISABLE_XET=1
Falls back to plain LFS transport. Rarely useful; only reach for it if you've
confirmed a Xet-specific bug.
Disabling progress bars (CI)
export HF_HUB_DISABLE_PROGRESS_BARS=1
Whestbench's say.* lines still emit; only the progress bars are suppressed.
For complete silence add --quiet to the whest invocation.
Troubleshooting
"I see a long pause and no output."
Cache miss on a cold HF cache. Watch the progress bar — for the warmup
dataset it's ~30 s on a 70 MB/s link. To avoid silent re-downloads, run
whest dataset download ahead
of time, or ls ~/.cache/huggingface/hub/ to confirm progress.
"Downloads feel slow."
You're probably unauthenticated; HF rate-limits anonymous traffic.
Run hf auth login once and re-run. See also
Authentication and streaming.
"Disk filled up."
HF stores blobs in both ~/.cache/huggingface/hub/ (raw download) and
~/.cache/huggingface/datasets/ (regenerated Arrow). Use
hf cache prune to drop unreferenced revisions, then hf cache ls to
verify reclaimed space. See Cleaning up.
"401/403 on upload."
Your token doesn't have write scope. Re-login with
hf auth login --token <new-token> from a token created with write access.
For org-owned repos, your account also needs membership in the org.
"Cannot use --streaming with --json output."
Known limitation — streaming progress events would corrupt JSON ordering.
Drop --json, or drop --streaming.
"len(ds) raises on a streaming dataset."
Expected per HF docs. Use whestbench.metadata(ds)["n_mlps"] instead — it
reflects the upstream metadata.json, not the local materialised count.
"I see cas-bridge.xethub.hf.co URLs but the file is LFS."
That's HF's Xet bridge transparently serving legacy LFS content via the Xet
CDN edge. No action required. If you need to force plain-LFS transport for
debugging, set HF_HUB_DISABLE_XET=1 (see
Disabling Xet entirely).
"Dataset is gated."
Request access on the dataset page (HF will email you a link from
https://huggingface.co/datasets/<repo>), then re-run. Make sure you're
authenticated with the same account that was granted access.
Reference
Format

Schema 3.0 spec: dataset-format.

SDK surface
import whestbench as wb

ds = wb.load_dataset(
    "aicrowd/foo", revision="v1", split="public", streaming=False
)
for mlp in wb.iter_mlps(ds):
    ...
mlp = wb.mlp_at(ds, 0)        # random access (materialised datasets only)
md = wb.metadata(ds)           # the dataset's metadata.json

wb.publish_dataset(
    "./my-eval", repo_id="aicrowd/foo", tag="v1"
)
wb.merge_datasets(["./partial-a", "./partial-b"], output_dir="./full")
wb.combine_split_datasets(
    ["./public", "./holdout"], output_dir="./full"
)
CLI verbs (canonical names)
VerbPurposeDeprecated aliaswhest dataset bakeGenerate locally—whest dataset uploadPublish to HFpushwhest dataset downloadFetch from HFpullwhest dataset infoShow metadatainspectwhest dataset mergeConcatenate partials—whest dataset combine-splitsAssemble multi-split—
Deprecated aliases continue to work through v0.6 and emit a deprecation
warning. v0.7 removes them.
Environment variables
VarPurposeHF_TOKENAuth token (lazy — only when needed)HF_HOMERoot of HF state (~/.cache/huggingface by default)HF_HUB_CACHEHub blobs cacheHF_DATASETS_CACHEdatasets-library Arrow cacheHF_XET_CACHEXet chunk stagingHF_XET_HIGH_PERFORMANCESaturate bandwidth + CPUHF_HUB_DISABLE_PROGRESS_BARSSuppress progress barsHF_HUB_DISABLE_XETForce plain-LFS transportHF_HUB_DISABLE_IMPLICIT_TOKENDon't send token on read callsNO_COLORDisable ANSI colours
CLI flag conventions
--repo-type, --revision, --token, --cache-dir, --quiet, --json,
--format {auto,human,agent,json,quiet}, --dry-run, --exist-ok. Adopted
from HF's hf CLI
for consistency.
See also

CLI reference — exhaustive flag list per verb.
Dataset format spec — schema 3.0 on-disk layout.
Parallel bake how-to — distributed bake + merge.
Publish to HF Hub how-to — token setup, repo creation.
Use WhestBench ExplorerPrevious PageParallel bake across multiple GPUs / hostsBake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to a single-host bake.

============================================================
How-to
============================================================

--- how-to/parallel-bake ---
URL: https://aicrowd.github.io/whestbench/docs/how-to/parallel-bake

How-toParallel bake across multiple GPUs / hostsBake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to a single-host bake.Bake one large dataset across N workers, then merge the partials into a single
canonical artifact that is bit-equivalent to what a single-host bake would have
produced.
When to use this
At the default sampling rate (n_samples=1_000_000_000), a single L40S GPU takes
roughly 4 hours for 100 MLPs (measured; see
GPU Dataset Generation for the full timing
table). Splitting the work across multiple workers reduces wall time proportionally:

1 L40S × 100 MLPs × 10⁹ samples ≈ ~4 h
4 L40S workers × 25 MLPs each × 10⁹ samples ≈ ~1 h
8 L40S workers × 12–13 MLPs each × 10⁹ samples ≈ ~30 min

Parallel baking is also useful for fault tolerance — if one worker fails, you only
need to re-bake its slice.
1. Bake each slice
Use --slice K/N to assign each worker a disjoint range of MLPs. All workers must
use the same --mlp-seeds file, --n-mlps, --n-samples, --width, and
--depth — the merge step enforces this.
Generate the seeds file once before launching workers:
whest dataset generate-seeds --n-mlps 1000 > seeds.json
The following example bakes 1000 MLPs across 4 workers. Run each command on its own
host (or in a separate job):
Worker 0 (MLPs 0–249):
whest dataset bake \
    --n-mlps 1000 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --mlp-seeds seeds.json \
    --slice 0/4 \
    --torch --device auto \
    --output ./partial-0
Worker 1 (MLPs 250–499):
whest dataset bake \
    --n-mlps 1000 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --mlp-seeds seeds.json \
    --slice 1/4 \
    --torch --device auto \
    --output ./partial-1
Worker 2 (MLPs 500–749):
whest dataset bake \
    --n-mlps 1000 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --mlp-seeds seeds.json \
    --slice 2/4 \
    --torch --device auto \
    --output ./partial-2
Worker 3 (MLPs 750–999):
whest dataset bake \
    --n-mlps 1000 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --mlp-seeds seeds.json \
    --slice 3/4 \
    --torch --device auto \
    --output ./partial-3
Each worker writes a directory marked is_partial=true in metadata.json. The
loader refuses to load partial datasets directly — you must merge them first.
2. Fetch partials locally
Once all workers finish, collect the partial directories on a single machine.
# scp example (adjust hostnames and paths)
scp -r worker-0:/data/partial-0 ./partial-0
scp -r worker-1:/data/partial-1 ./partial-1
scp -r worker-2:/data/partial-2 ./partial-2
scp -r worker-3:/data/partial-3 ./partial-3

# Or rsync (preserves timestamps, supports resumption)
rsync -avz worker-0:/data/partial-0/ ./partial-0/
rsync -avz worker-1:/data/partial-1/ ./partial-1/
rsync -avz worker-2:/data/partial-2/ ./partial-2/
rsync -avz worker-3:/data/partial-3/ ./partial-3/
3. Merge
whest dataset merge validates all partials, checks that their mlp_range values
cover [0, 1000) exactly once (no gaps, no overlaps), concatenates the Parquet
shards in MLP-index order, and writes a complete dataset directory:
whest dataset merge \
    ./partial-0 ./partial-1 ./partial-2 ./partial-3 \
    --output ./final-eval
Expected output:
Merged 4 partials to ./final-eval
The merge fails loudly on any of:

Partials disagree on n_samples, width, depth, backend, or
total_n_mlps (MergeIncompatibleError)
Ranges have gaps — e.g. [0,250) and [500,750) with nothing in between
(MergeIncompleteError)
Ranges overlap (MergeOverlapError)
A partial's actual row mlp_id values don't match its declared mlp_range
(MergeCorruptError)

4. Verify bit-equivalence (optional)
To confirm the parallel bake matches a serial bake on the same seeds file, bake a
small reference dataset on a single host and compare all_layer_means:
import numpy as np
from datasets import load_dataset

# Load the merged result
merged = load_dataset("./final-eval", split="public")

# Bake a tiny reference (e.g. first 4 MLPs) on one host for verification.
# Pass the SAME --chunk-size as the parallel workers — otherwise the auto-tuned
# chunk_size differs (workers: B=mlps_per_slice; reference: B=4) and reductions
# accumulate in different orders, producing ~5e-4 spurious diffs on CUDA.
# echo '[<seed0>,<seed1>,<seed2>,<seed3>]' > ref-seeds.json  # use seeds[0:4] from seeds.json
# whest dataset bake --n-mlps 4 --n-samples 1000000 \
#     --width 256 --depth 8 --mlp-seeds ref-seeds.json \
#     --chunk-size 524288 --output ./reference-4

reference = load_dataset("./reference-4", split="public")

# Compare means for the overlapping MLPs
for i in range(len(reference)):
    merged_means = np.array(merged[i]["all_layer_means"])
    ref_means = np.array(reference[i]["all_layer_means"])
    max_diff = np.abs(merged_means - ref_means).max()
    print(f"MLP {i}: max |Δmean| = {max_diff:.2e}")
    assert max_diff == 0.0, f"MLP {i}: not bit-exact!"

    # avg_variance loses ~1 float64 ULP from the (sum_sq/n - mean²) subtraction,
    # so compare with np.isclose rather than strict equality. rtol=1e-12 covers
    # ULP noise that scales with the variance magnitude; atol=1e-15 guards near
    # zero. Observed noise on N=1e9 bakes is ~1e-17, so this is ~100× headroom.
    merged_var = float(merged[i]["avg_variance"])
    ref_var = float(reference[i]["avg_variance"])
    assert np.isclose(merged_var, ref_var, rtol=1e-12, atol=1e-15), (
        f"MLP {i}: variance not within ULP tol "
        f"(merged={merged_var}, ref={ref_var})"
    )

print("Bit-equivalence verified for first 4 MLPs.")
Expected output (for the CPU backend):
MLP 0: max |Δmean| = 0.00e+00
MLP 1: max |Δmean| = 0.00e+00
MLP 2: max |Δmean| = 0.00e+00
MLP 3: max |Δmean| = 0.00e+00
Bit-equivalence verified for first 4 MLPs.
For the torch backend, bit-equivalence holds within each backend (flopscope or torch)
but not across backends — they use different RNG algorithms.
5. Inspect and publish
Inspect the merged dataset, then push to HuggingFace Hub as a single artifact.
See Publishing a dataset to HuggingFace Hub for the full
publish walkthrough.
# Inspect
whest dataset inspect ./final-eval

# Publish
whest dataset push ./final-eval \
    --repo aicrowd/arc-whestbench-2026 \
    --tag v1 \
    --message "Parallel bake: 1000 MLPs, 4 workers"
Slicing model
--slice K/N
Divides the logical dataset of --n-mlps into N equal slices and assigns slice K
(0-indexed). For n_mlps=1000 and N=4:
--slicemlp_range0/4[0, 250)1/4[250, 500)2/4[500, 750)3/4[750, 1000)
If n_mlps is not evenly divisible by N, the last slice gets the remainder.
--mlp-range START-END
The lower-level alternative to --slice. Both endpoints are inclusive on the
CLI (e.g. --mlp-range 0-249 covers MLPs 0 through 249 inclusive). The Python API
uses half-open [start, end) intervals internally.
Use --mlp-range for irregular splits or when you need to re-run only a specific
MLP range after a failure.
# Re-run just MLPs 250–499 after a worker failure (use the same seeds.json as the original bake)
whest dataset bake \
    --n-mlps 1000 --n-samples 1_000_000_000 \
    --width 256 --depth 8 --mlp-seeds seeds.json \
    --mlp-range 250-499 \
    --torch --device auto \
    --output ./partial-1-retry
Bit-equivalence requirements
The merge step produces a dataset bit-equivalent to a single-host bake only when:

All workers use the same --mlp-seeds file and same --n-mlps. Under
seed_protocol 3.0, each slot reads its input seed directly from that shared file,
so the derived weight/sample/estimator streams are identical regardless of which
worker processes the slot.

All workers use the same backend (flopscope vs torch). The two backends
use different RNG algorithms and produce statistically equivalent but not bitwise
identical results at the same seeds.

For the torch backend on CUDA, bitwise reproducibility additionally requires
the same torch version (CUDA kernel implementations may differ between versions).

For the torch backend on CUDA, all workers and any reference re-bake must use
the same --chunk-size. The default is auto-tuned per call from
mlps_per_batch (which derives from --n-mlps minus slicing) and the device's
free memory — so a worker baking a 1-MLP slice (mlps_per_batch=1, auto chunk
≈ 1048576) and a 4-MLP reference bake (mlps_per_batch=4, auto chunk ≈ 524288)
will pick different chunk sizes, accumulate float reductions in different
orders, and disagree by ~5e-4 absolute on all_layer_means, final_means, and
avg_variance. Pinning --chunk-size to a fixed value across every bake
(workers AND any reference bake) eliminates this. For width=256,
--chunk-size 524288 is a safe choice across all batch sizes from 1 to 16.
Cross-host CUDA non-determinism beyond chunk-size has been ruled out in practice
when the standard PyTorch determinism flags are set (cudnn.deterministic=True,
cudnn.benchmark=False, CUBLAS_WORKSPACE_CONFIG=:4096:8). With those + a
pinned --chunk-size, parallel-vs-serial bakes match bit-exactly on
weights, all_layer_means, and final_means. avg_variance differs by
~1 float64 ULP (~1e-17 on N=1e9 bakes) due to the (sum_sq/n − mean²)
subtraction; compare it with np.isclose(rtol=1e-12, atol=1e-15) rather
than strict equality. rtol covers ULP noise that scales with variance
magnitude; atol guards near zero.

Multi-split datasets
For datasets with multiple splits (e.g. the evaluation dataset with public and
holdout), bake each split independently — each split has its own seed file and the
seeds must be uncorrelated — then combine.
Under seed_protocol 3.0, each split has its own JSON file of per-MLP seeds. All
workers baking a given split must receive the SAME JSON file (they internally
slice it by --slice K/N); seeds for different splits MUST be different files
to preserve cross-split independence.
(The orchestrator in whest-evaluation-utils/gpu-dataset-bake/ automates this.)
# Generate independent seed files for each split (once, before launching workers).
whest dataset generate-seeds --n-mlps 50 > public-seeds.json
whest dataset generate-seeds --n-mlps 50 > holdout-seeds.json

# Parallel-bake the public split (4 workers, same seeds file).
for K in 0 1 2 3; do
  whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 \
    --split public --config default --mlp-seeds public-seeds.json --slice $K/4 \
    --torch --device cuda --output ./pub-p$K &
done
wait
whest dataset merge ./pub-p* --output ./pub-complete

# Parallel-bake the holdout split (4 workers, different seeds file).
for K in 0 1 2 3; do
  whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 \
    --split holdout --config holdout --mlp-seeds holdout-seeds.json --slice $K/4 \
    --torch --device cuda --output ./hold-p$K &
done
wait
whest dataset merge ./hold-p* --output ./hold-complete

# Combine into one multi-split directory.
whest dataset combine-splits ./pub-complete ./hold-complete --output ./eval

# Inspect, push.
whest dataset inspect ./eval
whest dataset push ./eval --repo aicrowd/arc-whestbench-2026-evals --tag round-1 --private
Each per-split bake is independent — workers in different splits don't share any seed
state. The combine step validates that all splits agree on the invariants (width,
depth, n_samples, backend) but allows different per-split n_mlps.Datasets — a complete guideWhestBench uses HuggingFace Datasets as its dataset format and HF Hub as the distribution channel. This guide walks you through every dataset-related verb in whest.Publishing a dataset to HuggingFace HubStep-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub.

--- how-to/publish-to-hf-hub ---
URL: https://aicrowd.github.io/whestbench/docs/how-to/publish-to-hf-hub

How-toPublishing a dataset to HuggingFace HubStep-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub.Step-by-step walkthrough for baking a WhestBench evaluation dataset locally and
publishing it to HuggingFace Hub. Once published, participants (and other machines)
can load it directly with datasets.load_dataset or whestbench.load_dataset.
Prerequisites:

pip install whestbench (or whestbench[gpu] for GPU bakes)
A HuggingFace account with write access to the target repo
HF_TOKEN with write scope (see step 1)

1. Set up authentication
# Option A — interactive login (stores a token in ~/.cache/huggingface/token)
huggingface-cli login

# Option B — environment variable (preferred in CI)
export HF_TOKEN=hf_your_write_token_here
The HF_TOKEN environment variable is read automatically by whest dataset push
and whestbench.publish_dataset. You can also pass it explicitly via --token.
2. Bake locally
Bake a dataset to a local directory. Choose --n-mlps and --n-samples appropriate
to your use case. For bit-exact reproducibility, pass an explicit
--mlp-seeds JSON file.
whest dataset bake \
    --n-mlps 10 \
    --n-samples 10_000_000 \
    --width 256 \
    --depth 8 \
    --output ./my-bake
For larger bakes, see Parallel bake across multiple GPUs and
GPU Dataset Generation.
3. Inspect before publishing
Verify the bake parameters before uploading. This is cheap and catches any
misconfiguration before it goes out:
whest dataset inspect ./my-bake
Expected output:
WhestBench dataset
  schema_version: 3.0
  format: hf-datasets-parquet
  backend: flopscope
  seed: 42
  n_mlps: 10
  n_samples: 10000000
  width: 256
  depth: 8
  created_at_utc: 2026-05-25T12:00:00+00:00
You can also verify the dataset loads correctly before pushing:
import whestbench

ds = whestbench.load_dataset("./my-bake")
print(len(ds), "MLPs loaded")
for mlp in whestbench.iter_mlps(ds):
    print(mlp.name, mlp.weights[0].shape)
    break
4. Publish
Push the local directory to HF Hub. Use --tag to create a versioned git tag —
this is strongly recommended so participants can pin a specific version.
whest dataset push ./my-bake \
    --repo aicrowd/arc-whestbench-2026 \
    --tag v1 \
    --message "Bake: 10 MLPs, seed=42, 10M samples"
Expected output:
Uploaded to aicrowd/arc-whestbench-2026; commit abc1234def; tag v1
For a private repo (e.g. holdout sets), add --private:
whest dataset push ./my-bake \
    --repo aicrowd/arc-whestbench-2026-holdout \
    --tag v1 \
    --private \
    --message "Holdout bake: seed=99"
What gets uploaded

data/<split>-00000-of-00001.parquet — the MLP data
metadata.json — provenance sidecar
README.md — rendered dataset card (re-rendered with the actual repo_id and tag before upload, including any declared HF config layout)

5. Verify on HF Hub
Visit the dataset page to confirm the upload succeeded:
https://huggingface.co/datasets/aicrowd/arc-whestbench-2026/tree/v1
You should see the three files (data/, metadata.json, README.md) and the
dataset card rendered from the README.
You can also inspect from the CLI without downloading:
whest dataset inspect aicrowd/arc-whestbench-2026 --revision v1
6. Pull on another machine
On any other machine with whestbench installed:
whest dataset pull aicrowd/arc-whestbench-2026 \
    --revision v1 \
    --output ./local-copy
For a private repo, pass --token or set HF_TOKEN first.
7. Load in a participant script
Using datasets.load_dataset directly
from datasets import load_dataset

ds = load_dataset(
    "aicrowd/arc-whestbench-2026",
    revision="v1",
    split="public",
)
print(ds)  # Dataset({features: ['mlp_id', 'mlp_name', ...], num_rows: 10})
print(ds[0]["mlp_name"])  # "danielle-johnson"
Using whestbench.load_dataset (recommended)
The wrapper validates the schema and attaches metadata for later retrieval:
import whestbench

ds = whestbench.load_dataset(
    "aicrowd/arc-whestbench-2026",
    revision="v1",
    split="public",
)

# Iterate as MLP instances
for mlp in whestbench.iter_mlps(ds):
    y_pred = my_estimator.predict(mlp)

# Access metadata
md = whestbench.metadata(ds)
print(md["seed"], md["n_mlps"])
Running evaluation against the published dataset
whest run --estimator ./estimator.py \
    --dataset hf://aicrowd/arc-whestbench-2026@v1

# Or equivalently:
whest run --estimator ./estimator.py \
    --dataset aicrowd/arc-whestbench-2026 \
    --revision v1
Note: bare aicrowd/arc-whestbench-2026 without --revision is rejected by
whest run — always pin a revision.
Troubleshooting
401 Unauthorized — Your HF_TOKEN doesn't have write access to the target
repo, or it has expired. Generate a new token at
huggingface.co/settings/tokens with
write scope.
404 Repository not found — The repo doesn't exist yet. whest dataset push
creates it automatically; ensure you have permission to create repos under the
target org (e.g. aicrowd/).
FileExistsError: output already exists — whest dataset bake refuses to
overwrite an existing directory. Delete or rename the old output first, or choose
a new --output path.
Dataset rejected with "partial dataset" error — You pushed a slice bake without
merging first. Run whest dataset merge on all slices, then push the merged result.
See Parallel bake.
Multi-split datasets
whest dataset push handles multi-split datasets natively. The local directory must contain one parquet per split in data/ and a metadata.json with a splits: dict; this is the shape produced by whest dataset combine-splits. If the input bakes declared --config, the push preserves that config-per-split layout in the published dataset card. The push uploads all parquets in one commit; tag with --tag round-N for per-round eval datasets.
For private repos (e.g. the evaluation dataset), pass --private on first push to create the repo as private. Subsequent pushes preserve the privacy setting.Parallel bake across multiple GPUs / hostsBake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to a single-host bake.Estimator ContractExact estimator I/O requirements, FLOP tracking rules, failure semantics, memory limits, and reproducibility contract.

============================================================
Reference
============================================================

--- reference/estimator-contract ---
URL: https://aicrowd.github.io/whestbench/docs/reference/estimator-contract

ReferenceEstimator ContractExact estimator I/O requirements, FLOP tracking rules, failure semantics, memory limits, and reproducibility contract.When to use this page
Use this page when you need exact estimator I/O requirements.
Required interface
predict(self, mlp: MLP, budget: int) -> fnp.ndarray
Optional lifecycle hooks:

setup(self, context: SetupContext) -> None
teardown(self) -> None

SetupContext fields
FieldTypeDescriptionwidthintNeuron count for generated MLPsdepthintNumber of layers per MLPflop_budgetintFLOP cap for the estimatorapi_versionstrContract version stringscratch_dirstr | NoneOptional writable directory for cachingseedintPer-run seed from --seed (default 0). Use in setup() to reproduce one-time random initialisation. See Setup-time reproducibility.
Input object quick reference
ObjectFieldMeaningMLPwidthNumber of neurons per layerMLPdepthNumber of weight matrices (layers)MLPweightsOrdered weight matrices, each (width, width)MLPseedPer-MLP grader-supplied seed; use this to seed estimator-internal randomness for reproducibility under regrade.MLPnameHuman-readable slug like "danielle-johnson" derived deterministically from seed. Stable across runs and CPU/GPU backends at the WhestBench release's pinned faker version. Useful for log lines and error messages; safe to ignore. Empty string only when an MLP is constructed outside an evaluator bake path.
For traversal examples, see Inspect and Traverse MLP Structure (in the starter kit).
Output requirements per predict call
RequirementRuleShapeReturn a 2D array with shape (mlp.depth, mlp.width)Numeric validityEvery value is finite
FLOP tracking
Your estimator must use flopscope primitives (import flopscope as flops and import flopscope.numpy as fnp) for all numerical computation. flopscope tracks FLOP usage analytically. If the total FLOPs across your entire predict call exceed flop_budget, all predictions for that MLP are replaced with zero vectors and your MSE for that MLP is computed against zeros.
Failure semantics
When predict() cannot return a valid result — for any reason — the affected MLP is
scored as if the estimator had returned a zero array, and the multiplier in the
budget-adjusted score s_m is forced to 1.0 (no compute discount). Concretely:

FLOP budget exhausted (flopscope.BudgetExhaustedError) → Y_hat = 0, s_m = MSE(0, Y) * 1.0
Wall-time / residual-time budget exhausted → same
Combined-budget post-check (C_m = F_m + λ·R_m > B_m) → same
predict() raised an exception (any subclass of Exception, including MemoryError,
ValueError from validate_predictions, custom estimator exceptions) → same
Invalid output shape (not (depth, width)) → same
Non-finite values (any inf or NaN) → same
Subprocess worker hard-killed (OOM, segfault, timeout, non-zero exit) → same

The scoring loop continues across the remaining MLPs and produces a finite adjusted_final_layer_score.
Per-MLP diagnostic fields (error, error_code, traceback, budget_exhausted,
time_exhausted, residual_wall_time_exhausted, combined_budget_exhausted) are preserved
so failures remain debuggable.
The "no compute discount on failure" rule (multiplier forced to 1.0) ensures that a failed
run is strictly worse than a trivial-zero submission that succeeds (which receives the
0.1 multiplier floor — the minimum discount, a factor-of-ten cap).
Memory limit
ContestSpec.memory_limit_mb (default 65_536, i.e. 64 GB — matches the Phase 1 grader allocation) bounds the address space available to your estimator. Enforcement depends on the runner:

--runner subprocess (used by the grader): the worker calls resource.setrlimit(RLIMIT_AS, ...) before importing your estimator module. Any allocation that would exceed the cap raises MemoryError inside predict(), which routes through the failure path described above (zero-prediction MSE × 1.0).
--runner local: the limit is advisory only. WhestBench cannot safely call setrlimit on the CLI process itself. The runner emits a single warning at start ("memory_limit_mb=… is advisory in --runner local: enforcement requires --runner subprocess (uses RLIMIT_AS) or external sandboxing (cgroups).") and continues without enforcement. Use --runner subprocess if you want the limit actually enforced locally.

Platforms without RLIMIT_AS (Windows, some BSDs) log a warning to the worker's stderr and continue without enforcement. The grader's evaluation environment is Linux, where enforcement is reliable.
Wall-clock cap
ContestSpec.wall_time_limit_s (default 60.0 seconds — matches the Phase 1 grader cap) is an operational backstop on per-MLP predict() execution. If a single predict() call's elapsed wall-clock time exceeds the cap, the estimator's prediction is replaced with zeros and the MLP is scored through the failure path (zero-prediction MSE × 1.0, no compute discount). This is intentionally generous — the primary compute constraint is the effective FLOP budget C_m = F_m + λ·R_m; the wall-clock cap only catches stalled or runaway submissions.
The CLI flag --wall-time-limit SECONDS accepts a positive float. To disable the cap programmatically, construct your own ContestSpec with wall_time_limit_s=None.
Reproducibility under the grader seed
Predict-time reproducibility
If your estimator uses randomness — Monte Carlo sampling, randomized hashing,
random projections, etc. — seed it from mlp.seed. The grader supplies a fixed
per-MLP seed that is identical across all submissions for a given MLP, derived
deterministically from the suite seed. Submissions that use unseeded randomness
or their own seeds are NOT guaranteed to reproduce under regrade and may be
disqualified for prize eligibility.
Example:
import flopscope.numpy as fnp

def predict(self, mlp, budget):
    rng = fnp.random.default_rng(mlp.seed)
    # ... use rng for any internal randomness
If your estimator is deterministic (no internal randomness), you can ignore mlp.seed.
Setup-time reproducibility
If your estimator does randomized one-time setup (e.g., sampling a random
projection basis, jittering initial weights, choosing random hyperparameters),
seed it from ctx.seed inside setup(). When the grader passes --seed, the same value is forwarded to ctx.seed for every MLP in the run; participants running locally can pass --seed themselves to reproduce a given setup.
import flopscope.numpy as fnp

def setup(self, ctx: SetupContext) -> None:
    self.setup_rng = fnp.random.default_rng(ctx.seed)
    # ... use self.setup_rng for any one-time random work
Do not call fnp.random.seed(ctx.seed) (or np.random.seed(ctx.seed)) —
that mutates the process-global RNG and breaks composability with other
libraries. Use fnp.random.default_rng(ctx.seed) to get an isolated Generator.
ctx.seed defaults to 0 when no --seed was passed; estimators that don't
read it are unaffected. The seed is recorded in the run output under
run_config.seed for audit-trail purposes — a reviewer can read it from a
participant's JSON output and re-run with --seed N to reproduce the
participant's setup state. See score-report-fields
for the run_config.seed field.
ctx.seed and mlp.seed are independent: mlp.seed controls per-MLP
randomness inside predict(), ctx.seed controls one-time setup. With
--dataset, the dataset supplies mlp.seed values (baked at the dataset's
own seed) while --seed controls ctx.seed only. See
cli-reference for the --seed flag semantics.
Next step

Write an Estimator (in the starter kit)
Common Participant Errors (in the starter kit)
Publishing a dataset to HuggingFace HubStep-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub.Score Report FieldsReference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula.

--- reference/score-report-fields ---
URL: https://aicrowd.github.io/whestbench/docs/reference/score-report-fields

ReferenceScore Report FieldsReference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula.When to use this page
Use this page to interpret whest run output fields.
Top-level fields
Typical report sections include:

schema_version
mode
run_meta
run_config
run_config.seed (always present; null when --seed is not provided)
run_config.dataset (present when --dataset is used)
results

Run configuration fields
run_config records the parameters that governed the run:
FieldDescriptionseedThe --seed value passed at the CLI, or null when --seed is omitted. When set, it determines both MLP generation (without --dataset) and SetupContext.seed for the participant's setup() call. When null, ctx.seed defaults to 0. See estimator-contract for the reproducibility contract and cli-reference for --seed flag semantics.datasetPresent when --dataset is used. See Dataset traceability fields below.
Host metadata
run_meta.host is always an object. If you set WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1, WhestBench still records cheap host fields and any values available through psutil, but fallback-backed fields such as cpu_count_physical and ram_total_bytes may be null.
Core result fields
Inside results:
FieldDescriptionadjusted_final_layer_scoreBudget-adjusted leaderboard metric — suite mean of per-MLP adjusted_final_layer_score = final_layer_mse × max(0.1, C_m/B_m); failure → × 1.0. Lower is better.all_layers_mseRaw all-layers MSE averaged across MLPs (no budget multiplier). Diagnostic — reveals where approximation error accumulates.final_layer_mseRaw final-layer MSE averaged across MLPs (no multiplier).per_layer_msePer-layer MSE averaged across MLPs. list[float] of length depth. The last element equals final_layer_mse and the list mean equals all_layers_mse. Diagnostic only, no budget multiplier.best_mlp_adjusted_final_layer_scoreMinimum per-MLP adjusted_final_layer_score.worst_mlp_adjusted_final_layer_scoreMaximum per-MLP adjusted_final_layer_score.mean_score_multiplierMean of per-MLP max(0.1, C_m/B_m) (1.0 on failure). Bounded [0.1, 1.0].mean_compute_utilizationMean of per-MLP C_m/B_m, unclamped — can exceed 1.0 when an MLP busted the cap.n_failed_mlpsCount of MLPs with any failure flag or error_code set.mean_effective_computeMean of per-MLP effective_compute.failure_breakdownDict with independent counts per failure flag: budget_exhausted, time_exhausted, residual_wall_time_exhausted, combined_budget_exhausted, error. Sums can exceed n_failed_mlps because one MLP can carry multiple flags.breakdownsAggregate FLOP/time breakdowns keyed by section name. Includes sampling and estimator.per_mlpArray of per-MLP detail records (see below)
Per-MLP fields
Each entry in per_mlp:
FieldTypeDescriptionmlp_indexintIndex of the MLP in the evaluation setmlp_namestrHuman-readable slug for this MLP (e.g. "danielle-johnson"). Same value as mlps[i].name on the corresponding MLP; derived deterministically from mlp_index's per-MLP seed. Use it as a stable label in your own logs and dashboards.flops_usedintTotal FLOPs used by your estimator for this MLPeffective_computefloatC_m = F_m + λ·R_m. Combined FLOP-equivalent compute used by the estimator.adjusted_final_layer_scorefloats_m. The per-MLP budget-adjusted score that flows into the suite mean.combined_budget_exhaustedboolWhether the post-hoc check C_m > B_m fired (predictions zeroed if true).budget_exhaustedboolWhether the estimator exceeded the FLOP budget (predictions zeroed if true)time_exhaustedboolWhether the estimator exceeded the wall-clock limit for this MLP (predictions zeroed if true)residual_wall_time_exhaustedboolWhether WhestBench judged non-flopscope time to exceed residual_wall_time_limit_s (predictions zeroed if true)wall_time_sfloatTotal elapsed wall-clock time measured for this MLP's estimator contextflopscope_backend_time_sfloatWall time inside counted flopscope numpy kernels - the participant's actual numpy computeflopscope_overhead_time_sfloatWall time inside flopscope's own dispatch code (wrapper preambles, FLOP bookkeeping, namespace push/pop). Framework cost, not participant cost.residual_wall_time_sfloatWall time inside the predict context that is neither flopscope backend execution nor flopscope dispatch - i.e. participant Python (loops, control flow), GC, uninstrumented numpyfinal_layer_msefloatMSE of your final-layer predictions vs ground truthall_layers_msefloatMSE of your all-layer predictions vs ground truthper_layer_mselist[float]Per-layer MSE for this MLP. Length equals depth. per_layer_mse[-1] == final_layer_mse and mean(per_layer_mse) == all_layers_mse within float precision.breakdownsdict | nullPer-MLP breakdown container. Currently includes estimator-only data under estimator. Sampling is aggregate-only.tracebackstr | nullNon-null when this MLP's run did not produce real predictions — captures the Python traceback for either an estimator exception or a budget/time exhaustion. null on clean runs. For subprocess/server runners, the traceback is forwarded from the worker.
When the estimator raised an unhandled exception (not budget/time exhaustion), the entry also includes:
FieldTypeDescriptionerrorstr | dictLegacy string message, or structured object: {"message": str, "details": object}error_codestrStable identifier: PREDICT_ERROR for a RunnerError, or the Python exception class name otherwise
For structured error objects, error.details includes:

expected_shape: List[int] with expected (depth, width).
got_shape: List[int] observed from estimator output.
cause_hints: List[str] with user-facing hints.
hint: short summary hint.

Time decomposition
Every predict() call satisfies a strict three-bucket identity:
wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s

flopscope_backend_time_s - numpy kernels actually crunching numbers via flopscope.numpy.*.
flopscope_overhead_time_s - flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop).
residual_wall_time_s - everything else inside the wall window: participant Python, GC, uninstrumented numpy.

The decomposition holds at every level: per-MLP, aggregated across MLPs, and per namespace inside breakdowns.
Breakdown containers
When namespace-aware flopscope data is available, WhestBench adds breakdown containers in
these places:

results.breakdowns.estimator - aggregated estimator breakdown across all evaluated MLPs
results.breakdowns.sampling - aggregated sampling breakdown across all evaluated MLPs
results.per_mlp[].breakdowns.estimator - one normalized estimator breakdown per MLP

Namespace normalization rules:

sampling work is namespaced under sampling.*
unlabeled estimator work becomes estimator.estimator-client
explicit estimator namespace phase becomes estimator.phase
nested estimator namespace phase.subphase becomes estimator.phase.subphase

Each breakdown summary also includes timing totals:

flopscope_backend_time_s - accumulated time inside counted flopscope operations
flopscope_overhead_time_s - accumulated time inside flopscope's own dispatch
residual_wall_time_s - everything else (participant Python, GC, uninstrumented numpy)

For results.breakdowns.*, those values are aggregated across all evaluated
MLPs.
Budget-adjusted scoring
The leaderboard ranks submissions by adjusted_final_layer_score, the suite mean of the
budget-adjusted per-MLP score:
adjusted_final_layer_score = final_layer_mse × max(0.1, C_m / B_m)   for valid runs
adjusted_final_layer_score = final_layer_mse × 1.0                    for failures (no compute discount)

C_m = F_m + λ · R_m                      (effective compute, FLOPs and FLOP-equivalents)
λ = 1e11 FLOPs/second                    (conversion rate; see flopscope-primer)
Where F_m is the analytical FLOPs counted by flopscope (flops_used), R_m is the
residual wall-time bucket (residual_wall_time_s — neither flopscope-backend nor
flopscope-overhead), and B_m is flop_budget. The max(0.1, ...) floor caps the
discount at 10× so an arbitrarily cheap-but-wrong submission cannot dominate the ranking.

Why "score" not "MSE"? Once final_layer_mse is multiplied by the budget
factor max(0.1, C_m/B_m), the result is no longer a mean-squared-error between
predictions and targets — it is a derived ranking score (denoted s_m). The
_score suffix in adjusted_final_layer_score reflects this; the raw
diagnostics final_layer_mse and all_layers_mse keep the _mse suffix because
they remain genuine MSEs.

Interpretation guide

final_layer_mse is your most actionable diagnostic — it directly drives adjusted_final_layer_score.
budget_exhausted is the first thing to check if your score is unexpectedly high — exceeded budget means your predictions were zeroed.
time_exhausted means the estimator crossed the wall-clock limit configured through wall_time_limit_s / --wall-time-limit.
residual_wall_time_exhausted means the non-flopscope portion of execution crossed WhestBench's residual_wall_time_limit_s / --residual-wall-time-limit.
flops_used vs flop_budget shows how much headroom you have. If you are consistently near the cap, consider lighter methods.
High flopscope_backend_time_s relative to wall: numpy compute is the dominant cost. Healthy for a numpy-heavy estimator.
High flopscope_overhead_time_s relative to wall: many small ops are paying the per-call dispatch tax. Consider batching with larger numpy primitives.
High residual_wall_time_s relative to wall: participant Python is the bottleneck (tight loops, per-element attribute access, calls into uninstrumented libraries). This is the bucket future versions of WhestBench will penalise on.
adjusted_final_layer_score is the budget-adjusted leaderboard metric and is always ≤ the raw final_layer_mse
mean (the multiplier is at most 1.0 — it equals 1.0 at full budget use or on failures
and drops to 0.1 at the discount floor — a factor-of-ten cap). A value close to raw final_layer_mse
means you used near-full budget; a value close to one-tenth of raw final_layer_mse
means you used ≤10% of the budget and got the maximum discount.
all_layers_mse is a diagnostic aggregate with no budget multiplier. Use it to understand where approximation error accumulates across all layers, not just the final layer.
per_layer_mse decomposes all_layers_mse layer-by-layer (length = depth). Useful for spotting which layers your estimator struggles on — e.g. early layers vs. final layer. By construction per_layer_mse[-1] == final_layer_mse and mean(per_layer_mse) == all_layers_mse (within float precision).

Dataset traceability fields
When using whest run --dataset, the report includes run_config.dataset:
FieldDescriptionpathAbsolute path to the dataset filesha256SHA-256 hash of the file for integrityseedRNG seed used to generate the datasetn_mlpsNumber of MLPs in the datasetseed_protocolObject with name and version. WhestBench currently requires version = "2.0".
Dataset format compatibility
The .npz files produced by whest create-dataset carry a seed_protocol.version in their embedded metadata. WhestBench refuses to load datasets at any other version: loading a v1.0 dataset raises ValueError("Incompatible dataset seed_protocol version: file has '1.0', this whestbench requires '2.0'. Re-bake the dataset with \whest create-dataset`.")`.
The v2.0 format adds a per-MLP seed (stored as the mlp_seeds array in the .npz) that is exposed to estimators via mlp.seed — see estimator-contract for how to consume it. Auto-migration is intentionally not implemented because the v1.0 spawn protocol (2 streams per MLP) cannot produce a deterministic third stream; re-baking from the original spec seed is the only correct path.
Schema 2.4 added the per-MLP name slug (stored as the mlp_names array in the .npz). It is a pure function of mlp_seeds at the WhestBench release's pinned faker version, so loading a 2.3 file under 2.4 code transparently synthesizes the same names a fresh 2.4 bake would produce — no re-bake required. See estimator-contract for the mlp.name field exposed to estimators.
Next step

CLI Reference
Scoring Model (in the starter kit)
Estimator ContractExact estimator I/O requirements, FLOP tracking rules, failure semantics, memory limits, and reproducibility contract.WhestBench dataset format (schema 3.0)WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars.

--- reference/dataset-format ---
URL: https://aicrowd.github.io/whestbench/docs/reference/dataset-format

ReferenceWhestBench dataset format (schema 3.0)WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars.WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files
plus two JSON/Markdown sidecars. This layout is native to the datasets library
(datasets.load_dataset(...) works directly on the directory), works with HuggingFace
Hub as a first-class dataset repository, and supports parallel distributed baking
with bit-exact merging.
The earlier .npz format (schemas 2.x) is no longer produced or loaded. Re-bake
with whest dataset bake to migrate.
On-disk layout
<dataset_root>/
├── data/
│   └── <split>-NNNNN-of-MMMMM.parquet   # one row per MLP
├── metadata.json                          # whestbench provenance sidecar
└── README.md                              # HuggingFace dataset card

<split> is the split name. Controlled by whest dataset bake --split.
Dataset authors can separately declare the HF config with --config; the
default is default.
NNNNN-of-MMMMM is the standard HF shard numbering; single-host bakes produce
00000-of-00001.
metadata.json is a flat JSON object with provenance, reproducibility, and hardware
fields (see below).
README.md is a rendered Jinja2 template with a YAML front-matter block that
HuggingFace Hub uses to display the dataset card.

Parquet schema (one row per MLP)
Eight columns per row. The depth and width dimensions are fixed for a given
dataset and captured in metadata.json.
This table mirrors the schema section in the published dataset card. They are
maintained in lockstep — any update here must also land in
src/whestbench/templates/dataset_card.md.j2.
ColumnType / shapeWhat this ismlp_idint320-based index of this MLP within the dataset (the absolute index across all parallel-bake slices).mlp_namestringStable, deterministic human-readable slug like "danielle-johnson", derived from mlp_seed. Useful for log lines; carries no information beyond mlp_seed.mlp_seedint64Per-MLP seed. Under seed_protocol 3.0 (new bakes), this is the input seed — the canonical value stored in the parquet. mlp.seed (participant-facing) is derived locally from this value via SeedSequence(mlp_seed).spawn(3)[2]. Under legacy seed_protocol 2.0, this column stored the already-derived estimator seed.weightsfloat32[depth, width, width]The MLP's layer weight matrices. The network has no biases and uses ReLU activations. Layer l computes h_l(x) = max(0, W_l @ h_{l-1}(x)). Weights are drawn i.i.d. from N(0, 2/width) (He initialization) at bake time.all_layer_meansfloat32[depth, width]Ground truth. Entry [l, j] is the empirical mean of neuron j's post-ReLU output at layer l, averaged over many independent Gaussian inputs: E_{x ~ N(0, I)}[ h_l(x)_j ] ≈ (1/N) Σ_i h_l(x_i)_j, where N = n_samples. Computed by direct Monte Carlo. This is what an estimator predicts.final_meansfloat32[width]The last row of all_layer_means — i.e. E[h_{depth}(x)_j] for each output neuron j. Materialised as its own column because the primary scoring metric (final_layer_mse) only looks at this row.avg_variancefloat64The mean across the final-layer neurons of the per-neuron output variance: (1/width) Σ_j Var[h_{depth}(x)_j]. A single scalar per MLP. Used as a normaliser in budget-adjusted scoring so that networks with naturally low output variance don't dominate the MSE rankings.sampling_budget_breakdownstring (JSON)FLOP accounting for the bake that produced the ground truth for this row — useful as provenance. Not related to the estimator's FLOP budget at evaluation time. Decode with json.loads(...).
Notes on individual columns
mlp_id — matches the MLP's position in the logical dataset. Partial bakes
(from --slice/--mlp-range) have mlp_id values starting from their slice
offset; after whest dataset merge, mlp_id is monotonically increasing from 0.
mlp_name — the name is derived deterministically from mlp_seed using the
faker library at a pinned version. The same --seed and --n-mlps always
produce the same name list, on any hardware. Bumping the faker version pin
requires a deliberate re-bake.
weights — stored as float32. The weight matrices for each layer are
weights[i] of shape (width, width). The forward pass uses no biases and ReLU
between layers; inputs are standard Gaussian, sampled fresh per Monte-Carlo draw
when ground truth is computed.
sampling_budget_breakdown — a JSON string with the per-namespace FLOP
counts and wall time consumed by the ground-truth Monte Carlo, accounted via
flopscope. Parse with
json.loads(row["sampling_budget_breakdown"]). This is provenance metadata
about the bake itself, not the estimator's FLOP budget at evaluation time
(which is set at runtime via whest run --flop-budget N).
metadata.json schema
metadata.json is a flat JSON object with the following fields.
Base fields (all bakes)
FieldTypeDescriptionschema_versionstringAlways "3.0" for this formatformatstringAlways "hf-datasets-parquet"backendstring"flopscope" (CPU path) or "torch" (GPU path)seed_protocol.namestring"whestbench_explicit_per_mlp_seeds" (3.0, new bakes) or "whestbench_seedsequence_hierarchy" (2.0, legacy).seed_protocol.versionstring"3.0" (new bakes) or "2.0" (legacy).seedinteger or nullPresent under seed_protocol 2.0 only. Root seed passed to --seed. null if auto-generated. Absent in 3.0 datasets.splitstringSplit name for a single-split bake. New bakes populate this; legacy metadata may omit it.configstringHF dataset config for a single-split bake. Defaults to "default"; legacy metadata may omit it.n_mlpsintegerNumber of MLPs in this dataset (or partial)n_samplesintegerGround-truth samples per MLPwidthintegerNeuron count per layerdepthintegerNumber of weight matricescreated_at_utcstringISO-8601 UTC timestamp of bake completionhardwareobjectHardware fingerprint from the baking host
Provenance fields
These pin the exact code + runtime state that produced a dataset, so a reader
can reproduce a bake without guessing which whestbench/flopscope/torch versions
or determinism flags were in effect. See
Parallel bake → Bit-equivalence requirements
for the operational consequences.
FieldTypeDescriptionwhestbench_versionstringInstalled whestbench package version (e.g. "0.3.0"). "unknown" if importlib.metadata couldn't resolve it.flopscope_versionstringInstalled flopscope package version. Weight init uses flopscope.numpy so this matters for bit-exact weights.
validate_metadata treats these as informational and does not require them
(absence doesn't fail validation), but whest dataset bake always populates
them.
Torch-specific fields (when backend == "torch")
FieldTypeDescriptiondevicestring"cuda", "mps", or "cpu"torch_versionstringPyTorch version string, e.g. "2.3.0"cuda_device_namestringGPU name (CUDA only), e.g. "NVIDIA L40S"cuda_device_capability[int, int]CUDA compute capability (CUDA only), e.g. [8, 9]cuda_driver_versionstringNVIDIA driver version (CUDA only, best-effort via nvidia-smi). Absent if nvidia-smi is unavailable.mps_device_namestringProcessor name (MPS only)mlps_per_batchintegerNumber of MLPs the bake processed per device-side batch.chunk_sizeintegerNumber of MC samples per device-side chunk. Pinning this to a fixed value across workers + reference re-bakes is required for cross-host bit-exact verification (see parallel-bake).bake_configobjectDeterminism flag state at bake time. See below.
bake_config object (torch path only)
Captures the state of torch's determinism levers + the cuBLAS workspace env var
at bake time. Two bakes that should produce bit-identical numeric columns must
have matching bake_config values (and matching chunk_size).
FieldTypeDescriptioncudnn_deterministicbooleanValue of torch.backends.cudnn.deterministic at bake time.cudnn_benchmarkbooleanValue of torch.backends.cudnn.benchmark at bake time.cublas_workspace_configstring or nullValue of the CUBLAS_WORKSPACE_CONFIG env var at bake time, or null if unset. Recommended value for deterministic cuBLAS: ":4096:8".torch_use_deterministic_algorithmsbooleanValue of torch.are_deterministic_algorithms_enabled() at bake time.
Partial-bake fields (when --slice or --mlp-range was used)
FieldTypeDescriptionis_partialbooleanAlways true for partial bakesmlp_range[int, int][start, end) range of MLPs in this partialtotal_n_mlpsintegerLogical total MLP count across all partials
A dataset with is_partial=true is refused by whestbench.load_dataset — run
whest dataset merge first to assemble a complete dataset.
Merged dataset fields (produced by whest dataset merge)
FieldTypeDescriptionmerged_at_utcstringISO-8601 UTC timestamp of the mergehardware_fingerprintsarrayList of per-partial hardware objects, each including mlp_range
is_partial, mlp_range, and total_n_mlps are removed by the merge step.
n_mlps is set to the total count.
Example metadata.json (CPU bake, seed_protocol 3.0)
{
  "schema_version": "3.0",
  "format": "hf-datasets-parquet",
  "backend": "flopscope",
  "seed_protocol": {
    "name": "whestbench_explicit_per_mlp_seeds",
    "version": "3.0"
  },
  "n_mlps": 10,
  "n_samples": 10000000,
  "width": 256,
  "depth": 8,
  "created_at_utc": "2026-05-25T12:00:00+00:00",
  "hardware": {
    "cpu_brand": "Intel Xeon Platinum 8480+",
    "cpu_count": 64,
    "ram_gb": 512.0
  },
  "whestbench_version": "0.3.0",
  "flopscope_version": "0.3.0"
}
Example metadata.json (torch CUDA bake, seed_protocol 3.0)
{
  "schema_version": "3.0",
  "format": "hf-datasets-parquet",
  "backend": "torch",
  "seed_protocol": {
    "name": "whestbench_explicit_per_mlp_seeds",
    "version": "3.0"
  },
  "n_mlps": 50,
  "n_samples": 1000000000,
  "width": 256,
  "depth": 8,
  "created_at_utc": "2026-05-26T03:45:00+00:00",
  "hardware": { "...": "..." },
  "whestbench_version": "0.3.0",
  "flopscope_version": "0.3.0",
  "torch_version": "2.3.0+cu121",
  "device": "cuda",
  "cuda_device_name": "NVIDIA L40S",
  "cuda_device_capability": [8, 9],
  "cuda_driver_version": "535.183.01",
  "mlps_per_batch": 16,
  "chunk_size": 524288,
  "bake_config": {
    "cudnn_deterministic": true,
    "cudnn_benchmark": false,
    "cublas_workspace_config": ":4096:8",
    "torch_use_deterministic_algorithms": false
  }
}
Under seed_protocol 3.0 there is no top-level seed field. Each MLP's input seed is
stored in the parquet mlp_seed column.
Example metadata.json (legacy seed_protocol 2.0)
{
  "schema_version": "3.0",
  "format": "hf-datasets-parquet",
  "backend": "flopscope",
  "seed_protocol": {
    "name": "whestbench_seedsequence_hierarchy",
    "version": "2.0"
  },
  "seed": 42,
  "n_mlps": 10,
  "n_samples": 10000000,
  "width": 256,
  "depth": 8,
  "created_at_utc": "2026-05-25T12:00:00+00:00",
  "hardware": {
    "cpu_brand": "Intel Xeon Platinum 8480+",
    "cpu_count": 64,
    "ram_gb": 512.0
  }
}
Legacy datasets (e.g. aicrowd/arc-whestbench-2026-smoke-test) use seed_protocol 2.0
and continue to load correctly. New bakes always write seed_protocol 3.0.
README.md (HF dataset card)
README.md is rendered from a Jinja2 template at bake time. It contains:

A YAML front-matter block with license, tags, task_categories, and HF
dataset card metadata required for correct Hub display.
A quick-start code snippet.
A dataset summary table (split, MLPs, width, depth, samples, schema version, seed protocol).
The full Parquet column schema.
Reproducibility information including the exact whest dataset bake command to re-bake.
Hardware provenance (for merged datasets, lists each host's GPU and mlp_range).

When whest dataset push uploads a local directory, it re-renders README.md with
the actual repo_id and revision (tag) so the published card has real values rather
than placeholders.
Loading
Bare datasets.load_dataset
Use this when you only need the raw data and don't need schema validation or the
metadata sidecar:
from datasets import load_dataset

# Local directory
ds = load_dataset("./my-eval", split="public")

# HF Hub
ds = load_dataset(
    "aicrowd/arc-whestbench-2026",
    revision="v1",
    split="public",
)
print(ds)  # Dataset({features: [...], num_rows: 10})
print(ds[0]["mlp_name"])  # "danielle-johnson"
whestbench.load_dataset wrapper
Use this for the recommended workflow. It validates metadata.json, refuses partial
datasets (suggesting the merge step), and attaches metadata to the returned Dataset
object for later retrieval via whestbench.metadata(ds):
import whestbench

# Local
ds = whestbench.load_dataset("./my-eval")

# HF Hub (pin a revision — bare repo without revision is rejected by whest run)
ds = whestbench.load_dataset(
    "aicrowd/arc-whestbench-2026",
    revision="v1",
    split="public",
)

# Access metadata sidecar
md = whestbench.metadata(ds)
print(md["seed"], md["n_mlps"], md["backend"])

# Iterate as MLP instances
for mlp in whestbench.iter_mlps(ds):
    print(mlp.name, mlp.weights[0].shape)

# Random access
mlp_0 = whestbench.mlp_at(ds, 0)
iter_mlps / mlp_at
Both functions return whestbench.MLP objects constructed via MLP.from_row(row).
The MLP object exposes the same interface as MLPs produced on-the-fly by
whestbench.sample_mlp: mlp.weights, mlp.width, mlp.depth, mlp.name,
mlp.seed.
Schema version policy
VersionFormatNotes3.0Parquet + sidecar directoryCurrent. Required by this release.2.4.npz with mlp_names fieldLegacy. Rejected by load_dataset with a re-bake hint.2.3.npzLegacy.2.2.npzLegacy.
schema_version tracks the storage format (2.x = npz, 3.0 = Parquet).
seed_protocol.version tracks the RNG algorithm that produces per-MLP seeds.
These two version numbers are independent — the seed protocol can be bumped without
changing the storage format, and vice versa.
Seed protocols
whestbench_seedsequence_hierarchy version 2.0 (legacy, read-only)
The original seeding scheme. A single root seed (--seed N) is expanded via
numpy.random.SeedSequence(root_seed) into n_mlps child sequences. Each child
spawns three streams: weights, samples, and estimator. The parquet mlp_seed column
stored the already-derived estimator seed (stream index 2), not the input seed.
New bakes can no longer write seed_protocol 2.0; --seed N on the CLI now rejects
with a migration hint.
whestbench_explicit_per_mlp_seeds version 3.0 (new, default)
Each MLP receives an independent input seed (64-bit integer). Seeds are either
auto-generated via secrets.randbits(63) or supplied explicitly via
--mlp-seeds FILE (JSON array of N ints). The parquet mlp_seed column stores
the input seed — the canonical, portable value.
Within each MLP, the three RNG streams are still derived locally:
SeedSequence(mlp_seed).spawn(3) → [weight_seq, sample_seq, estimator_seq].
mlp.seed (participant-facing) equals int(estimator_seq.generate_state(1)[0]),
unchanged from 2.0 from the participant's perspective.
Building a 3.0 dataset
# Auto-generated seeds (recommended for production bakes):
whest dataset bake --n-mlps 10 --n-samples 1e7 --width 256 --depth 8 \
    --output ./my-eval

# Explicit seeds (for reproducible small datasets or tests):
echo '[1001,2002,3003,4004]' > my-seeds.json
whest dataset bake --n-mlps 4 --n-samples 100 --width 4 --depth 2 \
    --mlp-seeds my-seeds.json --output ./tiny-eval

# Explicit HF config coordinate for authoring config-per-split repos:
whest dataset bake --n-mlps 100 --n-samples 1e9 --width 256 --depth 8 \
    --split full --config full --output ./full
In Python:
from whestbench.dataset import create_dataset

# Auto-generated:
create_dataset(n_mlps=10, n_samples=1_000_000, width=256, depth=8,
               output_path="./my-eval")

# Explicit:
create_dataset(n_mlps=4, n_samples=100, width=4, depth=2,
               mlp_seeds=[1001, 2002, 3003, 4004],
               output_path="./tiny-eval")

# Explicit config coordinate:
create_dataset(n_mlps=100, n_samples=1_000_000_000, width=256, depth=8,
               split="full", config="full", output_path="./full")
Extracting seeds from a published dataset
import whestbench

ds = whestbench.load_dataset("aicrowd/arc-whestbench-2026", revision="v1", split="public")
md = whestbench.metadata(ds)
if md["seed_protocol"]["version"] == "3.0":
    seeds = ds["mlp_seed"]   # list of input seeds
    print(seeds)
--slice + seed_protocol 3.0
Under 3.0, all workers baking a given split must receive the same --mlp-seeds
JSON file. Each worker uses --slice K/N to select its subset of rows; it draws
the corresponding seeds from that shared file. Seeds for different splits must use
different JSON files to preserve cross-split independence.
HuggingFace git tags (e.g. v1, v2) are content versions for a specific published
dataset. They are independent of the schema version — a dataset at tag v2 is still
schema 3.0.
Partial datasets and merging
Baking partials
--slice K/N divides a logical dataset of n_mlps into N equal slices and bakes
slice K (0-indexed). The output metadata is marked is_partial=true and includes
mlp_range=[start, end) and total_n_mlps.
# Generate once, share the same file with all workers.
whest dataset generate-seeds --n-mlps 1000 > seeds.json

# 4 workers each bake 250 of 1000 MLPs
whest dataset bake --slice 0/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p0
whest dataset bake --slice 1/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p1
whest dataset bake --slice 2/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p2
whest dataset bake --slice 3/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p3
--mlp-range START-END is the lower-level alternative. Both endpoints are inclusive
on the CLI, but the Python API uses half-open [start, end) intervals internally.
--slice 0/4 with n_mlps=1000 is equivalent to --mlp-range 0-249.
Merging
whest dataset merge validates all partials, checks for gap-free coverage of
[0, total_n_mlps), concatenates the Parquet files in order, and writes a new
complete dataset directory:
whest dataset merge ./p0 ./p1 ./p2 ./p3 --output ./final
Bit-equivalence property
The bit-equivalence guarantee means a worker baking --slice K/N produces rows
that are bitwise identical to the corresponding rows of a single-host bake with the
same --mlp-seeds file and --n-mlps. This holds because:

Under seed_protocol 3.0, each slot's input seed comes directly from the shared
--mlp-seeds JSON file. A worker baking slot i reads seeds[i] from that file
regardless of which slice it's assigned, so the derived weight/sample/estimator
streams are identical to a single-host bake.
MLP names are derived from the same per-MLP input seeds so that slice_names[K]
equals full_names[K].

Note: bit-equivalence is per-backend. The flopscope (CPU) and torch backends
use different RNG algorithms and produce statistically equivalent (not bitwise
identical) results at the same seed.
Multi-split datasets
A dataset directory can contain multiple splits as sibling parquet files in data/, with a single metadata.json describing all of them via an optional splits: sub-dict.
On-disk layout
my-eval/
├── data/
│   ├── public-00000-of-00001.parquet
│   └── holdout-00000-of-00001.parquet
├── metadata.json
└── README.md
metadata.json shape
{
  "schema_version": "3.0",
  "format": "hf-datasets-parquet",
  "backend": "torch",
  "seed_protocol": {"name": "whestbench_explicit_per_mlp_seeds", "version": "3.0"},
  "n_samples": 1000000000,
  "width": 256,
  "depth": 8,
  "created_at_utc": "...",
  "hardware": {},
  "splits": {
    "public":  {"config": "default", "n_mlps": 50, "created_at_utc": "...", "hardware_fingerprints": []},
    "holdout": {"config": "holdout", "n_mlps": 50, "created_at_utc": "...", "hardware_fingerprints": []}
  },
  "default_split": "public"
}
Under seed_protocol 3.0 there is no per-split seed field; seeds are stored in
the parquet mlp_seed column for each split.
Field placement
FieldSingle-splitMulti-splitschema_version, format, seed_protocoltop-leveltop-levelbackend, width, depth, n_samplestop-leveltop-level — must match across all splits (validated at combine time)split, configtop-level optional coordinate for new bakesper-split (splits.<name>.config)n_mlps, seedtop-levelper-split (splits.<name>.{n_mlps,seed})created_at_utctop-leveltop-level (= earliest of splits) + optional per-splithardwaretop-level (bake host)top-level (combine host) + per-split hardware_fingerprints for provenancesplitsabsentpresentis_partial, mlp_range, total_n_mlpspresent iff partialnot allowed (multi-split + partial is invalid)
The discriminator is the presence of the splits field. No schema_version bump — the multi-split shape is a purely additive extension of schema 3.0.
Loading
from whestbench import load_dataset, metadata, iter_mlps

dsd = load_dataset("./my-eval")             # → DatasetDict
ds  = load_dataset("./my-eval", split="public")   # → Dataset

print(metadata(dsd)["splits"].keys())        # full multi-split metadata
print(metadata(dsd, split="public")["seed"]) # single-split-shaped projection

for mlp in iter_mlps(dsd["public"]):
    mlp.validate()
Building a multi-split dataset
Bake each split as a complete single-split dataset, then combine. Under seed_protocol
3.0, each split uses its own seeds JSON file:
# Generate independent seed files for each split.
whest dataset generate-seeds --n-mlps 50 > public-seeds.json
whest dataset generate-seeds --n-mlps 50 > holdout-seeds.json

whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 --split public  --config default --mlp-seeds public-seeds.json  --output ./pub
whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 --split holdout --config holdout --mlp-seeds holdout-seeds.json --output ./hold
whest dataset combine-splits ./pub ./hold --output ./eval-r1
whest dataset push ./eval-r1 --repo aicrowd/arc-whestbench-2026-evals --tag round-1 --private
combine-splits preserves the baked config coordinate. If exactly one input
declares config="default", the combined metadata records that split as
default_split, so whest run --dataset ... can keep a split-oriented UX.
The public / holdout naming convention
The contest's evaluation dataset uses split names public (visible-during-contest scores) and holdout (private/final-leaderboard scores). The dataset-card template special-cases these names with leaderboard-specific wording. Other names render generically. Tooling itself accepts any HF-Hub-compatible split name (regex [a-z][a-z0-9]*(-[a-z0-9]+)*).Score Report FieldsReference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula.CLI ReferenceExact command syntax and key flags for all whest commands.

--- reference/cli-reference ---
URL: https://aicrowd.github.io/whestbench/docs/reference/cli-reference

ReferenceCLI ReferenceExact command syntax and key flags for all whest commands.
For the full per-command reference, see CLI.

When to use this page
Use this page for exact command syntax and key flags.
Environment variables

WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1 — skip OS-native fallback probes when collecting run_meta.host or dataset metadata.hardware. Cheap fields and psutil-backed fields are still collected; fallback-backed fields may remain null.
HF_TOKEN — HuggingFace Hub authentication token. Used by whest dataset push, whest dataset pull, and whest run --dataset hf://... as a fallback when --token is not provided.

Commands
Participant workflow commands:

whest smoke-test
whest doctor
whest init
whest validate
whest run
whest dataset (bake / push / pull / merge / inspect)
whest package
whest profile-simulation
whest version

All JSON outputs include a top-level whestbench_version string for traceability.
whest version
Print installed whestbench version.
whest version [--format rich|plain|json] [--json]
JSON output is:
{
  "ok": true,
  "command": "version",
  "name": "whestbench",
  "version": "0.2.0",
  "whestbench_version": "0.2.0"
}
Examples:
whest version
whest version --json

Migration note: whest create-dataset is replaced by whest dataset bake. Running whest create-dataset prints a redirect and exits.

whest smoke-test
Run a built-in CombinedEstimator dashboard check and print next-step participant commands.
whest smoke-test [--detail raw|full] [--profile] [--show-diagnostic-plots] [--format rich|plain|json] [--debug]

--format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise. Under a debugger, smoke-test automatically forces plain if rich was requested.

whest doctor
Run install and environment health checks. Prints a pass/fail list for Python version, uv/Node.js availability, BLAS thread pool, disk space, and working-directory writability. Useful for first-hour setup troubleshooting and for CI gates.
whest doctor [--format rich|plain|json] [--json] [--strict] [--debug]
Key options:

--format rich|plain|json — choose styled terminal output, plain log-friendly output ([OK]/[WARN]/[FAIL] tokens, no box-drawing), or JSON (schema_version, checks, counts, overall). Defaults to rich on TTYs and plain otherwise.
--json — alias for --format json.
--strict — treat warnings as failures for exit-code purposes. Rendering is unchanged.
--debug — re-raise exceptions from crashing checks instead of capturing them as fail.

Severity model

ok — the check passed.
warn — the check found something worth knowing but not blocking. Examples: uv missing (safe to ignore if you installed via pip), less than 1 GiB free disk in the current directory.
fail — the check found a genuine blocker. Examples: Python version below requires-python, threadpoolctl failed to import, cannot write to the working directory.

Exit codes

Default: 0 if all checks are ok or warn; 1 if any fail.
--strict: 0 only if all checks are ok; 1 otherwise.

Example
# Interactive first-hour check
whest doctor

# CI pre-flight (treat anything that isn't OK as a failure)
whest doctor --strict --json
whest init
Create starter files in a target directory.
whest init [path] [--format rich|plain|json] [--json] [--debug]
whest validate
Validate estimator loading and output contract.
whest validate --estimator <path> [--class <name>] [--format rich|plain|json] [--json] [--debug]
whest run
Run local scoring with a participant estimator.
whest run --estimator <path> [options]
Default behavior: whest run --estimator <path> is equivalent to --runner local.
Key options:

--class <name> — estimator class name (if the module exports more than one).
--runner local|subprocess|server|inprocess
--n-mlps <int> — number of MLPs to evaluate. Default: 10 without --dataset; full dataset size with --dataset. Clamped to dataset size when --dataset is set.
--flop-budget <int> — cap on effective compute C_m = F_m + λ·R_m per MLP. Default: 68_000_000_000 (6.8e10). Always honored; any flop_budget stored in --dataset's metadata is ignored.
--wall-time-limit <seconds> (default: 60.0) — wall-clock limit per predict() call; forwarded to the estimator BudgetContext. Operational backstop matching the Phase 1 grader cap; the primary compute constraint is --flop-budget.
--residual-wall-time-limit <seconds> — limit for non-flopscope time per predict() call, enforced by WhestBench after timing is reported.
--detail raw|full
--seed <int> — random seed for the run.

Without --dataset: seeds both MLP generation and estimator setup (ctx.seed).
With --dataset: MLP seeds come from the dataset; this flag seeds estimator setup (ctx.seed) only.
Default: omitted (ctx.seed defaults to 0; run_config.seed is null in the JSON output).
See estimator-contract for the ctx.seed reproducibility contract.

--profile
--show-diagnostic-plots
--format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise.
--json — alias for --format json.
--dataset <path> — dataset source. Accepts:

Local directory: ./my-eval or /abs/path/my-eval
HF Hub with inline revision: hf://owner/repo@v1 or hf://aicrowd/arc-whestbench-2026@v1
HF Hub with --revision flag: aicrowd/arc-whestbench-2026 --revision v1
Bare owner/repo without --revision is rejected (revision must be explicit).

--revision <tag> — HF Hub git tag or commit SHA for --dataset. Ignored for local paths.
--n-samples <int> — ground truth samples per MLP when generating on-the-fly (without --dataset). Default: width*width*256.
--debug — include estimator tracebacks in the report's "Estimator Errors" panel.
--fail-fast — stop on the first estimator error and let the raw Python traceback propagate. Combine with --debug to show it.
--max-threads <N> — limit BLAS to at most N CPU threads.

Recommended debug sequence:
whest run --estimator ./path/to/estimator.py
whest run --estimator ./path/to/estimator.py --debug
whest run --estimator ./path/to/estimator.py --debug --fail-fast
whest run --estimator ./path/to/estimator.py --runner local --format plain   # for pdb.set_trace() / breakpoint()
Using a pre-baked dataset
# Local directory (schema 3.0)
whest run --estimator ./estimator.py --dataset ./my-eval

# HF Hub with inline revision (preferred)
whest run --estimator ./estimator.py --dataset hf://aicrowd/arc-whestbench-2026@v1

# HF Hub with separate --revision flag
whest run --estimator ./estimator.py \
    --dataset aicrowd/arc-whestbench-2026 \
    --revision v1
Exit codes

0 — scoring completed; no estimator errors (budget or time exhaustion still exits 0).
1 — at least one MLP raised during predict, or setup/runtime failure.

Runner mode tradeoff:

local (default): in-process execution with better traceback fidelity while debugging. Required for interactive debuggers (pdb, breakpoint()).
subprocess: isolated execution in a separate process via the subprocess runner.
server: legacy alias for subprocess.
inprocess: alias for local.

whest dataset
Dataset management commands. All subcommands share the whest dataset <sub> prefix.
whest dataset {bake,push,pull,merge,inspect} ...
whest dataset bake
Bake a new evaluation dataset to a local directory.
whest dataset bake \
    --n-mlps N --n-samples N --width W --depth D \
    [--split SPLIT] [--config CONFIG] \
    --output DIR \
    [--torch] [--device auto|cuda|mps|cpu] \
    [--mlps-per-batch N] [--chunk-size N] \
    [--slice K/N | --mlp-range START-END]
Required options:

--n-mlps <int> — total number of MLPs in the logical dataset.
--n-samples <int> — ground-truth samples per MLP. Larger values give lower-noise ground truth. Default for on-the-fly runs is width*width*256 (~16.7M for 256-wide).
--width <int> — neuron count per layer.
--depth <int> — number of weight matrices per MLP.
--output <dir> — output directory (must not exist).

Key optional options:

--split <name> — dataset split name. Default: public.
--config <name> — HF dataset config name for this split. Default: default. Use this for authoring config-per-split datasets such as default/mini + full/full or default/public + holdout/holdout.
--torch — use the GPU/torch backend (requires pip install whestbench[gpu]). See GPU Dataset Generation.
--device auto|cuda|mps|cpu — device when --torch is active. auto resolves cuda > mps > cpu.
--mlps-per-batch <int> — torch backend: MLPs processed in parallel on device.
--chunk-size <int> — torch backend: samples per chunk per step.
--slice K/N — bake only the K-th slice of N total slices (0-indexed). Produces a partial dataset. Combine with whest dataset merge to assemble the full dataset. Example: --slice 0/4 for the first of four workers.
--mlp-range START-END — bake only MLP indices [START, END] inclusive (both ends). Alternative to --slice for irregular splits.

Bit-equivalence guarantee: a worker baking --slice K/N produces rows that are bitwise identical to the corresponding rows of a single-host bake with the same --seed and --n-mlps.
Output is a directory with:
<output>/
├── data/<split>-00000-of-00001.parquet
├── metadata.json
└── README.md
Example
# Full bake (10 MLPs, 10M samples each)
whest dataset bake \
    --n-mlps 10 --n-samples 10_000_000 \
    --width 256 --depth 8 \
    --output ./my-eval

# Partial bake (slice 0 of 4)
whest dataset bake \
    --n-mlps 100 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --slice 0/4 \
    --output ./partial-0

# GPU bake
whest dataset bake \
    --n-mlps 100 --n-samples 1_000_000_000 \
    --width 256 --depth 8 \
    --torch --device auto \
    --output ./gpu-eval
whest dataset inspect
Print metadata from a local directory or a HF Hub repo.
whest dataset inspect <DIR_OR_REPO_ID> [--revision REV]
Arguments:

DIR_OR_REPO_ID — local dataset directory, or HF Hub repo id (e.g. aicrowd/arc-whestbench-2026).
--revision <tag> — HF Hub git tag or commit SHA (for remote repos).

Example
# Local
whest dataset inspect ./my-eval

# Remote
whest dataset inspect aicrowd/arc-whestbench-2026 --revision v1
Output prints key metadata fields: schema_version, format, backend, split, config, n_mlps, n_samples, width, depth, created_at_utc, and device provenance for torch bakes. Multi-split datasets print each split's config when present.
whest dataset push
Upload a baked dataset directory to HuggingFace Hub. Requires HF_TOKEN set in the environment or --token.
whest dataset push <LOCAL_DIR> \
    --repo REPO_ID \
    [--tag TAG] \
    [--private] \
    [--token TOKEN] \
    [--message MSG]
Arguments:

LOCAL_DIR — local directory produced by whest dataset bake or whest dataset merge.
--repo <repo_id> — HF Hub repo id, e.g. aicrowd/arc-whestbench-2026.
--tag <tag> — optional git tag to create on the uploaded commit (e.g. v1). Recommended for versioning.
--private — create the repo as private if it doesn't exist yet.
--token <token> — HF Hub write token. Falls back to HF_TOKEN env var, then the huggingface-cli login cache.
--message <msg> — commit message for the HF Hub upload.

Example
# Publish with a version tag
whest dataset push ./my-eval \
    --repo aicrowd/arc-whestbench-2026 \
    --tag v1 \
    --message "Bake: 10 MLPs, seed=42"

# Private repo
whest dataset push ./my-eval \
    --repo aicrowd/arc-whestbench-2026-holdout \
    --tag v1 \
    --private
whest dataset pull
Download a dataset from HuggingFace Hub to a local directory.
whest dataset pull <REPO_ID> \
    [--revision REV] \
    --output DIR \
    [--token TOKEN]
Arguments:

REPO_ID — HF Hub repo id (e.g. aicrowd/arc-whestbench-2026).
--revision <tag> — HF Hub git tag or commit SHA. Default: main.
--output <dir> — local destination directory.
--token <token> — HF Hub token for private repos. Falls back to HF_TOKEN env var.

Example
whest dataset pull aicrowd/arc-whestbench-2026 \
    --revision v1 \
    --output ./eval-v1
whest dataset merge
Merge partial bakes (produced with --slice or --mlp-range) into a single canonical dataset.
whest dataset merge <DIR> [<DIR>...] --output <DIR>
Arguments:

<DIR>... — two or more partial dataset directories.
--output <dir> — destination for the merged dataset (must not exist).

All partial datasets must share the same --seed, --n-mlps, --n-samples, --width, --depth, and --backend. Their mlp_range values must together cover [0, total_n_mlps) exactly once (no gaps, no overlaps).
The merged result is bit-equivalent to a single-host bake with the same parameters.
Example
# After baking 4 slices on separate workers:
whest dataset merge \
    ./partial-0 ./partial-1 ./partial-2 ./partial-3 \
    --output ./final-eval
End-to-end example (bake → inspect → push → pull → run)
# 1. Bake
whest dataset bake \
    --n-mlps 10 --n-samples 10_000_000 \
    --width 256 --depth 8 \
    --output ./my-eval

# 2. Inspect locally
whest dataset inspect ./my-eval

# 3. Publish
export HF_TOKEN=hf_...
whest dataset push ./my-eval \
    --repo aicrowd/arc-whestbench-2026 \
    --tag v1

# 4. Pull on another machine
whest dataset pull aicrowd/arc-whestbench-2026 \
    --revision v1 --output ./local-copy

# 5. Run evaluation
whest run --estimator ./estimator.py \
    --dataset hf://aicrowd/arc-whestbench-2026@v1
whest package
Build a submission artifact.
whest package --estimator <path> [options]
Key options:

--class <name>
--requirements <path>
--submission-metadata <path>
--approach <path>
--output <path>
--format rich|plain|json
--json — alias for --format json
--debug

whest profile-simulation
Profile flopscope FLOP accounting and analytical correctness across a grid of network sizes and FLOP budgets.
whest profile-simulation [--preset super-quick|quick|standard|exhaustive]
                          [--output <path>]
                          [--format rich|plain|json]
                          [--json]
                          [--verbose]
                          [--debug]
Key options:

--preset <name> (default: standard) — parameter sweep size:

super-quick — 1 width (256), 1 depth (4), 10 000 samples. Sub-second, for testing the debug loop.
quick — 1 width (256), 2 depths (4, 128), 2 sample counts (10 000, 100 000). Finishes in seconds.
standard — 2 widths (64, 256), 3 depths (4, 32, 128), 2 sample counts (10 000, 100 000). Under a minute.
exhaustive — 2 widths (64, 256), 3 depths (4, 32, 128), 3 sample counts (10 000, 100 000, 1 000 000). Thorough but slow.

--output <path> — save a JSON report with correctness results and FLOP accounting data.
--format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise.
--json — alias for --format json.
--debug — show full tracebacks on errors.
--verbose — show full tables with all columns and raw data.

Example workflows:
# Quick correctness check
whest profile-simulation --preset quick

# Full profile with JSON export
whest profile-simulation --preset exhaustive --output profile_results.json
Next step

Dataset Format — schema 3.0 specification
Score Report Fields
GPU Dataset Generation
Inspect and Traverse MLP Structure (in the starter kit)
Validate, Run, and Package (in the starter kit)
WhestBench dataset format (schema 3.0)WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars.Code PatternsQuick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation.

--- reference/code-patterns ---
URL: https://aicrowd.github.io/whestbench/docs/reference/code-patterns

ReferenceCode PatternsQuick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation.Quick reference for flopscope operations. All examples assume import flopscope as flops and import flopscope.numpy as fnp.
Operators are tracked
Python arithmetic operators (+, -, *, /, @) on fnp.ndarray values are
FLOP-tracked — you do not need to use the verbose fnp.add, fnp.multiply, etc. forms.
import flopscope as flops
import flopscope.numpy as fnp

a = fnp.ones(4)
b = fnp.ones(4)

# These are all equivalent and all tracked:
c = a + b           # tracked: same as fnp.add(a, b)
d = a * b           # tracked: same as fnp.multiply(a, b)
e = a / b           # tracked: same as fnp.divide(a, b)

W = fnp.eye(4)
v = fnp.ones(4)
f = W @ v           # tracked: same as fnp.matmul(W, v)
g = W.T @ v         # tracked: transpose is free, matmul is tracked
h = W.T @ W @ v     # tracked: two matmuls, chained with @
Use operators whenever they improve readability. The verbose fnp.* forms are still
available but are no longer required for tracking purposes.
Operation costs
What you wantCodeFLOP costNotesCreate zerosfnp.zeros((n, n))0FreeCreate onesfnp.ones(n)0FreeIdentity matrixfnp.eye(n)0FreeWrap existing datafnp.array(data)0FreeMatrix multiplyfnp.matmul(A, B)O(m x n x k)Dominates budgetsElement-wise addfnp.add(a, b)1 per elementElement-wise multiplyfnp.multiply(a, b)1 per elementElement-wise dividefnp.divide(a, b)1 per elementReLUfnp.maximum(x, 0.0)1 per elementSquare rootfnp.sqrt(x)1 per elementExponentialfnp.exp(x)1 per elementLogarithmfnp.log(x)1 per elementTransposefnp.transpose(W)0FreeReshapefnp.reshape(x, shape)0FreeExtract diagonalfnp.diag(M)0FreeSet diagonalfnp.fill_diagonal(M, v)0Free, in-placeOuter productfnp.outer(a, b)n x mSumfnp.sum(x, axis=0)input sizeMeanfnp.mean(x, axis=0)input sizeMaxfnp.max(x)input sizeStack arraysfnp.stack(rows, axis=0)0FreeConcatenatefnp.concatenate([a, b])0FreeIndex/slicex[0], x[:, 3]0Free
Common patterns
Standard normal PDF and CDF (built-in)
flopscope provides built-in PDF and CDF functions that are FLOP-tracked:
import flopscope as flops
import flopscope.numpy as fnp

phi = flops.stats.norm.pdf(x)   # standard normal PDF
Phi = flops.stats.norm.cdf(x)   # standard normal CDF
These are the recommended approach — all example estimators use them. The manual implementations below are shown for reference.
Standard normal PDF (for ReLU expectation)
import flopscope as flops
import flopscope.numpy as fnp

def norm_pdf(x):
    """phi(x) = exp(-x^2/2) / sqrt(2*pi)"""
    return fnp.exp(-0.5 * x * x) / fnp.sqrt(2.0 * fnp.pi)
Standard normal CDF
Pure flopscope implementation using the Abramowitz & Stegun approximation (accurate to <7.5e-8):
import flopscope as flops
import flopscope.numpy as fnp

_P = 0.2316419
_A1, _A2, _A3 = 0.319381530, -0.356563782, 1.781477937
_A4, _A5 = -1.821255978, 1.330274429

def norm_cdf(x):
    t = 1.0 / (1.0 + _P * fnp.abs(x))
    poly = ((((_A5 * t + _A4) * t + _A3) * t + _A2) * t + _A1) * t
    pdf = fnp.exp(-0.5 * x * x) / fnp.sqrt(2.0 * fnp.pi)
    cdf = 1.0 - pdf * poly
    return fnp.where(x >= 0, cdf, 1.0 - cdf)
Alternatively, if you add scipy to your requirements.txt:
# Optional: requires scipy as a user-provided dependency
from scipy.special import ndtr

def norm_cdf(x):
    return fnp.array(ndtr(fnp.asarray(x, dtype=fnp.float64)).astype(fnp.float32))
ReLU expectation (E[max(0, z)] where z ~ N(mu, sigma^2))
import flopscope as flops
import flopscope.numpy as fnp

alpha = mu_pre / sigma_pre
E_relu = mu_pre * norm_cdf(alpha) + sigma_pre * norm_pdf(alpha)
See 02_mean_propagation.py (in the starter kit) for a complete worked example using these patterns.
Per-neuron variance propagation (diagonal)
import flopscope as flops
import flopscope.numpy as fnp

# var_pre[i] = sum_j W[j,i]^2 * var[j]
var_pre = (w * w).T @ var
Next step

Estimator Contract
Manage Your FLOP Budget (in the starter kit)
Algorithm Ideas (in the starter kit)
CLI ReferenceExact command syntax and key flags for all whest commands.Flopscope PrimerFlopscope is a numpy-compatible array library that tracks FLOPs analytically, enabling fair FLOP budgets across different machines.

--- reference/flopscope-primer ---
URL: https://aicrowd.github.io/whestbench/docs/reference/flopscope-primer

ReferenceFlopscope PrimerFlopscope is a numpy-compatible array library that tracks FLOPs analytically, enabling fair FLOP budgets across different machines.Flopscope is a numpy-compatible array library that tracks FLOPs analytically rather than timing them on hardware. Every arithmetic operation on a fnp.ndarray increments a FLOP counter instead of (or in addition to) performing the computation. This is how WhestBench enforces fair FLOP budgets across different machines.
Source: github.com/AIcrowd/flopscope
BudgetContext
All estimator predictions run inside a BudgetContext. When the budget is exhausted, a BudgetExhaustedError is raised and your predictions are zeroed out.
import flopscope as flops
import flopscope.numpy as fnp

with flops.BudgetContext(flop_budget=1_000_000) as ctx:
    x = fnp.ones(100)
    y = x @ fnp.eye(100)  # matmul: 100 * 100 * 100 = 1M FLOPs
    # BudgetExhaustedError raised here if budget exceeded
You don't need to create BudgetContext yourself — the framework does it before calling your predict() method. The budget argument tells you how many FLOPs you have.
BudgetContext also supports wall_time_limit_s when you want a cooperative
wall-clock limit in addition to the FLOP cap:
with flops.BudgetContext(flop_budget=1_000_000, wall_time_limit_s=2.0) as ctx:
    ...
The timer starts when the context is entered and is checked before and after
each counted flopscope/NumPy call. If it is exceeded, flopscope raises
TimeExhaustedError.
Operation FLOP Costs
CategoryOperationsCostFree (0 FLOPs)fnp.array, fnp.zeros, fnp.ones, fnp.eye, fnp.asarray, fnp.reshape, .T, indexing, fnp.stack, fnp.concatenate, .copy(), .astype()0Pointwise (1 FLOP/element)+, -, *, /, fnp.exp, fnp.sqrt, fnp.abs, fnp.maximum, fnp.where, fnp.log, comparisonsN elementsReductions (input size)fnp.sum, fnp.mean, fnp.var, fnp.max, fnp.min, fnp.all, fnp.anyN elementsMatmul@, fnp.matmulM * N * K for (M,N) @ (N,K)
Key insight: Matmul dominates. A single (100, 100) @ (100, 100) costs 1M FLOPs. A pointwise exp on 100 elements costs 100 FLOPs.
Array Creation
import flopscope as flops
import flopscope.numpy as fnp

x = fnp.zeros(100)                          # 1D zeros
X = fnp.zeros((64, 100), dtype=fnp.float32)  # 2D zeros, explicit dtype
I = fnp.eye(100, dtype=fnp.float32)          # identity matrix
a = fnp.array([1.0, 2.0, 3.0])             # from list
b = fnp.asarray(numpy_array)                # convert from numpy (free)
All array creation is free (0 FLOPs).
Random Number Generation
import flopscope as flops
import flopscope.numpy as fnp

rng = fnp.random.default_rng(42)            # seeded RNG
x = rng.standard_normal((1000, 64))        # Gaussian samples
x = x.astype(fnp.float32)                   # cast to float32 (free)
Random generation itself is free. FLOPs are counted when you operate on the arrays.
Budget Inspection
Use budget.summary() for the current explicit context and
fnp.budget_summary() for the accumulated session/global view:
with flops.BudgetContext(flop_budget=10_000_000) as ctx:
    # ... your computations ...
    print(ctx.summary())        # current context only
    print(fnp.budget_summary())  # process/session-wide summary
    print(ctx.flops_used)       # integer FLOP count
Both summaries also include four timing fields that satisfy a strict
decomposition identity, wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s:

wall_time_s: total elapsed time in the context
flopscope_backend_time_s: time spent inside counted flopscope numpy kernels
flopscope_overhead_time_s: time spent inside flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop)
residual_wall_time_s: everything else - participant Python, GC, uninstrumented numpy

This decomposition lets you see whether time is going to numpy compute, framework dispatch, or your own Python.
WhestBench-specific limits
Flopscope's BudgetContext measures wall_time_s, flopscope_backend_time_s,
flopscope_overhead_time_s, and residual_wall_time_s. It also accepts
wall_time_limit_s, which it checks while counted flopscope operations run.
WhestBench exposes some of those concepts as run-level CLI knobs:

--wall-time-limit: passed through to the estimator's BudgetContext
--residual-wall-time-limit: enforced by WhestBench after predict() returns,
using the reported residual_wall_time_s. Because residual_wall_time_s no longer
includes flopscope's own dispatch time, this gate measures only your
Python work — not the framework's bookkeeping tax.

So if you see time_exhausted, that came from Flopscope's wall_time_limit_s.
If you see residual_wall_time_exhausted, that came from WhestBench scoring
logic comparing Flopscope's measured residual_wall_time_s with the configured
--residual-wall-time-limit.
Residual wall-time charging (lambda)
WhestBench's effective compute budget combines analytical FLOPs and residual wall time
via a conversion rate λ (LAMBDA_FLOPS_PER_SECOND in whestbench.scoring):
C_m = F_m + λ · R_m

F_m = analytical FLOPs counted by flopscope (flops_used)
R_m = residual wall time — the third bucket of the time decomposition. Specifically,
residual_wall_time_s = wall_time_s − flopscope_backend_time_s − flopscope_overhead_time_s.
This is participant Python (loops, control flow), GC pauses, and uninstrumented numpy.
It explicitly excludes flopscope's own dispatch overhead (the second bucket).
λ = 1e11 FLOPs/second. This rate is fixed for the initial competition round.

The combined C_m is capped at B_m = flop_budget. If C_m > B_m, the MLP is marked
combined_budget_exhausted and the prediction is replaced with zeros.
Why charge non-flopscope time at all? It lets participants use any Python they like —
not just flopscope-instrumented operations — but holds them accountable for that work
in the compute budget. Pure-flopscope solutions get the entire budget for analytical
work; pure-Python solutions trade some FLOP headroom for residual time.
Common Gotchas
numpy arrays still count FLOPs. Since fnp.ndarray is backed by numpy, a raw numpy array passed to flopscope operations will still be tracked. Use fnp.array() or fnp.asarray() to convert explicitly.
Pythonic operators are tracked. x @ w counts the same FLOPs as fnp.matmul(x, w). Use whichever reads better.
dtype matters for precision, not FLOPs. float32 and float64 operations cost the same FLOPs. Use float32 for memory efficiency and float64 for numerical stability where needed.
Testing
Use flopscope's testing utilities:
import flopscope as flops
import flopscope.numpy as fnp

fnp.testing.assert_allclose(actual, expected, atol=1e-6)
fnp.testing.assert_array_equal(actual, expected)
These work like numpy's testing functions but on flopscope arrays.Code PatternsQuick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation.Generating Large Datasets on GPUFor ground-truth bakes with n_samples ≥ 10⁸, the optional torch backend runs the same computation on GPU, reducing a 30-hour CPU job to 15–30 minutes on a single GPU.

============================================================
CLI
============================================================

--- cli ---
URL: https://aicrowd.github.io/whestbench/docs/cli

CLICLI ReferenceAutogenerated reference for the whest command-line interface.CLI Reference
Generated from the whest argparse definition.

whest smoke-test — Run a built-in CombinedEstimator dashboard check and print next steps for participant workflows.
whest version — Print whestbench version.
whest init — Create starter estimator files.
whest validate — Validate estimator contract.
whest run — Run local evaluation for an estimator.
whest dataset — Dataset bake/publish/load/merge/inspect commands.
whest package — Package submission artifact.
whest profile-simulation — Benchmark flopscope simulation performance.
whest doctor — Run install/environment health checks.
whest login — Store your AIcrowd API key (interoperable with aicrowd-cli).
whest submit — Submit to AIcrowd (packages an estimator if needed, then uploads).
Generating Large Datasets on GPUFor ground-truth bakes with n_samples ≥ 10⁸, the optional torch backend runs the same computation on GPU, reducing a 30-hour CPU job to 15–30 minutes on a single GPU.whest smoke-testRun a built-in CombinedEstimator dashboard check and print next steps for participant workflows.

============================================================
API Reference
============================================================

--- api ---
URL: https://aicrowd.github.io/whestbench/docs/api

APIAPI ReferenceAutogenerated reference for the whestbench public API.API Reference
Every symbol exported from whestbench (__all__). Generated from source.

BaseEstimator — Estimator contract for participant implementations.
BudgetExhaustionWarning — Raised when an estimator exhausts its FLOP budget on a single MLP.
combine_split_datasets — Combine N complete single-split datasets into a multi-split dataset directory.
CombinedBudgetExhaustionWarning — Raised when combined compute C_m = F_m + lambda*R_m exceeds the FLOP budget on a single MLP.
create_dataset — Generate MLPs, compute ground-truth, and write a schema-3.0 dataset directory.
InvalidDatasetError — Raised when a dataset directory has missing/incompatible metadata.
iter_mlps — Iterate the MLPs in a Dataset, constructing MLP objects per row.
load_dataset — Load a whestbench dataset from a local directory or HF Hub repo.
merge_datasets — Concatenate partial bakes into a single canonical dataset directory.
metadata — Return the metadata.json contents attached to a Dataset or DatasetDict.
MLP — Validated MLP container with fixed width and layer depth.
mlp_at — Return the MLP at index in the Dataset.
publish_dataset — Upload a baked dataset directory to HF Hub.
relu — Element-wise ReLU activation.
ResidualWallTimeExhaustionWarning — Raised when an estimator exhausts its residual wall-time budget on a single MLP.
run_mlp — Forward pass returning final-layer activations.
run_mlp_all_layers — Forward pass returning activations after each layer.
sample_layer_statistics — Estimate per-layer activation statistics via chunked Monte Carlo sampling.
sample_mlp — Sample a random MLP with He-initialized weight matrices.
SCHEMA_VERSION — value
ScoringExhaustionWarning — Base class for budget/time exhaustion warnings raised during scoring.
SetupContext — Runtime context passed to BaseEstimator.setup.
TimeExhaustionWarning — Raised when an estimator exhausts its wall-clock budget on a single MLP.
whest submitSubmit to AIcrowd (packages an estimator if needed, then uploads).BaseEstimatorEstimator contract for participant implementations.

============================================================
Development
============================================================

--- development/release-process ---
URL: https://aicrowd.github.io/whestbench/docs/development/release-process

DevelopmentRelease processThe authoritative reference for cutting a new release of whestbench to PyPI, covering the steady-state flow, one-time setup, and troubleshooting notes.This document is the authoritative reference for cutting a new release
of whestbench to PyPI. It covers the steady-state flow, the one-time
setup that must happen outside the repo, and a few troubleshooting
notes.
TL;DR (steady-state)
git checkout main && git pull origin main
uv run cz bump --dry-run                  # preview the next version + CHANGELOG entry
uv run cz bump                            # writes pyproject version + CHANGELOG.md + creates v<x.y.z> tag
git push --follow-tags                    # tag push triggers the publish workflow
# … open GitHub Actions → approve the `publish-pypi` job → wait ~30s →
# package on PyPI + GitHub Release created
Pre-release tags: uv run cz bump --prerelease alpha produces tags
like v0.5.0a0.
What happens after git push --follow-tags
The tag push fires
.github/workflows/pypi-publish.yml,
which:

Builds the sdist + wheel with uv build.
Pauses for approval in the pypi GitHub environment (manual gate).
Publishes to PyPI via Trusted Publishing (OIDC; no API token stored
in repo secrets).
Creates a GitHub Release whose body is the matching CHANGELOG
section for the tag.

End result: uv add whestbench / pip install whestbench works ~2
minutes after a maintainer clicks "approve" on the publish-pypi job.
One-time setup (per maintainer, per repo)
Before the first release will succeed, two things must be configured
outside the repo.
1. PyPI Trusted Publisher
On pypi.org, as an account with Owner or
Maintainer rights on the whestbench project (or as the user
creating it, if not yet published):

"Your projects" → whestbench → "Publishing" → "Add a pending
publisher" (or "Add a publisher" if the project already exists).
Fill in:

PyPI project name: whestbench
Owner: AIcrowd
Repository name: whestbench-public
Workflow filename: pypi-publish.yml
Environment name: pypi

PyPI's "pending publisher" feature allows trusted publishing to
succeed on the very first publish of a brand-new project name.
2. GitHub pypi environment
In the whestbench-public repo on GitHub:

Settings → Environments → "New environment" → name: pypi.
Enable "Required reviewers".
Add yourself (and any other release maintainers) as reviewers.
Save.

Without this, publishes proceed without a human approval gate. The
Trusted Publishing OIDC handshake will still work — there is just no
gate to abort a bad tag.
How CHANGELOG entries get into the GitHub Release
The publish workflow extracts the body of the matching ## v<version>
section in CHANGELOG.md using an awk script and uses it as the
GitHub Release notes. Commitizen writes section headers in the
## v<version> (<date>) form, which the workflow expects.
When promoting an existing ## Unreleased section to a versioned
release manually (rather than via cz bump), use the same header
format: ## v0.4.0 (2026-05-26).
If no matching section is found, the workflow falls back to a default
body: Release v<x.y.z>\n\nSee CHANGELOG.md for details.
Troubleshooting
Publish job fails with "Trusted publisher not configured"
PyPI side is not configured. Re-check step 1 of "One-time setup". The
workflow filename and environment name must match exactly
(pypi-publish.yml, pypi).
Publish job fails with "File already exists on PyPI"
A version was previously uploaded and yanked. PyPI does not allow
re-uploading the same version, even after a yank. Resolution: delete
the tag locally and on the remote, bump to the next version, retag:
git tag -d v0.5.0
git push origin :refs/tags/v0.5.0
uv run cz bump   # bumps to v0.5.1
git push --follow-tags
GitHub Release step fails after PyPI succeeded
The package is on PyPI; only the GitHub Release is missing. Re-run
the workflow on the same tag from the GitHub Actions UI. The
github-release job's gh release create is the only remaining side
effect and is idempotent against the existing tag (will fail if a
release already exists, succeed if not).
cz bump --dry-run previews an unexpected version
The previewed version is computed from conventional-commits types in
the commit range since the last tag. feat → minor bump (under v1.x
behaviour: still minor while major_version_zero = true in
[tool.commitizen]), fix → patch, feat! or BREAKING CHANGE →
minor while major_version_zero = true, else major. To bump to a
specific version explicitly, use cz bump --increment PATCH|MINOR|MAJOR.
Pin updates for flopscope
Whestbench pins flopscope>=0.4.1 and flopscope-server>=0.4.1.
When flopscope ships a new minor or major version, bump these floors
in pyproject.toml and re-run uv lock before cutting the next
whestbench release. (Out of scope for an automated workflow; flag if
Dependabot becomes worth the noise.)TimeExhaustionWarningRaised when an estimator exhausts its wall-clock budget on a single MLP.ChangelogRelease history and notable changes.

============================================================
Changelog
============================================================

--- changelog ---
URL: https://aicrowd.github.io/whestbench/docs/changelog

ChangelogRelease history and notable changes.v0.9.2 (2026-06-01)
Fix

bump to track the flopscope 0.4.2 fix for fnp.random.default_rng() over the client/server grader boundary; the flopscope>=0.4.1 floor auto-resolves to 0.4.2 once published (AIcrowd/flopscope#109)

v0.9.1 (2026-05-31)
Fix

cli: whest submit --watch reaches terminal grading state (#74)

v0.9.0 (2026-05-29)
Feat

cli: add whest login + whest submit (hop-A AIcrowd submission)
add config-aware dataset authoring (#72)
prepared-arrow: friendly upfront notice + CLI preflight sizing (#69)

v0.8.0 (2026-05-27)
Feat

ux2: prepared-Arrow fast path on HF for multi-split datasets (#67)

Fix

prepared-arrow: handle multi-shard parquet splits (#68)

v0.7.0 (2026-05-27)
Feat

ux1: per-split configs + split-aware load + early default_split resolution (#66)
metadata: optional default_split + CLI fallback for multi-split datasets

v0.6.0 (2026-05-27)
Feat

add whest version command and version metadata in JSON
cli: validate/init/smoke-test/profile-simulation adopt unified copy
cli: package gets a bytes progress bar
cli: doctor wraps probes in a status spinner + bookends
cli: merge gets spinner + before/after copy
cli: download surfaces preflight summary + progress + completion
cli: upload gets a real progress bar + before/after copy
cli: bake gets phased progress bars + before/after copy
cli: rename dataset push/pull/inspect to upload/download/info + deprecation
cli: --streaming end-to-end with prominent cache-trade-off warning
cli: add --streaming flag to whest run
cli: use metadata-based n_mlps clamp when ds is streaming
scoring: make_contest_from_dataset supports IterableDataset
cli: wrap hf:// dataset load with hf_download progress UI
hf_progress: add hf_upload context manager
hf_progress: add hf_download context manager with three modes
hf_progress: add RichHFTqdm that forwards into active Rich Progress
hf_progress: add hf_preflight() with cache detection
hf_progress: add HFPreflight dataclass
ui: add status spinner context manager + finalize ui.py
ui: add progress_count context manager
ui: add progress_bytes context manager
ui: add say.* message helpers (intent/step/ok/warn/hint)
ui: add format_throughput helper
ui: add format_duration helper
ui: add format_bytes helper
template: emit configs: block in YAML for explicit split ordering
package: record tool and runtime versions in submission manifest

Fix

avoid duplicate JSON output in validate command
keep final_layer_mse in narrow score subtitle
guard profile-simulation JSON payload type for metadata wrapper
cli: cache-hit download says "Loaded from cache" not "Downloaded"
cli: drop stray comma in cache-miss download ok line
hf_progress: bail preflight when revision cannot be resolved
hf_progress: drop unused empty top-level upload task
hf_progress: raise on nested hf_download/hf_upload
hf_progress: subclass HF tqdm and guard disabled bars
ui: match HF Hub env-var truthy semantics in _progress_disabled
ui: roll over format_bytes at the next-unit boundary
dataset_io: use attr-set for configs to satisfy Pyright

Refactor

ui: cache the default Console as a module-level singleton
ui: inherit handles from ProgressHandle Protocol nominally

v0.5.1 (2026-05-27)
Feat

template: mini+full quick-start snippet leads with split="mini"
template: recognise mini+full split pair in dataset card

Fix

template: restore print(ds[0]['mlp_name']) smoke-test in generic quickstart fallback
template: scope companion-disclaimer to public+holdout, fix whitespace + spelling
test: import datasets.config submodule explicitly for pyright
dataset_io: scope merge_datasets HF cache to tempdir by default

v0.5.0 (2026-05-27)
Feat

load_dataset: add streaming=True support (closes #55)
readme: per-split MLP counts + tighter Compute/Reproducibility wording
readme: companion_repo template var + collapse hardware_fingerprints

Fix

lint: silence intentional type-violation in mlp_at streaming test
lint: narrow load_dataset return type via Literal[streaming] overloads
lint: narrow set element types before sort in fingerprint collapse

v0.4.0 (2026-05-26)
Added

seed_protocol 3.0 (whestbench_explicit_per_mlp_seeds): each MLP's seed is an independent input rather than a derivation from a single root. Each mlp_seed value in the parquet column is the canonical input seed. Within-MLP three-stream derivation (weight/sample/estimator) is preserved via SeedSequence(mlp_seed).spawn(3).
whest dataset bake --mlp-seeds FILE (JSON array of N ints) for explicit per-MLP seeds. Omitting both --mlp-seeds and --seed auto-generates via secrets.randbits(63).
create_dataset(mlp_seeds=[...]) / create_dataset_torch(mlp_seeds=[...]).
MLP.from_row(row, *, seed_protocol_version=...): protocol-aware estimator-seed derivation.
Frozen fixture tests/fixtures/single_split_v3_protocol/ for schema-drift regression.
Multi-split dataset support: dataset directories can now contain multiple Parquet files in data/, one per split, described by an optional splits: sub-dict in metadata.json. Backward-compatible — single-split datasets are unchanged.
whest dataset combine-splits INPUT_DIR... --output OUTPUT_DIR CLI subcommand for assembling multi-split datasets from N complete single-split inputs.
whestbench.combine_split_datasets() Python helper (re-exported from whestbench).
whest dataset bake --split <name> now accepts arbitrary split names matching [a-z][a-z0-9]*(-[a-z0-9]+)* (previously restricted to public / holdout).
whest dataset pull --split <name> and whest run --dataset ... --split <name> for selecting one split from multi-split datasets.

Changed

create_dataset(seed=...) / create_dataset_torch(seed=...) and whest dataset bake --seed N now reject with a migration hint pointing at --mlp-seeds.
Parquet mlp_seed column semantics: under 3.0, the column stores the input seed (was: derived estimator seed under 2.0). MLP.seed (participant-facing) is unchanged across protocols — derived locally from the input under 3.0.
whest dataset inspect now recognises multi-split datasets and prints a per-split summary, plus the seed_protocol: <name> (version <version>) line for all datasets.
whestbench.load_dataset() returns Dataset | DatasetDict based on the dataset shape; explicit split= always returns Dataset.
whestbench.metadata() accepts a DatasetDict and an optional split= filter that projects to single-split-shaped metadata.
The dataset-card template gains a multi-split branch with leaderboard-specific wording when splits are {public, holdout}; the single-split public branch's wording is updated to point at the new evaluation repo.

Compatibility

whestbench.load_dataset reads both seed_protocol 2.0 and 3.0 datasets indefinitely. Existing published datasets (e.g. aicrowd/arc-whestbench-2026-smoke-test) continue to work unchanged.
New bakes only write 3.0.
schema_version stays at "3.0". The protocol discriminator is seed_protocol.{name,version}.
The splits: field is purely additive.
Old whestbench reading new multi-split datasets fails loudly with a missing-n_mlps error — upgrade whestbench to read multi-split.

0.3.0 — 2026-05-25
BREAKING

Dataset format migrated from .npz to HF Parquet+sidecar (schema 2.4 → 3.0).
Datasets are now directories with data/<split>-NNNNN.parquet, metadata.json,
and README.md. The whest create-dataset command is replaced by
whest dataset bake. The DatasetBundle dataclass is removed; internal
consumers operate on datasets.Dataset directly.
Public estimator interface unchanged. Estimators still receive MLP
instances via predict(mlp: MLP).

NEW

whestbench.load_dataset(path_or_repo, revision=..., split=..., token=...) loads from local directories OR HF Hub.
whestbench.iter_mlps(ds), whestbench.mlp_at(ds, i), whestbench.metadata(ds).
whestbench.publish_dataset(local_dir, repo_id=..., tag=..., ...) for HF Hub uploads.
whestbench.merge_datasets(input_dirs, output_dir=...) — concatenate partial bakes.
whest dataset {bake, push, pull, merge, inspect} CLI subcommands.
Parallel bake via --slice K/N or --mlp-range START-END flags; merge with whest dataset merge.
whest run --dataset now accepts HF Hub repos: hf://owner/repo@v1 (inline revision) or owner/repo --revision v1.

MIGRATION

Legacy .npz datasets cannot be loaded by 0.3.0. Re-bake with whest dataset bake at the same --seed to reproduce.
See dataset-format for the schema 3.0 specification.
Release processThe authoritative reference for cutting a new release of whestbench to PyPI, covering the steady-state flow, one-time setup, and troubleshooting notes.For agentsMachine-readable resources for AI coding assistants.