============================================================ Participant Guide ============================================================ --- participant-guide --- URL: https://aicrowd.github.io/whestbench/docs/participant-guide Participant GuideTutorial and how-to for the challenge, federated from whest-starterkit.Participant Guide Federated from AIcrowd/whest-starterkit @ aaa3882.whestbenchWhite-box estimation of MLP output statistics under a FLOP budget.Tutorial — The 5-stage ladderNext Page ============================================================ Guides ============================================================ --- guides/datasets --- URL: https://aicrowd.github.io/whestbench/docs/guides/datasets GuidesDatasets — a complete guideWhestBench uses HuggingFace Datasets as its dataset format and HF Hub as the distribution channel. This guide walks you through every dataset-related verb in whest.WhestBench uses HuggingFace Datasets as its dataset format and HF Hub as the distribution channel. This guide walks you through every dataset-related verb in whest, in the order you'd typically encounter them. If you only have 5 minutes, read the Quick start below. The rest of the guide builds on it. Quick start You have a working estimator at ./estimator.py. Bake a tiny evaluation dataset locally, then score against it: # 1. Generate 10 MLPs with ground-truth statistics → ./my-eval/ whest dataset bake --n-mlps 10 --n-samples 1000 --width 64 --depth 4 \ --output ./my-eval # 2. Inspect what got written whest dataset info ./my-eval # 3. Score your estimator against the same MLPs every run whest run --estimator estimator.py --dataset ./my-eval Why this matters: without --dataset, whest run regenerates MLPs and ground truth on every invocation. Baking a dataset once and reusing it makes your runs deterministic and ~10× faster. Continue to: lifecycle ↓ The dataset lifecycle +--------+ +-----------+ +----------+ +--------+ | local | upload | HF | down- | local | run | scores | | bake | -----> | Hub repo | load → | cache | -----> | report | | (./out)| | (org/...) | | (~/.hf) | | | +--------+ +-----------+ +----------+ +--------+ ^ | |____________________________________________________________| iterate on estimator code Local-only workflow: bake → run. Best when you're iterating fast and don't care about sharing the dataset. See Working locally. Team workflow: bake → upload → … later … → download → run. The HF repo's tag pins which exact dataset everyone scores against. See Publishing to HuggingFace Hub and Downloading from HF Hub and the local cache. CI / leaderboard workflow: bake → upload --tag v1-warmup. Participants pull by tag. Streaming (whest run --streaming) is the natural fit for per-PR CI gates — see Streaming mode. Each verb is detailed below. Working locally You want to iterate fast: bake a small dataset to disk, inspect it, and reuse it across whest run invocations. No network, no HF account needed. whest dataset bake — create a dataset You're starting a new evaluation. Bake 100 MLPs of moderate size with their ground-truth statistics to ./my-eval/: whest dataset bake \ --n-mlps 100 \ --n-samples 10000 \ --width 256 --depth 8 \ --output ./my-eval Representative output: → Baking 100 MLPs (width=256, depth=8, n_samples=10000) to ./my-eval ✓ Generated weights 100/100 ✓ Computed ground truth 100/100 31.7s ✓ Wrote ./my-eval (2.0 GB) The result on disk: my-eval/ ├── data/public-00000-of-00001.parquet # weights + ground-truth stats ├── metadata.json # schema_version, seed_protocol, … └── README.md # dataset card Key flags: --mlp-seeds — pin per-MLP seeds explicitly. JSON array of N distinct int63 values. Required for bit-exact reproducibility with another bake. --mlp-range START-END or --slice K/N — bake a slice of a larger logical dataset. The slice is bit-equivalent to the corresponding portion of a single-host bake at the same seeds. --torch — use the GPU backend (requires whestbench[gpu]). --split — assign a split name (default public). See Multi-split datasets. --config — assign the HF config for this split (default default). Dataset authors use this for config-per-split repos; participants normally leave it unset. If it broke, see the Troubleshooting section — bake errors usually trace back to seed shape, an existing output directory, or running out of RAM on large --n-samples. whest dataset info — what's in a dataset You've baked or downloaded a dataset and want a one-screen summary before running against it: whest dataset info ./my-eval Reports schema_version, seed_protocol, n_mlps, n_samples, hardware fingerprint, and per-split row counts. info also works against HF Hub directly: whest dataset info aicrowd/arc-whestbench-public-2026 --revision v1-warmup No download required — info only fetches metadata.json. whest dataset merge — assemble parallel bakes You have a multi-host cluster and want to bake a 1,000-MLP dataset in two slices, then concatenate. Both workers must share the same --mlp-seeds file so the result is bit-equivalent to a single-host bake: # Two workers each bake a slice… whest dataset bake --n-mlps 1000 --slice 0/2 --output ./partial-a \ --mlp-seeds seeds.json whest dataset bake --n-mlps 1000 --slice 1/2 --output ./partial-b \ --mlp-seeds seeds.json # … then merge. whest dataset merge ./partial-a ./partial-b --output ./full The merged dataset is bit-equivalent to a single-host bake of the same size at the same seeds. See also: parallel-bake how-to. whest run --dataset — score against a baked dataset You're iterating on estimator.py. Score it against the first 50 MLPs of your baked dataset (fast feedback loop): whest run --estimator estimator.py --dataset ./my-eval --n-mlps 50 --n-mlps K clamps the run to the first K MLPs of the dataset (useful for quick iteration). Pass --split if the dataset is multi-split. Once you're happy with local results, publish the dataset so teammates can score against the same MLPs. Publishing to HuggingFace Hub You've baked a dataset locally and want to share it with the team — or pin a specific revision so a CI gate scores everyone against the same MLPs. Upload to a HuggingFace Hub dataset repo. Authenticate once hf auth login # opens a browser; or pass --token Tokens with write scope are required to push. You can also set the token without the interactive flow: export HF_TOKEN=hf_xxx whest dataset upload reads HF_TOKEN as a fallback when --token isn't passed. See also: the publish-to-hf-hub how-to for an end-to-end walkthrough. whest dataset upload You have ./my-eval from the previous section. Push it as a private repo and pin the resulting commit with a tag: whest dataset upload ./my-eval \ --repo aicrowd/my-eval \ --tag v1 \ --private # omit for public datasets Representative output: → Uploading ./my-eval to aicrowd/my-eval (private) ✓ Repo exists / created ✓ Uploaded 2.0 GB ████████████████████ 100% 34.1s ✓ Tag v1 created at d2f9a1c ✓ Done: https://huggingface.co/datasets/aicrowd/my-eval/tree/v1 The repo is created if it doesn't exist. The tag is created at the resulting commit so whest run --dataset hf://aicrowd/my-eval@v1 pins to this exact revision. Repo naming. Use /. Keep names short and hyphen-separated (e.g. aicrowd/arc-whestbench-public-2026). Tag conventions. HF doesn't enforce semver; the de-facto pattern is v. (e.g. v1.0, v1.1) or descriptive (v1-warmup, v1-holdout). See HF's revision docs. What gets published The dataset card (README.md) is auto-generated from metadata.json at bake time. It includes splits, hardware fingerprint, seed protocol, and a runnable quick-start snippet. Edit README.md after bake and before upload to add custom content. The card's YAML front-matter is what HuggingFace Hub renders on the dataset page (tags, license, language, etc.). Don't strip it. whest dataset push continues to work as a deprecated alias for upload through v0.6. v0.7 will remove it. Same applies to pull → download and inspect → info. If it broke (401, 403, repo already exists, network errors), jump to Troubleshooting. Downloading from HF Hub and the local cache You want to score against a dataset published by your team or the contest organisers. There are two paths. whest dataset download — explicit fetch Use when you want a real on-disk copy you can inspect, ship to another machine, or commit to a separate artifact store: whest dataset download aicrowd/arc-whestbench-public-2026 \ --revision v1-warmup \ --output ./eval Representative output: → Downloading aicrowd/arc-whestbench-public-2026@v1-warmup → ./eval Preflight: 1 parquet shard, 2.0 GB, 1,000 MLPs ✓ Downloaded 2.0 GB ████████████████████ 100% 28.9s ✓ Wrote ./eval (cache: ~/.cache/huggingface/hub/datasets--aicrowd--arc-whestbench-public-2026) With --output set, files are materialised under the named directory; the HF cache also picks them up. Auto-fetch via whest run You can skip the explicit download — whest run does it lazily on first use: whest run --estimator estimator.py \ --dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup This downloads on first invocation (showing a progress bar) and caches. Subsequent runs are ~10× faster (the cache hit prints Loaded from cache). HF cache layout After a fetch, the HF cache lives at three places: PathWhat's there~/.cache/huggingface/hub/datasets----/Raw blobs (Git LFS / Xet objects) + the revision snapshot symlinks~/.cache/huggingface/datasets/___/The datasets library's regenerated Arrow tables (memory-mapped)~/.cache/huggingface/xet/{chunk_cache,shard_cache,staging}/Xet chunk-level dedup cache (since hf_xet ≥ 1.0) Total disk usage is roughly 2× download size (the parquet blob + Arrow rebuild). The hub cache uses content-addressed dedup, so the same blob is shared across revisions and even repos. Cleaning up Defer to HF's own cache CLI — it understands the layout above and will not accidentally orphan blobs that are still referenced from another revision: hf cache ls # show what's there hf cache prune # drop unreferenced revisions hf cache rm # remove a specific repo or revision hf cache verify # check integrity Full reference: HF cache management. Cache location overrides Env varWhat it setsDefaultHF_HOMERoot of all HF state~/.cache/huggingfaceHF_HUB_CACHEHub-only cache (blobs/snapshots)$HF_HOME/hubHF_DATASETS_CACHEdatasets-library Arrow cache$HF_HOME/datasetsHF_XET_CACHEXet chunk staging$HF_HOME/xet When running on NFS, point HF_XET_CACHE=/local/ssd to avoid roundtrips. See Performance tuning for more knobs. whest dataset pull continues to work as a deprecated alias for download through v0.6. If it broke (long pause, disk full, gated dataset, cas-bridge.xethub.hf.co URLs you don't recognise), jump to Troubleshooting. Streaming mode You want to score against a small slice of a remote dataset without paying the cost of a full download. whest run --streaming consumes the dataset row-group-by-row-group over HTTP instead of downloading it first. When to use You're iterating on estimator code with --n-mlps 5 (or some small K). Streaming fetches only the first ⌈K/47⌉ row groups (~95 MB each for the warmup dataset) instead of the full 2 GB. You're on a constrained-disk environment (CI runner, container). You want a fast first-row response time more than total throughput. When NOT to use Repeated full evaluations of the same dataset. Streaming does NOT populate the cache — every run re-fetches. Use the default materialise path instead. Anything that needs random access. IterableDataset is iteration-only; len(ds), ds[i], and ds.shuffle(seed=…) don't work as expected. Trade-off table PropertyMaterialise (default)--streamingFirst-row latency, cold cache~30 s (full download)~5 sFirst-row latency, warm cache~2 s~5 s (re-fetch)Disk usage~4 GB (blob + Arrow)0Subsequent runs~2 s (cache hit)~5 s (re-fetch every time)Random accessYesNo Authentication and streaming Unauthenticated requests to HF are rate-limited and noticeably slower. Run hf auth login once to set a token; streaming throughput typically improves 30–50% authenticated. Example whest run --estimator estimator.py \ --dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup \ --streaming \ --n-mlps 5 You'll see a ⚠ Streaming from HF warning at startup, then a progress indicator while the first row group is fetched, then scoring begins. Streaming is incompatible with --json output (it would corrupt JSON ordering) and len(ds) raises on a streaming dataset. Both are documented under Troubleshooting. Multi-split datasets A dataset can contain multiple disjoint groups of MLPs — typically public (open to participants for tuning) and holdout (used only by the leaderboard grader). One repo, two splits. When and why Leaderboard datasets: participants score against public locally, the leaderboard grader scores against holdout. Same parquet schema, same hardware fingerprint, different seeds. Train/validation flow: split a dataset into train/val/test for meta-learning experiments on top of WhestBench. Baking a split Each split is baked separately. Make sure to use distinct --mlp-seeds files so the splits don't overlap: whest dataset bake --n-mlps 500 --split public --config default --output ./eval-public whest dataset bake --n-mlps 500 --split holdout --config holdout --output ./eval-holdout Combining splits into one multi-split directory whest dataset combine-splits ./eval-public ./eval-holdout --output ./eval-full The result is a single dataset directory with both splits in data/, suitable for whest dataset upload to a single HF repo. combine-splits preserves each bake's config metadata, so the published card can expose the same config-per-split layout as the official HF datasets. Selecting a split when running whest run --estimator estimator.py \ --dataset hf://aicrowd/eval-full@v1 \ --split public Without --split, multi-split datasets are rejected by whest run (the scoring path scores against exactly one split at a time, by design). Inspecting splits whest dataset info ./eval-full # Reports each split's n_mlps and seed. If combine-splits complains about overlapping mlp_seeds or mismatched hardware fingerprints, see Troubleshooting. Performance tuning These are power-user knobs. The defaults are fine for almost everyone. Xet high-performance mode If you have ≥64 GB RAM and a fat uplink: export HF_XET_HIGH_PERFORMANCE=1 Saturates both bandwidth and CPU cores. Helpful when downloading many-GB datasets to a workstation. Reference: HF Xet storage docs. Local SSD for the Xet cache If your HF cache is on NFS or a slow disk: export HF_XET_CACHE=/local/ssd/hf-xet Keeps the chunk staging cache on fast local storage. The main hub cache (HF_HUB_CACHE) can stay on NFS — only the per-chunk Xet metadata is roundtrip-sensitive. Disabling Xet entirely export HF_HUB_DISABLE_XET=1 Falls back to plain LFS transport. Rarely useful; only reach for it if you've confirmed a Xet-specific bug. Disabling progress bars (CI) export HF_HUB_DISABLE_PROGRESS_BARS=1 Whestbench's say.* lines still emit; only the progress bars are suppressed. For complete silence add --quiet to the whest invocation. Troubleshooting "I see a long pause and no output." Cache miss on a cold HF cache. Watch the progress bar — for the warmup dataset it's ~30 s on a 70 MB/s link. To avoid silent re-downloads, run whest dataset download ahead of time, or ls ~/.cache/huggingface/hub/ to confirm progress. "Downloads feel slow." You're probably unauthenticated; HF rate-limits anonymous traffic. Run hf auth login once and re-run. See also Authentication and streaming. "Disk filled up." HF stores blobs in both ~/.cache/huggingface/hub/ (raw download) and ~/.cache/huggingface/datasets/ (regenerated Arrow). Use hf cache prune to drop unreferenced revisions, then hf cache ls to verify reclaimed space. See Cleaning up. "401/403 on upload." Your token doesn't have write scope. Re-login with hf auth login --token from a token created with write access. For org-owned repos, your account also needs membership in the org. "Cannot use --streaming with --json output." Known limitation — streaming progress events would corrupt JSON ordering. Drop --json, or drop --streaming. "len(ds) raises on a streaming dataset." Expected per HF docs. Use whestbench.metadata(ds)["n_mlps"] instead — it reflects the upstream metadata.json, not the local materialised count. "I see cas-bridge.xethub.hf.co URLs but the file is LFS." That's HF's Xet bridge transparently serving legacy LFS content via the Xet CDN edge. No action required. If you need to force plain-LFS transport for debugging, set HF_HUB_DISABLE_XET=1 (see Disabling Xet entirely). "Dataset is gated." Request access on the dataset page (HF will email you a link from https://huggingface.co/datasets/), then re-run. Make sure you're authenticated with the same account that was granted access. Reference Format Schema 3.0 spec: dataset-format. SDK surface import whestbench as wb ds = wb.load_dataset( "aicrowd/foo", revision="v1", split="public", streaming=False ) for mlp in wb.iter_mlps(ds): ... mlp = wb.mlp_at(ds, 0) # random access (materialised datasets only) md = wb.metadata(ds) # the dataset's metadata.json wb.publish_dataset( "./my-eval", repo_id="aicrowd/foo", tag="v1" ) wb.merge_datasets(["./partial-a", "./partial-b"], output_dir="./full") wb.combine_split_datasets( ["./public", "./holdout"], output_dir="./full" ) CLI verbs (canonical names) VerbPurposeDeprecated aliaswhest dataset bakeGenerate locally—whest dataset uploadPublish to HFpushwhest dataset downloadFetch from HFpullwhest dataset infoShow metadatainspectwhest dataset mergeConcatenate partials—whest dataset combine-splitsAssemble multi-split— Deprecated aliases continue to work through v0.6 and emit a deprecation warning. v0.7 removes them. Environment variables VarPurposeHF_TOKENAuth token (lazy — only when needed)HF_HOMERoot of HF state (~/.cache/huggingface by default)HF_HUB_CACHEHub blobs cacheHF_DATASETS_CACHEdatasets-library Arrow cacheHF_XET_CACHEXet chunk stagingHF_XET_HIGH_PERFORMANCESaturate bandwidth + CPUHF_HUB_DISABLE_PROGRESS_BARSSuppress progress barsHF_HUB_DISABLE_XETForce plain-LFS transportHF_HUB_DISABLE_IMPLICIT_TOKENDon't send token on read callsNO_COLORDisable ANSI colours CLI flag conventions --repo-type, --revision, --token, --cache-dir, --quiet, --json, --format {auto,human,agent,json,quiet}, --dry-run, --exist-ok. Adopted from HF's hf CLI for consistency. See also CLI reference — exhaustive flag list per verb. Dataset format spec — schema 3.0 on-disk layout. Parallel bake how-to — distributed bake + merge. Publish to HF Hub how-to — token setup, repo creation. Use WhestBench ExplorerPrevious PageParallel bake across multiple GPUs / hostsBake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to a single-host bake. ============================================================ How-to ============================================================ --- how-to/parallel-bake --- URL: https://aicrowd.github.io/whestbench/docs/how-to/parallel-bake How-toParallel bake across multiple GPUs / hostsBake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to a single-host bake.Bake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to what a single-host bake would have produced. When to use this At the default sampling rate (n_samples=1_000_000_000), a single L40S GPU takes roughly 4 hours for 100 MLPs (measured; see GPU Dataset Generation for the full timing table). Splitting the work across multiple workers reduces wall time proportionally: 1 L40S × 100 MLPs × 10⁹ samples ≈ ~4 h 4 L40S workers × 25 MLPs each × 10⁹ samples ≈ ~1 h 8 L40S workers × 12–13 MLPs each × 10⁹ samples ≈ ~30 min Parallel baking is also useful for fault tolerance — if one worker fails, you only need to re-bake its slice. 1. Bake each slice Use --slice K/N to assign each worker a disjoint range of MLPs. All workers must use the same --mlp-seeds file, --n-mlps, --n-samples, --width, and --depth — the merge step enforces this. Generate the seeds file once before launching workers: whest dataset generate-seeds --n-mlps 1000 > seeds.json The following example bakes 1000 MLPs across 4 workers. Run each command on its own host (or in a separate job): Worker 0 (MLPs 0–249): whest dataset bake \ --n-mlps 1000 --n-samples 1_000_000_000 \ --width 256 --depth 8 \ --mlp-seeds seeds.json \ --slice 0/4 \ --torch --device auto \ --output ./partial-0 Worker 1 (MLPs 250–499): whest dataset bake \ --n-mlps 1000 --n-samples 1_000_000_000 \ --width 256 --depth 8 \ --mlp-seeds seeds.json \ --slice 1/4 \ --torch --device auto \ --output ./partial-1 Worker 2 (MLPs 500–749): whest dataset bake \ --n-mlps 1000 --n-samples 1_000_000_000 \ --width 256 --depth 8 \ --mlp-seeds seeds.json \ --slice 2/4 \ --torch --device auto \ --output ./partial-2 Worker 3 (MLPs 750–999): whest dataset bake \ --n-mlps 1000 --n-samples 1_000_000_000 \ --width 256 --depth 8 \ --mlp-seeds seeds.json \ --slice 3/4 \ --torch --device auto \ --output ./partial-3 Each worker writes a directory marked is_partial=true in metadata.json. The loader refuses to load partial datasets directly — you must merge them first. 2. Fetch partials locally Once all workers finish, collect the partial directories on a single machine. # scp example (adjust hostnames and paths) scp -r worker-0:/data/partial-0 ./partial-0 scp -r worker-1:/data/partial-1 ./partial-1 scp -r worker-2:/data/partial-2 ./partial-2 scp -r worker-3:/data/partial-3 ./partial-3 # Or rsync (preserves timestamps, supports resumption) rsync -avz worker-0:/data/partial-0/ ./partial-0/ rsync -avz worker-1:/data/partial-1/ ./partial-1/ rsync -avz worker-2:/data/partial-2/ ./partial-2/ rsync -avz worker-3:/data/partial-3/ ./partial-3/ 3. Merge whest dataset merge validates all partials, checks that their mlp_range values cover [0, 1000) exactly once (no gaps, no overlaps), concatenates the Parquet shards in MLP-index order, and writes a complete dataset directory: whest dataset merge \ ./partial-0 ./partial-1 ./partial-2 ./partial-3 \ --output ./final-eval Expected output: Merged 4 partials to ./final-eval The merge fails loudly on any of: Partials disagree on n_samples, width, depth, backend, or total_n_mlps (MergeIncompatibleError) Ranges have gaps — e.g. [0,250) and [500,750) with nothing in between (MergeIncompleteError) Ranges overlap (MergeOverlapError) A partial's actual row mlp_id values don't match its declared mlp_range (MergeCorruptError) 4. Verify bit-equivalence (optional) To confirm the parallel bake matches a serial bake on the same seeds file, bake a small reference dataset on a single host and compare all_layer_means: import numpy as np from datasets import load_dataset # Load the merged result merged = load_dataset("./final-eval", split="public") # Bake a tiny reference (e.g. first 4 MLPs) on one host for verification. # Pass the SAME --chunk-size as the parallel workers — otherwise the auto-tuned # chunk_size differs (workers: B=mlps_per_slice; reference: B=4) and reductions # accumulate in different orders, producing ~5e-4 spurious diffs on CUDA. # echo '[,,,]' > ref-seeds.json # use seeds[0:4] from seeds.json # whest dataset bake --n-mlps 4 --n-samples 1000000 \ # --width 256 --depth 8 --mlp-seeds ref-seeds.json \ # --chunk-size 524288 --output ./reference-4 reference = load_dataset("./reference-4", split="public") # Compare means for the overlapping MLPs for i in range(len(reference)): merged_means = np.array(merged[i]["all_layer_means"]) ref_means = np.array(reference[i]["all_layer_means"]) max_diff = np.abs(merged_means - ref_means).max() print(f"MLP {i}: max |Δmean| = {max_diff:.2e}") assert max_diff == 0.0, f"MLP {i}: not bit-exact!" # avg_variance loses ~1 float64 ULP from the (sum_sq/n - mean²) subtraction, # so compare with np.isclose rather than strict equality. rtol=1e-12 covers # ULP noise that scales with the variance magnitude; atol=1e-15 guards near # zero. Observed noise on N=1e9 bakes is ~1e-17, so this is ~100× headroom. merged_var = float(merged[i]["avg_variance"]) ref_var = float(reference[i]["avg_variance"]) assert np.isclose(merged_var, ref_var, rtol=1e-12, atol=1e-15), ( f"MLP {i}: variance not within ULP tol " f"(merged={merged_var}, ref={ref_var})" ) print("Bit-equivalence verified for first 4 MLPs.") Expected output (for the CPU backend): MLP 0: max |Δmean| = 0.00e+00 MLP 1: max |Δmean| = 0.00e+00 MLP 2: max |Δmean| = 0.00e+00 MLP 3: max |Δmean| = 0.00e+00 Bit-equivalence verified for first 4 MLPs. For the torch backend, bit-equivalence holds within each backend (flopscope or torch) but not across backends — they use different RNG algorithms. 5. Inspect and publish Inspect the merged dataset, then push to HuggingFace Hub as a single artifact. See Publishing a dataset to HuggingFace Hub for the full publish walkthrough. # Inspect whest dataset inspect ./final-eval # Publish whest dataset push ./final-eval \ --repo aicrowd/arc-whestbench-2026 \ --tag v1 \ --message "Parallel bake: 1000 MLPs, 4 workers" Slicing model --slice K/N Divides the logical dataset of --n-mlps into N equal slices and assigns slice K (0-indexed). For n_mlps=1000 and N=4: --slicemlp_range0/4[0, 250)1/4[250, 500)2/4[500, 750)3/4[750, 1000) If n_mlps is not evenly divisible by N, the last slice gets the remainder. --mlp-range START-END The lower-level alternative to --slice. Both endpoints are inclusive on the CLI (e.g. --mlp-range 0-249 covers MLPs 0 through 249 inclusive). The Python API uses half-open [start, end) intervals internally. Use --mlp-range for irregular splits or when you need to re-run only a specific MLP range after a failure. # Re-run just MLPs 250–499 after a worker failure (use the same seeds.json as the original bake) whest dataset bake \ --n-mlps 1000 --n-samples 1_000_000_000 \ --width 256 --depth 8 --mlp-seeds seeds.json \ --mlp-range 250-499 \ --torch --device auto \ --output ./partial-1-retry Bit-equivalence requirements The merge step produces a dataset bit-equivalent to a single-host bake only when: All workers use the same --mlp-seeds file and same --n-mlps. Under seed_protocol 3.0, each slot reads its input seed directly from that shared file, so the derived weight/sample/estimator streams are identical regardless of which worker processes the slot. All workers use the same backend (flopscope vs torch). The two backends use different RNG algorithms and produce statistically equivalent but not bitwise identical results at the same seeds. For the torch backend on CUDA, bitwise reproducibility additionally requires the same torch version (CUDA kernel implementations may differ between versions). For the torch backend on CUDA, all workers and any reference re-bake must use the same --chunk-size. The default is auto-tuned per call from mlps_per_batch (which derives from --n-mlps minus slicing) and the device's free memory — so a worker baking a 1-MLP slice (mlps_per_batch=1, auto chunk ≈ 1048576) and a 4-MLP reference bake (mlps_per_batch=4, auto chunk ≈ 524288) will pick different chunk sizes, accumulate float reductions in different orders, and disagree by ~5e-4 absolute on all_layer_means, final_means, and avg_variance. Pinning --chunk-size to a fixed value across every bake (workers AND any reference bake) eliminates this. For width=256, --chunk-size 524288 is a safe choice across all batch sizes from 1 to 16. Cross-host CUDA non-determinism beyond chunk-size has been ruled out in practice when the standard PyTorch determinism flags are set (cudnn.deterministic=True, cudnn.benchmark=False, CUBLAS_WORKSPACE_CONFIG=:4096:8). With those + a pinned --chunk-size, parallel-vs-serial bakes match bit-exactly on weights, all_layer_means, and final_means. avg_variance differs by ~1 float64 ULP (~1e-17 on N=1e9 bakes) due to the (sum_sq/n − mean²) subtraction; compare it with np.isclose(rtol=1e-12, atol=1e-15) rather than strict equality. rtol covers ULP noise that scales with variance magnitude; atol guards near zero. Multi-split datasets For datasets with multiple splits (e.g. the evaluation dataset with public and holdout), bake each split independently — each split has its own seed file and the seeds must be uncorrelated — then combine. Under seed_protocol 3.0, each split has its own JSON file of per-MLP seeds. All workers baking a given split must receive the SAME JSON file (they internally slice it by --slice K/N); seeds for different splits MUST be different files to preserve cross-split independence. (The orchestrator in whest-evaluation-utils/gpu-dataset-bake/ automates this.) # Generate independent seed files for each split (once, before launching workers). whest dataset generate-seeds --n-mlps 50 > public-seeds.json whest dataset generate-seeds --n-mlps 50 > holdout-seeds.json # Parallel-bake the public split (4 workers, same seeds file). for K in 0 1 2 3; do whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 \ --split public --config default --mlp-seeds public-seeds.json --slice $K/4 \ --torch --device cuda --output ./pub-p$K & done wait whest dataset merge ./pub-p* --output ./pub-complete # Parallel-bake the holdout split (4 workers, different seeds file). for K in 0 1 2 3; do whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 \ --split holdout --config holdout --mlp-seeds holdout-seeds.json --slice $K/4 \ --torch --device cuda --output ./hold-p$K & done wait whest dataset merge ./hold-p* --output ./hold-complete # Combine into one multi-split directory. whest dataset combine-splits ./pub-complete ./hold-complete --output ./eval # Inspect, push. whest dataset inspect ./eval whest dataset push ./eval --repo aicrowd/arc-whestbench-2026-evals --tag round-1 --private Each per-split bake is independent — workers in different splits don't share any seed state. The combine step validates that all splits agree on the invariants (width, depth, n_samples, backend) but allows different per-split n_mlps.Datasets — a complete guideWhestBench uses HuggingFace Datasets as its dataset format and HF Hub as the distribution channel. This guide walks you through every dataset-related verb in whest.Publishing a dataset to HuggingFace HubStep-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub. --- how-to/publish-to-hf-hub --- URL: https://aicrowd.github.io/whestbench/docs/how-to/publish-to-hf-hub How-toPublishing a dataset to HuggingFace HubStep-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub.Step-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub. Once published, participants (and other machines) can load it directly with datasets.load_dataset or whestbench.load_dataset. Prerequisites: pip install whestbench (or whestbench[gpu] for GPU bakes) A HuggingFace account with write access to the target repo HF_TOKEN with write scope (see step 1) 1. Set up authentication # Option A — interactive login (stores a token in ~/.cache/huggingface/token) huggingface-cli login # Option B — environment variable (preferred in CI) export HF_TOKEN=hf_your_write_token_here The HF_TOKEN environment variable is read automatically by whest dataset push and whestbench.publish_dataset. You can also pass it explicitly via --token. 2. Bake locally Bake a dataset to a local directory. Choose --n-mlps and --n-samples appropriate to your use case. For bit-exact reproducibility, pass an explicit --mlp-seeds JSON file. whest dataset bake \ --n-mlps 10 \ --n-samples 10_000_000 \ --width 256 \ --depth 8 \ --output ./my-bake For larger bakes, see Parallel bake across multiple GPUs and GPU Dataset Generation. 3. Inspect before publishing Verify the bake parameters before uploading. This is cheap and catches any misconfiguration before it goes out: whest dataset inspect ./my-bake Expected output: WhestBench dataset schema_version: 3.0 format: hf-datasets-parquet backend: flopscope seed: 42 n_mlps: 10 n_samples: 10000000 width: 256 depth: 8 created_at_utc: 2026-05-25T12:00:00+00:00 You can also verify the dataset loads correctly before pushing: import whestbench ds = whestbench.load_dataset("./my-bake") print(len(ds), "MLPs loaded") for mlp in whestbench.iter_mlps(ds): print(mlp.name, mlp.weights[0].shape) break 4. Publish Push the local directory to HF Hub. Use --tag to create a versioned git tag — this is strongly recommended so participants can pin a specific version. whest dataset push ./my-bake \ --repo aicrowd/arc-whestbench-2026 \ --tag v1 \ --message "Bake: 10 MLPs, seed=42, 10M samples" Expected output: Uploaded to aicrowd/arc-whestbench-2026; commit abc1234def; tag v1 For a private repo (e.g. holdout sets), add --private: whest dataset push ./my-bake \ --repo aicrowd/arc-whestbench-2026-holdout \ --tag v1 \ --private \ --message "Holdout bake: seed=99" What gets uploaded data/-00000-of-00001.parquet — the MLP data metadata.json — provenance sidecar README.md — rendered dataset card (re-rendered with the actual repo_id and tag before upload, including any declared HF config layout) 5. Verify on HF Hub Visit the dataset page to confirm the upload succeeded: https://huggingface.co/datasets/aicrowd/arc-whestbench-2026/tree/v1 You should see the three files (data/, metadata.json, README.md) and the dataset card rendered from the README. You can also inspect from the CLI without downloading: whest dataset inspect aicrowd/arc-whestbench-2026 --revision v1 6. Pull on another machine On any other machine with whestbench installed: whest dataset pull aicrowd/arc-whestbench-2026 \ --revision v1 \ --output ./local-copy For a private repo, pass --token or set HF_TOKEN first. 7. Load in a participant script Using datasets.load_dataset directly from datasets import load_dataset ds = load_dataset( "aicrowd/arc-whestbench-2026", revision="v1", split="public", ) print(ds) # Dataset({features: ['mlp_id', 'mlp_name', ...], num_rows: 10}) print(ds[0]["mlp_name"]) # "danielle-johnson" Using whestbench.load_dataset (recommended) The wrapper validates the schema and attaches metadata for later retrieval: import whestbench ds = whestbench.load_dataset( "aicrowd/arc-whestbench-2026", revision="v1", split="public", ) # Iterate as MLP instances for mlp in whestbench.iter_mlps(ds): y_pred = my_estimator.predict(mlp) # Access metadata md = whestbench.metadata(ds) print(md["seed"], md["n_mlps"]) Running evaluation against the published dataset whest run --estimator ./estimator.py \ --dataset hf://aicrowd/arc-whestbench-2026@v1 # Or equivalently: whest run --estimator ./estimator.py \ --dataset aicrowd/arc-whestbench-2026 \ --revision v1 Note: bare aicrowd/arc-whestbench-2026 without --revision is rejected by whest run — always pin a revision. Troubleshooting 401 Unauthorized — Your HF_TOKEN doesn't have write access to the target repo, or it has expired. Generate a new token at huggingface.co/settings/tokens with write scope. 404 Repository not found — The repo doesn't exist yet. whest dataset push creates it automatically; ensure you have permission to create repos under the target org (e.g. aicrowd/). FileExistsError: output already exists — whest dataset bake refuses to overwrite an existing directory. Delete or rename the old output first, or choose a new --output path. Dataset rejected with "partial dataset" error — You pushed a slice bake without merging first. Run whest dataset merge on all slices, then push the merged result. See Parallel bake. Multi-split datasets whest dataset push handles multi-split datasets natively. The local directory must contain one parquet per split in data/ and a metadata.json with a splits: dict; this is the shape produced by whest dataset combine-splits. If the input bakes declared --config, the push preserves that config-per-split layout in the published dataset card. The push uploads all parquets in one commit; tag with --tag round-N for per-round eval datasets. For private repos (e.g. the evaluation dataset), pass --private on first push to create the repo as private. Subsequent pushes preserve the privacy setting.Parallel bake across multiple GPUs / hostsBake one large dataset across N workers, then merge the partials into a single canonical artifact that is bit-equivalent to a single-host bake.Estimator ContractExact estimator I/O requirements, FLOP tracking rules, failure semantics, memory limits, and reproducibility contract. ============================================================ Reference ============================================================ --- reference/estimator-contract --- URL: https://aicrowd.github.io/whestbench/docs/reference/estimator-contract ReferenceEstimator ContractExact estimator I/O requirements, FLOP tracking rules, failure semantics, memory limits, and reproducibility contract.When to use this page Use this page when you need exact estimator I/O requirements. Required interface predict(self, mlp: MLP, budget: int) -> fnp.ndarray Optional lifecycle hooks: setup(self, context: SetupContext) -> None teardown(self) -> None SetupContext fields FieldTypeDescriptionwidthintNeuron count for generated MLPsdepthintNumber of layers per MLPflop_budgetintFLOP cap for the estimatorapi_versionstrContract version stringscratch_dirstr | NoneOptional writable directory for cachingseedintPer-run seed from --seed (default 0). Use in setup() to reproduce one-time random initialisation. See Setup-time reproducibility. Input object quick reference ObjectFieldMeaningMLPwidthNumber of neurons per layerMLPdepthNumber of weight matrices (layers)MLPweightsOrdered weight matrices, each (width, width)MLPseedPer-MLP grader-supplied seed; use this to seed estimator-internal randomness for reproducibility under regrade.MLPnameHuman-readable slug like "danielle-johnson" derived deterministically from seed. Stable across runs and CPU/GPU backends at the WhestBench release's pinned faker version. Useful for log lines and error messages; safe to ignore. Empty string only when an MLP is constructed outside an evaluator bake path. For traversal examples, see Inspect and Traverse MLP Structure (in the starter kit). Output requirements per predict call RequirementRuleShapeReturn a 2D array with shape (mlp.depth, mlp.width)Numeric validityEvery value is finite FLOP tracking Your estimator must use flopscope primitives (import flopscope as flops and import flopscope.numpy as fnp) for all numerical computation. flopscope tracks FLOP usage analytically. If the total FLOPs across your entire predict call exceed flop_budget, all predictions for that MLP are replaced with zero vectors and your MSE for that MLP is computed against zeros. Failure semantics When predict() cannot return a valid result — for any reason — the affected MLP is scored as if the estimator had returned a zero array, and the multiplier in the budget-adjusted score s_m is forced to 1.0 (no compute discount). Concretely: FLOP budget exhausted (flopscope.BudgetExhaustedError) → Y_hat = 0, s_m = MSE(0, Y) * 1.0 Wall-time / residual-time budget exhausted → same Combined-budget post-check (C_m = F_m + λ·R_m > B_m) → same predict() raised an exception (any subclass of Exception, including MemoryError, ValueError from validate_predictions, custom estimator exceptions) → same Invalid output shape (not (depth, width)) → same Non-finite values (any inf or NaN) → same Subprocess worker hard-killed (OOM, segfault, timeout, non-zero exit) → same The scoring loop continues across the remaining MLPs and produces a finite adjusted_final_layer_score. Per-MLP diagnostic fields (error, error_code, traceback, budget_exhausted, time_exhausted, residual_wall_time_exhausted, combined_budget_exhausted) are preserved so failures remain debuggable. The "no compute discount on failure" rule (multiplier forced to 1.0) ensures that a failed run is strictly worse than a trivial-zero submission that succeeds (which receives the 0.1 multiplier floor — the minimum discount, a factor-of-ten cap). Memory limit ContestSpec.memory_limit_mb (default 65_536, i.e. 64 GB — matches the Phase 1 grader allocation) bounds the address space available to your estimator. Enforcement depends on the runner: --runner subprocess (used by the grader): the worker calls resource.setrlimit(RLIMIT_AS, ...) before importing your estimator module. Any allocation that would exceed the cap raises MemoryError inside predict(), which routes through the failure path described above (zero-prediction MSE × 1.0). --runner local: the limit is advisory only. WhestBench cannot safely call setrlimit on the CLI process itself. The runner emits a single warning at start ("memory_limit_mb=… is advisory in --runner local: enforcement requires --runner subprocess (uses RLIMIT_AS) or external sandboxing (cgroups).") and continues without enforcement. Use --runner subprocess if you want the limit actually enforced locally. Platforms without RLIMIT_AS (Windows, some BSDs) log a warning to the worker's stderr and continue without enforcement. The grader's evaluation environment is Linux, where enforcement is reliable. Wall-clock cap ContestSpec.wall_time_limit_s (default 60.0 seconds — matches the Phase 1 grader cap) is an operational backstop on per-MLP predict() execution. If a single predict() call's elapsed wall-clock time exceeds the cap, the estimator's prediction is replaced with zeros and the MLP is scored through the failure path (zero-prediction MSE × 1.0, no compute discount). This is intentionally generous — the primary compute constraint is the effective FLOP budget C_m = F_m + λ·R_m; the wall-clock cap only catches stalled or runaway submissions. The CLI flag --wall-time-limit SECONDS accepts a positive float. To disable the cap programmatically, construct your own ContestSpec with wall_time_limit_s=None. Reproducibility under the grader seed Predict-time reproducibility If your estimator uses randomness — Monte Carlo sampling, randomized hashing, random projections, etc. — seed it from mlp.seed. The grader supplies a fixed per-MLP seed that is identical across all submissions for a given MLP, derived deterministically from the suite seed. Submissions that use unseeded randomness or their own seeds are NOT guaranteed to reproduce under regrade and may be disqualified for prize eligibility. Example: import flopscope.numpy as fnp def predict(self, mlp, budget): rng = fnp.random.default_rng(mlp.seed) # ... use rng for any internal randomness If your estimator is deterministic (no internal randomness), you can ignore mlp.seed. Setup-time reproducibility If your estimator does randomized one-time setup (e.g., sampling a random projection basis, jittering initial weights, choosing random hyperparameters), seed it from ctx.seed inside setup(). When the grader passes --seed, the same value is forwarded to ctx.seed for every MLP in the run; participants running locally can pass --seed themselves to reproduce a given setup. import flopscope.numpy as fnp def setup(self, ctx: SetupContext) -> None: self.setup_rng = fnp.random.default_rng(ctx.seed) # ... use self.setup_rng for any one-time random work Do not call fnp.random.seed(ctx.seed) (or np.random.seed(ctx.seed)) — that mutates the process-global RNG and breaks composability with other libraries. Use fnp.random.default_rng(ctx.seed) to get an isolated Generator. ctx.seed defaults to 0 when no --seed was passed; estimators that don't read it are unaffected. The seed is recorded in the run output under run_config.seed for audit-trail purposes — a reviewer can read it from a participant's JSON output and re-run with --seed N to reproduce the participant's setup state. See score-report-fields for the run_config.seed field. ctx.seed and mlp.seed are independent: mlp.seed controls per-MLP randomness inside predict(), ctx.seed controls one-time setup. With --dataset, the dataset supplies mlp.seed values (baked at the dataset's own seed) while --seed controls ctx.seed only. See cli-reference for the --seed flag semantics. Next step Write an Estimator (in the starter kit) Common Participant Errors (in the starter kit) Publishing a dataset to HuggingFace HubStep-by-step walkthrough for baking a WhestBench evaluation dataset locally and publishing it to HuggingFace Hub.Score Report FieldsReference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula. --- reference/score-report-fields --- URL: https://aicrowd.github.io/whestbench/docs/reference/score-report-fields ReferenceScore Report FieldsReference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula.When to use this page Use this page to interpret whest run output fields. Top-level fields Typical report sections include: schema_version mode run_meta run_config run_config.seed (always present; null when --seed is not provided) run_config.dataset (present when --dataset is used) results Run configuration fields run_config records the parameters that governed the run: FieldDescriptionseedThe --seed value passed at the CLI, or null when --seed is omitted. When set, it determines both MLP generation (without --dataset) and SetupContext.seed for the participant's setup() call. When null, ctx.seed defaults to 0. See estimator-contract for the reproducibility contract and cli-reference for --seed flag semantics.datasetPresent when --dataset is used. See Dataset traceability fields below. Host metadata run_meta.host is always an object. If you set WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1, WhestBench still records cheap host fields and any values available through psutil, but fallback-backed fields such as cpu_count_physical and ram_total_bytes may be null. Core result fields Inside results: FieldDescriptionadjusted_final_layer_scoreBudget-adjusted leaderboard metric — suite mean of per-MLP adjusted_final_layer_score = final_layer_mse × max(0.1, C_m/B_m); failure → × 1.0. Lower is better.all_layers_mseRaw all-layers MSE averaged across MLPs (no budget multiplier). Diagnostic — reveals where approximation error accumulates.final_layer_mseRaw final-layer MSE averaged across MLPs (no multiplier).per_layer_msePer-layer MSE averaged across MLPs. list[float] of length depth. The last element equals final_layer_mse and the list mean equals all_layers_mse. Diagnostic only, no budget multiplier.best_mlp_adjusted_final_layer_scoreMinimum per-MLP adjusted_final_layer_score.worst_mlp_adjusted_final_layer_scoreMaximum per-MLP adjusted_final_layer_score.mean_score_multiplierMean of per-MLP max(0.1, C_m/B_m) (1.0 on failure). Bounded [0.1, 1.0].mean_compute_utilizationMean of per-MLP C_m/B_m, unclamped — can exceed 1.0 when an MLP busted the cap.n_failed_mlpsCount of MLPs with any failure flag or error_code set.mean_effective_computeMean of per-MLP effective_compute.failure_breakdownDict with independent counts per failure flag: budget_exhausted, time_exhausted, residual_wall_time_exhausted, combined_budget_exhausted, error. Sums can exceed n_failed_mlps because one MLP can carry multiple flags.breakdownsAggregate FLOP/time breakdowns keyed by section name. Includes sampling and estimator.per_mlpArray of per-MLP detail records (see below) Per-MLP fields Each entry in per_mlp: FieldTypeDescriptionmlp_indexintIndex of the MLP in the evaluation setmlp_namestrHuman-readable slug for this MLP (e.g. "danielle-johnson"). Same value as mlps[i].name on the corresponding MLP; derived deterministically from mlp_index's per-MLP seed. Use it as a stable label in your own logs and dashboards.flops_usedintTotal FLOPs used by your estimator for this MLPeffective_computefloatC_m = F_m + λ·R_m. Combined FLOP-equivalent compute used by the estimator.adjusted_final_layer_scorefloats_m. The per-MLP budget-adjusted score that flows into the suite mean.combined_budget_exhaustedboolWhether the post-hoc check C_m > B_m fired (predictions zeroed if true).budget_exhaustedboolWhether the estimator exceeded the FLOP budget (predictions zeroed if true)time_exhaustedboolWhether the estimator exceeded the wall-clock limit for this MLP (predictions zeroed if true)residual_wall_time_exhaustedboolWhether WhestBench judged non-flopscope time to exceed residual_wall_time_limit_s (predictions zeroed if true)wall_time_sfloatTotal elapsed wall-clock time measured for this MLP's estimator contextflopscope_backend_time_sfloatWall time inside counted flopscope numpy kernels - the participant's actual numpy computeflopscope_overhead_time_sfloatWall time inside flopscope's own dispatch code (wrapper preambles, FLOP bookkeeping, namespace push/pop). Framework cost, not participant cost.residual_wall_time_sfloatWall time inside the predict context that is neither flopscope backend execution nor flopscope dispatch - i.e. participant Python (loops, control flow), GC, uninstrumented numpyfinal_layer_msefloatMSE of your final-layer predictions vs ground truthall_layers_msefloatMSE of your all-layer predictions vs ground truthper_layer_mselist[float]Per-layer MSE for this MLP. Length equals depth. per_layer_mse[-1] == final_layer_mse and mean(per_layer_mse) == all_layers_mse within float precision.breakdownsdict | nullPer-MLP breakdown container. Currently includes estimator-only data under estimator. Sampling is aggregate-only.tracebackstr | nullNon-null when this MLP's run did not produce real predictions — captures the Python traceback for either an estimator exception or a budget/time exhaustion. null on clean runs. For subprocess/server runners, the traceback is forwarded from the worker. When the estimator raised an unhandled exception (not budget/time exhaustion), the entry also includes: FieldTypeDescriptionerrorstr | dictLegacy string message, or structured object: {"message": str, "details": object}error_codestrStable identifier: PREDICT_ERROR for a RunnerError, or the Python exception class name otherwise For structured error objects, error.details includes: expected_shape: List[int] with expected (depth, width). got_shape: List[int] observed from estimator output. cause_hints: List[str] with user-facing hints. hint: short summary hint. Time decomposition Every predict() call satisfies a strict three-bucket identity: wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s flopscope_backend_time_s - numpy kernels actually crunching numbers via flopscope.numpy.*. flopscope_overhead_time_s - flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop). residual_wall_time_s - everything else inside the wall window: participant Python, GC, uninstrumented numpy. The decomposition holds at every level: per-MLP, aggregated across MLPs, and per namespace inside breakdowns. Breakdown containers When namespace-aware flopscope data is available, WhestBench adds breakdown containers in these places: results.breakdowns.estimator - aggregated estimator breakdown across all evaluated MLPs results.breakdowns.sampling - aggregated sampling breakdown across all evaluated MLPs results.per_mlp[].breakdowns.estimator - one normalized estimator breakdown per MLP Namespace normalization rules: sampling work is namespaced under sampling.* unlabeled estimator work becomes estimator.estimator-client explicit estimator namespace phase becomes estimator.phase nested estimator namespace phase.subphase becomes estimator.phase.subphase Each breakdown summary also includes timing totals: flopscope_backend_time_s - accumulated time inside counted flopscope operations flopscope_overhead_time_s - accumulated time inside flopscope's own dispatch residual_wall_time_s - everything else (participant Python, GC, uninstrumented numpy) For results.breakdowns.*, those values are aggregated across all evaluated MLPs. Budget-adjusted scoring The leaderboard ranks submissions by adjusted_final_layer_score, the suite mean of the budget-adjusted per-MLP score: adjusted_final_layer_score = final_layer_mse × max(0.1, C_m / B_m) for valid runs adjusted_final_layer_score = final_layer_mse × 1.0 for failures (no compute discount) C_m = F_m + λ · R_m (effective compute, FLOPs and FLOP-equivalents) λ = 1e11 FLOPs/second (conversion rate; see flopscope-primer) Where F_m is the analytical FLOPs counted by flopscope (flops_used), R_m is the residual wall-time bucket (residual_wall_time_s — neither flopscope-backend nor flopscope-overhead), and B_m is flop_budget. The max(0.1, ...) floor caps the discount at 10× so an arbitrarily cheap-but-wrong submission cannot dominate the ranking. Why "score" not "MSE"? Once final_layer_mse is multiplied by the budget factor max(0.1, C_m/B_m), the result is no longer a mean-squared-error between predictions and targets — it is a derived ranking score (denoted s_m). The _score suffix in adjusted_final_layer_score reflects this; the raw diagnostics final_layer_mse and all_layers_mse keep the _mse suffix because they remain genuine MSEs. Interpretation guide final_layer_mse is your most actionable diagnostic — it directly drives adjusted_final_layer_score. budget_exhausted is the first thing to check if your score is unexpectedly high — exceeded budget means your predictions were zeroed. time_exhausted means the estimator crossed the wall-clock limit configured through wall_time_limit_s / --wall-time-limit. residual_wall_time_exhausted means the non-flopscope portion of execution crossed WhestBench's residual_wall_time_limit_s / --residual-wall-time-limit. flops_used vs flop_budget shows how much headroom you have. If you are consistently near the cap, consider lighter methods. High flopscope_backend_time_s relative to wall: numpy compute is the dominant cost. Healthy for a numpy-heavy estimator. High flopscope_overhead_time_s relative to wall: many small ops are paying the per-call dispatch tax. Consider batching with larger numpy primitives. High residual_wall_time_s relative to wall: participant Python is the bottleneck (tight loops, per-element attribute access, calls into uninstrumented libraries). This is the bucket future versions of WhestBench will penalise on. adjusted_final_layer_score is the budget-adjusted leaderboard metric and is always ≤ the raw final_layer_mse mean (the multiplier is at most 1.0 — it equals 1.0 at full budget use or on failures and drops to 0.1 at the discount floor — a factor-of-ten cap). A value close to raw final_layer_mse means you used near-full budget; a value close to one-tenth of raw final_layer_mse means you used ≤10% of the budget and got the maximum discount. all_layers_mse is a diagnostic aggregate with no budget multiplier. Use it to understand where approximation error accumulates across all layers, not just the final layer. per_layer_mse decomposes all_layers_mse layer-by-layer (length = depth). Useful for spotting which layers your estimator struggles on — e.g. early layers vs. final layer. By construction per_layer_mse[-1] == final_layer_mse and mean(per_layer_mse) == all_layers_mse (within float precision). Dataset traceability fields When using whest run --dataset, the report includes run_config.dataset: FieldDescriptionpathAbsolute path to the dataset filesha256SHA-256 hash of the file for integrityseedRNG seed used to generate the datasetn_mlpsNumber of MLPs in the datasetseed_protocolObject with name and version. WhestBench currently requires version = "2.0". Dataset format compatibility The .npz files produced by whest create-dataset carry a seed_protocol.version in their embedded metadata. WhestBench refuses to load datasets at any other version: loading a v1.0 dataset raises ValueError("Incompatible dataset seed_protocol version: file has '1.0', this whestbench requires '2.0'. Re-bake the dataset with \whest create-dataset`.")`. The v2.0 format adds a per-MLP seed (stored as the mlp_seeds array in the .npz) that is exposed to estimators via mlp.seed — see estimator-contract for how to consume it. Auto-migration is intentionally not implemented because the v1.0 spawn protocol (2 streams per MLP) cannot produce a deterministic third stream; re-baking from the original spec seed is the only correct path. Schema 2.4 added the per-MLP name slug (stored as the mlp_names array in the .npz). It is a pure function of mlp_seeds at the WhestBench release's pinned faker version, so loading a 2.3 file under 2.4 code transparently synthesizes the same names a fresh 2.4 bake would produce — no re-bake required. See estimator-contract for the mlp.name field exposed to estimators. Next step CLI Reference Scoring Model (in the starter kit) Estimator ContractExact estimator I/O requirements, FLOP tracking rules, failure semantics, memory limits, and reproducibility contract.WhestBench dataset format (schema 3.0)WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars. --- reference/dataset-format --- URL: https://aicrowd.github.io/whestbench/docs/reference/dataset-format ReferenceWhestBench dataset format (schema 3.0)WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars.WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars. This layout is native to the datasets library (datasets.load_dataset(...) works directly on the directory), works with HuggingFace Hub as a first-class dataset repository, and supports parallel distributed baking with bit-exact merging. The earlier .npz format (schemas 2.x) is no longer produced or loaded. Re-bake with whest dataset bake to migrate. On-disk layout / ├── data/ │ └── -NNNNN-of-MMMMM.parquet # one row per MLP ├── metadata.json # whestbench provenance sidecar └── README.md # HuggingFace dataset card is the split name. Controlled by whest dataset bake --split. Dataset authors can separately declare the HF config with --config; the default is default. NNNNN-of-MMMMM is the standard HF shard numbering; single-host bakes produce 00000-of-00001. metadata.json is a flat JSON object with provenance, reproducibility, and hardware fields (see below). README.md is a rendered Jinja2 template with a YAML front-matter block that HuggingFace Hub uses to display the dataset card. Parquet schema (one row per MLP) Eight columns per row. The depth and width dimensions are fixed for a given dataset and captured in metadata.json. This table mirrors the schema section in the published dataset card. They are maintained in lockstep — any update here must also land in src/whestbench/templates/dataset_card.md.j2. ColumnType / shapeWhat this ismlp_idint320-based index of this MLP within the dataset (the absolute index across all parallel-bake slices).mlp_namestringStable, deterministic human-readable slug like "danielle-johnson", derived from mlp_seed. Useful for log lines; carries no information beyond mlp_seed.mlp_seedint64Per-MLP seed. Under seed_protocol 3.0 (new bakes), this is the input seed — the canonical value stored in the parquet. mlp.seed (participant-facing) is derived locally from this value via SeedSequence(mlp_seed).spawn(3)[2]. Under legacy seed_protocol 2.0, this column stored the already-derived estimator seed.weightsfloat32[depth, width, width]The MLP's layer weight matrices. The network has no biases and uses ReLU activations. Layer l computes h_l(x) = max(0, W_l @ h_{l-1}(x)). Weights are drawn i.i.d. from N(0, 2/width) (He initialization) at bake time.all_layer_meansfloat32[depth, width]Ground truth. Entry [l, j] is the empirical mean of neuron j's post-ReLU output at layer l, averaged over many independent Gaussian inputs: E_{x ~ N(0, I)}[ h_l(x)_j ] ≈ (1/N) Σ_i h_l(x_i)_j, where N = n_samples. Computed by direct Monte Carlo. This is what an estimator predicts.final_meansfloat32[width]The last row of all_layer_means — i.e. E[h_{depth}(x)_j] for each output neuron j. Materialised as its own column because the primary scoring metric (final_layer_mse) only looks at this row.avg_variancefloat64The mean across the final-layer neurons of the per-neuron output variance: (1/width) Σ_j Var[h_{depth}(x)_j]. A single scalar per MLP. Used as a normaliser in budget-adjusted scoring so that networks with naturally low output variance don't dominate the MSE rankings.sampling_budget_breakdownstring (JSON)FLOP accounting for the bake that produced the ground truth for this row — useful as provenance. Not related to the estimator's FLOP budget at evaluation time. Decode with json.loads(...). Notes on individual columns mlp_id — matches the MLP's position in the logical dataset. Partial bakes (from --slice/--mlp-range) have mlp_id values starting from their slice offset; after whest dataset merge, mlp_id is monotonically increasing from 0. mlp_name — the name is derived deterministically from mlp_seed using the faker library at a pinned version. The same --seed and --n-mlps always produce the same name list, on any hardware. Bumping the faker version pin requires a deliberate re-bake. weights — stored as float32. The weight matrices for each layer are weights[i] of shape (width, width). The forward pass uses no biases and ReLU between layers; inputs are standard Gaussian, sampled fresh per Monte-Carlo draw when ground truth is computed. sampling_budget_breakdown — a JSON string with the per-namespace FLOP counts and wall time consumed by the ground-truth Monte Carlo, accounted via flopscope. Parse with json.loads(row["sampling_budget_breakdown"]). This is provenance metadata about the bake itself, not the estimator's FLOP budget at evaluation time (which is set at runtime via whest run --flop-budget N). metadata.json schema metadata.json is a flat JSON object with the following fields. Base fields (all bakes) FieldTypeDescriptionschema_versionstringAlways "3.0" for this formatformatstringAlways "hf-datasets-parquet"backendstring"flopscope" (CPU path) or "torch" (GPU path)seed_protocol.namestring"whestbench_explicit_per_mlp_seeds" (3.0, new bakes) or "whestbench_seedsequence_hierarchy" (2.0, legacy).seed_protocol.versionstring"3.0" (new bakes) or "2.0" (legacy).seedinteger or nullPresent under seed_protocol 2.0 only. Root seed passed to --seed. null if auto-generated. Absent in 3.0 datasets.splitstringSplit name for a single-split bake. New bakes populate this; legacy metadata may omit it.configstringHF dataset config for a single-split bake. Defaults to "default"; legacy metadata may omit it.n_mlpsintegerNumber of MLPs in this dataset (or partial)n_samplesintegerGround-truth samples per MLPwidthintegerNeuron count per layerdepthintegerNumber of weight matricescreated_at_utcstringISO-8601 UTC timestamp of bake completionhardwareobjectHardware fingerprint from the baking host Provenance fields These pin the exact code + runtime state that produced a dataset, so a reader can reproduce a bake without guessing which whestbench/flopscope/torch versions or determinism flags were in effect. See Parallel bake → Bit-equivalence requirements for the operational consequences. FieldTypeDescriptionwhestbench_versionstringInstalled whestbench package version (e.g. "0.3.0"). "unknown" if importlib.metadata couldn't resolve it.flopscope_versionstringInstalled flopscope package version. Weight init uses flopscope.numpy so this matters for bit-exact weights. validate_metadata treats these as informational and does not require them (absence doesn't fail validation), but whest dataset bake always populates them. Torch-specific fields (when backend == "torch") FieldTypeDescriptiondevicestring"cuda", "mps", or "cpu"torch_versionstringPyTorch version string, e.g. "2.3.0"cuda_device_namestringGPU name (CUDA only), e.g. "NVIDIA L40S"cuda_device_capability[int, int]CUDA compute capability (CUDA only), e.g. [8, 9]cuda_driver_versionstringNVIDIA driver version (CUDA only, best-effort via nvidia-smi). Absent if nvidia-smi is unavailable.mps_device_namestringProcessor name (MPS only)mlps_per_batchintegerNumber of MLPs the bake processed per device-side batch.chunk_sizeintegerNumber of MC samples per device-side chunk. Pinning this to a fixed value across workers + reference re-bakes is required for cross-host bit-exact verification (see parallel-bake).bake_configobjectDeterminism flag state at bake time. See below. bake_config object (torch path only) Captures the state of torch's determinism levers + the cuBLAS workspace env var at bake time. Two bakes that should produce bit-identical numeric columns must have matching bake_config values (and matching chunk_size). FieldTypeDescriptioncudnn_deterministicbooleanValue of torch.backends.cudnn.deterministic at bake time.cudnn_benchmarkbooleanValue of torch.backends.cudnn.benchmark at bake time.cublas_workspace_configstring or nullValue of the CUBLAS_WORKSPACE_CONFIG env var at bake time, or null if unset. Recommended value for deterministic cuBLAS: ":4096:8".torch_use_deterministic_algorithmsbooleanValue of torch.are_deterministic_algorithms_enabled() at bake time. Partial-bake fields (when --slice or --mlp-range was used) FieldTypeDescriptionis_partialbooleanAlways true for partial bakesmlp_range[int, int][start, end) range of MLPs in this partialtotal_n_mlpsintegerLogical total MLP count across all partials A dataset with is_partial=true is refused by whestbench.load_dataset — run whest dataset merge first to assemble a complete dataset. Merged dataset fields (produced by whest dataset merge) FieldTypeDescriptionmerged_at_utcstringISO-8601 UTC timestamp of the mergehardware_fingerprintsarrayList of per-partial hardware objects, each including mlp_range is_partial, mlp_range, and total_n_mlps are removed by the merge step. n_mlps is set to the total count. Example metadata.json (CPU bake, seed_protocol 3.0) { "schema_version": "3.0", "format": "hf-datasets-parquet", "backend": "flopscope", "seed_protocol": { "name": "whestbench_explicit_per_mlp_seeds", "version": "3.0" }, "n_mlps": 10, "n_samples": 10000000, "width": 256, "depth": 8, "created_at_utc": "2026-05-25T12:00:00+00:00", "hardware": { "cpu_brand": "Intel Xeon Platinum 8480+", "cpu_count": 64, "ram_gb": 512.0 }, "whestbench_version": "0.3.0", "flopscope_version": "0.3.0" } Example metadata.json (torch CUDA bake, seed_protocol 3.0) { "schema_version": "3.0", "format": "hf-datasets-parquet", "backend": "torch", "seed_protocol": { "name": "whestbench_explicit_per_mlp_seeds", "version": "3.0" }, "n_mlps": 50, "n_samples": 1000000000, "width": 256, "depth": 8, "created_at_utc": "2026-05-26T03:45:00+00:00", "hardware": { "...": "..." }, "whestbench_version": "0.3.0", "flopscope_version": "0.3.0", "torch_version": "2.3.0+cu121", "device": "cuda", "cuda_device_name": "NVIDIA L40S", "cuda_device_capability": [8, 9], "cuda_driver_version": "535.183.01", "mlps_per_batch": 16, "chunk_size": 524288, "bake_config": { "cudnn_deterministic": true, "cudnn_benchmark": false, "cublas_workspace_config": ":4096:8", "torch_use_deterministic_algorithms": false } } Under seed_protocol 3.0 there is no top-level seed field. Each MLP's input seed is stored in the parquet mlp_seed column. Example metadata.json (legacy seed_protocol 2.0) { "schema_version": "3.0", "format": "hf-datasets-parquet", "backend": "flopscope", "seed_protocol": { "name": "whestbench_seedsequence_hierarchy", "version": "2.0" }, "seed": 42, "n_mlps": 10, "n_samples": 10000000, "width": 256, "depth": 8, "created_at_utc": "2026-05-25T12:00:00+00:00", "hardware": { "cpu_brand": "Intel Xeon Platinum 8480+", "cpu_count": 64, "ram_gb": 512.0 } } Legacy datasets (e.g. aicrowd/arc-whestbench-2026-smoke-test) use seed_protocol 2.0 and continue to load correctly. New bakes always write seed_protocol 3.0. README.md (HF dataset card) README.md is rendered from a Jinja2 template at bake time. It contains: A YAML front-matter block with license, tags, task_categories, and HF dataset card metadata required for correct Hub display. A quick-start code snippet. A dataset summary table (split, MLPs, width, depth, samples, schema version, seed protocol). The full Parquet column schema. Reproducibility information including the exact whest dataset bake command to re-bake. Hardware provenance (for merged datasets, lists each host's GPU and mlp_range). When whest dataset push uploads a local directory, it re-renders README.md with the actual repo_id and revision (tag) so the published card has real values rather than placeholders. Loading Bare datasets.load_dataset Use this when you only need the raw data and don't need schema validation or the metadata sidecar: from datasets import load_dataset # Local directory ds = load_dataset("./my-eval", split="public") # HF Hub ds = load_dataset( "aicrowd/arc-whestbench-2026", revision="v1", split="public", ) print(ds) # Dataset({features: [...], num_rows: 10}) print(ds[0]["mlp_name"]) # "danielle-johnson" whestbench.load_dataset wrapper Use this for the recommended workflow. It validates metadata.json, refuses partial datasets (suggesting the merge step), and attaches metadata to the returned Dataset object for later retrieval via whestbench.metadata(ds): import whestbench # Local ds = whestbench.load_dataset("./my-eval") # HF Hub (pin a revision — bare repo without revision is rejected by whest run) ds = whestbench.load_dataset( "aicrowd/arc-whestbench-2026", revision="v1", split="public", ) # Access metadata sidecar md = whestbench.metadata(ds) print(md["seed"], md["n_mlps"], md["backend"]) # Iterate as MLP instances for mlp in whestbench.iter_mlps(ds): print(mlp.name, mlp.weights[0].shape) # Random access mlp_0 = whestbench.mlp_at(ds, 0) iter_mlps / mlp_at Both functions return whestbench.MLP objects constructed via MLP.from_row(row). The MLP object exposes the same interface as MLPs produced on-the-fly by whestbench.sample_mlp: mlp.weights, mlp.width, mlp.depth, mlp.name, mlp.seed. Schema version policy VersionFormatNotes3.0Parquet + sidecar directoryCurrent. Required by this release.2.4.npz with mlp_names fieldLegacy. Rejected by load_dataset with a re-bake hint.2.3.npzLegacy.2.2.npzLegacy. schema_version tracks the storage format (2.x = npz, 3.0 = Parquet). seed_protocol.version tracks the RNG algorithm that produces per-MLP seeds. These two version numbers are independent — the seed protocol can be bumped without changing the storage format, and vice versa. Seed protocols whestbench_seedsequence_hierarchy version 2.0 (legacy, read-only) The original seeding scheme. A single root seed (--seed N) is expanded via numpy.random.SeedSequence(root_seed) into n_mlps child sequences. Each child spawns three streams: weights, samples, and estimator. The parquet mlp_seed column stored the already-derived estimator seed (stream index 2), not the input seed. New bakes can no longer write seed_protocol 2.0; --seed N on the CLI now rejects with a migration hint. whestbench_explicit_per_mlp_seeds version 3.0 (new, default) Each MLP receives an independent input seed (64-bit integer). Seeds are either auto-generated via secrets.randbits(63) or supplied explicitly via --mlp-seeds FILE (JSON array of N ints). The parquet mlp_seed column stores the input seed — the canonical, portable value. Within each MLP, the three RNG streams are still derived locally: SeedSequence(mlp_seed).spawn(3) → [weight_seq, sample_seq, estimator_seq]. mlp.seed (participant-facing) equals int(estimator_seq.generate_state(1)[0]), unchanged from 2.0 from the participant's perspective. Building a 3.0 dataset # Auto-generated seeds (recommended for production bakes): whest dataset bake --n-mlps 10 --n-samples 1e7 --width 256 --depth 8 \ --output ./my-eval # Explicit seeds (for reproducible small datasets or tests): echo '[1001,2002,3003,4004]' > my-seeds.json whest dataset bake --n-mlps 4 --n-samples 100 --width 4 --depth 2 \ --mlp-seeds my-seeds.json --output ./tiny-eval # Explicit HF config coordinate for authoring config-per-split repos: whest dataset bake --n-mlps 100 --n-samples 1e9 --width 256 --depth 8 \ --split full --config full --output ./full In Python: from whestbench.dataset import create_dataset # Auto-generated: create_dataset(n_mlps=10, n_samples=1_000_000, width=256, depth=8, output_path="./my-eval") # Explicit: create_dataset(n_mlps=4, n_samples=100, width=4, depth=2, mlp_seeds=[1001, 2002, 3003, 4004], output_path="./tiny-eval") # Explicit config coordinate: create_dataset(n_mlps=100, n_samples=1_000_000_000, width=256, depth=8, split="full", config="full", output_path="./full") Extracting seeds from a published dataset import whestbench ds = whestbench.load_dataset("aicrowd/arc-whestbench-2026", revision="v1", split="public") md = whestbench.metadata(ds) if md["seed_protocol"]["version"] == "3.0": seeds = ds["mlp_seed"] # list of input seeds print(seeds) --slice + seed_protocol 3.0 Under 3.0, all workers baking a given split must receive the same --mlp-seeds JSON file. Each worker uses --slice K/N to select its subset of rows; it draws the corresponding seeds from that shared file. Seeds for different splits must use different JSON files to preserve cross-split independence. HuggingFace git tags (e.g. v1, v2) are content versions for a specific published dataset. They are independent of the schema version — a dataset at tag v2 is still schema 3.0. Partial datasets and merging Baking partials --slice K/N divides a logical dataset of n_mlps into N equal slices and bakes slice K (0-indexed). The output metadata is marked is_partial=true and includes mlp_range=[start, end) and total_n_mlps. # Generate once, share the same file with all workers. whest dataset generate-seeds --n-mlps 1000 > seeds.json # 4 workers each bake 250 of 1000 MLPs whest dataset bake --slice 0/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p0 whest dataset bake --slice 1/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p1 whest dataset bake --slice 2/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p2 whest dataset bake --slice 3/4 --n-mlps 1000 --mlp-seeds seeds.json ... --output ./p3 --mlp-range START-END is the lower-level alternative. Both endpoints are inclusive on the CLI, but the Python API uses half-open [start, end) intervals internally. --slice 0/4 with n_mlps=1000 is equivalent to --mlp-range 0-249. Merging whest dataset merge validates all partials, checks for gap-free coverage of [0, total_n_mlps), concatenates the Parquet files in order, and writes a new complete dataset directory: whest dataset merge ./p0 ./p1 ./p2 ./p3 --output ./final Bit-equivalence property The bit-equivalence guarantee means a worker baking --slice K/N produces rows that are bitwise identical to the corresponding rows of a single-host bake with the same --mlp-seeds file and --n-mlps. This holds because: Under seed_protocol 3.0, each slot's input seed comes directly from the shared --mlp-seeds JSON file. A worker baking slot i reads seeds[i] from that file regardless of which slice it's assigned, so the derived weight/sample/estimator streams are identical to a single-host bake. MLP names are derived from the same per-MLP input seeds so that slice_names[K] equals full_names[K]. Note: bit-equivalence is per-backend. The flopscope (CPU) and torch backends use different RNG algorithms and produce statistically equivalent (not bitwise identical) results at the same seed. Multi-split datasets A dataset directory can contain multiple splits as sibling parquet files in data/, with a single metadata.json describing all of them via an optional splits: sub-dict. On-disk layout my-eval/ ├── data/ │ ├── public-00000-of-00001.parquet │ └── holdout-00000-of-00001.parquet ├── metadata.json └── README.md metadata.json shape { "schema_version": "3.0", "format": "hf-datasets-parquet", "backend": "torch", "seed_protocol": {"name": "whestbench_explicit_per_mlp_seeds", "version": "3.0"}, "n_samples": 1000000000, "width": 256, "depth": 8, "created_at_utc": "...", "hardware": {}, "splits": { "public": {"config": "default", "n_mlps": 50, "created_at_utc": "...", "hardware_fingerprints": []}, "holdout": {"config": "holdout", "n_mlps": 50, "created_at_utc": "...", "hardware_fingerprints": []} }, "default_split": "public" } Under seed_protocol 3.0 there is no per-split seed field; seeds are stored in the parquet mlp_seed column for each split. Field placement FieldSingle-splitMulti-splitschema_version, format, seed_protocoltop-leveltop-levelbackend, width, depth, n_samplestop-leveltop-level — must match across all splits (validated at combine time)split, configtop-level optional coordinate for new bakesper-split (splits..config)n_mlps, seedtop-levelper-split (splits..{n_mlps,seed})created_at_utctop-leveltop-level (= earliest of splits) + optional per-splithardwaretop-level (bake host)top-level (combine host) + per-split hardware_fingerprints for provenancesplitsabsentpresentis_partial, mlp_range, total_n_mlpspresent iff partialnot allowed (multi-split + partial is invalid) The discriminator is the presence of the splits field. No schema_version bump — the multi-split shape is a purely additive extension of schema 3.0. Loading from whestbench import load_dataset, metadata, iter_mlps dsd = load_dataset("./my-eval") # → DatasetDict ds = load_dataset("./my-eval", split="public") # → Dataset print(metadata(dsd)["splits"].keys()) # full multi-split metadata print(metadata(dsd, split="public")["seed"]) # single-split-shaped projection for mlp in iter_mlps(dsd["public"]): mlp.validate() Building a multi-split dataset Bake each split as a complete single-split dataset, then combine. Under seed_protocol 3.0, each split uses its own seeds JSON file: # Generate independent seed files for each split. whest dataset generate-seeds --n-mlps 50 > public-seeds.json whest dataset generate-seeds --n-mlps 50 > holdout-seeds.json whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 --split public --config default --mlp-seeds public-seeds.json --output ./pub whest dataset bake --n-mlps 50 --n-samples 1e9 --width 256 --depth 8 --split holdout --config holdout --mlp-seeds holdout-seeds.json --output ./hold whest dataset combine-splits ./pub ./hold --output ./eval-r1 whest dataset push ./eval-r1 --repo aicrowd/arc-whestbench-2026-evals --tag round-1 --private combine-splits preserves the baked config coordinate. If exactly one input declares config="default", the combined metadata records that split as default_split, so whest run --dataset ... can keep a split-oriented UX. The public / holdout naming convention The contest's evaluation dataset uses split names public (visible-during-contest scores) and holdout (private/final-leaderboard scores). The dataset-card template special-cases these names with leaderboard-specific wording. Other names render generically. Tooling itself accepts any HF-Hub-compatible split name (regex [a-z][a-z0-9]*(-[a-z0-9]+)*).Score Report FieldsReference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula.CLI ReferenceExact command syntax and key flags for all whest commands. --- reference/cli-reference --- URL: https://aicrowd.github.io/whestbench/docs/reference/cli-reference ReferenceCLI ReferenceExact command syntax and key flags for all whest commands. For the full per-command reference, see CLI. When to use this page Use this page for exact command syntax and key flags. Environment variables WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1 — skip OS-native fallback probes when collecting run_meta.host or dataset metadata.hardware. Cheap fields and psutil-backed fields are still collected; fallback-backed fields may remain null. HF_TOKEN — HuggingFace Hub authentication token. Used by whest dataset push, whest dataset pull, and whest run --dataset hf://... as a fallback when --token is not provided. Commands Participant workflow commands: whest smoke-test whest doctor whest init whest validate whest run whest dataset (bake / push / pull / merge / inspect) whest package whest profile-simulation whest version All JSON outputs include a top-level whestbench_version string for traceability. whest version Print installed whestbench version. whest version [--format rich|plain|json] [--json] JSON output is: { "ok": true, "command": "version", "name": "whestbench", "version": "0.2.0", "whestbench_version": "0.2.0" } Examples: whest version whest version --json Migration note: whest create-dataset is replaced by whest dataset bake. Running whest create-dataset prints a redirect and exits. whest smoke-test Run a built-in CombinedEstimator dashboard check and print next-step participant commands. whest smoke-test [--detail raw|full] [--profile] [--show-diagnostic-plots] [--format rich|plain|json] [--debug] --format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise. Under a debugger, smoke-test automatically forces plain if rich was requested. whest doctor Run install and environment health checks. Prints a pass/fail list for Python version, uv/Node.js availability, BLAS thread pool, disk space, and working-directory writability. Useful for first-hour setup troubleshooting and for CI gates. whest doctor [--format rich|plain|json] [--json] [--strict] [--debug] Key options: --format rich|plain|json — choose styled terminal output, plain log-friendly output ([OK]/[WARN]/[FAIL] tokens, no box-drawing), or JSON (schema_version, checks, counts, overall). Defaults to rich on TTYs and plain otherwise. --json — alias for --format json. --strict — treat warnings as failures for exit-code purposes. Rendering is unchanged. --debug — re-raise exceptions from crashing checks instead of capturing them as fail. Severity model ok — the check passed. warn — the check found something worth knowing but not blocking. Examples: uv missing (safe to ignore if you installed via pip), less than 1 GiB free disk in the current directory. fail — the check found a genuine blocker. Examples: Python version below requires-python, threadpoolctl failed to import, cannot write to the working directory. Exit codes Default: 0 if all checks are ok or warn; 1 if any fail. --strict: 0 only if all checks are ok; 1 otherwise. Example # Interactive first-hour check whest doctor # CI pre-flight (treat anything that isn't OK as a failure) whest doctor --strict --json whest init Create starter files in a target directory. whest init [path] [--format rich|plain|json] [--json] [--debug] whest validate Validate estimator loading and output contract. whest validate --estimator [--class ] [--format rich|plain|json] [--json] [--debug] whest run Run local scoring with a participant estimator. whest run --estimator [options] Default behavior: whest run --estimator is equivalent to --runner local. Key options: --class — estimator class name (if the module exports more than one). --runner local|subprocess|server|inprocess --n-mlps — number of MLPs to evaluate. Default: 10 without --dataset; full dataset size with --dataset. Clamped to dataset size when --dataset is set. --flop-budget — cap on effective compute C_m = F_m + λ·R_m per MLP. Default: 68_000_000_000 (6.8e10). Always honored; any flop_budget stored in --dataset's metadata is ignored. --wall-time-limit (default: 60.0) — wall-clock limit per predict() call; forwarded to the estimator BudgetContext. Operational backstop matching the Phase 1 grader cap; the primary compute constraint is --flop-budget. --residual-wall-time-limit — limit for non-flopscope time per predict() call, enforced by WhestBench after timing is reported. --detail raw|full --seed — random seed for the run. Without --dataset: seeds both MLP generation and estimator setup (ctx.seed). With --dataset: MLP seeds come from the dataset; this flag seeds estimator setup (ctx.seed) only. Default: omitted (ctx.seed defaults to 0; run_config.seed is null in the JSON output). See estimator-contract for the ctx.seed reproducibility contract. --profile --show-diagnostic-plots --format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise. --json — alias for --format json. --dataset — dataset source. Accepts: Local directory: ./my-eval or /abs/path/my-eval HF Hub with inline revision: hf://owner/repo@v1 or hf://aicrowd/arc-whestbench-2026@v1 HF Hub with --revision flag: aicrowd/arc-whestbench-2026 --revision v1 Bare owner/repo without --revision is rejected (revision must be explicit). --revision — HF Hub git tag or commit SHA for --dataset. Ignored for local paths. --n-samples — ground truth samples per MLP when generating on-the-fly (without --dataset). Default: width*width*256. --debug — include estimator tracebacks in the report's "Estimator Errors" panel. --fail-fast — stop on the first estimator error and let the raw Python traceback propagate. Combine with --debug to show it. --max-threads — limit BLAS to at most N CPU threads. Recommended debug sequence: whest run --estimator ./path/to/estimator.py whest run --estimator ./path/to/estimator.py --debug whest run --estimator ./path/to/estimator.py --debug --fail-fast whest run --estimator ./path/to/estimator.py --runner local --format plain # for pdb.set_trace() / breakpoint() Using a pre-baked dataset # Local directory (schema 3.0) whest run --estimator ./estimator.py --dataset ./my-eval # HF Hub with inline revision (preferred) whest run --estimator ./estimator.py --dataset hf://aicrowd/arc-whestbench-2026@v1 # HF Hub with separate --revision flag whest run --estimator ./estimator.py \ --dataset aicrowd/arc-whestbench-2026 \ --revision v1 Exit codes 0 — scoring completed; no estimator errors (budget or time exhaustion still exits 0). 1 — at least one MLP raised during predict, or setup/runtime failure. Runner mode tradeoff: local (default): in-process execution with better traceback fidelity while debugging. Required for interactive debuggers (pdb, breakpoint()). subprocess: isolated execution in a separate process via the subprocess runner. server: legacy alias for subprocess. inprocess: alias for local. whest dataset Dataset management commands. All subcommands share the whest dataset prefix. whest dataset {bake,push,pull,merge,inspect} ... whest dataset bake Bake a new evaluation dataset to a local directory. whest dataset bake \ --n-mlps N --n-samples N --width W --depth D \ [--split SPLIT] [--config CONFIG] \ --output DIR \ [--torch] [--device auto|cuda|mps|cpu] \ [--mlps-per-batch N] [--chunk-size N] \ [--slice K/N | --mlp-range START-END] Required options: --n-mlps — total number of MLPs in the logical dataset. --n-samples — ground-truth samples per MLP. Larger values give lower-noise ground truth. Default for on-the-fly runs is width*width*256 (~16.7M for 256-wide). --width — neuron count per layer. --depth — number of weight matrices per MLP. --output — output directory (must not exist). Key optional options: --split — dataset split name. Default: public. --config — HF dataset config name for this split. Default: default. Use this for authoring config-per-split datasets such as default/mini + full/full or default/public + holdout/holdout. --torch — use the GPU/torch backend (requires pip install whestbench[gpu]). See GPU Dataset Generation. --device auto|cuda|mps|cpu — device when --torch is active. auto resolves cuda > mps > cpu. --mlps-per-batch — torch backend: MLPs processed in parallel on device. --chunk-size — torch backend: samples per chunk per step. --slice K/N — bake only the K-th slice of N total slices (0-indexed). Produces a partial dataset. Combine with whest dataset merge to assemble the full dataset. Example: --slice 0/4 for the first of four workers. --mlp-range START-END — bake only MLP indices [START, END] inclusive (both ends). Alternative to --slice for irregular splits. Bit-equivalence guarantee: a worker baking --slice K/N produces rows that are bitwise identical to the corresponding rows of a single-host bake with the same --seed and --n-mlps. Output is a directory with: / ├── data/-00000-of-00001.parquet ├── metadata.json └── README.md Example # Full bake (10 MLPs, 10M samples each) whest dataset bake \ --n-mlps 10 --n-samples 10_000_000 \ --width 256 --depth 8 \ --output ./my-eval # Partial bake (slice 0 of 4) whest dataset bake \ --n-mlps 100 --n-samples 1_000_000_000 \ --width 256 --depth 8 \ --slice 0/4 \ --output ./partial-0 # GPU bake whest dataset bake \ --n-mlps 100 --n-samples 1_000_000_000 \ --width 256 --depth 8 \ --torch --device auto \ --output ./gpu-eval whest dataset inspect Print metadata from a local directory or a HF Hub repo. whest dataset inspect [--revision REV] Arguments: DIR_OR_REPO_ID — local dataset directory, or HF Hub repo id (e.g. aicrowd/arc-whestbench-2026). --revision — HF Hub git tag or commit SHA (for remote repos). Example # Local whest dataset inspect ./my-eval # Remote whest dataset inspect aicrowd/arc-whestbench-2026 --revision v1 Output prints key metadata fields: schema_version, format, backend, split, config, n_mlps, n_samples, width, depth, created_at_utc, and device provenance for torch bakes. Multi-split datasets print each split's config when present. whest dataset push Upload a baked dataset directory to HuggingFace Hub. Requires HF_TOKEN set in the environment or --token. whest dataset push \ --repo REPO_ID \ [--tag TAG] \ [--private] \ [--token TOKEN] \ [--message MSG] Arguments: LOCAL_DIR — local directory produced by whest dataset bake or whest dataset merge. --repo — HF Hub repo id, e.g. aicrowd/arc-whestbench-2026. --tag — optional git tag to create on the uploaded commit (e.g. v1). Recommended for versioning. --private — create the repo as private if it doesn't exist yet. --token — HF Hub write token. Falls back to HF_TOKEN env var, then the huggingface-cli login cache. --message — commit message for the HF Hub upload. Example # Publish with a version tag whest dataset push ./my-eval \ --repo aicrowd/arc-whestbench-2026 \ --tag v1 \ --message "Bake: 10 MLPs, seed=42" # Private repo whest dataset push ./my-eval \ --repo aicrowd/arc-whestbench-2026-holdout \ --tag v1 \ --private whest dataset pull Download a dataset from HuggingFace Hub to a local directory. whest dataset pull \ [--revision REV] \ --output DIR \ [--token TOKEN] Arguments: REPO_ID — HF Hub repo id (e.g. aicrowd/arc-whestbench-2026). --revision — HF Hub git tag or commit SHA. Default: main. --output — local destination directory. --token — HF Hub token for private repos. Falls back to HF_TOKEN env var. Example whest dataset pull aicrowd/arc-whestbench-2026 \ --revision v1 \ --output ./eval-v1 whest dataset merge Merge partial bakes (produced with --slice or --mlp-range) into a single canonical dataset. whest dataset merge [...] --output Arguments: ... — two or more partial dataset directories. --output — destination for the merged dataset (must not exist). All partial datasets must share the same --seed, --n-mlps, --n-samples, --width, --depth, and --backend. Their mlp_range values must together cover [0, total_n_mlps) exactly once (no gaps, no overlaps). The merged result is bit-equivalent to a single-host bake with the same parameters. Example # After baking 4 slices on separate workers: whest dataset merge \ ./partial-0 ./partial-1 ./partial-2 ./partial-3 \ --output ./final-eval End-to-end example (bake → inspect → push → pull → run) # 1. Bake whest dataset bake \ --n-mlps 10 --n-samples 10_000_000 \ --width 256 --depth 8 \ --output ./my-eval # 2. Inspect locally whest dataset inspect ./my-eval # 3. Publish export HF_TOKEN=hf_... whest dataset push ./my-eval \ --repo aicrowd/arc-whestbench-2026 \ --tag v1 # 4. Pull on another machine whest dataset pull aicrowd/arc-whestbench-2026 \ --revision v1 --output ./local-copy # 5. Run evaluation whest run --estimator ./estimator.py \ --dataset hf://aicrowd/arc-whestbench-2026@v1 whest package Build a submission artifact. whest package --estimator [options] Key options: --class --requirements --submission-metadata --approach --output --format rich|plain|json --json — alias for --format json --debug whest profile-simulation Profile flopscope FLOP accounting and analytical correctness across a grid of network sizes and FLOP budgets. whest profile-simulation [--preset super-quick|quick|standard|exhaustive] [--output ] [--format rich|plain|json] [--json] [--verbose] [--debug] Key options: --preset (default: standard) — parameter sweep size: super-quick — 1 width (256), 1 depth (4), 10 000 samples. Sub-second, for testing the debug loop. quick — 1 width (256), 2 depths (4, 128), 2 sample counts (10 000, 100 000). Finishes in seconds. standard — 2 widths (64, 256), 3 depths (4, 32, 128), 2 sample counts (10 000, 100 000). Under a minute. exhaustive — 2 widths (64, 256), 3 depths (4, 32, 128), 3 sample counts (10 000, 100 000, 1 000 000). Thorough but slow. --output — save a JSON report with correctness results and FLOP accounting data. --format rich|plain|json — choose styled terminal output, plain log-friendly output, or JSON. Defaults to rich on TTYs and plain otherwise. --json — alias for --format json. --debug — show full tracebacks on errors. --verbose — show full tables with all columns and raw data. Example workflows: # Quick correctness check whest profile-simulation --preset quick # Full profile with JSON export whest profile-simulation --preset exhaustive --output profile_results.json Next step Dataset Format — schema 3.0 specification Score Report Fields GPU Dataset Generation Inspect and Traverse MLP Structure (in the starter kit) Validate, Run, and Package (in the starter kit) WhestBench dataset format (schema 3.0)WhestBench schema 3.0 stores evaluation datasets as a directory of Parquet files plus two JSON/Markdown sidecars.Code PatternsQuick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation. --- reference/code-patterns --- URL: https://aicrowd.github.io/whestbench/docs/reference/code-patterns ReferenceCode PatternsQuick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation.Quick reference for flopscope operations. All examples assume import flopscope as flops and import flopscope.numpy as fnp. Operators are tracked Python arithmetic operators (+, -, *, /, @) on fnp.ndarray values are FLOP-tracked — you do not need to use the verbose fnp.add, fnp.multiply, etc. forms. import flopscope as flops import flopscope.numpy as fnp a = fnp.ones(4) b = fnp.ones(4) # These are all equivalent and all tracked: c = a + b # tracked: same as fnp.add(a, b) d = a * b # tracked: same as fnp.multiply(a, b) e = a / b # tracked: same as fnp.divide(a, b) W = fnp.eye(4) v = fnp.ones(4) f = W @ v # tracked: same as fnp.matmul(W, v) g = W.T @ v # tracked: transpose is free, matmul is tracked h = W.T @ W @ v # tracked: two matmuls, chained with @ Use operators whenever they improve readability. The verbose fnp.* forms are still available but are no longer required for tracking purposes. Operation costs What you wantCodeFLOP costNotesCreate zerosfnp.zeros((n, n))0FreeCreate onesfnp.ones(n)0FreeIdentity matrixfnp.eye(n)0FreeWrap existing datafnp.array(data)0FreeMatrix multiplyfnp.matmul(A, B)O(m x n x k)Dominates budgetsElement-wise addfnp.add(a, b)1 per elementElement-wise multiplyfnp.multiply(a, b)1 per elementElement-wise dividefnp.divide(a, b)1 per elementReLUfnp.maximum(x, 0.0)1 per elementSquare rootfnp.sqrt(x)1 per elementExponentialfnp.exp(x)1 per elementLogarithmfnp.log(x)1 per elementTransposefnp.transpose(W)0FreeReshapefnp.reshape(x, shape)0FreeExtract diagonalfnp.diag(M)0FreeSet diagonalfnp.fill_diagonal(M, v)0Free, in-placeOuter productfnp.outer(a, b)n x mSumfnp.sum(x, axis=0)input sizeMeanfnp.mean(x, axis=0)input sizeMaxfnp.max(x)input sizeStack arraysfnp.stack(rows, axis=0)0FreeConcatenatefnp.concatenate([a, b])0FreeIndex/slicex[0], x[:, 3]0Free Common patterns Standard normal PDF and CDF (built-in) flopscope provides built-in PDF and CDF functions that are FLOP-tracked: import flopscope as flops import flopscope.numpy as fnp phi = flops.stats.norm.pdf(x) # standard normal PDF Phi = flops.stats.norm.cdf(x) # standard normal CDF These are the recommended approach — all example estimators use them. The manual implementations below are shown for reference. Standard normal PDF (for ReLU expectation) import flopscope as flops import flopscope.numpy as fnp def norm_pdf(x): """phi(x) = exp(-x^2/2) / sqrt(2*pi)""" return fnp.exp(-0.5 * x * x) / fnp.sqrt(2.0 * fnp.pi) Standard normal CDF Pure flopscope implementation using the Abramowitz & Stegun approximation (accurate to <7.5e-8): import flopscope as flops import flopscope.numpy as fnp _P = 0.2316419 _A1, _A2, _A3 = 0.319381530, -0.356563782, 1.781477937 _A4, _A5 = -1.821255978, 1.330274429 def norm_cdf(x): t = 1.0 / (1.0 + _P * fnp.abs(x)) poly = ((((_A5 * t + _A4) * t + _A3) * t + _A2) * t + _A1) * t pdf = fnp.exp(-0.5 * x * x) / fnp.sqrt(2.0 * fnp.pi) cdf = 1.0 - pdf * poly return fnp.where(x >= 0, cdf, 1.0 - cdf) Alternatively, if you add scipy to your requirements.txt: # Optional: requires scipy as a user-provided dependency from scipy.special import ndtr def norm_cdf(x): return fnp.array(ndtr(fnp.asarray(x, dtype=fnp.float64)).astype(fnp.float32)) ReLU expectation (E[max(0, z)] where z ~ N(mu, sigma^2)) import flopscope as flops import flopscope.numpy as fnp alpha = mu_pre / sigma_pre E_relu = mu_pre * norm_cdf(alpha) + sigma_pre * norm_pdf(alpha) See 02_mean_propagation.py (in the starter kit) for a complete worked example using these patterns. Per-neuron variance propagation (diagonal) import flopscope as flops import flopscope.numpy as fnp # var_pre[i] = sum_j W[j,i]^2 * var[j] var_pre = (w * w).T @ var Next step Estimator Contract Manage Your FLOP Budget (in the starter kit) Algorithm Ideas (in the starter kit) CLI ReferenceExact command syntax and key flags for all whest commands.Flopscope PrimerFlopscope is a numpy-compatible array library that tracks FLOPs analytically, enabling fair FLOP budgets across different machines. --- reference/flopscope-primer --- URL: https://aicrowd.github.io/whestbench/docs/reference/flopscope-primer ReferenceFlopscope PrimerFlopscope is a numpy-compatible array library that tracks FLOPs analytically, enabling fair FLOP budgets across different machines.Flopscope is a numpy-compatible array library that tracks FLOPs analytically rather than timing them on hardware. Every arithmetic operation on a fnp.ndarray increments a FLOP counter instead of (or in addition to) performing the computation. This is how WhestBench enforces fair FLOP budgets across different machines. Source: github.com/AIcrowd/flopscope BudgetContext All estimator predictions run inside a BudgetContext. When the budget is exhausted, a BudgetExhaustedError is raised and your predictions are zeroed out. import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=1_000_000) as ctx: x = fnp.ones(100) y = x @ fnp.eye(100) # matmul: 100 * 100 * 100 = 1M FLOPs # BudgetExhaustedError raised here if budget exceeded You don't need to create BudgetContext yourself — the framework does it before calling your predict() method. The budget argument tells you how many FLOPs you have. BudgetContext also supports wall_time_limit_s when you want a cooperative wall-clock limit in addition to the FLOP cap: with flops.BudgetContext(flop_budget=1_000_000, wall_time_limit_s=2.0) as ctx: ... The timer starts when the context is entered and is checked before and after each counted flopscope/NumPy call. If it is exceeded, flopscope raises TimeExhaustedError. Operation FLOP Costs CategoryOperationsCostFree (0 FLOPs)fnp.array, fnp.zeros, fnp.ones, fnp.eye, fnp.asarray, fnp.reshape, .T, indexing, fnp.stack, fnp.concatenate, .copy(), .astype()0Pointwise (1 FLOP/element)+, -, *, /, fnp.exp, fnp.sqrt, fnp.abs, fnp.maximum, fnp.where, fnp.log, comparisonsN elementsReductions (input size)fnp.sum, fnp.mean, fnp.var, fnp.max, fnp.min, fnp.all, fnp.anyN elementsMatmul@, fnp.matmulM * N * K for (M,N) @ (N,K) Key insight: Matmul dominates. A single (100, 100) @ (100, 100) costs 1M FLOPs. A pointwise exp on 100 elements costs 100 FLOPs. Array Creation import flopscope as flops import flopscope.numpy as fnp x = fnp.zeros(100) # 1D zeros X = fnp.zeros((64, 100), dtype=fnp.float32) # 2D zeros, explicit dtype I = fnp.eye(100, dtype=fnp.float32) # identity matrix a = fnp.array([1.0, 2.0, 3.0]) # from list b = fnp.asarray(numpy_array) # convert from numpy (free) All array creation is free (0 FLOPs). Random Number Generation import flopscope as flops import flopscope.numpy as fnp rng = fnp.random.default_rng(42) # seeded RNG x = rng.standard_normal((1000, 64)) # Gaussian samples x = x.astype(fnp.float32) # cast to float32 (free) Random generation itself is free. FLOPs are counted when you operate on the arrays. Budget Inspection Use budget.summary() for the current explicit context and fnp.budget_summary() for the accumulated session/global view: with flops.BudgetContext(flop_budget=10_000_000) as ctx: # ... your computations ... print(ctx.summary()) # current context only print(fnp.budget_summary()) # process/session-wide summary print(ctx.flops_used) # integer FLOP count Both summaries also include four timing fields that satisfy a strict decomposition identity, wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s: wall_time_s: total elapsed time in the context flopscope_backend_time_s: time spent inside counted flopscope numpy kernels flopscope_overhead_time_s: time spent inside flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop) residual_wall_time_s: everything else - participant Python, GC, uninstrumented numpy This decomposition lets you see whether time is going to numpy compute, framework dispatch, or your own Python. WhestBench-specific limits Flopscope's BudgetContext measures wall_time_s, flopscope_backend_time_s, flopscope_overhead_time_s, and residual_wall_time_s. It also accepts wall_time_limit_s, which it checks while counted flopscope operations run. WhestBench exposes some of those concepts as run-level CLI knobs: --wall-time-limit: passed through to the estimator's BudgetContext --residual-wall-time-limit: enforced by WhestBench after predict() returns, using the reported residual_wall_time_s. Because residual_wall_time_s no longer includes flopscope's own dispatch time, this gate measures only your Python work — not the framework's bookkeeping tax. So if you see time_exhausted, that came from Flopscope's wall_time_limit_s. If you see residual_wall_time_exhausted, that came from WhestBench scoring logic comparing Flopscope's measured residual_wall_time_s with the configured --residual-wall-time-limit. Residual wall-time charging (lambda) WhestBench's effective compute budget combines analytical FLOPs and residual wall time via a conversion rate λ (LAMBDA_FLOPS_PER_SECOND in whestbench.scoring): C_m = F_m + λ · R_m F_m = analytical FLOPs counted by flopscope (flops_used) R_m = residual wall time — the third bucket of the time decomposition. Specifically, residual_wall_time_s = wall_time_s − flopscope_backend_time_s − flopscope_overhead_time_s. This is participant Python (loops, control flow), GC pauses, and uninstrumented numpy. It explicitly excludes flopscope's own dispatch overhead (the second bucket). λ = 1e11 FLOPs/second. This rate is fixed for the initial competition round. The combined C_m is capped at B_m = flop_budget. If C_m > B_m, the MLP is marked combined_budget_exhausted and the prediction is replaced with zeros. Why charge non-flopscope time at all? It lets participants use any Python they like — not just flopscope-instrumented operations — but holds them accountable for that work in the compute budget. Pure-flopscope solutions get the entire budget for analytical work; pure-Python solutions trade some FLOP headroom for residual time. Common Gotchas numpy arrays still count FLOPs. Since fnp.ndarray is backed by numpy, a raw numpy array passed to flopscope operations will still be tracked. Use fnp.array() or fnp.asarray() to convert explicitly. Pythonic operators are tracked. x @ w counts the same FLOPs as fnp.matmul(x, w). Use whichever reads better. dtype matters for precision, not FLOPs. float32 and float64 operations cost the same FLOPs. Use float32 for memory efficiency and float64 for numerical stability where needed. Testing Use flopscope's testing utilities: import flopscope as flops import flopscope.numpy as fnp fnp.testing.assert_allclose(actual, expected, atol=1e-6) fnp.testing.assert_array_equal(actual, expected) These work like numpy's testing functions but on flopscope arrays.Code PatternsQuick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation.Generating Large Datasets on GPUFor ground-truth bakes with n_samples ≥ 10⁸, the optional torch backend runs the same computation on GPU, reducing a 30-hour CPU job to 15–30 minutes on a single GPU. ============================================================ CLI ============================================================ --- cli --- URL: https://aicrowd.github.io/whestbench/docs/cli CLICLI ReferenceAutogenerated reference for the whest command-line interface.CLI Reference Generated from the whest argparse definition. whest smoke-test — Run a built-in CombinedEstimator dashboard check and print next steps for participant workflows. whest version — Print whestbench version. whest init — Create starter estimator files. whest validate — Validate estimator contract. whest run — Run local evaluation for an estimator. whest dataset — Dataset bake/publish/load/merge/inspect commands. whest package — Package submission artifact. whest profile-simulation — Benchmark flopscope simulation performance. whest doctor — Run install/environment health checks. whest login — Store your AIcrowd API key (interoperable with aicrowd-cli). whest submit — Submit to AIcrowd (packages an estimator if needed, then uploads). Generating Large Datasets on GPUFor ground-truth bakes with n_samples ≥ 10⁸, the optional torch backend runs the same computation on GPU, reducing a 30-hour CPU job to 15–30 minutes on a single GPU.whest smoke-testRun a built-in CombinedEstimator dashboard check and print next steps for participant workflows. ============================================================ API Reference ============================================================ --- api --- URL: https://aicrowd.github.io/whestbench/docs/api APIAPI ReferenceAutogenerated reference for the whestbench public API.API Reference Every symbol exported from whestbench (__all__). Generated from source. BaseEstimator — Estimator contract for participant implementations. BudgetExhaustionWarning — Raised when an estimator exhausts its FLOP budget on a single MLP. combine_split_datasets — Combine N complete single-split datasets into a multi-split dataset directory. CombinedBudgetExhaustionWarning — Raised when combined compute C_m = F_m + lambda*R_m exceeds the FLOP budget on a single MLP. create_dataset — Generate MLPs, compute ground-truth, and write a schema-3.0 dataset directory. InvalidDatasetError — Raised when a dataset directory has missing/incompatible metadata. iter_mlps — Iterate the MLPs in a Dataset, constructing MLP objects per row. load_dataset — Load a whestbench dataset from a local directory or HF Hub repo. merge_datasets — Concatenate partial bakes into a single canonical dataset directory. metadata — Return the metadata.json contents attached to a Dataset or DatasetDict. MLP — Validated MLP container with fixed width and layer depth. mlp_at — Return the MLP at index in the Dataset. publish_dataset — Upload a baked dataset directory to HF Hub. relu — Element-wise ReLU activation. ResidualWallTimeExhaustionWarning — Raised when an estimator exhausts its residual wall-time budget on a single MLP. run_mlp — Forward pass returning final-layer activations. run_mlp_all_layers — Forward pass returning activations after each layer. sample_layer_statistics — Estimate per-layer activation statistics via chunked Monte Carlo sampling. sample_mlp — Sample a random MLP with He-initialized weight matrices. SCHEMA_VERSION — value ScoringExhaustionWarning — Base class for budget/time exhaustion warnings raised during scoring. SetupContext — Runtime context passed to BaseEstimator.setup. TimeExhaustionWarning — Raised when an estimator exhausts its wall-clock budget on a single MLP. whest submitSubmit to AIcrowd (packages an estimator if needed, then uploads).BaseEstimatorEstimator contract for participant implementations. ============================================================ Development ============================================================ --- development/release-process --- URL: https://aicrowd.github.io/whestbench/docs/development/release-process DevelopmentRelease processThe authoritative reference for cutting a new release of whestbench to PyPI, covering the steady-state flow, one-time setup, and troubleshooting notes.This document is the authoritative reference for cutting a new release of whestbench to PyPI. It covers the steady-state flow, the one-time setup that must happen outside the repo, and a few troubleshooting notes. TL;DR (steady-state) git checkout main && git pull origin main uv run cz bump --dry-run # preview the next version + CHANGELOG entry uv run cz bump # writes pyproject version + CHANGELOG.md + creates v tag git push --follow-tags # tag push triggers the publish workflow # … open GitHub Actions → approve the `publish-pypi` job → wait ~30s → # package on PyPI + GitHub Release created Pre-release tags: uv run cz bump --prerelease alpha produces tags like v0.5.0a0. What happens after git push --follow-tags The tag push fires .github/workflows/pypi-publish.yml, which: Builds the sdist + wheel with uv build. Pauses for approval in the pypi GitHub environment (manual gate). Publishes to PyPI via Trusted Publishing (OIDC; no API token stored in repo secrets). Creates a GitHub Release whose body is the matching CHANGELOG section for the tag. End result: uv add whestbench / pip install whestbench works ~2 minutes after a maintainer clicks "approve" on the publish-pypi job. One-time setup (per maintainer, per repo) Before the first release will succeed, two things must be configured outside the repo. 1. PyPI Trusted Publisher On pypi.org, as an account with Owner or Maintainer rights on the whestbench project (or as the user creating it, if not yet published): "Your projects" → whestbench → "Publishing" → "Add a pending publisher" (or "Add a publisher" if the project already exists). Fill in: PyPI project name: whestbench Owner: AIcrowd Repository name: whestbench-public Workflow filename: pypi-publish.yml Environment name: pypi PyPI's "pending publisher" feature allows trusted publishing to succeed on the very first publish of a brand-new project name. 2. GitHub pypi environment In the whestbench-public repo on GitHub: Settings → Environments → "New environment" → name: pypi. Enable "Required reviewers". Add yourself (and any other release maintainers) as reviewers. Save. Without this, publishes proceed without a human approval gate. The Trusted Publishing OIDC handshake will still work — there is just no gate to abort a bad tag. How CHANGELOG entries get into the GitHub Release The publish workflow extracts the body of the matching ## v section in CHANGELOG.md using an awk script and uses it as the GitHub Release notes. Commitizen writes section headers in the ## v () form, which the workflow expects. When promoting an existing ## Unreleased section to a versioned release manually (rather than via cz bump), use the same header format: ## v0.4.0 (2026-05-26). If no matching section is found, the workflow falls back to a default body: Release v\n\nSee CHANGELOG.md for details. Troubleshooting Publish job fails with "Trusted publisher not configured" PyPI side is not configured. Re-check step 1 of "One-time setup". The workflow filename and environment name must match exactly (pypi-publish.yml, pypi). Publish job fails with "File already exists on PyPI" A version was previously uploaded and yanked. PyPI does not allow re-uploading the same version, even after a yank. Resolution: delete the tag locally and on the remote, bump to the next version, retag: git tag -d v0.5.0 git push origin :refs/tags/v0.5.0 uv run cz bump # bumps to v0.5.1 git push --follow-tags GitHub Release step fails after PyPI succeeded The package is on PyPI; only the GitHub Release is missing. Re-run the workflow on the same tag from the GitHub Actions UI. The github-release job's gh release create is the only remaining side effect and is idempotent against the existing tag (will fail if a release already exists, succeed if not). cz bump --dry-run previews an unexpected version The previewed version is computed from conventional-commits types in the commit range since the last tag. feat → minor bump (under v1.x behaviour: still minor while major_version_zero = true in [tool.commitizen]), fix → patch, feat! or BREAKING CHANGE → minor while major_version_zero = true, else major. To bump to a specific version explicitly, use cz bump --increment PATCH|MINOR|MAJOR. Pin updates for flopscope Whestbench pins flopscope>=0.4.1 and flopscope-server>=0.4.1. When flopscope ships a new minor or major version, bump these floors in pyproject.toml and re-run uv lock before cutting the next whestbench release. (Out of scope for an automated workflow; flag if Dependabot becomes worth the noise.)TimeExhaustionWarningRaised when an estimator exhausts its wall-clock budget on a single MLP.ChangelogRelease history and notable changes. ============================================================ Changelog ============================================================ --- changelog --- URL: https://aicrowd.github.io/whestbench/docs/changelog ChangelogRelease history and notable changes.v0.9.2 (2026-06-01) Fix bump to track the flopscope 0.4.2 fix for fnp.random.default_rng() over the client/server grader boundary; the flopscope>=0.4.1 floor auto-resolves to 0.4.2 once published (AIcrowd/flopscope#109) v0.9.1 (2026-05-31) Fix cli: whest submit --watch reaches terminal grading state (#74) v0.9.0 (2026-05-29) Feat cli: add whest login + whest submit (hop-A AIcrowd submission) add config-aware dataset authoring (#72) prepared-arrow: friendly upfront notice + CLI preflight sizing (#69) v0.8.0 (2026-05-27) Feat ux2: prepared-Arrow fast path on HF for multi-split datasets (#67) Fix prepared-arrow: handle multi-shard parquet splits (#68) v0.7.0 (2026-05-27) Feat ux1: per-split configs + split-aware load + early default_split resolution (#66) metadata: optional default_split + CLI fallback for multi-split datasets v0.6.0 (2026-05-27) Feat add whest version command and version metadata in JSON cli: validate/init/smoke-test/profile-simulation adopt unified copy cli: package gets a bytes progress bar cli: doctor wraps probes in a status spinner + bookends cli: merge gets spinner + before/after copy cli: download surfaces preflight summary + progress + completion cli: upload gets a real progress bar + before/after copy cli: bake gets phased progress bars + before/after copy cli: rename dataset push/pull/inspect to upload/download/info + deprecation cli: --streaming end-to-end with prominent cache-trade-off warning cli: add --streaming flag to whest run cli: use metadata-based n_mlps clamp when ds is streaming scoring: make_contest_from_dataset supports IterableDataset cli: wrap hf:// dataset load with hf_download progress UI hf_progress: add hf_upload context manager hf_progress: add hf_download context manager with three modes hf_progress: add RichHFTqdm that forwards into active Rich Progress hf_progress: add hf_preflight() with cache detection hf_progress: add HFPreflight dataclass ui: add status spinner context manager + finalize ui.py ui: add progress_count context manager ui: add progress_bytes context manager ui: add say.* message helpers (intent/step/ok/warn/hint) ui: add format_throughput helper ui: add format_duration helper ui: add format_bytes helper template: emit configs: block in YAML for explicit split ordering package: record tool and runtime versions in submission manifest Fix avoid duplicate JSON output in validate command keep final_layer_mse in narrow score subtitle guard profile-simulation JSON payload type for metadata wrapper cli: cache-hit download says "Loaded from cache" not "Downloaded" cli: drop stray comma in cache-miss download ok line hf_progress: bail preflight when revision cannot be resolved hf_progress: drop unused empty top-level upload task hf_progress: raise on nested hf_download/hf_upload hf_progress: subclass HF tqdm and guard disabled bars ui: match HF Hub env-var truthy semantics in _progress_disabled ui: roll over format_bytes at the next-unit boundary dataset_io: use attr-set for configs to satisfy Pyright Refactor ui: cache the default Console as a module-level singleton ui: inherit handles from ProgressHandle Protocol nominally v0.5.1 (2026-05-27) Feat template: mini+full quick-start snippet leads with split="mini" template: recognise mini+full split pair in dataset card Fix template: restore print(ds[0]['mlp_name']) smoke-test in generic quickstart fallback template: scope companion-disclaimer to public+holdout, fix whitespace + spelling test: import datasets.config submodule explicitly for pyright dataset_io: scope merge_datasets HF cache to tempdir by default v0.5.0 (2026-05-27) Feat load_dataset: add streaming=True support (closes #55) readme: per-split MLP counts + tighter Compute/Reproducibility wording readme: companion_repo template var + collapse hardware_fingerprints Fix lint: silence intentional type-violation in mlp_at streaming test lint: narrow load_dataset return type via Literal[streaming] overloads lint: narrow set element types before sort in fingerprint collapse v0.4.0 (2026-05-26) Added seed_protocol 3.0 (whestbench_explicit_per_mlp_seeds): each MLP's seed is an independent input rather than a derivation from a single root. Each mlp_seed value in the parquet column is the canonical input seed. Within-MLP three-stream derivation (weight/sample/estimator) is preserved via SeedSequence(mlp_seed).spawn(3). whest dataset bake --mlp-seeds FILE (JSON array of N ints) for explicit per-MLP seeds. Omitting both --mlp-seeds and --seed auto-generates via secrets.randbits(63). create_dataset(mlp_seeds=[...]) / create_dataset_torch(mlp_seeds=[...]). MLP.from_row(row, *, seed_protocol_version=...): protocol-aware estimator-seed derivation. Frozen fixture tests/fixtures/single_split_v3_protocol/ for schema-drift regression. Multi-split dataset support: dataset directories can now contain multiple Parquet files in data/, one per split, described by an optional splits: sub-dict in metadata.json. Backward-compatible — single-split datasets are unchanged. whest dataset combine-splits INPUT_DIR... --output OUTPUT_DIR CLI subcommand for assembling multi-split datasets from N complete single-split inputs. whestbench.combine_split_datasets() Python helper (re-exported from whestbench). whest dataset bake --split now accepts arbitrary split names matching [a-z][a-z0-9]*(-[a-z0-9]+)* (previously restricted to public / holdout). whest dataset pull --split and whest run --dataset ... --split for selecting one split from multi-split datasets. Changed create_dataset(seed=...) / create_dataset_torch(seed=...) and whest dataset bake --seed N now reject with a migration hint pointing at --mlp-seeds. Parquet mlp_seed column semantics: under 3.0, the column stores the input seed (was: derived estimator seed under 2.0). MLP.seed (participant-facing) is unchanged across protocols — derived locally from the input under 3.0. whest dataset inspect now recognises multi-split datasets and prints a per-split summary, plus the seed_protocol: (version ) line for all datasets. whestbench.load_dataset() returns Dataset | DatasetDict based on the dataset shape; explicit split= always returns Dataset. whestbench.metadata() accepts a DatasetDict and an optional split= filter that projects to single-split-shaped metadata. The dataset-card template gains a multi-split branch with leaderboard-specific wording when splits are {public, holdout}; the single-split public branch's wording is updated to point at the new evaluation repo. Compatibility whestbench.load_dataset reads both seed_protocol 2.0 and 3.0 datasets indefinitely. Existing published datasets (e.g. aicrowd/arc-whestbench-2026-smoke-test) continue to work unchanged. New bakes only write 3.0. schema_version stays at "3.0". The protocol discriminator is seed_protocol.{name,version}. The splits: field is purely additive. Old whestbench reading new multi-split datasets fails loudly with a missing-n_mlps error — upgrade whestbench to read multi-split. 0.3.0 — 2026-05-25 BREAKING Dataset format migrated from .npz to HF Parquet+sidecar (schema 2.4 → 3.0). Datasets are now directories with data/-NNNNN.parquet, metadata.json, and README.md. The whest create-dataset command is replaced by whest dataset bake. The DatasetBundle dataclass is removed; internal consumers operate on datasets.Dataset directly. Public estimator interface unchanged. Estimators still receive MLP instances via predict(mlp: MLP). NEW whestbench.load_dataset(path_or_repo, revision=..., split=..., token=...) loads from local directories OR HF Hub. whestbench.iter_mlps(ds), whestbench.mlp_at(ds, i), whestbench.metadata(ds). whestbench.publish_dataset(local_dir, repo_id=..., tag=..., ...) for HF Hub uploads. whestbench.merge_datasets(input_dirs, output_dir=...) — concatenate partial bakes. whest dataset {bake, push, pull, merge, inspect} CLI subcommands. Parallel bake via --slice K/N or --mlp-range START-END flags; merge with whest dataset merge. whest run --dataset now accepts HF Hub repos: hf://owner/repo@v1 (inline revision) or owner/repo --revision v1. MIGRATION Legacy .npz datasets cannot be loaded by 0.3.0. Re-bake with whest dataset bake at the same --seed to reproduce. See dataset-format for the schema 3.0 specification. Release processThe authoritative reference for cutting a new release of whestbench to PyPI, covering the steady-state flow, one-time setup, and troubleshooting notes.For agentsMachine-readable resources for AI coding assistants.