Datasets — a complete guide
WhestBench uses HuggingFace Datasets as its dataset format and HF Hub as the distribution channel. This guide walks you through every dataset-related verb in whest.
WhestBench uses HuggingFace Datasets as
its dataset format and HF Hub as the distribution channel. This guide walks you
through every dataset-related verb in whest, in the order you'd typically
encounter them.
If you only have 5 minutes, read the Quick start below. The rest of the guide builds on it.
Quick start
You have a working estimator at ./estimator.py. Bake a tiny evaluation
dataset locally, then score against it:
# 1. Generate 10 MLPs with ground-truth statistics → ./my-eval/
whest dataset bake --n-mlps 10 --n-samples 1000 --width 64 --depth 4 \
--output ./my-eval
# 2. Inspect what got written
whest dataset info ./my-eval
# 3. Score your estimator against the same MLPs every run
whest run --estimator estimator.py --dataset ./my-evalWhy this matters: without --dataset, whest run regenerates MLPs and ground
truth on every invocation. Baking a dataset once and reusing it makes your runs
deterministic and ~10× faster.
The dataset lifecycle
+--------+ +-----------+ +----------+ +--------+
| local | upload | HF | down- | local | run | scores |
| bake | -----> | Hub repo | load → | cache | -----> | report |
| (./out)| | (org/...) | | (~/.hf) | | |
+--------+ +-----------+ +----------+ +--------+
^ |
|____________________________________________________________|
iterate on estimator code- Local-only workflow:
bake → run. Best when you're iterating fast and don't care about sharing the dataset. See Working locally. - Team workflow:
bake → upload → … later … → download → run. The HF repo's tag pins which exact dataset everyone scores against. See Publishing to HuggingFace Hub and Downloading from HF Hub and the local cache. - CI / leaderboard workflow:
bake → upload --tag v1-warmup. Participants pull by tag. Streaming (whest run --streaming) is the natural fit for per-PR CI gates — see Streaming mode.
Each verb is detailed below.
Working locally
You want to iterate fast: bake a small dataset to disk, inspect it, and reuse
it across whest run invocations. No network, no HF account needed.
whest dataset bake — create a dataset
You're starting a new evaluation. Bake 100 MLPs of moderate size with their
ground-truth statistics to ./my-eval/:
whest dataset bake \
--n-mlps 100 \
--n-samples 10000 \
--width 256 --depth 8 \
--output ./my-evalRepresentative output:
→ Baking 100 MLPs (width=256, depth=8, n_samples=10000) to ./my-eval
✓ Generated weights 100/100
✓ Computed ground truth 100/100 31.7s
✓ Wrote ./my-eval (2.0 GB)The result on disk:
my-eval/
├── data/public-00000-of-00001.parquet # weights + ground-truth stats
├── metadata.json # schema_version, seed_protocol, …
└── README.md # dataset cardKey flags:
--mlp-seeds <file.json>— pin per-MLP seeds explicitly. JSON array of N distinct int63 values. Required for bit-exact reproducibility with another bake.--mlp-range START-ENDor--slice K/N— bake a slice of a larger logical dataset. The slice is bit-equivalent to the corresponding portion of a single-host bake at the same seeds.--torch— use the GPU backend (requireswhestbench[gpu]).--split <name>— assign a split name (defaultpublic). See Multi-split datasets.--config <name>— assign the HF config for this split (defaultdefault). Dataset authors use this for config-per-split repos; participants normally leave it unset.
If it broke, see the Troubleshooting section — bake errors usually trace back to seed shape, an existing output directory, or running out of RAM on large
--n-samples.
whest dataset info — what's in a dataset
You've baked or downloaded a dataset and want a one-screen summary before running against it:
whest dataset info ./my-evalReports schema_version, seed_protocol, n_mlps, n_samples,
hardware fingerprint, and per-split row counts. info also works against HF
Hub directly:
whest dataset info aicrowd/arc-whestbench-public-2026 --revision v1-warmupNo download required — info only fetches metadata.json.
whest dataset merge — assemble parallel bakes
You have a multi-host cluster and want to bake a 1,000-MLP dataset in two
slices, then concatenate. Both workers must share the same --mlp-seeds file
so the result is bit-equivalent to a single-host bake:
# Two workers each bake a slice…
whest dataset bake --n-mlps 1000 --slice 0/2 --output ./partial-a \
--mlp-seeds seeds.json
whest dataset bake --n-mlps 1000 --slice 1/2 --output ./partial-b \
--mlp-seeds seeds.json
# … then merge.
whest dataset merge ./partial-a ./partial-b --output ./fullThe merged dataset is bit-equivalent to a single-host bake of the same size at the same seeds. See also: parallel-bake how-to.
whest run --dataset <local-dir> — score against a baked dataset
You're iterating on estimator.py. Score it against the first 50 MLPs of
your baked dataset (fast feedback loop):
whest run --estimator estimator.py --dataset ./my-eval --n-mlps 50--n-mlps K clamps the run to the first K MLPs of the dataset (useful for
quick iteration). Pass --split <name> if the dataset is
multi-split.
Once you're happy with local results, publish the dataset so teammates can score against the same MLPs.
Publishing to HuggingFace Hub
You've baked a dataset locally and want to share it with the team — or pin a specific revision so a CI gate scores everyone against the same MLPs. Upload to a HuggingFace Hub dataset repo.
Authenticate once
hf auth login # opens a browser; or pass --token <hf_xxx>Tokens with write scope are required to push. You can also set the token
without the interactive flow:
export HF_TOKEN=hf_xxxwhest dataset upload reads HF_TOKEN as a fallback when --token isn't
passed. See also: the
publish-to-hf-hub how-to for an
end-to-end walkthrough.
whest dataset upload
You have ./my-eval from the previous section. Push it as a private repo and
pin the resulting commit with a tag:
whest dataset upload ./my-eval \
--repo aicrowd/my-eval \
--tag v1 \
--private # omit for public datasetsRepresentative output:
→ Uploading ./my-eval to aicrowd/my-eval (private)
✓ Repo exists / created
✓ Uploaded 2.0 GB ████████████████████ 100% 34.1s
✓ Tag v1 created at d2f9a1c
✓ Done: https://huggingface.co/datasets/aicrowd/my-eval/tree/v1The repo is created if it doesn't exist. The tag is created at the resulting commit so
whest run --dataset hf://aicrowd/my-eval@v1pins to this exact revision.
Repo naming. Use <org>/<dataset-name>. Keep names short and
hyphen-separated (e.g. aicrowd/arc-whestbench-public-2026).
Tag conventions. HF doesn't enforce semver; the de-facto pattern is
v<MAJOR>.<MINOR> (e.g. v1.0, v1.1) or descriptive (v1-warmup,
v1-holdout). See
HF's revision docs.
What gets published
The dataset card (README.md) is auto-generated from metadata.json at bake
time. It includes splits, hardware fingerprint, seed protocol, and a runnable
quick-start snippet. Edit README.md after bake and before upload to
add custom content.
The card's YAML front-matter is what HuggingFace Hub renders on the dataset page (tags, license, language, etc.). Don't strip it.
whest dataset pushcontinues to work as a deprecated alias foruploadthrough v0.6. v0.7 will remove it. Same applies topull→downloadandinspect→info.
If it broke (401, 403, repo already exists, network errors), jump to Troubleshooting.
Downloading from HF Hub and the local cache
You want to score against a dataset published by your team or the contest organisers. There are two paths.
whest dataset download — explicit fetch
Use when you want a real on-disk copy you can inspect, ship to another machine, or commit to a separate artifact store:
whest dataset download aicrowd/arc-whestbench-public-2026 \
--revision v1-warmup \
--output ./evalRepresentative output:
→ Downloading aicrowd/arc-whestbench-public-2026@v1-warmup → ./eval
Preflight: 1 parquet shard, 2.0 GB, 1,000 MLPs
✓ Downloaded 2.0 GB ████████████████████ 100% 28.9s
✓ Wrote ./eval (cache: ~/.cache/huggingface/hub/datasets--aicrowd--arc-whestbench-public-2026)With --output set, files are materialised under the named directory; the HF
cache also picks them up.
Auto-fetch via whest run
You can skip the explicit download — whest run does it lazily on first use:
whest run --estimator estimator.py \
--dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmupThis downloads on first invocation (showing a progress bar) and caches.
Subsequent runs are ~10× faster (the cache hit prints Loaded from cache).
HF cache layout
After a fetch, the HF cache lives at three places:
| Path | What's there |
|---|---|
~/.cache/huggingface/hub/datasets--<org>--<name>/ | Raw blobs (Git LFS / Xet objects) + the revision snapshot symlinks |
~/.cache/huggingface/datasets/<org>___<name>/ | The datasets library's regenerated Arrow tables (memory-mapped) |
~/.cache/huggingface/xet/{chunk_cache,shard_cache,staging}/ | Xet chunk-level dedup cache (since hf_xet ≥ 1.0) |
Total disk usage is roughly 2× download size (the parquet blob + Arrow
rebuild). The hub cache uses content-addressed dedup, so the same blob is
shared across revisions and even repos.
Cleaning up
Defer to HF's own cache CLI — it understands the layout above and will not accidentally orphan blobs that are still referenced from another revision:
hf cache ls # show what's there
hf cache prune # drop unreferenced revisions
hf cache rm <selector> # remove a specific repo or revision
hf cache verify # check integrityFull reference: HF cache management.
Cache location overrides
| Env var | What it sets | Default |
|---|---|---|
HF_HOME | Root of all HF state | ~/.cache/huggingface |
HF_HUB_CACHE | Hub-only cache (blobs/snapshots) | $HF_HOME/hub |
HF_DATASETS_CACHE | datasets-library Arrow cache | $HF_HOME/datasets |
HF_XET_CACHE | Xet chunk staging | $HF_HOME/xet |
When running on NFS, point HF_XET_CACHE=/local/ssd to avoid roundtrips.
See Performance tuning for more knobs.
whest dataset pullcontinues to work as a deprecated alias fordownloadthrough v0.6.
If it broke (long pause, disk full, gated dataset,
cas-bridge.xethub.hf.coURLs you don't recognise), jump to Troubleshooting.
Streaming mode
You want to score against a small slice of a remote dataset without paying
the cost of a full download. whest run --streaming consumes the dataset
row-group-by-row-group over HTTP instead of downloading it first.
When to use
- You're iterating on estimator code with
--n-mlps 5(or some small K). Streaming fetches only the first ⌈K/47⌉ row groups (~95 MB each for the warmup dataset) instead of the full 2 GB. - You're on a constrained-disk environment (CI runner, container).
- You want a fast first-row response time more than total throughput.
When NOT to use
- Repeated full evaluations of the same dataset. Streaming does NOT populate the cache — every run re-fetches. Use the default materialise path instead.
- Anything that needs random access.
IterableDatasetis iteration-only;len(ds),ds[i], andds.shuffle(seed=…)don't work as expected.
Trade-off table
| Property | Materialise (default) | --streaming |
|---|---|---|
| First-row latency, cold cache | ~30 s (full download) | ~5 s |
| First-row latency, warm cache | ~2 s | ~5 s (re-fetch) |
| Disk usage | ~4 GB (blob + Arrow) | 0 |
| Subsequent runs | ~2 s (cache hit) | ~5 s (re-fetch every time) |
| Random access | Yes | No |
Authentication and streaming
Unauthenticated requests to HF are rate-limited and noticeably slower. Run
hf auth login once to set a token; streaming throughput typically improves
30–50% authenticated.
Example
whest run --estimator estimator.py \
--dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup \
--streaming \
--n-mlps 5You'll see a ⚠ Streaming from HF warning at startup, then a progress
indicator while the first row group is fetched, then scoring begins.
Streaming is incompatible with
--jsonoutput (it would corrupt JSON ordering) andlen(ds)raises on a streaming dataset. Both are documented under Troubleshooting.
Multi-split datasets
A dataset can contain multiple disjoint groups of MLPs — typically public
(open to participants for tuning) and holdout (used only by the leaderboard
grader). One repo, two splits.
When and why
- Leaderboard datasets: participants score against
publiclocally, the leaderboard grader scores againstholdout. Same parquet schema, same hardware fingerprint, different seeds. - Train/validation flow: split a dataset into
train/val/testfor meta-learning experiments on top of WhestBench.
Baking a split
Each split is baked separately. Make sure to use distinct --mlp-seeds files
so the splits don't overlap:
whest dataset bake --n-mlps 500 --split public --config default --output ./eval-public
whest dataset bake --n-mlps 500 --split holdout --config holdout --output ./eval-holdoutCombining splits into one multi-split directory
whest dataset combine-splits ./eval-public ./eval-holdout --output ./eval-fullThe result is a single dataset directory with both splits in data/,
suitable for whest dataset upload to a single HF repo. combine-splits
preserves each bake's config metadata, so the published card can expose the
same config-per-split layout as the official HF datasets.
Selecting a split when running
whest run --estimator estimator.py \
--dataset hf://aicrowd/eval-full@v1 \
--split publicWithout --split, multi-split datasets are rejected by whest run (the
scoring path scores against exactly one split at a time, by design).
Inspecting splits
whest dataset info ./eval-full
# Reports each split's n_mlps and seed.If
combine-splitscomplains about overlappingmlp_seeds or mismatched hardware fingerprints, see Troubleshooting.
Performance tuning
These are power-user knobs. The defaults are fine for almost everyone.
Xet high-performance mode
If you have ≥64 GB RAM and a fat uplink:
export HF_XET_HIGH_PERFORMANCE=1Saturates both bandwidth and CPU cores. Helpful when downloading many-GB datasets to a workstation. Reference: HF Xet storage docs.
Local SSD for the Xet cache
If your HF cache is on NFS or a slow disk:
export HF_XET_CACHE=/local/ssd/hf-xetKeeps the chunk staging cache on fast local storage. The main hub cache
(HF_HUB_CACHE) can stay on NFS — only the per-chunk Xet metadata is
roundtrip-sensitive.
Disabling Xet entirely
export HF_HUB_DISABLE_XET=1Falls back to plain LFS transport. Rarely useful; only reach for it if you've confirmed a Xet-specific bug.
Disabling progress bars (CI)
export HF_HUB_DISABLE_PROGRESS_BARS=1Whestbench's say.* lines still emit; only the progress bars are suppressed.
For complete silence add --quiet to the whest invocation.
Troubleshooting
"I see a long pause and no output."
Cache miss on a cold HF cache. Watch the progress bar — for the warmup
dataset it's ~30 s on a 70 MB/s link. To avoid silent re-downloads, run
whest dataset download ahead
of time, or ls ~/.cache/huggingface/hub/ to confirm progress.
"Downloads feel slow."
You're probably unauthenticated; HF rate-limits anonymous traffic.
Run hf auth login once and re-run. See also
Authentication and streaming.
"Disk filled up."
HF stores blobs in both ~/.cache/huggingface/hub/ (raw download) and
~/.cache/huggingface/datasets/ (regenerated Arrow). Use
hf cache prune to drop unreferenced revisions, then hf cache ls to
verify reclaimed space. See Cleaning up.
"401/403 on upload."
Your token doesn't have write scope. Re-login with
hf auth login --token <new-token> from a token created with write access.
For org-owned repos, your account also needs membership in the org.
"Cannot use --streaming with --json output."
Known limitation — streaming progress events would corrupt JSON ordering.
Drop --json, or drop --streaming.
"len(ds) raises on a streaming dataset."
Expected per HF docs. Use whestbench.metadata(ds)["n_mlps"] instead — it
reflects the upstream metadata.json, not the local materialised count.
"I see cas-bridge.xethub.hf.co URLs but the file is LFS."
That's HF's Xet bridge transparently serving legacy LFS content via the Xet
CDN edge. No action required. If you need to force plain-LFS transport for
debugging, set HF_HUB_DISABLE_XET=1 (see
Disabling Xet entirely).
"Dataset is gated."
Request access on the dataset page (HF will email you a link from
https://huggingface.co/datasets/<repo>), then re-run. Make sure you're
authenticated with the same account that was granted access.
Reference
Format
- Schema 3.0 spec: dataset-format.
SDK surface
import whestbench as wb
ds = wb.load_dataset(
"aicrowd/foo", revision="v1", split="public", streaming=False
)
for mlp in wb.iter_mlps(ds):
...
mlp = wb.mlp_at(ds, 0) # random access (materialised datasets only)
md = wb.metadata(ds) # the dataset's metadata.json
wb.publish_dataset(
"./my-eval", repo_id="aicrowd/foo", tag="v1"
)
wb.merge_datasets(["./partial-a", "./partial-b"], output_dir="./full")
wb.combine_split_datasets(
["./public", "./holdout"], output_dir="./full"
)CLI verbs (canonical names)
| Verb | Purpose | Deprecated alias |
|---|---|---|
whest dataset bake | Generate locally | — |
whest dataset upload | Publish to HF | push |
whest dataset download | Fetch from HF | pull |
whest dataset info | Show metadata | inspect |
whest dataset merge | Concatenate partials | — |
whest dataset combine-splits | Assemble multi-split | — |
Deprecated aliases continue to work through v0.6 and emit a deprecation warning. v0.7 removes them.
Environment variables
| Var | Purpose |
|---|---|
HF_TOKEN | Auth token (lazy — only when needed) |
HF_HOME | Root of HF state (~/.cache/huggingface by default) |
HF_HUB_CACHE | Hub blobs cache |
HF_DATASETS_CACHE | datasets-library Arrow cache |
HF_XET_CACHE | Xet chunk staging |
HF_XET_HIGH_PERFORMANCE | Saturate bandwidth + CPU |
HF_HUB_DISABLE_PROGRESS_BARS | Suppress progress bars |
HF_HUB_DISABLE_XET | Force plain-LFS transport |
HF_HUB_DISABLE_IMPLICIT_TOKEN | Don't send token on read calls |
NO_COLOR | Disable ANSI colours |
CLI flag conventions
--repo-type, --revision, --token, --cache-dir, --quiet, --json,
--format {auto,human,agent,json,quiet}, --dry-run, --exist-ok. Adopted
from HF's hf CLI
for consistency.
See also
- CLI reference — exhaustive flag list per verb.
- Dataset format spec — schema 3.0 on-disk layout.
- Parallel bake how-to — distributed
bake+merge. - Publish to HF Hub how-to — token setup, repo creation.