Generating Large Datasets on GPU
For ground-truth bakes with n_samples ≥ 10⁸, the optional torch backend runs the same computation on GPU, reducing a 30-hour CPU job to 15–30 minutes on a single GPU.
For ground-truth bakes with n_samples ≥ 10⁸, the default CPU path is slow. The
optional torch backend runs the same computation on GPU (or torch CPU for dev).
A n_samples=10⁹ bake at default config (10 MLPs) takes ~30 hours on CPU but
~15–30 min on a single GPU. Larger n_mlps scales linearly — see
Performance expectations below for measured numbers.
Install
pip install whestbench[gpu]This pulls in torch as an optional dependency. The standard pip install whestbench does not include torch.
Quick start
# Auto-detect best available device (cuda > mps > cpu)
# WARNING: this takes ~4 hours on L40S, ~14 h on M3 Max. Calibrate first
# (see "Calibration recipe" below) before committing to a multi-hour bake.
whest dataset bake \
--torch --device auto \
--n-mlps 100 --n-samples 1_000_000_000 \
--width 256 --depth 8 --seed 42 \
--output ./ground-truth
# Smaller production-realistic example (10 MLPs × 10⁹ ≈ 25 min on L40S)
whest dataset bake \
--torch --device cuda \
--n-mlps 10 --n-samples 1_000_000_000 \
--width 256 --depth 8 --seed 42 \
--output ./data
# Develop on laptop using torch CPU (works without GPU)
whest dataset bake \
--torch --device cpu \
--n-mlps 5 --n-samples 100_000 \
--width 256 --depth 8 \
--output ./devOutput is a directory (schema 3.0 layout), not a single .npz file:
./ground-truth/
├── data/public-00000-of-00001.parquet
├── metadata.json
└── README.mdLoad it with whestbench.load_dataset or push to HF Hub with
whest dataset push. The array schema is identical to a CPU bake — the same
8 Parquet columns, same mlp_name values at the same seed.
The seed → name mapping is stable across machines as long as the installed
fakerversion matches the pin inpyproject.toml. Bumpingfakeris a deliberate operation; the lock-down test intests/test_naming.pytrips when faker's wordlists change, and reference datasets must be re-baked alongside the version bump.
Parallel bakes with --slice
For very large bakes, use --slice K/N to distribute across multiple GPU workers.
Each worker produces a partial directory; run whest dataset merge afterwards.
# 4 workers
whest dataset bake --slice 0/4 --torch --device cuda \
--n-mlps 400 --n-samples 1_000_000_000 \
--width 256 --depth 8 --seed 42 --output ./p0
# ... (repeat for slices 1/4, 2/4, 3/4)
# Merge
whest dataset merge ./p0 ./p1 ./p2 ./p3 --output ./finalSee Parallel bake across multiple GPUs for the full walkthrough.
Publishing
After baking (and optionally merging), push to HF Hub:
whest dataset push ./ground-truth \
--repo aicrowd/arc-whestbench-2026 \
--tag v1See Publishing a dataset to HuggingFace Hub.
Device selection
--device | Behavior |
|---|---|
| omitted | Use the default flopscope CPU path (no torch needed). |
auto | Resolves cuda > mps > cpu at runtime. |
cuda | Explicit CUDA. Errors if torch.cuda.is_available() is False. |
mps | Apple Silicon GPU. Errors if MPS is unavailable. |
cpu | Torch on CPU. First-class dev option, not a silent fallback. |
There is no automatic fallback to CPU if a GPU device is requested but unavailable. Explicit device choices are honored or rejected loudly.
--max-threads cannot be combined with --torch; torch manages threading
internally.
Performance expectations
Key finding from L40S benchmarking: at width=256, effective throughput
is bottlenecked at ~7–10 TFLOP/s on modern GPUs regardless of peak fp32 spec.
The matmul is too small to saturate tensor cores, and TF32/fp16 give negligible
speedup at this size (measured: ~2% on L40S). Don't extrapolate from peak
fp32 ratings; they overestimate by 5–10× for this workload.
Measured (NVIDIA L40S, AWS g6e.xlarge)
| n_mlps | n_samples | wall time | effective throughput |
|---|---|---|---|
| 10 | 10⁶ | 1.41 s | ~7.5 TF |
| 100 | 10⁶ | 13.78 s | ~7.5 TF |
| 10 | 10⁹ | ~23 min (linear projection) | — |
| 100 | 10⁹ | ~3.9 h (linear projection)† | — |
† TODO: confirm with full-bake measurement — calibration anchored on N=10⁶ predicts 3.9 hours; a 100 MLPs × 10⁹ bake is in progress as of this writing.
Scaling on L40S is fully linear in n_mlps and n_samples. Quadratic
in width. The mlps_per_batch knob has near-zero impact at L40S scale
(measured ≤ 0.4% spread across B ∈ {4, 8, 16, 32}).
Extrapolations to other GPUs
Anchor: 7.5 TFLOP/s effective on L40S. For modern Ampere+/Ada/Hopper at
width=256, expect 5–10 TFLOP/s in practice — variation between cards is
small because the small-matmul ceiling binds before peak compute matters.
| Hardware | 10 MLPs × 10⁹ (est.) | 100 MLPs × 10⁹ (est.) |
|---|---|---|
| L40S (g6e.xlarge) | ~23 min (measured) | ~3.9 h (measured) |
| H100 PCIe | ~15–25 min | ~2.5–4 h |
| RTX 4090 | ~20–35 min | ~3.5–6 h |
| A100 80GB | ~25–40 min | ~4–6.5 h |
| RTX 3090 | ~30–50 min | ~5–8 h |
| Apple M3 Max (mps) | ~2.3 h (measured) | ~14 h (measured) |
| CPU (flopscope) | ~30 h | ~12 days |
Strong recommendation: run a 60-second calibration on your actual GPU before committing to a multi-hour bake — see Calibration recipe below.
Calibration recipe
A 60-second N=10⁶ run on any GPU gives a precise wall-time projection for
your actual N=10⁹ bake. Run this once when you spin up the instance:
import time
from pathlib import Path
import torch
from whestbench.dataset_torch import create_dataset_torch
# Warmup (kernel compilation, ~0.2s on cuda)
create_dataset_torch(
n_mlps=2, n_samples=10_000, width=256, depth=8,
seed=0, output_path=Path('/tmp/warmup'), device='cuda')
# Calibration anchored on n_mlps=10 to match the production setup
t0 = time.perf_counter()
create_dataset_torch(
n_mlps=10, n_samples=1_000_000, width=256, depth=8,
seed=42, output_path=Path('/tmp/cal'), device='cuda')
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0
print(f'{elapsed:.2f}s at N=10⁶ → projected {elapsed*1000/60:.1f} min at N=10⁹')torch.cuda.synchronize() is critical — CUDA ops are async; without it
you'd measure dispatch time, not compute time.
If the projection looks reasonable, proceed with the full bake. If it's
2× higher than expected, check torch.backends.cuda.matmul.allow_tf32
(default False in recent torch) — but expect only marginal speedup
since matmuls are small.
Verifying the output
Datasets baked with --torch have identical Parquet column layout to default
(flopscope) datasets. Provenance is in metadata.json:
import whestbench
ds = whestbench.load_dataset("./ground-truth")
md = whestbench.metadata(ds)
backend = md.get("backend", "flopscope") # "torch" or "flopscope"
device = md.get("device") # "cuda" | "mps" | "cpu" if torch
torch_version = md.get("torch_version")Reproducibility
Datasets are deterministic per (seed, device, torch_version). The seed
hierarchy is identical to the flopscope path; only the leaf RNG that produces
input samples changes (numpy PCG64 → torch Philox/MT).
Important: the same seed on the CPU (flopscope) and torch paths will not
produce bitwise-identical datasets — different RNG algorithms. They are
statistically equivalent: per-neuron means agree within Monte Carlo noise
(~3×10⁻⁵ at N=10⁹).
Precision strategy
The torch backend uses fp32 matmul + fp64 reduction accumulators on CUDA and
CPU, matching the flopscope path's numerical semantics. On MPS (which does not
support fp64) the accumulators fall back to fp32 — this is acceptable for dev
workflows where N ≤ 10⁵, since fp32 accumulation error is comparable to Monte
Carlo noise at those scales. For production N=10⁹ bakes, use --device cuda.
Python API for tuning
For power-user tuning beyond what the CLI exposes:
from whestbench.dataset_torch import create_dataset_torch
create_dataset_torch(
n_mlps=100, n_samples=10**9,
width=256, depth=8,
seed=42, output_path="ground_truth",
device="cuda",
mlps_per_batch=32, # default: min(n_mlps, 16). Larger uses more GPU memory.
chunk_size=1 << 20, # default: memory-aware on cuda, 65536 on mps/cpu.
)See the docstring for full parameter semantics. The CLI exposes --device,
--mlps-per-batch, and --chunk-size; these are also available as Python-API
knobs for benchmarking.
Troubleshooting
ImportError: create_dataset_torch requires torch — Install the gpu
extra: pip install whestbench[gpu].
RuntimeError: CUDA requested but torch.cuda.is_available() is False —
Either CUDA isn't installed at the system level, or torch was installed
without CUDA support. Check python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)".
For dev without a GPU, use --device cpu.
Out of memory on GPU — Lower the Python-API knobs:
mlps_per_batch: fewer MLPs in parallel.chunk_size: smaller chunks of samples per step.
The auto-tuned defaults target ~25% of free GPU memory; on very full GPUs you may need to override.
Dataset looks slightly different from a CPU bake at the same seed —
Expected (see Reproducibility above). To verify equivalence, compare means
within ~5/sqrt(n_samples) tolerance.
Progress bar shows fewer chunks than expected — On GPU the chunk size is
much larger than on CPU (~64K–1M vs 4K), so there are 16–256× fewer chunks per
MLP. Total work units n_mlps * chunks_per_mlp still reflects the same total
samples processed.
Wall time is much longer than peak-fp32 math suggests — Expected. Peak
fp32 specs assume tensor cores can saturate, which requires large matmul
dimensions. At width=256 the matmuls are too small; effective throughput
plateaus at ~7–10 TFLOP/s on most modern GPUs regardless of whether the
card is rated for 30 TF (L40S fp32) or 100 TF (H100 fp32). Tools like
nvidia-smi will correctly show 100% GPU utilization despite the low
effective TFLOP/s — the card is fully busy, the kernels are just shape-bound.
TF32 / fp16 give only ~2% speedup at this matmul size (measured), so don't
rely on them to close the gap. See Performance expectations.
mlps_per_batch doesn't seem to do anything — Correct. On CUDA at
width=256, varying mlps_per_batch between 4 and 32 has < 1% effect on
wall time (measured on L40S). The bottleneck is the per-chunk matmul shape,
not the batching layer. Don't waste time tuning it.