Validate, Run, and Package
Sourced from whest-starterkit @
aaa3882.
Validate, Run, and Package
π― When to use this page
Use this page for the standard local participant loop.
π Do this now
Validate estimator loading and output contract:
whest validateis a fast sanity check using a small fixed MLP (width=4, depth=2). It verifies loading, output shape, and value finiteness β not full behavioral or performance correctness. Always follow withwhest runfor realistic tests.
whest validate --estimator estimator.pyRun local scoring (recommended default runner):
whest run --estimator estimator.pywhest run defaults to --runner local for fast iteration.
Run against the published evaluation dataset on HuggingFace (skips sampling β much faster for repeated runs, no local bake needed):
whest run \
--estimator estimator.py \
--dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup
# auto-resolves to the `mini` split (100 MLPs, ~250 MB cached after first call)Or bake a custom local dataset once and reuse it:
whest dataset bake --output ./my-eval --n-mlps 10 --n-samples 10000
whest run --estimator estimator.py --dataset ./my-evalSee Use Evaluation Datasets for details.
Run faster local debug path:
whest run --estimator estimator.py --runner localRun with machine-readable output:
whest run --estimator estimator.py --runner local --format json--json still works as an alias, but --format rich|plain|json is the canonical output selector across the CLI.
Package submission artifact:
whest package --estimator estimator.py --output ./submission.tar.gzOptional files during packaging:
whest package \
--estimator estimator.py \
--requirements requirements.txt \
--submission-metadata submission.yaml \
--approach APPROACH.md \
--output ./submission.tar.gzUseful whest run flags
These all show up in whest run --help but get lost there. Reach for them when:
| Flag | Reach for it when⦠|
|---|---|
--seed N | Deterministic comparison between two estimator versions. Pin the seed and the same MLPs, the same per-MLP mlp.seed values, and the same SetupContext.seed are used across runs. Also accepted by whest validate (seeds the validation setup(ctx) call). With --dataset, the dataset supplies the per-MLP seeds and --seed controls ctx.seed only. |
--n-samples N | Ground-truth sampling samples per MLP. The contest default (in whest run without an explicit override) is 100 * 100 * 256 = 2,560,000; whest dataset bake --n-samples defaults to 10000. Drop to --n-samples 5000 for a ~10x faster local sanity check; raise back up before drawing real conclusions. |
--n-mlps N | Default 10. Drop to 3 while iterating to halve runtime; raise to 20+ when you're trying to reduce noise on a close score. |
--flop-budget N | Default 6.8e10 (the grader effective-compute budget β caps C_m = F_m + λ·R_m, not just analytical FLOPs). Bump to 1e11 to confirm an algorithm idea isn't budget-starved before optimizing for budget. |
--profile | Emits a per-namespace FLOP/time breakdown so you can see where your estimator burns the budget. |
--show-diagnostic-plots | Renders convergence and per-layer error plots inline (terminal-friendly). Pairs well with --profile. |
--max-threads N | Pin the BLAS thread pool size so wall_time_s is comparable across machines. Useful when triaging a "fast on my laptop, slow in CI" report. |
--detail {raw,full} | raw strips Rich formatting (handy for tee-ing logs); full adds the per-MLP raw arrays. |
--wall-time-limit S | Cap each predict() call's wall time. Useful when local debugging hangs on a numerical edge case. |
--residual-wall-time-limit S | Cap time spent outside flopscope ops (Python plumbing, scipy calls). Surfaces "looks fast in FLOPs but Python is the bottleneck" issues. |
--debug + --fail-fast | First exception β halt + raw traceback. Combine for the fastest "what broke?" loop. |
β Expected outcome
validatepasses,runproduces a score report,packagecreates a.tar.gzartifact.
β οΈ Common first failure
Symptom: run fails after validate passed.
Use this escalation flow:
- Retry with debug info:
whest run --estimator estimator.py --debug- If traceback still feels opaque, rerun in-process:
whest run --estimator estimator.py --runner local --debugFor runner modes, see Stage 3: Run Locally, Stage 4: Subprocess Runner, and the Debugging Checklist.
Concrete example:
Error [predict:PREDICT_ERROR]: Estimator predict failed.
Use --debug to include a traceback.
Tip: For estimator-level tracebacks, rerun with --runner local --debug.Next command to run:
whest run --estimator estimator.py --runner local --debug