Stage 3: Run on the Public Set (In-Process Harness)
Sourced from whest-starterkit @
aaa3882.
Stage 3: Run on the Public Set (In-Process Harness)
Stage 2 confirms the contract. Stage 3 runs the real scoring pipeline (the same one the grader uses) against the public Mini split — 100 fixed MLPs with baked N=1e9 ground truth — and in-process, so you can drop import pdb; pdb.set_trace() anywhere in predict() and step through it.
🚀 Run it
uv run whest run --estimator estimator.py --dataset hf://aicrowd/arc-whestbench-public-2026 --split mini --runner local--split mini selects the 100-MLP Mini split (it's the default split, so you can omit --split); local is the default runner, so you can omit --runner local too. Ground truth is precomputed at N=1e9, so there's no sampling step — after the first download (~250 MB, cached) it scores in seconds. The FLOP budget is 6.8e10 (68B) and the MLP shape is the competition size (width=256, depth=8). (Omit --dataset and whest run instead generates a fresh random 10-MLP suite on the fly, computing ground truth with 2,560,000 Monte-Carlo samples — slower and not reproducible. Fine for a quick pdb poke; use the Mini split for real scoring.)
You'll see a Rich-rendered report with five panels:
- Run Context — estimator class, path, timestamps,
n_mlps,width,depth,flop_budget. - Hardware & Runtime — host, OS, CPU, RAM, Python and NumPy versions (so a leaderboard score is reproducible across machines).
- Sampling Budget Breakdown (Ground Truth) — provenance/FLOPs for the reference ground truth (loaded from the baked dataset with
--dataset; sampled locally otherwise). - Estimator Budget Breakdown — same fields for your
predict()call(s). - Final Score — the headline metrics:
╭──────────────────────── Final Score ────────────────────────╮
│ Adjusted Final-Layer Score [adjusted_final_layer_score] │
│ ≈ 0.083 ← primary score (what the leaderboard ranks on) │
│ Raw Final-Layer MSE [final_layer_mse] ≈ 0.83 │
│ All-Layers MSE [all_layers_mse] ≈ 0.65 │
│ ─────── │
│ Best MLP [best_mlp_adjusted_final_layer_score] ≈ 0.037 │
│ Worst MLP [worst_mlp_adjusted_final_layer_score] ≈ 0.18 │
│ ─────── │
│ Mean Score Multiplier [mean_score_multiplier] ≈ 0.10 │
│ Mean Compute Utilization [mean_compute_utilization] ≈ 5e-6 │
│ Failed MLPs [n_failed_mlps] 0 of 100 │
╰ per-MLP score = final_layer_mse × max(0.1, C_m / flop_budget) ╯With the zeros template, the raw MSE rows (final_layer_mse ≈ 0.83, all_layers_mse ≈ 0.65) reflect the natural variance of the ReLU activations. But the metric you are ranked on is adjusted_final_layer_score: because the zeros template spends almost no compute (~5e-6 of the budget), its multiplier sits at the 0.1 floor, so the leaderboard score is about 0.83 × 0.1 ≈ 0.083. Because the Mini split is fixed, these numbers are reproducible — no --seed needed. (adjusted_final_layer_score is the mean across MLPs of final_layer_mse × max(0.1, C_m / flop_budget); the raw final_layer_mse / all_layers_mse carry no multiplier.) See score-report-fields.md for the full schema.
FLOP-budget callout: Stage 1 vs Stage 3
Stage 1's local_engine.compare_against_monte_carlo runs your predict() under estimator_budget=1e9. Stage 3's whest run uses the grader default flop_budget=6.8e10 — about 68× larger. So Stage 1 is the tighter budget here: if your estimator fits in Stage 1, it has ample headroom at the grader budget, and budget exhaustion is unlikely to be why a Stage-1-good estimator scores differently in Stage 3.
Why a different score than Stage 1?
Both stages use the same MLP shape (width=256, depth=8). The numbers still differ because:
- Stage 1 scores your estimator against one fixed MLP (
build_mlp(width=256, depth=8, seed=0)) and prints raw MSE as Monte-Carlo ground truth converges (10 → 100,000 samples). - Stage 3 scores the 100 MLPs of the public Mini split against their baked N=1e9 ground truth, and reports the budget-adjusted
adjusted_final_layer_scoreaveraged across the suite — not raw MSE.
So Stage 3's headline number is averaged over 100 MLPs and scaled by the compute multiplier; expect it to differ from the single-MLP raw MSE you saw in Stage 1.
Debugging
Because --runner local runs in-process, pdb works:
def predict(self, mlp: MLP, budget: int) -> fnp.ndarray:
import pdb; pdb.set_trace()
...✅ Expected outcome
| Estimator | Typical raw final_layer_mse (public Mini split, 100 MLPs) |
|---|---|
| Zeros template | ~0.83 (the all-zeros accuracy floor) |
02_mean_propagation | ~7.5e-04 |
03_covariance_propagation | ~3.7e-05 |
These are the raw final-layer MSEs (the accuracy signal). Your leaderboard adjusted_final_layer_score scales each by the compute multiplier max(0.1, C_m / flop_budget) — and since these all use <1% of the budget, the ranked number is exactly one-tenth of the value shown (the 0.1 floor).
(Same ballpark as the Stage 1 table because the math and shape are the same (width=256, depth=8); they differ because Stage 3 scores the 100 fixed Mini MLPs against baked ground truth, while Stage 1 scores one fixed MLP with on-the-fly Monte Carlo.) Full benchmark methodology in scoring-model.md.
✅ When you're ready
Move on to Stage 4: subprocess runner for grader parity.