Scoring Model
Sourced from whest-starterkit @
aaa3882.
Scoring Model
π― When to use this page
Use this page to understand how the leaderboard score is computed from your estimator's predictions.
Pipeline at a glance
βββββββββββββββββββββββ
β random MLP_m β one of M MLPs (default M=10)
β flop_budget (B) β
ββββββββββββ¬βββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββ
β your predict(mlp_m, budget) β runs inside flopscope.BudgetContext
β (flopscope counts every op) β
ββββββββββββ¬βββββββββββββββββββββββ
β
βΌ
effective compute C_m = F_m + λ·R_m > B ?
/ \
yes / budget exceeded \ no
βΌ βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β pred_m := zeros β β pred_m := your array β
β mult_m := 1.0 β β mult_m := β
β (no compute discount)β β max(0.1, C_m / B) β
ββββββββββββ¬ββββββββββββ ββββββββββββ¬ββββββββββββ
β β
ββββββββββββββββ¬βββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β final_layer_mse_m = mean((pred_m β truth_m)Β²) β
β over the final layer β
β adjusted_m = final_layer_mse_m Γ mult_m β
βββββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
(repeat for every MLP)
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β adjusted_final_layer_score = mean_m(adjusted_m) β
β β THIS is the leaderboard ranking metric β
β final_layer_mse, all_layers_mse = mean raw MSE β
β (diagnostics only β no multiplier) β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
lower is betterπ TL;DR
- Lower score is better.
- The leaderboard ranks on
adjusted_final_layer_scoreβ the final-layer MSE scaled by a compute multipliermax(0.1, C_m / flop_budget), then averaged across MLPs. final_layer_mseandall_layers_mseare reported too, but as raw diagnostics β they are not the metric you are ranked on.- The multiplier rewards using less compute, down to a 10Γ discount floor (reached at 10% budget use). Below that floor, only accuracy moves your score.
- If your estimator exceeds the FLOP budget, all predictions for that MLP are zeroed and the multiplier is forced to 1.0 β a failure is strictly worse than the cheapest valid submission.
The core idea
The scoring model answers a specific question: how accurately can your estimator predict expected neuron values, and how little compute does it spend doing so?
Each estimator call is given a flop_budget β a cap on the floating-point operations it may perform, tracked analytically by flopscope. If the estimator stays within budget, its final-layer predictions are scored by MSE against Monte Carlo ground truth, and that MSE is then scaled by how much of the budget you used to form the leaderboard score. If it exceeds the budget, all predictions for that MLP are replaced with zeros and no compute discount is applied.
How scoring works
For the configured FLOP budget B:
- Your estimator runs. Your
predict(mlp, budget)is called. flopscope counts every floating-point operation analytically (F_m); the harness also measures the residual wall-time bucket (R_m) β Python-side work that runs outside a flopscope kernel. - Effective compute is formed.
C_m = F_m + λ·R_m, whereλ = 1e11FLOPs/sec converts residual wall-time into FLOP-equivalents. - Budget is checked. If
C_m > B(or flopscope trips mid-run, or a wall-time limit fires), all predictions for this MLP are replaced with zero vectors and the compute multiplier is forced to1.0. - Raw accuracy is measured. The final-layer mean squared error (MSE) between your predictions and Monte Carlo ground truth is computed β this is
final_layer_mse, a diagnostic. - Score is the budget-adjusted MSE. The per-MLP score is
final_layer_mse Γ max(0.1, C_m / B)β accuracy scaled by the share of budget you used, with the discount capped at 10Γ.
The leaderboard metric, adjusted_final_layer_score, is this per-MLP score averaged across all MLPs (multiplier forced to 1.0 wherever the budget was exceeded). Lower is better.
The formula
The leaderboard ranks on a single metric, adjusted_final_layer_score: the
final-layer MSE scaled by a per-MLP compute multiplier, then averaged across
MLPs. Lower is better.
1 M
adjusted_final_layer_score = βββ β final_layer_mse_m Γ max(0.1, C_m / B)
M m=1
1 n
final_layer_mse_m = βββ β ( pred_m[d-1, i] β truth_m[d-1, i] )Β²
n i=1
βββββββββ final-layer cells only βββββββββ
C_m = F_m + λ·R_m effective compute: analytical FLOPs F_m plus residual
wall-time R_m converted at Ξ» = 1e11 FLOPs/sec
B = flop_budget
max(0.1, C_m / B) compute multiplier β caps the discount at 10Γ (the 0.1
floor); forced to 1.0 for any MLP whose budget was exceededWhy "score" and not "MSE"? Once
final_layer_mseis multiplied by the budget factor, the result is no longer a mean-squared-error β it is a derived ranking score. That is why the leaderboard field is namedadjusted_final_layer_score(the_scoresuffix), while the raw diagnostics keep the_msesuffix.
The two raw MSEs are reported for diagnosis only β they carry no multiplier:
1 M 1 n
final_layer_mse = βββ β βββ β ( pred_m[d-1, i] β truth_m[d-1, i] )Β²
M m=1 n i=1
βββββββββ final-layer cells only βββββββββ
1 M 1 d-1 n
all_layers_mse = βββ β βββ β β ( pred_m[k, i] β truth_m[k, i] )Β²
M m=1 dΒ·n k=0 i=1
βββββββ all (depth Γ width) cells ββββββββ
M = number of MLPs in the suite (default 10; --n-mlps overrides)
d = mlp.depth, n = mlp.width
pred_m = (depth, width) array your predict() returned for MLP m
truth_m = Monte-Carlo ground-truth means for MLP m
(replaced with zeros if your call exceeded flop_budget)adjusted_final_layer_score is what the leaderboard ranks on. all_layers_mse
helps you diagnose whether your error concentrates in the final layer or
accumulates earlier β see also best_mlp_adjusted_final_layer_score and
worst_mlp_adjusted_final_layer_score in the
score report.
Budget behavior
Your estimator receives a budget argument (the FLOP budget B). It is a fixed
hard cap for the run, so fixed-strategy estimators that always use the same
approach are a good default β as long as they stay within budget. Because the
score scales with C_m / B, spending less compute lowers (improves) your
score, but only until you hit the 0.1 floor at 10% budget use; below that,
further savings don't help and accuracy is the only lever left.
Budget enforcement rules
The budget is enforced analytically, and your compute usage feeds the score multiplier:
- Exceeded budget. If your effective compute
C_mexceedsflop_budgetβ whether flopscope trips mid-run on analytical FLOPs, or the post-hocC_m > Bcheck fires on residual wall-time β all predictions for that MLP are replaced with zeros and the multiplier is forced to1.0. This is a hard cutoff, not per-depth. (A failed MLP therefore scores 10Γ worse than the cheapest valid submission, which earns the 0.1 floor.) - Under budget. Predictions are used as-is, and the multiplier
max(0.1, C_m / B)rewards using less of the budget β down to a 10Γ discount at 10% utilization. - Multiplier floor. Below 10% budget use the multiplier clamps at
0.1, so there is no further reward for getting cheaper β accuracy is what remains.
What a good score looks like
A score near zero means your predictions are accurate and you used little compute. A score well above zero means either your predictions are inaccurate, or your estimator exceeded the FLOP cap and was zeroed (with no compute discount).
Scores below what sampling would achieve at that budget indicate your structural approach is genuinely better than brute-force Monte Carlo. That is the research milestone this challenge targets.
Practical tuning intuition
- Start with a safe method that consistently emits valid rows and stays within budget.
- Use
flop_budgetfor hard-cap-aware implementation choices (not budget-time routing). - Tune your implementation for the fixed budget profile you care about; the multiplier rewards staying well under budget, but only down to the 10% floor.
- Compare
final_layer_mseandall_layers_msein your reports to see which depths hurt your accuracy, and watchmean_score_multiplierto see how much the budget factor is scaling that accuracy. - Use evaluation datasets to fix networks and ground truth across runs β this makes score comparisons meaningful and skips repeated sampling.
Worked example
Suppose ground truth for a 3-neuron final layer is [0.42, 0.38, 0.51] and your estimator predicts [0.40, 0.35, 0.55].
final_layer_mse = mean([(0.40 - 0.42)^2, (0.35 - 0.38)^2, (0.55 - 0.51)^2]) = mean([0.0004, 0.0009, 0.0016]) = (0.0004 + 0.0009 + 0.0016) / 3 = 0.000967
That 0.000967 is this MLP's raw final_layer_mse. To get the per-MLP score that the leaderboard actually uses, scale it by the compute multiplier. If the call spent 30% of its budget (C_m / B = 0.30, so the multiplier is max(0.1, 0.30) = 0.30):
adjusted_m = 0.000967 Γ 0.30 = 0.000290
The leaderboard adjusted_final_layer_score is the mean of these per-MLP adjusted_m values across all MLPs in the evaluation β a mean of means, not a sum. (Had the same call used β€10% of budget, the multiplier would clamp at the 0.1 floor: 0.000967 Γ 0.1 = 0.0000967.)
Example estimator benchmarks
The table below shows the raw-MSE diagnostics from the bundled example estimators run against the public release dataset β arc-whestbench-public-2026, mini split (100 MLPs, width 256 / depth 8, N=1e9 baked ground truth) β at the 6.8e10 FLOP budget. These are the unscaled final_layer_mse / all_layers_mse values; the leaderboard adjusted_final_layer_score multiplies final_layer_mse by max(0.1, C_m / budget) (β€ 1.0). Every bundled example spends well under 1% of the budget, so each one bottoms out at the 0.1 floor β its ranked score is exactly final_layer_mse Γ· 10. Use these as calibration points for your own estimator.
| Estimator | final_layer_mse | all_layers_mse | Approach |
|---|---|---|---|
random_estimator | ~0.60 | ~0.42 | Returns random values β the interface walkthrough. The bundled estimator.py at the repo root is the true (all-zeros) baseline (~0.83); running uv run whest init <dir> in a fresh directory produces the same template. |
mean_propagation | ~7.5e-04 | ~4.4e-04 | Diagonal variance, O(depth x width^2), ~2.7M FLOPs. ~1000x better than the zeros baseline. |
covariance_propagation | ~3.7e-05 | ~1.7e-05 | Full covariance, O(depth x width^3), ~404M FLOPs. ~20x better again than mean propagation. |
How to read these numbers:
- The zeros baseline (
estimator.py, ~0.83) and the random estimator (~0.60) give you the "doing nothing" scale β their MSE reflects the natural magnitude of the ground-truth activations. - Mean propagation is ~1000x more accurate than zeros β a huge improvement from a simple analytical formula at ~2.7M FLOPs (well under 1% of budget).
- Covariance propagation is another ~20x better, but costs O(width^3) per layer (~404M FLOPs at width=256, still <1% of budget). The cubic cost grows fast β by around widthβ1400 it would consume the whole 6.8e10 budget.
- The leaderboard score (
adjusted_final_layer_score) is not shown directly: it scales each estimator'sfinal_layer_msebymax(0.1, C_m / budget). Every bundled example spends <1% of the budget, so all of them bottom out at the 0.1 floor β each one's ranked score is exactly itsfinal_layer_mse Γ· 10.
To reproduce: uv run whest run --estimator examples/<NN>_<name>.py --dataset hf://aicrowd/arc-whestbench-public-2026@v1-warmup (e.g. examples/02_mean_propagation.py)
These numbers are reproducible: the mini split fixes the 100 MLPs and bakes ground truth at N=1e9, so re-running yields the same values (the random_estimator row uses --seed 42).