whestbench.
Reference

Score Report Fields

Reference for interpreting whest run output fields, including per-MLP diagnostics, time decomposition, and the budget-adjusted scoring formula.

When to use this page

Use this page to interpret whest run output fields.

Top-level fields

Typical report sections include:

  • schema_version
  • mode
  • run_meta
  • run_config
  • run_config.seed (always present; null when --seed is not provided)
  • run_config.dataset (present when --dataset is used)
  • results

Run configuration fields

run_config records the parameters that governed the run:

FieldDescription
seedThe --seed value passed at the CLI, or null when --seed is omitted. When set, it determines both MLP generation (without --dataset) and SetupContext.seed for the participant's setup() call. When null, ctx.seed defaults to 0. See estimator-contract for the reproducibility contract and cli-reference for --seed flag semantics.
datasetPresent when --dataset is used. See Dataset traceability fields below.

Host metadata

run_meta.host is always an object. If you set WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1, WhestBench still records cheap host fields and any values available through psutil, but fallback-backed fields such as cpu_count_physical and ram_total_bytes may be null.

Core result fields

Inside results:

FieldDescription
adjusted_final_layer_scoreBudget-adjusted leaderboard metric — suite mean of per-MLP adjusted_final_layer_score = final_layer_mse × max(0.1, C_m/B_m); failure → × 1.0. Lower is better.
all_layers_mseRaw all-layers MSE averaged across MLPs (no budget multiplier). Diagnostic — reveals where approximation error accumulates.
final_layer_mseRaw final-layer MSE averaged across MLPs (no multiplier).
per_layer_msePer-layer MSE averaged across MLPs. list[float] of length depth. The last element equals final_layer_mse and the list mean equals all_layers_mse. Diagnostic only, no budget multiplier.
best_mlp_adjusted_final_layer_scoreMinimum per-MLP adjusted_final_layer_score.
worst_mlp_adjusted_final_layer_scoreMaximum per-MLP adjusted_final_layer_score.
mean_score_multiplierMean of per-MLP max(0.1, C_m/B_m) (1.0 on failure). Bounded [0.1, 1.0].
mean_compute_utilizationMean of per-MLP C_m/B_m, unclamped — can exceed 1.0 when an MLP busted the cap.
n_failed_mlpsCount of MLPs with any failure flag or error_code set.
mean_effective_computeMean of per-MLP effective_compute.
failure_breakdownDict with independent counts per failure flag: budget_exhausted, time_exhausted, residual_wall_time_exhausted, combined_budget_exhausted, error. Sums can exceed n_failed_mlps because one MLP can carry multiple flags.
breakdownsAggregate FLOP/time breakdowns keyed by section name. Includes sampling and estimator.
per_mlpArray of per-MLP detail records (see below)

Per-MLP fields

Each entry in per_mlp:

FieldTypeDescription
mlp_indexintIndex of the MLP in the evaluation set
mlp_namestrHuman-readable slug for this MLP (e.g. "danielle-johnson"). Same value as mlps[i].name on the corresponding MLP; derived deterministically from mlp_index's per-MLP seed. Use it as a stable label in your own logs and dashboards.
flops_usedintTotal FLOPs used by your estimator for this MLP
effective_computefloatC_m = F_m + λ·R_m. Combined FLOP-equivalent compute used by the estimator.
adjusted_final_layer_scorefloats_m. The per-MLP budget-adjusted score that flows into the suite mean.
combined_budget_exhaustedboolWhether the post-hoc check C_m > B_m fired (predictions zeroed if true).
budget_exhaustedboolWhether the estimator exceeded the FLOP budget (predictions zeroed if true)
time_exhaustedboolWhether the estimator exceeded the wall-clock limit for this MLP (predictions zeroed if true)
residual_wall_time_exhaustedboolWhether WhestBench judged non-flopscope time to exceed residual_wall_time_limit_s (predictions zeroed if true)
wall_time_sfloatTotal elapsed wall-clock time measured for this MLP's estimator context
flopscope_backend_time_sfloatWall time inside counted flopscope numpy kernels - the participant's actual numpy compute
flopscope_overhead_time_sfloatWall time inside flopscope's own dispatch code (wrapper preambles, FLOP bookkeeping, namespace push/pop). Framework cost, not participant cost.
residual_wall_time_sfloatWall time inside the predict context that is neither flopscope backend execution nor flopscope dispatch - i.e. participant Python (loops, control flow), GC, uninstrumented numpy
final_layer_msefloatMSE of your final-layer predictions vs ground truth
all_layers_msefloatMSE of your all-layer predictions vs ground truth
per_layer_mselist[float]Per-layer MSE for this MLP. Length equals depth. per_layer_mse[-1] == final_layer_mse and mean(per_layer_mse) == all_layers_mse within float precision.
breakdownsdict | nullPer-MLP breakdown container. Currently includes estimator-only data under estimator. Sampling is aggregate-only.
tracebackstr | nullNon-null when this MLP's run did not produce real predictions — captures the Python traceback for either an estimator exception or a budget/time exhaustion. null on clean runs. For subprocess/server runners, the traceback is forwarded from the worker.

When the estimator raised an unhandled exception (not budget/time exhaustion), the entry also includes:

FieldTypeDescription
errorstr | dictLegacy string message, or structured object: {"message": str, "details": object}
error_codestrStable identifier: PREDICT_ERROR for a RunnerError, or the Python exception class name otherwise

For structured error objects, error.details includes:

  • expected_shape: List[int] with expected (depth, width).
  • got_shape: List[int] observed from estimator output.
  • cause_hints: List[str] with user-facing hints.
  • hint: short summary hint.

Time decomposition

Every predict() call satisfies a strict three-bucket identity:

wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s
  • flopscope_backend_time_s - numpy kernels actually crunching numbers via flopscope.numpy.*.
  • flopscope_overhead_time_s - flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop).
  • residual_wall_time_s - everything else inside the wall window: participant Python, GC, uninstrumented numpy.

The decomposition holds at every level: per-MLP, aggregated across MLPs, and per namespace inside breakdowns.

Breakdown containers

When namespace-aware flopscope data is available, WhestBench adds breakdown containers in these places:

  • results.breakdowns.estimator - aggregated estimator breakdown across all evaluated MLPs
  • results.breakdowns.sampling - aggregated sampling breakdown across all evaluated MLPs
  • results.per_mlp[].breakdowns.estimator - one normalized estimator breakdown per MLP

Namespace normalization rules:

  • sampling work is namespaced under sampling.*
  • unlabeled estimator work becomes estimator.estimator-client
  • explicit estimator namespace phase becomes estimator.phase
  • nested estimator namespace phase.subphase becomes estimator.phase.subphase

Each breakdown summary also includes timing totals:

  • flopscope_backend_time_s - accumulated time inside counted flopscope operations
  • flopscope_overhead_time_s - accumulated time inside flopscope's own dispatch
  • residual_wall_time_s - everything else (participant Python, GC, uninstrumented numpy)

For results.breakdowns.*, those values are aggregated across all evaluated MLPs.

Budget-adjusted scoring

The leaderboard ranks submissions by adjusted_final_layer_score, the suite mean of the budget-adjusted per-MLP score:

adjusted_final_layer_score = final_layer_mse × max(0.1, C_m / B_m)   for valid runs
adjusted_final_layer_score = final_layer_mse × 1.0                    for failures (no compute discount)

C_m = F_m + λ · R_m                      (effective compute, FLOPs and FLOP-equivalents)
λ = 1e11 FLOPs/second                    (conversion rate; see flopscope-primer)

Where F_m is the analytical FLOPs counted by flopscope (flops_used), R_m is the residual wall-time bucket (residual_wall_time_s — neither flopscope-backend nor flopscope-overhead), and B_m is flop_budget. The max(0.1, ...) floor caps the discount at 10× so an arbitrarily cheap-but-wrong submission cannot dominate the ranking.

Why "score" not "MSE"? Once final_layer_mse is multiplied by the budget factor max(0.1, C_m/B_m), the result is no longer a mean-squared-error between predictions and targets — it is a derived ranking score (denoted s_m). The _score suffix in adjusted_final_layer_score reflects this; the raw diagnostics final_layer_mse and all_layers_mse keep the _mse suffix because they remain genuine MSEs.

Interpretation guide

  • final_layer_mse is your most actionable diagnostic — it directly drives adjusted_final_layer_score.
  • budget_exhausted is the first thing to check if your score is unexpectedly high — exceeded budget means your predictions were zeroed.
  • time_exhausted means the estimator crossed the wall-clock limit configured through wall_time_limit_s / --wall-time-limit.
  • residual_wall_time_exhausted means the non-flopscope portion of execution crossed WhestBench's residual_wall_time_limit_s / --residual-wall-time-limit.
  • flops_used vs flop_budget shows how much headroom you have. If you are consistently near the cap, consider lighter methods.
  • High flopscope_backend_time_s relative to wall: numpy compute is the dominant cost. Healthy for a numpy-heavy estimator.
  • High flopscope_overhead_time_s relative to wall: many small ops are paying the per-call dispatch tax. Consider batching with larger numpy primitives.
  • High residual_wall_time_s relative to wall: participant Python is the bottleneck (tight loops, per-element attribute access, calls into uninstrumented libraries). This is the bucket future versions of WhestBench will penalise on.
  • adjusted_final_layer_score is the budget-adjusted leaderboard metric and is always ≤ the raw final_layer_mse mean (the multiplier is at most 1.0 — it equals 1.0 at full budget use or on failures and drops to 0.1 at the discount floor — a factor-of-ten cap). A value close to raw final_layer_mse means you used near-full budget; a value close to one-tenth of raw final_layer_mse means you used ≤10% of the budget and got the maximum discount.
  • all_layers_mse is a diagnostic aggregate with no budget multiplier. Use it to understand where approximation error accumulates across all layers, not just the final layer.
  • per_layer_mse decomposes all_layers_mse layer-by-layer (length = depth). Useful for spotting which layers your estimator struggles on — e.g. early layers vs. final layer. By construction per_layer_mse[-1] == final_layer_mse and mean(per_layer_mse) == all_layers_mse (within float precision).

Dataset traceability fields

When using whest run --dataset, the report includes run_config.dataset:

FieldDescription
pathAbsolute path to the dataset file
sha256SHA-256 hash of the file for integrity
seedRNG seed used to generate the dataset
n_mlpsNumber of MLPs in the dataset
seed_protocolObject with name and version. WhestBench currently requires version = "2.0".

Dataset format compatibility

The .npz files produced by whest create-dataset carry a seed_protocol.version in their embedded metadata. WhestBench refuses to load datasets at any other version: loading a v1.0 dataset raises ValueError("Incompatible dataset seed_protocol version: file has '1.0', this whestbench requires '2.0'. Re-bake the dataset with \whest create-dataset`.")`.

The v2.0 format adds a per-MLP seed (stored as the mlp_seeds array in the .npz) that is exposed to estimators via mlp.seed — see estimator-contract for how to consume it. Auto-migration is intentionally not implemented because the v1.0 spawn protocol (2 streams per MLP) cannot produce a deterministic third stream; re-baking from the original spec seed is the only correct path.

Schema 2.4 added the per-MLP name slug (stored as the mlp_names array in the .npz). It is a pure function of mlp_seeds at the WhestBench release's pinned faker version, so loading a 2.3 file under 2.4 code transparently synthesizes the same names a fresh 2.4 bake would produce — no re-bake required. See estimator-contract for the mlp.name field exposed to estimators.

Next step

On this page