whestbench.
Reference

Flopscope Primer

Flopscope is a numpy-compatible array library that tracks FLOPs analytically, enabling fair FLOP budgets across different machines.

Flopscope is a numpy-compatible array library that tracks FLOPs analytically rather than timing them on hardware. Every arithmetic operation on a fnp.ndarray increments a FLOP counter instead of (or in addition to) performing the computation. This is how WhestBench enforces fair FLOP budgets across different machines.

Source: github.com/AIcrowd/flopscope

BudgetContext

All estimator predictions run inside a BudgetContext. When the budget is exhausted, a BudgetExhaustedError is raised and your predictions are zeroed out.

import flopscope as flops
import flopscope.numpy as fnp

with flops.BudgetContext(flop_budget=1_000_000) as ctx:
    x = fnp.ones(100)
    y = x @ fnp.eye(100)  # matmul: 100 * 100 * 100 = 1M FLOPs
    # BudgetExhaustedError raised here if budget exceeded

You don't need to create BudgetContext yourself — the framework does it before calling your predict() method. The budget argument tells you how many FLOPs you have.

BudgetContext also supports wall_time_limit_s when you want a cooperative wall-clock limit in addition to the FLOP cap:

with flops.BudgetContext(flop_budget=1_000_000, wall_time_limit_s=2.0) as ctx:
    ...

The timer starts when the context is entered and is checked before and after each counted flopscope/NumPy call. If it is exceeded, flopscope raises TimeExhaustedError.

Operation FLOP Costs

CategoryOperationsCost
Free (0 FLOPs)fnp.array, fnp.zeros, fnp.ones, fnp.eye, fnp.asarray, fnp.reshape, .T, indexing, fnp.stack, fnp.concatenate, .copy(), .astype()0
Pointwise (1 FLOP/element)+, -, *, /, fnp.exp, fnp.sqrt, fnp.abs, fnp.maximum, fnp.where, fnp.log, comparisonsN elements
Reductions (input size)fnp.sum, fnp.mean, fnp.var, fnp.max, fnp.min, fnp.all, fnp.anyN elements
Matmul@, fnp.matmulM * N * K for (M,N) @ (N,K)

Key insight: Matmul dominates. A single (100, 100) @ (100, 100) costs 1M FLOPs. A pointwise exp on 100 elements costs 100 FLOPs.

Array Creation

import flopscope as flops
import flopscope.numpy as fnp

x = fnp.zeros(100)                          # 1D zeros
X = fnp.zeros((64, 100), dtype=fnp.float32)  # 2D zeros, explicit dtype
I = fnp.eye(100, dtype=fnp.float32)          # identity matrix
a = fnp.array([1.0, 2.0, 3.0])             # from list
b = fnp.asarray(numpy_array)                # convert from numpy (free)

All array creation is free (0 FLOPs).

Random Number Generation

import flopscope as flops
import flopscope.numpy as fnp

rng = fnp.random.default_rng(42)            # seeded RNG
x = rng.standard_normal((1000, 64))        # Gaussian samples
x = x.astype(fnp.float32)                   # cast to float32 (free)

Random generation itself is free. FLOPs are counted when you operate on the arrays.

Budget Inspection

Use budget.summary() for the current explicit context and fnp.budget_summary() for the accumulated session/global view:

with flops.BudgetContext(flop_budget=10_000_000) as ctx:
    # ... your computations ...
    print(ctx.summary())        # current context only
    print(fnp.budget_summary())  # process/session-wide summary
    print(ctx.flops_used)       # integer FLOP count

Both summaries also include four timing fields that satisfy a strict decomposition identity, wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s:

  • wall_time_s: total elapsed time in the context
  • flopscope_backend_time_s: time spent inside counted flopscope numpy kernels
  • flopscope_overhead_time_s: time spent inside flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop)
  • residual_wall_time_s: everything else - participant Python, GC, uninstrumented numpy

This decomposition lets you see whether time is going to numpy compute, framework dispatch, or your own Python.

WhestBench-specific limits

Flopscope's BudgetContext measures wall_time_s, flopscope_backend_time_s, flopscope_overhead_time_s, and residual_wall_time_s. It also accepts wall_time_limit_s, which it checks while counted flopscope operations run.

WhestBench exposes some of those concepts as run-level CLI knobs:

  • --wall-time-limit: passed through to the estimator's BudgetContext
  • --residual-wall-time-limit: enforced by WhestBench after predict() returns, using the reported residual_wall_time_s. Because residual_wall_time_s no longer includes flopscope's own dispatch time, this gate measures only your Python work — not the framework's bookkeeping tax.

So if you see time_exhausted, that came from Flopscope's wall_time_limit_s. If you see residual_wall_time_exhausted, that came from WhestBench scoring logic comparing Flopscope's measured residual_wall_time_s with the configured --residual-wall-time-limit.

Residual wall-time charging (lambda)

WhestBench's effective compute budget combines analytical FLOPs and residual wall time via a conversion rate λ (LAMBDA_FLOPS_PER_SECOND in whestbench.scoring):

C_m = F_m + λ · R_m
  • F_m = analytical FLOPs counted by flopscope (flops_used)
  • R_m = residual wall time — the third bucket of the time decomposition. Specifically, residual_wall_time_s = wall_time_s − flopscope_backend_time_s − flopscope_overhead_time_s. This is participant Python (loops, control flow), GC pauses, and uninstrumented numpy. It explicitly excludes flopscope's own dispatch overhead (the second bucket).
  • λ = 1e11 FLOPs/second. This rate is fixed for the initial competition round.

The combined C_m is capped at B_m = flop_budget. If C_m > B_m, the MLP is marked combined_budget_exhausted and the prediction is replaced with zeros.

Why charge non-flopscope time at all? It lets participants use any Python they like — not just flopscope-instrumented operations — but holds them accountable for that work in the compute budget. Pure-flopscope solutions get the entire budget for analytical work; pure-Python solutions trade some FLOP headroom for residual time.

Common Gotchas

numpy arrays still count FLOPs. Since fnp.ndarray is backed by numpy, a raw numpy array passed to flopscope operations will still be tracked. Use fnp.array() or fnp.asarray() to convert explicitly.

Pythonic operators are tracked. x @ w counts the same FLOPs as fnp.matmul(x, w). Use whichever reads better.

dtype matters for precision, not FLOPs. float32 and float64 operations cost the same FLOPs. Use float32 for memory efficiency and float64 for numerical stability where needed.

Testing

Use flopscope's testing utilities:

import flopscope as flops
import flopscope.numpy as fnp

fnp.testing.assert_allclose(actual, expected, atol=1e-6)
fnp.testing.assert_array_equal(actual, expected)

These work like numpy's testing functions but on flopscope arrays.

On this page