Flopscope Primer
Flopscope is a numpy-compatible array library that tracks FLOPs analytically, enabling fair FLOP budgets across different machines.
Flopscope is a numpy-compatible array library that tracks FLOPs analytically rather than timing them on hardware. Every arithmetic operation on a fnp.ndarray increments a FLOP counter instead of (or in addition to) performing the computation. This is how WhestBench enforces fair FLOP budgets across different machines.
Source: github.com/AIcrowd/flopscope
BudgetContext
All estimator predictions run inside a BudgetContext. When the budget is exhausted, a BudgetExhaustedError is raised and your predictions are zeroed out.
import flopscope as flops
import flopscope.numpy as fnp
with flops.BudgetContext(flop_budget=1_000_000) as ctx:
x = fnp.ones(100)
y = x @ fnp.eye(100) # matmul: 100 * 100 * 100 = 1M FLOPs
# BudgetExhaustedError raised here if budget exceededYou don't need to create BudgetContext yourself — the framework does it before calling your predict() method. The budget argument tells you how many FLOPs you have.
BudgetContext also supports wall_time_limit_s when you want a cooperative
wall-clock limit in addition to the FLOP cap:
with flops.BudgetContext(flop_budget=1_000_000, wall_time_limit_s=2.0) as ctx:
...The timer starts when the context is entered and is checked before and after
each counted flopscope/NumPy call. If it is exceeded, flopscope raises
TimeExhaustedError.
Operation FLOP Costs
| Category | Operations | Cost |
|---|---|---|
| Free (0 FLOPs) | fnp.array, fnp.zeros, fnp.ones, fnp.eye, fnp.asarray, fnp.reshape, .T, indexing, fnp.stack, fnp.concatenate, .copy(), .astype() | 0 |
| Pointwise (1 FLOP/element) | +, -, *, /, fnp.exp, fnp.sqrt, fnp.abs, fnp.maximum, fnp.where, fnp.log, comparisons | N elements |
| Reductions (input size) | fnp.sum, fnp.mean, fnp.var, fnp.max, fnp.min, fnp.all, fnp.any | N elements |
| Matmul | @, fnp.matmul | M * N * K for (M,N) @ (N,K) |
Key insight: Matmul dominates. A single (100, 100) @ (100, 100) costs 1M FLOPs. A pointwise exp on 100 elements costs 100 FLOPs.
Array Creation
import flopscope as flops
import flopscope.numpy as fnp
x = fnp.zeros(100) # 1D zeros
X = fnp.zeros((64, 100), dtype=fnp.float32) # 2D zeros, explicit dtype
I = fnp.eye(100, dtype=fnp.float32) # identity matrix
a = fnp.array([1.0, 2.0, 3.0]) # from list
b = fnp.asarray(numpy_array) # convert from numpy (free)All array creation is free (0 FLOPs).
Random Number Generation
import flopscope as flops
import flopscope.numpy as fnp
rng = fnp.random.default_rng(42) # seeded RNG
x = rng.standard_normal((1000, 64)) # Gaussian samples
x = x.astype(fnp.float32) # cast to float32 (free)Random generation itself is free. FLOPs are counted when you operate on the arrays.
Budget Inspection
Use budget.summary() for the current explicit context and
fnp.budget_summary() for the accumulated session/global view:
with flops.BudgetContext(flop_budget=10_000_000) as ctx:
# ... your computations ...
print(ctx.summary()) # current context only
print(fnp.budget_summary()) # process/session-wide summary
print(ctx.flops_used) # integer FLOP countBoth summaries also include four timing fields that satisfy a strict
decomposition identity, wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s:
wall_time_s: total elapsed time in the contextflopscope_backend_time_s: time spent inside counted flopscope numpy kernelsflopscope_overhead_time_s: time spent inside flopscope's own dispatch (wrapper preambles, FLOP bookkeeping, namespace push/pop)residual_wall_time_s: everything else - participant Python, GC, uninstrumented numpy
This decomposition lets you see whether time is going to numpy compute, framework dispatch, or your own Python.
WhestBench-specific limits
Flopscope's BudgetContext measures wall_time_s, flopscope_backend_time_s,
flopscope_overhead_time_s, and residual_wall_time_s. It also accepts
wall_time_limit_s, which it checks while counted flopscope operations run.
WhestBench exposes some of those concepts as run-level CLI knobs:
--wall-time-limit: passed through to the estimator'sBudgetContext--residual-wall-time-limit: enforced by WhestBench afterpredict()returns, using the reportedresidual_wall_time_s. Becauseresidual_wall_time_sno longer includes flopscope's own dispatch time, this gate measures only your Python work — not the framework's bookkeeping tax.
So if you see time_exhausted, that came from Flopscope's wall_time_limit_s.
If you see residual_wall_time_exhausted, that came from WhestBench scoring
logic comparing Flopscope's measured residual_wall_time_s with the configured
--residual-wall-time-limit.
Residual wall-time charging (lambda)
WhestBench's effective compute budget combines analytical FLOPs and residual wall time
via a conversion rate λ (LAMBDA_FLOPS_PER_SECOND in whestbench.scoring):
C_m = F_m + λ · R_mF_m= analytical FLOPs counted by flopscope (flops_used)R_m= residual wall time — the third bucket of the time decomposition. Specifically,residual_wall_time_s=wall_time_s − flopscope_backend_time_s − flopscope_overhead_time_s. This is participant Python (loops, control flow), GC pauses, and uninstrumented numpy. It explicitly excludes flopscope's own dispatch overhead (the second bucket).λ= 1e11 FLOPs/second. This rate is fixed for the initial competition round.
The combined C_m is capped at B_m = flop_budget. If C_m > B_m, the MLP is marked
combined_budget_exhausted and the prediction is replaced with zeros.
Why charge non-flopscope time at all? It lets participants use any Python they like — not just flopscope-instrumented operations — but holds them accountable for that work in the compute budget. Pure-flopscope solutions get the entire budget for analytical work; pure-Python solutions trade some FLOP headroom for residual time.
Common Gotchas
numpy arrays still count FLOPs. Since fnp.ndarray is backed by numpy, a raw numpy array passed to flopscope operations will still be tracked. Use fnp.array() or fnp.asarray() to convert explicitly.
Pythonic operators are tracked. x @ w counts the same FLOPs as fnp.matmul(x, w). Use whichever reads better.
dtype matters for precision, not FLOPs. float32 and float64 operations cost the same FLOPs. Use float32 for memory efficiency and float64 for numerical stability where needed.
Testing
Use flopscope's testing utilities:
import flopscope as flops
import flopscope.numpy as fnp
fnp.testing.assert_allclose(actual, expected, atol=1e-6)
fnp.testing.assert_array_equal(actual, expected)These work like numpy's testing functions but on flopscope arrays.
Code Patterns
Quick reference for flopscope operations, including operators, FLOP costs, and common patterns for mean and variance propagation.
Generating Large Datasets on GPU
For ground-truth bakes with n_samples ≥ 10⁸, the optional torch backend runs the same computation on GPU, reducing a 30-hour CPU job to 15–30 minutes on a single GPU.