whestbench.

Performance Tips

Sourced from whest-starterkit @ aaa3882.

Performance Tips

← Documentation

This page lists concrete patterns for reducing FLOP usage in your estimator.

Matmul dominates your budget

A single fnp.matmul(A, B) on two (n, n) matrices costs O(n^3) FLOPs. For width=256, that is ~33M FLOPs per matmul. In an 8-layer network, 8 matmuls cost ~268M FLOPs — well within the 6.8e10 default budget, but the cost dominates for any moderately-sized estimator.

Tip: If you only need diagonal information (per-neuron variance), avoid full matrix-matrix multiplies. Diagonal propagation uses matrix-vector products: O(n^2) per layer instead of O(n^3).

Free operations — use them liberally

These cost 0 FLOPs in flopscope:

  • fnp.zeros(), fnp.ones(), fnp.eye(), fnp.array()
  • fnp.reshape(), fnp.transpose()
  • fnp.concatenate(), fnp.stack()
  • Indexing: x[0], x[:, 3], fnp.diag(M)

Precompute anything you can using free ops. Store intermediate values in variables — there is no memory cost in FLOP terms.

Precompute outside the layer loop

If your estimator computes something that does not change per-layer, move it before the loop:

import flopscope.numpy as fnp

# Instead of this (wasteful):
for w in mlp.weights:
    scale = fnp.sqrt(2.0 / mlp.width)  # recomputed every layer
    ...

# Do this (free):
scale = fnp.sqrt(2.0 / mlp.width)  # computed once
for w in mlp.weights:
    ...

Diagonal vs full covariance — know when to switch

ApproachCost per layerWhen to use
Mean propagation (diagonal)O(width^2)Default. Budget < 30 x width^2
Covariance propagation (full)O(width^3)Budget >= 30 x width^2

Check your budget breakdown

Use flops.budget_summary() inside a BudgetContext to see exactly where your FLOPs go:

import flopscope as flops

with flops.BudgetContext(flop_budget=68_000_000_000) as budget:
    result = estimator.predict(mlp, budget=68_000_000_000)
    flops.budget_summary()

This prints a per-operation table showing call counts and cumulative FLOPs. Look for the dominant operation and optimize that first.

Skip hardware fallback probes during local iteration

If startup latency matters while you are iterating locally, you can skip the extra OS-native hardware fallback probes that populate report and dataset metadata:

WHEST_SKIP_HARDWARE_FALLBACK_PROBES=1 uv run whest run --estimator estimator.py

This keeps cheap metadata collection and psutil-backed fields enabled. Only the fallback probes are skipped, so fields such as cpu_count_physical or ram_total_bytes may remain null when they are not already available.

➡️ Next step

On this page