flopscope.
Development

Calibration & Empirical Weights

How analytical FLOPs map to real hardware via per-operation weights.

You will learn:

  • What per-operation weights are and why they matter
  • How to run calibration and produce a weights config
  • How to load and use weights in your code
  • How the measurement methodology works
  • How to interpret weight values

What are weights?

flopscope's analytical cost formulas treat all operations within a category equally -- exp, log, sin, and abs all cost numel(output) FLOPs for a pointwise unary operation. In reality, exp decomposes into a minimax polynomial approximation requiring approximately 14 floating-point instructions per element, while abs is a single bit manipulation.

Per-operation weights correct for this. Each weight is a multiplicative constant applied on top of the analytical formula:

effective_cost = analytical_formula(shape) * weight(op_name)

A weight of 16.0 for exp means each analytical FLOP of exp is calibrated as approximately 16 times more expensive than one analytical FLOP of add under the chosen measurement mode. Known analytical zero-FLOP operations are stored with weight 0.0 in the official artifacts so the generated docs and packaged defaults surface them as truly free. In normal use, flopscope loads packaged official weights automatically on import. Set FLOPSCOPE_DISABLE_WEIGHTS=1 if you want the pure analytical unit-cost model instead.

Quick start

Run the benchmark suite to produce a JSON config:

python -m benchmarks.runner \
    --dtype float64 \
    --output weights.json \
    --html report.html \
    --repeats 5

This benchmarks all 291 operations across 14 categories and writes:

  • weights.json -- the rich weights config and metadata source for flopscope
  • report.html -- a human-readable HTML dashboard

The generated weights.json is the source-of-truth artifact: it contains the per-operation weights plus metadata used by the generated docs and API data. The packaged runtime file is a slim derivative used only for default loading.

To benchmark only certain categories:

python -m benchmarks.runner \
    --dtype float64 \
    --output weights.json \
    --category pointwise \
    --category linalg

Available categories: pointwise, reductions, linalg, linalg_delegates, fft, sorting, random, polynomial, contractions, misc, window, bitwise, complex.

Using weights

Default, override, and disable behavior

flopscope ships with packaged official weights and loads them automatically on a normal import.

Set FLOPSCOPE_WEIGHTS_FILE to override those packaged defaults at import time:

export FLOPSCOPE_WEIGHTS_FILE=weights.json
python your_code.py

Set FLOPSCOPE_DISABLE_WEIGHTS=1 to disable weighting entirely and use unit weights:

export FLOPSCOPE_DISABLE_WEIGHTS=1
python your_code.py

The JSON file must have a "weights" key mapping operation names to floats:

{
  "weights": {
    "reshape": 0.0,
    "add": 1.0,
    "exp": 16.0,
    "sin": 16.0,
    "matmul": 1.0,
    "linalg.cholesky": 4.0
  }
}

Operations not listed in the override file default to 1.0. A partial config (e.g., pointwise-only) works without error.

Programmatic loading

from flopscope._weights import load_weights, reset_weights, get_weight

# Load from a specific file
load_weights("/path/to/weights.json")

# Or re-load the packaged official defaults explicitly
load_weights()

# Check a weight
print(get_weight("exp"))   # e.g. 16.0 after calibration
print(get_weight("add"))   # 1.0
print(get_weight("foo"))   # 1.0 (unknown ops default to 1.0)

# Clear active overrides in the current process
reset_weights()

How weights are applied

Weights are applied centrally in BudgetContext.deduct(). Every counted operation passes its op_name to deduct(), which looks up the weight and multiplies it into the cost:

adjusted_cost = analytical_cost * flop_multiplier * weight(op_name)

Weights compose with flop_multiplier and with symmetry reductions -- symmetry reduces the element count, the weight scales the per-element cost, and both apply independently.

How measurement works

The benchmark suite supports two measurement modes, chosen automatically:

ModePlatformWhat it measures
perfLinux (with perf installed)Instruction-style hardware counters, weighted by SIMD width
timingAny (macOS, Linux without perf)Relative wall-clock cost, normalized against np.add

Every counted operation's weight is computed as:

weight(op) = max(alpha_raw(op) - overhead_for_category, 0)

Known analytical zero-FLOP operations are not benchmarked; they are emitted separately into the official artifacts with weight 0.0.

where alpha_raw(op) depends on the active measurement mode:

  • in perf mode, it is the median ratio of counter-observed retired instructions to the analytical FLOP count (FMA = 1 op)
  • in timing mode, it is the median ratio of wall-clock cost relative to np.add, normalized against the same analytical FLOP count

In other words, perf-mode weights are instruction-oriented calibration factors, while timing-mode weights are relative cost proxies. Both are useful for scaling analytical FLOPs, but timing-mode results should not be read as literal hardware instruction counts.

The ufunc dispatch overhead (measured from np.abs, which generates zero FP arithmetic) is subtracted per category to remove NumPy implementation noise from the weight. BLAS-backed operations bypass the ufunc layer and have zero overhead subtracted.

Each measurement uses pre-allocated output arrays to eliminate memory allocation overhead, multiple input distributions for robustness, subprocess isolation to prevent interference, and warmup iterations in timing mode.

Interpreting results

  • weight = 1.0 -- the operation has the same calibrated cost per analytical FLOP as np.add under the chosen measurement mode
  • weight > 1.0 -- the operation is more expensive per analytical FLOP than np.add under the chosen calibration mode (for example, transcendental functions such as exp)
  • weight < 1.0 -- the operation is cheaper per analytical FLOP than np.add under the chosen calibration mode. This can happen because of efficient kernels, vectorization, library implementation details, or benchmarking noise; it should not be interpreted using the old “FMA counts as 2 FLOPs” convention, because Flopscope uses FMA = 1 throughout the model

Weights are platform-specific -- different CPUs, BLAS libraries, and libm implementations produce different values. Always measure on the target platform. Symmetry reductions are independent of weights: symmetry reduces the element count while the weight scales the per-element cost.

Full per-operation weight data is available in the API Reference and on each standalone operation page under the canonical /docs/api/... routes, for example /docs/api/einsum/.

On this page