FLOP Counting Model
Use this page to understand how Flopscope counts FLOPs and why it uses analytical counting instead of runtime measurement.
You will learn:
- How Flopscope computes FLOP costs analytically from tensor shapes
- How cost formulas work for each operation category (einsum, linalg, FFT, etc.)
- How symmetry savings and per-operation weights modify costs
- How the FLOP multiplier and namespaces interact with the cost model
Convention: FMA = 1 operation
This codebase counts a fused multiply-add (a * b + c) as a single operation. Hardware FMA units execute this in one instruction; the common textbook convention of counting it as 2 (one multiply + one add) is not used here. All cost formulae reflect this: a matrix multiply of dimensions (m, k) x (k, n) costs mkn operations, not 2mk*n.
Why FLOPs instead of wall-clock time
- Deterministic: The same code always produces the same FLOP count, regardless of hardware
- Hardware-independent: A matmul costs the same FLOPs on a laptop and a server
- Reproducible: No variance from CPU scheduling, cache effects, or thermal throttling
- Composable: You can sum individual operation costs to predict total cost
How costs are computed
flopscope computes FLOP costs analytically from tensor shapes, not by measuring execution time.
- You call a counted operation (e.g.,
fnp.einsum('ij,j->i', W, x)) - flopscope computes the cost from the shapes: 256 × 256 = 65,536 FLOPs
- The cost is checked against the remaining budget
- If within budget: the operation executes and the cost is deducted
- If over budget:
BudgetExhaustedErroris raised, the operation does not execute
Cost formulas by category
Each formula below gives the analytical base cost. In normal use,
flopscope loads the packaged official per-operation
weights automatically at import time, so the base
cost is multiplied by the operation's weight to give the final deducted
cost. Set FLOPSCOPE_DISABLE_WEIGHTS=1 if you want the pure analytical unit-cost
model instead.
| Category | Formula | Example |
|---|---|---|
| Einsum | Per-step: product of all index dims | 'ij,jk->ik' → 3 × 4 × 5 = 60 |
| Unary (exp, log, sqrt, ...) | shape (256, 256) → 65,536 | |
| Binary (add, multiply, ...) | shape (256, 256) → 65,536 | |
| Reduction (sum, mean, max, ...) | shape (256, 256) → 65,536 | |
| SVD | (256, 256, k=10) → 655,360 | |
| Solve | (256, 256) solve → 16,777,216 | |
| Dot / Matmul | Same as einsum | (256, 256) @ (256, 256) → 256³ |
| Free ops | 0 | zeros, reshape, etc. |
Sorting & search
| Category | Formula | Example |
|---|---|---|
| Sort / Argsort | per slice | shape (4, 8), axis=-1 → 4 × 8 × 3 = 96 |
| Lexsort | 2 keys of length 8 → 2 × 8 × 3 = 48 | |
| Partition | per slice | shape (100,), kth=50 → 100 |
| Searchsorted | 5 queries into 1024 → 5 × 10 = 50 | |
| Unique | 8 elements → 8 × 3 = 24 | |
| Set ops | 4 + 4 elements → 8 × 3 = 24 |
Histogram & counting
| Category | Formula | Example |
|---|---|---|
| Histogram | 100 elements, 8 bins → 100 × 3 = 300 | |
| Bincount | 100 elements → 100 |
Random sampling
| Category | Formula | Example |
|---|---|---|
| Simple samplers | shape (10, 20) → 200 | |
| Shuffle / Permutation | 16 elements → 16 × 4 = 64 |
Symmetry savings
When a tensor is a SymmetricTensor, costs are reduced based on the number of unique elements rather than total elements. For a symmetric matrix, there are unique elements instead of .
| Category | Symmetric cost | Standard cost |
|---|---|---|
| Pointwise (unary/binary) | unique_elements | |
| Reduction | unique_elements | |
| Einsum (symmetric contraction) | Symmetry-reduced (see below) | Full product |
| Solve | ||
| Det / Slogdet | ||
| Inv |
See Exploit Symmetry Savings for usage details.
Subgraph symmetry detection
Symmetry that reduces einsum costs comes from two complementary sources, both unified under the subgraph symmetry detection algorithm:
-
Declared per-operand symmetry. When an operand is wrapped with
flops.as_symmetric(), its symmetry groups are embedded in the bipartite graph as U-vertex equivalence classes. These propagate into intermediate tensors automatically. -
Induced symmetry from repeated operands. When the same Python object is passed at multiple operand positions, the subgraph oracle detects this via Python identity (
is) and derives symmetry groups on the output that cannot be seen from per-operand metadata alone.
The oracle builds a bipartite graph once per contract_path call and
evaluates symmetry lazily per subset of operands encountered during path
search. Both sources are merged via the same group-merging machinery, so a
tensor that is both SymmetricTensor and also repeated in the subscript
benefits from both contributions simultaneously.
See the symmetry guide for usage examples, and the subgraph symmetry explanation for the algorithm walkthrough.
Einsum cost model
Every einsum — regardless of the number of operands — is decomposed into pairwise contraction steps along an optimal path (found via flopscope's opt_einsum fork). The total cost is the sum of per-step costs:
total_cost = sum(step.flop_cost for step in path.steps)Per-step cost
For each pairwise step, the dense cost is:
dense_step_cost = product of all index dimensionsEach fused multiply-add (FMA) counts as 1 operation (see Convention above), so the cost of a contraction step is simply the product of all index dimensions — there is no factor-of-2 distinction between inner products and outer products.
When symmetry is present, flopscope reduces each step's cost based on the structure of the contraction.
Symmetric contraction cost
Each pairwise step's cost is reduced by two independent multiplicative factors — one for the output (V-side) indices and one for the inner (W-side) contracted indices:
step_cost = dense_step_cost
× (unique_output_elements / total_output_elements)
× (unique_inner_elements / total_inner_elements)Each ratio is computed exactly using Burnside's lemma over the
permutation group detected for that step by the
SubgraphSymmetryOracle. For the
full symmetric group S on equal-sized axes, Burnside reduces to
the stars-and-bars formula ; for proper subgroups like
or block groups the oracle returns the exact generators and
Burnside counts over the enumerated elements.
The output (V-side) reduction is always applied when the step's intermediate has a non-trivial permutation group on its free indices — only the unique output elements need to be computed.
The inner (W-side) reduction is applied only when all labels in
the detected inner group are present as contracted indices in that
specific pairwise step. If any of those labels were contracted at an
earlier step and no longer appear in the current step, the inner
reduction is skipped (the per-step table shows this as [W: ...] when
detected-but-not-applied versus [W✓: ...] when applied). Inner
symmetry can be toggled globally with
flops.configure(use_inner_symmetry=False).
The two factors are independent; outer-product contractions (no summed indices) and non-uniform index dimensions are handled by the same formula, since Burnside's lemma makes no assumption about uniform sizes beyond requiring axes in the same orbit to share a dimension.
Multi-operand contractions
For a simple two-operand einsum like 'ij,jk->ik', there is one step,
so the total cost equals the step cost. For multi-operand einsums (3+
tensors), the optimizer finds the pairwise ordering that minimizes the
total cost.
When symmetric tensors are present, the optimizer is symmetry-aware: it uses symmetric costs to decide which pair to contract at each step, so the returned path may differ from the dense-optimal path. Symmetry propagates through intermediates — if an early contraction produces a symmetric intermediate, subsequent steps benefit from the reduced element count, and the optimizer factors this into its ordering decisions.
Use fnp.einsum_path() to inspect the per-step breakdown. See
Use Einsum for examples.
Per-operation weights
The analytical formulas above treat all operations within a category as
equally expensive -- exp, log, sin, and abs all cost
FLOPs. In reality, exp decomposes into
a minimax polynomial approximation requiring approximately 14 FP
instructions per element, while abs is a single bit manipulation.
Per-operation weights correct for this. Each weight is a multiplicative constant applied on top of the analytical formula:
actual_cost = analytical_formula(shape) × weight(op_name)| Operation | Analytical formula | Weight | Effective cost (256x256) |
|---|---|---|---|
add | 1 | 65,536 | |
exp | 16 | 1,048,576 | |
sin | 16 | 1,048,576 | |
matmul | 1 | 16,777,216 | |
linalg.cholesky | 4 | 67,108,864 | |
reshape | 0 | 0 |
Weights are measured using the overhead-subtracted correction-factor methodology described in FLOP Weight Calibration Results. The formula is:
where is the median ratio of hardware-observed FP
instructions to analytical FLOPs (FMA = 1 op), measured via
fp_arith_inst_retired performance counters. The ufunc dispatch overhead
(measured from np.abs, which generates zero FP arithmetic) is subtracted
per category to remove numpy implementation noise from the weight.
BLAS-backed operations (contractions, linalg) have weights near 1.0 because their tight FMA loops execute almost exactly 1 hardware FP instruction per analytical FLOP, with no ufunc overhead to subtract.
Known analytical zero-FLOP operations (reshape, broadcast_to,
random.seed, etc.) are stored with weight 0.0 in the official artifacts
so the generated docs surface them as free rather than as 1x unit-cost ops.
Integer and bitwise operations (bitwise_and, gcd, lcm, etc.) use the
instructions hardware counter (total retired instructions) because they
do not retire fp_arith_inst_retired events. Their weights are derived from
instruction counts normalized the same way as FP operations.
Official weights are packaged with flopscope and enabled by default on import.
Use FLOPSCOPE_WEIGHTS_FILE to override them with a custom JSON file, or set
FLOPSCOPE_DISABLE_WEIGHTS=1 to disable weighting entirely and fall back to
unit weights (1.0 for all operations).
How weights are applied
Weights are applied centrally in BudgetContext.deduct(). Every counted
operation passes its op_name to deduct(), which looks up the weight
and multiplies it into the cost:
adjusted_cost = analytical_cost × flop_multiplier × weight(op_name)This means weights compose with flop_multiplier and with symmetry
reductions -- symmetry reduces the element count, the weight scales the
per-element cost, and both apply independently.
Overriding or disabling packaged weights
Normal imports load the packaged official weights automatically. To override
them at import time, set FLOPSCOPE_WEIGHTS_FILE:
export FLOPSCOPE_WEIGHTS_FILE=/path/to/weights.jsonTo disable weighting entirely and use unit weights instead:
export FLOPSCOPE_DISABLE_WEIGHTS=1The JSON file must have a "weights" key mapping operation names to floats:
{
"weights": {
"add": 1.0,
"exp": 16.0,
"sin": 16.0,
"matmul": 1.0,
"linalg.cholesky": 4.0
}
}Operations not listed in the override file default to 1.0. See Calibrate Weights for how to generate this file.
Where weights come from
Weights can be determined in two ways:
-
Hardware performance counters (Linux
perf stat) -- counts actual floating-point instructions retired by the CPU, weighted by SIMD width. This gives the true number of basic FP ops per high-level operation. -
Wall-clock time normalization -- measures
time(op) / time(add)as a relative proxy. Less precise than hardware counters but works on any platform.
The benchmarks/ package in this repository automates both methods. See
Calibrate Weights.
FLOP multiplier
The flop_multiplier parameter in BudgetContext scales all costs:
import flopscope as flops
with flops.BudgetContext(flop_budget=10**6, flop_multiplier=2.0) as budget:
# Every operation costs 2× its normal FLOP count
...This is useful for experimentation or adjusting the difficulty of a budget
constraint. Note that flop_multiplier and per-operation weights are
independent — flop_multiplier scales all operations uniformly, while
weights scale each operation individually.
Namespaces
Use flops.namespace(...) to create nested namespace scopes inside an active BudgetContext. The namespace parameter on BudgetContext sets the root namespace prefix for operations in that context:
import flopscope as flops
import flopscope.numpy as fnp
with flops.BudgetContext(flop_budget=10**6, namespace="predict") as budget:
with flops.namespace("precompute"):
stats = fnp.mean(x)
with flops.namespace("fallback"):
with flops.namespace("sampling"):
sample = fnp.add(stats, 1)
print(budget.summary(by_namespace=True))Namespaces are hierarchical and exact per operation. flops.namespace("precompute") inside namespace="predict" records operations as predict.precompute, and nested scopes append dotted segments such as predict.fallback.sampling.
Namespaces do not create child budgets or separate time limits. They only change attribution. budget.summary() and flops.budget_summary() stay flat by default; budget.summary(by_namespace=True) and flops.budget_summary(by_namespace=True) opt into a By namespace section. The structured forms, budget.summary_dict(by_namespace=True) and flops.budget_summary_dict(by_namespace=True), return namespace buckets keyed by the full namespace path or None for unlabeled operations. Namespace buckets include namespace-attributable FLOPs, calls, operations, flopscope_backend_time_s, and flopscope_overhead_time_s; context-level wall and residual wall time stay on the top-level summary.
data = budget.summary_dict(by_namespace=True)
print(data["by_namespace"]["predict.fallback.sampling"]["flops_used"])Related pages
- Operation Categories — which operations are free, counted, or unsupported
- Budget Planning — query costs before running
- Calibrate Weights — measure per-operation weights empirically