flopscope.
Understanding Flopscope

FLOP Counting Model

Use this page to understand how Flopscope counts FLOPs and why it uses analytical counting instead of runtime measurement.

You will learn:

  • How Flopscope computes FLOP costs analytically from tensor shapes
  • How cost formulas work for each operation category (einsum, linalg, FFT, etc.)
  • How symmetry savings and per-operation weights modify costs
  • How the FLOP multiplier and namespaces interact with the cost model

Convention: FMA = 1 operation

This codebase counts a fused multiply-add (a * b + c) as a single operation. Hardware FMA units execute this in one instruction; the common textbook convention of counting it as 2 (one multiply + one add) is not used here. All cost formulae reflect this: a matrix multiply of dimensions (m, k) x (k, n) costs mkn operations, not 2mk*n.

Why FLOPs instead of wall-clock time

  • Deterministic: The same code always produces the same FLOP count, regardless of hardware
  • Hardware-independent: A matmul costs the same FLOPs on a laptop and a server
  • Reproducible: No variance from CPU scheduling, cache effects, or thermal throttling
  • Composable: You can sum individual operation costs to predict total cost

How costs are computed

flopscope computes FLOP costs analytically from tensor shapes, not by measuring execution time.

  1. You call a counted operation (e.g., fnp.einsum('ij,j->i', W, x))
  2. flopscope computes the cost from the shapes: 256 × 256 = 65,536 FLOPs
  3. The cost is checked against the remaining budget
  4. If within budget: the operation executes and the cost is deducted
  5. If over budget: BudgetExhaustedError is raised, the operation does not execute

Cost formulas by category

Each formula below gives the analytical base cost. In normal use, flopscope loads the packaged official per-operation weights automatically at import time, so the base cost is multiplied by the operation's weight to give the final deducted cost. Set FLOPSCOPE_DISABLE_WEIGHTS=1 if you want the pure analytical unit-cost model instead.

CategoryFormulaExample
EinsumPer-step: product of all index dims'ij,jk->ik' → 3 × 4 × 5 = 60
Unary (exp, log, sqrt, ...){numel}({output})\text\{numel\}(\text\{output\})shape (256, 256) → 65,536
Binary (add, multiply, ...){numel}({output})\text\{numel\}(\text\{output\})shape (256, 256) → 65,536
Reduction (sum, mean, max, ...){numel}({input})\text\{numel\}(\text\{input\})shape (256, 256) → 65,536
SVDmnkm \cdot n \cdot k(256, 256, k=10) → 655,360
Solven3n^3(256, 256) solve → 16,777,216
Dot / MatmulSame as einsum(256, 256) @ (256, 256) → 256³
Free ops0zeros, reshape, etc.
CategoryFormulaExample
Sort / Argsortnlog2nn \cdot \lceil\log_2 n\rceil per sliceshape (4, 8), axis=-1 → 4 × 8 × 3 = 96
Lexsortknlog2nk \cdot n \cdot \lceil\log_2 n\rceil2 keys of length 8 → 2 × 8 × 3 = 48
Partitionnn per sliceshape (100,), kth=50 → 100
Searchsortedmlog2nm \cdot \lceil\log_2 n\rceil5 queries into 1024 → 5 × 10 = 50
Uniquenlog2nn \cdot \lceil\log_2 n\rceil8 elements → 8 × 3 = 24
Set ops(n+m)log2(n+m)(n+m) \cdot \lceil\log_2(n+m)\rceil4 + 4 elements → 8 × 3 = 24

Histogram & counting

CategoryFormulaExample
Histogramnlog2{bins}n \cdot \lceil\log_2 \text\{bins\}\rceil100 elements, 8 bins → 100 × 3 = 300
Bincountnn100 elements → 100

Random sampling

CategoryFormulaExample
Simple samplers{numel}({output})\text\{numel\}(\text\{output\})shape (10, 20) → 200
Shuffle / Permutationnlog2nn \cdot \lceil\log_2 n\rceil16 elements → 16 × 4 = 64

Symmetry savings

When a tensor is a SymmetricTensor, costs are reduced based on the number of unique elements rather than total elements. For a symmetric n×nn \times n matrix, there are n(n+1)/2n(n+1)/2 unique elements instead of n2n^2.

CategorySymmetric costStandard cost
Pointwise (unary/binary)unique_elements{numel}({output})\text\{numel\}(\text\{output\})
Reductionunique_elements{numel}({input})\text\{numel\}(\text\{input\})
Einsum (symmetric contraction)Symmetry-reduced (see below)Full product
Solven3n^3n3n^3
Det / Slogdetn3n^3n3n^3
Invn3/3+n3n^3/3 + n^3n3n^3

See Exploit Symmetry Savings for usage details.

Subgraph symmetry detection

Symmetry that reduces einsum costs comes from two complementary sources, both unified under the subgraph symmetry detection algorithm:

  1. Declared per-operand symmetry. When an operand is wrapped with flops.as_symmetric(), its symmetry groups are embedded in the bipartite graph as U-vertex equivalence classes. These propagate into intermediate tensors automatically.

  2. Induced symmetry from repeated operands. When the same Python object is passed at multiple operand positions, the subgraph oracle detects this via Python identity (is) and derives symmetry groups on the output that cannot be seen from per-operand metadata alone.

The oracle builds a bipartite graph once per contract_path call and evaluates symmetry lazily per subset of operands encountered during path search. Both sources are merged via the same group-merging machinery, so a tensor that is both SymmetricTensor and also repeated in the subscript benefits from both contributions simultaneously.

See the symmetry guide for usage examples, and the subgraph symmetry explanation for the algorithm walkthrough.

Einsum cost model

Every einsum — regardless of the number of operands — is decomposed into pairwise contraction steps along an optimal path (found via flopscope's opt_einsum fork). The total cost is the sum of per-step costs:

total_cost = sum(step.flop_cost for step in path.steps)

Per-step cost

For each pairwise step, the dense cost is:

dense_step_cost = product of all index dimensions

Each fused multiply-add (FMA) counts as 1 operation (see Convention above), so the cost of a contraction step is simply the product of all index dimensions — there is no factor-of-2 distinction between inner products and outer products.

When symmetry is present, flopscope reduces each step's cost based on the structure of the contraction.

Symmetric contraction cost

Each pairwise step's cost is reduced by two independent multiplicative factors — one for the output (V-side) indices and one for the inner (W-side) contracted indices:

step_cost = dense_step_cost
          × (unique_output_elements / total_output_elements)
          × (unique_inner_elements / total_inner_elements)

Each ratio is computed exactly using Burnside's lemma over the permutation group detected for that step by the SubgraphSymmetryOracle. For the full symmetric group Sk_k on kk equal-sized axes, Burnside reduces to the stars-and-bars formula ({n)+k1}{k}\binom\{n+k-1\}\{k\}; for proper subgroups like CkC_k or block groups the oracle returns the exact generators and Burnside counts over the enumerated elements.

The output (V-side) reduction is always applied when the step's intermediate has a non-trivial permutation group on its free indices — only the unique output elements need to be computed.

The inner (W-side) reduction is applied only when all labels in the detected inner group are present as contracted indices in that specific pairwise step. If any of those labels were contracted at an earlier step and no longer appear in the current step, the inner reduction is skipped (the per-step table shows this as [W: ...] when detected-but-not-applied versus [W✓: ...] when applied). Inner symmetry can be toggled globally with flops.configure(use_inner_symmetry=False).

The two factors are independent; outer-product contractions (no summed indices) and non-uniform index dimensions are handled by the same formula, since Burnside's lemma makes no assumption about uniform sizes beyond requiring axes in the same orbit to share a dimension.

Multi-operand contractions

For a simple two-operand einsum like 'ij,jk->ik', there is one step, so the total cost equals the step cost. For multi-operand einsums (3+ tensors), the optimizer finds the pairwise ordering that minimizes the total cost.

When symmetric tensors are present, the optimizer is symmetry-aware: it uses symmetric costs to decide which pair to contract at each step, so the returned path may differ from the dense-optimal path. Symmetry propagates through intermediates — if an early contraction produces a symmetric intermediate, subsequent steps benefit from the reduced element count, and the optimizer factors this into its ordering decisions.

Use fnp.einsum_path() to inspect the per-step breakdown. See Use Einsum for examples.

Per-operation weights

The analytical formulas above treat all operations within a category as equally expensive -- exp, log, sin, and abs all cost {numel}({output})\text\{numel\}(\text\{output\}) FLOPs. In reality, exp decomposes into a minimax polynomial approximation requiring approximately 14 FP instructions per element, while abs is a single bit manipulation.

Per-operation weights correct for this. Each weight is a multiplicative constant applied on top of the analytical formula:

actual_cost = analytical_formula(shape) × weight(op_name)
OperationAnalytical formulaWeightEffective cost (256x256)
add{numel}({output})\text\{numel\}(\text\{output\})165,536
exp{numel}({output})\text\{numel\}(\text\{output\})161,048,576
sin{numel}({output})\text\{numel\}(\text\{output\})161,048,576
matmulmknm \cdot k \cdot n116,777,216
linalg.choleskyn3n^3467,108,864
reshape0000

Weights are measured using the overhead-subtracted correction-factor methodology described in FLOP Weight Calibration Results. The formula is:

w({op})=max(α{{raw}}({op}){overhead}{{category}}, 0)w(\text\{op\}) = \max\bigl(\alpha_\{\text\{raw\}\}(\text\{op\}) - \text\{overhead\}_\{\text\{category\}\}, \ 0\bigr)

where α{{raw}}\alpha_\{\text\{raw\}\} is the median ratio of hardware-observed FP instructions to analytical FLOPs (FMA = 1 op), measured via fp_arith_inst_retired performance counters. The ufunc dispatch overhead (measured from np.abs, which generates zero FP arithmetic) is subtracted per category to remove numpy implementation noise from the weight.

BLAS-backed operations (contractions, linalg) have weights near 1.0 because their tight FMA loops execute almost exactly 1 hardware FP instruction per analytical FLOP, with no ufunc overhead to subtract.

Known analytical zero-FLOP operations (reshape, broadcast_to, random.seed, etc.) are stored with weight 0.0 in the official artifacts so the generated docs surface them as free rather than as 1x unit-cost ops.

Integer and bitwise operations (bitwise_and, gcd, lcm, etc.) use the instructions hardware counter (total retired instructions) because they do not retire fp_arith_inst_retired events. Their weights are derived from instruction counts normalized the same way as FP operations.

Official weights are packaged with flopscope and enabled by default on import. Use FLOPSCOPE_WEIGHTS_FILE to override them with a custom JSON file, or set FLOPSCOPE_DISABLE_WEIGHTS=1 to disable weighting entirely and fall back to unit weights (1.0 for all operations).

How weights are applied

Weights are applied centrally in BudgetContext.deduct(). Every counted operation passes its op_name to deduct(), which looks up the weight and multiplies it into the cost:

adjusted_cost = analytical_cost × flop_multiplier × weight(op_name)

This means weights compose with flop_multiplier and with symmetry reductions -- symmetry reduces the element count, the weight scales the per-element cost, and both apply independently.

Overriding or disabling packaged weights

Normal imports load the packaged official weights automatically. To override them at import time, set FLOPSCOPE_WEIGHTS_FILE:

export FLOPSCOPE_WEIGHTS_FILE=/path/to/weights.json

To disable weighting entirely and use unit weights instead:

export FLOPSCOPE_DISABLE_WEIGHTS=1

The JSON file must have a "weights" key mapping operation names to floats:

{
  "weights": {
    "add": 1.0,
    "exp": 16.0,
    "sin": 16.0,
    "matmul": 1.0,
    "linalg.cholesky": 4.0
  }
}

Operations not listed in the override file default to 1.0. See Calibrate Weights for how to generate this file.

Where weights come from

Weights can be determined in two ways:

  1. Hardware performance counters (Linux perf stat) -- counts actual floating-point instructions retired by the CPU, weighted by SIMD width. This gives the true number of basic FP ops per high-level operation.

  2. Wall-clock time normalization -- measures time(op) / time(add) as a relative proxy. Less precise than hardware counters but works on any platform.

The benchmarks/ package in this repository automates both methods. See Calibrate Weights.

FLOP multiplier

The flop_multiplier parameter in BudgetContext scales all costs:

import flopscope as flops

with flops.BudgetContext(flop_budget=10**6, flop_multiplier=2.0) as budget:
    # Every operation costs 2× its normal FLOP count
    ...

This is useful for experimentation or adjusting the difficulty of a budget constraint. Note that flop_multiplier and per-operation weights are independent — flop_multiplier scales all operations uniformly, while weights scale each operation individually.

Namespaces

Use flops.namespace(...) to create nested namespace scopes inside an active BudgetContext. The namespace parameter on BudgetContext sets the root namespace prefix for operations in that context:

import flopscope as flops
import flopscope.numpy as fnp

with flops.BudgetContext(flop_budget=10**6, namespace="predict") as budget:
    with flops.namespace("precompute"):
        stats = fnp.mean(x)

    with flops.namespace("fallback"):
        with flops.namespace("sampling"):
            sample = fnp.add(stats, 1)

    print(budget.summary(by_namespace=True))

Namespaces are hierarchical and exact per operation. flops.namespace("precompute") inside namespace="predict" records operations as predict.precompute, and nested scopes append dotted segments such as predict.fallback.sampling.

Namespaces do not create child budgets or separate time limits. They only change attribution. budget.summary() and flops.budget_summary() stay flat by default; budget.summary(by_namespace=True) and flops.budget_summary(by_namespace=True) opt into a By namespace section. The structured forms, budget.summary_dict(by_namespace=True) and flops.budget_summary_dict(by_namespace=True), return namespace buckets keyed by the full namespace path or None for unlabeled operations. Namespace buckets include namespace-attributable FLOPs, calls, operations, flopscope_backend_time_s, and flopscope_overhead_time_s; context-level wall and residual wall time stay on the top-level summary.

data = budget.summary_dict(by_namespace=True)
print(data["by_namespace"]["predict.fallback.sampling"]["flops_used"])

On this page