flopscope.
Guides

Einsum Patterns

Use this page to understand fnp.einsum -- the core computation primitive in flopscope.

You will learn:

  • How to write common einsum patterns and understand their FLOP costs
  • How to use symmetric tensors with einsum for cost savings
  • How to inspect and customize contraction paths
  • How to leverage path caching for repeated operations

Prerequisites

Common patterns

import flopscope as flops
import flopscope.numpy as fnp

with flops.BudgetContext(flop_budget=10**8) as budget:
    A = fnp.ones((256, 256))
    B = fnp.ones((256, 256))
    x = fnp.ones((256,))

    # Matrix-vector multiply: cost = m × k
    y = fnp.einsum('ij,j->i', A, x)           # 256 × 256 = 65,536 FLOPs

    # Matrix multiply: cost = m × k × n
    C = fnp.einsum('ij,jk->ik', A, B)         # 256 × 256 × 256 = 16,777,216 FLOPs

    # Outer product: cost = i × j
    outer = fnp.einsum('i,j->ij', x, x)       # 256 × 256 = 65,536 FLOPs

    # Trace: cost = i
    tr = fnp.einsum('ii->', A)                 # 256 FLOPs

    # Batched matmul: cost = b × m × k × n
    batch = fnp.ones((4, 256, 256))
    out = fnp.einsum('bij,bjk->bik', batch, batch)  # 4 × 256 × 256 × 256 FLOPs

    print(budget.summary())

Cost formula

The cost of an einsum is the sum of per-step costs along the optimal contraction path. Every einsum — even a simple two-operand one — goes through the opt_einsum path optimizer (a symmetry-aware fork of opt_einsum).

For each pairwise step:

cost = product of all index dimensions

Each FMA (fused multiply-add) counts as 1 operation, so the cost is simply the product of all index dimensions with no factor-of-2.

For 'ij,jk->ik' with shapes (256, 256) and (256, 256):

  • Indices: i=256, j=256, k=256
  • Cost: 256 x 256 x 256 = 16,777,216

For multi-operand einsums (3+ tensors), Flopscope automatically decomposes the contraction into optimal pairwise steps. The total cost is the sum of per-step costs.

When symmetric tensors are involved, each step's cost is further reduced by the ratio of unique output elements to total output elements. See Symmetry Savings for the full practical guide.

fnp.dot and fnp.matmul

fnp.dot(A, B) and fnp.matmul(A, B) are equivalent to the corresponding einsum and have the same FLOP cost.

Symmetric tensors

There are two separate symmetry declarations — one for inputs, one for outputs:

Input symmetry — wrap with flops.as_symmetric() before passing to einsum. The optimizer automatically uses symmetry to choose the best contraction order and charges reduced costs:

with flops.BudgetContext(flop_budget=10**8) as budget:
    S = flops.as_symmetric(fnp.eye(10), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)))  # 55 unique elements
    v = fnp.ones((10,))

    result = fnp.einsum('ij,j->i', S, v)  # cost reduced by input symmetry

Output symmetry — pass symmetry= to einsum() to declare that the result is symmetric. This wraps the output as a SymmetricTensor so downstream operations benefit from reduced costs. It does NOT affect the cost of this einsum — it's a declaration about the result's structure:

with flops.BudgetContext(flop_budget=10**8) as budget:
    X = fnp.random.randn(100, 10)

    # X^T X is always symmetric — declare the exact output group
    C = fnp.einsum('ki,kj->ij', X, X, symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)))

    print(type(C))  # <class 'SymmetricTensor'>
    # C can now be passed to other operations with automatic cost savings

For the full symmetry guide, see Symmetry Savings.

Inspecting costs

fnp.einsum_path() previews the contraction plan without executing the contraction itself. Planning is cheap: it records a nominal 1-FLOP einsum_path event, but none of the contraction FLOPs are spent.

import flopscope as flops
import flopscope.numpy as fnp

n = 10
T = flops.as_symmetric(fnp.ones((n, n, n)), symmetric_axes=(0, 1, 2))
A = fnp.random.randn(n, n)
B = fnp.random.randn(n, n)
C = fnp.random.randn(n, n)

n = 10
T = flops.as_symmetric(fnp.ones((n, n, n)), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1, 2)))
A = fnp.random.randn(n, n)
B = fnp.random.randn(n, n)
C = fnp.random.randn(n, n)

path, info = fnp.einsum_path('ijk,ai,bj,ck->abc', T, A, B, C)

print(f"Path: {path}")
print(info)
print(f"Naive cost:     {info.naive_cost:,}")
print(f"Optimized cost: {info.optimized_cost:,}")
print(f"Speedup:        {info.speedup:.1f}x")
print(f"Optimizer used: {info.optimizer_used}")
Path: [(0, 1), (0, 2), (0, 1)]
  Complete contraction:  ijk,ai,bj,ck->abc
      Naive cost (flopscope):  3,000,000
  Optimized cost (flopscope):  25,500
                     Speedup:  117.647x
       Largest intermediate:  1,000 elements
                Index sizes:  a=b=c=i=j=k=10
                  Optimizer:  optimal
--------------------------------------------------------------------------------------------------------------------------------------------
step  contract  subscript                                flops     dense_flops   savings  blas      unique/total  symmetry (inputs → output)
--------------------------------------------------------------------------------------------------------------------------------------------
   0  (0, 1)    ai,ijk->ajk                              5,500          10,000    45.0%  SYMM      V:550/1,000   - × S3{i,j,k} → S2{j,k}
   1  (0, 2)    ajk,bj->akb                             10,000          10,000     0.0%  TDOT      -             S2{j,k} × - → -
   2  (0, 1)    akb,ck->abc                             10,000          10,000     0.0%  TDOT      -             -
Naive cost:     3,000,000
Optimized cost: 25,500
Speedup:        117.6x
Optimizer used: optimal

The printed table gives you the contraction order, naive-vs-optimized FLOP counts, largest intermediate, grouped index sizes, and one row per pairwise contraction step. Each step shows the chosen contract tuple, dense baseline, savings, BLAS tag, and any symmetry that survived into the intermediate.

For per-step debugging, call print(info.format_table(verbose=True)). The verbose view adds indented rows with the merged operand subset, the intermediate output shape, and the running cumulative cost.

flops.einsum_cost() returns the same cost that einsum() would deduct — one source of truth:

cost = flops.einsum_cost('ij,jk->ik', shapes=[(256, 256), (256, 256)])
print(f"Matmul cost: {cost:,}")  # 16,777,216

Custom contraction paths

By default Flopscope finds the optimal contraction order automatically. You can override this by passing an explicit path — a list of int-tuples specifying which operand positions to contract at each step:

import flopscope as flops
import flopscope.numpy as fnp

A = fnp.ones((3, 4))
B = fnp.ones((4, 5))
C = fnp.ones((5, 6))

# Plan first, execute later
path, info = fnp.einsum_path('ij,jk,kl->il', A, B, C)
print(f"Optimal path: {path}")  # e.g. [(0, 1), (0, 1)]

# Execute with the planned path
with flops.BudgetContext(flop_budget=10**8) as budget:
    result = fnp.einsum('ij,jk,kl->il', A, B, C, optimize=path)

You can also specify a completely custom path. Each tuple names the positions (in the current operand list) to contract; the result is appended to the end:

# Force B×C first (positions 1,2), then A×result (positions 0,1)
result = fnp.einsum('ij,jk,kl->il', A, B, C, optimize=[(1, 2), (0, 1)])

# Force A×B first (positions 0,1), then result×C (positions 0,1)
result = fnp.einsum('ij,jk,kl->il', A, B, C, optimize=[(0, 1), (0, 1)])

Different paths may have different FLOP costs. Use fnp.einsum_path() to compare — it returns the plan without executing the contraction.

Path caching

Contraction paths are cached automatically in a module-level LRU cache. When you call fnp.einsum() with the same subscripts, shapes, optimizer, and symmetry structure, the path is reused from cache instead of being recomputed. This makes repeated einsums in loops essentially free in path-finding overhead:

with flops.BudgetContext(flop_budget=10**9) as budget:
    for i in range(1000):
        y = fnp.einsum('ij,j->i', A, x)  # path computed once, reused 999 times

fnp.einsum_path() shares the same cache, so planning a path warms the cache for subsequent fnp.einsum() calls and vice versa.

Cache management

# Inspect cache statistics
info = fnp.einsum_cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}, Size: {info.currsize}/{info.maxsize}")

# Clear the cache (e.g., to free memory or force recomputation)
fnp.clear_einsum_cache()

# Change the cache size (default 4096 entries, rebuilds the cache)
flops.configure(einsum_path_cache_size=8192)

Common pitfalls

Symptom: Unexpectedly high FLOP cost

Fix: Check all index dimensions. A subscript like 'ijkl,jklm->im' multiplies all five dimension sizes together. Use flops.einsum_cost() or fnp.einsum_path() to preview costs before executing.

On this page