flopscope.
Understanding Flopscope

Symmetry Detection Deep Dive

Contributor-level walkthrough of flopscope's symmetry detection algorithm.

You will learn:

  • How the bipartite graph is constructed from einsum subscripts
  • How the subset-keyed oracle detects and caches symmetry lazily
  • How the sigma-loop derives label permutations from row permutations
  • How Burnside's lemma counts unique elements for exact FLOP reduction

TL;DR: flopscope detects when an einsum expression has symmetry that allows computing fewer FLOPs. It does this by building a bipartite graph from the einsum subscripts, finding column permutations that preserve the graph structure, and using group theory to count how many unique terms exist. The savings can be dramatic -- a symmetric matrix multiplication can cost half as many FLOPs.

The problem

A multi-operand einsum like 'ij,ai,bj->ab' is decomposed by opt_einsum into a sequence of pairwise contractions. At each step the optimizer must evaluate candidate pairs -- and it needs to know, for each candidate intermediate, whether the result is symmetric, so it can score it with a reduced cost.

When operands are SymmetricTensors, their per-operand symmetry is known upfront. But there is a second source of symmetry: when the same Python object appears at multiple operand positions, the output can be symmetric in index labels contributed by those repeated operands -- even if the operands are dense.

The naive approach is to rerun a detection procedure at every step for every candidate subset. This is too expensive for large contractions. We want:

  1. Correctness -- detect all exploitable symmetry without false positives.
  2. Memoization -- compute each intermediate's symmetry at most once.
  3. Laziness -- only evaluate subsets that the path optimizer actually visits.

Subgraph symmetry detection achieves all three.

The bipartite graph

The core data structure is a bipartite graph over the einsum expression.

Left vertices (U): One U-vertex per axis of each operand. For a dense operand with subscript "ai", each axis produces its own U-vertex (two total). For a SymmetricTensor with subscript "ij" and declared symmetry S_2{i,j}, both axes still produce separate U-vertices -- per-operand symmetry does not affect the graph topology. Instead, per-operand symmetry is handled entirely by the expanded sigma-loop (see below), which uses the declared symmetry generators as an additional source of row permutations.

Right vertices (labels): One right vertex per unique index label. Labels are partitioned into:

  • V (free labels): appear in the final output subscript or in operands outside the current subset (they "cross the cut").
  • W (summed labels): contracted entirely within the current subset.

Incidence: An edge from U-vertex u to label c has weight equal to the multiplicity of c in the axes belonging to U-vertex u.

Identical-operand groups: Operands that are the same Python object are grouped. These groups are the source of induced symmetry.

Worked example

Consider 'ij,ai,bj->ab' with operands T, A, B where T is a dense tensor:

Subscripts:  ij,  ai,  bj  ->  ab
Operands:    T    A    B

U-vertices (one per axis):

  • (T, 0) -- label set {i}
  • (T, 1) -- label set {j}
  • (A, 0) -- label set {a}
  • (A, 1) -- label set {i}
  • (B, 0) -- label set {b}
  • (B, 1) -- label set {j}

Free labels at the top level: {a, b} (appear in output ->ab). Summed labels at the top level: {i, j} (contracted out).

No identical operands in this example -- T, A, and B are distinct Python objects.

Full bipartite graph

   U (axes)                         Labels
   -----------------               ------
                                   V (free):
   (A, 0) ------------------------- a
   (B, 0) ------------------------- b

                                   W (summed):
   (T, 0) ----------+
                    +--------------- i
   (A, 1) ----------+

   (T, 1) ----------+
                    +--------------- j
   (B, 1) ----------+

Now consider the subset {A, B} (positions 1 and 2):

  • U-vertices in subset: (A, 0), (A, 1), (B, 0), (B, 1)
  • Labels in subset: {a, i, b, j}
  • Labels outside subset (in T): {i, j}
  • Crossing labels (in subset AND in outside): {i, j}
  • V at this step = {a, b} + {i, j} = {a, b, i, j} (all four -- {i,j} cross the cut)
  • W at this step = {} (nothing is summed entirely within {A, B})

Induced subgraph for subset {A, B}

When we restrict to subset {A, B}, labels i and j cross the cut (they also appear in T, outside the subset), so they move from W to V:

   U (subset {A, B} only)           Labels
   ----------------------           ------
                                    V (all free):
   (A, 0) ------------------------- a
   (A, 1) ------------------------- i
   (B, 0) ------------------------- b
   (B, 1) ------------------------- j

                                    W: (empty)

The incidence matrix M at this subset (rows = U-vertices, columns = V+W):

         a  i  b  j
(A, 0):  1  0  0  0
(A, 1):  0  1  0  0
(B, 0):  0  0  1  0
(B, 1):  0  0  0  1

The subset-keyed oracle

The key invariant is the pure-in-subset property: the symmetry of an intermediate tensor depends only on the set of original operands it was formed from, not on the order in which they were contracted. This is because:

  • The bipartite graph structure is fixed for the full einsum.
  • The induced subgraph on a subset S is fully determined by which operands are in S.
  • Symmetry is a property of the final intermediate, not its contraction history.

This property makes the subset key canonical. The oracle stores results in a dict[frozenset[int], SubsetSymmetry] and returns cached results on subsequent calls with the same subset.

from flopscope._opt_einsum._subgraph_symmetry import SubgraphSymmetryOracle

# One oracle per contract_path call
oracle = SubgraphSymmetryOracle(
    operands=list(operands),
    subscript_parts=input_parts,
    per_op_groups=perm_groups,
    output_chars=output_str,
)

# Lazy evaluation -- only computed on first access per subset
result = oracle.sym(frozenset({0, 1}))  # SubsetSymmetry for intermediate from ops 0 and 1
result.output  # V-side (output tensor) symmetry
result.inner   # W-side (inner summation) symmetry

The detection algorithm

Goal

For a fixed subset S with incidence matrix M, we want the full group of automorphisms of the labelled bipartite graph -- pairs (sigma, pi) where sigma permutes identical-operand rows and pi permutes label columns, such that applying pi to the columns of sigma(M) recovers M:

pi(sigma(M)) = M

Every such pi is a symmetry of the intermediate tensor built from S. Restricted to V labels it contributes to the output (V-side) symmetry; restricted to W labels it contributes to the inner (W-side) symmetry. The V/W partition is part of the labelled structure, so legitimate automorphisms must preserve it -- pi(V) is a subset of V and pi(W) is a subset of W -- and any pi with a cycle crossing V to W is rejected.

Column fingerprints

For each label c, compute its column fingerprint col(c) -- the tuple of incidence values down the rows of M. Labels with identical fingerprints are candidates for symmetry equivalence. The fingerprint-to-label mapping is used by the sigma-loop to derive pi in O(1) per label via hash lookup.

Earlier versions had a standalone fast path that detected S_k whenever labels shared a fingerprint, without running the sigma-loop. This was incorrect for non-S_k groups (see the C3 bug note below) and has been removed. Fingerprints are now used only for pi derivation inside the sigma-loop -- they are not a standalone detection mechanism.

Sigma loop: derive pi from generators

The sigma-loop iterates over generators of the row-permutation group on M, drawn from three sources:

  • Source A -- per-operand internal symmetry generators. For each operand that carries a declared SymmetryGroup, each generator of that group is lifted to a row permutation on M (permuting only the rows belonging to that operand). This captures symmetry that was previously handled by orbit-based axis merging.

  • Source B -- identical-operand swap generators. For each group of k identical operands (same Python object), the k-1 adjacent transpositions (op_i, op_{i+1}) are used as generators. Each such swap is lifted to a row permutation that exchanges the rows of the two operands.

  • Source C -- coordinated axis relabeling. When identical operands share the same subscript pattern (e.g. both have subscript ij), permuting axes uniformly across all copies is equivalent to relabeling dummy indices. Adjacent transpositions on W-only (summed) axes are generated, applied to every copy simultaneously. This is restricted to W-side labels because relabeling free (output) labels would change the output tensor.

All generators are collected and passed to Dimino's algorithm to build the full row-permutation group. The sigma-loop then iterates over all elements of this group (not just the generators), deriving pi for each.

For each group element sigma:

  1. Lift sigma to a row permutation on M.
  2. Compute sigma(M)'s column fingerprints: sigma * col(c) for each label c.
  3. Derive the induced label permutation pi directly. For each label l, pi(l) is the label whose M-column matches sigma(M)'s column for l -- a hash-table lookup in O(1). When multiple labels share a fingerprint (collision), pick the lex-first unused candidate. If any label has no match, reject this sigma.
  4. Validate pi: pi(V) is a subset of V and pi(W) is a subset of W. Any cycle crossing V to W invalidates the sigma.
  5. Collect pi as a generator literal restricted to V labels (and separately to W labels). Non-identity generators become part of the detected SymmetryGroup.

The sigma-loop collects all non-identity pi restrictions as generator literals. These generators are passed to Dimino's algorithm to close the group and build the exact symmetry group on V (and separately on W).

Interactive explorer

Walk through each detection step interactively with the Symmetry Explorer — choose a preset example or define your own einsum expression.

Worked examples

Click to expand each example.

V-side and W-side

V-side groups are symmetries of the output tensor -- they reduce the number of unique output elements that need to be computed. W-side groups are symmetries among the contracted (summed) labels -- they reduce the number of unique summation terms. Both contribute multiplicatively to the cost reduction:

cost = dense_cost * (unique_output / total_output) * (unique_inner / total_inner)

The output (V-side) reduction is always applied when the step's intermediate has a non-trivial permutation group on its free indices.

The inner (W-side) reduction is applied only when all labels in the detected inner group are present as contracted indices in that specific pairwise step. If any of those labels were contracted at an earlier step and no longer appear, the inner reduction is skipped. In the contraction path table, [W checkmark: ...] indicates the inner reduction was applied, while [W: ...] indicates it was detected but not applied. Inner symmetry can be toggled with flops.configure(use_inner_symmetry=False).

Exact group detection and Burnside counting

The sigma-loop collects all valid pi permutations as generator literals and builds a SymmetryGroup directly.

When the generated group equals S_k (the full symmetric group, checked via order == k!), the existing stars-and-bars formula C(n+k-1, k) applies. When the group is a proper subgroup (e.g., C_3 from einsum('ij,jk,ki->', A, A, A)), Burnside's lemma gives the exact unique element count.

Complexity bound

The oracle evaluates each subset at most once. For a contract with N operands and groups of sizes k_1, k_2, ...:

  • Generator collection: Source A contributes O(rank) generators per operand with declared symmetry. Source B contributes k-1 generators per identical group. Source C contributes at most rank-1 generators per identical group with matching subscripts. Total generators: g = O(N * rank).
  • Group enumeration: Dimino's algorithm builds the full row-permutation group from the generators in O(|G| * g) compositions.
  • Pi derivation: For each of the |G| group elements, deriving pi costs O(n_labels) via fingerprint hash lookup.
  • Per-subset total: O(|G| * (g + n_labels)).
  • Number of subsets visited: at most 2^N (usually much less -- path algorithms visit only O(N^2) subsets in practice).

For the common case of a single pair of identical operands (|G| = 2, g = 1): per-subset cost is O(n_labels).

On this page