============================================================ Getting Started ============================================================ --- getting-started/installation --- URL: https://aicrowd.github.io/flopscope/docs/getting-started/installation Getting StartedInstallationUse this page when setting up flopscope for the first time. You will learn: How to install flopscope as a dependency or for development How to verify your installation works How to fix common installation pitfalls Install as a dependency uv add git+https://github.com/AIcrowd/flopscope.git Install for development git clone https://github.com/AIcrowd/flopscope.git cd flopscope uv sync --all-extras Verify installation uv run python -c "import flopscope as flops; print(flops.__version__)" What you'll see 0.2.0+np2.2.6 The version string includes the installed NumPy version suffix. If you see a version number, flopscope is installed correctly. Common pitfalls Symptom: ImportError: numpy version mismatch Fix: Flopscope supports NumPy >=2.0.0,<2.3.0 (default install uses NumPy 2.2). Using uv handles this automatically. If you installed manually, check your NumPy version: uv run python -c "import numpy; print(numpy.__version__)" Related pages Quickstart — run your first FLOP-counted computation QuickstartNext Page --- getting-started/quickstart --- URL: https://aicrowd.github.io/flopscope/docs/getting-started/quickstart Getting StartedQuickstartRun your first FLOP-counted computation in under 2 minutes. You will learn: How to run a FLOP-counted computation with the global default budget How to read the budget summary output Prerequisites Installation Quickest possible start You do not need to set up a budget context to start counting FLOPs. flopscope activates a global default context the first time any counted operation runs. The default budget is 1e15 FLOPs (configurable via the FLOPSCOPE_DEFAULT_BUDGET environment variable). Save this as first_budget.py: import flopscope as flops import flopscope.numpy as fnp depth = 10 # number of layers width = 256 # hidden dimension # No BudgetContext needed — the global default activates automatically scale = fnp.sqrt(2.0 / width) # Kaiming init scale: counted through flopscope weights = [ fnp.array(fnp.multiply(fnp.random.randn(width, width), scale)) for _ in range(depth) ] x = fnp.random.randn(width) h = x for W in weights: h = fnp.einsum('ij,j->i', W, h) # matrix-vector multiply h = fnp.maximum(h, 0) # ReLU activation: counted result = fnp.sum(h) # reduction: counted # Print the default flat summary flops.budget_summary() Run it: uv run python first_budget.py What you'll see flopscope FLOP Budget Summary ========================= Total budget: 1,000,000,000,000,000 Used: 2,624,513 (0.0%) Remaining: 999,999,997,375,487 (100.0%) By operation: random.randn 655,616 ( 25.0%) [11 calls] multiply 655,360 ( 25.0%) [10 calls] array 655,360 ( 25.0%) [10 calls] einsum 655,360 ( 25.0%) [10 calls] maximum 2,560 ( 0.1%) [10 calls] sum 256 ( 0.0%) [1 call] sqrt 1 ( 0.0%) [1 call] By operation (time): random.randn ...s ( ...%) [11 calls] multiply ...s ( ...%) [10 calls] einsum ...s ( ...%) [10 calls] array ...s ( ...%) [10 calls] maximum ...s ( ...%) [10 calls] sum ...s ( ...%) [1 call] sqrt ...s ( ...%) [1 call] Reading the output: Top rows: Used is the total FLOP count spent so far across the current session, and Remaining shows the implicit global headroom Flat default: the summary stays flat unless you explicitly ask for flops.budget_summary(by_namespace=True) Operations table: the 10-layer MLP spreads FLOPs roughly equally across random.randn, multiply, array, and einsum (~25% each); activations (maximum) are comparatively cheap, and the Kaiming scale sqrt adds just 1 FLOP If you want namespace attribution, opt in separately: with flops.BudgetContext(flop_budget=10**6, namespace="train") as budget: with fnp.namespace("precompute"): ... print(budget.summary(by_namespace=True)) Next steps Ready for budget limits? See the Competition Guide.InstallationPrevious PageCompetition GuideNext Page --- getting-started/competition --- URL: https://aicrowd.github.io/flopscope/docs/getting-started/competition Getting StartedCompetition GuideEverything you need to compete within a FLOP budget. You will learn: How to set budget limits with BudgetContext How to use the @flops.budget decorator form How wall-time limits work via wall_time_limit_s How to read budget summaries Common competition pitfalls and tips Setting a FLOP budget Every competition submission runs inside a FLOP budget. Use BudgetContext to declare how many FLOPs your code is allowed to spend: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=50_000_000, namespace="solver") as budget: A = fnp.ones((256, 256)) x = fnp.ones((256,)) h = fnp.einsum('ij,j->i', A, x) h = fnp.exp(h) result = fnp.sum(h) If your code exceeds the budget, Flopscope raises BudgetExhaustedError before the offending operation executes. The error message includes the cost of the failed operation and the remaining budget. The namespace parameter sets the root namespace prefix for that budget context. Nested fnp.namespace(...) scopes extend it with dotted segments, but they do not create child budgets or split the FLOP limit into separate pools. Decorator form For cleaner code, use @flops.budget to attach a budget directly to a function: import flopscope as flops import flopscope.numpy as fnp @flops.budget(flop_budget=50_000_000, namespace="forward-pass") def forward(W, x): h = fnp.einsum('ij,j->i', W, x) h = fnp.maximum(h, 0) return fnp.sum(h) result = forward(W, x) flops.budget_summary() Each call to the decorated function runs inside the same BudgetContext; the namespace is only the root prefix used for attribution, not a separate budget pool. Repeated calls reuse that context and keep accumulating on the same budget and operation log. Wall-time limits In addition to FLOP budgets, competitions may enforce a wall-clock time limit via wall_time_limit_s. This prevents solutions from stalling on operations that are analytically cheap but slow in practice: with flops.BudgetContext(flop_budget=10**9, wall_time_limit_s=60.0) as budget: # Must finish within 60 seconds AND within 1 billion FLOPs ... If the wall-clock time is exceeded, Flopscope raises TimeExhaustedError. The timer starts when the context is entered and is checked after each counted operation. What wall_time_limit_s does and does not do: It is a BudgetContext setting, so you configure it in the same place you set flop_budget. It measures total wall-clock time for the active context, not FLOPs. It is checked cooperatively before and after counted NumPy calls, so overshoot is bounded by the duration of one NumPy call. It is a clean diagnostic limit inside flopscope. Hard process/container kills still belong to the outer execution environment. Reading the budget summary Call budget.summary() when you want the current context's summary, or flops.budget_summary() for the accumulated session/global view. Both stay flat by default; use by_namespace=True only when you want a namespace breakdown: print(budget.summary()) # context summary, flat by default print(budget.summary(by_namespace=True)) # context summary with namespaces flops.budget_summary() # session/global summary flops.budget_summary(by_namespace=True) # session/global summary with namespaces Use these forms for different questions: budget.summary() answers "what did this one explicit context spend?" flops.budget_summary() answers "what has this process/session spent overall?" budget.summary_dict(...) and flops.budget_summary_dict(...) return the same information as structured data instead of formatted text. The block below shows print(budget.summary(by_namespace=True)) for the solver context: flopscope FLOP Budget Summary [solver] ================================== Total budget: 50,000,000 Used: 66,048 (0.1%) Remaining: 49,933,952 (99.9%) By namespace: solver 66,048 (100.0%) [3 calls] Backend 0.000s Overhead 0.000s By operation: einsum 65,536 ( 99.2%) [1 call] exp 256 ( 0.4%) [1 call] sum 256 ( 0.4%) [1 call] Total Wall Time: ...s Flopscope Backend: ...s ( ...%) Flopscope Overhead: ...s ( ...%) Residual Wall Time: ...s ( ...%) By operation (time): einsum ...s ( ...%) [1 call] sum ...s ( ...%) [1 call] exp ...s ( ...%) [1 call] Key things to look for: Budget / Used / Remaining: the top rows show the explicit competition budget, current spend, and remaining headroom By namespace: solver is the root prefix, and nested scopes show up as dotted paths like solver.precompute. Use budget.summary(by_namespace=True) for the current context or flops.budget_summary(by_namespace=True) for the accumulated session/global view By operation: this toy pass is dominated by the single einsum; exp and sum are tiny by comparison Wall / backend / flopscope overhead / residual time: wall time is total elapsed time for the context. Flopscope backend time is spent inside the underlying NumPy / BLAS / LAPACK calls being counted. Flopscope overhead is spent in flopscope's own dispatch code (wrapper preambles, FLOP cost computation, view-casts, post-call wrapping, maybe_check_nan_inf when opted in). Residual wall time is the measured remainder outside backend calls and flopscope overhead (user Python between ops, sleeps, GC pauses). The decomposition is exact: wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s Flat default: the default summary stays flat unless you opt into by_namespace=True For programmatic access, use flops.budget_summary_dict(): data = flops.budget_summary_dict() print(f"Used: {data['flops_used']:,} / {data['flop_budget']:,}") # Per-namespace breakdown: data = flops.budget_summary_dict(by_namespace=True) print(data["by_namespace"]["solver"]["flops_used"]) Quick tips for competition Check costs before committing budget. Use cost query functions to estimate before executing: cost = flops.einsum_cost('ij,jk->ik', shapes=[(256, 256), (256, 256)]) print(f"This matmul will cost {cost:,} FLOPs") # 16,777,216 Use namespaces for phases. Split your solution into named phases (e.g., "init", "solve", "refine") so the budget summary shows exactly where FLOPs are spent. Exploit symmetry for savings. If your tensors are symmetric, wrapping them with flops.as_symmetric() can halve pointwise costs and significantly reduce einsum costs. See Symmetry Savings for details. Prefer cheaper operations. A matrix-vector product via fnp.einsum('ij,j->i', A, x) costs m*n FLOPs, while a full matrix-matrix multiply costs m*n*k. Avoid computing more than you need. Watch out for hidden costs. Operations like fnp.array() and fnp.concatenate() are not free -- they charge numel(output) FLOPs, and fnp.where() charges numel(condition). Check the cost of any operation you are unsure about. When things go wrong If you hit BudgetExhaustedError, see Budget Planning & Debugging for a systematic approach to diagnosing overruns and reducing costs.QuickstartPrevious PageMigrate from NumPyNext Page ============================================================ Guides ============================================================ --- guides/migrate-from-numpy --- URL: https://aicrowd.github.io/flopscope/docs/guides/migrate-from-numpy GuidesMigrate from NumPyUse this page when converting existing NumPy code to flopscope. You will learn: How to convert NumPy imports and operations to flopscope equivalents Which NumPy behaviors stay the same and which change How to avoid common pitfalls when migrating Prerequisites Installation Quickstart The basics Change your import and wrap computation in a BudgetContext: Before (NumPy): import numpy as np W = np.random.randn(256, 256) x = np.random.randn(256) h = np.dot(W, x) h = np.maximum(h, 0) After (flopscope) — simplest form: import flopscope as flops import flopscope.numpy as fnp # No setup needed — global default budget tracks FLOPs automatically W = fnp.random.randn(256, 256) x = fnp.random.randn(256) h = fnp.dot(W, x) h = fnp.maximum(h, 0) # No setup needed — global default budget tracks FLOPs automatically W = fnp.random.randn(256, 256) x = fnp.random.randn(256) h = fnp.dot(W, x) h = fnp.maximum(h, 0) flops.budget_summary() # see what you spent After (flopscope) — with explicit budget control: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=20_000_000) as budget: W = fnp.random.randn(256, 256) x = fnp.random.randn(256) h = fnp.dot(W, x) h = fnp.maximum(h, 0) What stays the same Function signatures match NumPy for supported operations Broadcasting rules are identical Array indexing, slicing, and assignment work normally What changes NumPyflopscopeNotesimport numpy as npimport flopscope.numpy as fnpUse import flopscope as flops alongside it for budgets, symmetry, and accounting helpersCall ops anywhereWorks anywhere tooA global default budget auto-activates; use explicit BudgetContext for limits and namespacingnp.linalg.svd(A)fnp.linalg.svd(A, k=10)Truncated SVD with explicit kPlain ndarray onlySymmetricTensor availableWrap with flops.as_symmetric() for cost savingsAll NumPy ops availableMost available, 32 blacklistedI/O and config ops raise AttributeErrorNo cost trackingAutomatic FLOP countingEvery counted op deducts from budget Common pitfalls Symptom: AttributeError when calling an I/O or config function (e.g., fnp.save, fnp.seterr) Fix: 32 operations are blacklisted because they are I/O, configuration, or datetime functions with no FLOP cost. See Operation Categories for the full list. Use numpy directly for these. Symptom: Using np.linalg.svd instead of fnp.linalg.svd Fix: If you import NumPy alongside flopscope, make sure to use fnp. for counted operations. Operations called through np. bypass FLOP counting entirely. Related pages Operation Categories — what's supported and what isn't API Reference — full list of all operations Competition GuidePrevious PageEinsum PatternsNext Page --- guides/einsum --- URL: https://aicrowd.github.io/flopscope/docs/guides/einsum GuidesEinsum PatternsUse this page to understand fnp.einsum -- the core computation primitive in flopscope. You will learn: How to write common einsum patterns and understand their FLOP costs How to use symmetric tensors with einsum for cost savings How to inspect and customize contraction paths How to leverage path caching for repeated operations Prerequisites Quickstart Common patterns import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=10**8) as budget: A = fnp.ones((256, 256)) B = fnp.ones((256, 256)) x = fnp.ones((256,)) # Matrix-vector multiply: cost = m × k y = fnp.einsum('ij,j->i', A, x) # 256 × 256 = 65,536 FLOPs # Matrix multiply: cost = m × k × n C = fnp.einsum('ij,jk->ik', A, B) # 256 × 256 × 256 = 16,777,216 FLOPs # Outer product: cost = i × j outer = fnp.einsum('i,j->ij', x, x) # 256 × 256 = 65,536 FLOPs # Trace: cost = i tr = fnp.einsum('ii->', A) # 256 FLOPs # Batched matmul: cost = b × m × k × n batch = fnp.ones((4, 256, 256)) out = fnp.einsum('bij,bjk->bik', batch, batch) # 4 × 256 × 256 × 256 FLOPs print(budget.summary()) Cost formula The cost of an einsum is the sum of per-step costs along the optimal contraction path. Every einsum — even a simple two-operand one — goes through the opt_einsum path optimizer (a symmetry-aware fork of opt_einsum). For each pairwise step: cost = product of all index dimensions Each FMA (fused multiply-add) counts as 1 operation, so the cost is simply the product of all index dimensions with no factor-of-2. For 'ij,jk->ik' with shapes (256, 256) and (256, 256): Indices: i=256, j=256, k=256 Cost: 256 x 256 x 256 = 16,777,216 For multi-operand einsums (3+ tensors), Flopscope automatically decomposes the contraction into optimal pairwise steps. The total cost is the sum of per-step costs. When symmetric tensors are involved, each step's cost is further reduced by the ratio of unique output elements to total output elements. See Symmetry Savings for the full practical guide. fnp.dot and fnp.matmul fnp.dot(A, B) and fnp.matmul(A, B) are equivalent to the corresponding einsum and have the same FLOP cost. Symmetric tensors There are two separate symmetry declarations — one for inputs, one for outputs: Input symmetry — wrap with flops.as_symmetric() before passing to einsum. The optimizer automatically uses symmetry to choose the best contraction order and charges reduced costs: with flops.BudgetContext(flop_budget=10**8) as budget: S = flops.as_symmetric(fnp.eye(10), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1))) # 55 unique elements v = fnp.ones((10,)) result = fnp.einsum('ij,j->i', S, v) # cost reduced by input symmetry Output symmetry — pass symmetry= to einsum() to declare that the result is symmetric. This wraps the output as a SymmetricTensor so downstream operations benefit from reduced costs. It does NOT affect the cost of this einsum — it's a declaration about the result's structure: with flops.BudgetContext(flop_budget=10**8) as budget: X = fnp.random.randn(100, 10) # X^T X is always symmetric — declare the exact output group C = fnp.einsum('ki,kj->ij', X, X, symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1))) print(type(C)) # # C can now be passed to other operations with automatic cost savings For the full symmetry guide, see Symmetry Savings. Inspecting costs fnp.einsum_path() previews the contraction plan without executing the contraction itself. Planning is cheap: it records a nominal 1-FLOP einsum_path event, but none of the contraction FLOPs are spent. import flopscope as flops import flopscope.numpy as fnp n = 10 T = flops.as_symmetric(fnp.ones((n, n, n)), symmetric_axes=(0, 1, 2)) A = fnp.random.randn(n, n) B = fnp.random.randn(n, n) C = fnp.random.randn(n, n) n = 10 T = flops.as_symmetric(fnp.ones((n, n, n)), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1, 2))) A = fnp.random.randn(n, n) B = fnp.random.randn(n, n) C = fnp.random.randn(n, n) path, info = fnp.einsum_path('ijk,ai,bj,ck->abc', T, A, B, C) print(f"Path: {path}") print(info) print(f"Naive cost: {info.naive_cost:,}") print(f"Optimized cost: {info.optimized_cost:,}") print(f"Speedup: {info.speedup:.1f}x") print(f"Optimizer used: {info.optimizer_used}") Path: [(0, 1), (0, 2), (0, 1)] Complete contraction: ijk,ai,bj,ck->abc Naive cost (flopscope): 3,000,000 Optimized cost (flopscope): 25,500 Speedup: 117.647x Largest intermediate: 1,000 elements Index sizes: a=b=c=i=j=k=10 Optimizer: optimal -------------------------------------------------------------------------------------------------------------------------------------------- step contract subscript flops dense_flops savings blas unique/total symmetry (inputs → output) -------------------------------------------------------------------------------------------------------------------------------------------- 0 (0, 1) ai,ijk->ajk 5,500 10,000 45.0% SYMM V:550/1,000 - × S3{i,j,k} → S2{j,k} 1 (0, 2) ajk,bj->akb 10,000 10,000 0.0% TDOT - S2{j,k} × - → - 2 (0, 1) akb,ck->abc 10,000 10,000 0.0% TDOT - - Naive cost: 3,000,000 Optimized cost: 25,500 Speedup: 117.6x Optimizer used: optimal The printed table gives you the contraction order, naive-vs-optimized FLOP counts, largest intermediate, grouped index sizes, and one row per pairwise contraction step. Each step shows the chosen contract tuple, dense baseline, savings, BLAS tag, and any symmetry that survived into the intermediate. For per-step debugging, call print(info.format_table(verbose=True)). The verbose view adds indented rows with the merged operand subset, the intermediate output shape, and the running cumulative cost. flops.einsum_cost() returns the same cost that einsum() would deduct — one source of truth: cost = flops.einsum_cost('ij,jk->ik', shapes=[(256, 256), (256, 256)]) print(f"Matmul cost: {cost:,}") # 16,777,216 Custom contraction paths By default Flopscope finds the optimal contraction order automatically. You can override this by passing an explicit path — a list of int-tuples specifying which operand positions to contract at each step: import flopscope as flops import flopscope.numpy as fnp A = fnp.ones((3, 4)) B = fnp.ones((4, 5)) C = fnp.ones((5, 6)) # Plan first, execute later path, info = fnp.einsum_path('ij,jk,kl->il', A, B, C) print(f"Optimal path: {path}") # e.g. [(0, 1), (0, 1)] # Execute with the planned path with flops.BudgetContext(flop_budget=10**8) as budget: result = fnp.einsum('ij,jk,kl->il', A, B, C, optimize=path) You can also specify a completely custom path. Each tuple names the positions (in the current operand list) to contract; the result is appended to the end: # Force B×C first (positions 1,2), then A×result (positions 0,1) result = fnp.einsum('ij,jk,kl->il', A, B, C, optimize=[(1, 2), (0, 1)]) # Force A×B first (positions 0,1), then result×C (positions 0,1) result = fnp.einsum('ij,jk,kl->il', A, B, C, optimize=[(0, 1), (0, 1)]) Different paths may have different FLOP costs. Use fnp.einsum_path() to compare — it returns the plan without executing the contraction. Path caching Contraction paths are cached automatically in a module-level LRU cache. When you call fnp.einsum() with the same subscripts, shapes, optimizer, and symmetry structure, the path is reused from cache instead of being recomputed. This makes repeated einsums in loops essentially free in path-finding overhead: with flops.BudgetContext(flop_budget=10**9) as budget: for i in range(1000): y = fnp.einsum('ij,j->i', A, x) # path computed once, reused 999 times fnp.einsum_path() shares the same cache, so planning a path warms the cache for subsequent fnp.einsum() calls and vice versa. Cache management # Inspect cache statistics info = fnp.einsum_cache_info() print(f"Hits: {info.hits}, Misses: {info.misses}, Size: {info.currsize}/{info.maxsize}") # Clear the cache (e.g., to free memory or force recomputation) fnp.clear_einsum_cache() # Change the cache size (default 4096 entries, rebuilds the cache) flops.configure(einsum_path_cache_size=8192) Common pitfalls Symptom: Unexpectedly high FLOP cost Fix: Check all index dimensions. A subscript like 'ijkl,jklm->im' multiplies all five dimension sizes together. Use flops.einsum_cost() or fnp.einsum_path() to preview costs before executing. Related pages Symmetry Savings — full guide to symmetry mechanisms API Reference — algorithms, symmetry support, and operation details Plan Your Budget — query costs before executing FLOP Counting Model — how costs are computed Migrate from NumPyPrevious PageSymmetry SavingsNext Page --- guides/symmetry --- URL: https://aicrowd.github.io/flopscope/docs/guides/symmetry GuidesSymmetry SavingsReduce FLOP costs when your tensors have symmetry. You will learn: How to declare full and non-full tensor symmetries with flops.as_symmetric() How to generate example tensors for arbitrary permutation groups When to use fnp.random.symmetric() (sample + project) versus flops.symmetrize() (project existing data) How slicing, reductions, and binary pointwise ops preserve, weaken, or drop symmetry metadata When to re-tag results with flops.as_symmetric() after conservative propagation Why symmetry matters Many tensors contain repeated structure. A symmetric matrix has only n * (n + 1) / 2 unique elements instead of n^2, and higher-order tensors with permutation symmetry can shrink the effective element count even more. When Flopscope knows a tensor's symmetry, it charges FLOPs based on unique elements instead of dense ones. OperationDense costSymmetry-aware costWhy it dropsfnp.exp(s2_matrix)n^2n * (n + 1) / 2only unique matrix entries matterfnp.einsum('ki,kj->ij', x, x, symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)))m * n^2m * n * (n + 1) / 2the repeated x operand lets flopscope detect symmetric output and reduce the costfnp.einsum('i,j,k->ijk', v, v, v)n^3symmetry-reducedrepeated operands induce output symmetry By contrast, declaring symmetry= on an einsum output tags the result for downstream operations; it does not reduce that einsum's own cost by itself. Quick start import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=10**6) as budget: s2_matrix = flops.as_symmetric( fnp.array([[2.0, 1.0], [1.0, 3.0]]), symmetric_axes=(0, 1), ) exp_s2_matrix = fnp.exp(s2_matrix) sliced_row = s2_matrix[0] print(type(exp_s2_matrix).__name__) # SymmetricTensor print(type(sliced_row).__name__) # FlopscopeArray print(budget.flops_used) flops.as_symmetric() validates the data first. After that, Flopscope propagates symmetry metadata algebraically through many operations. Unary pointwise ops preserve symmetry-aware costs and keep the same exact group, including non-full groups such as C_k or D_k. Slicing, reductions, and binary pointwise ops can weaken it or remove it entirely. How to declare symmetry Full symmetry with SymmetryGroup.symmetric Use SymmetryGroup.symmetric when the tensor is invariant under every permutation of a set of axes. import flopscope as flops import flopscope.numpy as fnp matrix_data = fnp.array([[2.0, 1.0], [1.0, 3.0]]) s2_matrix = flops.as_symmetric(matrix_data, symmetric_axes=(0, 1)) This is the most common declaration: full S_2 symmetry on matrix axes (0, 1). Multiple independent full symmetric groups block_tensor_data = fnp.ones((2, 2, 3, 3)) block_s2_tensor = flops.as_symmetric( block_tensor_data, symmetry=flops.SymmetryGroup.young(blocks=((0, 1), (2, 3))), ) This declares one full symmetric group on axes (0, 1) and another on (2, 3). Explicit full symmetric groups with SymmetryGroup.symmetric s3_group = flops.SymmetryGroup.symmetric(axes=(0, 1, 2)) s3_tensor = flops.as_symmetric( fnp.ones((4, 4, 4)), symmetry=s3_group, ) This explicit group is useful once you want to inspect or combine symmetry directly. Cyclic symmetry with SymmetryGroup.cyclic c3_group = flops.SymmetryGroup.cyclic(axes=(0, 1, 2)) c3_tensor = flops.as_symmetric( fnp.ones((4, 4, 4)), symmetry=c3_group, ) C_3 means rotations are allowed, but reflections are not. This is weaker than S_3, so it usually gives fewer savings. Dihedral symmetry with SymmetryGroup.dihedral d4_group = flops.SymmetryGroup.dihedral(axes=(0, 1, 2, 3)) d4_tensor = flops.as_symmetric( fnp.ones((4, 4, 4, 4)), symmetry=d4_group, ) D_4 includes both rotations and reflections of a four-position structure. Arbitrary subgroups from custom generators opposite_pair_swap_group = flops.SymmetryGroup.from_generators( [[2, 3, 0, 1]], axes=(0, 1, 2, 3), ) opposite_pair_swap_tensor = flops.as_symmetric( fnp.ones((4, 4, 4, 4)), symmetry=opposite_pair_swap_group, ) Use this form when the built-in constructors do not describe your symmetry. Multiple explicit groups on one tensor row_swap_group = flops.SymmetryGroup.symmetric(axes=(0, 1)) column_swap_group = flops.SymmetryGroup.symmetric(axes=(2, 3)) two_group_tensor = flops.as_symmetric( fnp.ones((3, 3, 5, 5)), symmetry=flops.SymmetryGroup.direct_product(row_swap_group, column_swap_group), ) This uses a direct product of two exact symmetry factors, one on (0, 1) and one on (2, 3). Generating example data with the Reynolds operator When you want example tensors for arbitrary groups, prefer fnp.random.symmetric. It is a handy helper that samples from a distribution and applies the Reynolds operator. Use flops.symmetrize when you already have concrete data to symmetrize, and flops.as_symmetric when you already have concrete data to validate and tag. R_G(T) = (1 / |G|) * sum_{g in G} g · T S = fnp.random.symmetric((4, 4, 4), s3_group) T = fnp.random.symmetric((4, 4, 4), c3_group) U = fnp.random.symmetric((4, 4, 4, 4), d4_group) This helper is ideal for docs, tests, and experiments: it works for S_k, C_k, D_k, and custom generator sets prefer fnp.random.symmetric() for synthetic data generation fnp.random.symmetric internally samples data and calls flops.symmetrize so the projection and validation behavior is identical. approximate costs (meaningful estimate): fnp.random.symmetric: C_dist(n) + |G| * n + n + validation flops.symmetrize: |G| * n + n + validation with n total elements and |G| group order. in exact arithmetic it projects onto the invariant subspace, and in practice flops.as_symmetric() validates the result with its usual validation tolerances it keeps examples consistent across symmetry classes s3_group = flops.SymmetryGroup.symmetric(axes=(0, 1, 2)) c3_group = flops.SymmetryGroup.cyclic(axes=(0, 1, 2)) d4_group = flops.SymmetryGroup.dihedral(axes=(0, 1, 2, 3)) s3_tensor = fnp.random.symmetric((4, 4, 4), s3_group) c3_tensor = fnp.random.symmetric((4, 4, 4), c3_group) d4_tensor = fnp.random.symmetric((4, 4, 4, 4), d4_group) The propagation examples below assume import flopscope as flops plus import flopscope.numpy as fnp; generated tensors above use fnp.random.symmetric(...) and the explicit transform helper is flops.symmetrize(...). Symmetry propagation at a glance Symmetry propagation is conservative. Flopscope keeps symmetry metadata only when the operation's structure guarantees that the output still respects the surviving group. OperationResult typeRulefnp.exp(s3_tensor)SymmetricTensorunary pointwise ops preserve the exact declared groupfnp.add(s3_tensor, c3_tensor)SymmetricTensor or FlopscopeArraybinary pointwise ops keep the intersection of both operands' groupss2_matrix * 3SymmetricTensorscalar binary ops preserve the tensor's groupss2_matrix[0]FlopscopeArrayinteger indexing removes one axis; no nontrivial group survivess2_matrix[:3, :3]SymmetricTensorequal-size slices can preserve symmetrys2_matrix[:3, :2]FlopscopeArrayunequal-size slices break symmetry between those axess2_matrix[fnp.array([0, 1])]FlopscopeArrayadvanced indexing drops symmetry conservativelyfnp.sum(s3_tensor, axis=0)SymmetricTensorreductions keep the surviving setwise stabilizer subgroupfnp.sum(d4_tensor, axis=(1, 3))SymmetricTensorreductions can keep a proper subgroup like C_2s2_matrix @ s2_matrixFlopscopeArraymatrix products are not assumed symmetric in general fnp.exp(c3_tensor) keeps the original C_3 subgroup exactly; the same applies to D_k and custom exact groups. Slicing rules Slicing uses the pointwise stabilizer of the removed axes. Informally: every removed axis must stay fixed under any surviving group element. s2_matrix = flops.as_symmetric( fnp.ones((6, 6)), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)), ) same_size_slice = s2_matrix[:3, :3] different_size_slice = s2_matrix[:3, :2] expanded_s2_matrix = s2_matrix[fnp.newaxis, :, :] advanced_index_slice = s2_matrix[fnp.array([0, 1])] What happens here: same_size_slice stays symmetric because both surviving axes still have the same size different_size_slice loses symmetry because the two axes no longer match expanded_s2_matrix keeps the same S_2 action, but the axes are renumbered from (0, 1) to (1, 2) advanced_index_slice returns a dense FlopscopeArray; flopscope does not attempt to propagate symmetry through array/list indexing Ellipsis behaves like NumPy's normal expansion rules. It can change which axes remain, but it does not change the propagation rule itself. The difference between full and non-full groups matters immediately: s3_group = flops.SymmetryGroup.symmetric(axes=(0, 1, 2)) c3_group = flops.SymmetryGroup.cyclic(axes=(0, 1, 2)) s3_tensor = fnp.random.symmetric((4, 4, 4), s3_group) c3_tensor = fnp.random.symmetric((4, 4, 4), c3_group) s3_slice = s3_tensor[:, :, 0] c3_slice = c3_tensor[:, :, 0] s3_slice keeps an S_2 subgroup on the surviving axes c3_slice loses all nontrivial symmetry, because C_3 has no non-identity element that fixes one point Reduction rules Reductions use the setwise stabilizer of the reduced axes. Informally: reduced axes are allowed to permute among themselves, because summation treats all positions along the reduced set equivalently. s3_group = flops.SymmetryGroup.symmetric(axes=(0, 1, 2)) c3_group = flops.SymmetryGroup.cyclic(axes=(0, 1, 2)) c4_group = flops.SymmetryGroup.cyclic(axes=(0, 1, 2, 3)) s3_tensor = fnp.random.symmetric((4, 4, 4), s3_group) c3_tensor = fnp.random.symmetric((4, 4, 4), c3_group) c4_tensor = fnp.random.symmetric((4, 4, 4, 4), c4_group) s3_reduced = fnp.sum(s3_tensor, axis=0) c3_reduced = fnp.sum(c3_tensor, axis=2) c4_reduced = fnp.sum(c4_tensor, axis=(1, 3)) c4_keepdims = fnp.sum(c4_tensor, axis=(1, 3), keepdims=True) What happens here: s3_reduced keeps an S_2 subgroup on the remaining axes c3_reduced loses all nontrivial symmetry c4_reduced keeps a C_2 subgroup c4_keepdims keeps the same surviving subgroup, but the output axes stay in their original tensor positions because keepdims=True Reducing an axis that is not in a symmetry group leaves that group alone, apart from any axis renumbering caused by the removed dimension. Binary pointwise ops and broadcasting Binary pointwise ops keep only the symmetry present in both operands. For general groups, that means element-set intersection on matching output axes, not just matching tuples of axis numbers. s3_group = flops.SymmetryGroup.symmetric(axes=(0, 1, 2)) c3_group = flops.SymmetryGroup.cyclic(axes=(0, 1, 2)) s3_tensor = fnp.random.symmetric((4, 4, 4), s3_group) c3_tensor = fnp.random.symmetric((4, 4, 4), c3_group) intersection_tensor = fnp.add(s3_tensor, c3_tensor) intersection_tensor keeps C_3, because C_3 is the common subgroup of S_3 and C_3. Multiple groups are handled independently: left_tensor = flops.as_symmetric( fnp.ones((3, 3, 5, 5)), symmetry=flops.SymmetryGroup.young(blocks=((0, 1), (2, 3))), ) right_tensor = flops.as_symmetric( fnp.ones((3, 3, 5, 5)), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)), ) shared_group_tensor = fnp.add(left_tensor, right_tensor) shared_group_tensor keeps only the swap on (0, 1). In the exact-group representation that surviving action is still embedded in the full output rank, so the group's support tuple spans (0, 1, 2, 3) even though it acts nontrivially only on (0, 1). Broadcasting matters too: stretched_s2_tensor = flops.as_symmetric( fnp.ones((1, 1, 4)), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)), ) plain_tensor = fnp.ones((3, 3, 4)) broadcast_sum = fnp.add(stretched_s2_tensor, plain_tensor) Before group intersection, any axis stretched from size 1 to a larger output size is removed from the carried candidate group. So singleton broadcasting by itself does not preserve symmetry. In this example, though, plain_tensor already carries the same analytically provable S_2 symmetry on (0, 1), so broadcast_sum keeps that shared block. Warnings, conservative behavior, and re-tagging Flopscope propagates symmetry metadata conservatively. When the operation does not guarantee that the declared symmetry survives, the result falls back to a dense FlopscopeArray with no .symmetry metadata. s2_matrix = flops.as_symmetric( fnp.ones((6, 6)), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)), ) row_slice = s2_matrix[0] row_slice is a plain dense FlopscopeArray, not a SymmetricTensor. That is expected. SymmetryLossWarning tells you that metadata was dropped or weakened during an operation. If you know more about the result than the conservative propagation rule does, you can re-tag it with flops.as_symmetric(). flops.configure(symmetry_warnings=False) One important caveat: the current implementation does not report every possible partial weakening via SymmetryLossWarning. Treat the warning system as helpful guidance, not as a complete audit of every same-axis subgroup change. Edge cases when SymmetricTensors meet NumPy protocols Once your tensors flow through arbitrary user code (or third-party libraries), they inevitably hit NumPy's __array_ufunc__ and __array_function__ machinery — np.add(A, B), np.divmod(A, B), np.add.outer(A, B), np.add.at(A, idx, vals), A @= B, np.tensordot(A, B, axes=...), and so on. Flopscope's protocol implementations are conservative: when an operation would silently corrupt your declared symmetry, flopscope refuses or strips rather than letting it through. This section is a tour of those edge cases. In-place dunders refuse symmetry-corrupting writes A_sym = flops.symmetrize( fnp.random.randn(4, 4), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1)), ) B_plain = fnp.random.randn(4, 4) # not symmetric A_sym += B_plain # raises ValueError The right-hand-side has no symmetry, so A_sym + B_plain returns a plain FlopscopeArray. Writing that result back into A_sym's buffer would leave the metadata claiming symmetry while the data is asymmetric. Flopscope refuses with ValueError: in-place add on a SymmetricTensor would weaken or destroy the declared symmetry. If you know the asymmetric write is intentional, downgrade first: A_plain = A_sym.view(fnp.ndarray) # zero-copy view as plain FlopscopeArray, no symmetry A_plain += B_plain # works The same guard applies to __isub__, __imul__, __itruediv__, __ifloordiv__, __imod__, __ipow__, __iand__, __ior__, __ixor__, __ilshift__, __irshift__, and __imatmul__ (which additionally falls back to CPython's rebind-the-name semantics when the matmul output shape differs from self.shape). In-place sort / partition refuse on SymmetricTensor A_sym.sort(axis=0) # raises ValueError A_sym.partition(2) # raises ValueError A reorder along any axis breaks the permutation invariance. Use the out-of-place forms instead: sorted_arr = fnp.sort(A_sym, axis=0) # plain FlopscopeArray, no symmetry ufunc.at refuses on SymmetricTensor np.add.at(a, indices, values) does an unbuffered fancy-index write — every repeat of an index applies again (unlike a[indices] += values which dedupes). On a SymmetricTensor this almost certainly breaks symmetry, so flopscope refuses: np.add.at(A_sym, ([0], [1]), 1.0) # ValueError Downgrade with A_sym.view(fnp.ndarray) first if you really need the unbuffered update. ufunc.outer produces direct-product symmetry When both operands are symmetric, the output of np..outer(A, B) inherits the direct product of the input symmetries — A's symmetry on its own axes, B's symmetry on the lifted slots A.ndim..A.ndim+B.ndim-1: A = flops.symmetrize(fnp.random.randn(3, 3), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1))) B = flops.symmetrize(fnp.random.randn(2, 2), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1))) C = np.add.outer(A, B) # shape (3, 3, 2, 2), SymmetricTensor # symmetry: S_2 on (0, 1) × S_2 on (2, 3) tensordot keeps surviving direct-product symmetry The contracted axes drop out of each operand's symmetry; what's left of each operand's symmetry on the surviving (uncontracted) axes is direct-producted: sym = flops.SymmetryGroup.symmetric(axes=(0, 1)) A = flops.symmetrize(fnp.random.randn(4, 4, 4, 4), symmetry=sym) # axes (0,1) symmetric B = flops.symmetrize(fnp.random.randn(4, 4, 4, 4), symmetry=sym) C = fnp.tensordot(A, B, axes=((2,), (2,))) # A's surviving axes: (0, 1, 3) → S_2 on (0, 1) survives # B's surviving axes: (0, 1, 3) → S_2 on (0, 1) survives # C: shape (4,4,4,4,4,4), SymmetricTensor with S_2 on (0,1) × S_2 on (3,4) If the contracted axis is part of the symmetry group, that group is destroyed on the corresponding side. Multi-output ufuncs preserve symmetry on every output np.divmod(A, B), np.frexp(A), np.modf(A) are elementwise — both outputs inherit the same symmetry as their inputs: S = flops.symmetrize(fnp.array([[1.5, 2.5], [2.5, 3.5]]), symmetry=flops.SymmetryGroup.symmetric(axes=(0, 1))) frac, integer = fnp.modf(S) # both SymmetricTensor with the same symmetry out=(o1, o2) works (per-slot identity is preserved), and out=(o1, None) lets numpy allocate just the second slot. Cost model is a placeholder above degree 12 Flopscope charges dense_cost × unique_output_elements / dense_output_elements for ufunc.outer and tensordot as a coarse proxy for the savings a symmetry-aware implementation could realise. The underlying NumPy call still does dense work; only the budget reflects the savings. For a SymmetryGroup with degree above 12, flopscope skips this adjustment and charges the full dense cost — Burnside enumeration on S_n for n > 12 becomes infeasible (13! ≈ 6.2 × 10⁹). The skip is announced via CostFallbackWarning: import warnings deep = fnp.ones((1,) * 33) # auto-inferred S_33 symmetry with warnings.catch_warnings(record=True) as caught: warnings.simplefilter("always") with flops.BudgetContext(flop_budget=int(1e10)): try: fnp.tensordot(deep, deep, axes=0) # CostFallbackWarning: bailed to dense except ValueError: pass # ndim>64 — the warning still fires before numpy refuses This is rare in practice — common user tensors have degree ≤ 8 — but high-rank auto-inferred symmetries on degenerate shapes ((1,)*N for large N) trip the cap. The warning fires once per (op_name, degree) pair per process to avoid log flooding. Suppress with flops.configure(symmetry_warnings=False), which shares the flag with SymmetryLossWarning. A proper algorithmic-cost model is a follow-up. Under the hood The propagation rules are easier to predict if you keep three ideas in mind: Unary pointwise ops preserve symmetry-aware costs, but for full and non-full groups alike they keep the same exact group Slicing uses the pointwise stabilizer of the removed axes Reductions use the setwise stabilizer of the reduced axes Binary pointwise ops use intersection of both operands' groups after broadcast alignment After computing the surviving subgroup, flopscope restricts it to the axes still present in the output and remaps those axes to the output tensor's numbering. That is why: slicing one axis of S_3 leaves S_2 slicing one axis of C_3 leaves nothing nontrivial reducing {1, 3} of C_4 can still leave C_2 Going deeper Einsum Patterns — how declared and induced symmetry interact with fnp.einsum Symmetry Detection Deep Dive — the full detection algorithm for einsum Symmetry Explorer — experiment with symmetry interactively Einsum PatternsPrevious PageLinear AlgebraNext Page --- guides/linalg --- URL: https://aicrowd.github.io/flopscope/docs/guides/linalg GuidesLinear AlgebraUse this page to learn how to use fnp.linalg operations and their FLOP costs. You will learn: How to use decompositions, solvers, and property operations in fnp.linalg How symmetric inputs reduce linalg costs How to query linalg costs before running them Prerequisites Quickstart Available operations Decompositions OperationCostWeightNotesfnp.linalg.svd(A, k=k)m⋅n⋅km \cdot n \cdot km⋅n⋅k4.0Truncated SVDfnp.linalg.eig(A)10n310n^310n34.0General eigendecompositionfnp.linalg.eigh(A)4n3/34n^3/34n3/34.0Symmetric eigendecompositionfnp.linalg.cholesky(A)n3/3n^3/3n3/34.0Cholesky (symmetric positive definite)fnp.linalg.qr(A)mn2−n3/3mn^2 - n^3/3mn2−n3/34.0Householder QR (FMA=1)fnp.linalg.eigvals(A)10n310n^310n34.0Eigenvalues onlyfnp.linalg.eigvalsh(A)4n3/34n^3/34n3/34.0Symmetric eigenvalues onlyfnp.linalg.svdvals(A)m⋅n⋅min⁡(m,n)m \cdot n \cdot \min(m,n)m⋅n⋅min(m,n)4.0Singular values only Solvers solve_cost(n) always returns n^3 regardless of the symmetric or nrhs parameters — those arguments exist for API compatibility but are currently ignored in the cost model. OperationCostWeightfnp.linalg.solve(A, b)n3n^3n34.0fnp.linalg.inv(A)n3n^3n34.0fnp.linalg.lstsq(A, b)m⋅n⋅min⁡(m,n)m \cdot n \cdot \min(m,n)m⋅n⋅min(m,n)4.0fnp.linalg.pinv(A)m⋅n⋅min⁡(m,n)m \cdot n \cdot \min(m,n)m⋅n⋅min(m,n)4.0 inv of a symmetric matrix returns a SymmetricTensor. Properties OperationCostWeightfnp.linalg.det(A)n3n^3n34.0fnp.linalg.slogdet(A)n3n^3n34.0fnp.linalg.norm(x)depends on ordvariesfnp.linalg.cond(A)m⋅n⋅min⁡(m,n)m \cdot n \cdot \min(m,n)m⋅n⋅min(m,n)variesfnp.linalg.matrix_rank(A)m⋅n⋅min⁡(m,n)m \cdot n \cdot \min(m,n)m⋅n⋅min(m,n)variesfnp.linalg.trace(A)nnnvaries Compound OperationCostNotesfnp.linalg.multi_dot(arrays)Optimal chain orderingUses np.linalg.multi_dotfnp.linalg.matrix_power(A, n)n3×{exponent}n^3 \times \text\{exponent\}n3×{exponent}Repeated squaring Symmetric input savings Pass a SymmetricTensor to get automatic cost reductions: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=10**8) as budget: A = flops.as_symmetric(fnp.multiply(fnp.eye(10), 2.0), symmetric_axes=(0, 1)) # solve_cost(n=10) = n^3 = 1000 FLOPs (symmetric/nrhs params are currently ignored) x = fnp.linalg.solve(A, fnp.ones(10)) # inv returns SymmetricTensor A_inv = fnp.linalg.inv(A) print(isinstance(A_inv, flops.SymmetricTensor)) # True See Exploit Symmetry Savings for full details. Query cost before running cost = flops.svd_cost(m=256, n=256, k=10) print(f"SVD cost: {cost:,}") # 655,360 cost = flops.solve_cost(n=256) print(f"Solve cost: {cost:,}") # 16,777,216 (= 256^3; symmetric/nrhs params currently ignored) Common pitfalls Symptom: Using numpy.linalg.svd instead of fnp.linalg.svd Fix: Operations called through numpy directly bypass FLOP counting. Always use fnp.linalg.*. Related pages Exploit Symmetry Savings — symmetry-aware cost reductions Plan Your Budget — query costs before running API Reference — full list of supported operations Symmetry SavingsPrevious PageFFT OperationsNext Page --- guides/fft --- URL: https://aicrowd.github.io/flopscope/docs/guides/fft GuidesFFT OperationsUse this page to learn how to use fnp.fft operations and understand their FLOP costs. You will learn: How to use 1-D, 2-D, and N-D FFT operations and their cost formulas How to choose between real and complex transforms to save FLOPs How to query FFT costs before committing budget Prerequisites Quickstart Cost model FFT costs are based on the Cooley-Tukey radix-2 algorithm: TransformCost FormulaExample (n=1024)fft, ifft5n⋅⌈log⁡2n⌉5n \cdot \lceil\log_2 n\rceil5n⋅⌈log2​n⌉51,200rfft, irfft5(n/2)⋅⌈log⁡2n⌉5(n/2) \cdot \lceil\log_2 n\rceil5(n/2)⋅⌈log2​n⌉25,600fft2, ifft25N⋅⌈log⁡2N⌉5N \cdot \lceil\log_2 N\rceil5N⋅⌈log2​N⌉ where N=n{1}⋅n{2}N = n_\{1\} \cdot n_\{2\}N=n{​1}⋅n{​2}variesfftn, ifftn5N⋅⌈log⁡2N⌉5N \cdot \lceil\log_2 N\rceil5N⋅⌈log2​N⌉ where N=∏{i}n{i}N = \prod_\{i\} n_\{i\}N=∏{​i}n{​i}variesfftfreq, rfftfreq0 (free)0fftshift, ifftshift0 (free)0 Real-valued transforms (rfft, irfft, rfftn, irfftn) cost roughly half of their complex counterparts because they exploit conjugate symmetry. Basic usage import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=1_000_000) as budget: # Generate a signal (costs numel(output) = 1,024 FLOPs) signal = fnp.random.randn(1024) # Forward FFT: 5 * 1024 * 10 = 51,200 FLOPs spectrum = fnp.fft.fft(signal) # Inverse FFT: same cost reconstructed = fnp.fft.ifft(spectrum) # Frequency bins (free) freqs = fnp.fft.fftfreq(1024) # Total: randn 1,024 + fft 51,200 + ifft 51,200 + fftfreq 0 = 103,424 print(f"Total FFT cost: {budget.flops_used:,}") # 103,424 Real vs complex transforms When your input is real-valued (which is common in signal processing), prefer rfft over fft — it costs half as much: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=1_000_000) as budget: signal = fnp.random.randn(1024) # 1,024 FLOPs # Complex FFT: 51,200 FLOPs spec_complex = fnp.fft.fft(signal) budget_after_fft = budget.flops_used # Real FFT: 25,600 FLOPs spec_real = fnp.fft.rfft(signal) rfft_cost = budget.flops_used - budget_after_fft print(f"fft cost: {budget_after_fft:,}") # 52,224 (randn 1,024 + fft 51,200) print(f"rfft cost: {rfft_cost:,}") # 25,600 The output of rfft has shape (n//2 + 1,) instead of (n,), since the negative frequencies are redundant for real inputs. Multi-dimensional FFT Use fft2 for 2-D transforms (e.g., images) and fftn for arbitrary dimensions: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=10**8) as budget: # 2-D image (costs numel(output) = 65,536 FLOPs) image = fnp.random.randn(256, 256) # 2-D FFT spectrum_2d = fnp.fft.fft2(image) print(f"2D FFT cost: {budget.flops_used:,}") # N-D FFT with explicit shape volume = fnp.random.randn(32, 32, 32) spectrum_3d = fnp.fft.fftn(volume) Windowed FFT pattern A common signal processing pattern — window the signal before FFT to reduce spectral leakage: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=1_000_000) as budget: signal = fnp.random.randn(1024) # Window function (counted — hamming costs n FLOPs) window = fnp.hamming(1024) # Apply window (counted — multiply costs n FLOPs) windowed = fnp.multiply(signal, window) # FFT (counted) spectrum = fnp.fft.rfft(windowed) print(budget.summary()) Query costs before running from flopscope.flops import fft_cost, rfft_cost # Check cost of a large FFT before committing budget n = 2**20 # ~1 million points print(f"Complex FFT: {fft_cost(n):,} FLOPs") # 104,857,600 print(f"Real FFT: {rfft_cost(n):,} FLOPs") # 52,428,800 Common pitfalls Symptom: Using fnp.fft.fft on real data when fnp.fft.rfft would suffice Fix: rfft costs half as much. If your input is real-valued, always prefer rfft/irfft over fft/ifft. Symptom: Unexpectedly high cost for multi-dimensional FFT Fix: The cost scales as 5⋅∏n{i}⋅⌈log⁡2(∏n{i})⌉5 \cdot \prod n_\{i\} \cdot \lceil\log_2(\prod n_\{i\})\rceil5⋅∏n{​i}⋅⌈log2​(∏n{​i})⌉. A 256x256 2-D FFT processes 65,536 elements, not 256. Use fft_cost to estimate before running. Related pages API Reference — full function signatures and docstrings Plan Your Budget — general cost estimation workflow FLOP Counting Model — how all costs are computed Linear AlgebraPrevious PageRandom Number GenerationNext Page --- guides/budget-planning --- URL: https://aicrowd.github.io/flopscope/docs/guides/budget-planning GuidesBudget Planning & DebuggingEstimate costs before running and diagnose overruns after. You will learn: How to use cost query functions to estimate FLOPs without executing How to read and interpret the budget summary How to diagnose expensive operations using the operation log Optimization strategies for reducing FLOP consumption Estimate before running Flopscope provides cost query functions that compute FLOP costs from shapes without executing anything or touching the budget. Use these to plan before committing FLOPs: import flopscope as flops import flopscope.numpy as fnp # Einsum cost cost = flops.einsum_cost('ij,jk->ik', shapes=[(256, 256), (256, 256)]) print(f"Matmul cost: {cost:,}") # 16,777,216 (256^3, FMA=1) # SVD cost cost = flops.svd_cost(m=256, n=256, k=10) print(f"SVD cost: {cost:,}") # 655,360 # Pointwise cost (unary/binary ops like exp, add, multiply) cost = flops.pointwise_cost("exp", shape=(256, 256)) print(f"Pointwise cost: {cost:,}") # 65,536 # Reduction cost (sum, mean, max, etc.) cost = flops.reduction_cost("sum", input_shape=(256, 256)) print(f"Reduction cost: {cost:,}") # 65,536 For multi-operand einsums (3+ operands), use fnp.einsum_path() to see the step-by-step contraction breakdown with per-step costs and symmetry savings: path, info = fnp.einsum_path('ijk,ai,bj,ck->abc', T, A, B, C) print(f"Optimized cost: {info.optimized_cost:,}") print(f"Naive cost: {info.naive_cost:,}") print(f"Speedup: {info.speedup:.1f}x") print(info) # full per-step table fnp.einsum_path() does not execute the contraction, but it does record a nominal 1-FLOP planning event so the path query itself is still visible in the operation log. Budget breakdown example Plan a multi-step computation before executing: steps = [ ("einsum ij,j->i", flops.einsum_cost('ij,j->i', shapes=[(256, 256), (256,)])), ("ReLU (maximum)", flops.pointwise_cost("maximum", shape=(256,))), ("sum reduction", flops.reduction_cost("sum", input_shape=(256,))), ] total = sum(cost for _, cost in steps) print(f"{'Operation':<20} {'FLOPs':>12}") print("-" * 34) for name, cost in steps: print(f"{name:<20} {cost:>12,}") print("-" * 34) print(f"{'Total':<20} {total:>12,}") Read the budget summary Call flops.budget_summary() after your computation for a human-readable breakdown, or budget.summary() inside a context. Pass by_namespace=True when you want dotted namespace attribution: with flops.BudgetContext(flop_budget=10_000_000) as budget: A = fnp.ones((256, 256)) x = fnp.ones((256,)) h = fnp.einsum('ij,j->i', A, x) h = fnp.exp(h) h = fnp.sum(h) print(budget.summary()) The summary shows cost per operation type, sorted by highest cost first. Look for operations consuming a disproportionate share of the budget. When you opt into by_namespace=True, the display adds a namespace breakdown for the exact dotted paths recorded in that run. For programmatic analysis, use flops.budget_summary_dict(): data = flops.budget_summary_dict() print(f"Budget: {data['flop_budget']:,}") print(f"Used: {data['flops_used']:,}") print(f"Left: {data['flops_remaining']:,}") for op_name, op_data in data["operations"].items(): print(f" {op_name}: {op_data['flop_cost']:,} ({op_data['calls']} calls)") Use flops.budget_summary_dict(by_namespace=True) for exact per-namespace breakdowns keyed by the full dotted path: with flops.BudgetContext(flop_budget=1000, namespace="predict") as budget: x = fnp.ones((1,)) with fnp.namespace("fallback"): with fnp.namespace("sampling"): sample = fnp.add(x, 1) data = budget.summary_dict(by_namespace=True) print(data["by_namespace"]["predict.fallback.sampling"]["flops_used"]) Add a time limit when FLOPs are not the only risk Some operations are analytically cheap enough to fit the FLOP budget but still slow in practice. Use wall_time_limit_s on the same BudgetContext when you want a cooperative wall-clock deadline in addition to the FLOP cap: with flops.BudgetContext( flop_budget=10_000_000, wall_time_limit_s=2.0, namespace="predict", ) as budget: # computation must stay within both limits ... print(budget.summary()) When the time limit is exceeded, Flopscope raises TimeExhaustedError at the next operation boundary. The summary exposes four timing views that decompose wall time exactly: wall_time_s: total elapsed time for the context flopscope_backend_time_s: time spent inside the underlying NumPy / BLAS / LAPACK backend calls being counted flopscope_overhead_time_s: time spent in flopscope's own dispatch code (wrapper preambles, FLOP cost computation, view-casts, post-call wrapping, maybe_check_nan_inf when opted in via flopscope.configure(check_nan_inf=True)) residual_wall_time_s: the measured wall-clock remainder outside backend calls and flopscope overhead (user Python between ops, time.sleep, GC pauses, un-instrumented numpy) The identity wall_time_s = flopscope_backend_time_s + flopscope_overhead_time_s + residual_wall_time_s holds within numerical tolerance. Use budget.summary() when you want the current context's timings, and flops.budget_summary() when you want the accumulated session/global view. Diagnose overruns When you hit a BudgetExhaustedError, the budget's operation log gives per-call detail: for record in budget.op_log: print(f"{record.op_name:<16} cost={record.flop_cost:>12,} cumulative={record.cumulative:>12,}") Each OpRecord contains: FieldDescriptionop_nameOperation name (e.g., "einsum", "exp")namespaceEffective namespace path recorded for that operationsubscriptsEinsum subscript string, or NoneshapesTuple of input shapesflop_costFLOP cost of this single callcumulativeRunning total after this callflopscope_context_start_offset_sSeconds from the active BudgetContext start to when this operation was recordedflopscope_backend_duration_sSeconds spent in the underlying backend call for this operationflopscope_overhead_duration_sSeconds of flopscope wrapper/accounting overhead attributed to this operation Look for the operation where cumulative jumps sharply -- that is your most expensive call. For real-time monitoring during long computations, use the live budget display: with flops.budget_live(): with flops.BudgetContext(flop_budget=10**8, namespace="training") as budget: for i in range(100): # ... computation ... pass # The live display updates automatically as FLOPs are consumed What to do next Once you have identified the expensive operations, apply these strategies: Reduce dimensions. If random.randn(1024, 1024) is too expensive, try smaller arrays. A 512x512 matrix costs 1/4 the FLOPs of a 1024x1024 matrix for a matmul. Exploit symmetry. If operands are symmetric, use flops.as_symmetric() to halve pointwise costs and significantly reduce einsum costs. See Symmetry Savings. Use cheaper operations. A matrix-vector product costs m*n FLOPs, while a matrix-matrix product costs m*n*k. Avoid computing full matrix products when you only need a few rows or a single vector result. Increase budget. If the computation is genuinely needed and you have headroom, raise flop_budget on the BudgetContext. Split into phases. Use namespaces to attribute different phases without splitting the FLOP budget into child budgets: with flops.BudgetContext(flop_budget=10**8, namespace="solver") as budget: with fnp.namespace("init"): # initialization ... with fnp.namespace("solve"): # main computation ... print(budget.summary(by_namespace=True))Random Number GenerationPrevious PageHow Flopscope WorksNext Page ============================================================ Understanding flopscope ============================================================ --- understanding/how-flopscope-works --- URL: https://aicrowd.github.io/flopscope/docs/understanding/how-flopscope-works Understanding FlopscopeHow Flopscope WorksUnderstand how Flopscope wraps NumPy to count every FLOP. You will learn: The wrapping pattern that makes import flopscope.numpy as fnp the counted NumPy surface How costs are calculated from tensor shapes before execution How budgets are enforced and what happens when they are exceeded How the operation registry classifies every NumPy callable The wrapping pattern flopscope exposes a NumPy-compatible API. When you write import flopscope.numpy as fnp and call fnp.einsum(...), you get a function that behaves like np.einsum(...) but with FLOP counting layered on top. Under the hood, flopscope re-exports wrapped versions of NumPy functions. The flopscope/__init__.py module imports from internal modules that each handle a category of operations: _pointwise.py -- unary and binary elementwise operations (exp, add, multiply, etc.) _einsum.py -- the einsum and einsum_path functions with symmetry-aware path optimization _free_ops.py -- zero-cost operations (zeros, reshape, transpose, copy, etc.) _counting_ops.py -- operations that look free but involve genuine computation (trace, histogram, etc.) _sorting_ops.py -- sorting, searching, and set operations Submodules -- flopscope.numpy.linalg, flopscope.numpy.fft, flopscope.numpy.random, flopscope.stats Each wrapped function follows the same pattern: compute the analytical FLOP cost, check the budget, then delegate to the real NumPy implementation. Cost interception When you call a counted operation, flopscope computes its FLOP cost analytically from the tensor shapes before the operation executes. The cost depends on the operation category: CategoryCost formulaExamplePointwise unarynumel(output)fnp.exp(x) on shape (256, 256) costs 65,536Pointwise binarynumel(output)fnp.add(a, b) with broadcast output (256, 256) costs 65,536Reductionnumel(input)fnp.sum(x) on shape (256, 256) costs 65,536Einsumproduct of all index dimensions'ij,jk->ik' with shapes (m, k), (k, n) costs m * k * nFree0fnp.zeros(...), fnp.reshape(...), fnp.transpose(...) The cost is always deterministic -- the same shapes produce the same FLOP count regardless of the data values or the hardware running the code. Each FMA (fused multiply-add) counts as 1 operation, not 2. A matrix multiply of dimensions (m, k) x (k, n) costs m * k * n FLOPs. Budget enforcement BudgetContext accumulates the cost of every operation that runs inside it. Before each counted operation executes, the budget is checked: The wrapped function computes the analytical cost from input shapes It calls budget.deduct(op_name, flop_cost=cost, ...) on the active budget deduct() checks if flops_used + cost > flop_budget If within budget: the cost is recorded, and the real NumPy function runs If over budget: BudgetExhaustedError is raised, and the operation does not execute Every deduction is recorded as an OpRecord in the budget's operation log, capturing the operation name, input shapes, FLOP cost, cumulative total, context start offset, backend duration, and flopscope overhead duration. This log powers the budget summary and debugging tools. If no explicit BudgetContext is active, Flopscope automatically creates a global default context with a budget of 1e15 FLOPs (configurable via the FLOPSCOPE_DEFAULT_BUDGET environment variable). This means bare calls outside any with block still work and still count FLOPs. The flow of a single call Here is what happens when you call fnp.matmul(A, B) with shapes (100, 200) and (200, 50): User calls fnp.matmul(A, B) | v flopscope computes cost: 100 * 200 * 50 = 1,000,000 FLOPs | v budget.deduct("matmul", flop_cost=1_000_000, shapes=((100,200), (200,50))) | +--> if flops_used + 1_000_000 > flop_budget: | raise BudgetExhaustedError | +--> else: flops_used += 1_000_000 record OpRecord to op_log | v np.matmul(A, B) executes and returns the result | v Result returned to user The operation registry The registry (flopscope/_registry.py) is a mapping of every NumPy callable to its classification and cost behavior. Each entry specifies: Category: one of counted_unary, counted_binary, counted_reduction, counted_custom, free, or blacklisted Module: which NumPy module it belongs to (numpy, numpy.linalg, numpy.fft, etc.) Notes: any special behavior or cost formula details The categories determine how costs are calculated: CategoryMeaningCostcounted_unaryScalar math on each elementnumel(output)counted_binaryElement-wise binary operationnumel(output)counted_reductionReduce an array along axesnumel(input)counted_customBespoke cost formulaVaries (e.g., n * ceil(log2(n)) for sort)freeZero FLOP cost0blacklistedIntentionally unsupportedRaises AttributeError Free operations include allocation (zeros, ones, empty), shape manipulation (reshape, transpose, squeeze), indexing helpers (ix_, indices), and metadata queries (shape, ndim, size). These do not touch the budget. Blocked operations include I/O (save, load), error state management (geterr, seterr), and other operations that do not make sense in a FLOP-counted context. Calling a blocked operation raises AttributeError. When per-operation weights are loaded, the analytical cost is multiplied by the operation's weight before deduction. This allows the cost model to reflect that exp is more expensive than abs in terms of actual hardware instructions, while keeping the base formulas simple and deterministic. Related pages FLOP Counting Model -- detailed cost formulas for every category Operation Categories -- which operations are free, counted, or blocked Competition Guide -- using budgets in competition Budget Planning & DebuggingPrevious PageFLOP Counting ModelNext Page --- understanding/flop-counting-model --- URL: https://aicrowd.github.io/flopscope/docs/understanding/flop-counting-model Understanding FlopscopeFLOP Counting ModelUse this page to understand how Flopscope counts FLOPs and why it uses analytical counting instead of runtime measurement. You will learn: How Flopscope computes FLOP costs analytically from tensor shapes How cost formulas work for each operation category (einsum, linalg, FFT, etc.) How symmetry savings and per-operation weights modify costs How the FLOP multiplier and namespaces interact with the cost model Convention: FMA = 1 operation This codebase counts a fused multiply-add (a * b + c) as a single operation. Hardware FMA units execute this in one instruction; the common textbook convention of counting it as 2 (one multiply + one add) is not used here. All cost formulae reflect this: a matrix multiply of dimensions (m, k) x (k, n) costs mkn operations, not 2mk*n. Why FLOPs instead of wall-clock time Deterministic: The same code always produces the same FLOP count, regardless of hardware Hardware-independent: A matmul costs the same FLOPs on a laptop and a server Reproducible: No variance from CPU scheduling, cache effects, or thermal throttling Composable: You can sum individual operation costs to predict total cost How costs are computed flopscope computes FLOP costs analytically from tensor shapes, not by measuring execution time. You call a counted operation (e.g., fnp.einsum('ij,j->i', W, x)) flopscope computes the cost from the shapes: 256 × 256 = 65,536 FLOPs The cost is checked against the remaining budget If within budget: the operation executes and the cost is deducted If over budget: BudgetExhaustedError is raised, the operation does not execute Cost formulas by category Each formula below gives the analytical base cost. In normal use, flopscope loads the packaged official per-operation weights automatically at import time, so the base cost is multiplied by the operation's weight to give the final deducted cost. Set FLOPSCOPE_DISABLE_WEIGHTS=1 if you want the pure analytical unit-cost model instead. CategoryFormulaExampleEinsumPer-step: product of all index dims'ij,jk->ik' → 3 × 4 × 5 = 60Unary (exp, log, sqrt, ...){numel}({output})\text\{numel\}(\text\{output\}){numel}({output})shape (256, 256) → 65,536Binary (add, multiply, ...){numel}({output})\text\{numel\}(\text\{output\}){numel}({output})shape (256, 256) → 65,536Reduction (sum, mean, max, ...){numel}({input})\text\{numel\}(\text\{input\}){numel}({input})shape (256, 256) → 65,536SVDm⋅n⋅km \cdot n \cdot km⋅n⋅k(256, 256, k=10) → 655,360Solven3n^3n3(256, 256) solve → 16,777,216Dot / MatmulSame as einsum(256, 256) @ (256, 256) → 256³Free ops0zeros, reshape, etc. Sorting & search CategoryFormulaExampleSort / Argsortn⋅⌈log⁡2n⌉n \cdot \lceil\log_2 n\rceiln⋅⌈log2​n⌉ per sliceshape (4, 8), axis=-1 → 4 × 8 × 3 = 96Lexsortk⋅n⋅⌈log⁡2n⌉k \cdot n \cdot \lceil\log_2 n\rceilk⋅n⋅⌈log2​n⌉2 keys of length 8 → 2 × 8 × 3 = 48Partitionnnn per sliceshape (100,), kth=50 → 100Searchsortedm⋅⌈log⁡2n⌉m \cdot \lceil\log_2 n\rceilm⋅⌈log2​n⌉5 queries into 1024 → 5 × 10 = 50Uniquen⋅⌈log⁡2n⌉n \cdot \lceil\log_2 n\rceiln⋅⌈log2​n⌉8 elements → 8 × 3 = 24Set ops(n+m)⋅⌈log⁡2(n+m)⌉(n+m) \cdot \lceil\log_2(n+m)\rceil(n+m)⋅⌈log2​(n+m)⌉4 + 4 elements → 8 × 3 = 24 Histogram & counting CategoryFormulaExampleHistogramn⋅⌈log⁡2{bins}⌉n \cdot \lceil\log_2 \text\{bins\}\rceiln⋅⌈log2​{bins}⌉100 elements, 8 bins → 100 × 3 = 300Bincountnnn100 elements → 100 Random sampling CategoryFormulaExampleSimple samplers{numel}({output})\text\{numel\}(\text\{output\}){numel}({output})shape (10, 20) → 200Shuffle / Permutationn⋅⌈log⁡2n⌉n \cdot \lceil\log_2 n\rceiln⋅⌈log2​n⌉16 elements → 16 × 4 = 64 Symmetry savings When a tensor is a SymmetricTensor, costs are reduced based on the number of unique elements rather than total elements. For a symmetric n×nn \times nn×n matrix, there are n(n+1)/2n(n+1)/2n(n+1)/2 unique elements instead of n2n^2n2. CategorySymmetric costStandard costPointwise (unary/binary)unique_elements{numel}({output})\text\{numel\}(\text\{output\}){numel}({output})Reductionunique_elements{numel}({input})\text\{numel\}(\text\{input\}){numel}({input})Einsum (symmetric contraction)Symmetry-reduced (see below)Full productSolven3n^3n3n3n^3n3Det / Slogdetn3n^3n3n3n^3n3Invn3/3+n3n^3/3 + n^3n3/3+n3n3n^3n3 See Exploit Symmetry Savings for usage details. Subgraph symmetry detection Symmetry that reduces einsum costs comes from two complementary sources, both unified under the subgraph symmetry detection algorithm: Declared per-operand symmetry. When an operand is wrapped with flops.as_symmetric(), its symmetry groups are embedded in the bipartite graph as U-vertex equivalence classes. These propagate into intermediate tensors automatically. Induced symmetry from repeated operands. When the same Python object is passed at multiple operand positions, the subgraph oracle detects this via Python identity (is) and derives symmetry groups on the output that cannot be seen from per-operand metadata alone. The oracle builds a bipartite graph once per contract_path call and evaluates symmetry lazily per subset of operands encountered during path search. Both sources are merged via the same group-merging machinery, so a tensor that is both SymmetricTensor and also repeated in the subscript benefits from both contributions simultaneously. See the symmetry guide for usage examples, and the subgraph symmetry explanation for the algorithm walkthrough. Einsum cost model Every einsum — regardless of the number of operands — is decomposed into pairwise contraction steps along an optimal path (found via flopscope's opt_einsum fork). The total cost is the sum of per-step costs: total_cost = sum(step.flop_cost for step in path.steps) Per-step cost For each pairwise step, the dense cost is: dense_step_cost = product of all index dimensions Each fused multiply-add (FMA) counts as 1 operation (see Convention above), so the cost of a contraction step is simply the product of all index dimensions — there is no factor-of-2 distinction between inner products and outer products. When symmetry is present, flopscope reduces each step's cost based on the structure of the contraction. Symmetric contraction cost Each pairwise step's cost is reduced by two independent multiplicative factors — one for the output (V-side) indices and one for the inner (W-side) contracted indices: step_cost = dense_step_cost × (unique_output_elements / total_output_elements) × (unique_inner_elements / total_inner_elements) Each ratio is computed exactly using Burnside's lemma over the permutation group detected for that step by the SubgraphSymmetryOracle. For the full symmetric group Sk_kk​ on kkk equal-sized axes, Burnside reduces to the stars-and-bars formula ({n)+k−1}{k}\binom\{n+k-1\}\{k\}(n{​)+k−1}{k}; for proper subgroups like CkC_kCk​ or block groups the oracle returns the exact generators and Burnside counts over the enumerated elements. The output (V-side) reduction is always applied when the step's intermediate has a non-trivial permutation group on its free indices — only the unique output elements need to be computed. The inner (W-side) reduction is applied only when all labels in the detected inner group are present as contracted indices in that specific pairwise step. If any of those labels were contracted at an earlier step and no longer appear in the current step, the inner reduction is skipped (the per-step table shows this as [W: ...] when detected-but-not-applied versus [W✓: ...] when applied). Inner symmetry can be toggled globally with flops.configure(use_inner_symmetry=False). The two factors are independent; outer-product contractions (no summed indices) and non-uniform index dimensions are handled by the same formula, since Burnside's lemma makes no assumption about uniform sizes beyond requiring axes in the same orbit to share a dimension. Multi-operand contractions For a simple two-operand einsum like 'ij,jk->ik', there is one step, so the total cost equals the step cost. For multi-operand einsums (3+ tensors), the optimizer finds the pairwise ordering that minimizes the total cost. When symmetric tensors are present, the optimizer is symmetry-aware: it uses symmetric costs to decide which pair to contract at each step, so the returned path may differ from the dense-optimal path. Symmetry propagates through intermediates — if an early contraction produces a symmetric intermediate, subsequent steps benefit from the reduced element count, and the optimizer factors this into its ordering decisions. Use fnp.einsum_path() to inspect the per-step breakdown. See Use Einsum for examples. Per-operation weights The analytical formulas above treat all operations within a category as equally expensive -- exp, log, sin, and abs all cost {numel}({output})\text\{numel\}(\text\{output\}){numel}({output}) FLOPs. In reality, exp decomposes into a minimax polynomial approximation requiring approximately 14 FP instructions per element, while abs is a single bit manipulation. Per-operation weights correct for this. Each weight is a multiplicative constant applied on top of the analytical formula: actual_cost = analytical_formula(shape) × weight(op_name) OperationAnalytical formulaWeightEffective cost (256x256)add{numel}({output})\text\{numel\}(\text\{output\}){numel}({output})165,536exp{numel}({output})\text\{numel\}(\text\{output\}){numel}({output})161,048,576sin{numel}({output})\text\{numel\}(\text\{output\}){numel}({output})161,048,576matmulm⋅k⋅nm \cdot k \cdot nm⋅k⋅n116,777,216linalg.choleskyn3n^3n3467,108,864reshape00000 Weights are measured using the overhead-subtracted correction-factor methodology described in FLOP Weight Calibration Results. The formula is: w({op})=max⁡(α{{raw}}({op})−{overhead}{{category}}, 0)w(\text\{op\}) = \max\bigl(\alpha_\{\text\{raw\}\}(\text\{op\}) - \text\{overhead\}_\{\text\{category\}\}, \ 0\bigr)w({op})=max(α{​{raw}}({op})−{overhead}{​{category}}, 0) where α{{raw}}\alpha_\{\text\{raw\}\}α{​{raw}} is the median ratio of hardware-observed FP instructions to analytical FLOPs (FMA = 1 op), measured via fp_arith_inst_retired performance counters. The ufunc dispatch overhead (measured from np.abs, which generates zero FP arithmetic) is subtracted per category to remove numpy implementation noise from the weight. BLAS-backed operations (contractions, linalg) have weights near 1.0 because their tight FMA loops execute almost exactly 1 hardware FP instruction per analytical FLOP, with no ufunc overhead to subtract. Known analytical zero-FLOP operations (reshape, broadcast_to, random.seed, etc.) are stored with weight 0.0 in the official artifacts so the generated docs surface them as free rather than as 1x unit-cost ops. Integer and bitwise operations (bitwise_and, gcd, lcm, etc.) use the instructions hardware counter (total retired instructions) because they do not retire fp_arith_inst_retired events. Their weights are derived from instruction counts normalized the same way as FP operations. Official weights are packaged with flopscope and enabled by default on import. Use FLOPSCOPE_WEIGHTS_FILE to override them with a custom JSON file, or set FLOPSCOPE_DISABLE_WEIGHTS=1 to disable weighting entirely and fall back to unit weights (1.0 for all operations). How weights are applied Weights are applied centrally in BudgetContext.deduct(). Every counted operation passes its op_name to deduct(), which looks up the weight and multiplies it into the cost: adjusted_cost = analytical_cost × flop_multiplier × weight(op_name) This means weights compose with flop_multiplier and with symmetry reductions -- symmetry reduces the element count, the weight scales the per-element cost, and both apply independently. Overriding or disabling packaged weights Normal imports load the packaged official weights automatically. To override them at import time, set FLOPSCOPE_WEIGHTS_FILE: export FLOPSCOPE_WEIGHTS_FILE=/path/to/weights.json To disable weighting entirely and use unit weights instead: export FLOPSCOPE_DISABLE_WEIGHTS=1 The JSON file must have a "weights" key mapping operation names to floats: { "weights": { "add": 1.0, "exp": 16.0, "sin": 16.0, "matmul": 1.0, "linalg.cholesky": 4.0 } } Operations not listed in the override file default to 1.0. See Calibrate Weights for how to generate this file. Where weights come from Weights can be determined in two ways: Hardware performance counters (Linux perf stat) -- counts actual floating-point instructions retired by the CPU, weighted by SIMD width. This gives the true number of basic FP ops per high-level operation. Wall-clock time normalization -- measures time(op) / time(add) as a relative proxy. Less precise than hardware counters but works on any platform. The benchmarks/ package in this repository automates both methods. See Calibrate Weights. FLOP multiplier The flop_multiplier parameter in BudgetContext scales all costs: import flopscope as flops with flops.BudgetContext(flop_budget=10**6, flop_multiplier=2.0) as budget: # Every operation costs 2× its normal FLOP count ... This is useful for experimentation or adjusting the difficulty of a budget constraint. Note that flop_multiplier and per-operation weights are independent — flop_multiplier scales all operations uniformly, while weights scale each operation individually. Namespaces Use flops.namespace(...) to create nested namespace scopes inside an active BudgetContext. The namespace parameter on BudgetContext sets the root namespace prefix for operations in that context: import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=10**6, namespace="predict") as budget: with flops.namespace("precompute"): stats = fnp.mean(x) with flops.namespace("fallback"): with flops.namespace("sampling"): sample = fnp.add(stats, 1) print(budget.summary(by_namespace=True)) Namespaces are hierarchical and exact per operation. flops.namespace("precompute") inside namespace="predict" records operations as predict.precompute, and nested scopes append dotted segments such as predict.fallback.sampling. Namespaces do not create child budgets or separate time limits. They only change attribution. budget.summary() and flops.budget_summary() stay flat by default; budget.summary(by_namespace=True) and flops.budget_summary(by_namespace=True) opt into a By namespace section. The structured forms, budget.summary_dict(by_namespace=True) and flops.budget_summary_dict(by_namespace=True), return namespace buckets keyed by the full namespace path or None for unlabeled operations. Namespace buckets include namespace-attributable FLOPs, calls, operations, flopscope_backend_time_s, and flopscope_overhead_time_s; context-level wall and residual wall time stay on the top-level summary. data = budget.summary_dict(by_namespace=True) print(data["by_namespace"]["predict.fallback.sampling"]["flops_used"]) Related pages Operation Categories — which operations are free, counted, or unsupported Budget Planning — query costs before running Calibrate Weights — measure per-operation weights empirically How Flopscope WorksPrevious PageOperation CategoriesNext Page --- understanding/operation-categories --- URL: https://aicrowd.github.io/flopscope/docs/understanding/operation-categories Understanding FlopscopeOperation CategoriesUse this page to understand which operations cost FLOPs, which are free, and which are unsupported. You will learn: How to identify free, counted, and blacklisted operations What cost formulas apply to each counted sub-category Which operations are blocked and why Three categories Every NumPy function falls into one of three categories in flopscope: Free operations (0 FLOPs) Operations that involve no arithmetic computation — just memory allocation, reshaping, or data movement. Examples: zeros, ones, full, eye, arange, linspace, empty, reshape, transpose, concatenate, stack, split, squeeze, expand_dims, ravel, take, where, copy, astype, asarray Counted operations (cost > 0) Operations that perform arithmetic. Cost is computed analytically from tensor shapes. Sub-categoryCost formulaExamplesUnarynumel(output)exp, log, sqrt, abs, sin, cos, tanh, ceil, floorBinarynumel(output)add, multiply, maximum, divide, power, subtractReductionnumel(input)sum, mean, max, min, std, var, argmax, nansumEinsumproduct of all index dimsfnp.einsum(...)Dot/Matmulequivalent einsumfnp.dot(A, B), A @ BLinalgper-operation formulafnp.linalg.solve, fnp.linalg.eigh, fnp.linalg.choleskyFFT5 N log Nfnp.fft.fft, fnp.fft.rfft, fnp.fft.fft2SVDm × n × kfnp.linalg.svd(A, k=10)Sort/Searchn log n per slicesort, argsort, unique, searchsortedRandomnumel(output)fnp.random.randn, fnp.random.normal, fnp.random.uniformStatsflat per-element (varies)flops.stats.norm.pdf, flops.stats.expon.cdf, flops.stats.cauchy.ppf When inputs are SymmetricTensor, many operations automatically get reduced costs. See Exploit Symmetry. Blacklisted operations Operations not relevant to numerical computation. Calling them raises an AttributeError. These are I/O, configuration, datetime, and display functions that have no meaningful FLOP cost. fnp.save(array, "file.npy") # AttributeError: flopscope does not support 'save' (blacklisted). Save array to .npy file. Not supported.. Did you mean: 'ravel'? Blacklisted categories: I/O (save, load, loadtxt, savetxt, savez, genfromtxt), configuration (seterr, geterr, setbufsize), datetime (busday_count, is_busday), display (array2string, array_repr), functional (apply_along_axis, piecewise, frompyfunc). See API Reference for the complete list. Related pages API Reference — complete list of every operation and its category FLOP Counting Model — how costs are calculated Migrate from NumPy — what changes when moving from NumPy FLOP Counting ModelPrevious PageSymmetry Detection Deep DiveNext Page --- understanding/symmetry-detection --- URL: https://aicrowd.github.io/flopscope/docs/understanding/symmetry-detection Understanding FlopscopeSymmetry Detection Deep DiveContributor-level walkthrough of flopscope's symmetry detection algorithm. You will learn: How the bipartite graph is constructed from einsum subscripts How the subset-keyed oracle detects and caches symmetry lazily How the sigma-loop derives label permutations from row permutations How Burnside's lemma counts unique elements for exact FLOP reduction TL;DR: flopscope detects when an einsum expression has symmetry that allows computing fewer FLOPs. It does this by building a bipartite graph from the einsum subscripts, finding column permutations that preserve the graph structure, and using group theory to count how many unique terms exist. The savings can be dramatic -- a symmetric matrix multiplication can cost half as many FLOPs. The problem A multi-operand einsum like 'ij,ai,bj->ab' is decomposed by opt_einsum into a sequence of pairwise contractions. At each step the optimizer must evaluate candidate pairs -- and it needs to know, for each candidate intermediate, whether the result is symmetric, so it can score it with a reduced cost. When operands are SymmetricTensors, their per-operand symmetry is known upfront. But there is a second source of symmetry: when the same Python object appears at multiple operand positions, the output can be symmetric in index labels contributed by those repeated operands -- even if the operands are dense. The naive approach is to rerun a detection procedure at every step for every candidate subset. This is too expensive for large contractions. We want: Correctness -- detect all exploitable symmetry without false positives. Memoization -- compute each intermediate's symmetry at most once. Laziness -- only evaluate subsets that the path optimizer actually visits. Subgraph symmetry detection achieves all three. The bipartite graph The core data structure is a bipartite graph over the einsum expression. Left vertices (U): One U-vertex per axis of each operand. For a dense operand with subscript "ai", each axis produces its own U-vertex (two total). For a SymmetricTensor with subscript "ij" and declared symmetry S_2{i,j}, both axes still produce separate U-vertices -- per-operand symmetry does not affect the graph topology. Instead, per-operand symmetry is handled entirely by the expanded sigma-loop (see below), which uses the declared symmetry generators as an additional source of row permutations. Right vertices (labels): One right vertex per unique index label. Labels are partitioned into: V (free labels): appear in the final output subscript or in operands outside the current subset (they "cross the cut"). W (summed labels): contracted entirely within the current subset. Incidence: An edge from U-vertex u to label c has weight equal to the multiplicity of c in the axes belonging to U-vertex u. Identical-operand groups: Operands that are the same Python object are grouped. These groups are the source of induced symmetry. Worked example Consider 'ij,ai,bj->ab' with operands T, A, B where T is a dense tensor: Subscripts: ij, ai, bj -> ab Operands: T A B U-vertices (one per axis): (T, 0) -- label set {i} (T, 1) -- label set {j} (A, 0) -- label set {a} (A, 1) -- label set {i} (B, 0) -- label set {b} (B, 1) -- label set {j} Free labels at the top level: {a, b} (appear in output ->ab). Summed labels at the top level: {i, j} (contracted out). No identical operands in this example -- T, A, and B are distinct Python objects. Full bipartite graph U (axes) Labels ----------------- ------ V (free): (A, 0) ------------------------- a (B, 0) ------------------------- b W (summed): (T, 0) ----------+ +--------------- i (A, 1) ----------+ (T, 1) ----------+ +--------------- j (B, 1) ----------+ Now consider the subset {A, B} (positions 1 and 2): U-vertices in subset: (A, 0), (A, 1), (B, 0), (B, 1) Labels in subset: {a, i, b, j} Labels outside subset (in T): {i, j} Crossing labels (in subset AND in outside): {i, j} V at this step = {a, b} + {i, j} = {a, b, i, j} (all four -- {i,j} cross the cut) W at this step = {} (nothing is summed entirely within {A, B}) Induced subgraph for subset {A, B} When we restrict to subset {A, B}, labels i and j cross the cut (they also appear in T, outside the subset), so they move from W to V: U (subset {A, B} only) Labels ---------------------- ------ V (all free): (A, 0) ------------------------- a (A, 1) ------------------------- i (B, 0) ------------------------- b (B, 1) ------------------------- j W: (empty) The incidence matrix M at this subset (rows = U-vertices, columns = V+W): a i b j (A, 0): 1 0 0 0 (A, 1): 0 1 0 0 (B, 0): 0 0 1 0 (B, 1): 0 0 0 1 The subset-keyed oracle The key invariant is the pure-in-subset property: the symmetry of an intermediate tensor depends only on the set of original operands it was formed from, not on the order in which they were contracted. This is because: The bipartite graph structure is fixed for the full einsum. The induced subgraph on a subset S is fully determined by which operands are in S. Symmetry is a property of the final intermediate, not its contraction history. This property makes the subset key canonical. The oracle stores results in a dict[frozenset[int], SubsetSymmetry] and returns cached results on subsequent calls with the same subset. from flopscope._opt_einsum._subgraph_symmetry import SubgraphSymmetryOracle # One oracle per contract_path call oracle = SubgraphSymmetryOracle( operands=list(operands), subscript_parts=input_parts, per_op_groups=perm_groups, output_chars=output_str, ) # Lazy evaluation -- only computed on first access per subset result = oracle.sym(frozenset({0, 1})) # SubsetSymmetry for intermediate from ops 0 and 1 result.output # V-side (output tensor) symmetry result.inner # W-side (inner summation) symmetry The detection algorithm Goal For a fixed subset S with incidence matrix M, we want the full group of automorphisms of the labelled bipartite graph -- pairs (sigma, pi) where sigma permutes identical-operand rows and pi permutes label columns, such that applying pi to the columns of sigma(M) recovers M: pi(sigma(M)) = M Every such pi is a symmetry of the intermediate tensor built from S. Restricted to V labels it contributes to the output (V-side) symmetry; restricted to W labels it contributes to the inner (W-side) symmetry. The V/W partition is part of the labelled structure, so legitimate automorphisms must preserve it -- pi(V) is a subset of V and pi(W) is a subset of W -- and any pi with a cycle crossing V to W is rejected. Column fingerprints For each label c, compute its column fingerprint col(c) -- the tuple of incidence values down the rows of M. Labels with identical fingerprints are candidates for symmetry equivalence. The fingerprint-to-label mapping is used by the sigma-loop to derive pi in O(1) per label via hash lookup. Earlier versions had a standalone fast path that detected S_k whenever labels shared a fingerprint, without running the sigma-loop. This was incorrect for non-S_k groups (see the C3 bug note below) and has been removed. Fingerprints are now used only for pi derivation inside the sigma-loop -- they are not a standalone detection mechanism. Sigma loop: derive pi from generators The sigma-loop iterates over generators of the row-permutation group on M, drawn from three sources: Source A -- per-operand internal symmetry generators. For each operand that carries a declared SymmetryGroup, each generator of that group is lifted to a row permutation on M (permuting only the rows belonging to that operand). This captures symmetry that was previously handled by orbit-based axis merging. Source B -- identical-operand swap generators. For each group of k identical operands (same Python object), the k-1 adjacent transpositions (op_i, op_{i+1}) are used as generators. Each such swap is lifted to a row permutation that exchanges the rows of the two operands. Source C -- coordinated axis relabeling. When identical operands share the same subscript pattern (e.g. both have subscript ij), permuting axes uniformly across all copies is equivalent to relabeling dummy indices. Adjacent transpositions on W-only (summed) axes are generated, applied to every copy simultaneously. This is restricted to W-side labels because relabeling free (output) labels would change the output tensor. All generators are collected and passed to Dimino's algorithm to build the full row-permutation group. The sigma-loop then iterates over all elements of this group (not just the generators), deriving pi for each. For each group element sigma: Lift sigma to a row permutation on M. Compute sigma(M)'s column fingerprints: sigma * col(c) for each label c. Derive the induced label permutation pi directly. For each label l, pi(l) is the label whose M-column matches sigma(M)'s column for l -- a hash-table lookup in O(1). When multiple labels share a fingerprint (collision), pick the lex-first unused candidate. If any label has no match, reject this sigma. Validate pi: pi(V) is a subset of V and pi(W) is a subset of W. Any cycle crossing V to W invalidates the sigma. Collect pi as a generator literal restricted to V labels (and separately to W labels). Non-identity generators become part of the detected SymmetryGroup. The sigma-loop collects all non-identity pi restrictions as generator literals. These generators are passed to Dimino's algorithm to close the group and build the exact symmetry group on V (and separately on W). Interactive explorer Walk through each detection step interactively with the Symmetry Explorer — choose a preset example or define your own einsum expression. Worked examples Click to expand each example. Block symmetry: einsum('ab,cd->abcd', X, X)Per-index symmetry: einsum('ia,ib->ab', X, X)Cautionary note: the C3 axis-merging bug V-side and W-side V-side groups are symmetries of the output tensor -- they reduce the number of unique output elements that need to be computed. W-side groups are symmetries among the contracted (summed) labels -- they reduce the number of unique summation terms. Both contribute multiplicatively to the cost reduction: cost = dense_cost * (unique_output / total_output) * (unique_inner / total_inner) The output (V-side) reduction is always applied when the step's intermediate has a non-trivial permutation group on its free indices. The inner (W-side) reduction is applied only when all labels in the detected inner group are present as contracted indices in that specific pairwise step. If any of those labels were contracted at an earlier step and no longer appear, the inner reduction is skipped. In the contraction path table, [W checkmark: ...] indicates the inner reduction was applied, while [W: ...] indicates it was detected but not applied. Inner symmetry can be toggled with flops.configure(use_inner_symmetry=False). Exact group detection and Burnside counting The sigma-loop collects all valid pi permutations as generator literals and builds a SymmetryGroup directly. When the generated group equals S_k (the full symmetric group, checked via order == k!), the existing stars-and-bars formula C(n+k-1, k) applies. When the group is a proper subgroup (e.g., C_3 from einsum('ij,jk,ki->', A, A, A)), Burnside's lemma gives the exact unique element count. Worked example: tr(A³) Complexity bound The oracle evaluates each subset at most once. For a contract with N operands and groups of sizes k_1, k_2, ...: Generator collection: Source A contributes O(rank) generators per operand with declared symmetry. Source B contributes k-1 generators per identical group. Source C contributes at most rank-1 generators per identical group with matching subscripts. Total generators: g = O(N * rank). Group enumeration: Dimino's algorithm builds the full row-permutation group from the generators in O(|G| * g) compositions. Pi derivation: For each of the |G| group elements, deriving pi costs O(n_labels) via fingerprint hash lookup. Per-subset total: O(|G| * (g + n_labels)). Number of subsets visited: at most 2^N (usually much less -- path algorithms visit only O(N^2) subsets in practice). For the common case of a single pair of identical operands (|G| = 2, g = 1): per-subset cost is O(n_labels). Related pages Symmetry Savings -- practical guide to using symmetry FLOP Counting Model -- how costs are calculated Operation CategoriesPrevious PageAPI ReferenceBrowse Flopscope by namespace, then use the operation cost index when you need a dense cost-oriented lookup. ============================================================ API Reference ============================================================ --- api --- URL: https://aicrowd.github.io/flopscope/docs/api API ReferenceAPI ReferenceBrowse Flopscope by namespace, then use the operation cost index when you need a dense cost-oriented lookup.Browse the public APIStart with a namespace chapter, then drop into canonical per-symbol pages that mirror the public import path.flopscopeFlopscope primitives39 entriesBudgets, symmetry helpers, public objects, and configuration primitives that sit outside the counted NumPy surface.flops.stats.cauchy.cdfflops.stats.cauchy.pdfflops.stats.cauchy.ppfflopscope.numpyNumPy array routines586 entriesThe counted NumPy-shaped surface, including array construction, linear algebra, FFT, and random sampling.flopscope.numpy.absflopscope.numpy.absoluteflopscope.numpy.acosflopscope.statsStatistics32 entriesDistribution objects and their methods for PDFs, CDFs, and inverse CDFs.flopscope.stats.cauchyflopscope.stats.cauchy.cdfflopscope.stats.cauchy.pdfflopscope.accountingAccounting48 entriesAnalytical FLOP estimators and planning helpers for reasoning about cost before execution.flopscope.accounting.bartlett_costflopscope.accounting.blackman_costflopscope.accounting.cholesky_costCommon entry pointsHigh-signal entry points that cover the counted NumPy surface, top-level helpers, distributions, and analytical estimators.numpyflopscope.numpy.einsumEvaluates the Einstein summation convention on the operands.flopscopeflopscope.BudgetContextContext manager for FLOP budget enforcement.flopscopeflopscope.symmetrizeProject an array onto the invariant subspace of a permutation group.statsflopscope.stats.normNormal (Gaussian) continuous random variable.accountingflopscope.accounting.einsum_costWeighted FLOP cost of an einsum operation.numpyflopscope.numpy.random.symmetricSample random data and project it to a symmetry group.Symmetry Detection Deep DivePrevious PageOperation Cost IndexSearch and filter counted runtime operations across flopscope.numpy and flopscope.stats. --- api/for-agents --- URL: https://aicrowd.github.io/flopscope/docs/api/for-agents API ReferenceFor AI AgentsUse this page if you are an AI coding assistant (or building one) that needs to generate flopscope code correctly. You will learn: How to orient yourself with llms.txt, ops.json, per-op JSON, and the cheat sheet The five rules for generating correct flopscope code How to avoid common mistakes AI agents make with flopscope This page is for AI coding assistants (Claude, Cursor, Copilot, etc.) helping users write flopscope code. It explains what resources are available, how to access them, and the key things you must know before generating code. Quick orientation flopscope is NOT NumPy. It wraps a subset of NumPy with analytical FLOP counting. Every arithmetic operation is charged against a budget. Code that works with NumPy may fail or behave differently with flopscope: All counted operations require an active BudgetContext 35 operations are blocked entirely (I/O, config, state) sort, argsort, trace, random.* sampling ops are now counted (not free) Costs are analytical (from tensor shapes), not measured at runtime Machine-readable resources ResourceFormatUse casellms.txtMarkdownStart here. Curated index of all doc pages with one-line descriptions. Under 4K tokens.llms-full.txtMarkdownComplete docs in one file. Use if your context window is large enough (~115KB).ops.jsonJSONSlim machine-readable index of all 508 operations. Query programmatically for names, filters, and detail URLs./api-data/ops/.jsonJSONFull standalone payload for one operation, including summary, signature, notes, and structured doc sections.API ReferenceMarkdownDense reference of every operation's cost and full operation inventory. How to use llms.txt If you're an agent encountering flopscope for the first time: Fetch llms.txt — this gives you the doc map in ~300 words Identify which page answers your question from the section descriptions Fetch that specific page URL patterns: llms.txt links to .md variants of each page (raw markdown for agents). Every page is also available as rendered HTML — just drop the trailing /index.md from the URL: Agent URL (raw markdown)Human URL (rendered HTML).../getting-started/installation/index.md.../getting-started/installation/.../guides/einsum/index.md.../guides/einsum/.../api/index.md.../api/ If you have a large context window, fetch llms-full.txt instead to get everything in one request. How to use ops.json ops.json contains a JSON object with an operations array. Each entry is a slim index row with the information needed to search, filter, and find the full payload for one operation: { "name": "einsum", "slug": "einsum", "detail_href": "/docs/api/einsum/", "detail_json_href": "/api-data/ops/einsum.json", "module": "numpy", "flopscope_ref": "fnp.einsum", "numpy_ref": "np.einsum", "category": "counted_custom", "cost_formula": "product of all index dims (FMA=1)", "cost_formula_latex": "$\\prod_i d_i$", "free": false, "blocked": false, "status": "supported", "summary": "Evaluate Einstein summation with FLOP counting and optional path optimization.", "notes": "Supports SymmetricTensor inputs and repeated-operand detection for automatic cost reduction" } Use this to: Check if an operation is supported: filter by "blocked": false Get the cost formula for a specific operation: look up by name List all free operations: filter by "free": true Map between NumPy and flopscope calls: use numpy_ref and flopscope_ref Jump to the standalone page or full payload: use detail_href and detail_json_href How to use per-op JSON Each operation also has a full standalone JSON payload at: /api-data/ops/.json For example: /api-data/ops/einsum.json These payloads include: top-level metadata (slug, detail_href, detail_json_href) source links (source.flopscope, source.numpy) operation metadata (op.*) normalized doc sections under docs.sections[] Fetch the per-op JSON when you need the full signature, summary, notes, parameter list, returns, or examples for a single function without loading the entire reference surface. Five rules for generating flopscope code 1. A global default budget is active automatically — use BudgetContext for control. A global default budget auto-activates when flopscope is imported, so quick scripts work without any setup. For precise budget control and namespacing, use an explicit BudgetContext. Both forms are valid: import flopscope as flops import flopscope.numpy as fnp # Quick work — global default handles budget tracking automatically result = fnp.einsum('ij,jk->ik', A, B) # Recommended for budget control and namespacing with flops.BudgetContext(flop_budget=10**8) as budget: result = fnp.einsum('ij,jk->ik', A, B) # Decorator form for functions @flops.BudgetContext(flop_budget=10**8) def my_forward_pass(x): return fnp.einsum('ij,j->i', W, x) 2. Know what's free and what's counted. Free (0 FLOPs): zeros, ones, reshape, transpose, copy, random.seed, random.get_state, random.set_state, random.default_rng. Custom cost (numel FLOPs): array, linspace, arange, concatenate, where. These are NOT free — each charges numel(output) FLOPs against the budget. Counted: einsum, dot, matmul, exp, log, add, multiply, sum, mean, all linalg.*, all fft.*, sort, argsort, trace, unique, set ops (in1d, isin, etc.), histogram, random.* sampling. Blocked: save, load, geterr, seterr, and 28 others. These raise AttributeError. When in doubt, check ops.json, the relevant /api-data/ops/.json, or the API Reference. 3. Use flops.accounting.* to estimate costs before running. cost = flops.accounting.einsum_cost('ij,jk->ik', shapes=[(256, 256), (256, 256)]) cost = flops.accounting.svd_cost(m=256, n=256, k=10) These are pure functions — no BudgetContext needed. 4. Use fnp.einsum as the primary computation primitive. Most linear algebra can be expressed as einsum. The cost is simply the product of all index dimensions — each FMA (fused multiply-add) counts as 1 operation. 'ij,jk->ik' with shapes (m, k) and (k, n) costs m * k * n FLOPs. 5. Use wall_time_limit_s for time-limited execution. In competition evaluation, submissions run under both a FLOP budget and a wall-clock time limit. Test locally with: with flops.BudgetContext(flop_budget=10**9, wall_time_limit_s=5.0) as budget: # your code — must complete within 5 seconds ... If the time limit is exceeded, TimeExhaustedError is raised at the next operation boundary. The error includes the operation name, elapsed time, and configured limit for diagnostics. 6. Exploit symmetry for cost savings. Use symmetric_axes for symmetric outputs: fnp.einsum('ki,kj->ij', X, X, symmetric_axes=[(0, 1)]) Wrap known-symmetric matrices with flops.as_symmetric(data, symmetric_axes=(0, 1)) for automatic savings in downstream ops Common mistakes agents make MistakeWhat happensFixUsing np.einsum instead of fnp.einsumFLOPs not counted, budget not checkedAlways use fnp.* for counted NumPy-like operationsSkipping BudgetContext entirelyNo error (global default handles it), but budget is harder to track and namespaceUse an explicit BudgetContext for any work you want to measure or labelAssuming array, linspace, concatenate, where are freeUnderestimates budget usage — each charges numel(output) FLOPsThese are custom-cost ops, not free; check the cheat sheetAssuming sort is freeUnderestimates budget usagesort costs n*ceil(log2(n)) per slice — check the cheat sheetUsing fnp.save() or fnp.load()AttributeError — blockedUse numpy directly for I/ONesting two explicit BudgetContext blocksRuntimeErrorUse a single explicit context; nesting with the global default is fineIgnoring wall_time_limit_s in testingTimeExhaustedError in competitionTest with a time limit locally to catch slow code early Related pages API Reference — full operation inventory and cost reference Exploit Symmetry — detailed symmetry guide AccountingAnalytical cost estimators for planning FLOP usage before you execute a counted operation.Client-Server ModelNext Page ============================================================ Infrastructure ============================================================ --- infrastructure/client-server --- URL: https://aicrowd.github.io/flopscope/docs/infrastructure/client-server InfrastructureClient-Server ModelThis page covers the client-server architecture used for competition evaluation, where participant code runs in an isolated container. For how Flopscope wraps NumPy internally, see How Flopscope Works. Use this page to understand how Flopscope's client-server architecture works and why it exists. You will learn: Why Flopscope uses a client-server model for competition evaluation How arrays, operations, and budgets flow between client and server How to choose between the local library and client-server packages Why client-server? In competition evaluation, participant code runs in an isolated container that cannot import NumPy directly. This prevents participants from bypassing FLOP counting by calling NumPy functions outside flopscope. The client-server model enforces this isolation: How it works Server runs the real flopscope library backed by NumPy. It stores all arrays, enforces budgets, and counts FLOPs. Client exposes the same public imports (import flopscope as flops plus import flopscope.numpy as fnp) and proxies every operation to the server over ZMQ (msgpack-encoded messages). Arrays stay on the server. The client holds lightweight RemoteArray handles that reference server-side data. When you call fnp.einsum(...), the client sends the operation and handle IDs to the server, which executes it and returns a new handle. Budget enforcement happens server-side. The client cannot manipulate FLOP counts. Communication protocol Transport: ZMQ (REQ/REP pattern) Serialization: msgpack with binary-safe array payloads Default endpoint: ipc:///tmp/flopscope.sock (configurable via FLOPSCOPE_SERVER_URL) Timeout: 30 seconds per request API compatibility Code written for the local library works unchanged with the client: # This code works with BOTH the local library and the client import flopscope as flops import flopscope.numpy as fnp with flops.BudgetContext(flop_budget=10**6) as budget: x = fnp.zeros((256,)) W = fnp.random.randn(256, 256) h = fnp.einsum('ij,j->i', W, x) print(budget.summary()) When to use which Use casePackageInstall pathDevelopment, testing, researchflopscope (local library)uv add git+... or uv sync from repoCompetition evaluation, sandboxed environmentsflopscope-client + flopscope-serverDocker containers Three packages in this repo PackageLocationDescriptionflopscopesrc/flopscope/Local library — full NumPy backend, direct executionflopscope-clientflopscope-client/Client proxy — no NumPy dependency, forwards ops to serverflopscope-serverflopscope-server/Server — runs real flopscope, manages sessions and arrays Related pages Running with Docker — set up client-server locally Contributor Guide — source-checkout commands for local development Quickstart — getting started with the local library For AI AgentsPrevious PageRunning with DockerNext Page --- infrastructure/docker --- URL: https://aicrowd.github.io/flopscope/docs/infrastructure/docker InfrastructureRunning with DockerUse this page to run the client-server model locally, either with Docker Compose or manually. You will learn: How to start the client-server setup with Docker Compose How to run client and server manually without Docker How to configure IPC and TCP transports Prerequisites Client-Server Model — understand why the architecture exists Docker and Docker Compose installed With Docker Compose The docker/ directory contains a ready-to-use setup: cd docker docker compose up --build This starts two containers: ServiceImageRolebackendDockerfile.serverRuns flopscope server, listens on IPC socketparticipantDockerfile.participant-hardenedRuns participant code with flopscope-client only The containers share an IPC socket volume for communication. Without Docker From a source checkout, start both processes from the repository root so the server can import the local src/flopscope package: # Terminal 1: Start the server PYTHONPATH=src:flopscope-server/src \ uv run --with pyzmq --with msgpack \ python -m flopscope_server --url ipc:///tmp/flopscope.sock # Terminal 2: Run client code export FLOPSCOPE_SERVER_URL=ipc:///tmp/flopscope.sock PYTHONPATH=flopscope-client/src \ uv run --with pyzmq --with msgpack python your_script.py For TCP (e.g., across machines): # Server PYTHONPATH=src:flopscope-server/src \ uv run --with pyzmq --with msgpack \ python -m flopscope_server --url tcp://0.0.0.0:15555 # Client export FLOPSCOPE_SERVER_URL=tcp://server-host:15555 PYTHONPATH=flopscope-client/src \ uv run --with pyzmq --with msgpack python your_script.py If you already have flopscope-client and flopscope-server installed into separate environments, the shorter cd ... && uv run ... workflow also works. The commands above are the reproducible source-checkout path. Time limit enforcement Submissions run under two layers of time enforcement: In-library (cooperative): BudgetContext(wall_time_limit_s=N) checks the deadline before and after each numpy call. When exceeded, it raises TimeExhaustedError with diagnostic info (which operation, elapsed time, configured limit). This is a UX feature — it gives participants a clean error message. Container-level (hard): The Docker container enforces a kernel-level time limit via cgroups/rlimit. If the in-library check doesn't catch the overshoot (e.g., a single very long numpy call), the container delivers SIGKILL. This is the ultimate backstop — no Python code can escape it. Signal-based preemption (SIGALRM) is deliberately not used because Python signal handlers cannot interrupt C extensions (numpy/LAPACK/BLAS), making them ineffective for exactly the operations where time limits matter most. Common pitfalls Symptom: Connection refused or timeout Fix: Ensure the server is running before starting the client. Check that FLOPSCOPE_SERVER_URL matches the server's --url argument. Symptom: Port conflict Fix: Change the port in both the server --url and client FLOPSCOPE_SERVER_URL. Related pages Client-Server Model — architecture overview Contributor Guide — local repo workflows Client-Server ModelPrevious PageContributor GuideNext Page ============================================================ Development ============================================================ --- development/contributing --- URL: https://aicrowd.github.io/flopscope/docs/development/contributing DevelopmentContributor GuideUse this page when you are working on the flopscope repository itself rather than only consuming the published API. You will learn: How the repository is organized across three packages How to set up your development environment and run tests How to work with client, server, and Docker workflows How auto-generated documentation is maintained Repository layout This repository contains three Python packages plus docs and Docker assets: PathPurposesrc/flopscope/Core library backed by NumPyflopscope-client/src/flopscope/Client proxy used in sandboxed participant environmentsflopscope-server/src/flopscope_server/ZMQ server that executes the real librarytests/Core library test suiteflopscope-client/tests/Client unit, integration, and adversarial testsflopscope-server/tests/Server unit testswebsite/content/docs/Docs source for the published sitewebsite/public/ops.jsonGenerated slim API operation index consumed by /docs/apiwebsite/public/api-data/ops/*.jsonGenerated per-operation detail payloads for canonical operation pageswebsite/.generated/public-api-routes.jsonGenerated canonical route manifest for /docs/api/... pageswebsite/.generated/op-doc-imports.tsGenerated static import map for operation docswebsite/.generated/symbol-doc-imports.tsGenerated static import map for public helper and object docswebsite/.generated/public-api-symbols.jsonGenerated manifest of non-registry public API pagesscripts/generate_api_docs.pyRegenerates API route manifests, per-operation payloads, and public symbol docsdocker/Local client-server and hardened evaluation images Initial setup For normal work on the core package, docs, and root test suite: git clone https://github.com/AIcrowd/flopscope.git cd flopscope make install make install runs uv sync --all-extras and configures the local git hooks. Which environment to use The root environment covers the core package, linting, docs, and the main test suite. The client and server each also have their own pyproject.toml. One important caveat: flopscope-server depends on the local flopscope package, which is not resolved from a package index in a fresh source checkout. For server development, run commands from the repository root with PYTHONPATH=src:flopscope-server/src instead of relying on cd flopscope-server && uv run .... Common commands Core library make lint make test make test-numpy-compat make docs-build make docs-serve make ci If you prefer direct uv commands: uv run pytest uv run mkdocs serve When running the local docs site and you want flopscope error messages to link to your local copy instead of the hosted site, set: export FLOPSCOPE_DOCS_ROOT=http://localhost:3000/docs If FLOPSCOPE_DOCS_ROOT is unset, flopscope falls back to the hosted docs at https://aicrowd.github.io/flopscope/docs. Client package The client package is independently installable, so its test suite can run via its own project file: uv run --project flopscope-client pytest flopscope-client/tests Client integration and adversarial tests start a real server subprocess using the repository root .venv/bin/python, so run make install first. Server package Run server tests from the repository root so the local core package is on PYTHONPATH: PYTHONPATH=src:flopscope-server/src \ uv run --with pyzmq --with msgpack pytest flopscope-server/tests To launch the server manually from a source checkout: PYTHONPATH=src:flopscope-server/src \ uv run --with pyzmq --with msgpack \ python -m flopscope_server --url ipc:///tmp/flopscope.sock Running client and server together without Docker From a source checkout, use repo-root commands so both packages resolve correctly: # Terminal 1 PYTHONPATH=src:flopscope-server/src \ uv run --with pyzmq --with msgpack \ python -m flopscope_server --url ipc:///tmp/flopscope.sock # Terminal 2 export FLOPSCOPE_SERVER_URL=ipc:///tmp/flopscope.sock PYTHONPATH=flopscope-client/src \ uv run --with pyzmq --with msgpack python your_script.py See Running with Docker if you want the same split using containers. Generated documentation Do not hand-edit website/public/ops.json, website/public/api-data/ops/*.json, website/.generated/public-api-routes.json, website/.generated/op-doc-imports.ts, website/.generated/symbol-doc-imports.ts, or website/.generated/public-api-symbols.json. The interactive API reference, canonical API pages, and legacy redirect routes consume those generated artifacts directly. Instead, update scripts/generate_api_docs.py, the relevant source docstrings, or the operation registry, then regenerate and verify: uv run python scripts/generate_api_docs.py uv run python scripts/generate_api_docs.py --verify NumPy Compatibility Testing flopscope's goal is NumPy API compatibility on the counted surface: import flopscope.numpy as np should work for supported functions. To verify this, we run NumPy's own test suite against flopscope. How it works A pytest conftest at tests/numpy_compat/conftest.py monkeypatches numpy functions with their flopscope equivalents at session start. When we point pytest at NumPy's installed test files using --pyargs, every test that calls np.sum(...), np.mean(...), etc. actually calls flopscope's version. NumPy test file conftest.py flopscope calls np.sum(x) ──────> np.sum = fnp.sum ──────> fnp.sum(x) asserts result (monkeypatch) (FLOP-counted) Avoiding infinite recursion flopscope functions internally call numpy (for example, fnp.dot eventually delegates to _np.dot inside the implementation modules). Since _np is the numpy module, patching numpy.dot = fnp.dot without isolating those backend references would cause infinite recursion: fnp.dot → _np.dot → numpy.dot → fnp.dot → ... We solve this by freezing numpy before patching: the conftest creates a snapshot of the numpy module (and its submodules like numpy.linalg, numpy.fft), then rebinds every flopscope module's _np reference to the frozen copy. Now flopscope's internal calls go to the original numpy functions, while the test suite sees flopscope's versions. # Simplified flow in conftest.py: frozen_np = freeze_numpy() # snapshot of original numpy rebind_flopscope_np(frozen_np) # flopscope internals → frozen copy patch_numpy() # np.sum = fnp.sum, etc. # Now: test calls np.sum → fnp.sum → frozen_np.sum (original) ✓ What gets patched Of flopscope's 508 registered functions, most non-ufunc functions are patched onto numpy during testing. The only categories skipped: CategoryCountWhy skippedUfuncs101flopscope functions are plain callables, not ufuncs -- they lack .reduce, .accumulate, .outer, .nargs. Tests check these attributes at collection time.Blacklisted32Intentionally unsupportedlinalg.outer1fnp.linalg.outer delegates to np.outer (not np.linalg.outer), which has different validation behavior Everything else -- free ops, counted custom ops (dot, einsum, etc.), submodule functions (linalg, fft), reductions, and special functions -- is patched. Test suites We run 7 NumPy test modules covering core math, ufuncs, numerics, linear algebra, FFT, polynomials, and random: SuiteModulePassedxfailedCore mathnumpy._core.tests.test_umath4,66813Ufunc infrastructurenumpy._core.tests.test_ufunc7957Numeric operationsnumpy._core.tests.test_numeric1,56020Linear algebranumpy.linalg.tests.test_linalg48255FFTnumpy.fft.tests.test_pocketfft11434Polynomialsnumpy.polynomial.tests.test_polynomial362Randomnumpy.random.tests.test_random1420Total7,363331 All failures are tracked as xfails in tests/numpy_compat/xfails.py. Running the tests Tests use pytest-xdist for parallel execution across all CPU cores. # Run everything (recommended) make test-numpy-compat # Run a single suite uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -n auto -q # Filter to specific functions uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -k "sqrt" -n auto -v # Run without parallelism (for debugging) uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -v --tb=short The numpy_compat tests are excluded from the default pytest run (via pyproject.toml addopts) to prevent the monkeypatch from contaminating the main test suite. They run as a separate step in CI. Known divergences (xfails) Tests that fail due to known, accepted differences are tracked in tests/numpy_compat/xfails.py. Each entry maps a test pattern to a categorized reason: CategoryMeaningExamplesNOT_IMPLEMENTEDFunction exists but lacks a kwarg or edge caseMissing out=, where=, subok= kwargsUNSUPPORTED_DTYPEflopscope doesn't support this dtypetimedelta, object arraysUFUNC_INTERNALSTest relies on ufunc protocol.reduce, __array_ufunc__BUDGET_SIDE_EFFECTTest assumes no global state changesBudget deduction during assertionsNUMPY_INTERNALTest uses numpy internals_umath_tests, internal type tables The linalg suite has the most xfails (255) because flopscope's linalg wrappers don't support stacked/batched arrays, 0-size arrays, or some advanced kwargs that numpy's linalg tests exercise extensively. Triaging new failures Run a suite: uv run pytest tests/numpy_compat/ --pyargs -n auto --tb=line Categorize each failure If it's a bug we should fix, create an issue If it's an accepted divergence, add it to xfails.py Why monkeypatching (not subclassing) We considered alternatives: Array subclass with __array_ufunc__: Would intercept ufunc calls, but flopscope arrays are plain numpy.ndarray by design -- no custom tensor class. Running tests with import flopscope as np: NumPy's test files import from numpy._core, numpy.testing, etc. -- can't redirect all internal imports. Monkeypatching with frozen numpy: Simple, works with NumPy's existing test infrastructure, tests exactly what users experience (same function signatures), and the frozen-numpy trick prevents infinite recursion. Related pages Running with Docker — containerized client-server setup Client-Server Model — architecture overview Running with DockerPrevious PageCalibration & Empirical WeightsNext Page --- development/calibration --- URL: https://aicrowd.github.io/flopscope/docs/development/calibration DevelopmentCalibration & Empirical WeightsHow analytical FLOPs map to real hardware via per-operation weights. You will learn: What per-operation weights are and why they matter How to run calibration and produce a weights config How to load and use weights in your code How the measurement methodology works How to interpret weight values What are weights? flopscope's analytical cost formulas treat all operations within a category equally -- exp, log, sin, and abs all cost numel(output) FLOPs for a pointwise unary operation. In reality, exp decomposes into a minimax polynomial approximation requiring approximately 14 floating-point instructions per element, while abs is a single bit manipulation. Per-operation weights correct for this. Each weight is a multiplicative constant applied on top of the analytical formula: effective_cost = analytical_formula(shape) * weight(op_name) A weight of 16.0 for exp means each analytical FLOP of exp is calibrated as approximately 16 times more expensive than one analytical FLOP of add under the chosen measurement mode. Known analytical zero-FLOP operations are stored with weight 0.0 in the official artifacts so the generated docs and packaged defaults surface them as truly free. In normal use, flopscope loads packaged official weights automatically on import. Set FLOPSCOPE_DISABLE_WEIGHTS=1 if you want the pure analytical unit-cost model instead. Quick start Run the benchmark suite to produce a JSON config: python -m benchmarks.runner \ --dtype float64 \ --output weights.json \ --html report.html \ --repeats 5 This benchmarks all 291 operations across 14 categories and writes: weights.json -- the rich weights config and metadata source for flopscope report.html -- a human-readable HTML dashboard The generated weights.json is the source-of-truth artifact: it contains the per-operation weights plus metadata used by the generated docs and API data. The packaged runtime file is a slim derivative used only for default loading. To benchmark only certain categories: python -m benchmarks.runner \ --dtype float64 \ --output weights.json \ --category pointwise \ --category linalg Available categories: pointwise, reductions, linalg, linalg_delegates, fft, sorting, random, polynomial, contractions, misc, window, bitwise, complex. Using weights Default, override, and disable behavior flopscope ships with packaged official weights and loads them automatically on a normal import. Set FLOPSCOPE_WEIGHTS_FILE to override those packaged defaults at import time: export FLOPSCOPE_WEIGHTS_FILE=weights.json python your_code.py Set FLOPSCOPE_DISABLE_WEIGHTS=1 to disable weighting entirely and use unit weights: export FLOPSCOPE_DISABLE_WEIGHTS=1 python your_code.py The JSON file must have a "weights" key mapping operation names to floats: { "weights": { "reshape": 0.0, "add": 1.0, "exp": 16.0, "sin": 16.0, "matmul": 1.0, "linalg.cholesky": 4.0 } } Operations not listed in the override file default to 1.0. A partial config (e.g., pointwise-only) works without error. Programmatic loading from flopscope._weights import load_weights, reset_weights, get_weight # Load from a specific file load_weights("/path/to/weights.json") # Or re-load the packaged official defaults explicitly load_weights() # Check a weight print(get_weight("exp")) # e.g. 16.0 after calibration print(get_weight("add")) # 1.0 print(get_weight("foo")) # 1.0 (unknown ops default to 1.0) # Clear active overrides in the current process reset_weights() How weights are applied Weights are applied centrally in BudgetContext.deduct(). Every counted operation passes its op_name to deduct(), which looks up the weight and multiplies it into the cost: adjusted_cost = analytical_cost * flop_multiplier * weight(op_name) Weights compose with flop_multiplier and with symmetry reductions -- symmetry reduces the element count, the weight scales the per-element cost, and both apply independently. How measurement works The benchmark suite supports two measurement modes, chosen automatically: ModePlatformWhat it measuresperfLinux (with perf installed)Instruction-style hardware counters, weighted by SIMD widthtimingAny (macOS, Linux without perf)Relative wall-clock cost, normalized against np.add Every counted operation's weight is computed as: weight(op) = max(alpha_raw(op) - overhead_for_category, 0) Known analytical zero-FLOP operations are not benchmarked; they are emitted separately into the official artifacts with weight 0.0. where alpha_raw(op) depends on the active measurement mode: in perf mode, it is the median ratio of counter-observed retired instructions to the analytical FLOP count (FMA = 1 op) in timing mode, it is the median ratio of wall-clock cost relative to np.add, normalized against the same analytical FLOP count In other words, perf-mode weights are instruction-oriented calibration factors, while timing-mode weights are relative cost proxies. Both are useful for scaling analytical FLOPs, but timing-mode results should not be read as literal hardware instruction counts. The ufunc dispatch overhead (measured from np.abs, which generates zero FP arithmetic) is subtracted per category to remove NumPy implementation noise from the weight. BLAS-backed operations bypass the ufunc layer and have zero overhead subtracted. Each measurement uses pre-allocated output arrays to eliminate memory allocation overhead, multiple input distributions for robustness, subprocess isolation to prevent interference, and warmup iterations in timing mode. Interpreting results weight = 1.0 -- the operation has the same calibrated cost per analytical FLOP as np.add under the chosen measurement mode weight > 1.0 -- the operation is more expensive per analytical FLOP than np.add under the chosen calibration mode (for example, transcendental functions such as exp) weight < 1.0 -- the operation is cheaper per analytical FLOP than np.add under the chosen calibration mode. This can happen because of efficient kernels, vectorization, library implementation details, or benchmarking noise; it should not be interpreted using the old “FMA counts as 2 FLOPs” convention, because Flopscope uses FMA = 1 throughout the model Weights are platform-specific -- different CPUs, BLAS libraries, and libm implementations produce different values. Always measure on the target platform. Symmetry reductions are independent of weights: symmetry reduces the element count while the weight scales the per-element cost. Full per-operation weight data is available in the API Reference and on each standalone operation page under the canonical /docs/api/... routes, for example /docs/api/einsum/. Related pages FLOP Counting Model -- how weights fit into the cost model Budget Planning & Debugging -- query costs before running Contributor GuidePrevious PageChangelogNext Page ============================================================ Changelog ============================================================ --- changelog --- URL: https://aicrowd.github.io/flopscope/docs/changelog ChangelogUnreleased Fixed fnp.random.default_rng() and fnp.random.RandomState() now properly count FLOPs. Sampler methods on the returned objects (e.g. rng.standard_normal(), rs.randn()) deduct FLOPs from the active budget and return FlopscopeArray instead of raw numpy.ndarray. Previously these silently bypassed FLOP accounting — a real risk for the ARC Whitebox Estimation Challenge, since submissions could burn arbitrary compute without deducting a single FLOP. Closes flopscope#18. Changed fnp.random.__getattr__ no longer silently forwards unknown attributes to numpy.random. Bit-generator classes (BitGenerator, MT19937, PCG64, PCG64DXSM, Philox, SFC64) pass through unchanged. Anything else now raises AttributeError with a pointer to default_rng(). Use numpy.random directly if you need an unwrapped/unsupported function. Module-level samplers (fnp.random.randn, normal, uniform, …) are unchanged — same semantics as numpy, no warnings. Notes Downstream repos (whestbench, whest-starterkit) can now drop the fnp.asarray(rng.uniform(...).astype(...)) workaround around default_rng / RandomState sampler outputs — the wrap is no longer needed. Added Method-level registry entries for random.Generator. and random.RandomState. (~94 entries with categories counted_random_method / free_random_method and a cost_formula field). scripts/numpy_audit.py now drift-checks the new slice on every numpy version bump, so future numpy releases that add a new sampler method will fail the audit until the maintainer adds a registry entry. NumPy __array_ufunc__ (NEP 13) and __array_function__ (NEP 18) protocols on FlopscopeArray. Calls like np.add(flopscope, x), np.add.reduce(a), np.add.outer(a, b), np.divmod(a, b), np.modf(a), np.frexp(a), np.add.at(a, idx, val), np.add.reduceat(a, idx), plus ~108 function-form callables (np.sort, np.transpose, np.linalg.solve, …) now route through flopscope's FLOP-counted wrappers automatically. Closes #58 (ndarray methods bypassing tracking), #38 (in-place dunders on SymmetricTensor rebinding instead of mutating), and #62 (no-symmetry SymmetricTensor type ambiguity). See the new Edge cases when SymmetricTensors meet NumPy protocols section for the corner-case rules. 25 ndarray method overrides on FlopscopeArray. a.sum(), a.dot(b), a.argsort(), a.compress(), a.trace(), a.round(), a.clip(), etc. now produce the same FLOP count as fnp.sum(a), fnp.dot(a, b), etc. In-place dunder rewrites with symmetry-corruption guards. A_sym += B mutates A_sym in place when the result preserves A_sym's declared symmetry, and raises ValueError when it would weaken or destroy it (instead of silently rebinding to a new array, the pre-#67 behaviour). Covers __iadd__, __isub__, __imul__, __itruediv__, __ifloordiv__, __imod__, __ipow__, __iand__, __ior__, __ixor__, __ilshift__, __irshift__, __imatmul__. In-place sort / partition similarly refuse on SymmetricTensor. Multi-output ufuncs np.divmod / np.frexp / np.modf now route through flopscope with full out=(o1, o2) support, including partial allocation (out=(o1, None)). Both outputs preserve any symmetry the input had. Symmetry-aware cost adjustment for ufunc.outer and tensordot — placeholder model charges dense_cost × unique_output_elements / dense_output_elements to reflect the savings a symmetry-aware implementation could realise. Above SymmetryGroup degree 12, the adjustment is skipped (Burnside enumeration on S_n for n > 12 is infeasible) and a new CostFallbackWarning fires once per (op_name, degree) pair per process. Suppress via flops.configure(symmetry_warnings=False) (shares the flag with SymmetryLossWarning). CostFallbackWarning added to both the core library and the client package. Subclass of FlopscopeWarning. Wall-clock time limits. BudgetContext now accepts wall_time_limit_s to set a wall-clock deadline. When exceeded, TimeExhaustedError is raised at the next operation boundary with diagnostic info (operation name, elapsed time, limit). The deadline is checked both before and after each numpy call (cooperative enforcement). The entry banner shows the time limit when set. Timing attribution. Every operation now records its backend duration in flopscope_backend_duration_s and its attributed flopscope overhead in flopscope_overhead_duration_s. The budget summary (both plain-text and Rich) shows wall_time_s, flopscope_backend_time_s, flopscope_overhead_time_s, and residual_wall_time_s. Use budget.summary() or flops.budget_summary() to see the timing data. TimeExhaustedError added to both the core library and the client package. Einsum path caching. Contraction paths are now cached in a module-level LRU cache (default 4096 entries). Repeated fnp.einsum() calls with the same subscripts, shapes, optimizer, and symmetry structure reuse the cached path instead of recomputing it. New public API: fnp.clear_einsum_cache(), fnp.einsum_cache_info(), and flops.configure(einsum_path_cache_size=N). Multi-version NumPy support. flopscope now supports NumPy 2.0, 2.1, and 2.2 (>=2.0.0,<2.3.0). Default install resolves to NumPy 2.2. Functions not available in older NumPy versions raise UnsupportedFunctionError with an actionable message at call time (not import time). matvec and vecmat — new FLOP-counted wrappers for NumPy 2.2's matrix-vector and vector-matrix product ufuncs. Cost = output_size * contracted_axis (weight 1.0). UnsupportedFunctionError — new exception for calling functions that require a newer NumPy version than what's installed. CI NumPy version matrix — tests now run against NumPy 2.0, 2.1, and 2.2. Changed Renamed package from mechestim to flopscope to reflect the new challenge name "ARC Whitebox Estimation Challenge". The import convention changes from import mechestim as me to import flopscope as we. Symmetric BLAS classification restored. Pairwise contractions with symmetric inputs now correctly report SYMM, SYMV, or SYDT BLAS types instead of the generic GEMM, GEMV, DOT. This was disabled during the subgraph-symmetry refactor because per-input symmetry wasn't being looked up; now each step's inputs are queried via symmetry_oracle.sym(ssa_to_subset[ssa_id]) before calling can_blas. Symmetry detection rewritten — the induced-symmetry mechanism is replaced by a subset-keyed subgraph symmetry oracle (SubgraphSymmetryOracle). The oracle analyses the bipartite structure of the einsum expression, evaluates symmetry lazily per operand subset, and caches results. This correctly handles intermediates (not just the top-level contraction) and eliminates over-eager per-step propagation. Every optimizer is symmetry-aware — the symmetry_oracle kwarg is plumbed through _PATH_OPTIONS so that optimal, branch-*, greedy, random-greedy, and dynamic-programming algorithms all receive symmetry information and use the exact unique/dense ratio for scoring. DP uses a subset-keyed ratio cache (get_ratio(s, legs)) co-located with the existing bitmap_to_subset closure inside DynamicProgramming.__call__, amortizing the int↔str label translation across all _dp_compare_* helper calls for a given subset. Previously only greedy received symmetry info in some code paths. Silent fallback deleted — the previous code silently fell back to dense costs when detection produced no result. The oracle now enforces that symmetry information is consumed. Enforcement is verified by tests/test_no_silent_symmetry_drop.py. Removed symmetric_flop_count's input_symmetries parameter (high-level API) propagate_symmetry and related helpers _detect_induced_output_symmetry and related helpers induced_output_symmetry kwarg on contract_path Fixed bitmap_to_subset in DP now correctly handles operand renumbering. Previously, when _dp_parse_out_single_term_ops removed or renumbered operands before the DP loop (e.g., on einsum('i,ab,cd->abcd', v, X, X) where v has a unique index that reduces to a scalar), the bitmap-to-subset mapping would point at the wrong original operand positions, causing the oracle to return symmetry for an unrelated intermediate. This bug was latent under the conservative 2× heuristic and only surfaces with exact ratio scoring. Heterogeneous block dimensions in unique_elements — the stars-and-bars block-cardinality calculation assumed all axes within a block had the same dimension. For rectangular block-symmetric tensors (e.g. einsum('ab,cd->abcd', X, X) with X of shape (3, 4)), it computed n**s = 3**2 = 9 instead of the correct product 3*4 = 12, silently underestimating the unique-element count by up to ~8× on rank-3 cases. block_card is now computed as prod(size_dict[c] for c in blocks[0]), which reduces to the old formula for per-index groups and gives the correct product for block groups with differing axis sizes. Added Enriched PathInfo display — fnp.einsum_path().format_table(verbose=False) (called by __str__) now shows an Optimizer: header line resolving optimize='auto'/'auto-hq' to the inner choice that actually ran (e.g. optimal, dynamic_programming, random_greedy_128), a contract column giving the path-supplied contraction tuple, and a unique/dense column showing the bare element counts that the symmetry savings derive from. Call format_table(verbose=True) for an indented detail row per step showing the merged operand subset, the intermediate's output shape, and the running cumulative cost — the most useful view when debugging why a particular step's savings are what they are. New PathInfo.optimizer_used: str field and new StepInfo fields path_indices: tuple[int, ...] and merged_subset: frozenset[int] | None. The merged_subset field is the exact key SubgraphSymmetryOracle.sym(...) uses for its lookups, making the symmetry column directly attributable to the oracle's view of each intermediate. 0.2.0 (2026-04-03) Second release with unified einsum cost model, NumPy compatibility testing, and expanded operation coverage. New features Unified einsum cost model — all einsum-like operations (einsum, dot, matmul, tensordot) now share a single cost model based on opt_einsum's contraction path optimizer Symmetry-aware path finding — the opt_einsum path optimizer now factors symmetry savings into contraction ordering decisions, producing different (cheaper) paths for symmetric inputs NumPy compatibility test harness — run NumPy's own test suite against flopscope via monkeypatching; 7,300+ tests passing across 7 NumPy test modules Polynomial operations — polyval, polyfit, polymul, polydiv, polyadd, polysub, poly, roots, polyder, polyint with analytical FLOP costs Window functions — bartlett, hamming, hanning, blackman, kaiser with per-function cost formulas FFT module — fft, ifft, rfft, irfft, fft2, ifft2, fftn, ifftn, rfftn, irfftn and free helpers (fftfreq, rfftfreq, fftshift, ifftshift) Client-server architecture — flopscope-client and flopscope-server packages for sandboxed competition evaluation over ZMQ Global default budget — a 1e15 FLOP budget auto-activates on first use, so explicit BudgetContext is no longer required for quick scripts FLOPSCOPE_DEFAULT_BUDGET env var — configure the global default budget amount budget_live() — Rich-based live-updating budget display context manager einsum_path() — inspect contraction plans with per-step symmetry savings without spending budget 90%+ test coverage gate enforced in CI Breaking changes Einsum cost formula now uses product_of_all_index_dims × op_factor (op_factor=2 for inner products, 1 for outer products), matching opt_einsum convention. Previously used a different formula. fnp.dot and fnp.matmul costs are now computed via the einsum cost model instead of separate formulas. Bug fixes Accept scalars and array-likes in all flopscope functions Fix symmetry-aware greedy algorithm to actually use symmetry in path selection Fix contract_path cost reporting for output indices Correctly handle symmetric_dims propagation through multi-step contraction paths Documentation Comprehensive how-to guides for einsum, symmetry, linalg, budget planning, and debugging Architecture docs for client-server model and Docker deployment AI agent guide with llms.txt, ops.json, and cheat sheet NumPy compatibility testing methodology docs 0.1.0 (2026-04-01) Initial release for warm-up round. Einsum with symmetry detection and FLOP counting Pointwise operations (exp, log, add, multiply, etc.) Reductions (sum, mean, max, etc.) SVD with truncated top-k Free tensor creation and manipulation ops Budget enforcement via BudgetContext FLOP cost query API NumPy-compatible API (import flopscope as we) Calibration & Empirical WeightsPrevious Page