NumPy Compatibility Testing

mechestim's goal is to be a drop-in replacement for NumPy: import mechestim as np should work for all supported functions. To verify this, we run NumPy's own test suite against mechestim.

How it works

A pytest conftest at tests/numpy_compat/conftest.py monkeypatches numpy functions with their mechestim equivalents at session start. When we point pytest at NumPy's installed test files using --pyargs, every test that calls np.sum(...), np.mean(...), etc. actually calls mechestim's version.

NumPy test file                conftest.py               mechestim
  calls np.sum(x)  ──────>   np.sum = me.sum   ──────>  me.sum(x)
  asserts result              (monkeypatch)              (FLOP-counted)

Avoiding infinite recursion

mechestim functions internally call numpy (e.g., me.dot calls _np.dot). Since _np IS the numpy module, patching numpy.dot = me.dot would cause infinite recursion: me.dot → _np.dot → numpy.dot → me.dot → ...

We solve this by freezing numpy before patching: the conftest creates a snapshot of the numpy module (and its submodules like numpy.linalg, numpy.fft), then rebinds every mechestim module's _np reference to the frozen copy. Now mechestim's internal calls go to the original numpy functions, while the test suite sees mechestim's versions.

# Simplified flow in conftest.py:
frozen_np = freeze_numpy()           # snapshot of original numpy
rebind_mechestim_np(frozen_np)       # me._np → frozen copy
patch_numpy()                        # np.sum = me.sum, etc.
# Now: test calls np.sum → me.sum → frozen_np.sum (original) ✓

What gets patched

Of mechestim's 482 registered functions, most non-ufunc functions are patched onto numpy during testing. The only categories skipped:

Category	Count	Why skipped
Ufuncs	101	mechestim functions are plain callables, not ufuncs -- they lack `.reduce`, `.accumulate`, `.outer`, `.nargs`. Tests check these attributes at collection time.
Blacklisted	32	Intentionally unsupported
`linalg.outer`	1	`me.linalg.outer` delegates to `np.outer` (not `np.linalg.outer`), which has different validation behavior

Everything else -- free ops, counted custom ops (dot, einsum, etc.), submodule functions (linalg, fft), reductions, and special functions -- is patched.

Test suites

We run 7 NumPy test modules covering core math, ufuncs, numerics, linear algebra, FFT, polynomials, and random:

Suite	Module	Passed	xfailed
Core math	`numpy._core.tests.test_umath`	4,668	13
Ufunc infrastructure	`numpy._core.tests.test_ufunc`	795	7
Numeric operations	`numpy._core.tests.test_numeric`	1,560	20
Linear algebra	`numpy.linalg.tests.test_linalg`	48	255
FFT	`numpy.fft.tests.test_pocketfft`	114	34
Polynomials	`numpy.polynomial.tests.test_polynomial`	36	2
Random	`numpy.random.tests.test_random`	142	0
Total		7,363	331

All failures are tracked as xfails in tests/numpy_compat/xfails.py.

Running the tests

Tests use pytest-xdist for parallel execution across all CPU cores.

# Run everything (recommended)
make test-numpy-compat

# Run a single suite
uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -n auto -q

# Filter to specific functions
uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -k "sqrt" -n auto -v

# Run without parallelism (for debugging)
uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -v --tb=short

The numpy_compat tests are excluded from the default pytest run (via pyproject.toml addopts) to prevent the monkeypatch from contaminating the main test suite. They run as a separate step in CI.

Known divergences (xfails)

Tests that fail due to known, accepted differences are tracked in tests/numpy_compat/xfails.py. Each entry maps a test pattern to a categorized reason:

Category	Meaning	Examples
`NOT_IMPLEMENTED`	Function exists but lacks a kwarg or edge case	Missing `out=`, `where=`, `subok=` kwargs
`UNSUPPORTED_DTYPE`	mechestim doesn't support this dtype	timedelta, object arrays
`UFUNC_INTERNALS`	Test relies on ufunc protocol	`.reduce`, `__array_ufunc__`
`BUDGET_SIDE_EFFECT`	Test assumes no global state changes	Budget deduction during assertions
`NUMPY_INTERNAL`	Test uses numpy internals	`_umath_tests`, internal type tables

The linalg suite has the most xfails (255) because mechestim's linalg wrappers don't support stacked/batched arrays, 0-size arrays, or some advanced kwargs that numpy's linalg tests exercise extensively.

Triaging new failures

Run a suite: uv run pytest tests/numpy_compat/ --pyargs <module> -n auto --tb=line
Categorize each failure
If it's a bug we should fix, create an issue
If it's an accepted divergence, add it to xfails.py

Why monkeypatching (not subclassing)

We considered alternatives:

Array subclass with __array_ufunc__: Would intercept ufunc calls, but mechestim arrays are plain numpy.ndarray by design -- no custom tensor class.
Running tests with import mechestim as np: NumPy's test files import from numpy._core, numpy.testing, etc. -- can't redirect all internal imports.
Monkeypatching with frozen numpy: Simple, works with NumPy's existing test infrastructure, tests exactly what users experience (same function signatures), and the frozen-numpy trick prevents infinite recursion.