NumPy Compatibility Testing
mechestim's goal is to be a drop-in replacement for NumPy: import mechestim as np should work for all supported functions. To verify this, we run NumPy's own test suite against mechestim.
How it works
A pytest conftest at tests/numpy_compat/conftest.py monkeypatches numpy functions with their mechestim equivalents at session start. When we point pytest at NumPy's installed test files using --pyargs, every test that calls np.sum(...), np.mean(...), etc. actually calls mechestim's version.
NumPy test file conftest.py mechestim
calls np.sum(x) ──────> np.sum = me.sum ──────> me.sum(x)
asserts result (monkeypatch) (FLOP-counted)
Avoiding infinite recursion
mechestim functions internally call numpy (e.g., me.dot calls _np.dot). Since _np IS the numpy module, patching numpy.dot = me.dot would cause infinite recursion: me.dot → _np.dot → numpy.dot → me.dot → ...
We solve this by freezing numpy before patching: the conftest creates a snapshot of the numpy module (and its submodules like numpy.linalg, numpy.fft), then rebinds every mechestim module's _np reference to the frozen copy. Now mechestim's internal calls go to the original numpy functions, while the test suite sees mechestim's versions.
# Simplified flow in conftest.py:
frozen_np = freeze_numpy() # snapshot of original numpy
rebind_mechestim_np(frozen_np) # me._np → frozen copy
patch_numpy() # np.sum = me.sum, etc.
# Now: test calls np.sum → me.sum → frozen_np.sum (original) ✓
What gets patched
Of mechestim's 482 registered functions, most non-ufunc functions are patched onto numpy during testing. The only categories skipped:
| Category | Count | Why skipped |
|---|---|---|
| Ufuncs | 101 | mechestim functions are plain callables, not ufuncs -- they lack .reduce, .accumulate, .outer, .nargs. Tests check these attributes at collection time. |
| Blacklisted | 32 | Intentionally unsupported |
linalg.outer |
1 | me.linalg.outer delegates to np.outer (not np.linalg.outer), which has different validation behavior |
Everything else -- free ops, counted custom ops (dot, einsum, etc.), submodule functions (linalg, fft), reductions, and special functions -- is patched.
Test suites
We run 7 NumPy test modules covering core math, ufuncs, numerics, linear algebra, FFT, polynomials, and random:
| Suite | Module | Passed | xfailed |
|---|---|---|---|
| Core math | numpy._core.tests.test_umath |
4,668 | 13 |
| Ufunc infrastructure | numpy._core.tests.test_ufunc |
795 | 7 |
| Numeric operations | numpy._core.tests.test_numeric |
1,560 | 20 |
| Linear algebra | numpy.linalg.tests.test_linalg |
48 | 255 |
| FFT | numpy.fft.tests.test_pocketfft |
114 | 34 |
| Polynomials | numpy.polynomial.tests.test_polynomial |
36 | 2 |
| Random | numpy.random.tests.test_random |
142 | 0 |
| Total | 7,363 | 331 |
All failures are tracked as xfails in tests/numpy_compat/xfails.py.
Running the tests
Tests use pytest-xdist for parallel execution across all CPU cores.
# Run everything (recommended)
make test-numpy-compat
# Run a single suite
uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -n auto -q
# Filter to specific functions
uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -k "sqrt" -n auto -v
# Run without parallelism (for debugging)
uv run pytest tests/numpy_compat/ --pyargs numpy._core.tests.test_umath -v --tb=short
The numpy_compat tests are excluded from the default pytest run (via pyproject.toml addopts) to prevent the monkeypatch from contaminating the main test suite. They run as a separate step in CI.
Known divergences (xfails)
Tests that fail due to known, accepted differences are tracked in tests/numpy_compat/xfails.py. Each entry maps a test pattern to a categorized reason:
| Category | Meaning | Examples |
|---|---|---|
NOT_IMPLEMENTED |
Function exists but lacks a kwarg or edge case | Missing out=, where=, subok= kwargs |
UNSUPPORTED_DTYPE |
mechestim doesn't support this dtype | timedelta, object arrays |
UFUNC_INTERNALS |
Test relies on ufunc protocol | .reduce, __array_ufunc__ |
BUDGET_SIDE_EFFECT |
Test assumes no global state changes | Budget deduction during assertions |
NUMPY_INTERNAL |
Test uses numpy internals | _umath_tests, internal type tables |
The linalg suite has the most xfails (255) because mechestim's linalg wrappers don't support stacked/batched arrays, 0-size arrays, or some advanced kwargs that numpy's linalg tests exercise extensively.
Triaging new failures
- Run a suite:
uv run pytest tests/numpy_compat/ --pyargs <module> -n auto --tb=line - Categorize each failure
- If it's a bug we should fix, create an issue
- If it's an accepted divergence, add it to
xfails.py
Why monkeypatching (not subclassing)
We considered alternatives:
- Array subclass with
__array_ufunc__: Would intercept ufunc calls, but mechestim arrays are plainnumpy.ndarrayby design -- no custom tensor class. - Running tests with
import mechestim as np: NumPy's test files import fromnumpy._core,numpy.testing, etc. -- can't redirect all internal imports. - Monkeypatching with frozen numpy: Simple, works with NumPy's existing test infrastructure, tests exactly what users experience (same function signatures), and the frozen-numpy trick prevents infinite recursion.