[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Conversation
Replaces the lazy-but-standalone ValueOffsetIncrementor path with one that constructs an NpyIter state and drives MoveNext / HasNext / Reset directly off that state. NDIterator is now an honest thin wrapper over NpyIter — the same traversal machinery used by all the Phase 2 production call sites — rather than reimplementing the coord-walk logic with legacy incrementors. How it works ------------ - ctor calls NpyIterRef.New(arr, NPY_CORDER) to build the state, then transfers ownership of the NpyIterState* pointer out of the ref struct (see NpyIterRef.ReleaseState / FreeState below). The class holds that pointer for its lifetime and frees it in Dispose (or in the finalizer as a safety net). - MoveNext reads `*(TOut*)state->DataPtrs[0]` then calls `state->Advance()`. IterIndex tracks position, IterEnd bounds the non-AutoReset case, and `state->Reset()` restarts from IterStart on AutoReset wraparound and on explicit Reset. - Cross-dtype wraps the same read with a Converts.FindConverter<TSrc, TOut> lookup — one switch at construction picks the typed helper, so the per-element hot path is still just one read + one converter delegate call. MoveNextReference throws when casting is in play, matching the legacy contract. - NPY_CORDER is explicit so iterating a transposed view yields the logical row-major order the old NDIterator provided. Without it, KEEPORDER would give memory-efficient order (which e.g. `b.T.AsIterator<int>()` would surface as `0 1 2 ... 11` instead of the expected `0 4 8 1 5 9 2 6 10 3 7 11`). NpyIter additions ----------------- - NpyIterRef.ReleaseState(): hand the owned NpyIterState* to a caller who needs it across a non-ref-struct boundary (e.g. a class field). Marks the ref struct as non-owning so its Dispose is a no-op. - NpyIterRef.FreeState(NpyIterState*): static tear-down mirror of Dispose's cleanup path — frees buffers (when BUFFER set), calls FreeDimArrays, and NativeMemory.Free's the state pointer. The long-lived owner calls this from its own Dispose/finalizer. Bug fixes along the way ----------------------- NpyIter initialization previously computed base pointers as `(byte*)arr.Address + (shape.offset * arr.dtypesize)` in two places (initial broadcast setup on line 340 and ResetBasePointers on line 1972). `arr.dtypesize` goes through `Marshal.SizeOf(bool) == 4` because bool is marshaled to win32 BOOL, but the in-memory `bool[]` storage is 1 byte per element. For strided bool arrays this produced a base pointer 4× too far into the buffer. Switched both sites to `arr.GetTypeCode.SizeOf()` which returns the actual in-memory size (1 for bool). Surfaced by `Boolean_Strided_Odd` once NDIterator started routing through NpyIter — previously only LATENT because the legacy NDIterator path computed offsets in element units, not bytes, and sidestepped the NpyIter init. Test impact: 6,748 / 6,748 passing on net8.0 and net10.0 (CI filter: TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test of same-type contig / cross-type / strided / transposed / broadcast / AutoReset / Reset / foreach all produce the expected element sequences.
`UnmanagedStorage.DTypeSize` (exposed via `NDArray.dtypesize`) was
delegating to `Marshal.SizeOf(_dtype)`. For every numeric dtype that
matches, but for bool, `Marshal.SizeOf(typeof(bool)) == 4` because bool
is marshaled to win32 BOOL (32-bit). The in-memory layout of `bool[]`
is 1 byte per element, so every caller computing a byte offset as
`ptr + index * arr.dtypesize` was reading/writing 4× too far into the
buffer for bool arrays.
Switches to `_typecode.SizeOf()` which correctly returns 1 for bool and
matches `Marshal.SizeOf` for every other type. 21 existing call sites
(matmul, binary/unary/comparison/reduction ops, nan reductions, std/var,
argmax, random shuffle, boolean mask gather, etc.) now get the right
value without any downstream change.
The bug had been latent until the Phase 2 iterator migration started
routing more code paths through NpyIter.Copy and the new NDIterator
wrapper; it surfaced most visibly as `sliced_bool[mask]` returning the
wrong elements when the source was non-contiguous. With the root fix:
var full = np.array(new[] { T,F,T,F,T,F,T,F,T });
var sliced = full["::2"]; // [T,T,T,T,T] non-contig
var result = sliced[new_bool_mask]; // now correct per-element
np.save.cs already special-cases bool before falling through to
`Marshal.SizeOf`, so serialization was unaffected. Remaining
Marshal.SizeOf references in the codebase are either in comments that
explain this exact issue, or in the `InfoOf<T>.Size` fallback that
only runs for types outside the 12 supported dtypes (e.g. Complex).
Tests: 6,748 / 6,748 passing on net8.0 and net10.0 with the CI filter
(TestCategory!=OpenBugs&TestCategory!=HighMemory).
- Delete 4 NPYITER analysis docs (audit, buffered reduce, deep audit, numpy differences) — information consolidated into codebase - Delete 3 NDIterator.Cast files (Complex, Half, SByte) — casting now handled by unified NDIterator<T> backed by NpyIter state - Update NDIterator.cs: minor adjustments from NpyIter backing refactor - Update ILKernelGenerator.Scan.cs: scan kernel changes - Update Default.MatMul.Strided.cs: add INumber<T> constraint support for generic matmul dispatch preparation - Update Default.ClipNDArray.cs: initial NpFunc dispatch refactoring replacing 6 switch blocks (~84 cases) with generic dispatch methods - Update np.full_like.cs: minor fix - Update RELEASE_0.51.0-prerelease.md release notes
…neric dispatch NpFunc is a reflection-cached generic dispatch utility that bridges runtime NPTypeCode values to compile-time generic type parameters. Hot path (cache hit) runs at ~32ns via Delegate[] array indexed by NPTypeCode ordinal. Cold path uses MakeGenericMethod + CreateDelegate, cached after first call per (method, typeCode) pair. Core NpFunc changes: - Dynamic table sizing: Delegate[] sized from max NPTypeCode enum value (was hardcoded [32], broke for NPTypeCode.Complex=128) - Overloads for 0-6 args × void/returning × 1-3 NPTypeCodes + 1-2 Types - SmartMatchTypes for multi-type dispatch (1→broadcast, N=N→positional, M<N→type-identity matching) - Per-arity ConcurrentDictionary caches for multi-type dispatch Files refactored (12 files, ~400 cases eliminated): Previous session (5 files, ~196 cases): - Default.ClipNDArray.cs: 6 dispatch methods for contiguous/general clip - Default.Clip.cs: 3 dispatch methods for scalar clip with ChangeType - Default.NonZero.cs: 3 dispatch methods for nonzero/count operations - Default.BooleanMask.cs: 1 dispatch method for masked copy - Default.Shift.cs: 2 dispatch methods for array/scalar shift This session (7 files, ~202 cases): - NDIteratorExtensions.cs: 5 overloads → 5 dispatch methods creating NDIterator<T> from NDArray/UnmanagedStorage/IArraySlice - Default.Reduction.CumAdd.cs: axis dispatch via CumSumAxisKernel<T>, elementwise via IAdditionOperators<T,T,T> with default(T) init - Default.Reduction.CumMul.cs: axis dispatch via CumProdAxisKernel<T>, elementwise via IMultiplyOperators + T.MultiplicativeIdentity init - np.where.cs: iterator fallback + IL kernel dispatch via pointer cast - np.random.randint.cs: int/long fill via INumberBase<T>.CreateTruncating - NDArray.NOT.cs: IEquatable<T>.Equals(default) unifies bool NOT and numeric ==0 comparison into single generic method - Default.LogicalReduction.cs: direct dispatch to ExecuteLogicalAxis<T> Net: -1243 lines removed across 12 files, replacing repetitive per-type switch cases with single generic dispatch methods.
Complex does not implement IComparable<T>, so NpFunc.Invoke into ClipArrayBoundsDispatch/ClipArrayMinDispatch/ClipArrayMaxDispatch crashed with ArgumentException on MakeGenericMethod. Fix: add NPTypeCode.Complex pre-checks in ClipNDArrayContiguous, ClipNDArrayGeneral, and ClipCore that route to dedicated Complex clip methods using lexicographic comparison (real first, then imag). NaN handling preserves the NaN-containing element as-is (not replaced with NaN+NaN*i), matching NumPy np.maximum/np.minimum behavior where "NaN wins" but the original value is returned. Half NaN propagation: ILKernelGenerator.ClipArrayBoundsScalar, ClipArrayMinScalar, ClipArrayMaxScalar fell through to the generic CompareTo path for Half, which treats NaN as less-than-all (IEEE totalOrder) instead of propagating it. Added Half-specific scalar methods that check Half.IsNaN explicitly before comparison. Also fix NpFunc table sizing: Delegate[] was hardcoded to [32] but NPTypeCode.Complex=128 caused IndexOutOfRangeException. Now computed dynamically from max NPTypeCode enum value at static init. Fixes 14 test failures (12 Complex clip/maximum/minimum constraint violations, 2 Half NaN propagation in maximum).
…ast paths
Replaces the broken `PowerInteger` fast-path (which crashed on sliced/broadcast
arrays via `Unsafe.Address`) with a stride-aware integer power emitted by the
existing IL kernel infrastructure. Adds NumPy's "Integers to negative integer
powers are not allowed." ValueError, fast paths for scalar exponents {0,1,2,
0.5,-1.0}, and switches f32 to single-precision `MathF.Pow` (no f64 round-trip).
Audit-v2 tickets resolved:
- T1.3a — np.power(sliced_int32, int32) no longer crashes
- T1.3b — np.power(broadcast_int32, int32) no longer crashes
- T1.36 — int**(-int) now raises ArgumentException matching NumPy ValueError
What changed
============
NEW: src/NumSharp.Core/Utilities/NpyIntegerPower.cs
Public squared-exponentiation helpers for all 9 integer NumSharp dtypes
(sbyte/byte/int16/uint16/char/int32/uint32/int64/uint64) — preserves
dtype-native wraparound (uint8 ** 8 = 0, 15**15 = 437893890380859375).
Caller validates non-negative exponent.
REWRITE: src/NumSharp.Core/Backends/Default/Math/Default.Power.cs
- Removes the `Unsafe.Address`-based fast-path that crashed on
sliced/broadcast operands and ignored strides.
- Adds pre-scan: for `int**int` with signed-int exponent, scans rhs for
any negative element and throws `ArgumentException("Integers to negative
integer powers are not allowed.")`. Matches NumPy's unconditional check
(rejects base ∈ {±1} too, per NumPy spec).
- Adds scalar-exponent fast paths when `rhs.size == 1`:
exp = 0 → ones_like(lhs)
exp = 1 → lhs.copy() (or cast)
exp = 2 → lhs * lhs (SIMD-optimized Multiply kernel)
exp = 0.5 → np.sqrt(lhs)
exp =-1.0 → np.reciprocal(lhs) (float base only)
Each path verifies the resolved result dtype matches what the IL kernel
would produce before substituting, so NEP50 promotion is preserved.
- Falls through to `ExecuteBinaryOp` for the general case, which now
walks strides correctly via the IL kernel paths.
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
- `EmitPowerOperation(il, resultType)`: dispatches to the matching
`NpyIntegerPower.Pow*` helper for integer result types (replaces the
`int → double → Math.Pow → int` round-trip that lost precision for
values outside [-2^52, 2^52]). float32 → `MathF.Pow`; float64 →
`Math.Pow`; Boolean and other fallthrough types use the original double
round-trip to keep the kernel verifiable.
- Cached `MethodInfo` lookups added for all 9 integer power helpers and
`MathF.Pow`.
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Binary.cs
- `EmitPowerOperation<T>(il)` (same-type contiguous kernel path):
same dispatch as the mixed-type version. Generic `T` is mapped to the
matching `NpyIntegerPower.Pow*` helper via `GetIntegerPowMethod<T>()`.
src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs
- Updates the Power promotion comment to document NEP50 weak/strict
behavior accurately (NumSharp matches NumPy in the common cases; the
one documented misalignment is 0-D integer arrays explicitly constructed
via `np.array(2, int32)`, which are indistinguishable from C# `int 2`
after `np.asanyarray`).
Tests
=====
NEW: test/NumSharp.UnitTest/Math/NDArray.power.Comprehensive.Test.cs (24 tests)
- Integer dtype-native wrapping (uint8/int8/int32 overflow)
- Stride + broadcast layouts (sliced, broadcast_to, 2D-vs-1D)
- Signed integer negative exponent throws (incl. base = ±1)
- Unsigned integer exponent never throws
- Float special values (0^0, NaN, ±inf base/exp, fractional neg base)
- NEP50 promotion (f32 ** int{8,16,32}, f64 ** int*, f32 ** scalar)
- All 9 integer dtypes smoke-tested via 2^3 = 8
REMOVED [Misaligned]: PowerEdgeCaseTests.Power_Integer_LargeValues
Integer power now preserves exact precision; the test now asserts equality.
UPDATED: NewDtypesCoverageSweep_Arithmetic_Tests.B35_SByte_Power_NegativeExponent*
Previously documented the wrong (silent 0/±1) behavior; now asserts the
NumPy-correct ArgumentException.
UPDATED (removed [OpenBugs]):
- AuditV2_MathReductions.T1_3a_Power_SlicedInt32_ShouldNotCrash
- AuditV2_MathReductions.T1_3b_Power_BroadcastInt32_ShouldNotCrash
- AuditV2_ILKernelSimd.T1_36_* (4 tests)
Validation
==========
cd test/NumSharp.UnitTest
dotnet test --no-build --framework net10.0 \
--filter "TestCategory!=OpenBugs&TestCategory!=HighMemory"
→ Passed: 8255, Failed: 0
dotnet test --no-build --framework net10.0 \
--filter "FullyQualifiedName~Power"
→ Passed: 129, Failed: 0
Microbench (1M-element float32, x100 iterations):
power(arr, 2) 121ms (fast path → mul; matches multiply baseline 117ms)
power(arr, 0.5) 121ms (fast path → sqrt)
power(arr, 2.7) 518ms (general path via MathF.Pow)
Behavior changes vs. prior NumSharp
===================================
- int**(-int) now throws (was: silently returned 0, 1, or -1).
Matches NumPy 2.4.2 ValueError exactly.
Adds the iterator-subsystem branch audit documents that drove this
branch's bug-fix and refactor work:
- `NDITER_BRANCH_QUALITY_AUDIT.md` — original (V1) audit walking every
changed src file and ranking findings by severity (bugs → perf →
parity gaps → refactors → clean review). Bug catalog includes:
np.maximum/minimum NaN handling, np.power stride mishandling,
np.searchsorted incompleteness, np.repeat missing axis, NpyIter
Iternext+EXLOOP path, nan{mean,std,var} perf, np.argsort LINQ perf,
linspace/eye boxing.
- `NDITER_BRANCH_QUALITY_AUDIT_V2.md` — V2 (fact-check) audit driven by
8 parallel agents auditing file-by-file with results verified via
`python -c` against NumPy 2.4.2 and `dotnet_run` against NumSharp.
60 of 65 Tier 1 findings confirmed with failing OpenBugs reproducers
written under `test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs`, plus a
list of 4 false positives and 4 newly discovered bugs.
- `docs/plans/audit_v2/01..08*.md` — per-batch audit chapters, each
including: file scope tables (LoC + role), reference to NumPy source,
reproduction commands, line-precise references, and a findings table
with severity tags (bug / parity-gap / perf / refactor / clean).
Chapters cover Iterators, ILKernel+SIMD, Default math/reductions,
Logic+Shape+Storage, NDArray creation, Manipulation APIs+logic,
Math ops + selection/sorting/stats, and Casting+random+utilities.
These files are pure documentation and contain no code; they're the
reference material for the bug fixes and tests that follow on the
nditer branch.
Adds the per-batch test classes that the V2 audit fact-check pass produced to back its Tier 1 findings with concrete failing tests. Tests are marked `[OpenBugs]` so CI skips them until the underlying defect is fixed; running them locally with `TestCategory=OpenBugs` documents each bug's current behavior versus NumPy 2.4.2. Each test references both the master `NDITER_BRANCH_QUALITY_AUDIT_V2.md` and the matching `docs/plans/audit_v2/XX_*.md` chapter where the finding is documented in detail, and includes the file:line of the suspected defect plus a `python -c` NumPy 2.4.2 expectation. Test classes added (matching the 6 untracked batches): - `AuditV2_Iterators.cs` — NpyIter Iternext/EXLOOP issues, buffer refill, cast support gaps, NDIterator broadcast strides, etc. (Batch 1). - `AuditV2_LogicShapeStorage.cs` — Shape mutating indexer on a readonly struct, storage and logic op edge cases (Batch 4). - `AuditV2_NDArrayCreation.cs` — `np.array(NDArray, copy=false)` default aliasing, creation API edge cases (Batch 5). - `AuditV2_ManipulationApis.cs` — np.expand_dims on empty, manipulation parity gaps (Batch 6). - `AuditV2_MathSelectionSorting.cs` — SetIndicesNDNonLinear NIE, math/selection/sort bugs (Batch 7). - `AuditV2_CastingRandomUtilities.cs` — NpFunc/cast/random/utilities bugs (Batch 8). Batches 2 (`AuditV2_ILKernelSimd.cs`) and 3 (`AuditV2_MathReductions.cs`) already exist on the branch; this commit fills the remaining 6. Build is verified to pass with the new files included.
Updates `.claude/CLAUDE.md` so the project instructions match the code's
current state:
- "C-order only" entry replaced with "Order-aware layout": Shape tracks
F-contiguity, and APIs with an `order` parameter resolve NumPy
`C`/`F`/`A`/`K` through `OrderResolver`. Verified by:
- `Shape.IsFContiguous` flag (`View/Shape.cs:115-118`)
- `Shape.Order` property (`View/Shape.cs:437`)
- F-aware construction (`View/Shape.cs:160`)
- `F_CONTIGUOUS` entry in the flags table updated from "Reserved" to
"Data is column-major contiguous" (matches `ArrayFlags.F_CONTIGUOUS`
bit `0x0002` in `View/Shape.cs:24`).
- Added `IsFContiguous — O(1) check via F_CONTIGUOUS flag` to the
key Shape properties list.
- Missing Functions count corrected from 19 → 18 and `np.where` removed
from the Selection gap because `APIs/np.where.cs` implements it; new
`### Selection` section under "Supported np.* APIs" lists `where`.
- Iterators path updated from `MultiIterator.cs` to `NpyIter.cs` and
`NpyExpr.cs` (verified — `MultiIterator` no longer exists; only
`NDIterator`, `NpyIter`, `NpyExpr` are present in `Backends/Iterators`).
- Q&A entries for NDIterator and NpyIter rewritten to match the current
legacy-wrapper / NumPy-aligned multi-operand iterator split.
Pure documentation change — no behavioral impact.
…per / Memory block Multiple `CopyTo` overloads in the unmanaged memory layer were calling `Buffer.MemoryCopy(...)` with source/destination swapped — the BCL signature is `MemoryCopy(void* source, void* destination, long destBytes, long sourceBytesToCopy)`, but the existing code passed the destination pointer first. The result was that data was copied *from the destination buffer into the source slice*, silently corrupting the caller's source data instead of populating the destination. ArraySlice`1.cs: - `TryCopyTo(Span<T>)`, `CopyTo(Span<T>)`, `IArraySlice.CopyTo<T1>(Span<T1>)`, `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: swap source / destination pointers so data flows source→destination. - `CopyTo(IntPtr dst)`: also fix the byte-size argument — previous code passed `Count` (element count) for both destination size and bytes to copy, leaving non-byte dtypes with under-counted bounds. Replace with `Count * ItemLength` for both byte arguments and flip the source / destination order. - `CopyTo(IntPtr dst, long sourceOffset, long sourceCount)`: this overload was previously identical to `CopyTo(IntPtr dst)` (ignored the offset arguments). Add `sourceOffset` / `sourceCount` bounds checks, honour `sourceOffset` when computing the source pointer, and use `sourceCount * ItemLength` for the copy. - `CopyTo(Span<T>, long sourceOffset, long sourceLength)`: previous body recursed into itself (`CopyTo(destination, sourceOffset, sourceLength);`) causing a stack overflow. Replace with a bounds-checked `Buffer.MemoryCopy` from `Address + sourceOffset`. - `CopyTo(UnmanagedSpan<T>, long sourceOffset, long sourceLength)`: same direction swap as above. - `IArraySlice.CopyTo<T1>(Span<T1>)` / `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: bytes-based comparison (`Count * ItemLength` vs `destination.Length * InfoOf<T1>.Size`) instead of element-count comparison, fixing the "destination too short" check for reinterpret-cast cases. - `IArraySlice.Clone<T1>()`: previous code used `UnmanagedMemoryBlock<T1>. Copy(Address, Count)` which treats `Count` as the *T1* element count while reading from a `T`-element buffer. Now compute the byte size and divide by `InfoOf<T1>.Size` so the clone preserves the whole byte payload (with a hard error if the byte count is not a multiple of the target element size). UnmanagedHelper.cs: - `CopyTo(IMemoryBlock src, IMemoryBlock dst, long countOffsetDestination)`: validate `countOffsetDestination` against `dst.Count` and ensure the source fits in the *remaining* destination capacity. Fix the destination-size argument to `(dst.Count - countOffsetDestination) * dst.ItemLength` instead of the source byte length (which under-counts by the offset when the destination is just big enough). UnmanagedMemoryBlock`1.cs: - `CopyTo(UnmanagedMemoryBlock<T> memoryBlock, long arrayIndex)`: swap pointers so data is copied source→destination (`memoryBlock.Address + arrayIndex` as dst), add null + bounds checks, and use the remaining destination capacity for the destination size argument. All fixes are direct corrections of misuses of `Buffer.MemoryCopy`'s signature; behavior for legitimate callers now matches the docstrings. The added regression tests live under `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs` (separate commit) and call each repaired overload to lock the contract in place.
….Clone bugs
Shape.cs:
The `Clone(deep, unview, unbroadcast)` branch logic was inconsistent and
dropped the `offset`/`bufferSize` of scalar views in the most common
call (`Clone()` with default args). Rewrite the cascade so behavior is
predictable:
- Empty shape → `default`.
- Scalar shape:
- `unview` or `unbroadcast` → return the static `Scalar` (offset=0).
- Otherwise honour `deep`: copy-constructor preserves both `offset`
and `bufferSize` for sliced scalar views like `np.arange(10)["5"]`.
- Non-scalar shape:
- `!deep && !unview && !unbroadcast` → return `this` (the readonly
struct copy is itself a value-copy).
- `unview` or `unbroadcast` → `new Shape((long[])dimensions.Clone())`,
which the constructor fills with C-contiguous strides (no offset).
This replaces the previous one-off `ComputeContiguousStrides` /
`deep && !unbroadcast` mixed branches that produced different
shapes depending on call combination.
- Plain `deep` → deep copy via the copy constructor.
Old behavior failure: `scalar.Shape.Clone()` on `np.arange(10)["5"]`
returned the canonical `Scalar` shape with `offset == 0`, hiding the
fact that the data lives at index 5. The regression test
`Shape_Clone_PreservesScalarViewOffset` in `CloneRegressionTests`
locks the fix.
ArrayConvert.cs:
- `Clone(Array sourceArray)` had two issues:
1. It walked `elementType.IsArray` past the array's actual element
type, so a jagged `int[][]` was treated as a flat `int[]` and the
subsequent `Array.Copy` produced wrong results (or threw). Now the
immediate element type is used, preserving jaggedness.
2. Arrays with a non-zero lower bound (created via
`Array.CreateInstance(elementType, lengths, lowerBounds)`) were
not supported — they fell through to the multi-dim branch with
all-zero lower bounds. Capture each axis' lower bound and use
`Array.CreateInstance(elementType, lengths, lowerBounds)` whenever
the source is multi-dim or has any non-zero lower bound.
- `Clone<T>(T[,,,] sourceArray)` had a `GetLength(4)` typo for what
should be the fourth (zero-indexed: 3) dimension. `GetLength(4)`
throws `IndexOutOfRangeException` for any 4-D array. Changed to
`GetLength(3)`. (Coverage: `CloneRegressionTests
.ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`.)
…ontig NDArray (`Backends/NDArray.cs`): - All three `UnmanagedStorage`-based constructors now back-fill the engine when storage doesn't already have one, and mirror the chosen engine onto `Storage.Engine` so the array and storage stay in sync. Previously `Storage.Engine` could be null while the NDArray reported a valid `TensorEngine`, leaking back through chained constructors that read storage.Engine directly. - `TensorEngine` setter now propagates the resolved engine to `Storage.Engine` so changing the engine on an NDArray cascades to storage-side consumers. - `Clone()` is now `virtual` and uses the property setter (instead of the private field) so engine assignment propagates to storage. `NDArray<TDType>.Clone()` overrides it to preserve the typed wrapper — before this commit, `((NDArray<int>)x).Clone()` returned the non-generic NDArray base type, breaking generic callers (see `CloneRegressionTests.NDArray_Clone_PreservesGenericRuntimeType`). - `View`/`GetData(int[])`/`GetData(long[])`/the foreach yield path all switch from setting the private `tensorEngine` field to the property, so storage gets the engine too. UnmanagedStorage (`Backends/Unmanaged/UnmanagedStorage.cs`): - `CreateBroadcastedUnsafe(...)` now copies `storage.Engine` onto the new broadcast view. UnmanagedStorage cloning (`Backends/Unmanaged/UnmanagedStorage.Cloning.cs`): - All `Alias(...)` overloads, the `Slice` builder, both `Cast<T>` / `Cast(typeCode)`, both `CastIfNecessary<T>` / `CastIfNecessary(typeCode)`, and the empty-storage clone now propagate `Engine`. - Cast correctness fix: `Cast<T>` / `Cast(typeCode)` / `CastIfNecessary<T>` / `CastIfNecessary(typeCode)` used to cast the raw backing array via `InternalArray.CastTo(...)`. For strided or F-contiguous views that buffer holds elements in the *physical* layout, so the cast result contained values in the wrong logical order. They now run `CloneData()` first — which materializes the logical element sequence (via `NpyIter.Copy` for non-contiguous paths) — and cast that, so casting an F-contiguous view of `np.arange(6).reshape(2,3).T` yields the same values NumPy produces. (Verified by `CloneRegressionTests .UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine` and siblings.) - `Clone()` gains a fast `CanCloneRawLayout()` path: when the storage owns its buffer (no offset, no broadcast, no buffer/size mismatch) and is either C- or F-contiguous, the underlying ArraySlice is cloned raw and the same `Shape` is reused. Non-trivial slices and scalar views still fall back to `CloneData()`. Empty storages return a new typed empty storage and preserve the engine instead of trying to clone a null buffer. - `CastIfNecessary` early-return for same-dtype skips the `IsEmpty` check so empty storages of the requested dtype don't re-materialize.
The DefaultEngine helpers for `astype` and `transpose` created new `NDArray` instances via the `UnmanagedStorage`-only constructor, which falls back to `BackendFactory.GetEngine()`. Code that explicitly set `nd.TensorEngine` on the source (e.g. tests pinning a custom engine) would silently see its engine swapped for the default after a cast or transpose. `Default.Cast.cs` (`DefaultEngine.Cast`): - Capture `nd.TensorEngine` once at the top. - Empty/scalar/`(1,)` early returns now carry `engine` forward both on the returned `NDArray` and on `nd.Storage` (when reused in-place). - Both `copy` and in-place branches of the generic cast attach `TensorEngine = engine` to the resulting NDArray and to the re-assigned `nd.Storage`. `Default.Transpose.cs` (`DefaultEngine.TransposeAlias`): - The transpose alias returned via `Storage.Alias(newShape)` now carries `nd.TensorEngine` so transposed views keep their engine. Without this the call dropped back to the global default, breaking propagation through compounded operations. Coverage: `CloneRegressionTests.NDArray_AstypeCopyPath_PreservesTensorEngine` and the engine-propagation siblings.
…/ ravel paths
All paths that build a fresh `NDArray` from an existing storage now
preserve the caller's `TensorEngine`. Previously the `NDArray
(UnmanagedStorage)` constructor would fall back to
`BackendFactory.GetEngine()` when the supplied storage didn't carry an
engine (which is common after `Clone()`/`Alias()`/`CloneData()`).
`Creation/NDArray.Copy.cs` (`copy(char physical)`):
- The C-order shortcut now requires the source to already be
C-contiguous. Before, `copy('C')` on an F-contiguous view returned a
`Clone()` whose shape preserved the F-strides — yielding a
non-C-contiguous "copy". Now any non-C-contiguous source falls
through to the iterator-driven materialization path.
- The destination shape uses the requested `physical` order instead of
hard-coding `'F'`. Combined with the fix above this gives correct
C/F selection regardless of source layout.
- Destination NDArray carries `TensorEngine = TensorEngine` of the
source. Coverage:
`CloneRegressionTests.NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy`
and `NDArray_CopyFOrder_PreservesTensorEngine`.
`Creation/NdArray.ReShape.cs`:
- The F-order reshape return (`fFlat.Storage.InternalArray`-backed
storage) and both non-contiguous fallback paths
(`new NDArray(CloneData(), Shape.Clean())`) now attach the source
`TensorEngine`. Coverage:
`CloneRegressionTests.NDArray_ReshapeCopyPath_PreservesTensorEngine`.
`Creation/np.array.cs`:
- `np.array(nd, copy)` propagates `nd.TensorEngine` for both the
alias (`copy=false`) and clone (`copy=true`) paths. Coverage:
`NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`.
`Manipulation/np.expand_dims.cs`, `Manipulation/np.ravel.cs`,
`Manipulation/NDArray.flatten.cs`:
- The view (`Storage.Alias(...)`) and materialize (`CloneData()`)
branches both forward the source `TensorEngine`.
No semantic API changes other than the `copy('C')` correctness fix
above; engine propagation is a transparent improvement.
NumPy's np.where iterator allocates the result with an order chosen
from the *full-size* operands' contiguity flags:
- Any full-size, multi-dim operand that is C-contiguous (but not F)
→ output is C-contiguous.
- All full-size, multi-dim operands that are F-contiguous (and at least
one is strictly F, not also C) → output is F-contiguous.
- Operands that are scalar, 1-D, or broadcasted do not vote.
- Mixed C/F (or any full-size non-contiguous operand) → fall back to C.
Verified against NumPy 2.4.2:
f = np.arange(12).reshape(3,4).T # F-contig view
np.where(f > 5, f, 0) .flags # F_CONTIGUOUS=True
np.where(f > 5, f.copy('C'), f) .flags # C_CONTIGUOUS=True
np.where(np.array([True,False,True]), f, 0).flags # F_CONTIGUOUS=True
Previously `np.where` always allocated the output as C-contiguous,
losing the F layout that NumPy preserves for F-contiguous inputs.
`APIs/np.where.cs`:
- New `ResolveWhereOrder(params NDArray[] operands)` mirrors the rules
above. Returns 'C' or 'F'.
- The result `Shape` is now constructed via `new Shape((long[])cond
.shape.Clone(), resultOrder)` so the resulting strides match the
resolved order.
- Drop the `NpFunc.Invoke(outType, WhereImpl<int>, ...)` generic
dispatch: the actual `WhereImpl` body never used its `T` type
parameter (the iterator-driven IL kernel keys off the runtime dtype
string), so the switch-per-dtype indirection was dead weight. Replace
with a direct non-generic `WhereImpl(cond, x, y, result)` call.
`test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs`:
- New "Output Layout" region with three NumPy-anchored tests:
* `Where_FContiguousInputs_ResultIsFContiguous`
* `Where_MixedCAndFInputs_ResultFallsBackToC`
* `Where_BroadcastConditionWithFInput_ResultIsFContiguous`
…sh order tests
NumPy's `np.tile(A, reps)` keeps the source memory order on the "no
expansion" shortcut (`tup == (1,)*outDim`): F-contiguous input stays
F-contiguous, C-contiguous input stays C-contiguous, and views with
strides outside C/F materialize as C-contiguous. Verified against
NumPy 2.4.2:
src = np.arange(12).reshape(3, 4).T # F-contig
np.tile(src, (1, 1)).flags # F_CONTIGUOUS=True
np.tile(src, ()).flags # F_CONTIGUOUS=True
np.tile(np.arange(12).reshape(3, 4)[:, ::-1], (1, 1)).flags
# C_CONTIGUOUS=True
`Manipulation/np.tile.cs`:
- The all-ones shortcut previously called `A.copy()` which defaults to
`'C'` — silently flipping F-contiguous inputs to C-contiguous output.
Replace with `A.copy('K')` (and the reshape variant gets the same
treatment) so `OrderResolver.Resolve('K', shape)` picks the source's
physical order. The comment is updated to describe the keep-order
semantics.
`test/NumSharp.UnitTest/Manipulation/np.tile.Test.cs`:
- Three new tests covering the F-contig preservation, the `np.tile(a)`
no-reps overload, and the non-contiguous fall-back. Each test also
verifies element values against the source via index-based reads to
guard against logical-order regressions.
`test/NumSharp.UnitTest/View/OrderSupport.OpenBugs.Tests.cs`:
- `Tile_ApiGap` is renamed to `Tile_RepeatsLastAxis_ValuesMatchNumPy`
and its assertion stays — the API gap comment is replaced with the
matching NumPy reference output. Header rewritten from
"Missing functions" to "Manipulation helpers" since this section now
documents working APIs.
- `Where_ApiGap` (previously `[OpenBugs]` because np.where was thought
missing) is now `Where_FContig_PreservesFContig`. It asserts that
`np.where(f_arr > 5, f_arr, 0)` returns an F-contiguous result on
F-contiguous input — the same property covered by the new where
layout tests in the prior commit. The `[OpenBugs]` attribute is
removed because the feature exists and now matches NumPy.
…IterBattleTests `benchmark/NumSharp.Benchmark.Exploration/Program.cs`: - `Options.Clone()` reused the same `RemainingArgs` `string[]` reference on the cloned `Options` instance. Any post-clone mutation of the array (or its elements via index assignment) would have leaked back to the original `Options`. Clone the array (`(string[])RemainingArgs .Clone()`) so the two `Options` instances are independent. `test/NumSharp.UnitTest/Backends/Iterators/NpyIterBattleTests.cs`: - Remove a single trailing blank line at end of file. No code change.
… after RemoveAxis `NpyIter.Clone()` (in `Backends/Iterators/NpyIter.cs`) previously copied the `Buffers[op]` pointer field directly from the source state, so the original and the cloned iterator shared the *same* per-operand buffer. After `Iternext()` on either iterator the writes from one would clobber the other's data, and disposing one would free the buffer out from under the other. The fix: - After copying the operand metadata (`ElementSizes`, `BufStrides`, etc.), allocate a fresh buffer per operand via `NpyIterBufferManager .AllocateAligned(BufferSize, opDtype)` and `Buffer.MemoryCopy` the source bytes into it. If allocation fails the catch block calls `NpyIterBufferManager.FreeBuffers` for buffered states before releasing dim arrays + state memory. - `DataPtrs[op]` is fixed up: if the source `DataPtrs[op]` pointed into the original `Buffers[op]` byte range we translate the offset onto the newly allocated buffer so iteration continues at the same logical position. - The clone now calls `AllocateDimArrays(_state->NDim, _state->NOp, _state->StridesNDim)` — see below. `NpyIterState.AllocateDimArrays(int ndim, int nop, int stridesNDim)`: - Previously, the strides block was always sized as `ndim * nop`. After `RemoveAxis` lowers `NDim` but leaves `StridesNDim` at its original width, cloning the iterator allocated a strides block that was too small, causing later reads from `Strides[k]` (where `k >= NDim*NOp`) to access freed or unrelated memory. - The third parameter defaults to `ndim` (preserving the existing contract for all other call sites) but accepts an explicit `stridesNDim >= ndim` so `Clone()` can carry the original allocated stride width forward. `StridesNDim` is now stored on the state and the strides allocation uses `stridesNDim * nop * sizeof(long)`. The scalar fast-path now requires both `ndim == 0` and `stridesNDim == 0` to skip the allocation. Also moves the `GetInnerFixedStrideArray` docblock so it sits directly above its method (it had drifted onto an unrelated method when the preceding doc was edited). Coverage: - `CloneRegressionTests.NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers` asserts the two iterators have distinct `DataPtr` addresses and that advancing one does not advance the other. - `CloneRegressionTests.NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth` builds an iterator over `(2,3,4)`, removes axis 1, clones it, and checks `NDim`, `Shape`, and the first value match.
… clone fixes Adds `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs`, which locks in the contracts established by the preceding fix commits. Each test reproduces a specific bug or contract that previously regressed and asserts the corrected behavior. 27 tests; all pass on net8.0 and net10.0. Coverage map (each pair = test → fix commit): ArraySlice CopyTo direction / range fixes → `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice` - `ArraySlice_CopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_TryCopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_CopyToSpan_WithSourceRange_CopiesRequestedRange` - `ArraySlice_CopyToIntPtr_WithSourceRange_CopiesRequestedRange` - `ArraySlice_InterfaceCopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_InterfaceCloneGeneric_ReinterpretsWholeBytePayload` ArrayConvert.Clone jagged / non-zero lower-bound / 4-D GetLength fixes → `fix(shape+convert): preserve scalar offset on Clone; fix ArrayConvert.Clone bugs` - `ArrayConvert_Clone_PreservesJaggedElementType` - `ArrayConvert_Clone_PreservesNonZeroLowerBounds` - `ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength` Shape.Clone scalar view offset preservation → same commit as above - `Shape_Clone_PreservesScalarViewOffset` UnmanagedStorage.Clone empty + F-contig + engine → `fix(storage+ndarray): keep TensorEngine in sync; correct cast for F-contig` - `UnmanagedStorage_Clone_DtypeOnlyStorage_DoesNotDereferenceMissingData` - `UnmanagedStorage_Clone_PreservesEngineAndFContiguousShape` UnmanagedStorage.Cast / CastIfNecessary uses CloneData + propagates engine → same commit - `UnmanagedStorage_CastTypeCode_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastIfNecessary_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastEmptyStorage_PreservesEngine` UnmanagedMemoryBlock.CopyTo arrayIndex offset → `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice` - `UnmanagedMemoryBlock_CopyToWithIndex_CopiesToDestinationOffset` UnmanagedHelper.CopyTo destination-offset bounds → same commit - `UnmanagedHelper_CopyToWithInvalidDestinationOffset_Throws` NDArray.Clone / engine propagation → `fix(storage+ndarray): ...` + `fix(creation+manipulation): ...` + `fix(default-engine): ...` - `NDArray_Clone_PreservesGenericRuntimeType` - `NDArray_Clone_PreservesTensorEngineOnArrayAndStorage` - `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy` - `NDArray_CopyFOrder_PreservesTensorEngine` - `NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy` - `NDArray_ReshapeCopyPath_PreservesTensorEngine` - `NDArray_AstypeCopyPath_PreservesTensorEngine` NpyIter.Clone buffered deep copy + RemoveAxis stride width → `fix(npyiter): deep-copy buffered Clone buffers; preserve stride width after RemoveAxis` - `NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers` - `NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`
…aths Promotes SByte, Half (float16), and Complex from "partially supported" to first-class dtypes, matching what NPTypeCode already declares and what NumPy 2.4.2 ships. The audit (NDITER_BRANCH_QUALITY_AUDIT_V2.md, Tier 1) flagged 9 production crash sites and 5 perf gaps where these three dtypes silently fell out of 12-dtype switches. After this commit every np.power(lhs, rhs) combination across the 15x15 dtype matrix works end-to-end, and the existing 12-dtype fast paths remain intact. CRASH FIXES (Tier 1): * NpyIterCasting (T1.9, T1.12, T1.38, T1.39): IsSafeCast / ReadAsDouble / WriteFromDouble / ConvertValue / PromoteTypes now handle SByte / Half / Complex. Complex routes through a dedicated Complex intermediate so the imaginary component is preserved on Complex->Complex copies and dropped cleanly (per NumPy's ComplexWarning) on Complex->real. Adds Half/SByte to IsFloatingPoint/IsSignedInteger predicates. * NpyIterBufferManager (related to T1.12): same-type buffered iteration was throwing for Complex base case. Adds SByte/Half/Complex branches to CopyToBuffer/CopyFromBuffer dtype dispatch. * UnmanagedStorage (T1.13, T1.57): SetValue(object,int[]/long[]), SetData(NDArray,long[]) scalar fast path, and the void*/IMemoryBlock CopyTo overloads all gained the three missing dtype branches. * ArrayConvert.cs (T1.30): 13 ToX(Array) destination switches were missing SByte/Half/Complex source cases. Plus ~40 new typed converter methods covering the previously-missing (Src -> Dst) corners. Total ~550 LOC added. * np.asanyarray (T1.49): adds IEnumerable<sbyte>, IEnumerable<Half>, IEnumerable<Complex> case branches; corresponding Memory<T>/ ReadOnlyMemory<T> dispatch; ConvertObjectListToNDArray branches; and FindCommonNumericType expansion (the seenMask bitset was bounded to 12 dtypes; Complex's typecode=128 also previously aliased bit 0 due to unbounded shift -- now masked by `(int)code & 31`). * np.copyto T1.55: now passes via the NpyIterCasting fix. * ILKernelGenerator.EmitDecimalConversion: Half<->Decimal and Complex<->Decimal routes were missing. np.power(Half, Decimal) now works (was the only np.power(15x15) failure after the casting fixes). PERF FIXES (Tier 2): * ILKernelGenerator.Binary.IsSimdSupported<T> (R9): adds sbyte. Vector*<sbyte> arithmetic is natively supported in .NET. * Converts.FindConverter (R18, R33): 12x12 type-pair fast-path ladder expanded to 15x15 (225 entries). Eliminates the IConvertible-interface boxing and object-cast boxing that the CreateFallbackConverter path imposes for SByte/Half/Complex sources or destinations. * Default.Reduction.ArgMax (R23): the per-slice NDArray view allocation in the Half/Complex axis fallback was costing one new NDArray per slice (1000 allocations for a (1000,1000) axis-reduce). Replaced with a stride-aware loop driven from a stackalloc coord vector. SByte path is removed from the fallback entirely since the IL kernel already handles it via CreateAxisArgReductionKernelTyped<sbyte>. * Default.BooleanMask gather (T1.58): the strided/broadcast fallback was calling Buffer.MemoryCopy(_, _, elemSize, elemSize) per matched element (~1us/element). Specialized on element size (1/2/4/8/16 bytes); all 15 dtypes hit a typed pointer write now, including Half (2B) and Complex (16B as two longs). VERIFICATION: * test/Math/NDArray.power.DtypeMatrix.Test.cs (new): - 15x15 dtype matrix smoke test (225 (lhs, rhs) combinations). - SByte ** -SByte raises ValueError-style ArgumentException. - Half ** Half preserves Half. - Complex ** Complex preserves Complex. - Float ** Complex promotes to Complex. - Half ** Single promotes to Single (NEP50). - SByte/Half/Complex List/IEnumerable inputs no longer throw. * Removed [OpenBugs] attribute from 11 AuditV2 tests that are now CI-green: T1_9 (3x), T1_12 (2x), T1_13 (2x), T1_30 (3x), T1_49 (3x), T1_55, T1_57, T1_58. They now run as regular tests. * Full suite: 8281 passed, 0 failed (was 8255 before this commit, including the new dtype-matrix tests and 26 promoted-from-OpenBugs tests). DOCS: * .claude/CLAUDE.md: "Supported Types (12)" -> "Supported Types (15)". Adds Half/SByte/Complex rows and a "Perf notes" section that documents Half/Complex/Decimal as scalar paths (no Vector<Half> arithmetic in .NET BCL; Complex.Pow is the BCL routine). OUT OF SCOPE FOR THIS COMMIT: * T1.34 NpyExpr Const/Where/Call SByte/Half/Complex support: not on np.power's critical path; left for a separate pass. * T1.39 Int64/UInt64 -> double precision loss above 2^53: separate audit item, unrelated to the three target dtypes. For full audit context see docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md section "Major themes" item 2 (missing SByte/Half/Complex).
…s 3000× copy
Audit V2 finding (Section 1.6 / src/NumSharp.Core/Manipulation/np.ravel.cs:30-34):
np.ravel(a, 'F') unconditionally routed through a.flatten('F'), which allocates
fresh F-contiguous memory and runs NpyIter.Copy over the source. NumPy, in
contrast, returns a 1-D view sharing the underlying buffer whenever the source
is already F-contiguous (np.shares_memory(np.ravel(aF, 'F'), aF) == True).
The audit reports a 3000× performance regression on the hot F-order path
(np.arange(12).reshape(3,4).copy('F') -> np.ravel(.,'F')): an O(1) shape-only
aliasing was replaced with an O(N) buffered copy.
Root cause
----------
ravel's F-branch had no fast path for the IsFContiguous case. flatten('F') is
documented to "ALWAYS return a copy" by design, so calling it from ravel forced
materialization even when the linear memory walk would already reproduce the
column-major read-out.
Why a 1-D view is correct for F-contiguous sources
--------------------------------------------------
An F-contiguous array has strides[0] == 1 and strides[i] == dims[i-1] *
strides[i-1] for i > 0, with no broadcast/stride-0 dimensions. Walking the
underlying buffer linearly from `offset` for `size` elements visits values in
F-order (first axis varies fastest), which is exactly what ravel('F') is
specified to produce.
For non-F-contig sources we still fall back to flatten('F') — a strided / C-
contig / sliced source needs the column-major copy to reproduce F-order
correctly.
Implementation
--------------
ravel(a, 'F') with NDim > 1 and size > 1:
* a.Shape.IsFContiguous → build Shape(new[]{size}, new[]{1}, a.Shape.offset,
a.Shape.bufferSize) and return new NDArray(a.Storage.Alias(vec)). offset and
bufferSize are preserved so sliced F-views remain correct; size becomes the
1-D shape's logical and physical extent.
* Otherwise → existing flatten('F') copy path (unchanged).
The new shape's flags are recomputed by ComputeFlagsStatic over the 1-D
dims/strides, which trivially marks the result as both C- and F-contiguous and
writeable (a 1-D dense vector is both orders). Storage.Alias chains _baseStorage
to the ultimate owner, so view tracking and the @base property continue to work.
Test coverage
-------------
* AuditV2_ManipulationApis.Ravel_FContiguous_FOrder_ReturnsView is no longer
marked [OpenBugs(audit-v2-ravel-fcont-fview)] — the documented NumPy
np.shares_memory invariant is now asserted directly in CI.
* test/Manipulation/np.ravel.Test.cs gains 10 new tests:
- Ravel_FOrder_FContig2D_IsView — write-through verifies aliasing.
- Ravel_FOrder_FContig2D_ValuesMatchColumnMajor — read-out sequence matches NumPy.
- Ravel_FOrder_FContig3D_IsView — 3-D F-flat-index decomposition.
- Ravel_FOrder_CContig_IsCopy — C-contig source still copies.
- Ravel_FOrder_Transpose2D_IsView — a.T (F-contig view) also aliases.
- Ravel_FOrder_KOrder_FContigSource_IsView — 'K' resolves to 'F' for F-source.
- Ravel_FOrder_AOrder_FContigSource_IsView — 'A' resolves to 'F' for strict F.
- Ravel_FOrder_FContig_DtypeFloat — dtype preserved across the view.
- Ravel_FOrder_FContig_EquivalentToFlattenF_Values — values match flatten('F').
- Ravel_FOrder_FContig_PreservesSize — 2-D / 3-D / 4-D size invariants.
Verified
--------
* New tests pass on net8.0 and net10.0.
* Full CI-filtered suite (TestCategory!=OpenBugs&TestCategory!=HighMemory)
passes 8292/8292 on both target frameworks.
* The 17 pre-existing F-contig OpenBugs failures (np.flip, np.sort, np.repeat
axis, reduction F-preservation, save/load fortran_order, etc.) remain
unchanged — they live in test/View/OrderSupport.OpenBugs.Tests.cs and are
excluded by the CI filter.
References
----------
* NumPy: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html
* docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md — Section 1.6
* Spec: np.shares_memory(np.ravel(aF, 'F'), aF) == True for IsFContiguous source
…aths
Audit of np.ravel call paths flagged two cases that the initial fix relied on
but did not directly assert in tests. Add explicit coverage so regressions are
caught:
1. Ravel_FOrder_FContigColumnSlice_PreservesOffset_IsView
aF[:, 1:3] on F-contig (4,5) yields (4,2) F-contig with offset=4. The view
path must preserve offset and bufferSize so the 1-D Alias reads memory
[4..11]. Verified:
* shape (8,)
* F-order values [1, 6, 11, 16, 2, 7, 12, 17] (column-major read-out)
* write-through r[0] → s[0,0] and aF[0,1] both updated (shared memory)
2. Ravel_FOrder_FContig_BothCAndFContig_IsView
A (1, N) shape is simultaneously C- and F-contiguous. ravel('F') enters the
F-branch (NDim>1, size>1, IsFContiguous=true) and returns an Alias view; this
was already covered by the implementation but not by an explicit test.
* shape (4,)
* values [10, 20, 30, 40] (linear memory walk)
* write-through r[0] propagates to both[0, 0]
Both cases pass on net8.0 and net10.0 (64/64 tests in the ravel suite).
Background — full ravel coverage matrix audited manually:
Order Source layout Branch Result
----- ----------------------------------------- ------------- -------------
'F' strict F-contig, NDim>1, size>1 view path view
'F' both C+F contig (e.g. (1,N)), NDim>1 view path view
'F' F-contig col-slice, offset!=0 view path view (offset preserved)
'F' transpose of C-contig 2-D (→ F-contig) view path view
'F' C-contig only, NDim>1 flatten('F') copy
'F' broadcast / strided / non-contig flatten('F') copy
'F' 1-D (NDim==1) C-order path view if contig
'F' scalar / empty / size<=1 C-order path trivial
'C' C-contig reshape view
'C' F-contig only CloneData C-order copy
'A' F-contig (strict) resolves to F view
'A' otherwise resolves to C view/copy
'K' F-contig resolves to F view
'K' C-contig resolves to C view/copy
All 15 dtypes (Boolean, Byte, SByte, Int16, UInt16, Int32, UInt32, Int64, UInt64,
Char, Half, Single, Double, Decimal, Complex) verified end-to-end via in-process
buffer-address comparison and dtype assertion.
NDArray.ravel() and NDArray.ravel(char) instance methods delegate to np.ravel,
so the fix covers both call sites.
…efault-None bounds
Brings np.clip up to NumPy 2.x signature parity. Two missing capabilities are
addressed at the API surface; the underlying engine (Default.ClipNDArray.cs)
already supported null bounds for both legs of the interval.
NumPy 2.x signature mirrored:
clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None)
Changes:
- src/NumSharp.Core/Math/np.clip.cs
* Replace the trio of legacy 4-arg overloads with a single unified entry
point exposing all parameters as optional. Callers may now write:
np.clip(a) — no bounds, returns copy
np.clip(a, min: 3) — lower bound only (NEP-rebrand)
np.clip(a, max: 5) — upper bound only
np.clip(a, min: lo, max: hi) — both via aliases
np.clip(a, a_min: null, a_max: 5) — explicit None
np.clip(a, min: 3.5, dtype: NPTypeCode.Double)
a_min/a_max still accepted (NumPy keeps both names; min=/max= were added
in 2.0 as keyword-only aliases).
* Conflict detection mirrors NumPy: passing both a_min and min (or both
a_max and max) raises ArgumentException rather than silently picking one.
* Type-dtype overload preserved separately (Type != NPTypeCode?, no merge
possible). Existing positional-3 call sites (np.clip(a, lo, hi)) and
named-arg call sites in np.maximum/np.minimum compile unchanged.
- test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs
* 9 new tests covering the NumPy 2.x surface:
- min=/max= keyword aliases (lower-only, upper-only, both)
- Explicit a_min=null / a_max=null
- Bare np.clip(a) returns a copy (verifies distinct backing storage)
- min= keyword with array bound (broadcast verification)
- Conflict detection (a_min+min, a_max+max throw)
- min= combined with dtype= promotes result dtype
Verification:
- Reference outputs cross-checked against NumPy 2.4.2 via Python; all 9
documented behaviors match byte-for-byte.
- ClipNDArrayTests: 26/26 pass (was 17, +9 new).
- ClipEdgeCaseTests + np.maximum/np.minimum suite: 105/105 pass — no
regressions (np.maximum/minimum use np.clip via named a_min:/a_max:).
- Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
7202 pass, 0 fail, 11 pre-existing skips.
Audit reference: audit_v2/07_math_ops_selection_sorting_stats.md (Batch 7,
item 12).
…n-int upcast
Brings the np.clip engine path up to NumPy 2.x ufunc parity. Three latent
bugs surfaced while battle-testing edge cases for the min=/max= alias work:
1. Dtype promotion silently demoted to lhs.typecode
* Before: outType = typeCode ?? lhs.typecode
- clip(int32, min=3.5) → int32 (NumPy: float64)
- clip(int32, min=float32) → int32 (NumPy: float64)
* After: weak-scalar promotion consistent with NumSharp's binary-op
engine and NEP 50 — a 0-d bound of the same kind (int/float/complex
/decimal) as lhs is "weak" and does not promote; cross-kind or array
bounds promote via np.result_type.
* Examples now matching NumPy:
clip(int32, min=3.5) → float64 (was int32)
clip(int32, min=3.0f) → float64 (was int32)
clip(uint8, 50, 75) → uint8 (preserved, NEP 50 weak)
clip(int32, min=long_arr) → int64 (array promotes)
clip(float32, 3, 7) → float32 (preserved)
* NaN bound on int array now upcasts to float64 with all-NaN result
(was: silently a no-op, value unchanged).
2. @out= with mismatched dtype silently wrote garbage
* Before: cast lhs/bounds to outType, blit through copyto into @out
which retained its own (often narrower) dtype — produced truncated
or pattern-aliased values.
* After: validate @out.GetTypeCode == outType up front. Mismatch
raises ArgumentException mirroring NumPy's _UFuncOutputCastingError
("Cannot cast ufunc 'clip' output from dtype('X') to dtype('Y')
with casting rule 'same_kind'").
3. Engine refactor for the both-null + dtype= case
* np.clip(arr, dtype=Single) with no bounds now properly casts the
output and respects @out when supplied (previously dtype= without
bounds returned plain lhs.Clone()).
Implementation details:
- Added PromoteClipBound(outType, bound): no-promotion shortcut for
0-d same-kind bounds; falls back to np.result_type otherwise.
- Added IsSameKind(a, b): groups Byte/Char/signed-int/unsigned-int as
integer kind; floats/decimals/complex compare by NPTypeCode group.
- @out validation now runs before any work, so shape/dtype errors fail
fast without partial mutation of @out.
- np.copyto(@out, Cast(lhs, outType, copy: false)) handles the case
where lhs needs casting to the promoted output type before writing.
Test additions (test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs)
— 30 new tests across 8 categories all cross-checked against NumPy 2.4.2:
Dtype Promotion (NEP-50):
- uint8 + int scalars preserves uint8
- int32 + float scalar → float64 (also float32 scalar → float64)
- float64 + int scalars preserves float64
- int32 + int64 array bound → int64
- dtype= with no bounds casts input
- dtype= override forces narrower type even when bounds promote
- NaN bound on int array upcasts to float64
@out= Edge Cases:
- in-place out=src returns same buffer & mutates
- out= separate buffer leaves src unchanged
- shape mismatch throws
- dtype mismatch throws (previously silent garbage)
- out= with no bounds copies src
Special Float Values via kwarg:
- min=-inf / max=+inf no-op
- min=NaN / max=NaN propagates
0-d (Scalar) Input:
- clip(scalar, lo, hi) preserves ndim=0
- clip(scalar, max:hi) preserves ndim=0
- clip(scalar) preserves ndim=0
Half / Complex via Kwarg:
- Half min/max preserves Half
- Complex min= (lex ordering, scalar bound)
- Complex array min/max bounds (lex ordering)
Broadcasting via Kwarg:
- 2D + row vector min → broadcasts along axis 0
- 2D + column vector max → broadcasts along axis 1
- 2D + mixed row min + column max
Strided Inputs via Kwarg:
- Reversed-slice (negative stride) clipped via min=/max=
Empty Arrays via Kwarg:
- Empty + min= only
- Empty + max= only
- Empty + dtype= cast
Verification:
- ClipNDArrayTests: 56/56 pass (was 26; +30 new).
- np.clip + np.maximum + np.minimum + ClipEdgeCase + np.clip.Test suites:
85 pass on net8.0, 55 pass on net10.0 (frameworks differ in shared-class
counts).
- Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
7232 pass, 0 fail, 11 pre-existing skips (was 7202 before this commit).
…x bug
Benchmarking np.clip against NumPy 2.4.2 revealed a 48-80× slowdown on
the common case `clip(arr, lo, hi)` with scalar literal bounds. Root
cause: the engine was materializing every scalar bound via
`np.broadcast_to(scalar, lhs.Shape).astype(outType)`, which for a 10M
int32 input allocated and memset two 40MB bound arrays per call (then
ran an element-wise array-bounds kernel that re-read both buffers).
Investigation also surfaced a pre-existing kernel bug exposed once the
new fast path routed scalar-bound calls through ClipScalar / ClipStrided
/ ClipScalarTail: the integer scalar fallbacks used `if / else if` to
apply the two clamps, so when `minVal > maxVal` values below `minVal`
incorrectly stuck at `minVal` instead of capping to `maxVal` (NumPy
guarantees `min(max(x, lo), hi)` — i.e. `maxVal` wins when bounds are
inverted). SIMD paths and Math.Min(Math.Max,...) float paths were
already correct.
Changes
=======
src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
- Add scalar-bounds fast path: detect 0-d min/max (or null) and
dispatch directly to ClipUnified / ClipMinUnified / ClipMaxUnified
(the kernel family that broadcasts the scalar inside the vector loop).
Skips broadcast_to + astype materialization entirely.
- ClipNDArrayScalarBounds: type-switch on outType to call the right
generic kernel; uses a small delegate-based helper (ClipScalar<T>) so
the dispatch logic isn't duplicated 12 times.
- ClipNDArrayScalarBoundsFallback: Half and Complex still go through
the array-bounds path — their scalar SIMD kernels aren't wired and
Complex has lex-ordering NaN semantics already implemented there.
Cost is just the 0-d→shape broadcast (stride-0 view, O(1)) plus a
1-element astype.
- Array bounds (any non-0-d min or max) flow into the existing path
unchanged.
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
- ClipScalar<T> (generic integer fallback, called by ClipHelper): replace
`if (val > maxVal) val = maxVal; else if (val < minVal) val = minVal;`
with two sequential `if`s. Now matches NumPy semantics when min > max.
- ClipScalarTail<T> (non-float tail after SIMD bulk loop): same fix.
- ClipStrided<T> (coordinate-iterated path for non-contiguous arrays):
same fix.
- Added comments explaining why two sequential clamps are required.
Performance (Windows 11, .NET 10.0.1, AVX2-class CPU; 50 iterations,
min of timings reported; same array shapes/dtypes on both runtimes):
Scalar bounds, contiguous
NumPy 2.4.2 NumSharp BEFORE NumSharp AFTER
int32 size=1K 3.4 µs 37.8 µs 3.3 µs
int32 size=100K 8.4 µs 2980.4 µs 66.5 µs
int32 size=10M 6 741 µs 323 557 µs 10 094 µs
int64 size=10M 14 519 µs 698 077 µs 34 860 µs
float32 size=10M 6 917 µs 570 707 µs 22 441 µs
float64 size=10M 14 228 µs 612 228 µs 30 926 µs
Single-sided scalar bound (min= or max= only)
int32 size=10M min= 12 451 µs 285 434 µs 10 532 µs (faster than NumPy)
int32 size=10M max= 12 024 µs 294 756 µs 10 720 µs (faster than NumPy)
float64 size=10M min= 23 155 µs 300 770 µs 23 043 µs (parity)
out= parameter
int32 10M, out=arr in-place 7 038 µs 562 393 µs 7 465 µs (parity)
int32 10M, out=preallocated 7 794 µs 557 192 µs 12 539 µs
No bounds (clip(a))
int32 10M 12 126 µs 7 437 µs 7 158 µs (faster than NumPy — Cast(copy:true))
Speedups range 20-75× over the previous NumSharp implementation; the
common `clip(arr, lo, hi)` path now sits at 1.5-3× NumPy or matches it
for small arrays. Remaining gaps:
* Array bounds (lo_arr, hi_arr same-size): 3.5× slower — kernel is
memory-bandwidth bound on three arrays; expected gap given .NET
Vector<T> vs hand-tuned NumPy AVX2 inner loop.
* Strided input (a[::2], a[::-1]): 15-20× slower — ClipStrided uses
Shape.TransformOffset per element; NumPy's ufunc has a strided
inner loop with stride-aware SIMD where possible.
* Half (float16): 11× slower — .NET's `Vector<Half>` arithmetic is
not supported, scalar Half→double→Half path required.
* 2D broadcast (row vec): 33× slower — still goes through array path
after broadcast_to materializes the row vector.
These remaining gaps are tracked for future kernel work and are not
addressed in this commit.
Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory),
including the regression test for min > max which now exercises the
scalar fast path through the fixed ClipScalar/ClipStrided kernels.
- Bench harness: $CLAUDE_JOB_DIR/clip_bench.py and clip_bench.cs (50
iterations each, min of timings).
Two further optimizations on top of the scalar-bounds fast path. Both target the gap to NumPy that benchmarking surfaced. Findings from the breakdown profiler ($CLAUDE_JOB_DIR/clip_breakdown.cs) on int32 size=10M: Step Time (µs) ────────────────────────────────────────────── ───────── Pure ClipArrayBounds kernel (3R + 1W) ~7,700 Cast(lhs, int32, copy:true) alloc + 1R+1W ~6,100 np.broadcast_to(lo_arr, same_shape) ~negligible np.broadcast_to(lo_arr).astype(int32) — same dtype ~7,700 ← wasted clone np.clip(arr, lo_arr, hi_arr) full path 37,700 np.clip(arr, 2, 7) scalar fast path 17,100 ClipHelper kernel only (1R + 1W in-place) ~9,800 The two wasted passes: (1) `astype(same_dtype)` cloning the bounds even when no cast is needed (15ms wasted on two array bounds), (2) the Cast-then-clip pattern doing 4 memory streams (2R + 2W) when 2 streams (1R src + 1W dst) suffice. Changes ======= src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs 1. PrepareBound(bound, targetShape, outType) helper: When the bound is already same-shape, same-dtype, contiguous, offset zero, return it directly instead of running broadcast_to + astype (which clones via UnmanagedStorage.Clone). Wins for the common case where users pass arrays that already match the input layout. 2. ClipNDArrayFusedScalarBounds: new fast-fast path for the dominant `np.clip(a, lo, hi)` shape — contiguous lhs, scalar literal bounds, no @out, no dtype promotion. Allocates a fresh `np.empty(shape)` and runs the fused CopyAndClip kernel in a single pass. Replaces the classic Cast-then-clip pattern (which ran two passes over memory). Falls through to the existing in-place scalar path when @out is supplied or the lhs needs casting. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs 3. CopyAndClip / CopyAndClipMin / CopyAndClipMax (and their *Simd256 / *Scalar / *ScalarTail variants for 10 SIMD-supported dtypes): fused read-clip-write kernels. Each loop iteration loads a Vector256, runs Min(Max(v, lo), hi) in registers, and stores to the destination buffer — never spilling intermediate values to memory. Halves the memory bandwidth requirement vs the in-place "copy then clip" pattern on memory-bandwidth-bound input sizes. Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2-only CPU, 50 iterations, min reported) NumPy NumSharp BEFORE NumSharp AFTER AFTER/NumPy Scalar bounds, contiguous int32 size=1K 3.4 µs 37.8 µs 3.1 µs 0.91× (FASTER) int32 size=100K 8.4 µs 2980 µs 68.2 µs 8.1× int32 size=10M 6741 µs 323557 µs 9336 µs 1.4× int64 size=10M 14519 µs 698077 µs 19287 µs 1.3× float32 size=10M 6917 µs 570707 µs 11002 µs 1.6× float64 size=10M 14228 µs 612228 µs 26969 µs 1.9× Array bounds, contiguous (PrepareBound win) int32 size=10M 9488 µs 38259 µs 13898 µs 1.5× (was 4.0×) float64 size=10M 24712 µs 83863 µs 42137 µs 1.7× (was 3.4×) Single-sided int32 10M min= 12451 µs 285434 µs 11200 µs 0.90× (FASTER) int32 10M max= 12024 µs 294756 µs 11351 µs 0.94× (FASTER) out= (in-place / preallocated) 10M in-place out=arr 7038 µs 562393 µs 4567 µs 0.65× (35% FASTER than NumPy) 10M out=preallocated 7794 µs 557192 µs 10101 µs 1.3× Both bounds None 10M, clip(a) 12126 µs 7437 µs 5778 µs 0.48× (2× FASTER) Combined effect of all four perf commits (3505edc, 79c1894, 9334bd7, this one): the common `np.clip(arr, lo, hi)` path went from 48-80× slower than NumPy to within 1.4-1.9× across dtypes, with several cases matching or beating NumPy outright. Discussion of the user's two questions ────────────────────────────────────── 1. Vector<T> vs Vector256<T> — measured both on this CPU; identical wall time (5527 vs 5559 µs/10M int32 in micro-bench, see $CLAUDE_JOB_DIR/clip_micro.cs). Vector<T> picks the widest hardware register at JIT time, so on AVX-512 hardware it'd be 512 bits = 2× throughput. On THIS AVX2 machine, no gain. Switching the existing Vector256<T> kernels to Vector<T> is a low-risk forward-compat move for AVX-512 hosts but no measurable win here. Not changed in this commit (would touch the whole kernel file ecosystem; out of scope). 2. IL Generation via DynamicMethod — the existing binary kernels (ILKernelGenerator.Binary.cs) emit 4× unrolled SIMD loops via System.Reflection.Emit. Tested whether porting that pattern to clip would help: micro-benchmarked manually-unrolled 4× and 8× Vector256<int> loops against the simple 1× variant. Results (10M int32): 1× unrolled: 5559 µs 4× unrolled: 6494 µs (SLOWER — register pressure) 8× unrolled: 5428 µs (2% faster — within noise) The .NET JIT already auto-unrolls the simple SIMD loop well enough that hand-unrolling doesn't help and can hurt. IL emission for this op would add significant complexity for ~no perf win. Not pursued. The wins came from algorithmic changes (fused single-pass kernel, skipping redundant clones) — not from instruction-level tuning. Verification ============ - All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory). - Includes the regression test for `min > max` semantics through the new fused kernel path (which goes through CopyAndClip's scalar tail for the size<32 case). - Bench harness: $CLAUDE_JOB_DIR/{clip_bench.py,clip_bench.cs, clip_breakdown.cs,clip_micro.cs}. Remaining gaps ============== - Strided / negative-stride / broadcast inputs: ~12-15× slower than NumPy. ClipStrided iterates with Shape.TransformOffset per element (~50ns/element overhead). NumPy ufunc has stride-aware SIMD inner loops. Would require a stride-aware clip kernel similar to NumPy's. - Half / float16: ~9× slower. .NET's Vector<Half> arithmetic is not supported; falls back to scalar Half-via-double round-trip. - 100K size scalar bounds (8.1×): allocation/dispatch overhead is amortized over fewer elements; gap shrinks at larger sizes.
…-adaptive
Per the user's directive, the entire clip code path now goes through a
single ILKernelGenerator entry point that dispatches internally and
emits all loops as DynamicMethod IL using the runtime-detected vector
width (V128 / V256 / V512). No hardcoded Vector256 references remain;
no hand-written C# loops remain in the engine or kernel files.
Files
=====
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
Rewritten from scratch (~1900 → ~360 lines, -81%).
Public surface — what the engine actually calls:
public enum ClipMode { BothBounds, MinOnly, MaxOnly }
public enum ClipBoundsKind { Scalar, Array }
public unsafe delegate void ClipKernel(
void* src, void* dst, long size, void* lo, void* hi);
public static unsafe void Clip(
NPTypeCode dtype, ClipMode mode, ClipBoundsKind kind,
void* src, void* dst, long size, void* lo, void* hi);
All dispatch happens inside ILKernelGenerator:
- Cache key = (dtype << 16) | (mode << 8) | kind
- On first miss, Generate(dtype, mode, kind) builds a DynamicMethod
and stores the resulting delegate in a ConcurrentDictionary.
The IL emitter:
- Uses GetVectorContainerType() / GetVectorType() / GetVectorCount()
so the SIMD loop body adapts to V512 on AVX-512 hosts, V256 on
AVX2, V128 on SSE2. There is no `Vector256` or `Vector128` token
anywhere in the kernel file.
- Hoists the scalar bound load and Vector.Create() out of the SIMD
loop (one broadcast per kernel call, not per iteration).
- Computes `byteOff = i * sz` once per iteration into a local and
reuses it for src/lo/hi/dst pointer arithmetic — avoids the
O(N × pointer_count) multiplications the prior C# kernels had.
- Falls back to a pure scalar IL loop for dtypes without
Vector<T>.Min/Max (Char, Decimal, Half, Complex). Half and
Complex delegate the per-element clamp to static helper methods
(NaN-aware / lex-order); the loop is still IL.
src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
Stripped from 1222 → 207 lines (-83%). Everything but policy is gone:
- dtype promotion (NEP-50 weak scalar via PromoteClipBound)
- @out validation (shape, writeable, dtype)
- scalar-vs-array kind detection (min.ndim == 0)
- NaN-in-scalar-bound short-circuit for float dtypes
- dst materialization choice: in-place vs fused-fresh vs cast-copy
- single call to ILKernelGenerator.Clip(...)
The previous ClipNDArrayContiguous / ClipNDArrayGeneral /
ClipNDArrayScalarBounds / ClipNDArrayFusedScalarBounds / 12 per-dtype
switches / delegate-based generic dispatchers / 14 Generated*Core
methods are all deleted. One call site, one cache, one IL emitter.
src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs
Deleted (251 lines). Dead code — internal `ClipScalar(NDArray, object,
object)` had no callers anywhere in the codebase, was a parallel
hand-coded path that the IL kernel now subsumes.
src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
Added EmitVectorMinOrMax() to the existing emit-primitives section
(sibling of EmitVectorOperation). Resolves Vector{Width}.Min<T> /
Max<T> by reflection on whatever container the runtime selected at
startup — same width-adaptive pattern used by the binary kernels.
Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2 CPU, 50 iters,
min reported)
NumPy Before IL After IL
Scalar bounds, contiguous
int32 size=1K 3.4 µs 3.1 µs 2.9 µs (0.85× — faster)
int32 size=100K 8.4 µs 68.2 µs 47.3 µs
int32 size=10M 6 741 µs 9 336 µs 7 509 µs (1.11×)
int64 size=10M 14 519 µs 19 287 µs 22 279 µs
float32 size=10M 6 917 µs 11 002 µs 13 057 µs
float64 size=10M 14 228 µs 26 969 µs 28 842 µs
Single-sided (min= or max= only)
int32 10M min= 12 451 µs 11 200 µs 10 944 µs (0.88× — faster)
int32 10M max= 12 024 µs 11 351 µs 8 009 µs (0.67× — faster)
float64 10M min= 23 155 µs 23 043 µs 19 776 µs (0.85× — faster)
out=
10M in-place out=arr 7 038 µs 4 567 µs 3 954 µs (0.56× — faster)
10M out=preallocated 7 794 µs 10 101 µs 9 175 µs
Both bounds None
10M clip(a) 12 126 µs 5 778 µs 6 025 µs (0.50× — faster)
Half (float16) — IL emit cut the gap by 3×
10M 66 969 µs 602 219 µs 212 024 µs (was 9×, now 3.2×)
Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory).
- The 85 clip-family tests (ClipNDArrayTests, ClipEdgeCaseTests,
np.clip.Test, NewDtypes Half/Complex clip tests) cover:
* Scalar literal bounds, array bounds, both-None, min-only, max-only
* 14 dtypes (Byte, SByte, Int16/32/64, UInt16/32/64, Char, Half,
Single, Double, Decimal, Complex)
* Contiguous, transposed, strided (every other), reversed slices
* Broadcast inputs (the OpenBug test)
* NaN propagation in float arrays
* NaN in scalar bound → all-NaN result (short-circuited in engine)
* min > max → result all = max
* @out= validation (shape & dtype mismatch throws)
* NEP-50 weak-scalar promotion (uint8 + 50 stays uint8)
* Cross-kind promotion (int32 + 3.5 → float64)
- Cache correctness: each (dtype, mode, kind) combination generates
its kernel once on first call, then reuses the cached delegate. Re-
running the test suite a second time keeps the same delegates (no
re-emit per call).
Remaining gaps (deferred)
=========================
- Strided / negative-stride contiguity (~15-20× NumPy): the engine
materializes a contiguous copy first via Cast(copy:true). A proper
fix would IL-emit a stride-aware kernel, but that doubles the code
size and is rarely the hot path.
- Array-bounds slightly worse than the prior hand-coded V256 inner
loop (~2× NumPy vs ~1.5× before): the IL emit doesn't 4×-unroll
like the binary kernels do. Measured earlier in the conversation,
manual 4× unroll on the simple clip loop hurt rather than helped
on the JIT auto-unrolled C# baseline; effect on IL-emitted code
may differ but not investigated.
…) into nditer Brings in 5 commits from worktree-clip-min-max-aliases that rebuild np.clip end-to-end. Replaces and supersedes the in-flight clip work on nditer (c3bbe9a "fix(clip): Complex IComparable + Half NaN propagation") whose root cause — generic CompareTo / NpFunc routing for clip — no longer exists after this merge. Summary of incoming work (3505edc..10064ab) ============================================= 1. feat(np.clip): NumPy 2.x parity — min=/max= keyword aliases and default-None bounds. New signature mirrors NumPy 2.x: clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None, dtype=None) 2. fix(np.clip): NumPy 2.x dtype promotion (NEP 50 weak-scalar via np.result_type), out= dtype validation, NaN-on-int upcast. 3. perf(np.clip): scalar-bounds fast path + fixed a latent ClipScalar min>max kernel bug (`if/else if` instead of two sequential clamps). 4. perf(np.clip): fused copy+clip kernel + skip the redundant astype clone when the bound already matches lhs shape/dtype/contiguity. 5. refactor(np.clip): all kernels routed through a single ILKernelGenerator.Clip() entry. Every loop is now emitted as a DynamicMethod via System.Reflection.Emit. The SIMD width is resolved at runtime (V128/V256/V512) — no Vector256 token remains anywhere in the clip path. Conflict resolution =================== * src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs Deleted in worktree. nditer's c3bbe9a modified it (Complex pre- checks). The IL kernels in this merge handle Complex natively via ComplexMaxNaN/ComplexMinNaN helpers called from the generated loop, so the Default.Clip.cs path becomes redundant. Took the deletion. * src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs Both sides modified. nditer's version had c3bbe9a + 574a0d8 refactor (NpFunc generic dispatch, ~400 switch cases replaced). Worktree's version (this branch) is the IL-routed engine (207 lines of pure policy + one ILKernelGenerator.Clip call). Took worktree. Half / Complex correctness preserved by the new IL kernel — verified via the existing battletest suite (NewDtypes Half + Complex tests all pass). * src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs Both modified. nditer added Half-specific scalar paths in the old kernel API. Worktree rewrote the file from ~1900 → ~360 lines of IL emission. Took worktree — Half NaN handling now lives inside the IL-emitted scalar tail via HalfMaxNaN/HalfMinNaN helper calls. * src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs Auto-merged cleanly. Worktree added EmitVectorMinOrMax helper alongside nditer's dtype-parity expansion. * src/NumSharp.Core/Math/np.clip.cs Manually merged: kept worktree's new public signature (a_min/a_max/out/dtype/min/max with NumPy-2.x semantics) and nditer's PreserveFContigFromSource wrapper (39ef08c "F-contig preservation across ILKernel dispatch"). Output now keeps F-order when the input was F-contiguous. * test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs Auto-merged cleanly — 39 new tests from worktree (NEP-50 promotion, min/max aliases, out= edge cases, etc.) sit alongside nditer's existing coverage. Verification ============ - Full build: 0 errors, 17 warnings (unchanged from each side). - Test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0: 8334 pass, 0 fail, 11 pre-existing skips. nditer was at ~7232 tests pre-merge in the worktree's view; the actual count on nditer-only is higher and the merge brings the combined total up. - All 85 clip-family tests pass (39 new + 46 pre-existing). - The Complex IComparable issue that c3bbe9a addressed is verified fixed by the merge: the failing tests in that commit's "Fixes 14 test failures" list all pass through the new IL kernel (Complex takes the scalar-IL path with ComplexMaxNaN/ComplexMinNaN helpers). API behaviour for callers ========================= - np.clip(arr, lo, hi) — works exactly as before. - np.clip(arr, lo, hi, dtype: NPTypeCode.Double) — new dtype= override. - np.clip(arr, lo, hi, @out: dst) — supported as named arg. - np.clip(arr, min: 3) — NEW: NumPy 2.x kwarg alias. - np.clip(arr, max: 7) — NEW: kwarg alias. - np.clip(arr, a_min: null, a_max: 5) — NEW: explicit None bound. - Promotion: clip(int32, 3.5) → float64 (was int32 — bug pre-merge). - Out= dtype mismatch now throws ArgumentException (was silent garbage pre-merge).
…six + bitwise, np.positive full ufunc surface, NumPy round shape Completes the single-NumPy-shaped-overload audit across 5716f86 (slice 2) and 6a566e4 (merged-overload wave). A reflection audit over every np.* member those commits touched found four remaining gaps; all are closed here, every rule probed against NumPy 2.4.2 BEFORE implementation and pinned verbatim by tests. 1) dtype= on add/subtract/multiply/divide/true_divide/mod The 6a566e4 wave deferred these because the loop-dtype machinery didn't exist; it does now (ExecuteBinaryOp's dtype override from the power/floor_divide work), so the deferral was stale. TensorEngine gains the house (lhs, rhs, typeCode, out, where) signature on the five members; np.* adds the trailing dtype= param. * the loop COMPUTES in dtype: subtract(300_i64, 5_i64, dtype=i16) = 295 in the int16 loop; add(0.1, 0.2, out=f64, dtype=f32) stores float32(0.3) = 0.30000001192092896 in the f64 out (probed). * BUG FIX (ordering): the bool add/multiply remap (+ = logical OR, * = logical AND) keyed off the PROMOTED dtype and ran before the dtype= override — add(True, True, dtype=i32) would have run the BitwiseOr loop and returned 1. NumPy runs the i32 add loop and returns 2 (probed). The remap now keys off the FINAL loop dtype: moved after the dtype override in ExecuteBinaryOp. * divide/true_divide are float-only ufuncs: integer/bool dtype= raises 'No loop matching the specified signature and casting was found for ufunc divide' (probed; same gate family as sqrt&co). * mod reports 'remainder' with indexed input-cast errors (probed: mod(f64, f64, dtype=i32) → "Cannot cast ufunc 'remainder' input 0 from dtype('float64') to dtype('int32') ..."). 2) dtype= on bitwise_and/or/xor Loop selection among the bool/int loops: dtype=i64 widens, dtype=i16 narrows under same_kind (300 & 300 = 300_i16, probed). A float/complex/decimal dtype= raises the NO-LOOP text — distinct from the float-INPUT coercion text ValidateBitwiseLoop produces ('ufunc not supported for the input types...'); both texts probed, the first implementation reused the wrong one and the probe caught it. 3) np.positive — full ufunc surface (was: bare copy only) Slice 2 prepared UnaryOp.Positive (identity emit, UfuncName mapping, Round uses it as the masked-copy vehicle) but never exposed the np surface. New TensorEngine.Positive + Default.Positive: * positive(x, out=) returns the provided instance; where= masks (false slots keep prior contents); both ride the unary Into-path with the identity kernel. * dtype= selects the loop: positive(i32, dtype=f64) widens (the no-out path is a cast-copy); positive(f64, dtype=i32) raises the unary same_kind input-cast error naming 'positive'. * bool rule probed precisely: positive has NO bool loop — plain positive(bool) raises "ufunc 'positive' did not contain a loop with signature matching types <class 'numpy.dtypes.BoolDType'> -> None" and dtype=bool raises the two-sided variant naming the input's DType class (new NumPyDTypeClassName helper maps NPTypeCode → numpy.dtypes.* names); positive(bool, dtype=f64) is LEGAL → [1., 0.] — the guard keys off the loop, not the input. 4) round_/around — NumPy's round(a, decimals=0, out=None) shape Slice 2 shipped TWO out-overloads each ((x, out=null) + (x, decimals, out=null)); NumPy has ONE signature whose 2nd positional is DECIMALS (np.round(x, out_array) raises 'only integer scalar arrays can be converted to a scalar index' — probed). Merged to (x, int decimals = 0, NDArray out = null); positional-out callers migrate to @out: (3 test sites updated). Legacy positional-dtype conveniences unchanged. Tests: UfuncDtypeOverloadTests +8 (binary-six loop matrix incl. the bool-remap probe pin add(True,True,dtype=i32)=2, dtype+out f32 composition, divide no-loop + remainder/add input-cast texts, bitwise int-loop matrix with narrowing + no-loop texts, positive call-form matrix + both did-not-contain-a-loop texts + same_kind error, round single-shape matrix); UfuncUnaryBatchOutWhereTests round callers moved to NumPy positions. Suite: net10.0 CI filter 9709/0; touched families 100/100 on net8.0; live side-by-side value checks vs numpy 2.4.2 (f32 loop precision 0.30000001192092896, 1/3 float32, banker's rounding, positive matrix).
… NumPy + standalone decomposition
New matched-id benchmark pair (npyiter_core_bench.{cs,py}) measuring the ITERATOR,
not the kernels: construction across 13 flag configurations, traversal/orchestration
across chunk profiles (w=4..1024 strided rows, transposed, row/col broadcast),
buffered cast/mixed windows, per-element protocol (+C_INDEX/+MULTI_INDEX), full-
reduce, and small-N pipeline scaling N=8..2M. All kernels trivial and matched to
NumPy's loop families (memcpy / scalar-strided / V256-contig) so iterator cost
dominates. Both scripts correctness-check before timing and refuse Debug JIT.
Headline results (i9-13900K, NumPy 2.4.2, Release — full tables in
benchmark/poc/NPYITER_CORE_BENCH_RESULTS.md):
- Construction beats np.nditer 1.4-3.7x in every multi-operand config
(3-op 308 vs 622 ns; ufunc config 343 vs 1000 ns; 8-op 385 vs 1140 ns).
- Full-iterator vs full-iterator (strided-row ADD through NumPy's real ufunc
nditer): parity at w=4 (10.1 vs 10.3 ns/chunk), 2x faster at w=1024.
The 4 ns/chunk machine in np.copyto is NumPy's stripped raw-copy walker,
NOT nditer — and production NpyIter.Copy matches it exactly (T2c).
- Buffered mixed add 0.86x, broadcast adds 0.74-0.81x, transposed copy 0.84x,
reductions 0.64-0.85x, contig copy at DRAM roofline parity.
- Raw iterator pipeline tracks NumPy's whole ufunc dispatch within +-8% at every
N; production np.add(out=) carries ~200 ns of glue above it (648 vs ~440 ns).
- REUSED iterator (Reset+ForEach) at N=512: 54.7 ns/call — 7x under NumPy's
e2e floor; the structural small-N lever NumPy cannot reach from Python.
Bugs/gaps surfaced (quantified in the results doc):
1. GROWINNER is hollow — NpyIter.cs:751 sets the bit, ComputeTransferSize never
reads it: same-dtype BUFFERED iterators chunk into 512 needless 8192-elem
windows, +8.4% on 4M f64 add (G-series). One condition fixes it.
2. Broadcast construction pays +388 ns over same-shape (C5 696 vs C3 308 ns):
rank-mismatched operands miss the same-shape fast path and allocate through
np.broadcast_to; NumPy's delta is +97 ns.
3. Chunked-callback overhead decomposed: delegate ~1.3 ns/chunk (ForEach vs
ExecuteGeneric), element-stride imul per op/axis/step in ExternalLoopNext
(NumPy stores byte strides), mask-resolution branch — 7 vs 4 ns/chunk against
raw-walker workloads only; already parity against NumPy's real nditer.
'How NpyIter becomes the best' (prioritized, measured headroom): iterator
reuse/state pooling (54.7 ns floor proven), cut the 200 ns production glue,
implement GROWINNER, broadcast-ctor fast path, byte-stride axisdata +
specialized iternext, tiny-chunk whole-array lowering, parallel ForEach.
Roadmap doc links the new bench in its verification harness section.
charts_npyiter_core.py renders the four presentation charts from the 2026-06-12 npyiter_core_bench run (data pinned inline from NPYITER_CORE_BENCH_RESULTS.md): construction grouped bars, per-chunk dispatch overhead (copy vs raw walker + add vs real ufunc nditer), small-N log-log scaling with the production-glue and reuse-floor markers, and the traversal ratio scoreboard with the hollow-GROWINNER inset. Output: %TEMP%/npyiter_charts/*.png
… territory; P0 crash + 3 losses + 4 unexpected wins
npyiter_frontier_bench.{cs,py} (matched ids) extends the iterator-core bench into
every suspected weak spot: axis reductions through op_axes+REDUCE (Wave-5
territory), ALLOCATE outputs, where= masks at degenerate run lengths (all-true /
alternating run=1 / blocky run=64), strided buffered casts, forced-order outputs,
0-d scalar calls, tiny-chunk production copyto, 8-op single-pass fusion, and the
kernel-bound dtype frontier (complex128/float16/int8) as labeled context rows.
NEW BUGS / LOSSES (full tables + analysis in NPYITER_CORE_BENCH_RESULTS.md):
1. P0 CRASH: ForEach on a BUFFERED+REDUCE iterator AccessViolates — GetIterNext()
has no BUFFER+REDUCE branch, falls to ExternalLoopNext which walks BUFFER
pointers with SOURCE strides while GetInnerLoopSizePtr hands BufIterEnd as the
kernel count. Only BufferedReduce<TKernel>/Iternext() drives this config
safely. Repro pinned (R3, skipped with comment).
2. Iterator ALLOCATE outputs are np.zeros'd (NpyIter.cs:277) where NumPy
allocates EMPTY: +2.33 ms per 4M f64 call (32 MB memset). Still beats NumPy
allocating (0.78) only because their page-fault tax is worse than our pooled
memset; np.empty for WRITEONLY ALLOCATE => ~2x ahead.
3. Blocky where= (run=64) regresses BELOW our unmasked baseline (4.10 vs
2.80 ms) while NumPy GAINS from the same mask (3.19 vs 3.54): per-run
delegate/scan overhead eats the saved work. All-true and run=1 both WIN
(0.79 / 0.72 — NumPy degrades worse at run=1).
4. Windowed buffered cast on strided source 1.52x behind one-pass copyto;
production np.copyto already one-pass (1.08).
5. 0-d scalar calls 1.64x/2.41x behind (469/811 ns vs 286/337) — N=1 glue.
6. Axis-0 op_axes reduction 1.20x behind add.reduce; axis-1 wins 2x.
7. (kernel-bound) float16 1.34x behind, complex128 1.10-1.15x behind.
UNEXPECTED WINS:
- np.add ALLOCATING f64 4M: 3.83 vs 7.5-9.8 ms — 2-2.6x faster (Wave-2.4 pool
vs NumPy's fresh-page faults).
- F-order-out elementwise 1.5x faster (X1 0.67 / X1p 0.65).
- 8-op ONE-PASS sum of 7 arrays 1.9x faster than NumPy's best possible
composition (Y1 7.85 vs 14.59 ms) — multi-operand fusion dividend.
- int8 add 4M: 173 us vs 1.20 ms — 7x faster (NumPy 2.4.2 i8 loop slow).
- axis-1 reductions 2-2.8x faster (R2/R0b); reversed copy 0.94; production
copyto at w=4 parity with NumPy's raw walker (P4).
…ny 14.5x losses found; parallel banding 4.7x win proven
npyiter_frontier2_bench.{cs,py} (matched ids): overlap/alias per-call taxes
(exact-alias V1, forced-copy V2), comparison->bool (D1), early-exit boolean
reduces (E1/E2), reduce over a broadcast view (F1), mixed-dtype/scalar/empty
small-N (M1/O3/O4), 8-D construction (C14), and a hand-rolled 8-band parallel
iteration (PAR series — one iterator per disjoint row band via Parallel.For,
the Wave-6.2 dividend made concrete on f64 sin).
HEADLINE LOSSES (root causes probed and pinned in the results doc):
1. F1: np.sum over broadcast_to(8K -> (1024,8192)) = 61.9 ms vs NumPy 1.14 ms
(54x). NOT materialization: probe shows bc.copy()=11.3ms + dense sum=2.6ms,
so even naive materialize-then-sum would be 4.5x faster — the reduction
falls to a coordinate-walking general path at 7.4 ns/elem on IsBroadcasted
inputs.
2. E1: np.any(bool 10M) full scan = 1.86 ms (4.9 GB/s scalar) vs NumPy 128 us
(14.5x) — while np.count_nonzero on the SAME array runs 0.16 ms (63.7 GB/s
SIMD). Routing bug: the SIMD scan exists, np.any doesn't use it. Early-exit
case E2 WINS 3.9x (350 ns vs 1.35 us).
3. D1: np.less(out=bool) f64 4M 1.41x behind (2.99 vs 2.12 ms) — bool packing.
4. O3: array+scalar small-N 1.73x behind (901 vs 520 ns) — the scalar wrap
costs MORE than passing a second full array (H0 648 ns).
WINS:
- PAR8: 8 banded iterators on f64 sin = 2.47 ms vs 12.1 single / 11.7 NumPy
ceiling — 4.9x scaling, 4.7x over NumPy (which never threads its iterator);
production np.sin already at single-thread parity (12.1 vs 11.7).
- V2: forced-copy overlap (write-ahead alias) 1.75x FASTER than NumPy
(4.72 vs 8.26 ms) — Wave-1.1 machinery + Wave-2.4 pooled temp beat their
fresh-alloc copy; V1 exact alias 0.88.
- C14 8-D ctor 3x faster (321 vs 953 ns); O4 empty 2.8x; E2 early-exit 3.9x;
M1 mixed small-N parity-win (888 vs 931 ns).
Results doc gains the round-2 table, findings 13-16 with probe decomposition,
and reproduce lines.
…rom source and benchmarked across argument matrices
The consumer map is grounded in src/numpy (grep NpyIter_{New,MultiNew,AdvancedNew},
enclosing functions resolved): execute_ufunc_loop, PyUFunc_{Accumulate,Reduceat,
ReduceWrapper}, ufunc_at, array_{boolean_subscript,assign_boolean_subscript},
PyArray_{MapIterNew,CountNonzero,Nonzero,CopyAsFlat,Where}, arr_{ravel_multi_index,
unravel_index}, einsum, nditer_pywrap(+nested_iters), busday/datetime/string/void
consumers. npyiter_consumers_bench.{cs,py} exercises every benchable consumer
through its np.* surface with the perf-relevant argument matrix (dtype=/out-cast/
promoting unary; reduce axis/keepdims/dtype/3-D middle axis/amin; cumsum axes;
where same/scalar/broadcast; boolean read/assign; count_nonzero/argwhere; fancy
gather/scatter; ravel transposed/F-order/flatten/astype; unravel/ravel_multi_index)
and times the consumers NumSharp lacks NumPy-only as implementation targets.
Score: 20 wins, 4 losses, 1 parity, 8 feature gaps.
NEW LOSSES:
1. RD3 np.sum(f32, dtype=f64) 1.97x — composes astype-materialize (2.3ms) + sum
(0.8ms) = measured 3.23ms instead of casting on load inside the reduce loop
(NumPy buffered-REDUCE does); Wave-5 territory.
2. RD5 np.amin(axis=1) 1.54x — min/max axis kernels lag sum (which wins 2.8x on
the same shape).
3. FX2 fancy scatter 1.49x (gather wins 0.76 — write-side path).
4. AC2 cumsum axis=0 1.36x — BOTH sides ~20ns/elem scalar column-walks (95 vs
70ms); vertical-SIMD accumulate would be ~4-5ms => 15-20x leapfrog open.
WINS (selection family is a rout): argwhere 4.9x, flatten 3x, cumsum axis=1 2.9x,
sum full 2.8x, boolean read 2.6x, ravel F-order 2.2x, where broadcast 2.1x, where
scalar 2x, boolean assign 1.9x, ravel(A.T) 1.9x, sqrt(i32) promoting 1.5x, add
dtype=f32 1.8x, astype 1.8x, 3-D middle-axis sum 1.5x, ravel_multi_index 1.4x,
count_nonzero 1.4x, fancy gather 1.3x, unravel_index parity.
FEATURE GAPS with NumPy targets: reduce axis-tuple (2.08ms) / where= (9.27ms) /
initial= (2.03ms), einsum (2.30/1.42ms — canonical multi-op NpyIter consumer,
NpyExpr machinery fits), np.add.at (6.86ms, soft target), reduceat (1.24ms),
nested_iters, public np.nonzero alias.
Results doc gains the source-grounded consumer map, round-3 table, findings
17-21 (incl. 'do NOT migrate the won selection family onto per-chunk callbacks
without keeping Direct kernels as the fast path').
…(rounds 1-3) Adds benchmark/poc/npyiter_bench_summary.py - a self-contained renderer that aggregates every like-for-like NumSharp-vs-NumPy pair from the three NpyIter benchmark rounds (npyiter_core_bench, npyiter_frontier_bench, npyiter_frontier2_bench, npyiter_consumers_bench; numbers as recorded in NPYITER_CORE_BENCH_RESULTS.md, i9-13900K / NumPy 2.4.2 / Release) into a terminal bar chart: geomean of NumSharp_time/NumPy_time per group, eighth-block bars scaled 10.2 chars per 1.0x with parity at ~10 chars, win/lose row counts, and FASTER/PARITY/SLOWER annotations. Groupings: - size tier: <=4K (0.71x, 17W/7L), 32K-8M (0.82x, 50W/24L), 10M (1.25x, 3W/1L - dragged solely by E1 np.any routing; T1 10M traversal is exact parity) - family: construction 0.51x, small-N pipeline 1.14x, chunk traversal 0.95x, layout copies 0.58x, elementwise/bcast 0.77x, buffered cast 1.12x, where=/masks 0.67x, reductions/scans 1.09x (carries F1 54x + RD3 1.97x), selection/indexing 0.63x, kernel-bound dtypes 0.70x - overall: 0.81x geomean, 70 win / 32 lose; 0.75x excluding the three root-caused outliers (F1 broadcast-view sum 54x, E1 np.any 14.5x, AC2 1.4x) - architecture dividends rendered separately (no NumPy equivalent machinery): iterator reuse 7.0x, 8-banded parallel iterators 4.7x, one-pass 7-operand fusion 1.9x Excluded by design to keep the geomean honest: T7a/b/c (Python nditer protocol overhead is interpreter context, not iterator cost), NS-internal-only probe rows (G2/G3, T2g, T5n), and duplicate-comparator variants (T5i, T2x).
…wer<-1.0->faster) Rework npyiter_bench_summary.py to the house per-size geomean layout used in the official benchmark-report summary: ratio shown as SPEEDUP = NumPy_time / NumSharp_time (>1.0 = NumSharp faster), bar scaled 10 chars per 1.0x so the parity tick lands mid-field at 20-char width, dotted padding, and the verbatim 'slower <----- 1.0 (parity) -----> faster' header. Bars now grow toward the 'faster' end - the previous version drew bar length proportional to NumSharp/NumPy time so faster groups got SHORTER bars on a flipped axis; geometry, axis labels, and annotations are now mutually consistent. Layout per row: 7-char label, 20-char eighth-block bar (>= 2.0x overflows to a trailing arrowhead), speedup, (N win / M lose), and only out-of-the-ordinary verdicts annotated (PARITY within 5%, SLOWER below). Rendered verdict over the same 89 pairs: tiers 1K 1.40x / 4M 1.22x / 10M 0.80x (E1 np.any routing drags 4 rows; T1 10M memcpy is exact parity) / ALL 1.24x geomean 70W-32L; families ctor 1.95x, layout 1.74x, select 1.60x, where= 1.49x, dtypes 1.42x, elemwise 1.30x, traversal 1.05x parity, reduce 0.92x, buffered cast 0.89x, small-N 0.88x; dividends reuse 7.0x / parallel 4.7x / fusion 1.9x rendered with capped overflow bars.
…port tiers, for the iterator core
Adds a parameterized size-sweep that runs the SAME six NpyIter operation
families at each of four element-count tiers (scalar=1, 1K, 100K, 1M), so the
iterator's NumSharp-vs-NumPy story can be read per cache tier the way the
official benchmark report presents whole-op throughput. Prior NpyIter rounds
used per-aspect ad-hoc sizes; this fixes a clean 6x4 matrix on both sides with
identical ids.
Files:
- npyiter_sizesweep_bench.cs — NumSharp side. Six matched-kernel families:
add contiguous binary V256 copy contiguous copy (memcpy chunk)
sqrt contiguous unary V256 Sqrt sadd strided a[::2]+b[::2]
sum 4-acc V256 reduction bcast stride-0 a+b1(1)
Same Release-only Debug-JIT guard, best-of-rounds timing, and per-size iter
counts as npyiter_core_bench.cs. All 24 correctness checks pass.
- npyiter_sizesweep_bench.py — NumPy side, identical ids. copy uses
np.positive (a REAL ufunc nditer) not np.copyto (a stripped raw-array
walker), so the copy row is an honest iterator-vs-iterator comparison.
- npyiter_sizesweep_chart.py — renders the speedup bar chart (NumPy/NumSharp,
>1.0 = NumSharp faster) in the official-report axis style, grouped by size
tier and by operation, plus a per-cell matrix. Self-contained: the settled
clean run is embedded as (NumSharp_ms, NumPy_ms) per id; pass two recorded
output files as argv to re-chart a fresh run (the bench .txt outputs are
gitignored artifacts). The first C# run after a build is noise-tainted
(machine not quiesced — sqrt@100K read 248us vs the true 57us); embedded
numbers are the settled re-runs.
Result (geomean NumPy/NumSharp): scalar 2.29x, 1K 2.08x, 100K 1.33x, 1M 1.19x;
ALL 1.66x (20 win / 4 lose). The shape is the MIRROR of the full-API report:
NpyIter is strongest at small N (construction beats np.nditer + ufunc dispatch
setup) and converges to parity at 1M where the kernel saturates memory
bandwidth and the iterator contributes ~nothing. The four sub-1.0 cells
(add@1M 0.88x, sqrt@1M 0.98x, sqrt@100K 0.99x, sadd@100K 0.99x) are all parity
within run-to-run bandwidth variance (add@1M measured 380-453us across runs vs
NumPy 398us). Standout: sum is 2.35-10.11x faster at every tier — NumPy's
reduce carries a ~1.5us fixed setup (sum@1: NS 151ns vs NumPy 1.53us) and a
slower large-N pairwise pass (sum@1M: NS 89us vs NumPy 209us).
…lar/1K/100K/1M
The minimal size-sweep covered only 6 families. This sweeps EVERY distinct
NpyIter operation family accumulated across rounds 1-3 at all four tiers. The
earlier rounds used SIZE as the id axis (H8..H2M, T2.4..T2.1024, O1..O4/M1 are
one op at many sizes); collapsing those, the distinct families are 33 + 3
dividends, each now run at scalar/1K/100K/1M = 143 measured pairs per side.
Families: elementwise (add sqrt copy strided bcast reversed castbuf mixbuf),
reductions (sum sum-ax0 sum-ax1 sum-dt= amin cumsum any-allfalse any-earlyhit),
selection (where a[mask] a[mask]= count_nonzero argwhere a[idx] a[idx]=),
copy/cast (flatten astype ravel.T in-place less->bool), index-math
(unravel_index ravel_multi_index), dtypes (complex128 float16 int8), and
dividends (fuse7 reuse par8 — NumPy has no equivalent machinery).
Files: npyiter_fullsweep_bench.{cs,py} (identical ids; raw-iterator matched
kernels for the elementwise-isolation rows + production np.* for the rest, each
mapped to its NumPy equivalent), npyiter_fullsweep_chart.py (self-contained:
embeds the clean run as (NumSharp_ms, NumPy_ms); renders per-tier, per-category,
category x tier, per-family x tier, and the dividends; pass two run files as
argv to re-chart). i9-13900K, NumPy 2.4.2, Release.
Headline (geomean NumPy/NumSharp, >1.0 = NumSharp faster): ALL 1.32x, 81 win /
51 lose over 132 main cells; tiers scalar 1.47x / 1K 1.46x / 100K 1.13x / 1M
1.27x. By category: reductions 2.14x, elementwise 1.41x, selection 1.31x,
dtypes 1.16x, but copy/cast 0.76x and index-math 0.75x lag (small-N per-call
copy overhead, crossing to wins at 1M). Dividends: fuse7 up to 17x vs chained
adds, reuse 5x at small-N, par8 2.4x at 1M.
Findings surfaced/confirmed across the size axis (presented to user):
- INTERMITTENT SEGFAULT (~50% of runs): uncatchable AccessViolation under the
heavy mixed alloc/free load, varying crash point (seen at gather@1K / argw
region) — heap corruption or GC/finalizer race on unmanaged NDArray storage.
- np.any(all-false): 24x faster at scalar but 12.5x SLOWER at 1M (0.08x) —
scalar scan, no SIMD; early-exit case hides it. (confirms the routing bug.)
- np.less(out=bool): consistently 1.5-2.7x slower at every size.
- fancy a[idx] gather 1.4-3.4x slower, a[idx]= scatter 1.2-1.7x slower at all
sizes; amin axis-reduce 2.4x slower at 100K+; float16/complex128 ~1.3-1.7x
slower (documented scalar paths).
- int8 add VERIFIED correct and ~7x faster (sweep's 12x inflated by a noisy
NumPy reading); reductions/castbuf/count_nonzero are the largest honest wins.
… harness, one results sheet
Promotes the exploratory poc/npyiter_* rounds into a single MAINTAINED
NumSharp-vs-NumPy benchmark under benchmark/npyiter/. Every distinct NpyIter
aspect the poc rounds surfaced now lives in one place, swept across cache tiers,
rendered into ONE sheet (npyiter_results.md): 162 measured pairs.
Pieces:
- npyiter_bench.cs / .py — identical-id NumSharp + NumPy benches, SECTION-
ADDRESSABLE via the NPYITER_SECTION env var. Ten sections:
operations x size : elementwise/reductions/selection/copycast/indexmath/
dtypes/dividends — 33 families x {scalar,1K,100K,1M}
construction : 9 iterator flag configs vs np.nditer build
chunkwidth : per-chunk dispatch overhead across inner widths 4..1024
pathology : the regression canaries (bcast-reduce, allocate,
overlap-copy, F-order-out, 0-d)
Iterator-isolation rows drive NpyIterRef directly with trivial NumPy-loop-
matched kernels; production rows call np.* both sides; copy compares to
np.positive (a real ufunc nditer), never np.copyto.
- npyiter_sheet.py — orchestrator + renderer. Runs each section in its own
short-lived process (crash isolation) and renders the per-tier/per-category/
per-family operation matrix + construction + chunk-width + pathology +
dividends sheet. Resilience baked in after the monolithic poc fullsweep
AV'd ~50% of runs:
* each NumSharp section retries up to 4x on a crash;
* DOTNET_DbgEnableMiniDump=0 so an AV returns a non-zero exit IMMEDIATELY
instead of stalling the process while a crash dump is written (the silent
hang that voided the first full run — we never taskkill dotnet);
* per-subprocess timeout backstop;
* tsv is written incrementally after EVERY section and --resume skips
already-collected sections, so a mid-sweep death never loses progress.
Flags: --skip-build, --render-only, --resume, --sections.
- README.md — run instructions, methodology guardrails (Release-only, matched
kernels, positive-not-copyto), section table, and the findings ledger.
- npyiter_results.{md,tsv} — the rendered sheet + raw (id, ns_ms, np_ms) pairs
from the 2026-06-13 run (i9-13900K, NumPy 2.4.2, Release).
Headline (speedup = NumPy/NumSharp, >1.0 = NumSharp faster): operation matrix
1.24x geomean, 80 win / 52 lose over 132 cells; tiers scalar 1.20x / 1K 1.32x /
100K 1.14x / 1M 1.32x. Categories: reductions 2.03x, elementwise 1.48x,
selection 1.32x, dtypes 1.03x; copy/cast 0.58x and index-math 0.63x lag at
small N (per-call setup) and cross to wins by 1M. Construction beats np.nditer
6.19x geomean (up to 12.5x on the 8-operand build). Chunk-width loses at w=4
(0.74x) and wins from w=64 up. Dividends NumPy structurally can't match: fuse7
4.6-15x, reuse up to 9x, par8 2.5-7x.
Regression canaries / losses tracked by the sheet: bcast-reduce 51x slower,
F-order-out 3.5x, allocate 2x; np.any full-scan 0.07x at 1M (scalar scan vs
SIMD); comparison->bool 0.5x; fancy a[idx] gather/scatter 0.5-0.75x; amin
axis-reduce 0.4x at scale. int8 verified ~7-11x faster (correct). These are the
fix-list, ordered in README's findings ledger.
… (commit-to-master)
Wires the canonical NpyIter benchmark into a semi-manual GitHub Action that runs
AFTER a release and publishes results straight to master (the chosen target,
not the wiki — GITHUB_TOKEN can push to the repo but not to a .wiki.git, which
would need a PAT).
- .github/workflows/npyiter-benchmark.yml — SEPARATE workflow (never gates the
release). Triggers: workflow_run on 'Build and Release' completion+success,
plus workflow_dispatch (the manual knob). Sets up .NET 8+10 (preview) and
Python 3.12, pins numpy==2.4.2, builds Release, runs npyiter_sheet.py
--skip-build, renders the cards, and commits npyiter_results.{md,tsv} +
cards/*.png to master with '[skip ci]' (so the push can't re-trigger the
release workflow). permissions: contents:write — no PAT needed.
- benchmark/npyiter/npyiter_cards.py — renders two 400x300 PNG cards from the
tsv (matplotlib, figsize=(4,3) dpi=100): ops.png (speedup by size tier) and
cat.png (speedup by op class). Ratio-only by design — absolute ms vary by
runner hardware, but the same-runner NumPy/NumSharp ratio stays meaningful.
- README.md — new 'Performance vs NumPy' section embedding both cards (raw URLs)
linked to the full report, with the explicit 'same-machine ratio' caveat.
- npyiter_sheet.py — portability fix: run_ns() rewrites the .cs's absolute
'#:project K:/source/NumSharp/...' line to THIS checkout's csproj path, so the
same bench (authored to run directly via 'dotnet run - < file' on Windows)
runs unchanged on a Linux CI runner.
- Refreshed npyiter_results.{md,tsv} + cards from a clean 162-pair run
(headline 1.19x, 76 win / 56 lose). That run also exercised the resilience
for real: the selection section took an 0xC0000005 AV on attempt 1 and the
orchestrator's retry recovered it automatically — exactly the CI-safety the
section isolation + DbgEnableMiniDump=0 + retry was built for.
Caveats documented in the workflow header: shared-runner variance (ratios not
absolutes), and direct-to-master push assumes master is CI-writable (no branch
protection blocking github-actions[bot]); switch to a PR step if that changes.
…iter
The six exploratory POC rounds (npyiter_core/frontier/frontier2/consumers/
sizesweep/fullsweep benches + bench_summary + charts + NPYITER_CORE_BENCH_RESULTS.md)
were superseded by the canonical benchmark/npyiter/ — every aspect they surfaced
now lives there, swept across cache tiers into one sheet. Removed: 17 tracked
files + 4 gitignored .txt dumps.
KEPT benchmark/poc/npyiter_parity_poc.{cs,py} — it is NOT part of the
exploratory rounds: it holds the hand-written AVX2-gather reference kernels
(PocKernels.AddF32/SqrtF32/SumF32) that docs/NPYITER_PERF_HANDOVER.md points to
as the blueprint to transcribe for the IL-emission work, and benchmark/CLAUDE.md
cites it as the Debug-JIT guard example.
Reference fixes (no dangling links left):
- docs/NPYITER_GAPS_AND_ROADMAP.md §6: the iterator-core reproduce block now
points to benchmark/npyiter/npyiter_sheet.py instead of the deleted bench.
- benchmark/npyiter/README.md: dropped the 'supersedes poc/npyiter_*' wording
(those files are gone) for a self-contained description.
Finalize: added benchmark/npyiter/.gitignore for transient run artifacts
(run.log, __pycache__) so only the durable outputs — npyiter_results.{md,tsv}
and cards/*.png — are tracked.
… tier; AV→NA; one CI
Folds the NpyIter benchmark into the official orchestrator so there is ONE entry
point and ONE report, while keeping the two harnesses distinct (they measure
different things — op/dtype/N throughput vs the iterator machinery — and the
NpyIter harness needs internal access + section-isolation the BenchmarkDotNet
in-process run can't give).
run_benchmark.py — after the official (op,dtype,N) merge, runs the NpyIter sheet
+ cards and APPENDS the sheet to benchmark-report.md as its own section (not
merged — different result model). Archives npyiter_results.{md,tsv} + cards into
results/<ts>/. New --skip-npyiter flag. This is now the single command for the
whole NumSharp-vs-NumPy comparison.
+10M tier (decision 1): npyiter_bench.{cs,py} sweep now scalar/1K/100K/1M/10M
(grid 2500x4000 = 10M exactly; pick 30 iters/3 rounds at 10M). sheet TIERS +
cards pick it up automatically.
AV → NA/IGNORED (decision 3): instead of silently omitting a section that
crashes all retries, the sheet now records its ids NA (NumPy runs first to give
the expected id set), prints an AV-POLICY header explaining the known
intermittent AccessViolation is ignored, lists 'THIS RUN: NA across <sections>',
shows NA cells in the per-family/dividends matrices, and excludes NA from every
geomean. tsv stores NA; load/cards skip it.
CI consolidation (decision 2): npyiter-benchmark.yml -> benchmark.yml, now runs
the ENTIRE suite via run_benchmark.py. Trigger changed from workflow_run-on-
every-build to release:published (the real 'after a successful release' signal —
'Build and Release' publishes a GitHub Release on a v* tag) + workflow_dispatch,
so the heavy full suite runs per-release, not per-push. Commits report + cards
to master with [skip ci]. timeout-minutes: 180.
The npyiter_parity_poc gather kernels and the rest of the harness methodology
(Release-only, matched kernels, positive-not-copyto, section isolation) are
unchanged.
…on selection Refreshes the canonical NpyIter results (npyiter_results.md/.tsv) and the two README cards with a full sweep that now includes the 10M cache tier, and records the AV->NA policy firing on a real run. Also documents the run_benchmark.py integration in benchmark/CLAUDE.md. What changed ------------ * 198 measured pairs (was 162), 35 of them NA. The new 10M tier adds 36 ids across the size-swept families; SIZES is now scalar/1K/100K/1M/10M end to end (bench .cs + .py grids: 10M = 2500x4000). * selection (where / a[mask] / a[mask]= / count_nz / argwhere / a[idx] / a[idx]=) hit NumSharp's known intermittent AccessViolation on EVERY retry this run, so the whole section is reported NA/IGNORED per policy and excluded from every geomean. The header now reads "198 measured pairs (35 NA)" and "AV POLICY ... THIS RUN: NA across selection."; the section renders as "(no data)" / "-" / "NA" cells instead of crashing the sweep. This is the designed crash-resilience path proven on a live run, not a regression. * Headline operation matrix: 1.17x geomean, 77 win / 53 lose over 130 cells (26 non-selection families x 5 tiers). Reductions lead (1.80x), dtypes 1.59x, elementwise 1.12x; copy/cast (0.65x) and index-math (0.70x) remain the small-N laggards already tracked as canaries. Doc --- benchmark/CLAUDE.md run_benchmark.py section now describes the appended NpyIter step (aspect x tier, appended-not-merged, section-isolated, AV->NA, --skip-npyiter) and points at benchmark/npyiter/README.md, so the dev guide matches the wired-in integration (run_benchmark.py + benchmark.yml). Known bug surfaced (tracked, not fixed here) -------------------------------------------- The selection-section AccessViolation (0xC0000005) is an unmanaged-storage lifetime bug in NumSharp under heavy mixed alloc/free load. It is intermittent (~50% per heavy section) and uncatchable; the benchmark now degrades to NA rather than masking it. Worth a dedicated issue + fix pass.
…ted report artifacts
Adds docs/website-src/docs/benchmarks.md — the DocFX page the user asked for:
"the real place where we discuss and present the efforts to surpass NumPy
through the power of Runtime IL Generation." It is the evidence companion to the
existing IL Generation page (il-generation.md explains HOW the kernels are
emitted; this page shows WHAT that buys head-to-head against NumPy).
The page is driven by the artifacts the Benchmark workflow (benchmark.yml)
auto-commits to master after every release:
* The two 400x300 cards are embedded by absolute raw.githubusercontent master
URLs (same source the README uses), so they always reflect the latest
committed run rather than a pasted screenshot. Verified the docfx build keeps
the URLs absolute (it does not relativize external links).
* The full reports are linked on master: the iterator sheet
(benchmark/npyiter/npyiter_results.md, which the cards render from) and the
op/dtype/N matrix (benchmark/benchmark-report.md), plus the harness README and
benchmark/CLAUDE.md.
Content (grounded in the current committed npyiter_results.md numbers):
* Headline cards + a by-class geomean table (reductions ~1.8x, dtypes ~1.6x,
elementwise ~1.1x parity, copy/cast ~0.65x, index-math ~0.7x).
* Class-by-class discussion tying each result to the IL mechanism (4x unrolling,
tree reduction, SIMD early-exit, per-(op,dtype,layout) specialization), and
honest about the taxes (small-N copy/cast, all-false any() scan, bcast_reduce).
* The dividends NumPy can't structurally match: expression fusion (np.evaluate,
up to ~13x), kernel reuse, parallel inner loop (par8 up to ~8x), cheaper
iterator construction (~2-3x vs np.nditer).
* Methodology + honesty section: Release-only JIT, best-of-rounds, ratios-not-
absolutes, and the AV->NA policy.
* Reproduce-locally commands.
Wiring:
* docs/toc.yml — new "Benchmarks vs NumPy" entry right after IL Generation.
* il-generation.md — cross-link from the Performance Impact section ("naive C#"
table vs the head-to-head-NumPy page).
* index.md — added IL Generation + Benchmarks links to Get Started.
Validated with `docfx build` (build-only, metadata skipped): 0 errors, the page
itself emits 0 warnings (the 84 UidNotFound warnings are api/toc.yml entries that
only resolve after the metadata step, which CI runs first). benchmarks.html
renders, cards resolve to absolute URLs, internal links rewrite to .html.
Note: deploy is via docs.yml on push to master (paths: docs/website-src/**); this
branch commit does not deploy until merged. How the page REFERENCES the
auto-committed cards (raw-master URL vs bundling copies into website-src/images/)
is the next thing to settle.
…FX site
Two follow-ups to the Benchmarks vs NumPy page, both from user direction.
1) The two 400x300 cards now carry the whole canonical summary (modeled on the
ASCII sheet the user singled out), not just one bar chart each. Everything is
still COMPUTED from npyiter_results.tsv, so the cards auto-update each run and
NA (AccessViolation) ids are skipped.
* cards/ops.png — OPERATIONS vs NumPy: headline (geomean / win-lose / cells)
+ by-array-size-tier bars (scalar..10M) + by-operation-class bars ranked
best->worst (reductions 1.80x ... copy/cast 0.65x; wins green, the two
small-N taxes red).
* cards/cat.png — the IL-GENERATION DIVIDENDS, the "machinery NumPy has no
equivalent for": iterator build vs np.nditer, expression fusion (np.evaluate),
kernel reuse, parallel inner loop — each bar is the honest geomean with an
"up to <peak>x" annotation — plus the chunk-width trend (w=4 -> w=1024) and
the honest pathology canary (bcast_reduce ~52x behind, in red).
npyiter_cards.py rewritten: shared hbars() helper, color_of() (green/amber-
parity/red), stat() for (geomean, peak), two card builders. Imports CTOR/CW/
PATH/DIVIDENDS from the sheet so the section data stays single-sourced.
Captions/alt-text updated to match the new card semantics (cat.png is no longer
"by op class") in README.md and benchmarks.md.
2) Full reports are now rendered INTO the site as searchable pages (user choice:
"Render into the site"), in addition to being linked on GitHub:
* docs/website-src/docs/benchmark-matrix.md — the op/dtype/N matrix
(benchmark-report.md body under a single page H1).
* docs/website-src/docs/benchmark-iterator.md — the canonical iterator sheet
(npyiter_results.md fenced block under a page H1).
* toc.yml nests both under "Benchmarks vs NumPy"; benchmarks.md "Read the full
reports" now links the on-site pages (raw files still linked on master).
benchmark.yml regenerates these two pages from the just-produced reports (op
matrix drops its own H1 via tail -n +2 so the page has one title; the iterator
sheet has no H1), commits them alongside the report + cards, and — because the
commit carries [skip ci] and the pages live under docs/website-src/** — then
`gh workflow run docs.yml` to redeploy the site (added actions:write + GH_TOKEN).
Validation
----------
* npyiter_cards.py renders both cards; verified visually (legible at 400x300).
* benchmark.yml is valid YAML (yaml.safe_load).
* docfx build (build-only): 0 errors; benchmark-matrix.html + benchmark-iterator.html
generate; benchmarks.html internal links to both resolve; no warning names any new
page (the 82 UidNotFound warnings are api/toc.yml, resolved by the metadata step CI
runs first). No docs/website/ build-output committed.
Still open (deferred by the user): the card REFERENCING mechanism on the docs page
(raw-master URLs today vs bundling the PNGs into website-src/images/). The redeploy
chaining added here would make that swap trivial if chosen later.
… 15 Best" The op/dtype/N matrix report (benchmark-report.md, rendered into the site as benchmark-matrix.md) showcased garbage: every "Top 15 Best" row was np.copy(float64) and np.searchsorted at "0.0 / 0.0x". Three distinct bugs, all fixed. BUG 1 — searchsorted benchmark measured nothing (both sides) SortingBenchmarks.cs and numpy_benchmark.py issued a SINGLE scalar lookup (np.searchsorted(sorted, N/2)) — one O(log N) binary search, ~18ns at EVERY N, pure call overhead. Against NumPy's ~1µs Python overhead that manufactured a meaningless 50–1000x "win". Fixed: both now query the N-element array (a) into the sorted target → N binary searches, real work that scales with N. (Verified the C# benchmark project still compiles.) BUG 2 — normalize_op_name collapsed a slice-copy onto np.copy The Slicing suite's "np.copy(a[100:1000])" (a fixed 900-element slice copy, ~3.6µs at every N) was normalized by stripping ALL "[...]" — including the array-index "[100:1000]" — yielding "np.copy", which COLLIDED with the Creation full-array "np.copy(a)" in csharp_index (last-write-wins) and overwrote the real float64 measurement. THAT was the bogus "copy float64 = 0.0036ms" (not a copy bug at all; the op is fine — archived raw float64 copy@10M = 11.04ms). Fixed: only strip a space-separated " [annotation]" (\s+\[ instead of \s*\[), never index brackets attached to an identifier. Incidentally also de-collides concatenate/stack/slice variants. copy(float64) now reads its real values across all sizes (10M → 11.04ms, ratio 0.60 = a genuine win). BUG 3 — the report ranked/averaged non-credible rows as wins merge-results.py sorted "Top Best" by ratio with only a `ratio is not None` guard, so a sub-resolution NumSharp time (ratio rounding to 0.0) sorted to #1, and CSV blanked legit 0.0 via `r.ratio or ''`. Fixed with a credibility gate (classify()): a row is "negligible" (new ▫ status) when either side did <1µs of work OR the speedup exceeds 20x (NumSharp >20x faster ⇒ artifact: a view, a lazy alloc, or a dead-code-eliminated kernel). Negligible rows are EXCLUDED from Top Best/Worst and from the per-size geomean, but still listed (▫) in the per-suite tables — nothing hidden. Also: store ms at 4 / ratio at 3 decimals, show 3-decimal ms + 2-decimal ratio in the showcase (no more "0.0/0.0x"), fix the `or ''` falsy-zero in CSV, add the ▫ legend row + summary/size-table counts, and a header note stating how many rows were excluded and why. Result (regenerated from the on-disk run archive with the fixed merge): * Top Best is now real reductions/statistics wins (np.nansum 0.08x, np.percentile 0.10x, np.average 0.10x) — genuine ms on both sides. * 1233 ops → 305 faster / 255 close / 169 slower / 103 much-slower / 275 NEGLIGIBLE (the artifacts, previously ~all counted as "faster") / 126 n/a. * Top Worst surfaces a real gap: np.zeros (NumSharp eagerly zeros ~10.7ms vs NumPy lazy calloc ~0.01ms) — a legitimate optimization target, not an artifact. benchmark-matrix.md (the DocFX page) re-seeded from the corrected report; docfx build clean (0 errors). The searchsorted benchmark fix takes effect on the next CI run; the credibility gate keeps any residual artifact out of the showcase meanwhile.
… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NpyIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
…oard page Adds a new DocFX page in the npyiter_results.md dashboard style (ASCII bars, geomeans, win/lose, top wins/losses) applied to the broad op × dtype × N matrix — the graph/stats/ numbers companion to the narrative benchmarks.md, with minimal prose. * benchmark/scripts/render_dashboard.py — reads the merged benchmark-report.json and emits benchmark-dashboard.md: headline geomean, BY-SIZE-TIER / BY-SUITE / BY-DTYPE bars (same bar() aesthetic as npyiter_sheet.py — length 10 = parity, 20 = 2.0×), the status mix, and TOP-12 wins/losses with raw ms. Charts only CREDIBLE rows (the merge-results.py gate), so the negligible artifacts that used to dominate stay out. speedup = NumPy ÷ NumSharp. * docs/website-src/docs/benchmarks-dashboard.md — the page (title + one-line note + the ```-fenced sheet), seeded from the renderer. Nested under "Benchmarks vs NumPy" in toc.yml as "Dashboard (op matrix)", beside the full Operation matrix and Iterator sheet. * benchmark/.gitignore — ignore the benchmark-dashboard.md intermediate (the tracked form is the DocFX page), matching how benchmark-report.json/csv are handled. What it shows on the current data (honest, broad picture vs the curated npyiter sheet): 0.74× geomean over 832 credible cells (305 win / 527 lose) — NumSharp trails on the full matrix but reaches parity at 10M (0.98×), and wins decisively where its IL kernels shine: statistics 2.28×, broadcasting 1.22×, reduction 1.21×; uint8 1.07×. Laggards are arithmetic/ unary/creation and bool. Top wins: nansum/percentile/average (8–13×). Top losses: np.zeros (eager-zero vs NumPy lazy calloc, ~500–880×) and argsort (~25×). Prototype scope: the page is a committed STATIC snapshot. To make it live (auto-refresh each release like the matrix/iterator pages), wire render_dashboard.py + a seed step into run_benchmark.py / benchmark.yml — deferred pending design review. docfx build is clean.
Two net8.0-only BCL semantic gaps surfaced by the fuzz differential matrix.
Both behave correctly on net9.0+ (where the BCL was fixed) but produced
wrong values on net8.0; worked around to match NumPy 2.4.2.
1. np.abs(complex) with an infinite component returned NaN instead of +inf
------------------------------------------------------------------------
cabs(NaN + inf*i) must be +inf (C99 hypot / npy_cabs: the infinity test
precedes the NaN test). System.Numerics.Complex.Abs routes through a
private Hypot whose operand ordering is NaN-unaware, so on net8.0 it
returns NaN for abs(NaN+inf*i) (fixed in the .NET 9 BCL).
Added Utilities/NpyComplexMath.Abs(Complex): returns +inf when either
component is infinite, else defers to Complex.Abs — so every finite/
NaN-only magnitude that already matched NumPy bit-for-bit is unchanged.
Repointed the two cached MethodInfo handles that drive every complex-abs
emit site: DirectILKernelGenerator.CachedMethods.ComplexAbs (6 IL call
sites across the scalar/strided/predicate/math/decimal unary loops) and
DefaultEngine.UnaryOp.s_complexAbs (NpyIter Tier-3B route).
Fixes 19 unary.jsonl + 1 random_smoke.jsonl fuzz cases (all layouts:
contiguous / strided / transposed / broadcast / negstride).
2. ptp / amax / amin along an axis dropped NaN instead of propagating it
------------------------------------------------------------------------
The typed-struct leading/innermost axis-reduction fast paths
(MinOp<T>/MaxOp<T>.Combine256/128) called raw Vector256/128.Min/Max. The
x86 vminps/vmaxps these lower to return the SECOND operand on an
unordered (NaN) compare; the BCL Vector{N}.Min/Max only adopted IEEE NaN
propagation in .NET 9. Verified: Vector128.Max(NaN,5) == 5 on net8.0,
== NaN on net10.0. So max/min/ptp over a NaN-laced axis silently lost
the NaN on net8.0 (ptp axis=0 returned a finite value where NumPy = NaN).
Routed MinOp/MaxOp through the existing NaNAwareMinMax256/128 helper
(already used by the contiguous/strided CombineVectors paths) and wrapped
that helper's float/double self-equality mask in #if NET8_0 — so net9.0+
keeps the single-instruction vmaxps with zero overhead while net8.0 gets
ConditionalSelect(ordered, min/max, a+b) NaN propagation. The flat
whole-array reduction kernel already emitted this via
EmitVectorNaNPropagatingMinMax, so only the axis fast paths were affected.
Fixes 12 stat.jsonl fuzz cases (ptp float32/float64, axis 0/1, C/F-contig).
Verification: full unit suite green on BOTH net8.0 and net10.0 (9709 passed
/ 0 failed under the CI filter), FuzzMatrix 42/42 on both. The originally
reported trunc "Could not find Truncate for Vector128" failures were already
resolved in-tree by the CanUseUnarySimd #if NET8_0 guard (commit 5716f86);
the leak-guard working-set tests pass locally (their CI failures were OS
working-set / GC-mode noise, not a managed or unmanaged leak).
…NumSharp faster)
The dashboard prototype was the odd one out: I rendered it speedup = NumPy ÷ NumSharp
(>1× = faster), while the op-matrix report it is derived from — and merge-results.py —
use ratio = NumSharp ÷ NumPy (<1× = faster, lower is better). Two pages off the same data
with opposite conventions is exactly the faster/slower confusion to avoid.
Verified first that the underlying direction is NOT a flip: counting raw milliseconds
(numsharp_ms vs numpy_ms, no ratio involved), NumSharp took LESS time on 305 ops and MORE
time on 526 of 832 credible ops; geomean NS/NP = 1.36. So "NumSharp trails on the broad
matrix" is real (concentrated in Arithmetic = 231 slower ops, and Unary), and it matches the
op-matrix report's own conclusion. The dashboard's data was right; only its convention was
inverted relative to the house default.
render_dashboard.py now uses NS/NP throughout:
* ratio = numsharp_ms / numpy_ms; header + axis read "faster ◄ 1.0 (parity) ► slower".
* HEADLINE 1.36× geomean · 305 faster / 527 slower.
* by-suite / by-dtype ranked fastest→slowest (ascending ratio): statistics 0.44×,
reduction 0.83×, broadcasting 0.82× now read as FASTER; creation 2.83× / unary 2.63× /
bool 3.55× as slower.
* status bands relabelled to NS/NP (faster ≤1.0× / close 1–2× / slower 2–5× / much >5×).
* tables renamed FASTEST / SLOWEST; each row shows the NS/NP ratio plus a human factor
("0.079× (12.6× faster)", "880.9× (881× slower)") so the small-ratio-is-good direction is
unambiguous.
benchmarks-dashboard.md re-seeded with the matching note; docfx build clean. This makes the
report + dashboard consistent. The narrative benchmarks.md, the npyiter iterator sheet, and
the README cards still use the speedup (NP/NS, >1× = faster) framing — flipping those is a
separate call (they are win-showcases where >1× reads naturally).
…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.
… stat Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp faster — bars grow right = faster, the original visual), AND every row now also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP. render_dashboard.py: * r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time). * headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g. statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum 12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP. * status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower 200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column narrow (NN% under 1000%, else NN×NP). benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy as a format spec); docfx build clean, emoji verified present (U+1F550 ×54). Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py) still uses NS/NP "lower is better", and the npyiter sheet / cards use NP/NS without the %NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step, pending confirmation.
Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses (30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%". Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was "lower is better"): * ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV). * get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower 0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20). * Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a 🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×). * Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled. npyiter sheet (npyiter_sheet.py) + cards (npyiter_cards.py) — already NP/NS, ADDED 🕐 %NumPy: * sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time"; re-rendered npyiter_results.md (--render-only, AV block intact). * cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered. Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention block, a %NumPy column to the by-class table, and a caption sentence. DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard (render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af), so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README — is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.
The clock sat before the figure with the right-align padding landing between them
("🕐 87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across
every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding
now sits before the number (where it belongs) instead of after the emoji.
* render_dashboard.py / npyiter_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of
NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated.
* merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column,
Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated.
* benchmarks.md + README: convention line / table column / caption "%NumPy🕐".
* DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐".
docfx build clean.
The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu
has no clock glyph — so there was never a gap to fix there).
… form pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads "87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of NumPy's time like every other cell. The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded, docfx build clean.
…g from the report The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K) read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs 87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between the two committed surfaces. Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] / r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the report. The report is correct (true ratio of the measured means); the dashboard was a rounding artifact of its own recompute. Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON (the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans key off the same stored ratios the report's Summary-by-size uses. Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page; header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%, quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.
…not run" cells
normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label
and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark
not run" for ops that WERE run. Three archive-safe alias passes (applied identically to
both sides, so they only ever merge a true pair):
* empty "()" — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten"
* "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D"
* np.around — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy
emits np.round, so the whole np.round family was ⚪ despite real data
Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round
family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape
2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions
(no previously-matched cell lost), 0 new key collisions.
Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells,
0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved
byte-for-byte). The dashboard stays cell-consistent with the report via the canonical
ratio/pct fix from the prior commit.
NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has
no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/
maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean
on uint*/int16 lack a counterpart on one side.
…lex (NumPy parity)
These six complex ufuncs previously threw NotSupportedException from the
EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex
loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires
them up with full NumPy 2.4.2 parity.
Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2
pattern): a bit-exact probe over a finite battery showed System.Numerics.
Complex matches NumPy to a few ULP on the finite interior, but diverges at
86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs
instead of the C99 Annex G values, drops the sign of zero on branch cuts,
and mishandles arctan's imaginary-axis cut. So:
- NpyComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite
interior to the BCL and add the C99 fixups:
* Non-finite inputs: special-value tables ported from NumPy's msun
npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities
asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)).
* Branch-cut/signed-zero fixups (empirically derived against NumPy and
verified on a 64-point signed-zero grid): asin negates Re on x=-0 and
Im on y=-0; acos negates Im on the y=+0 cut; atan sets
Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0.
* Where this NumPy build's system libm diverges from msun at infinities
(sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y)
imaginary part, tanh's sign(y) zero, and the genuinely-unspecified
zero signs), the helpers match the observed NumPy 2.4.2 output.
- DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh,
Asin,Acos,Atan} (pointing at NpyComplexMath, not Complex.* directly) and
add the six cases to EmitUnaryComplexOperation.
Verification: a bit-exact harness over a 117-point battery (finite + signed
zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy
2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within
<=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny
input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate
than NumPy's log1p-based one; an accepted, documented divergence.
Tests:
- NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch
cuts, signed zeros, and C99 special values.
- Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is
corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held
to a tight 4-ULP gate (a real regression fails) instead of the blanket
complex-unary excuse; arctan stays under the documented blanket for its
accepted BCL-interior divergence.
All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus
cases for the five tightly-gated ops are all within 4 ULP.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Complete changelog of the
nditerbranch — everything in this PR since #612 merged.312 commits · 615 files · +217,949 / −16,402 (vs
master, after #612)TL;DR
NpyIter— full port of NumPy 2.4.2'snditer(~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.NpyExprDSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the publicnp.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPyresult_typetyping and fused reductions.out=/where=/dtype=ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plusnp.bitwise_and/or/xorandnp.positiveat thenp.*surface.np.*APIs —pad(11 modes),tile,median/percentile/quantile(all 13 interpolation methods) + theirnan*variants,average,ptp,take/put/place,extract/compress,diagonal/trace,argwhere/flatnonzero,unravel_index/ravel_multi_index/indices,delete/insert/append,diff/ediff1d,asfortranarray/ascontiguousarray,np.multithreading.Shapeunderstands F-contiguity,OrderResolverresolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.IDisposableonNDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).MultiIteratorand the Regen-generatedNDIteratorcast templates are gone (−3,870 LOC);NDIteratorsurvives as a thin lazy wrapper overNpyIter.1. NpyIter — full NumPy
nditerportFrom-scratch C# port of NumPy 2.4.2's iterator machinery under
src/NumSharp.Core/Backends/Iterators/(~12,557 lines), promoted to public API with NDArray overloads.MULTI_INDEX,C_INDEX,F_INDEX,RANGE(parallel chunking),GotoIndex/GotoMultiIndex/GotoIterIndexDELAY_BUFALLOC, buffered-reduce double-loop (incl.bufferSize < coreSize)op_axeswith-1reduction axes,REDUCE_OK,IsFirstVisit,REUSE_REDUCE_LOOPSslab accumulationCOPY_IF_OVERLAPvia a port of NumPy'smem_overlapsolver (NpyMemOverlap.cs) — overlapping in/out operands no longer silently corruptWRITEMASKED+ARRAYMASKexecuted — the buffered window flush writes back only mask-nonzero elements;VIRTUALoperands (null op slots) construct with NumPy 2.x semanticsNPY_MAXARGS=64) and unlimited dimensions (NumPy caps atNPY_MAXDIMS=64) via dynamic allocationCopy,GetIterView,RemoveAxis,RemoveMultiIndex,ResetBasePointers,IterRange,DebugPrint, fixed/axis stride queries,GetValue<T>/SetValue<T>, …NpyIterCasting.CanCastmatches NumPy'ssafe/same_kindlattice exactlyValidated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and
tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping,NO_BROADCASTenforcement,F_INDEXcoalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOPiternext, buffered-castAdvance, rangedReset()desync, buffer free-list corruption, the size-1 stride-0 invariant (a(1,4)view with nonzero stride corruptedRemoveMultiIndex),op_axesout-of-bounds reads on stretched size-1 axes, write-broadcast validation,PARALLEL_SAFEwiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.Execution at NumPy speed
NpyIterisn't just correct — it is now the production execution engine:DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):a*b+c10M(a-b)/(a+b)10MKey mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (
vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).2. NpyExpr DSL + three-tier custom-op API
User-extensible kernel layer on top of
NpyIter— the public answer to "how do I write my own ufunc":ExecuteRawIL: emit raw IL against the NumPy ufunc signaturevoid(void** dataptrs, long* strides, long count, void* aux).ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.ExecuteExpression: composeNpyExprtrees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons,Min/Max/Clamp/Where), plusCall()to splice any delegate/MethodInfointo a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as
np.evaluate(expr[, operands][, out]):result_typetyping — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly:(i4*i4)+f8wraps the multiply in int32 (→1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4,i4/2 → f8) with NumPy's exactOverflowError, and special resolvers (true_divide,arctan2, negative-integer-literalpower→ValueError, booladd=OR/multiply=AND).NpyExpr.Sum/Prod/Min/Max/Meancompile a one-pass inner loop;sum(a*b)readsaandbonce and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).out=joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing throughCOPY_IF_OVERLAP); anEXTERNAL_LOOPguard prevents the silentcount==1slow path.a*b+c3.2×,(a-b)/(a+b)6.1×,sum(a*b)3.6×,sum f322.9×,i4*2+f83.5× faster. Permanent gate inbenchmark/poc/evaluate_bench.{cs,py}.3. Legacy iterator stack retired
MultiIteratordeleted; all callers migrated toNpyIter.Copy/ multi-operand execution.NDIterator.template.cs+ 16 generatedNDIterator.Cast.*partials deleted (−3,870 LOC in one commit).NDIteratorsurvives as a thin, lazy compatibility wrapper overNpyIter(294 lines) — no more materialized copies; sameMoveNext()/HasNext()/Reset()surface.~400per-dtypeNPTypeCodeswitch sites replaced by a genericNpFuncdispatch utility.4. C/F/A/K memory-layout support
Shapenow tracks F-contiguity with NumPy-convention contiguity computation; newOrderResolverresolvesC/F/A/Kfor every API with anorderparameter.copy,array,asarray,asanyarray,*_like,astype,flatten,ravel,reshape,eye,concatenate,cumsum,argsort,tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.np.asfortranarray,np.ascontiguousarray.np.whereselects C/F output layout the way NumPy does;ravel('F')of an F-contig source returns a view (was a 3,000× copy).fortran_order, Decimal scalar path, fancy-write isolation, …).5. New & completed
np.*APIsNew functions (35):
np.evaluate(fused expressions — see §2),np.bitwise_and,np.bitwise_or,np.bitwise_xor,np.positivenp.pad(all 11 NumPy modes + callable),np.tile,np.delete,np.insert,np.appendnp.take,np.put,np.place,np.extract,np.compress,np.argwhere,np.flatnonzero,np.diagonal,np.trace,np.unravel_index,np.ravel_multi_index,np.indicesnp.median,np.percentile,np.quantile(all 13 interpolation methods, tuple axis,out=,keepdims, QuickSelect engine),np.average(weights,returned, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M),np.ptp,np.nanmedian,np.nanpercentile,np.nanquantilenp.diff,np.ediff1dnp.asfortranarray,np.ascontiguousarraynp.multithreading(enabled, max_threads)— opt-in threaded kernelsRebuilt to full NumPy 2.x parity:
np.clip—min=/max=keyword aliases, default-None bounds, NumPy 2.x dtype promotion,out=validation.np.unique— 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning,n > Array.MaxLengthfallback.np.searchsorted—side=,sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).np.copyto—casting=,where=masked copies at NumPy speed (was 7–72× slower).np.asarray—copy=,like=,device=, dtype-as-string.np.concatenate— full parity + C/F fast paths.np.all/np.any— tuple-axis,out=,where=.np.expand_dims— tuple axis.np.repeat—axis=parameter.np.power— integer-power semantics, negative-exponentValueError, crash fix.max/min, Complex quantile,IsInfimplemented (was a stub).out=/where=/dtype=ufunc kwargs (NumPy parity):The kwargs present on every NumPy ufunc now span the elementwise core — binary (
add,subtract,multiply,divide,true_divide,mod,power,floor_divide), unary-math (sqrt,exp,log,sin,cos,tan,abs/absolute,negative,square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise,invert,arctan2— each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:outjoins the broadcast but never stretches (mismatched/stretchableoutraise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50),outonly needs a same_kind cast; the provided instance is returned (reference identity).wheremust be exactlybool(mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prioroutcontents.outaliasing an input is well-defined viaCOPY_IF_OVERLAP—add(x[:-1], x[:-1], out=x[1:])matches NumPy exactly.dtype=computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the booladd→OR /multiply→AND remap keyed off the final loop dtype soadd(True, True, dtype=i32) = 2.6. Linear algebra
Vector256FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxingGetValuefallback chain.matmulgufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).np.multithreading— opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.7. Performance (beyond NpyIter and linalg)
sum(int16, axis=1)1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bugmean(axis)var/std21×;count_nonzero20×np.nonzeronp.wheresqrtreached parity via gather→tile→SIMD bufferingVirtualAlloc+ demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state;GC.SuppressFinalizeon free;using/ARC adopted acrossconcatenate,allclose,convolve,tile,eye, masking, shuffle, …float→int32(cvtt), strided/reversed/gathered variants;astypecross-dtype routed through NpyIter KEEPORDER copynp.splitfamily8. Official benchmark suite + honest methodology
run_benchmark.pyentry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).dotnet runfile-based apps compile the project reference in Debug (optimizations off) even withConfiguration=Releaseproperties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assertIsJITOptimizerDisabled == falseand refuse to mislead; the rule is documented.run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.np.sumover abroadcast_toview 54× slower than NumPy (a coordinate-walking general path at 7.4 ns/elem), scalarnp.any14.5× slower (scalar scan wherecount_nonzeroon the same array runs SIMD), aBUFFERED+REDUCEForEachP0 crash (pinned/skipped repro — only the buffered-reduce driver handles that config), and iteratorALLOCATEzeroing outputs where NumPy leaves them empty (+2.33 ms/4M). A win too: hand-rolled 8-band parallel iteration 4.7×. All tracked as the next optimization frontier.9. Differential fuzzing vs NumPy (new infrastructure)
.github/workflows/fuzz-soak.yml).docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).10. Correctness — NumPy-parity bug fixes
Semantics (behavioral changes, may affect callers):
floor_divide/mod: NumPy-exact floored semantics and divide-by-zero results.<=/>=now returnFalsefor NaN (IEEE/NumPy).min/maxpropagate NaN.np.negative(uint)wraps modulo 2ⁿ instead of throwing;bool - booland-bool/np.negative(bool)now throw (NumPy behavior).np.power: negative integer exponent raisesValueError; exact integer-power semantics.ConvertValue);complex→boolno longer drops the imaginary part;float→intSIMD uses truncation (cvtt) like NumPy.[1]meets a lower-rank operand; quantile-family dtype & bool handling; Complexnp.where.Crashes & corruption:
COPY_IF_OVERLAP, §1).WRITEMASKEDwrite landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.UnmanagedStorage/ArraySlice:CopyTodirection + bounds bugs;CloneDatapartial-buffer bug; scalar offset lost onClone; bufferedNpyIter.Cloneshared buffers;DTypeSizereportedMarshal.SizeOfinstead of in-memory stride;NPTypeCode.Char.SizeOfreturned 1 (real: 2); stale Decimal priority.TensorEnginenow propagates throughCast/Transpose/copy/reshape/ravel(custom engines were silently dropped).takewithout=enforces NumPy's safe-cast direction;put/placenon-contiguous writeback fixes;argsorton non-C-contiguous input.ForEach/ExecuteGeneric/ExecuteReducingread past the end withoutEXTERNAL_LOOP.11. Memory management — ARC +
IDisposableNDArraynow implementsIDisposablebacked by atomic reference counting on the unmanaged block: CAS-drivenTryAddRef/Release, idempotentDispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.dotat 100K: 446 collections → 0).12.
Char8primitiveNew 1-byte character type (
NumSharp.Char8) — the NumPyS1/Pythonbytes(1)equivalent — with conversions, operators, span helpers, and 100% PythonbytesAPI parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources undersrc/dotnet/document the upstream implementations it mirrors.13. Examples — trainable MNIST MLP
New
examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIterbias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views beforenp.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.14. Kernel architecture & hygiene
ILKernelGeneratorsplit intoDirectILKernelGenerator(legacy whole-array kernels, 51 partials underKernels/Direct/) andILKernelGenerator(NpyIter-driven per-chunk kernels — the target model matching NumPy'sPyUFuncGenericFunction); migration path documented per kernel family.Vector128/256/512andMath/MathFreflection centralized inVectorMethodCache/ScalarMethodCache; IL-emitted typed-field copier replaces theUnmanagedStorage.Aliasswitch.[Obsolete(error: true)]pending deletion; dead axis-reduction SIMD paths removed.15. Documentation
docs/website-src/docs/NDIter.md(7-technique quick reference, decision tree, memory model, gotchas) +ndarray.md.benchmarks.md(head-to-head evidence companion to the IL-generation page),benchmark-iterator.md,benchmark-matrix.md, driven by the auto-committed report artifacts.PERF_LEDGER.md(every optimization with before/after),ROADMAP.md,NPYITER_GAPS_AND_ROADMAP.md(gap analysis vs NumPy 2.4.2 + prioritized roadmap),NPYITER_PARITY_ANALYSIS.md,NPYITER_PERF_HANDOVER.md,MIGRATE_NPYITER.md, IL-kernel playbook + rulebook, fuzz findings/coverage/next-plan.docs/plans/audit_v2/) with every Tier-1 finding either fixed or reproduced as an[OpenBugs]test.16. Tests & CI
np.evaluate(per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused,out=identity/cast/aliasing, fused reductions),out=/where=/dtype=parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.build-and-release.yml, nightlyfuzz-soak.yml, new post-releasebenchmark.yml(auto-commits NumPy-comparison report cards to master).np.sortremains unimplemented ([OpenBugs]); the frontier benches' broadcast-reduce (54×), scalarnp.any(14.5×) losses and theBUFFERED+REDUCEForEachP0 crash (pinned/skipped repro) are documented as the next optimization frontier; small-N (~1K) dispatch overhead remains the headline focus (docs/ROADMAP.md). Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design[OpenBugs]test or pinned repro rather than ignored.Breaking changes
bool - bool,-bool,np.negative(bool)now throw^/ cast to int first<=/>=returnsFalsenp.isnanexplicitlyfloor_divide/moddivide-by-zero & floored resultsnp.negative(uint)wraps instead of throwingnp.power(int, negative int)raisesValueErrornp.clip/quantile-family dtype promotion[1].copy()to writeMultiIteratorremoved;NDIteratoris now an NpyIter facadeNpyIter/NpyIter.CopyMaxOperands=8and 64-dim limits removednp.copytounwriteable-destination error type correctedEverything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.