[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul by Nucs · Pull Request #611 · SciSharp/NumSharp

Nucs · 2026-04-22T09:19:43Z

Complete changelog of the nditer branch — everything in this PR since #612 merged.

312 commits · 615 files · +217,949 / −16,402 (vs master, after #612)

TL;DR

NpyIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.
NpyExpr DSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the public np.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
out= / where= / dtype= ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor and np.positive at the np.* surface.
NumPy-parity benchmark: geomean 1.00× at 10M elements across ~409 ops (166 faster / 171 close / 36 slower) — measured by a new official BenchmarkDotNet-vs-NumPy suite committed with the report.
30 new np.* APIs — pad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + their nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading.
C/F/A/K order support wired through the whole API — Shape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.
Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot is 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
Differential fuzzing infrastructure — 37,445 bit-exact NumPy-comparison cases across 24 corpus tiers, a seeded random fuzzer with shrinker, a CI FuzzMatrix gate, and a nightly soak workflow.
Legacy iterator stack deleted — MultiIterator and the Regen-generated NDIterator cast templates are gone (−3,870 LOC); NDIterator survives as a thin lazy wrapper over NpyIter.
Test suite: 9,709 passed / 0 failed on net10.0 (+2,500 new test methods), plus the 37,445-case fuzz corpus replayed by the FuzzMatrix gate.

1. NpyIter — full NumPy `nditer` port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads.

Capability	Detail
Iteration orders	C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes	`MULTI_INDEX`, `C_INDEX`, `F_INDEX`, `RANGE` (parallel chunking), `GotoIndex` / `GotoMultiIndex` / `GotoIterIndex`
Buffering	Buffered casting with all 5 casting rules, windowed buffered iteration, `DELAY_BUFALLOC`, buffered-reduce double-loop (incl. `bufferSize < coreSize`)
Reductions	`op_axes` with `-1` reduction axes, `REDUCE_OK`, `IsFirstVisit`, `REUSE_REDUCE_LOOPS` slab accumulation
Overlap safety	`COPY_IF_OVERLAP` via a port of NumPy's `mem_overlap` solver (`NpyMemOverlap.cs`) — overlapping in/out operands no longer silently corrupt
Masking	`WRITEMASKED` + `ARRAYMASK` executed — the buffered window flush writes back only mask-nonzero elements; `VIRTUAL` operands (null op slots) construct with NumPy 2.x semantics
Operands / dims	Unlimited operands (NumPy caps at `NPY_MAXARGS=64`) and unlimited dimensions (NumPy caps at `NPY_MAXDIMS=64`) via dynamic allocation
APIs	`Copy`, `GetIterView`, `RemoveAxis`, `RemoveMultiIndex`, `ResetBasePointers`, `IterRange`, `DebugPrint`, fixed/axis stride queries, `GetValue<T>`/`SetValue<T>`, …
Casting parity	`NpyIterCasting.CanCast` matches NumPy's `safe`/`same_kind` lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOP iternext, buffered-cast Advance, ranged Reset() desync, buffer free-list corruption, the size-1 stride-0 invariant (a (1,4) view with nonzero stride corrupted RemoveMultiIndex), op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, PARALLEL_SAFE wiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NpyIter isn't just correct — it is now the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):

Aspect (float32)	NumSharp	NumPy	Ratio
contig sqrt 10M	2.98 ms	3.24 ms	0.92×
contig add 10M	3.91 ms	4.09 ms	0.96×
strided add 1M	319 µs	416 µs	0.77×
strided sqrt 1M	206 µs	374 µs	0.55×
strided sum 1M	109 µs	205 µs	0.53×
fused `a*b+c` 10M	4.77 ms	13.38 ms	0.36×
fused `(a-b)/(a+b)` 10M	4.12 ms	22.33 ms	0.18×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).

2. NpyExpr DSL + three-tier custom-op API

User-extensible kernel layer on top of NpyIter — the public answer to "how do I write my own ufunc":

Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
Tier 3C — ExecuteExpression: compose NpyExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as np.evaluate(expr[, operands][, out]):

Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal power → ValueError, bool add=OR/multiply=AND).
Fused reductions — NpyExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/poc/evaluate_bench.{cs,py}.

3. Legacy iterator stack retired

MultiIterator deleted; all callers migrated to NpyIter.Copy / multi-operand execution.
The Regen template NDIterator.template.cs + 16 generated NDIterator.Cast.* partials deleted (−3,870 LOC in one commit).
NDIterator survives as a thin, lazy compatibility wrapper over NpyIter (294 lines) — no more materialized copies; same MoveNext()/HasNext()/Reset() surface.
~400 per-dtype NPTypeCode switch sites replaced by a generic NpFunc dispatch utility.

4. C/F/A/K memory-layout support

Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
New: np.asfortranarray, np.ascontiguousarray.
np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of new order tests (Sections 41–51: reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

5. New & completed `np.*` APIs

New functions (35):

Area	APIs
Fused / ufunc	`np.evaluate` (fused expressions — see §2), `np.bitwise_and`, `np.bitwise_or`, `np.bitwise_xor`, `np.positive`
Manipulation	`np.pad` (all 11 NumPy modes + callable), `np.tile`, `np.delete`, `np.insert`, `np.append`
Indexing/selection	`np.take`, `np.put`, `np.place`, `np.extract`, `np.compress`, `np.argwhere`, `np.flatnonzero`, `np.diagonal`, `np.trace`, `np.unravel_index`, `np.ravel_multi_index`, `np.indices`
Statistics	`np.median`, `np.percentile`, `np.quantile` (all 13 interpolation methods, tuple axis, `out=`, `keepdims`, QuickSelect engine), `np.average` (`weights`, `returned`, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M), `np.ptp`, `np.nanmedian`, `np.nanpercentile`, `np.nanquantile`
Math	`np.diff`, `np.ediff1d`
Creation	`np.asfortranarray`, `np.ascontiguousarray`
Runtime	`np.multithreading(enabled, max_threads)` — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity:

np.clip — min=/max= keyword aliases, default-None bounds, NumPy 2.x dtype promotion, out= validation.
np.unique — 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning, n > Array.MaxLength fallback.
np.searchsorted — side=, sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).
np.copyto — casting=, where= masked copies at NumPy speed (was 7–72× slower).
np.asarray — copy=, like=, device=, dtype-as-string. np.concatenate — full parity + C/F fast paths. np.all/np.any — tuple-axis, out=, where=. np.expand_dims — tuple axis. np.repeat — axis= parameter. np.power — integer-power semantics, negative-exponent ValueError, crash fix.
Engine completeness: bool/char max/min, Complex quantile, IsInf implemented (was a stub).
Full 15-dtype coverage pushed through the hot paths — the SByte/Half/Complex dtypes introduced in [new dtypes, NEP50] fully supported Half/Complex/SByte, np.* alias overhaul, NumPy 2.x type alias alignment #612 now work across every kernel family this PR touches (reductions, indexing, trace, casts, quantile, …).

out= / where= / dtype= ufunc kwargs (NumPy parity):

The kwargs present on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50), out only needs a same_kind cast; the provided instance is returned (reference identity).
where must be exactly bool (mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
out aliasing an input is well-defined via COPY_IF_OVERLAP — add(x[:-1], x[:-1], out=x[1:]) matches NumPy exactly.
dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype so add(True, True, dtype=i32) = 2.

6. Linear algebra

Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).
Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
np.multithreading — opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.

7. Performance (beyond NpyIter and linalg)

Op	Improvement
Axis reductions, narrow ints	Widening SIMD (int16→int32 accum etc.): `sum(int16, axis=1)` 1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bug
`mean` (axis)	217× (Phase-0 bug surgery); `var`/`std` 21×; `count_nonzero` 20×
`np.nonzero`	IL SIMD kernel closes an 8–241× gap to NumPy
`np.where`	IL kernels for scalar-broadcast & non-contiguous (1.2–2× NumPy on broadcast conditions)
Strided 1-D unary	Fused strided-SIMD kernel: 0.55 ns/elem flat — beats NumPy at every size; strided `sqrt` reached parity via gather→tile→SIMD buffering
Strided flat reductions	Incremental-advance path: strided sum 8.3× faster (11.8× behind NumPy → 1.4×)
Comparisons	PDEP-based packed mask→bool store; broadcast/strided compares routed via NpyIter
Axis-0 reductions	Column-tiled accumulation (breaks the output RAW dependency); 8× pairwise unrolled flat reductions
Allocation	tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both the small-N ufunc result and 4M+ outputs that previously paid a fresh `VirtualAlloc` + demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state; `GC.SuppressFinalize` on free; `using`/ARC adopted across `concatenate`, `allclose`, `convolve`, `tile`, `eye`, masking, shuffle, …
Casts	NumPy-faithful SIMD `float→int32` (`cvtt`), strided/reversed/gathered variants; `astype` cross-dtype routed through NpyIter KEEPORDER copy
`np.split` family	O(1) sub-shape derivation, direct views — 1.5–4× faster than NumPy
Where/copyto/searchsorted/unique	see §5

8. Official benchmark suite + honest methodology

New cross-platform run_benchmark.py entry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).
Headline: geomean NumSharp÷NumPy = 1.00× at N=10M (166 ops faster / 171 close / 36 slower) — parity across the whole op surface at memory-bound sizes; ~1.9× at 1K where per-call dispatch dominates (tracked as the next focus).
Found and neutralized a benchmark-invalidating tooling bug: dotnet run file-based apps compile the project reference in Debug (optimizations off) even with Configuration=Release properties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assert IsJITOptimizerDisabled == false and refuse to mislead; the rule is documented.
Canonical NpyIter benchmark — a section-addressable harness covering 33 op families × {scalar/1K/100K/1M/10M}, integrated into run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.
Honest frontier findings — adversarial probes record losses, not just wins: np.sum over a broadcast_to view 54× slower than NumPy (a coordinate-walking general path at 7.4 ns/elem), scalar np.any 14.5× slower (scalar scan where count_nonzero on the same array runs SIMD), a BUFFERED+REDUCE ForEach P0 crash (pinned/skipped repro — only the buffered-reduce driver handles that config), and iterator ALLOCATE zeroing outputs where NumPy leaves them empty (+2.33 ms/4M). A win too: hand-rolled 8-band parallel iteration 4.7×. All tracked as the next optimization frontier.

9. Differential fuzzing vs NumPy (new infrastructure)

37,445 bit-exact corpus cases across 24 JSONL tiers generated from real NumPy 2.4.2 outputs: casts (full 15×15 matrix), binary arith (NEP50), div/mod/power, comparisons, unary (incl. float16 inputs + all narrow ints), reductions, NaN-aware reductions, cumulative, statistics, logic/extrema, bitwise+shift, where/place, manipulation, matmul, modf multi-output, sorting/searching, parameter sweeps, SIMD-tail boundaries (900 cases around vector-width edges), operand aliasing, and error-parity (exception-for-exception).
Seeded random fuzzer with an element-wise shrinker for minimal repros; metamorphic invariant tier (11 algebraic properties).
CI integration: FuzzMatrix gate wired into the build workflow + a new nightly fuzz-soak workflow (.github/workflows/fuzz-soak.yml).
Findings inventoried in docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).

10. Correctness — NumPy-parity bug fixes

Semantics (behavioral changes, may affect callers):

floor_divide / mod: NumPy-exact floored semantics and divide-by-zero results.
Comparisons: <= / >= now return False for NaN (IEEE/NumPy).
Flat min/max propagate NaN.
np.negative(uint) wraps modulo 2ⁿ instead of throwing; bool - bool and -bool/np.negative(bool) now throw (NumPy behavior).
Transcendental ufuncs use NEP50 width-based float promotion.
np.power: negative integer exponent raises ValueError; exact integer-power semantics.
Cast semantics aligned with NumPy across all dtype pairs (IL kernels + ConvertValue); complex→bool no longer drops the imaginary part; float→int SIMD uses truncation (cvtt) like NumPy.
Broadcasting keeps rank when a 1-D [1] meets a lower-rank operand; quantile-family dtype & bool handling; Complex np.where.

Crashes & corruption:

Overlapping-operand corruption eliminated iterator-wide (COPY_IF_OVERLAP, §1).
Masked iteration: a buffered WRITEMASKED write landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.
uint32 axis-sum produced wrong values past 8 distinct columns (widening-SIMD rewrite).
np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.
UnmanagedStorage/ArraySlice: CopyTo direction + bounds bugs; CloneData partial-buffer bug; scalar offset lost on Clone; buffered NpyIter.Clone shared buffers; DTypeSize reported Marshal.SizeOf instead of in-memory stride; NPTypeCode.Char.SizeOf returned 1 (real: 2); stale Decimal priority.
TensorEngine now propagates through Cast/Transpose/copy/reshape/ravel (custom engines were silently dropped).
take with out= enforces NumPy's safe-cast direction; put/place non-contiguous writeback fixes; argsort on non-C-contiguous input.
NpyIter ForEach/ExecuteGeneric/ExecuteReducing read past the end without EXTERNAL_LOOP.

11. Memory management — ARC + `IDisposable`

NDArray now implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
Deterministic release means hot loops no longer wait on the finalizer queue; combined with the buffer pool this removes most steady-state GC pressure (dot at 100K: 446 collections → 0).

12. `Char8` primitive

New 1-byte character type (NumSharp.Char8) — the NumPy S1/Python bytes(1) equivalent — with conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources under src/dotnet/ document the upstream implementations it mirrors.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views before np.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.

14. Kernel architecture & hygiene

ILKernelGenerator split into DirectILKernelGenerator (legacy whole-array kernels, 51 partials under Kernels/Direct/) and ILKernelGenerator (NpyIter-driven per-chunk kernels — the target model matching NumPy's PyUFuncGenericFunction); migration path documented per kernel family.
All Vector128/256/512 and Math/MathF reflection centralized in VectorMethodCache / ScalarMethodCache; IL-emitted typed-field copier replaces the UnmanagedStorage.Alias switch.
24 dead kernel methods poisoned with [Obsolete(error: true)] pending deletion; dead axis-reduction SIMD paths removed.

15. Documentation

NpyIter/NDIter book: docs/website-src/docs/NDIter.md (7-technique quick reference, decision tree, memory model, gotchas) + ndarray.md.
DocFX website — Benchmarks vs NumPy: benchmarks.md (head-to-head evidence companion to the IL-generation page), benchmark-iterator.md, benchmark-matrix.md, driven by the auto-committed report artifacts.
Engineering ledgers: PERF_LEDGER.md (every optimization with before/after), ROADMAP.md, NPYITER_GAPS_AND_ROADMAP.md (gap analysis vs NumPy 2.4.2 + prioritized roadmap), NPYITER_PARITY_ANALYSIS.md, NPYITER_PERF_HANDOVER.md, MIGRATE_NPYITER.md, IL-kernel playbook + rulebook, fuzz findings/coverage/next-plan.
Branch quality audits V1+V2 (8 chapter reviews under docs/plans/audit_v2/) with every Tier-1 finding either fixed or reproduced as an [OpenBugs] test.

16. Tests & CI

+2,500 test methods; suite now 9,709 passed / 0 failed on net10.0 (also green on net8.0). Zero regressions maintained commit-by-commit.
New suites: np.evaluate (per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused, out= identity/cast/aliasing, fused reductions), out=/where=/dtype= parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.
CI: fuzz gate in build-and-release.yml, nightly fuzz-soak.yml, new post-release benchmark.yml (auto-commits NumPy-comparison report cards to master).
Known gaps stay visible: np.sort remains unimplemented ([OpenBugs]); the frontier benches' broadcast-reduce (54×), scalar np.any (14.5×) losses and the BUFFERED+REDUCE ForEach P0 crash (pinned/skipped repro) are documented as the next optimization frontier; small-N (~1K) dispatch overhead remains the headline focus (docs/ROADMAP.md). Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design [OpenBugs] test or pinned repro rather than ignored.

Breaking changes

Change	Impact	Migration
`bool - bool`, `-bool`, `np.negative(bool)` now throw	Matches NumPy	Use `^` / cast to int first
NaN `<=` / `>=` returns `False`	Matches IEEE & NumPy	Use `np.isnan` explicitly
`floor_divide`/`mod` divide-by-zero & floored results	Matches NumPy	—
`np.negative(uint)` wraps instead of throwing	Matches NumPy	—
`np.power(int, negative int)` raises `ValueError`	Matches NumPy	Use float exponents
Cast edge cases (overflow/NaN/complex→bool/float→int truncation)	Matches NumPy	—
Transcendental ufuncs: NEP50 width-based promotion	Return dtype may change	—
`np.clip`/quantile-family dtype promotion	Return dtype may change	—
Broadcast views are read-only; broadcasting keeps rank for 1-D `[1]`	Matches NumPy	`.copy()` to write
`MultiIterator` removed; `NDIterator` is now an NpyIter facade	Internal API	Use `NpyIter` / `NpyIter.Copy`
NpyIter: `MaxOperands=8` and 64-dim limits removed	None (loosening)	—
`np.copyto` unwriteable-destination error type corrected	Exception type change	—

Everything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.

Replaces the lazy-but-standalone ValueOffsetIncrementor path with one that constructs an NpyIter state and drives MoveNext / HasNext / Reset directly off that state. NDIterator is now an honest thin wrapper over NpyIter — the same traversal machinery used by all the Phase 2 production call sites — rather than reimplementing the coord-walk logic with legacy incrementors. How it works ------------ - ctor calls NpyIterRef.New(arr, NPY_CORDER) to build the state, then transfers ownership of the NpyIterState* pointer out of the ref struct (see NpyIterRef.ReleaseState / FreeState below). The class holds that pointer for its lifetime and frees it in Dispose (or in the finalizer as a safety net). - MoveNext reads `*(TOut*)state->DataPtrs[0]` then calls `state->Advance()`. IterIndex tracks position, IterEnd bounds the non-AutoReset case, and `state->Reset()` restarts from IterStart on AutoReset wraparound and on explicit Reset. - Cross-dtype wraps the same read with a Converts.FindConverter<TSrc, TOut> lookup — one switch at construction picks the typed helper, so the per-element hot path is still just one read + one converter delegate call. MoveNextReference throws when casting is in play, matching the legacy contract. - NPY_CORDER is explicit so iterating a transposed view yields the logical row-major order the old NDIterator provided. Without it, KEEPORDER would give memory-efficient order (which e.g. `b.T.AsIterator<int>()` would surface as `0 1 2 ... 11` instead of the expected `0 4 8 1 5 9 2 6 10 3 7 11`). NpyIter additions ----------------- - NpyIterRef.ReleaseState(): hand the owned NpyIterState* to a caller who needs it across a non-ref-struct boundary (e.g. a class field). Marks the ref struct as non-owning so its Dispose is a no-op. - NpyIterRef.FreeState(NpyIterState*): static tear-down mirror of Dispose's cleanup path — frees buffers (when BUFFER set), calls FreeDimArrays, and NativeMemory.Free's the state pointer. The long-lived owner calls this from its own Dispose/finalizer. Bug fixes along the way ----------------------- NpyIter initialization previously computed base pointers as `(byte*)arr.Address + (shape.offset * arr.dtypesize)` in two places (initial broadcast setup on line 340 and ResetBasePointers on line 1972). `arr.dtypesize` goes through `Marshal.SizeOf(bool) == 4` because bool is marshaled to win32 BOOL, but the in-memory `bool[]` storage is 1 byte per element. For strided bool arrays this produced a base pointer 4× too far into the buffer. Switched both sites to `arr.GetTypeCode.SizeOf()` which returns the actual in-memory size (1 for bool). Surfaced by `Boolean_Strided_Odd` once NDIterator started routing through NpyIter — previously only LATENT because the legacy NDIterator path computed offsets in element units, not bytes, and sidestepped the NpyIter init. Test impact: 6,748 / 6,748 passing on net8.0 and net10.0 (CI filter: TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test of same-type contig / cross-type / strided / transposed / broadcast / AutoReset / Reset / foreach all produce the expected element sequences.

`UnmanagedStorage.DTypeSize` (exposed via `NDArray.dtypesize`) was delegating to `Marshal.SizeOf(_dtype)`. For every numeric dtype that matches, but for bool, `Marshal.SizeOf(typeof(bool)) == 4` because bool is marshaled to win32 BOOL (32-bit). The in-memory layout of `bool[]` is 1 byte per element, so every caller computing a byte offset as `ptr + index * arr.dtypesize` was reading/writing 4× too far into the buffer for bool arrays. Switches to `_typecode.SizeOf()` which correctly returns 1 for bool and matches `Marshal.SizeOf` for every other type. 21 existing call sites (matmul, binary/unary/comparison/reduction ops, nan reductions, std/var, argmax, random shuffle, boolean mask gather, etc.) now get the right value without any downstream change. The bug had been latent until the Phase 2 iterator migration started routing more code paths through NpyIter.Copy and the new NDIterator wrapper; it surfaced most visibly as `sliced_bool[mask]` returning the wrong elements when the source was non-contiguous. With the root fix: var full = np.array(new[] { T,F,T,F,T,F,T,F,T }); var sliced = full["::2"]; // [T,T,T,T,T] non-contig var result = sliced[new_bool_mask]; // now correct per-element np.save.cs already special-cases bool before falling through to `Marshal.SizeOf`, so serialization was unaffected. Remaining Marshal.SizeOf references in the codebase are either in comments that explain this exact issue, or in the `InfoOf<T>.Size` fallback that only runs for types outside the 12 supported dtypes (e.g. Complex). Tests: 6,748 / 6,748 passing on net8.0 and net10.0 with the CI filter (TestCategory!=OpenBugs&TestCategory!=HighMemory).

- Delete 4 NPYITER analysis docs (audit, buffered reduce, deep audit, numpy differences) — information consolidated into codebase - Delete 3 NDIterator.Cast files (Complex, Half, SByte) — casting now handled by unified NDIterator<T> backed by NpyIter state - Update NDIterator.cs: minor adjustments from NpyIter backing refactor - Update ILKernelGenerator.Scan.cs: scan kernel changes - Update Default.MatMul.Strided.cs: add INumber<T> constraint support for generic matmul dispatch preparation - Update Default.ClipNDArray.cs: initial NpFunc dispatch refactoring replacing 6 switch blocks (~84 cases) with generic dispatch methods - Update np.full_like.cs: minor fix - Update RELEASE_0.51.0-prerelease.md release notes

…neric dispatch NpFunc is a reflection-cached generic dispatch utility that bridges runtime NPTypeCode values to compile-time generic type parameters. Hot path (cache hit) runs at ~32ns via Delegate[] array indexed by NPTypeCode ordinal. Cold path uses MakeGenericMethod + CreateDelegate, cached after first call per (method, typeCode) pair. Core NpFunc changes: - Dynamic table sizing: Delegate[] sized from max NPTypeCode enum value (was hardcoded [32], broke for NPTypeCode.Complex=128) - Overloads for 0-6 args × void/returning × 1-3 NPTypeCodes + 1-2 Types - SmartMatchTypes for multi-type dispatch (1→broadcast, N=N→positional, M<N→type-identity matching) - Per-arity ConcurrentDictionary caches for multi-type dispatch Files refactored (12 files, ~400 cases eliminated): Previous session (5 files, ~196 cases): - Default.ClipNDArray.cs: 6 dispatch methods for contiguous/general clip - Default.Clip.cs: 3 dispatch methods for scalar clip with ChangeType - Default.NonZero.cs: 3 dispatch methods for nonzero/count operations - Default.BooleanMask.cs: 1 dispatch method for masked copy - Default.Shift.cs: 2 dispatch methods for array/scalar shift This session (7 files, ~202 cases): - NDIteratorExtensions.cs: 5 overloads → 5 dispatch methods creating NDIterator<T> from NDArray/UnmanagedStorage/IArraySlice - Default.Reduction.CumAdd.cs: axis dispatch via CumSumAxisKernel<T>, elementwise via IAdditionOperators<T,T,T> with default(T) init - Default.Reduction.CumMul.cs: axis dispatch via CumProdAxisKernel<T>, elementwise via IMultiplyOperators + T.MultiplicativeIdentity init - np.where.cs: iterator fallback + IL kernel dispatch via pointer cast - np.random.randint.cs: int/long fill via INumberBase<T>.CreateTruncating - NDArray.NOT.cs: IEquatable<T>.Equals(default) unifies bool NOT and numeric ==0 comparison into single generic method - Default.LogicalReduction.cs: direct dispatch to ExecuteLogicalAxis<T> Net: -1243 lines removed across 12 files, replacing repetitive per-type switch cases with single generic dispatch methods.

Complex does not implement IComparable<T>, so NpFunc.Invoke into ClipArrayBoundsDispatch/ClipArrayMinDispatch/ClipArrayMaxDispatch crashed with ArgumentException on MakeGenericMethod. Fix: add NPTypeCode.Complex pre-checks in ClipNDArrayContiguous, ClipNDArrayGeneral, and ClipCore that route to dedicated Complex clip methods using lexicographic comparison (real first, then imag). NaN handling preserves the NaN-containing element as-is (not replaced with NaN+NaN*i), matching NumPy np.maximum/np.minimum behavior where "NaN wins" but the original value is returned. Half NaN propagation: ILKernelGenerator.ClipArrayBoundsScalar, ClipArrayMinScalar, ClipArrayMaxScalar fell through to the generic CompareTo path for Half, which treats NaN as less-than-all (IEEE totalOrder) instead of propagating it. Added Half-specific scalar methods that check Half.IsNaN explicitly before comparison. Also fix NpFunc table sizing: Delegate[] was hardcoded to [32] but NPTypeCode.Complex=128 caused IndexOutOfRangeException. Now computed dynamically from max NPTypeCode enum value at static init. Fixes 14 test failures (12 Complex clip/maximum/minimum constraint violations, 2 Half NaN propagation in maximum).

…ast paths Replaces the broken `PowerInteger` fast-path (which crashed on sliced/broadcast arrays via `Unsafe.Address`) with a stride-aware integer power emitted by the existing IL kernel infrastructure. Adds NumPy's "Integers to negative integer powers are not allowed." ValueError, fast paths for scalar exponents {0,1,2, 0.5,-1.0}, and switches f32 to single-precision `MathF.Pow` (no f64 round-trip). Audit-v2 tickets resolved: - T1.3a — np.power(sliced_int32, int32) no longer crashes - T1.3b — np.power(broadcast_int32, int32) no longer crashes - T1.36 — int**(-int) now raises ArgumentException matching NumPy ValueError What changed ============ NEW: src/NumSharp.Core/Utilities/NpyIntegerPower.cs Public squared-exponentiation helpers for all 9 integer NumSharp dtypes (sbyte/byte/int16/uint16/char/int32/uint32/int64/uint64) — preserves dtype-native wraparound (uint8 ** 8 = 0, 15**15 = 437893890380859375). Caller validates non-negative exponent. REWRITE: src/NumSharp.Core/Backends/Default/Math/Default.Power.cs - Removes the `Unsafe.Address`-based fast-path that crashed on sliced/broadcast operands and ignored strides. - Adds pre-scan: for `int**int` with signed-int exponent, scans rhs for any negative element and throws `ArgumentException("Integers to negative integer powers are not allowed.")`. Matches NumPy's unconditional check (rejects base ∈ {±1} too, per NumPy spec). - Adds scalar-exponent fast paths when `rhs.size == 1`: exp = 0 → ones_like(lhs) exp = 1 → lhs.copy() (or cast) exp = 2 → lhs * lhs (SIMD-optimized Multiply kernel) exp = 0.5 → np.sqrt(lhs) exp =-1.0 → np.reciprocal(lhs) (float base only) Each path verifies the resolved result dtype matches what the IL kernel would produce before substituting, so NEP50 promotion is preserved. - Falls through to `ExecuteBinaryOp` for the general case, which now walks strides correctly via the IL kernel paths. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs - `EmitPowerOperation(il, resultType)`: dispatches to the matching `NpyIntegerPower.Pow*` helper for integer result types (replaces the `int → double → Math.Pow → int` round-trip that lost precision for values outside [-2^52, 2^52]). float32 → `MathF.Pow`; float64 → `Math.Pow`; Boolean and other fallthrough types use the original double round-trip to keep the kernel verifiable. - Cached `MethodInfo` lookups added for all 9 integer power helpers and `MathF.Pow`. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Binary.cs - `EmitPowerOperation<T>(il)` (same-type contiguous kernel path): same dispatch as the mixed-type version. Generic `T` is mapped to the matching `NpyIntegerPower.Pow*` helper via `GetIntegerPowMethod<T>()`. src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs - Updates the Power promotion comment to document NEP50 weak/strict behavior accurately (NumSharp matches NumPy in the common cases; the one documented misalignment is 0-D integer arrays explicitly constructed via `np.array(2, int32)`, which are indistinguishable from C# `int 2` after `np.asanyarray`). Tests ===== NEW: test/NumSharp.UnitTest/Math/NDArray.power.Comprehensive.Test.cs (24 tests) - Integer dtype-native wrapping (uint8/int8/int32 overflow) - Stride + broadcast layouts (sliced, broadcast_to, 2D-vs-1D) - Signed integer negative exponent throws (incl. base = ±1) - Unsigned integer exponent never throws - Float special values (0^0, NaN, ±inf base/exp, fractional neg base) - NEP50 promotion (f32 ** int{8,16,32}, f64 ** int*, f32 ** scalar) - All 9 integer dtypes smoke-tested via 2^3 = 8 REMOVED [Misaligned]: PowerEdgeCaseTests.Power_Integer_LargeValues Integer power now preserves exact precision; the test now asserts equality. UPDATED: NewDtypesCoverageSweep_Arithmetic_Tests.B35_SByte_Power_NegativeExponent* Previously documented the wrong (silent 0/±1) behavior; now asserts the NumPy-correct ArgumentException. UPDATED (removed [OpenBugs]): - AuditV2_MathReductions.T1_3a_Power_SlicedInt32_ShouldNotCrash - AuditV2_MathReductions.T1_3b_Power_BroadcastInt32_ShouldNotCrash - AuditV2_ILKernelSimd.T1_36_* (4 tests) Validation ========== cd test/NumSharp.UnitTest dotnet test --no-build --framework net10.0 \ --filter "TestCategory!=OpenBugs&TestCategory!=HighMemory" → Passed: 8255, Failed: 0 dotnet test --no-build --framework net10.0 \ --filter "FullyQualifiedName~Power" → Passed: 129, Failed: 0 Microbench (1M-element float32, x100 iterations): power(arr, 2) 121ms (fast path → mul; matches multiply baseline 117ms) power(arr, 0.5) 121ms (fast path → sqrt) power(arr, 2.7) 518ms (general path via MathF.Pow) Behavior changes vs. prior NumSharp =================================== - int**(-int) now throws (was: silently returned 0, 1, or -1). Matches NumPy 2.4.2 ValueError exactly.

Adds the iterator-subsystem branch audit documents that drove this branch's bug-fix and refactor work: - `NDITER_BRANCH_QUALITY_AUDIT.md` — original (V1) audit walking every changed src file and ranking findings by severity (bugs → perf → parity gaps → refactors → clean review). Bug catalog includes: np.maximum/minimum NaN handling, np.power stride mishandling, np.searchsorted incompleteness, np.repeat missing axis, NpyIter Iternext+EXLOOP path, nan{mean,std,var} perf, np.argsort LINQ perf, linspace/eye boxing. - `NDITER_BRANCH_QUALITY_AUDIT_V2.md` — V2 (fact-check) audit driven by 8 parallel agents auditing file-by-file with results verified via `python -c` against NumPy 2.4.2 and `dotnet_run` against NumSharp. 60 of 65 Tier 1 findings confirmed with failing OpenBugs reproducers written under `test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs`, plus a list of 4 false positives and 4 newly discovered bugs. - `docs/plans/audit_v2/01..08*.md` — per-batch audit chapters, each including: file scope tables (LoC + role), reference to NumPy source, reproduction commands, line-precise references, and a findings table with severity tags (bug / parity-gap / perf / refactor / clean). Chapters cover Iterators, ILKernel+SIMD, Default math/reductions, Logic+Shape+Storage, NDArray creation, Manipulation APIs+logic, Math ops + selection/sorting/stats, and Casting+random+utilities. These files are pure documentation and contain no code; they're the reference material for the bug fixes and tests that follow on the nditer branch.

Adds the per-batch test classes that the V2 audit fact-check pass produced to back its Tier 1 findings with concrete failing tests. Tests are marked `[OpenBugs]` so CI skips them until the underlying defect is fixed; running them locally with `TestCategory=OpenBugs` documents each bug's current behavior versus NumPy 2.4.2. Each test references both the master `NDITER_BRANCH_QUALITY_AUDIT_V2.md` and the matching `docs/plans/audit_v2/XX_*.md` chapter where the finding is documented in detail, and includes the file:line of the suspected defect plus a `python -c` NumPy 2.4.2 expectation. Test classes added (matching the 6 untracked batches): - `AuditV2_Iterators.cs` — NpyIter Iternext/EXLOOP issues, buffer refill, cast support gaps, NDIterator broadcast strides, etc. (Batch 1). - `AuditV2_LogicShapeStorage.cs` — Shape mutating indexer on a readonly struct, storage and logic op edge cases (Batch 4). - `AuditV2_NDArrayCreation.cs` — `np.array(NDArray, copy=false)` default aliasing, creation API edge cases (Batch 5). - `AuditV2_ManipulationApis.cs` — np.expand_dims on empty, manipulation parity gaps (Batch 6). - `AuditV2_MathSelectionSorting.cs` — SetIndicesNDNonLinear NIE, math/selection/sort bugs (Batch 7). - `AuditV2_CastingRandomUtilities.cs` — NpFunc/cast/random/utilities bugs (Batch 8). Batches 2 (`AuditV2_ILKernelSimd.cs`) and 3 (`AuditV2_MathReductions.cs`) already exist on the branch; this commit fills the remaining 6. Build is verified to pass with the new files included.

Updates `.claude/CLAUDE.md` so the project instructions match the code's current state: - "C-order only" entry replaced with "Order-aware layout": Shape tracks F-contiguity, and APIs with an `order` parameter resolve NumPy `C`/`F`/`A`/`K` through `OrderResolver`. Verified by: - `Shape.IsFContiguous` flag (`View/Shape.cs:115-118`) - `Shape.Order` property (`View/Shape.cs:437`) - F-aware construction (`View/Shape.cs:160`) - `F_CONTIGUOUS` entry in the flags table updated from "Reserved" to "Data is column-major contiguous" (matches `ArrayFlags.F_CONTIGUOUS` bit `0x0002` in `View/Shape.cs:24`). - Added `IsFContiguous — O(1) check via F_CONTIGUOUS flag` to the key Shape properties list. - Missing Functions count corrected from 19 → 18 and `np.where` removed from the Selection gap because `APIs/np.where.cs` implements it; new `### Selection` section under "Supported np.* APIs" lists `where`. - Iterators path updated from `MultiIterator.cs` to `NpyIter.cs` and `NpyExpr.cs` (verified — `MultiIterator` no longer exists; only `NDIterator`, `NpyIter`, `NpyExpr` are present in `Backends/Iterators`). - Q&A entries for NDIterator and NpyIter rewritten to match the current legacy-wrapper / NumPy-aligned multi-operand iterator split. Pure documentation change — no behavioral impact.

…per / Memory block Multiple `CopyTo` overloads in the unmanaged memory layer were calling `Buffer.MemoryCopy(...)` with source/destination swapped — the BCL signature is `MemoryCopy(void* source, void* destination, long destBytes, long sourceBytesToCopy)`, but the existing code passed the destination pointer first. The result was that data was copied *from the destination buffer into the source slice*, silently corrupting the caller's source data instead of populating the destination. ArraySlice`1.cs: - `TryCopyTo(Span<T>)`, `CopyTo(Span<T>)`, `IArraySlice.CopyTo<T1>(Span<T1>)`, `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: swap source / destination pointers so data flows source→destination. - `CopyTo(IntPtr dst)`: also fix the byte-size argument — previous code passed `Count` (element count) for both destination size and bytes to copy, leaving non-byte dtypes with under-counted bounds. Replace with `Count * ItemLength` for both byte arguments and flip the source / destination order. - `CopyTo(IntPtr dst, long sourceOffset, long sourceCount)`: this overload was previously identical to `CopyTo(IntPtr dst)` (ignored the offset arguments). Add `sourceOffset` / `sourceCount` bounds checks, honour `sourceOffset` when computing the source pointer, and use `sourceCount * ItemLength` for the copy. - `CopyTo(Span<T>, long sourceOffset, long sourceLength)`: previous body recursed into itself (`CopyTo(destination, sourceOffset, sourceLength);`) causing a stack overflow. Replace with a bounds-checked `Buffer.MemoryCopy` from `Address + sourceOffset`. - `CopyTo(UnmanagedSpan<T>, long sourceOffset, long sourceLength)`: same direction swap as above. - `IArraySlice.CopyTo<T1>(Span<T1>)` / `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: bytes-based comparison (`Count * ItemLength` vs `destination.Length * InfoOf<T1>.Size`) instead of element-count comparison, fixing the "destination too short" check for reinterpret-cast cases. - `IArraySlice.Clone<T1>()`: previous code used `UnmanagedMemoryBlock<T1>. Copy(Address, Count)` which treats `Count` as the *T1* element count while reading from a `T`-element buffer. Now compute the byte size and divide by `InfoOf<T1>.Size` so the clone preserves the whole byte payload (with a hard error if the byte count is not a multiple of the target element size). UnmanagedHelper.cs: - `CopyTo(IMemoryBlock src, IMemoryBlock dst, long countOffsetDestination)`: validate `countOffsetDestination` against `dst.Count` and ensure the source fits in the *remaining* destination capacity. Fix the destination-size argument to `(dst.Count - countOffsetDestination) * dst.ItemLength` instead of the source byte length (which under-counts by the offset when the destination is just big enough). UnmanagedMemoryBlock`1.cs: - `CopyTo(UnmanagedMemoryBlock<T> memoryBlock, long arrayIndex)`: swap pointers so data is copied source→destination (`memoryBlock.Address + arrayIndex` as dst), add null + bounds checks, and use the remaining destination capacity for the destination size argument. All fixes are direct corrections of misuses of `Buffer.MemoryCopy`'s signature; behavior for legitimate callers now matches the docstrings. The added regression tests live under `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs` (separate commit) and call each repaired overload to lock the contract in place.

….Clone bugs Shape.cs: The `Clone(deep, unview, unbroadcast)` branch logic was inconsistent and dropped the `offset`/`bufferSize` of scalar views in the most common call (`Clone()` with default args). Rewrite the cascade so behavior is predictable: - Empty shape → `default`. - Scalar shape: - `unview` or `unbroadcast` → return the static `Scalar` (offset=0). - Otherwise honour `deep`: copy-constructor preserves both `offset` and `bufferSize` for sliced scalar views like `np.arange(10)["5"]`. - Non-scalar shape: - `!deep && !unview && !unbroadcast` → return `this` (the readonly struct copy is itself a value-copy). - `unview` or `unbroadcast` → `new Shape((long[])dimensions.Clone())`, which the constructor fills with C-contiguous strides (no offset). This replaces the previous one-off `ComputeContiguousStrides` / `deep && !unbroadcast` mixed branches that produced different shapes depending on call combination. - Plain `deep` → deep copy via the copy constructor. Old behavior failure: `scalar.Shape.Clone()` on `np.arange(10)["5"]` returned the canonical `Scalar` shape with `offset == 0`, hiding the fact that the data lives at index 5. The regression test `Shape_Clone_PreservesScalarViewOffset` in `CloneRegressionTests` locks the fix. ArrayConvert.cs: - `Clone(Array sourceArray)` had two issues: 1. It walked `elementType.IsArray` past the array's actual element type, so a jagged `int[][]` was treated as a flat `int[]` and the subsequent `Array.Copy` produced wrong results (or threw). Now the immediate element type is used, preserving jaggedness. 2. Arrays with a non-zero lower bound (created via `Array.CreateInstance(elementType, lengths, lowerBounds)`) were not supported — they fell through to the multi-dim branch with all-zero lower bounds. Capture each axis' lower bound and use `Array.CreateInstance(elementType, lengths, lowerBounds)` whenever the source is multi-dim or has any non-zero lower bound. - `Clone<T>(T[,,,] sourceArray)` had a `GetLength(4)` typo for what should be the fourth (zero-indexed: 3) dimension. `GetLength(4)` throws `IndexOutOfRangeException` for any 4-D array. Changed to `GetLength(3)`. (Coverage: `CloneRegressionTests .ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`.)

…ontig NDArray (`Backends/NDArray.cs`): - All three `UnmanagedStorage`-based constructors now back-fill the engine when storage doesn't already have one, and mirror the chosen engine onto `Storage.Engine` so the array and storage stay in sync. Previously `Storage.Engine` could be null while the NDArray reported a valid `TensorEngine`, leaking back through chained constructors that read storage.Engine directly. - `TensorEngine` setter now propagates the resolved engine to `Storage.Engine` so changing the engine on an NDArray cascades to storage-side consumers. - `Clone()` is now `virtual` and uses the property setter (instead of the private field) so engine assignment propagates to storage. `NDArray<TDType>.Clone()` overrides it to preserve the typed wrapper — before this commit, `((NDArray<int>)x).Clone()` returned the non-generic NDArray base type, breaking generic callers (see `CloneRegressionTests.NDArray_Clone_PreservesGenericRuntimeType`). - `View`/`GetData(int[])`/`GetData(long[])`/the foreach yield path all switch from setting the private `tensorEngine` field to the property, so storage gets the engine too. UnmanagedStorage (`Backends/Unmanaged/UnmanagedStorage.cs`): - `CreateBroadcastedUnsafe(...)` now copies `storage.Engine` onto the new broadcast view. UnmanagedStorage cloning (`Backends/Unmanaged/UnmanagedStorage.Cloning.cs`): - All `Alias(...)` overloads, the `Slice` builder, both `Cast<T>` / `Cast(typeCode)`, both `CastIfNecessary<T>` / `CastIfNecessary(typeCode)`, and the empty-storage clone now propagate `Engine`. - Cast correctness fix: `Cast<T>` / `Cast(typeCode)` / `CastIfNecessary<T>` / `CastIfNecessary(typeCode)` used to cast the raw backing array via `InternalArray.CastTo(...)`. For strided or F-contiguous views that buffer holds elements in the *physical* layout, so the cast result contained values in the wrong logical order. They now run `CloneData()` first — which materializes the logical element sequence (via `NpyIter.Copy` for non-contiguous paths) — and cast that, so casting an F-contiguous view of `np.arange(6).reshape(2,3).T` yields the same values NumPy produces. (Verified by `CloneRegressionTests .UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine` and siblings.) - `Clone()` gains a fast `CanCloneRawLayout()` path: when the storage owns its buffer (no offset, no broadcast, no buffer/size mismatch) and is either C- or F-contiguous, the underlying ArraySlice is cloned raw and the same `Shape` is reused. Non-trivial slices and scalar views still fall back to `CloneData()`. Empty storages return a new typed empty storage and preserve the engine instead of trying to clone a null buffer. - `CastIfNecessary` early-return for same-dtype skips the `IsEmpty` check so empty storages of the requested dtype don't re-materialize.

The DefaultEngine helpers for `astype` and `transpose` created new `NDArray` instances via the `UnmanagedStorage`-only constructor, which falls back to `BackendFactory.GetEngine()`. Code that explicitly set `nd.TensorEngine` on the source (e.g. tests pinning a custom engine) would silently see its engine swapped for the default after a cast or transpose. `Default.Cast.cs` (`DefaultEngine.Cast`): - Capture `nd.TensorEngine` once at the top. - Empty/scalar/`(1,)` early returns now carry `engine` forward both on the returned `NDArray` and on `nd.Storage` (when reused in-place). - Both `copy` and in-place branches of the generic cast attach `TensorEngine = engine` to the resulting NDArray and to the re-assigned `nd.Storage`. `Default.Transpose.cs` (`DefaultEngine.TransposeAlias`): - The transpose alias returned via `Storage.Alias(newShape)` now carries `nd.TensorEngine` so transposed views keep their engine. Without this the call dropped back to the global default, breaking propagation through compounded operations. Coverage: `CloneRegressionTests.NDArray_AstypeCopyPath_PreservesTensorEngine` and the engine-propagation siblings.

…/ ravel paths All paths that build a fresh `NDArray` from an existing storage now preserve the caller's `TensorEngine`. Previously the `NDArray (UnmanagedStorage)` constructor would fall back to `BackendFactory.GetEngine()` when the supplied storage didn't carry an engine (which is common after `Clone()`/`Alias()`/`CloneData()`). `Creation/NDArray.Copy.cs` (`copy(char physical)`): - The C-order shortcut now requires the source to already be C-contiguous. Before, `copy('C')` on an F-contiguous view returned a `Clone()` whose shape preserved the F-strides — yielding a non-C-contiguous "copy". Now any non-C-contiguous source falls through to the iterator-driven materialization path. - The destination shape uses the requested `physical` order instead of hard-coding `'F'`. Combined with the fix above this gives correct C/F selection regardless of source layout. - Destination NDArray carries `TensorEngine = TensorEngine` of the source. Coverage: `CloneRegressionTests.NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy` and `NDArray_CopyFOrder_PreservesTensorEngine`. `Creation/NdArray.ReShape.cs`: - The F-order reshape return (`fFlat.Storage.InternalArray`-backed storage) and both non-contiguous fallback paths (`new NDArray(CloneData(), Shape.Clean())`) now attach the source `TensorEngine`. Coverage: `CloneRegressionTests.NDArray_ReshapeCopyPath_PreservesTensorEngine`. `Creation/np.array.cs`: - `np.array(nd, copy)` propagates `nd.TensorEngine` for both the alias (`copy=false`) and clone (`copy=true`) paths. Coverage: `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`. `Manipulation/np.expand_dims.cs`, `Manipulation/np.ravel.cs`, `Manipulation/NDArray.flatten.cs`: - The view (`Storage.Alias(...)`) and materialize (`CloneData()`) branches both forward the source `TensorEngine`. No semantic API changes other than the `copy('C')` correctness fix above; engine propagation is a transparent improvement.

NumPy's np.where iterator allocates the result with an order chosen from the *full-size* operands' contiguity flags: - Any full-size, multi-dim operand that is C-contiguous (but not F) → output is C-contiguous. - All full-size, multi-dim operands that are F-contiguous (and at least one is strictly F, not also C) → output is F-contiguous. - Operands that are scalar, 1-D, or broadcasted do not vote. - Mixed C/F (or any full-size non-contiguous operand) → fall back to C. Verified against NumPy 2.4.2: f = np.arange(12).reshape(3,4).T # F-contig view np.where(f > 5, f, 0) .flags # F_CONTIGUOUS=True np.where(f > 5, f.copy('C'), f) .flags # C_CONTIGUOUS=True np.where(np.array([True,False,True]), f, 0).flags # F_CONTIGUOUS=True Previously `np.where` always allocated the output as C-contiguous, losing the F layout that NumPy preserves for F-contiguous inputs. `APIs/np.where.cs`: - New `ResolveWhereOrder(params NDArray[] operands)` mirrors the rules above. Returns 'C' or 'F'. - The result `Shape` is now constructed via `new Shape((long[])cond .shape.Clone(), resultOrder)` so the resulting strides match the resolved order. - Drop the `NpFunc.Invoke(outType, WhereImpl<int>, ...)` generic dispatch: the actual `WhereImpl` body never used its `T` type parameter (the iterator-driven IL kernel keys off the runtime dtype string), so the switch-per-dtype indirection was dead weight. Replace with a direct non-generic `WhereImpl(cond, x, y, result)` call. `test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs`: - New "Output Layout" region with three NumPy-anchored tests: * `Where_FContiguousInputs_ResultIsFContiguous` * `Where_MixedCAndFInputs_ResultFallsBackToC` * `Where_BroadcastConditionWithFInput_ResultIsFContiguous`

…sh order tests NumPy's `np.tile(A, reps)` keeps the source memory order on the "no expansion" shortcut (`tup == (1,)*outDim`): F-contiguous input stays F-contiguous, C-contiguous input stays C-contiguous, and views with strides outside C/F materialize as C-contiguous. Verified against NumPy 2.4.2: src = np.arange(12).reshape(3, 4).T # F-contig np.tile(src, (1, 1)).flags # F_CONTIGUOUS=True np.tile(src, ()).flags # F_CONTIGUOUS=True np.tile(np.arange(12).reshape(3, 4)[:, ::-1], (1, 1)).flags # C_CONTIGUOUS=True `Manipulation/np.tile.cs`: - The all-ones shortcut previously called `A.copy()` which defaults to `'C'` — silently flipping F-contiguous inputs to C-contiguous output. Replace with `A.copy('K')` (and the reshape variant gets the same treatment) so `OrderResolver.Resolve('K', shape)` picks the source's physical order. The comment is updated to describe the keep-order semantics. `test/NumSharp.UnitTest/Manipulation/np.tile.Test.cs`: - Three new tests covering the F-contig preservation, the `np.tile(a)` no-reps overload, and the non-contiguous fall-back. Each test also verifies element values against the source via index-based reads to guard against logical-order regressions. `test/NumSharp.UnitTest/View/OrderSupport.OpenBugs.Tests.cs`: - `Tile_ApiGap` is renamed to `Tile_RepeatsLastAxis_ValuesMatchNumPy` and its assertion stays — the API gap comment is replaced with the matching NumPy reference output. Header rewritten from "Missing functions" to "Manipulation helpers" since this section now documents working APIs. - `Where_ApiGap` (previously `[OpenBugs]` because np.where was thought missing) is now `Where_FContig_PreservesFContig`. It asserts that `np.where(f_arr > 5, f_arr, 0)` returns an F-contiguous result on F-contiguous input — the same property covered by the new where layout tests in the prior commit. The `[OpenBugs]` attribute is removed because the feature exists and now matches NumPy.

…IterBattleTests `benchmark/NumSharp.Benchmark.Exploration/Program.cs`: - `Options.Clone()` reused the same `RemainingArgs` `string[]` reference on the cloned `Options` instance. Any post-clone mutation of the array (or its elements via index assignment) would have leaked back to the original `Options`. Clone the array (`(string[])RemainingArgs .Clone()`) so the two `Options` instances are independent. `test/NumSharp.UnitTest/Backends/Iterators/NpyIterBattleTests.cs`: - Remove a single trailing blank line at end of file. No code change.

… after RemoveAxis `NpyIter.Clone()` (in `Backends/Iterators/NpyIter.cs`) previously copied the `Buffers[op]` pointer field directly from the source state, so the original and the cloned iterator shared the *same* per-operand buffer. After `Iternext()` on either iterator the writes from one would clobber the other's data, and disposing one would free the buffer out from under the other. The fix: - After copying the operand metadata (`ElementSizes`, `BufStrides`, etc.), allocate a fresh buffer per operand via `NpyIterBufferManager .AllocateAligned(BufferSize, opDtype)` and `Buffer.MemoryCopy` the source bytes into it. If allocation fails the catch block calls `NpyIterBufferManager.FreeBuffers` for buffered states before releasing dim arrays + state memory. - `DataPtrs[op]` is fixed up: if the source `DataPtrs[op]` pointed into the original `Buffers[op]` byte range we translate the offset onto the newly allocated buffer so iteration continues at the same logical position. - The clone now calls `AllocateDimArrays(_state->NDim, _state->NOp, _state->StridesNDim)` — see below. `NpyIterState.AllocateDimArrays(int ndim, int nop, int stridesNDim)`: - Previously, the strides block was always sized as `ndim * nop`. After `RemoveAxis` lowers `NDim` but leaves `StridesNDim` at its original width, cloning the iterator allocated a strides block that was too small, causing later reads from `Strides[k]` (where `k >= NDim*NOp`) to access freed or unrelated memory. - The third parameter defaults to `ndim` (preserving the existing contract for all other call sites) but accepts an explicit `stridesNDim >= ndim` so `Clone()` can carry the original allocated stride width forward. `StridesNDim` is now stored on the state and the strides allocation uses `stridesNDim * nop * sizeof(long)`. The scalar fast-path now requires both `ndim == 0` and `stridesNDim == 0` to skip the allocation. Also moves the `GetInnerFixedStrideArray` docblock so it sits directly above its method (it had drifted onto an unrelated method when the preceding doc was edited). Coverage: - `CloneRegressionTests.NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers` asserts the two iterators have distinct `DataPtr` addresses and that advancing one does not advance the other. - `CloneRegressionTests.NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth` builds an iterator over `(2,3,4)`, removes axis 1, clones it, and checks `NDim`, `Shape`, and the first value match.

… clone fixes Adds `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs`, which locks in the contracts established by the preceding fix commits. Each test reproduces a specific bug or contract that previously regressed and asserts the corrected behavior. 27 tests; all pass on net8.0 and net10.0. Coverage map (each pair = test → fix commit): ArraySlice CopyTo direction / range fixes → `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice` - `ArraySlice_CopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_TryCopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_CopyToSpan_WithSourceRange_CopiesRequestedRange` - `ArraySlice_CopyToIntPtr_WithSourceRange_CopiesRequestedRange` - `ArraySlice_InterfaceCopyToSpan_CopiesFromSliceToDestination` - `ArraySlice_InterfaceCloneGeneric_ReinterpretsWholeBytePayload` ArrayConvert.Clone jagged / non-zero lower-bound / 4-D GetLength fixes → `fix(shape+convert): preserve scalar offset on Clone; fix ArrayConvert.Clone bugs` - `ArrayConvert_Clone_PreservesJaggedElementType` - `ArrayConvert_Clone_PreservesNonZeroLowerBounds` - `ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength` Shape.Clone scalar view offset preservation → same commit as above - `Shape_Clone_PreservesScalarViewOffset` UnmanagedStorage.Clone empty + F-contig + engine → `fix(storage+ndarray): keep TensorEngine in sync; correct cast for F-contig` - `UnmanagedStorage_Clone_DtypeOnlyStorage_DoesNotDereferenceMissingData` - `UnmanagedStorage_Clone_PreservesEngineAndFContiguousShape` UnmanagedStorage.Cast / CastIfNecessary uses CloneData + propagates engine → same commit - `UnmanagedStorage_CastTypeCode_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastIfNecessary_FContiguousSource_CopiesLogicalValuesAndEngine` - `UnmanagedStorage_CastEmptyStorage_PreservesEngine` UnmanagedMemoryBlock.CopyTo arrayIndex offset → `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice` - `UnmanagedMemoryBlock_CopyToWithIndex_CopiesToDestinationOffset` UnmanagedHelper.CopyTo destination-offset bounds → same commit - `UnmanagedHelper_CopyToWithInvalidDestinationOffset_Throws` NDArray.Clone / engine propagation → `fix(storage+ndarray): ...` + `fix(creation+manipulation): ...` + `fix(default-engine): ...` - `NDArray_Clone_PreservesGenericRuntimeType` - `NDArray_Clone_PreservesTensorEngineOnArrayAndStorage` - `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy` - `NDArray_CopyFOrder_PreservesTensorEngine` - `NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy` - `NDArray_ReshapeCopyPath_PreservesTensorEngine` - `NDArray_AstypeCopyPath_PreservesTensorEngine` NpyIter.Clone buffered deep copy + RemoveAxis stride width → `fix(npyiter): deep-copy buffered Clone buffers; preserve stride width after RemoveAxis` - `NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers` - `NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`

…aths Promotes SByte, Half (float16), and Complex from "partially supported" to first-class dtypes, matching what NPTypeCode already declares and what NumPy 2.4.2 ships. The audit (NDITER_BRANCH_QUALITY_AUDIT_V2.md, Tier 1) flagged 9 production crash sites and 5 perf gaps where these three dtypes silently fell out of 12-dtype switches. After this commit every np.power(lhs, rhs) combination across the 15x15 dtype matrix works end-to-end, and the existing 12-dtype fast paths remain intact. CRASH FIXES (Tier 1): * NpyIterCasting (T1.9, T1.12, T1.38, T1.39): IsSafeCast / ReadAsDouble / WriteFromDouble / ConvertValue / PromoteTypes now handle SByte / Half / Complex. Complex routes through a dedicated Complex intermediate so the imaginary component is preserved on Complex->Complex copies and dropped cleanly (per NumPy's ComplexWarning) on Complex->real. Adds Half/SByte to IsFloatingPoint/IsSignedInteger predicates. * NpyIterBufferManager (related to T1.12): same-type buffered iteration was throwing for Complex base case. Adds SByte/Half/Complex branches to CopyToBuffer/CopyFromBuffer dtype dispatch. * UnmanagedStorage (T1.13, T1.57): SetValue(object,int[]/long[]), SetData(NDArray,long[]) scalar fast path, and the void*/IMemoryBlock CopyTo overloads all gained the three missing dtype branches. * ArrayConvert.cs (T1.30): 13 ToX(Array) destination switches were missing SByte/Half/Complex source cases. Plus ~40 new typed converter methods covering the previously-missing (Src -> Dst) corners. Total ~550 LOC added. * np.asanyarray (T1.49): adds IEnumerable<sbyte>, IEnumerable<Half>, IEnumerable<Complex> case branches; corresponding Memory<T>/ ReadOnlyMemory<T> dispatch; ConvertObjectListToNDArray branches; and FindCommonNumericType expansion (the seenMask bitset was bounded to 12 dtypes; Complex's typecode=128 also previously aliased bit 0 due to unbounded shift -- now masked by `(int)code & 31`). * np.copyto T1.55: now passes via the NpyIterCasting fix. * ILKernelGenerator.EmitDecimalConversion: Half<->Decimal and Complex<->Decimal routes were missing. np.power(Half, Decimal) now works (was the only np.power(15x15) failure after the casting fixes). PERF FIXES (Tier 2): * ILKernelGenerator.Binary.IsSimdSupported<T> (R9): adds sbyte. Vector*<sbyte> arithmetic is natively supported in .NET. * Converts.FindConverter (R18, R33): 12x12 type-pair fast-path ladder expanded to 15x15 (225 entries). Eliminates the IConvertible-interface boxing and object-cast boxing that the CreateFallbackConverter path imposes for SByte/Half/Complex sources or destinations. * Default.Reduction.ArgMax (R23): the per-slice NDArray view allocation in the Half/Complex axis fallback was costing one new NDArray per slice (1000 allocations for a (1000,1000) axis-reduce). Replaced with a stride-aware loop driven from a stackalloc coord vector. SByte path is removed from the fallback entirely since the IL kernel already handles it via CreateAxisArgReductionKernelTyped<sbyte>. * Default.BooleanMask gather (T1.58): the strided/broadcast fallback was calling Buffer.MemoryCopy(_, _, elemSize, elemSize) per matched element (~1us/element). Specialized on element size (1/2/4/8/16 bytes); all 15 dtypes hit a typed pointer write now, including Half (2B) and Complex (16B as two longs). VERIFICATION: * test/Math/NDArray.power.DtypeMatrix.Test.cs (new): - 15x15 dtype matrix smoke test (225 (lhs, rhs) combinations). - SByte ** -SByte raises ValueError-style ArgumentException. - Half ** Half preserves Half. - Complex ** Complex preserves Complex. - Float ** Complex promotes to Complex. - Half ** Single promotes to Single (NEP50). - SByte/Half/Complex List/IEnumerable inputs no longer throw. * Removed [OpenBugs] attribute from 11 AuditV2 tests that are now CI-green: T1_9 (3x), T1_12 (2x), T1_13 (2x), T1_30 (3x), T1_49 (3x), T1_55, T1_57, T1_58. They now run as regular tests. * Full suite: 8281 passed, 0 failed (was 8255 before this commit, including the new dtype-matrix tests and 26 promoted-from-OpenBugs tests). DOCS: * .claude/CLAUDE.md: "Supported Types (12)" -> "Supported Types (15)". Adds Half/SByte/Complex rows and a "Perf notes" section that documents Half/Complex/Decimal as scalar paths (no Vector<Half> arithmetic in .NET BCL; Complex.Pow is the BCL routine). OUT OF SCOPE FOR THIS COMMIT: * T1.34 NpyExpr Const/Where/Call SByte/Half/Complex support: not on np.power's critical path; left for a separate pass. * T1.39 Int64/UInt64 -> double precision loss above 2^53: separate audit item, unrelated to the three target dtypes. For full audit context see docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md section "Major themes" item 2 (missing SByte/Half/Complex).

…s 3000× copy Audit V2 finding (Section 1.6 / src/NumSharp.Core/Manipulation/np.ravel.cs:30-34): np.ravel(a, 'F') unconditionally routed through a.flatten('F'), which allocates fresh F-contiguous memory and runs NpyIter.Copy over the source. NumPy, in contrast, returns a 1-D view sharing the underlying buffer whenever the source is already F-contiguous (np.shares_memory(np.ravel(aF, 'F'), aF) == True). The audit reports a 3000× performance regression on the hot F-order path (np.arange(12).reshape(3,4).copy('F') -> np.ravel(.,'F')): an O(1) shape-only aliasing was replaced with an O(N) buffered copy. Root cause ---------- ravel's F-branch had no fast path for the IsFContiguous case. flatten('F') is documented to "ALWAYS return a copy" by design, so calling it from ravel forced materialization even when the linear memory walk would already reproduce the column-major read-out. Why a 1-D view is correct for F-contiguous sources -------------------------------------------------- An F-contiguous array has strides[0] == 1 and strides[i] == dims[i-1] * strides[i-1] for i > 0, with no broadcast/stride-0 dimensions. Walking the underlying buffer linearly from `offset` for `size` elements visits values in F-order (first axis varies fastest), which is exactly what ravel('F') is specified to produce. For non-F-contig sources we still fall back to flatten('F') — a strided / C- contig / sliced source needs the column-major copy to reproduce F-order correctly. Implementation -------------- ravel(a, 'F') with NDim > 1 and size > 1: * a.Shape.IsFContiguous → build Shape(new[]{size}, new[]{1}, a.Shape.offset, a.Shape.bufferSize) and return new NDArray(a.Storage.Alias(vec)). offset and bufferSize are preserved so sliced F-views remain correct; size becomes the 1-D shape's logical and physical extent. * Otherwise → existing flatten('F') copy path (unchanged). The new shape's flags are recomputed by ComputeFlagsStatic over the 1-D dims/strides, which trivially marks the result as both C- and F-contiguous and writeable (a 1-D dense vector is both orders). Storage.Alias chains _baseStorage to the ultimate owner, so view tracking and the @base property continue to work. Test coverage ------------- * AuditV2_ManipulationApis.Ravel_FContiguous_FOrder_ReturnsView is no longer marked [OpenBugs(audit-v2-ravel-fcont-fview)] — the documented NumPy np.shares_memory invariant is now asserted directly in CI. * test/Manipulation/np.ravel.Test.cs gains 10 new tests: - Ravel_FOrder_FContig2D_IsView — write-through verifies aliasing. - Ravel_FOrder_FContig2D_ValuesMatchColumnMajor — read-out sequence matches NumPy. - Ravel_FOrder_FContig3D_IsView — 3-D F-flat-index decomposition. - Ravel_FOrder_CContig_IsCopy — C-contig source still copies. - Ravel_FOrder_Transpose2D_IsView — a.T (F-contig view) also aliases. - Ravel_FOrder_KOrder_FContigSource_IsView — 'K' resolves to 'F' for F-source. - Ravel_FOrder_AOrder_FContigSource_IsView — 'A' resolves to 'F' for strict F. - Ravel_FOrder_FContig_DtypeFloat — dtype preserved across the view. - Ravel_FOrder_FContig_EquivalentToFlattenF_Values — values match flatten('F'). - Ravel_FOrder_FContig_PreservesSize — 2-D / 3-D / 4-D size invariants. Verified -------- * New tests pass on net8.0 and net10.0. * Full CI-filtered suite (TestCategory!=OpenBugs&TestCategory!=HighMemory) passes 8292/8292 on both target frameworks. * The 17 pre-existing F-contig OpenBugs failures (np.flip, np.sort, np.repeat axis, reduction F-preservation, save/load fortran_order, etc.) remain unchanged — they live in test/View/OrderSupport.OpenBugs.Tests.cs and are excluded by the CI filter. References ---------- * NumPy: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html * docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md — Section 1.6 * Spec: np.shares_memory(np.ravel(aF, 'F'), aF) == True for IsFContiguous source

…aths Audit of np.ravel call paths flagged two cases that the initial fix relied on but did not directly assert in tests. Add explicit coverage so regressions are caught: 1. Ravel_FOrder_FContigColumnSlice_PreservesOffset_IsView aF[:, 1:3] on F-contig (4,5) yields (4,2) F-contig with offset=4. The view path must preserve offset and bufferSize so the 1-D Alias reads memory [4..11]. Verified: * shape (8,) * F-order values [1, 6, 11, 16, 2, 7, 12, 17] (column-major read-out) * write-through r[0] → s[0,0] and aF[0,1] both updated (shared memory) 2. Ravel_FOrder_FContig_BothCAndFContig_IsView A (1, N) shape is simultaneously C- and F-contiguous. ravel('F') enters the F-branch (NDim>1, size>1, IsFContiguous=true) and returns an Alias view; this was already covered by the implementation but not by an explicit test. * shape (4,) * values [10, 20, 30, 40] (linear memory walk) * write-through r[0] propagates to both[0, 0] Both cases pass on net8.0 and net10.0 (64/64 tests in the ravel suite). Background — full ravel coverage matrix audited manually: Order Source layout Branch Result ----- ----------------------------------------- ------------- ------------- 'F' strict F-contig, NDim>1, size>1 view path view 'F' both C+F contig (e.g. (1,N)), NDim>1 view path view 'F' F-contig col-slice, offset!=0 view path view (offset preserved) 'F' transpose of C-contig 2-D (→ F-contig) view path view 'F' C-contig only, NDim>1 flatten('F') copy 'F' broadcast / strided / non-contig flatten('F') copy 'F' 1-D (NDim==1) C-order path view if contig 'F' scalar / empty / size<=1 C-order path trivial 'C' C-contig reshape view 'C' F-contig only CloneData C-order copy 'A' F-contig (strict) resolves to F view 'A' otherwise resolves to C view/copy 'K' F-contig resolves to F view 'K' C-contig resolves to C view/copy All 15 dtypes (Boolean, Byte, SByte, Int16, UInt16, Int32, UInt32, Int64, UInt64, Char, Half, Single, Double, Decimal, Complex) verified end-to-end via in-process buffer-address comparison and dtype assertion. NDArray.ravel() and NDArray.ravel(char) instance methods delegate to np.ravel, so the fix covers both call sites.

…efault-None bounds Brings np.clip up to NumPy 2.x signature parity. Two missing capabilities are addressed at the API surface; the underlying engine (Default.ClipNDArray.cs) already supported null bounds for both legs of the interval. NumPy 2.x signature mirrored: clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None) Changes: - src/NumSharp.Core/Math/np.clip.cs * Replace the trio of legacy 4-arg overloads with a single unified entry point exposing all parameters as optional. Callers may now write: np.clip(a) — no bounds, returns copy np.clip(a, min: 3) — lower bound only (NEP-rebrand) np.clip(a, max: 5) — upper bound only np.clip(a, min: lo, max: hi) — both via aliases np.clip(a, a_min: null, a_max: 5) — explicit None np.clip(a, min: 3.5, dtype: NPTypeCode.Double) a_min/a_max still accepted (NumPy keeps both names; min=/max= were added in 2.0 as keyword-only aliases). * Conflict detection mirrors NumPy: passing both a_min and min (or both a_max and max) raises ArgumentException rather than silently picking one. * Type-dtype overload preserved separately (Type != NPTypeCode?, no merge possible). Existing positional-3 call sites (np.clip(a, lo, hi)) and named-arg call sites in np.maximum/np.minimum compile unchanged. - test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs * 9 new tests covering the NumPy 2.x surface: - min=/max= keyword aliases (lower-only, upper-only, both) - Explicit a_min=null / a_max=null - Bare np.clip(a) returns a copy (verifies distinct backing storage) - min= keyword with array bound (broadcast verification) - Conflict detection (a_min+min, a_max+max throw) - min= combined with dtype= promotes result dtype Verification: - Reference outputs cross-checked against NumPy 2.4.2 via Python; all 9 documented behaviors match byte-for-byte. - ClipNDArrayTests: 26/26 pass (was 17, +9 new). - ClipEdgeCaseTests + np.maximum/np.minimum suite: 105/105 pass — no regressions (np.maximum/minimum use np.clip via named a_min:/a_max:). - Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0: 7202 pass, 0 fail, 11 pre-existing skips. Audit reference: audit_v2/07_math_ops_selection_sorting_stats.md (Batch 7, item 12).

@out

…n-int upcast Brings the np.clip engine path up to NumPy 2.x ufunc parity. Three latent bugs surfaced while battle-testing edge cases for the min=/max= alias work: 1. Dtype promotion silently demoted to lhs.typecode * Before: outType = typeCode ?? lhs.typecode - clip(int32, min=3.5) → int32 (NumPy: float64) - clip(int32, min=float32) → int32 (NumPy: float64) * After: weak-scalar promotion consistent with NumSharp's binary-op engine and NEP 50 — a 0-d bound of the same kind (int/float/complex /decimal) as lhs is "weak" and does not promote; cross-kind or array bounds promote via np.result_type. * Examples now matching NumPy: clip(int32, min=3.5) → float64 (was int32) clip(int32, min=3.0f) → float64 (was int32) clip(uint8, 50, 75) → uint8 (preserved, NEP 50 weak) clip(int32, min=long_arr) → int64 (array promotes) clip(float32, 3, 7) → float32 (preserved) * NaN bound on int array now upcasts to float64 with all-NaN result (was: silently a no-op, value unchanged). 2. @out= with mismatched dtype silently wrote garbage * Before: cast lhs/bounds to outType, blit through copyto into @out which retained its own (often narrower) dtype — produced truncated or pattern-aliased values. * After: validate @out.GetTypeCode == outType up front. Mismatch raises ArgumentException mirroring NumPy's _UFuncOutputCastingError ("Cannot cast ufunc 'clip' output from dtype('X') to dtype('Y') with casting rule 'same_kind'"). 3. Engine refactor for the both-null + dtype= case * np.clip(arr, dtype=Single) with no bounds now properly casts the output and respects @out when supplied (previously dtype= without bounds returned plain lhs.Clone()). Implementation details: - Added PromoteClipBound(outType, bound): no-promotion shortcut for 0-d same-kind bounds; falls back to np.result_type otherwise. - Added IsSameKind(a, b): groups Byte/Char/signed-int/unsigned-int as integer kind; floats/decimals/complex compare by NPTypeCode group. - @out validation now runs before any work, so shape/dtype errors fail fast without partial mutation of @out. - np.copyto(@out, Cast(lhs, outType, copy: false)) handles the case where lhs needs casting to the promoted output type before writing. Test additions (test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs) — 30 new tests across 8 categories all cross-checked against NumPy 2.4.2: Dtype Promotion (NEP-50): - uint8 + int scalars preserves uint8 - int32 + float scalar → float64 (also float32 scalar → float64) - float64 + int scalars preserves float64 - int32 + int64 array bound → int64 - dtype= with no bounds casts input - dtype= override forces narrower type even when bounds promote - NaN bound on int array upcasts to float64 @out= Edge Cases: - in-place out=src returns same buffer & mutates - out= separate buffer leaves src unchanged - shape mismatch throws - dtype mismatch throws (previously silent garbage) - out= with no bounds copies src Special Float Values via kwarg: - min=-inf / max=+inf no-op - min=NaN / max=NaN propagates 0-d (Scalar) Input: - clip(scalar, lo, hi) preserves ndim=0 - clip(scalar, max:hi) preserves ndim=0 - clip(scalar) preserves ndim=0 Half / Complex via Kwarg: - Half min/max preserves Half - Complex min= (lex ordering, scalar bound) - Complex array min/max bounds (lex ordering) Broadcasting via Kwarg: - 2D + row vector min → broadcasts along axis 0 - 2D + column vector max → broadcasts along axis 1 - 2D + mixed row min + column max Strided Inputs via Kwarg: - Reversed-slice (negative stride) clipped via min=/max= Empty Arrays via Kwarg: - Empty + min= only - Empty + max= only - Empty + dtype= cast Verification: - ClipNDArrayTests: 56/56 pass (was 26; +30 new). - np.clip + np.maximum + np.minimum + ClipEdgeCase + np.clip.Test suites: 85 pass on net8.0, 55 pass on net10.0 (frameworks differ in shared-class counts). - Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0: 7232 pass, 0 fail, 11 pre-existing skips (was 7202 before this commit).

…x bug Benchmarking np.clip against NumPy 2.4.2 revealed a 48-80× slowdown on the common case `clip(arr, lo, hi)` with scalar literal bounds. Root cause: the engine was materializing every scalar bound via `np.broadcast_to(scalar, lhs.Shape).astype(outType)`, which for a 10M int32 input allocated and memset two 40MB bound arrays per call (then ran an element-wise array-bounds kernel that re-read both buffers). Investigation also surfaced a pre-existing kernel bug exposed once the new fast path routed scalar-bound calls through ClipScalar / ClipStrided / ClipScalarTail: the integer scalar fallbacks used `if / else if` to apply the two clamps, so when `minVal > maxVal` values below `minVal` incorrectly stuck at `minVal` instead of capping to `maxVal` (NumPy guarantees `min(max(x, lo), hi)` — i.e. `maxVal` wins when bounds are inverted). SIMD paths and Math.Min(Math.Max,...) float paths were already correct. Changes ======= src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs - Add scalar-bounds fast path: detect 0-d min/max (or null) and dispatch directly to ClipUnified / ClipMinUnified / ClipMaxUnified (the kernel family that broadcasts the scalar inside the vector loop). Skips broadcast_to + astype materialization entirely. - ClipNDArrayScalarBounds: type-switch on outType to call the right generic kernel; uses a small delegate-based helper (ClipScalar<T>) so the dispatch logic isn't duplicated 12 times. - ClipNDArrayScalarBoundsFallback: Half and Complex still go through the array-bounds path — their scalar SIMD kernels aren't wired and Complex has lex-ordering NaN semantics already implemented there. Cost is just the 0-d→shape broadcast (stride-0 view, O(1)) plus a 1-element astype. - Array bounds (any non-0-d min or max) flow into the existing path unchanged. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs - ClipScalar<T> (generic integer fallback, called by ClipHelper): replace `if (val > maxVal) val = maxVal; else if (val < minVal) val = minVal;` with two sequential `if`s. Now matches NumPy semantics when min > max. - ClipScalarTail<T> (non-float tail after SIMD bulk loop): same fix. - ClipStrided<T> (coordinate-iterated path for non-contiguous arrays): same fix. - Added comments explaining why two sequential clamps are required. Performance (Windows 11, .NET 10.0.1, AVX2-class CPU; 50 iterations, min of timings reported; same array shapes/dtypes on both runtimes): Scalar bounds, contiguous NumPy 2.4.2 NumSharp BEFORE NumSharp AFTER int32 size=1K 3.4 µs 37.8 µs 3.3 µs int32 size=100K 8.4 µs 2980.4 µs 66.5 µs int32 size=10M 6 741 µs 323 557 µs 10 094 µs int64 size=10M 14 519 µs 698 077 µs 34 860 µs float32 size=10M 6 917 µs 570 707 µs 22 441 µs float64 size=10M 14 228 µs 612 228 µs 30 926 µs Single-sided scalar bound (min= or max= only) int32 size=10M min= 12 451 µs 285 434 µs 10 532 µs (faster than NumPy) int32 size=10M max= 12 024 µs 294 756 µs 10 720 µs (faster than NumPy) float64 size=10M min= 23 155 µs 300 770 µs 23 043 µs (parity) out= parameter int32 10M, out=arr in-place 7 038 µs 562 393 µs 7 465 µs (parity) int32 10M, out=preallocated 7 794 µs 557 192 µs 12 539 µs No bounds (clip(a)) int32 10M 12 126 µs 7 437 µs 7 158 µs (faster than NumPy — Cast(copy:true)) Speedups range 20-75× over the previous NumSharp implementation; the common `clip(arr, lo, hi)` path now sits at 1.5-3× NumPy or matches it for small arrays. Remaining gaps: * Array bounds (lo_arr, hi_arr same-size): 3.5× slower — kernel is memory-bandwidth bound on three arrays; expected gap given .NET Vector<T> vs hand-tuned NumPy AVX2 inner loop. * Strided input (a[::2], a[::-1]): 15-20× slower — ClipStrided uses Shape.TransformOffset per element; NumPy's ufunc has a strided inner loop with stride-aware SIMD where possible. * Half (float16): 11× slower — .NET's `Vector<Half>` arithmetic is not supported, scalar Half→double→Half path required. * 2D broadcast (row vec): 33× slower — still goes through array path after broadcast_to materializes the row vector. These remaining gaps are tracked for future kernel work and are not addressed in this commit. Verification ============ - All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory), including the regression test for min > max which now exercises the scalar fast path through the fixed ClipScalar/ClipStrided kernels. - Bench harness: $CLAUDE_JOB_DIR/clip_bench.py and clip_bench.cs (50 iterations each, min of timings).

@out

Two further optimizations on top of the scalar-bounds fast path. Both target the gap to NumPy that benchmarking surfaced. Findings from the breakdown profiler ($CLAUDE_JOB_DIR/clip_breakdown.cs) on int32 size=10M: Step Time (µs) ────────────────────────────────────────────── ───────── Pure ClipArrayBounds kernel (3R + 1W) ~7,700 Cast(lhs, int32, copy:true) alloc + 1R+1W ~6,100 np.broadcast_to(lo_arr, same_shape) ~negligible np.broadcast_to(lo_arr).astype(int32) — same dtype ~7,700 ← wasted clone np.clip(arr, lo_arr, hi_arr) full path 37,700 np.clip(arr, 2, 7) scalar fast path 17,100 ClipHelper kernel only (1R + 1W in-place) ~9,800 The two wasted passes: (1) `astype(same_dtype)` cloning the bounds even when no cast is needed (15ms wasted on two array bounds), (2) the Cast-then-clip pattern doing 4 memory streams (2R + 2W) when 2 streams (1R src + 1W dst) suffice. Changes ======= src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs 1. PrepareBound(bound, targetShape, outType) helper: When the bound is already same-shape, same-dtype, contiguous, offset zero, return it directly instead of running broadcast_to + astype (which clones via UnmanagedStorage.Clone). Wins for the common case where users pass arrays that already match the input layout. 2. ClipNDArrayFusedScalarBounds: new fast-fast path for the dominant `np.clip(a, lo, hi)` shape — contiguous lhs, scalar literal bounds, no @out, no dtype promotion. Allocates a fresh `np.empty(shape)` and runs the fused CopyAndClip kernel in a single pass. Replaces the classic Cast-then-clip pattern (which ran two passes over memory). Falls through to the existing in-place scalar path when @out is supplied or the lhs needs casting. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs 3. CopyAndClip / CopyAndClipMin / CopyAndClipMax (and their *Simd256 / *Scalar / *ScalarTail variants for 10 SIMD-supported dtypes): fused read-clip-write kernels. Each loop iteration loads a Vector256, runs Min(Max(v, lo), hi) in registers, and stores to the destination buffer — never spilling intermediate values to memory. Halves the memory bandwidth requirement vs the in-place "copy then clip" pattern on memory-bandwidth-bound input sizes. Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2-only CPU, 50 iterations, min reported) NumPy NumSharp BEFORE NumSharp AFTER AFTER/NumPy Scalar bounds, contiguous int32 size=1K 3.4 µs 37.8 µs 3.1 µs 0.91× (FASTER) int32 size=100K 8.4 µs 2980 µs 68.2 µs 8.1× int32 size=10M 6741 µs 323557 µs 9336 µs 1.4× int64 size=10M 14519 µs 698077 µs 19287 µs 1.3× float32 size=10M 6917 µs 570707 µs 11002 µs 1.6× float64 size=10M 14228 µs 612228 µs 26969 µs 1.9× Array bounds, contiguous (PrepareBound win) int32 size=10M 9488 µs 38259 µs 13898 µs 1.5× (was 4.0×) float64 size=10M 24712 µs 83863 µs 42137 µs 1.7× (was 3.4×) Single-sided int32 10M min= 12451 µs 285434 µs 11200 µs 0.90× (FASTER) int32 10M max= 12024 µs 294756 µs 11351 µs 0.94× (FASTER) out= (in-place / preallocated) 10M in-place out=arr 7038 µs 562393 µs 4567 µs 0.65× (35% FASTER than NumPy) 10M out=preallocated 7794 µs 557192 µs 10101 µs 1.3× Both bounds None 10M, clip(a) 12126 µs 7437 µs 5778 µs 0.48× (2× FASTER) Combined effect of all four perf commits (3505edc, 79c1894, 9334bd7, this one): the common `np.clip(arr, lo, hi)` path went from 48-80× slower than NumPy to within 1.4-1.9× across dtypes, with several cases matching or beating NumPy outright. Discussion of the user's two questions ────────────────────────────────────── 1. Vector<T> vs Vector256<T> — measured both on this CPU; identical wall time (5527 vs 5559 µs/10M int32 in micro-bench, see $CLAUDE_JOB_DIR/clip_micro.cs). Vector<T> picks the widest hardware register at JIT time, so on AVX-512 hardware it'd be 512 bits = 2× throughput. On THIS AVX2 machine, no gain. Switching the existing Vector256<T> kernels to Vector<T> is a low-risk forward-compat move for AVX-512 hosts but no measurable win here. Not changed in this commit (would touch the whole kernel file ecosystem; out of scope). 2. IL Generation via DynamicMethod — the existing binary kernels (ILKernelGenerator.Binary.cs) emit 4× unrolled SIMD loops via System.Reflection.Emit. Tested whether porting that pattern to clip would help: micro-benchmarked manually-unrolled 4× and 8× Vector256<int> loops against the simple 1× variant. Results (10M int32): 1× unrolled: 5559 µs 4× unrolled: 6494 µs (SLOWER — register pressure) 8× unrolled: 5428 µs (2% faster — within noise) The .NET JIT already auto-unrolls the simple SIMD loop well enough that hand-unrolling doesn't help and can hurt. IL emission for this op would add significant complexity for ~no perf win. Not pursued. The wins came from algorithmic changes (fused single-pass kernel, skipping redundant clones) — not from instruction-level tuning. Verification ============ - All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory). - Includes the regression test for `min > max` semantics through the new fused kernel path (which goes through CopyAndClip's scalar tail for the size<32 case). - Bench harness: $CLAUDE_JOB_DIR/{clip_bench.py,clip_bench.cs, clip_breakdown.cs,clip_micro.cs}. Remaining gaps ============== - Strided / negative-stride / broadcast inputs: ~12-15× slower than NumPy. ClipStrided iterates with Shape.TransformOffset per element (~50ns/element overhead). NumPy ufunc has stride-aware SIMD inner loops. Would require a stride-aware clip kernel similar to NumPy's. - Half / float16: ~9× slower. .NET's Vector<Half> arithmetic is not supported; falls back to scalar Half-via-double round-trip. - 100K size scalar bounds (8.1×): allocation/dispatch overhead is amortized over fewer elements; gap shrinks at larger sizes.

@out

…-adaptive Per the user's directive, the entire clip code path now goes through a single ILKernelGenerator entry point that dispatches internally and emits all loops as DynamicMethod IL using the runtime-detected vector width (V128 / V256 / V512). No hardcoded Vector256 references remain; no hand-written C# loops remain in the engine or kernel files. Files ===== src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs Rewritten from scratch (~1900 → ~360 lines, -81%). Public surface — what the engine actually calls: public enum ClipMode { BothBounds, MinOnly, MaxOnly } public enum ClipBoundsKind { Scalar, Array } public unsafe delegate void ClipKernel( void* src, void* dst, long size, void* lo, void* hi); public static unsafe void Clip( NPTypeCode dtype, ClipMode mode, ClipBoundsKind kind, void* src, void* dst, long size, void* lo, void* hi); All dispatch happens inside ILKernelGenerator: - Cache key = (dtype << 16) | (mode << 8) | kind - On first miss, Generate(dtype, mode, kind) builds a DynamicMethod and stores the resulting delegate in a ConcurrentDictionary. The IL emitter: - Uses GetVectorContainerType() / GetVectorType() / GetVectorCount() so the SIMD loop body adapts to V512 on AVX-512 hosts, V256 on AVX2, V128 on SSE2. There is no `Vector256` or `Vector128` token anywhere in the kernel file. - Hoists the scalar bound load and Vector.Create() out of the SIMD loop (one broadcast per kernel call, not per iteration). - Computes `byteOff = i * sz` once per iteration into a local and reuses it for src/lo/hi/dst pointer arithmetic — avoids the O(N × pointer_count) multiplications the prior C# kernels had. - Falls back to a pure scalar IL loop for dtypes without Vector<T>.Min/Max (Char, Decimal, Half, Complex). Half and Complex delegate the per-element clamp to static helper methods (NaN-aware / lex-order); the loop is still IL. src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs Stripped from 1222 → 207 lines (-83%). Everything but policy is gone: - dtype promotion (NEP-50 weak scalar via PromoteClipBound) - @out validation (shape, writeable, dtype) - scalar-vs-array kind detection (min.ndim == 0) - NaN-in-scalar-bound short-circuit for float dtypes - dst materialization choice: in-place vs fused-fresh vs cast-copy - single call to ILKernelGenerator.Clip(...) The previous ClipNDArrayContiguous / ClipNDArrayGeneral / ClipNDArrayScalarBounds / ClipNDArrayFusedScalarBounds / 12 per-dtype switches / delegate-based generic dispatchers / 14 Generated*Core methods are all deleted. One call site, one cache, one IL emitter. src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs Deleted (251 lines). Dead code — internal `ClipScalar(NDArray, object, object)` had no callers anywhere in the codebase, was a parallel hand-coded path that the IL kernel now subsumes. src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs Added EmitVectorMinOrMax() to the existing emit-primitives section (sibling of EmitVectorOperation). Resolves Vector{Width}.Min<T> / Max<T> by reflection on whatever container the runtime selected at startup — same width-adaptive pattern used by the binary kernels. Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2 CPU, 50 iters, min reported) NumPy Before IL After IL Scalar bounds, contiguous int32 size=1K 3.4 µs 3.1 µs 2.9 µs (0.85× — faster) int32 size=100K 8.4 µs 68.2 µs 47.3 µs int32 size=10M 6 741 µs 9 336 µs 7 509 µs (1.11×) int64 size=10M 14 519 µs 19 287 µs 22 279 µs float32 size=10M 6 917 µs 11 002 µs 13 057 µs float64 size=10M 14 228 µs 26 969 µs 28 842 µs Single-sided (min= or max= only) int32 10M min= 12 451 µs 11 200 µs 10 944 µs (0.88× — faster) int32 10M max= 12 024 µs 11 351 µs 8 009 µs (0.67× — faster) float64 10M min= 23 155 µs 23 043 µs 19 776 µs (0.85× — faster) out= 10M in-place out=arr 7 038 µs 4 567 µs 3 954 µs (0.56× — faster) 10M out=preallocated 7 794 µs 10 101 µs 9 175 µs Both bounds None 10M clip(a) 12 126 µs 5 778 µs 6 025 µs (0.50× — faster) Half (float16) — IL emit cut the gap by 3× 10M 66 969 µs 602 219 µs 212 024 µs (was 9×, now 3.2×) Verification ============ - All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory). - The 85 clip-family tests (ClipNDArrayTests, ClipEdgeCaseTests, np.clip.Test, NewDtypes Half/Complex clip tests) cover: * Scalar literal bounds, array bounds, both-None, min-only, max-only * 14 dtypes (Byte, SByte, Int16/32/64, UInt16/32/64, Char, Half, Single, Double, Decimal, Complex) * Contiguous, transposed, strided (every other), reversed slices * Broadcast inputs (the OpenBug test) * NaN propagation in float arrays * NaN in scalar bound → all-NaN result (short-circuited in engine) * min > max → result all = max * @out= validation (shape & dtype mismatch throws) * NEP-50 weak-scalar promotion (uint8 + 50 stays uint8) * Cross-kind promotion (int32 + 3.5 → float64) - Cache correctness: each (dtype, mode, kind) combination generates its kernel once on first call, then reuses the cached delegate. Re- running the test suite a second time keeps the same delegates (no re-emit per call). Remaining gaps (deferred) ========================= - Strided / negative-stride contiguity (~15-20× NumPy): the engine materializes a contiguous copy first via Cast(copy:true). A proper fix would IL-emit a stride-aware kernel, but that doubles the code size and is rarely the hot path. - Array-bounds slightly worse than the prior hand-coded V256 inner loop (~2× NumPy vs ~1.5× before): the IL emit doesn't 4×-unroll like the binary kernels do. Measured earlier in the conversation, manual 4× unroll on the simple clip loop hurt rather than helped on the JIT auto-unrolled C# baseline; effect on IL-emitted code may differ but not investigated.

@out

…) into nditer Brings in 5 commits from worktree-clip-min-max-aliases that rebuild np.clip end-to-end. Replaces and supersedes the in-flight clip work on nditer (c3bbe9a "fix(clip): Complex IComparable + Half NaN propagation") whose root cause — generic CompareTo / NpFunc routing for clip — no longer exists after this merge. Summary of incoming work (3505edc..10064ab) ============================================= 1. feat(np.clip): NumPy 2.x parity — min=/max= keyword aliases and default-None bounds. New signature mirrors NumPy 2.x: clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None, dtype=None) 2. fix(np.clip): NumPy 2.x dtype promotion (NEP 50 weak-scalar via np.result_type), out= dtype validation, NaN-on-int upcast. 3. perf(np.clip): scalar-bounds fast path + fixed a latent ClipScalar min>max kernel bug (`if/else if` instead of two sequential clamps). 4. perf(np.clip): fused copy+clip kernel + skip the redundant astype clone when the bound already matches lhs shape/dtype/contiguity. 5. refactor(np.clip): all kernels routed through a single ILKernelGenerator.Clip() entry. Every loop is now emitted as a DynamicMethod via System.Reflection.Emit. The SIMD width is resolved at runtime (V128/V256/V512) — no Vector256 token remains anywhere in the clip path. Conflict resolution =================== * src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs Deleted in worktree. nditer's c3bbe9a modified it (Complex pre- checks). The IL kernels in this merge handle Complex natively via ComplexMaxNaN/ComplexMinNaN helpers called from the generated loop, so the Default.Clip.cs path becomes redundant. Took the deletion. * src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs Both sides modified. nditer's version had c3bbe9a + 574a0d8 refactor (NpFunc generic dispatch, ~400 switch cases replaced). Worktree's version (this branch) is the IL-routed engine (207 lines of pure policy + one ILKernelGenerator.Clip call). Took worktree. Half / Complex correctness preserved by the new IL kernel — verified via the existing battletest suite (NewDtypes Half + Complex tests all pass). * src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs Both modified. nditer added Half-specific scalar paths in the old kernel API. Worktree rewrote the file from ~1900 → ~360 lines of IL emission. Took worktree — Half NaN handling now lives inside the IL-emitted scalar tail via HalfMaxNaN/HalfMinNaN helper calls. * src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs Auto-merged cleanly. Worktree added EmitVectorMinOrMax helper alongside nditer's dtype-parity expansion. * src/NumSharp.Core/Math/np.clip.cs Manually merged: kept worktree's new public signature (a_min/a_max/out/dtype/min/max with NumPy-2.x semantics) and nditer's PreserveFContigFromSource wrapper (39ef08c "F-contig preservation across ILKernel dispatch"). Output now keeps F-order when the input was F-contiguous. * test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs Auto-merged cleanly — 39 new tests from worktree (NEP-50 promotion, min/max aliases, out= edge cases, etc.) sit alongside nditer's existing coverage. Verification ============ - Full build: 0 errors, 17 warnings (unchanged from each side). - Test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0: 8334 pass, 0 fail, 11 pre-existing skips. nditer was at ~7232 tests pre-merge in the worktree's view; the actual count on nditer-only is higher and the merge brings the combined total up. - All 85 clip-family tests pass (39 new + 46 pre-existing). - The Complex IComparable issue that c3bbe9a addressed is verified fixed by the merge: the failing tests in that commit's "Fixes 14 test failures" list all pass through the new IL kernel (Complex takes the scalar-IL path with ComplexMaxNaN/ComplexMinNaN helpers). API behaviour for callers ========================= - np.clip(arr, lo, hi) — works exactly as before. - np.clip(arr, lo, hi, dtype: NPTypeCode.Double) — new dtype= override. - np.clip(arr, lo, hi, @out: dst) — supported as named arg. - np.clip(arr, min: 3) — NEW: NumPy 2.x kwarg alias. - np.clip(arr, max: 7) — NEW: kwarg alias. - np.clip(arr, a_min: null, a_max: 5) — NEW: explicit None bound. - Promotion: clip(int32, 3.5) → float64 (was int32 — bug pre-merge). - Out= dtype mismatch now throws ArgumentException (was silent garbage pre-merge).

@out

…six + bitwise, np.positive full ufunc surface, NumPy round shape Completes the single-NumPy-shaped-overload audit across 5716f86 (slice 2) and 6a566e4 (merged-overload wave). A reflection audit over every np.* member those commits touched found four remaining gaps; all are closed here, every rule probed against NumPy 2.4.2 BEFORE implementation and pinned verbatim by tests. 1) dtype= on add/subtract/multiply/divide/true_divide/mod The 6a566e4 wave deferred these because the loop-dtype machinery didn't exist; it does now (ExecuteBinaryOp's dtype override from the power/floor_divide work), so the deferral was stale. TensorEngine gains the house (lhs, rhs, typeCode, out, where) signature on the five members; np.* adds the trailing dtype= param. * the loop COMPUTES in dtype: subtract(300_i64, 5_i64, dtype=i16) = 295 in the int16 loop; add(0.1, 0.2, out=f64, dtype=f32) stores float32(0.3) = 0.30000001192092896 in the f64 out (probed). * BUG FIX (ordering): the bool add/multiply remap (+ = logical OR, * = logical AND) keyed off the PROMOTED dtype and ran before the dtype= override — add(True, True, dtype=i32) would have run the BitwiseOr loop and returned 1. NumPy runs the i32 add loop and returns 2 (probed). The remap now keys off the FINAL loop dtype: moved after the dtype override in ExecuteBinaryOp. * divide/true_divide are float-only ufuncs: integer/bool dtype= raises 'No loop matching the specified signature and casting was found for ufunc divide' (probed; same gate family as sqrt&co). * mod reports 'remainder' with indexed input-cast errors (probed: mod(f64, f64, dtype=i32) → "Cannot cast ufunc 'remainder' input 0 from dtype('float64') to dtype('int32') ..."). 2) dtype= on bitwise_and/or/xor Loop selection among the bool/int loops: dtype=i64 widens, dtype=i16 narrows under same_kind (300 & 300 = 300_i16, probed). A float/complex/decimal dtype= raises the NO-LOOP text — distinct from the float-INPUT coercion text ValidateBitwiseLoop produces ('ufunc not supported for the input types...'); both texts probed, the first implementation reused the wrong one and the probe caught it. 3) np.positive — full ufunc surface (was: bare copy only) Slice 2 prepared UnaryOp.Positive (identity emit, UfuncName mapping, Round uses it as the masked-copy vehicle) but never exposed the np surface. New TensorEngine.Positive + Default.Positive: * positive(x, out=) returns the provided instance; where= masks (false slots keep prior contents); both ride the unary Into-path with the identity kernel. * dtype= selects the loop: positive(i32, dtype=f64) widens (the no-out path is a cast-copy); positive(f64, dtype=i32) raises the unary same_kind input-cast error naming 'positive'. * bool rule probed precisely: positive has NO bool loop — plain positive(bool) raises "ufunc 'positive' did not contain a loop with signature matching types <class 'numpy.dtypes.BoolDType'> -> None" and dtype=bool raises the two-sided variant naming the input's DType class (new NumPyDTypeClassName helper maps NPTypeCode → numpy.dtypes.* names); positive(bool, dtype=f64) is LEGAL → [1., 0.] — the guard keys off the loop, not the input. 4) round_/around — NumPy's round(a, decimals=0, out=None) shape Slice 2 shipped TWO out-overloads each ((x, out=null) + (x, decimals, out=null)); NumPy has ONE signature whose 2nd positional is DECIMALS (np.round(x, out_array) raises 'only integer scalar arrays can be converted to a scalar index' — probed). Merged to (x, int decimals = 0, NDArray out = null); positional-out callers migrate to @out: (3 test sites updated). Legacy positional-dtype conveniences unchanged. Tests: UfuncDtypeOverloadTests +8 (binary-six loop matrix incl. the bool-remap probe pin add(True,True,dtype=i32)=2, dtype+out f32 composition, divide no-loop + remainder/add input-cast texts, bitwise int-loop matrix with narrowing + no-loop texts, positive call-form matrix + both did-not-contain-a-loop texts + same_kind error, round single-shape matrix); UfuncUnaryBatchOutWhereTests round callers moved to NumPy positions. Suite: net10.0 CI filter 9709/0; touched families 100/100 on net8.0; live side-by-side value checks vs numpy 2.4.2 (f32 loop precision 0.30000001192092896, 1/3 float32, banker's rounding, positive matrix).

… NumPy + standalone decomposition New matched-id benchmark pair (npyiter_core_bench.{cs,py}) measuring the ITERATOR, not the kernels: construction across 13 flag configurations, traversal/orchestration across chunk profiles (w=4..1024 strided rows, transposed, row/col broadcast), buffered cast/mixed windows, per-element protocol (+C_INDEX/+MULTI_INDEX), full- reduce, and small-N pipeline scaling N=8..2M. All kernels trivial and matched to NumPy's loop families (memcpy / scalar-strided / V256-contig) so iterator cost dominates. Both scripts correctness-check before timing and refuse Debug JIT. Headline results (i9-13900K, NumPy 2.4.2, Release — full tables in benchmark/poc/NPYITER_CORE_BENCH_RESULTS.md): - Construction beats np.nditer 1.4-3.7x in every multi-operand config (3-op 308 vs 622 ns; ufunc config 343 vs 1000 ns; 8-op 385 vs 1140 ns). - Full-iterator vs full-iterator (strided-row ADD through NumPy's real ufunc nditer): parity at w=4 (10.1 vs 10.3 ns/chunk), 2x faster at w=1024. The 4 ns/chunk machine in np.copyto is NumPy's stripped raw-copy walker, NOT nditer — and production NpyIter.Copy matches it exactly (T2c). - Buffered mixed add 0.86x, broadcast adds 0.74-0.81x, transposed copy 0.84x, reductions 0.64-0.85x, contig copy at DRAM roofline parity. - Raw iterator pipeline tracks NumPy's whole ufunc dispatch within +-8% at every N; production np.add(out=) carries ~200 ns of glue above it (648 vs ~440 ns). - REUSED iterator (Reset+ForEach) at N=512: 54.7 ns/call — 7x under NumPy's e2e floor; the structural small-N lever NumPy cannot reach from Python. Bugs/gaps surfaced (quantified in the results doc): 1. GROWINNER is hollow — NpyIter.cs:751 sets the bit, ComputeTransferSize never reads it: same-dtype BUFFERED iterators chunk into 512 needless 8192-elem windows, +8.4% on 4M f64 add (G-series). One condition fixes it. 2. Broadcast construction pays +388 ns over same-shape (C5 696 vs C3 308 ns): rank-mismatched operands miss the same-shape fast path and allocate through np.broadcast_to; NumPy's delta is +97 ns. 3. Chunked-callback overhead decomposed: delegate ~1.3 ns/chunk (ForEach vs ExecuteGeneric), element-stride imul per op/axis/step in ExternalLoopNext (NumPy stores byte strides), mask-resolution branch — 7 vs 4 ns/chunk against raw-walker workloads only; already parity against NumPy's real nditer. 'How NpyIter becomes the best' (prioritized, measured headroom): iterator reuse/state pooling (54.7 ns floor proven), cut the 200 ns production glue, implement GROWINNER, broadcast-ctor fast path, byte-stride axisdata + specialized iternext, tiny-chunk whole-array lowering, parallel ForEach. Roadmap doc links the new bench in its verification harness section.

charts_npyiter_core.py renders the four presentation charts from the 2026-06-12 npyiter_core_bench run (data pinned inline from NPYITER_CORE_BENCH_RESULTS.md): construction grouped bars, per-chunk dispatch overhead (copy vs raw walker + add vs real ufunc nditer), small-N log-log scaling with the production-glue and reuse-floor markers, and the traversal ratio scoreboard with the hollow-GROWINNER inset. Output: %TEMP%/npyiter_charts/*.png

… territory; P0 crash + 3 losses + 4 unexpected wins npyiter_frontier_bench.{cs,py} (matched ids) extends the iterator-core bench into every suspected weak spot: axis reductions through op_axes+REDUCE (Wave-5 territory), ALLOCATE outputs, where= masks at degenerate run lengths (all-true / alternating run=1 / blocky run=64), strided buffered casts, forced-order outputs, 0-d scalar calls, tiny-chunk production copyto, 8-op single-pass fusion, and the kernel-bound dtype frontier (complex128/float16/int8) as labeled context rows. NEW BUGS / LOSSES (full tables + analysis in NPYITER_CORE_BENCH_RESULTS.md): 1. P0 CRASH: ForEach on a BUFFERED+REDUCE iterator AccessViolates — GetIterNext() has no BUFFER+REDUCE branch, falls to ExternalLoopNext which walks BUFFER pointers with SOURCE strides while GetInnerLoopSizePtr hands BufIterEnd as the kernel count. Only BufferedReduce<TKernel>/Iternext() drives this config safely. Repro pinned (R3, skipped with comment). 2. Iterator ALLOCATE outputs are np.zeros'd (NpyIter.cs:277) where NumPy allocates EMPTY: +2.33 ms per 4M f64 call (32 MB memset). Still beats NumPy allocating (0.78) only because their page-fault tax is worse than our pooled memset; np.empty for WRITEONLY ALLOCATE => ~2x ahead. 3. Blocky where= (run=64) regresses BELOW our unmasked baseline (4.10 vs 2.80 ms) while NumPy GAINS from the same mask (3.19 vs 3.54): per-run delegate/scan overhead eats the saved work. All-true and run=1 both WIN (0.79 / 0.72 — NumPy degrades worse at run=1). 4. Windowed buffered cast on strided source 1.52x behind one-pass copyto; production np.copyto already one-pass (1.08). 5. 0-d scalar calls 1.64x/2.41x behind (469/811 ns vs 286/337) — N=1 glue. 6. Axis-0 op_axes reduction 1.20x behind add.reduce; axis-1 wins 2x. 7. (kernel-bound) float16 1.34x behind, complex128 1.10-1.15x behind. UNEXPECTED WINS: - np.add ALLOCATING f64 4M: 3.83 vs 7.5-9.8 ms — 2-2.6x faster (Wave-2.4 pool vs NumPy's fresh-page faults). - F-order-out elementwise 1.5x faster (X1 0.67 / X1p 0.65). - 8-op ONE-PASS sum of 7 arrays 1.9x faster than NumPy's best possible composition (Y1 7.85 vs 14.59 ms) — multi-operand fusion dividend. - int8 add 4M: 173 us vs 1.20 ms — 7x faster (NumPy 2.4.2 i8 loop slow). - axis-1 reductions 2-2.8x faster (R2/R0b); reversed copy 0.94; production copyto at w=4 parity with NumPy's raw walker (P4).

…ny 14.5x losses found; parallel banding 4.7x win proven npyiter_frontier2_bench.{cs,py} (matched ids): overlap/alias per-call taxes (exact-alias V1, forced-copy V2), comparison->bool (D1), early-exit boolean reduces (E1/E2), reduce over a broadcast view (F1), mixed-dtype/scalar/empty small-N (M1/O3/O4), 8-D construction (C14), and a hand-rolled 8-band parallel iteration (PAR series — one iterator per disjoint row band via Parallel.For, the Wave-6.2 dividend made concrete on f64 sin). HEADLINE LOSSES (root causes probed and pinned in the results doc): 1. F1: np.sum over broadcast_to(8K -> (1024,8192)) = 61.9 ms vs NumPy 1.14 ms (54x). NOT materialization: probe shows bc.copy()=11.3ms + dense sum=2.6ms, so even naive materialize-then-sum would be 4.5x faster — the reduction falls to a coordinate-walking general path at 7.4 ns/elem on IsBroadcasted inputs. 2. E1: np.any(bool 10M) full scan = 1.86 ms (4.9 GB/s scalar) vs NumPy 128 us (14.5x) — while np.count_nonzero on the SAME array runs 0.16 ms (63.7 GB/s SIMD). Routing bug: the SIMD scan exists, np.any doesn't use it. Early-exit case E2 WINS 3.9x (350 ns vs 1.35 us). 3. D1: np.less(out=bool) f64 4M 1.41x behind (2.99 vs 2.12 ms) — bool packing. 4. O3: array+scalar small-N 1.73x behind (901 vs 520 ns) — the scalar wrap costs MORE than passing a second full array (H0 648 ns). WINS: - PAR8: 8 banded iterators on f64 sin = 2.47 ms vs 12.1 single / 11.7 NumPy ceiling — 4.9x scaling, 4.7x over NumPy (which never threads its iterator); production np.sin already at single-thread parity (12.1 vs 11.7). - V2: forced-copy overlap (write-ahead alias) 1.75x FASTER than NumPy (4.72 vs 8.26 ms) — Wave-1.1 machinery + Wave-2.4 pooled temp beat their fresh-alloc copy; V1 exact alias 0.88. - C14 8-D ctor 3x faster (321 vs 953 ns); O4 empty 2.8x; E2 early-exit 3.9x; M1 mixed small-N parity-win (888 vs 931 ns). Results doc gains the round-2 table, findings 13-16 with probe decomposition, and reproduce lines.

…rom source and benchmarked across argument matrices The consumer map is grounded in src/numpy (grep NpyIter_{New,MultiNew,AdvancedNew}, enclosing functions resolved): execute_ufunc_loop, PyUFunc_{Accumulate,Reduceat, ReduceWrapper}, ufunc_at, array_{boolean_subscript,assign_boolean_subscript}, PyArray_{MapIterNew,CountNonzero,Nonzero,CopyAsFlat,Where}, arr_{ravel_multi_index, unravel_index}, einsum, nditer_pywrap(+nested_iters), busday/datetime/string/void consumers. npyiter_consumers_bench.{cs,py} exercises every benchable consumer through its np.* surface with the perf-relevant argument matrix (dtype=/out-cast/ promoting unary; reduce axis/keepdims/dtype/3-D middle axis/amin; cumsum axes; where same/scalar/broadcast; boolean read/assign; count_nonzero/argwhere; fancy gather/scatter; ravel transposed/F-order/flatten/astype; unravel/ravel_multi_index) and times the consumers NumSharp lacks NumPy-only as implementation targets. Score: 20 wins, 4 losses, 1 parity, 8 feature gaps. NEW LOSSES: 1. RD3 np.sum(f32, dtype=f64) 1.97x — composes astype-materialize (2.3ms) + sum (0.8ms) = measured 3.23ms instead of casting on load inside the reduce loop (NumPy buffered-REDUCE does); Wave-5 territory. 2. RD5 np.amin(axis=1) 1.54x — min/max axis kernels lag sum (which wins 2.8x on the same shape). 3. FX2 fancy scatter 1.49x (gather wins 0.76 — write-side path). 4. AC2 cumsum axis=0 1.36x — BOTH sides ~20ns/elem scalar column-walks (95 vs 70ms); vertical-SIMD accumulate would be ~4-5ms => 15-20x leapfrog open. WINS (selection family is a rout): argwhere 4.9x, flatten 3x, cumsum axis=1 2.9x, sum full 2.8x, boolean read 2.6x, ravel F-order 2.2x, where broadcast 2.1x, where scalar 2x, boolean assign 1.9x, ravel(A.T) 1.9x, sqrt(i32) promoting 1.5x, add dtype=f32 1.8x, astype 1.8x, 3-D middle-axis sum 1.5x, ravel_multi_index 1.4x, count_nonzero 1.4x, fancy gather 1.3x, unravel_index parity. FEATURE GAPS with NumPy targets: reduce axis-tuple (2.08ms) / where= (9.27ms) / initial= (2.03ms), einsum (2.30/1.42ms — canonical multi-op NpyIter consumer, NpyExpr machinery fits), np.add.at (6.86ms, soft target), reduceat (1.24ms), nested_iters, public np.nonzero alias. Results doc gains the source-grounded consumer map, round-3 table, findings 17-21 (incl. 'do NOT migrate the won selection family onto per-chunk callbacks without keeping Direct kernels as the fast path').

…(rounds 1-3) Adds benchmark/poc/npyiter_bench_summary.py - a self-contained renderer that aggregates every like-for-like NumSharp-vs-NumPy pair from the three NpyIter benchmark rounds (npyiter_core_bench, npyiter_frontier_bench, npyiter_frontier2_bench, npyiter_consumers_bench; numbers as recorded in NPYITER_CORE_BENCH_RESULTS.md, i9-13900K / NumPy 2.4.2 / Release) into a terminal bar chart: geomean of NumSharp_time/NumPy_time per group, eighth-block bars scaled 10.2 chars per 1.0x with parity at ~10 chars, win/lose row counts, and FASTER/PARITY/SLOWER annotations. Groupings: - size tier: <=4K (0.71x, 17W/7L), 32K-8M (0.82x, 50W/24L), 10M (1.25x, 3W/1L - dragged solely by E1 np.any routing; T1 10M traversal is exact parity) - family: construction 0.51x, small-N pipeline 1.14x, chunk traversal 0.95x, layout copies 0.58x, elementwise/bcast 0.77x, buffered cast 1.12x, where=/masks 0.67x, reductions/scans 1.09x (carries F1 54x + RD3 1.97x), selection/indexing 0.63x, kernel-bound dtypes 0.70x - overall: 0.81x geomean, 70 win / 32 lose; 0.75x excluding the three root-caused outliers (F1 broadcast-view sum 54x, E1 np.any 14.5x, AC2 1.4x) - architecture dividends rendered separately (no NumPy equivalent machinery): iterator reuse 7.0x, 8-banded parallel iterators 4.7x, one-pass 7-operand fusion 1.9x Excluded by design to keep the geomean honest: T7a/b/c (Python nditer protocol overhead is interpreter context, not iterator cost), NS-internal-only probe rows (G2/G3, T2g, T5n), and duplicate-comparator variants (T5i, T2x).

…wer<-1.0->faster) Rework npyiter_bench_summary.py to the house per-size geomean layout used in the official benchmark-report summary: ratio shown as SPEEDUP = NumPy_time / NumSharp_time (>1.0 = NumSharp faster), bar scaled 10 chars per 1.0x so the parity tick lands mid-field at 20-char width, dotted padding, and the verbatim 'slower <----- 1.0 (parity) -----> faster' header. Bars now grow toward the 'faster' end - the previous version drew bar length proportional to NumSharp/NumPy time so faster groups got SHORTER bars on a flipped axis; geometry, axis labels, and annotations are now mutually consistent. Layout per row: 7-char label, 20-char eighth-block bar (>= 2.0x overflows to a trailing arrowhead), speedup, (N win / M lose), and only out-of-the-ordinary verdicts annotated (PARITY within 5%, SLOWER below). Rendered verdict over the same 89 pairs: tiers 1K 1.40x / 4M 1.22x / 10M 0.80x (E1 np.any routing drags 4 rows; T1 10M memcpy is exact parity) / ALL 1.24x geomean 70W-32L; families ctor 1.95x, layout 1.74x, select 1.60x, where= 1.49x, dtypes 1.42x, elemwise 1.30x, traversal 1.05x parity, reduce 0.92x, buffered cast 0.89x, small-N 0.88x; dividends reuse 7.0x / parallel 4.7x / fusion 1.9x rendered with capped overflow bars.

…port tiers, for the iterator core Adds a parameterized size-sweep that runs the SAME six NpyIter operation families at each of four element-count tiers (scalar=1, 1K, 100K, 1M), so the iterator's NumSharp-vs-NumPy story can be read per cache tier the way the official benchmark report presents whole-op throughput. Prior NpyIter rounds used per-aspect ad-hoc sizes; this fixes a clean 6x4 matrix on both sides with identical ids. Files: - npyiter_sizesweep_bench.cs — NumSharp side. Six matched-kernel families: add contiguous binary V256 copy contiguous copy (memcpy chunk) sqrt contiguous unary V256 Sqrt sadd strided a[::2]+b[::2] sum 4-acc V256 reduction bcast stride-0 a+b1(1) Same Release-only Debug-JIT guard, best-of-rounds timing, and per-size iter counts as npyiter_core_bench.cs. All 24 correctness checks pass. - npyiter_sizesweep_bench.py — NumPy side, identical ids. copy uses np.positive (a REAL ufunc nditer) not np.copyto (a stripped raw-array walker), so the copy row is an honest iterator-vs-iterator comparison. - npyiter_sizesweep_chart.py — renders the speedup bar chart (NumPy/NumSharp, >1.0 = NumSharp faster) in the official-report axis style, grouped by size tier and by operation, plus a per-cell matrix. Self-contained: the settled clean run is embedded as (NumSharp_ms, NumPy_ms) per id; pass two recorded output files as argv to re-chart a fresh run (the bench .txt outputs are gitignored artifacts). The first C# run after a build is noise-tainted (machine not quiesced — sqrt@100K read 248us vs the true 57us); embedded numbers are the settled re-runs. Result (geomean NumPy/NumSharp): scalar 2.29x, 1K 2.08x, 100K 1.33x, 1M 1.19x; ALL 1.66x (20 win / 4 lose). The shape is the MIRROR of the full-API report: NpyIter is strongest at small N (construction beats np.nditer + ufunc dispatch setup) and converges to parity at 1M where the kernel saturates memory bandwidth and the iterator contributes ~nothing. The four sub-1.0 cells (add@1M 0.88x, sqrt@1M 0.98x, sqrt@100K 0.99x, sadd@100K 0.99x) are all parity within run-to-run bandwidth variance (add@1M measured 380-453us across runs vs NumPy 398us). Standout: sum is 2.35-10.11x faster at every tier — NumPy's reduce carries a ~1.5us fixed setup (sum@1: NS 151ns vs NumPy 1.53us) and a slower large-N pairwise pass (sum@1M: NS 89us vs NumPy 209us).

…lar/1K/100K/1M The minimal size-sweep covered only 6 families. This sweeps EVERY distinct NpyIter operation family accumulated across rounds 1-3 at all four tiers. The earlier rounds used SIZE as the id axis (H8..H2M, T2.4..T2.1024, O1..O4/M1 are one op at many sizes); collapsing those, the distinct families are 33 + 3 dividends, each now run at scalar/1K/100K/1M = 143 measured pairs per side. Families: elementwise (add sqrt copy strided bcast reversed castbuf mixbuf), reductions (sum sum-ax0 sum-ax1 sum-dt= amin cumsum any-allfalse any-earlyhit), selection (where a[mask] a[mask]= count_nonzero argwhere a[idx] a[idx]=), copy/cast (flatten astype ravel.T in-place less->bool), index-math (unravel_index ravel_multi_index), dtypes (complex128 float16 int8), and dividends (fuse7 reuse par8 — NumPy has no equivalent machinery). Files: npyiter_fullsweep_bench.{cs,py} (identical ids; raw-iterator matched kernels for the elementwise-isolation rows + production np.* for the rest, each mapped to its NumPy equivalent), npyiter_fullsweep_chart.py (self-contained: embeds the clean run as (NumSharp_ms, NumPy_ms); renders per-tier, per-category, category x tier, per-family x tier, and the dividends; pass two run files as argv to re-chart). i9-13900K, NumPy 2.4.2, Release. Headline (geomean NumPy/NumSharp, >1.0 = NumSharp faster): ALL 1.32x, 81 win / 51 lose over 132 main cells; tiers scalar 1.47x / 1K 1.46x / 100K 1.13x / 1M 1.27x. By category: reductions 2.14x, elementwise 1.41x, selection 1.31x, dtypes 1.16x, but copy/cast 0.76x and index-math 0.75x lag (small-N per-call copy overhead, crossing to wins at 1M). Dividends: fuse7 up to 17x vs chained adds, reuse 5x at small-N, par8 2.4x at 1M. Findings surfaced/confirmed across the size axis (presented to user): - INTERMITTENT SEGFAULT (~50% of runs): uncatchable AccessViolation under the heavy mixed alloc/free load, varying crash point (seen at gather@1K / argw region) — heap corruption or GC/finalizer race on unmanaged NDArray storage. - np.any(all-false): 24x faster at scalar but 12.5x SLOWER at 1M (0.08x) — scalar scan, no SIMD; early-exit case hides it. (confirms the routing bug.) - np.less(out=bool): consistently 1.5-2.7x slower at every size. - fancy a[idx] gather 1.4-3.4x slower, a[idx]= scatter 1.2-1.7x slower at all sizes; amin axis-reduce 2.4x slower at 100K+; float16/complex128 ~1.3-1.7x slower (documented scalar paths). - int8 add VERIFIED correct and ~7x faster (sweep's 12x inflated by a noisy NumPy reading); reductions/castbuf/count_nonzero are the largest honest wins.

… harness, one results sheet Promotes the exploratory poc/npyiter_* rounds into a single MAINTAINED NumSharp-vs-NumPy benchmark under benchmark/npyiter/. Every distinct NpyIter aspect the poc rounds surfaced now lives in one place, swept across cache tiers, rendered into ONE sheet (npyiter_results.md): 162 measured pairs. Pieces: - npyiter_bench.cs / .py — identical-id NumSharp + NumPy benches, SECTION- ADDRESSABLE via the NPYITER_SECTION env var. Ten sections: operations x size : elementwise/reductions/selection/copycast/indexmath/ dtypes/dividends — 33 families x {scalar,1K,100K,1M} construction : 9 iterator flag configs vs np.nditer build chunkwidth : per-chunk dispatch overhead across inner widths 4..1024 pathology : the regression canaries (bcast-reduce, allocate, overlap-copy, F-order-out, 0-d) Iterator-isolation rows drive NpyIterRef directly with trivial NumPy-loop- matched kernels; production rows call np.* both sides; copy compares to np.positive (a real ufunc nditer), never np.copyto. - npyiter_sheet.py — orchestrator + renderer. Runs each section in its own short-lived process (crash isolation) and renders the per-tier/per-category/ per-family operation matrix + construction + chunk-width + pathology + dividends sheet. Resilience baked in after the monolithic poc fullsweep AV'd ~50% of runs: * each NumSharp section retries up to 4x on a crash; * DOTNET_DbgEnableMiniDump=0 so an AV returns a non-zero exit IMMEDIATELY instead of stalling the process while a crash dump is written (the silent hang that voided the first full run — we never taskkill dotnet); * per-subprocess timeout backstop; * tsv is written incrementally after EVERY section and --resume skips already-collected sections, so a mid-sweep death never loses progress. Flags: --skip-build, --render-only, --resume, --sections. - README.md — run instructions, methodology guardrails (Release-only, matched kernels, positive-not-copyto), section table, and the findings ledger. - npyiter_results.{md,tsv} — the rendered sheet + raw (id, ns_ms, np_ms) pairs from the 2026-06-13 run (i9-13900K, NumPy 2.4.2, Release). Headline (speedup = NumPy/NumSharp, >1.0 = NumSharp faster): operation matrix 1.24x geomean, 80 win / 52 lose over 132 cells; tiers scalar 1.20x / 1K 1.32x / 100K 1.14x / 1M 1.32x. Categories: reductions 2.03x, elementwise 1.48x, selection 1.32x, dtypes 1.03x; copy/cast 0.58x and index-math 0.63x lag at small N (per-call setup) and cross to wins by 1M. Construction beats np.nditer 6.19x geomean (up to 12.5x on the 8-operand build). Chunk-width loses at w=4 (0.74x) and wins from w=64 up. Dividends NumPy structurally can't match: fuse7 4.6-15x, reuse up to 9x, par8 2.5-7x. Regression canaries / losses tracked by the sheet: bcast-reduce 51x slower, F-order-out 3.5x, allocate 2x; np.any full-scan 0.07x at 1M (scalar scan vs SIMD); comparison->bool 0.5x; fancy a[idx] gather/scatter 0.5-0.75x; amin axis-reduce 0.4x at scale. int8 verified ~7-11x faster (correct). These are the fix-list, ordered in README's findings ledger.

… (commit-to-master) Wires the canonical NpyIter benchmark into a semi-manual GitHub Action that runs AFTER a release and publishes results straight to master (the chosen target, not the wiki — GITHUB_TOKEN can push to the repo but not to a .wiki.git, which would need a PAT). - .github/workflows/npyiter-benchmark.yml — SEPARATE workflow (never gates the release). Triggers: workflow_run on 'Build and Release' completion+success, plus workflow_dispatch (the manual knob). Sets up .NET 8+10 (preview) and Python 3.12, pins numpy==2.4.2, builds Release, runs npyiter_sheet.py --skip-build, renders the cards, and commits npyiter_results.{md,tsv} + cards/*.png to master with '[skip ci]' (so the push can't re-trigger the release workflow). permissions: contents:write — no PAT needed. - benchmark/npyiter/npyiter_cards.py — renders two 400x300 PNG cards from the tsv (matplotlib, figsize=(4,3) dpi=100): ops.png (speedup by size tier) and cat.png (speedup by op class). Ratio-only by design — absolute ms vary by runner hardware, but the same-runner NumPy/NumSharp ratio stays meaningful. - README.md — new 'Performance vs NumPy' section embedding both cards (raw URLs) linked to the full report, with the explicit 'same-machine ratio' caveat. - npyiter_sheet.py — portability fix: run_ns() rewrites the .cs's absolute '#:project K:/source/NumSharp/...' line to THIS checkout's csproj path, so the same bench (authored to run directly via 'dotnet run - < file' on Windows) runs unchanged on a Linux CI runner. - Refreshed npyiter_results.{md,tsv} + cards from a clean 162-pair run (headline 1.19x, 76 win / 56 lose). That run also exercised the resilience for real: the selection section took an 0xC0000005 AV on attempt 1 and the orchestrator's retry recovered it automatically — exactly the CI-safety the section isolation + DbgEnableMiniDump=0 + retry was built for. Caveats documented in the workflow header: shared-runner variance (ratios not absolutes), and direct-to-master push assumes master is CI-writable (no branch protection blocking github-actions[bot]); switch to a PR step if that changes.

…iter The six exploratory POC rounds (npyiter_core/frontier/frontier2/consumers/ sizesweep/fullsweep benches + bench_summary + charts + NPYITER_CORE_BENCH_RESULTS.md) were superseded by the canonical benchmark/npyiter/ — every aspect they surfaced now lives there, swept across cache tiers into one sheet. Removed: 17 tracked files + 4 gitignored .txt dumps. KEPT benchmark/poc/npyiter_parity_poc.{cs,py} — it is NOT part of the exploratory rounds: it holds the hand-written AVX2-gather reference kernels (PocKernels.AddF32/SqrtF32/SumF32) that docs/NPYITER_PERF_HANDOVER.md points to as the blueprint to transcribe for the IL-emission work, and benchmark/CLAUDE.md cites it as the Debug-JIT guard example. Reference fixes (no dangling links left): - docs/NPYITER_GAPS_AND_ROADMAP.md §6: the iterator-core reproduce block now points to benchmark/npyiter/npyiter_sheet.py instead of the deleted bench. - benchmark/npyiter/README.md: dropped the 'supersedes poc/npyiter_*' wording (those files are gone) for a self-contained description. Finalize: added benchmark/npyiter/.gitignore for transient run artifacts (run.log, __pycache__) so only the durable outputs — npyiter_results.{md,tsv} and cards/*.png — are tracked.

… tier; AV→NA; one CI Folds the NpyIter benchmark into the official orchestrator so there is ONE entry point and ONE report, while keeping the two harnesses distinct (they measure different things — op/dtype/N throughput vs the iterator machinery — and the NpyIter harness needs internal access + section-isolation the BenchmarkDotNet in-process run can't give). run_benchmark.py — after the official (op,dtype,N) merge, runs the NpyIter sheet + cards and APPENDS the sheet to benchmark-report.md as its own section (not merged — different result model). Archives npyiter_results.{md,tsv} + cards into results/<ts>/. New --skip-npyiter flag. This is now the single command for the whole NumSharp-vs-NumPy comparison. +10M tier (decision 1): npyiter_bench.{cs,py} sweep now scalar/1K/100K/1M/10M (grid 2500x4000 = 10M exactly; pick 30 iters/3 rounds at 10M). sheet TIERS + cards pick it up automatically. AV → NA/IGNORED (decision 3): instead of silently omitting a section that crashes all retries, the sheet now records its ids NA (NumPy runs first to give the expected id set), prints an AV-POLICY header explaining the known intermittent AccessViolation is ignored, lists 'THIS RUN: NA across <sections>', shows NA cells in the per-family/dividends matrices, and excludes NA from every geomean. tsv stores NA; load/cards skip it. CI consolidation (decision 2): npyiter-benchmark.yml -> benchmark.yml, now runs the ENTIRE suite via run_benchmark.py. Trigger changed from workflow_run-on- every-build to release:published (the real 'after a successful release' signal — 'Build and Release' publishes a GitHub Release on a v* tag) + workflow_dispatch, so the heavy full suite runs per-release, not per-push. Commits report + cards to master with [skip ci]. timeout-minutes: 180. The npyiter_parity_poc gather kernels and the rest of the harness methodology (Release-only, matched kernels, positive-not-copyto, section isolation) are unchanged.

…on selection Refreshes the canonical NpyIter results (npyiter_results.md/.tsv) and the two README cards with a full sweep that now includes the 10M cache tier, and records the AV->NA policy firing on a real run. Also documents the run_benchmark.py integration in benchmark/CLAUDE.md. What changed ------------ * 198 measured pairs (was 162), 35 of them NA. The new 10M tier adds 36 ids across the size-swept families; SIZES is now scalar/1K/100K/1M/10M end to end (bench .cs + .py grids: 10M = 2500x4000). * selection (where / a[mask] / a[mask]= / count_nz / argwhere / a[idx] / a[idx]=) hit NumSharp's known intermittent AccessViolation on EVERY retry this run, so the whole section is reported NA/IGNORED per policy and excluded from every geomean. The header now reads "198 measured pairs (35 NA)" and "AV POLICY ... THIS RUN: NA across selection."; the section renders as "(no data)" / "-" / "NA" cells instead of crashing the sweep. This is the designed crash-resilience path proven on a live run, not a regression. * Headline operation matrix: 1.17x geomean, 77 win / 53 lose over 130 cells (26 non-selection families x 5 tiers). Reductions lead (1.80x), dtypes 1.59x, elementwise 1.12x; copy/cast (0.65x) and index-math (0.70x) remain the small-N laggards already tracked as canaries. Doc --- benchmark/CLAUDE.md run_benchmark.py section now describes the appended NpyIter step (aspect x tier, appended-not-merged, section-isolated, AV->NA, --skip-npyiter) and points at benchmark/npyiter/README.md, so the dev guide matches the wired-in integration (run_benchmark.py + benchmark.yml). Known bug surfaced (tracked, not fixed here) -------------------------------------------- The selection-section AccessViolation (0xC0000005) is an unmanaged-storage lifetime bug in NumSharp under heavy mixed alloc/free load. It is intermittent (~50% per heavy section) and uncatchable; the benchmark now degrades to NA rather than masking it. Worth a dedicated issue + fix pass.

…ted report artifacts Adds docs/website-src/docs/benchmarks.md — the DocFX page the user asked for: "the real place where we discuss and present the efforts to surpass NumPy through the power of Runtime IL Generation." It is the evidence companion to the existing IL Generation page (il-generation.md explains HOW the kernels are emitted; this page shows WHAT that buys head-to-head against NumPy). The page is driven by the artifacts the Benchmark workflow (benchmark.yml) auto-commits to master after every release: * The two 400x300 cards are embedded by absolute raw.githubusercontent master URLs (same source the README uses), so they always reflect the latest committed run rather than a pasted screenshot. Verified the docfx build keeps the URLs absolute (it does not relativize external links). * The full reports are linked on master: the iterator sheet (benchmark/npyiter/npyiter_results.md, which the cards render from) and the op/dtype/N matrix (benchmark/benchmark-report.md), plus the harness README and benchmark/CLAUDE.md. Content (grounded in the current committed npyiter_results.md numbers): * Headline cards + a by-class geomean table (reductions ~1.8x, dtypes ~1.6x, elementwise ~1.1x parity, copy/cast ~0.65x, index-math ~0.7x). * Class-by-class discussion tying each result to the IL mechanism (4x unrolling, tree reduction, SIMD early-exit, per-(op,dtype,layout) specialization), and honest about the taxes (small-N copy/cast, all-false any() scan, bcast_reduce). * The dividends NumPy can't structurally match: expression fusion (np.evaluate, up to ~13x), kernel reuse, parallel inner loop (par8 up to ~8x), cheaper iterator construction (~2-3x vs np.nditer). * Methodology + honesty section: Release-only JIT, best-of-rounds, ratios-not- absolutes, and the AV->NA policy. * Reproduce-locally commands. Wiring: * docs/toc.yml — new "Benchmarks vs NumPy" entry right after IL Generation. * il-generation.md — cross-link from the Performance Impact section ("naive C#" table vs the head-to-head-NumPy page). * index.md — added IL Generation + Benchmarks links to Get Started. Validated with `docfx build` (build-only, metadata skipped): 0 errors, the page itself emits 0 warnings (the 84 UidNotFound warnings are api/toc.yml entries that only resolve after the metadata step, which CI runs first). benchmarks.html renders, cards resolve to absolute URLs, internal links rewrite to .html. Note: deploy is via docs.yml on push to master (paths: docs/website-src/**); this branch commit does not deploy until merged. How the page REFERENCES the auto-committed cards (raw-master URL vs bundling copies into website-src/images/) is the next thing to settle.

…FX site Two follow-ups to the Benchmarks vs NumPy page, both from user direction. 1) The two 400x300 cards now carry the whole canonical summary (modeled on the ASCII sheet the user singled out), not just one bar chart each. Everything is still COMPUTED from npyiter_results.tsv, so the cards auto-update each run and NA (AccessViolation) ids are skipped. * cards/ops.png — OPERATIONS vs NumPy: headline (geomean / win-lose / cells) + by-array-size-tier bars (scalar..10M) + by-operation-class bars ranked best->worst (reductions 1.80x ... copy/cast 0.65x; wins green, the two small-N taxes red). * cards/cat.png — the IL-GENERATION DIVIDENDS, the "machinery NumPy has no equivalent for": iterator build vs np.nditer, expression fusion (np.evaluate), kernel reuse, parallel inner loop — each bar is the honest geomean with an "up to <peak>x" annotation — plus the chunk-width trend (w=4 -> w=1024) and the honest pathology canary (bcast_reduce ~52x behind, in red). npyiter_cards.py rewritten: shared hbars() helper, color_of() (green/amber- parity/red), stat() for (geomean, peak), two card builders. Imports CTOR/CW/ PATH/DIVIDENDS from the sheet so the section data stays single-sourced. Captions/alt-text updated to match the new card semantics (cat.png is no longer "by op class") in README.md and benchmarks.md. 2) Full reports are now rendered INTO the site as searchable pages (user choice: "Render into the site"), in addition to being linked on GitHub: * docs/website-src/docs/benchmark-matrix.md — the op/dtype/N matrix (benchmark-report.md body under a single page H1). * docs/website-src/docs/benchmark-iterator.md — the canonical iterator sheet (npyiter_results.md fenced block under a page H1). * toc.yml nests both under "Benchmarks vs NumPy"; benchmarks.md "Read the full reports" now links the on-site pages (raw files still linked on master). benchmark.yml regenerates these two pages from the just-produced reports (op matrix drops its own H1 via tail -n +2 so the page has one title; the iterator sheet has no H1), commits them alongside the report + cards, and — because the commit carries [skip ci] and the pages live under docs/website-src/** — then `gh workflow run docs.yml` to redeploy the site (added actions:write + GH_TOKEN). Validation ---------- * npyiter_cards.py renders both cards; verified visually (legible at 400x300). * benchmark.yml is valid YAML (yaml.safe_load). * docfx build (build-only): 0 errors; benchmark-matrix.html + benchmark-iterator.html generate; benchmarks.html internal links to both resolve; no warning names any new page (the 82 UidNotFound warnings are api/toc.yml, resolved by the metadata step CI runs first). No docs/website/ build-output committed. Still open (deferred by the user): the card REFERENCING mechanism on the docs page (raw-master URLs today vs bundling the PNGs into website-src/images/). The redeploy chaining added here would make that swap trivial if chosen later.

… 15 Best" The op/dtype/N matrix report (benchmark-report.md, rendered into the site as benchmark-matrix.md) showcased garbage: every "Top 15 Best" row was np.copy(float64) and np.searchsorted at "0.0 / 0.0x". Three distinct bugs, all fixed. BUG 1 — searchsorted benchmark measured nothing (both sides) SortingBenchmarks.cs and numpy_benchmark.py issued a SINGLE scalar lookup (np.searchsorted(sorted, N/2)) — one O(log N) binary search, ~18ns at EVERY N, pure call overhead. Against NumPy's ~1µs Python overhead that manufactured a meaningless 50–1000x "win". Fixed: both now query the N-element array (a) into the sorted target → N binary searches, real work that scales with N. (Verified the C# benchmark project still compiles.) BUG 2 — normalize_op_name collapsed a slice-copy onto np.copy The Slicing suite's "np.copy(a[100:1000])" (a fixed 900-element slice copy, ~3.6µs at every N) was normalized by stripping ALL "[...]" — including the array-index "[100:1000]" — yielding "np.copy", which COLLIDED with the Creation full-array "np.copy(a)" in csharp_index (last-write-wins) and overwrote the real float64 measurement. THAT was the bogus "copy float64 = 0.0036ms" (not a copy bug at all; the op is fine — archived raw float64 copy@10M = 11.04ms). Fixed: only strip a space-separated " [annotation]" (\s+\[ instead of \s*\[), never index brackets attached to an identifier. Incidentally also de-collides concatenate/stack/slice variants. copy(float64) now reads its real values across all sizes (10M → 11.04ms, ratio 0.60 = a genuine win). BUG 3 — the report ranked/averaged non-credible rows as wins merge-results.py sorted "Top Best" by ratio with only a `ratio is not None` guard, so a sub-resolution NumSharp time (ratio rounding to 0.0) sorted to #1, and CSV blanked legit 0.0 via `r.ratio or ''`. Fixed with a credibility gate (classify()): a row is "negligible" (new ▫ status) when either side did <1µs of work OR the speedup exceeds 20x (NumSharp >20x faster ⇒ artifact: a view, a lazy alloc, or a dead-code-eliminated kernel). Negligible rows are EXCLUDED from Top Best/Worst and from the per-size geomean, but still listed (▫) in the per-suite tables — nothing hidden. Also: store ms at 4 / ratio at 3 decimals, show 3-decimal ms + 2-decimal ratio in the showcase (no more "0.0/0.0x"), fix the `or ''` falsy-zero in CSV, add the ▫ legend row + summary/size-table counts, and a header note stating how many rows were excluded and why. Result (regenerated from the on-disk run archive with the fixed merge): * Top Best is now real reductions/statistics wins (np.nansum 0.08x, np.percentile 0.10x, np.average 0.10x) — genuine ms on both sides. * 1233 ops → 305 faster / 255 close / 169 slower / 103 much-slower / 275 NEGLIGIBLE (the artifacts, previously ~all counted as "faster") / 126 n/a. * Top Worst surfaces a real gap: np.zeros (NumSharp eagerly zeros ~10.7ms vs NumPy lazy calloc ~0.01ms) — a legitimate optimization target, not an artifact. benchmark-matrix.md (the DocFX page) re-seeded from the corrected report; docfx build clean (0 errors). The searchsorted benchmark fix takes effect on the next CI run; the credibility gate keeps any residual artifact out of the showcase meanwhile.

… 1.3–6.1) Branch advanced 31 substantive commits past the first changelog (which described through 33058b8). The branch was rebased meanwhile — the original changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8 remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary. Learned and folded in: - np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy. - out=/where=/dtype= across the elementwise ufunc API (binary, unary-math, comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped overload each, exact broadcast/cast/error-text semantics. - New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive. - nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant, op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb). - Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC pressure, finalizer suppression. - Canonical NpyIter benchmark suite + post-release benchmark.yml CI + DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0 crash, parallel banding 4.7x win). Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402. Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35. Same content (minus H1) pushed live to the PR #611 description via REST PATCH.

…oard page Adds a new DocFX page in the npyiter_results.md dashboard style (ASCII bars, geomeans, win/lose, top wins/losses) applied to the broad op × dtype × N matrix — the graph/stats/ numbers companion to the narrative benchmarks.md, with minimal prose. * benchmark/scripts/render_dashboard.py — reads the merged benchmark-report.json and emits benchmark-dashboard.md: headline geomean, BY-SIZE-TIER / BY-SUITE / BY-DTYPE bars (same bar() aesthetic as npyiter_sheet.py — length 10 = parity, 20 = 2.0×), the status mix, and TOP-12 wins/losses with raw ms. Charts only CREDIBLE rows (the merge-results.py gate), so the negligible artifacts that used to dominate stay out. speedup = NumPy ÷ NumSharp. * docs/website-src/docs/benchmarks-dashboard.md — the page (title + one-line note + the ```-fenced sheet), seeded from the renderer. Nested under "Benchmarks vs NumPy" in toc.yml as "Dashboard (op matrix)", beside the full Operation matrix and Iterator sheet. * benchmark/.gitignore — ignore the benchmark-dashboard.md intermediate (the tracked form is the DocFX page), matching how benchmark-report.json/csv are handled. What it shows on the current data (honest, broad picture vs the curated npyiter sheet): 0.74× geomean over 832 credible cells (305 win / 527 lose) — NumSharp trails on the full matrix but reaches parity at 10M (0.98×), and wins decisively where its IL kernels shine: statistics 2.28×, broadcasting 1.22×, reduction 1.21×; uint8 1.07×. Laggards are arithmetic/ unary/creation and bool. Top wins: nansum/percentile/average (8–13×). Top losses: np.zeros (eager-zero vs NumPy lazy calloc, ~500–880×) and argsort (~25×). Prototype scope: the page is a committed STATIC snapshot. To make it live (auto-refresh each release like the matrix/iterator pages), wire render_dashboard.py + a seed step into run_benchmark.py / benchmark.yml — deferred pending design review. docfx build is clean.

Two net8.0-only BCL semantic gaps surfaced by the fuzz differential matrix. Both behave correctly on net9.0+ (where the BCL was fixed) but produced wrong values on net8.0; worked around to match NumPy 2.4.2. 1. np.abs(complex) with an infinite component returned NaN instead of +inf ------------------------------------------------------------------------ cabs(NaN + inf*i) must be +inf (C99 hypot / npy_cabs: the infinity test precedes the NaN test). System.Numerics.Complex.Abs routes through a private Hypot whose operand ordering is NaN-unaware, so on net8.0 it returns NaN for abs(NaN+inf*i) (fixed in the .NET 9 BCL). Added Utilities/NpyComplexMath.Abs(Complex): returns +inf when either component is infinite, else defers to Complex.Abs — so every finite/ NaN-only magnitude that already matched NumPy bit-for-bit is unchanged. Repointed the two cached MethodInfo handles that drive every complex-abs emit site: DirectILKernelGenerator.CachedMethods.ComplexAbs (6 IL call sites across the scalar/strided/predicate/math/decimal unary loops) and DefaultEngine.UnaryOp.s_complexAbs (NpyIter Tier-3B route). Fixes 19 unary.jsonl + 1 random_smoke.jsonl fuzz cases (all layouts: contiguous / strided / transposed / broadcast / negstride). 2. ptp / amax / amin along an axis dropped NaN instead of propagating it ------------------------------------------------------------------------ The typed-struct leading/innermost axis-reduction fast paths (MinOp<T>/MaxOp<T>.Combine256/128) called raw Vector256/128.Min/Max. The x86 vminps/vmaxps these lower to return the SECOND operand on an unordered (NaN) compare; the BCL Vector{N}.Min/Max only adopted IEEE NaN propagation in .NET 9. Verified: Vector128.Max(NaN,5) == 5 on net8.0, == NaN on net10.0. So max/min/ptp over a NaN-laced axis silently lost the NaN on net8.0 (ptp axis=0 returned a finite value where NumPy = NaN). Routed MinOp/MaxOp through the existing NaNAwareMinMax256/128 helper (already used by the contiguous/strided CombineVectors paths) and wrapped that helper's float/double self-equality mask in #if NET8_0 — so net9.0+ keeps the single-instruction vmaxps with zero overhead while net8.0 gets ConditionalSelect(ordered, min/max, a+b) NaN propagation. The flat whole-array reduction kernel already emitted this via EmitVectorNaNPropagatingMinMax, so only the axis fast paths were affected. Fixes 12 stat.jsonl fuzz cases (ptp float32/float64, axis 0/1, C/F-contig). Verification: full unit suite green on BOTH net8.0 and net10.0 (9709 passed / 0 failed under the CI filter), FuzzMatrix 42/42 on both. The originally reported trunc "Could not find Truncate for Vector128" failures were already resolved in-tree by the CanUseUnarySimd #if NET8_0 guard (commit 5716f86); the leak-guard working-set tests pass locally (their CI failures were OS working-set / GC-mode noise, not a managed or unmanaged leak).

…NumSharp faster) The dashboard prototype was the odd one out: I rendered it speedup = NumPy ÷ NumSharp (>1× = faster), while the op-matrix report it is derived from — and merge-results.py — use ratio = NumSharp ÷ NumPy (<1× = faster, lower is better). Two pages off the same data with opposite conventions is exactly the faster/slower confusion to avoid. Verified first that the underlying direction is NOT a flip: counting raw milliseconds (numsharp_ms vs numpy_ms, no ratio involved), NumSharp took LESS time on 305 ops and MORE time on 526 of 832 credible ops; geomean NS/NP = 1.36. So "NumSharp trails on the broad matrix" is real (concentrated in Arithmetic = 231 slower ops, and Unary), and it matches the op-matrix report's own conclusion. The dashboard's data was right; only its convention was inverted relative to the house default. render_dashboard.py now uses NS/NP throughout: * ratio = numsharp_ms / numpy_ms; header + axis read "faster ◄ 1.0 (parity) ► slower". * HEADLINE 1.36× geomean · 305 faster / 527 slower. * by-suite / by-dtype ranked fastest→slowest (ascending ratio): statistics 0.44×, reduction 0.83×, broadcasting 0.82× now read as FASTER; creation 2.83× / unary 2.63× / bool 3.55× as slower. * status bands relabelled to NS/NP (faster ≤1.0× / close 1–2× / slower 2–5× / much >5×). * tables renamed FASTEST / SLOWEST; each row shows the NS/NP ratio plus a human factor ("0.079× (12.6× faster)", "880.9× (881× slower)") so the small-ratio-is-good direction is unambiguous. benchmarks-dashboard.md re-seeded with the matching note; docfx build clean. This makes the report + dashboard consistent. The narrative benchmarks.md, the npyiter iterator sheet, and the README cards still use the speedup (NP/NS, >1× = faster) framing — flipping those is a separate call (they are win-showcases where >1× reads naturally).

…m the changelog Per review: the changelog should describe the final state, not the development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) — added after the first changelog' umbrella section entirely and dissolved its content into the proper topical sections, with all 'wave' terminology and 'added after'/'previously absent'/'now reachable' path-language gone: - np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing, fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups. - out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection. - WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1 stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis fixes folded into §1 (capability matrix + bug list); masked-write corruption fix added to §10. - buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer suppression folded into §7; TL;DR memory bullet updated. - canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages, and the honest frontier findings folded into §8/§15. - 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'. Net: zero 'wave' occurrences; the 16-section topical structure is intact. Same content (minus H1) pushed live to the PR #611 description.

… stat Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp faster — bars grow right = faster, the original visual), AND every row now also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP. render_dashboard.py: * r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time). * headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g. statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum 12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP. * status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower 200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column narrow (NN% under 1000%, else NN×NP). benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy as a format spec); docfx build clean, emoji verified present (U+1F550 ×54). Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py) still uses NS/NP "lower is better", and the npyiter sheet / cards use NP/NS without the %NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step, pending confirmation.

Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses (30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%". Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was "lower is better"): * ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV). * get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower 0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20). * Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a 🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×). * Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled. npyiter sheet (npyiter_sheet.py) + cards (npyiter_cards.py) — already NP/NS, ADDED 🕐 %NumPy: * sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time"; re-rendered npyiter_results.md (--render-only, AV block intact). * cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered. Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention block, a %NumPy column to the by-class table, and a caption sentence. DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard (render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af), so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README — is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.

The clock sat before the figure with the right-align padding landing between them ("🕐 87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding now sits before the number (where it belongs) instead of after the emoji. * render_dashboard.py / npyiter_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated. * merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column, Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated. * benchmarks.md + README: convention line / table column / caption "%NumPy🕐". * DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐". docfx build clean. The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu has no clock glyph — so there was never a gap to fix there).

… form pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads "87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of NumPy's time like every other cell. The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded, docfx build clean.

…g from the report The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K) read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs 87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between the two committed surfaces. Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] / r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the report. The report is correct (true ratio of the measured means); the dashboard was a rounding artifact of its own recompute. Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON (the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans key off the same stored ratios the report's Summary-by-size uses. Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page; header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%, quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.

…not run" cells normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark not run" for ops that WERE run. Three archive-safe alias passes (applied identically to both sides, so they only ever merge a true pair): * empty "()" — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten" * "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D" * np.around — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy emits np.round, so the whole np.round family was ⚪ despite real data Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape 2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions (no previously-matched cell lost), 0 new key collisions. Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells, 0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved byte-for-byte). The dashboard stays cell-consistent with the report via the canonical ratio/pct fix from the prior commit. NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/ maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean on uint*/int16 lack a counterpart on one side.

…lex (NumPy parity) These six complex ufuncs previously threw NotSupportedException from the EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires them up with full NumPy 2.4.2 parity. Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2 pattern): a bit-exact probe over a finite battery showed System.Numerics. Complex matches NumPy to a few ULP on the finite interior, but diverges at 86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs instead of the C99 Annex G values, drops the sign of zero on branch cuts, and mishandles arctan's imaginary-axis cut. So: - NpyComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite interior to the BCL and add the C99 fixups: * Non-finite inputs: special-value tables ported from NumPy's msun npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)). * Branch-cut/signed-zero fixups (empirically derived against NumPy and verified on a 64-point signed-zero grid): asin negates Re on x=-0 and Im on y=-0; acos negates Im on the y=+0 cut; atan sets Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0. * Where this NumPy build's system libm diverges from msun at infinities (sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y) imaginary part, tanh's sign(y) zero, and the genuinely-unspecified zero signs), the helpers match the observed NumPy 2.4.2 output. - DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh, Asin,Acos,Atan} (pointing at NpyComplexMath, not Complex.* directly) and add the six cases to EmitUnaryComplexOperation. Verification: a bit-exact harness over a 117-point battery (finite + signed zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy 2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within <=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate than NumPy's log1p-based one; an accepted, documented divergence. Tests: - NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch cuts, signed zeros, and C99 special values. - Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held to a tight 4-ULP gate (a real regression fails) instead of the blanket complex-unary excuse; arctan stays under the documented blanket for its accepted BCL-interior divergence. All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus cases for the five tightly-gated ops are all within 4 ULP.

Nucs added 4 commits April 22, 2026 23:41

Nucs force-pushed the nditer branch from f5c05a7 to 574a0d8 Compare April 23, 2026 09:34

Nucs mentioned this pull request Apr 28, 2026

Add NDIterator<T> overload with support for specific axis. #363

Open

Nucs added 23 commits May 13, 2026 09:14

Nucs added 30 commits June 12, 2026 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 324 commits into
masterfrom
nditer

Nucs commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nucs commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

1. NpyIter — full NumPy nditer port

Execution at NumPy speed

2. NpyExpr DSL + three-tier custom-op API

3. Legacy iterator stack retired

4. C/F/A/K memory-layout support

5. New & completed np.* APIs

6. Linear algebra

7. Performance (beyond NpyIter and linalg)

8. Official benchmark suite + honest methodology

9. Differential fuzzing vs NumPy (new infrastructure)

10. Correctness — NumPy-parity bug fixes

11. Memory management — ARC + IDisposable

12. Char8 primitive

13. Examples — trainable MNIST MLP

14. Kernel architecture & hygiene

15. Documentation

16. Tests & CI

Breaking changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nucs commented Apr 22, 2026 •

edited

Loading

1. NpyIter — full NumPy `nditer` port

5. New & completed `np.*` APIs

11. Memory management — ARC + `IDisposable`

12. `Char8` primitive