Skip to content

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611

Draft
Nucs wants to merge 324 commits into
masterfrom
nditer
Draft

[Major Rewrite] NumPy nditer port, NpyExpr DSL with 3-tier custom-op API, C/F/A/K memory layout support, stride-native matmul#611
Nucs wants to merge 324 commits into
masterfrom
nditer

Conversation

@Nucs

@Nucs Nucs commented Apr 22, 2026

Copy link
Copy Markdown
Member

Complete changelog of the nditer branch — everything in this PR since #612 merged.

312 commits · 615 files · +217,949 / −16,402 (vs master, after #612)


TL;DR

  • NpyIter — full port of NumPy 2.4.2's nditer (~12.5K lines): all iteration orders (C/F/A/K), all indexing modes, buffered casting, buffered-reduce double-loop, masking, memory-overlap protection (COPY_IF_OVERLAP), windowed buffering (DELAY_BUFALLOC), unlimited operands and dimensions. 566+ byte-for-byte NumPy parity scenarios.
  • NpyExpr DSL + three-tier custom-op API — write your own ufuncs: raw IL (Tier 3A), element-wise scalar/SIMD (Tier 3B), or composable expression trees with operator overloads (Tier 3C). Exposed as the public np.evaluate, which runs fused expressions 3.2–6.1× faster than NumPy (which can't fuse), with per-node NumPy result_type typing and fused reductions.
  • out= / where= / dtype= ufunc kwargs across the elementwise API — the kwargs on every NumPy ufunc, spanning the binary, unary-math, comparison, predicate, and bitwise families with exact NumPy broadcast/cast/error-text semantics. Plus np.bitwise_and/or/xor and np.positive at the np.* surface.
  • NumPy-parity benchmark: geomean 1.00× at 10M elements across ~409 ops (166 faster / 171 close / 36 slower) — measured by a new official BenchmarkDotNet-vs-NumPy suite committed with the report.
  • 30 new np.* APIspad (11 modes), tile, median/percentile/quantile (all 13 interpolation methods) + their nan* variants, average, ptp, take/put/place, extract/compress, diagonal/trace, argwhere/flatnonzero, unravel_index/ravel_multi_index/indices, delete/insert/append, diff/ediff1d, asfortranarray/ascontiguousarray, np.multithreading.
  • C/F/A/K order support wired through the whole APIShape understands F-contiguity, OrderResolver resolves NumPy order modes, ~68 layout bugs fixed across 9 fix groups.
  • Stride-native matmul/dot — BLIS-style GEBP GEMM absorbs arbitrary strides for all dtypes (kills a ~100× penalty on transposed inputs); fused 1-D dot is 3.5–9× faster with zero GC; opt-in multithreaded dot ~2× faster than NumPy's default on 1M vectors.
  • Deterministic memory management — atomic reference counting + IDisposable on NDArray, plus a tcache-style buffer pool (1 B – 64 MiB window).
  • Differential fuzzing infrastructure — 37,445 bit-exact NumPy-comparison cases across 24 corpus tiers, a seeded random fuzzer with shrinker, a CI FuzzMatrix gate, and a nightly soak workflow.
  • Legacy iterator stack deletedMultiIterator and the Regen-generated NDIterator cast templates are gone (−3,870 LOC); NDIterator survives as a thin lazy wrapper over NpyIter.
  • Test suite: 9,709 passed / 0 failed on net10.0 (+2,500 new test methods), plus the 37,445-case fuzz corpus replayed by the FuzzMatrix gate.

1. NpyIter — full NumPy nditer port

From-scratch C# port of NumPy 2.4.2's iterator machinery under src/NumSharp.Core/Backends/Iterators/ (~12,557 lines), promoted to public API with NDArray overloads.

Capability Detail
Iteration orders C, F, A, K — incl. NEGPERM negative-stride handling, axis reordering + coalescing to full 1-D collapse
Indexing modes MULTI_INDEX, C_INDEX, F_INDEX, RANGE (parallel chunking), GotoIndex / GotoMultiIndex / GotoIterIndex
Buffering Buffered casting with all 5 casting rules, windowed buffered iteration, DELAY_BUFALLOC, buffered-reduce double-loop (incl. bufferSize < coreSize)
Reductions op_axes with -1 reduction axes, REDUCE_OK, IsFirstVisit, REUSE_REDUCE_LOOPS slab accumulation
Overlap safety COPY_IF_OVERLAP via a port of NumPy's mem_overlap solver (NpyMemOverlap.cs) — overlapping in/out operands no longer silently corrupt
Masking WRITEMASKED + ARRAYMASK executed — the buffered window flush writes back only mask-nonzero elements; VIRTUAL operands (null op slots) construct with NumPy 2.x semantics
Operands / dims Unlimited operands (NumPy caps at NPY_MAXARGS=64) and unlimited dimensions (NumPy caps at NPY_MAXDIMS=64) via dynamic allocation
APIs Copy, GetIterView, RemoveAxis, RemoveMultiIndex, ResetBasePointers, IterRange, DebugPrint, fixed/axis stride queries, GetValue<T>/SetValue<T>, …
Casting parity NpyIterCasting.CanCast matches NumPy's safe/same_kind lattice exactly

Validated by a dedicated battletest harness: 566 scenarios replayed against NumPy 2.4.2 byte-for-byte, a permanent variation-probe harness, and tools/iterator_parity. Dozens of parity bugs found and fixed against NumPy ground truth: negative-stride flipping, NO_BROADCAST enforcement, F_INDEX coalescing, buffered-reduction stride inversion, K-order on broadcast inputs, EXLOOP iternext, buffered-cast Advance, ranged Reset() desync, buffer free-list corruption, the size-1 stride-0 invariant (a (1,4) view with nonzero stride corrupted RemoveMultiIndex), op_axes out-of-bounds reads on stretched size-1 axes, write-broadcast validation, PARALLEL_SAFE wiring, and unit-axis absorption — each reproduced against NumPy first, then fixed by adopting NumPy's constructor structure.

Execution at NumPy speed

NpyIter isn't just correct — it is now the production execution engine: DefaultEngine's binary, unary, and comparison ops (same- and mixed-dtype) route through the NpyIter Tier-3B shell, and it measures at-or-faster than NumPy on every probed aspect (Release, i9-13900K, NumPy 2.4.2):

Aspect (float32) NumSharp NumPy Ratio
contig sqrt 10M 2.98 ms 3.24 ms 0.92×
contig add 10M 3.91 ms 4.09 ms 0.96×
strided add 1M 319 µs 416 µs 0.77×
strided sqrt 1M 206 µs 374 µs 0.55×
strided sum 1M 109 µs 205 µs 0.53×
fused a*b+c 10M 4.77 ms 13.38 ms 0.36×
fused (a-b)/(a+b) 10M 4.12 ms 22.33 ms 0.18×

Key mechanisms: an O(1) trivial-loop bypass that skips iterator construction for contiguous operands, identity-broadcast fast paths, AVX2 hardware-gather (vgatherdps) strided SIMD in the Tier-3B shell (NumPy uses scalar loops for strided binary/reduce — its floors are beatable), and strided-reduction kernels (2-D strided sqrt 1.36× faster than NumPy, strided sum 2.2× faster).

2. NpyExpr DSL + three-tier custom-op API

User-extensible kernel layer on top of NpyIter — the public answer to "how do I write my own ufunc":

  • Tier 3A — ExecuteRawIL: emit raw IL against the NumPy ufunc signature void(void** dataptrs, long* strides, long count, void* aux).
  • Tier 3B — ExecuteElementWise: provide scalar + vector IL; the shell supplies a 4×-unrolled SIMD loop, remainder vector, scalar tail, and strided fallback.
  • Tier 3C — ExecuteExpression: compose NpyExpr trees with C# operators ((a - b) / (a + b)), 50+ node types (arithmetic, trig, exp/log, rounding, predicates, comparisons, Min/Max/Clamp/Where), plus Call() to splice any delegate/MethodInfo into a fused kernel. Compiled once, cached by structural key, ~5 ns dispatch.

This is what powers the fusion wins — one pass, no temporaries — and it is exposed publicly as np.evaluate(expr[, operands][, out]):

  • Per-node NumPy result_type typing — every node resolves to its NumPy 2.4.2 dtype, so mixed trees wrap correctly: (i4*i4)+f8 wraps the multiply in int32 (→ 1410065408) before promoting. Strong-strong NEP50 (incl. int/float tier crossing), weak python-scalar literals (i4+2 → i4, i4/2 → f8) with NumPy's exact OverflowError, and special resolvers (true_divide, arctan2, negative-integer-literal powerValueError, bool add=OR/multiply=AND).
  • Fused reductionsNpyExpr.Sum/Prod/Min/Max/Mean compile a one-pass inner loop; sum(a*b) reads a and b once and never materializes the product. NumPy reduction dtypes (int→i64, uint→u64, mean→f64).
  • out= joins via the ufunc rules (same_kind validation, reference identity, overlap-safe aliasing through COPY_IF_OVERLAP); an EXTERNAL_LOOP guard prevents the silent count==1 slow path.
  • Measured (Release, 4M f64, NumPy 2.4.2): a*b+c 3.2×, (a-b)/(a+b) 6.1×, sum(a*b) 3.6×, sum f32 2.9×, i4*2+f8 3.5× faster. Permanent gate in benchmark/poc/evaluate_bench.{cs,py}.

3. Legacy iterator stack retired

  • MultiIterator deleted; all callers migrated to NpyIter.Copy / multi-operand execution.
  • The Regen template NDIterator.template.cs + 16 generated NDIterator.Cast.* partials deleted (−3,870 LOC in one commit).
  • NDIterator survives as a thin, lazy compatibility wrapper over NpyIter (294 lines) — no more materialized copies; same MoveNext()/HasNext()/Reset() surface.
  • ~400 per-dtype NPTypeCode switch sites replaced by a generic NpFunc dispatch utility.

4. C/F/A/K memory-layout support

  • Shape now tracks F-contiguity with NumPy-convention contiguity computation; new OrderResolver resolves C/F/A/K for every API with an order parameter.
  • Order support wired through: copy, array, asarray, asanyarray, *_like, astype, flatten, ravel, reshape, eye, concatenate, cumsum, argsort, tile, plus post-hoc F-contig preservation across the IL-kernel dispatchers.
  • New: np.asfortranarray, np.ascontiguousarray.
  • np.where selects C/F output layout the way NumPy does; ravel('F') of an F-contig source returns a view (was a 3,000× copy).
  • ~68 layout bugs fixed across 9 TDD fix groups, backed by ~3,300 lines of new order tests (Sections 41–51: reductions/keepdims, matmul/dot/outer/convolve, broadcasting-from-F, manipulation, file I/O fortran_order, Decimal scalar path, fancy-write isolation, …).

5. New & completed np.* APIs

New functions (35):

Area APIs
Fused / ufunc np.evaluate (fused expressions — see §2), np.bitwise_and, np.bitwise_or, np.bitwise_xor, np.positive
Manipulation np.pad (all 11 NumPy modes + callable), np.tile, np.delete, np.insert, np.append
Indexing/selection np.take, np.put, np.place, np.extract, np.compress, np.argwhere, np.flatnonzero, np.diagonal, np.trace, np.unravel_index, np.ravel_multi_index, np.indices
Statistics np.median, np.percentile, np.quantile (all 13 interpolation methods, tuple axis, out=, keepdims, QuickSelect engine), np.average (weights, returned, tuple-axis; fused kernel 1.3–1.6× faster than NumPy at 1M), np.ptp, np.nanmedian, np.nanpercentile, np.nanquantile
Math np.diff, np.ediff1d
Creation np.asfortranarray, np.ascontiguousarray
Runtime np.multithreading(enabled, max_threads) — opt-in threaded kernels

Rebuilt to full NumPy 2.x parity:

  • np.clipmin=/max= keyword aliases, default-None bounds, NumPy 2.x dtype promotion, out= validation.
  • np.unique — 5 missing kwargs, sort+mask algorithm (up to 43× faster), NaN partitioning, n > Array.MaxLength fallback.
  • np.searchsortedside=, sorter=, multidim validation; IL binary-search kernels 5–25× faster (beats NumPy on 20/22 benchmarks).
  • np.copytocasting=, where= masked copies at NumPy speed (was 7–72× slower).
  • np.asarraycopy=, like=, device=, dtype-as-string. np.concatenate — full parity + C/F fast paths. np.all/np.any — tuple-axis, out=, where=. np.expand_dims — tuple axis. np.repeataxis= parameter. np.power — integer-power semantics, negative-exponent ValueError, crash fix.
  • Engine completeness: bool/char max/min, Complex quantile, IsInf implemented (was a stub).
  • Full 15-dtype coverage pushed through the hot paths — the SByte/Half/Complex dtypes introduced in [new dtypes, NEP50] fully supported Half/Complex/SByte, np.* alias overhaul, NumPy 2.x type alias alignment #612 now work across every kernel family this PR touches (reductions, indexing, trace, casts, quantile, …).

out= / where= / dtype= ufunc kwargs (NumPy parity):

The kwargs present on every NumPy ufunc now span the elementwise core — binary (add, subtract, multiply, divide, true_divide, mod, power, floor_divide), unary-math (sqrt, exp, log, sin, cos, tan, abs/absolute, negative, square), the six comparisons, predicates (isnan/isfinite/isinf), bitwise, invert, arctan2 — each as one NumPy-shaped overload, every rule pinned against NumPy 2.4.2:

  • out joins the broadcast but never stretches (mismatched/stretchable out raise NumPy's exact texts, trailing space included); loop dtype resolved from inputs (NEP50), out only needs a same_kind cast; the provided instance is returned (reference identity).
  • where must be exactly bool (mask cast under 'safe'); it broadcasts over operands and participates in output shape; mask-false slots keep prior out contents.
  • out aliasing an input is well-defined via COPY_IF_OVERLAPadd(x[:-1], x[:-1], out=x[1:]) matches NumPy exactly.
  • dtype= computes in the loop dtype (subtract(300, 5, dtype=i16) = 295), with the bool add→OR / multiply→AND remap keyed off the final loop dtype so add(True, True, dtype=i32) = 2.

6. Linear algebra

  • Stride-native GEMM for all 12 numeric dtypes — BLIS-style GEBP with stride-aware packers; the 8×16 Vector256 FMA micro-kernel reads packed panels, so transposed/sliced inputs cost nothing extra. Eliminates the ~100× fallback penalty (np.dot(x.T, grad): 240 ms → ~1 ms) and the boxing GetValue fallback chain.
  • Full matmul gufunc semantics — batched stacking, 1-D promotion/squeeze rules, validated by a dedicated differential matrix (816 cases).
  • Fused single-pass 1-D dot — 3.5–9× faster, zero GC (was up to 446 gen-0 collections per call at 100K).
  • np.multithreading — opt-in parallel 1-D dot: 1M float dot 172 → 60 µs, ~2× faster than NumPy's default build. Off by default; bitwise-identical summation order when off.

7. Performance (beyond NpyIter and linalg)

Op Improvement
Axis reductions, narrow ints Widening SIMD (int16→int32 accum etc.): sum(int16, axis=1) 1058 ms → 2.7 ms (389×, now faster than NumPy); int32/uint32 2.3–4.6×; also fixes a uint32 axis-sum corruption bug
mean (axis) 217× (Phase-0 bug surgery); var/std 21×; count_nonzero 20×
np.nonzero IL SIMD kernel closes an 8–241× gap to NumPy
np.where IL kernels for scalar-broadcast & non-contiguous (1.2–2× NumPy on broadcast conditions)
Strided 1-D unary Fused strided-SIMD kernel: 0.55 ns/elem flat — beats NumPy at every size; strided sqrt reached parity via gather→tile→SIMD buffering
Strided flat reductions Incremental-advance path: strided sum 8.3× faster (11.8× behind NumPy → 1.4×)
Comparisons PDEP-based packed mask→bool store; broadcast/strided compares routed via NpyIter
Axis-0 reductions Column-tiled accumulation (breaks the output RAW dependency); 8× pairwise unrolled flat reductions
Allocation tcache-style size-bucketed buffer pool with a 1 B – 64 MiB window (covers both the small-N ufunc result and 4M+ outputs that previously paid a fresh VirtualAlloc + demand-zero faults); ≥1 MiB buckets capped at 2 buffers; pool-side GC memory pressure tracking live state; GC.SuppressFinalize on free; using/ARC adopted across concatenate, allclose, convolve, tile, eye, masking, shuffle, …
Casts NumPy-faithful SIMD float→int32 (cvtt), strided/reversed/gathered variants; astype cross-dtype routed through NpyIter KEEPORDER copy
np.split family O(1) sub-shape derivation, direct views — 1.5–4× faster than NumPy
Where/copyto/searchsorted/unique see §5

8. Official benchmark suite + honest methodology

  • New cross-platform run_benchmark.py entry point: BenchmarkDotNet Full rigor (50 iters, InProcessEmit) × all suites × {1K, 100K, 10M} vs NumPy 2.x — 1,813 C# measurements, 1,111 matched op×dtype×size comparisons, structural op-name join, tracked markdown report + per-suite artifacts + history snapshots. Coverage spans all 15 dtypes (SByte/Half/Complex suites added).
  • Headline: geomean NumSharp÷NumPy = 1.00× at N=10M (166 ops faster / 171 close / 36 slower) — parity across the whole op surface at memory-bound sizes; ~1.9× at 1K where per-call dispatch dominates (tracked as the next focus).
  • Found and neutralized a benchmark-invalidating tooling bug: dotnet run file-based apps compile the project reference in Debug (optimizations off) even with Configuration=Release properties — hand loops measured ~2× slow while DynamicMethod IL was immune. Benchmarks now assert IsJITOptimizerDisabled == false and refuse to mislead; the rule is documented.
  • Canonical NpyIter benchmark — a section-addressable harness covering 33 op families × {scalar/1K/100K/1M/10M}, integrated into run_benchmark.py, plus a post-release CI workflow (.github/workflows/benchmark.yml) that auto-commits report cards to master.
  • Honest frontier findings — adversarial probes record losses, not just wins: np.sum over a broadcast_to view 54× slower than NumPy (a coordinate-walking general path at 7.4 ns/elem), scalar np.any 14.5× slower (scalar scan where count_nonzero on the same array runs SIMD), a BUFFERED+REDUCE ForEach P0 crash (pinned/skipped repro — only the buffered-reduce driver handles that config), and iterator ALLOCATE zeroing outputs where NumPy leaves them empty (+2.33 ms/4M). A win too: hand-rolled 8-band parallel iteration 4.7×. All tracked as the next optimization frontier.

9. Differential fuzzing vs NumPy (new infrastructure)

  • 37,445 bit-exact corpus cases across 24 JSONL tiers generated from real NumPy 2.4.2 outputs: casts (full 15×15 matrix), binary arith (NEP50), div/mod/power, comparisons, unary (incl. float16 inputs + all narrow ints), reductions, NaN-aware reductions, cumulative, statistics, logic/extrema, bitwise+shift, where/place, manipulation, matmul, modf multi-output, sorting/searching, parameter sweeps, SIMD-tail boundaries (900 cases around vector-width edges), operand aliasing, and error-parity (exception-for-exception).
  • Seeded random fuzzer with an element-wise shrinker for minimal repros; metamorphic invariant tier (11 algebraic properties).
  • CI integration: FuzzMatrix gate wired into the build workflow + a new nightly fuzz-soak workflow (.github/workflows/fuzz-soak.yml).
  • Findings inventoried in docs/FUZZ_FINDINGS.md; every fixed class re-armed as a permanent regression gate. The error-parity tier alone surfaced 1 critical crash; the op tiers surfaced 17+ distinct bug classes that are now fixed (see §10).

10. Correctness — NumPy-parity bug fixes

Semantics (behavioral changes, may affect callers):

  • floor_divide / mod: NumPy-exact floored semantics and divide-by-zero results.
  • Comparisons: <= / >= now return False for NaN (IEEE/NumPy).
  • Flat min/max propagate NaN.
  • np.negative(uint) wraps modulo 2ⁿ instead of throwing; bool - bool and -bool/np.negative(bool) now throw (NumPy behavior).
  • Transcendental ufuncs use NEP50 width-based float promotion.
  • np.power: negative integer exponent raises ValueError; exact integer-power semantics.
  • Cast semantics aligned with NumPy across all dtype pairs (IL kernels + ConvertValue); complex→bool no longer drops the imaginary part; float→int SIMD uses truncation (cvtt) like NumPy.
  • Broadcasting keeps rank when a 1-D [1] meets a lower-rank operand; quantile-family dtype & bool handling; Complex np.where.

Crashes & corruption:

  • Overlapping-operand corruption eliminated iterator-wide (COPY_IF_OVERLAP, §1).
  • Masked iteration: a buffered WRITEMASKED write landed garbage in exactly the slots NumPy preserves (silent corruption of the elements the caller asked to protect) — now writes back only mask-nonzero elements.
  • uint32 axis-sum produced wrong values past 8 distinct columns (widening-SIMD rewrite).
  • np.pad: 5 correctness/crash bugs (battle-tested against NumPy 2.4.2); linear_ramp preserved Complex dtype.
  • UnmanagedStorage/ArraySlice: CopyTo direction + bounds bugs; CloneData partial-buffer bug; scalar offset lost on Clone; buffered NpyIter.Clone shared buffers; DTypeSize reported Marshal.SizeOf instead of in-memory stride; NPTypeCode.Char.SizeOf returned 1 (real: 2); stale Decimal priority.
  • TensorEngine now propagates through Cast/Transpose/copy/reshape/ravel (custom engines were silently dropped).
  • take with out= enforces NumPy's safe-cast direction; put/place non-contiguous writeback fixes; argsort on non-C-contiguous input.
  • NpyIter ForEach/ExecuteGeneric/ExecuteReducing read past the end without EXTERNAL_LOOP.

11. Memory management — ARC + IDisposable

  • NDArray now implements IDisposable backed by atomic reference counting on the unmanaged block: CAS-driven TryAddRef/Release, idempotent Dispose, finalizer safety net, immortal non-owning wraps. Views keep parents alive; parent disposal never invalidates live views.
  • Hammered by a 15-case lifecycle suite incl. 32-thread × 1,000-op concurrency races and 50-way parallel dispose — zero corruption.
  • Deterministic release means hot loops no longer wait on the finalizer queue; combined with the buffer pool this removes most steady-state GC pressure (dot at 100K: 446 collections → 0).

12. Char8 primitive

New 1-byte character type (NumSharp.Char8) — the NumPy S1/Python bytes(1) equivalent — with conversions, operators, span helpers, and 100% Python bytes API parity validated against a Python oracle. Vendored .NET ASCII/Latin-1 reference sources under src/dotnet/ document the upstream implementations it mirrors.

13. Examples — trainable MNIST MLP

New examples/NeuralNetwork.NumSharp: a 2-layer MLP with a naive implementation and a fused one (single-NpyIter bias+ReLU fusion, fused softmax-cross-entropy backward, Adam optimizer). Originally needed a "copy transposed views before np.dot" workaround (31× training speedup at the time); the stride-native GEMM (§6) made the workaround unnecessary. Converges to >99% test accuracy in the bundled demo.

14. Kernel architecture & hygiene

  • ILKernelGenerator split into DirectILKernelGenerator (legacy whole-array kernels, 51 partials under Kernels/Direct/) and ILKernelGenerator (NpyIter-driven per-chunk kernels — the target model matching NumPy's PyUFuncGenericFunction); migration path documented per kernel family.
  • All Vector128/256/512 and Math/MathF reflection centralized in VectorMethodCache / ScalarMethodCache; IL-emitted typed-field copier replaces the UnmanagedStorage.Alias switch.
  • 24 dead kernel methods poisoned with [Obsolete(error: true)] pending deletion; dead axis-reduction SIMD paths removed.

15. Documentation

  • NpyIter/NDIter book: docs/website-src/docs/NDIter.md (7-technique quick reference, decision tree, memory model, gotchas) + ndarray.md.
  • DocFX website — Benchmarks vs NumPy: benchmarks.md (head-to-head evidence companion to the IL-generation page), benchmark-iterator.md, benchmark-matrix.md, driven by the auto-committed report artifacts.
  • Engineering ledgers: PERF_LEDGER.md (every optimization with before/after), ROADMAP.md, NPYITER_GAPS_AND_ROADMAP.md (gap analysis vs NumPy 2.4.2 + prioritized roadmap), NPYITER_PARITY_ANALYSIS.md, NPYITER_PERF_HANDOVER.md, MIGRATE_NPYITER.md, IL-kernel playbook + rulebook, fuzz findings/coverage/next-plan.
  • Branch quality audits V1+V2 (8 chapter reviews under docs/plans/audit_v2/) with every Tier-1 finding either fixed or reproduced as an [OpenBugs] test.

16. Tests & CI

  • +2,500 test methods; suite now 9,709 passed / 0 failed on net10.0 (also green on net8.0). Zero regressions maintained commit-by-commit.
  • New suites: np.evaluate (per-node wraparound, dtype matrices, weak scalars + overflow, fused-vs-unfused, out= identity/cast/aliasing, fused reductions), out=/where=/dtype= parity suites (broadcast/cast/error-text pins), WRITEMASKED/VIRTUAL parity; NpyIter battletests (566 scenarios), order-support sections 41–51, ARC lifecycle, clone regression, np.pad/average/median/percentile/ptp/diff battle tests, IL-kernel battle tests, behavioral audit harness.
  • CI: fuzz gate in build-and-release.yml, nightly fuzz-soak.yml, new post-release benchmark.yml (auto-commits NumPy-comparison report cards to master).
  • Known gaps stay visible: np.sort remains unimplemented ([OpenBugs]); the frontier benches' broadcast-reduce (54×), scalar np.any (14.5×) losses and the BUFFERED+REDUCE ForEach P0 crash (pinned/skipped repro) are documented as the next optimization frontier; small-N (~1K) dispatch overhead remains the headline focus (docs/ROADMAP.md). Every open issue found by the audits/fuzzers/benches is checked in as a failing-by-design [OpenBugs] test or pinned repro rather than ignored.

Breaking changes

Change Impact Migration
bool - bool, -bool, np.negative(bool) now throw Matches NumPy Use ^ / cast to int first
NaN <= / >= returns False Matches IEEE & NumPy Use np.isnan explicitly
floor_divide/mod divide-by-zero & floored results Matches NumPy
np.negative(uint) wraps instead of throwing Matches NumPy
np.power(int, negative int) raises ValueError Matches NumPy Use float exponents
Cast edge cases (overflow/NaN/complex→bool/float→int truncation) Matches NumPy
Transcendental ufuncs: NEP50 width-based promotion Return dtype may change
np.clip/quantile-family dtype promotion Return dtype may change
Broadcast views are read-only; broadcasting keeps rank for 1-D [1] Matches NumPy .copy() to write
MultiIterator removed; NDIterator is now an NpyIter facade Internal API Use NpyIter / NpyIter.Copy
NpyIter: MaxOperands=8 and 64-dim limits removed None (loosening)
np.copyto unwriteable-destination error type corrected Exception type change

Everything above was validated against NumPy 2.4.2 ground truth — by 37k differential corpus cases, 566 iterator parity scenarios, and per-feature battle tests run on actual NumPy output.

Nucs added 4 commits April 22, 2026 23:41
Replaces the lazy-but-standalone ValueOffsetIncrementor path with one
that constructs an NpyIter state and drives MoveNext / HasNext / Reset
directly off that state. NDIterator is now an honest thin wrapper
over NpyIter — the same traversal machinery used by all the Phase 2
production call sites — rather than reimplementing the coord-walk
logic with legacy incrementors.

How it works
------------
- ctor calls NpyIterRef.New(arr, NPY_CORDER) to build the state, then
  transfers ownership of the NpyIterState* pointer out of the ref
  struct (see NpyIterRef.ReleaseState / FreeState below). The class
  holds that pointer for its lifetime and frees it in Dispose (or in
  the finalizer as a safety net).
- MoveNext reads `*(TOut*)state->DataPtrs[0]` then calls
  `state->Advance()`. IterIndex tracks position, IterEnd bounds the
  non-AutoReset case, and `state->Reset()` restarts from IterStart on
  AutoReset wraparound and on explicit Reset.
- Cross-dtype wraps the same read with a Converts.FindConverter<TSrc, TOut>
  lookup — one switch at construction picks the typed helper, so the
  per-element hot path is still just one read + one converter delegate
  call. MoveNextReference throws when casting is in play, matching the
  legacy contract.
- NPY_CORDER is explicit so iterating a transposed view yields the
  logical row-major order the old NDIterator provided. Without it,
  KEEPORDER would give memory-efficient order (which e.g.
  `b.T.AsIterator<int>()` would surface as `0 1 2 ... 11` instead of
  the expected `0 4 8 1 5 9 2 6 10 3 7 11`).

NpyIter additions
-----------------
- NpyIterRef.ReleaseState(): hand the owned NpyIterState* to a caller
  who needs it across a non-ref-struct boundary (e.g. a class field).
  Marks the ref struct as non-owning so its Dispose is a no-op.
- NpyIterRef.FreeState(NpyIterState*): static tear-down mirror of
  Dispose's cleanup path — frees buffers (when BUFFER set), calls
  FreeDimArrays, and NativeMemory.Free's the state pointer. The
  long-lived owner calls this from its own Dispose/finalizer.

Bug fixes along the way
-----------------------
NpyIter initialization previously computed base pointers as
`(byte*)arr.Address + (shape.offset * arr.dtypesize)` in two places
(initial broadcast setup on line 340 and ResetBasePointers on line
1972). `arr.dtypesize` goes through `Marshal.SizeOf(bool) == 4` because
bool is marshaled to win32 BOOL, but the in-memory `bool[]` storage is
1 byte per element. For strided bool arrays this produced a base
pointer 4× too far into the buffer.

Switched both sites to `arr.GetTypeCode.SizeOf()` which returns the
actual in-memory size (1 for bool). Surfaced by `Boolean_Strided_Odd`
once NDIterator started routing through NpyIter — previously only
LATENT because the legacy NDIterator path computed offsets in
element units, not bytes, and sidestepped the NpyIter init.

Test impact: 6,748 / 6,748 passing on net8.0 and net10.0 (CI filter:
TestCategory!=OpenBugs&TestCategory!=HighMemory). Smoke test of
same-type contig / cross-type / strided / transposed / broadcast /
AutoReset / Reset / foreach all produce the expected element sequences.
`UnmanagedStorage.DTypeSize` (exposed via `NDArray.dtypesize`) was
delegating to `Marshal.SizeOf(_dtype)`. For every numeric dtype that
matches, but for bool, `Marshal.SizeOf(typeof(bool)) == 4` because bool
is marshaled to win32 BOOL (32-bit). The in-memory layout of `bool[]`
is 1 byte per element, so every caller computing a byte offset as
`ptr + index * arr.dtypesize` was reading/writing 4× too far into the
buffer for bool arrays.

Switches to `_typecode.SizeOf()` which correctly returns 1 for bool and
matches `Marshal.SizeOf` for every other type. 21 existing call sites
(matmul, binary/unary/comparison/reduction ops, nan reductions, std/var,
argmax, random shuffle, boolean mask gather, etc.) now get the right
value without any downstream change.

The bug had been latent until the Phase 2 iterator migration started
routing more code paths through NpyIter.Copy and the new NDIterator
wrapper; it surfaced most visibly as `sliced_bool[mask]` returning the
wrong elements when the source was non-contiguous. With the root fix:

    var full = np.array(new[] { T,F,T,F,T,F,T,F,T });
    var sliced = full["::2"];            // [T,T,T,T,T] non-contig
    var result = sliced[new_bool_mask];  // now correct per-element

np.save.cs already special-cases bool before falling through to
`Marshal.SizeOf`, so serialization was unaffected. Remaining
Marshal.SizeOf references in the codebase are either in comments that
explain this exact issue, or in the `InfoOf<T>.Size` fallback that
only runs for types outside the 12 supported dtypes (e.g. Complex).

Tests: 6,748 / 6,748 passing on net8.0 and net10.0 with the CI filter
(TestCategory!=OpenBugs&TestCategory!=HighMemory).
- Delete 4 NPYITER analysis docs (audit, buffered reduce, deep audit,
  numpy differences) — information consolidated into codebase
- Delete 3 NDIterator.Cast files (Complex, Half, SByte) — casting now
  handled by unified NDIterator<T> backed by NpyIter state
- Update NDIterator.cs: minor adjustments from NpyIter backing refactor
- Update ILKernelGenerator.Scan.cs: scan kernel changes
- Update Default.MatMul.Strided.cs: add INumber<T> constraint support
  for generic matmul dispatch preparation
- Update Default.ClipNDArray.cs: initial NpFunc dispatch refactoring
  replacing 6 switch blocks (~84 cases) with generic dispatch methods
- Update np.full_like.cs: minor fix
- Update RELEASE_0.51.0-prerelease.md release notes
…neric dispatch

NpFunc is a reflection-cached generic dispatch utility that bridges
runtime NPTypeCode values to compile-time generic type parameters.
Hot path (cache hit) runs at ~32ns via Delegate[] array indexed by
NPTypeCode ordinal. Cold path uses MakeGenericMethod + CreateDelegate,
cached after first call per (method, typeCode) pair.

Core NpFunc changes:
- Dynamic table sizing: Delegate[] sized from max NPTypeCode enum value
  (was hardcoded [32], broke for NPTypeCode.Complex=128)
- Overloads for 0-6 args × void/returning × 1-3 NPTypeCodes + 1-2 Types
- SmartMatchTypes for multi-type dispatch (1→broadcast, N=N→positional,
  M<N→type-identity matching)
- Per-arity ConcurrentDictionary caches for multi-type dispatch

Files refactored (12 files, ~400 cases eliminated):

Previous session (5 files, ~196 cases):
- Default.ClipNDArray.cs: 6 dispatch methods for contiguous/general clip
- Default.Clip.cs: 3 dispatch methods for scalar clip with ChangeType
- Default.NonZero.cs: 3 dispatch methods for nonzero/count operations
- Default.BooleanMask.cs: 1 dispatch method for masked copy
- Default.Shift.cs: 2 dispatch methods for array/scalar shift

This session (7 files, ~202 cases):
- NDIteratorExtensions.cs: 5 overloads → 5 dispatch methods creating
  NDIterator<T> from NDArray/UnmanagedStorage/IArraySlice
- Default.Reduction.CumAdd.cs: axis dispatch via CumSumAxisKernel<T>,
  elementwise via IAdditionOperators<T,T,T> with default(T) init
- Default.Reduction.CumMul.cs: axis dispatch via CumProdAxisKernel<T>,
  elementwise via IMultiplyOperators + T.MultiplicativeIdentity init
- np.where.cs: iterator fallback + IL kernel dispatch via pointer cast
- np.random.randint.cs: int/long fill via INumberBase<T>.CreateTruncating
- NDArray.NOT.cs: IEquatable<T>.Equals(default) unifies bool NOT and
  numeric ==0 comparison into single generic method
- Default.LogicalReduction.cs: direct dispatch to ExecuteLogicalAxis<T>

Net: -1243 lines removed across 12 files, replacing repetitive per-type
switch cases with single generic dispatch methods.
Complex does not implement IComparable<T>, so NpFunc.Invoke into
ClipArrayBoundsDispatch/ClipArrayMinDispatch/ClipArrayMaxDispatch
crashed with ArgumentException on MakeGenericMethod.

Fix: add NPTypeCode.Complex pre-checks in ClipNDArrayContiguous,
ClipNDArrayGeneral, and ClipCore that route to dedicated Complex
clip methods using lexicographic comparison (real first, then imag).
NaN handling preserves the NaN-containing element as-is (not
replaced with NaN+NaN*i), matching NumPy np.maximum/np.minimum
behavior where "NaN wins" but the original value is returned.

Half NaN propagation: ILKernelGenerator.ClipArrayBoundsScalar,
ClipArrayMinScalar, ClipArrayMaxScalar fell through to the generic
CompareTo path for Half, which treats NaN as less-than-all (IEEE
totalOrder) instead of propagating it. Added Half-specific scalar
methods that check Half.IsNaN explicitly before comparison.

Also fix NpFunc table sizing: Delegate[] was hardcoded to [32] but
NPTypeCode.Complex=128 caused IndexOutOfRangeException. Now computed
dynamically from max NPTypeCode enum value at static init.

Fixes 14 test failures (12 Complex clip/maximum/minimum constraint
violations, 2 Half NaN propagation in maximum).
Nucs added 23 commits May 13, 2026 09:14
…ast paths

Replaces the broken `PowerInteger` fast-path (which crashed on sliced/broadcast
arrays via `Unsafe.Address`) with a stride-aware integer power emitted by the
existing IL kernel infrastructure. Adds NumPy's "Integers to negative integer
powers are not allowed." ValueError, fast paths for scalar exponents {0,1,2,
0.5,-1.0}, and switches f32 to single-precision `MathF.Pow` (no f64 round-trip).

Audit-v2 tickets resolved:
- T1.3a — np.power(sliced_int32, int32) no longer crashes
- T1.3b — np.power(broadcast_int32, int32) no longer crashes
- T1.36 — int**(-int) now raises ArgumentException matching NumPy ValueError

What changed
============

NEW: src/NumSharp.Core/Utilities/NpyIntegerPower.cs
  Public squared-exponentiation helpers for all 9 integer NumSharp dtypes
  (sbyte/byte/int16/uint16/char/int32/uint32/int64/uint64) — preserves
  dtype-native wraparound (uint8 ** 8 = 0, 15**15 = 437893890380859375).
  Caller validates non-negative exponent.

REWRITE: src/NumSharp.Core/Backends/Default/Math/Default.Power.cs
  - Removes the `Unsafe.Address`-based fast-path that crashed on
    sliced/broadcast operands and ignored strides.
  - Adds pre-scan: for `int**int` with signed-int exponent, scans rhs for
    any negative element and throws `ArgumentException("Integers to negative
    integer powers are not allowed.")`. Matches NumPy's unconditional check
    (rejects base ∈ {±1} too, per NumPy spec).
  - Adds scalar-exponent fast paths when `rhs.size == 1`:
      exp = 0   → ones_like(lhs)
      exp = 1   → lhs.copy() (or cast)
      exp = 2   → lhs * lhs (SIMD-optimized Multiply kernel)
      exp = 0.5 → np.sqrt(lhs)
      exp =-1.0 → np.reciprocal(lhs) (float base only)
    Each path verifies the resolved result dtype matches what the IL kernel
    would produce before substituting, so NEP50 promotion is preserved.
  - Falls through to `ExecuteBinaryOp` for the general case, which now
    walks strides correctly via the IL kernel paths.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
  - `EmitPowerOperation(il, resultType)`: dispatches to the matching
    `NpyIntegerPower.Pow*` helper for integer result types (replaces the
    `int → double → Math.Pow → int` round-trip that lost precision for
    values outside [-2^52, 2^52]). float32 → `MathF.Pow`; float64 →
    `Math.Pow`; Boolean and other fallthrough types use the original double
    round-trip to keep the kernel verifiable.
  - Cached `MethodInfo` lookups added for all 9 integer power helpers and
    `MathF.Pow`.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Binary.cs
  - `EmitPowerOperation<T>(il)` (same-type contiguous kernel path):
    same dispatch as the mixed-type version. Generic `T` is mapped to the
    matching `NpyIntegerPower.Pow*` helper via `GetIntegerPowMethod<T>()`.

src/NumSharp.Core/Backends/Default/Math/DefaultEngine.BinaryOp.cs
  - Updates the Power promotion comment to document NEP50 weak/strict
    behavior accurately (NumSharp matches NumPy in the common cases; the
    one documented misalignment is 0-D integer arrays explicitly constructed
    via `np.array(2, int32)`, which are indistinguishable from C# `int 2`
    after `np.asanyarray`).

Tests
=====

NEW: test/NumSharp.UnitTest/Math/NDArray.power.Comprehensive.Test.cs (24 tests)
  - Integer dtype-native wrapping (uint8/int8/int32 overflow)
  - Stride + broadcast layouts (sliced, broadcast_to, 2D-vs-1D)
  - Signed integer negative exponent throws (incl. base = ±1)
  - Unsigned integer exponent never throws
  - Float special values (0^0, NaN, ±inf base/exp, fractional neg base)
  - NEP50 promotion (f32 ** int{8,16,32}, f64 ** int*, f32 ** scalar)
  - All 9 integer dtypes smoke-tested via 2^3 = 8

REMOVED [Misaligned]: PowerEdgeCaseTests.Power_Integer_LargeValues
  Integer power now preserves exact precision; the test now asserts equality.

UPDATED: NewDtypesCoverageSweep_Arithmetic_Tests.B35_SByte_Power_NegativeExponent*
  Previously documented the wrong (silent 0/±1) behavior; now asserts the
  NumPy-correct ArgumentException.

UPDATED (removed [OpenBugs]):
  - AuditV2_MathReductions.T1_3a_Power_SlicedInt32_ShouldNotCrash
  - AuditV2_MathReductions.T1_3b_Power_BroadcastInt32_ShouldNotCrash
  - AuditV2_ILKernelSimd.T1_36_* (4 tests)

Validation
==========

  cd test/NumSharp.UnitTest
  dotnet test --no-build --framework net10.0 \
    --filter "TestCategory!=OpenBugs&TestCategory!=HighMemory"
  → Passed: 8255, Failed: 0

  dotnet test --no-build --framework net10.0 \
    --filter "FullyQualifiedName~Power"
  → Passed: 129, Failed: 0

Microbench (1M-element float32, x100 iterations):
  power(arr, 2)     121ms  (fast path → mul; matches multiply baseline 117ms)
  power(arr, 0.5)   121ms  (fast path → sqrt)
  power(arr, 2.7)   518ms  (general path via MathF.Pow)

Behavior changes vs. prior NumSharp
===================================
- int**(-int) now throws (was: silently returned 0, 1, or -1).
  Matches NumPy 2.4.2 ValueError exactly.
Adds the iterator-subsystem branch audit documents that drove this
branch's bug-fix and refactor work:

- `NDITER_BRANCH_QUALITY_AUDIT.md` — original (V1) audit walking every
  changed src file and ranking findings by severity (bugs → perf →
  parity gaps → refactors → clean review). Bug catalog includes:
  np.maximum/minimum NaN handling, np.power stride mishandling,
  np.searchsorted incompleteness, np.repeat missing axis, NpyIter
  Iternext+EXLOOP path, nan{mean,std,var} perf, np.argsort LINQ perf,
  linspace/eye boxing.

- `NDITER_BRANCH_QUALITY_AUDIT_V2.md` — V2 (fact-check) audit driven by
  8 parallel agents auditing file-by-file with results verified via
  `python -c` against NumPy 2.4.2 and `dotnet_run` against NumSharp.
  60 of 65 Tier 1 findings confirmed with failing OpenBugs reproducers
  written under `test/NumSharp.UnitTest/AuditV2/AuditV2_*.cs`, plus a
  list of 4 false positives and 4 newly discovered bugs.

- `docs/plans/audit_v2/01..08*.md` — per-batch audit chapters, each
  including: file scope tables (LoC + role), reference to NumPy source,
  reproduction commands, line-precise references, and a findings table
  with severity tags (bug / parity-gap / perf / refactor / clean).
  Chapters cover Iterators, ILKernel+SIMD, Default math/reductions,
  Logic+Shape+Storage, NDArray creation, Manipulation APIs+logic,
  Math ops + selection/sorting/stats, and Casting+random+utilities.

These files are pure documentation and contain no code; they're the
reference material for the bug fixes and tests that follow on the
nditer branch.
Adds the per-batch test classes that the V2 audit fact-check pass
produced to back its Tier 1 findings with concrete failing tests.
Tests are marked `[OpenBugs]` so CI skips them until the underlying
defect is fixed; running them locally with `TestCategory=OpenBugs`
documents each bug's current behavior versus NumPy 2.4.2.

Each test references both the master `NDITER_BRANCH_QUALITY_AUDIT_V2.md`
and the matching `docs/plans/audit_v2/XX_*.md` chapter where the finding
is documented in detail, and includes the file:line of the suspected
defect plus a `python -c` NumPy 2.4.2 expectation.

Test classes added (matching the 6 untracked batches):
- `AuditV2_Iterators.cs` — NpyIter Iternext/EXLOOP issues, buffer refill,
  cast support gaps, NDIterator broadcast strides, etc. (Batch 1).
- `AuditV2_LogicShapeStorage.cs` — Shape mutating indexer on a
  readonly struct, storage and logic op edge cases (Batch 4).
- `AuditV2_NDArrayCreation.cs` — `np.array(NDArray, copy=false)` default
  aliasing, creation API edge cases (Batch 5).
- `AuditV2_ManipulationApis.cs` — np.expand_dims on empty, manipulation
  parity gaps (Batch 6).
- `AuditV2_MathSelectionSorting.cs` — SetIndicesNDNonLinear NIE,
  math/selection/sort bugs (Batch 7).
- `AuditV2_CastingRandomUtilities.cs` — NpFunc/cast/random/utilities
  bugs (Batch 8).

Batches 2 (`AuditV2_ILKernelSimd.cs`) and 3 (`AuditV2_MathReductions.cs`)
already exist on the branch; this commit fills the remaining 6.
Build is verified to pass with the new files included.
Updates `.claude/CLAUDE.md` so the project instructions match the code's
current state:

- "C-order only" entry replaced with "Order-aware layout": Shape tracks
  F-contiguity, and APIs with an `order` parameter resolve NumPy
  `C`/`F`/`A`/`K` through `OrderResolver`. Verified by:
    - `Shape.IsFContiguous` flag (`View/Shape.cs:115-118`)
    - `Shape.Order` property (`View/Shape.cs:437`)
    - F-aware construction (`View/Shape.cs:160`)
- `F_CONTIGUOUS` entry in the flags table updated from "Reserved" to
  "Data is column-major contiguous" (matches `ArrayFlags.F_CONTIGUOUS`
  bit `0x0002` in `View/Shape.cs:24`).
- Added `IsFContiguous — O(1) check via F_CONTIGUOUS flag` to the
  key Shape properties list.
- Missing Functions count corrected from 19 → 18 and `np.where` removed
  from the Selection gap because `APIs/np.where.cs` implements it; new
  `### Selection` section under "Supported np.* APIs" lists `where`.
- Iterators path updated from `MultiIterator.cs` to `NpyIter.cs` and
  `NpyExpr.cs` (verified — `MultiIterator` no longer exists; only
  `NDIterator`, `NpyIter`, `NpyExpr` are present in `Backends/Iterators`).
- Q&A entries for NDIterator and NpyIter rewritten to match the current
  legacy-wrapper / NumPy-aligned multi-operand iterator split.

Pure documentation change — no behavioral impact.
…per / Memory block

Multiple `CopyTo` overloads in the unmanaged memory layer were calling
`Buffer.MemoryCopy(...)` with source/destination swapped — the BCL
signature is `MemoryCopy(void* source, void* destination, long destBytes,
long sourceBytesToCopy)`, but the existing code passed the destination
pointer first. The result was that data was copied *from the destination
buffer into the source slice*, silently corrupting the caller's source
data instead of populating the destination.

ArraySlice`1.cs:
- `TryCopyTo(Span<T>)`, `CopyTo(Span<T>)`, `IArraySlice.CopyTo<T1>(Span<T1>)`,
  `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`: swap source / destination
  pointers so data flows source→destination.
- `CopyTo(IntPtr dst)`: also fix the byte-size argument — previous code
  passed `Count` (element count) for both destination size and bytes to
  copy, leaving non-byte dtypes with under-counted bounds. Replace with
  `Count * ItemLength` for both byte arguments and flip the source /
  destination order.
- `CopyTo(IntPtr dst, long sourceOffset, long sourceCount)`: this
  overload was previously identical to `CopyTo(IntPtr dst)` (ignored the
  offset arguments). Add `sourceOffset` / `sourceCount` bounds checks,
  honour `sourceOffset` when computing the source pointer, and use
  `sourceCount * ItemLength` for the copy.
- `CopyTo(Span<T>, long sourceOffset, long sourceLength)`: previous body
  recursed into itself (`CopyTo(destination, sourceOffset, sourceLength);`)
  causing a stack overflow. Replace with a bounds-checked
  `Buffer.MemoryCopy` from `Address + sourceOffset`.
- `CopyTo(UnmanagedSpan<T>, long sourceOffset, long sourceLength)`:
  same direction swap as above.
- `IArraySlice.CopyTo<T1>(Span<T1>)` / `IArraySlice.CopyTo<T1>(UnmanagedSpan<T1>)`:
  bytes-based comparison (`Count * ItemLength` vs `destination.Length *
  InfoOf<T1>.Size`) instead of element-count comparison, fixing the
  "destination too short" check for reinterpret-cast cases.
- `IArraySlice.Clone<T1>()`: previous code used `UnmanagedMemoryBlock<T1>.
  Copy(Address, Count)` which treats `Count` as the *T1* element count
  while reading from a `T`-element buffer. Now compute the byte size
  and divide by `InfoOf<T1>.Size` so the clone preserves the whole byte
  payload (with a hard error if the byte count is not a multiple of the
  target element size).

UnmanagedHelper.cs:
- `CopyTo(IMemoryBlock src, IMemoryBlock dst, long countOffsetDestination)`:
  validate `countOffsetDestination` against `dst.Count` and ensure the
  source fits in the *remaining* destination capacity. Fix the
  destination-size argument to `(dst.Count - countOffsetDestination) *
  dst.ItemLength` instead of the source byte length (which under-counts
  by the offset when the destination is just big enough).

UnmanagedMemoryBlock`1.cs:
- `CopyTo(UnmanagedMemoryBlock<T> memoryBlock, long arrayIndex)`: swap
  pointers so data is copied source→destination (`memoryBlock.Address +
  arrayIndex` as dst), add null + bounds checks, and use the remaining
  destination capacity for the destination size argument.

All fixes are direct corrections of misuses of `Buffer.MemoryCopy`'s
signature; behavior for legitimate callers now matches the docstrings.
The added regression tests live under
`test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs` (separate
commit) and call each repaired overload to lock the contract in place.
….Clone bugs

Shape.cs:
The `Clone(deep, unview, unbroadcast)` branch logic was inconsistent and
dropped the `offset`/`bufferSize` of scalar views in the most common
call (`Clone()` with default args). Rewrite the cascade so behavior is
predictable:
- Empty shape → `default`.
- Scalar shape:
    - `unview` or `unbroadcast` → return the static `Scalar` (offset=0).
    - Otherwise honour `deep`: copy-constructor preserves both `offset`
      and `bufferSize` for sliced scalar views like `np.arange(10)["5"]`.
- Non-scalar shape:
    - `!deep && !unview && !unbroadcast` → return `this` (the readonly
      struct copy is itself a value-copy).
    - `unview` or `unbroadcast` → `new Shape((long[])dimensions.Clone())`,
      which the constructor fills with C-contiguous strides (no offset).
      This replaces the previous one-off `ComputeContiguousStrides` /
      `deep && !unbroadcast` mixed branches that produced different
      shapes depending on call combination.
    - Plain `deep` → deep copy via the copy constructor.

Old behavior failure: `scalar.Shape.Clone()` on `np.arange(10)["5"]`
returned the canonical `Scalar` shape with `offset == 0`, hiding the
fact that the data lives at index 5. The regression test
`Shape_Clone_PreservesScalarViewOffset` in `CloneRegressionTests`
locks the fix.

ArrayConvert.cs:
- `Clone(Array sourceArray)` had two issues:
  1. It walked `elementType.IsArray` past the array's actual element
     type, so a jagged `int[][]` was treated as a flat `int[]` and the
     subsequent `Array.Copy` produced wrong results (or threw). Now the
     immediate element type is used, preserving jaggedness.
  2. Arrays with a non-zero lower bound (created via
     `Array.CreateInstance(elementType, lengths, lowerBounds)`) were
     not supported — they fell through to the multi-dim branch with
     all-zero lower bounds. Capture each axis' lower bound and use
     `Array.CreateInstance(elementType, lengths, lowerBounds)` whenever
     the source is multi-dim or has any non-zero lower bound.
- `Clone<T>(T[,,,] sourceArray)` had a `GetLength(4)` typo for what
  should be the fourth (zero-indexed: 3) dimension. `GetLength(4)`
  throws `IndexOutOfRangeException` for any 4-D array. Changed to
  `GetLength(3)`. (Coverage: `CloneRegressionTests
  .ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`.)
…ontig

NDArray (`Backends/NDArray.cs`):
- All three `UnmanagedStorage`-based constructors now back-fill the
  engine when storage doesn't already have one, and mirror the chosen
  engine onto `Storage.Engine` so the array and storage stay in sync.
  Previously `Storage.Engine` could be null while the NDArray reported a
  valid `TensorEngine`, leaking back through chained constructors that
  read storage.Engine directly.
- `TensorEngine` setter now propagates the resolved engine to
  `Storage.Engine` so changing the engine on an NDArray cascades to
  storage-side consumers.
- `Clone()` is now `virtual` and uses the property setter (instead of
  the private field) so engine assignment propagates to storage.
  `NDArray<TDType>.Clone()` overrides it to preserve the typed wrapper —
  before this commit, `((NDArray<int>)x).Clone()` returned the
  non-generic NDArray base type, breaking generic callers (see
  `CloneRegressionTests.NDArray_Clone_PreservesGenericRuntimeType`).
- `View`/`GetData(int[])`/`GetData(long[])`/the foreach yield path all
  switch from setting the private `tensorEngine` field to the property,
  so storage gets the engine too.

UnmanagedStorage (`Backends/Unmanaged/UnmanagedStorage.cs`):
- `CreateBroadcastedUnsafe(...)` now copies `storage.Engine` onto the
  new broadcast view.

UnmanagedStorage cloning (`Backends/Unmanaged/UnmanagedStorage.Cloning.cs`):
- All `Alias(...)` overloads, the `Slice` builder, both `Cast<T>` /
  `Cast(typeCode)`, both `CastIfNecessary<T>` / `CastIfNecessary(typeCode)`,
  and the empty-storage clone now propagate `Engine`.
- Cast correctness fix: `Cast<T>` / `Cast(typeCode)` /
  `CastIfNecessary<T>` / `CastIfNecessary(typeCode)` used to cast the
  raw backing array via `InternalArray.CastTo(...)`. For strided or
  F-contiguous views that buffer holds elements in the *physical*
  layout, so the cast result contained values in the wrong logical
  order. They now run `CloneData()` first — which materializes the
  logical element sequence (via `NpyIter.Copy` for non-contiguous
  paths) — and cast that, so casting an F-contiguous view of
  `np.arange(6).reshape(2,3).T` yields the same values NumPy produces.
  (Verified by `CloneRegressionTests
  .UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine`
  and siblings.)
- `Clone()` gains a fast `CanCloneRawLayout()` path: when the storage
  owns its buffer (no offset, no broadcast, no buffer/size mismatch)
  and is either C- or F-contiguous, the underlying ArraySlice is
  cloned raw and the same `Shape` is reused. Non-trivial slices and
  scalar views still fall back to `CloneData()`. Empty storages return
  a new typed empty storage and preserve the engine instead of trying
  to clone a null buffer.
- `CastIfNecessary` early-return for same-dtype skips the
  `IsEmpty` check so empty storages of the requested dtype don't
  re-materialize.
The DefaultEngine helpers for `astype` and `transpose` created new
`NDArray` instances via the `UnmanagedStorage`-only constructor, which
falls back to `BackendFactory.GetEngine()`. Code that explicitly set
`nd.TensorEngine` on the source (e.g. tests pinning a custom engine)
would silently see its engine swapped for the default after a cast or
transpose.

`Default.Cast.cs` (`DefaultEngine.Cast`):
- Capture `nd.TensorEngine` once at the top.
- Empty/scalar/`(1,)` early returns now carry `engine` forward both on
  the returned `NDArray` and on `nd.Storage` (when reused in-place).
- Both `copy` and in-place branches of the generic cast attach
  `TensorEngine = engine` to the resulting NDArray and to the
  re-assigned `nd.Storage`.

`Default.Transpose.cs` (`DefaultEngine.TransposeAlias`):
- The transpose alias returned via `Storage.Alias(newShape)` now carries
  `nd.TensorEngine` so transposed views keep their engine. Without this
  the call dropped back to the global default, breaking propagation
  through compounded operations.

Coverage: `CloneRegressionTests.NDArray_AstypeCopyPath_PreservesTensorEngine`
and the engine-propagation siblings.
…/ ravel paths

All paths that build a fresh `NDArray` from an existing storage now
preserve the caller's `TensorEngine`. Previously the `NDArray
(UnmanagedStorage)` constructor would fall back to
`BackendFactory.GetEngine()` when the supplied storage didn't carry an
engine (which is common after `Clone()`/`Alias()`/`CloneData()`).

`Creation/NDArray.Copy.cs` (`copy(char physical)`):
- The C-order shortcut now requires the source to already be
  C-contiguous. Before, `copy('C')` on an F-contiguous view returned a
  `Clone()` whose shape preserved the F-strides — yielding a
  non-C-contiguous "copy". Now any non-C-contiguous source falls
  through to the iterator-driven materialization path.
- The destination shape uses the requested `physical` order instead of
  hard-coding `'F'`. Combined with the fix above this gives correct
  C/F selection regardless of source layout.
- Destination NDArray carries `TensorEngine = TensorEngine` of the
  source. Coverage:
  `CloneRegressionTests.NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy`
  and `NDArray_CopyFOrder_PreservesTensorEngine`.

`Creation/NdArray.ReShape.cs`:
- The F-order reshape return (`fFlat.Storage.InternalArray`-backed
  storage) and both non-contiguous fallback paths
  (`new NDArray(CloneData(), Shape.Clean())`) now attach the source
  `TensorEngine`. Coverage:
  `CloneRegressionTests.NDArray_ReshapeCopyPath_PreservesTensorEngine`.

`Creation/np.array.cs`:
- `np.array(nd, copy)` propagates `nd.TensorEngine` for both the
  alias (`copy=false`) and clone (`copy=true`) paths. Coverage:
  `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`.

`Manipulation/np.expand_dims.cs`, `Manipulation/np.ravel.cs`,
`Manipulation/NDArray.flatten.cs`:
- The view (`Storage.Alias(...)`) and materialize (`CloneData()`)
  branches both forward the source `TensorEngine`.

No semantic API changes other than the `copy('C')` correctness fix
above; engine propagation is a transparent improvement.
NumPy's np.where iterator allocates the result with an order chosen
from the *full-size* operands' contiguity flags:
- Any full-size, multi-dim operand that is C-contiguous (but not F)
  → output is C-contiguous.
- All full-size, multi-dim operands that are F-contiguous (and at least
  one is strictly F, not also C) → output is F-contiguous.
- Operands that are scalar, 1-D, or broadcasted do not vote.
- Mixed C/F (or any full-size non-contiguous operand) → fall back to C.

Verified against NumPy 2.4.2:

    f = np.arange(12).reshape(3,4).T            # F-contig view
    np.where(f > 5, f, 0)              .flags   # F_CONTIGUOUS=True
    np.where(f > 5, f.copy('C'), f)    .flags   # C_CONTIGUOUS=True
    np.where(np.array([True,False,True]), f, 0).flags  # F_CONTIGUOUS=True

Previously `np.where` always allocated the output as C-contiguous,
losing the F layout that NumPy preserves for F-contiguous inputs.

`APIs/np.where.cs`:
- New `ResolveWhereOrder(params NDArray[] operands)` mirrors the rules
  above. Returns 'C' or 'F'.
- The result `Shape` is now constructed via `new Shape((long[])cond
  .shape.Clone(), resultOrder)` so the resulting strides match the
  resolved order.
- Drop the `NpFunc.Invoke(outType, WhereImpl<int>, ...)` generic
  dispatch: the actual `WhereImpl` body never used its `T` type
  parameter (the iterator-driven IL kernel keys off the runtime dtype
  string), so the switch-per-dtype indirection was dead weight. Replace
  with a direct non-generic `WhereImpl(cond, x, y, result)` call.

`test/NumSharp.UnitTest/Logic/np.where.BattleTest.cs`:
- New "Output Layout" region with three NumPy-anchored tests:
    * `Where_FContiguousInputs_ResultIsFContiguous`
    * `Where_MixedCAndFInputs_ResultFallsBackToC`
    * `Where_BroadcastConditionWithFInput_ResultIsFContiguous`
…sh order tests

NumPy's `np.tile(A, reps)` keeps the source memory order on the "no
expansion" shortcut (`tup == (1,)*outDim`): F-contiguous input stays
F-contiguous, C-contiguous input stays C-contiguous, and views with
strides outside C/F materialize as C-contiguous. Verified against
NumPy 2.4.2:

    src = np.arange(12).reshape(3, 4).T          # F-contig
    np.tile(src, (1, 1)).flags     # F_CONTIGUOUS=True
    np.tile(src, ()).flags         # F_CONTIGUOUS=True
    np.tile(np.arange(12).reshape(3, 4)[:, ::-1], (1, 1)).flags
                                                  # C_CONTIGUOUS=True

`Manipulation/np.tile.cs`:
- The all-ones shortcut previously called `A.copy()` which defaults to
  `'C'` — silently flipping F-contiguous inputs to C-contiguous output.
  Replace with `A.copy('K')` (and the reshape variant gets the same
  treatment) so `OrderResolver.Resolve('K', shape)` picks the source's
  physical order. The comment is updated to describe the keep-order
  semantics.

`test/NumSharp.UnitTest/Manipulation/np.tile.Test.cs`:
- Three new tests covering the F-contig preservation, the `np.tile(a)`
  no-reps overload, and the non-contiguous fall-back. Each test also
  verifies element values against the source via index-based reads to
  guard against logical-order regressions.

`test/NumSharp.UnitTest/View/OrderSupport.OpenBugs.Tests.cs`:
- `Tile_ApiGap` is renamed to `Tile_RepeatsLastAxis_ValuesMatchNumPy`
  and its assertion stays — the API gap comment is replaced with the
  matching NumPy reference output. Header rewritten from
  "Missing functions" to "Manipulation helpers" since this section now
  documents working APIs.
- `Where_ApiGap` (previously `[OpenBugs]` because np.where was thought
  missing) is now `Where_FContig_PreservesFContig`. It asserts that
  `np.where(f_arr > 5, f_arr, 0)` returns an F-contiguous result on
  F-contiguous input — the same property covered by the new where
  layout tests in the prior commit. The `[OpenBugs]` attribute is
  removed because the feature exists and now matches NumPy.
…IterBattleTests

`benchmark/NumSharp.Benchmark.Exploration/Program.cs`:
- `Options.Clone()` reused the same `RemainingArgs` `string[]` reference
  on the cloned `Options` instance. Any post-clone mutation of the
  array (or its elements via index assignment) would have leaked back
  to the original `Options`. Clone the array (`(string[])RemainingArgs
  .Clone()`) so the two `Options` instances are independent.

`test/NumSharp.UnitTest/Backends/Iterators/NpyIterBattleTests.cs`:
- Remove a single trailing blank line at end of file. No code change.
… after RemoveAxis

`NpyIter.Clone()` (in `Backends/Iterators/NpyIter.cs`) previously copied
the `Buffers[op]` pointer field directly from the source state, so the
original and the cloned iterator shared the *same* per-operand buffer.
After `Iternext()` on either iterator the writes from one would clobber
the other's data, and disposing one would free the buffer out from
under the other.

The fix:
- After copying the operand metadata (`ElementSizes`, `BufStrides`, etc.),
  allocate a fresh buffer per operand via `NpyIterBufferManager
  .AllocateAligned(BufferSize, opDtype)` and `Buffer.MemoryCopy` the
  source bytes into it. If allocation fails the catch block calls
  `NpyIterBufferManager.FreeBuffers` for buffered states before
  releasing dim arrays + state memory.
- `DataPtrs[op]` is fixed up: if the source `DataPtrs[op]` pointed into
  the original `Buffers[op]` byte range we translate the offset onto
  the newly allocated buffer so iteration continues at the same logical
  position.
- The clone now calls `AllocateDimArrays(_state->NDim, _state->NOp,
  _state->StridesNDim)` — see below.

`NpyIterState.AllocateDimArrays(int ndim, int nop, int stridesNDim)`:
- Previously, the strides block was always sized as `ndim * nop`. After
  `RemoveAxis` lowers `NDim` but leaves `StridesNDim` at its original
  width, cloning the iterator allocated a strides block that was too
  small, causing later reads from `Strides[k]` (where `k >= NDim*NOp`)
  to access freed or unrelated memory.
- The third parameter defaults to `ndim` (preserving the existing
  contract for all other call sites) but accepts an explicit
  `stridesNDim >= ndim` so `Clone()` can carry the original allocated
  stride width forward. `StridesNDim` is now stored on the state and
  the strides allocation uses `stridesNDim * nop * sizeof(long)`. The
  scalar fast-path now requires both `ndim == 0` and `stridesNDim == 0`
  to skip the allocation.

Also moves the `GetInnerFixedStrideArray` docblock so it sits directly
above its method (it had drifted onto an unrelated method when the
preceding doc was edited).

Coverage:
- `CloneRegressionTests.NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers`
  asserts the two iterators have distinct `DataPtr` addresses and that
  advancing one does not advance the other.
- `CloneRegressionTests.NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`
  builds an iterator over `(2,3,4)`, removes axis 1, clones it, and
  checks `NDim`, `Shape`, and the first value match.
… clone fixes

Adds `test/NumSharp.UnitTest/Backends/CloneRegressionTests.cs`,
which locks in the contracts established by the preceding fix commits.
Each test reproduces a specific bug or contract that previously
regressed and asserts the corrected behavior. 27 tests; all pass on
net8.0 and net10.0.

Coverage map (each pair = test → fix commit):

ArraySlice CopyTo direction / range fixes
→ `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice`
- `ArraySlice_CopyToSpan_CopiesFromSliceToDestination`
- `ArraySlice_TryCopyToSpan_CopiesFromSliceToDestination`
- `ArraySlice_CopyToSpan_WithSourceRange_CopiesRequestedRange`
- `ArraySlice_CopyToIntPtr_WithSourceRange_CopiesRequestedRange`
- `ArraySlice_InterfaceCopyToSpan_CopiesFromSliceToDestination`
- `ArraySlice_InterfaceCloneGeneric_ReinterpretsWholeBytePayload`

ArrayConvert.Clone jagged / non-zero lower-bound / 4-D GetLength fixes
→ `fix(shape+convert): preserve scalar offset on Clone; fix ArrayConvert.Clone bugs`
- `ArrayConvert_Clone_PreservesJaggedElementType`
- `ArrayConvert_Clone_PreservesNonZeroLowerBounds`
- `ArrayConvert_Clone_FourDimensionalArray_UsesFourthDimensionLength`

Shape.Clone scalar view offset preservation
→ same commit as above
- `Shape_Clone_PreservesScalarViewOffset`

UnmanagedStorage.Clone empty + F-contig + engine
→ `fix(storage+ndarray): keep TensorEngine in sync; correct cast for F-contig`
- `UnmanagedStorage_Clone_DtypeOnlyStorage_DoesNotDereferenceMissingData`
- `UnmanagedStorage_Clone_PreservesEngineAndFContiguousShape`

UnmanagedStorage.Cast / CastIfNecessary uses CloneData + propagates engine
→ same commit
- `UnmanagedStorage_CastTypeCode_FContiguousSource_CopiesLogicalValuesAndEngine`
- `UnmanagedStorage_CastGeneric_FContiguousSource_CopiesLogicalValuesAndEngine`
- `UnmanagedStorage_CastIfNecessary_FContiguousSource_CopiesLogicalValuesAndEngine`
- `UnmanagedStorage_CastEmptyStorage_PreservesEngine`

UnmanagedMemoryBlock.CopyTo arrayIndex offset
→ `fix(unmanaged): correct CopyTo direction + bounds in ArraySlice`
- `UnmanagedMemoryBlock_CopyToWithIndex_CopiesToDestinationOffset`

UnmanagedHelper.CopyTo destination-offset bounds
→ same commit
- `UnmanagedHelper_CopyToWithInvalidDestinationOffset_Throws`

NDArray.Clone / engine propagation
→ `fix(storage+ndarray): ...` + `fix(creation+manipulation): ...` +
  `fix(default-engine): ...`
- `NDArray_Clone_PreservesGenericRuntimeType`
- `NDArray_Clone_PreservesTensorEngineOnArrayAndStorage`
- `NpArray_FromNDArray_PreservesTensorEngineForAliasAndCopy`
- `NDArray_CopyFOrder_PreservesTensorEngine`
- `NDArray_CopyCOrder_FromFContiguousSource_ProducesCContiguousCopy`
- `NDArray_ReshapeCopyPath_PreservesTensorEngine`
- `NDArray_AstypeCopyPath_PreservesTensorEngine`

NpyIter.Clone buffered deep copy + RemoveAxis stride width
→ `fix(npyiter): deep-copy buffered Clone buffers; preserve stride
  width after RemoveAxis`
- `NpyIterCopy_BufferedIterator_AllocatesIndependentBuffers`
- `NpyIterCopy_AfterRemoveAxis_PreservesAllocatedStrideWidth`
…aths

Promotes SByte, Half (float16), and Complex from "partially supported" to
first-class dtypes, matching what NPTypeCode already declares and what
NumPy 2.4.2 ships.

The audit (NDITER_BRANCH_QUALITY_AUDIT_V2.md, Tier 1) flagged 9 production
crash sites and 5 perf gaps where these three dtypes silently fell out of
12-dtype switches. After this commit every np.power(lhs, rhs) combination
across the 15x15 dtype matrix works end-to-end, and the existing 12-dtype
fast paths remain intact.

CRASH FIXES (Tier 1):

* NpyIterCasting (T1.9, T1.12, T1.38, T1.39): IsSafeCast / ReadAsDouble /
  WriteFromDouble / ConvertValue / PromoteTypes now handle SByte / Half /
  Complex. Complex routes through a dedicated Complex intermediate so the
  imaginary component is preserved on Complex->Complex copies and dropped
  cleanly (per NumPy's ComplexWarning) on Complex->real. Adds Half/SByte
  to IsFloatingPoint/IsSignedInteger predicates.

* NpyIterBufferManager (related to T1.12): same-type buffered iteration
  was throwing for Complex base case. Adds SByte/Half/Complex branches to
  CopyToBuffer/CopyFromBuffer dtype dispatch.

* UnmanagedStorage (T1.13, T1.57): SetValue(object,int[]/long[]),
  SetData(NDArray,long[]) scalar fast path, and the void*/IMemoryBlock
  CopyTo overloads all gained the three missing dtype branches.

* ArrayConvert.cs (T1.30): 13 ToX(Array) destination switches were
  missing SByte/Half/Complex source cases. Plus ~40 new typed converter
  methods covering the previously-missing (Src -> Dst) corners. Total
  ~550 LOC added.

* np.asanyarray (T1.49): adds IEnumerable<sbyte>, IEnumerable<Half>,
  IEnumerable<Complex> case branches; corresponding Memory<T>/
  ReadOnlyMemory<T> dispatch; ConvertObjectListToNDArray branches;
  and FindCommonNumericType expansion (the seenMask bitset was bounded
  to 12 dtypes; Complex's typecode=128 also previously aliased bit 0
  due to unbounded shift -- now masked by `(int)code & 31`).

* np.copyto T1.55: now passes via the NpyIterCasting fix.

* ILKernelGenerator.EmitDecimalConversion: Half<->Decimal and
  Complex<->Decimal routes were missing. np.power(Half, Decimal) now
  works (was the only np.power(15x15) failure after the casting fixes).

PERF FIXES (Tier 2):

* ILKernelGenerator.Binary.IsSimdSupported<T> (R9): adds sbyte.
  Vector*<sbyte> arithmetic is natively supported in .NET.

* Converts.FindConverter (R18, R33): 12x12 type-pair fast-path ladder
  expanded to 15x15 (225 entries). Eliminates the IConvertible-interface
  boxing and object-cast boxing that the CreateFallbackConverter path
  imposes for SByte/Half/Complex sources or destinations.

* Default.Reduction.ArgMax (R23): the per-slice NDArray view allocation
  in the Half/Complex axis fallback was costing one new NDArray per slice
  (1000 allocations for a (1000,1000) axis-reduce). Replaced with a
  stride-aware loop driven from a stackalloc coord vector. SByte path is
  removed from the fallback entirely since the IL kernel already handles
  it via CreateAxisArgReductionKernelTyped<sbyte>.

* Default.BooleanMask gather (T1.58): the strided/broadcast fallback was
  calling Buffer.MemoryCopy(_, _, elemSize, elemSize) per matched element
  (~1us/element). Specialized on element size (1/2/4/8/16 bytes); all 15
  dtypes hit a typed pointer write now, including Half (2B) and Complex
  (16B as two longs).

VERIFICATION:

* test/Math/NDArray.power.DtypeMatrix.Test.cs (new):
  - 15x15 dtype matrix smoke test (225 (lhs, rhs) combinations).
  - SByte ** -SByte raises ValueError-style ArgumentException.
  - Half ** Half preserves Half.
  - Complex ** Complex preserves Complex.
  - Float ** Complex promotes to Complex.
  - Half ** Single promotes to Single (NEP50).
  - SByte/Half/Complex List/IEnumerable inputs no longer throw.

* Removed [OpenBugs] attribute from 11 AuditV2 tests that are now CI-green:
  T1_9 (3x), T1_12 (2x), T1_13 (2x), T1_30 (3x), T1_49 (3x), T1_55,
  T1_57, T1_58. They now run as regular tests.

* Full suite: 8281 passed, 0 failed (was 8255 before this commit, including
  the new dtype-matrix tests and 26 promoted-from-OpenBugs tests).

DOCS:

* .claude/CLAUDE.md: "Supported Types (12)" -> "Supported Types (15)".
  Adds Half/SByte/Complex rows and a "Perf notes" section that documents
  Half/Complex/Decimal as scalar paths (no Vector<Half> arithmetic in
  .NET BCL; Complex.Pow is the BCL routine).

OUT OF SCOPE FOR THIS COMMIT:

* T1.34 NpyExpr Const/Where/Call SByte/Half/Complex support: not on
  np.power's critical path; left for a separate pass.
* T1.39 Int64/UInt64 -> double precision loss above 2^53: separate
  audit item, unrelated to the three target dtypes.

For full audit context see docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md
section "Major themes" item 2 (missing SByte/Half/Complex).
…s 3000× copy

Audit V2 finding (Section 1.6 / src/NumSharp.Core/Manipulation/np.ravel.cs:30-34):
np.ravel(a, 'F') unconditionally routed through a.flatten('F'), which allocates
fresh F-contiguous memory and runs NpyIter.Copy over the source. NumPy, in
contrast, returns a 1-D view sharing the underlying buffer whenever the source
is already F-contiguous (np.shares_memory(np.ravel(aF, 'F'), aF) == True).

The audit reports a 3000× performance regression on the hot F-order path
(np.arange(12).reshape(3,4).copy('F') -> np.ravel(.,'F')): an O(1) shape-only
aliasing was replaced with an O(N) buffered copy.

Root cause
----------
ravel's F-branch had no fast path for the IsFContiguous case. flatten('F') is
documented to "ALWAYS return a copy" by design, so calling it from ravel forced
materialization even when the linear memory walk would already reproduce the
column-major read-out.

Why a 1-D view is correct for F-contiguous sources
--------------------------------------------------
An F-contiguous array has strides[0] == 1 and strides[i] == dims[i-1] *
strides[i-1] for i > 0, with no broadcast/stride-0 dimensions. Walking the
underlying buffer linearly from `offset` for `size` elements visits values in
F-order (first axis varies fastest), which is exactly what ravel('F') is
specified to produce.

For non-F-contig sources we still fall back to flatten('F') — a strided / C-
contig / sliced source needs the column-major copy to reproduce F-order
correctly.

Implementation
--------------
ravel(a, 'F') with NDim > 1 and size > 1:
  * a.Shape.IsFContiguous → build Shape(new[]{size}, new[]{1}, a.Shape.offset,
    a.Shape.bufferSize) and return new NDArray(a.Storage.Alias(vec)). offset and
    bufferSize are preserved so sliced F-views remain correct; size becomes the
    1-D shape's logical and physical extent.
  * Otherwise → existing flatten('F') copy path (unchanged).

The new shape's flags are recomputed by ComputeFlagsStatic over the 1-D
dims/strides, which trivially marks the result as both C- and F-contiguous and
writeable (a 1-D dense vector is both orders). Storage.Alias chains _baseStorage
to the ultimate owner, so view tracking and the @base property continue to work.

Test coverage
-------------
* AuditV2_ManipulationApis.Ravel_FContiguous_FOrder_ReturnsView is no longer
  marked [OpenBugs(audit-v2-ravel-fcont-fview)] — the documented NumPy
  np.shares_memory invariant is now asserted directly in CI.
* test/Manipulation/np.ravel.Test.cs gains 10 new tests:
    - Ravel_FOrder_FContig2D_IsView                — write-through verifies aliasing.
    - Ravel_FOrder_FContig2D_ValuesMatchColumnMajor — read-out sequence matches NumPy.
    - Ravel_FOrder_FContig3D_IsView                — 3-D F-flat-index decomposition.
    - Ravel_FOrder_CContig_IsCopy                  — C-contig source still copies.
    - Ravel_FOrder_Transpose2D_IsView              — a.T (F-contig view) also aliases.
    - Ravel_FOrder_KOrder_FContigSource_IsView     — 'K' resolves to 'F' for F-source.
    - Ravel_FOrder_AOrder_FContigSource_IsView     — 'A' resolves to 'F' for strict F.
    - Ravel_FOrder_FContig_DtypeFloat              — dtype preserved across the view.
    - Ravel_FOrder_FContig_EquivalentToFlattenF_Values — values match flatten('F').
    - Ravel_FOrder_FContig_PreservesSize           — 2-D / 3-D / 4-D size invariants.

Verified
--------
* New tests pass on net8.0 and net10.0.
* Full CI-filtered suite (TestCategory!=OpenBugs&TestCategory!=HighMemory)
  passes 8292/8292 on both target frameworks.
* The 17 pre-existing F-contig OpenBugs failures (np.flip, np.sort, np.repeat
  axis, reduction F-preservation, save/load fortran_order, etc.) remain
  unchanged — they live in test/View/OrderSupport.OpenBugs.Tests.cs and are
  excluded by the CI filter.

References
----------
* NumPy: https://numpy.org/doc/stable/reference/generated/numpy.ravel.html
* docs/plans/NDITER_BRANCH_QUALITY_AUDIT_V2.md — Section 1.6
* Spec: np.shares_memory(np.ravel(aF, 'F'), aF) == True for IsFContiguous source
…aths

Audit of np.ravel call paths flagged two cases that the initial fix relied on
but did not directly assert in tests. Add explicit coverage so regressions are
caught:

1. Ravel_FOrder_FContigColumnSlice_PreservesOffset_IsView
   aF[:, 1:3] on F-contig (4,5) yields (4,2) F-contig with offset=4. The view
   path must preserve offset and bufferSize so the 1-D Alias reads memory
   [4..11]. Verified:
     * shape (8,)
     * F-order values [1, 6, 11, 16, 2, 7, 12, 17] (column-major read-out)
     * write-through r[0] → s[0,0] and aF[0,1] both updated (shared memory)

2. Ravel_FOrder_FContig_BothCAndFContig_IsView
   A (1, N) shape is simultaneously C- and F-contiguous. ravel('F') enters the
   F-branch (NDim>1, size>1, IsFContiguous=true) and returns an Alias view; this
   was already covered by the implementation but not by an explicit test.
     * shape (4,)
     * values [10, 20, 30, 40] (linear memory walk)
     * write-through r[0] propagates to both[0, 0]

Both cases pass on net8.0 and net10.0 (64/64 tests in the ravel suite).

Background — full ravel coverage matrix audited manually:

  Order  Source layout                              Branch         Result
  -----  -----------------------------------------  -------------  -------------
  'F'    strict F-contig, NDim>1, size>1            view path      view
  'F'    both C+F contig (e.g. (1,N)), NDim>1       view path      view
  'F'    F-contig col-slice, offset!=0              view path      view (offset preserved)
  'F'    transpose of C-contig 2-D (→ F-contig)     view path      view
  'F'    C-contig only, NDim>1                      flatten('F')   copy
  'F'    broadcast / strided / non-contig           flatten('F')   copy
  'F'    1-D (NDim==1)                              C-order path   view if contig
  'F'    scalar / empty / size<=1                   C-order path   trivial
  'C'    C-contig                                   reshape        view
  'C'    F-contig only                              CloneData      C-order copy
  'A'    F-contig (strict)                          resolves to F  view
  'A'    otherwise                                  resolves to C  view/copy
  'K'    F-contig                                   resolves to F  view
  'K'    C-contig                                   resolves to C  view/copy

All 15 dtypes (Boolean, Byte, SByte, Int16, UInt16, Int32, UInt32, Int64, UInt64,
Char, Half, Single, Double, Decimal, Complex) verified end-to-end via in-process
buffer-address comparison and dtype assertion.

NDArray.ravel() and NDArray.ravel(char) instance methods delegate to np.ravel,
so the fix covers both call sites.
…efault-None bounds

Brings np.clip up to NumPy 2.x signature parity. Two missing capabilities are
addressed at the API surface; the underlying engine (Default.ClipNDArray.cs)
already supported null bounds for both legs of the interval.

NumPy 2.x signature mirrored:
    clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None)

Changes:
- src/NumSharp.Core/Math/np.clip.cs
  * Replace the trio of legacy 4-arg overloads with a single unified entry
    point exposing all parameters as optional. Callers may now write:
      np.clip(a)                        — no bounds, returns copy
      np.clip(a, min: 3)                — lower bound only (NEP-rebrand)
      np.clip(a, max: 5)                — upper bound only
      np.clip(a, min: lo, max: hi)      — both via aliases
      np.clip(a, a_min: null, a_max: 5) — explicit None
      np.clip(a, min: 3.5, dtype: NPTypeCode.Double)
    a_min/a_max still accepted (NumPy keeps both names; min=/max= were added
    in 2.0 as keyword-only aliases).
  * Conflict detection mirrors NumPy: passing both a_min and min (or both
    a_max and max) raises ArgumentException rather than silently picking one.
  * Type-dtype overload preserved separately (Type != NPTypeCode?, no merge
    possible). Existing positional-3 call sites (np.clip(a, lo, hi)) and
    named-arg call sites in np.maximum/np.minimum compile unchanged.

- test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs
  * 9 new tests covering the NumPy 2.x surface:
      - min=/max= keyword aliases (lower-only, upper-only, both)
      - Explicit a_min=null / a_max=null
      - Bare np.clip(a) returns a copy (verifies distinct backing storage)
      - min= keyword with array bound (broadcast verification)
      - Conflict detection (a_min+min, a_max+max throw)
      - min= combined with dtype= promotes result dtype

Verification:
- Reference outputs cross-checked against NumPy 2.4.2 via Python; all 9
  documented behaviors match byte-for-byte.
- ClipNDArrayTests: 26/26 pass (was 17, +9 new).
- ClipEdgeCaseTests + np.maximum/np.minimum suite: 105/105 pass — no
  regressions (np.maximum/minimum use np.clip via named a_min:/a_max:).
- Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
  7202 pass, 0 fail, 11 pre-existing skips.

Audit reference: audit_v2/07_math_ops_selection_sorting_stats.md (Batch 7,
item 12).
…n-int upcast

Brings the np.clip engine path up to NumPy 2.x ufunc parity. Three latent
bugs surfaced while battle-testing edge cases for the min=/max= alias work:

1. Dtype promotion silently demoted to lhs.typecode
   * Before: outType = typeCode ?? lhs.typecode
     - clip(int32, min=3.5) → int32 (NumPy: float64)
     - clip(int32, min=float32) → int32 (NumPy: float64)
   * After: weak-scalar promotion consistent with NumSharp's binary-op
     engine and NEP 50 — a 0-d bound of the same kind (int/float/complex
     /decimal) as lhs is "weak" and does not promote; cross-kind or array
     bounds promote via np.result_type.
   * Examples now matching NumPy:
       clip(int32, min=3.5)              → float64  (was int32)
       clip(int32, min=3.0f)             → float64  (was int32)
       clip(uint8, 50, 75)               → uint8    (preserved, NEP 50 weak)
       clip(int32, min=long_arr)         → int64    (array promotes)
       clip(float32, 3, 7)               → float32  (preserved)
   * NaN bound on int array now upcasts to float64 with all-NaN result
     (was: silently a no-op, value unchanged).

2. @out= with mismatched dtype silently wrote garbage
   * Before: cast lhs/bounds to outType, blit through copyto into @out
     which retained its own (often narrower) dtype — produced truncated
     or pattern-aliased values.
   * After: validate @out.GetTypeCode == outType up front. Mismatch
     raises ArgumentException mirroring NumPy's _UFuncOutputCastingError
     ("Cannot cast ufunc 'clip' output from dtype('X') to dtype('Y')
     with casting rule 'same_kind'").

3. Engine refactor for the both-null + dtype= case
   * np.clip(arr, dtype=Single) with no bounds now properly casts the
     output and respects @out when supplied (previously dtype= without
     bounds returned plain lhs.Clone()).

Implementation details:
- Added PromoteClipBound(outType, bound): no-promotion shortcut for
  0-d same-kind bounds; falls back to np.result_type otherwise.
- Added IsSameKind(a, b): groups Byte/Char/signed-int/unsigned-int as
  integer kind; floats/decimals/complex compare by NPTypeCode group.
- @out validation now runs before any work, so shape/dtype errors fail
  fast without partial mutation of @out.
- np.copyto(@out, Cast(lhs, outType, copy: false)) handles the case
  where lhs needs casting to the promoted output type before writing.

Test additions (test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs)
— 30 new tests across 8 categories all cross-checked against NumPy 2.4.2:

  Dtype Promotion (NEP-50):
    - uint8 + int scalars preserves uint8
    - int32 + float scalar → float64 (also float32 scalar → float64)
    - float64 + int scalars preserves float64
    - int32 + int64 array bound → int64
    - dtype= with no bounds casts input
    - dtype= override forces narrower type even when bounds promote
    - NaN bound on int array upcasts to float64

  @out= Edge Cases:
    - in-place out=src returns same buffer & mutates
    - out= separate buffer leaves src unchanged
    - shape mismatch throws
    - dtype mismatch throws (previously silent garbage)
    - out= with no bounds copies src

  Special Float Values via kwarg:
    - min=-inf / max=+inf no-op
    - min=NaN / max=NaN propagates

  0-d (Scalar) Input:
    - clip(scalar, lo, hi) preserves ndim=0
    - clip(scalar, max:hi) preserves ndim=0
    - clip(scalar) preserves ndim=0

  Half / Complex via Kwarg:
    - Half min/max preserves Half
    - Complex min= (lex ordering, scalar bound)
    - Complex array min/max bounds (lex ordering)

  Broadcasting via Kwarg:
    - 2D + row vector min → broadcasts along axis 0
    - 2D + column vector max → broadcasts along axis 1
    - 2D + mixed row min + column max

  Strided Inputs via Kwarg:
    - Reversed-slice (negative stride) clipped via min=/max=

  Empty Arrays via Kwarg:
    - Empty + min= only
    - Empty + max= only
    - Empty + dtype= cast

Verification:
- ClipNDArrayTests: 56/56 pass (was 26; +30 new).
- np.clip + np.maximum + np.minimum + ClipEdgeCase + np.clip.Test suites:
  85 pass on net8.0, 55 pass on net10.0 (frameworks differ in shared-class
  counts).
- Full unit-test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
  7232 pass, 0 fail, 11 pre-existing skips (was 7202 before this commit).
…x bug

Benchmarking np.clip against NumPy 2.4.2 revealed a 48-80× slowdown on
the common case `clip(arr, lo, hi)` with scalar literal bounds. Root
cause: the engine was materializing every scalar bound via
`np.broadcast_to(scalar, lhs.Shape).astype(outType)`, which for a 10M
int32 input allocated and memset two 40MB bound arrays per call (then
ran an element-wise array-bounds kernel that re-read both buffers).

Investigation also surfaced a pre-existing kernel bug exposed once the
new fast path routed scalar-bound calls through ClipScalar / ClipStrided
/ ClipScalarTail: the integer scalar fallbacks used `if / else if` to
apply the two clamps, so when `minVal > maxVal` values below `minVal`
incorrectly stuck at `minVal` instead of capping to `maxVal` (NumPy
guarantees `min(max(x, lo), hi)` — i.e. `maxVal` wins when bounds are
inverted). SIMD paths and Math.Min(Math.Max,...) float paths were
already correct.

Changes
=======

src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
- Add scalar-bounds fast path: detect 0-d min/max (or null) and
  dispatch directly to ClipUnified / ClipMinUnified / ClipMaxUnified
  (the kernel family that broadcasts the scalar inside the vector loop).
  Skips broadcast_to + astype materialization entirely.
- ClipNDArrayScalarBounds: type-switch on outType to call the right
  generic kernel; uses a small delegate-based helper (ClipScalar<T>) so
  the dispatch logic isn't duplicated 12 times.
- ClipNDArrayScalarBoundsFallback: Half and Complex still go through
  the array-bounds path — their scalar SIMD kernels aren't wired and
  Complex has lex-ordering NaN semantics already implemented there.
  Cost is just the 0-d→shape broadcast (stride-0 view, O(1)) plus a
  1-element astype.
- Array bounds (any non-0-d min or max) flow into the existing path
  unchanged.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
- ClipScalar<T> (generic integer fallback, called by ClipHelper): replace
  `if (val > maxVal) val = maxVal; else if (val < minVal) val = minVal;`
  with two sequential `if`s. Now matches NumPy semantics when min > max.
- ClipScalarTail<T> (non-float tail after SIMD bulk loop): same fix.
- ClipStrided<T> (coordinate-iterated path for non-contiguous arrays):
  same fix.
- Added comments explaining why two sequential clamps are required.

Performance (Windows 11, .NET 10.0.1, AVX2-class CPU; 50 iterations,
min of timings reported; same array shapes/dtypes on both runtimes):

Scalar bounds, contiguous
                                NumPy 2.4.2   NumSharp BEFORE   NumSharp AFTER
  int32  size=1K                    3.4 µs         37.8 µs           3.3 µs
  int32  size=100K                  8.4 µs       2980.4 µs          66.5 µs
  int32  size=10M               6 741   µs    323 557   µs      10 094   µs
  int64  size=10M              14 519   µs    698 077   µs      34 860   µs
  float32 size=10M              6 917   µs    570 707   µs      22 441   µs
  float64 size=10M             14 228   µs    612 228   µs      30 926   µs

Single-sided scalar bound (min= or max= only)
  int32  size=10M min=         12 451   µs    285 434   µs      10 532   µs   (faster than NumPy)
  int32  size=10M max=         12 024   µs    294 756   µs      10 720   µs   (faster than NumPy)
  float64 size=10M min=        23 155   µs    300 770   µs      23 043   µs   (parity)

out= parameter
  int32 10M, out=arr in-place   7 038   µs    562 393   µs       7 465   µs   (parity)
  int32 10M, out=preallocated   7 794   µs    557 192   µs      12 539   µs

No bounds (clip(a))
  int32 10M                    12 126   µs      7 437   µs       7 158   µs   (faster than NumPy — Cast(copy:true))

Speedups range 20-75× over the previous NumSharp implementation; the
common `clip(arr, lo, hi)` path now sits at 1.5-3× NumPy or matches it
for small arrays. Remaining gaps:
  * Array bounds (lo_arr, hi_arr same-size): 3.5× slower — kernel is
    memory-bandwidth bound on three arrays; expected gap given .NET
    Vector<T> vs hand-tuned NumPy AVX2 inner loop.
  * Strided input (a[::2], a[::-1]): 15-20× slower — ClipStrided uses
    Shape.TransformOffset per element; NumPy's ufunc has a strided
    inner loop with stride-aware SIMD where possible.
  * Half (float16): 11× slower — .NET's `Vector<Half>` arithmetic is
    not supported, scalar Half→double→Half path required.
  * 2D broadcast (row vec): 33× slower — still goes through array path
    after broadcast_to materializes the row vector.

These remaining gaps are tracked for future kernel work and are not
addressed in this commit.

Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory),
  including the regression test for min > max which now exercises the
  scalar fast path through the fixed ClipScalar/ClipStrided kernels.
- Bench harness: $CLAUDE_JOB_DIR/clip_bench.py and clip_bench.cs (50
  iterations each, min of timings).
Two further optimizations on top of the scalar-bounds fast path. Both
target the gap to NumPy that benchmarking surfaced.

Findings from the breakdown profiler ($CLAUDE_JOB_DIR/clip_breakdown.cs)
on int32 size=10M:

  Step                                            Time (µs)
  ──────────────────────────────────────────────  ─────────
  Pure ClipArrayBounds kernel (3R + 1W)             ~7,700
  Cast(lhs, int32, copy:true)  alloc + 1R+1W       ~6,100
  np.broadcast_to(lo_arr, same_shape)              ~negligible
  np.broadcast_to(lo_arr).astype(int32) — same dtype  ~7,700  ← wasted clone
  np.clip(arr, lo_arr, hi_arr) full path           37,700
  np.clip(arr, 2, 7) scalar fast path              17,100
  ClipHelper kernel only (1R + 1W in-place)         ~9,800

The two wasted passes: (1) `astype(same_dtype)` cloning the bounds even
when no cast is needed (15ms wasted on two array bounds), (2) the
Cast-then-clip pattern doing 4 memory streams (2R + 2W) when 2 streams
(1R src + 1W dst) suffice.

Changes
=======

src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs

1. PrepareBound(bound, targetShape, outType) helper:
   When the bound is already same-shape, same-dtype, contiguous, offset
   zero, return it directly instead of running broadcast_to + astype
   (which clones via UnmanagedStorage.Clone). Wins for the common case
   where users pass arrays that already match the input layout.

2. ClipNDArrayFusedScalarBounds: new fast-fast path for the dominant
   `np.clip(a, lo, hi)` shape — contiguous lhs, scalar literal bounds,
   no @out, no dtype promotion. Allocates a fresh `np.empty(shape)` and
   runs the fused CopyAndClip kernel in a single pass. Replaces the
   classic Cast-then-clip pattern (which ran two passes over memory).
   Falls through to the existing in-place scalar path when @out is
   supplied or the lhs needs casting.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs

3. CopyAndClip / CopyAndClipMin / CopyAndClipMax (and their *Simd256 /
   *Scalar / *ScalarTail variants for 10 SIMD-supported dtypes): fused
   read-clip-write kernels. Each loop iteration loads a Vector256, runs
   Min(Max(v, lo), hi) in registers, and stores to the destination
   buffer — never spilling intermediate values to memory. Halves the
   memory bandwidth requirement vs the in-place "copy then clip"
   pattern on memory-bandwidth-bound input sizes.

Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2-only CPU,
50 iterations, min reported)

                                          NumPy    NumSharp BEFORE   NumSharp AFTER   AFTER/NumPy
Scalar bounds, contiguous
  int32  size=1K                          3.4 µs        37.8 µs           3.1 µs       0.91× (FASTER)
  int32  size=100K                        8.4 µs      2980   µs          68.2 µs       8.1×
  int32  size=10M                       6741   µs   323557   µs        9336   µs       1.4×
  int64  size=10M                      14519   µs   698077   µs       19287   µs       1.3×
  float32 size=10M                      6917   µs   570707   µs       11002   µs       1.6×
  float64 size=10M                     14228   µs   612228   µs       26969   µs       1.9×

Array bounds, contiguous (PrepareBound win)
  int32  size=10M                       9488   µs    38259   µs       13898   µs       1.5×    (was 4.0×)
  float64 size=10M                     24712   µs    83863   µs       42137   µs       1.7×    (was 3.4×)

Single-sided
  int32 10M min=                       12451   µs   285434   µs       11200   µs       0.90×   (FASTER)
  int32 10M max=                       12024   µs   294756   µs       11351   µs       0.94×   (FASTER)

out=  (in-place / preallocated)
  10M in-place out=arr                  7038   µs   562393   µs        4567   µs       0.65×   (35% FASTER than NumPy)
  10M out=preallocated                  7794   µs   557192   µs       10101   µs       1.3×

Both bounds None
  10M, clip(a)                         12126   µs     7437   µs        5778   µs       0.48×   (2× FASTER)

Combined effect of all four perf commits (3505edc, 79c1894, 9334bd7,
this one): the common `np.clip(arr, lo, hi)` path went from 48-80×
slower than NumPy to within 1.4-1.9× across dtypes, with several cases
matching or beating NumPy outright.

Discussion of the user's two questions
──────────────────────────────────────

1. Vector<T> vs Vector256<T> — measured both on this CPU; identical
   wall time (5527 vs 5559 µs/10M int32 in micro-bench, see
   $CLAUDE_JOB_DIR/clip_micro.cs). Vector<T> picks the widest hardware
   register at JIT time, so on AVX-512 hardware it'd be 512 bits = 2×
   throughput. On THIS AVX2 machine, no gain. Switching the existing
   Vector256<T> kernels to Vector<T> is a low-risk forward-compat move
   for AVX-512 hosts but no measurable win here. Not changed in this
   commit (would touch the whole kernel file ecosystem; out of scope).

2. IL Generation via DynamicMethod — the existing binary kernels
   (ILKernelGenerator.Binary.cs) emit 4× unrolled SIMD loops via
   System.Reflection.Emit. Tested whether porting that pattern to clip
   would help: micro-benchmarked manually-unrolled 4× and 8×
   Vector256<int> loops against the simple 1× variant. Results
   (10M int32):
     1× unrolled:   5559 µs
     4× unrolled:   6494 µs   (SLOWER — register pressure)
     8× unrolled:   5428 µs   (2% faster — within noise)
   The .NET JIT already auto-unrolls the simple SIMD loop well
   enough that hand-unrolling doesn't help and can hurt. IL emission
   for this op would add significant complexity for ~no perf win.
   Not pursued.

   The wins came from algorithmic changes (fused single-pass kernel,
   skipping redundant clones) — not from instruction-level tuning.

Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory).
- Includes the regression test for `min > max` semantics through the
  new fused kernel path (which goes through CopyAndClip's scalar tail
  for the size<32 case).
- Bench harness: $CLAUDE_JOB_DIR/{clip_bench.py,clip_bench.cs,
  clip_breakdown.cs,clip_micro.cs}.

Remaining gaps
==============
- Strided / negative-stride / broadcast inputs: ~12-15× slower than
  NumPy. ClipStrided iterates with Shape.TransformOffset per element
  (~50ns/element overhead). NumPy ufunc has stride-aware SIMD inner
  loops. Would require a stride-aware clip kernel similar to NumPy's.
- Half / float16: ~9× slower. .NET's Vector<Half> arithmetic is not
  supported; falls back to scalar Half-via-double round-trip.
- 100K size scalar bounds (8.1×): allocation/dispatch overhead is
  amortized over fewer elements; gap shrinks at larger sizes.
…-adaptive

Per the user's directive, the entire clip code path now goes through a
single ILKernelGenerator entry point that dispatches internally and
emits all loops as DynamicMethod IL using the runtime-detected vector
width (V128 / V256 / V512). No hardcoded Vector256 references remain;
no hand-written C# loops remain in the engine or kernel files.

Files
=====

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
  Rewritten from scratch (~1900 → ~360 lines, -81%).
  Public surface — what the engine actually calls:

      public enum ClipMode      { BothBounds, MinOnly, MaxOnly }
      public enum ClipBoundsKind { Scalar, Array }
      public unsafe delegate void ClipKernel(
          void* src, void* dst, long size, void* lo, void* hi);

      public static unsafe void Clip(
          NPTypeCode dtype, ClipMode mode, ClipBoundsKind kind,
          void* src, void* dst, long size, void* lo, void* hi);

  All dispatch happens inside ILKernelGenerator:
    - Cache key = (dtype << 16) | (mode << 8) | kind
    - On first miss, Generate(dtype, mode, kind) builds a DynamicMethod
      and stores the resulting delegate in a ConcurrentDictionary.

  The IL emitter:
    - Uses GetVectorContainerType() / GetVectorType() / GetVectorCount()
      so the SIMD loop body adapts to V512 on AVX-512 hosts, V256 on
      AVX2, V128 on SSE2. There is no `Vector256` or `Vector128` token
      anywhere in the kernel file.
    - Hoists the scalar bound load and Vector.Create() out of the SIMD
      loop (one broadcast per kernel call, not per iteration).
    - Computes `byteOff = i * sz` once per iteration into a local and
      reuses it for src/lo/hi/dst pointer arithmetic — avoids the
      O(N × pointer_count) multiplications the prior C# kernels had.
    - Falls back to a pure scalar IL loop for dtypes without
      Vector<T>.Min/Max (Char, Decimal, Half, Complex). Half and
      Complex delegate the per-element clamp to static helper methods
      (NaN-aware / lex-order); the loop is still IL.

src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
  Stripped from 1222 → 207 lines (-83%). Everything but policy is gone:
    - dtype promotion (NEP-50 weak scalar via PromoteClipBound)
    - @out validation (shape, writeable, dtype)
    - scalar-vs-array kind detection (min.ndim == 0)
    - NaN-in-scalar-bound short-circuit for float dtypes
    - dst materialization choice: in-place vs fused-fresh vs cast-copy
    - single call to ILKernelGenerator.Clip(...)

  The previous ClipNDArrayContiguous / ClipNDArrayGeneral /
  ClipNDArrayScalarBounds / ClipNDArrayFusedScalarBounds / 12 per-dtype
  switches / delegate-based generic dispatchers / 14 Generated*Core
  methods are all deleted. One call site, one cache, one IL emitter.

src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs
  Deleted (251 lines). Dead code — internal `ClipScalar(NDArray, object,
  object)` had no callers anywhere in the codebase, was a parallel
  hand-coded path that the IL kernel now subsumes.

src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
  Added EmitVectorMinOrMax() to the existing emit-primitives section
  (sibling of EmitVectorOperation). Resolves Vector{Width}.Min<T> /
  Max<T> by reflection on whatever container the runtime selected at
  startup — same width-adaptive pattern used by the binary kernels.

Performance vs NumPy 2.4.2 (Windows 11, .NET 10.0.1, AVX2 CPU, 50 iters,
min reported)
                                              NumPy      Before IL    After IL
Scalar bounds, contiguous
  int32  size=1K                              3.4 µs       3.1 µs       2.9 µs   (0.85× — faster)
  int32  size=100K                            8.4 µs      68.2 µs      47.3 µs
  int32  size=10M                          6 741 µs     9 336 µs     7 509 µs    (1.11×)
  int64  size=10M                         14 519 µs    19 287 µs    22 279 µs
  float32 size=10M                         6 917 µs    11 002 µs    13 057 µs
  float64 size=10M                        14 228 µs    26 969 µs    28 842 µs

Single-sided (min= or max= only)
  int32 10M min=                          12 451 µs    11 200 µs    10 944 µs   (0.88× — faster)
  int32 10M max=                          12 024 µs    11 351 µs     8 009 µs   (0.67× — faster)
  float64 10M min=                        23 155 µs    23 043 µs    19 776 µs   (0.85× — faster)

out=
  10M in-place out=arr                     7 038 µs     4 567 µs     3 954 µs   (0.56× — faster)
  10M out=preallocated                     7 794 µs    10 101 µs     9 175 µs

Both bounds None
  10M clip(a)                             12 126 µs     5 778 µs     6 025 µs   (0.50× — faster)

Half (float16) — IL emit cut the gap by 3×
  10M                                     66 969 µs   602 219 µs   212 024 µs   (was 9×, now 3.2×)

Verification
============
- All 7232 unit tests pass on net10.0 (TestCategory!=OpenBugs&!=HighMemory).
- The 85 clip-family tests (ClipNDArrayTests, ClipEdgeCaseTests,
  np.clip.Test, NewDtypes Half/Complex clip tests) cover:
    * Scalar literal bounds, array bounds, both-None, min-only, max-only
    * 14 dtypes (Byte, SByte, Int16/32/64, UInt16/32/64, Char, Half,
      Single, Double, Decimal, Complex)
    * Contiguous, transposed, strided (every other), reversed slices
    * Broadcast inputs (the OpenBug test)
    * NaN propagation in float arrays
    * NaN in scalar bound → all-NaN result (short-circuited in engine)
    * min > max → result all = max
    * @out= validation (shape & dtype mismatch throws)
    * NEP-50 weak-scalar promotion (uint8 + 50 stays uint8)
    * Cross-kind promotion (int32 + 3.5 → float64)
- Cache correctness: each (dtype, mode, kind) combination generates
  its kernel once on first call, then reuses the cached delegate. Re-
  running the test suite a second time keeps the same delegates (no
  re-emit per call).

Remaining gaps (deferred)
=========================
- Strided / negative-stride contiguity (~15-20× NumPy): the engine
  materializes a contiguous copy first via Cast(copy:true). A proper
  fix would IL-emit a stride-aware kernel, but that doubles the code
  size and is rarely the hot path.
- Array-bounds slightly worse than the prior hand-coded V256 inner
  loop (~2× NumPy vs ~1.5× before): the IL emit doesn't 4×-unroll
  like the binary kernels do. Measured earlier in the conversation,
  manual 4× unroll on the simple clip loop hurt rather than helped
  on the JIT auto-unrolled C# baseline; effect on IL-emitted code
  may differ but not investigated.
…) into nditer

Brings in 5 commits from worktree-clip-min-max-aliases that rebuild np.clip
end-to-end. Replaces and supersedes the in-flight clip work on nditer
(c3bbe9a "fix(clip): Complex IComparable + Half NaN propagation") whose
root cause — generic CompareTo / NpFunc routing for clip — no longer
exists after this merge.

Summary of incoming work (3505edc..10064ab)
=============================================

1. feat(np.clip): NumPy 2.x parity — min=/max= keyword aliases and
   default-None bounds. New signature mirrors NumPy 2.x:
     clip(a, a_min=None, a_max=None, out=None, *, min=None, max=None,
          dtype=None)

2. fix(np.clip): NumPy 2.x dtype promotion (NEP 50 weak-scalar via
   np.result_type), out= dtype validation, NaN-on-int upcast.

3. perf(np.clip): scalar-bounds fast path + fixed a latent ClipScalar
   min>max kernel bug (`if/else if` instead of two sequential clamps).

4. perf(np.clip): fused copy+clip kernel + skip the redundant astype
   clone when the bound already matches lhs shape/dtype/contiguity.

5. refactor(np.clip): all kernels routed through a single
   ILKernelGenerator.Clip() entry. Every loop is now emitted as a
   DynamicMethod via System.Reflection.Emit. The SIMD width is
   resolved at runtime (V128/V256/V512) — no Vector256 token remains
   anywhere in the clip path.

Conflict resolution
===================

* src/NumSharp.Core/Backends/Default/Math/Default.Clip.cs
    Deleted in worktree. nditer's c3bbe9a modified it (Complex pre-
    checks). The IL kernels in this merge handle Complex natively via
    ComplexMaxNaN/ComplexMinNaN helpers called from the generated loop,
    so the Default.Clip.cs path becomes redundant. Took the deletion.

* src/NumSharp.Core/Backends/Default/Math/Default.ClipNDArray.cs
    Both sides modified. nditer's version had c3bbe9a + 574a0d8
    refactor (NpFunc generic dispatch, ~400 switch cases replaced).
    Worktree's version (this branch) is the IL-routed engine (207 lines
    of pure policy + one ILKernelGenerator.Clip call). Took worktree.
    Half / Complex correctness preserved by the new IL kernel — verified
    via the existing battletest suite (NewDtypes Half + Complex tests
    all pass).

* src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.Clip.cs
    Both modified. nditer added Half-specific scalar paths in the
    old kernel API. Worktree rewrote the file from ~1900 → ~360 lines
    of IL emission. Took worktree — Half NaN handling now lives inside
    the IL-emitted scalar tail via HalfMaxNaN/HalfMinNaN helper calls.

* src/NumSharp.Core/Backends/Kernels/ILKernelGenerator.cs
    Auto-merged cleanly. Worktree added EmitVectorMinOrMax helper
    alongside nditer's dtype-parity expansion.

* src/NumSharp.Core/Math/np.clip.cs
    Manually merged: kept worktree's new public signature
    (a_min/a_max/out/dtype/min/max with NumPy-2.x semantics) and
    nditer's PreserveFContigFromSource wrapper (39ef08c "F-contig
    preservation across ILKernel dispatch"). Output now keeps F-order
    when the input was F-contiguous.

* test/NumSharp.UnitTest/NumPyPortedTests/ClipNDArrayTests.cs
    Auto-merged cleanly — 39 new tests from worktree (NEP-50 promotion,
    min/max aliases, out= edge cases, etc.) sit alongside nditer's
    existing coverage.

Verification
============
- Full build: 0 errors, 17 warnings (unchanged from each side).
- Test sweep (TestCategory!=OpenBugs&!=HighMemory) on net10.0:
  8334 pass, 0 fail, 11 pre-existing skips. nditer was at ~7232 tests
  pre-merge in the worktree's view; the actual count on nditer-only is
  higher and the merge brings the combined total up.
- All 85 clip-family tests pass (39 new + 46 pre-existing).
- The Complex IComparable issue that c3bbe9a addressed is verified
  fixed by the merge: the failing tests in that commit's "Fixes 14
  test failures" list all pass through the new IL kernel (Complex
  takes the scalar-IL path with ComplexMaxNaN/ComplexMinNaN helpers).

API behaviour for callers
=========================
- np.clip(arr, lo, hi)  — works exactly as before.
- np.clip(arr, lo, hi, dtype: NPTypeCode.Double)  — new dtype= override.
- np.clip(arr, lo, hi, @out: dst)  — supported as named arg.
- np.clip(arr, min: 3)        — NEW: NumPy 2.x kwarg alias.
- np.clip(arr, max: 7)        — NEW: kwarg alias.
- np.clip(arr, a_min: null, a_max: 5)  — NEW: explicit None bound.
- Promotion: clip(int32, 3.5) → float64 (was int32 — bug pre-merge).
- Out= dtype mismatch now throws ArgumentException (was silent
  garbage pre-merge).
Nucs added 30 commits June 12, 2026 07:21
…six + bitwise, np.positive full ufunc surface, NumPy round shape

Completes the single-NumPy-shaped-overload audit across 5716f86 (slice 2)
and 6a566e4 (merged-overload wave). A reflection audit over every np.*
member those commits touched found four remaining gaps; all are closed
here, every rule probed against NumPy 2.4.2 BEFORE implementation and
pinned verbatim by tests.

1) dtype= on add/subtract/multiply/divide/true_divide/mod
   The 6a566e4 wave deferred these because the loop-dtype machinery
   didn't exist; it does now (ExecuteBinaryOp's dtype override from the
   power/floor_divide work), so the deferral was stale. TensorEngine
   gains the house (lhs, rhs, typeCode, out, where) signature on the
   five members; np.* adds the trailing dtype= param.
   * the loop COMPUTES in dtype: subtract(300_i64, 5_i64, dtype=i16) =
     295 in the int16 loop; add(0.1, 0.2, out=f64, dtype=f32) stores
     float32(0.3) = 0.30000001192092896 in the f64 out (probed).
   * BUG FIX (ordering): the bool add/multiply remap (+ = logical OR,
     * = logical AND) keyed off the PROMOTED dtype and ran before the
     dtype= override — add(True, True, dtype=i32) would have run the
     BitwiseOr loop and returned 1. NumPy runs the i32 add loop and
     returns 2 (probed). The remap now keys off the FINAL loop dtype:
     moved after the dtype override in ExecuteBinaryOp.
   * divide/true_divide are float-only ufuncs: integer/bool dtype=
     raises 'No loop matching the specified signature and casting was
     found for ufunc divide' (probed; same gate family as sqrt&co).
   * mod reports 'remainder' with indexed input-cast errors (probed:
     mod(f64, f64, dtype=i32) → "Cannot cast ufunc 'remainder' input 0
     from dtype('float64') to dtype('int32') ...").

2) dtype= on bitwise_and/or/xor
   Loop selection among the bool/int loops: dtype=i64 widens,
   dtype=i16 narrows under same_kind (300 & 300 = 300_i16, probed).
   A float/complex/decimal dtype= raises the NO-LOOP text — distinct
   from the float-INPUT coercion text ValidateBitwiseLoop produces
   ('ufunc not supported for the input types...'); both texts probed,
   the first implementation reused the wrong one and the probe caught it.

3) np.positive — full ufunc surface (was: bare copy only)
   Slice 2 prepared UnaryOp.Positive (identity emit, UfuncName mapping,
   Round uses it as the masked-copy vehicle) but never exposed the np
   surface. New TensorEngine.Positive + Default.Positive:
   * positive(x, out=) returns the provided instance; where= masks
     (false slots keep prior contents); both ride the unary Into-path
     with the identity kernel.
   * dtype= selects the loop: positive(i32, dtype=f64) widens (the
     no-out path is a cast-copy); positive(f64, dtype=i32) raises the
     unary same_kind input-cast error naming 'positive'.
   * bool rule probed precisely: positive has NO bool loop — plain
     positive(bool) raises "ufunc 'positive' did not contain a loop
     with signature matching types <class 'numpy.dtypes.BoolDType'> ->
     None" and dtype=bool raises the two-sided variant naming the
     input's DType class (new NumPyDTypeClassName helper maps
     NPTypeCode → numpy.dtypes.* names); positive(bool, dtype=f64) is
     LEGAL → [1., 0.] — the guard keys off the loop, not the input.

4) round_/around — NumPy's round(a, decimals=0, out=None) shape
   Slice 2 shipped TWO out-overloads each ((x, out=null) + (x, decimals,
   out=null)); NumPy has ONE signature whose 2nd positional is DECIMALS
   (np.round(x, out_array) raises 'only integer scalar arrays can be
   converted to a scalar index' — probed). Merged to
   (x, int decimals = 0, NDArray out = null); positional-out callers
   migrate to @out: (3 test sites updated). Legacy positional-dtype
   conveniences unchanged.

Tests: UfuncDtypeOverloadTests +8 (binary-six loop matrix incl. the
bool-remap probe pin add(True,True,dtype=i32)=2, dtype+out f32
composition, divide no-loop + remainder/add input-cast texts, bitwise
int-loop matrix with narrowing + no-loop texts, positive call-form
matrix + both did-not-contain-a-loop texts + same_kind error, round
single-shape matrix); UfuncUnaryBatchOutWhereTests round callers moved
to NumPy positions.

Suite: net10.0 CI filter 9709/0; touched families 100/100 on net8.0;
live side-by-side value checks vs numpy 2.4.2 (f32 loop precision
0.30000001192092896, 1/3 float32, banker's rounding, positive matrix).
… NumPy + standalone decomposition

New matched-id benchmark pair (npyiter_core_bench.{cs,py}) measuring the ITERATOR,
not the kernels: construction across 13 flag configurations, traversal/orchestration
across chunk profiles (w=4..1024 strided rows, transposed, row/col broadcast),
buffered cast/mixed windows, per-element protocol (+C_INDEX/+MULTI_INDEX), full-
reduce, and small-N pipeline scaling N=8..2M. All kernels trivial and matched to
NumPy's loop families (memcpy / scalar-strided / V256-contig) so iterator cost
dominates. Both scripts correctness-check before timing and refuse Debug JIT.

Headline results (i9-13900K, NumPy 2.4.2, Release — full tables in
benchmark/poc/NPYITER_CORE_BENCH_RESULTS.md):
- Construction beats np.nditer 1.4-3.7x in every multi-operand config
  (3-op 308 vs 622 ns; ufunc config 343 vs 1000 ns; 8-op 385 vs 1140 ns).
- Full-iterator vs full-iterator (strided-row ADD through NumPy's real ufunc
  nditer): parity at w=4 (10.1 vs 10.3 ns/chunk), 2x faster at w=1024.
  The 4 ns/chunk machine in np.copyto is NumPy's stripped raw-copy walker,
  NOT nditer — and production NpyIter.Copy matches it exactly (T2c).
- Buffered mixed add 0.86x, broadcast adds 0.74-0.81x, transposed copy 0.84x,
  reductions 0.64-0.85x, contig copy at DRAM roofline parity.
- Raw iterator pipeline tracks NumPy's whole ufunc dispatch within +-8% at every
  N; production np.add(out=) carries ~200 ns of glue above it (648 vs ~440 ns).
- REUSED iterator (Reset+ForEach) at N=512: 54.7 ns/call — 7x under NumPy's
  e2e floor; the structural small-N lever NumPy cannot reach from Python.

Bugs/gaps surfaced (quantified in the results doc):
1. GROWINNER is hollow — NpyIter.cs:751 sets the bit, ComputeTransferSize never
   reads it: same-dtype BUFFERED iterators chunk into 512 needless 8192-elem
   windows, +8.4% on 4M f64 add (G-series). One condition fixes it.
2. Broadcast construction pays +388 ns over same-shape (C5 696 vs C3 308 ns):
   rank-mismatched operands miss the same-shape fast path and allocate through
   np.broadcast_to; NumPy's delta is +97 ns.
3. Chunked-callback overhead decomposed: delegate ~1.3 ns/chunk (ForEach vs
   ExecuteGeneric), element-stride imul per op/axis/step in ExternalLoopNext
   (NumPy stores byte strides), mask-resolution branch — 7 vs 4 ns/chunk against
   raw-walker workloads only; already parity against NumPy's real nditer.

'How NpyIter becomes the best' (prioritized, measured headroom): iterator
reuse/state pooling (54.7 ns floor proven), cut the 200 ns production glue,
implement GROWINNER, broadcast-ctor fast path, byte-stride axisdata +
specialized iternext, tiny-chunk whole-array lowering, parallel ForEach.
Roadmap doc links the new bench in its verification harness section.
charts_npyiter_core.py renders the four presentation charts from the 2026-06-12
npyiter_core_bench run (data pinned inline from NPYITER_CORE_BENCH_RESULTS.md):
construction grouped bars, per-chunk dispatch overhead (copy vs raw walker +
add vs real ufunc nditer), small-N log-log scaling with the production-glue and
reuse-floor markers, and the traversal ratio scoreboard with the hollow-GROWINNER
inset. Output: %TEMP%/npyiter_charts/*.png
… territory; P0 crash + 3 losses + 4 unexpected wins

npyiter_frontier_bench.{cs,py} (matched ids) extends the iterator-core bench into
every suspected weak spot: axis reductions through op_axes+REDUCE (Wave-5
territory), ALLOCATE outputs, where= masks at degenerate run lengths (all-true /
alternating run=1 / blocky run=64), strided buffered casts, forced-order outputs,
0-d scalar calls, tiny-chunk production copyto, 8-op single-pass fusion, and the
kernel-bound dtype frontier (complex128/float16/int8) as labeled context rows.

NEW BUGS / LOSSES (full tables + analysis in NPYITER_CORE_BENCH_RESULTS.md):
1. P0 CRASH: ForEach on a BUFFERED+REDUCE iterator AccessViolates — GetIterNext()
   has no BUFFER+REDUCE branch, falls to ExternalLoopNext which walks BUFFER
   pointers with SOURCE strides while GetInnerLoopSizePtr hands BufIterEnd as the
   kernel count. Only BufferedReduce<TKernel>/Iternext() drives this config
   safely. Repro pinned (R3, skipped with comment).
2. Iterator ALLOCATE outputs are np.zeros'd (NpyIter.cs:277) where NumPy
   allocates EMPTY: +2.33 ms per 4M f64 call (32 MB memset). Still beats NumPy
   allocating (0.78) only because their page-fault tax is worse than our pooled
   memset; np.empty for WRITEONLY ALLOCATE => ~2x ahead.
3. Blocky where= (run=64) regresses BELOW our unmasked baseline (4.10 vs
   2.80 ms) while NumPy GAINS from the same mask (3.19 vs 3.54): per-run
   delegate/scan overhead eats the saved work. All-true and run=1 both WIN
   (0.79 / 0.72 — NumPy degrades worse at run=1).
4. Windowed buffered cast on strided source 1.52x behind one-pass copyto;
   production np.copyto already one-pass (1.08).
5. 0-d scalar calls 1.64x/2.41x behind (469/811 ns vs 286/337) — N=1 glue.
6. Axis-0 op_axes reduction 1.20x behind add.reduce; axis-1 wins 2x.
7. (kernel-bound) float16 1.34x behind, complex128 1.10-1.15x behind.

UNEXPECTED WINS:
- np.add ALLOCATING f64 4M: 3.83 vs 7.5-9.8 ms — 2-2.6x faster (Wave-2.4 pool
  vs NumPy's fresh-page faults).
- F-order-out elementwise 1.5x faster (X1 0.67 / X1p 0.65).
- 8-op ONE-PASS sum of 7 arrays 1.9x faster than NumPy's best possible
  composition (Y1 7.85 vs 14.59 ms) — multi-operand fusion dividend.
- int8 add 4M: 173 us vs 1.20 ms — 7x faster (NumPy 2.4.2 i8 loop slow).
- axis-1 reductions 2-2.8x faster (R2/R0b); reversed copy 0.94; production
  copyto at w=4 parity with NumPy's raw walker (P4).
…ny 14.5x losses found; parallel banding 4.7x win proven

npyiter_frontier2_bench.{cs,py} (matched ids): overlap/alias per-call taxes
(exact-alias V1, forced-copy V2), comparison->bool (D1), early-exit boolean
reduces (E1/E2), reduce over a broadcast view (F1), mixed-dtype/scalar/empty
small-N (M1/O3/O4), 8-D construction (C14), and a hand-rolled 8-band parallel
iteration (PAR series — one iterator per disjoint row band via Parallel.For,
the Wave-6.2 dividend made concrete on f64 sin).

HEADLINE LOSSES (root causes probed and pinned in the results doc):
1. F1: np.sum over broadcast_to(8K -> (1024,8192)) = 61.9 ms vs NumPy 1.14 ms
   (54x). NOT materialization: probe shows bc.copy()=11.3ms + dense sum=2.6ms,
   so even naive materialize-then-sum would be 4.5x faster — the reduction
   falls to a coordinate-walking general path at 7.4 ns/elem on IsBroadcasted
   inputs.
2. E1: np.any(bool 10M) full scan = 1.86 ms (4.9 GB/s scalar) vs NumPy 128 us
   (14.5x) — while np.count_nonzero on the SAME array runs 0.16 ms (63.7 GB/s
   SIMD). Routing bug: the SIMD scan exists, np.any doesn't use it. Early-exit
   case E2 WINS 3.9x (350 ns vs 1.35 us).
3. D1: np.less(out=bool) f64 4M 1.41x behind (2.99 vs 2.12 ms) — bool packing.
4. O3: array+scalar small-N 1.73x behind (901 vs 520 ns) — the scalar wrap
   costs MORE than passing a second full array (H0 648 ns).

WINS:
- PAR8: 8 banded iterators on f64 sin = 2.47 ms vs 12.1 single / 11.7 NumPy
  ceiling — 4.9x scaling, 4.7x over NumPy (which never threads its iterator);
  production np.sin already at single-thread parity (12.1 vs 11.7).
- V2: forced-copy overlap (write-ahead alias) 1.75x FASTER than NumPy
  (4.72 vs 8.26 ms) — Wave-1.1 machinery + Wave-2.4 pooled temp beat their
  fresh-alloc copy; V1 exact alias 0.88.
- C14 8-D ctor 3x faster (321 vs 953 ns); O4 empty 2.8x; E2 early-exit 3.9x;
  M1 mixed small-N parity-win (888 vs 931 ns).

Results doc gains the round-2 table, findings 13-16 with probe decomposition,
and reproduce lines.
…rom source and benchmarked across argument matrices

The consumer map is grounded in src/numpy (grep NpyIter_{New,MultiNew,AdvancedNew},
enclosing functions resolved): execute_ufunc_loop, PyUFunc_{Accumulate,Reduceat,
ReduceWrapper}, ufunc_at, array_{boolean_subscript,assign_boolean_subscript},
PyArray_{MapIterNew,CountNonzero,Nonzero,CopyAsFlat,Where}, arr_{ravel_multi_index,
unravel_index}, einsum, nditer_pywrap(+nested_iters), busday/datetime/string/void
consumers. npyiter_consumers_bench.{cs,py} exercises every benchable consumer
through its np.* surface with the perf-relevant argument matrix (dtype=/out-cast/
promoting unary; reduce axis/keepdims/dtype/3-D middle axis/amin; cumsum axes;
where same/scalar/broadcast; boolean read/assign; count_nonzero/argwhere; fancy
gather/scatter; ravel transposed/F-order/flatten/astype; unravel/ravel_multi_index)
and times the consumers NumSharp lacks NumPy-only as implementation targets.

Score: 20 wins, 4 losses, 1 parity, 8 feature gaps.

NEW LOSSES:
1. RD3 np.sum(f32, dtype=f64) 1.97x — composes astype-materialize (2.3ms) + sum
   (0.8ms) = measured 3.23ms instead of casting on load inside the reduce loop
   (NumPy buffered-REDUCE does); Wave-5 territory.
2. RD5 np.amin(axis=1) 1.54x — min/max axis kernels lag sum (which wins 2.8x on
   the same shape).
3. FX2 fancy scatter 1.49x (gather wins 0.76 — write-side path).
4. AC2 cumsum axis=0 1.36x — BOTH sides ~20ns/elem scalar column-walks (95 vs
   70ms); vertical-SIMD accumulate would be ~4-5ms => 15-20x leapfrog open.

WINS (selection family is a rout): argwhere 4.9x, flatten 3x, cumsum axis=1 2.9x,
sum full 2.8x, boolean read 2.6x, ravel F-order 2.2x, where broadcast 2.1x, where
scalar 2x, boolean assign 1.9x, ravel(A.T) 1.9x, sqrt(i32) promoting 1.5x, add
dtype=f32 1.8x, astype 1.8x, 3-D middle-axis sum 1.5x, ravel_multi_index 1.4x,
count_nonzero 1.4x, fancy gather 1.3x, unravel_index parity.

FEATURE GAPS with NumPy targets: reduce axis-tuple (2.08ms) / where= (9.27ms) /
initial= (2.03ms), einsum (2.30/1.42ms — canonical multi-op NpyIter consumer,
NpyExpr machinery fits), np.add.at (6.86ms, soft target), reduceat (1.24ms),
nested_iters, public np.nonzero alias.

Results doc gains the source-grounded consumer map, round-3 table, findings
17-21 (incl. 'do NOT migrate the won selection family onto per-chunk callbacks
without keeping Direct kernels as the fast path').
…(rounds 1-3)

Adds benchmark/poc/npyiter_bench_summary.py - a self-contained renderer that
aggregates every like-for-like NumSharp-vs-NumPy pair from the three NpyIter
benchmark rounds (npyiter_core_bench, npyiter_frontier_bench,
npyiter_frontier2_bench, npyiter_consumers_bench; numbers as recorded in
NPYITER_CORE_BENCH_RESULTS.md, i9-13900K / NumPy 2.4.2 / Release) into a
terminal bar chart: geomean of NumSharp_time/NumPy_time per group, eighth-block
bars scaled 10.2 chars per 1.0x with parity at ~10 chars, win/lose row counts,
and FASTER/PARITY/SLOWER annotations.

Groupings:
- size tier: <=4K (0.71x, 17W/7L), 32K-8M (0.82x, 50W/24L), 10M (1.25x, 3W/1L
  - dragged solely by E1 np.any routing; T1 10M traversal is exact parity)
- family: construction 0.51x, small-N pipeline 1.14x, chunk traversal 0.95x,
  layout copies 0.58x, elementwise/bcast 0.77x, buffered cast 1.12x,
  where=/masks 0.67x, reductions/scans 1.09x (carries F1 54x + RD3 1.97x),
  selection/indexing 0.63x, kernel-bound dtypes 0.70x
- overall: 0.81x geomean, 70 win / 32 lose; 0.75x excluding the three
  root-caused outliers (F1 broadcast-view sum 54x, E1 np.any 14.5x, AC2 1.4x)
- architecture dividends rendered separately (no NumPy equivalent machinery):
  iterator reuse 7.0x, 8-banded parallel iterators 4.7x, one-pass 7-operand
  fusion 1.9x

Excluded by design to keep the geomean honest: T7a/b/c (Python nditer protocol
overhead is interpreter context, not iterator cost), NS-internal-only probe
rows (G2/G3, T2g, T5n), and duplicate-comparator variants (T5i, T2x).
…wer<-1.0->faster)

Rework npyiter_bench_summary.py to the house per-size geomean layout used in
the official benchmark-report summary: ratio shown as SPEEDUP = NumPy_time /
NumSharp_time (>1.0 = NumSharp faster), bar scaled 10 chars per 1.0x so the
parity tick lands mid-field at 20-char width, dotted padding, and the verbatim
'slower <----- 1.0 (parity) -----> faster' header. Bars now grow toward the
'faster' end - the previous version drew bar length proportional to
NumSharp/NumPy time so faster groups got SHORTER bars on a flipped axis;
geometry, axis labels, and annotations are now mutually consistent.

Layout per row: 7-char label, 20-char eighth-block bar (>= 2.0x overflows to a
trailing arrowhead), speedup, (N win / M lose), and only out-of-the-ordinary
verdicts annotated (PARITY within 5%, SLOWER below).

Rendered verdict over the same 89 pairs: tiers 1K 1.40x / 4M 1.22x / 10M 0.80x
(E1 np.any routing drags 4 rows; T1 10M memcpy is exact parity) / ALL 1.24x
geomean 70W-32L; families ctor 1.95x, layout 1.74x, select 1.60x, where= 1.49x,
dtypes 1.42x, elemwise 1.30x, traversal 1.05x parity, reduce 0.92x, buffered
cast 0.89x, small-N 0.88x; dividends reuse 7.0x / parallel 4.7x / fusion 1.9x
rendered with capped overflow bars.
…port tiers, for the iterator core

Adds a parameterized size-sweep that runs the SAME six NpyIter operation
families at each of four element-count tiers (scalar=1, 1K, 100K, 1M), so the
iterator's NumSharp-vs-NumPy story can be read per cache tier the way the
official benchmark report presents whole-op throughput. Prior NpyIter rounds
used per-aspect ad-hoc sizes; this fixes a clean 6x4 matrix on both sides with
identical ids.

Files:
- npyiter_sizesweep_bench.cs  — NumSharp side. Six matched-kernel families:
    add   contiguous binary V256        copy  contiguous copy (memcpy chunk)
    sqrt  contiguous unary V256 Sqrt    sadd  strided a[::2]+b[::2]
    sum   4-acc V256 reduction          bcast stride-0 a+b1(1)
  Same Release-only Debug-JIT guard, best-of-rounds timing, and per-size iter
  counts as npyiter_core_bench.cs. All 24 correctness checks pass.
- npyiter_sizesweep_bench.py  — NumPy side, identical ids. copy uses
  np.positive (a REAL ufunc nditer) not np.copyto (a stripped raw-array
  walker), so the copy row is an honest iterator-vs-iterator comparison.
- npyiter_sizesweep_chart.py  — renders the speedup bar chart (NumPy/NumSharp,
  >1.0 = NumSharp faster) in the official-report axis style, grouped by size
  tier and by operation, plus a per-cell matrix. Self-contained: the settled
  clean run is embedded as (NumSharp_ms, NumPy_ms) per id; pass two recorded
  output files as argv to re-chart a fresh run (the bench .txt outputs are
  gitignored artifacts). The first C# run after a build is noise-tainted
  (machine not quiesced — sqrt@100K read 248us vs the true 57us); embedded
  numbers are the settled re-runs.

Result (geomean NumPy/NumSharp): scalar 2.29x, 1K 2.08x, 100K 1.33x, 1M 1.19x;
ALL 1.66x (20 win / 4 lose). The shape is the MIRROR of the full-API report:
NpyIter is strongest at small N (construction beats np.nditer + ufunc dispatch
setup) and converges to parity at 1M where the kernel saturates memory
bandwidth and the iterator contributes ~nothing. The four sub-1.0 cells
(add@1M 0.88x, sqrt@1M 0.98x, sqrt@100K 0.99x, sadd@100K 0.99x) are all parity
within run-to-run bandwidth variance (add@1M measured 380-453us across runs vs
NumPy 398us). Standout: sum is 2.35-10.11x faster at every tier — NumPy's
reduce carries a ~1.5us fixed setup (sum@1: NS 151ns vs NumPy 1.53us) and a
slower large-N pairwise pass (sum@1M: NS 89us vs NumPy 209us).
…lar/1K/100K/1M

The minimal size-sweep covered only 6 families. This sweeps EVERY distinct
NpyIter operation family accumulated across rounds 1-3 at all four tiers. The
earlier rounds used SIZE as the id axis (H8..H2M, T2.4..T2.1024, O1..O4/M1 are
one op at many sizes); collapsing those, the distinct families are 33 + 3
dividends, each now run at scalar/1K/100K/1M = 143 measured pairs per side.

Families: elementwise (add sqrt copy strided bcast reversed castbuf mixbuf),
reductions (sum sum-ax0 sum-ax1 sum-dt= amin cumsum any-allfalse any-earlyhit),
selection (where a[mask] a[mask]= count_nonzero argwhere a[idx] a[idx]=),
copy/cast (flatten astype ravel.T in-place less->bool), index-math
(unravel_index ravel_multi_index), dtypes (complex128 float16 int8), and
dividends (fuse7 reuse par8 — NumPy has no equivalent machinery).

Files: npyiter_fullsweep_bench.{cs,py} (identical ids; raw-iterator matched
kernels for the elementwise-isolation rows + production np.* for the rest, each
mapped to its NumPy equivalent), npyiter_fullsweep_chart.py (self-contained:
embeds the clean run as (NumSharp_ms, NumPy_ms); renders per-tier, per-category,
category x tier, per-family x tier, and the dividends; pass two run files as
argv to re-chart). i9-13900K, NumPy 2.4.2, Release.

Headline (geomean NumPy/NumSharp, >1.0 = NumSharp faster): ALL 1.32x, 81 win /
51 lose over 132 main cells; tiers scalar 1.47x / 1K 1.46x / 100K 1.13x / 1M
1.27x. By category: reductions 2.14x, elementwise 1.41x, selection 1.31x,
dtypes 1.16x, but copy/cast 0.76x and index-math 0.75x lag (small-N per-call
copy overhead, crossing to wins at 1M). Dividends: fuse7 up to 17x vs chained
adds, reuse 5x at small-N, par8 2.4x at 1M.

Findings surfaced/confirmed across the size axis (presented to user):
- INTERMITTENT SEGFAULT (~50% of runs): uncatchable AccessViolation under the
  heavy mixed alloc/free load, varying crash point (seen at gather@1K / argw
  region) — heap corruption or GC/finalizer race on unmanaged NDArray storage.
- np.any(all-false): 24x faster at scalar but 12.5x SLOWER at 1M (0.08x) —
  scalar scan, no SIMD; early-exit case hides it. (confirms the routing bug.)
- np.less(out=bool): consistently 1.5-2.7x slower at every size.
- fancy a[idx] gather 1.4-3.4x slower, a[idx]= scatter 1.2-1.7x slower at all
  sizes; amin axis-reduce 2.4x slower at 100K+; float16/complex128 ~1.3-1.7x
  slower (documented scalar paths).
- int8 add VERIFIED correct and ~7x faster (sweep's 12x inflated by a noisy
  NumPy reading); reductions/castbuf/count_nonzero are the largest honest wins.
… harness, one results sheet

Promotes the exploratory poc/npyiter_* rounds into a single MAINTAINED
NumSharp-vs-NumPy benchmark under benchmark/npyiter/. Every distinct NpyIter
aspect the poc rounds surfaced now lives in one place, swept across cache tiers,
rendered into ONE sheet (npyiter_results.md): 162 measured pairs.

Pieces:
- npyiter_bench.cs / .py  — identical-id NumSharp + NumPy benches, SECTION-
  ADDRESSABLE via the NPYITER_SECTION env var. Ten sections:
    operations x size : elementwise/reductions/selection/copycast/indexmath/
                        dtypes/dividends — 33 families x {scalar,1K,100K,1M}
    construction      : 9 iterator flag configs vs np.nditer build
    chunkwidth        : per-chunk dispatch overhead across inner widths 4..1024
    pathology         : the regression canaries (bcast-reduce, allocate,
                        overlap-copy, F-order-out, 0-d)
  Iterator-isolation rows drive NpyIterRef directly with trivial NumPy-loop-
  matched kernels; production rows call np.* both sides; copy compares to
  np.positive (a real ufunc nditer), never np.copyto.
- npyiter_sheet.py  — orchestrator + renderer. Runs each section in its own
  short-lived process (crash isolation) and renders the per-tier/per-category/
  per-family operation matrix + construction + chunk-width + pathology +
  dividends sheet. Resilience baked in after the monolithic poc fullsweep
  AV'd ~50% of runs:
    * each NumSharp section retries up to 4x on a crash;
    * DOTNET_DbgEnableMiniDump=0 so an AV returns a non-zero exit IMMEDIATELY
      instead of stalling the process while a crash dump is written (the silent
      hang that voided the first full run — we never taskkill dotnet);
    * per-subprocess timeout backstop;
    * tsv is written incrementally after EVERY section and --resume skips
      already-collected sections, so a mid-sweep death never loses progress.
  Flags: --skip-build, --render-only, --resume, --sections.
- README.md  — run instructions, methodology guardrails (Release-only, matched
  kernels, positive-not-copyto), section table, and the findings ledger.
- npyiter_results.{md,tsv}  — the rendered sheet + raw (id, ns_ms, np_ms) pairs
  from the 2026-06-13 run (i9-13900K, NumPy 2.4.2, Release).

Headline (speedup = NumPy/NumSharp, >1.0 = NumSharp faster): operation matrix
1.24x geomean, 80 win / 52 lose over 132 cells; tiers scalar 1.20x / 1K 1.32x /
100K 1.14x / 1M 1.32x. Categories: reductions 2.03x, elementwise 1.48x,
selection 1.32x, dtypes 1.03x; copy/cast 0.58x and index-math 0.63x lag at
small N (per-call setup) and cross to wins by 1M. Construction beats np.nditer
6.19x geomean (up to 12.5x on the 8-operand build). Chunk-width loses at w=4
(0.74x) and wins from w=64 up. Dividends NumPy structurally can't match: fuse7
4.6-15x, reuse up to 9x, par8 2.5-7x.

Regression canaries / losses tracked by the sheet: bcast-reduce 51x slower,
F-order-out 3.5x, allocate 2x; np.any full-scan 0.07x at 1M (scalar scan vs
SIMD); comparison->bool 0.5x; fancy a[idx] gather/scatter 0.5-0.75x; amin
axis-reduce 0.4x at scale. int8 verified ~7-11x faster (correct). These are the
fix-list, ordered in README's findings ledger.
… (commit-to-master)

Wires the canonical NpyIter benchmark into a semi-manual GitHub Action that runs
AFTER a release and publishes results straight to master (the chosen target,
not the wiki — GITHUB_TOKEN can push to the repo but not to a .wiki.git, which
would need a PAT).

- .github/workflows/npyiter-benchmark.yml — SEPARATE workflow (never gates the
  release). Triggers: workflow_run on 'Build and Release' completion+success,
  plus workflow_dispatch (the manual knob). Sets up .NET 8+10 (preview) and
  Python 3.12, pins numpy==2.4.2, builds Release, runs npyiter_sheet.py
  --skip-build, renders the cards, and commits npyiter_results.{md,tsv} +
  cards/*.png to master with '[skip ci]' (so the push can't re-trigger the
  release workflow). permissions: contents:write — no PAT needed.

- benchmark/npyiter/npyiter_cards.py — renders two 400x300 PNG cards from the
  tsv (matplotlib, figsize=(4,3) dpi=100): ops.png (speedup by size tier) and
  cat.png (speedup by op class). Ratio-only by design — absolute ms vary by
  runner hardware, but the same-runner NumPy/NumSharp ratio stays meaningful.

- README.md — new 'Performance vs NumPy' section embedding both cards (raw URLs)
  linked to the full report, with the explicit 'same-machine ratio' caveat.

- npyiter_sheet.py — portability fix: run_ns() rewrites the .cs's absolute
  '#:project K:/source/NumSharp/...' line to THIS checkout's csproj path, so the
  same bench (authored to run directly via 'dotnet run - < file' on Windows)
  runs unchanged on a Linux CI runner.

- Refreshed npyiter_results.{md,tsv} + cards from a clean 162-pair run
  (headline 1.19x, 76 win / 56 lose). That run also exercised the resilience
  for real: the selection section took an 0xC0000005 AV on attempt 1 and the
  orchestrator's retry recovered it automatically — exactly the CI-safety the
  section isolation + DbgEnableMiniDump=0 + retry was built for.

Caveats documented in the workflow header: shared-runner variance (ratios not
absolutes), and direct-to-master push assumes master is CI-writable (no branch
protection blocking github-actions[bot]); switch to a PR step if that changes.
…iter

The six exploratory POC rounds (npyiter_core/frontier/frontier2/consumers/
sizesweep/fullsweep benches + bench_summary + charts + NPYITER_CORE_BENCH_RESULTS.md)
were superseded by the canonical benchmark/npyiter/ — every aspect they surfaced
now lives there, swept across cache tiers into one sheet. Removed: 17 tracked
files + 4 gitignored .txt dumps.

KEPT benchmark/poc/npyiter_parity_poc.{cs,py} — it is NOT part of the
exploratory rounds: it holds the hand-written AVX2-gather reference kernels
(PocKernels.AddF32/SqrtF32/SumF32) that docs/NPYITER_PERF_HANDOVER.md points to
as the blueprint to transcribe for the IL-emission work, and benchmark/CLAUDE.md
cites it as the Debug-JIT guard example.

Reference fixes (no dangling links left):
- docs/NPYITER_GAPS_AND_ROADMAP.md §6: the iterator-core reproduce block now
  points to benchmark/npyiter/npyiter_sheet.py instead of the deleted bench.
- benchmark/npyiter/README.md: dropped the 'supersedes poc/npyiter_*' wording
  (those files are gone) for a self-contained description.

Finalize: added benchmark/npyiter/.gitignore for transient run artifacts
(run.log, __pycache__) so only the durable outputs — npyiter_results.{md,tsv}
and cards/*.png — are tracked.
… tier; AV→NA; one CI

Folds the NpyIter benchmark into the official orchestrator so there is ONE entry
point and ONE report, while keeping the two harnesses distinct (they measure
different things — op/dtype/N throughput vs the iterator machinery — and the
NpyIter harness needs internal access + section-isolation the BenchmarkDotNet
in-process run can't give).

run_benchmark.py — after the official (op,dtype,N) merge, runs the NpyIter sheet
+ cards and APPENDS the sheet to benchmark-report.md as its own section (not
merged — different result model). Archives npyiter_results.{md,tsv} + cards into
results/<ts>/. New --skip-npyiter flag. This is now the single command for the
whole NumSharp-vs-NumPy comparison.

+10M tier (decision 1): npyiter_bench.{cs,py} sweep now scalar/1K/100K/1M/10M
(grid 2500x4000 = 10M exactly; pick 30 iters/3 rounds at 10M). sheet TIERS +
cards pick it up automatically.

AV → NA/IGNORED (decision 3): instead of silently omitting a section that
crashes all retries, the sheet now records its ids NA (NumPy runs first to give
the expected id set), prints an AV-POLICY header explaining the known
intermittent AccessViolation is ignored, lists 'THIS RUN: NA across <sections>',
shows NA cells in the per-family/dividends matrices, and excludes NA from every
geomean. tsv stores NA; load/cards skip it.

CI consolidation (decision 2): npyiter-benchmark.yml -> benchmark.yml, now runs
the ENTIRE suite via run_benchmark.py. Trigger changed from workflow_run-on-
every-build to release:published (the real 'after a successful release' signal —
'Build and Release' publishes a GitHub Release on a v* tag) + workflow_dispatch,
so the heavy full suite runs per-release, not per-push. Commits report + cards
to master with [skip ci]. timeout-minutes: 180.

The npyiter_parity_poc gather kernels and the rest of the harness methodology
(Release-only, matched kernels, positive-not-copyto, section isolation) are
unchanged.
…on selection

Refreshes the canonical NpyIter results (npyiter_results.md/.tsv) and the two
README cards with a full sweep that now includes the 10M cache tier, and records
the AV->NA policy firing on a real run. Also documents the run_benchmark.py
integration in benchmark/CLAUDE.md.

What changed
------------
* 198 measured pairs (was 162), 35 of them NA. The new 10M tier adds 36 ids
  across the size-swept families; SIZES is now scalar/1K/100K/1M/10M end to end
  (bench .cs + .py grids: 10M = 2500x4000).
* selection (where / a[mask] / a[mask]= / count_nz / argwhere / a[idx] / a[idx]=)
  hit NumSharp's known intermittent AccessViolation on EVERY retry this run, so
  the whole section is reported NA/IGNORED per policy and excluded from every
  geomean. The header now reads "198 measured pairs (35 NA)" and
  "AV POLICY ... THIS RUN: NA across selection."; the section renders as
  "(no data)" / "-" / "NA" cells instead of crashing the sweep. This is the
  designed crash-resilience path proven on a live run, not a regression.
* Headline operation matrix: 1.17x geomean, 77 win / 53 lose over 130 cells
  (26 non-selection families x 5 tiers). Reductions lead (1.80x), dtypes 1.59x,
  elementwise 1.12x; copy/cast (0.65x) and index-math (0.70x) remain the small-N
  laggards already tracked as canaries.

Doc
---
benchmark/CLAUDE.md run_benchmark.py section now describes the appended NpyIter
step (aspect x tier, appended-not-merged, section-isolated, AV->NA,
--skip-npyiter) and points at benchmark/npyiter/README.md, so the dev guide
matches the wired-in integration (run_benchmark.py + benchmark.yml).

Known bug surfaced (tracked, not fixed here)
--------------------------------------------
The selection-section AccessViolation (0xC0000005) is an unmanaged-storage
lifetime bug in NumSharp under heavy mixed alloc/free load. It is intermittent
(~50% per heavy section) and uncatchable; the benchmark now degrades to NA
rather than masking it. Worth a dedicated issue + fix pass.
…ted report artifacts

Adds docs/website-src/docs/benchmarks.md — the DocFX page the user asked for:
"the real place where we discuss and present the efforts to surpass NumPy
through the power of Runtime IL Generation." It is the evidence companion to the
existing IL Generation page (il-generation.md explains HOW the kernels are
emitted; this page shows WHAT that buys head-to-head against NumPy).

The page is driven by the artifacts the Benchmark workflow (benchmark.yml)
auto-commits to master after every release:
* The two 400x300 cards are embedded by absolute raw.githubusercontent master
  URLs (same source the README uses), so they always reflect the latest
  committed run rather than a pasted screenshot. Verified the docfx build keeps
  the URLs absolute (it does not relativize external links).
* The full reports are linked on master: the iterator sheet
  (benchmark/npyiter/npyiter_results.md, which the cards render from) and the
  op/dtype/N matrix (benchmark/benchmark-report.md), plus the harness README and
  benchmark/CLAUDE.md.

Content (grounded in the current committed npyiter_results.md numbers):
* Headline cards + a by-class geomean table (reductions ~1.8x, dtypes ~1.6x,
  elementwise ~1.1x parity, copy/cast ~0.65x, index-math ~0.7x).
* Class-by-class discussion tying each result to the IL mechanism (4x unrolling,
  tree reduction, SIMD early-exit, per-(op,dtype,layout) specialization), and
  honest about the taxes (small-N copy/cast, all-false any() scan, bcast_reduce).
* The dividends NumPy can't structurally match: expression fusion (np.evaluate,
  up to ~13x), kernel reuse, parallel inner loop (par8 up to ~8x), cheaper
  iterator construction (~2-3x vs np.nditer).
* Methodology + honesty section: Release-only JIT, best-of-rounds, ratios-not-
  absolutes, and the AV->NA policy.
* Reproduce-locally commands.

Wiring:
* docs/toc.yml — new "Benchmarks vs NumPy" entry right after IL Generation.
* il-generation.md — cross-link from the Performance Impact section ("naive C#"
  table vs the head-to-head-NumPy page).
* index.md — added IL Generation + Benchmarks links to Get Started.

Validated with `docfx build` (build-only, metadata skipped): 0 errors, the page
itself emits 0 warnings (the 84 UidNotFound warnings are api/toc.yml entries that
only resolve after the metadata step, which CI runs first). benchmarks.html
renders, cards resolve to absolute URLs, internal links rewrite to .html.

Note: deploy is via docs.yml on push to master (paths: docs/website-src/**); this
branch commit does not deploy until merged. How the page REFERENCES the
auto-committed cards (raw-master URL vs bundling copies into website-src/images/)
is the next thing to settle.
…FX site

Two follow-ups to the Benchmarks vs NumPy page, both from user direction.

1) The two 400x300 cards now carry the whole canonical summary (modeled on the
   ASCII sheet the user singled out), not just one bar chart each. Everything is
   still COMPUTED from npyiter_results.tsv, so the cards auto-update each run and
   NA (AccessViolation) ids are skipped.

   * cards/ops.png  — OPERATIONS vs NumPy: headline (geomean / win-lose / cells)
     + by-array-size-tier bars (scalar..10M) + by-operation-class bars ranked
     best->worst (reductions 1.80x ... copy/cast 0.65x; wins green, the two
     small-N taxes red).
   * cards/cat.png  — the IL-GENERATION DIVIDENDS, the "machinery NumPy has no
     equivalent for": iterator build vs np.nditer, expression fusion (np.evaluate),
     kernel reuse, parallel inner loop — each bar is the honest geomean with an
     "up to <peak>x" annotation — plus the chunk-width trend (w=4 -> w=1024) and
     the honest pathology canary (bcast_reduce ~52x behind, in red).

   npyiter_cards.py rewritten: shared hbars() helper, color_of() (green/amber-
   parity/red), stat() for (geomean, peak), two card builders. Imports CTOR/CW/
   PATH/DIVIDENDS from the sheet so the section data stays single-sourced.

   Captions/alt-text updated to match the new card semantics (cat.png is no longer
   "by op class") in README.md and benchmarks.md.

2) Full reports are now rendered INTO the site as searchable pages (user choice:
   "Render into the site"), in addition to being linked on GitHub:

   * docs/website-src/docs/benchmark-matrix.md   — the op/dtype/N matrix
     (benchmark-report.md body under a single page H1).
   * docs/website-src/docs/benchmark-iterator.md — the canonical iterator sheet
     (npyiter_results.md fenced block under a page H1).
   * toc.yml nests both under "Benchmarks vs NumPy"; benchmarks.md "Read the full
     reports" now links the on-site pages (raw files still linked on master).

   benchmark.yml regenerates these two pages from the just-produced reports (op
   matrix drops its own H1 via tail -n +2 so the page has one title; the iterator
   sheet has no H1), commits them alongside the report + cards, and — because the
   commit carries [skip ci] and the pages live under docs/website-src/** — then
   `gh workflow run docs.yml` to redeploy the site (added actions:write + GH_TOKEN).

Validation
----------
* npyiter_cards.py renders both cards; verified visually (legible at 400x300).
* benchmark.yml is valid YAML (yaml.safe_load).
* docfx build (build-only): 0 errors; benchmark-matrix.html + benchmark-iterator.html
  generate; benchmarks.html internal links to both resolve; no warning names any new
  page (the 82 UidNotFound warnings are api/toc.yml, resolved by the metadata step CI
  runs first). No docs/website/ build-output committed.

Still open (deferred by the user): the card REFERENCING mechanism on the docs page
(raw-master URLs today vs bundling the PNGs into website-src/images/). The redeploy
chaining added here would make that swap trivial if chosen later.
… 15 Best"

The op/dtype/N matrix report (benchmark-report.md, rendered into the site as
benchmark-matrix.md) showcased garbage: every "Top 15 Best" row was np.copy(float64)
and np.searchsorted at "0.0 / 0.0x". Three distinct bugs, all fixed.

BUG 1 — searchsorted benchmark measured nothing (both sides)
  SortingBenchmarks.cs and numpy_benchmark.py issued a SINGLE scalar lookup
  (np.searchsorted(sorted, N/2)) — one O(log N) binary search, ~18ns at EVERY N,
  pure call overhead. Against NumPy's ~1µs Python overhead that manufactured a
  meaningless 50–1000x "win". Fixed: both now query the N-element array (a) into the
  sorted target → N binary searches, real work that scales with N. (Verified the C#
  benchmark project still compiles.)

BUG 2 — normalize_op_name collapsed a slice-copy onto np.copy
  The Slicing suite's "np.copy(a[100:1000])" (a fixed 900-element slice copy, ~3.6µs
  at every N) was normalized by stripping ALL "[...]" — including the array-index
  "[100:1000]" — yielding "np.copy", which COLLIDED with the Creation full-array
  "np.copy(a)" in csharp_index (last-write-wins) and overwrote the real float64
  measurement. THAT was the bogus "copy float64 = 0.0036ms" (not a copy bug at all;
  the op is fine — archived raw float64 copy@10M = 11.04ms). Fixed: only strip a
  space-separated " [annotation]" (\s+\[ instead of \s*\[), never index brackets
  attached to an identifier. Incidentally also de-collides concatenate/stack/slice
  variants. copy(float64) now reads its real values across all sizes (10M → 11.04ms,
  ratio 0.60 = a genuine win).

BUG 3 — the report ranked/averaged non-credible rows as wins
  merge-results.py sorted "Top Best" by ratio with only a `ratio is not None` guard,
  so a sub-resolution NumSharp time (ratio rounding to 0.0) sorted to #1, and CSV
  blanked legit 0.0 via `r.ratio or ''`. Fixed with a credibility gate (classify()):
  a row is "negligible" (new ▫ status) when either side did <1µs of work OR the
  speedup exceeds 20x (NumSharp >20x faster ⇒ artifact: a view, a lazy alloc, or a
  dead-code-eliminated kernel). Negligible rows are EXCLUDED from Top Best/Worst and
  from the per-size geomean, but still listed (▫) in the per-suite tables — nothing
  hidden. Also: store ms at 4 / ratio at 3 decimals, show 3-decimal ms + 2-decimal
  ratio in the showcase (no more "0.0/0.0x"), fix the `or ''` falsy-zero in CSV, add
  the ▫ legend row + summary/size-table counts, and a header note stating how many
  rows were excluded and why.

Result (regenerated from the on-disk run archive with the fixed merge):
  * Top Best is now real reductions/statistics wins (np.nansum 0.08x, np.percentile
    0.10x, np.average 0.10x) — genuine ms on both sides.
  * 1233 ops → 305 faster / 255 close / 169 slower / 103 much-slower / 275 NEGLIGIBLE
    (the artifacts, previously ~all counted as "faster") / 126 n/a.
  * Top Worst surfaces a real gap: np.zeros (NumSharp eagerly zeros ~10.7ms vs NumPy
    lazy calloc ~0.01ms) — a legitimate optimization target, not an artifact.

benchmark-matrix.md (the DocFX page) re-seeded from the corrected report; docfx build
clean (0 errors). The searchsorted benchmark fix takes effect on the next CI run; the
credibility gate keeps any residual artifact out of the showcase meanwhile.
… 1.3–6.1)

Branch advanced 31 substantive commits past the first changelog (which
described through 33058b8). The branch was rebased meanwhile — the original
changelog commit bb7ed7a8 is orphaned, its twin is 4140f4d, and 33058b8
remains an ancestor of HEAD, so 33058b8..HEAD is the true new-work boundary.

Learned and folded in:
- np.evaluate — Tier-3C fusion made public; per-node NumPy result_type typing
  (fixes the mixed-tree dtype bug: i4*i4+f8 must wrap in int32 first), fused
  reductions, EXTERNAL_LOOP guard, out= via ufunc rules. 3.2–6.1x vs NumPy.
- out=/where=/dtype= across the elementwise ufunc API (binary, unary-math,
  comparisons, predicates, bitwise, invert, arctan2) — one NumPy-shaped
  overload each, exact broadcast/cast/error-text semantics.
- New at np.*: bitwise_and/or/xor (were operator-only, CS0117) and positive.
- nditer: WRITEMASKED/ARRAYMASK execution + VIRTUAL operands (was silent
  masked-write corruption); Wave-1.4 fixes (size-1 stride-0 invariant,
  op_axes OOB, write-broadcast validation, PARALLEL_SAFE, unit-axis absorb).
- Alloc Wave 2.4: buffer-pool window 4KiB–1MiB -> 1B–64MiB, pool-side GC
  pressure, finalizer suppression.
- Canonical NpyIter benchmark suite + post-release benchmark.yml CI +
  DocFX Benchmarks-vs-NumPy website pages; honest frontier findings recorded
  (broadcast-reduce 54x, scalar np.any 14.5x, BUFFERED+REDUCE ForEach P0
  crash, parallel banding 4.7x win).

Stats refreshed: 272/519/+198k -> 312 commits, 615 files, +217,949/-16,402.
Tests: 9,447 -> 9,709 passed/0 failed (net10.0). New-API count 30 -> 35.
Same content (minus H1) pushed live to the PR #611 description via REST PATCH.
…oard page

Adds a new DocFX page in the npyiter_results.md dashboard style (ASCII bars, geomeans,
win/lose, top wins/losses) applied to the broad op × dtype × N matrix — the graph/stats/
numbers companion to the narrative benchmarks.md, with minimal prose.

* benchmark/scripts/render_dashboard.py — reads the merged benchmark-report.json and emits
  benchmark-dashboard.md: headline geomean, BY-SIZE-TIER / BY-SUITE / BY-DTYPE bars (same
  bar() aesthetic as npyiter_sheet.py — length 10 = parity, 20 = 2.0×), the status mix, and
  TOP-12 wins/losses with raw ms. Charts only CREDIBLE rows (the merge-results.py gate), so
  the negligible artifacts that used to dominate stay out. speedup = NumPy ÷ NumSharp.
* docs/website-src/docs/benchmarks-dashboard.md — the page (title + one-line note + the
  ```-fenced sheet), seeded from the renderer. Nested under "Benchmarks vs NumPy" in toc.yml
  as "Dashboard (op matrix)", beside the full Operation matrix and Iterator sheet.
* benchmark/.gitignore — ignore the benchmark-dashboard.md intermediate (the tracked form is
  the DocFX page), matching how benchmark-report.json/csv are handled.

What it shows on the current data (honest, broad picture vs the curated npyiter sheet):
0.74× geomean over 832 credible cells (305 win / 527 lose) — NumSharp trails on the full
matrix but reaches parity at 10M (0.98×), and wins decisively where its IL kernels shine:
statistics 2.28×, broadcasting 1.22×, reduction 1.21×; uint8 1.07×. Laggards are arithmetic/
unary/creation and bool. Top wins: nansum/percentile/average (8–13×). Top losses: np.zeros
(eager-zero vs NumPy lazy calloc, ~500–880×) and argsort (~25×).

Prototype scope: the page is a committed STATIC snapshot. To make it live (auto-refresh each
release like the matrix/iterator pages), wire render_dashboard.py + a seed step into
run_benchmark.py / benchmark.yml — deferred pending design review. docfx build is clean.
Two net8.0-only BCL semantic gaps surfaced by the fuzz differential matrix.
Both behave correctly on net9.0+ (where the BCL was fixed) but produced
wrong values on net8.0; worked around to match NumPy 2.4.2.

1. np.abs(complex) with an infinite component returned NaN instead of +inf
   ------------------------------------------------------------------------
   cabs(NaN + inf*i) must be +inf (C99 hypot / npy_cabs: the infinity test
   precedes the NaN test). System.Numerics.Complex.Abs routes through a
   private Hypot whose operand ordering is NaN-unaware, so on net8.0 it
   returns NaN for abs(NaN+inf*i) (fixed in the .NET 9 BCL).

   Added Utilities/NpyComplexMath.Abs(Complex): returns +inf when either
   component is infinite, else defers to Complex.Abs — so every finite/
   NaN-only magnitude that already matched NumPy bit-for-bit is unchanged.
   Repointed the two cached MethodInfo handles that drive every complex-abs
   emit site: DirectILKernelGenerator.CachedMethods.ComplexAbs (6 IL call
   sites across the scalar/strided/predicate/math/decimal unary loops) and
   DefaultEngine.UnaryOp.s_complexAbs (NpyIter Tier-3B route).

   Fixes 19 unary.jsonl + 1 random_smoke.jsonl fuzz cases (all layouts:
   contiguous / strided / transposed / broadcast / negstride).

2. ptp / amax / amin along an axis dropped NaN instead of propagating it
   ------------------------------------------------------------------------
   The typed-struct leading/innermost axis-reduction fast paths
   (MinOp<T>/MaxOp<T>.Combine256/128) called raw Vector256/128.Min/Max. The
   x86 vminps/vmaxps these lower to return the SECOND operand on an
   unordered (NaN) compare; the BCL Vector{N}.Min/Max only adopted IEEE NaN
   propagation in .NET 9. Verified: Vector128.Max(NaN,5) == 5 on net8.0,
   == NaN on net10.0. So max/min/ptp over a NaN-laced axis silently lost
   the NaN on net8.0 (ptp axis=0 returned a finite value where NumPy = NaN).

   Routed MinOp/MaxOp through the existing NaNAwareMinMax256/128 helper
   (already used by the contiguous/strided CombineVectors paths) and wrapped
   that helper's float/double self-equality mask in #if NET8_0 — so net9.0+
   keeps the single-instruction vmaxps with zero overhead while net8.0 gets
   ConditionalSelect(ordered, min/max, a+b) NaN propagation. The flat
   whole-array reduction kernel already emitted this via
   EmitVectorNaNPropagatingMinMax, so only the axis fast paths were affected.

   Fixes 12 stat.jsonl fuzz cases (ptp float32/float64, axis 0/1, C/F-contig).

Verification: full unit suite green on BOTH net8.0 and net10.0 (9709 passed
/ 0 failed under the CI filter), FuzzMatrix 42/42 on both. The originally
reported trunc "Could not find Truncate for Vector128" failures were already
resolved in-tree by the CanUseUnarySimd #if NET8_0 guard (commit 5716f86);
the leak-guard working-set tests pass locally (their CI failures were OS
working-set / GC-mode noise, not a managed or unmanaged leak).
…NumSharp faster)

The dashboard prototype was the odd one out: I rendered it speedup = NumPy ÷ NumSharp
(>1× = faster), while the op-matrix report it is derived from — and merge-results.py —
use ratio = NumSharp ÷ NumPy (<1× = faster, lower is better). Two pages off the same data
with opposite conventions is exactly the faster/slower confusion to avoid.

Verified first that the underlying direction is NOT a flip: counting raw milliseconds
(numsharp_ms vs numpy_ms, no ratio involved), NumSharp took LESS time on 305 ops and MORE
time on 526 of 832 credible ops; geomean NS/NP = 1.36. So "NumSharp trails on the broad
matrix" is real (concentrated in Arithmetic = 231 slower ops, and Unary), and it matches the
op-matrix report's own conclusion. The dashboard's data was right; only its convention was
inverted relative to the house default.

render_dashboard.py now uses NS/NP throughout:
* ratio = numsharp_ms / numpy_ms; header + axis read "faster ◄ 1.0 (parity) ► slower".
* HEADLINE 1.36× geomean · 305 faster / 527 slower.
* by-suite / by-dtype ranked fastest→slowest (ascending ratio): statistics 0.44×,
  reduction 0.83×, broadcasting 0.82× now read as FASTER; creation 2.83× / unary 2.63× /
  bool 3.55× as slower.
* status bands relabelled to NS/NP (faster ≤1.0× / close 1–2× / slower 2–5× / much >5×).
* tables renamed FASTEST / SLOWEST; each row shows the NS/NP ratio plus a human factor
  ("0.079× (12.6× faster)", "880.9× (881× slower)") so the small-ratio-is-good direction is
  unambiguous.

benchmarks-dashboard.md re-seeded with the matching note; docfx build clean. This makes the
report + dashboard consistent. The narrative benchmarks.md, the npyiter iterator sheet, and
the README cards still use the speedup (NP/NS, >1× = faster) framing — flipping those is a
separate call (they are win-showcases where >1× reads naturally).
…m the changelog

Per review: the changelog should describe the final state, not the
development path. Removed the temporal 'Latest wave (Waves 1.3–6.1) —
added after the first changelog' umbrella section entirely and dissolved
its content into the proper topical sections, with all 'wave' terminology
and 'added after'/'previously absent'/'now reachable' path-language gone:

- np.evaluate folded into §2 (NpyExpr DSL): per-node result_type typing,
  fused reductions, out= rules, EXTERNAL_LOOP guard, measured speedups.
- out=/where=/dtype= ufunc kwargs folded into §5 as a parity subsection.
- WRITEMASKED/ARRAYMASK execution, VIRTUAL operands, and the size-1
  stride-0 / op_axes-OOB / write-broadcast / PARALLEL_SAFE / unit-axis
  fixes folded into §1 (capability matrix + bug list); masked-write
  corruption fix added to §10.
- buffer-pool window (1 B–64 MiB), pool-side GC pressure, finalizer
  suppression folded into §7; TL;DR memory bullet updated.
- canonical NpyIter benchmark, benchmark.yml CI, DocFX benchmark pages,
  and the honest frontier findings folded into §8/§15.
- 'NPYITER_GAPS_AND_ROADMAP … 6-wave plan' -> 'prioritized roadmap'.

Net: zero 'wave' occurrences; the 16-section topical structure is intact.
Same content (minus H1) pushed live to the PR #611 description.
… stat

Per updated direction: the ratio convention is NumPy ÷ NumSharp again (>1.0× = NumSharp
faster — bars grow right = faster, the original visual), AND every row now also carries
🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses. So a win
reads two intuitive ways: "12.63× faster" and "🕐 8%" (takes only 8% of the time NumPy
would); parity is 🕐 100%; >100% is slower. Huge slowdowns compact to e.g. 🕐 881×NP.

render_dashboard.py:
* r["sp"] = numpy/numsharp (speedup), r["pct"] = numsharp/numpy*100 (share of NumPy time).
* headline + every bar/table show both: HEADLINE 0.74× geomean · 🕐 136%; by-suite e.g.
  statistics 2.28× 🕐 44%, reduction 1.21× 🕐 83%, creation 0.35× 🕐 283%; FASTEST nansum
  12.63× 🕐 8%; SLOWEST np.zeros 0.001× 🕐 881×NP.
* status-mix bands relabelled in %NumPy terms (faster ≤100% / close 100–200% / slower
  200–500% / much >500%), a legend line explains the 🕐 stat, pct_str() keeps the column
  narrow (NN% under 1000%, else NN×NP).

benchmarks-dashboard.md re-seeded with the matching note (heredoc — printf mis-read %NumPy
as a format spec); docfx build clean, emoji verified present (U+1F550 ×54).

Supersedes the brief NS/NP experiment (c0a5346). The op-matrix report (merge-results.py)
still uses NS/NP "lower is better", and the npyiter sheet / cards use NP/NS without the
%NumPy stat — rolling the NP/NS + 🕐 %NumPy convention out to those is the next step,
pending confirmation.
Completes the rollout chosen after the dashboard fix: every benchmark surface now uses the
SAME convention — speedup = NumPy ÷ NumSharp (>1.0× = NumSharp faster) — and every surface
also carries 🕐 %NumPy = (NumSharp ÷ NumPy) × 100 = the share of NumPy's time NumSharp uses
(30% = takes only 30% of the time NumPy would; <100% = faster; huge slowdowns compact to
e.g. 880×NP). So a win reads two intuitive ways at once: "12.66× faster" and "🕐 8%".

Op-matrix report (merge-results.py) — FLIPPED from NS/NP to NP/NS (the one surface that was
"lower is better"):
* ratio = numpy_ms / numsharp_ms; new pct_numpy field on UnifiedResult (JSON + CSV).
* get_status bands inverted around >1 = faster (faster ≥1.0× / close 0.5–1.0× / slower
  0.2–0.5× / much <0.2×); classify() credibility gate flips to ratio > 20 (was < 1/20).
* Best/Worst now sort DESCENDING (fastest first); legend + tables + summary-by-size gain a
  🕐 %NumPy column; ratio_fmt keeps tiny slowdowns readable (0.001× not 0.00×).
* Regenerated from the on-disk run archive: Top Best nansum 12.66× 🕐 8%; Top Worst
  np.zeros 0.001× 🕐 880×NP; searchsorted stays negligible (now ratio>20). Counts
  unchanged (305/255/169/103/275/126) — same rows, just the direction relabelled.

npyiter sheet (npyiter_sheet.py) + cards (npyiter_cards.py) — already NP/NS, ADDED 🕐 %NumPy:
* sheet: legend line + per-bar 🕐 %NumPy + headline "1.17× geomean · 🕐 85% of NumPy's time";
  re-rendered npyiter_results.md (--render-only, AV block intact).
* cards: each bar label now "1.80× · 56%" (ops) / "4.3× · 23%" (dividends); footer explains
  the %. No emoji in matplotlib (DejaVu lacks the glyph) — the % carries it. Re-rendered.

Narrative benchmarks.md + README — already NP/NS, added the 🕐 %NumPy line to the convention
block, a %NumPy column to the by-class table, and a caption sentence.

DocFX pages (benchmark-matrix.md, benchmark-iterator.md) re-seeded from the regenerated
report + sheet; benchmarks.md updated; docfx build clean (0 errors). The dashboard
(render_dashboard.py / benchmarks-dashboard.md) already carries this convention (49af3af),
so the whole benchmark stack — report, dashboard, iterator sheet, cards, narrative, README —
is now identical: NumPy ÷ NumSharp speedup + 🕐 %NumPy.
The clock sat before the figure with the right-align padding landing between them
("🕐  87%"). Moved it to immediately follow the percentage, no space — "87%🕐" — across
every surface, and likewise the metric name (🕐 %NumPy → %NumPy🕐). The alignment padding
now sits before the number (where it belongs) instead of after the emoji.

* render_dashboard.py / npyiter_sheet.py: bar values "{pct_str}🕐", headline "85%🕐 of
  NumPy's time", legend "%NumPy🕐 = …". Dashboard + sheet regenerated.
* merge-results.py: report legend, status-band table, summary-by-size "%NP🕐" column,
  Best/Worst note, and per-suite "%NumPy🕐" column headers. Report regenerated.
* benchmarks.md + README: convention line / table column / caption "%NumPy🕐".
* DocFX pages (matrix, iterator, dashboard) re-seeded; dashboard page note "%NumPy🕐".
  docfx build clean.

The matplotlib cards are unaffected (they show "1.80× · 56%" without the emoji — DejaVu
has no clock glyph — so there was never a gap to fix there).
… form

pct_str (dashboard/sheet) and pct_fmt (report) switched to a ×-multiplier form for huge
slowdowns (np.zeros etc.), so the %NumPy stat showed "880×NP🕐" / "880×" — breaking the
NN%🕐 depiction the column promises. Now they always render a percentage: np.zeros reads
"87957%" (report) / "88087%🕐" (dashboard) = takes ~880× as long, stated as a share of
NumPy's time like every other cell.

The ratio column is untouched — it legitimately uses × (0.001×, 12.65×); only the %NumPy
formatters changed. Report + sheet + dashboard regenerated, the three DocFX pages re-seeded,
docfx build clean.
…g from the report

The dashboard and benchmark-report.md disagreed on the SAME cell: np.nansum(f64,100K)
read 12.63× on the dashboard vs 12.65× in the report, np.zeros(i64,10M) read 88087% vs
87957%, quantile/percentile likewise — 161 rows printed a different ratio at 2dp between
the two committed surfaces.

Root cause: merge-results.py computes ratio = NumPy/NumSharp and pct_numpy from the
FULL-PRECISION means, then stores numpy_ms/numsharp_ms rounded to 4dp. render_dashboard.py
ignored the stored ratio/pct_numpy fields and RE-DIVIDED the rounded ms (r["numpy_ms"] /
r["numsharp_ms"]), so every row where the 4dp rounding moved a digit drifted from the
report. The report is correct (true ratio of the measured means); the dashboard was a
rounding artifact of its own recompute.

Fix: the credible loop now consumes r["ratio"] / r["pct_numpy"] straight from the JSON
(the same numbers benchmark-report.md prints), falling back to 100/ratio only if pct is
absent. Dashboard and report now agree cell-for-cell, and the per-suite/per-dtype geomeans
key off the same stored ratios the report's Summary-by-size uses.

Regenerated benchmark-dashboard.md (gitignored) and re-seeded the DocFX dashboard page;
header preserved, body refreshed. Verified: nansum 12.65×/8%, zeros 0.001×/87957%,
quantile 9.89×/10% identical on both surfaces; size tiers match Summary-by-size exactly.
…not run" cells

normalize_op_name dropped measured C# data on the floor whenever the C# benchmark label
and the NumPy suite name differed only cosmetically, so the report showed ⚪ "C# benchmark
not run" for ops that WERE run. Three archive-safe alias passes (applied identically to
both sides, so they only ever merge a true pair):

  * empty "()"  — a no-arg C# method call "a.flatten()" now meets NumPy's "a.flatten"
  * "->" spacing — C# "reshape 2D -> 1D" now meets NumPy's "reshape 2D->1D"
  * np.around    — IS np.round (NumPy alias); C# benchmarks rounding as np.around, NumPy
                   emits np.round, so the whole np.round family was ⚪ despite real data

Effect (re-merged from the same archive — no re-run): ⚪ no-data 126 → 116; the np.round
family gains 6 real rows (float32/float64 × 3 sizes), a.flatten +2 (100K/10M), reshape
2D->1D +2. Verified against the archive before editing: +10 joined cells, 0 regressions
(no previously-matched cell lost), 0 new key collisions.

Regenerated benchmark-report.{md,json,csv} + the dashboard (now 840 credible cells,
0.73× geomean) and re-seeded the matrix + dashboard DocFX pages (headers preserved
byte-for-byte). The dashboard stays cell-consistent with the report via the canonical
ratio/pct fix from the prior commit.

NOT fixed here (genuine gaps needing a benchmark re-run, not a name alias): np.prod has
no NumPy full-reduction row at all; isnan/isinf/isfinite/isclose/allclose/array_equal/
maximum/minimum have no C# benchmark; amax/amin/mean/std/var axis variants and np.mean
on uint*/int16 lack a counterpart on one side.
…lex (NumPy parity)

These six complex ufuncs previously threw NotSupportedException from the
EmitUnaryComplexOperation default arm, even though NumPy 2.x has complex
loops for all of them (csinh/ccosh/ctanh/casin/cacos/catan). This wires
them up with full NumPy 2.4.2 parity.

Approach (hybrid BCL + C99 fixups, mirroring the existing abs/log2/exp2
pattern): a bit-exact probe over a finite battery showed System.Numerics.
Complex matches NumPy to a few ULP on the finite interior, but diverges at
86/360 edge components -- it returns (NaN,NaN) for nearly all inf/NaN inputs
instead of the C99 Annex G values, drops the sign of zero on branch cuts,
and mishandles arctan's imaginary-axis cut. So:

- NpyComplexMath.{Sinh,Cosh,Tanh,Asin,Acos,Atan} delegate the finite
  interior to the BCL and add the C99 fixups:
  * Non-finite inputs: special-value tables ported from NumPy's msun
    npy_csinh/ccosh/ctanh, with asin/atan reusing NumPy's own identities
    asin(z)=i*conj(casinh(i*conj z)) and atan(z)=i*conj(catanh(i*conj z)).
  * Branch-cut/signed-zero fixups (empirically derived against NumPy and
    verified on a 64-point signed-zero grid): asin negates Re on x=-0 and
    Im on y=-0; acos negates Im on the y=+0 cut; atan sets
    Re=copysign(|y|>1?pi/2:0, x) on the imaginary axis and negates Im on y=-0.
  * Where this NumPy build's system libm diverges from msun at infinities
    (sign-preserving sinh(-inf+i*inf).re, cosh's even-function +inf*sin(y)
    imaginary part, tanh's sign(y) zero, and the genuinely-unspecified
    zero signs), the helpers match the observed NumPy 2.4.2 output.
- DirectILKernelGenerator: register CachedMethods.Complex{Sinh,Cosh,Tanh,
  Asin,Acos,Atan} (pointing at NpyComplexMath, not Complex.* directly) and
  add the six cases to EmitUnaryComplexOperation.

Verification: a bit-exact harness over a 117-point battery (finite + signed
zeros + branch cuts + inf/NaN) plus a 64-point grid, diffed against NumPy
2.4.2, gives 1402/1404 components matching (1249 bit-exact, 153 within
<=3 ULP). The only 2 residuals are arctan's finite interior (1e-10 tiny
input ~8e-8 rel; 100+100j at 3 ULP) -- .NET's Atan kernel is less accurate
than NumPy's log1p-based one; an accepted, documented divergence.

Tests:
- NewDtypesUnaryTests: 9 NumPy-verified cases covering interior, branch
  cuts, signed zeros, and C99 special values.
- Fuzz/MisalignedRegistry: the stale "complex kernel throws" excuse is
  corrected to Half-only; complex sinh/cosh/tanh/arcsin/arccos are now held
  to a tight 4-ULP gate (a real regression fails) instead of the blanket
  complex-unary excuse; arctan stays under the documented blanket for its
  accepted BCL-interior divergence.

All 609 Fuzz + NewDtypes tests pass (net10.0); the 26x5 complex corpus
cases for the five tightly-gated ops are all within 4 ULP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant