Skip to content

Latest commit

 

History

History
154 lines (132 loc) · 85.2 KB

File metadata and controls

154 lines (132 loc) · 85.2 KB

Optimizations ledger (SSOT)

Purpose. A discoverable index of every place where cljw's code is shaped for speed rather than for the simplest correct form. The user's directive (2026-05-31): "将来の最適化のとき、「最適化してる んだよ」と分かりやすく — 理想は SSOT 的な箇所があること". This is that SSOT. Optimizations come in many kinds and not all fit one registry cleanly, so this is a best-effort index, paired with the grep-discoverable in-code // PERF: marker (see .claude/rules/perf_marker.md).

An entry answers: what is the naive correct form, what is the optimized form, why is it faster, and what verifies they agree? The naive form is the behavioural contract; the optimization must be observably equivalent (F-011) — only the internal mechanics change.

⚠️ Measurement mode (2026-06-01 correction). Many absolute numbers in the O-001…O-004 rows below were measured on a Debug binary (zig build defaults to Debug; a Debug tree-walk interpreter is ~10-100× slower than the shipped build). They are NOT representative of shipped speed — e.g. (count (vec (range 1e6))) reads ~121s in Debug but ~0.02s in ReleaseFast, and startup is ~0.48s Debug vs ~ms Release (cljw already meets the ms-level cold-start mission target; cw v0 claims ~4ms). The algorithmic wins are real (O(n) beats O(n log n); chunked beats per-element in any mode), but the urgency was Debug-inflated. Future O-NNN numbers MUST be Release — measure only via scripts/perf.sh (see .claude/rules/perf_measure_release.md).

How to read / maintain

  • Every optimization that trades simplicity for speed gets (a) a // PERF: <what> [refs: O-NNN, …] marker at the code site and (b) a row here. The O-NNN id is this ledger's; cross-ref the driving D-NNN debt row when one exists (perf debt lives in .dev/debt.yaml; this ledger is the implemented optimizations).
  • A "fast path" that can be removed and replaced by the naive form with no behaviour change is the cleanest kind — note the naive fallback explicitly so a future reader can verify by deletion.
  • When an optimization is reverted / superseded, mark the row RETIRED <date> rather than deleting it (history).

Entries

ID Site Naive form (the contract) Optimized form Why faster Verified by Refs
O-001 runtime/collection/range.zig + call sites (range a b s) as a lazy cons-seq (one cons + lazy_seq per element) Compact .range value {start,end,step,count}: O(1) count/nth, tight-loop reduce, chunked-cons seq No per-element alloc on count/nth/reduce; 1 alloc/32 on walk phase14_range_indexed.sh + diff oracle vs clj D-163 / D-168
O-002 higher_order.zig::reduceFn (.vector arm) reduce over a vector via seqFnvectorToList (N-element eager cons list), then walk via first/next Index-walk: vector.nth(coll, i) in a tight i loop, honouring reduced No N-element intermediate cons list; (reduce f bigvec) / (into to bigvec) went O(n) alloc → O(1). Measured (reduce + (vec (range 1e6))) 182s → fast phase14_* reduce e2e + diff oracle vs clj D-163
O-003 vector.zig::fromSlice + transient/transient_vector.zig::toPersistent + core.clj into/vec persistent! rebuilds the persistent vector via N persistent conjs (O(n log n)); into/vec = (reduce conj …), also N persistent conjs Bulk fromSlice builds the HAMT trie bottom-up from the transient's flat buffer in O(n) (32-element leaves → interiors grouped 32-at-a-time → root; last ≤32 = tail); into/vec route editable targets (vector/hash-map/hash-set, NOT sorted/nil/list) through transient/conj!/persistent! persistent! O(n log n) → O(n); into/vec build O(n) over a flat buffer + one O(n) trie conversion, vs N persistent conjs. Measured (count (vec (range 1e6))) 121s → 2.4s; (reduce + (vec (range 1e6))) 123s → 2.5s vector.zig boundary unit test (n ∈ {0,1,31,32,33,63,64,65,1023,1024,1025,1e5}: fromSlice == conj-built, same shift/tail/root) + diff oracle vs clj (into/vec over vector/map/set/sorted/nil/list + meta) D-180
O-004 core.clj map/filter/keep 2-arg + higher_order.zig::reduceFn chunked arm + sequence.zig::countFn chunk drain + chunked_cons.zig chunk-builder (map f coll) / (filter pred coll) build a meta-less lazy-seq walked one element per nextFn — each step allocs a cons + lazy_seq node and tree-walks the .clj body (~408µs/element measured) Chunk-preserving: when the source is chunked (range seq / chunked map), transform a whole 32-chunk per thunk into a fresh chunk (JVM chunk-cons shape); reduce/count drain a chunk per step. Fill loop stays in .clj (a tree-walk loop is ~2µs/iter, negligible vs the 408µs amortised) The ~408µs/element lazy-seq machinery is paid once per 32 elements, not per element. Measured (count (map inc (range 1e5))) 41.3s → 2.8s (~15x); (reduce + (map inc (range 1e5))) → 2.4s. (Residual is the per-element f vtable call ≈ 2µs — D-133's, not this cycle's.) phase14_chunked_seq.sh (chunk-boundary count 1/32/33/65/1000 + reduce/nth/last/=/lazy-take) + diff oracle vs clj D-163 / ADR-0065
O-005 RETIRED 2026-06-11 eval/analyzer/{analyzer,bindings}.zig + eval/backend/tree_walk.zig::callMethodImpl Every TreeWalk fn call nil-inits the full [MAX_LOCALS=256]Value (~2 KB) call frame (reverted) leave [frame_size..256) undefined, nil-init only the analyser frame high-water RETIRED — introduced a GC UAF. callMethodImpl's locals is ALSO the VM's frame: on the default VM backend it is handed to vm.eval, which publishes the WHOLE slice as a GC root (gc_frame.locals). The undefined tail was traced as Values under a CLJW_GC_TORTURE collect → SIGSEGV (deterministic repro: a deep call fills high stack slots, then a shallow fn runs under torture). The bound-the-slice fix then surfaced that frame_high_water under-counts the VM's slot needs for a fn containing a nested fn* (slot_out_of_range). 12 % on one microbench was not worth the correctness risk + the slot-accounting subtlety. The full-256 nil-init is restored. repro deep+shallow under CLJW_GC_TORTURE=1 (139→0 after revert) + full gate e2e_phase16_gc_torture D-163
O-007 lang/primitive/higher_order.zig::sortNaturalFn (-sort-natural leaf) + core.clj sort (sort coll) ran the .clj -msort merge sort: per recursion level (vec (take mid v)) + (vec (drop mid v)), and -merge-sorted does (first a)/(rest a)/(conj acc …)/(empty? …) + a compare call through the eval machinery per element Default order copies the vector into a flat []Value buffer, runs std.mem.sort (stable block sort) calling valueCompare directly (no eval reentry → no GC safepoint mid-sort → no frame rooting needed), and rebuilds via vector.fromSlice (O-003). Custom-comparator (sort comp coll) / sort-by stay on .clj -msort (a user comparator re-enters eval) Eliminates the O(n log n) .clj take/drop/vec/rest/conj churn + per-comparison eval reentry. Measured 36_sort bench (5×(reduce + (take 100 (sort (vec (range 5000 0 -1)))))) 0.39s → ~0.00s (startup-only) ReleaseFast test/diff/clj_corpus/sort.txt (17 cases vs clj: empty/single/dups/mixed int·float ties/strings/keywords/nested vectors/custom comp/sort-by stability) + zig build test + zig build lint D-163
O-008 build.zig (exe_mod.strip) — binary-size axis b.installArtifact(exe) shipped an UNSTRIPPED binary (the ~5.7K-symbol table rode in every release cljw); there was no packaging step to strip it, so the bench/RELEASE_METRICS.md "stripped" headline was aspirational, not the actual install .strip = optimize != .Debug — every non-Debug build (ReleaseSafe / ReleaseFast / ReleaseSmall) strips at link time; Debug stays unstripped for lldb. cljw renders error traces from its own runtime StackFrame stack, not native symbols, so stripping costs no user diagnostics Installed ReleaseSafe cljw 3.93 MB → 3.39 MB (~400 KB / ~10% off the shipped artifact); ReleaseSmall also stripped (a further CLI strip floors it at 1.63 MB — Zig link-strip is less aggressive than CLI strip only for the ReleaseSmall layout) bash bench/release_metrics.sh (3.24 MB on-disk ReleaseSafe) + smoke (stripped ReleaseSafe e2e: corpus 2289/2289, build_cljw pass) + cljw -e '(sort [3 1 2])' on the stripped binary
O-009 core.clj reductions (reductions f init coll) = (seq (reduce (fn [acc x] (conj acc (f (last acc) x))) [init] coll))(last acc) is O(n) on the growing result vector → O(n²) overall; also fully eager (could not handle an infinite coll) and consed the reduced wrapper instead of stopping JVM's own lazy + accumulator-threaded shape: carry the running value as init through the recursion (no (last acc)), wrap each step in lazy-seq, and stop on a reduced init. O(n), lazy, and reduced-correct O(n²) → O(n). Measured (count (reductions + (range 100000))) 103.68 s → 0.04 s (~2500×) ReleaseFast. Also fixes two latent bugs the old eager form had: infinite-seq support ((take 5 (reductions + (range)))) and reduced early-termination test/diff/clj_corpus/reductions.txt (14 cases vs clj: init/empty/*/conj/infinite-range/reduced early-stop/str/etc. — all OK) + zig build test + zig build lint D-163
O-010 lang/primitive/higher_order.zig::sortByKeysFn (-sort-by-keys leaf) + core.clj sort-by (sort-by f coll) ran the .clj -msort with a (fn [a b] (compare (f a) (f b))) comparator — re-enters eval AND re-applies f on every comparison (O(n log n) f calls), plus the merge-sort take/drop/vec/rest/conj churn 2-arg default order precomputes keys = (mapv f coll) (one f per element) and runs the native -sort-by-keys: stable-sorts an index permutation by valueCompare on the keys (no eval reentry → no GC rooting), then gathers via vector.fromSlice. 3-arg custom-comparator (sort-by f comp coll) stays on .clj -msort Measured (last (sort-by - (range 20000))) 0.79s → 0.01s (~79×) ReleaseFast. Fewer f calls than JVM (n vs n log n) — observably identical for a pure key fn (F-011 contract) test/diff/clj_corpus/sort_by.txt (14 cases vs clj: key fns -/count/:age/val/str/stability with dup keys/3-arg custom comp/empty — all OK) + zig build test + zig build lint D-163
O-011 core.clj map-indexed / keep-indexed (2-arg) (map-indexed f coll) = (mapv #(f % (nth coll %)) (range (count coll))) (and keep-indexed the reduce analogue) — (nth coll i) is O(i) on a non-indexed coll (lazy seq / list), so the indexed walk is O(n²); only a .range/vector source (O(1) nth) stayed fast Route the 2-arg form through the existing 1-arg stateful transducer: (-seq-or-empty (into [] (map-indexed f) coll))into+xform walks the source SEQUENTIALLY (O(n), transient conj path) with a volatile running index, no nth O(n²) → O(n). Measured (count (map-indexed vector (map inc (range 20000)))) 4.64 s → 0.02 s (~230×); keep-indexed 4.61 s → 0.01 s. (A .range/vector source was already fast; the win is for the common lazy-seq / list source) test/diff/clj_corpus/map_keep_indexed.txt (12 cases vs clj: vector/string/lazy-map/filter/empty sources + keep-indexed drop-nil — all OK) + zig build test + zig build lint D-163
O-012 clojure/string.clj join (2-arity) (join sep coll) = (str (reduce (fn [acc x] (str acc sep x)) nil coll)) — each step copies the GROWING accumulator string → O(n²) in total length JVM idiom (apply str (interpose sep coll))interpose walks once (lazy), apply str builds the result in one native variadic-str pass (O(n)) O(n²) → O(n). Measured (count (join "," (map str (range N)))): N=50k 0.99 s → 0.03 s, N=100k 3.16 s → 0.07 s (~45×; the gap widens with size) ReleaseFast test/diff/clj_corpus/string_join.txt (12 cases vs clj: int/str/kw elems, char sep, empty/single, list + lazy-seq sources, nil elem → "") + zig build test + zig build lint D-163
O-015 eval/analyzer/{analyzer,bindings}.zig + eval/backend/tree_walk.zig — exact-count frame rooting (ADR-0130 am1; O-005 redo) callMethodImpl inits + GC-roots the full [256]Value locals on EVERY call (a ~2 KB nil-init per call) The analyzer threads a per-fn-method frame_max (each declare bumps it; a nested fn* gets its OWN counter so its slots don't pollute the outer's — the O-005 mistake) → exact FnMethod.frame_slotsFunctionMethod. callMethodImpl inits locals[0..fs] and passes that BOUNDED slice to both backends, so the GC roots only the used slots (the prior O-005 left rooting at the full 256 → traced the undefined tail → UAF). Sentinel 0 → full MAX_LOCALS fallback Cuts the per-call ~2 KB nil-init. ReleaseSafe quick-bench: fib_recursive 61→58 ms (~5%, consistent ×2); tak ~flat. Modest — the memset was a smaller share than the survey estimated; the larger fib lever is the call-dispatch structure (the 6-hop recursion / flat-frame, the survey's 2nd pick). Also the prerequisite (exact frame sizes) for that next step gc_torture (nested_deep + walk_hashmap, real ReleaseSafe binary) + zig build test (1090, diff oracle) + ReleaseSafe build (cache_gen) ADR-0130
O-014 eval/backend/{vm,compiler,intrinsic}.zig(+ a b) op_add intrinsic (ADR-0130) (+ a b) compiles to a generic op_call: load the + Var, resolve it at runtime, dispatch the BuiltinFn, slice the args — per operation The compiler emits op_add when the callee resolves (by Var pointer identity) to canonical clojure.core/+ with 2 args; dispatch does the fixnum add inline via promote.addPromoting (the SAME tower the builtin uses), else defers to the cached + builtin. A let-shadowed + is a .local_ref (never op_add); alter-var-root on + deopts via core_arith_pristine Skips var-resolution + dispatch-frame + arg-slice for the hot 2-fixnum case. ReleaseSafe quick-bench, cumulative across the family (op_add then am1 sub/mul/</></=/=): arith_loop 170→107 ms (37%) (op_add for 2 +/iter + op_eq for the (= i n) condition); fib_recursive 71→61 ms (14%) (all of fib's arith — +, two -, < — intrinsified); tak ~flat (call-bound). The call-bound residue (fib/tak) needs the call-path opt (v0 24A.5 monomorphic IC), not more arith opcodes diff_test op_add inline cases + alter-var-root deopt→999 + phase4_cli e2e (i48-boundary heap-Long / shadowed-+ "ab" / deopt) + zig build test (1088) ADR-0130
O-013 RETIRED 2026-06-11 core.clj concat (concat & colls) = (reduce -concat2 nil colls) (LEFT fold) — (apply concat N-colls) re-yields early colls through all outer wrappers → O(n × #colls) (reverted) right-nest via -concat-seqs RETIRED — broke interleave with a stack overflow. The right-nested form places a recursive 2-arg concat's tail arg behind an extra -concat2/lazy-seq layer; interleave ((concat (map first ss) (apply interleave (map rest ss)))) self-recurses, so it accumulated one native force-frame per level → SIGSEGV (139) at ~50k (e2e_phase14_seq_helpers2 interleave_large). The LEFT fold keeps the tail arg in -concat2's 2nd (seq-y) position so deep 2-arg recursion stays flat; restored. The apply concat N-colls O(n×N) is the accepted tradeoff (rare; mapcat uses the lazy -concat-seqs directly). repro (count (interleave (range 50000) (range 50000))) 139→0 after revert + full gate e2e_phase14_seq_helpers2 D-163
O-016 eval/backend/vm.zig — per-thread operand arena (VmArena; ADR-0131 2a) Each vm.eval declares its operand stack + parallel loc stack as fresh host-C-stack [256]Value/[256]SourceLocation arrays per call (cold memory + host-frame setup every call) A threadlocal-static VmArena (inline arrays = demand-paged BSS, nothing to alloc/free) holds both stacks; each eval borrows a region at the global watermark op_top (restored on return; nested reentrant evals stack above). stepOnce takes slices; the EvalFrame roots stack[0..op_top]. The pre-step op_top (via stepOnce's deferred write-back) keeps a callee's args rooted during vt.callFn AND positions the nested borrow above them — so NO publish-before-nest is needed The reused arena stays warm in cache vs a cold fresh [256] host region per eval. fib_recursive 56→41 ms, tak 18→15 ms (10-run ReleaseSafe) — an unexpected win for what was planned as behaviour-identical 2a infra; the deeper lever (removing the host eval re-entry itself) is ADR-0131 2b gc_torture (frame_local_alloc heap-in-local-across-non-tail-call + nested_deep, ReleaseSafe) + zig build test ×2 (diff oracle, no leaks) + catch e2e (11+5+13 PASS) + phase14_error_format smoke ADR-0131
O-017 eval/backend/vm.ziginline fn stepOnce (D-386 step 1) stepOnce is a plain fn called per instruction from the eval loop with an 11-pointer signature; ReleaseSafe did NOT inline it, so every op paid a real call boundary (v0's step fn is 2-arg) Mark stepOnce inline fn so the per-op dispatch folds into the eval loop — no call boundary, sp/ip/handler_count stay in the loop's registers Removes the per-instruction 11-arg call. fib_recursive 40→33 ms, tak 15→13 (10-run ReleaseSafe). The fib/tak tax is per-op dispatch (ADR-0131 2b confirmed it is NOT the call structure); this is the first D-386 dispatch lever. Naive form = plain fn; identical Values zig build test ×2 (diff oracle — pure inlining hint, no behaviour change) + 10-run ReleaseSafe bench D-386
O-018 eval/backend/vm/{opcode,compiler}.zig + intrinsic.zig + vm.zigop_*_local_const superinstructions (D-386 step 2) (<op> local-ref const-literal) compiles to 3 dispatches: op_load_local + op_const + op_ (e.g. fib (- n 1) / (< n 2)) The compiler fuses the triple into ONE op_{add,sub,mul,lt,le,gt,ge,eq}_local_const (operand = local_slot<<8 | const_idx, both <256); the VM arm loads locals[slot] + constants[idx] and runs the SAME fixnum-fast / builtin-deopt as the op_add family — net stack effect +1 (pure push) Cuts 2 dispatches per fused triple; the post-O-017 profile is flat so reducing op COUNT is the only lever (v0 37.2). fib_recursive 33→26 ms (≈ Python 24), arith_loop 107→96 (10-run ReleaseSafe). tak unchanged (it is local-LOCAL (< y x) — needs the *_locals variant next) zig build test ×2 (diff oracle — fused op = TreeWalk Value) + spot-check (sub/lt fused + bigint F-005 deopt) + 10-run bench D-386
O-019 eval/backend/vm/{opcode,compiler}.zig + intrinsic.zig + vm.zigop_*_locals superinstructions (D-386 step 3) (<op> local-ref local-ref) = op_load_local + op_load_local + op_ (arith_loop (< i n) / (+ acc i), tak (< y x)) The compiler fuses the triple into ONE op_*_locals (operand = slot_a<<8 | slot_b, both <256); the VM arm loads locals[a] + locals[b], same fixnum-fast / builtin-deopt as op_add — net stack +1. Sibling of O-018 (local-CONST); together they cover the two hot binop operand shapes arith_loop 94→76 ms (the (< i n) + (+ acc i) loop body), tak/fib ~flat (no local-local). 10-run ReleaseSafe zig build test ×2 (diff oracle) + spot-check [(- a b) (< b a) (+ a b) (= a b)][2 true 8 false] D-386
O-020 lang/clj/clojure/core.cljupdate-in 3-arg fast arity update-in is variadic (fn* [m ks f & args] …) — EVERY call rest-packs & args (even when empty) + the recursive descent uses (apply update-in (into [...] args)) (vector build + apply spread per level) Add a 3-arg arity ([m ks f] …) that recurses DIRECTLY ((update-in (get m k) nks f)) + calls f directly ((f (get m k))) — no rest-pack, no apply, no into. The variadic & args arity is kept for the rare extra-args call The hot path (update-in m ks inc) (nested_update's 10000-loop) no longer pays variadic packing + apply + into per level. nested_update 58→24 ms (Python 20; was 2.8×, now 1.2×). 10-run ReleaseSafe zig build test ×2 (diff oracle, rebuild + .clj load) + spot-check (update-in {:a {:b 1}} [:a :b] inc){:a {:b 2}} + … + 10 20{:a {:b 31}} D-386
O-021 eval/backend/vm/{opcode,compiler}.zig + intrinsic.zig + vm.zigop_branch_* compare-and-branch superinstructions (D-386 step 4) (if (<cmp> local const/local) …) = a fused cmp op (O-018/019) + op_jump_if_false = 2 dispatches (fib (< n 2), arith_loop (= i n)) compileIf fuses the cmp+branch into ONE op_branch_{ne,ge,gt}_{local_const,locals} (the NEGATED cmp for jump_if_false: eq→ne, lt→ge, le→gt). v0's 2-word trick fits cljw's fixed {opcode,u16} with NO format change: the op's operand = the slot/const pair, the IMMEDIATELY-FOLLOWING instruction is a DATA WORD (the backpatched op_jump_if_false, never dispatched) carrying the i16 offset. Same fixnum-fast / builtin-deopt; net stack 0. Fusion in the COMPILER (peephole stays removal-only per its contract) Collapses the test+branch from 2 dispatches to 1. fib 26→24 ms (= Python), arith_loop 76→64. 10-run ReleaseSafe zig build test ×2 (diff oracle — fused branch = TreeWalk control-flow) + sanity (fib20=6765, (< 5 2)→else, (= 3 3)→then, no-else (<= 1 1)) D-386
O-022 eval/backend/vm/{opcode,compiler}.zig + vm.zigop_recur_loop superinstruction (D-386 step 5) A recur back-edge = op_recur N + N op_store_local (reverse) + back-op_jump = N+2 dispatches/iter (arith_loop, make-list) compileRecur fuses it into ONE op_recur_loop when the loop bindings are CONTIGUOUS slots [base, base+N) (checked; else the unfused fallback). operand = (base<<8)|N; the following DATA WORD holds the i16 back-offset. The VM stores the top N operands to locals[base..base+N) (arg k → binding k) + jumps — no per-binding op_store_local dispatch Collapses the loop tail from N+2 dispatches to 1. arith_loop 64→50 ms (BEATS Python 58); sieve/mfr ~flat (their cost is the lazy seq, not the loop). 10-run ReleaseSafe zig build test ×2 (diff oracle + updated compile-shape test) + sanity ((loop [i 0 sum 0] … (recur (+ i 1) (+ sum i)))→10, list-build→ordered) D-386
O-023 runtime/lazy_seq.zig (fuse slot) + lang/primitive/higher_order.zig (-lazy-{set,get}-fuse + reduceFn arm) + lang/clj/clojure/core.clj (map/filter split + -fused-reduce) (reduce f init (filter p (map g xs))) over a cons list walks the lazy chain per element — each firstFn/nextFn forces a .clj lazy thunk ×2 transforms; intermediate map/filter seqs fully materialized map/filter stamp a [xform coll] descriptor on a SEPARATE LazySeq.fuse slot (NOT user meta → (meta (map …)) nil = clj parity; lazy body unchanged via internal -map-lazy/-filter-lazy recursing WITHOUT stamping, so the lazy path pays nothing). 3-arg reduce with a fused source delegates to .clj -fused-reduce → walks the chain composing transducers inner-first + reaches the base, runs ONE (transduce composed (completing f) init base) — zero intermediate seq map_filter_reduce 27→15 ms (BEATS Python 16); sieve/transduce un-regressed (split keeps the lazy path stamp-free); invisible to laziness. 10-run ReleaseSafe zig build test ×2 (diff oracle) + CLJW_GC_TORTURE (fuse trace + reentrant transduce →364) + spot-check (list/range/conj/lazy-take) D-386

| O-024 | runtime/regex/match.zig (ThreadList reuse) + lang/primitive/regex.zig (re-find-from fromSlice) | re-find-from (backs re-seq) built [match start end] via THREE persistent-vector conj copies per match + findFrom alloc'd 2 ThreadLists per scanned position | Build the 3-tuple in ONE vector.fromSlice; allocate the matcher's current/next ThreadLists ONCE per findFrom scan and clear+reuse them per position (was alloc+free per position) | regex_count's malloc profile was dominated by these per-match allocs. regex_count 55→45 ms (Python 24.8; the fromSlice cut is the win, the ThreadList reuse is a companion alloc reduction — neutral on this short string, helps long scans). 10-run ReleaseSafe | zig build test ×2 (diff oracle incl. regex suite) + CLJW_GC_TORTURE ((re-seq #"\d+" …)→5) + spot-check re-seq/re-find/lookahead | D-386 |

| O-025 | lang/clj/clojure/core.cljupdate-in indexed descent | The 3-arg update-in recursed via (next ks), which on a VECTOR path (the common shape) allocates a subvec/seq view per level | -update-in-idx walks the path by INDEX ((nth ks i), O(1) on a vector) passing the path unchanged — no per-level next-ks alloc. The variadic & args arity keeps the next form | nested_update 27→25 ms (Python 20.5; 1.33×→1.22×). The residual is the get+assoc per level (inherent). 10-run ReleaseSafe | zig build test ×2 (diff oracle) + spot-check vector path [:a :b :c]→inc + list path (:a :b)→+10 | D-386 |

| O-026 | runtime/collection/map.zig (fromLiteralPairs/allSimpleKeys) + eval/backend/vm.zig (op_map_literal) | A map literal {:a i :b … :c …} built via an N-deep assoc fold — each assoc COPIES the whole 16-slot ArrayMap (gc_stress: ×100k 3-entry maps = 300k array-map copies) | When all keys are SIMPLE (keyword/string/int/symbol/char/bool/nil → keyEq is pure, no eval/GC) and N/2 ≤ 8, build the ArrayMap in ONE gc.alloc + a pure dedup fill (last-key-wins). The single alloc is the only allocation → the fill cannot GC → the unrooted am is safe with no rooting frame. HAMT-size / custom-= keys fall back to the assoc fold | gc_stress 41→32 ms (Python 30; 1.36×→1.07×, ~parity). 10-run ReleaseSafe | zig build test ×2 (diff oracle) + CLJW_GC_TORTURE + spot-check dedup {:a 1 :a 2}{:a 2} + 9-key→HAMT fallback (count 9) | D-386 |

| O-027 | lang/clj/clojure/core.cljnot= 2-arg fast arity | not= was (fn* [& args] (not (apply = args))) — every call rest-packs & args + applys = (sieve's filter pred (not= 0 (mod x p)) runs per element × per filter-level) | Add fixed ([a b] (not (= a b))) (direct =, the intrinsic op_eq) + ([] false)/([a] false); the variadic clause starts at 3 args (Clojure requires the variadic's required count > every fixed arity) | sieve 32→28 ms (Python 20; 1.55×→1.4× — the residual is the nested-filter lazy-seq force, structural). Broadly helps every 2-arg not=. 10-run ReleaseSafe | zig build test ×2 (diff oracle, incl. the cache_gen syntax check that caught the illegal ([a b])+([& args]) overlap) + spot-check 0/1/2/3-arg → [false false true false true false] | D-386 |

| O-028 | eval/backend/vm.zig — hoist ip to a loop-carried register (D-386 dispatch sub-step 1) | The eval loop recomputed const cur per iteration and passed &cur.ip (arena HEAP) into stepOnce; each op loaded ip from heap + wrote it back via the per-op defercur.ip could not stay in a register because the pointer aliased arena memory | cur + ip are loop-carried locals; ip is synced to cur.ip (heap) ONLY at frame transitions (flatten / op_ret / catch-handler jump), so the hot non-transition path keeps ip in a register. ip is NOT a GC root (only op_top is, via gc_frame.sp), so the hoist is the SAFE deterministic half of the dispatch inline — sub-step 2 (hoisting op_top itself) is the UAF-class follow-up | Removes a per-op heap load+store of the instruction pointer. fib_recursive (fib 32) 535.9→472.5 ms (~12%) (hyperfine -N -r12, ReleaseSafe). Pure dispatch refactor — identical Values (diff oracle) | zig build test ×2 (diff oracle, TreeWalk≡VM) + CLJW_GC_TORTURE (heap-in-local across non-tail recursion →200; throw/catch carrying heap ex-data under collect →50) + try/catch + nested-rethrow + loop/recur spot-checks | D-386 |

| O-029 | eval/backend/intrinsic.zig — alloc-free fixnum arith fast path | fastBinaryFixnum add/sub/mul delegated to promote.*Promoting, which runs 5 type-checks (float/bigdec/ratio/int) then wrapI64 — and allocates a heap-Long on i48-overflow / BigInt on i64-overflow, so the "fast path" was neither inline nor alloc-free | Compute the result inline (@add/sub/mulWithOverflow on i64); return the fixnum only when it fits the i48 window, else null so the slow builtin path produces the identical heap-Long / BigInt (TreeWalk already routes overflow through the builtin → diff oracle CONVERGES). Comparisons stay exact-i48 | Skips promote's type-dispatch + function call on the hot case AND makes the whole fn alloc-free (the D-386 sub-step 2 prerequisite — the VM hot arith op then needs no op_top GC sync). fib_recursive (fib 32) 472.5→419.9 ms (~11%, on top of O-028; ~22% session total) hyperfine -N -r12 ReleaseSafe | zig build test ×2 (diff oracle) + observable overflow: (* i48max 2)→Long 281474976710654, (* 9999999999999 9999999999999)→BigInt, (- i48min 1)→Long -140737488355329 + updated unit contract test | D-386 | | O-030 | eval/backend/intrinsic.zig + vm/opcode.zig + vm.zig — fixnum mod/rem/quot intrinsic (extends the O-029 fastBinaryFixnum family) | mod/rem/quot were NOT in the intrinsic ArithOp set, so (mod x p) resolved the mod Var → generic builtin dispatch → promote.*Promoting type-checks + box (~230 ns/call — the sieve's per-element cost) | Add mod/rem/quot to ArithOp + op_mod/op_rem/op_quot (+ _local_const/_locals superinstructions); fastBinaryFixnum computes @mod/@rem/@divTrunc inline for a positive divisor (bi<=0→null defers — bi==0 raises divide_by_zero, bi<0 + the @divTrunc(i48min,-1) overflow corner go to the builtin, clj-correct); alloc-free, no op_top sync | Skips the var-deref + builtin type-dispatch on the hot (mod x p) case. micro (pos? (mod x 7)) ×1M 476→236 ms (2×); sieve(1000) cljw 26.4→23.0 ms (1.40×→1.23× py). NB @mod requires b>0 (Zig safety) — the bi<=0 guard is necessary, not merely conservative | zig build test ×2 (diff oracle TreeWalk≡VM) + new diff_test.zig mod/rem/quot case + 2 intrinsic unit tests + probes (mod -7 3)→2 / (rem -7 3)→-1 / (quot -7 3)→-2 / (mod 5 0)→raises | D-386 |

| O-031 | eval/backend/intrinsic.zig + vm/opcode.zig + vm.zig + lang/bootstrap.zig — fixnum not= intrinsic (op_ne, mirrors op_eq) | not= was left to the .clj (not (= a b)) Function (core.clj:583) — a closure call + not call per invocation (260 ns), the sieve's residual after O-030 | Add ne to ArithOp + op_ne/op_ne_local_const/op_ne_locals; fastBinaryFixnum .ne => ai != bi (fixnum-only; non-fixnum defers to the cached not= Var, full value-equality e.g. (not= 1 1.0)→true). Bootstrap fix: not= is .clj-defined, so the setupCorePrefix arith-cache (which only saw the Zig builtins) missed it — finalizeUserNs now RE-caches after core.clj loads (idempotent; core does not redefine arith) so the compiler recognises the not= Var | Skips the .clj closure + not call on the hot 2-arg fixnum case. **micro (not= 0 (mod x 7)) ×1M 485→224 ms (2.2×); sieve(1000) cljw 23.0→19.7 ms — now ≈ Python (1.01× faster, from 1.40× behind)**. The sieve loser is CLOSED | zig build test ×2 (diff oracle TreeWalk≡VM) + new diff_test.zig not= case (fixnum + local_const/locals + non-fixnum defer) + intrinsic unit test + probes (not= 1 1.0)→true / (not= :a :a)→false | D-386 |

| O-032 | lang/primitive/chunk_transform.zig (new) + -map-lazy/-filter-lazy chunked arms (core.clj) | The chunked arm of lazy map/filter ran a .clj loop/recur per element: -chunk-nth + f + chunk-append (3 prim calls + the user-fn) ×32 per chunk + the tree-walked loop glue (~77 ns/elem residual on top of the ~49 ns f-call) | -chunk-map-step [f s] / -chunk-filter-step [pred s]: drain the WHOLE source chunk in Zig (currentChunkNthinvokeCallable(f)chunkAppend into a fresh ChunkBuffer), returning the buffer; the .clj arm keeps chunk-cons + the lazy-tail recursion + O-023 fuse stamping. The producer-side analogue of reduceFn's in-Zig chunk drain (O-004). GC-root frame [f/pred, s, buf] mirrors reduceFn; the one new root site is the growing output buffer (rooted across every invokeCallable) | Removes the per-element .clj interpreter loop (one prim call per 32-chunk replaces ~32×3 prim calls + the tree-walked loop). (count (map (fn[x]x) (range 1M))) 186→86 ms (2.16×); (count (filter (fn[x]true) (range 1M))) 216→86 ms (2.5×) hyperfine ReleaseSafe. Broad win for all map/filter over chunked sources (range/vector); sieve (filter over a LIST) + map_filter_reduce (reduce O-023 path) unchanged | zig build test ×2 (diff oracle TreeWalk≡VM) + 12-golden chunk_transform clj corpus (map/filter over range+vector, partial/empty chunks, infinite-range laziness, 32-at-a-time side-effect order) + CLJW_GC_TORTURE=1 (safepoint collect: map/filter rooting holds). NB CLJW_GC_TORTURE_ALLOC=1 trips the PRE-EXISTING D-244 #4 op_vector_literal bug via the O-023 fuse [xform coll] literal (not the producer; [1 2 3] alone panics identically) | O-004, O-023, D-386 |

| O-033 | lang/primitive/collection.zig (updateInFn/updateInRec) + update-in 3-arg (core.clj) | update-in 3-arg was .clj -update-in-idx recursion: per level (nth ks i) + (get m k) + (assoc m k …), leaf (f (get m k)) — N .clj frames + per-level prim calls (nested_update (update-in m [:a :b :c] inc) ×10000) | Zig -update-in [m ks f]: walk the vector path in Zig (get down → invokeCallable(f) at the leaf → assoc back up), one builtin replacing the .clj recursion. The .clj update-in keeps the variadic & args arity + routes only a NON-EMPTY VECTOR path to -update-in (list/empty paths stay .clj). GC-root frame [f, m, ks, child]m transitively roots the descent chain (parents are sub-values of m); slot 3 re-roots the ascent's in-progress new map before each assoc alloc (the O-032 buf hazard) | Removes the per-level .clj recursion + prim-call overhead. 17_nested_update cljw 25→17 ms — 1.18× FASTER than Python (was 1.30× behind). Loser CLOSED. | zig build test ×2 (diff oracle) + 10-golden update_in clj corpus (nested, missing-path create, vector path, multi-arg f) + CLJW_GC_TORTURE=1 (safepoint: single + loop×1000 + 5-deep×500, rooting holds). NB CLJW_GC_TORTURE_ALLOC=1 blocked by PRE-EXISTING vector-builder infra bugs (D-244 #4 op_vector_literal + a vector-builtin integer-overflow) — not this code; the ascent assoc/hash-map ops are ALLOC-clean | O-004, O-032, D-386 |

| O-034 | runtime/regex/match.zigThreadList.seen generation-stamp (ADR-0147 Stage 1a; regex_count) | Pike-VM ThreadList.clear() @memset-ed the whole seen array per input position; findFrom clears both lists at every position, so the cost was O(positions × insts) memsets per match-scan | Replace seen: []bool with seen: []u32 generation stamps + a gen: u32 counter; clear() bumps gen (O(1)) instead of memset (wrap re-zeros); a pc is seen this position iff seen[pc] == gen. The correct finished-form Pike-VM sparse-set design (RE2/burntsushi) | Removes the per-position memset scaling. Bench delta WITHIN NOISE on regex_count (the \d+ program is ~5 insts, so the old memset was already ~5 bytes; cljw 100k 0.36→0.355s, 10k unchanged at ~0.04s). The value is algorithmic: O(1) clear regardless of program size — matters for larger patterns, and is the foundation for Stage 1b/2. NOT a claimed beat-Python win on its own | zig build test -Dwasm (all units incl. 30+ match.zig cases) + check_corpus_regression.sh regex_equivalence 48/48 (equivalence-neutral) | ADR-0147, D-447 |

| O-036 | runtime/regex/compile.zig (computeLeading + Program.leading) + runtime/regex/match.zig (scanFrom prefilter skip) (ADR-0147 Stage 2; regex_count) | Pike-VM scanFrom ran a full tryMatchAt (two ThreadList clears + addThread) at EVERY input position to find the leftmost match start — even across long stretches that provably cannot start a match (e.g. \d+ over a long alphabetic run scans every letter) | Compile-time computeLeading walks the ε-closure from pc 0 to the EXACT set of bytes that can be the first consumed byte (a 256-bit CharClass on Program.leading); scanFrom skips positions whose byte is not in the set with a cheap bitmap membership scan. Exact-or-disabled: returns null (prefilter off, current behaviour) when a match can complete via ε without consuming (nullable, e.g. a*), the first byte is undeterminable (leading look), or the set is near-full (./[^x] ≥ 250 bytes). Anchors/save are zero-width (walked through); the residual ^/\b constraint is still enforced by the VM, so an exact set never skips a startable position — equivalence-neutral. The literal-prefilter technique RE2/rust-regex/ezi-gex layer over their NFA (ADR-0147) | Replaces per-position VM restart with one bitmap test across non-matching runs. regex_count bench (digit-dense) WITHIN NOISE (17→17 ms — only ~5 skippable positions in the 15-char input). The win is on sparse inputs: re-seq #"\d+" over a ~4000-char mostly-alphabetic string ×20000 — ReleaseSafe A/B 1.35→0.05 s ≈ 27× FASTER. Scalar membership scan = portable floor; range/single-byte SIMD is a deferred accelerator (measure-first) | zig build test -Dwasm ×2 (diff oracle + 9 new computeLeading unit cases: \d+/abc/alt/a*-null/a*b/^abc/\bword/.-null/(ab)+) + check_corpus_regression.sh regex_equivalence 51/51 (3 sparse-prefilter goldens added, anti-D-177) + probes (sparse/dense/zero-width/^anchor/alt/\b all == clj) | O-034, O-035, ADR-0147, D-447 | | O-035 | runtime/regex/match.zig (findAll + scanFrom extraction) + lang/primitive/regex.zig (re-find-all) + re-seq (core.clj) (ADR-0147 Stage 1b; regex_count) | re-seq was a .clj loop/recur over re-find-from: per match a [match start end] vector was built in Zig, returned as a Value, then deconstructed in the interpreter (nth×3, conj, =, the tree-walked loop). Direct measurement showed this .clj-layer + Value round-trip was ~70% of the per-iteration cost (the audit estimated ~30%) | re-find-all [re s]: ONE Zig scan — match.zig findAll collects all match bounds (plain structs, reusing a single ThreadList pair across every match vs re-find-from's per-call pair), then builds the PersistentList from the end via consHeap. GC-root frame [list, head] roots the growing tail + each buildMatchResult value across the cons/alloc (O-032 discipline). Empty → nil (clj (seq []) parity). re-seq body collapses to (re-find-all re s) | Removes the per-match Value round-trip + interpreter loop. regex_count cljw 100k 0.355→0.11s (3.2×); the scored 10k bench 0.04→0.01s — now FASTER than Python (0.02s). Loser CLOSED (cljw's ~12ms startup edge + the 3.3× per-iter cut). re-seq return type unchanged (PersistentList) | zig build test -Dwasm (4 new findAll unit cases incl. zero-width + no-match) + check_corpus_regression.sh regex_equivalence 48/48 + clean probes: \d+/a* zero-width/x no-match→nil/grouped vectors/type→PersistentList all == clj | O-032, O-034, ADR-0147, D-447 | | O-037 | runtime/numeric/promote.zig (partsOf → ref-based RatioParts/OwnedParts) (ADR-0148; ratio_sum) | ratioArith (+ the quot exact path) called partsOf, which cloned a ratio operand's numerator AND denominator Managed via cloneWithDifferentAllocator (alloc + limb memcpy + free) on every call — yet the arithmetic only ever READS the parts (mul/add/sub take *const Managed). For (reduce + …) over ratios both operands are ratios, so 4 wasted Managed clones per +. | partsOf returns RatioParts { num: *const Managed, den: *const Managed }: a ratio operand aliases its stored numer.m/denom.m pointers (zero alloc); a non-ratio integer/BigInt still materialises value/1 into a caller-provided OwnedParts local (&owned.num stable for the call scope, deinit only when active). | Removes 2 Managed clones per ratio operand. 33_ratio_sum ReleaseSafe 108.1→103.3 ms (hyperfine -N, 10 runs). Modest alone (the dominant cost is the per-op gcd/divTrunc scratch + result BigInt alloc in allocFromManagedPair, the next lever); broad win for all ratio + - * / quot. | zig build test -Dwasm ×2 (diff oracle, all ratioArith/quot units) + smoke gate; output unchanged (13943237577224054960759/3099044504245996706400) | O-033, ADR-0148, D-450 | | O-049 | runtime/equal.zig (eqConsult simple-key fast path + isSimpleEqKey) (ADR-0129/ADR-0148; destructure, gc_large_heap) | eqConsult (called per keyEq in every map-get scan / set membership) read dispatch.current_env (a macOS _tlv_get_addr call) UNCONDITIONALLY first, then ran 2 keyInstanceEq probes — all a no-op for simple keys (keyword/symbol/string/number), which can never be a seq-key nor a custom-equiv instance. destructure ({:keys [a b c]} m ×100k) profiled ~20% leaf self-time in _tlv_get_addr. | Guard at the top: if (isSimpleEqKey(a) and isSimpleEqKey(b)) return keyEqValue(a, b); — skips the TLV read + both keyInstanceEq calls. BOTH operands must be simple (an instance on either side could carry a custom equiv matching a simple value → full path), so custom-equiv/seq-key dedup is unchanged. | 37_destructure ReleaseSafe 48.3→45.9 ms (cumulative 55.0→45.9, −16.5%); 27_gc_large_heap 32.5→32.0 ms. Broad win for every simple-key map-get / set-membership / =. | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM incl. deftype custom-equiv keys) + corpus 3181; both-simple→keyEqValue, instance-side→full consult unchanged | O-043, O-048, ADR-0129, ADR-0148, D-450 | | O-048 | eval/backend/intrinsic.zig (fastGet) (ADR-0130/ADR-0148; destructure, gc_large_heap) | The op_get intrinsic's fastGet did if (map_mod.contains(coll,k)) map_mod.get(coll,k) else nil_valtwo full scans of the map per (get coll k). But map_mod.get already returns nil_val for an absent key (identical to the 2-arg nil default), so the contains pre-check was pure redundancy: every lookup scanned twice. destructure ({:keys [a b c]} m ×100k = 3 gets/iter) + gc_large_heap ((get m :val) ×100k) are get-bound. | Drop the contains pre-scan: .array_map, .hash_map => try map_mod.get(coll, k). One scan; behaviour-identical (nil for absent OR present-nil, same as before; keyEq/eqConsult error behaviour unchanged). | 37_destructure ReleaseSafe 55.0→48.3 ms (−12%); 27_gc_large_heap 33.5→32.5 ms. Broad win for every 2-arg (get map k) on the VM intrinsic path. | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM, identical Values) + corpus + smoke; map hit/miss/present-nil unchanged | O-043, ADR-0130, ADR-0148, D-450 | | O-047 | runtime/numeric/big_int.zig (wrapManaged + add/sub/mul/divFloor direct-into-cell) + runtime/numeric/promote.zig (wrapArithCell + add/sub/mul Managed arms) (ADR-0148; bigint_factorial) | Every BigInt arith result was computed into a TEMP Managed then deep-copied to the GC cell via allocFromManaged (cloneWithDifferentAllocator = alloc + limb-memcpy + free the temp). bigint_factorial ((reduce *' …), accumulator grows to ~9 limbs) cloned the growing result on EVERY * (~100k multiplies). O-039 removed the OPERAND clones; this is the RESULT clone, the symmetric other half. The hot path is promote.zig's Managed arm (BigInt×Long), not big_int.allocMulManaged (Long×Long overflow only). | Compute the result DIRECTLY into a fresh gc.infra heap *Managed (the final cell). wrapManaged attaches the GC BigInt wrapper with NO clone; promote.zig's wrapArithCell consumes the cell — inline-Long collapse frees it, a heap wrap MOVES it (struct-copy the limbs slice, no memcpy). Ownership: two errdefers (destroy the Managed alloc + deinit its limbs) cover gc.alloc failure until the wrap stores the pointer; the collapse path frees explicitly on the success return (errdefers don't fire). | 32_bigint_factorial ReleaseSafe 21.3→19.0 ms (A/B 20 runs); cross-lang cljw 20.2 ms now FASTEST-script (python 20.4, babashka 20.7) — the 1.31×-behind target CLOSED. Broad win for ALL BigInt + - * / (every result previously cloned). | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM, identical Values) + lint + smoke (corpus 3181 + leak-detecting units exercise the move/collapse paths); output unchanged (158 = digit count of 100!) | O-039, O-037, ADR-0148, D-450 | | O-046 | runtime/numeric/ratio.zig (canonical two-tier Ratio) + promote.zig (i64 fast paths in ratioArith/divPromoting + partsOf/coerce/sign/toI64/bigdec arms) + equal.zig (ratioKeyEq + .ratio hash) + print.zig + math.zig/json.zig/ratio_methods.zig/analyzer.zig (parts() branches) (ADR-0149; ratio_sum) | ratio_sum (reduce + (map #(/ 1 %) (range 1 51))) ×1000 was ALLOC-bound: each tiny ratio (1/x, x≤50) paid ~6 heap allocs (2 coerce + arena + 2 result BigInt + Ratio) through std.math.big. The lone far target (2.34×). | Small-ratio inline-i64 representation (ADR-0149 Alt 2): a Ratio stores numer/denom as inline i64 when both fit (no BigInt), auto-promoting to the BigInt tier on i64 overflow. CANONICAL (small iff reduced fits i64) so the rt-free equal/hash arms stay correct by construction (small-vs-big never equal; hashLong gives cross-rep hash parity). divPromoting int/int + ratioArith small⊗small use overflow-guarded i64 cross-multiply + i64 gcd (MIN_I64 → Managed fallback), allocating only the small Ratio struct. | 33_ratio_sum ReleaseSafe 81→31.6 ms (2.34×→0.90× — now FASTER than Babashka 35 ms; the LONE far target CLOSED); div-only 33→16 ms. Broad win for all small-rational arithmetic. | zig build test -Dwasm ×2 (diff oracle + ratio unit tests incl. canonical small/big/collapse + leak detector) + corpus 3157; verified vs clj: basic/collapse(1N/2N/2)/big/numerator/denominator/double/sort/bigdec + canonical: (= 1/2 (/ 1e11 2e11))→true, hash-eq→true, map-key→:half, set→true + harmonic sum exact | O-037, O-038, O-039, ADR-0149, ADR-0148, D-450 | | O-045 | lang/primitive/higher_order.zig (reduceFn fusion gate) (ADR-0148; gc_large_heap) | The O-023 fused-reduce path fired for ANY 3-arg (reduce f init (map/filter … src)) with a fuse descriptor. Measurement showed it is a 2.5× REGRESSION for a CHUNKED source (range/vector): the .clj-transducer-per-element fused path is slower than the generic walk's in-Zig chunk drain (-chunk-map-step, O-032). (into [] (map f (range 100k))) = 39 ms fused vs 16 ms walked. gc_large_heap's (into [] (map (fn[i]{…}) (range))) paid this. | Gate the fusion: force (seq coll) once (memoised; the generic walk re-seqs the same node) and fire -fused-reduce ONLY when the realized head is non-chunked (a cons-list chain — O-023's actual win, map_filter_reduce). A chunked head falls through to the generic walk's O-004/O-032 chunk path. | 27_gc_large_heap ReleaseSafe 59→34.9 ms (1.99×→1.18× vs Babashka 29.7 ms) — the 2nd-worst target CLOSED to near-parity. map_filter_reduce (cons-list) UNREGRESSED (still fuses, 11.7 ms). Broad win for into/reduce over map/filter of range/vector. | zig build test -Dwasm ×2 (diff oracle) + corpus 3157 (covers map/filter/reduce/into; fixture lacks core so no diff_test arm) + manual: into+map / cons-list map+filter / map+filter+set all == clj | O-023, O-032, ADR-0148, D-450 | | O-044 | eval/backend/vm/opcode.zig (op_nth2) + intrinsic.zig (fastNth2) + vm.zig dispatch + vm/compiler.zig 2-arg emit (ADR-0130 extended; gc_alloc_rate) | O-043 intrinsified only 3-arg nth; the 2-arg (nth v i) (no default — RAISES on OOB) still compiled to a generic op_call. gc_alloc_rate's (+ sum (nth v 2)) ×200K uses 2-arg nth. | op_nth2 (reuses the cached nth Var + core_coll_pristine): fastNth2 inlines an in-range vector index; every error case (OOB / negative / non-int / non-vector / nil) defers to the cached nth builtin so the raise is identical. Compiler emits op_nth2 for a recognised 2-arg nth. | 26_gc_alloc_rate ReleaseSafe 45.8→40.5 ms (1.30×→1.15× vs Babashka 35.3 ms); cumulative 108.4→40.5 (2.81×→1.15×). | zig build test -Dwasm ×2 + corpus + smoke; verified in-range (10/30), list-defer (6), OOB raises (:oob), negative raises (:neg) == clj | O-043, ADR-0130, ADR-0148, D-450 | | O-043 | eval/backend/vm/opcode.zig (op_get/op_nth) + intrinsic.zig (CollOp/recognizeColl/fastGet/fastNth3) + vm.zig dispatch + vm/compiler.zig emit + runtime.zig (coll_vars/core_coll_pristine) + bootstrap.zig cache + core.zig deopt (ADR-0130 extended; destructure, gc_large_heap) | (get coll k) (2-arg) + (nth coll i default) (3-arg) compiled to a generic op_call: op_get_var callee push + var-resolution + BuiltinFn dispatch per call. destructure does 3 get + 3 nth/iter ×100K (0% malloc — pure dispatch); gc_large_heap does (get m :val) ×100K. | Collection-accessor intrinsic opcodes mirroring the ADR-0130 arith family: the compiler recognises the canonical get/nth Var (pointer identity) and emits op_get/op_nth (no callee push). The VM arm runs fastGet (map/nil inline = getFn 2-arg) / fastNth3 (vector inline = nthFn 3-arg vector), deferring every other collection kind to the cached Var; core_coll_pristine (cleared by alter-var-root on get/nth) deopts to the builtin so a redefinition is honoured. Allocation-free → no GC op_top sync. | 37_destructure ReleaseSafe 73.7→53.6 ms (1.68×→1.22× vs Babashka 43.9 ms); 27_gc_large_heap 63.0→59.0 ms (2.12×→1.99×, get-bound portion; closures still dominate). Broad win for all (get map k) / (nth vec i d). | zig build test -Dwasm ×2 (diff oracle + new op_get / op_nth diff test: map hit/miss, nil, vector, set, OOB/negative default, list defer, destructure=66, string/nested checkEqual) + corpus; deopt verified (alter-var-root #'get → redefinition honoured); all values == clj | O-014, O-018, ADR-0130, ADR-0148, D-450 | | O-042 | lang/primitive/core.zig (strFn single-int fast path) (ADR-0148; string_ops) | (str i) always built a heap std.Io.Writer.Allocating (gpa buffer alloc + free) + writeStrValue + the final string.alloc — even for a single small integer. string_ops does (str i) ×100K. | Single immediate-integer arg (args.len == 1 and args[0].isInt()): std.fmt.bufPrint("{d}") into a [24]u8 stack buffer → string.alloc directly, skipping the Allocating-writer alloc+free. Heap-Long / BigInt (tag .big_int, not isInt()) keep the full path → value-exact. Multi-arg / non-int unchanged. | string_ops ReleaseSafe 28.6→23.0 ms (1.35×→1.08× vs Babashka 21.3 ms). | zig build test -Dwasm ×2 + corpus + smoke; spot-checked (str 0/-5/123) + bigint 10^24 (full value) + (str 1 2 3)→"123" + (str "x" 5)→"x5" | O-029, ADR-0148, D-450 | | O-041 | lang/primitive/json.zig (jsonToCw array + object arms) (ADR-0148; json_parse) | read-str converted std.json.Value → cljw collections with empty + N×conj (arrays) and empty + N×assoc (objects) — N throwaway intermediate vectors/array-maps per JSON container. The bench's 200-element :users vector alone = 200 throwaway vectors; each 5-key user map = 5 throwaway array-maps ×200. | Array arm: buffer the converted elements, vector.fromSlice (one-shot). Object arm: buffer [k v …], map.fromLiteralPairs when n ≤ ARRAY_MAP_THRESHOLD (8) and keys simple (JSON keys are unique strings), else the assoc fold. Mirrors O-040 (vector literal) + O-026 (map literal). | 26_json_parse ReleaseSafe 54.3→39.0 ms (1.59×→1.14× vs Python 34.1 ms). | zig build test -Dwasm ×2 + corpus + smoke; spot-checked nested array/object + 10-key (>threshold) fallback (count 10) | O-026, O-040, ADR-0148, D-450 | | O-040 | eval/backend/vm.zig (op_vector_literal) (ADR-0148; gc_alloc_rate) | op_vector_literal built an N-element vector via vector.empty() + N×vector.conj — each conj path-copies + allocates a fresh intermediate vector, so [i (+ i 1) (+ i 2) (+ i 3)] (gc_alloc_rate's loop body, ×200K) allocated 4 throwaway vectors per literal. sample showed the bench is NOT malloc-bound (18/3729 leaf, 0.5%) — the cost was the per-conj construction work (the map literal already had this fixed as O-026; vectors were left on the slow path). | One-shot vector.fromSlice(rt, stack[sp-n..sp]): n≤32 → 1 TailNode + 1 Vector (memcpy the elements); n>32 → bulk HAMT build. The elements stay rooted on the operand stack (op_top watermark) across the build. VM-only, mirroring O-026's map fast path (TreeWalk keeps empty+conj; diff oracle proves equality). | 26_gc_alloc_rate ReleaseSafe 108.4→45.8 ms (2.37×; 2.81×→1.30× vs Babashka 35.3 ms); system time 31→6.5 ms. Broad win for every vector literal (esp. small literals in hot loops). | zig build test -Dwasm ×2 (diff oracle, TreeWalk slow ≡ VM fromSlice) + corpus + smoke; spot-checked n=0/1/40/nested vs clj (count 40, sum 780) | O-003, O-026, ADR-0148, D-450 | | O-039 | runtime/numeric/promote.zig (operandManaged/OwnedManaged + add/sub/mul BigInt else-branches) (ADR-0148; bigint_factorial) | The integer/BigInt arithmetic else-branch (reached when an operand is a heap BigInt) called coerceToManaged on BOTH operands, which clones a .big_int operand via cloneWithDifferentAllocator (alloc + limb memcpy) — yet r.add/sub/mul only READ them. bigint_factorial ((reduce *' …), acc grows to 9 limbs) clones the growing accumulator on every *. Same waste O-037 fixed for ratios, single-Managed flavour. | operandManaged(rt, v, *OwnedManaged) returns *const Managed: a BigInt aliases its stored Managed (zero alloc); an immediate Long materialises into a caller-stable OwnedManaged local (deinit only when active). The 3 else-branches use it; the @addWithOverflow both-int sub-paths keep coerceToManaged (both small, nothing to alias). | Removes the per-op accumulator clone. **32_bigint_factorial ReleaseSafe 26.4→22.4 ms (12 runs; 1.49×→1.27× vs Babashka 17.7 ms).** ratio_sum unchanged (separate path). Broad win for all BigInt + - *. | zig build test -Dwasm ×2 (diff oracle) + corpus + smoke; output unchanged (158 = digit count of 100!) | O-037, ADR-0148, D-450 | | O-038 | runtime/numeric/promote.zig (ratioArith scratch) + runtime/numeric/ratio.zig (allocFromManagedPair reduce-scratch) (ADR-0148; ratio_sum) | Each ratio op allocated its transient cross-multiply + gcd/divTrunc scratch as individual Managed.init(rt.gc.infra) calls — ratioArith 2-4 (rn/rd/lhs/rhs) + allocFromManagedPair 4 (gcd/r_num/r_den/rem) — so ~8 GPA malloc/free per +. sample profiling showed libsystem_malloc (_xzm_*malloc/_xzm_free/memset/bzero) ≈ 20% of leaf self-time. | Each function opens ONE std.heap.ArenaAllocator over rt.gc.infra and allocates all its scratch Managed from it; defer arena.deinit() bulk-frees on return. The two arenas don't alias (rn/rd live on ratioArith's arena, read inside allocFromManagedPair which has its own). Result BigInts (numer/denom) still clone onto gc.infra (GC-managed, outlive the call). | Collapses ~8 individual malloc/free into ~2 arena chunk alloc/free per op. 33_ratio_sum 103.3→82.9 ms (12 runs); combined with O-037 108.1→82.9 (23%). Re-profile: mem-leaf 227/1950 ≈ 11.6% (alloc no longer dominant; residual is interpreter dispatch + bignum compute → D-386). Broad win for all ratio arithmetic. | zig build test -Dwasm ×2 (diff oracle) + corpus 3157/3157 + smoke gate; output unchanged | O-037, ADR-0148, D-450 | | O-051 | runtime/collection/map.zig (arrayMapKeywordSlot helper + get/contains keyword fast path) (ADR-0165/ADR-0148; destructure, gc_large_heap) | An .array_map get/contains scanned every entry through keyEqequal.eqConsultisSimpleEqKey×2→keyEqValue (a !bool error-union chain) even when the lookup key is a keyword — yet keywords are INTERNED, so = over two keywords IS NaN-box payload bit-identity. destructure ({:keys [a b c]} m ×100k = 3 keyword gets/iter) + every keyword config-map lookup pay the chain per entry. (O-049 cut the TLV read inside eqConsult; this removes the call chain + error union entirely for the dominant keyword case.) | arrayMapKeywordSlot(am, kw): when k.tag()==.keyword, scan am.entries comparing @intFromEnum(entry_key)==@intFromEnum(kw) inline (no call, no try), returning the entry index or null. get/contains take it on the keyword branch; non-keyword keys keep the general keyEq path. Correct because keyword interning makes bit-equality ⟺ =, and a keyword can only bit-match a keyword entry (distinct NaN-box tags). The >8-entry .hash_map path is unchanged (keyword-keyed maps are overwhelmingly small). | clean old-vs-new ReleaseSafe binary A/B (hyperfine -N, 30 runs, same session): 37_destructure 51.5→48.1 ms (−6.6%); 27_gc_large_heap 33.2→31.7 ms (−4.5%); 300k-get microbench 23.7→21.1 ms (−11.0%); map-destructure micro 34.8→32.6 (−6.3%). Broad win for every keyword (get map k) / (contains? map k) / :keys destructure. | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM, identical Values) + new map.zig unit test (keyword hit/miss + mixed keyword/int/string keys, re-interned keyword identity) + lint | O-048, O-049, O-026, ADR-0165, ADR-0148, D-450, D-520 | | O-050 | runtime/numeric/promote.zig (ratioArith add/sub branch) + runtime/numeric/ratio.zig (allocFromReducedManagedPair/finishReducedPair) (ADR-0148; ratio_sum) | The BIG-tier ratio add (an*bd ± bn*ad)/(ad*bd) reduced via a full gcd in allocFromManagedPair. O-046's i64 small tier covers tiny ratios, but the harmonic-sum ACCUMULATOR ((reduce + …)) grows past i64 → big-tier add ~45×/call, each paying the big gcd. | Knuth TAOCP 4.5.1 / GMP mpq_add gcd-first add: g = gcd(ad, bd) FIRST (the added term's denom ≤ 50 → a tiny gcd), reduce by g; coprime denominators (g=1) SKIP the final gcd entirely, else a small gcd(t, g≤50). Result is coprime → allocFromReducedManagedPair (factored post-gcd tail of allocFromManagedPair, no redundant final gcd; zero-numerator→int-0 guarded). mul/div keep the full reduce. | 33_ratio_sum ReleaseSafe 39.6→31.0 ms (hyperfine -N, 15 runs, 1.28×±0.20 over the pre-Knuth baseline, same machine/session A/B). Complements O-046 (small tier); broad win for all big-tier ratio +/-. | zig build test -Dwasm ×2 (diff oracle) + corpus 3656 + 10 clj-verified edge cases (coprime / g>1 reduction / zero / integer-collapse / negative / large-coprime / harmonic) | O-037, O-046, ADR-0148, D-450 |

Identified high-ROI candidates (measured, not yet implemented)

Ranked by ROI (impact × frequency / effort·risk). Measured 2026-05-31 on mac-arm-m4pro, startup baseline 0.48s subtracted where noted.

  1. persistent! bulk trie build — DONE (O-003, D-180 discharged). transient_vector.toPersistent now calls vector.fromSlice, which builds the HAMT trie bottom-up from the flat buffer in O(n) instead of N persistent conjs (O(n log n)); into/vec route editable targets through transient/conj!/persistent!. Measured: (count (vec (range 1e6))) 121s → 2.4s; (reduce + (vec (range 1e6))) 123s → 2.5s. Verified by the fromSlice-vs-conj boundary unit test + diff oracle. The residual ~2.4s is per-element reduce/conj! interpreter dispatch + the lazy-seq walk of from — addressed by D-163 (fusion) / D-140 (startup), not persistent!. (The map/set arm of the routing is correctness-enabling, not a perf win: routing into {} / into #{} through transients required completing the transient hash map for > 8 entries — ADR-0064, which delegates to the persistent HAMT (O(n log n), no map speedup). The in-place editable-CHAMP transient that would make maps faster is deferred to D-181. The vector arm is the measured O-003 win.)

  2. cljw startup ≈ 0.48s per invocation — D-140. NOTE (corrected 2026-05-31): clojure.core is ALREADY AOT-restored from an embedded bytecode envelope, NOT re-parsed. ADR-0056 (Cycle 2b) built cache_gen → AOT-compiles core.clj to a bytecode envelope at build time → @embedFile'd into the cljw binary as bootstrap_cache; runner.zig setupCoreAot restores core from it on every cljw -e/ file/stdin run (the prior "re-parses+analyses+evaluates core.clj" description was stale — it predates Cycle 2b). So the residual ~0.48s is NOT core re-eval; it is (unprofiled) the envelope RESTORE (deserialize + run the op_def chunks to intern ~hundreds of core vars on the VM) + primitive.registerAll + process spawn + the full-self-exe read tryRunEmbedded does to check for a trailer (a footer-only seek would avoid it — noted in builder.zig:135). The 11 non-core .clj files (string/set/walk/…) are lazy on require, so a minimal cljw -e 1 does not load them. Next step = PROFILE the 0.48s (the bottleneck moved since the docs were written), then targeted tuning (footer seek / faster restore) + ADR-0056 Cycle 3 (AOT or lazy-defer the non-core files). Architecture already exists (ADR-0056); this is profile-and-tune, not a new cache. Highest dev-velocity ROI (every test/probe pays it).

Out-of-scope future optimizations (tracked, not yet implemented)

  • (none currently — the map/filter/take reduce-fusion that was listed here landed as O-004 / D-163 first cycle.)