Optimizations ledger (SSOT)

Purpose. A discoverable index of every place where cljw's code is shaped for speed rather than for the simplest correct form. The user's directive (2026-05-31): "将来の最適化のとき、「最適化してるんだよ」と分かりやすく — 理想は SSOT 的な箇所があること". This is that SSOT. Optimizations come in many kinds and not all fit one registry cleanly, so this is a best-effort index, paired with the grep-discoverable in-code // PERF: marker (see .claude/rules/perf_marker.md).

An entry answers: what is the naive correct form, what is the optimized form, why is it faster, and what verifies they agree? The naive form is the behavioural contract; the optimization must be observably equivalent (F-011) — only the internal mechanics change.

⚠️ Measurement mode (2026-06-01 correction). Many absolute numbers in the O-001…O-004 rows below were measured on a Debug binary (zig build defaults to Debug; a Debug tree-walk interpreter is ~10-100× slower than the shipped build). They are NOT representative of shipped speed — e.g. (count (vec (range 1e6))) reads ~121s in Debug but ~0.02s in ReleaseFast, and startup is ~0.48s Debug vs ~ms Release (cljw already meets the ms-level cold-start mission target; cw v0 claims ~4ms). The algorithmic wins are real (O(n) beats O(n log n); chunked beats per-element in any mode), but the urgency was Debug-inflated. Future O-NNN numbers MUST be Release — measure only via scripts/perf.sh (see .claude/rules/perf_measure_release.md).

How to read / maintain

Every optimization that trades simplicity for speed gets (a) a // PERF: <what> [refs: O-NNN, …] marker at the code site and (b) a row here. The O-NNN id is this ledger's; cross-ref the driving D-NNN debt row when one exists (perf debt lives in .dev/debt.yaml; this ledger is the implemented optimizations).
A "fast path" that can be removed and replaced by the naive form with no behaviour change is the cleanest kind — note the naive fallback explicitly so a future reader can verify by deletion.
When an optimization is reverted / superseded, mark the row RETIRED <date> rather than deleting it (history).

Entries

ID	Site	Naive form (the contract)	Optimized form	Why faster	Verified by	Refs
O-001	`runtime/collection/range.zig` + call sites	`(range a b s)` as a lazy cons-seq (one cons + lazy_seq per element)	Compact `.range` value `{start,end,step,count}`: O(1) count/nth, tight-loop reduce, chunked-cons `seq`	No per-element alloc on count/nth/reduce; 1 alloc/32 on walk	`phase14_range_indexed.sh` + diff oracle vs `clj`	D-163 / D-168
O-002	`higher_order.zig::reduceFn` (`.vector` arm)	`reduce` over a vector via `seqFn` → `vectorToList` (N-element eager cons list), then walk via first/next	Index-walk: `vector.nth(coll, i)` in a tight `i` loop, honouring `reduced`	No N-element intermediate cons list; `(reduce f bigvec)` / `(into to bigvec)` went O(n) alloc → O(1). Measured `(reduce + (vec (range 1e6)))` 182s → fast	`phase14_*` reduce e2e + diff oracle vs `clj`	D-163
O-003	`vector.zig::fromSlice` + `transient/transient_vector.zig::toPersistent` + `core.clj` `into`/`vec`	`persistent!` rebuilds the persistent vector via N persistent `conj`s (O(n log n)); `into`/`vec` = `(reduce conj …)`, also N persistent conjs	Bulk `fromSlice` builds the HAMT trie bottom-up from the transient's flat buffer in O(n) (32-element leaves → interiors grouped 32-at-a-time → root; last ≤32 = tail); `into`/`vec` route editable targets (vector/hash-map/hash-set, NOT sorted/nil/list) through `transient`/`conj!`/`persistent!`	`persistent!` O(n log n) → O(n); `into`/`vec` build O(n) over a flat buffer + one O(n) trie conversion, vs N persistent conjs. Measured `(count (vec (range 1e6)))` 121s → 2.4s; `(reduce + (vec (range 1e6)))` 123s → 2.5s	`vector.zig` boundary unit test (n ∈ {0,1,31,32,33,63,64,65,1023,1024,1025,1e5}: `fromSlice` == conj-built, same shift/tail/root) + diff oracle vs `clj` (into/vec over vector/map/set/sorted/nil/list + meta)	D-180
O-004	`core.clj` `map`/`filter`/`keep` 2-arg + `higher_order.zig::reduceFn` chunked arm + `sequence.zig::countFn` chunk drain + `chunked_cons.zig` chunk-builder	`(map f coll)` / `(filter pred coll)` build a meta-less lazy-seq walked one element per `nextFn` — each step allocs a cons + lazy_seq node and tree-walks the `.clj` body (~408µs/element measured)	Chunk-preserving: when the source is chunked (range seq / chunked map), transform a whole 32-chunk per thunk into a fresh chunk (JVM `chunk-cons` shape); `reduce`/`count` drain a chunk per step. Fill loop stays in `.clj` (a tree-walk loop is ~2µs/iter, negligible vs the 408µs amortised)	The ~408µs/element lazy-seq machinery is paid once per 32 elements, not per element. Measured `(count (map inc (range 1e5)))` 41.3s → 2.8s (~15x); `(reduce + (map inc (range 1e5)))` → 2.4s. (Residual is the per-element `f` vtable call ≈ 2µs — D-133's, not this cycle's.)	`phase14_chunked_seq.sh` (chunk-boundary count 1/32/33/65/1000 + reduce/nth/last/=/lazy-take) + diff oracle vs `clj`	D-163 / ADR-0065
O-005 RETIRED 2026-06-11	`eval/analyzer/{analyzer,bindings}.zig` + `eval/backend/tree_walk.zig::callMethodImpl`	Every TreeWalk fn call nil-inits the full `[MAX_LOCALS=256]Value` (~2 KB) call frame	(reverted) leave `[frame_size..256)` `undefined`, nil-init only the analyser frame high-water	RETIRED — introduced a GC UAF. `callMethodImpl`'s `locals` is ALSO the VM's frame: on the default VM backend it is handed to `vm.eval`, which publishes the WHOLE slice as a GC root (`gc_frame.locals`). The `undefined` tail was traced as Values under a `CLJW_GC_TORTURE` collect → SIGSEGV (deterministic repro: a deep call fills high stack slots, then a shallow fn runs under torture). The bound-the-slice fix then surfaced that `frame_high_water` under-counts the VM's slot needs for a fn containing a nested `fn*` (`slot_out_of_range`). 12 % on one microbench was not worth the correctness risk + the slot-accounting subtlety. The full-256 nil-init is restored.	repro `deep`+`shallow` under `CLJW_GC_TORTURE=1` (139→0 after revert) + full gate `e2e_phase16_gc_torture`	D-163
O-007	`lang/primitive/higher_order.zig::sortNaturalFn` (`-sort-natural` leaf) + `core.clj` `sort`	`(sort coll)` ran the `.clj` `-msort` merge sort: per recursion level `(vec (take mid v))` + `(vec (drop mid v))`, and `-merge-sorted` does `(first a)`/`(rest a)`/`(conj acc …)`/`(empty? …)` + a `compare` call through the eval machinery per element	Default order copies the vector into a flat `[]Value` buffer, runs `std.mem.sort` (stable block sort) calling `valueCompare` directly (no eval reentry → no GC safepoint mid-sort → no frame rooting needed), and rebuilds via `vector.fromSlice` (O-003). Custom-comparator `(sort comp coll)` / `sort-by` stay on `.clj` `-msort` (a user comparator re-enters eval)	Eliminates the O(n log n) `.clj` take/drop/vec/rest/conj churn + per-comparison eval reentry. Measured `36_sort` bench (5×`(reduce + (take 100 (sort (vec (range 5000 0 -1)))))`) 0.39s → ~0.00s (startup-only) ReleaseFast	`test/diff/clj_corpus/sort.txt` (17 cases vs `clj`: empty/single/dups/mixed int·float ties/strings/keywords/nested vectors/custom comp/sort-by stability) + `zig build test` + `zig build lint`	D-163
O-008	`build.zig` (`exe_mod.strip`) — binary-size axis	`b.installArtifact(exe)` shipped an UNSTRIPPED binary (the ~5.7K-symbol table rode in every release `cljw`); there was no packaging step to strip it, so the `bench/RELEASE_METRICS.md` "stripped" headline was aspirational, not the actual install	`.strip = optimize != .Debug` — every non-Debug build (ReleaseSafe / ReleaseFast / ReleaseSmall) strips at link time; Debug stays unstripped for `lldb`. cljw renders error traces from its own runtime `StackFrame` stack, not native symbols, so stripping costs no user diagnostics	Installed ReleaseSafe `cljw` 3.93 MB → 3.39 MB (~400 KB / ~10% off the shipped artifact); ReleaseSmall also stripped (a further CLI `strip` floors it at 1.63 MB — Zig link-strip is less aggressive than CLI `strip` only for the ReleaseSmall layout)	`bash bench/release_metrics.sh` (3.24 MB on-disk ReleaseSafe) + smoke (stripped ReleaseSafe e2e: corpus 2289/2289, build_cljw pass) + `cljw -e '(sort [3 1 2])'` on the stripped binary	—
O-009	`core.clj` `reductions`	`(reductions f init coll)` = `(seq (reduce (fn [acc x] (conj acc (f (last acc) x))) [init] coll))` — `(last acc)` is O(n) on the growing result vector → O(n²) overall; also fully eager (could not handle an infinite `coll`) and consed the `reduced` wrapper instead of stopping	JVM's own lazy + accumulator-threaded shape: carry the running value as `init` through the recursion (no `(last acc)`), wrap each step in `lazy-seq`, and stop on a `reduced` init. O(n), lazy, and `reduced`-correct	O(n²) → O(n). Measured `(count (reductions + (range 100000)))` 103.68 s → 0.04 s (~2500×) ReleaseFast. Also fixes two latent bugs the old eager form had: infinite-seq support (`(take 5 (reductions + (range)))`) and `reduced` early-termination	`test/diff/clj_corpus/reductions.txt` (14 cases vs `clj`: init/empty/`*`/`conj`/infinite-`range`/`reduced` early-stop/`str`/etc. — all OK) + `zig build test` + `zig build lint`	D-163
O-010	`lang/primitive/higher_order.zig::sortByKeysFn` (`-sort-by-keys` leaf) + `core.clj` `sort-by`	`(sort-by f coll)` ran the `.clj` `-msort` with a `(fn [a b] (compare (f a) (f b)))` comparator — re-enters eval AND re-applies `f` on every comparison (O(n log n) `f` calls), plus the merge-sort take/drop/vec/rest/conj churn	2-arg default order precomputes `keys = (mapv f coll)` (one `f` per element) and runs the native `-sort-by-keys`: stable-sorts an index permutation by `valueCompare` on the keys (no eval reentry → no GC rooting), then gathers via `vector.fromSlice`. 3-arg custom-comparator `(sort-by f comp coll)` stays on `.clj` `-msort`	Measured `(last (sort-by - (range 20000)))` 0.79s → 0.01s (~79×) ReleaseFast. Fewer `f` calls than JVM (n vs n log n) — observably identical for a pure key fn (F-011 contract)	`test/diff/clj_corpus/sort_by.txt` (14 cases vs `clj`: key fns `-`/`count`/`:age`/`val`/`str`/stability with dup keys/3-arg custom comp/empty — all OK) + `zig build test` + `zig build lint`	D-163
O-011	`core.clj` `map-indexed` / `keep-indexed` (2-arg)	`(map-indexed f coll)` = `(mapv #(f % (nth coll %)) (range (count coll)))` (and `keep-indexed` the `reduce` analogue) — `(nth coll i)` is O(i) on a non-indexed `coll` (lazy seq / list), so the indexed walk is O(n²); only a `.range`/vector source (O(1) nth) stayed fast	Route the 2-arg form through the existing 1-arg stateful transducer: `(-seq-or-empty (into [] (map-indexed f) coll))` — `into`+xform walks the source SEQUENTIALLY (O(n), transient conj path) with a volatile running index, no `nth`	O(n²) → O(n). Measured `(count (map-indexed vector (map inc (range 20000))))` 4.64 s → 0.02 s (~230×); `keep-indexed` 4.61 s → 0.01 s. (A `.range`/vector source was already fast; the win is for the common lazy-seq / list source)	`test/diff/clj_corpus/map_keep_indexed.txt` (12 cases vs `clj`: vector/string/lazy-`map`/`filter`/empty sources + `keep-indexed` drop-nil — all OK) + `zig build test` + `zig build lint`	D-163
O-012	`clojure/string.clj` `join` (2-arity)	`(join sep coll)` = `(str (reduce (fn [acc x] (str acc sep x)) nil coll))` — each step copies the GROWING accumulator string → O(n²) in total length	JVM idiom `(apply str (interpose sep coll))` — `interpose` walks once (lazy), `apply str` builds the result in one native variadic-`str` pass (O(n))	O(n²) → O(n). Measured `(count (join "," (map str (range N))))`: N=50k 0.99 s → 0.03 s, N=100k 3.16 s → 0.07 s (~45×; the gap widens with size) ReleaseFast	`test/diff/clj_corpus/string_join.txt` (12 cases vs `clj`: int/str/kw elems, char sep, empty/single, list + lazy-seq sources, nil elem → "") + `zig build test` + `zig build lint`	D-163
O-015	`eval/analyzer/{analyzer,bindings}.zig` + `eval/backend/tree_walk.zig` — exact-count frame rooting (ADR-0130 am1; O-005 redo)	`callMethodImpl` inits + GC-roots the full `[256]Value` locals on EVERY call (a ~2 KB nil-init per call)	The analyzer threads a per-fn-method `frame_max` (each `declare` bumps it; a nested `fn*` gets its OWN counter so its slots don't pollute the outer's — the O-005 mistake) → exact `FnMethod.frame_slots` → `FunctionMethod`. `callMethodImpl` inits `locals[0..fs]` and passes that BOUNDED slice to both backends, so the GC roots only the used slots (the prior O-005 left rooting at the full 256 → traced the undefined tail → UAF). Sentinel 0 → full MAX_LOCALS fallback	Cuts the per-call ~2 KB nil-init. ReleaseSafe quick-bench: fib_recursive 61→58 ms (~5%, consistent ×2); tak ~flat. Modest — the memset was a smaller share than the survey estimated; the larger fib lever is the call-dispatch structure (the 6-hop recursion / flat-frame, the survey's 2nd pick). Also the prerequisite (exact frame sizes) for that next step	gc_torture (nested_deep + walk_hashmap, real ReleaseSafe binary) + `zig build test` (1090, diff oracle) + ReleaseSafe build (cache_gen)	ADR-0130
O-014	`eval/backend/{vm,compiler,intrinsic}.zig` — `(+ a b)` op_add intrinsic (ADR-0130)	`(+ a b)` compiles to a generic `op_call`: load the `+` Var, resolve it at runtime, dispatch the `BuiltinFn`, slice the args — per operation	The compiler emits `op_add` when the callee resolves (by Var pointer identity) to canonical `clojure.core/+` with 2 args; dispatch does the fixnum add inline via `promote.addPromoting` (the SAME tower the builtin uses), else defers to the cached `+` builtin. A let-shadowed `+` is a `.local_ref` (never op_add); `alter-var-root` on `+` deopts via `core_arith_pristine`	Skips var-resolution + dispatch-frame + arg-slice for the hot 2-fixnum case. ReleaseSafe quick-bench, cumulative across the family (op_add then am1 sub/mul/</></=/=): arith_loop 170→107 ms (37%) (op_add for 2 `+`/iter + op_eq for the `(= i n)` condition); fib_recursive 71→61 ms (14%) (all of fib's arith — `+`, two `-`, `<` — intrinsified); tak ~flat (call-bound). The call-bound residue (fib/tak) needs the call-path opt (v0 24A.5 monomorphic IC), not more arith opcodes	`diff_test` op_add inline cases + alter-var-root deopt→999 + `phase4_cli` e2e (i48-boundary heap-Long / shadowed-+ "ab" / deopt) + `zig build test` (1088)	ADR-0130
O-013 RETIRED 2026-06-11	`core.clj` `concat`	`(concat & colls)` = `(reduce -concat2 nil colls)` (LEFT fold) — `(apply concat N-colls)` re-yields early colls through all outer wrappers → O(n × #colls)	(reverted) right-nest via `-concat-seqs`	RETIRED — broke `interleave` with a stack overflow. The right-nested form places a recursive 2-arg `concat`'s tail arg behind an extra `-concat2`/lazy-seq layer; `interleave` (`(concat (map first ss) (apply interleave (map rest ss)))`) self-recurses, so it accumulated one native force-frame per level → SIGSEGV (139) at ~50k (`e2e_phase14_seq_helpers2 interleave_large`). The LEFT fold keeps the tail arg in `-concat2`'s 2nd (seq-y) position so deep 2-arg recursion stays flat; restored. The `apply concat N-colls` O(n×N) is the accepted tradeoff (rare; `mapcat` uses the lazy `-concat-seqs` directly).	repro `(count (interleave (range 50000) (range 50000)))` 139→0 after revert + full gate `e2e_phase14_seq_helpers2`	D-163
O-016	`eval/backend/vm.zig` — per-thread operand arena (`VmArena`; ADR-0131 2a)	Each `vm.eval` declares its operand stack + parallel loc stack as fresh host-C-stack `[256]Value`/`[256]SourceLocation` arrays per call (cold memory + host-frame setup every call)	A threadlocal-static `VmArena` (inline arrays = demand-paged BSS, nothing to alloc/free) holds both stacks; each `eval` borrows a region at the global watermark `op_top` (restored on return; nested reentrant evals stack above). `stepOnce` takes slices; the EvalFrame roots `stack[0..op_top]`. The pre-step `op_top` (via stepOnce's deferred write-back) keeps a callee's args rooted during `vt.callFn` AND positions the nested borrow above them — so NO publish-before-nest is needed	The reused arena stays warm in cache vs a cold fresh `[256]` host region per `eval`. fib_recursive 56→41 ms, tak 18→15 ms (10-run ReleaseSafe) — an unexpected win for what was planned as behaviour-identical 2a infra; the deeper lever (removing the host `eval` re-entry itself) is ADR-0131 2b	gc_torture (`frame_local_alloc` heap-in-local-across-non-tail-call + `nested_deep`, ReleaseSafe) + `zig build test` ×2 (diff oracle, no leaks) + catch e2e (11+5+13 PASS) + `phase14_error_format` smoke	ADR-0131
O-017	`eval/backend/vm.zig` — `inline fn stepOnce` (D-386 step 1)	`stepOnce` is a plain `fn` called per instruction from the `eval` loop with an 11-pointer signature; ReleaseSafe did NOT inline it, so every op paid a real call boundary (v0's step fn is 2-arg)	Mark `stepOnce` `inline fn` so the per-op dispatch folds into the `eval` loop — no call boundary, sp/ip/handler_count stay in the loop's registers	Removes the per-instruction 11-arg call. fib_recursive 40→33 ms, tak 15→13 (10-run ReleaseSafe). The fib/tak tax is per-op dispatch (ADR-0131 2b confirmed it is NOT the call structure); this is the first D-386 dispatch lever. Naive form = plain `fn`; identical Values	`zig build test` ×2 (diff oracle — pure inlining hint, no behaviour change) + 10-run ReleaseSafe bench	D-386
O-018	`eval/backend/vm/{opcode,compiler}.zig` + `intrinsic.zig` + `vm.zig` — `op_*_local_const` superinstructions (D-386 step 2)	`(<op> local-ref const-literal)` compiles to 3 dispatches: op_load_local + op_const + op_ (e.g. fib `(- n 1)` / `(< n 2)`)	The compiler fuses the triple into ONE `op_{add,sub,mul,lt,le,gt,ge,eq}_local_const` (operand = `local_slot<<8 \| const_idx`, both <256); the VM arm loads `locals[slot]` + `constants[idx]` and runs the SAME fixnum-fast / builtin-deopt as the op_add family — net stack effect +1 (pure push)	Cuts 2 dispatches per fused triple; the post-O-017 profile is flat so reducing op COUNT is the only lever (v0 37.2). fib_recursive 33→26 ms (≈ Python 24), arith_loop 107→96 (10-run ReleaseSafe). tak unchanged (it is local-LOCAL `(< y x)` — needs the `*_locals` variant next)	`zig build test` ×2 (diff oracle — fused op = TreeWalk Value) + spot-check (sub/lt fused + bigint F-005 deopt) + 10-run bench	D-386
O-019	`eval/backend/vm/{opcode,compiler}.zig` + `intrinsic.zig` + `vm.zig` — `op_*_locals` superinstructions (D-386 step 3)	`(<op> local-ref local-ref)` = op_load_local + op_load_local + op_ (arith_loop `(< i n)` / `(+ acc i)`, tak `(< y x)`)	The compiler fuses the triple into ONE `op_*_locals` (operand = `slot_a<<8 \| slot_b`, both <256); the VM arm loads `locals[a]` + `locals[b]`, same fixnum-fast / builtin-deopt as op_add — net stack +1. Sibling of O-018 (local-CONST); together they cover the two hot binop operand shapes	arith_loop 94→76 ms (the `(< i n)` + `(+ acc i)` loop body), tak/fib ~flat (no local-local). 10-run ReleaseSafe	`zig build test` ×2 (diff oracle) + spot-check `[(- a b) (< b a) (+ a b) (= a b)]` → `[2 true 8 false]`	D-386
O-020	`lang/clj/clojure/core.clj` — `update-in` 3-arg fast arity	`update-in` is variadic `(fn* [m ks f & args] …)` — EVERY call rest-packs `& args` (even when empty) + the recursive descent uses `(apply update-in (into [...] args))` (vector build + apply spread per level)	Add a 3-arg arity `([m ks f] …)` that recurses DIRECTLY (`(update-in (get m k) nks f)`) + calls `f` directly (`(f (get m k))`) — no rest-pack, no `apply`, no `into`. The variadic `& args` arity is kept for the rare extra-args call	The hot path `(update-in m ks inc)` (nested_update's 10000-loop) no longer pays variadic packing + apply + into per level. nested_update 58→24 ms (Python 20; was 2.8×, now 1.2×). 10-run ReleaseSafe	`zig build test` ×2 (diff oracle, rebuild + .clj load) + spot-check `(update-in {:a {:b 1}} [:a :b] inc)`→`{:a {:b 2}}` + `… + 10 20`→`{:a {:b 31}}`	D-386
O-021	`eval/backend/vm/{opcode,compiler}.zig` + `intrinsic.zig` + `vm.zig` — `op_branch_*` compare-and-branch superinstructions (D-386 step 4)	`(if (<cmp> local const/local) …)` = a fused cmp op (O-018/019) + `op_jump_if_false` = 2 dispatches (fib `(< n 2)`, arith_loop `(= i n)`)	`compileIf` fuses the cmp+branch into ONE `op_branch_{ne,ge,gt}_{local_const,locals}` (the NEGATED cmp for jump_if_false: eq→ne, lt→ge, le→gt). v0's 2-word trick fits cljw's fixed `{opcode,u16}` with NO format change: the op's operand = the slot/const pair, the IMMEDIATELY-FOLLOWING instruction is a DATA WORD (the backpatched `op_jump_if_false`, never dispatched) carrying the i16 offset. Same fixnum-fast / builtin-deopt; net stack 0. Fusion in the COMPILER (peephole stays removal-only per its contract)	Collapses the test+branch from 2 dispatches to 1. fib 26→24 ms (= Python), arith_loop 76→64. 10-run ReleaseSafe	`zig build test` ×2 (diff oracle — fused branch = TreeWalk control-flow) + sanity (fib20=6765, `(< 5 2)`→else, `(= 3 3)`→then, no-else `(<= 1 1)`)	D-386
O-022	`eval/backend/vm/{opcode,compiler}.zig` + `vm.zig` — `op_recur_loop` superinstruction (D-386 step 5)	A `recur` back-edge = `op_recur N` + N `op_store_local` (reverse) + back-`op_jump` = N+2 dispatches/iter (arith_loop, make-list)	`compileRecur` fuses it into ONE `op_recur_loop` when the loop bindings are CONTIGUOUS slots `[base, base+N)` (checked; else the unfused fallback). `operand` = `(base<<8)\|N`; the following DATA WORD holds the i16 back-offset. The VM stores the top N operands to `locals[base..base+N)` (arg k → binding k) + jumps — no per-binding op_store_local dispatch	Collapses the loop tail from N+2 dispatches to 1. arith_loop 64→50 ms (BEATS Python 58); sieve/mfr ~flat (their cost is the lazy seq, not the loop). 10-run ReleaseSafe	`zig build test` ×2 (diff oracle + updated compile-shape test) + sanity (`(loop [i 0 sum 0] … (recur (+ i 1) (+ sum i)))`→10, list-build→ordered)	D-386
O-023	`runtime/lazy_seq.zig` (`fuse` slot) + `lang/primitive/higher_order.zig` (`-lazy-{set,get}-fuse` + `reduceFn` arm) + `lang/clj/clojure/core.clj` (`map`/`filter` split + `-fused-reduce`)	`(reduce f init (filter p (map g xs)))` over a cons list walks the lazy chain per element — each `firstFn`/`nextFn` forces a `.clj` lazy thunk ×2 transforms; intermediate map/filter seqs fully materialized	`map`/`filter` stamp a `[xform coll]` descriptor on a SEPARATE `LazySeq.fuse` slot (NOT user meta → `(meta (map …))` nil = clj parity; lazy body unchanged via internal `-map-lazy`/`-filter-lazy` recursing WITHOUT stamping, so the lazy path pays nothing). 3-arg `reduce` with a fused source delegates to `.clj -fused-reduce` → walks the chain composing transducers inner-first + reaches the base, runs ONE `(transduce composed (completing f) init base)` — zero intermediate seq	map_filter_reduce 27→15 ms (BEATS Python 16); sieve/transduce un-regressed (split keeps the lazy path stamp-free); invisible to laziness. 10-run ReleaseSafe	`zig build test` ×2 (diff oracle) + `CLJW_GC_TORTURE` (`fuse` trace + reentrant transduce →364) + spot-check (list/range/conj/lazy-take)	D-386

| O-024 | runtime/regex/match.zig (ThreadList reuse) + lang/primitive/regex.zig (re-find-from fromSlice) | re-find-from (backs re-seq) built [match start end] via THREE persistent-vector conj copies per match + findFrom alloc'd 2 ThreadLists per scanned position | Build the 3-tuple in ONE vector.fromSlice; allocate the matcher's current/next ThreadLists ONCE per findFrom scan and clear+reuse them per position (was alloc+free per position) | regex_count's malloc profile was dominated by these per-match allocs. regex_count 55→45 ms (Python 24.8; the fromSlice cut is the win, the ThreadList reuse is a companion alloc reduction — neutral on this short string, helps long scans). 10-run ReleaseSafe | zig build test ×2 (diff oracle incl. regex suite) + CLJW_GC_TORTURE ((re-seq #"\d+" …)→5) + spot-check re-seq/re-find/lookahead | D-386 |

| O-025 | lang/clj/clojure/core.clj — update-in indexed descent | The 3-arg update-in recursed via (next ks), which on a VECTOR path (the common shape) allocates a subvec/seq view per level | -update-in-idx walks the path by INDEX ((nth ks i), O(1) on a vector) passing the path unchanged — no per-level next-ks alloc. The variadic & args arity keeps the next form | nested_update 27→25 ms (Python 20.5; 1.33×→1.22×). The residual is the get+assoc per level (inherent). 10-run ReleaseSafe | zig build test ×2 (diff oracle) + spot-check vector path [:a :b :c]→inc + list path (:a :b)→+10 | D-386 |

| O-026 | runtime/collection/map.zig (fromLiteralPairs/allSimpleKeys) + eval/backend/vm.zig (op_map_literal) | A map literal {:a i :b … :c …} built via an N-deep assoc fold — each assoc COPIES the whole 16-slot ArrayMap (gc_stress: ×100k 3-entry maps = 300k array-map copies) | When all keys are SIMPLE (keyword/string/int/symbol/char/bool/nil → keyEq is pure, no eval/GC) and N/2 ≤ 8, build the ArrayMap in ONE gc.alloc + a pure dedup fill (last-key-wins). The single alloc is the only allocation → the fill cannot GC → the unrooted am is safe with no rooting frame. HAMT-size / custom-= keys fall back to the assoc fold | gc_stress 41→32 ms (Python 30; 1.36×→1.07×, ~parity). 10-run ReleaseSafe | zig build test ×2 (diff oracle) + CLJW_GC_TORTURE + spot-check dedup {:a 1 :a 2}→{:a 2} + 9-key→HAMT fallback (count 9) | D-386 |

| O-027 | lang/clj/clojure/core.clj — not= 2-arg fast arity | not= was (fn* [& args] (not (apply = args))) — every call rest-packs & args + applys = (sieve's filter pred (not= 0 (mod x p)) runs per element × per filter-level) | Add fixed ([a b] (not (= a b))) (direct =, the intrinsic op_eq) + ([] false)/([a] false); the variadic clause starts at 3 args (Clojure requires the variadic's required count > every fixed arity) | sieve 32→28 ms (Python 20; 1.55×→1.4× — the residual is the nested-filter lazy-seq force, structural). Broadly helps every 2-arg not=. 10-run ReleaseSafe | zig build test ×2 (diff oracle, incl. the cache_gen syntax check that caught the illegal ([a b])+([& args]) overlap) + spot-check 0/1/2/3-arg → [false false true false true false] | D-386 |

| O-028 | eval/backend/vm.zig — hoist ip to a loop-carried register (D-386 dispatch sub-step 1) | The eval loop recomputed const cur per iteration and passed &cur.ip (arena HEAP) into stepOnce; each op loaded ip from heap + wrote it back via the per-op defer — cur.ip could not stay in a register because the pointer aliased arena memory | cur + ip are loop-carried locals; ip is synced to cur.ip (heap) ONLY at frame transitions (flatten / op_ret / catch-handler jump), so the hot non-transition path keeps ip in a register. ip is NOT a GC root (only op_top is, via gc_frame.sp), so the hoist is the SAFE deterministic half of the dispatch inline — sub-step 2 (hoisting op_top itself) is the UAF-class follow-up | Removes a per-op heap load+store of the instruction pointer. fib_recursive (fib 32) 535.9→472.5 ms (~12%) (hyperfine -N -r12, ReleaseSafe). Pure dispatch refactor — identical Values (diff oracle) | zig build test ×2 (diff oracle, TreeWalk≡VM) + CLJW_GC_TORTURE (heap-in-local across non-tail recursion →200; throw/catch carrying heap ex-data under collect →50) + try/catch + nested-rethrow + loop/recur spot-checks | D-386 |

| O-029 | eval/backend/intrinsic.zig — alloc-free fixnum arith fast path | fastBinaryFixnum add/sub/mul delegated to promote.*Promoting, which runs 5 type-checks (float/bigdec/ratio/int) then wrapI64 — and allocates a heap-Long on i48-overflow / BigInt on i64-overflow, so the "fast path" was neither inline nor alloc-free | Compute the result inline (@add/sub/mulWithOverflow on i64); return the fixnum only when it fits the i48 window, else null so the slow builtin path produces the identical heap-Long / BigInt (TreeWalk already routes overflow through the builtin → diff oracle CONVERGES). Comparisons stay exact-i48 | Skips promote's type-dispatch + function call on the hot case AND makes the whole fn alloc-free (the D-386 sub-step 2 prerequisite — the VM hot arith op then needs no op_top GC sync). fib_recursive (fib 32) 472.5→419.9 ms (~11%, on top of O-028; ~22% session total) hyperfine -N -r12 ReleaseSafe | zig build test ×2 (diff oracle) + observable overflow: (* i48max 2)→Long 281474976710654, (* 9999999999999 9999999999999)→BigInt, (- i48min 1)→Long -140737488355329 + updated unit contract test | D-386 | | O-030 | eval/backend/intrinsic.zig + vm/opcode.zig + vm.zig — fixnum mod/rem/quot intrinsic (extends the O-029 fastBinaryFixnum family) | mod/rem/quot were NOT in the intrinsic ArithOp set, so (mod x p) resolved the mod Var → generic builtin dispatch → promote.*Promoting type-checks + box (~230 ns/call — the sieve's per-element cost) | Add mod/rem/quot to ArithOp + op_mod/op_rem/op_quot (+ _local_const/_locals superinstructions); fastBinaryFixnum computes @mod/@rem/@divTrunc inline for a positive divisor (bi<=0→null defers — bi==0 raises divide_by_zero, bi<0 + the @divTrunc(i48min,-1) overflow corner go to the builtin, clj-correct); alloc-free, no op_top sync | Skips the var-deref + builtin type-dispatch on the hot (mod x p) case. micro (pos? (mod x 7)) ×1M 476→236 ms (2×); sieve(1000) cljw 26.4→23.0 ms (1.40×→1.23× py). NB @mod requires b>0 (Zig safety) — the bi<=0 guard is necessary, not merely conservative | zig build test ×2 (diff oracle TreeWalk≡VM) + new diff_test.zig mod/rem/quot case + 2 intrinsic unit tests + probes (mod -7 3)→2 / (rem -7 3)→-1 / (quot -7 3)→-2 / (mod 5 0)→raises | D-386 |

| O-031 | eval/backend/intrinsic.zig + vm/opcode.zig + vm.zig + lang/bootstrap.zig — fixnum not= intrinsic (op_ne, mirrors op_eq) | not= was left to the .clj (not (= a b)) Function (core.clj:583) — a closure call + not call per invocation (260 ns), the sieve's residual after O-030 | Add ne to ArithOp + op_ne/op_ne_local_const/op_ne_locals; fastBinaryFixnum .ne => ai != bi (fixnum-only; non-fixnum defers to the cached not= Var, full value-equality e.g. (not= 1 1.0)→true). Bootstrap fix: not= is .clj-defined, so the setupCorePrefix arith-cache (which only saw the Zig builtins) missed it — finalizeUserNs now RE-caches after core.clj loads (idempotent; core does not redefine arith) so the compiler recognises the not= Var | Skips the .clj closure + not call on the hot 2-arg fixnum case. **micro (not= 0 (mod x 7)) ×1M 485→224 ms (2.2×); sieve(1000) cljw 23.0→19.7 ms — now ≈ Python (1.01× faster, from 1.40× behind)**. The sieve loser is CLOSED | zig build test ×2 (diff oracle TreeWalk≡VM) + new diff_test.zig not= case (fixnum + local_const/locals + non-fixnum defer) + intrinsic unit test + probes (not= 1 1.0)→true / (not= :a :a)→false | D-386 |

| O-032 | lang/primitive/chunk_transform.zig (new) + -map-lazy/-filter-lazy chunked arms (core.clj) | The chunked arm of lazy map/filter ran a .clj loop/recur per element: -chunk-nth + f + chunk-append (3 prim calls + the user-fn) ×32 per chunk + the tree-walked loop glue (~77 ns/elem residual on top of the ~49 ns f-call) | -chunk-map-step [f s] / -chunk-filter-step [pred s]: drain the WHOLE source chunk in Zig (currentChunkNth → invokeCallable(f) → chunkAppend into a fresh ChunkBuffer), returning the buffer; the .clj arm keeps chunk-cons + the lazy-tail recursion + O-023 fuse stamping. The producer-side analogue of reduceFn's in-Zig chunk drain (O-004). GC-root frame [f/pred, s, buf] mirrors reduceFn; the one new root site is the growing output buffer (rooted across every invokeCallable) | Removes the per-element .clj interpreter loop (one prim call per 32-chunk replaces ~32×3 prim calls + the tree-walked loop). (count (map (fn[x]x) (range 1M))) 186→86 ms (2.16×); (count (filter (fn[x]true) (range 1M))) 216→86 ms (2.5×) hyperfine ReleaseSafe. Broad win for all map/filter over chunked sources (range/vector); sieve (filter over a LIST) + map_filter_reduce (reduce O-023 path) unchanged | zig build test ×2 (diff oracle TreeWalk≡VM) + 12-golden chunk_transform clj corpus (map/filter over range+vector, partial/empty chunks, infinite-range laziness, 32-at-a-time side-effect order) + CLJW_GC_TORTURE=1 (safepoint collect: map/filter rooting holds). NB CLJW_GC_TORTURE_ALLOC=1 trips the PRE-EXISTING D-244 #4 op_vector_literal bug via the O-023 fuse [xform coll] literal (not the producer; [1 2 3] alone panics identically) | O-004, O-023, D-386 |

| O-033 | lang/primitive/collection.zig (updateInFn/updateInRec) + update-in 3-arg (core.clj) | update-in 3-arg was .clj -update-in-idx recursion: per level (nth ks i) + (get m k) + (assoc m k …), leaf (f (get m k)) — N .clj frames + per-level prim calls (nested_update (update-in m [:a :b :c] inc) ×10000) | Zig -update-in [m ks f]: walk the vector path in Zig (get down → invokeCallable(f) at the leaf → assoc back up), one builtin replacing the .clj recursion. The .clj update-in keeps the variadic & args arity + routes only a NON-EMPTY VECTOR path to -update-in (list/empty paths stay .clj). GC-root frame [f, m, ks, child] — m transitively roots the descent chain (parents are sub-values of m); slot 3 re-roots the ascent's in-progress new map before each assoc alloc (the O-032 buf hazard) | Removes the per-level .clj recursion + prim-call overhead. 17_nested_update cljw 25→17 ms — 1.18× FASTER than Python (was 1.30× behind). Loser CLOSED. | zig build test ×2 (diff oracle) + 10-golden update_in clj corpus (nested, missing-path create, vector path, multi-arg f) + CLJW_GC_TORTURE=1 (safepoint: single + loop×1000 + 5-deep×500, rooting holds). NB CLJW_GC_TORTURE_ALLOC=1 blocked by PRE-EXISTING vector-builder infra bugs (D-244 #4 op_vector_literal + a vector-builtin integer-overflow) — not this code; the ascent assoc/hash-map ops are ALLOC-clean | O-004, O-032, D-386 |

| O-034 | runtime/regex/match.zig — ThreadList.seen generation-stamp (ADR-0147 Stage 1a; regex_count) | Pike-VM ThreadList.clear() @memset-ed the whole seen array per input position; findFrom clears both lists at every position, so the cost was O(positions × insts) memsets per match-scan | Replace seen: []bool with seen: []u32 generation stamps + a gen: u32 counter; clear() bumps gen (O(1)) instead of memset (wrap re-zeros); a pc is seen this position iff seen[pc] == gen. The correct finished-form Pike-VM sparse-set design (RE2/burntsushi) | Removes the per-position memset scaling. Bench delta WITHIN NOISE on regex_count (the \d+ program is ~5 insts, so the old memset was already ~5 bytes; cljw 100k ~~0.36→~~0.355s, 10k unchanged at ~0.04s). The value is algorithmic: O(1) clear regardless of program size — matters for larger patterns, and is the foundation for Stage 1b/2. NOT a claimed beat-Python win on its own | zig build test -Dwasm (all units incl. 30+ match.zig cases) + check_corpus_regression.sh regex_equivalence 48/48 (equivalence-neutral) | ADR-0147, D-447 |

| O-036 | runtime/regex/compile.zig (computeLeading + Program.leading) + runtime/regex/match.zig (scanFrom prefilter skip) (ADR-0147 Stage 2; regex_count) | Pike-VM scanFrom ran a full tryMatchAt (two ThreadList clears + addThread) at EVERY input position to find the leftmost match start — even across long stretches that provably cannot start a match (e.g. \d+ over a long alphabetic run scans every letter) | Compile-time computeLeading walks the ε-closure from pc 0 to the EXACT set of bytes that can be the first consumed byte (a 256-bit CharClass on Program.leading); scanFrom skips positions whose byte is not in the set with a cheap bitmap membership scan. Exact-or-disabled: returns null (prefilter off, current behaviour) when a match can complete via ε without consuming (nullable, e.g. a*), the first byte is undeterminable (leading look), or the set is near-full (./[^x] ≥ 250 bytes). Anchors/save are zero-width (walked through); the residual ^/\b constraint is still enforced by the VM, so an exact set never skips a startable position — equivalence-neutral. The literal-prefilter technique RE2/rust-regex/ezi-gex layer over their NFA (ADR-0147) | Replaces per-position VM restart with one bitmap test across non-matching runs. regex_count bench (digit-dense) WITHIN NOISE (17→17 ms — only ~5 skippable positions in the 15-char input). The win is on sparse inputs: re-seq #"\d+" over a ~4000-char mostly-alphabetic string ×20000 — ReleaseSafe A/B 1.35→0.05 s ≈ 27× FASTER. Scalar membership scan = portable floor; range/single-byte SIMD is a deferred accelerator (measure-first) | zig build test -Dwasm ×2 (diff oracle + 9 new computeLeading unit cases: \d+/abc/alt/a*-null/a*b/^abc/\bword/.-null/(ab)+) + check_corpus_regression.sh regex_equivalence 51/51 (3 sparse-prefilter goldens added, anti-D-177) + probes (sparse/dense/zero-width/^anchor/alt/\b all == clj) | O-034, O-035, ADR-0147, D-447 | | O-035 | runtime/regex/match.zig (findAll + scanFrom extraction) + lang/primitive/regex.zig (re-find-all) + re-seq (core.clj) (ADR-0147 Stage 1b; regex_count) | re-seq was a .clj loop/recur over re-find-from: per match a [match start end] vector was built in Zig, returned as a Value, then deconstructed in the interpreter (nth×3, conj, =, the tree-walked loop). Direct measurement showed this .clj-layer + Value round-trip was ~70% of the per-iteration cost (the audit estimated ~30%) | re-find-all [re s]: ONE Zig scan — match.zig findAll collects all match bounds (plain structs, reusing a single ThreadList pair across every match vs re-find-from's per-call pair), then builds the PersistentList from the end via consHeap. GC-root frame [list, head] roots the growing tail + each buildMatchResult value across the cons/alloc (O-032 discipline). Empty → nil (clj (seq []) parity). re-seq body collapses to (re-find-all re s) | Removes the per-match Value round-trip + interpreter loop. regex_count cljw 100k 0.355→0.11s (3.2×); the scored 10k bench 0.04→0.01s — now FASTER than Python (0.02s). Loser CLOSED (cljw's ~12ms startup edge + the 3.3× per-iter cut). re-seq return type unchanged (PersistentList) | zig build test -Dwasm (4 new findAll unit cases incl. zero-width + no-match) + check_corpus_regression.sh regex_equivalence 48/48 + clean probes: \d+/a* zero-width/x no-match→nil/grouped vectors/type→PersistentList all == clj | O-032, O-034, ADR-0147, D-447 | | O-037 | runtime/numeric/promote.zig (partsOf → ref-based RatioParts/OwnedParts) (ADR-0148; ratio_sum) | ratioArith (+ the quot exact path) called partsOf, which cloned a ratio operand's numerator AND denominator Managed via cloneWithDifferentAllocator (alloc + limb memcpy + free) on every call — yet the arithmetic only ever READS the parts (mul/add/sub take *const Managed). For (reduce + …) over ratios both operands are ratios, so 4 wasted Managed clones per +. | partsOf returns RatioParts { num: *const Managed, den: *const Managed }: a ratio operand aliases its stored numer.m/denom.m pointers (zero alloc); a non-ratio integer/BigInt still materialises value/1 into a caller-provided OwnedParts local (&owned.num stable for the call scope, deinit only when active). | Removes 2 Managed clones per ratio operand. 33_ratio_sum ReleaseSafe 108.1→103.3 ms (hyperfine -N, 10 runs). Modest alone (the dominant cost is the per-op gcd/divTrunc scratch + result BigInt alloc in allocFromManagedPair, the next lever); broad win for all ratio + - * / quot. | zig build test -Dwasm ×2 (diff oracle, all ratioArith/quot units) + smoke gate; output unchanged (13943237577224054960759/3099044504245996706400) | O-033, ADR-0148, D-450 | | O-049 | runtime/equal.zig (eqConsult simple-key fast path + isSimpleEqKey) (ADR-0129/ADR-0148; destructure, gc_large_heap) | eqConsult (called per keyEq in every map-get scan / set membership) read dispatch.current_env (a macOS _tlv_get_addr call) UNCONDITIONALLY first, then ran 2 keyInstanceEq probes — all a no-op for simple keys (keyword/symbol/string/number), which can never be a seq-key nor a custom-equiv instance. destructure ({:keys [a b c]} m ×100k) profiled ~20% leaf self-time in _tlv_get_addr. | Guard at the top: if (isSimpleEqKey(a) and isSimpleEqKey(b)) return keyEqValue(a, b); — skips the TLV read + both keyInstanceEq calls. BOTH operands must be simple (an instance on either side could carry a custom equiv matching a simple value → full path), so custom-equiv/seq-key dedup is unchanged. | 37_destructure ReleaseSafe 48.3→45.9 ms (cumulative 55.0→45.9, −16.5%); 27_gc_large_heap 32.5→32.0 ms. Broad win for every simple-key map-get / set-membership / =. | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM incl. deftype custom-equiv keys) + corpus 3181; both-simple→keyEqValue, instance-side→full consult unchanged | O-043, O-048, ADR-0129, ADR-0148, D-450 | | O-048 | eval/backend/intrinsic.zig (fastGet) (ADR-0130/ADR-0148; destructure, gc_large_heap) | The op_get intrinsic's fastGet did if (map_mod.contains(coll,k)) map_mod.get(coll,k) else nil_val — two full scans of the map per (get coll k). But map_mod.get already returns nil_val for an absent key (identical to the 2-arg nil default), so the contains pre-check was pure redundancy: every lookup scanned twice. destructure ({:keys [a b c]} m ×100k = 3 gets/iter) + gc_large_heap ((get m :val) ×100k) are get-bound. | Drop the contains pre-scan: .array_map, .hash_map => try map_mod.get(coll, k). One scan; behaviour-identical (nil for absent OR present-nil, same as before; keyEq/eqConsult error behaviour unchanged). | 37_destructure ReleaseSafe 55.0→48.3 ms (−12%); 27_gc_large_heap 33.5→32.5 ms. Broad win for every 2-arg (get map k) on the VM intrinsic path. | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM, identical Values) + corpus + smoke; map hit/miss/present-nil unchanged | O-043, ADR-0130, ADR-0148, D-450 | | O-047 | runtime/numeric/big_int.zig (wrapManaged + add/sub/mul/divFloor direct-into-cell) + runtime/numeric/promote.zig (wrapArithCell + add/sub/mul Managed arms) (ADR-0148; bigint_factorial) | Every BigInt arith result was computed into a TEMP Managed then deep-copied to the GC cell via allocFromManaged (cloneWithDifferentAllocator = alloc + limb-memcpy + free the temp). bigint_factorial ((reduce *' …), accumulator grows to ~9 limbs) cloned the growing result on EVERY * (~100k multiplies). O-039 removed the OPERAND clones; this is the RESULT clone, the symmetric other half. The hot path is promote.zig's Managed arm (BigInt×Long), not big_int.allocMulManaged (Long×Long overflow only). | Compute the result DIRECTLY into a fresh gc.infra heap *Managed (the final cell). wrapManaged attaches the GC BigInt wrapper with NO clone; promote.zig's wrapArithCell consumes the cell — inline-Long collapse frees it, a heap wrap MOVES it (struct-copy the limbs slice, no memcpy). Ownership: two errdefers (destroy the Managed alloc + deinit its limbs) cover gc.alloc failure until the wrap stores the pointer; the collapse path frees explicitly on the success return (errdefers don't fire). | 32_bigint_factorial ReleaseSafe 21.3→19.0 ms (A/B 20 runs); cross-lang cljw 20.2 ms now FASTEST-script (python 20.4, babashka 20.7) — the 1.31×-behind target CLOSED. Broad win for ALL BigInt + - * / (every result previously cloned). | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM, identical Values) + lint + smoke (corpus 3181 + leak-detecting units exercise the move/collapse paths); output unchanged (158 = digit count of 100!) | O-039, O-037, ADR-0148, D-450 | | O-046 | runtime/numeric/ratio.zig (canonical two-tier Ratio) + promote.zig (i64 fast paths in ratioArith/divPromoting + partsOf/coerce/sign/toI64/bigdec arms) + equal.zig (ratioKeyEq + .ratio hash) + print.zig + math.zig/json.zig/ratio_methods.zig/analyzer.zig (parts() branches) (ADR-0149; ratio_sum) | ratio_sum (reduce + (map #(/ 1 %) (range 1 51))) ×1000 was ALLOC-bound: each tiny ratio (1/x, x≤50) paid ~6 heap allocs (2 coerce + arena + 2 result BigInt + Ratio) through std.math.big. The lone far target (2.34×). | Small-ratio inline-i64 representation (ADR-0149 Alt 2): a Ratio stores numer/denom as inline i64 when both fit (no BigInt), auto-promoting to the BigInt tier on i64 overflow. CANONICAL (small iff reduced fits i64) so the rt-free equal/hash arms stay correct by construction (small-vs-big never equal; hashLong gives cross-rep hash parity). divPromoting int/int + ratioArith small⊗small use overflow-guarded i64 cross-multiply + i64 gcd (MIN_I64 → Managed fallback), allocating only the small Ratio struct. | 33_ratio_sum ReleaseSafe 81→31.6 ms (2.34×→0.90× — now FASTER than Babashka 35 ms; the LONE far target CLOSED); div-only 33→16 ms. Broad win for all small-rational arithmetic. | zig build test -Dwasm ×2 (diff oracle + ratio unit tests incl. canonical small/big/collapse + leak detector) + corpus 3157; verified vs clj: basic/collapse(1N/2N/2)/big/numerator/denominator/double/sort/bigdec + canonical: (= 1/2 (/ 1e11 2e11))→true, hash-eq→true, map-key→:half, set→true + harmonic sum exact | O-037, O-038, O-039, ADR-0149, ADR-0148, D-450 | | O-045 | lang/primitive/higher_order.zig (reduceFn fusion gate) (ADR-0148; gc_large_heap) | The O-023 fused-reduce path fired for ANY 3-arg (reduce f init (map/filter … src)) with a fuse descriptor. Measurement showed it is a 2.5× REGRESSION for a CHUNKED source (range/vector): the .clj-transducer-per-element fused path is slower than the generic walk's in-Zig chunk drain (-chunk-map-step, O-032). (into [] (map f (range 100k))) = 39 ms fused vs 16 ms walked. gc_large_heap's (into [] (map (fn[i]{…}) (range))) paid this. | Gate the fusion: force (seq coll) once (memoised; the generic walk re-seqs the same node) and fire -fused-reduce ONLY when the realized head is non-chunked (a cons-list chain — O-023's actual win, map_filter_reduce). A chunked head falls through to the generic walk's O-004/O-032 chunk path. | 27_gc_large_heap ReleaseSafe 59→34.9 ms (1.99×→1.18× vs Babashka 29.7 ms) — the 2nd-worst target CLOSED to near-parity. map_filter_reduce (cons-list) UNREGRESSED (still fuses, 11.7 ms). Broad win for into/reduce over map/filter of range/vector. | zig build test -Dwasm ×2 (diff oracle) + corpus 3157 (covers map/filter/reduce/into; fixture lacks core so no diff_test arm) + manual: into+map / cons-list map+filter / map+filter+set all == clj | O-023, O-032, ADR-0148, D-450 | | O-044 | eval/backend/vm/opcode.zig (op_nth2) + intrinsic.zig (fastNth2) + vm.zig dispatch + vm/compiler.zig 2-arg emit (ADR-0130 extended; gc_alloc_rate) | O-043 intrinsified only 3-arg nth; the 2-arg (nth v i) (no default — RAISES on OOB) still compiled to a generic op_call. gc_alloc_rate's (+ sum (nth v 2)) ×200K uses 2-arg nth. | op_nth2 (reuses the cached nth Var + core_coll_pristine): fastNth2 inlines an in-range vector index; every error case (OOB / negative / non-int / non-vector / nil) defers to the cached nth builtin so the raise is identical. Compiler emits op_nth2 for a recognised 2-arg nth. | 26_gc_alloc_rate ReleaseSafe 45.8→40.5 ms (1.30×→1.15× vs Babashka 35.3 ms); cumulative 108.4→40.5 (2.81×→1.15×). | zig build test -Dwasm ×2 + corpus + smoke; verified in-range (10/30), list-defer (6), OOB raises (:oob), negative raises (:neg) == clj | O-043, ADR-0130, ADR-0148, D-450 | | O-043 | eval/backend/vm/opcode.zig (op_get/op_nth) + intrinsic.zig (CollOp/recognizeColl/fastGet/fastNth3) + vm.zig dispatch + vm/compiler.zig emit + runtime.zig (coll_vars/core_coll_pristine) + bootstrap.zig cache + core.zig deopt (ADR-0130 extended; destructure, gc_large_heap) | (get coll k) (2-arg) + (nth coll i default) (3-arg) compiled to a generic op_call: op_get_var callee push + var-resolution + BuiltinFn dispatch per call. destructure does 3 get + 3 nth/iter ×100K (0% malloc — pure dispatch); gc_large_heap does (get m :val) ×100K. | Collection-accessor intrinsic opcodes mirroring the ADR-0130 arith family: the compiler recognises the canonical get/nth Var (pointer identity) and emits op_get/op_nth (no callee push). The VM arm runs fastGet (map/nil inline = getFn 2-arg) / fastNth3 (vector inline = nthFn 3-arg vector), deferring every other collection kind to the cached Var; core_coll_pristine (cleared by alter-var-root on get/nth) deopts to the builtin so a redefinition is honoured. Allocation-free → no GC op_top sync. | 37_destructure ReleaseSafe 73.7→53.6 ms (1.68×→1.22× vs Babashka 43.9 ms); 27_gc_large_heap 63.0→59.0 ms (2.12×→1.99×, get-bound portion; closures still dominate). Broad win for all (get map k) / (nth vec i d). | zig build test -Dwasm ×2 (diff oracle + new op_get / op_nth diff test: map hit/miss, nil, vector, set, OOB/negative default, list defer, destructure=66, string/nested checkEqual) + corpus; deopt verified (alter-var-root #'get → redefinition honoured); all values == clj | O-014, O-018, ADR-0130, ADR-0148, D-450 | | O-042 | lang/primitive/core.zig (strFn single-int fast path) (ADR-0148; string_ops) | (str i) always built a heap std.Io.Writer.Allocating (gpa buffer alloc + free) + writeStrValue + the final string.alloc — even for a single small integer. string_ops does (str i) ×100K. | Single immediate-integer arg (args.len == 1 and args[0].isInt()): std.fmt.bufPrint("{d}") into a [24]u8 stack buffer → string.alloc directly, skipping the Allocating-writer alloc+free. Heap-Long / BigInt (tag .big_int, not isInt()) keep the full path → value-exact. Multi-arg / non-int unchanged. | string_ops ReleaseSafe 28.6→23.0 ms (1.35×→1.08× vs Babashka 21.3 ms). | zig build test -Dwasm ×2 + corpus + smoke; spot-checked (str 0/-5/123) + bigint 10^24 (full value) + (str 1 2 3)→"123" + (str "x" 5)→"x5" | O-029, ADR-0148, D-450 | | O-041 | lang/primitive/json.zig (jsonToCw array + object arms) (ADR-0148; json_parse) | read-str converted std.json.Value → cljw collections with empty + N×conj (arrays) and empty + N×assoc (objects) — N throwaway intermediate vectors/array-maps per JSON container. The bench's 200-element :users vector alone = 200 throwaway vectors; each 5-key user map = 5 throwaway array-maps ×200. | Array arm: buffer the converted elements, vector.fromSlice (one-shot). Object arm: buffer [k v …], map.fromLiteralPairs when n ≤ ARRAY_MAP_THRESHOLD (8) and keys simple (JSON keys are unique strings), else the assoc fold. Mirrors O-040 (vector literal) + O-026 (map literal). | 26_json_parse ReleaseSafe 54.3→39.0 ms (1.59×→1.14× vs Python 34.1 ms). | zig build test -Dwasm ×2 + corpus + smoke; spot-checked nested array/object + 10-key (>threshold) fallback (count 10) | O-026, O-040, ADR-0148, D-450 | | O-040 | eval/backend/vm.zig (op_vector_literal) (ADR-0148; gc_alloc_rate) | op_vector_literal built an N-element vector via vector.empty() + N×vector.conj — each conj path-copies + allocates a fresh intermediate vector, so [i (+ i 1) (+ i 2) (+ i 3)] (gc_alloc_rate's loop body, ×200K) allocated 4 throwaway vectors per literal. sample showed the bench is NOT malloc-bound (18/3729 leaf, 0.5%) — the cost was the per-conj construction work (the map literal already had this fixed as O-026; vectors were left on the slow path). | One-shot vector.fromSlice(rt, stack[sp-n..sp]): n≤32 → 1 TailNode + 1 Vector (memcpy the elements); n>32 → bulk HAMT build. The elements stay rooted on the operand stack (op_top watermark) across the build. VM-only, mirroring O-026's map fast path (TreeWalk keeps empty+conj; diff oracle proves equality). | 26_gc_alloc_rate ReleaseSafe 108.4→45.8 ms (2.37×; 2.81×→1.30× vs Babashka 35.3 ms); system time 31→6.5 ms. Broad win for every vector literal (esp. small literals in hot loops). | zig build test -Dwasm ×2 (diff oracle, TreeWalk slow ≡ VM fromSlice) + corpus + smoke; spot-checked n=0/1/40/nested vs clj (count 40, sum 780) | O-003, O-026, ADR-0148, D-450 | | O-039 | runtime/numeric/promote.zig (operandManaged/OwnedManaged + add/sub/mul BigInt else-branches) (ADR-0148; bigint_factorial) | The integer/BigInt arithmetic else-branch (reached when an operand is a heap BigInt) called coerceToManaged on BOTH operands, which clones a .big_int operand via cloneWithDifferentAllocator (alloc + limb memcpy) — yet r.add/sub/mul only READ them. bigint_factorial ((reduce *' …), acc grows to 9 limbs) clones the growing accumulator on every *. Same waste O-037 fixed for ratios, single-Managed flavour. | operandManaged(rt, v, *OwnedManaged) returns *const Managed: a BigInt aliases its stored Managed (zero alloc); an immediate Long materialises into a caller-stable OwnedManaged local (deinit only when active). The 3 else-branches use it; the @addWithOverflow both-int sub-paths keep coerceToManaged (both small, nothing to alias). | Removes the per-op accumulator clone. **32_bigint_factorial ReleaseSafe 26.4→22.4 ms (12 runs; 1.49×→1.27× vs Babashka 17.7 ms).** ratio_sum unchanged (separate path). Broad win for all BigInt + - *. | zig build test -Dwasm ×2 (diff oracle) + corpus + smoke; output unchanged (158 = digit count of 100!) | O-037, ADR-0148, D-450 | | O-038 | runtime/numeric/promote.zig (ratioArith scratch) + runtime/numeric/ratio.zig (allocFromManagedPair reduce-scratch) (ADR-0148; ratio_sum) | Each ratio op allocated its transient cross-multiply + gcd/divTrunc scratch as individual Managed.init(rt.gc.infra) calls — ratioArith 2-4 (rn/rd/lhs/rhs) + allocFromManagedPair 4 (gcd/r_num/r_den/rem) — so ~8 GPA malloc/free per +. sample profiling showed libsystem_malloc (_xzm_*malloc/_xzm_free/memset/bzero) ≈ 20% of leaf self-time. | Each function opens ONE std.heap.ArenaAllocator over rt.gc.infra and allocates all its scratch Managed from it; defer arena.deinit() bulk-frees on return. The two arenas don't alias (rn/rd live on ratioArith's arena, read inside allocFromManagedPair which has its own). Result BigInts (numer/denom) still clone onto gc.infra (GC-managed, outlive the call). | Collapses ~8 individual malloc/free into ~2 arena chunk alloc/free per op. 33_ratio_sum 103.3→82.9 ms (12 runs); combined with O-037 108.1→82.9 (23%). Re-profile: mem-leaf 227/1950 ≈ 11.6% (alloc no longer dominant; residual is interpreter dispatch + bignum compute → D-386). Broad win for all ratio arithmetic. | zig build test -Dwasm ×2 (diff oracle) + corpus 3157/3157 + smoke gate; output unchanged | O-037, ADR-0148, D-450 | | O-051 | runtime/collection/map.zig (arrayMapKeywordSlot helper + get/contains keyword fast path) (ADR-0165/ADR-0148; destructure, gc_large_heap) | An .array_map get/contains scanned every entry through keyEq→equal.eqConsult→isSimpleEqKey×2→keyEqValue (a !bool error-union chain) even when the lookup key is a keyword — yet keywords are INTERNED, so = over two keywords IS NaN-box payload bit-identity. destructure ({:keys [a b c]} m ×100k = 3 keyword gets/iter) + every keyword config-map lookup pay the chain per entry. (O-049 cut the TLV read inside eqConsult; this removes the call chain + error union entirely for the dominant keyword case.) | arrayMapKeywordSlot(am, kw): when k.tag()==.keyword, scan am.entries comparing @intFromEnum(entry_key)==@intFromEnum(kw) inline (no call, no try), returning the entry index or null. get/contains take it on the keyword branch; non-keyword keys keep the general keyEq path. Correct because keyword interning makes bit-equality ⟺ =, and a keyword can only bit-match a keyword entry (distinct NaN-box tags). The >8-entry .hash_map path is unchanged (keyword-keyed maps are overwhelmingly small). | clean old-vs-new ReleaseSafe binary A/B (hyperfine -N, 30 runs, same session): 37_destructure 51.5→48.1 ms (−6.6%); 27_gc_large_heap 33.2→31.7 ms (−4.5%); 300k-get microbench 23.7→21.1 ms (−11.0%); map-destructure micro 34.8→32.6 (−6.3%). Broad win for every keyword (get map k) / (contains? map k) / :keys destructure. | zig build test -Dwasm ×2 (diff oracle TreeWalk≡VM, identical Values) + new map.zig unit test (keyword hit/miss + mixed keyword/int/string keys, re-interned keyword identity) + lint | O-048, O-049, O-026, ADR-0165, ADR-0148, D-450, D-520 | | O-050 | runtime/numeric/promote.zig (ratioArith add/sub branch) + runtime/numeric/ratio.zig (allocFromReducedManagedPair/finishReducedPair) (ADR-0148; ratio_sum) | The BIG-tier ratio add (an*bd ± bn*ad)/(ad*bd) reduced via a full gcd in allocFromManagedPair. O-046's i64 small tier covers tiny ratios, but the harmonic-sum ACCUMULATOR ((reduce + …)) grows past i64 → big-tier add ~45×/call, each paying the big gcd. | Knuth TAOCP 4.5.1 / GMP mpq_add gcd-first add: g = gcd(ad, bd) FIRST (the added term's denom ≤ 50 → a tiny gcd), reduce by g; coprime denominators (g=1) SKIP the final gcd entirely, else a small gcd(t, g≤50). Result is coprime → allocFromReducedManagedPair (factored post-gcd tail of allocFromManagedPair, no redundant final gcd; zero-numerator→int-0 guarded). mul/div keep the full reduce. | 33_ratio_sum ReleaseSafe 39.6→31.0 ms (hyperfine -N, 15 runs, 1.28×±0.20 over the pre-Knuth baseline, same machine/session A/B). Complements O-046 (small tier); broad win for all big-tier ratio +/-. | zig build test -Dwasm ×2 (diff oracle) + corpus 3656 + 10 clj-verified edge cases (coprime / g>1 reduction / zero / integer-collapse / negative / large-coprime / harmonic) | O-037, O-046, ADR-0148, D-450 |

Identified high-ROI candidates (measured, not yet implemented)

Ranked by ROI (impact × frequency / effort·risk). Measured 2026-05-31 on mac-arm-m4pro, startup baseline 0.48s subtracted where noted.

persistent! bulk trie build — DONE (O-003, D-180 discharged). transient_vector.toPersistent now calls vector.fromSlice, which builds the HAMT trie bottom-up from the flat buffer in O(n) instead of N persistent conjs (O(n log n)); into/vec route editable targets through transient/conj!/persistent!. Measured: (count (vec (range 1e6))) 121s → 2.4s; (reduce + (vec (range 1e6))) 123s → 2.5s. Verified by the fromSlice-vs-conj boundary unit test + diff oracle. The residual ~2.4s is per-element reduce/conj! interpreter dispatch + the lazy-seq walk of from — addressed by D-163 (fusion) / D-140 (startup), not persistent!. (The map/set arm of the routing is correctness-enabling, not a perf win: routing into {} / into #{} through transients required completing the transient hash map for > 8 entries — ADR-0064, which delegates to the persistent HAMT (O(n log n), no map speedup). The in-place editable-CHAMP transient that would make maps faster is deferred to D-181. The vector arm is the measured O-003 win.)
cljw startup ≈ 0.48s per invocation — D-140. NOTE (corrected 2026-05-31): clojure.core is ALREADY AOT-restored from an embedded bytecode envelope, NOT re-parsed. ADR-0056 (Cycle 2b) built cache_gen → AOT-compiles core.clj to a bytecode envelope at build time → @embedFile'd into the cljw binary as bootstrap_cache; runner.zig setupCoreAot restores core from it on every cljw -e/ file/stdin run (the prior "re-parses+analyses+evaluates core.clj" description was stale — it predates Cycle 2b). So the residual ~0.48s is NOT core re-eval; it is (unprofiled) the envelope RESTORE (deserialize + run the op_def chunks to intern ~hundreds of core vars on the VM) + primitive.registerAll + process spawn + the full-self-exe read tryRunEmbedded does to check for a trailer (a footer-only seek would avoid it — noted in builder.zig:135). The 11 non-core .clj files (string/set/walk/…) are lazy on require, so a minimal cljw -e 1 does not load them. Next step = PROFILE the 0.48s (the bottleneck moved since the docs were written), then targeted tuning (footer seek / faster restore) + ADR-0056 Cycle 3 (AOT or lazy-defer the non-core files). Architecture already exists (ADR-0056); this is profile-and-tune, not a new cache. Highest dev-velocity ROI (every test/probe pays it).

Out-of-scope future optimizations (tracked, not yet implemented)

(none currently — the map/filter/take reduce-fusion that was listed here landed as O-004 / D-163 first cycle.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimizations ledger (SSOT)

How to read / maintain

Entries

Identified high-ROI candidates (measured, not yet implemented)

Out-of-scope future optimizations (tracked, not yet implemented)

Uh oh!

FilesExpand file tree

optimizations.md

Latest commit

History

optimizations.md

File metadata and controls

Optimizations ledger (SSOT)

How to read / maintain

Entries

Identified high-ROI candidates (measured, not yet implemented)

Out-of-scope future optimizations (tracked, not yet implemented)