index: 3.7x faster posting list construction via direct-indexed ASCII array by clemlesne · Pull Request #1020 · sourcegraph/zoekt

clemlesne · 2026-03-20T12:28:25Z

Summary

Reduces newSearchableString build time by 3.7x (9.3s → 2.5s on kubernetes) through three targeted changes to the trigram posting list construction hot path, where ~54% of CPU was previously spent on Go runtime memory management.

Merge postings and lastOffsets maps into a single map[ngram]*postingList. Pointer-stored values mean the map is only written when a new ngram is first seen (~282K writes) rather than on every trigram occurrence (~169M). Cuts per-trigram map operations from 4 to ~1.
Direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB). Eliminates hash/probe overhead for 99%+ of trigrams in source code. Non-ASCII trigrams fall back to the map.
Pool postingsBuilder via sync.Pool so sequential/parallel shard builds reuse map, array, and slice allocations across shards.

Also adds an ASCII fast-path (data[0] < utf8.RuneSelf) to skip utf8.DecodeRune call overhead, and pre-sizes maps and posting slices based on shard size.

Benchmarks

On kubernetes (22.9K files, 169 MB), Apple M1 Max:

Cold path (`NewSearchableString`)

Metric	Before	After	Change
Time	9.3s	2.5s	-73%
Allocs/op	901K	676K	-25%
B/op	1,358 MB	1,544 MB	+14%

Warm path (pooled reuse across shards)

Metric	Before	After	Change
Time	9.3s	2.3s	-75%
Allocs/op	901K	23K	-97%
B/op	1,358 MB	0.6 MB	-99.96%

Write path (`WritePostings`)

~155ms → ~130ms. Unchanged.

Context

Filed as #1017. seek wraps zoekt as a local CLI for AI coding agents — every search is a blocking tool call, so cold-index speed matters.

Test plan

go test ./index/ -race passes (all 50+ tests including golden shard TestBuildv16)
go vet and golangci-lint clean (no new issues)
Unicode tests pass (TestUnicodeVariableLength, TestAndOrUnicode)
Benchmark on a second large repo (e.g., torvalds/linux)
Profile with pprof to confirm runtime memory management dropped from ~54% to <20%

🤖 Generated with Claude Code

Three targeted changes to newSearchableString, the hot path where ~54% of CPU was spent on runtime memory management (memclr, memmove, madvise, mapassign): 1. Merge postings and lastOffsets maps into a single map[ngram]*postingList. Pointer-stored values mean the map is only written when a new ngram is first seen (~282K writes for kubernetes) rather than on every trigram occurrence (~169M). This cuts per-trigram map operations from 4 to ~1. 2. Pre-size the map using estimateNgrams(shardMaxBytes) and pre-allocate each posting list to 1024 bytes, reducing slice growth events and eliminating map rehashing. 3. Pool postingsBuilder instances via sync.Pool on the Builder, so sequential and parallel shard builds reuse map and slice allocations across shards instead of re-creating them. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Time: 9.3s → 5.3s (-43%) Allocs: 901K → 677K (-25%) B/op: 1358 → 1536 MB (+13%, from larger pre-alloc) Warm path (pooled reuse across shards): Time: 9.3s → 5.1s (-45%) Allocs: 901K → 23K (-97%) B/op: 1358 → 0.6 MB (-99.96%) WritePostings: ~155ms, unchanged.

Replace map lookups with a direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB of pointers). Since 99%+ of trigrams in source code are pure ASCII, this eliminates nearly all hash and probe overhead from the hot loop. Non-ASCII trigrams still fall back to the map. Also inline the ASCII check (data[0] < utf8.RuneSelf) to avoid utf8.DecodeRune function call overhead on the 95-99% of bytes that are ASCII. In writePostings, collect ngrams from both the array and map into a single sorted slice for writing. The on-disk format is unchanged. Benchmarked on kubernetes (22.9K files, 169 MB, Apple M1 Max): Cold path (NewSearchableString): Before (with map opt): 5.3s, 677K allocs, 1536 MB After: 2.5s, 676K allocs, 1544 MB Speedup: 2.1x (3.7x vs original baseline) Warm path (pooled reuse): Before (with map opt): 5.1s, 23K allocs, 0.6 MB After: 2.3s, 23K allocs, 0.6 MB Speedup: 2.2x (4.0x vs original baseline) WritePostings: ~130ms, unchanged. The non-ASCII map now holds only ~6K entries (vs ~282K before), since the vast majority of trigrams are served by the direct array.

Update three comments that still referenced map-only storage after the direct-indexed ASCII array was added: postingList doc, estimateNgrams doc, and reset() doc.

keegancsmith · 2026-03-20T15:08:38Z

Haven't looked at code yet, just responding to the description

Direct-indexed [2M]*postingList array for ASCII trigrams (3 × 7-bit runes = 21-bit index, 16 MB). Eliminates hash/probe overhead for 99%+ of trigrams in source code. Non-ASCII trigrams fall back to the map.

Clever! Can you quantify how much of a difference this optimization makes? I would of thought we are dominated by what happens in *postingList rather than growing the map / lookups in the map.

Pool postingsBuilder via sync.Pool so sequential/parallel shard builds reuse map, array, and slice allocations across shards.

I'd also be interested in how effective this is. I was under the impression the need for sync.Pool was pretty rare these days after improvements to the golang runtime. However, I did just grep the go stdlib and I see it is still pretty heavily used.

keegancsmith

After reading all the code, I'd be interesting on the impact of this but not with a normal go benchmark. But rather with calling something like zoekt-git-index on a repo using hyperfine to measure the impact.

keegancsmith · 2026-03-20T15:13:18Z

index/builder.go

+
+	// postingsPool reuses postingsBuilder instances across shard builds,
+	// retaining their map and slice allocations to avoid repeated
+	// memclr/madvise overhead.
+	postingsPool sync.Pool


I'm kinda surprised this pool has a hit rate. Under what conditions did you test this? Did you have smaller shard size?

The kubernetes repo is a handful of shards right? With concurrency the opportunity to re-use an older postingsPool before garbage collection has had a chance to free stuff from sync.Pool seems low?

Maybe the GC just doesn't free as often as I would expect from sync.Pool.

If you are just testing this with all the normal defaults (including GOGC being on, etc) and its working cool lets use this optimization. If you are tweaking the defaults somehow, then we could probably do better with maintaining our own free list instead of relying on sync.Pool

Default Parallelism is 4, so up to 4 shards build concurrently. sync.Pool has per-P caching — each goroutine gets/puts on its own P cache without contention. After a goroutine finishes and Puts, the next goroutine scheduled on that P Gets the cached builder. Worst case (GC clears pool between shards) = fresh allocation, identical to no pool.

Tested with normal defaults (GOGC=100, no tuning).

Do you get hits outside of benchmarks though? Mind testing end-to-end with something like zoekt-git-index (or some other zoekt binary) and adding in a log line or something on if we successfully get a value out of this sync.Pool.

I agree this is useful optimization, especially for much larger repos or the shard size is tuned down so we have more shards. However, I have a suspicion in practice sync.Pool for this has a bad hitrate since GC would of run by the time sync.Pool.Get is called. So we would rather need to implement our own free list than just plugging in sync.Pool to get the benefit in practice.

But this is all a suspicion of mine, would be great to get confirmation.

Instrumented getPostingsBuilder with a log line and tested end-to-end with zoekt-git-index on kubernetes (23K files, 162 MB):

Shard size Shards Gets Hits Misses Hit rate

100 MB (default) 3 7 2 5 29%

10 MB ~25 51 30 21 59%

Your suspicion is partially confirmed — at default shard size (few large shards), the hit rate is poor. But with smaller shards it reaches 59%, and the pool never hurts (miss = fresh allocation = status quo).

Since Go 1.13, sync.Pool uses a victim cache that survives one extra GC cycle, so objects aren't cleared immediately — they get ~2 cycles. The 29% at 100 MB shards reflects that shard builds are still long enough for 2+ GC cycles to pass.

I think keeping the pool is the right call:

No downside on miss — same as no pool

59% at 10 MB shards is meaningful, especially for users tuning shard size down

Larger repos with more shards will also see higher hit rates

Zero config — no sizing decisions like a channel-based free list would require

Separately, I added a sparse index (asciiPopulated []uint32) that tracks populated ASCII array slots. This makes reset() and writePostings iterate only ~275K populated entries instead of scanning all 2M slots, while still retaining all postingList allocations for pool reuse. The warm-path benchmark confirms identical B/op (0.55 MB) and allocs/op (22,978) to the original.

keegancsmith · 2026-03-20T15:16:20Z

index/builder.go

+	content := b.getPostingsBuilder()
+	name := b.getPostingsBuilder()
+	shardBuilder := newShardBuilderWithPostings(content, name)
+	if err := shardBuilder.setRepository(&desc); err != nil {


an error here is super rare, is it worth complicating matter and doing a put after this?

Agreed, removed in ad39bc9.

keegancsmith · 2026-03-20T15:17:57Z

index/builder.go

+	return newPostingsBuilder(b.opts.ShardMax)
+}
+
+func (b *Builder) putPostingsBuilder(pb *postingsBuilder) {


maybe make this take in a shardBuilder instead and it does put on both contentPostings and namePostings. And finally it sets those fields to nil so if we misuse this something crashes in a more obvious way.

Done — returnPostingsBuilders(*ShardBuilder) puts both and nils the fields. ad39bc9.

keegancsmith · 2026-03-20T15:20:58Z

index/shard_builder.go

+		newPostingsBuilder(defaultShardMax),
+		newPostingsBuilder(defaultShardMax),


all the places we call newShardBuilder we should be able to pass in the actual shard max?

Done — newShardBuilder(shardMax int), 0 means default. merge.go and tests pass 0. Builder path already had it via the pool. ad39bc9.

keegancsmith · 2026-03-20T15:22:31Z

index/write.go

+			all = append(all, ngramPosting{k, pl})
+		}
+	}
+	sort.Slice(all, func(i, j int) bool { return all[i].ng < all[j].ng })


minor: slices.SortFunc now exists and generally reads better / is faster.

Switched in ad39bc9.

keegancsmith · 2026-03-20T15:25:55Z

index/postings_bench_test.go

+//	git clone --depth=1 https://github.com/kubernetes/kubernetes /tmp/k8s
+//	ZOEKT_BENCH_REPO=/tmp/k8s go test ./index/ -bench=BenchmarkPostings -benchmem -count=5 -timeout=600s
+
+func requireBenchRepo(b *testing.B) string {


FYI we have some infra here to download copies of code for these sorts of tests internal/e2e/e2e_rank_test.go. Not sure if worth using here. But if there is a way to maybe share ideas / infra with that... that would be great.

Good pointer, thanks. Will look into aligning with e2e_rank_test.go infra in a follow-up.

keegancsmith · 2026-03-20T15:29:47Z

index/postings_bench_test.go

+// BenchmarkPostings_Reuse measures the warm path: building postings with a
+// reset (pooled) postingsBuilder that retains its map and slice allocations
+// from a previous shard build.
+func BenchmarkPostings_Reuse(b *testing.B) {


I don't think this accurately measures the actual impact on reuse. This is since we won't be calling newPostingsBuilder in a hot loop. Instead it will be called once per shard.

You're right — the Reuse benchmark measures allocation reduction potential, not realistic shard-building throughput. The real end-to-end impact measured via hyperfine on kubernetes:

Benchmark 1 (before): 11.373 s ± 0.174 s Benchmark 2 (after): 7.866 s ± 0.378 s (1.45x faster, no ctags) Benchmark 1 (before): 20.478 s ± 0.465 s Benchmark 2 (after): 16.588 s ± 0.119 s (1.23x faster, with ctags)

- Replace putPostingsBuilder with returnPostingsBuilders(*ShardBuilder) that returns both content and name builders to the pool and nils the fields, so any subsequent misuse panics obviously. - Drop error-path pool returns in newShardBuilder: setRepository errors are extremely rare (invalid templates or >64 branches), not worth the code complexity. - Thread shardMax through newShardBuilder(shardMax int). Callers without a shard size context (merge, tests, public API) pass 0 for the default (100 MB). Builder.newShardBuilder passes b.opts.ShardMax via the pool. - Switch sort.Slice to slices.SortFunc in writePostings for type safety and to avoid interface boxing overhead.

clemlesne · 2026-03-20T16:30:24Z

Per-optimization breakdown (microbenchmark on kubernetes, NewSearchableString):

Optimization	Time	Speedup
Baseline	9.3s	—
+ Pointer-stored map values + pre-sizing	5.3s	1.8x
+ Direct ASCII array	2.5s	3.7x cumulative

The ASCII array accounts for the majority of the remaining speedup (2.1x on top of the map optimization). The map lookup was indeed the dominant cost after fixing the allocation pattern — 169M hash+probe ops at ~20ns each.

sync.Pool benefit is mostly allocation reduction (901K → 23K allocs on reuse), not wall-clock. For kubernetes (3 shards) the wall-clock gain is marginal; it matters more for repos that produce many small shards.

End-to-end via hyperfine with zoekt-git-index on kubernetes:

Without ctags: 11.4s → 7.9s (1.45x)
With ctags:    20.5s → 16.6s (1.23x)

Shard output is byte-identical (only metadata timestamp differs).

keegancsmith

LGTM! I have a few questions inline, but once answered will merge. Thanks for the great contribution.

keegancsmith · 2026-03-24T12:52:33Z

index/builder.go

+
+	// postingsPool reuses postingsBuilder instances across shard builds,
+	// retaining their map and slice allocations to avoid repeated
+	// memclr/madvise overhead.
+	postingsPool sync.Pool


Do you get hits outside of benchmarks though? Mind testing end-to-end with something like zoekt-git-index (or some other zoekt binary) and adding in a log line or something on if we successfully get a value out of this sync.Pool.

I agree this is useful optimization, especially for much larger repos or the shard size is tuned down so we have more shards. However, I have a suspicion in practice sync.Pool for this has a bad hitrate since GC would of run by the time sync.Pool.Get is called. So we would rather need to implement our own free list than just plugging in sync.Pool to get the benefit in practice.

But this is all a suspicion of mine, would be great to get confirmation.

keegancsmith · 2026-03-24T13:00:04Z

index/shard_builder.go


-func newPostingsBuilder() *postingsBuilder {
+// Initial capacity for each posting list's byte slice. Empirically,
+// the average posting list in a source-code corpus is ~900 bytes


which corpus did you use? I think the median is a better value than the average from that corpus.

This might increase memory usage for large shards right? Since I also think its kinda like a power law. The most common number of entries I think is 1. But yeah, I suspect with GC/etc overall probably better to use the average or median value to avoid lots of growing.

Measured on kubernetes (282K unique trigrams, 162 MB). You're right — textbook power law:

Percentile Size (bytes)

p25 4

Median 10

p75 47

p90 386

p95 1,343

Mean 1,017

70% of posting lists are under 32 bytes. The mean was misleading — driven by a tiny tail (max: 8.4 MB).

With cap=1024: pre-allocates 275 MB, wastes 244 MB (88.5%).
With cap=64: pre-allocates 17 MB, covers 78% without growth.

Changed to 64. Cold-path B/op dropped 1,558 → 1,352 MB (-13%). End-to-end zoekt-git-index is unchanged (16.65s vs 16.68s). Shard output byte-identical.

Also added a sparse index (asciiPopulated []uint32) so reset() and writePostings are O(populated) instead of O(2M). writePostings went from ~500ms to ~73ms (-85%) in benchmarks.

clemlesne · 2026-03-24T14:42:43Z

Thanks for the thorough review! Inline questions answered, and pushing the changes now. Summary of what changed:

`initialPostingCap` 1024 → 64

The posting list size distribution on kubernetes is a steep power law — median is 10 bytes, 70% are under 32 bytes. The old value (1024) was based on the mean (1,017), which is misleading for this distribution. Pre-allocation dropped from 275 MB to 17 MB, wasted capacity from 244 MB to 11 MB.

Sparse index for ASCII array iteration

Added asciiPopulated []uint32 to track which of the 2M ASCII array slots are populated (~275K on kubernetes, 13% fill). Three paths now iterate only populated slots instead of scanning all 2M:

reset() — retains all postingList allocations for pool reuse
writePostings() — skips the 2M scan entirely
Hot path — one extra else if len(pl.data) == 0 branch for pool-reused slots (fires once per unique trigram per shard, same as the existing pl == nil)

sync.Pool — keeping it

Tested end-to-end with zoekt-git-index: 29% hit rate at 100 MB shards, 59% at 10 MB shards. Not great at default settings, but no downside on miss and meaningful for smaller shard sizes or repos with more shards.

Benchmarks

Cold path (NewSearchableString, kubernetes 23K files):

Metric	Before	After	Change
B/op	1,558 MB	1,352 MB	-13%
Time	2,522 ms	2,501 ms	~neutral

Warm path (pooled reuse): Identical — 551 KB B/op and 22,978 allocs/op.

End-to-end zoekt-git-index via hyperfine:

	Mean ± σ
Before	16.65s ± 0.26s
After	16.68s ± 0.12s

No regression. Lower variance. Shard output byte-identical.

Address review feedback: - Add asciiPopulated []uint32 sparse index so reset() and writePostings iterate only populated slots (~275K) instead of scanning all 2M. Retains postingList allocations for pool reuse via len(pl.data)==0 detection on the hot path. - Reduce initialPostingCap from 1024 to 64. On kubernetes (282K trigrams), median posting list is 10 bytes and 78% are under 64. Drops pre-allocation waste from 244 MB to 11 MB. Cold-path B/op: 1558 MB → 1352 MB (-13%).

keegancsmith · 2026-03-24T16:36:02Z

Merge activity

Mar 24, 4:36 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Mar 24, 4:36 PM UTC: Graphite couldn't merge this PR because it failed for an unknown reason (Fast-forward merges are not supported for forked repositories. Please create a branch in the target repository in order to merge).

clemlesne added 3 commits March 20, 2026 13:08

index: fix stale comments after ASCII array introduction

0b70a64

Update three comments that still referenced map-only storage after the direct-indexed ASCII array was added: postingList doc, estimateNgrams doc, and reset() doc.

clemlesne marked this pull request as ready for review March 20, 2026 12:30

clemlesne mentioned this pull request Mar 20, 2026

Pre-size trigram posting list allocations to reduce GC pressure during indexing #1017

Closed

keegancsmith reviewed Mar 20, 2026

View reviewed changes

keegancsmith approved these changes Mar 24, 2026

View reviewed changes

keegancsmith merged commit 22155f1 into sourcegraph:main Mar 24, 2026
8 checks passed

		newPostingsBuilder(defaultShardMax),
		newPostingsBuilder(defaultShardMax),

Conversation

clemlesne commented Mar 20, 2026

Summary

Benchmarks

Cold path (NewSearchableString)

Warm path (pooled reuse across shards)

Write path (WritePostings)

Context

Test plan

Uh oh!

keegancsmith commented Mar 20, 2026

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clemlesne Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clemlesne commented Mar 20, 2026

Uh oh!

keegancsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clemlesne commented Mar 24, 2026

initialPostingCap 1024 → 64

Sparse index for ASCII array iteration

sync.Pool — keeping it

Benchmarks

Uh oh!

keegancsmith commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Cold path (`NewSearchableString`)

Write path (`WritePostings`)

clemlesne Mar 20, 2026 •

edited

Loading

`initialPostingCap` 1024 → 64

keegancsmith commented Mar 24, 2026 •

edited

Loading