index: add hybrid go-re2 engine for large file content matching#1024
Merged
keegancsmith merged 1 commit intosourcegraph:mainfrom Mar 24, 2026
Merged
index: add hybrid go-re2 engine for large file content matching#1024keegancsmith merged 1 commit intosourcegraph:mainfrom
keegancsmith merged 1 commit intosourcegraph:mainfrom
Conversation
keegancsmith
approved these changes
Mar 24, 2026
Member
keegancsmith
left a comment
There was a problem hiding this comment.
Approving but some inline feedback. Will land after your response
internal/hybridre2/hybridre2.go
Outdated
| // EnvThreshold is the environment variable name controlling the size | ||
| // threshold (bytes) at which go-re2 is used instead of grafana/regexp. | ||
| // Set to -1 (default) to disable go-re2 entirely, 0 to always use it. | ||
| EnvThreshold = "ZOEKT_RE2_THRESHOLD_BYTES" |
Member
There was a problem hiding this comment.
minor: this and disabled don't need to be exported
internal/hybridre2/hybridre2.go
Outdated
|
|
||
| // Threshold returns the configured byte threshold, reading ZOEKT_RE2_THRESHOLD_BYTES | ||
| // from the environment exactly once. Negative means disabled; zero means always use RE2. | ||
| func Threshold() int64 { |
Member
There was a problem hiding this comment.
minor: you can simplify and use sync.OnceValue
internal/hybridre2/hybridre2_test.go
Outdated
| // NOT safe for concurrent use: do not call t.Parallel() inside f, and do not | ||
| // use this helper from TestMain or init() — if Threshold() fires its sync.Once | ||
| // before withThreshold sets cachedThresh, the value will be overwritten. | ||
| func withThreshold(thresh int64, f func()) { |
Member
There was a problem hiding this comment.
minor: more idiomatic for test code to have a helper use t.Cleanup rather than passing in a function
internal/hybridre2/hybridre2_test.go
Outdated
| } | ||
|
|
||
| // Ensure fmt is used. | ||
| var _ = fmt.Sprintf |
6d6ee90 to
604f972
Compare
Adds an optional hybrid regex engine (internal/hybridre2) that transparently switches between grafana/regexp and wasilibs/go-re2 (RE2 via WebAssembly) based on file content size. Disabled by default — no behaviour change without opt-in via ZOEKT_RE2_THRESHOLD_BYTES. ## Motivation Issue sourcegraph#323 identified regex as the dominant CPU consumer in zoekt's webserver profile. Go's regexp engine (including the grafana/regexp fork already in use) lacks a lazy DFA. RE2's lazy DFA provides linear-time matching with much better constant factors for alternations, character classes, and complex patterns on large inputs. The tradeoff: go-re2 uses WebAssembly (~600ns per-call overhead), making it slower than grafana/regexp for small inputs (<4KB) but dramatically faster above the threshold. A full engine swap would regress small-file searches, so a threshold-based hybrid is the pragmatic approach. ## Implementation ### New package: internal/hybridre2 hybridre2.Regexp compiles both engines once at query-parse time and dispatches FindAllIndex based on len(input) >= Threshold(): func (re *Regexp) FindAllIndex(b []byte, n int) [][]int { if useRE2(len(b)) { return re.re2.FindAllIndex(b, n) } return re.grafana.FindAllIndex(b, n) } ### Change to index/matchtree.go regexpMatchTree gains a hybridRegexp field used for file content matching; filename matching keeps using grafana/regexp directly (filenames are always short, so WASM overhead dominates there). ### Configuration ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup: -1 (default): disabled — always grafana/regexp, zero behaviour change 0: always use go-re2 (useful for evaluation/testing) 32768: use go-re2 for files >= 32KB (recommended starting point) ## Benchmarks Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO). Alternations — `func|var|const|type|import`: 32KB: grafana 2505µs go-re2 467µs 5.4x speedup 128KB: grafana 9900µs go-re2 1699µs 5.8x speedup 512KB: grafana 40.7ms go-re2 6.8ms 6.0x speedup Complex — `(func|var)\s+[A-Z]\w*\s*(`: 32KB: grafana 1237µs go-re2 230µs 5.4x speedup 128KB: grafana 4935µs go-re2 911µs 5.4x speedup 512KB: grafana 19.9ms go-re2 3.8ms 5.3x speedup Literal — `main` (grafana wins; threshold protects this case): 32KB: grafana 33.2µs go-re2 59.8µs ## Testing go test ./internal/hybridre2/ # unit + correctness matrix go test ./index/ -short # full existing suite: passes go test ./... -short # full suite: passes Correctness verified by asserting identical match offsets between grafana and go-re2 for 9 patterns x 5 sizes (64B-256KB). ## Notes - Binary/non-UTF-8 content: go-re2 stops at invalid UTF-8 (vs. grafana which replaces with the replacement character). The default threshold of -1 ensures zero behaviour change. Operators enabling the threshold should be aware; future work could detect non-UTF-8 and force the grafana path. - Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system deps. Binary size increase: ~2MB (the embedded RE2 WASM module). - Rollout plan: enable in GitLab via feature flag starting at 32KB, compare p95 regex latency before/after using per-shard timing in search responses.
604f972 to
696eef0
Compare
Contributor
Author
|
@keegancsmith Thank you! I believe it's ready for another look. I've even added one optimization (skip re2 compilation for filename-only regexps) 🤝 |
keegancsmith
approved these changes
Mar 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an optional hybrid regex engine (
internal/hybridre2) that transparentlyswitches between
grafana/regexpandwasilibs/go-re2(RE2 via WebAssembly)based on file content size. Disabled by default — no behaviour change
without opt-in via
ZOEKT_RE2_THRESHOLD_BYTES.Related: #323
Motivation
Issue #323 (April 2022) identified regex as the dominant CPU consumer in zoekt's
webserver profile. The root cause is that Go's
regexpengine — including thegrafana/regexpfork already in use — lacks a lazy DFA. RE2's lazy DFA provideslinear-time matching with much better constant factors for alternations, character
classes, and complex patterns on large inputs.
The tradeoff: go-re2 uses WebAssembly under the hood (~600ns per-call overhead),
making it slower than grafana/regexp for small inputs (<1–4KB) but dramatically
faster above that threshold. A full engine swap would regress small-file searches,
so a threshold-based hybrid is the pragmatic approach.
Implementation
New package:
internal/hybridre2hybridre2.Regexpcompiles grafana/regexp unconditionally. The go-re2 variant isonly compiled when
ZOEKT_RE2_THRESHOLD_BYTES >= 0— when RE2 is disabled (thedefault), no WASM is initialised, keeping the disabled path truly zero-cost.
Dispatch at match time:
Change to
index/matchtree.goregexpMatchTreegains ahybridRegexpfield used for file content matching;filename matching keeps using
grafana/regexpdirectly (filenames are always short).Configuration
ZOEKT_RE2_THRESHOLD_BYTESenv var, read once at startup:-1(default)032768Benchmarks
Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO).
Alternations —
func|var|const|type|importComplex —
(func|var)\s+[A-Z]\w*\s*\(Literal —
main(grafana/regexp wins — threshold protects this case)CGO mode (
-tags re2_cgo, requireslibre2-dev) adds ~30% on top of WASM.Testing
Correctness verified by asserting identical match offsets between grafana and
go-re2 for 9 patterns × 5 sizes (64B–256KB) in
TestFindAllIndexIdenticalResults.Notes
replaces with U+FFFD and continues. Results may differ on binary content that
slips past content-type detection. The threshold gates this to large files only,
and the default of
-1ensures zero behaviour change.Regexpholds compiled state for bothengines. Patterns are compiled per-search (not cached globally), so memory is
short-lived, but RSS should be monitored under high concurrency with many unique
patterns.
accepts (due to syntax differences),
Compilereturns an error rather thanfalling back silently. This is intentional (fail-fast). Patterns from zoekt's
query parser are validated earlier in the pipeline, so this is unlikely in
practice, but worth knowing if you enable the threshold.
github.com/wasilibs/go-re2 v1.10.0— pure Go WASM, no systemdeps. Binary size increase: ~2MB (the embedded RE2 WASM module).