Skip to content

index: add hybrid go-re2 engine for large file content matching#1024

Merged
keegancsmith merged 1 commit intosourcegraph:mainfrom
dgruzd:dgruzd/hybrid-re2-engine
Mar 24, 2026
Merged

index: add hybrid go-re2 engine for large file content matching#1024
keegancsmith merged 1 commit intosourcegraph:mainfrom
dgruzd:dgruzd/hybrid-re2-engine

Conversation

@dgruzd
Copy link
Copy Markdown
Contributor

@dgruzd dgruzd commented Mar 24, 2026

Summary

Adds an optional hybrid regex engine (internal/hybridre2) that transparently
switches between grafana/regexp and
wasilibs/go-re2 (RE2 via WebAssembly)
based on file content size. Disabled by default — no behaviour change
without opt-in via ZOEKT_RE2_THRESHOLD_BYTES.

Related: #323

Motivation

Issue #323 (April 2022) identified regex as the dominant CPU consumer in zoekt's
webserver profile. The root cause is that Go's regexp engine — including the
grafana/regexp fork already in use — lacks a lazy DFA. RE2's lazy DFA provides
linear-time matching with much better constant factors for alternations, character
classes, and complex patterns on large inputs.

The tradeoff: go-re2 uses WebAssembly under the hood (~600ns per-call overhead),
making it slower than grafana/regexp for small inputs (<1–4KB) but dramatically
faster above that threshold. A full engine swap would regress small-file searches,
so a threshold-based hybrid is the pragmatic approach.

Implementation

New package: internal/hybridre2

hybridre2.Regexp compiles grafana/regexp unconditionally. The go-re2 variant is
only compiled when ZOEKT_RE2_THRESHOLD_BYTES >= 0 — when RE2 is disabled (the
default), no WASM is initialised, keeping the disabled path truly zero-cost.

Dispatch at match time:

func (re *Regexp) FindAllIndex(b []byte, n int) [][]int {
    if re.re2 != nil && useRE2(len(b)) {
        return re.re2.FindAllIndex(b, n)
    }
    return re.grafana.FindAllIndex(b, n)
}

Change to index/matchtree.go

regexpMatchTree gains a hybridRegexp field used for file content matching;
filename matching keeps using grafana/regexp directly (filenames are always short).

Configuration

ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup:

Value Behaviour
-1 (default) Disabled — always grafana/regexp, zero behaviour change
0 Always use go-re2 (useful for evaluation/testing)
32768 Use go-re2 for files ≥ 32KB (recommended starting point)

Benchmarks

Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO).

Alternations — func|var|const|type|import

Size grafana/regexp go-re2 (WASM) Speedup
32KB 2505µs 467µs 5.4x
128KB 9900µs 1699µs 5.8x
512KB 40.7ms 6.8ms 6.0x

Complex — (func|var)\s+[A-Z]\w*\s*\(

Size grafana/regexp go-re2 (WASM) Speedup
32KB 1237µs 230µs 5.4x
128KB 4935µs 911µs 5.4x
512KB 19.9ms 3.8ms 5.3x

Literal — main (grafana/regexp wins — threshold protects this case)

Size grafana/regexp go-re2 (WASM) Note
32KB 33.2µs 59.8µs grafana faster

CGO mode (-tags re2_cgo, requires libre2-dev) adds ~30% on top of WASM.

Testing

go test ./internal/hybridre2/    # 14 tests + 45-case correctness matrix
go test ./index/ -short          # full existing suite: passes
go test ./... -short             # full suite: passes

Correctness verified by asserting identical match offsets between grafana and
go-re2 for 9 patterns × 5 sizes (64B–256KB) in TestFindAllIndexIdenticalResults.

Notes

  • Binary / non-UTF-8 content: go-re2 stops at invalid UTF-8; grafana/regexp
    replaces with U+FFFD and continues. Results may differ on binary content that
    slips past content-type detection. The threshold gates this to large files only,
    and the default of -1 ensures zero behaviour change.
  • Memory: when RE2 is enabled, each Regexp holds compiled state for both
    engines. Patterns are compiled per-search (not cached globally), so memory is
    short-lived, but RSS should be monitored under high concurrency with many unique
    patterns.
  • RE2 compilation failure: if go-re2 rejects a pattern that grafana/regexp
    accepts (due to syntax differences), Compile returns an error rather than
    falling back silently. This is intentional (fail-fast). Patterns from zoekt's
    query parser are validated earlier in the pipeline, so this is unlikely in
    practice, but worth knowing if you enable the threshold.
  • Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system
    deps. Binary size increase: ~2MB (the embedded RE2 WASM module).

Copy link
Copy Markdown
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but some inline feedback. Will land after your response

// EnvThreshold is the environment variable name controlling the size
// threshold (bytes) at which go-re2 is used instead of grafana/regexp.
// Set to -1 (default) to disable go-re2 entirely, 0 to always use it.
EnvThreshold = "ZOEKT_RE2_THRESHOLD_BYTES"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this and disabled don't need to be exported


// Threshold returns the configured byte threshold, reading ZOEKT_RE2_THRESHOLD_BYTES
// from the environment exactly once. Negative means disabled; zero means always use RE2.
func Threshold() int64 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: you can simplify and use sync.OnceValue

// NOT safe for concurrent use: do not call t.Parallel() inside f, and do not
// use this helper from TestMain or init() — if Threshold() fires its sync.Once
// before withThreshold sets cachedThresh, the value will be overwritten.
func withThreshold(thresh int64, f func()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: more idiomatic for test code to have a helper use t.Cleanup rather than passing in a function

}

// Ensure fmt is used.
var _ = fmt.Sprintf
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird?

@dgruzd dgruzd force-pushed the dgruzd/hybrid-re2-engine branch from 6d6ee90 to 604f972 Compare March 24, 2026 11:05
Adds an optional hybrid regex engine (internal/hybridre2) that transparently
switches between grafana/regexp and wasilibs/go-re2 (RE2 via WebAssembly)
based on file content size. Disabled by default — no behaviour change
without opt-in via ZOEKT_RE2_THRESHOLD_BYTES.

## Motivation

Issue sourcegraph#323 identified regex as the dominant CPU consumer in zoekt's
webserver profile. Go's regexp engine (including the grafana/regexp fork
already in use) lacks a lazy DFA. RE2's lazy DFA provides linear-time
matching with much better constant factors for alternations, character
classes, and complex patterns on large inputs.

The tradeoff: go-re2 uses WebAssembly (~600ns per-call overhead), making
it slower than grafana/regexp for small inputs (<4KB) but dramatically
faster above the threshold. A full engine swap would regress small-file
searches, so a threshold-based hybrid is the pragmatic approach.

## Implementation

### New package: internal/hybridre2

hybridre2.Regexp compiles both engines once at query-parse time and
dispatches FindAllIndex based on len(input) >= Threshold():

    func (re *Regexp) FindAllIndex(b []byte, n int) [][]int {
        if useRE2(len(b)) {
            return re.re2.FindAllIndex(b, n)
        }
        return re.grafana.FindAllIndex(b, n)
    }

### Change to index/matchtree.go

regexpMatchTree gains a hybridRegexp field used for file content matching;
filename matching keeps using grafana/regexp directly (filenames are always
short, so WASM overhead dominates there).

### Configuration

ZOEKT_RE2_THRESHOLD_BYTES env var, read once at startup:

  -1 (default): disabled — always grafana/regexp, zero behaviour change
  0:            always use go-re2 (useful for evaluation/testing)
  32768:        use go-re2 for files >= 32KB (recommended starting point)

## Benchmarks

Hardware: AMD EPYC 9B14, go-re2 v1.10.0 (WASM, no CGO).

Alternations — `func|var|const|type|import`:
  32KB:  grafana 2505µs  go-re2  467µs  5.4x speedup
  128KB: grafana 9900µs  go-re2 1699µs  5.8x speedup
  512KB: grafana 40.7ms  go-re2  6.8ms  6.0x speedup

Complex — `(func|var)\s+[A-Z]\w*\s*(`:
  32KB:  grafana 1237µs  go-re2  230µs  5.4x speedup
  128KB: grafana 4935µs  go-re2  911µs  5.4x speedup
  512KB: grafana 19.9ms  go-re2  3.8ms  5.3x speedup

Literal — `main` (grafana wins; threshold protects this case):
  32KB:  grafana 33.2µs  go-re2 59.8µs

## Testing

    go test ./internal/hybridre2/   # unit + correctness matrix
    go test ./index/ -short         # full existing suite: passes
    go test ./... -short            # full suite: passes

Correctness verified by asserting identical match offsets between grafana
and go-re2 for 9 patterns x 5 sizes (64B-256KB).

## Notes

- Binary/non-UTF-8 content: go-re2 stops at invalid UTF-8 (vs. grafana
  which replaces with the replacement character). The default threshold of
  -1 ensures zero behaviour change. Operators enabling the threshold should
  be aware; future work could detect non-UTF-8 and force the grafana path.
- Dependency: github.com/wasilibs/go-re2 v1.10.0 — pure Go WASM, no system
  deps. Binary size increase: ~2MB (the embedded RE2 WASM module).
- Rollout plan: enable in GitLab via feature flag starting at 32KB, compare
  p95 regex latency before/after using per-shard timing in search responses.
@dgruzd dgruzd force-pushed the dgruzd/hybrid-re2-engine branch from 604f972 to 696eef0 Compare March 24, 2026 11:08
@dgruzd
Copy link
Copy Markdown
Contributor Author

dgruzd commented Mar 24, 2026

@keegancsmith Thank you! I believe it's ready for another look. I've even added one optimization (skip re2 compilation for filename-only regexps) 🤝

@keegancsmith keegancsmith merged commit 971fcf5 into sourcegraph:main Mar 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants