Add scaling_bench: encode_batch vs worker-pool comparison (#1900)#2048
Open
stargazerZJ wants to merge 1 commit into
Open
Add scaling_bench: encode_batch vs worker-pool comparison (#1900)#2048stargazerZJ wants to merge 1 commit into
stargazerZJ wants to merge 1 commit into
Conversation
…e#1900) Standalone benchmark binary (`harness = false`) that compares `Tokenizer::encode_batch` against an explicit `rayon::ThreadPoolBuilder` + `par_iter()` worker pool over `Tokenizer::encode`, on the same data, and reports the ratio of their wall times. Deep-dive bench for maintainers and contributors working on the parallel-encode path; not wired into CI because the regression it measures only manifests on machines with many cores (>=16) under realistic batch shapes, which the `ubuntu-latest` benchmark runner cannot provide. The existing `ci_benchmark.rs::concurrent-4t` group remains the in-CI watchdog. Two synthesized workloads (random-letters, repeated-words) stress the encode hot path and the BPE-cache + memory-pressure path respectively; `--input <file>` overrides with a real corpus. No new dependencies (hand-rolled CLI parser); no library changes; no CI changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
tokenizers/benches/scaling_bench.rs, a standalone-binary benchmark (harness = false) that compares the two parallel-tokenization API shapes a real user can write today, on the same data, in the same process, and reports the ratio of their wall times.Scope: a single new file; one new
[[bench]]entry intokenizers/Cargo.toml. No new dependencies, no CI changes, no library changes.This is a deep-dive benchmark for maintainers and contributors, not a CI gate. The regression it measures only manifests at scale (≥16 cores, large documents, large batches); GitHub Actions'
ubuntu-latestrunner has 4 vCPUs and cannot reproduce it. The existingci_benchmark.rsconcurrent-4tgroup remains the in-CI watchdog; this bench complements it for the cases where 4-thread, ~80 KB total data isn't enough to expose the scaling problem.Background
Issue #1900 reports that
Tokenizer::encode_batchis several times slower than a manualrayon::ThreadPoolBuilder+par_iter()overTokenizer::encodeon many-core x86_64. PRs #2028 (merged) and #2029 (open) move the needle but don't close the gap on a 128-core dual-socket Xeon 8375C. A standalone reproducer with full numerical findings — including drop-site A/B isolation, allocator-swap experiments (jemalloc/mimalloc),MALLOC_ARENA_MAXsweep, and futex/syscall counts — lives at https://github.com/stargazerZJ/tokenizers-1900-repro.This PR brings the core comparison in-tree as a permanent reference for contributors working on the parallel-encode path, leaving the diagnostic surface area (allocator swaps, drop-site A/B, etc.) in the standalone repo where it belongs.
What the bench measures
Two
--methodvalues, both run on the same generated input via the publicTokenizerAPI only:worker-pool: explicitrayon::ThreadPoolBuilder::new().num_threads(W).build(), thenpool.install(|| texts.par_iter().for_each(|t| { let enc = tok.encode(t, false)?; consume(enc) })). EachEncodingis consumed (token count read, then dropped) inside the closure, so it is allocated and freed on the same worker thread. This is the natural shape for Rust users calling intotokenizersdirectly — work on each doc in a parallel closure and never ship theEncodings across threads.encode-batch: stockTokenizer::encode_batch(texts, false)per chunk of--batch-size, returningVec<Encoding>to the caller. The caller iterates the returned vec on the main thread and drops there.Reported metric:
encode_batch_elapsed / worker_pool_elapsed. On a 128-core x86_64 box with glibc, against currentmain, this ratio is ~2-3×. After @sebpop's plannedEncoding-recycling fix it should drop to ≈ 1.0.Two synthesized workloads, both run by default:
random-letters— random a-zA-Z, no whitespace. Essentially no BPE cache hits; exercises the encode hot path.repeated-words— short pseudo-words separated by spaces. Cache fires aggressively; exercises the cache + memory-pressure path.The two regimes respond very differently to fixes; running only one risks over-fitting future PRs to one and silently regressing the other.
--input <file>overrides the synthesized data with a real corpus (e.g.data/big.txt) for ad-hoc experiments.Why standalone, not criterion?
The other benches in
tokenizers/benches/are criterion-based, including the consolidatedci_benchmark.rsthat thebenchmarks.ymlworkflow runs. Two reasons this bench is structurally different:Scale. The regression doesn't manifest below ~16 cores; the benchmarks workflow runs on
ubuntu-latest(4 vCPUs). On a 4-core box, the ratio is ≤ 1.2× and indistinguishable from noise. A criterion bench wired intoci_benchmark.rswould either no-op (useless watchdog) or produce noise. There is no "small N that exposes the regression" — it's a function of contention on the destination thread's allocator arena and scales with worker count.Iteration shape. Each bench iteration here is a whole batch — many thousands of
encodecalls — taking tens of seconds at realistic sizes. Criterion's sampling model expects much shorter iterations; whilesample_size(10)can paper over this, it costs both wall time and statistical interpretability. A standalone binary measures one full run with knobs the user controls.The bench therefore lives next to the criterion benches but uses
harness = false, prints its own summary, and is invoked manually viacargo bench --bench scaling_bench --. This matchesharness = falseprecedent already established for every other bench in the directory; only the omission of criterion is new.A note on "fair comparison"
A reviewer might reasonably ask: shouldn't both methods do the same thing to the
Encodings — e.g. both should return aVec<Encoding>to the caller — to be a "fair" comparison?We deliberately don't, for two reasons:
The bench measures user-visible cost of the
encode_batchAPI shape, not algorithmic equivalence between two synthetic shapes.encode_batchis what every Python caller (includingtransformers) routes to, because that's the shape the Python binding exposes ergonomically — and it's also what Rust users reach for when they want "give me back aVec<Encoding>I can iterate." The worker-pool shape is what Rust users write directly when they want per-doc parallel work without the cross-thread hand-off — the shape this issue's original reporter and reviewer both reached for in Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900. Forcing the worker-pool variant to also collect intoVec<Encoding>would measure something nobody invokes and would hide the very asymmetry the bench exists to expose — an asymmetry that every Python caller ofencode_batchpays today whether they know it or not.The local drop is the cause of the cheap path, not a benchmarking gimmick. This was confirmed experimentally in the standalone reproducer (drop-site A/B, see https://github.com/stargazerZJ/tokenizers-1900-repro/blob/main/REPORT.md): the same scope/queue shape as
encode_batch, but with eachEncodingdropped on the worker that allocated it, takes essentially the same time asworker-pool. The asymmetry the bench measures is the cost of the API choice.The headline ratio is therefore "how much more does the
encode_batchpath cost, today, than the cheap worker-pool path would on the same work." Python callers and Rust users routing throughencode_batchpay this cost; Rust users who write the manual worker-pool shape don't. OnceEncodingrecycling lands the ratio collapses to ~1× and the bench's job becomes documenting the equalized state.Allocator caveat
Numbers vary across allocators. On glibc the gap is widest; jemalloc roughly halves it on the cache-friendly workload; mimalloc is workload-dependent. The bench is documented as glibc-default; the direction of the ratio is what carries the regression signal, not the absolute magnitude.
Knobs (CLI)
--workers <N>,--batch-size <N>(default 1024),--count <N>(default 500),--length <N>(default 51200),--workload random-letters|repeated-words|both(default both),--method worker-pool|encode-batch|both(default both),--tokenizer <path>(defaultdata/llama-3-tokenizer.json),--input <path>(real-corpus override),--quick(small enough to run in seconds on any machine; will not producea meaningful ratio on small machines but verifies the bench compiles and
runs).
Argument parsing is hand-rolled — no new dependency on
clap.Reference numbers
128-core dual-socket Xeon 8375C, glibc 2.35, current
main(commit22d54d37, which already includes the per-thread BPE cache fix from #2028),--workers 32 --batch-size 1024 --count 500 --length 51200:Numbers from a single run on the author's hardware, included for calibration only — your numbers will differ; the shape of the asymmetry is what the bench guards against, not the absolute magnitude.
What's left to close the gap
Two fixes are still in flight:
rayon::scopewake/sleep cost (sebpop's "claim 1" in his 2026-04-27 comment on #1900).Encodingrecycling — a planned follow-up from @sebpop targeting cross-thread free ofEncodingon the main thread's destructor sweep (sebpop's "claim 2").The standalone reproducer at https://github.com/stargazerZJ/tokenizers-1900-repro experimentally separates the two effects (drop-site A/B, full numbers in its REPORT.md): at the batch sizes this bench measures (
-b 1024and up), cross-thread free is the dominant cost and the per-scope wake/sleep cost is second-order. #2029 alone is therefore not expected to materially move this bench; the recycling work is what brings the ratio to ~1×. Both should land for the in-tree reference to converge.Open questions for the maintainers
data/llama-3-tokenizer.json(already wired up bymake bench). Happy to switch togpt2-vocab.json/bert-base-uncased-vocab.txtif you prefer; the tokenizer choice does not affect whether the asymmetry shows up.ci_benchmark.rs'sconcurrent-4tgroup. Happy to also add a section toCONTRIBUTING.mdunder the "benchmarks" subsection if you'd like a pointer.Makefileintegration.make benchrunscargo bench(no args), which currently runs every bench file. Sincescaling_benchis manual-run-only and produces ~tens of seconds of output even on a 4-core box, should I either (a) gate it behind an env var somake benchskips it, or (b) leave as-is and let users opt in viacargo bench --bench scaling_bench? Currently as-is.Files
tokenizers/benches/scaling_bench.rs(new, ~530 lines including doc comments)tokenizers/Cargo.toml(4-line addition: new[[bench]]entry)Test plan
cargo build --release --bench scaling_benchsucceeds.cargo clippy --all-targets --all-features -- -D warningsclean.cargo fmt -- ./benches/scaling_bench.rs --checkclean.cargo bench --bench scaling_bench -- --quickruns end-to-end on a128-core box (~1s).
cargo bench --bench scaling_bench --(full default) runsend-to-end and reports a ratio in the expected direction.
laptop) that
--quickruns cleanly and the bench compiles.This PR by itself is a watchdog/reference, not the fix. The companion fix is @sebpop's
Encoding-recycling work, planned as a separate PR.Closes #1900 only after that fix lands and ratios drop to ≈ 1×.