Add scaling_bench: encode_batch vs worker-pool comparison (#1900) by stargazerZJ · Pull Request #2048 · huggingface/tokenizers

stargazerZJ · 2026-05-01T16:37:39Z

Summary

Adds tokenizers/benches/scaling_bench.rs, a standalone-binary benchmark (harness = false) that compares the two parallel-tokenization API shapes a real user can write today, on the same data, in the same process, and reports the ratio of their wall times.

Scope: a single new file; one new [[bench]] entry in tokenizers/Cargo.toml. No new dependencies, no CI changes, no library changes.

This is a deep-dive benchmark for maintainers and contributors, not a CI gate. The regression it measures only manifests at scale (≥16 cores, large documents, large batches); GitHub Actions' ubuntu-latest runner has 4 vCPUs and cannot reproduce it. The existing ci_benchmark.rs concurrent-4t group remains the in-CI watchdog; this bench complements it for the cases where 4-thread, ~80 KB total data isn't enough to expose the scaling problem.

Background

Issue #1900 reports that Tokenizer::encode_batch is several times slower than a manual rayon::ThreadPoolBuilder + par_iter() over Tokenizer::encode on many-core x86_64. PRs #2028 (merged) and #2029 (open) move the needle but don't close the gap on a 128-core dual-socket Xeon 8375C. A standalone reproducer with full numerical findings — including drop-site A/B isolation, allocator-swap experiments (jemalloc/mimalloc), MALLOC_ARENA_MAX sweep, and futex/syscall counts — lives at https://github.com/stargazerZJ/tokenizers-1900-repro.

This PR brings the core comparison in-tree as a permanent reference for contributors working on the parallel-encode path, leaving the diagnostic surface area (allocator swaps, drop-site A/B, etc.) in the standalone repo where it belongs.

What the bench measures

Two --method values, both run on the same generated input via the public Tokenizer API only:

worker-pool: explicit rayon::ThreadPoolBuilder::new().num_threads(W).build(), then pool.install(|| texts.par_iter().for_each(|t| { let enc = tok.encode(t, false)?; consume(enc) })). Each Encoding is consumed (token count read, then dropped) inside the closure, so it is allocated and freed on the same worker thread. This is the natural shape for Rust users calling into tokenizers directly — work on each doc in a parallel closure and never ship the Encodings across threads.
encode-batch: stock Tokenizer::encode_batch(texts, false) per chunk of --batch-size, returning Vec<Encoding> to the caller. The caller iterates the returned vec on the main thread and drops there.

Reported metric: encode_batch_elapsed / worker_pool_elapsed. On a 128-core x86_64 box with glibc, against current main, this ratio is ~2-3×. After @sebpop's planned Encoding-recycling fix it should drop to ≈ 1.0.

Two synthesized workloads, both run by default:

random-letters — random a-zA-Z, no whitespace. Essentially no BPE cache hits; exercises the encode hot path.
repeated-words — short pseudo-words separated by spaces. Cache fires aggressively; exercises the cache + memory-pressure path.

The two regimes respond very differently to fixes; running only one risks over-fitting future PRs to one and silently regressing the other. --input <file> overrides the synthesized data with a real corpus (e.g. data/big.txt) for ad-hoc experiments.

Why standalone, not criterion?

The other benches in tokenizers/benches/ are criterion-based, including the consolidated ci_benchmark.rs that the benchmarks.yml workflow runs. Two reasons this bench is structurally different:

Scale. The regression doesn't manifest below ~16 cores; the benchmarks workflow runs on ubuntu-latest (4 vCPUs). On a 4-core box, the ratio is ≤ 1.2× and indistinguishable from noise. A criterion bench wired into ci_benchmark.rs would either no-op (useless watchdog) or produce noise. There is no "small N that exposes the regression" — it's a function of contention on the destination thread's allocator arena and scales with worker count.
Iteration shape. Each bench iteration here is a whole batch — many thousands of encode calls — taking tens of seconds at realistic sizes. Criterion's sampling model expects much shorter iterations; while sample_size(10) can paper over this, it costs both wall time and statistical interpretability. A standalone binary measures one full run with knobs the user controls.

The bench therefore lives next to the criterion benches but uses harness = false, prints its own summary, and is invoked manually via cargo bench --bench scaling_bench --. This matches harness = false precedent already established for every other bench in the directory; only the omission of criterion is new.

A note on "fair comparison"

A reviewer might reasonably ask: shouldn't both methods do the same thing to the Encodings — e.g. both should return a Vec<Encoding> to the caller — to be a "fair" comparison?

We deliberately don't, for two reasons:

The bench measures user-visible cost of the encode_batch API shape, not algorithmic equivalence between two synthetic shapes. encode_batch is what every Python caller (including transformers) routes to, because that's the shape the Python binding exposes ergonomically — and it's also what Rust users reach for when they want "give me back a Vec<Encoding> I can iterate." The worker-pool shape is what Rust users write directly when they want per-doc parallel work without the cross-thread hand-off — the shape this issue's original reporter and reviewer both reached for in Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900. Forcing the worker-pool variant to also collect into Vec<Encoding> would measure something nobody invokes and would hide the very asymmetry the bench exists to expose — an asymmetry that every Python caller of encode_batch pays today whether they know it or not.
The local drop is the cause of the cheap path, not a benchmarking gimmick. This was confirmed experimentally in the standalone reproducer (drop-site A/B, see https://github.com/stargazerZJ/tokenizers-1900-repro/blob/main/REPORT.md): the same scope/queue shape as encode_batch, but with each Encoding dropped on the worker that allocated it, takes essentially the same time as worker-pool. The asymmetry the bench measures is the cost of the API choice.

The headline ratio is therefore "how much more does the encode_batch path cost, today, than the cheap worker-pool path would on the same work." Python callers and Rust users routing through encode_batch pay this cost; Rust users who write the manual worker-pool shape don't. Once Encoding recycling lands the ratio collapses to ~1× and the bench's job becomes documenting the equalized state.

Allocator caveat

Numbers vary across allocators. On glibc the gap is widest; jemalloc roughly halves it on the cache-friendly workload; mimalloc is workload-dependent. The bench is documented as glibc-default; the direction of the ratio is what carries the regression signal, not the absolute magnitude.

Knobs (CLI)

--workers <N>, --batch-size <N> (default 1024),
--count <N> (default 500), --length <N> (default 51200),
--workload random-letters|repeated-words|both (default both),
--method worker-pool|encode-batch|both (default both),
--tokenizer <path> (default data/llama-3-tokenizer.json),
--input <path> (real-corpus override),
--quick (small enough to run in seconds on any machine; will not produce
a meaningful ratio on small machines but verifies the bench compiles and
runs).

Argument parsing is hand-rolled — no new dependency on clap.

Reference numbers

128-core dual-socket Xeon 8375C, glibc 2.35, current main (commit 22d54d37, which already includes the per-thread BPE cache fix from #2028), --workers 32 --batch-size 1024 --count 500 --length 51200:

Workload	worker_pool	encode_batch	ratio
random-letters	0.27 s	0.59 s	2.18×
repeated-words	0.36 s	0.78 s	2.15×

Numbers from a single run on the author's hardware, included for calibration only — your numbers will differ; the shape of the asymmetry is what the bench guards against, not the absolute magnitude.

What's left to close the gap

Two fixes are still in flight:

Batch encode: lock-free work queue with dynamic window sizing #2029 (open) — persistent-barrier worker pool, targeting the per-rayon::scope wake/sleep cost (sebpop's "claim 1" in his 2026-04-27 comment on #1900).
Encoding recycling — a planned follow-up from @sebpop targeting cross-thread free of Encoding on the main thread's destructor sweep (sebpop's "claim 2").

The standalone reproducer at https://github.com/stargazerZJ/tokenizers-1900-repro experimentally separates the two effects (drop-site A/B, full numbers in its REPORT.md): at the batch sizes this bench measures (-b 1024 and up), cross-thread free is the dominant cost and the per-scope wake/sleep cost is second-order. #2029 alone is therefore not expected to materially move this bench; the recycling work is what brings the ratio to ~1×. Both should land for the in-tree reference to converge.

Open questions for the maintainers

Tokenizer file. Default is data/llama-3-tokenizer.json (already wired up by make bench). Happy to switch to gpt2-vocab.json / bert-base-uncased-vocab.txt if you prefer; the tokenizer choice does not affect whether the asymmetry shows up.
Documentation placement. Top-of-file doc comment links Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900, BPE cache: per-thread read-through cache to avoid RwLock atomics on hits #2028, Batch encode: lock-free work queue with dynamic window sizing #2029, the standalone repo, and ci_benchmark.rs's concurrent-4t group. Happy to also add a section to CONTRIBUTING.md under the "benchmarks" subsection if you'd like a pointer.
Makefile integration. make bench runs cargo bench (no args), which currently runs every bench file. Since scaling_bench is manual-run-only and produces ~tens of seconds of output even on a 4-core box, should I either (a) gate it behind an env var so make bench skips it, or (b) leave as-is and let users opt in via cargo bench --bench scaling_bench? Currently as-is.

Files

tokenizers/benches/scaling_bench.rs (new, ~530 lines including doc comments)
tokenizers/Cargo.toml (4-line addition: new [[bench]] entry)

Test plan

cargo build --release --bench scaling_bench succeeds.
cargo clippy --all-targets --all-features -- -D warnings clean.
cargo fmt -- ./benches/scaling_bench.rs --check clean.
cargo bench --bench scaling_bench -- --quick runs end-to-end on a
128-core box (~1s).
cargo bench --bench scaling_bench -- (full default) runs
end-to-end and reports a ratio in the expected direction.
Verified by a maintainer on a smaller box (e.g. an 8-core dev
laptop) that --quick runs cleanly and the bench compiles.

This PR by itself is a watchdog/reference, not the fix. The companion fix is @sebpop's Encoding-recycling work, planned as a separate PR.
Closes #1900 only after that fix lands and ratios drop to ≈ 1×.

…e#1900) Standalone benchmark binary (`harness = false`) that compares `Tokenizer::encode_batch` against an explicit `rayon::ThreadPoolBuilder` + `par_iter()` worker pool over `Tokenizer::encode`, on the same data, and reports the ratio of their wall times. Deep-dive bench for maintainers and contributors working on the parallel-encode path; not wired into CI because the regression it measures only manifests on machines with many cores (>=16) under realistic batch shapes, which the `ubuntu-latest` benchmark runner cannot provide. The existing `ci_benchmark.rs::concurrent-4t` group remains the in-CI watchdog. Two synthesized workloads (random-letters, repeated-words) stress the encode hot path and the BPE-cache + memory-pressure path respectively; `--input <file>` overrides with a real corpus. No new dependencies (hand-rolled CLI parser); no library changes; no CI changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stargazerZJ mentioned this pull request May 1, 2026

Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scaling_bench: encode_batch vs worker-pool comparison (#1900)#2048

Add scaling_bench: encode_batch vs worker-pool comparison (#1900)#2048
stargazerZJ wants to merge 1 commit into
huggingface:mainfrom
stargazerZJ:scaling-bench

stargazerZJ commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stargazerZJ commented May 1, 2026

Summary

Background

What the bench measures

Why standalone, not criterion?

A note on "fair comparison"

Allocator caveat

Knobs (CLI)

Reference numbers

What's left to close the gap

Open questions for the maintainers

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant