Skip to content

Add scaling_bench: encode_batch vs worker-pool comparison (#1900)#2048

Open
stargazerZJ wants to merge 1 commit into
huggingface:mainfrom
stargazerZJ:scaling-bench
Open

Add scaling_bench: encode_batch vs worker-pool comparison (#1900)#2048
stargazerZJ wants to merge 1 commit into
huggingface:mainfrom
stargazerZJ:scaling-bench

Conversation

@stargazerZJ
Copy link
Copy Markdown

Summary

Adds tokenizers/benches/scaling_bench.rs, a standalone-binary benchmark (harness = false) that compares the two parallel-tokenization API shapes a real user can write today, on the same data, in the same process, and reports the ratio of their wall times.

Scope: a single new file; one new [[bench]] entry in tokenizers/Cargo.toml. No new dependencies, no CI changes, no library changes.

This is a deep-dive benchmark for maintainers and contributors, not a CI gate. The regression it measures only manifests at scale (≥16 cores, large documents, large batches); GitHub Actions' ubuntu-latest runner has 4 vCPUs and cannot reproduce it. The existing ci_benchmark.rs concurrent-4t group remains the in-CI watchdog; this bench complements it for the cases where 4-thread, ~80 KB total data isn't enough to expose the scaling problem.

Background

Issue #1900 reports that Tokenizer::encode_batch is several times slower than a manual rayon::ThreadPoolBuilder + par_iter() over Tokenizer::encode on many-core x86_64. PRs #2028 (merged) and #2029 (open) move the needle but don't close the gap on a 128-core dual-socket Xeon 8375C. A standalone reproducer with full numerical findings — including drop-site A/B isolation, allocator-swap experiments (jemalloc/mimalloc), MALLOC_ARENA_MAX sweep, and futex/syscall counts — lives at https://github.com/stargazerZJ/tokenizers-1900-repro.

This PR brings the core comparison in-tree as a permanent reference for contributors working on the parallel-encode path, leaving the diagnostic surface area (allocator swaps, drop-site A/B, etc.) in the standalone repo where it belongs.

What the bench measures

Two --method values, both run on the same generated input via the public Tokenizer API only:

  • worker-pool: explicit rayon::ThreadPoolBuilder::new().num_threads(W).build(), then pool.install(|| texts.par_iter().for_each(|t| { let enc = tok.encode(t, false)?; consume(enc) })). Each Encoding is consumed (token count read, then dropped) inside the closure, so it is allocated and freed on the same worker thread. This is the natural shape for Rust users calling into tokenizers directly — work on each doc in a parallel closure and never ship the Encodings across threads.

  • encode-batch: stock Tokenizer::encode_batch(texts, false) per chunk of --batch-size, returning Vec<Encoding> to the caller. The caller iterates the returned vec on the main thread and drops there.

Reported metric: encode_batch_elapsed / worker_pool_elapsed. On a 128-core x86_64 box with glibc, against current main, this ratio is ~2-3×. After @sebpop's planned Encoding-recycling fix it should drop to ≈ 1.0.

Two synthesized workloads, both run by default:

  • random-letters — random a-zA-Z, no whitespace. Essentially no BPE cache hits; exercises the encode hot path.
  • repeated-words — short pseudo-words separated by spaces. Cache fires aggressively; exercises the cache + memory-pressure path.

The two regimes respond very differently to fixes; running only one risks over-fitting future PRs to one and silently regressing the other. --input <file> overrides the synthesized data with a real corpus (e.g. data/big.txt) for ad-hoc experiments.

Why standalone, not criterion?

The other benches in tokenizers/benches/ are criterion-based, including the consolidated ci_benchmark.rs that the benchmarks.yml workflow runs. Two reasons this bench is structurally different:

  1. Scale. The regression doesn't manifest below ~16 cores; the benchmarks workflow runs on ubuntu-latest (4 vCPUs). On a 4-core box, the ratio is ≤ 1.2× and indistinguishable from noise. A criterion bench wired into ci_benchmark.rs would either no-op (useless watchdog) or produce noise. There is no "small N that exposes the regression" — it's a function of contention on the destination thread's allocator arena and scales with worker count.

  2. Iteration shape. Each bench iteration here is a whole batch — many thousands of encode calls — taking tens of seconds at realistic sizes. Criterion's sampling model expects much shorter iterations; while sample_size(10) can paper over this, it costs both wall time and statistical interpretability. A standalone binary measures one full run with knobs the user controls.

The bench therefore lives next to the criterion benches but uses harness = false, prints its own summary, and is invoked manually via cargo bench --bench scaling_bench --. This matches harness = false precedent already established for every other bench in the directory; only the omission of criterion is new.

A note on "fair comparison"

A reviewer might reasonably ask: shouldn't both methods do the same thing to the Encodings — e.g. both should return a Vec<Encoding> to the caller — to be a "fair" comparison?

We deliberately don't, for two reasons:

  1. The bench measures user-visible cost of the encode_batch API shape, not algorithmic equivalence between two synthetic shapes. encode_batch is what every Python caller (including transformers) routes to, because that's the shape the Python binding exposes ergonomically — and it's also what Rust users reach for when they want "give me back a Vec<Encoding> I can iterate." The worker-pool shape is what Rust users write directly when they want per-doc parallel work without the cross-thread hand-off — the shape this issue's original reporter and reviewer both reached for in Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900. Forcing the worker-pool variant to also collect into Vec<Encoding> would measure something nobody invokes and would hide the very asymmetry the bench exists to expose — an asymmetry that every Python caller of encode_batch pays today whether they know it or not.

  2. The local drop is the cause of the cheap path, not a benchmarking gimmick. This was confirmed experimentally in the standalone reproducer (drop-site A/B, see https://github.com/stargazerZJ/tokenizers-1900-repro/blob/main/REPORT.md): the same scope/queue shape as encode_batch, but with each Encoding dropped on the worker that allocated it, takes essentially the same time as worker-pool. The asymmetry the bench measures is the cost of the API choice.

The headline ratio is therefore "how much more does the encode_batch path cost, today, than the cheap worker-pool path would on the same work." Python callers and Rust users routing through encode_batch pay this cost; Rust users who write the manual worker-pool shape don't. Once Encoding recycling lands the ratio collapses to ~1× and the bench's job becomes documenting the equalized state.

Allocator caveat

Numbers vary across allocators. On glibc the gap is widest; jemalloc roughly halves it on the cache-friendly workload; mimalloc is workload-dependent. The bench is documented as glibc-default; the direction of the ratio is what carries the regression signal, not the absolute magnitude.

Knobs (CLI)

--workers <N>, --batch-size <N> (default 1024),
--count <N> (default 500), --length <N> (default 51200),
--workload random-letters|repeated-words|both (default both),
--method worker-pool|encode-batch|both (default both),
--tokenizer <path> (default data/llama-3-tokenizer.json),
--input <path> (real-corpus override),
--quick (small enough to run in seconds on any machine; will not produce
a meaningful ratio on small machines but verifies the bench compiles and
runs).

Argument parsing is hand-rolled — no new dependency on clap.

Reference numbers

128-core dual-socket Xeon 8375C, glibc 2.35, current main (commit 22d54d37, which already includes the per-thread BPE cache fix from #2028), --workers 32 --batch-size 1024 --count 500 --length 51200:

Workload worker_pool encode_batch ratio
random-letters 0.27 s 0.59 s 2.18×
repeated-words 0.36 s 0.78 s 2.15×

Numbers from a single run on the author's hardware, included for calibration only — your numbers will differ; the shape of the asymmetry is what the bench guards against, not the absolute magnitude.

What's left to close the gap

Two fixes are still in flight:

  1. Batch encode: lock-free work queue with dynamic window sizing #2029 (open) — persistent-barrier worker pool, targeting the per-rayon::scope wake/sleep cost (sebpop's "claim 1" in his 2026-04-27 comment on #1900).
  2. Encoding recycling — a planned follow-up from @sebpop targeting cross-thread free of Encoding on the main thread's destructor sweep (sebpop's "claim 2").

The standalone reproducer at https://github.com/stargazerZJ/tokenizers-1900-repro experimentally separates the two effects (drop-site A/B, full numbers in its REPORT.md): at the batch sizes this bench measures (-b 1024 and up), cross-thread free is the dominant cost and the per-scope wake/sleep cost is second-order. #2029 alone is therefore not expected to materially move this bench; the recycling work is what brings the ratio to ~1×. Both should land for the in-tree reference to converge.

Open questions for the maintainers

  1. Tokenizer file. Default is data/llama-3-tokenizer.json (already wired up by make bench). Happy to switch to gpt2-vocab.json / bert-base-uncased-vocab.txt if you prefer; the tokenizer choice does not affect whether the asymmetry shows up.
  2. Documentation placement. Top-of-file doc comment links Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances #1900, BPE cache: per-thread read-through cache to avoid RwLock atomics on hits #2028, Batch encode: lock-free work queue with dynamic window sizing #2029, the standalone repo, and ci_benchmark.rs's concurrent-4t group. Happy to also add a section to CONTRIBUTING.md under the "benchmarks" subsection if you'd like a pointer.
  3. Makefile integration. make bench runs cargo bench (no args), which currently runs every bench file. Since scaling_bench is manual-run-only and produces ~tens of seconds of output even on a 4-core box, should I either (a) gate it behind an env var so make bench skips it, or (b) leave as-is and let users opt in via cargo bench --bench scaling_bench? Currently as-is.

Files

  • tokenizers/benches/scaling_bench.rs (new, ~530 lines including doc comments)
  • tokenizers/Cargo.toml (4-line addition: new [[bench]] entry)

Test plan

  • cargo build --release --bench scaling_bench succeeds.
  • cargo clippy --all-targets --all-features -- -D warnings clean.
  • cargo fmt -- ./benches/scaling_bench.rs --check clean.
  • cargo bench --bench scaling_bench -- --quick runs end-to-end on a
    128-core box (~1s).
  • cargo bench --bench scaling_bench -- (full default) runs
    end-to-end and reports a ratio in the expected direction.
  • Verified by a maintainer on a smaller box (e.g. an 8-core dev
    laptop) that --quick runs cleanly and the bench compiles.

This PR by itself is a watchdog/reference, not the fix. The companion fix is @sebpop's Encoding-recycling work, planned as a separate PR.
Closes #1900 only after that fix lands and ratios drop to ≈ 1×.

…e#1900)

Standalone benchmark binary (`harness = false`) that compares
`Tokenizer::encode_batch` against an explicit `rayon::ThreadPoolBuilder`
+ `par_iter()` worker pool over `Tokenizer::encode`, on the same data,
and reports the ratio of their wall times.

Deep-dive bench for maintainers and contributors working on the
parallel-encode path; not wired into CI because the regression it
measures only manifests on machines with many cores (>=16) under
realistic batch shapes, which the `ubuntu-latest` benchmark runner
cannot provide. The existing `ci_benchmark.rs::concurrent-4t` group
remains the in-CI watchdog.

Two synthesized workloads (random-letters, repeated-words) stress the
encode hot path and the BPE-cache + memory-pressure path respectively;
`--input <file>` overrides with a real corpus.

No new dependencies (hand-rolled CLI parser); no library changes; no
CI changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance: batch_encode scales poorly on high-core Server CPUs compared to sharded tokenizer instances

1 participant