Bindings: acquire model read-lock once per call instead of per pre-token by sebpop · Pull Request #2072 · huggingface/tokenizers

sebpop · 2026-05-26T17:33:24Z

The Python and Node bindings both wrap the inner model in
Arc<RwLock<ModelWrapper>>. Their Model::tokenize() impls each do
self.model.read().unwrap().tokenize(seq), so a single read() lock
acquisition per call is unavoidable -- but TokenizerImpl::do_tokenize
calls tokenize() once per pre-token, and a typical ~6 KB document
contains ~1 500 pre-tokens. Each acquire/release pair on the
RwLock becomes one atomic operation per pre-token that produces no
useful work.

Add a default Model::tokenize_in_pretokenized trait method that
takes the optional truncation parameters and dispatches to either
PreTokenizedString::tokenize or tokenize_with_limit. Override it
in PyModel (Python binding) and Model (Node binding) so that
both bindings acquire the read lock once and tokenize every
pre-token under the same guard. TokenizerImpl::do_tokenize calls
the new method instead of dispatching per pre-token, which makes
both the truncated and the non-truncated paths benefit from the
override. The full doc lives once on the trait method; the
overrides just point at it.

The default implementation preserves the old behaviour byte-for-byte
for any Model that is not behind a RwLock, so adding the method
to the public trait is non-breaking: external implementors do not
need to override it.

cargo test --lib --features http
 201 passed, 0 failed.

cargo clippy --lib --tests --features http -- -D warnings
 clean on tokenizers/, bindings/python/, bindings/node/.

Throughput on a Python script that calls tokenizer.encode_batch (docs, false) in a 15 s wall-clock loop, where docs is
tokenizers/data/big.txt (6.5 MB) split into 999 ~6.5 KB chunks.
encode_batch calls completed (more is better):

  platform                              cores   threads  before   after
  ---------                             -----   -------  ------   -----
  NVIDIA Vera (aarch64)                  89P    1T          10      10
                                                88T         57     147   (+158 %)
  AMD EPYC 9124 (x86_64)                 16P    1T           8       8
                                                16T         63      68   (+8 %)
                                                32T         65      70   (+8 %)
  Apple M3 (aarch64, no SMT)             12P    1T          11      11
                                                6T          43      45   (+5 %)
                                                12T         47      54   (+15 %)

The 1T cases are flat on every platform because tokenisation is
dominated by BPE merging itself; there is no contention on a single
thread. The win scales with thread count and is largest on
many-core aarch64 (where each LSE acquire/release pair takes
substantially more cycles than x86 lock cmpxchg under contention).

Perf evidence on Vera at 88T, wheels built -Ctarget-feature= +lse,+rcpc, perf record -g --call-graph fp -F 4999:

  symbol                                                before    after
  <PyModel as Model>::tokenize                          75.36%    -
  <PyModel as Model>::tokenize_in_pretokenized          -          0.00%
  <ModelWrapper as Model>::tokenize                      0.01%     0.21%
  tokenizers::models::bpe::word::Word::merge_all         -         3.72%
  tokenizers::models::bpe::model::BPE::merge_word        0.17%     1.25%
  std::sys::sync::rwlock::futex::RwLock::read_contended  0.07%     0.00%

Before, 75 % of CPU cycles are inside PyModel::tokenize, the
wrapper that takes an Arc<RwLock>::read per pre-token, calls the
inner BPE, then drops the guard. After, that wrapper is replaced by
a single call to tokenize_in_pretokenized which takes the lock
once for the whole pre-token sequence; the actual BPE merging
(Word::merge_all, BPE::merge_word) surfaces in the profile where
it should have been all along.

Perf on AMD EPYC 9124 at 16T (built without +lse, atomics inlined
as native x86 lock instructions):

  symbol                                                before    after
  <BPE as Model>::tokenize                              0.70%     0.50%
  <PyModel as Model>::tokenize                          0.60%     -
  <ModelWrapper as Model>::tokenize                     0.15%     0.16%
  std::sys::sync::rwlock::futex::RwLock::read_contended 0.01%     0.00%

The relative magnitudes are vastly smaller on x86 because uncontended
lock cmpxchg is fast, but the wall-clock direction is the same.

The Rust-only bpe_benchmark (which builds BPE directly via
BpeBuilder and never goes through PyModel) is unaffected by this
change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.

HuggingFaceDocBuilderDev · 2026-05-28T08:21:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

McPatate

Not sure we want to replicate the comments 3 times, perhaps best to add docs over at the call site instead? just an idea though.

There's also a pretokenized.tokenize_with_limit call that might benefit from this too in the codebase.

Finally, would be good to test on other platforms and archs.

McPatate · 2026-05-28T08:18:57Z

+    pretokenized: &mut tk::tokenizer::PreTokenizedString,
+  ) -> tk::Result<()> {
+    let model = self.model.as_ref().ok_or("Uninitialized Model")?;
+    let guard = model.read().unwrap();


Would be nice to return an error rather than panic imo

The Python and Node bindings both wrap the inner model in `Arc<RwLock<ModelWrapper>>`. Their `Model::tokenize()` impls each do `self.model.read().unwrap().tokenize(seq)`, so a single `read()` lock acquisition per call is unavoidable -- but `TokenizerImpl::do_tokenize` calls `tokenize()` once per pre-token, and a typical ~6 KB document contains ~1 500 pre-tokens. Each acquire/release pair on the `RwLock` becomes one atomic operation per pre-token that produces no useful work. Add a default `Model::tokenize_in_pretokenized` trait method that takes the optional truncation parameters and dispatches to either `PreTokenizedString::tokenize` or `tokenize_with_limit`. Override it in `PyModel` (Python binding) and `Model` (Node binding) so that both bindings acquire the read lock once and tokenize every pre-token under the same guard. `TokenizerImpl::do_tokenize` calls the new method instead of dispatching per pre-token, which makes both the truncated and the non-truncated paths benefit from the override. The full doc lives once on the trait method; the overrides just point at it. The default implementation preserves the old behaviour byte-for-byte for any `Model` that is not behind a `RwLock`, so adding the method to the public trait is non-breaking: external implementors do not need to override it. cargo test --lib --features http: 201 passed, 0 failed. cargo clippy --lib --tests --features http -- -D warnings: clean on tokenizers/, bindings/python/, bindings/node/. Throughput on a Python script that calls `tokenizer.encode_batch (docs, false)` in a 15 s wall-clock loop, where `docs` is `tokenizers/data/big.txt` (6.5 MB) split into 999 ~6.5 KB chunks. `encode_batch` calls completed (more is better): platform cores threads before after --------- ----- ------- ------ ----- NVIDIA Vera (aarch64) 89P 1T 10 10 88T 57 147 (+158 %) AMD EPYC 9124 (x86_64) 16P 1T 8 8 16T 63 68 (+8 %) 32T 65 70 (+8 %) Apple M3 (aarch64, no SMT) 12P 1T 11 11 6T 43 45 (+5 %) 12T 47 54 (+15 %) The 1T cases are flat on every platform because tokenisation is dominated by BPE merging itself; there is no contention on a single thread. The win scales with thread count and is largest on many-core aarch64 (where each LSE acquire/release pair takes substantially more cycles than x86 `lock cmpxchg` under contention). Perf evidence on Vera at 88T, wheels built `-Ctarget-feature= +lse,+rcpc`, `perf record -g --call-graph fp -F 4999`: symbol before after <PyModel as Model>::tokenize 75.36% - <PyModel as Model>::tokenize_in_pretokenized - 0.00% <ModelWrapper as Model>::tokenize 0.01% 0.21% tokenizers::models::bpe::word::Word::merge_all - 3.72% tokenizers::models::bpe::model::BPE::merge_word 0.17% 1.25% std::sys::sync::rwlock::futex::RwLock::read_contended 0.07% 0.00% Before, 75 % of CPU cycles are inside `PyModel::tokenize`, the wrapper that takes an `Arc<RwLock>::read` per pre-token, calls the inner BPE, then drops the guard. After, that wrapper is replaced by a single call to `tokenize_in_pretokenized` which takes the lock once for the whole pre-token sequence; the actual BPE merging (`Word::merge_all`, `BPE::merge_word`) surfaces in the profile where it should have been all along. Perf on AMD EPYC 9124 at 16T (built without `+lse`, atomics inlined as native x86 `lock` instructions): symbol before after <BPE as Model>::tokenize 0.70% 0.50% <PyModel as Model>::tokenize 0.60% - <ModelWrapper as Model>::tokenize 0.15% 0.16% std::sys::sync::rwlock::futex::RwLock::read_contended 0.01% 0.00% The relative magnitudes are vastly smaller on x86 because uncontended `lock cmpxchg` is fast, but the wall-clock direction is the same. The Rust-only `bpe_benchmark` (which builds `BPE` directly via `BpeBuilder` and never goes through `PyModel`) is unaffected by this change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T, both within run-to-run noise.

sebpop · 2026-05-29T02:13:36Z

Thanks for the review. Force-pushed an amended patch that addresses three points:

Docs consolidated. Full rationale lives once on the trait default; the PyModel and Node overrides are now one-line // See [Model::tokenize_in_pretokenized].
tokenize_with_limit folded in. The trait method now takes truncation: Option<(usize, TruncationDirection)> and dispatches to either tokenize or tokenize_with_limit internally. do_tokenize collapses to one call, and both the truncated and non-truncated paths benefit from the lock-once override.
Cross-platform validation. Added in the commit message: numbers on aarch64 Vera (89P/178L), AMD EPYC 9124 (16P/32L) and Apple M3 (12P). 1T flat everywhere as expected. The wall-clock win scales with thread count and is largest on many-core aarch64; x86 sees a smaller +8% because uncontended lock cmpxchg is much cheaper per call than aarch64 LSE under contention. Direction same on all three.

sebpop · 2026-05-29T09:30:12Z

Two pre-existing CI failures on main, neither related to any code change in tokenizers/ or bindings/{python,node}/src/

Fixed with: #2079

McPatate reviewed May 28, 2026

View reviewed changes

sebpop force-pushed the p3 branch from 8d6422e to 6b41e97 Compare May 29, 2026 02:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bindings: acquire model read-lock once per call instead of per pre-token#2072

Bindings: acquire model read-lock once per call instead of per pre-token#2072
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p3

sebpop commented May 26, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

McPatate left a comment

Uh oh!

McPatate May 28, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sebpop commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 28, 2026

Uh oh!

McPatate left a comment

Choose a reason for hiding this comment

Uh oh!

McPatate May 28, 2026

Choose a reason for hiding this comment

Uh oh!

sebpop commented May 29, 2026

Uh oh!

sebpop commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sebpop commented May 26, 2026 •

edited

Loading