Skip to content

Bindings: acquire model read-lock once per call instead of per pre-token#2072

Open
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p3
Open

Bindings: acquire model read-lock once per call instead of per pre-token#2072
sebpop wants to merge 1 commit into
huggingface:mainfrom
sebpop:p3

Conversation

@sebpop
Copy link
Copy Markdown
Contributor

@sebpop sebpop commented May 26, 2026

The Python and Node bindings both wrap the inner model in
Arc<RwLock<ModelWrapper>>. Their Model::tokenize() impls each do
self.model.read().unwrap().tokenize(seq), so a single read() lock
acquisition per call is unavoidable -- but TokenizerImpl::do_tokenize
calls tokenize() once per pre-token, and a typical ~6 KB document
contains ~1 500 pre-tokens. Each acquire/release pair on the
RwLock becomes one atomic operation per pre-token that produces no
useful work.

Add a default Model::tokenize_in_pretokenized trait method that
takes the optional truncation parameters and dispatches to either
PreTokenizedString::tokenize or tokenize_with_limit. Override it
in PyModel (Python binding) and Model (Node binding) so that
both bindings acquire the read lock once and tokenize every
pre-token under the same guard. TokenizerImpl::do_tokenize calls
the new method instead of dispatching per pre-token, which makes
both the truncated and the non-truncated paths benefit from the
override. The full doc lives once on the trait method; the
overrides just point at it.

The default implementation preserves the old behaviour byte-for-byte
for any Model that is not behind a RwLock, so adding the method
to the public trait is non-breaking: external implementors do not
need to override it.

cargo test --lib --features http
 201 passed, 0 failed.

cargo clippy --lib --tests --features http -- -D warnings
 clean on tokenizers/, bindings/python/, bindings/node/.

Throughput on a Python script that calls tokenizer.encode_batch (docs, false) in a 15 s wall-clock loop, where docs is
tokenizers/data/big.txt (6.5 MB) split into 999 ~6.5 KB chunks.
encode_batch calls completed (more is better):

  platform                              cores   threads  before   after
  ---------                             -----   -------  ------   -----
  NVIDIA Vera (aarch64)                  89P    1T          10      10
                                                88T         57     147   (+158 %)
  AMD EPYC 9124 (x86_64)                 16P    1T           8       8
                                                16T         63      68   (+8 %)
                                                32T         65      70   (+8 %)
  Apple M3 (aarch64, no SMT)             12P    1T          11      11
                                                6T          43      45   (+5 %)
                                                12T         47      54   (+15 %)

The 1T cases are flat on every platform because tokenisation is
dominated by BPE merging itself; there is no contention on a single
thread. The win scales with thread count and is largest on
many-core aarch64 (where each LSE acquire/release pair takes
substantially more cycles than x86 lock cmpxchg under contention).

Perf evidence on Vera at 88T, wheels built -Ctarget-feature= +lse,+rcpc, perf record -g --call-graph fp -F 4999:

  symbol                                                before    after
  <PyModel as Model>::tokenize                          75.36%    -
  <PyModel as Model>::tokenize_in_pretokenized          -          0.00%
  <ModelWrapper as Model>::tokenize                      0.01%     0.21%
  tokenizers::models::bpe::word::Word::merge_all         -         3.72%
  tokenizers::models::bpe::model::BPE::merge_word        0.17%     1.25%
  std::sys::sync::rwlock::futex::RwLock::read_contended  0.07%     0.00%

Before, 75 % of CPU cycles are inside PyModel::tokenize, the
wrapper that takes an Arc<RwLock>::read per pre-token, calls the
inner BPE, then drops the guard. After, that wrapper is replaced by
a single call to tokenize_in_pretokenized which takes the lock
once for the whole pre-token sequence; the actual BPE merging
(Word::merge_all, BPE::merge_word) surfaces in the profile where
it should have been all along.

Perf on AMD EPYC 9124 at 16T (built without +lse, atomics inlined
as native x86 lock instructions):

  symbol                                                before    after
  <BPE as Model>::tokenize                              0.70%     0.50%
  <PyModel as Model>::tokenize                          0.60%     -
  <ModelWrapper as Model>::tokenize                     0.15%     0.16%
  std::sys::sync::rwlock::futex::RwLock::read_contended 0.01%     0.00%

The relative magnitudes are vastly smaller on x86 because uncontended
lock cmpxchg is fast, but the wall-clock direction is the same.

The Rust-only bpe_benchmark (which builds BPE directly via
BpeBuilder and never goes through PyModel) is unaffected by this
change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure we want to replicate the comments 3 times, perhaps best to add docs over at the call site instead? just an idea though.

There's also a pretokenized.tokenize_with_limit call that might benefit from this too in the codebase.

Finally, would be good to test on other platforms and archs.

pretokenized: &mut tk::tokenizer::PreTokenizedString,
) -> tk::Result<()> {
let model = self.model.as_ref().ok_or("Uninitialized Model")?;
let guard = model.read().unwrap();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to return an error rather than panic imo

The Python and Node bindings both wrap the inner model in
`Arc<RwLock<ModelWrapper>>`.  Their `Model::tokenize()` impls each do
`self.model.read().unwrap().tokenize(seq)`, so a single `read()` lock
acquisition per call is unavoidable -- but `TokenizerImpl::do_tokenize`
calls `tokenize()` once per pre-token, and a typical ~6 KB document
contains ~1 500 pre-tokens.  Each acquire/release pair on the
`RwLock` becomes one atomic operation per pre-token that produces no
useful work.

Add a default `Model::tokenize_in_pretokenized` trait method that
takes the optional truncation parameters and dispatches to either
`PreTokenizedString::tokenize` or `tokenize_with_limit`.  Override it
in `PyModel` (Python binding) and `Model` (Node binding) so that
both bindings acquire the read lock once and tokenize every
pre-token under the same guard.  `TokenizerImpl::do_tokenize` calls
the new method instead of dispatching per pre-token, which makes
both the truncated and the non-truncated paths benefit from the
override.  The full doc lives once on the trait method; the
overrides just point at it.

The default implementation preserves the old behaviour byte-for-byte
for any `Model` that is not behind a `RwLock`, so adding the method
to the public trait is non-breaking: external implementors do not
need to override it.

cargo test --lib --features http: 201 passed, 0 failed.
cargo clippy --lib --tests --features http -- -D warnings: clean
on tokenizers/, bindings/python/, bindings/node/.

Throughput on a Python script that calls `tokenizer.encode_batch
(docs, false)` in a 15 s wall-clock loop, where `docs` is
`tokenizers/data/big.txt` (6.5 MB) split into 999 ~6.5 KB chunks.
`encode_batch` calls completed (more is better):

  platform                              cores   threads  before   after
  ---------                             -----   -------  ------   -----
  NVIDIA Vera (aarch64)                  89P    1T          10      10
                                                88T         57     147   (+158 %)
  AMD EPYC 9124 (x86_64)                 16P    1T           8       8
                                                16T         63      68   (+8 %)
                                                32T         65      70   (+8 %)
  Apple M3 (aarch64, no SMT)             12P    1T          11      11
                                                6T          43      45   (+5 %)
                                                12T         47      54   (+15 %)

The 1T cases are flat on every platform because tokenisation is
dominated by BPE merging itself; there is no contention on a single
thread.  The win scales with thread count and is largest on
many-core aarch64 (where each LSE acquire/release pair takes
substantially more cycles than x86 `lock cmpxchg` under contention).

Perf evidence on Vera at 88T, wheels built `-Ctarget-feature=
+lse,+rcpc`, `perf record -g --call-graph fp -F 4999`:

  symbol                                                before    after
  <PyModel as Model>::tokenize                          75.36%    -
  <PyModel as Model>::tokenize_in_pretokenized          -          0.00%
  <ModelWrapper as Model>::tokenize                      0.01%     0.21%
  tokenizers::models::bpe::word::Word::merge_all         -         3.72%
  tokenizers::models::bpe::model::BPE::merge_word        0.17%     1.25%
  std::sys::sync::rwlock::futex::RwLock::read_contended  0.07%     0.00%

Before, 75 % of CPU cycles are inside `PyModel::tokenize`, the
wrapper that takes an `Arc<RwLock>::read` per pre-token, calls the
inner BPE, then drops the guard.  After, that wrapper is replaced by
a single call to `tokenize_in_pretokenized` which takes the lock
once for the whole pre-token sequence; the actual BPE merging
(`Word::merge_all`, `BPE::merge_word`) surfaces in the profile where
it should have been all along.

Perf on AMD EPYC 9124 at 16T (built without `+lse`, atomics inlined
as native x86 `lock` instructions):

  symbol                                                before    after
  <BPE as Model>::tokenize                              0.70%     0.50%
  <PyModel as Model>::tokenize                          0.60%     -
  <ModelWrapper as Model>::tokenize                     0.15%     0.16%
  std::sys::sync::rwlock::futex::RwLock::read_contended 0.01%     0.00%

The relative magnitudes are vastly smaller on x86 because uncontended
`lock cmpxchg` is fast, but the wall-clock direction is the same.

The Rust-only `bpe_benchmark` (which builds `BPE` directly via
`BpeBuilder` and never goes through `PyModel`) is unaffected by this
change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.
@sebpop
Copy link
Copy Markdown
Contributor Author

sebpop commented May 29, 2026

Thanks for the review. Force-pushed an amended patch that addresses three points:

  1. Docs consolidated. Full rationale lives once on the trait default; the PyModel and Node overrides are now one-line // See [Model::tokenize_in_pretokenized].

  2. tokenize_with_limit folded in. The trait method now takes truncation: Option<(usize, TruncationDirection)> and dispatches to either tokenize or tokenize_with_limit internally. do_tokenize collapses to one call, and both the truncated and non-truncated paths benefit from the lock-once override.

  3. Cross-platform validation. Added in the commit message: numbers on aarch64 Vera (89P/178L), AMD EPYC 9124 (16P/32L) and Apple M3 (12P). 1T flat everywhere as expected. The wall-clock win scales with thread count and is largest on many-core aarch64; x86 sees a smaller +8% because uncontended lock cmpxchg is much cheaper per call than aarch64 LSE under contention. Direction same on all three.

@sebpop
Copy link
Copy Markdown
Contributor Author

sebpop commented May 29, 2026

Two pre-existing CI failures on main, neither related to any code change in tokenizers/ or bindings/{python,node}/src/

Fixed with: #2079

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants