Bindings: acquire model read-lock once per call instead of per pre-token#2072
Bindings: acquire model read-lock once per call instead of per pre-token#2072sebpop wants to merge 1 commit into
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
McPatate
left a comment
There was a problem hiding this comment.
Not sure we want to replicate the comments 3 times, perhaps best to add docs over at the call site instead? just an idea though.
There's also a pretokenized.tokenize_with_limit call that might benefit from this too in the codebase.
Finally, would be good to test on other platforms and archs.
| pretokenized: &mut tk::tokenizer::PreTokenizedString, | ||
| ) -> tk::Result<()> { | ||
| let model = self.model.as_ref().ok_or("Uninitialized Model")?; | ||
| let guard = model.read().unwrap(); |
There was a problem hiding this comment.
Would be nice to return an error rather than panic imo
The Python and Node bindings both wrap the inner model in
`Arc<RwLock<ModelWrapper>>`. Their `Model::tokenize()` impls each do
`self.model.read().unwrap().tokenize(seq)`, so a single `read()` lock
acquisition per call is unavoidable -- but `TokenizerImpl::do_tokenize`
calls `tokenize()` once per pre-token, and a typical ~6 KB document
contains ~1 500 pre-tokens. Each acquire/release pair on the
`RwLock` becomes one atomic operation per pre-token that produces no
useful work.
Add a default `Model::tokenize_in_pretokenized` trait method that
takes the optional truncation parameters and dispatches to either
`PreTokenizedString::tokenize` or `tokenize_with_limit`. Override it
in `PyModel` (Python binding) and `Model` (Node binding) so that
both bindings acquire the read lock once and tokenize every
pre-token under the same guard. `TokenizerImpl::do_tokenize` calls
the new method instead of dispatching per pre-token, which makes
both the truncated and the non-truncated paths benefit from the
override. The full doc lives once on the trait method; the
overrides just point at it.
The default implementation preserves the old behaviour byte-for-byte
for any `Model` that is not behind a `RwLock`, so adding the method
to the public trait is non-breaking: external implementors do not
need to override it.
cargo test --lib --features http: 201 passed, 0 failed.
cargo clippy --lib --tests --features http -- -D warnings: clean
on tokenizers/, bindings/python/, bindings/node/.
Throughput on a Python script that calls `tokenizer.encode_batch
(docs, false)` in a 15 s wall-clock loop, where `docs` is
`tokenizers/data/big.txt` (6.5 MB) split into 999 ~6.5 KB chunks.
`encode_batch` calls completed (more is better):
platform cores threads before after
--------- ----- ------- ------ -----
NVIDIA Vera (aarch64) 89P 1T 10 10
88T 57 147 (+158 %)
AMD EPYC 9124 (x86_64) 16P 1T 8 8
16T 63 68 (+8 %)
32T 65 70 (+8 %)
Apple M3 (aarch64, no SMT) 12P 1T 11 11
6T 43 45 (+5 %)
12T 47 54 (+15 %)
The 1T cases are flat on every platform because tokenisation is
dominated by BPE merging itself; there is no contention on a single
thread. The win scales with thread count and is largest on
many-core aarch64 (where each LSE acquire/release pair takes
substantially more cycles than x86 `lock cmpxchg` under contention).
Perf evidence on Vera at 88T, wheels built `-Ctarget-feature=
+lse,+rcpc`, `perf record -g --call-graph fp -F 4999`:
symbol before after
<PyModel as Model>::tokenize 75.36% -
<PyModel as Model>::tokenize_in_pretokenized - 0.00%
<ModelWrapper as Model>::tokenize 0.01% 0.21%
tokenizers::models::bpe::word::Word::merge_all - 3.72%
tokenizers::models::bpe::model::BPE::merge_word 0.17% 1.25%
std::sys::sync::rwlock::futex::RwLock::read_contended 0.07% 0.00%
Before, 75 % of CPU cycles are inside `PyModel::tokenize`, the
wrapper that takes an `Arc<RwLock>::read` per pre-token, calls the
inner BPE, then drops the guard. After, that wrapper is replaced by
a single call to `tokenize_in_pretokenized` which takes the lock
once for the whole pre-token sequence; the actual BPE merging
(`Word::merge_all`, `BPE::merge_word`) surfaces in the profile where
it should have been all along.
Perf on AMD EPYC 9124 at 16T (built without `+lse`, atomics inlined
as native x86 `lock` instructions):
symbol before after
<BPE as Model>::tokenize 0.70% 0.50%
<PyModel as Model>::tokenize 0.60% -
<ModelWrapper as Model>::tokenize 0.15% 0.16%
std::sys::sync::rwlock::futex::RwLock::read_contended 0.01% 0.00%
The relative magnitudes are vastly smaller on x86 because uncontended
`lock cmpxchg` is fast, but the wall-clock direction is the same.
The Rust-only `bpe_benchmark` (which builds `BPE` directly via
`BpeBuilder` and never goes through `PyModel`) is unaffected by this
change: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.
|
Thanks for the review. Force-pushed an amended patch that addresses three points:
|
|
Two pre-existing CI failures on main, neither related to any code change in tokenizers/ or bindings/{python,node}/src/ Fixed with: #2079 |
The Python and Node bindings both wrap the inner model in
Arc<RwLock<ModelWrapper>>. TheirModel::tokenize()impls each doself.model.read().unwrap().tokenize(seq), so a singleread()lockacquisition per call is unavoidable -- but
TokenizerImpl::do_tokenizecalls
tokenize()once per pre-token, and a typical ~6 KB documentcontains ~1 500 pre-tokens. Each acquire/release pair on the
RwLockbecomes one atomic operation per pre-token that produces nouseful work.
Add a default
Model::tokenize_in_pretokenizedtrait method thattakes the optional truncation parameters and dispatches to either
PreTokenizedString::tokenizeortokenize_with_limit. Override itin
PyModel(Python binding) andModel(Node binding) so thatboth bindings acquire the read lock once and tokenize every
pre-token under the same guard.
TokenizerImpl::do_tokenizecallsthe new method instead of dispatching per pre-token, which makes
both the truncated and the non-truncated paths benefit from the
override. The full doc lives once on the trait method; the
overrides just point at it.
The default implementation preserves the old behaviour byte-for-byte
for any
Modelthat is not behind aRwLock, so adding the methodto the public trait is non-breaking: external implementors do not
need to override it.
Throughput on a Python script that calls
tokenizer.encode_batch (docs, false)in a 15 s wall-clock loop, wheredocsistokenizers/data/big.txt(6.5 MB) split into 999 ~6.5 KB chunks.encode_batchcalls completed (more is better):The 1T cases are flat on every platform because tokenisation is
dominated by BPE merging itself; there is no contention on a single
thread. The win scales with thread count and is largest on
many-core aarch64 (where each LSE acquire/release pair takes
substantially more cycles than x86
lock cmpxchgunder contention).Perf evidence on Vera at 88T, wheels built
-Ctarget-feature= +lse,+rcpc,perf record -g --call-graph fp -F 4999:Before, 75 % of CPU cycles are inside
PyModel::tokenize, thewrapper that takes an
Arc<RwLock>::readper pre-token, calls theinner BPE, then drops the guard. After, that wrapper is replaced by
a single call to
tokenize_in_pretokenizedwhich takes the lockonce for the whole pre-token sequence; the actual BPE merging
(
Word::merge_all,BPE::merge_word) surfaces in the profile whereit should have been all along.
Perf on AMD EPYC 9124 at 16T (built without
+lse, atomics inlinedas native x86
lockinstructions):The relative magnitudes are vastly smaller on x86 because uncontended
lock cmpxchgis fast, but the wall-clock direction is the same.The Rust-only
bpe_benchmark(which buildsBPEdirectly viaBpeBuilderand never goes throughPyModel) is unaffected by thischange: 3.93 -> 3.94 MiB/s at 1T and 17.97 -> 17.93 MiB/s at 88T,
both within run-to-run noise.