Avoid full vocab clone in get_vocab_size()#2074
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
59d1e77 to
0fa29b9
Compare
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks for catching this!
Benchmarked locally on the llama-3 tokenizer (vocab=128,256, added=256), PR cuts get_vocab_size(true) from ~5.70 ms → ~1.35 µs (≈4220×).
Since non_overlapping = added.len() - overlapping is fully determined by what we've already inserted, we can cache it as a counter on AddedVocabulary
get_vocab_size(true) in O(1) without iterating added tokens at all.
Will let add_tokens skip added_tokens_map_r.keys().max().
Benched and: get_vocab_size(true) drops further to ~786 ps on llama-3.
The counter is model-dependent / stateful so we might want to do that another time as there are some edgecases.
|
Can you pull latest from main BTW ! |
get_vocab_size(true)was callingget_vocab(true).len(), which clones the entire model vocabulary into a newHashMapjust to count entries.For large vocabularies (e.g. LLaMA-3 128k tokens) this allocates ~10MB on every call.
Fix by computing
base + added.len() - overlappingdirectly, whereoverlappingcounts added tokens already present in the model viatoken_to_id. Zero allocation.Applies the same fix to the Node binding, which had the identical pattern.
Adds a test covering the overlap scenario (token in both model vocab and added_vocabulary).