Avoid full vocab clone in get_vocab_size() by eunseo9311 · Pull Request #2074 · huggingface/tokenizers

eunseo9311 · 2026-05-27T03:57:26Z

get_vocab_size(true) was calling get_vocab(true).len(), which clones the entire model vocabulary into a new HashMap just to count entries.
For large vocabularies (e.g. LLaMA-3 128k tokens) this allocates ~10MB on every call.

Fix by computing base + added.len() - overlapping directly, where overlapping counts added tokens already present in the model via token_to_id. Zero allocation.

Applies the same fix to the Node binding, which had the identical pattern.

Adds a test covering the overlap scenario (token in both model vocab and added_vocabulary).

HuggingFaceDocBuilderDev · 2026-05-27T13:02:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks for catching this!

Benchmarked locally on the llama-3 tokenizer (vocab=128,256, added=256), PR cuts get_vocab_size(true) from ~5.70 ms → ~1.35 µs (≈4220×).
Since non_overlapping = added.len() - overlapping is fully determined by what we've already inserted, we can cache it as a counter on AddedVocabulary
get_vocab_size(true) in O(1) without iterating added tokens at all.
Will let add_tokens skip added_tokens_map_r.keys().max().
Benched and: get_vocab_size(true) drops further to ~786 ps on llama-3.

The counter is model-dependent / stateful so we might want to do that another time as there are some edgecases.

ArthurZucker · 2026-05-28T13:26:29Z

Can you pull latest from main BTW !

Avoid full vocab clone in get_vocab_size()

0fa29b9

eunseo9311 force-pushed the fix/get-vocab-size-perf branch from 59d1e77 to 0fa29b9 Compare May 28, 2026 08:13

ArthurZucker approved these changes May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid full vocab clone in get_vocab_size()#2074

Avoid full vocab clone in get_vocab_size()#2074
eunseo9311 wants to merge 1 commit into
huggingface:mainfrom
eunseo9311:fix/get-vocab-size-perf

eunseo9311 commented May 27, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eunseo9311 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 27, 2026

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eunseo9311 commented May 27, 2026 •

edited

Loading