Skip to content

kitefishai/TokenizerBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TokenizerBench

This dataset is designed to evaluate tokenizer performance across before you use it for model pre-training/finetuning:

  • ๐ŸŒ Human languages (multilingual + scripts)
  • ๐Ÿ’ป Programming languages (syntax-heavy)
  • ๐Ÿงฎ Math & science expressions (symbols, unicode, formulas)

๐ŸŽฏ Goal

This dataset helps evaluate:

  • Multilingual tokenization quality
  • Code token handling
  • Mathematical symbol parsing
  • Robustness to noisy and mixed inputs

๐Ÿงฉ How to Use the Dataset

The dataset is organized into modular Python files:

data/
โ”œโ”€โ”€ human_languages.py
โ”œโ”€โ”€ programming_languages.py
โ”œโ”€โ”€ scientific_formulas.py
โ”œโ”€โ”€ edge_cases.py

Each file contains structured dictionaries that can be directly imported and used for tokenizer evaluation.


๐Ÿ“ฅ 1. Import the Dataset

from tokenizerbench.data.human_languages import human_languages
from tokenizerbench.data.programming_languages import programming_languages
from tokenizerbench.data.scientific_formulas import scientific_formulas

๐Ÿ”„ 2. Combine All Data (Optional)

dataset = {
    "human_languages": human_languages,
    "programming_languages": programming_languages,
    "scientific_formulas": scientific_formulas
}

๐Ÿ” 3. Run Tokenizer Evaluation

Example using any tokenizer (HuggingFace, TikToken, SentencePiece, etc.):

def evaluate_tokenizer(tokenizer, dataset):
    results = {}

    for category, data in dataset.items():
        results[category] = {}

        for subcategory, samples in data.items():
            token_counts = []

            for text in samples:
                tokens = tokenizer.encode(text)
                token_counts.append(len(tokens))

            results[category][subcategory] = {
                "avg_tokens": sum(token_counts) / len(token_counts),
                "max_tokens": max(token_counts),
                "min_tokens": min(token_counts)
            }

    return results

๐Ÿ“Š 4. Evaluate Compression Efficiency

def compression_ratio(tokenizer, text):
    tokens = tokenizer.encode(text)
    return len(tokens) / len(text)

๐Ÿ‘‰ Run this across:

  • Different languages
  • Code snippets
  • Math expressions

๐ŸŒ 5. Test Unicode Robustness

def unicode_test(tokenizer, text):
    tokens = tokenizer.encode(text)
    decoded = tokenizer.decode(tokens)
    return text == decoded

Test on:

  • Multilingual text
  • Emojis
  • Scientific symbols

๐Ÿงช 6. Long Sequence Testing

long_text = "AI_TOKEN_TEST " * 1000  # ~10K chars
tokens = tokenizer.encode(long_text)

print("Token count:", len(tokens))

๐Ÿ‘‰ Helps evaluate:

  • Context handling
  • Token explosion
  • Memory efficiency

โš ๏ธ 7. Recommended Evaluation Strategy

Run comparisons across:

  • Multiple tokenizers (BPE, SentencePiece, Unigram)

  • Multiple categories:

    • Human languages
    • Code
    • Math & symbols

Track:

  • Token count
  • Compression ratio
  • Decode fidelity
  • Stability on long inputs

๐Ÿง  Pro Tip

For serious benchmarking, log results like:

{
  "tokenizer": "tiktoken",
  "language": "hindi",
  "avg_tokens": 18.2,
  "compression_ratio": 0.32,
  "unicode_safe": True
}

๐Ÿ‘‰ This allows you to build:

  • Leaderboards
  • Tokenizer comparisons
  • Performance dashboards

๐Ÿ“ How to Measure Tokenizer Performance

1. Token Count

Measure how many tokens each input produces.

tokens = tokenizer.encode(text)
print(len(tokens))

๐Ÿ‘‰ Lower token count (for same meaning) = better efficiency


2. Compression Ratio

compression_ratio = len(tokens) / len(text)
  • Lower ratio โ†’ better tokenizer
  • Indicates how efficiently text is represented

3. Unicode Handling

Test:

  • Multilingual text
  • Emojis
  • Mathematical symbols
test = "Hello ไธ–็•Œ ๐Ÿš€ ฮฑ ฮฒ ฮณ โˆ‘"
tokens = tokenizer.encode(test)
decoded = tokenizer.decode(tokens)

Check:

  • Is decoded text identical?
  • Any corruption?
  • Any token explosion?

4. Edge Case Robustness

Test:

  • Long sequences (2Kโ€“10K chars)
  • Mixed scripts
  • Noisy text

๐ŸŽฏ Goal

This dataset helps evaluate:

  • Multilingual tokenization quality
  • Code token handling
  • Mathematical symbol parsing
  • Robustness to noisy and long inputs

TODO

  • Expand human_languages โ†’ 100 languages using ISO language list
  • Keep same semantic structure across languages for consistency
  • Add longer sequences (2Kโ€“10K chars) to test tokenizer limits

About

A comprehensive multilingual tokenizer benchmark covering 100+ human languages, programming languages, and mathematical expressions for evaluating LLM tokenization efficiency, compression, and Unicode handling.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages