Cyrillic normalizer and decoder for south slavic languages by procesaur · Pull Request #2046 · huggingface/tokenizers

procesaur · 2026-04-28T13:52:20Z

This PR adds a Cyrillic normalizer and a Cyrillic decoder, ensuring both way transliteration between Latin and Cyrillic scripts.

Normalizer detects Cyrillic text using regex, transliterate it into latin and encloses is between ... tags. This way, the model does not need to learn the same token twice, in both Cyrillic and Latin, and treats them as equal.

Decoder is used to produce cyrillic output, when necessary. It uses the same logic, but in reverse, locates text enclosed between ... tags and transliterate it back to Cyrillic.

Transliteration is performed using character (or bigram) level mapping. Currently these mappings are set for South Slavic languages (Serbian, Montenegrin, Macedonian, Bulgarian). Serbian and Montenegrin are natively digraphic with defined equivalents.

Examples of usage:

Training data > In Cyrillic žaba is written as жаба.
What the model sees during training > In Cyrillic žaba is written as žaba.

Input > Translitarate Baba to Cyrillic
Model sees > Translitarate Baba to Cyrillic
Model outputs > Baba
Decoder outputs > Баба

Result > Model is learning to transliterate and equalize the text in both scriptures on just a few examples. It only uses latin tokens for training, but can output Cyrillic output if needed.

cyrillic normalizer and decoder for south slavic languages

dbb4bb9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyrillic normalizer and decoder for south slavic languages#2046

Cyrillic normalizer and decoder for south slavic languages#2046
procesaur wants to merge 1 commit into
huggingface:mainfrom
procesaur:main

procesaur commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

procesaur commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant