Skip to content

Cyrillic normalizer and decoder for south slavic languages#2046

Open
procesaur wants to merge 1 commit into
huggingface:mainfrom
procesaur:main
Open

Cyrillic normalizer and decoder for south slavic languages#2046
procesaur wants to merge 1 commit into
huggingface:mainfrom
procesaur:main

Conversation

@procesaur
Copy link
Copy Markdown

This PR adds a Cyrillic normalizer and a Cyrillic decoder, ensuring both way transliteration between Latin and Cyrillic scripts.

Normalizer detects Cyrillic text using regex, transliterate it into latin and encloses is between ... tags. This way, the model does not need to learn the same token twice, in both Cyrillic and Latin, and treats them as equal.

Decoder is used to produce cyrillic output, when necessary. It uses the same logic, but in reverse, locates text enclosed between ... tags and transliterate it back to Cyrillic.

Transliteration is performed using character (or bigram) level mapping. Currently these mappings are set for South Slavic languages (Serbian, Montenegrin, Macedonian, Bulgarian). Serbian and Montenegrin are natively digraphic with defined equivalents.

Examples of usage:

Training data > In Cyrillic žaba is written as жаба.
What the model sees during training > In Cyrillic žaba is written as žaba.

Input > Translitarate Baba to Cyrillic
Model sees > Translitarate Baba to Cyrillic
Model outputs > Baba
Decoder outputs > Баба

Result > Model is learning to transliterate and equalize the text in both scriptures on just a few examples. It only uses latin tokens for training, but can output Cyrillic output if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant