Cyrillic normalizer and decoder for south slavic languages#2046
Open
procesaur wants to merge 1 commit into
Open
Cyrillic normalizer and decoder for south slavic languages#2046procesaur wants to merge 1 commit into
procesaur wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a Cyrillic normalizer and a Cyrillic decoder, ensuring both way transliteration between Latin and Cyrillic scripts.
Normalizer detects Cyrillic text using regex, transliterate it into latin and encloses is between ... tags. This way, the model does not need to learn the same token twice, in both Cyrillic and Latin, and treats them as equal.
Decoder is used to produce cyrillic output, when necessary. It uses the same logic, but in reverse, locates text enclosed between ... tags and transliterate it back to Cyrillic.
Transliteration is performed using character (or bigram) level mapping. Currently these mappings are set for South Slavic languages (Serbian, Montenegrin, Macedonian, Bulgarian). Serbian and Montenegrin are natively digraphic with defined equivalents.
Examples of usage:
Training data > In Cyrillic žaba is written as жаба.
What the model sees during training > In Cyrillic žaba is written as žaba.
Input > Translitarate Baba to Cyrillic
Model sees > Translitarate Baba to Cyrillic
Model outputs > Baba
Decoder outputs > Баба
Result > Model is learning to transliterate and equalize the text in both scriptures on just a few examples. It only uses latin tokens for training, but can output Cyrillic output if needed.