[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer#64667
[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer#64667nishant94 wants to merge 6 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
@nishant94 have you tried icu analyzer? because I think icu could handle many different languages. |
@yiguolei The ICU Analyzer is not good as the Kuromoji. There is huge difference between icu and kuromoji when it comes to morphology of the Japanese words. So I think it worth it adding this new parser. |
|
Is the code under |
This is original code but it is modeled on Apache Lucene's kuromoji. |
FE UT Coverage ReportIncrement line coverage |
Add a built-in `kuromoji` inverted-index parser that segments Japanese text into morphemes, mirroring the existing Chinese IK analyzer.
- Added `darts.h` to `.clang-format-ignore` and `.licenserc.yaml`. - Improved code formatting in various Kuromoji source files for better readability. - Updated tests files to include necessary headers.
…mposition - Added support for search mode in the Kuromoji Viterbi segmenter, applying penalties for long all-kanji and other tokens to enhance search recall. - Updated the KuromojiMode enumeration to reflect the new search and extended modes. - Modified the KuromojiTokenizer to utilize the new mode functionality. - Added unit tests to validate the behavior of the search mode, ensuring correct segmentation of compounds. - Updated NOTICE.txt to include Apache Lucene as a dependency for the kuromoji analyzer.
…wn words - Implemented functionality in the Kuromoji Viterbi segmenter to decompose unknown (out-of-vocabulary) words into per-character unigrams when in extended mode, aligning with Lucene's JapaneseTokenizer behavior. - Added unit tests to validate the correct segmentation of unknown words in both normal and extended modes, ensuring expected outputs for various input scenarios.
389fcfb to
b79db3c
Compare
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
- Modified error messages to include 'kuromoji' parser in the parser mode validation. - Enhanced tests for the Japanese analyzer to assert expected tokenization results.
|
run buildall |
What problem does this PR solve?
Issue Number: #64646
Related PR: None
Problem Summary:
Doris has no Japanese-aware tokenizer for the inverted index. Japanese text has no spaces between words, so the existing parsers can't segment it and
MATCH/MATCH_PHRASEon Japanese columns end up with poor recall and precision.This PR adds a built-in
kuromojiparser for Japanese, in the same style as the existing Chinese IK analyzer. It's opt-in per column:After indexing, MATCH, MATCH_PHRASE and TOKENIZE() run against the segmented Japanese terms.
How it works:
be/src/storage/index/inverted/analyzer/kuromoji/, so there's no JVM on the indexing path. KuromojiAnalyzer / KuromojiTokenizer mirror the IK analyzer/tokenizer, with a Viterbi cost-model segmenter over the IPADIC connection-cost matrix.Dictionary source is mecab-ipadic-2.7.0-20070801 (NAIST-2003 license, the same lexicon Lucene kuromoji uses).
Release note
Support Japanese text tokenization in the inverted index via a new kuromoji parser (
PROPERTIES("parser"="kuromoji")), withsearch/normal/extendedmodes.Check List (For Author)
parser="kuromoji".Check List (For Reviewer who merge this PR)