[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer by nishant94 · Pull Request #64667 · apache/doris

nishant94 · 2026-06-22T05:53:45Z

What problem does this PR solve?

Issue Number: #64646

Related PR: None

Problem Summary:
Doris has no Japanese-aware tokenizer for the inverted index. Japanese text has no spaces between words, so the existing parsers can't segment it and MATCH / MATCH_PHRASE on Japanese columns end up with poor recall and precision.

This PR adds a built-in kuromoji parser for Japanese, in the same style as the existing Chinese IK analyzer. It's opt-in per column:

 INDEX content_idx (`content`) USING INVERTED
 PROPERTIES("parser" = "kuromoji", "parser_mode" = "search");

After indexing, MATCH, MATCH_PHRASE and TOKENIZE() run against the segmented Japanese terms.

How it works:

Native C++ under be/src/storage/index/inverted/analyzer/kuromoji/, so there's no JVM on the indexing path. KuromojiAnalyzer / KuromojiTokenizer mirror the IK analyzer/tokenizer, with a Viterbi cost-model segmenter over the IPADIC connection-cost matrix.
- The dictionary is a process-wide singleton loaded once from ${inverted_index_dict_path}/kuromoji. An offline converter compiles raw IPADIC into a compact C++ runtime format (double-array trie + cost matrix + char/unknown tables) at build time, so no binary blob is committed.
- search (default), normal and extended modes are supported. No thrift/proto changes — parser and mode ride as strings in the index properties.

Dictionary source is mecab-ipadic-2.7.0-20070801 (NAIST-2003 license, the same lexicon Lucene kuromoji uses).

Release note

Support Japanese text tokenization in the inverted index via a new kuromoji parser (PROPERTIES("parser"="kuromoji")), with search/normal/extended modes.

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)

  CREATE TABLE test_jp (
    id BIGINT,
    content TEXT,
    INDEX idx_content (content) USING INVERTED
      PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
  ) ENGINE=OLAP
  DUPLICATE KEY(id)
  DISTRIBUTED BY HASH(id) BUCKETS 1
  PROPERTIES("replication_num" = "1");

  INSERT INTO test_jp VALUES
    (1, '東京都に住んでいます'),
    (2, '日本語の形態素解析エンジン');

  -- search-mode decompounding: 東京都 also matches 東京
  SELECT id FROM test_jp WHERE content MATCH '東京';          -- expect: 1
  SELECT id FROM test_jp WHERE content MATCH_PHRASE '形態素解析'; -- expect: 2

  -- inspect segmentation directly
  SELECT TOKENIZE('東京都に住んでいます', '"parser"="kuromoji","parser_mode"="search"');

Behavior changed:
- No.
- Yes. It adds a new opt-in kuromoji parser. Existing parsers and their output are unchanged; the new behavior only applies to indexes that explicitly set parser="kuromoji".
Does this need documentation?
- No.
- Yes. PR Link to Doris-Website.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

hello-stephen · 2026-06-22T05:53:50Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

nishant94 · 2026-06-22T06:23:33Z

run buildall

yiguolei · 2026-06-22T06:27:35Z

@nishant94 have you tried icu analyzer? because I think icu could handle many different languages.

nishant94 · 2026-06-22T06:31:13Z

@nishant94 have you tried icu analyzer? because I think icu could handle many different languages.

@yiguolei The ICU Analyzer is not good as the Kuromoji. There is huge difference between icu and kuromoji when it comes to morphology of the Japanese words. So I think it worth it adding this new parser.

BiteTheDDDDt · 2026-06-22T07:02:18Z

Is the code under be/src/storage/index/inverted/analyzer/kuromoji entirely original or derived from other projects? Perhaps we need to clarify the situation regarding this part.

nishant94 · 2026-06-22T07:29:43Z

Is the code under be/src/storage/index/inverted/analyzer/kuromoji entirely original or derived from other projects? Perhaps we need to clarify the situation regarding this part.

This is original code but it is modeled on Apache Lucene's kuromoji.

hello-stephen · 2026-06-22T07:56:03Z

FE UT Coverage Report

Increment line coverage 44.44% (4/9) 🎉
Increment coverage report
Complete coverage report

Add a built-in `kuromoji` inverted-index parser that segments Japanese text into morphemes, mirroring the existing Chinese IK analyzer.

- Added `darts.h` to `.clang-format-ignore` and `.licenserc.yaml`. - Improved code formatting in various Kuromoji source files for better readability. - Updated tests files to include necessary headers.

…mposition - Added support for search mode in the Kuromoji Viterbi segmenter, applying penalties for long all-kanji and other tokens to enhance search recall. - Updated the KuromojiMode enumeration to reflect the new search and extended modes. - Modified the KuromojiTokenizer to utilize the new mode functionality. - Added unit tests to validate the behavior of the search mode, ensuring correct segmentation of compounds. - Updated NOTICE.txt to include Apache Lucene as a dependency for the kuromoji analyzer.

…wn words - Implemented functionality in the Kuromoji Viterbi segmenter to decompose unknown (out-of-vocabulary) words into per-character unigrams when in extended mode, aligning with Lucene's JapaneseTokenizer behavior. - Added unit tests to validate the correct segmentation of unknown words in both normal and extended modes, ensuring expected outputs for various input scenarios.

nishant94 · 2026-06-22T09:58:03Z

run buildall

hello-stephen · 2026-06-22T12:00:26Z

BE UT Coverage Report

Increment line coverage 82.40% (791/960) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	54.51% (21439/39327)
Line Coverage	38.17% (205347/537919)
Region Coverage	34.16% (161044/471416)
Branch Coverage	35.14% (70517/200651)

hello-stephen · 2026-06-22T15:49:21Z

BE UT Coverage Report

Increment line coverage 84.10% (836/994) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	54.50% (21433/39329)
Line Coverage	38.13% (205092/537920)
Region Coverage	34.11% (160793/471446)
Branch Coverage	35.11% (70468/200678)

hello-stephen · 2026-06-22T16:58:18Z

BE Regression && UT Coverage Report

Increment line coverage 83.85% (462/551) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	74.11% (28441/38375)
Line Coverage	58.02% (309954/534209)
Region Coverage	54.69% (258833/473301)
Branch Coverage	56.10% (112608/200725)

hello-stephen · 2026-06-22T17:05:28Z

FE Regression Coverage Report

Increment line coverage 66.67% (6/9) 🎉
Increment coverage report
Complete coverage report

- Modified error messages to include 'kuromoji' parser in the parser mode validation. - Enhanced tests for the Japanese analyzer to assert expected tokenization results.

nishant94 · 2026-06-23T06:16:44Z

run buildall

nishant94 requested review from BiteTheDDDDt, airborne12 and zclllyybb as code owners June 22, 2026 05:53

nishant94 added 5 commits June 22, 2026 15:27

[feature](inverted-index) Add Japanese (Kuromoji) morphological analyzer

ee23881

Add a built-in `kuromoji` inverted-index parser that segments Japanese text into morphemes, mirroring the existing Chinese IK analyzer.

add empty line on the end of .gitignore

2fc9d8c

Update Kuromoji analyzer files and formatting

c9ad7cb

- Added `darts.h` to `.clang-format-ignore` and `.licenserc.yaml`. - Improved code formatting in various Kuromoji source files for better readability. - Updated tests files to include necessary headers.

nishant94 force-pushed the feat/kuromoji-japanese-analyzer branch from 389fcfb to b79db3c Compare June 22, 2026 09:57

Enhance Japanese analyzer tests

57f897f

- Modified error messages to include 'kuromoji' parser in the parser mode validation. - Enhanced tests for the Japanese analyzer to assert expected tokenization results.

morningman self-assigned this Jun 23, 2026

Conversation

nishant94 commented Jun 22, 2026

What problem does this PR solve?

Release note

Uh oh!

hello-stephen commented Jun 22, 2026

Uh oh!

nishant94 commented Jun 22, 2026

Uh oh!

yiguolei commented Jun 22, 2026

Uh oh!

nishant94 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BiteTheDDDDt commented Jun 22, 2026

Uh oh!

nishant94 commented Jun 22, 2026

Uh oh!

hello-stephen commented Jun 22, 2026

FE UT Coverage Report

Uh oh!

nishant94 commented Jun 22, 2026

Uh oh!

hello-stephen commented Jun 22, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 22, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Jun 22, 2026

BE Regression && UT Coverage Report

Uh oh!

hello-stephen commented Jun 22, 2026

FE Regression Coverage Report

Uh oh!

nishant94 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nishant94 commented Jun 22, 2026 •

edited

Loading