Skip to content

ICU-23376 Minimize the number of states of the DFA#3948

Merged
eggrobin merged 1 commit into
unicode-org:mainfrom
eggrobin:𒁾𒃲𒁔𒆷
May 7, 2026

Hidden character warning

The head ref may contain hidden characters: "\ud808\udc7e\ud808\udcf2\ud808\udc54\ud808\uddb7"
Merged

ICU-23376 Minimize the number of states of the DFA#3948
eggrobin merged 1 commit into
unicode-org:mainfrom
eggrobin:𒁾𒃲𒁔𒆷

Conversation

@eggrobin
Copy link
Copy Markdown
Member

@eggrobin eggrobin commented Apr 17, 2026

This halves the size of the line*.brk (from 73 424 bytes to 37 224 for line.brk, similarly for the others); it does nothing for char.brk or sent.brk, and not much for word.brk (23 120 to 22 672).

The C++ implementation is atrocious because of our lack of decent data structures, this was 30 lines of Python and would probably not be very much more in modern C++…

CC @aheninger, @robertbastian.

Checklist

  • Required: Issue filed: ICU-23376
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@eggrobin eggrobin requested a review from markusicu April 17, 2026 21:42
@markusicu markusicu self-assigned this Apr 17, 2026
@markusicu markusicu requested a review from aheninger April 17, 2026 21:46
@markusicu
Copy link
Copy Markdown
Member

Going down to 199 states should let us use an 8-bit-value trie and 8-bit-value state table, right?
Does the builder select those automatically if possible?

@markusicu
Copy link
Copy Markdown
Member

Java: future commit here or future PR?

Comment thread icu4c/source/common/rbbitblb.cpp Fixed
@eggrobin
Copy link
Copy Markdown
Member Author

Going down to 199 states should let us use an 8-bit-value trie and 8-bit-value state table, right?
Does the builder select those automatically if possible?

Looks like it:

bool RBBITableBuilder::use8BitsForTable() const {
return fDStates->size() <= kMaxStateFor8BitsTable;
}

@markusicu
Copy link
Copy Markdown
Member

This halves the size of the line*.brk (from 73 424 bytes to 37 224 for line.brk

🤯

@markusicu
Copy link
Copy Markdown
Member

Going down to 199 states should let us use an 8-bit-value trie and 8-bit-value state table, right?
Does the builder select those automatically if possible?

Looks like it:

Yeah!

We might have a new problem then: Lack of code coverage for the 16-bit paths. Unless we have a unit test with strangely convoluted rules.

@eggrobin
Copy link
Copy Markdown
Member Author

Java: future commit here or future PR?

I think either works, so depends on whether you review faster than I write the Java (you have an edge, I have half of Monday off). We probably can’t regenerate the checked-in .brks until I have done that because of round-trip tests.

Copy link
Copy Markdown
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not trying to understand the logic.

Comment thread icu4c/source/common/rbbitblb.cpp
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
Comment thread icu4c/source/common/rbbitblb.cpp
@markusicu
Copy link
Copy Markdown
Member

PS: Where I ask a question for why you choose something, and there is a good reason to do it that way, consider maybe adding a code comment.

markusicu
markusicu previously approved these changes Apr 17, 2026
Comment thread icu4c/source/common/rbbitblb.cpp Outdated
markusicu
markusicu previously approved these changes Apr 18, 2026
@markusicu
Copy link
Copy Markdown
Member

Looking at your local time, I think you are practicing for the UTC meeting...

@markusicu
Copy link
Copy Markdown
Member

Sadly, CI checks are failing. I should bike home, and you should catch some sleep...

@eggrobin
Copy link
Copy Markdown
Member Author

Looking at your local time, I think you are practicing for the UTC meeting...

Yeah, I am not sure what is going on here. This morning I woke up at seven in the morning. 24 h prior I was failing to sleep at seven in the morning.

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Apr 18, 2026

Sadly, CI checks are failing

Yeah I think the unique_ptr is dragging in some silly symbols. Once we have a make_unique with a status we should not let that stop us, but for now I guess I will replace the remaining one with a LocalArray.

markusicu
markusicu previously approved these changes Apr 18, 2026
@eggrobin
Copy link
Copy Markdown
Member Author

In any case, this needs TC approval, so it’s not getting merged tonight.

@eggrobin
Copy link
Copy Markdown
Member Author

It looks like the fuzzer failure is just telling us that this minimization algorithm can be a bit slow on pathological state machines.
On the rules .+䶝.䶝.............;, it starts out with 7398 states and does nothing, which means that it needs to refine the partition from two parts to 7398. On my machine this takes 10 s in debug mode and 2 s in release mode. It does not seem implausible that it would exceed 25 s with ubsan on the CI runners.

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented May 6, 2026

@markusicu It looks like I successfully appeased the fuzzers. The windows-msys2-gcc-x86_64 failure seems unrelated, I see it on other PRs too.

Comment thread icu4c/source/common/rbbitblb.cpp
@markusicu
Copy link
Copy Markdown
Member

The windows-msys2-gcc-x86_64 failure seems unrelated, I see it on other PRs too.

Yeah, it fails on main too: https://github.com/unicode-org/icu/actions/runs/25382815834
This is very recent :-(

@markusicu
Copy link
Copy Markdown
Member

The windows-msys2-gcc-x86_64 failure seems unrelated, I see it on other PRs too.

Yeah, it fails on main too: https://github.com/unicode-org/icu/actions/runs/25382815834 This is very recent :-(

Mihai has unblocked CI via PR #3970. Please rerun the failing test. If it still fails, try rebasing.

@eggrobin eggrobin force-pushed the 𒁾𒃲𒁔𒆷 branch from 3da6c2e to 2a3406b Compare May 7, 2026 02:02
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@eggrobin eggrobin merged commit cbf4c65 into unicode-org:main May 7, 2026
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants