ICU-23376 Minimize the number of states of the DFA#3948
Hidden character warning
Conversation
|
Going down to 199 states should let us use an 8-bit-value trie and 8-bit-value state table, right? |
|
Java: future commit here or future PR? |
Looks like it: icu/icu4c/source/common/rbbitblb.cpp Lines 1362 to 1364 in c5946d7 |
🤯 |
Yeah! We might have a new problem then: Lack of code coverage for the 16-bit paths. Unless we have a unit test with strangely convoluted rules. |
I think either works, so depends on whether you review faster than I write the Java (you have an edge, I have half of Monday off). We probably can’t regenerate the checked-in .brks until I have done that because of round-trip tests. |
markusicu
left a comment
There was a problem hiding this comment.
I am not trying to understand the logic.
|
PS: Where I ask a question for why you choose something, and there is a good reason to do it that way, consider maybe adding a code comment. |
|
Looking at your local time, I think you are practicing for the UTC meeting... |
|
Sadly, CI checks are failing. I should bike home, and you should catch some sleep... |
Yeah, I am not sure what is going on here. This morning I woke up at seven in the morning. 24 h prior I was failing to sleep at seven in the morning. |
Yeah I think the unique_ptr is dragging in some silly symbols. Once we have a make_unique with a status we should not let that stop us, but for now I guess I will replace the remaining one with a LocalArray. |
|
In any case, this needs TC approval, so it’s not getting merged tonight. |
|
It looks like the fuzzer failure is just telling us that this minimization algorithm can be a bit slow on pathological state machines. |
|
@markusicu It looks like I successfully appeased the fuzzers. The windows-msys2-gcc-x86_64 failure seems unrelated, I see it on other PRs too. |
Yeah, it fails on main too: https://github.com/unicode-org/icu/actions/runs/25382815834 |
Mihai has unblocked CI via PR #3970. Please rerun the failing test. If it still fails, try rebasing. |
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
This halves the size of the
line*.brk(from 73 424 bytes to 37 224 forline.brk, similarly for the others); it does nothing forchar.brkorsent.brk, and not much forword.brk(23 120 to 22 672).The C++ implementation is atrocious because of our lack of decent data structures, this was 30 lines of Python and would probably not be very much more in modern C++…
CC @aheninger, @robertbastian.
Checklist