Skip to content

Bump snapshot versions#1332

Open
eggrobin wants to merge 2 commits intounicode-org:mainfrom
eggrobin:pom-pom-pom-pom
Open

Bump snapshot versions#1332
eggrobin wants to merge 2 commits intounicode-org:mainfrom
eggrobin:pom-pom-pom-pom

Conversation

@eggrobin
Copy link
Copy Markdown
Member

And use the word segmenter now that we have unicode-org/icu#3935.

  • Approver: Feel free to merge on my behalf
    • rebase & merge one or more commits
    • squash & merge multiple commits into one

@eggrobin eggrobin requested a review from markusicu April 15, 2026 03:39
Copy link
Copy Markdown
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are CI check failures

Comment on lines +307 to +310
snippet.replaceAll("\\.-", ".0")
.replaceAll(
"(?<=[0-9]'*)'(?='*\\.[0-9])",
"0"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move this into a String variable before computing the segments

::iterator;
for (final var segment : segments) {
String word =
snippet.substring(segment.start, segment.limit)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want the word from the original snippet string? Do the replaceAll() calls preserve the string indexes?

Otherwise, I would use segment.getSubSequence().toString() here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really want the word from the original snippet string. The replaceAll is basically a poor man’s word segmentation tailoring (and the replacements are length preserving to achieve that).

.segments()
.filter(s -> s.ruleStatus >= BreakIterator.WORD_NUMBER)
::iterator;
for (final var segment : segments) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echeran is asking why you don't move more of this loop into Stream map() etc. calls above...

@eggrobin
Copy link
Copy Markdown
Member Author

there are CI check failures

Looks like unicode-org/cldr@79c9a73 broke https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/draft/GenerateCharacterFrequencyCharts.java by removing ExemplarInfo.Status.

I have no idea what ExemplarInfo.Status was, or what it should be replaced by. For that matter, I have no idea what GenerateCharacterFrequencyCharts is.

@macchiati, could you take a look?

@markusicu, if this turns out to be nontrivial, could we do the ICU & CLDR dependency updates separately?

@markusicu
Copy link
Copy Markdown
Member

if this turns out to be nontrivial, could we do the ICU & CLDR dependency updates separately?

of course

@markusicu
Copy link
Copy Markdown
Member

GenerateCharacterFrequencyCharts -- FWIW

@macchiati how does CharacterFrequency play into N4M?

Judging from the output table columns, I would have guessed that this tool prints an informational chart for CLDR, and that CLDR has long forgotten about it.

@eggrobin
Copy link
Copy Markdown
Member Author

git history peters out in 2013

It is unfortunately interrupted, but if you know how to jump over the discontinuity you can dig deeper: #485 (comment)

GenerateCharacterFrequencyCharts was added in 2010, as part of a large bucket of misc. cfe4d60

This blame shows some nontrivial revisions 13 years ago: https://github.com/unicode-org/unicodetools/blame/2d1a65225b0b376ebeb4396c95fcfac192c25855/org/unicode/draft/GenerateCharacterFrequencyCharts.java.

@markusicu
Copy link
Copy Markdown
Member

git history peters out in 2013

It is unfortunately interrupted, but if you know how to jump over the discontinuity you can dig deeper

I would like to learn how.

https://github.com/unicode-org/unicodetools/commits/main/unicodetools/src/main/java/org/unicode/draft/GenerateCharacterFrequencyCharts.java
shows
“Renamed from unicodetools/org/unicode/draft/GenerateCharacterFrequencyCharts.java
(Browse History)
and that ends with commit 8317043 “ticket:1: move unicodetools under trunk”
followed by
“End of commit history for this file”

How do you go further back from there?

I just tried going from that last commit to its parent commit 5ad0764 “ticket:1: create a new trunk” but when I browse files there I come up empty, and looking for characterfrequency just keeps spinning.

@eggrobin
Copy link
Copy Markdown
Member Author

I would like to learn how.

When you hit the bottom of the current stratum at 8317043, you start over from the top of the previous one, at https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855 (which is the parent of 5ad0764).

@markusicu
Copy link
Copy Markdown
Member

I would like to learn how.

When you hit the bottom of the current stratum at 8317043, you start over from the top of the previous one, at https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855 (which is the parent of 5ad0764).

I see -- when one commit comes up empty / unbrowsable, you keep going up the parent chain until there is browsable content again. Thanks!

Also, this file was renamed from an earlier Combining.java...

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Apr 16, 2026

I see -- when one commit comes up empty / unbrowsable, you keep going up the parent chain until there is browsable content again. Thanks!

Yes. In practice, there is one such discontinuity, so it is worth bookmarking or otherwise knowing where to find https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855; what I do is search this repository for "archæ‌ology", which brings up the comment on #485 which I mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants