Bump snapshot versions by eggrobin · Pull Request #1332 · unicode-org/unicodetools

eggrobin · 2026-04-15T03:39:18Z

And use the word segmenter now that we have unicode-org/icu#3935.

Approver: Feel free to merge on my behalf
- rebase & merge one or more commits
- squash & merge multiple commits into one

markusicu

there are CI check failures

markusicu · 2026-04-15T23:53:21Z

+                                                    snippet.replaceAll("\\.-", ".0")
+                                                            .replaceAll(
+                                                                    "(?<=[0-9]'*)'(?='*\\.[0-9])",
+                                                                    "0"))


please move this into a String variable before computing the segments

markusicu · 2026-04-16T00:05:19Z

+                                    ::iterator;
+                    for (final var segment : segments) {
+                        String word =
+                                snippet.substring(segment.start, segment.limit)


Do you really want the word from the original snippet string? Do the replaceAll() calls preserve the string indexes?

Otherwise, I would use segment.getSubSequence().toString() here.

I really want the word from the original snippet string. The replaceAll is basically a poor man’s word segmentation tailoring (and the replacements are length preserving to achieve that).

markusicu · 2026-04-16T00:06:08Z

+                                            .segments()
+                                            .filter(s -> s.ruleStatus >= BreakIterator.WORD_NUMBER)
+                                    ::iterator;
+                    for (final var segment : segments) {


@echeran is asking why you don't move more of this loop into Stream map() etc. calls above...

eggrobin · 2026-04-16T11:42:27Z

there are CI check failures

Looks like unicode-org/cldr@79c9a73 broke https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/draft/GenerateCharacterFrequencyCharts.java by removing ExemplarInfo.Status.

I have no idea what ExemplarInfo.Status was, or what it should be replaced by. For that matter, I have no idea what GenerateCharacterFrequencyCharts is.

@macchiati, could you take a look?

@markusicu, if this turns out to be nontrivial, could we do the ICU & CLDR dependency updates separately?

markusicu · 2026-04-16T16:57:18Z

if this turns out to be nontrivial, could we do the ICU & CLDR dependency updates separately?

of course

markusicu · 2026-04-16T17:15:21Z

GenerateCharacterFrequencyCharts -- FWIW

not mentioned anywhere (no docs etc.)
related to sibling class https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/draft/CharacterFrequency.java which is half commented out
git history peters out in 2013, apparently only cleanup/refactoring changes are recorded
Unicode Tools classes with main() says “used by N4M code” and “move to new n4m package”

@macchiati how does CharacterFrequency play into N4M?

Judging from the output table columns, I would have guessed that this tool prints an informational chart for CLDR, and that CLDR has long forgotten about it.

eggrobin · 2026-04-16T18:39:06Z

git history peters out in 2013

It is unfortunately interrupted, but if you know how to jump over the discontinuity you can dig deeper: #485 (comment)

GenerateCharacterFrequencyCharts was added in 2010, as part of a large bucket of misc. cfe4d60

This blame shows some nontrivial revisions 13 years ago: https://github.com/unicode-org/unicodetools/blame/2d1a65225b0b376ebeb4396c95fcfac192c25855/org/unicode/draft/GenerateCharacterFrequencyCharts.java.

markusicu · 2026-04-16T21:29:05Z

git history peters out in 2013

It is unfortunately interrupted, but if you know how to jump over the discontinuity you can dig deeper

I would like to learn how.

https://github.com/unicode-org/unicodetools/commits/main/unicodetools/src/main/java/org/unicode/draft/GenerateCharacterFrequencyCharts.java
shows
“Renamed from unicodetools/org/unicode/draft/GenerateCharacterFrequencyCharts.java
(Browse History)”
and that ends with commit 8317043 “ticket:1: move unicodetools under trunk”
followed by
“End of commit history for this file”

How do you go further back from there?

I just tried going from that last commit to its parent commit 5ad0764 “ticket:1: create a new trunk” but when I browse files there I come up empty, and looking for characterfrequency just keeps spinning.

eggrobin · 2026-04-16T21:33:53Z

I would like to learn how.

When you hit the bottom of the current stratum at 8317043, you start over from the top of the previous one, at https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855 (which is the parent of 5ad0764).

markusicu · 2026-04-16T21:49:48Z

I would like to learn how.

When you hit the bottom of the current stratum at 8317043, you start over from the top of the previous one, at https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855 (which is the parent of 5ad0764).

I see -- when one commit comes up empty / unbrowsable, you keep going up the parent chain until there is browsable content again. Thanks!

Also, this file was renamed from an earlier Combining.java...

eggrobin · 2026-04-16T21:52:51Z

I see -- when one commit comes up empty / unbrowsable, you keep going up the parent chain until there is browsable content again. Thanks!

Yes. In practice, there is one such discontinuity, so it is worth bookmarking or otherwise knowing where to find https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855; what I do is search this repository for "archæ‌ology", which brings up the comment on #485 which I mentioned above.

eggrobin added 2 commits April 15, 2026 05:28

Bump 🧊🫵 & 🦭🦌 versions

5943aec

Use the word segmenter

6d83740

eggrobin requested a review from markusicu April 15, 2026 03:39

markusicu reviewed Apr 16, 2026

View reviewed changes

Uh oh!

Conversation

eggrobin commented Apr 15, 2026

Uh oh!

markusicu left a comment

Choose a reason for hiding this comment

Uh oh!

markusicu Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

markusicu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

eggrobin Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

markusicu Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

eggrobin commented Apr 16, 2026

Uh oh!

markusicu commented Apr 16, 2026

Uh oh!

markusicu commented Apr 16, 2026

Uh oh!

eggrobin commented Apr 16, 2026

Uh oh!

markusicu commented Apr 16, 2026

Uh oh!

eggrobin commented Apr 16, 2026

Uh oh!

markusicu commented Apr 16, 2026

Uh oh!

eggrobin commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eggrobin commented Apr 16, 2026 •

edited

Loading