Conversation
markusicu
left a comment
There was a problem hiding this comment.
there are CI check failures
| snippet.replaceAll("\\.-", ".0") | ||
| .replaceAll( | ||
| "(?<=[0-9]'*)'(?='*\\.[0-9])", | ||
| "0")) |
There was a problem hiding this comment.
please move this into a String variable before computing the segments
| ::iterator; | ||
| for (final var segment : segments) { | ||
| String word = | ||
| snippet.substring(segment.start, segment.limit) |
There was a problem hiding this comment.
Do you really want the word from the original snippet string? Do the replaceAll() calls preserve the string indexes?
Otherwise, I would use segment.getSubSequence().toString() here.
There was a problem hiding this comment.
I really want the word from the original snippet string. The replaceAll is basically a poor man’s word segmentation tailoring (and the replacements are length preserving to achieve that).
| .segments() | ||
| .filter(s -> s.ruleStatus >= BreakIterator.WORD_NUMBER) | ||
| ::iterator; | ||
| for (final var segment : segments) { |
There was a problem hiding this comment.
@echeran is asking why you don't move more of this loop into Stream map() etc. calls above...
Looks like unicode-org/cldr@79c9a73 broke https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/draft/GenerateCharacterFrequencyCharts.java by removing ExemplarInfo.Status. I have no idea what ExemplarInfo.Status was, or what it should be replaced by. For that matter, I have no idea what GenerateCharacterFrequencyCharts is. @macchiati, could you take a look? @markusicu, if this turns out to be nontrivial, could we do the ICU & CLDR dependency updates separately? |
of course |
|
GenerateCharacterFrequencyCharts -- FWIW
@macchiati how does CharacterFrequency play into N4M? Judging from the output table columns, I would have guessed that this tool prints an informational chart for CLDR, and that CLDR has long forgotten about it. |
It is unfortunately interrupted, but if you know how to jump over the discontinuity you can dig deeper: #485 (comment) GenerateCharacterFrequencyCharts was added in 2010, as part of a large bucket of This blame shows some nontrivial revisions 13 years ago: https://github.com/unicode-org/unicodetools/blame/2d1a65225b0b376ebeb4396c95fcfac192c25855/org/unicode/draft/GenerateCharacterFrequencyCharts.java. |
I would like to learn how. https://github.com/unicode-org/unicodetools/commits/main/unicodetools/src/main/java/org/unicode/draft/GenerateCharacterFrequencyCharts.java How do you go further back from there? I just tried going from that last commit to its parent commit 5ad0764 “ticket:1: create a new trunk” but when I browse files there I come up empty, and looking for |
When you hit the bottom of the current stratum at 8317043, you start over from the top of the previous one, at https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855 (which is the parent of 5ad0764). |
I see -- when one commit comes up empty / unbrowsable, you keep going up the parent chain until there is browsable content again. Thanks! Also, this file was renamed from an earlier Combining.java... |
Yes. In practice, there is one such discontinuity, so it is worth bookmarking or otherwise knowing where to find https://github.com/unicode-org/unicodetools/tree/2d1a65225b0b376ebeb4396c95fcfac192c25855; what I do is search this repository for "archæology", which brings up the comment on #485 which I mentioned above. |
And use the word segmenter now that we have unicode-org/icu#3935.