Skip to content

ICU-3736 UAX44-LM2 loose matching for character names#3932

Open
eggrobin wants to merge 22 commits intounicode-org:mainfrom
eggrobin:𒈬𒁲𒁮

Hidden character warning

The head ref may contain hidden characters: "\ud808\ude2c\ud808\udc72\ud808\udc6e"
Open

ICU-3736 UAX44-LM2 loose matching for character names#3932
eggrobin wants to merge 22 commits intounicode-org:mainfrom
eggrobin:𒈬𒁲𒁮

Conversation

@eggrobin
Copy link
Copy Markdown
Member

@eggrobin eggrobin commented Apr 8, 2026

Checklist

  • Required: Issue filed: ICU-3736
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@markusicu markusicu self-assigned this Apr 9, 2026
@markusicu markusicu requested a review from richgillam April 9, 2026 16:32
Copy link
Copy Markdown
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just because I'm not awake yet (or maybe I'm not the best choice of reviewer), but I had trouble finding my way through this. A lot of it made sense, but it wasn't clear to me what the actual rules for matching are, and there seemed to be several spots in the code that assumed more knowledge on the part of the reader than I possess. Can you help me understand what's going on here?

Comment thread icu4c/source/common/unames.cpp Outdated
Comment thread icu4c/source/common/unames.cpp
if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) {
continue;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you taking this out?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode 1 names were removed from ICU in ICU 49.

Comment thread icu4c/source/common/unames.cpp
Comment thread icu4c/source/test/intltest/usettest.cpp
richgillam
richgillam previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOKTM

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Apr 10, 2026

I added some comments; they are pretty long because while writing them I noticed some subtle edge cases (which, as best I can tell, were handled correctly, but are noteworthy)…

@eggrobin eggrobin requested a review from richgillam April 13, 2026 14:34
@eggrobin eggrobin requested a review from markusicu April 13, 2026 14:43
Copy link
Copy Markdown
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As usual, nice and elegant code where I feel like I learn good and modern techniques. I do have a few comments though.

Comment thread icu4c/source/common/unames.cpp
Comment thread icu4c/source/common/unames.cpp Outdated
public:
class Matcher {
public:
Matcher(const CharacterNameQuery *const query)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not take the query by reference? it must not be nullptr.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google style uses pointers when there is a lifetime requirement, as here, and I like that. Added a comment though, lifetime requirements should be documented…

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT Google code has been moving away from pointers to references when not nullable, including for outputs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of Google... I much prefer using a reference when the thing can't be null.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT Google code has been moving away from pointers to references when not nullable, including for outputs.

For outputs, yes; but not for lifetime dependencies if I recall correctly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how a pointer helps for that; it's certainly not something we do in ICU.
Please change query to a reference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do advise to make the member a reference. Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I see the point:

In some cases reference parameters can bind to temporaries, leading to lifetime bugs.

One of many C++ foot guns... I guess I can live with the pointer :-(

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how a pointer helps for that;

Well, the ToTW has some pretty clear examples; it prevents passing temporaries.

But a cursory search for outlive finds at least one comment about a lifetime requirements on reference parameters in ICU, so I guess we do this here. Done.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we certainly do that in some places in ICU: see, e.g.,

* @param out The vector to which ConversionRateInfo instances are to be
* added. This vector must outlive the use of the ResourceSink.
*/
explicit ConversionRateDataSink(MaybeStackVector<ConversionRateInfo> *out) : outVector(out) {}

Comment thread icu4c/source/common/unames.cpp Outdated
Comment thread icu4c/source/common/unames.cpp Outdated
Comment thread icu4c/source/common/unames.cpp
Comment thread icu4c/source/common/unames.cpp Outdated
Comment thread icu4c/source/common/unames.cpp Outdated
Comment thread icu4c/source/common/unames.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants