Skip to content

Conversation

@lionel-rowe
Copy link

@lionel-rowe lionel-rowe commented Nov 14, 2025

Fixes #308.

In the end I didn't use ICU directly, since ICU support in Python can be a bit finnicky as a dependency and can't be added with a simple pip install.

Instead, I re-implemented the Unicode sentence segmentation algorithm as a pure Python module, unicode-segment. It's a fully compliant implementation (passes all the tests in the TR29 test suite) but kinda slow and, after some optimization, decently performant for inputs of a reasonable size (typically sub-ms per 1k chars, YMMV depending on hardware etc).

It's also a deterministic algorithm, which means there are still certain corner cases it will get wrong:

UAX #29’s sentence boundary rules are a lot smarter than just treating every full stop as the end of a sentence. But they’re not perfect. In the string "Dr. John works at I.B.M., doesn't he?", asked Alice. "Yes," replied Charlie., the regex \b{sb}.+?\b{sb} finds 3 matches: "Dr. , John works at I.B.M., doesn't he?", asked Alice. , and "Yes," replied Charlie.. A full stop ends a sentence if it is followed by a capital letter. The question mark does not trigger a sentence break because of the comma that follows, even with the quote in between.

Without using a neural-based approach for sentence segmenting or applying some hacky, ad-hoc solution, there's not much that can be done about these.

@lionel-rowe lionel-rowe changed the title Better sentence chunking algorithm to fix edge cases ( Better sentence chunking algorithm to fix edge cases ("etc.", etc.) Nov 14, 2025
@lionel-rowe lionel-rowe marked this pull request as draft November 15, 2025 17:11
@lionel-rowe lionel-rowe marked this pull request as ready for review November 17, 2025 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant