Better sentence chunking algorithm to fix edge cases ("etc.", etc.) #415

lionel-rowe · 2025-11-14T15:31:04Z

Fixes #308.

In the end I didn't use ICU directly, since ICU support in Python can be a bit finnicky as a dependency and can't be added with a simple pip install.

Instead, I re-implemented the Unicode sentence segmentation algorithm as a pure Python module, unicode-segment. It's a fully compliant implementation (passes all the tests in the TR29 test suite) ~~but kinda slow~~ and, after some optimization, decently performant for inputs of a reasonable size (typically sub-ms per 1k chars, YMMV depending on hardware etc).

It's also a deterministic algorithm, which means there are still certain corner cases it will get wrong:

UAX #29’s sentence boundary rules are a lot smarter than just treating every full stop as the end of a sentence. But they’re not perfect. In the string "Dr. John works at I.B.M., doesn't he?", asked Alice. "Yes," replied Charlie., the regex \b{sb}.+?\b{sb} finds 3 matches: "Dr. , John works at I.B.M., doesn't he?", asked Alice. , and "Yes," replied Charlie.. A full stop ends a sentence if it is followed by a capital letter. The question mark does not trigger a sentence break because of the comma that follows, even with the quote in between.

Without using a neural-based approach for sentence segmenting or applying some hacky, ad-hoc solution, there's not much that can be done about these.

Use Unicode TR29 algorithm for chunking into sentences

f4f2af9

lionel-rowe changed the title ~~Better sentence chunking algorithm to fix edge cases (~~ Better sentence chunking algorithm to fix edge cases ("etc.", etc.) Nov 14, 2025

lionel-rowe marked this pull request as draft November 15, 2025 17:11

Update to [email protected]

9784b10

lionel-rowe marked this pull request as ready for review November 17, 2025 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Better sentence chunking algorithm to fix edge cases ("etc.", etc.) #415

Better sentence chunking algorithm to fix edge cases ("etc.", etc.) #415

lionel-rowe commented Nov 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Better sentence chunking algorithm to fix edge cases ("etc.", etc.) #415

Are you sure you want to change the base?

Better sentence chunking algorithm to fix edge cases ("etc.", etc.) #415

Conversation

lionel-rowe commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lionel-rowe commented Nov 14, 2025 •

edited

Loading