Better sentence chunking algorithm to fix edge cases ("etc.", etc.) #415
+38
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #308.
In the end I didn't use ICU directly, since ICU support in Python can be a bit finnicky as a dependency and can't be added with a simple
pip install.Instead, I re-implemented the Unicode sentence segmentation algorithm as a pure Python module,
unicode-segment. It's a fully compliant implementation (passes all the tests in the TR29 test suite)but kinda slowand, after some optimization, decently performant for inputs of a reasonable size (typically sub-ms per 1k chars, YMMV depending on hardware etc).It's also a deterministic algorithm, which means there are still certain corner cases it will get wrong:
Without using a neural-based approach for sentence segmenting or applying some hacky, ad-hoc solution, there's not much that can be done about these.