modify usfm for chapter-level drafting to avoid import issues; move remarks to chapters by mshannon-sil · Pull Request #285 · sillsdev/machine.py

mshannon-sil · 2026-03-26T21:46:59Z

This PR addresses issue #284.

Mostly looking for high-level feedback about the approach at the moment. As we were discussing, is the right place for this functionality in the get_usfm() method as essentially a post-processing step? Or should we look to implement this feature in process_tokens() (and maybe move the remark logic here as well)?

Some initial thoughts:
Pros for putting it in get_usfm():

The code is together in a cohesive unit making it potentially easier to maintain, rather than spread across process_token().
If it's just for the purposes of importing, then it can be thought of as a kind of "view" that Paratext needs to avoid import issues while the true model is kept unmodified in handler._tokens. This allows for the option to access the unmodified usfm if needed in the future.

Pros for putting it in process_token():

Faster execution time since it's all part of the same iteration
If thought of as an essential change to the usfm structure such that alternative views are unnecessary, it could make more structural sense to include it here.

This change is

…emarks to chapters

ddaspit

@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93 and mshannon-sil).

machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):

        tokens = list(self._tokens)
        if chapters is not None:
            tokens = self._get_incremental_draft_tokens(tokens, chapters)

I think we can do something similar, but before we parse instead of after. Instead of calling parse_usfm in update_usfm, we can do something like this:

tokenizer = UsfmTokenizer(self._settings.stylesheet)
tokens = tokenizer.tokenize(usfm)
tokens = filter_tokens_by_chapter(tokens, chapters)
parser = UsfmParser(tokens, handler, self._settings.stylesheet, self._settings.versification)
parser.process_tokens()

This would avoid updating the whole book.

mshannon-sil · 2026-04-15T21:09:50Z

machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think we can do something similar, but before we parse instead of after. Instead of calling parse_usfm in update_usfm, we can do something like this:
tokenizer = UsfmTokenizer(self._settings.stylesheet)
tokens = tokenizer.tokenize(usfm)
tokens = filter_tokens_by_chapter(tokens, chapters)
parser = UsfmParser(tokens, handler, self._settings.stylesheet, self._settings.versification)
parser.process_tokens()
This would avoid updating the whole book.

I updated it accordingly, how does it look now?

If we change parse_usfm to accept a Sequence[UsfmToken] as well as str, we could call it here and let it instantiate the parser and process the tokens to avoid some code duplication. But not sure if there's a reason parse_usfm only accepts str.

…g chapter remarks

Enkidu93

@Enkidu93 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on ddaspit and mshannon-sil).

machine/corpora/paratext_project_text_updater_base.py line 97 at r2 (raw file):

            in_id_marker = False
        elif token.type == UsfmTokenType.CHAPTER:
            if token.data and int(token.data) in chapters:

I think this may be safe now after some recent changes, but you may want to double check what happens with bad chapter references like \c 1. if you haven't already/isn't already covered by tests.

machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):

                remark_tokens.append(UsfmToken(UsfmTokenType.TEXT, text=remark))
            if len(tokens) > 0:
                for index, token in enumerate(tokens):

Don't we want to preserve the ability to add book-level remarks? Am I reading this correctly that we'd no longer be able to do so with this change? Peter has also put in a PR related to the per-chapter remarks; are you guys coordinating on this sillsdev/machine#408? I think if we did something like he has there, we would have a bit more flexibility.

mshannon-sil · 2026-04-17T16:11:38Z

machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

Don't we want to preserve the ability to add book-level remarks? Am I reading this correctly that we'd no longer be able to do so with this change? Peter has also put in a PR related to the per-chapter remarks; are you guys coordinating on this sillsdev/machine#408? I think if we did something like he has there, we would have a bit more flexibility.

I was unaware that Peter was making any changes. As far as I'm aware we want it working in SILNLP first (through machine.py) before changes to machine.

What I remember from the NT meeting where we discussed remarks was that the team wanted to move all the remarks to the chapter level once we realized that \rem is valid in chapters, so we would no longer have book remarks. I can throw the question to the wider group to see what we think.

ddaspit

@ddaspit reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on Enkidu93 and mshannon-sil).

machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):

Previously, mshannon-sil wrote…

I updated it accordingly, how does it look now?

If we change parse_usfm to accept a Sequence[UsfmToken] as well as str, we could call it here and let it instantiate the parser and process the tokens to avoid some code duplication. But not sure if there's a reason parse_usfm only accepts str.

I think it makes sense to update parse_usfm to accept tokens as well.

machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):

Previously, mshannon-sil wrote…

I was unaware that Peter was making any changes. As far as I'm aware we want it working in SILNLP first (through machine.py) before changes to machine.

What I remember from the NT meeting where we discussed remarks was that the team wanted to move all the remarks to the chapter level once we realized that \rem is valid in chapters, so we would no longer have book remarks. I can throw the question to the wider group to see what we think.

The philosophy for implementing features in the Machine library is a bit different than our other applications. For applications, we only build what we need. Because Machine is a shared library, we try to keep any features we add generally useful (within reason). In this case, I think it makes sense to include the ability to add book-level remarks even if we aren't planning on using that specific feature right now.

machine/corpora/__init__.py line 30 at r2 (raw file):

from .paratext_project_settings_parser_base import ParatextProjectSettingsParserBase
from .paratext_project_terms_parser_base import KeyTerm, ParatextProjectTermsParserBase
from .paratext_project_text_updater_base import ParatextProjectTextUpdaterBase, filter_tokens_by_chapter

This seems unnecessary.

pmachapman · 2026-04-19T20:10:34Z

What I remember from the NT meeting where we discussed remarks was that the team wanted to move all the remarks to the chapter level once we realized that \rem is valid in chapters, so we would no longer have book remarks. I can throw the question to the wider group to see what we think.

Sorry for jumping the gun a bit in C# land - I wanted to get something working so I could see how it might work in Serval/SF, and figure out any problems and pitfalls at this early stage.

I think book level remarks are still necessary, as we don't want to change how full book drafting is working. It could be a bit counterintuitive for users of full book drafting to suddenly have per-chapter remarks for a when they make their next draft of say Acts or Genesis, and then they need to track down and remove manually the remarks when they begin the process of editing the draft.

Also, I think we might need different remarks per chapter? Your implementation looks like it has the same remark for each draft. The only reason I thought this is that our current book level remarks in Serval:

\id MRK
\rem This draft of MRK was generated using AI on 2025-12-10 00:21:48Z. It should be reviewed and edited carefully.

Might change to per-chapter:

\c 3
\rem This draft of MRK 3 was generated using AI on 2025-12-10 00:21:48Z. It should be reviewed and edited carefully.

I think having the chapter number in the remark helps keep explicit what was drafted. Without some form of identifier in the remark, there is a danger it perhaps starts floating into another chapter if copy-pasted, or if a merge of incoming changes in PT goes haywire. I also think saying just "This draft was generated" or "The draft of this chapter was generated" is not as clear as stating exactly what chapter is drafted, and removes any ambiguity.

Apologies if I misunderstood things, or if I am wrong - I am happy which ever way we go RE: the points above - I just wanted to explain my thoughts, so you can correct me if I am barking up the wrong tree.

Enkidu93 · 2026-04-20T09:37:38Z

I put a comment in the other PR, but I think we need Matthew's updates to support partial-book USFM but I also think we should do something more like what Peter has to preserve the per-book remark functionality as well as have more fine-grained control of the per-chapter remarks. I think you guys should decide where to put the changes first (machine or machine.py) and then port.

mshannon-sil · 2026-04-27T19:19:09Z

I've added Peter's changes for the remarks to this PR and updated test cases to reflect this. The only change I made to the approach was to add "rem" for markers_to_skip at the chapter level as well, so that we always put the remark after any existing remarks just like we do at the book level.

codecov-commenter · 2026-04-27T22:02:05Z

Codecov Report

❌ Patch coverage is 98.80952% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 91.52%. Comparing base (a9b049e) to head (cf0ea05).

Files with missing lines	Patch %	Lines
machine/corpora/update_usfm_parser_handler.py	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #285      +/-   ##
==========================================
+ Coverage   91.50%   91.52%   +0.02%     
==========================================
  Files         355      355              
  Lines       22861    22927      +66     
==========================================
+ Hits        20920    20985      +65     
- Misses       1941     1942       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mshannon-sil · 2026-04-27T22:42:31Z

machine/corpora/paratext_project_text_updater_base.py line 97 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I think this may be safe now after some recent changes, but you may want to double check what happens with bad chapter references like \c 1. if you haven't already/isn't already covered by tests.

I updated it to use the parse_integer method to handle it more safely, and I added a test case to cover this. As is, the code now just skips over chapters with malformed chapter numbers and does not include them in the output, which you can see in the test_filter_chapters_with_bad_chapter_reference test case. Is this the best approach in your opinion, or do you want it to be able to select even chapters with malformed chapter numbers?

mshannon-sil · 2026-04-27T22:48:19Z

machine/corpora/update_usfm_parser_handler.py line 345 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think it makes sense to update parse_usfm to accept tokens as well.

Done.

mshannon-sil · 2026-04-27T22:52:55Z

machine/corpora/__init__.py line 30 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This seems unnecessary.

The reason it's here is so I can call it in test_update_usfm_parser_handler.py since the test file reimplements update_usfm so that it's set up more conveniently for testing. Should we change either the test file or implementation to avoid needing to import filter_tokens_by_chapter?

mshannon-sil · 2026-04-27T22:54:18Z

machine/corpora/update_usfm_parser_handler.py line 348 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The philosophy for implementing features in the Machine library is a bit different than our other applications. For applications, we only build what we need. Because Machine is a shared library, we try to keep any features we add generally useful (within reason). In this case, I think it makes sense to include the ability to add book-level remarks even if we aren't planning on using that specific feature right now.

Done.

Enkidu93

This is no longer a draft PR, correct? Or is there a reason we wouldn't want to merge this soon?

@Enkidu93 reviewed 5 files and all commit messages, made 3 comments, and resolved 1 discussion.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on ddaspit and mshannon-sil).

machine/corpora/__init__.py line 30 at r2 (raw file):

Previously, mshannon-sil wrote…

The reason it's here is so I can call it in test_update_usfm_parser_handler.py since the test file reimplements update_usfm so that it's set up more conveniently for testing. Should we change either the test file or implementation to avoid needing to import filter_tokens_by_chapter?

I don't think it has to happen now, but we should consider making a MemoryParatextProjectTextUpdater like we have with the other Zip.../File... Paratext classes. We could then refactor these tests to all use the memory updater and avoid the custom logic in the testupdate_usfm function as well as avoid importing filter_tokens_by_chapter. If we don't want to refactor all the tests, we could in the meantime just replace the logic in the else branch of update_usfm - i.e.:

    else:
        source = source.strip().replace("\r\n", "\n") + "\r\n"
        updater = UpdateUsfmParserHandler(
        ...

with a call toMemoryParatextProjectTextUpdater.update_usfm() which would also avoid exposing filter_tokens_by_chapter. I leave it to Damien whether a change like this should be done in this PR or separately later.

machine/corpora/paratext_project_text_updater_base.py line 97 at r2 (raw file):

Previously, mshannon-sil wrote…

I updated it to use the parse_integer method to handle it more safely, and I added a test case to cover this. As is, the code now just skips over chapters with malformed chapter numbers and does not include them in the output, which you can see in the test_filter_chapters_with_bad_chapter_reference test case. Is this the best approach in your opinion, or do you want it to be able to select even chapters with malformed chapter numbers?

Great! I think skipping it is good for now. If, down the road, we add logic to handle these chapters more robustly elsewhere, we can add support here as well.

Enkidu93

@Enkidu93 resolved 1 discussion.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on ddaspit).

mshannon-sil · 2026-04-28T13:28:52Z

Just removed the draft designation!

ddaspit

@ddaspit reviewed 5 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Enkidu93 and mshannon-sil).

machine/corpora/__init__.py line 30 at r2 (raw file):

Previously, Enkidu93 (Eli C. Lowry) wrote…

I don't think it has to happen now, but we should consider making a MemoryParatextProjectTextUpdater like we have with the other Zip.../File... Paratext classes. We could then refactor these tests to all use the memory updater and avoid the custom logic in the testupdate_usfm function as well as avoid importing filter_tokens_by_chapter. If we don't want to refactor all the tests, we could in the meantime just replace the logic in the else branch of update_usfm - i.e.:
    else:
        source = source.strip().replace("\r\n", "\n") + "\r\n"
        updater = UpdateUsfmParserHandler(
        ...
with a call toMemoryParatextProjectTextUpdater.update_usfm() which would also avoid exposing filter_tokens_by_chapter. I leave it to Damien whether a change like this should be done in this PR or separately later.

I see why you are exporting it. I agree with Eli. If we want to test the chapter filtering, then we should test the ParatextProjectTextUpdaterBase class. The best way to do that is to create a memory-based implementation.

mshannon-sil · 2026-04-28T18:59:27Z

machine/corpora/__init__.py line 30 at r2 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I see why you are exporting it. I agree with Eli. If we want to test the chapter filtering, then we should test the ParatextProjectTextUpdaterBase class. The best way to do that is to create a memory-based implementation.

Sounds good, I'll work on adding that!

modify usfm for chapter-level drafting to avoid import issues; move r…

d8ca02d

…emarks to chapters

mshannon-sil requested review from Enkidu93 and ddaspit March 26, 2026 21:47

ddaspit reviewed Mar 30, 2026

View reviewed changes

mshannon-sil added 2 commits April 15, 2026 14:02

move filtering before token processing

aef5d5d

add test case for chapter filtering

e423708

mshannon-sil added 3 commits April 16, 2026 10:02

make sure all text in \id is included

1e2e999

update remark test and ensure remarks are added at the end of existin…

707119c

…g chapter remarks

add test case for including chapter 1 and header information

e1865ea

Enkidu93 mentioned this pull request Apr 17, 2026

Add support for per-chapter remarks sillsdev/machine#408

Open

Enkidu93 reviewed Apr 17, 2026

View reviewed changes

ddaspit reviewed Apr 17, 2026

View reviewed changes

mshannon-sil added 2 commits April 27, 2026 15:13

support both book-level and chapter-level remarks

4a8fb50

add test case for multiple remarks for the same chapter

74dcf8f

fix init.py

ef9041f

mshannon-sil added 2 commits April 27, 2026 18:12

cover edge case of chapter as last marker

d9ecd2a

handle malformed chapter numbers

e39f94c

use parse_usfm; fix pass_remark test

9cfd578

Enkidu93 reviewed Apr 28, 2026

View reviewed changes

Merge branch 'main' into incremental_draft

26a111d

mshannon-sil marked this pull request as ready for review April 28, 2026 13:27

ddaspit requested changes Apr 28, 2026

View reviewed changes

mshannon-sil added 2 commits April 28, 2026 16:54

don't filter if chapters is empty list

911df51

revert change

cf0ea05

mshannon-sil mentioned this pull request Apr 29, 2026

add support for chapter level usfm filtering and remarks sillsdev/silnlp#1021

Open

Uh oh!

Conversation

mshannon-sil commented Mar 26, 2026 • edited by ddaspit Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Apr 15, 2026

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Apr 17, 2026

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

pmachapman commented Apr 19, 2026

Uh oh!

Enkidu93 commented Apr 20, 2026

Uh oh!

mshannon-sil commented Apr 27, 2026

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mshannon-sil commented Apr 27, 2026

Uh oh!

mshannon-sil commented Apr 27, 2026

Uh oh!

mshannon-sil commented Apr 27, 2026

Uh oh!

mshannon-sil commented Apr 27, 2026

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

Enkidu93 left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Apr 28, 2026

Uh oh!

ddaspit left a comment

Choose a reason for hiding this comment

Uh oh!

mshannon-sil commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mshannon-sil commented Mar 26, 2026 •

edited by ddaspit

Loading

codecov-commenter commented Apr 27, 2026 •

edited

Loading