restructure input_pipeline#3124
Merged
copybara-service[bot] merged 1 commit intomainfrom Feb 14, 2026
Merged
Conversation
a618e8d to
0f4ef0b
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
81ee44b to
8690e4f
Compare
8690e4f to
9823136
Compare
|
🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
📋 Review Summary
This pull request is a large-scale refactoring of the input_pipeline module, and the changes look solid. The file moves, renames, and import updates are consistent and well-executed.
🔍 General Feedback
- The restructuring of the
input_pipelineinto its ownmaxtextsubpackage is a great improvement for code organization. - Renaming modules to remove the leading underscore (e.g.,
_hf_data_processing.pytohf_data_processing.py) improves clarity. - The minor code cleanups, like combining imports, are also appreciated.
Overall, this is a good refactoring that improves the structure of the codebase.
hengtaoguo
approved these changes
Feb 12, 2026
NuojCheng
approved these changes
Feb 12, 2026
bvandermoon
approved these changes
Feb 13, 2026
Collaborator
bvandermoon
left a comment
There was a problem hiding this comment.
LGTM, thanks Aireen
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Retry the restructure in #3050 which was rolled back due to internal breakage
Restructure the input pipeline folder as follows:
under
src/maxtext/input_pipeline:-packing
-- prefill_packing.py
-- sequence_packing.py
-tokenizer.py
-multihost_dataloading.py
-distillation_data_processing.py (prev _distillation_data_processing.py)
-grain_data_processing.py (_grain_data_processing.py)
-grain_tokenizer.py (_grain_tokenizer.py)
-hf_data_processing.py (_hf_data_processing.py)
-input_pipeline_utils.py (_input_pipeline_utils.py)
-tfds_data_processing.py (_tfds_data_processing.py)
-tfds_data_processing_c4_mlperf.py (_tfds_data_processing_c4_mlperf.py)
-input_pipeline_interface.py
-synthetic_data_processing.py
-instruction_data_processing.py
Makes corresponding changes in imports
Tests
CI test
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.