feat(object-storage): auto-enable skip_listing for all storage types; adaptive validation interval#483
Merged
FileSystemGuy merged 1 commit intoJun 19, 2026
Conversation
… adaptive validation interval
- mlpstorage_py/benchmarks/dlio.py:
- New _compute_validation_interval(num_files) static method: scales validation
sampling geometrically with dataset size so startup HEAD-check time is
bounded regardless of scale (exhaustive for <10K files; every 10,000th file
for 10M+ files — ~1,000 checks at any scale).
- New _apply_skip_listing_params() method: injects dataset.skip_listing=True
and dataset.listing_validation_interval=<adaptive> for ALL training runs,
both --file and --object. Each MPI rank reconstructs its own shard
deterministically — no process ever holds the full file list in memory.
Eliminates flat-file OOM for large local-filesystem datasets as well as
the 12+ hour S3 listing problem (issue mlcommons#472). Respects --params overrides.
- _apply_object_storage_params(): removed skip_listing injection (now handled
by _apply_skip_listing_params). Object-storage-specific S3 credential and
endpoint setup unchanged.
- docs/OBJECT_STORAGE_GUIDE.md: document skip_listing, adaptive validation
interval, override examples, scoring clarification.
Relates-to: mlcommons/DLIO_local_changes#27
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
|
@russfellows I don't see skip_listing being set to true anywhere in DLIO_local_changes:PR#27 |
Contributor
Author
|
Correct, It is set to be true in mlp-storage now. I believe the default is
true, did I miss something?
…On Fri, Jun 19, 2026 at 5:30 PM Curtis Anderson ***@***.***> wrote:
*FileSystemGuy* left a comment (mlcommons/storage#483)
<#483 (comment)>
@russfellows <https://github.com/russfellows> I don't see skip_listing
being set to true anywhere in DLIO_local_changes:PR#27
—
Reply to this email directly, view it on GitHub
<#483?email_source=notifications&email_token=AF64UJ7YLKVXCRDEFUFZGA35AXEKLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGU2DIOJWHAY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755449681>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF64UJ4DEXOFKU3MHDJG5GT5AXEKLAVCNFSNUABFKJSXA33TNF2G64TZHM2DKOJXGYZTMMBTHNEXG43VMU5TINZQGQYTGMBQGE4KC5QC>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Thanks,
--Russ
|
Contributor
Author
|
Curtis:
—
The chain has four steps. Here’s exactly where it flows:
**Step 1 — mlp-storage injects it into `params_dict`**
mlpstorage_py/benchmarks/dlio.py — `_apply_skip_listing_params()`:
```python
if 'dataset.skip_listing' not in self.params_dict:
self.params_dict['dataset.skip_listing'] = 'True'
```
Called unconditionally from `TrainingBenchmark.__init__` for every training run.
**Step 2 — `generate_dlio_command()` turns every `params_dict` entry into a Hydra override**
```python
for key, value in self.params_dict.items():
cmd += f" ++workload.{key}={value}"
```
This produces `++workload.dataset.skip_listing=True` on the DLIO command line.
**Step 3 — DLIO's Hydra config system receives it**
`ConfigArguments` in `dlio_benchmark/utils/config.py` declares:
```python
skip_listing: bool = False
```
The `++workload.dataset.skip_listing=True` override sets it to `True` at startup.
**Step 4 — main.py branches on it**
```python
if self.args.skip_listing:
for idx in range(self.my_rank, num_files_expected, self.comm_size):
... # deterministic URI generation, no storage API calls
```
So the reviewer should look at `_apply_skip_listing_params()` in the mlp-storage dlio.py as the origin point — it's the mlp-storage layer that enables it for all runs, which then flows through as a Hydra override into the DLIO process.
—
Russ
… On Jun 19, 2026, at 5:34 PM, Russ Fellows ***@***.***> wrote:
Correct, It is set to be true in mlp-storage now. I believe the default is true, did I miss something?
On Fri, Jun 19, 2026 at 5:30 PM Curtis Anderson ***@***.*** ***@***.***>> wrote:
>
> FileSystemGuy
> left a comment
> (mlcommons/storage#483)
> <#483 (comment)>
> @russfellows <https://github.com/russfellows> I don't see skip_listing being set to true anywhere in DLIO_local_changes:PR#27
>
> —
> Reply to this email directly, view it on GitHub <#483?email_source=notifications&email_token=AF64UJ7YLKVXCRDEFUFZGA35AXEKLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGU2DIOJWHAY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755449681>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ4DEXOFKU3MHDJG5GT5AXEKLAVCNFSNUABFKJSXA33TNF2G64TZHM2DKOJXGYZTMMBTHNEXG43VMU5TINZQGQYTGMBQGE4KC5QC>.
> You are receiving this because you were mentioned.
>
--
Thanks,
--Russ
|
FileSystemGuy
approved these changes
Jun 19, 2026
Contributor
|
@russfellows Now you're going to bump up the "pin" of the last commit for DLIO_local_changes and bump the version number in the pyproject.toml file and then regenerate the uv.lock file? |
Contributor
Author
|
Yes, I'll make that fix directly --Russ On Jun 19, 2026, at 5:47 PM, Curtis Anderson ***@***.***> wrote:FileSystemGuy left a comment (mlcommons/storage#483)
@russfellows Now you're going to bump up the "pin" of the last commit for DLIO_local_changes and bump the version number in the pyproject.toml file and then regenerate the uv.lock file?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Contributor
Author
|
Curtis, or whomever has time.
I did exactly this, updated the pyproject.toml to point to the latest DLIO_local_changes hash, and also updated uv.lock.
#484
chore: bump dlio-benchmark pin to merged PR #27 SHA; version 3.0.15 by russfellows · Pull Request #484 · mlcommons/storage
github.com
Please review, approve and merge if the tests pass.
—Russ
… On Jun 19, 2026, at 5:47 PM, Curtis Anderson ***@***.***> wrote:
FileSystemGuy
left a comment
(mlcommons/storage#483)
<#483 (comment)>
@russfellows <https://github.com/russfellows> Now you're going to bump up the "pin" of the last commit for DLIO_local_changes and bump the version number in the pyproject.toml file and then regenerate the uv.lock file?
—
Reply to this email directly, view it on GitHub <#483?email_source=notifications&email_token=AF64UJ5LEOLE55VRQPPHHST5AXGJRA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGU2TAMJXHAZ2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755501783>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJZRBPGHLLOR55FNHZD5AXGJRAVCNFSNUABFKJSXA33TNF2G64TZHM2DKOJXGYZTMMBTHNEXG43VMU5TINZQGQYTGMBQGE4KC5QC>.
You are receiving this because you were mentioned.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When running benchmarks with millions of files or objects, we can run out of memory due to listing and holding in memory the entire data set list. When number of files or objects exceeds 1M, this becomes especially problematic.
See issue #466
Addtionally, performing a "list" against S3-compatible object storage with large datasets (millions of files), DLIO must discover dataset files before training starts. The default mechanism — paginated
list_objects_v2— can take 12+ hours for a 50M-file dataset. The same problem exists for flat local-filesystem layouts whereos.listdir()+sorted()materializes the entire file list on rank 0.This PR makes mlp-storage automatically enable DLIO's
skip_listingfeature for all training benchmark runs (--fileand--object), reducing cold-start time from hours to seconds and eliminating the per-rank OOM risk for large datasets.Background: what is
skip_listing?skip_listingis implemented in mlcommons/DLIO_local_changes PR #27. When enabled, each MPI rank independently reconstructs its own file-URI shard from DLIO's standard naming convention — zero storage API calls, zero MPI communication, and no process ever holds the full file list in memory (each rank holds O(N/comm_size) entries only).Does this affect benchmark scores?
No. File discovery runs inside
DLIOBenchmark.initialize(), which completes before the scored benchmark window opens atstats.start_run(). AU and throughput scores are identical whetherskip_listingis True or False.Changes
mlpstorage_py/benchmarks/dlio.pyTwo new methods on
DLIOBenchmark:_compute_validation_interval(num_files)— scales HEAD-check sampling geometrically with dataset size so startup time is bounded at any scale:_apply_skip_listing_params()— injectsdataset.skip_listing=Trueanddataset.listing_validation_interval=<adaptive>for all training runs. Called fromTrainingBenchmark.__init__after_apply_object_storage_params(). Both params respect--paramsoverrides._apply_object_storage_params()no longer injects skip_listing (now handled by the new method for all storage types).docs/OBJECT_STORAGE_GUIDE.mdDocuments
skip_listing, the adaptive validation interval, override examples, and scoring clarification.Override examples
Depends on
mlcommons/DLIO_local_changes#27 — implements
skip_listingin DLIO (required; pyproject.toml will be updated to pin the merged SHA once that PR lands).