feat(object-storage): auto-enable skip_listing for all storage types; adaptive validation interval by russfellows · Pull Request #483 · mlcommons/storage

russfellows · 2026-06-19T23:24:09Z

Summary

When running benchmarks with millions of files or objects, we can run out of memory due to listing and holding in memory the entire data set list. When number of files or objects exceeds 1M, this becomes especially problematic.

See issue #466

Addtionally, performing a "list" against S3-compatible object storage with large datasets (millions of files), DLIO must discover dataset files before training starts. The default mechanism — paginated list_objects_v2 — can take 12+ hours for a 50M-file dataset. The same problem exists for flat local-filesystem layouts where os.listdir() + sorted() materializes the entire file list on rank 0.

This PR makes mlp-storage automatically enable DLIO's skip_listing feature for all training benchmark runs (--file and --object), reducing cold-start time from hours to seconds and eliminating the per-rank OOM risk for large datasets.

Background: what is `skip_listing`?

skip_listing is implemented in mlcommons/DLIO_local_changes PR #27. When enabled, each MPI rank independently reconstructs its own file-URI shard from DLIO's standard naming convention — zero storage API calls, zero MPI communication, and no process ever holds the full file list in memory (each rank holds O(N/comm_size) entries only).

Does this affect benchmark scores?

No. File discovery runs inside DLIOBenchmark.initialize(), which completes before the scored benchmark window opens at stats.start_run(). AU and throughput scores are identical whether skip_listing is True or False.

Changes

`mlpstorage_py/benchmarks/dlio.py`

Two new methods on DLIOBenchmark:

_compute_validation_interval(num_files) — scales HEAD-check sampling geometrically with dataset size so startup time is bounded at any scale:

Train files	Interval	~HEAD checks
< 10,000	1	every file
10,000 – 99,999	10	~1,000
100,000 – 999,999	100	~1,000
1,000,000 – 9,999,999	1,000	~1,000
≥ 10,000,000	10,000	~1,000–5,000

_apply_skip_listing_params() — injects dataset.skip_listing=True and dataset.listing_validation_interval=<adaptive> for all training runs. Called from TrainingBenchmark.__init__ after _apply_object_storage_params(). Both params respect --params overrides.

_apply_object_storage_params() no longer injects skip_listing (now handled by the new method for all storage types).

`docs/OBJECT_STORAGE_GUIDE.md`

Documents skip_listing, the adaptive validation interval, override examples, and scoring clarification.

Override examples

# Default: skip_listing=True, adaptive interval
uv run mlpstorage training run --object ...

# Externally generated dataset (non-standard naming):
uv run mlpstorage training run --object ... --params dataset.skip_listing=False

# Disable validation entirely (fastest startup, no safety net):
uv run mlpstorage training run --object ... --params dataset.listing_validation_interval=0

Depends on

mlcommons/DLIO_local_changes#27 — implements skip_listing in DLIO (required; pyproject.toml will be updated to pin the merged SHA once that PR lands).

… adaptive validation interval - mlpstorage_py/benchmarks/dlio.py: - New _compute_validation_interval(num_files) static method: scales validation sampling geometrically with dataset size so startup HEAD-check time is bounded regardless of scale (exhaustive for <10K files; every 10,000th file for 10M+ files — ~1,000 checks at any scale). - New _apply_skip_listing_params() method: injects dataset.skip_listing=True and dataset.listing_validation_interval=<adaptive> for ALL training runs, both --file and --object. Each MPI rank reconstructs its own shard deterministically — no process ever holds the full file list in memory. Eliminates flat-file OOM for large local-filesystem datasets as well as the 12+ hour S3 listing problem (issue mlcommons#472). Respects --params overrides. - _apply_object_storage_params(): removed skip_listing injection (now handled by _apply_skip_listing_params). Object-storage-specific S3 credential and endpoint setup unchanged. - docs/OBJECT_STORAGE_GUIDE.md: document skip_listing, adaptive validation interval, override examples, scoring clarification. Relates-to: mlcommons/DLIO_local_changes#27

github-actions · 2026-06-19T23:24:19Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

FileSystemGuy · 2026-06-19T23:30:22Z

@russfellows I don't see skip_listing being set to true anywhere in DLIO_local_changes:PR#27

russfellows · 2026-06-19T23:35:10Z

Correct, It is set to be true in mlp-storage now. I believe the default is true, did I miss something?

…

On Fri, Jun 19, 2026 at 5:30 PM Curtis Anderson ***@***.***> wrote: *FileSystemGuy* left a comment (mlcommons/storage#483) <#483 (comment)> @russfellows <https://github.com/russfellows> I don't see skip_listing being set to true anywhere in DLIO_local_changes:PR#27 — Reply to this email directly, view it on GitHub <#483?email_source=notifications&email_token=AF64UJ7YLKVXCRDEFUFZGA35AXEKLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGU2DIOJWHAY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755449681>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ4DEXOFKU3MHDJG5GT5AXEKLAVCNFSNUABFKJSXA33TNF2G64TZHM2DKOJXGYZTMMBTHNEXG43VMU5TINZQGQYTGMBQGE4KC5QC> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Thanks, --Russ

russfellows · 2026-06-19T23:38:00Z

Curtis: — The chain has four steps. Here’s exactly where it flows: **Step 1 — mlp-storage injects it into `params_dict`** mlpstorage_py/benchmarks/dlio.py — `_apply_skip_listing_params()`: ```python if 'dataset.skip_listing' not in self.params_dict: self.params_dict['dataset.skip_listing'] = 'True' ``` Called unconditionally from `TrainingBenchmark.__init__` for every training run. **Step 2 — `generate_dlio_command()` turns every `params_dict` entry into a Hydra override** ```python for key, value in self.params_dict.items(): cmd += f" ++workload.{key}={value}" ``` This produces `++workload.dataset.skip_listing=True` on the DLIO command line. **Step 3 — DLIO's Hydra config system receives it** `ConfigArguments` in `dlio_benchmark/utils/config.py` declares: ```python skip_listing: bool = False ``` The `++workload.dataset.skip_listing=True` override sets it to `True` at startup. **Step 4 — main.py branches on it** ```python if self.args.skip_listing: for idx in range(self.my_rank, num_files_expected, self.comm_size): ... # deterministic URI generation, no storage API calls ``` So the reviewer should look at `_apply_skip_listing_params()` in the mlp-storage dlio.py as the origin point — it's the mlp-storage layer that enables it for all runs, which then flows through as a Hydra override into the DLIO process. — Russ

…

On Jun 19, 2026, at 5:34 PM, Russ Fellows ***@***.***> wrote: Correct, It is set to be true in mlp-storage now. I believe the default is true, did I miss something? On Fri, Jun 19, 2026 at 5:30 PM Curtis Anderson ***@***.*** ***@***.***>> wrote: > > FileSystemGuy > left a comment > (mlcommons/storage#483) > <#483 (comment)> > @russfellows <https://github.com/russfellows> I don't see skip_listing being set to true anywhere in DLIO_local_changes:PR#27 > > — > Reply to this email directly, view it on GitHub <#483?email_source=notifications&email_token=AF64UJ7YLKVXCRDEFUFZGA35AXEKLA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGU2DIOJWHAY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755449681>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ4DEXOFKU3MHDJG5GT5AXEKLAVCNFSNUABFKJSXA33TNF2G64TZHM2DKOJXGYZTMMBTHNEXG43VMU5TINZQGQYTGMBQGE4KC5QC>. > You are receiving this because you were mentioned. > -- Thanks, --Russ

FileSystemGuy · 2026-06-19T23:47:13Z

@russfellows Now you're going to bump up the "pin" of the last commit for DLIO_local_changes and bump the version number in the pyproject.toml file and then regenerate the uv.lock file?

russfellows · 2026-06-20T00:18:05Z

Yes, I'll make that fix directly --Russ On Jun 19, 2026, at 5:47 PM, Curtis Anderson ***@***.***> wrote:FileSystemGuy left a comment (mlcommons/storage#483) @russfellows Now you're going to bump up the "pin" of the last commit for DLIO_local_changes and bump the version number in the pyproject.toml file and then regenerate the uv.lock file? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

russfellows · 2026-06-20T00:57:45Z

Curtis, or whomever has time. I did exactly this, updated the pyproject.toml to point to the latest DLIO_local_changes hash, and also updated uv.lock. #484 chore: bump dlio-benchmark pin to merged PR #27 SHA; version 3.0.15 by russfellows · Pull Request #484 · mlcommons/storage github.com Please review, approve and merge if the tests pass. —Russ

…

On Jun 19, 2026, at 5:47 PM, Curtis Anderson ***@***.***> wrote: FileSystemGuy left a comment (mlcommons/storage#483) <#483 (comment)> @russfellows <https://github.com/russfellows> Now you're going to bump up the "pin" of the last commit for DLIO_local_changes and bump the version number in the pyproject.toml file and then regenerate the uv.lock file? — Reply to this email directly, view it on GitHub <#483?email_source=notifications&email_token=AF64UJ5LEOLE55VRQPPHHST5AXGJRA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGU2TAMJXHAZ2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755501783>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJZRBPGHLLOR55FNHZD5AXGJRAVCNFSNUABFKJSXA33TNF2G64TZHM2DKOJXGYZTMMBTHNEXG43VMU5TINZQGQYTGMBQGE4KC5QC>. You are receiving this because you were mentioned.

russfellows requested a review from a team June 19, 2026 23:24

FileSystemGuy approved these changes Jun 19, 2026

View reviewed changes

FileSystemGuy merged commit 124613e into mlcommons:main Jun 19, 2026
3 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 19, 2026

russfellows deleted the russfellows/skip-listing-object-storage branch June 20, 2026 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(object-storage): auto-enable skip_listing for all storage types; adaptive validation interval#483

feat(object-storage): auto-enable skip_listing for all storage types; adaptive validation interval#483
FileSystemGuy merged 1 commit into
mlcommons:mainfrom
russfellows:russfellows/skip-listing-object-storage

russfellows commented Jun 19, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

russfellows commented Jun 19, 2026 via email

Uh oh!

russfellows commented Jun 19, 2026 via email

Uh oh!

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

russfellows commented Jun 20, 2026 via email

Uh oh!

russfellows commented Jun 20, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

russfellows commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background: what is skip_listing?

Does this affect benchmark scores?

Changes

mlpstorage_py/benchmarks/dlio.py

docs/OBJECT_STORAGE_GUIDE.md

Override examples

Depends on

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

russfellows commented Jun 19, 2026 via email

Uh oh!

russfellows commented Jun 19, 2026 via email

Uh oh!

Uh oh!

FileSystemGuy commented Jun 19, 2026

Uh oh!

russfellows commented Jun 20, 2026 via email

Uh oh!

russfellows commented Jun 20, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

russfellows commented Jun 19, 2026 •

edited

Loading

Background: what is `skip_listing`?

`mlpstorage_py/benchmarks/dlio.py`

`docs/OBJECT_STORAGE_GUIDE.md`