Skip to content

Fix issues #449 #450 #451 #464 #466 #472 #475: large-scale listing, S3 object storage, epoch-2+ AU, TFRecord via s3dlio#27

Merged
FileSystemGuy merged 8 commits into
mlcommons:mainfrom
russfellows:russfellows/issue472-skip-listing
Jun 19, 2026
Merged

Fix issues #449 #450 #451 #464 #466 #472 #475: large-scale listing, S3 object storage, epoch-2+ AU, TFRecord via s3dlio#27
FileSystemGuy merged 8 commits into
mlcommons:mainfrom
russfellows:russfellows/issue472-skip-listing

Conversation

@russfellows

Copy link
Copy Markdown

DLIO_local_changes — PR Summary (russfellows/issue472-skip-listing)

Branch: russfellows/issue472-skip-listing
Base: origin/main (commit ef58613 — Wolfgang's PR #26 "Improve large scale training file lists")
Commits above base: 6 committed + additional uncommitted working-tree changes (see below)


Issues Addressed

✅ Issue #449 — Per-rank file listing causes OOM on large datasets

Root cause: Every MPI rank called storage.walk_node() independently,
materialising the full file list in each process. At 64M files × 8 ranks/node,
each node allocated ~60 GB RAM just for file lists before any training started.

Fix (PR #26, merged as ef58613):

  • dlio_benchmark/main.py: Rank 0 alone calls storage.walk_node() using a
    ThreadPoolExecutor with listing_threads workers for sub-folder layouts.
    The result is broadcast to all ranks in chunks of 1M entries via MPI.
    Each rank then filters by global_index % comm_size == my_rank (round-robin
    sharding), so no rank ever holds more than 1/N of the list.
  • dlio_benchmark/utils/config.py: New files_pre_sharded: bool flag (default
    False), set to True automatically after the rank-0 listing path runs.
    New listing_threads: int field (default 4) for sub-folder parallelism.
  • dlio_benchmark/utils/utility.py: New allreduce_min() and alltoall()
    MPI collective helpers used for sample-count alignment and epoch resharding.

✅ Issue #466 — Data-Loader OOM (sharding analysis)

Root cause: Same as #449. All ranks listed all files independently.

Fix: Same as #449 (PR #26). The per-rank round-robin sharding means each
rank's working set is 1/N of the total — RAM scales with files-per-rank, not
total files.


✅ Issue #475 — Persistent workers hold stale file shards across epochs

Root cause: persistent_workers=True was hard-coded in the TorchDataset
(map-style) DataLoader path. PyTorch persistent workers capture ConfigArguments
at spawn time and never see reshuffled/resharded file lists in subsequent epochs.

Fix (PR #26, merged as ef58613):

  • dlio_benchmark/data_loader/torch_data_loader.py: persistent_workers=True
    removed from the TorchDataset path. Workers re-spawn each epoch, picking up
    the updated serial_args from refresh_args().
  • New refresh_args() method re-serializes ConfigArguments before each epoch
    so worker processes receive the latest resharded file lists.
  • New _reshard_files() in config.py performs alltoall-based epoch resharding
    when file_shuffle != OFF and files_pre_sharded=True.

✅ Issue #464 — Epoch 2+ shows zero storage traffic / inflated AU

Root cause (part 1): Same as #475 — persistent workers served all reads from
a process-level cache (_local_cache in _LocalFSIterableMixin) after epoch 1,
producing zero actual I/O while reporting full AU.

Fix — local filesystem (PR #26, merged as ef58613):

  • dlio_benchmark/reader/_local_fs_iterable_mixin.py:
    _localfs_ensure_cached() no longer short-circuits on cache hit. Every call
    issues a real read regardless of cache state.

Fix — object storage (our branch, commit dba0caf):


✅ Issue #450 — S3 environment variables not honoured

Root cause: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT_URL
were not read from the environment by DLIO's object storage layer.

Fix (pre-existing in DLIO_local_changes, confirmed via regression tests):

  • dlio_benchmark/storage/obj_store_lib.py: Reads all standard AWS env vars
    with YAML config values taking priority. The S3_ENDPOINT_URIS env var enables
    multi-endpoint load balancing. All covered by TestIssue9_StorageEnvOverrides
    (11 tests, all pass).

✅ Issue #451 — s3dlio BytesView incompatible with MinIO SDK writer

Root cause: When s3dlio is installed, the data generator (dgen-py) returns
data as a BytesView (zero-copy Rust buffer). When the storage library is
minio, put_data() routes through MinioWriter.write() which called
data.encode() — a str-only method. BytesView has no .encode(), so every
datagen write crashed with AttributeError.

Fix (our branch, commit dba0caf):

  • dlio_benchmark/storage/obj_store_lib.pyMinioWriter.write():
    # Before (broken):
    self.buffer.write(data.encode())
    
    # After (fixed):
    # bytes() works for BytesView, memoryview, bytearray, and any
    # object implementing the buffer protocol.
    self.buffer.write(bytes(data))
    The guard is also broadened: isinstance(data, (bytes, bytearray, memoryview))
    goes through the fast path; everything else uses bytes(data) for conversion.

✅ Issue #472 — S3 listing takes 12+ hours for 50M-file datasets

Root cause: list_objects_v2 paginates all object keys sequentially. At
50M files with 1,000 keys/page, that is 50,000 API calls. With WAN or high-
latency S3 (100ms/page), this alone takes ~83 minutes. Observed in the field
to take 12+ hours.

Partial fix (PR #26, merged): Reduces from N×concurrent listings (one per
rank) to rank-0-only listing. Eliminates the N× amplification, but rank 0 still
paginates all 50M keys.

Full fix (our branch, commits 4e41555, 62916c3, dd2b0c6):
skip_listing: bool = False — new YAML config option.

When enabled, DLIO generates file URIs deterministically without any S3 API
calls. The naming convention matches DLIO's own data generator:

{file_prefix}_{index:0Nd}_of_{num_files}.{format}

Each rank generates only its own slice (round-robin by rank), so there is zero
MPI communication and zero S3 listing for the file-discovery phase.

Optional sampling validation (listing_validation_interval: int = 1000):
rank 0 checks a sample of the generated URIs via HEAD requests to verify the
files actually exist, with progress output:

validating 50,001 of 50,136,788 files (first, last, every 1,000) via HEAD requests ...
5,000/50,001 checked (10%) — 483 checks/s — ETA 93s — 0 failed so far
validation complete — all 50,001 samples exist (103.6s, 483 checks/s)

Files changed:

  • dlio_benchmark/main.py — skip_listing branch in file discovery
  • dlio_benchmark/utils/config.py — new skip_listing and
    listing_validation_interval fields
  • dlio_benchmark/storage/obj_store_lib.py — new file_exists() method
  • dlio_benchmark/storage/file_storage.py — new file_exists() method

Additional Bugs Found and Fixed During Testing

Bug A — DLIOMPI.allreduce_min() / alltoall() crash in child-process and single-rank contexts

Root cause: PR #26 added allreduce_min() to get_global_map_index() to
align sample counts across ranks. When TorchIterableDatasetSimple runs with
num_workers=0, it calls worker_init(0) directly in the main thread. This
deserializes ConfigArguments via pickle → __setstate__DLIOMPI.reset()

  • set_parent_values(), leaving the singleton in CHILD_INITIALIZED state.
    The subsequent reconfigure() call then hits allreduce_min()
    comm() → raises "called in a child process".

Fix (commit 7319379):

  • dlio_benchmark/utils/utility.pyallreduce_min() and alltoall() now
    short-circuit when mpi_state == CHILD_INITIALIZED or mpi_size <= 1,
    returning the local value directly. For child processes, MPI collectives are
    impossible and the local value is authoritative. For single-rank runs, no
    collective is needed.

Bug B — DLIOMPI.initialize() raises when called after child-state corruption

Root cause: Same root cause as Bug A. After the first S3 test runs and
leaves DLIOMPI in CHILD_INITIALIZED, the second S3 test calls
DLIOMPI.get_instance().initialize() which raises "called in a child process"
rather than reinitialising.

Fix (commit 31759a8):

  • dlio_benchmark/utils/utility.pyinitialize() now detects
    CHILD_INITIALIZED state and, if MPI.Is_initialized() is True (meaning
    we are actually the main MPI process), resets to UNINITIALIZED and
    proceeds with normal initialisation. If MPI is not running, it still raises
    (we genuinely are in a child process).

Bug C — TFRecord + PyTorch incorrectly blocked; extended to full generate+read via s3dlio

Original problem: config.validate() raised
"pytorch support for tfrecord is not implemented for pytorch" unconditionally
for any config combining framework=pytorch + format=tfrecord, even during
pure datagen (generate_data=True, train=False). No data loading happens
during generation, so the restriction was wrong.

Initial fix (commit 31759a8):

  • dlio_benchmark/utils/config.py — validation guard now checks
    self.do_train or self.do_eval before raising. TFRecord datagen with pytorch
    works via s3dlio's put_bytes() path.

Extended fix (uncommitted working-tree changes):
TFRecordReaderS3Iterable reads TFRecord objects as raw bytes via
s3dlio.get_many() with no tensorflow/protobuf decoding required. It was
already implemented but not correctly routed, and contained a bug causing
failures when used as a map-style reader. Three changes:

  1. dlio_benchmark/reader/reader_factory.pyTFRecordReaderS3Iterable
    is now selected whenever storage_library == "s3dlio", regardless of
    storage_type. s3dlio handles both s3:// and file:// URIs, so the
    old check of storage_type in (S3, AISTORE) was both too narrow (missed
    file://) and too broad (did not guarantee s3dlio was the library).

  2. dlio_benchmark/utils/config.py — The pytorch+tfrecord restriction for
    train/eval now checks storage_library != "s3dlio" rather than
    storage_type not in (S3, AISTORE). TFRecord reading with pytorch is
    supported exclusively through s3dlio. All other paths (TFReader via
    tf.io.gfile, DALI) still raise the original error.

  3. dlio_benchmark/reader/tfrecord_reader_s3_iterable.pyread_index()
    was calling super().read_index() which resolved to NPYReader.read_index()
    _localfs_ensure_cached(), attempting to open an S3/object URI as a local
    file. Fixed to call FormatReader.read_index() directly, bypassing NPYReader.
    Also: _s3_iterable_mixin.py storage_library now defaults to "s3dlio"
    instead of raising ValueError when not set in the YAML.

Result: TFRecord generate + read works end-to-end via s3dlio for both
s3:// and file:// URIs with no tensorflow installation required. Confirmed
by test_s3dlio_tfrecord_datagen_and_read (live S3 test, PASSED).


Bug D — _S3_EXTENDED undefined in test file / hardcoded server IP

Root cause: test_s3dlio_object_store.py referenced _S3_EXTENDED at
module level without defining it, causing a NameError that prevented all test
collection. Additionally, _endpoint() had a hardcoded fallback IP — a real
internal server address in a public-facing test file.

Fix (commits 7319379, 31759a8 + uncommitted):

  • _S3_EXTENDED defined as os.environ.get("DLIO_S3_EXTENDED", "0") == "1"
  • _endpoint() calls pytest.skip() if AWS_ENDPOINT_URL is not set
  • Stale comments claiming "routes through boto3" corrected
  • test_s3dlio_tfrecord_datagen renamed to test_s3dlio_tfrecord_datagen_and_read
    and extended with a full read phase

Complete File Change Index

File Changed by Issues / Bugs
dlio_benchmark/main.py PR #26 + our branch #449 #466 #472
dlio_benchmark/data_loader/torch_data_loader.py PR #26 #475 #464
dlio_benchmark/reader/_local_fs_iterable_mixin.py PR #26 #464
dlio_benchmark/reader/_s3_iterable_mixin.py Our branch #464 (S3 gap); Bug C (s3dlio default)
dlio_benchmark/reader/reader_factory.py Our branch Bug C (TFRecord routing)
dlio_benchmark/reader/tfrecord_reader_s3_iterable.py Our branch Bug C (read_index fix)
dlio_benchmark/utils/config.py PR #26 + our branch #449 #472 Bug C
dlio_benchmark/utils/utility.py PR #26 + our branch #449 Bug A Bug B
dlio_benchmark/storage/obj_store_lib.py Our branch #451 #472
dlio_benchmark/storage/file_storage.py Our branch #472
tests/test_s3dlio_object_store.py Our branch Bug D; TFRecord read test

Test Results

Non-S3 (CI gate, no external dependencies)

uv run pytest tests/test_fast_ci.py tests/test_dlio_sampler.py \
              tests/test_virtual_index_map.py tests/test_issue_regressions.py
→ 130 passed, 1 skipped (expected: dftracer native C extension absent)

Live S3 (against MinIO at 172.16.1.40:9000, HTTPS with self-signed cert)

DLIO_OBJECT_STORAGE_TESTS=1 DLIO_TEST_BUCKET=mlp-s3dlio \
uv run pytest tests/test_s3dlio_object_store.py
→ 2 passed:
    test_s3dlio_datagen_and_read[npy]           — NPY generate + read via s3dlio
    test_s3dlio_tfrecord_datagen_and_read       — TFRecord generate + read via s3dlio

Local Branch Commits (above origin/main)

31759a8  fix(mpi): repair CHILD_INITIALIZED state in initialize(); fix TFRecord datagen validation
7319379  fix(mpi): allreduce_min/alltoall safe in child-process and single-rank contexts
dba0caf  fix(storage): fix BytesView incompatibility and S3 epoch-2+ cache bypass
dd2b0c6  feat(listing): add progress output during skip_listing validation
62916c3  feat(listing): add sampling validation for skip_listing (issue #472)
4e41555  feat(listing): add skip_listing for deterministic file URI generation (issue #472)

Uncommitted working-tree changes (not yet in any commit):

  • dlio_benchmark/reader/_s3_iterable_mixin.py — storage_library defaults to "s3dlio"
  • dlio_benchmark/reader/reader_factory.py — TFRecord routing keyed on storage_library
  • dlio_benchmark/reader/tfrecord_reader_s3_iterable.py — read_index() fix; FormatReader import
  • dlio_benchmark/utils/config.py — TFRecord validation keyed on storage_library
  • tests/test_s3dlio_object_store.py — TFRecord test renamed + read phase; comment fixes
  • docs/DLRM-Parquet-S3-Throughput-Analysis.md — removed stale boto3 wording

Base commit (origin/main):

ef58613  Improve large scale training file lists with distributed approach (#26)
         (Wolfgang De Salvador + DLIOSampler MPI guard fix)

… (issue #472)

For DLIO-generated datasets, file URIs follow a known naming pattern:
  {file_prefix}_{index:0N}_of_{num_files}.{format}

When skip_listing=True, each rank independently generates its own
round-robin shard using this convention — zero S3 API calls, zero MPI
communication for the listing phase.

This eliminates multi-hour S3 listing for large flat datasets:
- 50M files at 100ms/page × 50K pages = ~5000s (83 min) of listing
- With skip_listing=True: ~5-10s of Python string generation
- 100-500× speedup for retinanet-scale workloads (issue #472)

Also supports subfoldered layouts: subfolder index computed as
  str(file_index % num_subfolders).zfill(nd_sf)

Usage:
  ++workload.dataset.skip_listing=True

Default: False (backward compatible, listing behavior unchanged)
When skip_listing=True, rank 0 now verifies that the deterministically
generated file URIs actually exist before training begins.

Validation checks:
  - The first file (index 0)
  - The last file (index num_files - 1)
  - Every listing_validation_interval-th file (default: every 1,000th)

If any sampled file is missing, an informative exception is raised
directing the user to either fix the file prefix/format or set
skip_listing=False to fall back to directory listing.

New config fields:
  listing_validation_interval: int = 1000
    Set to 0 to disable validation entirely.

New storage methods:
  ObjStoreLibStorage.file_exists(uri)  -- uses s3dlio.exists() or stat_object()
  FileStorage.file_exists(id)          -- uses os.path.isfile()

For 50M files with interval=1000: 50,001 HEAD requests vs 50,000 listing
pages -- same order of magnitude but fully parallel-capable and verifies
bounds and uniform sampling of the dataset.
Validation now emits three kinds of log lines on rank 0:

  1. Header before any checks:
       skip_listing [train]: validating 50,001 of 50,136,788 files
       (first, last, every 1,000) via HEAD requests ...

  2. Progress every ~10% of checks (at least every 500, no more than
     every 100):
       skip_listing [train]:   5,000/50,001 checked (10%)  —
       483 checks/s  —  ETA 93s  —  0 failed so far

  3. Final summary on success:
       skip_listing [train]: validation complete — all 50,001 samples
       exist (103.6s, 483 checks/s); 12,534,197 URIs ready for rank 0
       (50,136,788 total across all ranks)

  On failure the exception now includes elapsed time and shows the first
  3 missing URIs for diagnosis.
Issue #451: MinioWriter.write() called .encode() on data which fails for
s3dlio BytesView objects (and any buffer-protocol type that is not str).
Replace data.encode() with bytes(data), which works for bytes, bytearray,
memoryview, BytesView, and any object implementing the buffer protocol.

Issue #464 (object storage gap): _s3_ensure_cached() had the same cache-
guard bug (if filename not in self._object_cache) that PR argonne-lcf#26 fixed for
_localfs_ensure_cached. With persistent_workers=True still set on the
iterable dataset paths, a cached byte count from epoch 1 would survive to
epoch 2+ and _prefetch() would never be called — producing zero S3 traffic
and invalid AU measurements. Remove the guard so every epoch always issues
a real GET.
…k contexts

When TorchIterableDatasetSimple.__iter__ runs with num_workers=0, it calls
worker_init(0) directly in the main process, which deserializes ConfigArguments
via pickle.loads → __setstate__ → DLIOMPI.reset() + set_parent_values().  This
leaves the DLIOMPI singleton in CHILD_INITIALIZED state for the remainder of
the epoch.

After the epoch, reconfigure() → get_global_map_index() → allreduce_min() then
calls comm(), which raises 'called in a child process'.

Fix: allreduce_min() and alltoall() now short-circuit when state is
CHILD_INITIALIZED or mpi_size <= 1.  For child processes, MPI collectives are
impossible and returning the local value is always correct (rank 0 owns the
authoritative state).  For single-rank runs, no allreduce is needed at all.

Also fix: _S3_EXTENDED missing definition in test_s3dlio_object_store.py.
Removes hardcoded fallback IP from _endpoint() — now skips if AWS_ENDPOINT_URL
is not set rather than silently using a real server address.
…d datagen validation

utility.py: DLIOMPI.initialize() now repairs CHILD_INITIALIZED state when MPI
is actually running (main process had its state corrupted by a DataLoader
worker_init deserialization in the num_workers=0 path).  Previously this
raised 'called in a child process' when the second S3 test called initialize()
after the first test left the state dirty.

config.py: TFRecord+PyTorch validation guard now only fires when a data loader
is actually used (do_train or do_eval).  Previously it fired unconditionally,
rejecting datagen-only runs (generate_data=True, train=False, evaluation=False)
even though no data loading occurs during generation.

test_s3dlio_object_store.py: add evaluation=False override to TFRecord datagen
test so the eval phase does not attempt to load TFRecords via pytorch.
…ndex

- reader/reader_factory.py: route TFRECORD to TFRecordReaderS3Iterable
  whenever storage_library=s3dlio, regardless of storage_type.
  s3dlio handles both s3:// and file:// URIs, so the old check of
  storage_type in (S3, AISTORE) was both too narrow and too broad.

- reader/tfrecord_reader_s3_iterable.py: fix read_index() to call
  FormatReader.read_index() directly instead of super(), which was
  resolving to NPYReader.read_index() -> _localfs_ensure_cached(),
  causing FileNotFoundError when reading from S3/object URIs.
  Add FormatReader import. Clarify class docstring.

- reader/_s3_iterable_mixin.py: storage_library now defaults to
  's3dlio' instead of raising ValueError when not set in YAML,
  consistent with how data generation defaults.

- utils/config.py: pytorch+tfrecord restriction for train/eval now
  checks storage_library != 's3dlio' rather than storage_type not
  in (S3, AISTORE). TFRecord reading with pytorch is supported
  exclusively through s3dlio's TFRecordReaderS3Iterable.

- tests/test_s3dlio_object_store.py: rename test_s3dlio_tfrecord_datagen
  to test_s3dlio_tfrecord_datagen_and_read; add full read phase.
  Fix stale comments (S3Storage uses tf.io.gfile, not boto3).
  Remove botocore from logger noise-suppression list.

- docs: remove stale boto3 references in two analysis docs.

All 130 unit tests pass. Live S3 tests: 2 passed (npy, tfrecord).
@russfellows russfellows requested a review from a team June 19, 2026 20:51
@FileSystemGuy

Copy link
Copy Markdown

The core of this PR appears to be an efficient means of avoiding listing the S3 bucket contents so that the user can avoid an operation that is extremely slow on Object storage.

It is turned off by default, do we need a way to turn it on so that the users get the benefit?

It also looks like a score that results a run where this was enabled is absolutely not comparable with a score resulting from a run where this was not enabled. Is that's the case, rather than having two sets of incomparable results, we should turn it on and not enable anyone to turn it off. Does that sound right?

@russfellows

russfellows commented Jun 19, 2026 via email

Copy link
Copy Markdown
Author

@FileSystemGuy

Copy link
Copy Markdown

@russfellows

To the second point, no I am 99% confident that results will be comparable, because listing is not part of the timed benchmark run. At least I didn’t THINK it was, but we should check.

IMHO, avoiding S3 object listing because it is so slow is tantamount to cheating. 😄 If it is too slow to use, then fix it!

Having said that, I'm ok that listing is not included in the measured performance of the SuT since everyone in the known universe is finding ways to avoid doing object listings because they're so slow.

@russfellows

russfellows commented Jun 19, 2026 via email

Copy link
Copy Markdown
Author

@russfellows

russfellows commented Jun 19, 2026 via email

Copy link
Copy Markdown
Author

Documents skip_listing and listing_validation_interval:
- What the problem is (S3 listing latency for large datasets)
- How skip_listing works (deterministic URI generation)
- Whether the listing phase is scored (it is NOT — in initialize(),
  before stats.start_run() in run())
- Configuration: Hydra CLI overrides, no YAML changes required
- Worked examples for direct dlio_benchmark and via mlp-storage
- Comparability guidance for benchmark submissions
@russfellows

russfellows commented Jun 19, 2026 via email

Copy link
Copy Markdown
Author

@FileSystemGuy FileSystemGuy merged commit 7098ccc into mlcommons:main Jun 19, 2026
4 checks passed
FileSystemGuy pushed a commit to mlcommons/storage that referenced this pull request Jun 19, 2026
… adaptive validation interval (#483)

- mlpstorage_py/benchmarks/dlio.py:
  - New _compute_validation_interval(num_files) static method: scales validation
    sampling geometrically with dataset size so startup HEAD-check time is
    bounded regardless of scale (exhaustive for <10K files; every 10,000th file
    for 10M+ files — ~1,000 checks at any scale).
  - New _apply_skip_listing_params() method: injects dataset.skip_listing=True
    and dataset.listing_validation_interval=<adaptive> for ALL training runs,
    both --file and --object. Each MPI rank reconstructs its own shard
    deterministically — no process ever holds the full file list in memory.
    Eliminates flat-file OOM for large local-filesystem datasets as well as
    the 12+ hour S3 listing problem (issue #472). Respects --params overrides.
  - _apply_object_storage_params(): removed skip_listing injection (now handled
    by _apply_skip_listing_params). Object-storage-specific S3 credential and
    endpoint setup unchanged.

- docs/OBJECT_STORAGE_GUIDE.md: document skip_listing, adaptive validation
  interval, override examples, scoring clarification.

Relates-to: mlcommons/DLIO_local_changes#27
@russfellows russfellows deleted the russfellows/issue472-skip-listing branch June 20, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants