Fix issues #449 #450 #451 #464 #466 #472 #475: large-scale listing, S3 object storage, epoch-2+ AU, TFRecord via s3dlio#27
Conversation
… (issue #472)
For DLIO-generated datasets, file URIs follow a known naming pattern:
{file_prefix}_{index:0N}_of_{num_files}.{format}
When skip_listing=True, each rank independently generates its own
round-robin shard using this convention — zero S3 API calls, zero MPI
communication for the listing phase.
This eliminates multi-hour S3 listing for large flat datasets:
- 50M files at 100ms/page × 50K pages = ~5000s (83 min) of listing
- With skip_listing=True: ~5-10s of Python string generation
- 100-500× speedup for retinanet-scale workloads (issue #472)
Also supports subfoldered layouts: subfolder index computed as
str(file_index % num_subfolders).zfill(nd_sf)
Usage:
++workload.dataset.skip_listing=True
Default: False (backward compatible, listing behavior unchanged)
When skip_listing=True, rank 0 now verifies that the deterministically
generated file URIs actually exist before training begins.
Validation checks:
- The first file (index 0)
- The last file (index num_files - 1)
- Every listing_validation_interval-th file (default: every 1,000th)
If any sampled file is missing, an informative exception is raised
directing the user to either fix the file prefix/format or set
skip_listing=False to fall back to directory listing.
New config fields:
listing_validation_interval: int = 1000
Set to 0 to disable validation entirely.
New storage methods:
ObjStoreLibStorage.file_exists(uri) -- uses s3dlio.exists() or stat_object()
FileStorage.file_exists(id) -- uses os.path.isfile()
For 50M files with interval=1000: 50,001 HEAD requests vs 50,000 listing
pages -- same order of magnitude but fully parallel-capable and verifies
bounds and uniform sampling of the dataset.
Validation now emits three kinds of log lines on rank 0:
1. Header before any checks:
skip_listing [train]: validating 50,001 of 50,136,788 files
(first, last, every 1,000) via HEAD requests ...
2. Progress every ~10% of checks (at least every 500, no more than
every 100):
skip_listing [train]: 5,000/50,001 checked (10%) —
483 checks/s — ETA 93s — 0 failed so far
3. Final summary on success:
skip_listing [train]: validation complete — all 50,001 samples
exist (103.6s, 483 checks/s); 12,534,197 URIs ready for rank 0
(50,136,788 total across all ranks)
On failure the exception now includes elapsed time and shows the first
3 missing URIs for diagnosis.
Issue #451: MinioWriter.write() called .encode() on data which fails for s3dlio BytesView objects (and any buffer-protocol type that is not str). Replace data.encode() with bytes(data), which works for bytes, bytearray, memoryview, BytesView, and any object implementing the buffer protocol. Issue #464 (object storage gap): _s3_ensure_cached() had the same cache- guard bug (if filename not in self._object_cache) that PR argonne-lcf#26 fixed for _localfs_ensure_cached. With persistent_workers=True still set on the iterable dataset paths, a cached byte count from epoch 1 would survive to epoch 2+ and _prefetch() would never be called — producing zero S3 traffic and invalid AU measurements. Remove the guard so every epoch always issues a real GET.
…k contexts When TorchIterableDatasetSimple.__iter__ runs with num_workers=0, it calls worker_init(0) directly in the main process, which deserializes ConfigArguments via pickle.loads → __setstate__ → DLIOMPI.reset() + set_parent_values(). This leaves the DLIOMPI singleton in CHILD_INITIALIZED state for the remainder of the epoch. After the epoch, reconfigure() → get_global_map_index() → allreduce_min() then calls comm(), which raises 'called in a child process'. Fix: allreduce_min() and alltoall() now short-circuit when state is CHILD_INITIALIZED or mpi_size <= 1. For child processes, MPI collectives are impossible and returning the local value is always correct (rank 0 owns the authoritative state). For single-rank runs, no allreduce is needed at all. Also fix: _S3_EXTENDED missing definition in test_s3dlio_object_store.py. Removes hardcoded fallback IP from _endpoint() — now skips if AWS_ENDPOINT_URL is not set rather than silently using a real server address.
…d datagen validation utility.py: DLIOMPI.initialize() now repairs CHILD_INITIALIZED state when MPI is actually running (main process had its state corrupted by a DataLoader worker_init deserialization in the num_workers=0 path). Previously this raised 'called in a child process' when the second S3 test called initialize() after the first test left the state dirty. config.py: TFRecord+PyTorch validation guard now only fires when a data loader is actually used (do_train or do_eval). Previously it fired unconditionally, rejecting datagen-only runs (generate_data=True, train=False, evaluation=False) even though no data loading occurs during generation. test_s3dlio_object_store.py: add evaluation=False override to TFRecord datagen test so the eval phase does not attempt to load TFRecords via pytorch.
…ndex - reader/reader_factory.py: route TFRECORD to TFRecordReaderS3Iterable whenever storage_library=s3dlio, regardless of storage_type. s3dlio handles both s3:// and file:// URIs, so the old check of storage_type in (S3, AISTORE) was both too narrow and too broad. - reader/tfrecord_reader_s3_iterable.py: fix read_index() to call FormatReader.read_index() directly instead of super(), which was resolving to NPYReader.read_index() -> _localfs_ensure_cached(), causing FileNotFoundError when reading from S3/object URIs. Add FormatReader import. Clarify class docstring. - reader/_s3_iterable_mixin.py: storage_library now defaults to 's3dlio' instead of raising ValueError when not set in YAML, consistent with how data generation defaults. - utils/config.py: pytorch+tfrecord restriction for train/eval now checks storage_library != 's3dlio' rather than storage_type not in (S3, AISTORE). TFRecord reading with pytorch is supported exclusively through s3dlio's TFRecordReaderS3Iterable. - tests/test_s3dlio_object_store.py: rename test_s3dlio_tfrecord_datagen to test_s3dlio_tfrecord_datagen_and_read; add full read phase. Fix stale comments (S3Storage uses tf.io.gfile, not boto3). Remove botocore from logger noise-suppression list. - docs: remove stale boto3 references in two analysis docs. All 130 unit tests pass. Live S3 tests: 2 passed (npy, tfrecord).
|
The core of this PR appears to be an efficient means of avoiding listing the S3 bucket contents so that the user can avoid an operation that is extremely slow on Object storage. It is turned off by default, do we need a way to turn it on so that the users get the benefit? It also looks like a score that results a run where this was enabled is absolutely not comparable with a score resulting from a run where this was not enabled. Is that's the case, rather than having two sets of incomparable results, we should turn it on and not enable anyone to turn it off. Does that sound right? |
|
Curtis,
Great points, and the first one (“how can users enable this?”) absolutely needs to be documented. I had thought of that before I pushed, but then forgot to come back to it.
To the second point, no I am 99% confident that results will be comparable, because listing is not part of the timed benchmark run. At least I didn’t THINK it was, but we should check.
I’ll review that point again. If more code changes are required, I can just add them to a new commit on this PR.
I will update the status of the check on timed vs. non-timed listing.
Regards,
… On Jun 19, 2026, at 3:18 PM, Curtis Anderson ***@***.***> wrote:
FileSystemGuy
left a comment
(mlcommons/DLIO_local_changes#27)
<#27 (comment)>
The core of this PR appears to be an efficient means of avoiding listing the S3 bucket contents so that the user can avoid an operation that is extremely slow on Object storage.
It is turned off by default, do we need a way to turn it on so that the users get the benefit?
It also looks like a score that results a run where this was enabled is absolutely not comparable with a score resulting from a run where this was not enabled. Is that's the case, rather than having two sets of incomparable results, we should turn it on and not enable anyone to turn it off. Does that sound right?
—
Reply to this email directly, view it on GitHub <#27?email_source=notifications&email_token=AF64UJ7QVHC5QK7RFFKBWL35AWU3BA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGQ3TMMBWHAZ2M4TFMFZW63VGMF2XI2DPOKSWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4754760683>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ5F3OZEDDD7SR732PD5AWU3BAVCNFSNUABGKJSXA33TNF2G64TZHMYTCNBUGQ4TIMRVHA5US43TOVSTWNBXGAZTIOJTGI3DTILWAI>.
You are receiving this because you authored the thread.
|
IMHO, avoiding S3 object listing because it is so slow is tantamount to cheating. 😄 If it is too slow to use, then fix it! Having said that, I'm ok that listing is not included in the measured performance of the SuT since everyone in the known universe is finding ways to avoid doing object listings because they're so slow. |
|
Curtis,
I checked and tested, the listing was never in the timed AU calculation code path, so including listing or not is fair for everyone, file, object, opt-out, opt-in, whatever.
I did change the value to be true by default now when the storage type is object. I also added documentation in 3 places. 1 in DLIO_local_changes and 2 docs in mlpstorage.
So, this will also require a small PR to mlpstorage, which I will issue after I update the existing PR.
Regards,
—Russ
… On Jun 19, 2026, at 4:25 PM, Curtis Anderson ***@***.***> wrote:
FileSystemGuy
left a comment
(mlcommons/DLIO_local_changes#27)
<#27 (comment)>
@russfellows <https://github.com/russfellows>
To the second point, no I am 99% confident that results will be comparable, because listing is not part of the timed benchmark run. At least I didn’t THINK it was, but we should check.
IMHO, avoiding S3 object listing because it is so slow is tantamount to cheating. 😄 If it is too slow to use, then fix it!
Having said that, I'm ok that listing is not included in the measured performance of the SuT since everyone in the known universe is finding ways to avoid doing object listings because they're so slow.
—
Reply to this email directly, view it on GitHub <#27?email_source=notifications&email_token=AF64UJZKZTKP4DY2NVLGRQT5AW4UPA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGUYTENBXHE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755124794>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ35ENEJDXUA7P6KYP35AW4UPAVCNFSNUABGKJSXA33TNF2G64TZHMYTCNBUGQ4TIMRVHA5US43TOVSTWNBXGAZTIOJTGI3DTILWAI>.
You are receiving this because you were mentioned.
|
|
Sorry, mean to include the Sonnet analysis:
All green. Here's a summary of everything done:
Code change — mlp-storage (mlpstorage_py/benchmarks/dlio.py):
_apply_object_storage_params() now injects two additional params for every --object run, only if the user hasn't already overridden them via --params:
dataset.skip_listing = True
dataset.listing_validation_interval = 1000
To opt out or tune:
New doc — skip-listing.md:
Explains the problem, how it works, the config parameters (both YAML and ++workload.dataset.* CLI syntax), the fact that the listing phase is NOT scored, worked examples, and comparability guidance.
Updated doc — OBJECT_STORAGE_GUIDE.md:
Auto-injected params table now includes skip_listing and listing_validation_interval. New section "skip_listing — Avoiding S3 Listing for Large Datasets" with the why, the scoring answer, requirements, override examples, and sample log output.
The reviewer's two questions, directly answered:
No YAML changes needed — use ++workload.dataset.skip_listing=True on the CLI, or just use mlp-storage with --object (it's now automatic).
Listing is NOT timed — it lives in initialize() before stats.start_run(). Scores are fully comparable. Enabling it by default for --object mode is the right call.
Claude Sonnet 4.6 • 197.1 credits
—Russ
… On Jun 19, 2026, at 4:25 PM, Curtis Anderson ***@***.***> wrote:
FileSystemGuy
left a comment
(mlcommons/DLIO_local_changes#27)
<#27 (comment)>
@russfellows <https://github.com/russfellows>
To the second point, no I am 99% confident that results will be comparable, because listing is not part of the timed benchmark run. At least I didn’t THINK it was, but we should check.
IMHO, avoiding S3 object listing because it is so slow is tantamount to cheating. 😄 If it is too slow to use, then fix it!
Having said that, I'm ok that listing is not included in the measured performance of the SuT since everyone in the known universe is finding ways to avoid doing object listings because they're so slow.
—
Reply to this email directly, view it on GitHub <#27?email_source=notifications&email_token=AF64UJZKZTKP4DY2NVLGRQT5AW4UPA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGUYTENBXHE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755124794>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ35ENEJDXUA7P6KYP35AW4UPAVCNFSNUABGKJSXA33TNF2G64TZHMYTCNBUGQ4TIMRVHA5US43TOVSTWNBXGAZTIOJTGI3DTILWAI>.
You are receiving this because you were mentioned.
|
Documents skip_listing and listing_validation_interval: - What the problem is (S3 listing latency for large datasets) - How skip_listing works (deterministic URI generation) - Whether the listing phase is scored (it is NOT — in initialize(), before stats.start_run() in run()) - Configuration: Hydra CLI overrides, no YAML changes required - Worked examples for direct dlio_benchmark and via mlp-storage - Comparability guidance for benchmark submissions
|
Curtis,
Making more changes, to make it both fair and to ACTUALLY fix the issue #466, OOM on large training sets.
This would still cause a problem for filesystems with 50m files. Not because it takes too long, but because it uses too much memory, as Wolfgang identified originally.
So, the better, more fair fix is to "skip_listing = true” by default for file and object, and then vary the percent of items you check based upon the total number of files you are training on. If we have 50m files, we would only check every 10,000th item for example.
This makes it fair, and removes the OOM issues with trying to keep 50m items in memory.
I’ll update the PR.
Regards,
—Russ
… On Jun 19, 2026, at 4:25 PM, Curtis Anderson ***@***.***> wrote:
FileSystemGuy
left a comment
(mlcommons/DLIO_local_changes#27)
<#27 (comment)>
@russfellows <https://github.com/russfellows>
To the second point, no I am 99% confident that results will be comparable, because listing is not part of the timed benchmark run. At least I didn’t THINK it was, but we should check.
IMHO, avoiding S3 object listing because it is so slow is tantamount to cheating. 😄 If it is too slow to use, then fix it!
Having said that, I'm ok that listing is not included in the measured performance of the SuT since everyone in the known universe is finding ways to avoid doing object listings because they're so slow.
—
Reply to this email directly, view it on GitHub <#27?email_source=notifications&email_token=AF64UJZKZTKP4DY2NVLGRQT5AW4UPA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINZVGUYTENBXHE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4755124794>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF64UJ35ENEJDXUA7P6KYP35AW4UPAVCNFSNUABGKJSXA33TNF2G64TZHMYTCNBUGQ4TIMRVHA5US43TOVSTWNBXGAZTIOJTGI3DTILWAI>.
You are receiving this because you were mentioned.
|
… adaptive validation interval (#483) - mlpstorage_py/benchmarks/dlio.py: - New _compute_validation_interval(num_files) static method: scales validation sampling geometrically with dataset size so startup HEAD-check time is bounded regardless of scale (exhaustive for <10K files; every 10,000th file for 10M+ files — ~1,000 checks at any scale). - New _apply_skip_listing_params() method: injects dataset.skip_listing=True and dataset.listing_validation_interval=<adaptive> for ALL training runs, both --file and --object. Each MPI rank reconstructs its own shard deterministically — no process ever holds the full file list in memory. Eliminates flat-file OOM for large local-filesystem datasets as well as the 12+ hour S3 listing problem (issue #472). Respects --params overrides. - _apply_object_storage_params(): removed skip_listing injection (now handled by _apply_skip_listing_params). Object-storage-specific S3 credential and endpoint setup unchanged. - docs/OBJECT_STORAGE_GUIDE.md: document skip_listing, adaptive validation interval, override examples, scoring clarification. Relates-to: mlcommons/DLIO_local_changes#27
DLIO_local_changes — PR Summary (russfellows/issue472-skip-listing)
Branch:
russfellows/issue472-skip-listingBase:
origin/main(commitef58613— Wolfgang's PR #26 "Improve large scale training file lists")Commits above base: 6 committed + additional uncommitted working-tree changes (see below)
Issues Addressed
✅ Issue #449 — Per-rank file listing causes OOM on large datasets
Root cause: Every MPI rank called
storage.walk_node()independently,materialising the full file list in each process. At 64M files × 8 ranks/node,
each node allocated ~60 GB RAM just for file lists before any training started.
Fix (PR #26, merged as ef58613):
dlio_benchmark/main.py: Rank 0 alone callsstorage.walk_node()using aThreadPoolExecutorwithlisting_threadsworkers for sub-folder layouts.The result is broadcast to all ranks in chunks of 1M entries via MPI.
Each rank then filters by
global_index % comm_size == my_rank(round-robinsharding), so no rank ever holds more than 1/N of the list.
dlio_benchmark/utils/config.py: Newfiles_pre_sharded: boolflag (defaultFalse), set toTrueautomatically after the rank-0 listing path runs.New
listing_threads: intfield (default 4) for sub-folder parallelism.dlio_benchmark/utils/utility.py: Newallreduce_min()andalltoall()MPI collective helpers used for sample-count alignment and epoch resharding.
✅ Issue #466 — Data-Loader OOM (sharding analysis)
Root cause: Same as #449. All ranks listed all files independently.
Fix: Same as #449 (PR #26). The per-rank round-robin sharding means each
rank's working set is 1/N of the total — RAM scales with files-per-rank, not
total files.
✅ Issue #475 — Persistent workers hold stale file shards across epochs
Root cause:
persistent_workers=Truewas hard-coded in theTorchDataset(map-style) DataLoader path. PyTorch persistent workers capture
ConfigArgumentsat spawn time and never see reshuffled/resharded file lists in subsequent epochs.
Fix (PR #26, merged as ef58613):
dlio_benchmark/data_loader/torch_data_loader.py:persistent_workers=Trueremoved from the
TorchDatasetpath. Workers re-spawn each epoch, picking upthe updated
serial_argsfromrefresh_args().refresh_args()method re-serializesConfigArgumentsbefore each epochso worker processes receive the latest resharded file lists.
_reshard_files()inconfig.pyperforms alltoall-based epoch reshardingwhen
file_shuffle != OFFandfiles_pre_sharded=True.✅ Issue #464 — Epoch 2+ shows zero storage traffic / inflated AU
Root cause (part 1): Same as #475 — persistent workers served all reads from
a process-level cache (
_local_cachein_LocalFSIterableMixin) after epoch 1,producing zero actual I/O while reporting full AU.
Fix — local filesystem (PR #26, merged as ef58613):
dlio_benchmark/reader/_local_fs_iterable_mixin.py:_localfs_ensure_cached()no longer short-circuits on cache hit. Every callissues a real read regardless of cache state.
Fix — object storage (our branch, commit
dba0caf):dlio_benchmark/reader/_s3_iterable_mixin.py:_s3_ensure_cached()had the identicalif filename not in self._object_cacheguard. Removed. Every call now issues a real GET. This mirrors the local-FS
fix that Wolfgang made in PR Improve large scale training file lists with distributed approach #26 but did not carry over to the S3 mixin.
✅ Issue #450 — S3 environment variables not honoured
Root cause:
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_ENDPOINT_URLwere not read from the environment by DLIO's object storage layer.
Fix (pre-existing in DLIO_local_changes, confirmed via regression tests):
dlio_benchmark/storage/obj_store_lib.py: Reads all standard AWS env varswith YAML config values taking priority. The
S3_ENDPOINT_URISenv var enablesmulti-endpoint load balancing. All covered by
TestIssue9_StorageEnvOverrides(11 tests, all pass).
✅ Issue #451 — s3dlio BytesView incompatible with MinIO SDK writer
Root cause: When s3dlio is installed, the data generator (
dgen-py) returnsdata as a
BytesView(zero-copy Rust buffer). When the storage library isminio,put_data()routes throughMinioWriter.write()which calleddata.encode()— astr-only method.BytesViewhas no.encode(), so everydatagen write crashed with
AttributeError.Fix (our branch, commit
dba0caf):dlio_benchmark/storage/obj_store_lib.py—MinioWriter.write():isinstance(data, (bytes, bytearray, memoryview))goes through the fast path; everything else uses
bytes(data)for conversion.✅ Issue #472 — S3 listing takes 12+ hours for 50M-file datasets
Root cause:
list_objects_v2paginates all object keys sequentially. At50M files with 1,000 keys/page, that is 50,000 API calls. With WAN or high-
latency S3 (100ms/page), this alone takes ~83 minutes. Observed in the field
to take 12+ hours.
Partial fix (PR #26, merged): Reduces from N×concurrent listings (one per
rank) to rank-0-only listing. Eliminates the N× amplification, but rank 0 still
paginates all 50M keys.
Full fix (our branch, commits
4e41555,62916c3,dd2b0c6):skip_listing: bool = False— new YAML config option.When enabled, DLIO generates file URIs deterministically without any S3 API
calls. The naming convention matches DLIO's own data generator:
Each rank generates only its own slice (round-robin by rank), so there is zero
MPI communication and zero S3 listing for the file-discovery phase.
Optional sampling validation (
listing_validation_interval: int = 1000):rank 0 checks a sample of the generated URIs via HEAD requests to verify the
files actually exist, with progress output:
Files changed:
dlio_benchmark/main.py— skip_listing branch in file discoverydlio_benchmark/utils/config.py— newskip_listingandlisting_validation_intervalfieldsdlio_benchmark/storage/obj_store_lib.py— newfile_exists()methoddlio_benchmark/storage/file_storage.py— newfile_exists()methodAdditional Bugs Found and Fixed During Testing
Bug A —
DLIOMPI.allreduce_min()/alltoall()crash in child-process and single-rank contextsRoot cause: PR #26 added
allreduce_min()toget_global_map_index()toalign sample counts across ranks. When
TorchIterableDatasetSimpleruns withnum_workers=0, it callsworker_init(0)directly in the main thread. Thisdeserializes
ConfigArgumentsvia pickle →__setstate__→DLIOMPI.reset()set_parent_values(), leaving the singleton inCHILD_INITIALIZEDstate.The subsequent
reconfigure()call then hitsallreduce_min()→comm()→ raises"called in a child process".Fix (commit
7319379):dlio_benchmark/utils/utility.py—allreduce_min()andalltoall()nowshort-circuit when
mpi_state == CHILD_INITIALIZEDormpi_size <= 1,returning the local value directly. For child processes, MPI collectives are
impossible and the local value is authoritative. For single-rank runs, no
collective is needed.
Bug B —
DLIOMPI.initialize()raises when called after child-state corruptionRoot cause: Same root cause as Bug A. After the first S3 test runs and
leaves DLIOMPI in
CHILD_INITIALIZED, the second S3 test callsDLIOMPI.get_instance().initialize()which raises"called in a child process"rather than reinitialising.
Fix (commit
31759a8):dlio_benchmark/utils/utility.py—initialize()now detectsCHILD_INITIALIZEDstate and, ifMPI.Is_initialized()is True (meaningwe are actually the main MPI process), resets to
UNINITIALIZEDandproceeds with normal initialisation. If MPI is not running, it still raises
(we genuinely are in a child process).
Bug C — TFRecord + PyTorch incorrectly blocked; extended to full generate+read via s3dlio
Original problem:
config.validate()raised"pytorch support for tfrecord is not implemented for pytorch"unconditionallyfor any config combining
framework=pytorch+format=tfrecord, even duringpure datagen (
generate_data=True,train=False). No data loading happensduring generation, so the restriction was wrong.
Initial fix (commit
31759a8):dlio_benchmark/utils/config.py— validation guard now checksself.do_train or self.do_evalbefore raising. TFRecord datagen with pytorchworks via s3dlio's
put_bytes()path.Extended fix (uncommitted working-tree changes):
TFRecordReaderS3Iterablereads TFRecord objects as raw bytes vias3dlio.get_many()with no tensorflow/protobuf decoding required. It wasalready implemented but not correctly routed, and contained a bug causing
failures when used as a map-style reader. Three changes:
dlio_benchmark/reader/reader_factory.py—TFRecordReaderS3Iterableis now selected whenever
storage_library == "s3dlio", regardless ofstorage_type. s3dlio handles boths3://andfile://URIs, so theold check of
storage_type in (S3, AISTORE)was both too narrow (missedfile://) and too broad (did not guarantee s3dlio was the library).dlio_benchmark/utils/config.py— The pytorch+tfrecord restriction fortrain/eval now checks
storage_library != "s3dlio"rather thanstorage_type not in (S3, AISTORE). TFRecord reading with pytorch issupported exclusively through s3dlio. All other paths (TFReader via
tf.io.gfile, DALI) still raise the original error.dlio_benchmark/reader/tfrecord_reader_s3_iterable.py—read_index()was calling
super().read_index()which resolved toNPYReader.read_index()→
_localfs_ensure_cached(), attempting to open an S3/object URI as a localfile. Fixed to call
FormatReader.read_index()directly, bypassing NPYReader.Also:
_s3_iterable_mixin.pystorage_librarynow defaults to"s3dlio"instead of raising
ValueErrorwhen not set in the YAML.Result: TFRecord generate + read works end-to-end via s3dlio for both
s3://andfile://URIs with no tensorflow installation required. Confirmedby
test_s3dlio_tfrecord_datagen_and_read(live S3 test, PASSED).Bug D —
_S3_EXTENDEDundefined in test file / hardcoded server IPRoot cause:
test_s3dlio_object_store.pyreferenced_S3_EXTENDEDatmodule level without defining it, causing a
NameErrorthat prevented all testcollection. Additionally,
_endpoint()had a hardcoded fallback IP — a realinternal server address in a public-facing test file.
Fix (commits
7319379,31759a8+ uncommitted):_S3_EXTENDEDdefined asos.environ.get("DLIO_S3_EXTENDED", "0") == "1"_endpoint()callspytest.skip()ifAWS_ENDPOINT_URLis not settest_s3dlio_tfrecord_datagenrenamed totest_s3dlio_tfrecord_datagen_and_readand extended with a full read phase
Complete File Change Index
dlio_benchmark/main.pydlio_benchmark/data_loader/torch_data_loader.pydlio_benchmark/reader/_local_fs_iterable_mixin.pydlio_benchmark/reader/_s3_iterable_mixin.pydlio_benchmark/reader/reader_factory.pydlio_benchmark/reader/tfrecord_reader_s3_iterable.pydlio_benchmark/utils/config.pydlio_benchmark/utils/utility.pydlio_benchmark/storage/obj_store_lib.pydlio_benchmark/storage/file_storage.pytests/test_s3dlio_object_store.pyTest Results
Non-S3 (CI gate, no external dependencies)
Live S3 (against MinIO at 172.16.1.40:9000, HTTPS with self-signed cert)
Local Branch Commits (above origin/main)
Uncommitted working-tree changes (not yet in any commit):
dlio_benchmark/reader/_s3_iterable_mixin.py— storage_library defaults to "s3dlio"dlio_benchmark/reader/reader_factory.py— TFRecord routing keyed on storage_librarydlio_benchmark/reader/tfrecord_reader_s3_iterable.py— read_index() fix; FormatReader importdlio_benchmark/utils/config.py— TFRecord validation keyed on storage_librarytests/test_s3dlio_object_store.py— TFRecord test renamed + read phase; comment fixesdocs/DLRM-Parquet-S3-Throughput-Analysis.md— removed stale boto3 wordingBase commit (origin/main):