fix/better-handle-concurrent-uploads#1278
fix/better-handle-concurrent-uploads#1278matthewlouisbrockman wants to merge 21 commits intomainfrom
Conversation
Bound writeFiles upload batches to the existing worker queues while preserving fail-fast behavior on non-retryable errors. In-flight uploads may finish or be aborted, but workers stop starting new uploads once the batch is doomed. Narrow JS upload retries to known transport and resource saturation codes plus explicit network error classes/names, avoiding broad retries for bare TypeError/message matches. Remove dead envd 429 handling from JS and Python writeFiles upload paths. Update JS and Python async tests to cover targeted retries, no retry for bare fetch failures, and stopping upload issuance after non-retryable errors.
Prepare each async write_files octet-stream payload once per queued file and reuse that bytes payload across transient upload retries. This prevents file-like inputs from being consumed on the first failed attempt and retried as empty content. Add a BytesIO regression test that verifies retry attempts send the original upload body.
…le with the existing node >=20
Capture serialized writeFiles upload bodies in the JS queue tests and verify a gzip upload retry reuses a non-empty, byte-identical body across attempts. This documents that toUploadBody materializes retryable upload bodies before retryFileUpload.
🦋 Changeset detectedLatest commit: f4bdff6 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
PR SummaryMedium Risk Overview
Filesystem Reviewed by Cursor Bugbot for commit f4bdff6. Bugbot is set up for automated code reviews on this repo. Configure here. |
Package ArtifactsBuilt from 32f7f1e. Download artifacts from this workflow run. JS SDK ( npm install ./e2b-2.19.1-handle-concurrent-uploads.0.tgzCLI ( npm install ./e2b-cli-2.9.1-handle-concurrent-uploads.0.tgzPython SDK ( pip install ./e2b-2.20.0+handle.concurrent.uploads-py3-none-any.whl |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bdfaddd9ed
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py:115-128— The module-level_GLOBAL_FILE_UPLOAD_SEMAPHORESdict is keyed by(id(asyncio.get_running_loop()), max_uploads)but entries are never removed when event loops are destroyed, causing unbounded memory growth in test environments where pytest-asyncio creates a new event loop per test. On top of the memory leak, Python's ability to reuse memory addresses means a new event loop can get the sameid()as a previous one, potentially returning a staleasyncio.Semaphore— though this risk is significantly mitigated by asyncio internals. To fix, key the dict only bymax_uploads(matching the JS SDK approach) or useweakref-based loop tracking to evict entries on loop destruction.Extended reasoning...
What the bug is
_GLOBAL_FILE_UPLOAD_SEMAPHORESis a module-leveldict[tuple[int, int], asyncio.Semaphore]introduced in this PR, keyed by(id(asyncio.get_running_loop()), max_uploads). The intent is to scope the global semaphore to the current event loop so that asyncio constraints are respected. However, entries are never removed from the dict when their associated event loop is destroyed, making this a memory leak that grows linearly with the number of event loops created over the process lifetime.The specific code path
_get_global_file_upload_semaphore()(lines 115–128) is called on every file upload retry attempt. It callsid(asyncio.get_running_loop())and uses the result as part of the dict key. When the event loop is later destroyed (e.g., at the end of a pytest-asyncio test), the dict entry for that(loop_id, max_uploads)pair is never cleaned up. The next test gets a fresh event loop, generates a new entry, and the old one remains.Why existing code doesn't prevent it
There is no
__del__, no weak reference, no loop lifecycle hook, and no cleanup call site. Python'sasynciodoes not provide a callback mechanism for loop destruction that could be used to purge stale entries. The dict simply accumulates one entry per (loop_id, max_uploads) pair seen over the process lifetime.Impact and the stale-semaphore scenario
The primary confirmed impact is an unbounded memory leak in testing environments. The project's
pytest.inisetsasyncio_mode = auto, meaning pytest-asyncio creates a new event loop per test by default, so every test that exerciseswrite_fileswith octet-stream uploads adds an entry that is never freed.The scarier scenario — ID reuse causing two tests to share a semaphore — is real but significantly mitigated by asyncio internals. As verifiers noted,
asyncio.Semaphoreonly bindsself._loopon the contended code path (when_value == 0). If the semaphore was never contended,self._loopstaysNoneand a stale semaphore lazily re-binds to the new loop correctly. However, if a semaphore was fully exhausted andself._loopwas set, asyncio holds a strong reference to the old loop object, preventing GC and thus preventing ID reuse in that case. The realistic remaining risk is a test interrupted mid-upload leaving a semaphore with_value < max_uploads, which a later test could inherit if ID collision occurs.Step-by-step proof of the memory leak
- Test A runs:
asyncio.get_running_loop()returnsloop_Awithid(loop_A) = 0x7f00._GLOBAL_FILE_UPLOAD_SEMAPHORES[(0x7f00, 128)]is created. - Test A ends: pytest-asyncio destroys
loop_A. The dict still holds{(0x7f00, 128): <Semaphore>}. - Test B runs: a new
loop_Bis created withid(loop_B) = 0x7f40._GLOBAL_FILE_UPLOAD_SEMAPHORES[(0x7f40, 128)]is created. - After N tests, the dict has N entries, none of which will ever be freed.
How to fix
The JS SDK avoids this entirely by keying
globalFileUploadSemaphoresonly bymaxUploads(a single integer), since aSemaphore-equivalent in JS is not loop-scoped. For Python, the simplest fix is the same: key only bymax_uploads. Python'sasyncio.Semaphorelazily binds to whichever loop first exhausts it, so a single process-level semaphore per limit value works correctly in production (one long-lived loop) and in tests (the semaphore re-binds on each new loop if it was never contended or was fully released before the loop ended). - Test A runs:
-
🔴
packages/js-sdk/src/sandbox/filesystem/index.ts:693-703— InwriteFiles,this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs)is called inside theretryFileUploadclosure, so each retry attempt creates a freshAbortSignal.timeout(requestTimeoutMs)starting from zero. WithDEFAULT_FILE_UPLOAD_RETRY_ATTEMPTS=4and e.g.requestTimeoutMs=5000, the total wall-clock time before failure can reach 4×5s + backoff (~20+ seconds), far exceeding what callers expect from the timeout parameter. Fix by callinggetSignalonce before enteringretryFileUploadand capturing the resulting signal in the closure.Extended reasoning...
What the bug is and how it manifests
ConnectionConfig.getSignal(ms)callsAbortSignal.timeout(ms)on every invocation, returning a brand-new timer starting from zero. In the newwriteFilesimplementation,getSignal(writeOpts?.requestTimeoutMs)is placed inside thefnclosure that is passed toretryFileUpload. BecauseretryFileUploadcallsfnon every attempt, each attempt creates a fresh signal with a fullrequestTimeoutMswindow.The specific code path
In
packages/js-sdk/src/sandbox/filesystem/index.ts(lines 693–703), themapWithConcurrencycallback invokesretryFileUpload(async () => { const timeoutSignal = this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs); ... }). TheretryFileUploadloop (lines 221–239) re-invokes this closure on every attempt after a transient error and a backoff delay.Why existing safeguards do not prevent it
The refutation correctly notes that
isRetryableFileUploadErrorreturnsfalseforAbortError, so a timed-out attempt is not retried — the first timeout that fires does propagate. However, this only applies to an attempt that reaches its timeout. When an attempt fails with a transient network error before the timeout fires (e.g.ECONNRESETafter 50ms), the signal from that attempt is discarded and a completely fresh signal is created for the next attempt. The net effect is that each retry window is independent, not cumulative.Impact
A caller who passes
requestTimeoutMs: 5000expects the entire upload operation to take at most ~5 s. With four attempts and transient failures on attempts 1–3, the actual wall-clock time is up to 4×5000 ms + backoff (~20+ s). This breaks deadline-sensitive callers and makes therequestTimeoutMsoption semantically misleading in the presence of retries.How to fix it
Call
getSignalonce before passingfntoretryFileUpload, then capture the pre-created signal in the closure:const timeoutSignal = this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs) return retryFileUpload(async () => { const { signal, cleanup } = timeoutSignal ? combineAbortSignals([abortSignal, timeoutSignal]) : { signal: abortSignal, cleanup: () => {} } // ... existing POST call })
This gives all retry attempts a shared deadline rather than independent fresh countdowns.
Step-by-step proof
- Caller invokes
writeFiles([file], { requestTimeoutMs: 5000 }). mapWithConcurrencycalls the per-file callback, which callsretryFileUpload(fn).retryFileUploadenters its loop, attempt=1, callsglobalUploads.run(fn).- Inside
fn:getSignal(5000)→AbortSignal.timeout(5000)— T=0, fires at T=5s. - At T=0.05s the TCP connection drops;
ECONNRESETis thrown, caught as retryable. wait(250)backoff. attempt=2, callsglobalUploads.run(fn)again.- Inside
fnagain:getSignal(5000)→ newAbortSignal.timeout(5000)— T=0 again, fires at T=5.3s from wall-clock start. - Steps 5–7 repeat for attempts 3 and 4. Total window: up to ~20.75s, not the expected 5s.
- Caller invokes
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f4bdff6. Configure here.

Uh oh!
There was an error while loading. Please reload this page.