fix/better-handle-concurrent-uploads by matthewlouisbrockman · Pull Request #1278 · e2b-dev/E2B

matthewlouisbrockman · 2026-04-20T17:44:22Z

queue octet-stream file uploads in the JS SDK and async Python SDK with per-call and global concurrency caps.
handle retries for failed uploads

Bound writeFiles upload batches to the existing worker queues while preserving fail-fast behavior on non-retryable errors. In-flight uploads may finish or be aborted, but workers stop starting new uploads once the batch is doomed. Narrow JS upload retries to known transport and resource saturation codes plus explicit network error classes/names, avoiding broad retries for bare TypeError/message matches. Remove dead envd 429 handling from JS and Python writeFiles upload paths. Update JS and Python async tests to cover targeted retries, no retry for bare fetch failures, and stopping upload issuance after non-retryable errors.

Prepare each async write_files octet-stream payload once per queued file and reuse that bytes payload across transient upload retries. This prevents file-like inputs from being consumed on the first failed attempt and retried as empty content. Add a BytesIO regression test that verifies retry attempts send the original upload body.

…le with the existing node >=20

Capture serialized writeFiles upload bodies in the JS queue tests and verify a gzip upload retry reuses a non-empty, byte-identical body across attempts. This documents that toUploadBody materializes retryable upload bodies before retryFileUpload.

changeset-bot · 2026-04-20T17:44:26Z

🦋 Changeset detected

Latest commit: f4bdff6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@e2b/python-sdk	Minor
e2b	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

cursor · 2026-04-20T17:44:27Z

PR Summary

Medium Risk
Changes the file-upload execution model (batching, global semaphores, retries, and timeout semantics) in both JS and Python SDKs, which could impact upload performance and failure behavior under load. Risk is mitigated by extensive new unit tests covering concurrency, retries, timeouts, and cancellation.

Overview
Adds configurable concurrency limits and retry-with-backoff for octet-stream file uploads in the JS SDK and Python SDKs.

ConnectionConfig now exposes per-call and process-wide upload concurrency caps plus a retry-attempts setting (also configurable via E2B_MAX_CONCURRENT_FILE_UPLOADS, E2B_MAX_GLOBAL_CONCURRENT_FILE_UPLOADS, E2B_FILE_UPLOAD_RETRY_ATTEMPTS), with strict positive-integer validation.

Filesystem writeFiles for octet-stream uploads now runs uploads through a bounded queue, applies a global limiter, retries transient transport failures, and treats requestTimeout/requestTimeoutMs as a per-file budget spanning queue wait + retries + backoff; new JS/Python tests cover these behaviors (including aborting in-flight work on first error and preserving bodies across retries).

^{Reviewed by Cursor Bugbot for commit f4bdff6. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-04-20T17:45:37Z

Package Artifacts

Built from 32f7f1e. Download artifacts from this workflow run.

JS SDK (e2b@2.19.1-handle-concurrent-uploads.0):

npm install ./e2b-2.19.1-handle-concurrent-uploads.0.tgz

CLI (@e2b/cli@2.9.1-handle-concurrent-uploads.0):

npm install ./e2b-cli-2.9.1-handle-concurrent-uploads.0.tgz

Python SDK (e2b==2.20.0+handle-concurrent-uploads):

pip install ./e2b-2.20.0+handle.concurrent.uploads-py3-none-any.whl

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bdfaddd9ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py:115-128 — The module-level _GLOBAL_FILE_UPLOAD_SEMAPHORES dict is keyed by (id(asyncio.get_running_loop()), max_uploads) but entries are never removed when event loops are destroyed, causing unbounded memory growth in test environments where pytest-asyncio creates a new event loop per test. On top of the memory leak, Python's ability to reuse memory addresses means a new event loop can get the same id() as a previous one, potentially returning a stale asyncio.Semaphore — though this risk is significantly mitigated by asyncio internals. To fix, key the dict only by max_uploads (matching the JS SDK approach) or use weakref-based loop tracking to evict entries on loop destruction.
Extended reasoning...

What the bug is

_GLOBAL_FILE_UPLOAD_SEMAPHORES is a module-level dict[tuple[int, int], asyncio.Semaphore] introduced in this PR, keyed by (id(asyncio.get_running_loop()), max_uploads). The intent is to scope the global semaphore to the current event loop so that asyncio constraints are respected. However, entries are never removed from the dict when their associated event loop is destroyed, making this a memory leak that grows linearly with the number of event loops created over the process lifetime.

The specific code path

_get_global_file_upload_semaphore() (lines 115–128) is called on every file upload retry attempt. It calls id(asyncio.get_running_loop()) and uses the result as part of the dict key. When the event loop is later destroyed (e.g., at the end of a pytest-asyncio test), the dict entry for that (loop_id, max_uploads) pair is never cleaned up. The next test gets a fresh event loop, generates a new entry, and the old one remains.

Why existing code doesn't prevent it

There is no __del__, no weak reference, no loop lifecycle hook, and no cleanup call site. Python's asyncio does not provide a callback mechanism for loop destruction that could be used to purge stale entries. The dict simply accumulates one entry per (loop_id, max_uploads) pair seen over the process lifetime.

Impact and the stale-semaphore scenario

The primary confirmed impact is an unbounded memory leak in testing environments. The project's pytest.ini sets asyncio_mode = auto, meaning pytest-asyncio creates a new event loop per test by default, so every test that exercises write_files with octet-stream uploads adds an entry that is never freed.

The scarier scenario — ID reuse causing two tests to share a semaphore — is real but significantly mitigated by asyncio internals. As verifiers noted, asyncio.Semaphore only binds self._loop on the contended code path (when _value == 0). If the semaphore was never contended, self._loop stays None and a stale semaphore lazily re-binds to the new loop correctly. However, if a semaphore was fully exhausted and self._loop was set, asyncio holds a strong reference to the old loop object, preventing GC and thus preventing ID reuse in that case. The realistic remaining risk is a test interrupted mid-upload leaving a semaphore with _value < max_uploads, which a later test could inherit if ID collision occurs.

Step-by-step proof of the memory leak
1. Test A runs: asyncio.get_running_loop() returns loop_A with id(loop_A) = 0x7f00. _GLOBAL_FILE_UPLOAD_SEMAPHORES[(0x7f00, 128)] is created.
2. Test A ends: pytest-asyncio destroys loop_A. The dict still holds {(0x7f00, 128): <Semaphore>}.
3. Test B runs: a new loop_B is created with id(loop_B) = 0x7f40. _GLOBAL_FILE_UPLOAD_SEMAPHORES[(0x7f40, 128)] is created.
4. After N tests, the dict has N entries, none of which will ever be freed.
How to fix

The JS SDK avoids this entirely by keying globalFileUploadSemaphores only by maxUploads (a single integer), since a Semaphore-equivalent in JS is not loop-scoped. For Python, the simplest fix is the same: key only by max_uploads. Python's asyncio.Semaphore lazily binds to whichever loop first exhausts it, so a single process-level semaphore per limit value works correctly in production (one long-lived loop) and in tests (the semaphore re-binds on each new loop if it was never contended or was fully released before the loop ended).
🔴 packages/js-sdk/src/sandbox/filesystem/index.ts:693-703 — In writeFiles, this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs) is called inside the retryFileUpload closure, so each retry attempt creates a fresh AbortSignal.timeout(requestTimeoutMs) starting from zero. With DEFAULT_FILE_UPLOAD_RETRY_ATTEMPTS=4 and e.g. requestTimeoutMs=5000, the total wall-clock time before failure can reach 4×5s + backoff (~20+ seconds), far exceeding what callers expect from the timeout parameter. Fix by calling getSignal once before entering retryFileUpload and capturing the resulting signal in the closure.
Extended reasoning...

What the bug is and how it manifests

ConnectionConfig.getSignal(ms) calls AbortSignal.timeout(ms) on every invocation, returning a brand-new timer starting from zero. In the new writeFiles implementation, getSignal(writeOpts?.requestTimeoutMs) is placed inside the fn closure that is passed to retryFileUpload. Because retryFileUpload calls fn on every attempt, each attempt creates a fresh signal with a full requestTimeoutMs window.

The specific code path

In packages/js-sdk/src/sandbox/filesystem/index.ts (lines 693–703), the mapWithConcurrency callback invokes retryFileUpload(async () => { const timeoutSignal = this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs); ... }). The retryFileUpload loop (lines 221–239) re-invokes this closure on every attempt after a transient error and a backoff delay.

Why existing safeguards do not prevent it

The refutation correctly notes that isRetryableFileUploadError returns false for AbortError, so a timed-out attempt is not retried — the first timeout that fires does propagate. However, this only applies to an attempt that reaches its timeout. When an attempt fails with a transient network error before the timeout fires (e.g. ECONNRESET after 50ms), the signal from that attempt is discarded and a completely fresh signal is created for the next attempt. The net effect is that each retry window is independent, not cumulative.

Impact

A caller who passes requestTimeoutMs: 5000 expects the entire upload operation to take at most ~5 s. With four attempts and transient failures on attempts 1–3, the actual wall-clock time is up to 4×5000 ms + backoff (~20+ s). This breaks deadline-sensitive callers and makes the requestTimeoutMs option semantically misleading in the presence of retries.

How to fix it

Call getSignal once before passing fn to retryFileUpload, then capture the pre-created signal in the closure:
```
const timeoutSignal = this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs)
return retryFileUpload(async () => {
  const { signal, cleanup } = timeoutSignal
    ? combineAbortSignals([abortSignal, timeoutSignal])
    : { signal: abortSignal, cleanup: () => {} }
  // ... existing POST call
})
```
This gives all retry attempts a shared deadline rather than independent fresh countdowns.

Step-by-step proof
1. Caller invokes writeFiles([file], { requestTimeoutMs: 5000 }).
2. mapWithConcurrency calls the per-file callback, which calls retryFileUpload(fn).
3. retryFileUpload enters its loop, attempt=1, calls globalUploads.run(fn).
4. Inside fn: getSignal(5000) → AbortSignal.timeout(5000) — T=0, fires at T=5s.
5. At T=0.05s the TCP connection drops; ECONNRESET is thrown, caught as retryable.
6. wait(250) backoff. attempt=2, calls globalUploads.run(fn) again.
7. Inside fn again: getSignal(5000) → new AbortSignal.timeout(5000) — T=0 again, fires at T=5.3s from wall-clock start.
8. Steps 5–7 repeat for attempts 3 and 4. Total window: up to ~20.75s, not the expected 5s.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f4bdff6. Configure here.}

matthewlouisbrockman added 6 commits April 16, 2026 16:21

limit concurrent sandbox file uploads and retry transient failures

91accc0

Format async template stacktrace tests

0c6ed9a

Avoid AbortSignal.any in upload cancellation to keep the SDK compatib…

9985185

…le with the existing node >=20

Cover JS gzip upload retries

c227b51

Capture serialized writeFiles upload bodies in the JS queue tests and verify a gzip upload retry reuses a non-empty, byte-identical body across attempts. This documents that toUploadBody materializes retryable upload bodies before retryFileUpload.

changeset

72beac9

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated

Comment thread packages/js-sdk/src/sandbox/filesystem/index.ts Outdated

matthewlouisbrockman added 3 commits April 20, 2026 10:57

check for positive values around the concurrency env vars

122c0d5

use existing wait instead of sleep

cbae3b0

prettier

bdfaddd

matthewlouisbrockman marked this pull request as ready for review April 20, 2026 18:39

matthewlouisbrockman requested review from ValentaTomas, jakubno and mishushakov as code owners April 20, 2026 18:39

chatgpt-codex-connector Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated

Comment thread packages/js-sdk/src/sandbox/filesystem/index.ts Outdated

matthewlouisbrockman added 2 commits April 20, 2026 11:48

update wording on patch note to configurable

b812cee

move vars into config

340949f

claude Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated

matthewlouisbrockman added 6 commits April 20, 2026 12:05

faster canceling of inflight requests on failure

173313f

bring the queue into its own files

168c9e9

Validate upload config integers strictly

4e6b194

Apply upload queue config to sync Python writes

de8a342

Clarify file upload limit docs

9dbf54f

faster timeout on tests

2528d1e

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/sandbox_sync/filesystem/upload_queue.py

Release sync upload slot before retry backoff

5df911d

formatter change not needed

4d85789

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated

Shrink async upload timeout across retries

f4bdff6

cursor Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread packages/python-sdk/e2b/connection_config.py

Comment thread packages/python-sdk/e2b/connection_config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/better-handle-concurrent-uploads#1278

fix/better-handle-concurrent-uploads#1278
matthewlouisbrockman wants to merge 21 commits intomainfrom
handle-concurrent-uploads

matthewlouisbrockman commented Apr 20, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthewlouisbrockman commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

cursor Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

github-actions Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Package Artifacts

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthewlouisbrockman commented Apr 20, 2026 •

edited

Loading

changeset-bot Bot commented Apr 20, 2026 •

edited

Loading

cursor Bot commented Apr 20, 2026 •

edited

Loading

github-actions Bot commented Apr 20, 2026 •

edited

Loading