Skip to content

fix/better-handle-concurrent-uploads#1278

Open
matthewlouisbrockman wants to merge 21 commits intomainfrom
handle-concurrent-uploads
Open

fix/better-handle-concurrent-uploads#1278
matthewlouisbrockman wants to merge 21 commits intomainfrom
handle-concurrent-uploads

Conversation

@matthewlouisbrockman
Copy link
Copy Markdown
Contributor

@matthewlouisbrockman matthewlouisbrockman commented Apr 20, 2026

  • queue octet-stream file uploads in the JS SDK and async Python SDK with per-call and global concurrency caps.
  • handle retries for failed uploads

Bound writeFiles upload batches to the existing worker queues while preserving fail-fast behavior on non-retryable errors. In-flight uploads may finish or be aborted, but workers stop starting new uploads once the batch is doomed.

Narrow JS upload retries to known transport and resource saturation codes plus explicit network error classes/names, avoiding broad retries for bare TypeError/message matches. Remove dead envd 429 handling from JS and Python writeFiles upload paths.

Update JS and Python async tests to cover targeted retries, no retry for bare fetch failures, and stopping upload issuance after non-retryable errors.
Prepare each async write_files octet-stream payload once per queued file and reuse that bytes payload across transient upload retries. This prevents file-like inputs from being consumed on the first failed attempt and retried as empty content.

Add a BytesIO regression test that verifies retry attempts send the original upload body.
Capture serialized writeFiles upload bodies in the JS queue tests and verify a gzip upload retry reuses a non-empty, byte-identical body across attempts. This documents that toUploadBody materializes retryable upload bodies before retryFileUpload.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 20, 2026

🦋 Changeset detected

Latest commit: f4bdff6

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@e2b/python-sdk Minor
e2b Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 20, 2026

PR Summary

Medium Risk
Changes the file-upload execution model (batching, global semaphores, retries, and timeout semantics) in both JS and Python SDKs, which could impact upload performance and failure behavior under load. Risk is mitigated by extensive new unit tests covering concurrency, retries, timeouts, and cancellation.

Overview
Adds configurable concurrency limits and retry-with-backoff for octet-stream file uploads in the JS SDK and Python SDKs.

ConnectionConfig now exposes per-call and process-wide upload concurrency caps plus a retry-attempts setting (also configurable via E2B_MAX_CONCURRENT_FILE_UPLOADS, E2B_MAX_GLOBAL_CONCURRENT_FILE_UPLOADS, E2B_FILE_UPLOAD_RETRY_ATTEMPTS), with strict positive-integer validation.

Filesystem writeFiles for octet-stream uploads now runs uploads through a bounded queue, applies a global limiter, retries transient transport failures, and treats requestTimeout/requestTimeoutMs as a per-file budget spanning queue wait + retries + backoff; new JS/Python tests cover these behaviors (including aborting in-flight work on first error and preserving bodies across retries).

Reviewed by Cursor Bugbot for commit f4bdff6. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 20, 2026

Package Artifacts

Built from 32f7f1e. Download artifacts from this workflow run.

JS SDK (e2b@2.19.1-handle-concurrent-uploads.0):

npm install ./e2b-2.19.1-handle-concurrent-uploads.0.tgz

CLI (@e2b/cli@2.9.1-handle-concurrent-uploads.0):

npm install ./e2b-cli-2.9.1-handle-concurrent-uploads.0.tgz

Python SDK (e2b==2.20.0+handle-concurrent-uploads):

pip install ./e2b-2.20.0+handle.concurrent.uploads-py3-none-any.whl

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated
Comment thread packages/js-sdk/src/sandbox/filesystem/index.ts Outdated
@matthewlouisbrockman matthewlouisbrockman marked this pull request as ready for review April 20, 2026 18:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bdfaddd9ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated
Comment thread packages/js-sdk/src/sandbox/filesystem/index.ts Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py:115-128 — The module-level _GLOBAL_FILE_UPLOAD_SEMAPHORES dict is keyed by (id(asyncio.get_running_loop()), max_uploads) but entries are never removed when event loops are destroyed, causing unbounded memory growth in test environments where pytest-asyncio creates a new event loop per test. On top of the memory leak, Python's ability to reuse memory addresses means a new event loop can get the same id() as a previous one, potentially returning a stale asyncio.Semaphore — though this risk is significantly mitigated by asyncio internals. To fix, key the dict only by max_uploads (matching the JS SDK approach) or use weakref-based loop tracking to evict entries on loop destruction.

    Extended reasoning...

    What the bug is

    _GLOBAL_FILE_UPLOAD_SEMAPHORES is a module-level dict[tuple[int, int], asyncio.Semaphore] introduced in this PR, keyed by (id(asyncio.get_running_loop()), max_uploads). The intent is to scope the global semaphore to the current event loop so that asyncio constraints are respected. However, entries are never removed from the dict when their associated event loop is destroyed, making this a memory leak that grows linearly with the number of event loops created over the process lifetime.

    The specific code path

    _get_global_file_upload_semaphore() (lines 115–128) is called on every file upload retry attempt. It calls id(asyncio.get_running_loop()) and uses the result as part of the dict key. When the event loop is later destroyed (e.g., at the end of a pytest-asyncio test), the dict entry for that (loop_id, max_uploads) pair is never cleaned up. The next test gets a fresh event loop, generates a new entry, and the old one remains.

    Why existing code doesn't prevent it

    There is no __del__, no weak reference, no loop lifecycle hook, and no cleanup call site. Python's asyncio does not provide a callback mechanism for loop destruction that could be used to purge stale entries. The dict simply accumulates one entry per (loop_id, max_uploads) pair seen over the process lifetime.

    Impact and the stale-semaphore scenario

    The primary confirmed impact is an unbounded memory leak in testing environments. The project's pytest.ini sets asyncio_mode = auto, meaning pytest-asyncio creates a new event loop per test by default, so every test that exercises write_files with octet-stream uploads adds an entry that is never freed.

    The scarier scenario — ID reuse causing two tests to share a semaphore — is real but significantly mitigated by asyncio internals. As verifiers noted, asyncio.Semaphore only binds self._loop on the contended code path (when _value == 0). If the semaphore was never contended, self._loop stays None and a stale semaphore lazily re-binds to the new loop correctly. However, if a semaphore was fully exhausted and self._loop was set, asyncio holds a strong reference to the old loop object, preventing GC and thus preventing ID reuse in that case. The realistic remaining risk is a test interrupted mid-upload leaving a semaphore with _value < max_uploads, which a later test could inherit if ID collision occurs.

    Step-by-step proof of the memory leak

    1. Test A runs: asyncio.get_running_loop() returns loop_A with id(loop_A) = 0x7f00. _GLOBAL_FILE_UPLOAD_SEMAPHORES[(0x7f00, 128)] is created.
    2. Test A ends: pytest-asyncio destroys loop_A. The dict still holds {(0x7f00, 128): <Semaphore>}.
    3. Test B runs: a new loop_B is created with id(loop_B) = 0x7f40. _GLOBAL_FILE_UPLOAD_SEMAPHORES[(0x7f40, 128)] is created.
    4. After N tests, the dict has N entries, none of which will ever be freed.

    How to fix

    The JS SDK avoids this entirely by keying globalFileUploadSemaphores only by maxUploads (a single integer), since a Semaphore-equivalent in JS is not loop-scoped. For Python, the simplest fix is the same: key only by max_uploads. Python's asyncio.Semaphore lazily binds to whichever loop first exhausts it, so a single process-level semaphore per limit value works correctly in production (one long-lived loop) and in tests (the semaphore re-binds on each new loop if it was never contended or was fully released before the loop ended).

  • 🔴 packages/js-sdk/src/sandbox/filesystem/index.ts:693-703 — In writeFiles, this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs) is called inside the retryFileUpload closure, so each retry attempt creates a fresh AbortSignal.timeout(requestTimeoutMs) starting from zero. With DEFAULT_FILE_UPLOAD_RETRY_ATTEMPTS=4 and e.g. requestTimeoutMs=5000, the total wall-clock time before failure can reach 4×5s + backoff (~20+ seconds), far exceeding what callers expect from the timeout parameter. Fix by calling getSignal once before entering retryFileUpload and capturing the resulting signal in the closure.

    Extended reasoning...

    What the bug is and how it manifests

    ConnectionConfig.getSignal(ms) calls AbortSignal.timeout(ms) on every invocation, returning a brand-new timer starting from zero. In the new writeFiles implementation, getSignal(writeOpts?.requestTimeoutMs) is placed inside the fn closure that is passed to retryFileUpload. Because retryFileUpload calls fn on every attempt, each attempt creates a fresh signal with a full requestTimeoutMs window.

    The specific code path

    In packages/js-sdk/src/sandbox/filesystem/index.ts (lines 693–703), the mapWithConcurrency callback invokes retryFileUpload(async () => { const timeoutSignal = this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs); ... }). The retryFileUpload loop (lines 221–239) re-invokes this closure on every attempt after a transient error and a backoff delay.

    Why existing safeguards do not prevent it

    The refutation correctly notes that isRetryableFileUploadError returns false for AbortError, so a timed-out attempt is not retried — the first timeout that fires does propagate. However, this only applies to an attempt that reaches its timeout. When an attempt fails with a transient network error before the timeout fires (e.g. ECONNRESET after 50ms), the signal from that attempt is discarded and a completely fresh signal is created for the next attempt. The net effect is that each retry window is independent, not cumulative.

    Impact

    A caller who passes requestTimeoutMs: 5000 expects the entire upload operation to take at most ~5 s. With four attempts and transient failures on attempts 1–3, the actual wall-clock time is up to 4×5000 ms + backoff (~20+ s). This breaks deadline-sensitive callers and makes the requestTimeoutMs option semantically misleading in the presence of retries.

    How to fix it

    Call getSignal once before passing fn to retryFileUpload, then capture the pre-created signal in the closure:

    const timeoutSignal = this.connectionConfig.getSignal(writeOpts?.requestTimeoutMs)
    return retryFileUpload(async () => {
      const { signal, cleanup } = timeoutSignal
        ? combineAbortSignals([abortSignal, timeoutSignal])
        : { signal: abortSignal, cleanup: () => {} }
      // ... existing POST call
    })

    This gives all retry attempts a shared deadline rather than independent fresh countdowns.

    Step-by-step proof

    1. Caller invokes writeFiles([file], { requestTimeoutMs: 5000 }).
    2. mapWithConcurrency calls the per-file callback, which calls retryFileUpload(fn).
    3. retryFileUpload enters its loop, attempt=1, calls globalUploads.run(fn).
    4. Inside fn: getSignal(5000)AbortSignal.timeout(5000) — T=0, fires at T=5s.
    5. At T=0.05s the TCP connection drops; ECONNRESET is thrown, caught as retryable.
    6. wait(250) backoff. attempt=2, calls globalUploads.run(fn) again.
    7. Inside fn again: getSignal(5000)new AbortSignal.timeout(5000) — T=0 again, fires at T=5.3s from wall-clock start.
    8. Steps 5–7 repeat for attempts 3 and 4. Total window: up to ~20.75s, not the expected 5s.

Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated
Comment thread packages/python-sdk/e2b/sandbox_sync/filesystem/upload_queue.py
Comment thread packages/python-sdk/e2b/sandbox_async/filesystem/filesystem.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit f4bdff6. Configure here.

Comment thread packages/python-sdk/e2b/connection_config.py
Comment thread packages/python-sdk/e2b/connection_config.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant