Skip to content

Fix premature COMMIT in stream-rows-to-copy under on error stop#1748

Merged
dimitri merged 2 commits into
mainfrom
fix/stream-rows-to-copy-on-error-stop
Jun 27, 2026
Merged

Fix premature COMMIT in stream-rows-to-copy under on error stop#1748
dimitri merged 2 commits into
mainfrom
fix/stream-rows-to-copy-on-error-stop

Conversation

@dimitri

@dimitri dimitri commented Jun 27, 2026

Copy link
Copy Markdown
Owner

Fixes #1622 (specifically the hang scenario reported in the 2026-06-22 comment by fluca1978).

Root cause

In stream-rows-to-copy (used when on error stop is set), the original code placed COMMIT inside the unwind-protect cleanup alongside close-db-writer:

(unwind-protect
     (loop ...)
  (cl-postgres:close-db-writer copier)
  (pomo:execute "COMMIT"))  ; ← runs before the handler-case handler

In Common Lisp, handler-case handlers run after the dynamic extent of the protected form — which means after unwind-protect cleanup forms have completed. So when a data error occurred:

  1. COMMIT ran inside the cleanup (wrong — partial data committed)
  2. ROLLBACK ran in the handler-case handler (no-op / warning: no transaction in progress)

This left rows partially committed on error and put the connection into an inconsistent state. Under on error stop with concurrency > 1, the inconsistent connection state could cause the remaining writer threads to hang indefinitely (PG sessions stuck in ClientRead on COPY … FROM STDIN).

Fix

Move (pomo:execute "COMMIT") to after the unwind-protect so it only executes on the success path (loop and close-db-writer both complete without signalling):

(unwind-protect
     (loop ...)
  ;; Only close the COPY writer; cl-postgres with-syncing reads
  ;; ReadyForQuery on error, leaving the connection in SQL mode.
  (cl-postgres:close-db-writer copier))
;; Only reached on normal loop completion (no error).
(pomo:execute "COMMIT")

When either the loop or close-db-writer signals, the handler-case handler now finds a clean SQL-mode connection (cl-postgres's with-syncing in close-db-writer reads until ReadyForQuery before re-signalling) on which ROLLBACK works correctly.

A secondary fix guards (incf bytes row-bytes) with (when row-bytes …) so that a nil returned by stream-row for a filtered row does not raise a TYPE-ERROR that was itself masking the COMMIT sequencing issue.

Test

Added test/sqlite-on-error-stop.load which migrates test/sqlite/type-mismatch.db — a SQLite database whose products table has a TEXT value ('lots-of-it') stored in an INTEGER column. pgloader must exit cleanly without hanging. The test database is regenerated by test/sqlite/create-type-mismatch.py.

dimitri added 2 commits June 28, 2026 00:43
When a data error occurs inside stream-rows-to-copy, the outer
handler-case handler calls ROLLBACK.  In Common Lisp, handler-case
handlers run *after* the unwind-protect cleanup forms have finished,
so the original code sequenced:

  1. COMMIT   (inside unwind-protect cleanup)
  2. ROLLBACK (inside handler-case handler — now a no-op)

This meant any rows already streamed to PostgreSQL were committed
even when the load failed, and the ROLLBACK was silently ignored
with a 'there is no transaction in progress' warning.  Under
on error stop with concurrency > 1 the subsequent inconsistent
connection state could also cause writer threads to hang (GitHub
issue #1622, comment by fluca1978 on 2026-06-22).

Fix: move the COMMIT after the unwind-protect form.  It is now
only reached when the row loop and close-db-writer both complete
without signalling.  When either signals, unwind-protect closes
the COPY writer (cl-postgres with-syncing reads until ReadyForQuery,
leaving the connection in SQL mode), and then the handler-case
handler correctly finds a clean connection on which ROLLBACK works.

Also guard the incf of bytes/seconds with (when row-bytes ...) so
that nil returned by stream-row for a filtered row does not raise
a TYPE-ERROR that was itself masking the COMMIT sequencing bug.

Add regression test: test/sqlite-on-error-stop.load uses a SQLite
database (test/sqlite/type-mismatch.db) that has a TEXT value in
an INTEGER column.  pgloader must exit cleanly without hanging.

Fixes: #1622
The WITH on error stop / on error resume next options were already
parsed by the grammar and surfaced as :on-error-stop in with-options
by ast.clj, but run-command in core.clj never read them.  Only the
--on-error-stop CLI flag set copy/*on-error-stop*.

Fix: extract :on-error-stop from with-options and add it to the
binding block alongside batch-rows, batch-size, etc.  Use (some? v)
instead of (or v ...) since the option is boolean: a false value from
a future explicit on-error-resume path must not be ignored.

Add parser tests verifying that:
- WITH on error stop  → :on-error-stop true  in with-options
- WITH on error resume next → :on-error-stop absent from with-options
@dimitri dimitri merged commit f69a58e into main Jun 27, 2026
37 checks passed
@dimitri dimitri deleted the fix/stream-rows-to-copy-on-error-stop branch June 27, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hangs while migrating from SQLite3

1 participant