fix(sqlite): report pkey and column context on UTF-8 decoding error#1747
Merged
Conversation
…1250) When a SQLite TEXT column contains bytes that are not valid UTF-8, pgloader now catches the babel decoding error per-column instead of aborting the entire table load. The error message includes: - table name - the invalid encoding name and byte position (matching MySQL style) - primary-key value of the offending row (when the PK column was already read before the failing column) - column name The bad column is substituted with NULL so the row is still inserted and loading continues with the next row — the same behaviour the MySQL and MSSQL sources have via their qmynd/mssql use-nil restarts. Since babel defines no restarts for character-decoding-error (unlike qmynd which exposes use-nil), the fix uses handler-case at the column-read site rather than handler-bind + invoke-restart. Added: test/sqlite/bad-utf8.db — pre-built SQLite DB with one row containing a Windows-1252 em-dash (0x96) stored as BLOB in a TEXT column test/sqlite/create-bad-utf8.py — script to regenerate that database test/sqlite-bad-utf8.load — pgloader load file for manual testing Closes #1250
…error The dead-code 'with row-num = 0' / 'do (incf row-num)' pair violated CL loop grammar: a do-clause cannot precede a for-variable clause. SBCL 2.1.11 (used in the debian-build CI job) rejects this ordering at compile time, causing all CI jobs to fail. Remove both lines entirely; row-num was never referenced anywhere.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a SQLite TEXT column contains bytes that are not valid UTF-8, pgloader emitted a terse error and aborted the entire table load:
No table name, no row identifier, no column name — making it very difficult for users to find the offending record (see #1250).
Fix
Catches
babel-encodings:character-decoding-errorper column (inside the inner column-reading loop) rather than via the outer table-level handler. When the error fires the message now includes:The bad column is substituted with NULL so the row is still inserted and the rest of the table continues loading — the same behaviour the MySQL and MSSQL sources provide via their
use-nilrestarts.Why handler-case instead of handler-bind + restart?
Babel defines no restarts for
character-decoding-error(unlike qmynd/mssql which expose ause-nilrestart).handler-caseat the column-read site gives the same outcome (substitute nil, continue) without requiring a restart.Sample error output (after fix)
Test artifacts
test/sqlite/bad-utf8.db— SQLite database with one row containing a Windows-1252 em-dash (0x96) stored in a TEXT column; rows 1 and 3 are valid UTF-8test/sqlite/create-bad-utf8.py— script to regenerate that databasetest/sqlite-bad-utf8.load— pgloader load file for manual verificationThe SQLite LOAD DATABASE command cannot be wired into the standard
--regressframework (which expects a single?tablenamein the URI), so this is provided as a manual test artifact rather than a REGRESS entry.Closes #1250