Skip to content

validate utf-16 surrogate halves in decodeUnicodeCodePoint#1698

Open
SABITHSAHEB wants to merge 2 commits into
open-source-parsers:masterfrom
SABITHSAHEB:validate-surrogate-halves
Open

validate utf-16 surrogate halves in decodeUnicodeCodePoint#1698
SABITHSAHEB wants to merge 2 commits into
open-source-parsers:masterfrom
SABITHSAHEB:validate-surrogate-halves

Conversation

@SABITHSAHEB

Copy link
Copy Markdown
Contributor
  1. after a high surrogate (D800-DBFF) the next \u escape is consumed and combined without checking it is a low surrogate, so "\uD801\u0041" or "\uD801\uD801" parse to a wrong astral code point and the second escape's real value is lost.
  2. a low surrogate that appears on its own (e.g. "\uDC00") is passed straight through and gets written out as the invalid UTF-8 bytes ED B0 80.
    Validated the low-surrogate range when completing a pair and rejected an unpaired low surrogate, in both Reader and OurReader. Added reader tests for the two new rejection paths; existing valid-pair cases are unaffected.

@SABITHSAHEB

Copy link
Copy Markdown
Contributor Author

any update?

"See Line 1, Column 12 for detail.\n");
}
{
char const doc[] = R"([ "\uD801\u0041" ])";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No tests for the legacy reader change?

@baylesj baylesj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix — the validation logic itself looks correct. One design concern before merging, though: this rejection is unconditional in both readers, with no way to opt out.

Unlike #1663 (unescaped control characters, which are grammar-invalid per RFC 8259), lone/mismatched surrogate escapes are syntactically valid JSON under both the RFC 8259 ABNF and ECMA-404 — the spec only calls the resulting value "unpredictable".

Real-world producers emit them legally: ES2019+ JSON.stringify('\uD83D') yields "\ud83d", and Python's json.dumps does the same. After this change, documents that parsed on every prior release hard-fail in every configuration — including CharReaderBuilder::ecma404Mode, which would now reject text that ECMA-404 defines as conforming, and the legacy Json::Reader, whose Features struct has no knob at all.

Comparable strictness decisions (failIfExtra, rejectDupKeys, allowSpecialFloats) are all routed through CharReaderBuilder settings. Could this be gated behind a setting (e.g. rejectInvalidSurrogates, arguably default-on) so downstream users with lenient-ingestion pipelines have an escape hatch? Alternatively, WHATWG-style U+FFFD replacement in the lenient path would avoid the hard break while still fixing the invalid-UTF-8 output.

return addError("Bad escape sequence in string", token, current);
}
} else {
if (static_cast<unsigned char>(c) < 0x20)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both decodeString loops copy any raw byte ≥ 0x20 through unvalidated, so the raw WTF-8 bytes ED B0 80 (U+DC00) still parse into a Value containing exactly the invalid UTF-8 this PR aims to prevent, while "\udc00" is now a hard error. The "parser accepted it, therefore strings are valid UTF-8" invariant the PR implies is not actually established — identical content is treated differently depending on byte-level spelling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants