validate utf-16 surrogate halves in decodeUnicodeCodePoint#1698
validate utf-16 surrogate halves in decodeUnicodeCodePoint#1698SABITHSAHEB wants to merge 2 commits into
Conversation
|
any update? |
| "See Line 1, Column 12 for detail.\n"); | ||
| } | ||
| { | ||
| char const doc[] = R"([ "\uD801\u0041" ])"; |
There was a problem hiding this comment.
No tests for the legacy reader change?
baylesj
left a comment
There was a problem hiding this comment.
Thanks for the fix — the validation logic itself looks correct. One design concern before merging, though: this rejection is unconditional in both readers, with no way to opt out.
Unlike #1663 (unescaped control characters, which are grammar-invalid per RFC 8259), lone/mismatched surrogate escapes are syntactically valid JSON under both the RFC 8259 ABNF and ECMA-404 — the spec only calls the resulting value "unpredictable".
Real-world producers emit them legally: ES2019+ JSON.stringify('\uD83D') yields "\ud83d", and Python's json.dumps does the same. After this change, documents that parsed on every prior release hard-fail in every configuration — including CharReaderBuilder::ecma404Mode, which would now reject text that ECMA-404 defines as conforming, and the legacy Json::Reader, whose Features struct has no knob at all.
Comparable strictness decisions (failIfExtra, rejectDupKeys, allowSpecialFloats) are all routed through CharReaderBuilder settings. Could this be gated behind a setting (e.g. rejectInvalidSurrogates, arguably default-on) so downstream users with lenient-ingestion pipelines have an escape hatch? Alternatively, WHATWG-style U+FFFD replacement in the lenient path would avoid the hard break while still fixing the invalid-UTF-8 output.
| return addError("Bad escape sequence in string", token, current); | ||
| } | ||
| } else { | ||
| if (static_cast<unsigned char>(c) < 0x20) |
There was a problem hiding this comment.
Both decodeString loops copy any raw byte ≥ 0x20 through unvalidated, so the raw WTF-8 bytes ED B0 80 (U+DC00) still parse into a Value containing exactly the invalid UTF-8 this PR aims to prevent, while "\udc00" is now a hard error. The "parser accepted it, therefore strings are valid UTF-8" invariant the PR implies is not actually established — identical content is treated differently depending on byte-level spelling.
Validated the low-surrogate range when completing a pair and rejected an unpaired low surrogate, in both Reader and OurReader. Added reader tests for the two new rejection paths; existing valid-pair cases are unaffected.