validate utf-16 surrogate halves in decodeUnicodeCodePoint by SABITHSAHEB · Pull Request #1698 · open-source-parsers/jsoncpp

SABITHSAHEB · 2026-06-17T15:50:57Z

after a high surrogate (D800-DBFF) the next \u escape is consumed and combined without checking it is a low surrogate, so "\uD801\u0041" or "\uD801\uD801" parse to a wrong astral code point and the second escape's real value is lost.
a low surrogate that appears on its own (e.g. "\uDC00") is passed straight through and gets written out as the invalid UTF-8 bytes ED B0 80.
Validated the low-surrogate range when completing a pair and rejected an unpaired low surrogate, in both Reader and OurReader. Added reader tests for the two new rejection paths; existing valid-pair cases are unaffected.

SABITHSAHEB · 2026-06-29T06:04:31Z

any update?

baylesj · 2026-07-01T23:23:45Z

                            "See Line 1, Column 12 for detail.\n");
  }
+  {
+    char const doc[] = R"([ "\uD801\u0041" ])";


No tests for the legacy reader change?

baylesj

Thanks for the fix — the validation logic itself looks correct. One design concern before merging, though: this rejection is unconditional in both readers, with no way to opt out.

Unlike #1663 (unescaped control characters, which are grammar-invalid per RFC 8259), lone/mismatched surrogate escapes are syntactically valid JSON under both the RFC 8259 ABNF and ECMA-404 — the spec only calls the resulting value "unpredictable".

Real-world producers emit them legally: ES2019+ JSON.stringify('\uD83D') yields "\ud83d", and Python's json.dumps does the same. After this change, documents that parsed on every prior release hard-fail in every configuration — including CharReaderBuilder::ecma404Mode, which would now reject text that ECMA-404 defines as conforming, and the legacy Json::Reader, whose Features struct has no knob at all.

Comparable strictness decisions (failIfExtra, rejectDupKeys, allowSpecialFloats) are all routed through CharReaderBuilder settings. Could this be gated behind a setting (e.g. rejectInvalidSurrogates, arguably default-on) so downstream users with lenient-ingestion pipelines have an escape hatch? Alternatively, WHATWG-style U+FFFD replacement in the lenient path would avoid the hard break while still fixing the invalid-UTF-8 output.

baylesj · 2026-07-01T23:25:57Z

        return addError("Bad escape sequence in string", token, current);
      }
    } else {
      if (static_cast<unsigned char>(c) < 0x20)


Both decodeString loops copy any raw byte ≥ 0x20 through unvalidated, so the raw WTF-8 bytes ED B0 80 (U+DC00) still parse into a Value containing exactly the invalid UTF-8 this PR aims to prevent, while "\udc00" is now a hard error. The "parser accepted it, therefore strings are valid UTF-8" invariant the PR implies is not actually established — identical content is treated differently depending on byte-level spelling.

validate utf-16 surrogate halves in decodeUnicodeCodePoint

5e59785

Merge branch 'master' into validate-surrogate-halves

8feb5e5

baylesj reviewed Jul 1, 2026

View reviewed changes

baylesj requested changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

validate utf-16 surrogate halves in decodeUnicodeCodePoint#1698

validate utf-16 surrogate halves in decodeUnicodeCodePoint#1698
SABITHSAHEB wants to merge 2 commits into
open-source-parsers:masterfrom
SABITHSAHEB:validate-surrogate-halves

SABITHSAHEB commented Jun 17, 2026

Uh oh!

SABITHSAHEB commented Jun 29, 2026

Uh oh!

baylesj Jul 1, 2026

Uh oh!

baylesj left a comment

Uh oh!

baylesj Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SABITHSAHEB commented Jun 17, 2026

Uh oh!

SABITHSAHEB commented Jun 29, 2026

Uh oh!

baylesj Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

baylesj left a comment

Choose a reason for hiding this comment

Uh oh!

baylesj Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants