Skip to content

Conversation

@dmsnell
Copy link
Contributor

@dmsnell dmsnell commented Dec 4, 2025

In the example highlighting ambiguities from missing semicolons on named character references, a "correct" encoding is provided, but that example makes no mention of the fact that the fragment was ambiguous precisely because the ampersand wasn't escaped.

This patch adds a clarifying note explaining how this situation is avoided by always escaping the ampersand.

  • At least two implementers are interested (and none opposed):
  • Tests are written and can be reviewed and commented upon at:
  • Implementation bugs are filed:
    • Chromium: …
    • Gecko: …
    • WebKit: …
    • Deno (only for timers, structured clone, base64 utils, channel messaging, module resolution, web workers, and web storage): …
    • Node.js (only for timers, structured clone, base64 utils, channel messaging, and module resolution): …
  • Corresponding HTML AAM & ARIA in HTML issues & PRs:
  • MDN issue is filed: …
  • The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)

In the example highlighting ambiguities from missing semicolons on named
character references, a "correct" encoding is provided, but that example
makes no mention of the fact that the fragment was ambiguous precisely
because the ampersand wasn't escaped.

This patch adds a clarifying note explaining how this situation is
avoided by always escaping the ampersand.

Co-authored-by: Jon Surrell <[email protected]>
GitHub-PR: 11988
GitHub-PR-URL: whatwg#11988
@dmsnell dmsnell force-pushed the syntax-errors/always-escape-amp branch from d1fb385 to 9753779 Compare December 4, 2025 19:50
@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 4, 2025

As a side note, I overlooked adding my name to the list of contributors in my first submission.

@sirreal
Copy link

sirreal commented Dec 5, 2025

I was surprised to find no recommendation about escaping & with character references anywhere in the HTML standard. The section this PR touches seems to encourage not escaping & if it is not ambiguous (bold mine):

Thus, the correct way to express the above cases is as follows:

<a href="?bill&ted">Bill and Ted</a> <!-- &ted is ok, since it's not a named character reference -->
<a href="?art&amp;copy">Art and Copy</a> <!-- the & has to be escaped, since &copy is a named character reference -->

I read this as if &amp;ted would be wrong in some way, since it isn't the correct way. However, it seems much simpler to me to escape the ampersand here as &amp;.

I would change this section to something like the following:

-<!-- &ted is ok, since it's not a named character reference -->
+<!-- "&ted" is ok because "ted" is not a named character reference. 
+<!-- "&amp;ted" is equivalent and less error-prone because "&amp;" explicitly decodes to "&". -->

There is precedent for such a recommendation. Section 4.12.1.3 Restrictions for contents of script elements has a prominent note with an encoding recommendation:

The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape an ASCII case-insensitive match for "<!--" as "\x3C!--", "<script" as "\x3Cscript", and "</script" as "\x3C/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments), and to avoid writing code that uses such constructs in expressions. Doing so avoids the pitfalls that the restrictions in this section are prone to triggering: namely, that, for historical reasons, parsing of script blocks in HTML is a strange and exotic practice that acts unintuitively in the face of these sequences.


Section 13.1.4 Character references seems like a good place to add a similar note. For example

Note

Where character references are allowed, it's a good idea to always encode & with its character reference &amp;. This prevents any ambiguity as to whether the & is part of a character reference or a literal &.

I would consider mention the most common characters that are useful to escape in different contexts, but the note about & seems particularly helpful.

@annevk
Copy link
Member

annevk commented Dec 5, 2025

https://html.spec.whatwg.org/multipage/syntax.html#character-references already requires this so I'm not sure we need to state it again in the parser section. Is the problem that the parser doesn't flag it?

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

Is the problem that the parser doesn't flag it?

I believe the problem here is that the illustrative example in the syntax-error section explicitly states that the correct way to produce HTML text containing & is to not escape it if what follows is not a legitimately-parsed character reference.

The example illustrates that a parser will correctly identify &ted as that raw string, but suggests that &ted is more appropriate than &amp;ted.

So basically this is just a confusing aspect for implementers and it seems like we could tweak the wording to maintain the demonstration of how these errors are handled without encouraging people to lean on syntax errors in cases where they produce the right output.

@annevk
Copy link
Member

annevk commented Dec 5, 2025

I see, this is part of https://html.spec.whatwg.org/multipage/introduction.html#syntax-errors.

We don't disallow &ted currently so unless we also change the HTML Writing requirements in some way I'd be a bit hesitant to change it in this one place.

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

@annevk thanks. I’m very open to trying out different ideas, but I think the spec is actually a bit vague on this.

already requires this

Unless I’m wrong, the spec does not require that & be escaped as &amp;, only that when mixing character references with text that they must begin with & and be followed by the correct syntax.

However, if someone is authoring HTML and not intending to produce a character reference, a stray & is both properly decoded by the parser and not forbidden.

I think we all agree that the intention is to always escape & as &amp;, but in the nitty gritty, unless it’s hidden in some other section none of us have scoured up yet, it’s not explicitly normalized as such. The only reference we’ve been able to find that isn’t implied is the one in this PR, where the spec assertively states that it’s correct to omit the escaping.

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

I apologize for omitting the before/after screenshots, but I took a before shot and was waiting to add it to the description until I had the parser previews generated but then they never appeared and I forgot to upload the before-shot anyway. Here is the relevant context from the modified section.

Screenshot 2025-12-04 at 12 51 32 PM

@annevk
Copy link
Member

annevk commented Dec 5, 2025

That's what I'm saying as well though in my latest comment. The Writing section explicitly allows you to do this. So I don't want to accept this PR as-is, as it'll contradict the Writing section.

@zcorpan was involved in some of the details here and should probably weigh in.

@dmsnell
Copy link
Contributor Author

dmsnell commented Dec 5, 2025

sounds great, and I have no wish that this be as-is. in fact, I was hoping for further input because I myself struggled to figure out how best to represent it. @sirreal is the author of the original suggestion.

interestingly enough, the HTML 3 spec was clearer on this point, but that entire document comprises only a handful of ill-defined paragraphs 🙃

Because certain characters will be interpreted as markup, they should be represented by markup…for instance the character "&" must be represented by the entity &amp;.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants