Skip to content

fix: strip code blocks before HTML detection in looksLikeHtml()#42

Merged
dacharyc merged 1 commit intoagent-ecosystem:mainfrom
mvvmm:fix/strip-code-in-looks-like-html
Apr 19, 2026
Merged

fix: strip code blocks before HTML detection in looksLikeHtml()#42
dacharyc merged 1 commit intoagent-ecosystem:mainfrom
mvvmm:fix/strip-code-in-looks-like-html

Conversation

@mvvmm
Copy link
Copy Markdown
Contributor

@mvvmm mvvmm commented Apr 19, 2026

Summary

Fixes looksLikeHtml() false-positives on markdown pages that contain HTML tags inside fenced code blocks or inline code spans (e.g. `<body>`). This causes content-negotiation and markdown-url-support checks to misclassify valid markdown responses as HTML.

Some pages like:

include valid HTML elements in their markdown in code blocks.

0 1

Fix

Strip fenced code blocks and inline code spans from the sample before running HTML pattern matching.

Tests

Added 5 new test cases covering fenced blocks (backtick and tilde), inline code spans, and ensuring real HTML outside code is still detected.

looksLikeHtml() false-positives on markdown pages that contain HTML tags
inside fenced code blocks or inline code spans (e.g. `<body>` or a
fenced HTML example). This causes content-negotiation and
markdown-url-support checks to misclassify valid markdown responses as
HTML.

Strip fenced code blocks and inline code spans from the sample before
running HTML pattern matching.
@dacharyc
Copy link
Copy Markdown
Member

Thanks for the PR, @mvvmm !

There's a case where I know this particular implementation will be an issue, so I'm going to merge this and then make a small follow-up to flip the logic.

You've got this code here:

const sample = stripCode(body.slice(0, 2000));

There's a truncation risk when an opening code fence is inside the slice and the closing code fence is outside the slice; the regex will never be able to match that case. So I'm going to swap it to:

const sample = stripCode(body).slice(0, 2000);

Given the size of the average page, I'm hoping the performance tradeoff is negligible compared to the network I/O of the page fetch. If this becomes an issue in practice, I'm open to exploring other options.

Thank you for the fix!

@dacharyc dacharyc merged commit 1e76de0 into agent-ecosystem:main Apr 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants