Conversation
Neo - PR Security ReviewNo security issues found Highlights
Hardening Notes
Comment |
WalkthroughThis PR enhances BodyDomainGrab by parsing HTML (goquery) and JavaScript (goja AST) when HTML is detected, extracting candidate hostnames from element attributes, srcset, meta tags, and script contents, then normalizing/validating domains and using regex as a fallback. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as HTTP Response Body
participant Detector as Content Detector
participant HTMLParser as HTML Parser (goquery)
participant JSParser as JS Parser (goja)
participant Regex as Regex Fallback
participant Validator as Domain Normalizer/Validator
Client->>Detector: provide response bytes
alt body starts with '<' (HTML)
Detector->>HTMLParser: parse HTML
HTMLParser->>HTMLParser: extract href/src/srcset/action/data-*/meta
HTMLParser->>JSParser: send inline script texts
JSParser->>JSParser: parse JS AST (skip >512KB) and extract string/template literals
JSParser->>Validator: emit domain candidates
end
Client->>Regex: run regex fallback on raw body
Regex->>Validator: emit domain candidates
Validator->>Validator: normalize, validate, publicsuffix, dedupe -> populate domains/fqdns
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@common/httpx/domains.go`:
- Line 151: The deferred anonymous recover in common/httpx/domains.go currently
calls recover() without using its return value which fails lint; update the
defer to explicitly handle the result (e.g., capture into a variable and ignore
or check for nil) so the return value is used — locate the defer func() {
recover() }() and change it to call recover into a variable (for example r :=
recover()) and either assign it to the blank identifier (_ = r) or wrap it with
if r != nil { /* no-op or minimal handling */ } to satisfy errcheck.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 42d71ac6-683a-49d5-a24a-a7beed6f8603
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (4)
common/httpx/domains.gocommon/httpx/domains_test.gocommon/httpx/test-data/sample_with_js.htmlgo.mod
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@common/httpx/domains.go`:
- Around line 41-53: The current flow always runs extractDomainsFromRegex after
the HTML branch, causing JS-only bodies to miss extractDomainsFromJS and
reintroducing parser false positives; change the control flow around the
looksLikeHTML check to a single conditional switch on r.Data: if
looksLikeHTML(r.Data) use extractDomainsFromHTML and then extractDomainsFromJS
for inline scripts (respecting maxInlineScriptSize), else if
looksLikeJavaScript(r.Data) (or Content-Type if available) and len(r.Data) <=
maxInlineScriptSize call extractDomainsFromJS(string(r.Data), domains, fqdns,
r.Input), otherwise call extractDomainsFromRegex(r.Raw, domains, fqdns,
r.Input); update or add a looksLikeJavaScript helper if needed and remove the
unconditional extractDomainsFromRegex call after the HTML branch.
- Around line 58-64: The looksLikeHTML function fails for UTF-8 BOM-prefixed
documents because bytes.TrimSpace doesn't remove the BOM; update looksLikeHTML
to strip a leading UTF-8 BOM (bytes.TrimPrefix(..., []byte{0xEF,0xBB,0xBF}) or
equivalent) from the trimmed prefix before testing the first byte, so inputs
like "\ufeff<!DOCTYPE html>" are correctly detected as HTML.
- Around line 194-198: Normalize resp.Input's host portion before doing equality
checks: extract hostname from the existing input value (e.g., using
net.SplitHostPort or strings.Split on ':' into an inputHost) and then use
inputHost in the comparisons instead of raw input; replace the checks around
domains[val] and fqdns[d] so they read "if inputHost != val { domains[val] =
struct{}{} }" and "if d != val && d != inputHost { fqdns[d] = struct{}{} }" and
add the appropriate import (net or strings) so the normalization occurs before
the comparisons that involve input, d, and val.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6d51af93-2eb2-4a8a-8815-9829a7bfd1ce
📒 Files selected for processing (1)
common/httpx/domains.go
| // Only run HTML/JS parsers if the body looks like HTML | ||
| if len(r.Data) > 0 && looksLikeHTML(r.Data) { | ||
| inlineScripts := extractDomainsFromHTML(r.Data, domains, fqdns, r.Input) | ||
|
|
||
| for _, script := range inlineScripts { | ||
| if len(script) <= maxInlineScriptSize { | ||
| extractDomainsFromJS(script, domains, fqdns, r.Input) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| // Regex fallback on the raw response (catches anything the parsers miss) | ||
| extractDomainsFromRegex(r.Raw, domains, fqdns, r.Input) |
There was a problem hiding this comment.
Don’t run the legacy regex after the parser path.
looksLikeHTML limits the AST path to inline <script> tags, and Line 53 then scans the whole response with the old regex anyway. That means standalone JavaScript bodies never use extractDomainsFromJS, and HTML responses still accumulate the false positives this parser work is supposed to remove.
Control-flow sketch
if len(r.Data) > 0 {
switch {
case looksLikeHTML(r.Data):
inlineScripts := extractDomainsFromHTML(r.Data, domains, fqdns, r.Input)
for _, script := range inlineScripts {
if len(script) <= maxInlineScriptSize {
extractDomainsFromJS(script, domains, fqdns, r.Input)
}
}
case looksLikeJavaScript(r.Data): // or Content-Type based detection if that is already available here
if len(r.Data) <= maxInlineScriptSize {
extractDomainsFromJS(string(r.Data), domains, fqdns, r.Input)
}
default:
extractDomainsFromRegex(r.Raw, domains, fqdns, r.Input)
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@common/httpx/domains.go` around lines 41 - 53, The current flow always runs
extractDomainsFromRegex after the HTML branch, causing JS-only bodies to miss
extractDomainsFromJS and reintroducing parser false positives; change the
control flow around the looksLikeHTML check to a single conditional switch on
r.Data: if looksLikeHTML(r.Data) use extractDomainsFromHTML and then
extractDomainsFromJS for inline scripts (respecting maxInlineScriptSize), else
if looksLikeJavaScript(r.Data) (or Content-Type if available) and len(r.Data) <=
maxInlineScriptSize call extractDomainsFromJS(string(r.Data), domains, fqdns,
r.Input), otherwise call extractDomainsFromRegex(r.Raw, domains, fqdns,
r.Input); update or add a looksLikeJavaScript helper if needed and remove the
unconditional extractDomainsFromRegex call after the HTML branch.
| func looksLikeHTML(data []byte) bool { | ||
| prefix := data | ||
| if len(prefix) > 1024 { | ||
| prefix = prefix[:1024] | ||
| } | ||
| trimmed := bytes.TrimSpace(prefix) | ||
| return len(trimmed) > 0 && trimmed[0] == '<' |
There was a problem hiding this comment.
Treat BOM-prefixed documents as HTML.
bytes.TrimSpace does not strip a UTF-8 BOM, so a valid body like \ufeff<!DOCTYPE html> returns false here and skips the parser path.
Small fix
func looksLikeHTML(data []byte) bool {
prefix := data
if len(prefix) > 1024 {
prefix = prefix[:1024]
}
trimmed := bytes.TrimSpace(prefix)
+ trimmed = bytes.TrimPrefix(trimmed, []byte{0xEF, 0xBB, 0xBF})
return len(trimmed) > 0 && trimmed[0] == '<'
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func looksLikeHTML(data []byte) bool { | |
| prefix := data | |
| if len(prefix) > 1024 { | |
| prefix = prefix[:1024] | |
| } | |
| trimmed := bytes.TrimSpace(prefix) | |
| return len(trimmed) > 0 && trimmed[0] == '<' | |
| func looksLikeHTML(data []byte) bool { | |
| prefix := data | |
| if len(prefix) > 1024 { | |
| prefix = prefix[:1024] | |
| } | |
| trimmed := bytes.TrimSpace(prefix) | |
| trimmed = bytes.TrimPrefix(trimmed, []byte{0xEF, 0xBB, 0xBF}) | |
| return len(trimmed) > 0 && trimmed[0] == '<' | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@common/httpx/domains.go` around lines 58 - 64, The looksLikeHTML function
fails for UTF-8 BOM-prefixed documents because bytes.TrimSpace doesn't remove
the BOM; update looksLikeHTML to strip a leading UTF-8 BOM
(bytes.TrimPrefix(..., []byte{0xEF,0xBB,0xBF}) or equivalent) from the trimmed
prefix before testing the first byte, so inputs like "\ufeff<!DOCTYPE html>" are
correctly detected as HTML.
| if input != val { | ||
| domains[val] = struct{}{} | ||
| } | ||
| if d != val && d != input { | ||
| fqdns[d] = struct{}{} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'Response struct and relevant fields:\n'
rg -n -C2 'type Response struct|Input string|Raw string|Data \[\]byte' --type go
printf '\nPotential writes to Response.Input (excluding tests):\n'
rg -n -C3 '\bInput:\s|\.Input\s*=' --type go -g '!**/*_test.go' -g '!**/vendor/**'
printf '\nBodyDomainGrab call sites:\n'
rg -n -C3 '\bBodyDomainGrab\s*\(' --type goRepository: projectdiscovery/httpx
Length of output: 25591
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Examine the parseURL function to see if it normalizes hostnames
printf 'parseURL function:\n'
rg -n -A15 'func.*parseURL' --type go common/httpx/ runner/ 2>/dev/null | head -50
# Check NewRequestWithContext to understand how Host is set
printf '\n\nChecking if req.Host is normalized:\n'
rg -n -B2 -A2 'req.Host|\.Host' --type go common/httpx/httpx.go | head -40
# Look at the addDomainCandidate function more closely (context around lines 194-198)
printf '\n\naddDomainCandidate function logic:\n'
sed -n '150,220p' common/httpx/domains.goRepository: projectdiscovery/httpx
Length of output: 3365
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Trace where origInput comes from - find all calls to analyze function
printf 'Calls to analyze() function to trace origInput source:\n'
rg -n 'analyze\(' --type go runner/runner.go | head -20
# Check where the input to BodyDomainGrab is passed
printf '\n\nInput parameter flow to BodyDomainGrab:\n'
sed -n '37,50p' common/httpx/domains.go
# Check what BodyDomainGrab receives as input
printf '\n\nBodyDomainGrab signature and input usage:\n'
sed -n '37,45p' common/httpx/domains.go
# Check how input is passed to extraction functions
printf '\n\nHow input is passed to extraction functions:\n'
rg -n 'extractDomainsFrom|addDomainCandidate' --type go common/httpx/domains.go | head -30
# Verify if req.Host can be non-canonical
printf '\n\nNew request creation and Host setting:\n'
sed -n '230,250p' common/httpx/httpx.go
# Check publicsuffix.Domain behavior
printf '\n\nPublicsuffix Domain function to understand what val is:\n'
rg -n 'publicsuffix.Domain' --type go common/httpx/domains.go -B2 -A2Repository: projectdiscovery/httpx
Length of output: 3572
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where analyze() is called and trace the input parameter
printf 'Trace calls to analyze() - where does t come from:\n'
sed -n '1690,1750p' runner/runner.go
# Check if req.Host can contain a port or other non-canonical data
printf '\n\nCheck how req.Host is set:\n'
rg -n 'NewRequest|req.Host' --type go common/httpx/httpx.go -B3 -A3 | head -40
# Check the types and parseURL implementation more carefully
printf '\n\nparseURL to understand Host vs Hostname:\n'
rg -n -A10 'type.*URL.*struct|func.*URL.*Host' --type go | grep -A10 'urlutil' | head -20
# Look at line 238 context - how is req.Host set?
printf '\n\nContext around resp.Input assignment:\n'
sed -n '225,245p' common/httpx/httpx.go
# Check if there's any normalization of the input before calling BodyDomainGrab
printf '\n\nCheck Response struct to see all fields:\n'
sed -n '1,30p' common/httpx/response.goRepository: projectdiscovery/httpx
Length of output: 3318
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Look at the full addDomainCandidate function again with more context
printf 'Full addDomainCandidate function:\n'
sed -n '176,200p' common/httpx/domains.go
# Check if Go http.Request.Host includes port
printf '\n\nLook at what req.Host value contains (from Go docs behavior):\n'
sed -n '410,425p' common/httpx/httpx.go
# Check test case to see if input can have port
printf '\n\nTest case examining input parameter:\n'
sed -n '450,460p' common/httpx/domains_test.go
# Look at where origInput is created in analyze function
printf '\n\nanalyze function signature and origInput handling:\n'
sed -n '1827,1860p' runner/runner.go
# Check how target.Host is constructed
printf '\n\nTarget structure and Host field:\n'
rg -n 'type Target struct' --type go -A10Repository: projectdiscovery/httpx
Length of output: 3030
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check if req.Host includes port by looking at how it's used and what Go's http.Request does
printf 'Check NewRequestFromURLWithContext signature and req.Host behavior:\n'
rg -n 'retryablehttp.NewRequestFromURLWithContext' --type go -A5 -B5 | head -30
# Check if there's any explicit handling of ports in the comparison
printf '\n\nCheck how input is normalized or processed before addDomainCandidate:\n'
sed -n '37,60p' common/httpx/domains.go
# Look for test cases that show input with ports
printf '\n\nSearch for test cases with input containing ports:\n'
rg -n 'Input:.*:' --type go common/httpx/domains_test.go
# Check how resp.Input gets set in all cases
printf '\n\nAll assignments to resp.Input or Input field:\n'
rg -n 'resp\.Input\s*=' --type go common/httpx/
# Verify if ports are ever included in input
printf '\n\nCheck URLs passed to NewRequestWithContext:\n'
sed -n '1880,1890p' runner/runner.go
# Look at req.String() to see if it includes port
printf '\n\nCheck how req is used after being created:\n'
sed -n '1956,1965p' runner/runner.goRepository: projectdiscovery/httpx
Length of output: 1583
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Let's verify if req.Host includes port by checking Go's behavior
printf 'Check URL parsing to understand Host field:\n'
python3 << 'EOF'
# Go's http.Request.Host behavior from URL
# URL.Host includes port, URL.Hostname() does not
# When creating request from URL, req.Host = URL.Host
test_cases = [
("http://example.com/path", "example.com"),
("http://example.com:443/path", "example.com:443"),
("http://EXAMPLE.COM/path", "EXAMPLE.COM"),
]
print("Go http.Request behavior:")
print("- URL.Host includes port: example.com:443")
print("- URL.Hostname() excludes port: example.com")
print("- req.Host comes from URL.Host, so it can include port")
print()
print("Issue: if input='example.com:443' and d='example.com'")
print(" val = publicsuffix.Domain('example.com') = 'example.com'")
print(" Check: input != val => 'example.com:443' != 'example.com' => TRUE")
print(" Result: example.com gets added to domains (BUG!)")
EOF
# Check if there's any normalization of input before BodyDomainGrab is called
printf '\n\nCheck if input is normalized before passing to BodyDomainGrab:\n'
sed -n '2635,2655p' runner/runner.go
# Look at how the URL is constructed to understand if port is included
printf '\n\nCheck what fullURL contains at line 2644:\n'
sed -n '2600,2650p' runner/runner.go | head -50Repository: projectdiscovery/httpx
Length of output: 2853
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find where the loop variable t comes from (the input parameter to analyze)
printf 'Find where input t originates (look for loop context):\n'
sed -n '1680,1700p' runner/runner.go
# Check the process function to see where it's called from
printf '\n\nCheck process() function which is called initially:\n'
rg -n 'func.*process.*Target' --type go runner/runner.go -A15 | head -40
# Look for the main entry point where inputs are read
printf '\n\nLook for where scan inputs are read:\n'
rg -n 'func.*Scan.*process' --type go runner/runner.go | head -5
# Check what origInput can contain by finding the input source
printf '\n\nFind where input strings originate from:\n'
rg -n 'for.*range.*input|readInputFile|parseInput' --type go runner/runner.go | head -20Repository: projectdiscovery/httpx
Length of output: 879
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Search for the process function definition
printf 'Find process function definition:\n'
rg -n 'func.*\(r \*Runner\) process' --type go runner/runner.go -A10
# Look for where t is defined in the context
printf '\n\nLook for the process function and where it gets called:\n'
rg -n 'r\.process\(' --type go runner/runner.go | head -10
# Find the main entry point
printf '\n\nLook for the main scanning loop:\n'
rg -n 'func.*RunEnumeration\|func.*Scan' --type go runner/runner.go | head -10Repository: projectdiscovery/httpx
Length of output: 1349
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Check the test case that explicitly sets Input to verify the expected behavior
printf 'Test case showing Input field usage:\n'
sed -n '450,465p' common/httpx/domains_test.go
# Look at resp.Input assignment again carefully
printf '\n\nWhere resp.Input is set in httpx.go:\n'
sed -n '235,242p' common/httpx/httpx.go
# Check if Go http.Request.Host includes port by testing with a simple case
printf '\n\nCheck req.Host behavior - does it include port?\n'
python3 << 'EOF'
# Go's net/http behavior:
# When creating a request from a URL like "http://example.com:8080/path":
# - url.Host = "example.com:8080" (includes port)
# - url.Hostname() = "example.com" (excludes port)
# - req.Host = url.Host (comes from URL.Host)
# So resp.Input = req.Host will include the port
print("CRITICAL: resp.Input = req.Host can include port number")
print()
print("Example scenario:")
print(" Input URL: http://example.com:8080/")
print(" req.Host = 'example.com:8080'")
print(" resp.Input = 'example.com:8080'")
print()
print("In addDomainCandidate with d='example.com':")
print(" val = publicsuffix.Domain('example.com') = 'example.com'")
print(" Line 194: if input != val")
print(" if 'example.com:8080' != 'example.com' => TRUE (BUG!)")
print(" Adds 'example.com' to domains when it should be filtered")
EOF
# Verify this is actually how it works by checking the test
printf '\n\nTest to confirm the issue:\n'
sed -n '450,465p' common/httpx/domains_test.goRepository: projectdiscovery/httpx
Length of output: 1821
Normalize input to extract hostname before comparing against d and val on lines 194-198.
resp.Input is set to req.Host (line 238 of httpx.go), which in Go's net/http includes the port number (e.g., "example.com:8080"). However, d and val are hostname-only after normalization. This causes the equality checks to fail: if input != val becomes "example.com:8080" != "example.com" → true, allowing the input host to leak back into Domains and Fqdns.
Extract the hostname from input before the comparisons—for example:
inputHost := strings.Split(input, ":")[0]
if inputHost != val {
domains[val] = struct{}{}
}
if d != val && d != inputHost {
fqdns[d] = struct{}{}
}This matches the test expectation (domains_test.go:459) that the input domain should be excluded from results.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@common/httpx/domains.go` around lines 194 - 198, Normalize resp.Input's host
portion before doing equality checks: extract hostname from the existing input
value (e.g., using net.SplitHostPort or strings.Split on ':' into an inputHost)
and then use inputHost in the comparisons instead of raw input; replace the
checks around domains[val] and fqdns[d] so they read "if inputHost != val {
domains[val] = struct{}{} }" and "if d != val && d != inputHost { fqdns[d] =
struct{}{} }" and add the appropriate import (net or strings) so the
normalization occurs before the comparisons that involve input, d, and val.
Proposed changes
Close #1773
Checklist
Summary by CodeRabbit
Bug Fixes
Tests