Skip to content

parser for fqdn#2459

Open
Mzack9999 wants to merge 2 commits intodevfrom
improve-fqdn-extraction
Open

parser for fqdn#2459
Mzack9999 wants to merge 2 commits intodevfrom
improve-fqdn-extraction

Conversation

@Mzack9999
Copy link
Member

@Mzack9999 Mzack9999 commented Mar 21, 2026

Proposed changes

Close #1773

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Summary by CodeRabbit

  • Bug Fixes

    • Improved domain extraction by parsing HTML attributes, meta tags, and embedded scripts for more accurate results.
    • Better normalization and validation of discovered domains, and safer handling of large or complex script contents to avoid hangs.
  • Tests

    • Added extensive test coverage and benchmarks covering many HTML/JS patterns, edge cases, false-positive rejection, and performance scenarios.

@neo-by-projectdiscovery-dev
Copy link

neo-by-projectdiscovery-dev bot commented Mar 21, 2026

Neo - PR Security Review

No security issues found

Highlights

  • Adds HTML parsing with goquery to extract domains from URL-bearing attributes (href, src, action, formaction, poster, etc.)
  • Adds JavaScript AST parsing with goja to extract domains from string literals and template literals in inline scripts
  • Implements comprehensive domain validation including TLD verification, package name rejection, and file extension filtering
  • Includes 512KB size limit for JS parsing and panic recovery to prevent DoS
Hardening Notes
  • Consider adding a timeout for HTML/JS parsing operations in domains.go to prevent resource exhaustion on extremely large or deeply nested documents
  • The regex pattern [a-z0-9.-]* in potentialDomainRegex at line 19 is safe due to Go's RE2 engine, but consider documenting this explicitly in a comment
  • In extractDomainsFromHTML (line 70), consider adding a maximum depth limit for nested HTML structures as an additional safeguard

Comment @pdneo help for available commands. · Open in Neo

@coderabbitai
Copy link

coderabbitai bot commented Mar 21, 2026

Walkthrough

This PR enhances BodyDomainGrab by parsing HTML (goquery) and JavaScript (goja AST) when HTML is detected, extracting candidate hostnames from element attributes, srcset, meta tags, and script contents, then normalizing/validating domains and using regex as a fallback.

Changes

Cohort / File(s) Summary
Core Domain Extraction
common/httpx/domains.go
Rewrote BodyDomainGrab to detect HTML bodies, parse with goquery, extract domains from href, src, action, srcset, meta[content], data-* attributes and inline <script> contents; added JS AST parsing via goja (skip >512KB). Introduced helpers addDomainCandidate and hostnameFromURL, moved regex matching to fallback.
Tests & Fixtures
common/httpx/domains_test.go, common/httpx/test-data/sample_with_js.html
Added extensive unit tests and benchmarks covering HTML/JS extraction, many edge cases (empty/broken bodies, deduplication, false-positive patterns), helper tests for hostnameFromURL/addDomainCandidate, and an embedded HTML fixture exercising varied URL patterns.
Dependencies
go.mod
Added direct dependency github.com/dop251/goja (JS parser) and introduced transitive github.com/go-sourcemap/sourcemap in the module graph.

Sequence Diagram(s)

sequenceDiagram
    participant Client as HTTP Response Body
    participant Detector as Content Detector
    participant HTMLParser as HTML Parser (goquery)
    participant JSParser as JS Parser (goja)
    participant Regex as Regex Fallback
    participant Validator as Domain Normalizer/Validator

    Client->>Detector: provide response bytes
    alt body starts with '<' (HTML)
        Detector->>HTMLParser: parse HTML
        HTMLParser->>HTMLParser: extract href/src/srcset/action/data-*/meta
        HTMLParser->>JSParser: send inline script texts
        JSParser->>JSParser: parse JS AST (skip >512KB) and extract string/template literals
        JSParser->>Validator: emit domain candidates
    end
    Client->>Regex: run regex fallback on raw body
    Regex->>Validator: emit domain candidates
    Validator->>Validator: normalize, validate, publicsuffix, dedupe -> populate domains/fqdns
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through tags and scripts today,

found hosts that used to hide away.
With goquery's eyes and goja's sight,
domains now dance into the light.
A tiny hop for code — a joyous spray! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'parser for fqdn' directly references the main change: refactoring FQDN extraction to use actual parsers (goquery for HTML, goja for JavaScript) instead of regex-only heuristics.
Linked Issues check ✅ Passed The PR fully implements issue #1773's requirements: HTML parsing via goquery, JavaScript AST parsing via goja, domain extraction from specific contexts (attributes, srcset, scripts), and regex fallback to reduce false positives.
Out of Scope Changes check ✅ Passed All changes are directly related to improving FQDN extraction through parsers: BodyDomainGrab refactoring, helper functions, test coverage for HTML/JS parsing, and required dependencies (goja). No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch improve-fqdn-extraction

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/httpx/domains.go`:
- Line 151: The deferred anonymous recover in common/httpx/domains.go currently
calls recover() without using its return value which fails lint; update the
defer to explicitly handle the result (e.g., capture into a variable and ignore
or check for nil) so the return value is used — locate the defer func() {
recover() }() and change it to call recover into a variable (for example r :=
recover()) and either assign it to the blank identifier (_ = r) or wrap it with
if r != nil { /* no-op or minimal handling */ } to satisfy errcheck.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 42d71ac6-683a-49d5-a24a-a7beed6f8603

📥 Commits

Reviewing files that changed from the base of the PR and between 9836829 and 090f0c2.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (4)
  • common/httpx/domains.go
  • common/httpx/domains_test.go
  • common/httpx/test-data/sample_with_js.html
  • go.mod

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@common/httpx/domains.go`:
- Around line 41-53: The current flow always runs extractDomainsFromRegex after
the HTML branch, causing JS-only bodies to miss extractDomainsFromJS and
reintroducing parser false positives; change the control flow around the
looksLikeHTML check to a single conditional switch on r.Data: if
looksLikeHTML(r.Data) use extractDomainsFromHTML and then extractDomainsFromJS
for inline scripts (respecting maxInlineScriptSize), else if
looksLikeJavaScript(r.Data) (or Content-Type if available) and len(r.Data) <=
maxInlineScriptSize call extractDomainsFromJS(string(r.Data), domains, fqdns,
r.Input), otherwise call extractDomainsFromRegex(r.Raw, domains, fqdns,
r.Input); update or add a looksLikeJavaScript helper if needed and remove the
unconditional extractDomainsFromRegex call after the HTML branch.
- Around line 58-64: The looksLikeHTML function fails for UTF-8 BOM-prefixed
documents because bytes.TrimSpace doesn't remove the BOM; update looksLikeHTML
to strip a leading UTF-8 BOM (bytes.TrimPrefix(..., []byte{0xEF,0xBB,0xBF}) or
equivalent) from the trimmed prefix before testing the first byte, so inputs
like "\ufeff<!DOCTYPE html>" are correctly detected as HTML.
- Around line 194-198: Normalize resp.Input's host portion before doing equality
checks: extract hostname from the existing input value (e.g., using
net.SplitHostPort or strings.Split on ':' into an inputHost) and then use
inputHost in the comparisons instead of raw input; replace the checks around
domains[val] and fqdns[d] so they read "if inputHost != val { domains[val] =
struct{}{} }" and "if d != val && d != inputHost { fqdns[d] = struct{}{} }" and
add the appropriate import (net or strings) so the normalization occurs before
the comparisons that involve input, d, and val.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6d51af93-2eb2-4a8a-8815-9829a7bfd1ce

📥 Commits

Reviewing files that changed from the base of the PR and between 090f0c2 and c3b06b5.

📒 Files selected for processing (1)
  • common/httpx/domains.go

Comment on lines +41 to +53
// Only run HTML/JS parsers if the body looks like HTML
if len(r.Data) > 0 && looksLikeHTML(r.Data) {
inlineScripts := extractDomainsFromHTML(r.Data, domains, fqdns, r.Input)

for _, script := range inlineScripts {
if len(script) <= maxInlineScriptSize {
extractDomainsFromJS(script, domains, fqdns, r.Input)
}
}
}

// Regex fallback on the raw response (catches anything the parsers miss)
extractDomainsFromRegex(r.Raw, domains, fqdns, r.Input)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t run the legacy regex after the parser path.

looksLikeHTML limits the AST path to inline <script> tags, and Line 53 then scans the whole response with the old regex anyway. That means standalone JavaScript bodies never use extractDomainsFromJS, and HTML responses still accumulate the false positives this parser work is supposed to remove.

Control-flow sketch
if len(r.Data) > 0 {
	switch {
	case looksLikeHTML(r.Data):
		inlineScripts := extractDomainsFromHTML(r.Data, domains, fqdns, r.Input)
		for _, script := range inlineScripts {
			if len(script) <= maxInlineScriptSize {
				extractDomainsFromJS(script, domains, fqdns, r.Input)
			}
		}
	case looksLikeJavaScript(r.Data): // or Content-Type based detection if that is already available here
		if len(r.Data) <= maxInlineScriptSize {
			extractDomainsFromJS(string(r.Data), domains, fqdns, r.Input)
		}
	default:
		extractDomainsFromRegex(r.Raw, domains, fqdns, r.Input)
	}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/httpx/domains.go` around lines 41 - 53, The current flow always runs
extractDomainsFromRegex after the HTML branch, causing JS-only bodies to miss
extractDomainsFromJS and reintroducing parser false positives; change the
control flow around the looksLikeHTML check to a single conditional switch on
r.Data: if looksLikeHTML(r.Data) use extractDomainsFromHTML and then
extractDomainsFromJS for inline scripts (respecting maxInlineScriptSize), else
if looksLikeJavaScript(r.Data) (or Content-Type if available) and len(r.Data) <=
maxInlineScriptSize call extractDomainsFromJS(string(r.Data), domains, fqdns,
r.Input), otherwise call extractDomainsFromRegex(r.Raw, domains, fqdns,
r.Input); update or add a looksLikeJavaScript helper if needed and remove the
unconditional extractDomainsFromRegex call after the HTML branch.

Comment on lines +58 to +64
func looksLikeHTML(data []byte) bool {
prefix := data
if len(prefix) > 1024 {
prefix = prefix[:1024]
}
trimmed := bytes.TrimSpace(prefix)
return len(trimmed) > 0 && trimmed[0] == '<'
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Treat BOM-prefixed documents as HTML.

bytes.TrimSpace does not strip a UTF-8 BOM, so a valid body like \ufeff<!DOCTYPE html> returns false here and skips the parser path.

Small fix
 func looksLikeHTML(data []byte) bool {
 	prefix := data
 	if len(prefix) > 1024 {
 		prefix = prefix[:1024]
 	}
 	trimmed := bytes.TrimSpace(prefix)
+	trimmed = bytes.TrimPrefix(trimmed, []byte{0xEF, 0xBB, 0xBF})
 	return len(trimmed) > 0 && trimmed[0] == '<'
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func looksLikeHTML(data []byte) bool {
prefix := data
if len(prefix) > 1024 {
prefix = prefix[:1024]
}
trimmed := bytes.TrimSpace(prefix)
return len(trimmed) > 0 && trimmed[0] == '<'
func looksLikeHTML(data []byte) bool {
prefix := data
if len(prefix) > 1024 {
prefix = prefix[:1024]
}
trimmed := bytes.TrimSpace(prefix)
trimmed = bytes.TrimPrefix(trimmed, []byte{0xEF, 0xBB, 0xBF})
return len(trimmed) > 0 && trimmed[0] == '<'
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/httpx/domains.go` around lines 58 - 64, The looksLikeHTML function
fails for UTF-8 BOM-prefixed documents because bytes.TrimSpace doesn't remove
the BOM; update looksLikeHTML to strip a leading UTF-8 BOM
(bytes.TrimPrefix(..., []byte{0xEF,0xBB,0xBF}) or equivalent) from the trimmed
prefix before testing the first byte, so inputs like "\ufeff<!DOCTYPE html>" are
correctly detected as HTML.

Comment on lines +194 to +198
if input != val {
domains[val] = struct{}{}
}
if d != val && d != input {
fqdns[d] = struct{}{}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf 'Response struct and relevant fields:\n'
rg -n -C2 'type Response struct|Input string|Raw string|Data \[\]byte' --type go

printf '\nPotential writes to Response.Input (excluding tests):\n'
rg -n -C3 '\bInput:\s|\.Input\s*=' --type go -g '!**/*_test.go' -g '!**/vendor/**'

printf '\nBodyDomainGrab call sites:\n'
rg -n -C3 '\bBodyDomainGrab\s*\(' --type go

Repository: projectdiscovery/httpx

Length of output: 25591


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Examine the parseURL function to see if it normalizes hostnames
printf 'parseURL function:\n'
rg -n -A15 'func.*parseURL' --type go common/httpx/ runner/ 2>/dev/null | head -50

# Check NewRequestWithContext to understand how Host is set
printf '\n\nChecking if req.Host is normalized:\n'
rg -n -B2 -A2 'req.Host|\.Host' --type go common/httpx/httpx.go | head -40

# Look at the addDomainCandidate function more closely (context around lines 194-198)
printf '\n\naddDomainCandidate function logic:\n'
sed -n '150,220p' common/httpx/domains.go

Repository: projectdiscovery/httpx

Length of output: 3365


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Trace where origInput comes from - find all calls to analyze function
printf 'Calls to analyze() function to trace origInput source:\n'
rg -n 'analyze\(' --type go runner/runner.go | head -20

# Check where the input to BodyDomainGrab is passed
printf '\n\nInput parameter flow to BodyDomainGrab:\n'
sed -n '37,50p' common/httpx/domains.go

# Check what BodyDomainGrab receives as input
printf '\n\nBodyDomainGrab signature and input usage:\n'
sed -n '37,45p' common/httpx/domains.go

# Check how input is passed to extraction functions
printf '\n\nHow input is passed to extraction functions:\n'
rg -n 'extractDomainsFrom|addDomainCandidate' --type go common/httpx/domains.go | head -30

# Verify if req.Host can be non-canonical
printf '\n\nNew request creation and Host setting:\n'
sed -n '230,250p' common/httpx/httpx.go

# Check publicsuffix.Domain behavior
printf '\n\nPublicsuffix Domain function to understand what val is:\n'
rg -n 'publicsuffix.Domain' --type go common/httpx/domains.go -B2 -A2

Repository: projectdiscovery/httpx

Length of output: 3572


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where analyze() is called and trace the input parameter
printf 'Trace calls to analyze() - where does t come from:\n'
sed -n '1690,1750p' runner/runner.go

# Check if req.Host can contain a port or other non-canonical data
printf '\n\nCheck how req.Host is set:\n'
rg -n 'NewRequest|req.Host' --type go common/httpx/httpx.go -B3 -A3 | head -40

# Check the types and parseURL implementation more carefully
printf '\n\nparseURL to understand Host vs Hostname:\n'
rg -n -A10 'type.*URL.*struct|func.*URL.*Host' --type go | grep -A10 'urlutil' | head -20

# Look at line 238 context - how is req.Host set?
printf '\n\nContext around resp.Input assignment:\n'
sed -n '225,245p' common/httpx/httpx.go

# Check if there's any normalization of the input before calling BodyDomainGrab
printf '\n\nCheck Response struct to see all fields:\n'
sed -n '1,30p' common/httpx/response.go

Repository: projectdiscovery/httpx

Length of output: 3318


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Look at the full addDomainCandidate function again with more context
printf 'Full addDomainCandidate function:\n'
sed -n '176,200p' common/httpx/domains.go

# Check if Go http.Request.Host includes port
printf '\n\nLook at what req.Host value contains (from Go docs behavior):\n'
sed -n '410,425p' common/httpx/httpx.go

# Check test case to see if input can have port
printf '\n\nTest case examining input parameter:\n'
sed -n '450,460p' common/httpx/domains_test.go

# Look at where origInput is created in analyze function
printf '\n\nanalyze function signature and origInput handling:\n'
sed -n '1827,1860p' runner/runner.go

# Check how target.Host is constructed 
printf '\n\nTarget structure and Host field:\n'
rg -n 'type Target struct' --type go -A10

Repository: projectdiscovery/httpx

Length of output: 3030


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if req.Host includes port by looking at how it's used and what Go's http.Request does
printf 'Check NewRequestFromURLWithContext signature and req.Host behavior:\n'
rg -n 'retryablehttp.NewRequestFromURLWithContext' --type go -A5 -B5 | head -30

# Check if there's any explicit handling of ports in the comparison
printf '\n\nCheck how input is normalized or processed before addDomainCandidate:\n'
sed -n '37,60p' common/httpx/domains.go

# Look for test cases that show input with ports
printf '\n\nSearch for test cases with input containing ports:\n'
rg -n 'Input:.*:' --type go common/httpx/domains_test.go

# Check how resp.Input gets set in all cases
printf '\n\nAll assignments to resp.Input or Input field:\n'
rg -n 'resp\.Input\s*=' --type go common/httpx/

# Verify if ports are ever included in input
printf '\n\nCheck URLs passed to NewRequestWithContext:\n'
sed -n '1880,1890p' runner/runner.go

# Look at req.String() to see if it includes port
printf '\n\nCheck how req is used after being created:\n'
sed -n '1956,1965p' runner/runner.go

Repository: projectdiscovery/httpx

Length of output: 1583


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Let's verify if req.Host includes port by checking Go's behavior
printf 'Check URL parsing to understand Host field:\n'
python3 << 'EOF'
# Go's http.Request.Host behavior from URL
# URL.Host includes port, URL.Hostname() does not
# When creating request from URL, req.Host = URL.Host

test_cases = [
    ("http://example.com/path", "example.com"),
    ("http://example.com:443/path", "example.com:443"),
    ("http://EXAMPLE.COM/path", "EXAMPLE.COM"),
]

print("Go http.Request behavior:")
print("- URL.Host includes port: example.com:443")
print("- URL.Hostname() excludes port: example.com")
print("- req.Host comes from URL.Host, so it can include port")
print()
print("Issue: if input='example.com:443' and d='example.com'")
print("  val = publicsuffix.Domain('example.com') = 'example.com'")
print("  Check: input != val => 'example.com:443' != 'example.com' => TRUE")
print("  Result: example.com gets added to domains (BUG!)")
EOF

# Check if there's any normalization of input before BodyDomainGrab is called
printf '\n\nCheck if input is normalized before passing to BodyDomainGrab:\n'
sed -n '2635,2655p' runner/runner.go

# Look at how the URL is constructed to understand if port is included
printf '\n\nCheck what fullURL contains at line 2644:\n'
sed -n '2600,2650p' runner/runner.go | head -50

Repository: projectdiscovery/httpx

Length of output: 2853


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find where the loop variable t comes from (the input parameter to analyze)
printf 'Find where input t originates (look for loop context):\n'
sed -n '1680,1700p' runner/runner.go

# Check the process function to see where it's called from
printf '\n\nCheck process() function which is called initially:\n'
rg -n 'func.*process.*Target' --type go runner/runner.go -A15 | head -40

# Look for the main entry point where inputs are read
printf '\n\nLook for where scan inputs are read:\n'
rg -n 'func.*Scan.*process' --type go runner/runner.go | head -5

# Check what origInput can contain by finding the input source
printf '\n\nFind where input strings originate from:\n'
rg -n 'for.*range.*input|readInputFile|parseInput' --type go runner/runner.go | head -20

Repository: projectdiscovery/httpx

Length of output: 879


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Search for the process function definition 
printf 'Find process function definition:\n'
rg -n 'func.*\(r \*Runner\) process' --type go runner/runner.go -A10

# Look for where t is defined in the context
printf '\n\nLook for the process function and where it gets called:\n'
rg -n 'r\.process\(' --type go runner/runner.go | head -10

# Find the main entry point
printf '\n\nLook for the main scanning loop:\n'
rg -n 'func.*RunEnumeration\|func.*Scan' --type go runner/runner.go | head -10

Repository: projectdiscovery/httpx

Length of output: 1349


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the test case that explicitly sets Input to verify the expected behavior
printf 'Test case showing Input field usage:\n'
sed -n '450,465p' common/httpx/domains_test.go

# Look at resp.Input assignment again carefully
printf '\n\nWhere resp.Input is set in httpx.go:\n'
sed -n '235,242p' common/httpx/httpx.go

# Check if Go http.Request.Host includes port by testing with a simple case
printf '\n\nCheck req.Host behavior - does it include port?\n'
python3 << 'EOF'
# Go's net/http behavior:
# When creating a request from a URL like "http://example.com:8080/path":
# - url.Host = "example.com:8080"  (includes port)
# - url.Hostname() = "example.com"  (excludes port)
# - req.Host = url.Host (comes from URL.Host)
# So resp.Input = req.Host will include the port

print("CRITICAL: resp.Input = req.Host can include port number")
print()
print("Example scenario:")
print("  Input URL: http://example.com:8080/")
print("  req.Host = 'example.com:8080'")
print("  resp.Input = 'example.com:8080'")
print()
print("In addDomainCandidate with d='example.com':")
print("  val = publicsuffix.Domain('example.com') = 'example.com'")
print("  Line 194: if input != val")
print("  if 'example.com:8080' != 'example.com'  => TRUE (BUG!)")
print("  Adds 'example.com' to domains when it should be filtered")
EOF

# Verify this is actually how it works by checking the test
printf '\n\nTest to confirm the issue:\n'
sed -n '450,465p' common/httpx/domains_test.go

Repository: projectdiscovery/httpx

Length of output: 1821


Normalize input to extract hostname before comparing against d and val on lines 194-198.

resp.Input is set to req.Host (line 238 of httpx.go), which in Go's net/http includes the port number (e.g., "example.com:8080"). However, d and val are hostname-only after normalization. This causes the equality checks to fail: if input != val becomes "example.com:8080" != "example.com"true, allowing the input host to leak back into Domains and Fqdns.

Extract the hostname from input before the comparisons—for example:

inputHost := strings.Split(input, ":")[0]
if inputHost != val {
    domains[val] = struct{}{}
}
if d != val && d != inputHost {
    fqdns[d] = struct{}{}
}

This matches the test expectation (domains_test.go:459) that the input domain should be excluded from results.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/httpx/domains.go` around lines 194 - 198, Normalize resp.Input's host
portion before doing equality checks: extract hostname from the existing input
value (e.g., using net.SplitHostPort or strings.Split on ':' into an inputHost)
and then use inputHost in the comparisons instead of raw input; replace the
checks around domains[val] and fqdns[d] so they read "if inputHost != val {
domains[val] = struct{}{} }" and "if d != val && d != inputHost { fqdns[d] =
struct{}{} }" and add the appropriate import (net or strings) so the
normalization occurs before the comparisons that involve input, d, and val.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

improve fqdn extraction from response body using parsers

1 participant