Skip to content

fix: update sitemap URL extraction to deal with different type of sit…#298

Closed
saimsajidirl wants to merge 2 commits into
omkarcloud:masterfrom
saimsajidirl:master
Closed

fix: update sitemap URL extraction to deal with different type of sit…#298
saimsajidirl wants to merge 2 commits into
omkarcloud:masterfrom
saimsajidirl:master

Conversation

@saimsajidirl
Copy link
Copy Markdown

…emap urls

Copilot AI review requested due to automatic review settings May 3, 2026 12:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread botasaurus/sitemap.py Outdated
Comment on lines +153 to +156
for sm_url in clean_default_sitemap_urls(url):
content = fetch_content(sm_url,proxy = request_options['proxy'])
if content:
return [sm_url]
Comment on lines +144 to +150
"sitemap.xml",
"sitemap_index.xml",
"sitemap-index.xml",
"sitemap_index.html",
"sitemap-index.html",
)
return [base_url + candidate for candidate in candidates]
Copy link
Copy Markdown

@biswajeetdev biswajeetdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying multiple candidate URLs is the right fix — sitemap_index.xml and sitemap-index.xml are common patterns that the single-URL approach was silently missing.

Two small observations:

1. Function name
clean_default_sitemap_urls implies it is cleaning/normalising a URL, but it is actually generating a list of candidates. Something like default_sitemap_candidates or get_default_sitemap_urls would be clearer to a future reader.

2. Early return on first successful fetch
The function returns [sm_url] on the first URL that fetch_content returns a truthy value for. Depending on how fetch_content behaves, some hosts serve a 200 with an HTML error page at /sitemap_index.xml — those would be treated as a found sitemap and short-circuit the search before reaching the real sitemap.xml. If fetch_content validates content-type or XML structure this is not an issue, but worth a comment if it only checks HTTP status.

@saimsajidirl
Copy link
Copy Markdown
Author

Good catch on both points.

  1. I agree the current name is a bit misleading. The function started as a URL normalization helper and evolved into generating candidate sitemap URLs. Renaming it to something like get_default_sitemap_urls or default_sitemap_candidates would make the intent much clearer.

  2. That's a fair concern. Right now the early return assumes that a successful fetch_content indicates a valid sitemap. If fetch_content only validates the HTTP response and not the actual content, an HTML error page served with a 200 could indeed cause a false positive and prevent checking the remaining candidates. In our case, sitemap validation happens downstream, but I'll add a comment to make that assumption explicit (or move the validation closer to the fetch logic if it makes sense).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants