fix: update sitemap URL extraction to deal with different type of sit… by saimsajidirl · Pull Request #298 · omkarcloud/botasaurus

saimsajidirl · 2026-05-03T12:19:25Z

…emap urls

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

+            for sm_url in clean_default_sitemap_urls(url):
+                content = fetch_content(sm_url,proxy = request_options['proxy'])
+                if content:
+                    return [sm_url]


+        "sitemap.xml",
+        "sitemap_index.xml",
+        "sitemap-index.xml",
+        "sitemap_index.html",
+        "sitemap-index.html",
+    )
+    return [base_url + candidate for candidate in candidates]


biswajeetdev

Trying multiple candidate URLs is the right fix — sitemap_index.xml and sitemap-index.xml are common patterns that the single-URL approach was silently missing.

Two small observations:

1. Function name
clean_default_sitemap_urls implies it is cleaning/normalising a URL, but it is actually generating a list of candidates. Something like default_sitemap_candidates or get_default_sitemap_urls would be clearer to a future reader.

2. Early return on first successful fetch
The function returns [sm_url] on the first URL that fetch_content returns a truthy value for. Depending on how fetch_content behaves, some hosts serve a 200 with an HTML error page at /sitemap_index.xml — those would be treated as a found sitemap and short-circuit the search before reaching the real sitemap.xml. If fetch_content validates content-type or XML structure this is not an issue, but worth a comment if it only checks HTTP status.

saimsajidirl · 2026-06-02T06:48:01Z

Good catch on both points.

I agree the current name is a bit misleading. The function started as a URL normalization helper and evolved into generating candidate sitemap URLs. Renaming it to something like get_default_sitemap_urls or default_sitemap_candidates would make the intent much clearer.
That's a fair concern. Right now the early return assumes that a successful fetch_content indicates a valid sitemap. If fetch_content only validates the HTTP response and not the actual content, an HTML error page served with a 200 could indeed cause a false positive and prevent checking the remaining candidates. In our case, sitemap validation happens downstream, but I'll add a comment to make that assumption explicit (or move the validation closer to the fetch logic if it makes sense).

…p discovery logic

fix: update sitemap URL extraction to deal with different type of sit…

5efc1d6

…emap urls

Copilot AI review requested due to automatic review settings May 3, 2026 12:19

Copilot started reviewing on behalf of saimsajidirl May 3, 2026 12:19 View session

Copilot AI reviewed May 3, 2026

View reviewed changes

saimsajidirl requested a review from Copilot May 11, 2026 07:50

Copilot started reviewing on behalf of saimsajidirl May 11, 2026 07:50 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

biswajeetdev reviewed Jun 1, 2026

View reviewed changes

feat: implement sitemap content validation and improve default sitema…

2235d61

…p discovery logic

saimsajidirl closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update sitemap URL extraction to deal with different type of sit…#298

fix: update sitemap URL extraction to deal with different type of sit…#298
saimsajidirl wants to merge 2 commits into
omkarcloud:masterfrom
saimsajidirl:master

saimsajidirl commented May 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

biswajeetdev left a comment

Uh oh!

saimsajidirl commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

saimsajidirl commented May 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

biswajeetdev left a comment

Choose a reason for hiding this comment

Uh oh!

saimsajidirl commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants