fix: update sitemap URL extraction to deal with different type of sit…#298
fix: update sitemap URL extraction to deal with different type of sit…#298saimsajidirl wants to merge 2 commits into
Conversation
| for sm_url in clean_default_sitemap_urls(url): | ||
| content = fetch_content(sm_url,proxy = request_options['proxy']) | ||
| if content: | ||
| return [sm_url] |
| "sitemap.xml", | ||
| "sitemap_index.xml", | ||
| "sitemap-index.xml", | ||
| "sitemap_index.html", | ||
| "sitemap-index.html", | ||
| ) | ||
| return [base_url + candidate for candidate in candidates] |
biswajeetdev
left a comment
There was a problem hiding this comment.
Trying multiple candidate URLs is the right fix — sitemap_index.xml and sitemap-index.xml are common patterns that the single-URL approach was silently missing.
Two small observations:
1. Function name
clean_default_sitemap_urls implies it is cleaning/normalising a URL, but it is actually generating a list of candidates. Something like default_sitemap_candidates or get_default_sitemap_urls would be clearer to a future reader.
2. Early return on first successful fetch
The function returns [sm_url] on the first URL that fetch_content returns a truthy value for. Depending on how fetch_content behaves, some hosts serve a 200 with an HTML error page at /sitemap_index.xml — those would be treated as a found sitemap and short-circuit the search before reaching the real sitemap.xml. If fetch_content validates content-type or XML structure this is not an issue, but worth a comment if it only checks HTTP status.
|
Good catch on both points.
|
…p discovery logic
…emap urls