Skip to content

Feature: Support for crawling dynamic javascript heavy site #10

@indrajithi

Description

@indrajithi

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

  • Enable the crawler to fetch and parse content from JavaScript-heavy sites.
  • Use a headless browser to render JavaScript content. (explore playwright-python)
  • Ensure compatibility with the existing crawler structure and options.
  • Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

  • Single Headless Browser Instance:
    • Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
  • Concurrency Management:
    • Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
    • Integrate the asynchronous fetching logic with our existing web crawler structure.
  • Error Handling:
    • Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
    • Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions