Feature: Support for crawling dynamic javascript heavy site

Description:

Enhance the existing web crawler to support crawling and extracting content from websites that rely heavily on JavaScript for rendering their content. This feature will involve integrating a headless browser to accurately render and interact with such pages.

Objectives:

- Enable the crawler to fetch and parse content from JavaScript-heavy sites.
- Use a headless browser to render JavaScript content. (explore [playwright-python](https://github.com/microsoft/playwright-python))
- Ensure compatibility with the existing crawler structure and options.
- Maintain the ability to switch between the default fetching method and the headless browser.

Design Considerations:

- Single Headless Browser Instance: 
   - Use a single instance of a headless browser to handle multiple asynchronous requests, reducing resource consumption.
- Concurrency Management: 
    - Utilize asyncio and a semaphore to manage concurrent requests within the same browser context.
    - Integrate the asynchronous fetching logic with our existing web crawler structure.
- Error Handling: 
   - Ensure proper error handling and resource cleanup. (no zombie browsers, they are already headless :p)
   - Fall back to default fetching mode when there is a error with the headless browser. (keep the user informed)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Support for crawling dynamic javascript heavy site #10

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Support for crawling dynamic javascript heavy site #10

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions