unclecode 66ac07b4f3 feat(crawler): add network request and console message capturing

Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests

This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.

2025-04-10 16:03:48 +08:00

5.9 KiB

Raw Blame History

Development Journal

This journal tracks significant feature additions, bug fixes, and architectural decisions in the crawl4ai project. It serves as both documentation and a historical record of the project's evolution.

[2025-04-09] Added MHTML Capture Feature

Feature: MHTML snapshot capture of crawled pages

Changes Made:

Added capture_mhtml: bool = False parameter to CrawlerRunConfig class
Added mhtml: Optional[str] = None field to CrawlResult model
Added mhtml_data: Optional[str] = None field to AsyncCrawlResponse class
Implemented capture_mhtml() method in AsyncPlaywrightCrawlerStrategy class to capture MHTML via CDP
Modified the crawler to capture MHTML when enabled and pass it to the result

Implementation Details:

MHTML capture uses Chrome DevTools Protocol (CDP) via Playwright's CDP session API
The implementation waits for page to fully load before capturing MHTML content
Enhanced waiting for JavaScript content with requestAnimationFrame for better JS content capture
We ensure all browser resources are properly cleaned up after capture

Files Modified:

crawl4ai/models.py: Added the mhtml field to CrawlResult
crawl4ai/async_configs.py: Added capture_mhtml parameter to CrawlerRunConfig
crawl4ai/async_crawler_strategy.py: Implemented MHTML capture logic
crawl4ai/async_webcrawler.py: Added mapping from AsyncCrawlResponse.mhtml_data to CrawlResult.mhtml

Testing:

Created comprehensive tests in tests/20241401/test_mhtml.py covering:
- Capturing MHTML when enabled
- Ensuring mhtml is None when disabled explicitly
- Ensuring mhtml is None by default
- Capturing MHTML on JavaScript-enabled pages

Challenges:

Had to improve page loading detection to ensure JavaScript content was fully rendered
Tests needed to be run independently due to Playwright browser instance management
Modified test expected content to match actual MHTML output

Why This Feature: The MHTML capture feature allows users to capture complete web pages including all resources (CSS, images, etc.) in a single file. This is valuable for:

Offline viewing of captured pages
Creating permanent snapshots of web content for archival
Ensuring consistent content for later analysis, even if the original site changes

Future Enhancements to Consider:

Add option to save MHTML to file
Support for filtering what resources get included in MHTML
Add support for specifying MHTML capture options

[2025-04-10] Added Network Request and Console Message Capturing

Feature: Comprehensive capturing of network requests/responses and browser console messages during crawling