
Implement comprehensive network request and console message capturing functionality: - Add capture_network_requests and capture_console_messages config parameters - Add network_requests and console_messages fields to models - Implement Playwright event listeners to capture requests, responses, and console output - Create detailed documentation and examples - Add comprehensive tests This feature enables deep visibility into web page activity for debugging, security analysis, performance profiling, and API discovery in web applications.
22 KiB
22 KiB
I want to enhance the AsyncPlaywrightCrawlerStrategy
to optionally capture network requests and console messages during a crawl, storing them in the final CrawlResult
.
Here's a breakdown of the proposed changes across the relevant files:
1. Configuration (crawl4ai/async_configs.py
)
- Goal: Add flags to
CrawlerRunConfig
to enable/disable capturing. - Changes:
- Add two new boolean attributes to
CrawlerRunConfig
:capture_network_requests: bool = False
capture_console_messages: bool = False
- Update
__init__
,from_kwargs
,to_dict
, and implicitlyclone
/dump
/load
to include these new attributes.
- Add two new boolean attributes to
# ==== File: crawl4ai/async_configs.py ====
# ... (imports) ...
class CrawlerRunConfig():
# ... (existing attributes) ...
# NEW: Network and Console Capturing Parameters
capture_network_requests: bool = False
capture_console_messages: bool = False
# Experimental Parameters
experimental: Dict[str, Any] = None,
def __init__(
self,
# ... (existing parameters) ...
# NEW: Network and Console Capturing Parameters
capture_network_requests: bool = False,
capture_console_messages: bool = False,
# Experimental Parameters
experimental: Dict[str, Any] = None,
):
# ... (existing assignments) ...
# NEW: Assign new parameters
self.capture_network_requests = capture_network_requests
self.capture_console_messages = capture_console_messages
# Experimental Parameters
self.experimental = experimental or {}
# ... (rest of __init__) ...
@staticmethod
def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
return CrawlerRunConfig(
# ... (existing kwargs gets) ...
# NEW: Get new parameters
capture_network_requests=kwargs.get("capture_network_requests", False),
capture_console_messages=kwargs.get("capture_console_messages", False),
# Experimental Parameters
experimental=kwargs.get("experimental"),
)
def to_dict(self):
return {
# ... (existing dict entries) ...
# NEW: Add new parameters to dict
"capture_network_requests": self.capture_network_requests,
"capture_console_messages": self.capture_console_messages,
"experimental": self.experimental,
}
# clone(), dump(), load() should work automatically if they rely on to_dict() and from_kwargs()
# or the serialization logic correctly handles all attributes.
2. Data Models (crawl4ai/models.py
)
- Goal: Add fields to store the captured data in the response/result objects.
- Changes:
- Add
network_requests: Optional[List[Dict[str, Any]]] = None
andconsole_messages: Optional[List[Dict[str, Any]]] = None
toAsyncCrawlResponse
. - Add the same fields to
CrawlResult
.
- Add
# ==== File: crawl4ai/models.py ====
# ... (imports) ...
# ... (Existing dataclasses/models) ...
class AsyncCrawlResponse(BaseModel):
html: str
response_headers: Dict[str, str]
js_execution_result: Optional[Dict[str, Any]] = None
status_code: int
screenshot: Optional[str] = None
pdf_data: Optional[bytes] = None
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
downloaded_files: Optional[List[str]] = None
ssl_certificate: Optional[SSLCertificate] = None
redirected_url: Optional[str] = None
# NEW: Fields for captured data
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
class Config:
arbitrary_types_allowed = True
# ... (Existing models like MediaItem, Link, etc.) ...
class CrawlResult(BaseModel):
url: str
html: str
success: bool
cleaned_html: Optional[str] = None
media: Dict[str, List[Dict]] = {}
links: Dict[str, List[Dict]] = {}
downloaded_files: Optional[List[str]] = None
js_execution_result: Optional[Dict[str, Any]] = None
screenshot: Optional[str] = None
pdf: Optional[bytes] = None
mhtml: Optional[str] = None # Added mhtml based on the provided models.py
_markdown: Optional[MarkdownGenerationResult] = PrivateAttr(default=None)
extracted_content: Optional[str] = None
metadata: Optional[dict] = None
error_message: Optional[str] = None
session_id: Optional[str] = None
response_headers: Optional[dict] = None
status_code: Optional[int] = None
ssl_certificate: Optional[SSLCertificate] = None
dispatch_result: Optional[DispatchResult] = None
redirected_url: Optional[str] = None
# NEW: Fields for captured data
network_requests: Optional[List[Dict[str, Any]]] = None
console_messages: Optional[List[Dict[str, Any]]] = None
class Config:
arbitrary_types_allowed = True
# ... (Existing __init__, properties, model_dump for markdown compatibility) ...
# ... (Rest of the models) ...
3. Crawler Strategy (crawl4ai/async_crawler_strategy.py
)
- Goal: Implement the actual capturing logic within
AsyncPlaywrightCrawlerStrategy._crawl_web
. - Changes:
- Inside
_crawl_web
, initialize empty listscaptured_requests = []
andcaptured_console = []
. - Conditionally attach Playwright event listeners (
page.on(...)
) based on theconfig.capture_network_requests
andconfig.capture_console_messages
flags. - Define handler functions for these listeners to extract relevant data and append it to the respective lists. Include timestamps.
- Pass the captured lists to the
AsyncCrawlResponse
constructor at the end of the method.
- Inside
# ==== File: crawl4ai/async_crawler_strategy.py ====
# ... (imports) ...
import time # Make sure time is imported
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
# ... (existing methods like __init__, start, close, etc.) ...
async def _crawl_web(
self, url: str, config: CrawlerRunConfig
) -> AsyncCrawlResponse:
"""
Internal method to crawl web URLs with the specified configuration.
Includes optional network and console capturing. # MODIFIED DOCSTRING
"""
config.url = url
response_headers = {}
execution_result = None
status_code = None
redirected_url = url
# Reset downloaded files list for new crawl
self._downloaded_files = []
# Initialize capture lists - IMPORTANT: Reset per crawl
captured_requests: List[Dict[str, Any]] = []
captured_console: List[Dict[str, Any]] = []
# Handle user agent ... (existing code) ...
# Get page for session
page, context = await self.browser_manager.get_page(crawlerRunConfig=config)
# ... (existing code for cookies, navigator overrides, hooks) ...
# --- Setup Capturing Listeners ---
# NOTE: These listeners are attached *before* page.goto()
# Network Request Capturing
if config.capture_network_requests:
async def handle_request_capture(request):
try:
post_data_str = None
try:
# Be cautious with large post data
post_data = request.post_data_buffer
if post_data:
# Attempt to decode, fallback to base64 or size indication
try:
post_data_str = post_data.decode('utf-8', errors='replace')
except UnicodeDecodeError:
post_data_str = f"[Binary data: {len(post_data)} bytes]"
except Exception:
post_data_str = "[Error retrieving post data]"
captured_requests.append({
"event_type": "request",
"url": request.url,
"method": request.method,
"headers": dict(request.headers), # Convert Header dict
"post_data": post_data_str,
"resource_type": request.resource_type,
"is_navigation_request": request.is_navigation_request(),
"timestamp": time.time()
})
except Exception as e:
self.logger.warning(f"Error capturing request details for {request.url}: {e}", tag="CAPTURE")
captured_requests.append({"event_type": "request_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()})
async def handle_response_capture(response):
try:
# Avoid capturing full response body by default due to size/security
# security_details = await response.security_details() # Optional: More SSL info
captured_requests.append({
"event_type": "response",
"url": response.url,
"status": response.status,
"status_text": response.status_text,
"headers": dict(response.headers), # Convert Header dict
"from_service_worker": response.from_service_worker,
# "security_details": security_details, # Uncomment if needed
"request_timing": response.request.timing, # Detailed timing info
"timestamp": time.time()
})
except Exception as e:
self.logger.warning(f"Error capturing response details for {response.url}: {e}", tag="CAPTURE")
captured_requests.append({"event_type": "response_capture_error", "url": response.url, "error": str(e), "timestamp": time.time()})
async def handle_request_failed_capture(request):
try:
captured_requests.append({
"event_type": "request_failed",
"url": request.url,
"method": request.method,
"resource_type": request.resource_type,
"failure_text": request.failure.error_text if request.failure else "Unknown failure",
"timestamp": time.time()
})
except Exception as e:
self.logger.warning(f"Error capturing request failed details for {request.url}: {e}", tag="CAPTURE")
captured_requests.append({"event_type": "request_failed_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()})
page.on("request", handle_request_capture)
page.on("response", handle_response_capture)
page.on("requestfailed", handle_request_failed_capture)
# Console Message Capturing
if config.capture_console_messages:
def handle_console_capture(msg):
try:
location = msg.location()
# Attempt to resolve JSHandle args to primitive values
resolved_args = []
try:
for arg in msg.args:
resolved_args.append(arg.json_value()) # May fail for complex objects
except Exception:
resolved_args.append("[Could not resolve JSHandle args]")
captured_console.append({
"type": msg.type(), # e.g., 'log', 'error', 'warning'
"text": msg.text(),
"args": resolved_args, # Captured arguments
"location": f"{location['url']}:{location['lineNumber']}:{location['columnNumber']}" if location else "N/A",
"timestamp": time.time()
})
except Exception as e:
self.logger.warning(f"Error capturing console message: {e}", tag="CAPTURE")
captured_console.append({"type": "console_capture_error", "error": str(e), "timestamp": time.time()})
def handle_pageerror_capture(err):
try:
captured_console.append({
"type": "error", # Consistent type for page errors
"text": err.message,
"stack": err.stack,
"timestamp": time.time()
})
except Exception as e:
self.logger.warning(f"Error capturing page error: {e}", tag="CAPTURE")
captured_console.append({"type": "pageerror_capture_error", "error": str(e), "timestamp": time.time()})
page.on("console", handle_console_capture)
page.on("pageerror", handle_pageerror_capture)
# --- End Setup Capturing Listeners ---
# Set up console logging if requested (Keep original logging logic separate or merge carefully)
if config.log_console:
# ... (original log_console setup using page.on(...) remains here) ...
# This allows logging to screen *and* capturing to the list if both flags are True
def log_consol(msg, console_log_type="debug"):
# ... existing implementation ...
pass # Placeholder for existing code
page.on("console", lambda msg: log_consol(msg, "debug"))
page.on("pageerror", lambda e: log_consol(e, "error"))
try:
# ... (existing code for SSL, downloads, goto, waits, JS execution, etc.) ...
# Get final HTML content
# ... (existing code for selector logic or page.content()) ...
if config.css_selector:
# ... existing selector logic ...
html = f"<div class='crawl4ai-result'>\n" + "\n".join(html_parts) + "\n</div>"
else:
html = await page.content()
await self.execute_hook(
"before_return_html", page=page, html=html, context=context, config=config
)
# Handle PDF and screenshot generation
# ... (existing code) ...
# Define delayed content getter
# ... (existing code) ...
# Return complete response - ADD CAPTURED DATA HERE
return AsyncCrawlResponse(
html=html,
response_headers=response_headers,
js_execution_result=execution_result,
status_code=status_code,
screenshot=screenshot_data,
pdf_data=pdf_data,
get_delayed_content=get_delayed_content,
ssl_certificate=ssl_cert,
downloaded_files=(
self._downloaded_files if self._downloaded_files else None
),
redirected_url=redirected_url,
# NEW: Pass captured data conditionally
network_requests=captured_requests if config.capture_network_requests else None,
console_messages=captured_console if config.capture_console_messages else None,
)
except Exception as e:
raise e # Re-raise the original exception
finally:
# If no session_id is given we should close the page
if not config.session_id:
# Detach listeners before closing to prevent potential errors during close
if config.capture_network_requests:
page.remove_listener("request", handle_request_capture)
page.remove_listener("response", handle_response_capture)
page.remove_listener("requestfailed", handle_request_failed_capture)
if config.capture_console_messages:
page.remove_listener("console", handle_console_capture)
page.remove_listener("pageerror", handle_pageerror_capture)
# Also remove logging listeners if they were attached
if config.log_console:
# Need to figure out how to remove the lambdas if necessary,
# or ensure they don't cause issues on close. Often, it's fine.
pass
await page.close()
# ... (rest of AsyncPlaywrightCrawlerStrategy methods) ...
4. Core Crawler (crawl4ai/async_webcrawler.py
)
- Goal: Ensure the captured data from
AsyncCrawlResponse
is transferred to the finalCrawlResult
. - Changes:
- In
arun
, when processing a non-cached result (inside theif not cached_result or not html:
block), after receivingasync_response
and callingaprocess_html
to getcrawl_result
, copy thenetwork_requests
andconsole_messages
fromasync_response
tocrawl_result
.
- In
# ==== File: crawl4ai/async_webcrawler.py ====
# ... (imports) ...
class AsyncWebCrawler:
# ... (existing methods) ...
async def arun(
self,
url: str,
config: CrawlerRunConfig = None,
**kwargs,
) -> RunManyReturn:
# ... (existing setup, cache check) ...
async with self._lock or self.nullcontext():
try:
# ... (existing logging, cache context setup) ...
if cached_result:
# ... (existing cache handling logic) ...
# Note: Captured network/console usually not useful from cache
# Ensure they are None or empty if read from cache, unless stored explicitly
cached_result.network_requests = cached_result.network_requests or None
cached_result.console_messages = cached_result.console_messages or None
# ... (rest of cache logic) ...
# Fetch fresh content if needed
if not cached_result or not html:
t1 = time.perf_counter()
# ... (existing user agent update, robots.txt check) ...
##############################
# Call CrawlerStrategy.crawl #
##############################
async_response = await self.crawler_strategy.crawl(
url,
config=config,
)
# ... (existing assignment of html, screenshot, pdf, js_result from async_response) ...
t2 = time.perf_counter()
# ... (existing logging) ...
###############################################################
# Process the HTML content, Call CrawlerStrategy.process_html #
###############################################################
crawl_result: CrawlResult = await self.aprocess_html(
# ... (existing args) ...
)
# --- Transfer data from AsyncCrawlResponse to CrawlResult ---
crawl_result.status_code = async_response.status_code
crawl_result.redirected_url = async_response.redirected_url or url
crawl_result.response_headers = async_response.response_headers
crawl_result.downloaded_files = async_response.downloaded_files
crawl_result.js_execution_result = js_execution_result
crawl_result.ssl_certificate = async_response.ssl_certificate
# NEW: Copy captured data
crawl_result.network_requests = async_response.network_requests
crawl_result.console_messages = async_response.console_messages
# ------------------------------------------------------------
crawl_result.success = bool(html)
crawl_result.session_id = getattr(config, "session_id", None)
# ... (existing logging) ...
# Update cache if appropriate
if cache_context.should_write() and not bool(cached_result):
# crawl_result now includes network/console data if captured
await async_db_manager.acache_url(crawl_result)
return CrawlResultContainer(crawl_result)
else: # Cached result was used
# ... (existing logging for cache hit) ...
cached_result.success = bool(html)
cached_result.session_id = getattr(config, "session_id", None)
cached_result.redirected_url = cached_result.redirected_url or url
return CrawlResultContainer(cached_result)
except Exception as e:
# ... (existing error handling) ...
return CrawlResultContainer(
CrawlResult(
url=url, html="", success=False, error_message=error_message
)
)
# ... (aprocess_html remains unchanged regarding capture) ...
# ... (arun_many remains unchanged regarding capture) ...
Summary of Changes:
- Configuration: Added
capture_network_requests
andcapture_console_messages
flags toCrawlerRunConfig
. - Models: Added corresponding
network_requests
andconsole_messages
fields (List of Dicts) toAsyncCrawlResponse
andCrawlResult
. - Strategy: Implemented conditional event listeners in
AsyncPlaywrightCrawlerStrategy._crawl_web
to capture data into lists when flags are true. Populated these fields in the returnedAsyncCrawlResponse
. Added basic error handling within capture handlers. Added timestamps. - Crawler: Modified
AsyncWebCrawler.arun
to copy the captured data fromAsyncCrawlResponse
into the finalCrawlResult
for non-cached fetches.
This approach keeps the capturing logic contained within the Playwright strategy, uses clear configuration flags, and integrates the results into the existing data flow. The data format (list of dictionaries) is flexible for storing varied information from requests/responses/console messages.