crawl4ai/prompt_net_requests.md at main

unclecode 66ac07b4f3 feat(crawler): add network request and console message capturing

Implement comprehensive network request and console message capturing functionality:
- Add capture_network_requests and capture_console_messages config parameters
- Add network_requests and console_messages fields to models
- Implement Playwright event listeners to capture requests, responses, and console output
- Create detailed documentation and examples
- Add comprehensive tests

This feature enables deep visibility into web page activity for debugging,
security analysis, performance profiling, and API discovery in web applications.

2025-04-10 16:03:48 +08:00

22 KiB

Raw Permalink Blame History

I want to enhance the AsyncPlaywrightCrawlerStrategy to optionally capture network requests and console messages during a crawl, storing them in the final CrawlResult.

Here's a breakdown of the proposed changes across the relevant files:

1. Configuration (crawl4ai/async_configs.py)

Goal: Add flags to CrawlerRunConfig to enable/disable capturing.
Changes:
- Add two new boolean attributes to CrawlerRunConfig:
  - capture_network_requests: bool = False
  - capture_console_messages: bool = False
- Update __init__, from_kwargs, to_dict, and implicitly clone/dump/load to include these new attributes.

# ==== File: crawl4ai/async_configs.py ====
# ... (imports) ...

class CrawlerRunConfig():
    # ... (existing attributes) ...

    # NEW: Network and Console Capturing Parameters
    capture_network_requests: bool = False
    capture_console_messages: bool = False

    # Experimental Parameters
    experimental: Dict[str, Any] = None,

    def __init__(
        self,
        # ... (existing parameters) ...

        # NEW: Network and Console Capturing Parameters
        capture_network_requests: bool = False,
        capture_console_messages: bool = False,

        # Experimental Parameters
        experimental: Dict[str, Any] = None,
    ):
        # ... (existing assignments) ...

        # NEW: Assign new parameters
        self.capture_network_requests = capture_network_requests
        self.capture_console_messages = capture_console_messages

        # Experimental Parameters
        self.experimental = experimental or {}

        # ... (rest of __init__) ...

    @staticmethod
    def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
        return CrawlerRunConfig(
            # ... (existing kwargs gets) ...

            # NEW: Get new parameters
            capture_network_requests=kwargs.get("capture_network_requests", False),
            capture_console_messages=kwargs.get("capture_console_messages", False),

            # Experimental Parameters
            experimental=kwargs.get("experimental"),
        )

    def to_dict(self):
        return {
            # ... (existing dict entries) ...

            # NEW: Add new parameters to dict
            "capture_network_requests": self.capture_network_requests,
            "capture_console_messages": self.capture_console_messages,

            "experimental": self.experimental,
        }

    # clone(), dump(), load() should work automatically if they rely on to_dict() and from_kwargs()
    # or the serialization logic correctly handles all attributes.

2. Data Models (crawl4ai/models.py)

Goal: Add fields to store the captured data in the response/result objects.
Changes:
- Add network_requests: Optional[List[Dict[str, Any]]] = None and console_messages: Optional[List[Dict[str, Any]]] = None to AsyncCrawlResponse.
- Add the same fields to CrawlResult.

# ==== File: crawl4ai/models.py ====
# ... (imports) ...

# ... (Existing dataclasses/models) ...

class AsyncCrawlResponse(BaseModel):
    html: str
    response_headers: Dict[str, str]
    js_execution_result: Optional[Dict[str, Any]] = None
    status_code: int
    screenshot: Optional[str] = None
    pdf_data: Optional[bytes] = None
    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
    downloaded_files: Optional[List[str]] = None
    ssl_certificate: Optional[SSLCertificate] = None
    redirected_url: Optional[str] = None
    # NEW: Fields for captured data
    network_requests: Optional[List[Dict[str, Any]]] = None
    console_messages: Optional[List[Dict[str, Any]]] = None

    class Config:
        arbitrary_types_allowed = True

# ... (Existing models like MediaItem, Link, etc.) ...

class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    js_execution_result: Optional[Dict[str, Any]] = None
    screenshot: Optional[str] = None
    pdf: Optional[bytes] = None
    mhtml: Optional[str] = None # Added mhtml based on the provided models.py
    _markdown: Optional[MarkdownGenerationResult] = PrivateAttr(default=None)
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    redirected_url: Optional[str] = None
    # NEW: Fields for captured data
    network_requests: Optional[List[Dict[str, Any]]] = None
    console_messages: Optional[List[Dict[str, Any]]] = None

    class Config:
        arbitrary_types_allowed = True

    # ... (Existing __init__, properties, model_dump for markdown compatibility) ...

# ... (Rest of the models) ...

3. Crawler Strategy (crawl4ai/async_crawler_strategy.py)

Goal: Implement the actual capturing logic within AsyncPlaywrightCrawlerStrategy._crawl_web.
Changes:
- Inside _crawl_web, initialize empty lists captured_requests = [] and captured_console = [].
- Conditionally attach Playwright event listeners (page.on(...)) based on the config.capture_network_requests and config.capture_console_messages flags.
- Define handler functions for these listeners to extract relevant data and append it to the respective lists. Include timestamps.
- Pass the captured lists to the AsyncCrawlResponse constructor at the end of the method.

# ==== File: crawl4ai/async_crawler_strategy.py ====
# ... (imports) ...
import time # Make sure time is imported

class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    # ... (existing methods like __init__, start, close, etc.) ...

    async def _crawl_web(
        self, url: str, config: CrawlerRunConfig
    ) -> AsyncCrawlResponse:
        """
        Internal method to crawl web URLs with the specified configuration.
        Includes optional network and console capturing. # MODIFIED DOCSTRING
        """
        config.url = url
        response_headers = {}
        execution_result = None
        status_code = None
        redirected_url = url

        # Reset downloaded files list for new crawl
        self._downloaded_files = []

        # Initialize capture lists - IMPORTANT: Reset per crawl
        captured_requests: List[Dict[str, Any]] = []
        captured_console: List[Dict[str, Any]] = []

        # Handle user agent ... (existing code) ...

        # Get page for session
        page, context = await self.browser_manager.get_page(crawlerRunConfig=config)

        # ... (existing code for cookies, navigator overrides, hooks) ...

        # --- Setup Capturing Listeners ---
        # NOTE: These listeners are attached *before* page.goto()

        # Network Request Capturing
        if config.capture_network_requests:
            async def handle_request_capture(request):
                try:
                    post_data_str = None
                    try:
                        # Be cautious with large post data
                        post_data = request.post_data_buffer
                        if post_data:
                             # Attempt to decode, fallback to base64 or size indication
                             try:
                                 post_data_str = post_data.decode('utf-8', errors='replace')
                             except UnicodeDecodeError:
                                 post_data_str = f"[Binary data: {len(post_data)} bytes]"
                    except Exception:
                        post_data_str = "[Error retrieving post data]"

                    captured_requests.append({
                        "event_type": "request",
                        "url": request.url,
                        "method": request.method,
                        "headers": dict(request.headers), # Convert Header dict
                        "post_data": post_data_str,
                        "resource_type": request.resource_type,
                        "is_navigation_request": request.is_navigation_request(),
                        "timestamp": time.time()
                    })
                except Exception as e:
                    self.logger.warning(f"Error capturing request details for {request.url}: {e}", tag="CAPTURE")
                    captured_requests.append({"event_type": "request_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()})

            async def handle_response_capture(response):
                try:
                    # Avoid capturing full response body by default due to size/security
                    # security_details = await response.security_details() # Optional: More SSL info
                    captured_requests.append({
                        "event_type": "response",
                        "url": response.url,
                        "status": response.status,
                        "status_text": response.status_text,
                        "headers": dict(response.headers), # Convert Header dict
                        "from_service_worker": response.from_service_worker,
                        # "security_details": security_details, # Uncomment if needed
                        "request_timing": response.request.timing, # Detailed timing info
                        "timestamp": time.time()
                    })
                except Exception as e:
                    self.logger.warning(f"Error capturing response details for {response.url}: {e}", tag="CAPTURE")
                    captured_requests.append({"event_type": "response_capture_error", "url": response.url, "error": str(e), "timestamp": time.time()})

            async def handle_request_failed_capture(request):
                 try:
                    captured_requests.append({
                        "event_type": "request_failed",
                        "url": request.url,
                        "method": request.method,
                        "resource_type": request.resource_type,
                        "failure_text": request.failure.error_text if request.failure else "Unknown failure",
                        "timestamp": time.time()
                    })
                 except Exception as e:
                    self.logger.warning(f"Error capturing request failed details for {request.url}: {e}", tag="CAPTURE")
                    captured_requests.append({"event_type": "request_failed_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()})

            page.on("request", handle_request_capture)
            page.on("response", handle_response_capture)
            page.on("requestfailed", handle_request_failed_capture)

        # Console Message Capturing
        if config.capture_console_messages:
            def handle_console_capture(msg):
                 try:
                    location = msg.location()
                    # Attempt to resolve JSHandle args to primitive values
                    resolved_args = []
                    try:
                        for arg in msg.args:
                            resolved_args.append(arg.json_value()) # May fail for complex objects
                    except Exception:
                         resolved_args.append("[Could not resolve JSHandle args]")

                    captured_console.append({
                        "type": msg.type(), # e.g., 'log', 'error', 'warning'
                        "text": msg.text(),
                        "args": resolved_args, # Captured arguments
                        "location": f"{location['url']}:{location['lineNumber']}:{location['columnNumber']}" if location else "N/A",
                        "timestamp": time.time()
                    })
                 except Exception as e:
                    self.logger.warning(f"Error capturing console message: {e}", tag="CAPTURE")
                    captured_console.append({"type": "console_capture_error", "error": str(e), "timestamp": time.time()})

            def handle_pageerror_capture(err):
                 try:
                    captured_console.append({
                        "type": "error", # Consistent type for page errors
                        "text": err.message,
                        "stack": err.stack,
                        "timestamp": time.time()
                    })
                 except Exception as e:
                    self.logger.warning(f"Error capturing page error: {e}", tag="CAPTURE")
                    captured_console.append({"type": "pageerror_capture_error", "error": str(e), "timestamp": time.time()})

            page.on("console", handle_console_capture)
            page.on("pageerror", handle_pageerror_capture)
        # --- End Setup Capturing Listeners ---


        # Set up console logging if requested (Keep original logging logic separate or merge carefully)
        if config.log_console:
            # ... (original log_console setup using page.on(...) remains here) ...
            # This allows logging to screen *and* capturing to the list if both flags are True
            def log_consol(msg, console_log_type="debug"):
                # ... existing implementation ...
                pass # Placeholder for existing code

            page.on("console", lambda msg: log_consol(msg, "debug"))
            page.on("pageerror", lambda e: log_consol(e, "error"))


        try:
            # ... (existing code for SSL, downloads, goto, waits, JS execution, etc.) ...

            # Get final HTML content
            # ... (existing code for selector logic or page.content()) ...
            if config.css_selector:
                # ... existing selector logic ...
                html = f"<div class='crawl4ai-result'>\n" + "\n".join(html_parts) + "\n</div>"
            else:
                html = await page.content()

            await self.execute_hook(
                "before_return_html", page=page, html=html, context=context, config=config
            )

            # Handle PDF and screenshot generation
            # ... (existing code) ...

            # Define delayed content getter
            # ... (existing code) ...

            # Return complete response - ADD CAPTURED DATA HERE
            return AsyncCrawlResponse(
                html=html,
                response_headers=response_headers,
                js_execution_result=execution_result,
                status_code=status_code,
                screenshot=screenshot_data,
                pdf_data=pdf_data,
                get_delayed_content=get_delayed_content,
                ssl_certificate=ssl_cert,
                downloaded_files=(
                    self._downloaded_files if self._downloaded_files else None
                ),
                redirected_url=redirected_url,
                # NEW: Pass captured data conditionally
                network_requests=captured_requests if config.capture_network_requests else None,
                console_messages=captured_console if config.capture_console_messages else None,
            )

        except Exception as e:
            raise e # Re-raise the original exception

        finally:
            # If no session_id is given we should close the page
            if not config.session_id:
                # Detach listeners before closing to prevent potential errors during close
                if config.capture_network_requests:
                    page.remove_listener("request", handle_request_capture)
                    page.remove_listener("response", handle_response_capture)
                    page.remove_listener("requestfailed", handle_request_failed_capture)
                if config.capture_console_messages:
                    page.remove_listener("console", handle_console_capture)
                    page.remove_listener("pageerror", handle_pageerror_capture)
                # Also remove logging listeners if they were attached
                if config.log_console:
                    # Need to figure out how to remove the lambdas if necessary,
                    # or ensure they don't cause issues on close. Often, it's fine.
                    pass

                await page.close()

    # ... (rest of AsyncPlaywrightCrawlerStrategy methods) ...

4. Core Crawler (crawl4ai/async_webcrawler.py)

Goal: Ensure the captured data from AsyncCrawlResponse is transferred to the final CrawlResult.
Changes:
- In arun, when processing a non-cached result (inside the if not cached_result or not html: block), after receiving async_response and calling aprocess_html to get crawl_result, copy the network_requests and console_messages from async_response to crawl_result.

# ==== File: crawl4ai/async_webcrawler.py ====
# ... (imports) ...

class AsyncWebCrawler:
    # ... (existing methods) ...

    async def arun(
        self,
        url: str,
        config: CrawlerRunConfig = None,
        **kwargs,
    ) -> RunManyReturn:
        # ... (existing setup, cache check) ...

        async with self._lock or self.nullcontext():
            try:
                # ... (existing logging, cache context setup) ...

                if cached_result:
                    # ... (existing cache handling logic) ...
                    # Note: Captured network/console usually not useful from cache
                    # Ensure they are None or empty if read from cache, unless stored explicitly
                    cached_result.network_requests = cached_result.network_requests or None
                    cached_result.console_messages = cached_result.console_messages or None
                    # ... (rest of cache logic) ...

                # Fetch fresh content if needed
                if not cached_result or not html:
                    t1 = time.perf_counter()

                    # ... (existing user agent update, robots.txt check) ...

                    ##############################
                    # Call CrawlerStrategy.crawl #
                    ##############################
                    async_response = await self.crawler_strategy.crawl(
                        url,
                        config=config,
                    )

                    # ... (existing assignment of html, screenshot, pdf, js_result from async_response) ...

                    t2 = time.perf_counter()
                    # ... (existing logging) ...

                    ###############################################################
                    # Process the HTML content, Call CrawlerStrategy.process_html #
                    ###############################################################
                    crawl_result: CrawlResult = await self.aprocess_html(
                        # ... (existing args) ...
                    )

                    # --- Transfer data from AsyncCrawlResponse to CrawlResult ---
                    crawl_result.status_code = async_response.status_code
                    crawl_result.redirected_url = async_response.redirected_url or url
                    crawl_result.response_headers = async_response.response_headers
                    crawl_result.downloaded_files = async_response.downloaded_files
                    crawl_result.js_execution_result = js_execution_result
                    crawl_result.ssl_certificate = async_response.ssl_certificate
                    # NEW: Copy captured data
                    crawl_result.network_requests = async_response.network_requests
                    crawl_result.console_messages = async_response.console_messages
                    # ------------------------------------------------------------

                    crawl_result.success = bool(html)
                    crawl_result.session_id = getattr(config, "session_id", None)

                    # ... (existing logging) ...

                    # Update cache if appropriate
                    if cache_context.should_write() and not bool(cached_result):
                        # crawl_result now includes network/console data if captured
                        await async_db_manager.acache_url(crawl_result)

                    return CrawlResultContainer(crawl_result)

                else: # Cached result was used
                     # ... (existing logging for cache hit) ...
                    cached_result.success = bool(html)
                    cached_result.session_id = getattr(config, "session_id", None)
                    cached_result.redirected_url = cached_result.redirected_url or url
                    return CrawlResultContainer(cached_result)

            except Exception as e:
                # ... (existing error handling) ...
                return CrawlResultContainer(
                    CrawlResult(
                        url=url, html="", success=False, error_message=error_message
                    )
                )

    # ... (aprocess_html remains unchanged regarding capture) ...

    # ... (arun_many remains unchanged regarding capture) ...

Summary of Changes:

Configuration: Added capture_network_requests and capture_console_messages flags to CrawlerRunConfig.
Models: Added corresponding network_requests and console_messages fields (List of Dicts) to AsyncCrawlResponse and CrawlResult.
Strategy: Implemented conditional event listeners in AsyncPlaywrightCrawlerStrategy._crawl_web to capture data into lists when flags are true. Populated these fields in the returned AsyncCrawlResponse. Added basic error handling within capture handlers. Added timestamps.
Crawler: Modified AsyncWebCrawler.arun to copy the captured data from AsyncCrawlResponse into the final CrawlResult for non-cached fetches.

This approach keeps the capturing logic contained within the Playwright strategy, uses clear configuration flags, and integrates the results into the existing data flow. The data format (list of dictionaries) is flexible for storing varied information from requests/responses/console messages.

22 KiB Raw Permalink Blame History

22 KiB

Raw Permalink Blame History