I want to enhance the `AsyncPlaywrightCrawlerStrategy` to optionally capture network requests and console messages during a crawl, storing them in the final `CrawlResult`. Here's a breakdown of the proposed changes across the relevant files: **1. Configuration (`crawl4ai/async_configs.py`)** * **Goal:** Add flags to `CrawlerRunConfig` to enable/disable capturing. * **Changes:** * Add two new boolean attributes to `CrawlerRunConfig`: * `capture_network_requests: bool = False` * `capture_console_messages: bool = False` * Update `__init__`, `from_kwargs`, `to_dict`, and implicitly `clone`/`dump`/`load` to include these new attributes. ```python # ==== File: crawl4ai/async_configs.py ==== # ... (imports) ... class CrawlerRunConfig(): # ... (existing attributes) ... # NEW: Network and Console Capturing Parameters capture_network_requests: bool = False capture_console_messages: bool = False # Experimental Parameters experimental: Dict[str, Any] = None, def __init__( self, # ... (existing parameters) ... # NEW: Network and Console Capturing Parameters capture_network_requests: bool = False, capture_console_messages: bool = False, # Experimental Parameters experimental: Dict[str, Any] = None, ): # ... (existing assignments) ... # NEW: Assign new parameters self.capture_network_requests = capture_network_requests self.capture_console_messages = capture_console_messages # Experimental Parameters self.experimental = experimental or {} # ... (rest of __init__) ... @staticmethod def from_kwargs(kwargs: dict) -> "CrawlerRunConfig": return CrawlerRunConfig( # ... (existing kwargs gets) ... # NEW: Get new parameters capture_network_requests=kwargs.get("capture_network_requests", False), capture_console_messages=kwargs.get("capture_console_messages", False), # Experimental Parameters experimental=kwargs.get("experimental"), ) def to_dict(self): return { # ... (existing dict entries) ... # NEW: Add new parameters to dict "capture_network_requests": self.capture_network_requests, "capture_console_messages": self.capture_console_messages, "experimental": self.experimental, } # clone(), dump(), load() should work automatically if they rely on to_dict() and from_kwargs() # or the serialization logic correctly handles all attributes. ``` **2. Data Models (`crawl4ai/models.py`)** * **Goal:** Add fields to store the captured data in the response/result objects. * **Changes:** * Add `network_requests: Optional[List[Dict[str, Any]]] = None` and `console_messages: Optional[List[Dict[str, Any]]] = None` to `AsyncCrawlResponse`. * Add the same fields to `CrawlResult`. ```python # ==== File: crawl4ai/models.py ==== # ... (imports) ... # ... (Existing dataclasses/models) ... class AsyncCrawlResponse(BaseModel): html: str response_headers: Dict[str, str] js_execution_result: Optional[Dict[str, Any]] = None status_code: int screenshot: Optional[str] = None pdf_data: Optional[bytes] = None get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None downloaded_files: Optional[List[str]] = None ssl_certificate: Optional[SSLCertificate] = None redirected_url: Optional[str] = None # NEW: Fields for captured data network_requests: Optional[List[Dict[str, Any]]] = None console_messages: Optional[List[Dict[str, Any]]] = None class Config: arbitrary_types_allowed = True # ... (Existing models like MediaItem, Link, etc.) ... class CrawlResult(BaseModel): url: str html: str success: bool cleaned_html: Optional[str] = None media: Dict[str, List[Dict]] = {} links: Dict[str, List[Dict]] = {} downloaded_files: Optional[List[str]] = None js_execution_result: Optional[Dict[str, Any]] = None screenshot: Optional[str] = None pdf: Optional[bytes] = None mhtml: Optional[str] = None # Added mhtml based on the provided models.py _markdown: Optional[MarkdownGenerationResult] = PrivateAttr(default=None) extracted_content: Optional[str] = None metadata: Optional[dict] = None error_message: Optional[str] = None session_id: Optional[str] = None response_headers: Optional[dict] = None status_code: Optional[int] = None ssl_certificate: Optional[SSLCertificate] = None dispatch_result: Optional[DispatchResult] = None redirected_url: Optional[str] = None # NEW: Fields for captured data network_requests: Optional[List[Dict[str, Any]]] = None console_messages: Optional[List[Dict[str, Any]]] = None class Config: arbitrary_types_allowed = True # ... (Existing __init__, properties, model_dump for markdown compatibility) ... # ... (Rest of the models) ... ``` **3. Crawler Strategy (`crawl4ai/async_crawler_strategy.py`)** * **Goal:** Implement the actual capturing logic within `AsyncPlaywrightCrawlerStrategy._crawl_web`. * **Changes:** * Inside `_crawl_web`, initialize empty lists `captured_requests = []` and `captured_console = []`. * Conditionally attach Playwright event listeners (`page.on(...)`) based on the `config.capture_network_requests` and `config.capture_console_messages` flags. * Define handler functions for these listeners to extract relevant data and append it to the respective lists. Include timestamps. * Pass the captured lists to the `AsyncCrawlResponse` constructor at the end of the method. ```python # ==== File: crawl4ai/async_crawler_strategy.py ==== # ... (imports) ... import time # Make sure time is imported class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): # ... (existing methods like __init__, start, close, etc.) ... async def _crawl_web( self, url: str, config: CrawlerRunConfig ) -> AsyncCrawlResponse: """ Internal method to crawl web URLs with the specified configuration. Includes optional network and console capturing. # MODIFIED DOCSTRING """ config.url = url response_headers = {} execution_result = None status_code = None redirected_url = url # Reset downloaded files list for new crawl self._downloaded_files = [] # Initialize capture lists - IMPORTANT: Reset per crawl captured_requests: List[Dict[str, Any]] = [] captured_console: List[Dict[str, Any]] = [] # Handle user agent ... (existing code) ... # Get page for session page, context = await self.browser_manager.get_page(crawlerRunConfig=config) # ... (existing code for cookies, navigator overrides, hooks) ... # --- Setup Capturing Listeners --- # NOTE: These listeners are attached *before* page.goto() # Network Request Capturing if config.capture_network_requests: async def handle_request_capture(request): try: post_data_str = None try: # Be cautious with large post data post_data = request.post_data_buffer if post_data: # Attempt to decode, fallback to base64 or size indication try: post_data_str = post_data.decode('utf-8', errors='replace') except UnicodeDecodeError: post_data_str = f"[Binary data: {len(post_data)} bytes]" except Exception: post_data_str = "[Error retrieving post data]" captured_requests.append({ "event_type": "request", "url": request.url, "method": request.method, "headers": dict(request.headers), # Convert Header dict "post_data": post_data_str, "resource_type": request.resource_type, "is_navigation_request": request.is_navigation_request(), "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing request details for {request.url}: {e}", tag="CAPTURE") captured_requests.append({"event_type": "request_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()}) async def handle_response_capture(response): try: # Avoid capturing full response body by default due to size/security # security_details = await response.security_details() # Optional: More SSL info captured_requests.append({ "event_type": "response", "url": response.url, "status": response.status, "status_text": response.status_text, "headers": dict(response.headers), # Convert Header dict "from_service_worker": response.from_service_worker, # "security_details": security_details, # Uncomment if needed "request_timing": response.request.timing, # Detailed timing info "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing response details for {response.url}: {e}", tag="CAPTURE") captured_requests.append({"event_type": "response_capture_error", "url": response.url, "error": str(e), "timestamp": time.time()}) async def handle_request_failed_capture(request): try: captured_requests.append({ "event_type": "request_failed", "url": request.url, "method": request.method, "resource_type": request.resource_type, "failure_text": request.failure.error_text if request.failure else "Unknown failure", "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing request failed details for {request.url}: {e}", tag="CAPTURE") captured_requests.append({"event_type": "request_failed_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()}) page.on("request", handle_request_capture) page.on("response", handle_response_capture) page.on("requestfailed", handle_request_failed_capture) # Console Message Capturing if config.capture_console_messages: def handle_console_capture(msg): try: location = msg.location() # Attempt to resolve JSHandle args to primitive values resolved_args = [] try: for arg in msg.args: resolved_args.append(arg.json_value()) # May fail for complex objects except Exception: resolved_args.append("[Could not resolve JSHandle args]") captured_console.append({ "type": msg.type(), # e.g., 'log', 'error', 'warning' "text": msg.text(), "args": resolved_args, # Captured arguments "location": f"{location['url']}:{location['lineNumber']}:{location['columnNumber']}" if location else "N/A", "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing console message: {e}", tag="CAPTURE") captured_console.append({"type": "console_capture_error", "error": str(e), "timestamp": time.time()}) def handle_pageerror_capture(err): try: captured_console.append({ "type": "error", # Consistent type for page errors "text": err.message, "stack": err.stack, "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing page error: {e}", tag="CAPTURE") captured_console.append({"type": "pageerror_capture_error", "error": str(e), "timestamp": time.time()}) page.on("console", handle_console_capture) page.on("pageerror", handle_pageerror_capture) # --- End Setup Capturing Listeners --- # Set up console logging if requested (Keep original logging logic separate or merge carefully) if config.log_console: # ... (original log_console setup using page.on(...) remains here) ... # This allows logging to screen *and* capturing to the list if both flags are True def log_consol(msg, console_log_type="debug"): # ... existing implementation ... pass # Placeholder for existing code page.on("console", lambda msg: log_consol(msg, "debug")) page.on("pageerror", lambda e: log_consol(e, "error")) try: # ... (existing code for SSL, downloads, goto, waits, JS execution, etc.) ... # Get final HTML content # ... (existing code for selector logic or page.content()) ... if config.css_selector: # ... existing selector logic ... html = f"