I want to enhance the `AsyncPlaywrightCrawlerStrategy` to optionally capture network requests and console messages during a crawl, storing them in the final `CrawlResult`. Here's a breakdown of the proposed changes across the relevant files: **1. Configuration (`crawl4ai/async_configs.py`)** * **Goal:** Add flags to `CrawlerRunConfig` to enable/disable capturing. * **Changes:** * Add two new boolean attributes to `CrawlerRunConfig`: * `capture_network_requests: bool = False` * `capture_console_messages: bool = False` * Update `__init__`, `from_kwargs`, `to_dict`, and implicitly `clone`/`dump`/`load` to include these new attributes. ```python # ==== File: crawl4ai/async_configs.py ==== # ... (imports) ... class CrawlerRunConfig(): # ... (existing attributes) ... # NEW: Network and Console Capturing Parameters capture_network_requests: bool = False capture_console_messages: bool = False # Experimental Parameters experimental: Dict[str, Any] = None, def __init__( self, # ... (existing parameters) ... # NEW: Network and Console Capturing Parameters capture_network_requests: bool = False, capture_console_messages: bool = False, # Experimental Parameters experimental: Dict[str, Any] = None, ): # ... (existing assignments) ... # NEW: Assign new parameters self.capture_network_requests = capture_network_requests self.capture_console_messages = capture_console_messages # Experimental Parameters self.experimental = experimental or {} # ... (rest of __init__) ... @staticmethod def from_kwargs(kwargs: dict) -> "CrawlerRunConfig": return CrawlerRunConfig( # ... (existing kwargs gets) ... # NEW: Get new parameters capture_network_requests=kwargs.get("capture_network_requests", False), capture_console_messages=kwargs.get("capture_console_messages", False), # Experimental Parameters experimental=kwargs.get("experimental"), ) def to_dict(self): return { # ... (existing dict entries) ... # NEW: Add new parameters to dict "capture_network_requests": self.capture_network_requests, "capture_console_messages": self.capture_console_messages, "experimental": self.experimental, } # clone(), dump(), load() should work automatically if they rely on to_dict() and from_kwargs() # or the serialization logic correctly handles all attributes. ``` **2. Data Models (`crawl4ai/models.py`)** * **Goal:** Add fields to store the captured data in the response/result objects. * **Changes:** * Add `network_requests: Optional[List[Dict[str, Any]]] = None` and `console_messages: Optional[List[Dict[str, Any]]] = None` to `AsyncCrawlResponse`. * Add the same fields to `CrawlResult`. ```python # ==== File: crawl4ai/models.py ==== # ... (imports) ... # ... (Existing dataclasses/models) ... class AsyncCrawlResponse(BaseModel): html: str response_headers: Dict[str, str] js_execution_result: Optional[Dict[str, Any]] = None status_code: int screenshot: Optional[str] = None pdf_data: Optional[bytes] = None get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None downloaded_files: Optional[List[str]] = None ssl_certificate: Optional[SSLCertificate] = None redirected_url: Optional[str] = None # NEW: Fields for captured data network_requests: Optional[List[Dict[str, Any]]] = None console_messages: Optional[List[Dict[str, Any]]] = None class Config: arbitrary_types_allowed = True # ... (Existing models like MediaItem, Link, etc.) ... class CrawlResult(BaseModel): url: str html: str success: bool cleaned_html: Optional[str] = None media: Dict[str, List[Dict]] = {} links: Dict[str, List[Dict]] = {} downloaded_files: Optional[List[str]] = None js_execution_result: Optional[Dict[str, Any]] = None screenshot: Optional[str] = None pdf: Optional[bytes] = None mhtml: Optional[str] = None # Added mhtml based on the provided models.py _markdown: Optional[MarkdownGenerationResult] = PrivateAttr(default=None) extracted_content: Optional[str] = None metadata: Optional[dict] = None error_message: Optional[str] = None session_id: Optional[str] = None response_headers: Optional[dict] = None status_code: Optional[int] = None ssl_certificate: Optional[SSLCertificate] = None dispatch_result: Optional[DispatchResult] = None redirected_url: Optional[str] = None # NEW: Fields for captured data network_requests: Optional[List[Dict[str, Any]]] = None console_messages: Optional[List[Dict[str, Any]]] = None class Config: arbitrary_types_allowed = True # ... (Existing __init__, properties, model_dump for markdown compatibility) ... # ... (Rest of the models) ... ``` **3. Crawler Strategy (`crawl4ai/async_crawler_strategy.py`)** * **Goal:** Implement the actual capturing logic within `AsyncPlaywrightCrawlerStrategy._crawl_web`. * **Changes:** * Inside `_crawl_web`, initialize empty lists `captured_requests = []` and `captured_console = []`. * Conditionally attach Playwright event listeners (`page.on(...)`) based on the `config.capture_network_requests` and `config.capture_console_messages` flags. * Define handler functions for these listeners to extract relevant data and append it to the respective lists. Include timestamps. * Pass the captured lists to the `AsyncCrawlResponse` constructor at the end of the method. ```python # ==== File: crawl4ai/async_crawler_strategy.py ==== # ... (imports) ... import time # Make sure time is imported class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy): # ... (existing methods like __init__, start, close, etc.) ... async def _crawl_web( self, url: str, config: CrawlerRunConfig ) -> AsyncCrawlResponse: """ Internal method to crawl web URLs with the specified configuration. Includes optional network and console capturing. # MODIFIED DOCSTRING """ config.url = url response_headers = {} execution_result = None status_code = None redirected_url = url # Reset downloaded files list for new crawl self._downloaded_files = [] # Initialize capture lists - IMPORTANT: Reset per crawl captured_requests: List[Dict[str, Any]] = [] captured_console: List[Dict[str, Any]] = [] # Handle user agent ... (existing code) ... # Get page for session page, context = await self.browser_manager.get_page(crawlerRunConfig=config) # ... (existing code for cookies, navigator overrides, hooks) ... # --- Setup Capturing Listeners --- # NOTE: These listeners are attached *before* page.goto() # Network Request Capturing if config.capture_network_requests: async def handle_request_capture(request): try: post_data_str = None try: # Be cautious with large post data post_data = request.post_data_buffer if post_data: # Attempt to decode, fallback to base64 or size indication try: post_data_str = post_data.decode('utf-8', errors='replace') except UnicodeDecodeError: post_data_str = f"[Binary data: {len(post_data)} bytes]" except Exception: post_data_str = "[Error retrieving post data]" captured_requests.append({ "event_type": "request", "url": request.url, "method": request.method, "headers": dict(request.headers), # Convert Header dict "post_data": post_data_str, "resource_type": request.resource_type, "is_navigation_request": request.is_navigation_request(), "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing request details for {request.url}: {e}", tag="CAPTURE") captured_requests.append({"event_type": "request_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()}) async def handle_response_capture(response): try: # Avoid capturing full response body by default due to size/security # security_details = await response.security_details() # Optional: More SSL info captured_requests.append({ "event_type": "response", "url": response.url, "status": response.status, "status_text": response.status_text, "headers": dict(response.headers), # Convert Header dict "from_service_worker": response.from_service_worker, # "security_details": security_details, # Uncomment if needed "request_timing": response.request.timing, # Detailed timing info "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing response details for {response.url}: {e}", tag="CAPTURE") captured_requests.append({"event_type": "response_capture_error", "url": response.url, "error": str(e), "timestamp": time.time()}) async def handle_request_failed_capture(request): try: captured_requests.append({ "event_type": "request_failed", "url": request.url, "method": request.method, "resource_type": request.resource_type, "failure_text": request.failure.error_text if request.failure else "Unknown failure", "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing request failed details for {request.url}: {e}", tag="CAPTURE") captured_requests.append({"event_type": "request_failed_capture_error", "url": request.url, "error": str(e), "timestamp": time.time()}) page.on("request", handle_request_capture) page.on("response", handle_response_capture) page.on("requestfailed", handle_request_failed_capture) # Console Message Capturing if config.capture_console_messages: def handle_console_capture(msg): try: location = msg.location() # Attempt to resolve JSHandle args to primitive values resolved_args = [] try: for arg in msg.args: resolved_args.append(arg.json_value()) # May fail for complex objects except Exception: resolved_args.append("[Could not resolve JSHandle args]") captured_console.append({ "type": msg.type(), # e.g., 'log', 'error', 'warning' "text": msg.text(), "args": resolved_args, # Captured arguments "location": f"{location['url']}:{location['lineNumber']}:{location['columnNumber']}" if location else "N/A", "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing console message: {e}", tag="CAPTURE") captured_console.append({"type": "console_capture_error", "error": str(e), "timestamp": time.time()}) def handle_pageerror_capture(err): try: captured_console.append({ "type": "error", # Consistent type for page errors "text": err.message, "stack": err.stack, "timestamp": time.time() }) except Exception as e: self.logger.warning(f"Error capturing page error: {e}", tag="CAPTURE") captured_console.append({"type": "pageerror_capture_error", "error": str(e), "timestamp": time.time()}) page.on("console", handle_console_capture) page.on("pageerror", handle_pageerror_capture) # --- End Setup Capturing Listeners --- # Set up console logging if requested (Keep original logging logic separate or merge carefully) if config.log_console: # ... (original log_console setup using page.on(...) remains here) ... # This allows logging to screen *and* capturing to the list if both flags are True def log_consol(msg, console_log_type="debug"): # ... existing implementation ... pass # Placeholder for existing code page.on("console", lambda msg: log_consol(msg, "debug")) page.on("pageerror", lambda e: log_consol(e, "error")) try: # ... (existing code for SSL, downloads, goto, waits, JS execution, etc.) ... # Get final HTML content # ... (existing code for selector logic or page.content()) ... if config.css_selector: # ... existing selector logic ... html = f"
\n" + "\n".join(html_parts) + "\n
" else: html = await page.content() await self.execute_hook( "before_return_html", page=page, html=html, context=context, config=config ) # Handle PDF and screenshot generation # ... (existing code) ... # Define delayed content getter # ... (existing code) ... # Return complete response - ADD CAPTURED DATA HERE return AsyncCrawlResponse( html=html, response_headers=response_headers, js_execution_result=execution_result, status_code=status_code, screenshot=screenshot_data, pdf_data=pdf_data, get_delayed_content=get_delayed_content, ssl_certificate=ssl_cert, downloaded_files=( self._downloaded_files if self._downloaded_files else None ), redirected_url=redirected_url, # NEW: Pass captured data conditionally network_requests=captured_requests if config.capture_network_requests else None, console_messages=captured_console if config.capture_console_messages else None, ) except Exception as e: raise e # Re-raise the original exception finally: # If no session_id is given we should close the page if not config.session_id: # Detach listeners before closing to prevent potential errors during close if config.capture_network_requests: page.remove_listener("request", handle_request_capture) page.remove_listener("response", handle_response_capture) page.remove_listener("requestfailed", handle_request_failed_capture) if config.capture_console_messages: page.remove_listener("console", handle_console_capture) page.remove_listener("pageerror", handle_pageerror_capture) # Also remove logging listeners if they were attached if config.log_console: # Need to figure out how to remove the lambdas if necessary, # or ensure they don't cause issues on close. Often, it's fine. pass await page.close() # ... (rest of AsyncPlaywrightCrawlerStrategy methods) ... ``` **4. Core Crawler (`crawl4ai/async_webcrawler.py`)** * **Goal:** Ensure the captured data from `AsyncCrawlResponse` is transferred to the final `CrawlResult`. * **Changes:** * In `arun`, when processing a non-cached result (inside the `if not cached_result or not html:` block), after receiving `async_response` and calling `aprocess_html` to get `crawl_result`, copy the `network_requests` and `console_messages` from `async_response` to `crawl_result`. ```python # ==== File: crawl4ai/async_webcrawler.py ==== # ... (imports) ... class AsyncWebCrawler: # ... (existing methods) ... async def arun( self, url: str, config: CrawlerRunConfig = None, **kwargs, ) -> RunManyReturn: # ... (existing setup, cache check) ... async with self._lock or self.nullcontext(): try: # ... (existing logging, cache context setup) ... if cached_result: # ... (existing cache handling logic) ... # Note: Captured network/console usually not useful from cache # Ensure they are None or empty if read from cache, unless stored explicitly cached_result.network_requests = cached_result.network_requests or None cached_result.console_messages = cached_result.console_messages or None # ... (rest of cache logic) ... # Fetch fresh content if needed if not cached_result or not html: t1 = time.perf_counter() # ... (existing user agent update, robots.txt check) ... ############################## # Call CrawlerStrategy.crawl # ############################## async_response = await self.crawler_strategy.crawl( url, config=config, ) # ... (existing assignment of html, screenshot, pdf, js_result from async_response) ... t2 = time.perf_counter() # ... (existing logging) ... ############################################################### # Process the HTML content, Call CrawlerStrategy.process_html # ############################################################### crawl_result: CrawlResult = await self.aprocess_html( # ... (existing args) ... ) # --- Transfer data from AsyncCrawlResponse to CrawlResult --- crawl_result.status_code = async_response.status_code crawl_result.redirected_url = async_response.redirected_url or url crawl_result.response_headers = async_response.response_headers crawl_result.downloaded_files = async_response.downloaded_files crawl_result.js_execution_result = js_execution_result crawl_result.ssl_certificate = async_response.ssl_certificate # NEW: Copy captured data crawl_result.network_requests = async_response.network_requests crawl_result.console_messages = async_response.console_messages # ------------------------------------------------------------ crawl_result.success = bool(html) crawl_result.session_id = getattr(config, "session_id", None) # ... (existing logging) ... # Update cache if appropriate if cache_context.should_write() and not bool(cached_result): # crawl_result now includes network/console data if captured await async_db_manager.acache_url(crawl_result) return CrawlResultContainer(crawl_result) else: # Cached result was used # ... (existing logging for cache hit) ... cached_result.success = bool(html) cached_result.session_id = getattr(config, "session_id", None) cached_result.redirected_url = cached_result.redirected_url or url return CrawlResultContainer(cached_result) except Exception as e: # ... (existing error handling) ... return CrawlResultContainer( CrawlResult( url=url, html="", success=False, error_message=error_message ) ) # ... (aprocess_html remains unchanged regarding capture) ... # ... (arun_many remains unchanged regarding capture) ... ``` **Summary of Changes:** 1. **Configuration:** Added `capture_network_requests` and `capture_console_messages` flags to `CrawlerRunConfig`. 2. **Models:** Added corresponding `network_requests` and `console_messages` fields (List of Dicts) to `AsyncCrawlResponse` and `CrawlResult`. 3. **Strategy:** Implemented conditional event listeners in `AsyncPlaywrightCrawlerStrategy._crawl_web` to capture data into lists when flags are true. Populated these fields in the returned `AsyncCrawlResponse`. Added basic error handling within capture handlers. Added timestamps. 4. **Crawler:** Modified `AsyncWebCrawler.arun` to copy the captured data from `AsyncCrawlResponse` into the final `CrawlResult` for non-cached fetches. This approach keeps the capturing logic contained within the Playwright strategy, uses clear configuration flags, and integrates the results into the existing data flow. The data format (list of dictionaries) is flexible for storing varied information from requests/responses/console messages.