`.
+
+---
+
+## 7. Putting It All Together: Larger Example
+
+Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
+
+```python
+schema = {
+ "name": "Blog Posts",
+ "baseSelector": "a.blog-post-card",
+ "baseFields": [
+ {"name": "post_url", "type": "attribute", "attribute": "href"}
+ ],
+ "fields": [
+ {"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
+ {"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
+ {"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
+ {"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
+ ]
+}
+```
+
+Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
+
+---
+
+## 8. Tips & Best Practices
+
+1. **Inspect the DOM** in Chrome DevTools or Firefox's Inspector to find stable selectors.
+2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
+3. **Test** your schema on partial HTML or a test page before a big crawl.
+4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
+5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it'll often show warnings.
+6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the "parent" item.
+7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
+8. **Consider Using Regex First**: For simple data types like emails, URLs, and dates, `RegexExtractionStrategy` is often the fastest approach.
+
+---
+
+## 9. Schema Generation Utility
+
+While manually crafting schemas is powerful and precise, Crawl4AI now offers a convenient utility to **automatically generate** extraction schemas using LLM. This is particularly useful when:
+
+1. You're dealing with a new website structure and want a quick starting point
+2. You need to extract complex nested data structures
+3. You want to avoid the learning curve of CSS/XPath selector syntax
+
+### Using the Schema Generator
+
+The schema generator is available as a static method on both `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. You can choose between OpenAI's GPT-4 or the open-source Ollama for schema generation:
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, JsonXPathExtractionStrategy
+from crawl4ai import LLMConfig
+
+# Sample HTML with product information
+html = """
+
+
Gaming Laptop
+
$999.99
+
+
+"""
+
+# Option 1: Using OpenAI (requires API token)
+css_schema = JsonCssExtractionStrategy.generate_schema(
+ html,
+ schema_type="css",
+ llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token")
+)
+
+# Option 2: Using Ollama (open source, no token needed)
+xpath_schema = JsonXPathExtractionStrategy.generate_schema(
+ html,
+ schema_type="xpath",
+ llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama
+)
+
+# Use the generated schema for fast, repeated extractions
+strategy = JsonCssExtractionStrategy(css_schema)
+```
+
+### LLM Provider Options
+
+1. **OpenAI GPT-4 (`openai/gpt4o`)**
+ - Default provider
+ - Requires an API token
+ - Generally provides more accurate schemas
+ - Set via environment variable: `OPENAI_API_KEY`
+
+2. **Ollama (`ollama/llama3.3`)**
+ - Open source alternative
+ - No API token required
+ - Self-hosted option
+ - Good for development and testing
+
+### Benefits of Schema Generation
+
+1. **One-Time Cost**: While schema generation uses LLM, it's a one-time cost. The generated schema can be reused for unlimited extractions without further LLM calls.
+2. **Smart Pattern Recognition**: The LLM analyzes the HTML structure and identifies common patterns, often producing more robust selectors than manual attempts.
+3. **Automatic Nesting**: Complex nested structures are automatically detected and properly represented in the schema.
+4. **Learning Tool**: The generated schemas serve as excellent examples for learning how to write your own schemas.
+
+### Best Practices
+
+1. **Review Generated Schemas**: While the generator is smart, always review and test the generated schema before using it in production.
+2. **Provide Representative HTML**: The better your sample HTML represents the overall structure, the more accurate the generated schema will be.
+3. **Consider Both CSS and XPath**: Try both schema types and choose the one that works best for your specific case.
+4. **Cache Generated Schemas**: Since generation uses LLM, save successful schemas for reuse.
+5. **API Token Security**: Never hardcode API tokens. Use environment variables or secure configuration management.
+6. **Choose Provider Wisely**:
+ - Use OpenAI for production-quality schemas
+ - Use Ollama for development, testing, or when you need a self-hosted solution
+
+---
+
+## 10. Conclusion
+
+With Crawl4AI's LLM-free extraction strategies - `JsonCssExtractionStrategy`, `JsonXPathExtractionStrategy`, and now `RegexExtractionStrategy` - you can build powerful pipelines that:
+
+- Scrape any consistent site for structured data.
+- Support nested objects, repeating lists, or pattern-based extraction.
+- Scale to thousands of pages quickly and reliably.
+
+**Choosing the Right Strategy**:
+
+- Use **`RegexExtractionStrategy`** for fast extraction of common data types like emails, phones, URLs, dates, etc.
+- Use **`JsonCssExtractionStrategy`** or **`JsonXPathExtractionStrategy`** for structured data with clear HTML patterns
+- If you need both: first extract structured data with JSON strategies, then use regex on specific fields
+
+**Remember**: For repeated, structured data, you don't need to pay for or wait on an LLM. Well-crafted schemas and regex patterns get you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
+
+**Last Updated**: 2025-05-02
+
+---
+
+That's it for **Extracting JSON (No LLM)**! You've seen how schema-based approaches (either CSS or XPath) and regex patterns can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
+```
+
+
+## File: docs/md_v2/extraction/clustring-strategies.md
+
+```md
+# Cosine Strategy
+
+The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.
+
+## How It Works
+
+The Cosine Strategy:
+1. Breaks down page content into meaningful chunks
+2. Converts text into vector representations
+3. Calculates similarity between chunks
+4. Clusters similar content together
+5. Ranks and filters content based on relevance
+
+## Basic Usage
+
+```python
+from crawl4ai.extraction_strategy import CosineStrategy
+
+strategy = CosineStrategy(
+ semantic_filter="product reviews", # Target content type
+ word_count_threshold=10, # Minimum words per cluster
+ sim_threshold=0.3 # Similarity threshold
+)
+
+async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://example.com/reviews",
+ extraction_strategy=strategy
+ )
+
+ content = result.extracted_content
+```
+
+## Configuration Options
+
+### Core Parameters
+
+```python
+CosineStrategy(
+ # Content Filtering
+ semantic_filter: str = None, # Keywords/topic for content filtering
+ word_count_threshold: int = 10, # Minimum words per cluster
+ sim_threshold: float = 0.3, # Similarity threshold (0.0 to 1.0)
+
+ # Clustering Parameters
+ max_dist: float = 0.2, # Maximum distance for clustering
+ linkage_method: str = 'ward', # Clustering linkage method
+ top_k: int = 3, # Number of top categories to extract
+
+ # Model Configuration
+ model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
+
+ verbose: bool = False # Enable logging
+)
+```
+
+### Parameter Details
+
+1. **semantic_filter**
+ - Sets the target topic or content type
+ - Use keywords relevant to your desired content
+ - Example: "technical specifications", "user reviews", "pricing information"
+
+2. **sim_threshold**
+ - Controls how similar content must be to be grouped together
+ - Higher values (e.g., 0.8) mean stricter matching
+ - Lower values (e.g., 0.3) allow more variation
+ ```python
+ # Strict matching
+ strategy = CosineStrategy(sim_threshold=0.8)
+
+ # Loose matching
+ strategy = CosineStrategy(sim_threshold=0.3)
+ ```
+
+3. **word_count_threshold**
+ - Filters out short content blocks
+ - Helps eliminate noise and irrelevant content
+ ```python
+ # Only consider substantial paragraphs
+ strategy = CosineStrategy(word_count_threshold=50)
+ ```
+
+4. **top_k**
+ - Number of top content clusters to return
+ - Higher values return more diverse content
+ ```python
+ # Get top 5 most relevant content clusters
+ strategy = CosineStrategy(top_k=5)
+ ```
+
+## Use Cases
+
+### 1. Article Content Extraction
+```python
+strategy = CosineStrategy(
+ semantic_filter="main article content",
+ word_count_threshold=100, # Longer blocks for articles
+ top_k=1 # Usually want single main content
+)
+
+result = await crawler.arun(
+ url="https://example.com/blog/post",
+ extraction_strategy=strategy
+)
+```
+
+### 2. Product Review Analysis
+```python
+strategy = CosineStrategy(
+ semantic_filter="customer reviews and ratings",
+ word_count_threshold=20, # Reviews can be shorter
+ top_k=10, # Get multiple reviews
+ sim_threshold=0.4 # Allow variety in review content
+)
+```
+
+### 3. Technical Documentation
+```python
+strategy = CosineStrategy(
+ semantic_filter="technical specifications documentation",
+ word_count_threshold=30,
+ sim_threshold=0.6, # Stricter matching for technical content
+ max_dist=0.3 # Allow related technical sections
+)
+```
+
+## Advanced Features
+
+### Custom Clustering
+```python
+strategy = CosineStrategy(
+ linkage_method='complete', # Alternative clustering method
+ max_dist=0.4, # Larger clusters
+ model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # Multilingual support
+)
+```
+
+### Content Filtering Pipeline
+```python
+strategy = CosineStrategy(
+ semantic_filter="pricing plans features",
+ word_count_threshold=15,
+ sim_threshold=0.5,
+ top_k=3
+)
+
+async def extract_pricing_features(url: str):
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url=url,
+ extraction_strategy=strategy
+ )
+
+ if result.success:
+ content = json.loads(result.extracted_content)
+ return {
+ 'pricing_features': content,
+ 'clusters': len(content),
+ 'similarity_scores': [item['score'] for item in content]
+ }
+```
+
+## Best Practices
+
+1. **Adjust Thresholds Iteratively**
+ - Start with default values
+ - Adjust based on results
+ - Monitor clustering quality
+
+2. **Choose Appropriate Word Count Thresholds**
+ - Higher for articles (100+)
+ - Lower for reviews/comments (20+)
+ - Medium for product descriptions (50+)
+
+3. **Optimize Performance**
+ ```python
+ strategy = CosineStrategy(
+ word_count_threshold=10, # Filter early
+ top_k=5, # Limit results
+ verbose=True # Monitor performance
+ )
+ ```
+
+4. **Handle Different Content Types**
+ ```python
+ # For mixed content pages
+ strategy = CosineStrategy(
+ semantic_filter="product features",
+ sim_threshold=0.4, # More flexible matching
+ max_dist=0.3, # Larger clusters
+ top_k=3 # Multiple relevant sections
+ )
+ ```
+
+## Error Handling
+
+```python
+try:
+ result = await crawler.arun(
+ url="https://example.com",
+ extraction_strategy=strategy
+ )
+
+ if result.success:
+ content = json.loads(result.extracted_content)
+ if not content:
+ print("No relevant content found")
+ else:
+ print(f"Extraction failed: {result.error_message}")
+
+except Exception as e:
+ print(f"Error during extraction: {str(e)}")
+```
+
+The Cosine Strategy is particularly effective when:
+- Content structure is inconsistent
+- You need semantic understanding
+- You want to find similar content blocks
+- Structure-based extraction (CSS/XPath) isn't reliable
+
+It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.
+```
+
+
+## File: docs/md_v2/advanced/advanced-features.md
+
+```md
+# Overview of Some Important Advanced Features
+(Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
+
+Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
+
+1. **Proxy Usage**
+2. **Capturing PDFs & Screenshots**
+3. **Handling SSL Certificates**
+4. **Custom Headers**
+5. **Session Persistence & Local Storage**
+6. **Robots.txt Compliance**
+
+> **Prerequisites**
+> - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)
+> - You know how to run or configure your Python environment with Playwright installed
+
+---
+
+## 1. Proxy Usage
+
+If you need to route your crawl traffic through a proxy—whether for IP rotation, geo-testing, or privacy—Crawl4AI supports it via `BrowserConfig.proxy_config`.
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+ browser_cfg = BrowserConfig(
+ proxy_config={
+ "server": "http://proxy.example.com:8080",
+ "username": "myuser",
+ "password": "mypass",
+ },
+ headless=True
+ )
+ crawler_cfg = CrawlerRunConfig(
+ verbose=True
+ )
+
+ async with AsyncWebCrawler(config=browser_cfg) as crawler:
+ result = await crawler.arun(
+ url="https://www.whatismyip.com/",
+ config=crawler_cfg
+ )
+ if result.success:
+ print("[OK] Page fetched via proxy.")
+ print("Page HTML snippet:", result.html[:200])
+ else:
+ print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key Points**
+- **`proxy_config`** expects a dict with `server` and optional auth credentials.
+- Many commercial proxies provide an HTTP/HTTPS “gateway” server that you specify in `server`.
+- If your proxy doesn’t need auth, omit `username`/`password`.
+
+---
+
+## 2. Capturing PDFs & Screenshots
+
+Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI can do both in one pass:
+
+```python
+import os, asyncio
+from base64 import b64decode
+from crawl4ai import AsyncWebCrawler, CacheMode
+
+async def main():
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
+ cache_mode=CacheMode.BYPASS,
+ pdf=True,
+ screenshot=True
+ )
+
+ if result.success:
+ # Save screenshot
+ if result.screenshot:
+ with open("wikipedia_screenshot.png", "wb") as f:
+ f.write(b64decode(result.screenshot))
+
+ # Save PDF
+ if result.pdf:
+ with open("wikipedia_page.pdf", "wb") as f:
+ f.write(result.pdf)
+
+ print("[OK] PDF & screenshot captured.")
+ else:
+ print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Why PDF + Screenshot?**
+- Large or complex pages can be slow or error-prone with “traditional” full-page screenshots.
+- Exporting a PDF is more reliable for very long pages. Crawl4AI automatically converts the first PDF page into an image if you request both.
+
+**Relevant Parameters**
+- **`pdf=True`**: Exports the current page as a PDF (base64-encoded in `result.pdf`).
+- **`screenshot=True`**: Creates a screenshot (base64-encoded in `result.screenshot`).
+- **`scan_full_page`** or advanced hooking can further refine how the crawler captures content.
+
+---
+
+## 3. Handling SSL Certificates
+
+If you need to verify or export a site’s SSL certificate—for compliance, debugging, or data analysis—Crawl4AI can fetch it during the crawl:
+
+```python
+import asyncio, os
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+
+async def main():
+ tmp_dir = os.path.join(os.getcwd(), "tmp")
+ os.makedirs(tmp_dir, exist_ok=True)
+
+ config = CrawlerRunConfig(
+ fetch_ssl_certificate=True,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(url="https://example.com", config=config)
+
+ if result.success and result.ssl_certificate:
+ cert = result.ssl_certificate
+ print("\nCertificate Information:")
+ print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
+ print(f"Valid until: {cert.valid_until}")
+ print(f"Fingerprint: {cert.fingerprint}")
+
+ # Export in multiple formats:
+ cert.to_json(os.path.join(tmp_dir, "certificate.json"))
+ cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
+ cert.to_der(os.path.join(tmp_dir, "certificate.der"))
+
+ print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
+ else:
+ print("[ERROR] No certificate or crawl failed.")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key Points**
+- **`fetch_ssl_certificate=True`** triggers certificate retrieval.
+- `result.ssl_certificate` includes methods (`to_json`, `to_pem`, `to_der`) for saving in various formats (handy for server config, Java keystores, etc.).
+
+---
+
+## 4. Custom Headers
+
+Sometimes you need to set custom headers (e.g., language preferences, authentication tokens, or specialized user-agent strings). You can do this in multiple ways:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+ # Option 1: Set headers at the crawler strategy level
+ crawler1 = AsyncWebCrawler(
+ # The underlying strategy can accept headers in its constructor
+ crawler_strategy=None # We'll override below for clarity
+ )
+ crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
+ crawler1.crawler_strategy.set_custom_headers({
+ "Accept-Language": "fr-FR,fr;q=0.9"
+ })
+ result1 = await crawler1.arun("https://www.example.com")
+ print("Example 1 result success:", result1.success)
+
+ # Option 2: Pass headers directly to `arun()`
+ crawler2 = AsyncWebCrawler()
+ result2 = await crawler2.arun(
+ url="https://www.example.com",
+ headers={"Accept-Language": "es-ES,es;q=0.9"}
+ )
+ print("Example 2 result success:", result2.success)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Notes**
+- Some sites may react differently to certain headers (e.g., `Accept-Language`).
+- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-based-crawling.md) or use `UserAgentGenerator`.
+
+---
+
+## 5. Session Persistence & Local Storage
+
+Crawl4AI can preserve cookies and localStorage so you can continue where you left off—ideal for logging into sites or skipping repeated auth flows.
+
+### 5.1 `storage_state`
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+ storage_dict = {
+ "cookies": [
+ {
+ "name": "session",
+ "value": "abcd1234",
+ "domain": "example.com",
+ "path": "/",
+ "expires": 1699999999.0,
+ "httpOnly": False,
+ "secure": False,
+ "sameSite": "None"
+ }
+ ],
+ "origins": [
+ {
+ "origin": "https://example.com",
+ "localStorage": [
+ {"name": "token", "value": "my_auth_token"}
+ ]
+ }
+ ]
+ }
+
+ # Provide the storage state as a dictionary to start "already logged in"
+ async with AsyncWebCrawler(
+ headless=True,
+ storage_state=storage_dict
+ ) as crawler:
+ result = await crawler.arun("https://example.com/protected")
+ if result.success:
+ print("Protected page content length:", len(result.html))
+ else:
+ print("Failed to crawl protected page")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+### 5.2 Exporting & Reusing State
+
+You can sign in once, export the browser context, and reuse it later—without re-entering credentials.
+
+- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
+- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
+
+**See**: [Detailed session management tutorial](./session-management.md) or [Explanations → Browser Context & Managed Browser](./identity-based-crawling.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
+
+---
+
+## 6. Robots.txt Compliance
+
+Crawl4AI supports respecting robots.txt rules with efficient caching:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+ # Enable robots.txt checking in config
+ config = CrawlerRunConfig(
+ check_robots_txt=True # Will check and respect robots.txt rules
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ "https://example.com",
+ config=config
+ )
+
+ if not result.success and result.status_code == 403:
+ print("Access denied by robots.txt")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Key Points**
+- Robots.txt files are cached locally for efficiency
+- Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
+- Cache has a default TTL of 7 days
+- If robots.txt can't be fetched, crawling is allowed
+- Returns 403 status code if URL is disallowed
+
+---
+
+## Putting It All Together
+
+Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
+
+```python
+import os, asyncio
+from base64 import b64decode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+
+async def main():
+ # 1. Browser config with proxy + headless
+ browser_cfg = BrowserConfig(
+ proxy_config={
+ "server": "http://proxy.example.com:8080",
+ "username": "myuser",
+ "password": "mypass",
+ },
+ headless=True,
+ )
+
+ # 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
+ crawler_cfg = CrawlerRunConfig(
+ pdf=True,
+ screenshot=True,
+ fetch_ssl_certificate=True,
+ cache_mode=CacheMode.BYPASS,
+ headers={"Accept-Language": "en-US,en;q=0.8"},
+ storage_state="my_storage.json", # Reuse session from a previous sign-in
+ verbose=True,
+ )
+
+ # 3. Crawl
+ async with AsyncWebCrawler(config=browser_cfg) as crawler:
+ result = await crawler.arun(
+ url = "https://secure.example.com/protected",
+ config=crawler_cfg
+ )
+
+ if result.success:
+ print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
+
+ # Save PDF & screenshot
+ if result.pdf:
+ with open("result.pdf", "wb") as f:
+ f.write(b64decode(result.pdf))
+ if result.screenshot:
+ with open("result.png", "wb") as f:
+ f.write(b64decode(result.screenshot))
+
+ # Check SSL cert
+ if result.ssl_certificate:
+ print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
+ else:
+ print("[ERROR]", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+---
+
+## Conclusion & Next Steps
+
+You’ve now explored several **advanced** features:
+
+- **Proxy Usage**
+- **PDF & Screenshot** capturing for large or critical pages
+- **SSL Certificate** retrieval & exporting
+- **Custom Headers** for language or specialized requests
+- **Session Persistence** via storage state
+- **Robots.txt Compliance**
+
+With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
+
+**Last Updated**: 2025-01-01
+```
+
+
+## File: docs/md_v2/advanced/crawl-dispatcher.md
+
+```md
+# Crawl Dispatcher
+
+We’re excited to announce a **Crawl Dispatcher** module that can handle **thousands** of crawling tasks simultaneously. By efficiently managing system resources (memory, CPU, network), this dispatcher ensures high-performance data extraction at scale. It also provides **real-time monitoring** of each crawler’s status, memory usage, and overall progress.
+
+Stay tuned—this feature is **coming soon** in an upcoming release of Crawl4AI! For the latest news, keep an eye on our changelogs and follow [@unclecode](https://twitter.com/unclecode) on X.
+
+Below is a **sample** of how the dispatcher’s performance monitor might look in action:
+
+
+
+
+We can’t wait to bring you this streamlined, **scalable** approach to multi-URL crawling—**watch this space** for updates!
+```
+
+
+## File: docs/md_v2/advanced/file-downloading.md
+
+```md
+# Download Handling in Crawl4AI
+
+This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
+
+## Enabling Downloads
+
+To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
+
+```python
+from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
+
+async def main():
+ config = BrowserConfig(accept_downloads=True) # Enable downloads globally
+ async with AsyncWebCrawler(config=config) as crawler:
+ # ... your crawling logic ...
+
+asyncio.run(main())
+```
+
+## Specifying Download Location
+
+Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
+
+```python
+from crawl4ai.async_configs import BrowserConfig
+import os
+
+downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
+os.makedirs(downloads_path, exist_ok=True)
+
+config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
+
+async def main():
+ async with AsyncWebCrawler(config=config) as crawler:
+ result = await crawler.arun(url="https://example.com")
+ # ...
+```
+
+## Triggering Downloads
+
+Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
+
+```python
+from crawl4ai.async_configs import CrawlerRunConfig
+
+config = CrawlerRunConfig(
+ js_code="""
+ const downloadLink = document.querySelector('a[href$=".exe"]');
+ if (downloadLink) {
+ downloadLink.click();
+ }
+ """,
+ wait_for=5 # Wait 5 seconds for the download to start
+)
+
+result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
+```
+
+## Accessing Downloaded Files
+
+The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
+
+```python
+if result.downloaded_files:
+ print("Downloaded files:")
+ for file_path in result.downloaded_files:
+ print(f"- {file_path}")
+ file_size = os.path.getsize(file_path)
+ print(f"- File size: {file_size} bytes")
+else:
+ print("No files downloaded.")
+```
+
+## Example: Downloading Multiple Files
+
+```python
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+import os
+from pathlib import Path
+
+async def download_multiple_files(url: str, download_path: str):
+ config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
+ async with AsyncWebCrawler(config=config) as crawler:
+ run_config = CrawlerRunConfig(
+ js_code="""
+ const downloadLinks = document.querySelectorAll('a[download]');
+ for (const link of downloadLinks) {
+ link.click();
+ // Delay between clicks
+ await new Promise(r => setTimeout(r, 2000));
+ }
+ """,
+ wait_for=10 # Wait for all downloads to start
+ )
+ result = await crawler.arun(url=url, config=run_config)
+
+ if result.downloaded_files:
+ print("Downloaded files:")
+ for file in result.downloaded_files:
+ print(f"- {file}")
+ else:
+ print("No files downloaded.")
+
+# Usage
+download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
+os.makedirs(download_path, exist_ok=True)
+
+asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
+```
+
+## Important Considerations
+
+- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
+- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
+- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
+- **Security:** Scan downloaded files for potential security threats before use.
+
+This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
+```
+
+
+## File: docs/md_v2/advanced/hooks-auth.md
+
+```md
+# Hooks & Auth in AsyncWebCrawler
+
+Crawl4AI’s **hooks** let you customize the crawler at specific points in the pipeline:
+
+1. **`on_browser_created`** – After browser creation.
+2. **`on_page_context_created`** – After a new context & page are created.
+3. **`before_goto`** – Just before navigating to a page.
+4. **`after_goto`** – Right after navigation completes.
+5. **`on_user_agent_updated`** – Whenever the user agent changes.
+6. **`on_execution_started`** – Once custom JavaScript execution begins.
+7. **`before_retrieve_html`** – Just before the crawler retrieves final HTML.
+8. **`before_return_html`** – Right before returning the HTML content.
+
+**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
+
+> note "Important Hook Usage Warning"
+ **Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`.
+
+> **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
+
+> **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.
+
+> **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.
+
+
+Below is an example demonstration.
+
+---
+
+## Example: Using Hooks in AsyncWebCrawler
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from playwright.async_api import Page, BrowserContext
+
+async def main():
+ print("🔗 Hooks Example: Demonstrating recommended usage")
+
+ # 1) Configure the browser
+ browser_config = BrowserConfig(
+ headless=True,
+ verbose=True
+ )
+
+ # 2) Configure the crawler run
+ crawler_run_config = CrawlerRunConfig(
+ js_code="window.scrollTo(0, document.body.scrollHeight);",
+ wait_for="body",
+ cache_mode=CacheMode.BYPASS
+ )
+
+ # 3) Create the crawler instance
+ crawler = AsyncWebCrawler(config=browser_config)
+
+ #
+ # Define Hook Functions
+ #
+
+ async def on_browser_created(browser, **kwargs):
+ # Called once the browser instance is created (but no pages or contexts yet)
+ print("[HOOK] on_browser_created - Browser created successfully!")
+ # Typically, do minimal setup here if needed
+ return browser
+
+ async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
+ # Called right after a new page + context are created (ideal for auth or route config).
+ print("[HOOK] on_page_context_created - Setting up page & context.")
+
+ # Example 1: Route filtering (e.g., block images)
+ async def route_filter(route):
+ if route.request.resource_type == "image":
+ print(f"[HOOK] Blocking image request: {route.request.url}")
+ await route.abort()
+ else:
+ await route.continue_()
+
+ await context.route("**", route_filter)
+
+ # Example 2: (Optional) Simulate a login scenario
+ # (We do NOT create or close pages here, just do quick steps if needed)
+ # e.g., await page.goto("https://example.com/login")
+ # e.g., await page.fill("input[name='username']", "testuser")
+ # e.g., await page.fill("input[name='password']", "password123")
+ # e.g., await page.click("button[type='submit']")
+ # e.g., await page.wait_for_selector("#welcome")
+ # e.g., await context.add_cookies([...])
+ # Then continue
+
+ # Example 3: Adjust the viewport
+ await page.set_viewport_size({"width": 1080, "height": 600})
+ return page
+
+ async def before_goto(
+ page: Page, context: BrowserContext, url: str, **kwargs
+ ):
+ # Called before navigating to each URL.
+ print(f"[HOOK] before_goto - About to navigate: {url}")
+ # e.g., inject custom headers
+ await page.set_extra_http_headers({
+ "Custom-Header": "my-value"
+ })
+ return page
+
+ async def after_goto(
+ page: Page, context: BrowserContext,
+ url: str, response, **kwargs
+ ):
+ # Called after navigation completes.
+ print(f"[HOOK] after_goto - Successfully loaded: {url}")
+ # e.g., wait for a certain element if we want to verify
+ try:
+ await page.wait_for_selector('.content', timeout=1000)
+ print("[HOOK] Found .content element!")
+ except:
+ print("[HOOK] .content not found, continuing anyway.")
+ return page
+
+ async def on_user_agent_updated(
+ page: Page, context: BrowserContext,
+ user_agent: str, **kwargs
+ ):
+ # Called whenever the user agent updates.
+ print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
+ return page
+
+ async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
+ # Called after custom JavaScript execution begins.
+ print("[HOOK] on_execution_started - JS code is running!")
+ return page
+
+ async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
+ # Called before final HTML retrieval.
+ print("[HOOK] before_retrieve_html - We can do final actions")
+ # Example: Scroll again
+ await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
+ return page
+
+ async def before_return_html(
+ page: Page, context: BrowserContext, html: str, **kwargs
+ ):
+ # Called just before returning the HTML in the result.
+ print(f"[HOOK] before_return_html - HTML length: {len(html)}")
+ return page
+
+ #
+ # Attach Hooks
+ #
+
+ crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
+ crawler.crawler_strategy.set_hook(
+ "on_page_context_created", on_page_context_created
+ )
+ crawler.crawler_strategy.set_hook("before_goto", before_goto)
+ crawler.crawler_strategy.set_hook("after_goto", after_goto)
+ crawler.crawler_strategy.set_hook(
+ "on_user_agent_updated", on_user_agent_updated
+ )
+ crawler.crawler_strategy.set_hook(
+ "on_execution_started", on_execution_started
+ )
+ crawler.crawler_strategy.set_hook(
+ "before_retrieve_html", before_retrieve_html
+ )
+ crawler.crawler_strategy.set_hook(
+ "before_return_html", before_return_html
+ )
+
+ await crawler.start()
+
+ # 4) Run the crawler on an example page
+ url = "https://example.com"
+ result = await crawler.arun(url, config=crawler_run_config)
+
+ if result.success:
+ print("\nCrawled URL:", result.url)
+ print("HTML length:", len(result.html))
+ else:
+ print("Error:", result.error_message)
+
+ await crawler.close()
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+---
+
+## Hook Lifecycle Summary
+
+1. **`on_browser_created`**:
+ - Browser is up, but **no** pages or contexts yet.
+ - Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).
+
+2. **`on_page_context_created`**:
+ - Perfect for advanced **auth** or route blocking.
+ - You have a **page** + **context** ready but haven’t navigated to the target URL yet.
+
+3. **`before_goto`**:
+ - Right before navigation. Typically used for setting **custom headers** or logging the target URL.
+
+4. **`after_goto`**:
+ - After page navigation is done. Good place for verifying content or waiting on essential elements.
+
+5. **`on_user_agent_updated`**:
+ - Whenever the user agent changes (for stealth or different UA modes).
+
+6. **`on_execution_started`**:
+ - If you set `js_code` or run custom scripts, this runs once your JS is about to start.
+
+7. **`before_retrieve_html`**:
+ - Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.
+
+8. **`before_return_html`**:
+ - The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
+
+---
+
+## When to Handle Authentication
+
+**Recommended**: Use **`on_page_context_created`** if you need to:
+
+- Navigate to a login page or fill forms
+- Set cookies or localStorage tokens
+- Block resource routes to avoid ads
+
+This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
+
+---
+
+## Additional Considerations
+
+- **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.
+- **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.
+- **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.
+- **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
+
+---
+
+## Conclusion
+
+Hooks provide **fine-grained** control over:
+
+- **Browser** creation (light tasks only)
+- **Page** and **context** creation (auth, route blocking)
+- **Navigation** phases
+- **Final HTML** retrieval
+
+Follow the recommended usage:
+- **Login** or advanced tasks in `on_page_context_created`
+- **Custom headers** or logs in `before_goto` / `after_goto`
+- **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
+
+
+```
+
+
+## File: docs/md_v2/advanced/identity-based-crawling.md
+
+```md
+# Preserve Your Identity with Crawl4AI
+
+Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers:
+
+1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling.
+2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity.
+
+---
+
+## 1. Managed Browsers: Your Digital Identity Solution
+
+**Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
+
+### Key Benefits
+
+- **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user.
+- **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.
+- **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
+
+---
+
+Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
+
+---
+
+### Creating a User Data Directory (Command-Line Approach via Playwright)
+
+If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
+
+1. **Find** the Playwright Chromium binary:
+ - On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.
+ - To see an overview of installed browsers, run:
+ ```bash
+ python -m playwright install --dry-run
+ ```
+ or
+ ```bash
+ playwright install --dry-run
+ ```
+ (depending on your environment). This shows where Playwright keeps Chromium.
+
+ - For instance, you might see a path like:
+ ```
+ ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
+ ```
+ on Linux, or a corresponding folder on macOS/Windows.
+
+2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
+ ```bash
+ # Linux example
+ ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
+ --user-data-dir=/home/
/my_chrome_profile
+ ```
+ ```bash
+ # macOS example (Playwright’s internal binary)
+ ~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
+ --user-data-dir=/Users//my_chrome_profile
+ ```
+ ```powershell
+ # Windows example (PowerShell/cmd)
+ "C:\Users\\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
+ --user-data-dir="C:\Users\\my_chrome_profile"
+ ```
+
+ **Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.
+ - This **opens** a fresh Chromium with your new or existing data folder.
+ - **Log into** any sites or configure your browser the way you want.
+ - **Close** when done—your profile data is saved in that folder.
+
+3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
+ ```python
+ from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+ browser_config = BrowserConfig(
+ headless=True,
+ use_managed_browser=True,
+ user_data_dir="/home//my_chrome_profile",
+ browser_type="chromium"
+ )
+ ```
+ - Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
+
+---
+
+## 3. Using Managed Browsers in Crawl4AI
+
+Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def main():
+ # 1) Reference your persistent data directory
+ browser_config = BrowserConfig(
+ headless=True, # 'True' for automated runs
+ verbose=True,
+ use_managed_browser=True, # Enables persistent browser strategy
+ browser_type="chromium",
+ user_data_dir="/path/to/my-chrome-profile"
+ )
+
+ # 2) Standard crawl config
+ crawl_config = CrawlerRunConfig(
+ wait_for="css:.logged-in-content"
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(url="https://example.com/private", config=crawl_config)
+ if result.success:
+ print("Successfully accessed private data with your identity!")
+ else:
+ print("Error:", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+### Workflow
+
+1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).
+2. **Close** that browser.
+3. **Use** the same folder in `user_data_dir=` in Crawl4AI.
+4. **Crawl** – The site sees your identity as if you’re the same user who just logged in.
+
+---
+
+## 4. Magic Mode: Simplified Automation
+
+If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://example.com",
+ config=CrawlerRunConfig(
+ magic=True, # Simplifies a lot of interaction
+ remove_overlay_elements=True,
+ page_timeout=60000
+ )
+ )
+```
+
+**Magic Mode**:
+
+- Simulates a user-like experience
+- Randomizes user agent & navigator
+- Randomizes interactions & timings
+- Masks automation signals
+- Attempts pop-up handling
+
+**But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
+
+---
+
+## 5. Comparing Managed Browsers vs. Magic Mode
+
+| Feature | **Managed Browsers** | **Magic Mode** |
+|----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
+| **Session Persistence** | Full localStorage/cookies retained in user_data_dir | No persistent data (fresh each run) |
+| **Genuine Identity** | Real user profile with full rights & preferences | Emulated user-like patterns, but no actual identity |
+| **Complex Sites** | Best for login-gated sites or heavy config | Simple tasks, minimal login or config needed |
+| **Setup** | External creation of user_data_dir, then use in Crawl4AI | Single-line approach (`magic=True`) |
+| **Reliability** | Extremely consistent (same data across runs) | Good for smaller tasks, can be less stable |
+
+---
+
+## 6. Using the BrowserProfiler Class
+
+Crawl4AI provides a dedicated `BrowserProfiler` class for managing browser profiles, making it easy to create, list, and delete profiles for identity-based browsing.
+
+### Creating and Managing Profiles with BrowserProfiler
+
+The `BrowserProfiler` class offers a comprehensive API for browser profile management:
+
+```python
+import asyncio
+from crawl4ai import BrowserProfiler
+
+async def manage_profiles():
+ # Create a profiler instance
+ profiler = BrowserProfiler()
+
+ # Create a profile interactively - opens a browser window
+ profile_path = await profiler.create_profile(
+ profile_name="my-login-profile" # Optional: name your profile
+ )
+
+ print(f"Profile saved at: {profile_path}")
+
+ # List all available profiles
+ profiles = profiler.list_profiles()
+
+ for profile in profiles:
+ print(f"Profile: {profile['name']}")
+ print(f" Path: {profile['path']}")
+ print(f" Created: {profile['created']}")
+ print(f" Browser type: {profile['type']}")
+
+ # Get a specific profile path by name
+ specific_profile = profiler.get_profile_path("my-login-profile")
+
+ # Delete a profile when no longer needed
+ success = profiler.delete_profile("old-profile-name")
+
+asyncio.run(manage_profiles())
+```
+
+**How profile creation works:**
+1. A browser window opens for you to interact with
+2. You log in to websites, set preferences, etc.
+3. When you're done, press 'q' in the terminal to close the browser
+4. The profile is saved in the Crawl4AI profiles directory
+5. You can use the returned path with `BrowserConfig.user_data_dir`
+
+### Interactive Profile Management
+
+The `BrowserProfiler` also offers an interactive management console that guides you through profile creation, listing, and deletion:
+
+```python
+import asyncio
+from crawl4ai import BrowserProfiler, AsyncWebCrawler, BrowserConfig
+
+# Define a function to use a profile for crawling
+async def crawl_with_profile(profile_path, url):
+ browser_config = BrowserConfig(
+ headless=True,
+ use_managed_browser=True,
+ user_data_dir=profile_path
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(url)
+ return result
+
+async def main():
+ # Create a profiler instance
+ profiler = BrowserProfiler()
+
+ # Launch the interactive profile manager
+ # Passing the crawl function as a callback adds a "crawl with profile" option
+ await profiler.interactive_manager(crawl_callback=crawl_with_profile)
+
+asyncio.run(main())
+```
+
+### Legacy Methods
+
+For backward compatibility, the previous methods on `ManagedBrowser` are still available, but they delegate to the new `BrowserProfiler` class:
+
+```python
+from crawl4ai.browser_manager import ManagedBrowser
+
+# These methods still work but use BrowserProfiler internally
+profiles = ManagedBrowser.list_profiles()
+```
+
+### Complete Example
+
+See the full example in `docs/examples/identity_based_browsing.py` for a complete demonstration of creating and using profiles for authenticated browsing using the new `BrowserProfiler` class.
+
+---
+
+## 7. Locale, Timezone, and Geolocation Control
+
+In addition to using persistent profiles, Crawl4AI supports customizing your browser's locale, timezone, and geolocation settings. These features enhance your identity-based browsing experience by allowing you to control how websites perceive your location and regional settings.
+
+### Setting Locale and Timezone
+
+You can set the browser's locale and timezone through `CrawlerRunConfig`:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://example.com",
+ config=CrawlerRunConfig(
+ # Set browser locale (language and region formatting)
+ locale="fr-FR", # French (France)
+
+ # Set browser timezone
+ timezone_id="Europe/Paris",
+
+ # Other normal options...
+ magic=True,
+ page_timeout=60000
+ )
+ )
+```
+
+**How it works:**
+- `locale` affects language preferences, date formats, number formats, etc.
+- `timezone_id` affects JavaScript's Date object and time-related functionality
+- These settings are applied when creating the browser context and maintained throughout the session
+
+### Configuring Geolocation
+
+Control the GPS coordinates reported by the browser's geolocation API:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, GeolocationConfig
+
+async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://maps.google.com", # Or any location-aware site
+ config=CrawlerRunConfig(
+ # Configure precise GPS coordinates
+ geolocation=GeolocationConfig(
+ latitude=48.8566, # Paris coordinates
+ longitude=2.3522,
+ accuracy=100 # Accuracy in meters (optional)
+ ),
+
+ # This site will see you as being in Paris
+ page_timeout=60000
+ )
+ )
+```
+
+**Important notes:**
+- When `geolocation` is specified, the browser is automatically granted permission to access location
+- Websites using the Geolocation API will receive the exact coordinates you specify
+- This affects map services, store locators, delivery services, etc.
+- Combined with the appropriate `locale` and `timezone_id`, you can create a fully consistent location profile
+
+### Combining with Managed Browsers
+
+These settings work perfectly with managed browsers for a complete identity solution:
+
+```python
+from crawl4ai import (
+ AsyncWebCrawler, BrowserConfig, CrawlerRunConfig,
+ GeolocationConfig
+)
+
+browser_config = BrowserConfig(
+ use_managed_browser=True,
+ user_data_dir="/path/to/my-profile",
+ browser_type="chromium"
+)
+
+crawl_config = CrawlerRunConfig(
+ # Location settings
+ locale="es-MX", # Spanish (Mexico)
+ timezone_id="America/Mexico_City",
+ geolocation=GeolocationConfig(
+ latitude=19.4326, # Mexico City
+ longitude=-99.1332
+ )
+)
+
+async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(url="https://example.com", config=crawl_config)
+```
+
+Combining persistent profiles with precise geolocation and region settings gives you complete control over your digital identity.
+
+## 8. Summary
+
+- **Create** your user-data directory either:
+ - By launching Chrome/Chromium externally with `--user-data-dir=/some/path`
+ - Or by using the built-in `BrowserProfiler.create_profile()` method
+ - Or through the interactive interface with `profiler.interactive_manager()`
+- **Log in** or configure sites as needed, then close the browser
+- **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`
+- **Customize** identity aspects with `locale`, `timezone_id`, and `geolocation`
+- **List and reuse** profiles with `BrowserProfiler.list_profiles()`
+- **Manage** your profiles with the dedicated `BrowserProfiler` class
+- Enjoy **persistent** sessions that reflect your real identity
+- If you only need quick, ephemeral automation, **Magic Mode** might suffice
+
+**Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
+
+With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.
+```
+
+
+## File: docs/md_v2/advanced/lazy-loading.md
+
+```md
+## Handling Lazy-Loaded Images
+
+Many websites now load images **lazily** as you scroll. If you need to ensure they appear in your final crawl (and in `result.media`), consider:
+
+1. **`wait_for_images=True`** – Wait for images to fully load.
+2. **`scan_full_page`** – Force the crawler to scroll the entire page, triggering lazy loads.
+3. **`scroll_delay`** – Add small delays between scroll steps.
+
+**Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
+
+### Example: Ensuring Lazy Images Appear
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
+from crawl4ai.async_configs import CacheMode
+
+async def main():
+ config = CrawlerRunConfig(
+ # Force the crawler to wait until images are fully loaded
+ wait_for_images=True,
+
+ # Option 1: If you want to automatically scroll the page to load images
+ scan_full_page=True, # Tells the crawler to try scrolling the entire page
+ scroll_delay=0.5, # Delay (seconds) between scroll steps
+
+ # Option 2: If the site uses a 'Load More' or JS triggers for images,
+ # you can also specify js_code or wait_for logic here.
+
+ cache_mode=CacheMode.BYPASS,
+ verbose=True
+ )
+
+ async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
+ result = await crawler.arun("https://www.example.com/gallery", config=config)
+
+ if result.success:
+ images = result.media.get("images", [])
+ print("Images found:", len(images))
+ for i, img in enumerate(images[:5]):
+ print(f"[Image {i}] URL: {img['src']}, Score: {img.get('score','N/A')}")
+ else:
+ print("Error:", result.error_message)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Explanation**:
+
+- **`wait_for_images=True`**
+ The crawler tries to ensure images have finished loading before finalizing the HTML.
+- **`scan_full_page=True`**
+ Tells the crawler to attempt scrolling from top to bottom. Each scroll step helps trigger lazy loading.
+- **`scroll_delay=0.5`**
+ Pause half a second between each scroll step. Helps the site load images before continuing.
+
+**When to Use**:
+
+- **Lazy-Loading**: If images appear only when the user scrolls into view, `scan_full_page` + `scroll_delay` helps the crawler see them.
+- **Heavier Pages**: If a page is extremely long, be mindful that scanning the entire page can be slow. Adjust `scroll_delay` or the max scroll steps as needed.
+
+---
+
+## Combining with Other Link & Media Filters
+
+You can still combine **lazy-load** logic with the usual **exclude_external_images**, **exclude_domains**, or link filtration:
+
+```python
+config = CrawlerRunConfig(
+ wait_for_images=True,
+ scan_full_page=True,
+ scroll_delay=0.5,
+
+ # Filter out external images if you only want local ones
+ exclude_external_images=True,
+
+ # Exclude certain domains for links
+ exclude_domains=["spammycdn.com"],
+)
+```
+
+This approach ensures you see **all** images from the main domain while ignoring external ones, and the crawler physically scrolls the entire page so that lazy-loading triggers.
+
+---
+
+## Tips & Troubleshooting
+
+1. **Long Pages**
+ - Setting `scan_full_page=True` on extremely long or infinite-scroll pages can be resource-intensive.
+ - Consider using [hooks](../core/page-interaction.md) or specialized logic to load specific sections or “Load More” triggers repeatedly.
+
+2. **Mixed Image Behavior**
+ - Some sites load images in batches as you scroll. If you’re missing images, increase your `scroll_delay` or call multiple partial scrolls in a loop with JS code or hooks.
+
+3. **Combining with Dynamic Wait**
+ - If the site has a placeholder that only changes to a real image after a certain event, you might do `wait_for="css:img.loaded"` or a custom JS `wait_for`.
+
+4. **Caching**
+ - If `cache_mode` is enabled, repeated crawls might skip some network fetches. If you suspect caching is missing new images, set `cache_mode=CacheMode.BYPASS` for fresh fetches.
+
+---
+
+With **lazy-loading** support, **wait_for_images**, and **scan_full_page** settings, you can capture the entire gallery or feed of images you expect—even if the site only loads them as the user scrolls. Combine these with the standard media filtering and domain exclusion for a complete link & media handling strategy.
+```
+
+
+## File: docs/md_v2/advanced/multi-url-crawling.md
+
+```md
+# Advanced Multi-URL Crawling with Dispatchers
+
+> **Heads Up**: Crawl4AI supports advanced dispatchers for **parallel** or **throttled** crawling, providing dynamic rate limiting and memory usage checks. The built-in `arun_many()` function uses these dispatchers to handle concurrency efficiently.
+
+## 1. Introduction
+
+When crawling many URLs:
+
+- **Basic**: Use `arun()` in a loop (simple but less efficient)
+- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
+- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
+
+**Why Dispatchers?**
+
+- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
+- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
+- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
+- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
+
+---
+
+## 2. Core Components
+
+### 2.1 Rate Limiter
+
+```python
+class RateLimiter:
+ def __init__(
+ # Random delay range between requests
+ base_delay: Tuple[float, float] = (1.0, 3.0),
+
+ # Maximum backoff delay
+ max_delay: float = 60.0,
+
+ # Retries before giving up
+ max_retries: int = 3,
+
+ # Status codes triggering backoff
+ rate_limit_codes: List[int] = [429, 503]
+ )
+```
+
+Here’s the revised and simplified explanation of the **RateLimiter**, focusing on constructor parameters and adhering to your markdown style and mkDocs guidelines.
+
+#### RateLimiter Constructor Parameters
+
+The **RateLimiter** is a utility that helps manage the pace of requests to avoid overloading servers or getting blocked due to rate limits. It operates internally to delay requests and handle retries but can be configured using its constructor parameters.
+
+**Parameters of the `RateLimiter` constructor:**
+
+1. **`base_delay`** (`Tuple[float, float]`, default: `(1.0, 3.0)`)
+ The range for a random delay (in seconds) between consecutive requests to the same domain.
+
+- A random delay is chosen between `base_delay[0]` and `base_delay[1]` for each request.
+- This prevents sending requests at a predictable frequency, reducing the chances of triggering rate limits.
+
+**Example:**
+If `base_delay = (2.0, 5.0)`, delays could be randomly chosen as `2.3s`, `4.1s`, etc.
+
+---
+
+2. **`max_delay`** (`float`, default: `60.0`)
+ The maximum allowable delay when rate-limiting errors occur.
+
+- When servers return rate-limit responses (e.g., 429 or 503), the delay increases exponentially with jitter.
+- The `max_delay` ensures the delay doesn’t grow unreasonably high, capping it at this value.
+
+**Example:**
+For a `max_delay = 30.0`, even if backoff calculations suggest a delay of `45s`, it will cap at `30s`.
+
+---
+
+3. **`max_retries`** (`int`, default: `3`)
+ The maximum number of retries for a request if rate-limiting errors occur.
+
+- After encountering a rate-limit response, the `RateLimiter` retries the request up to this number of times.
+- If all retries fail, the request is marked as failed, and the process continues.
+
+**Example:**
+If `max_retries = 3`, the system retries a failed request three times before giving up.
+
+---
+
+4. **`rate_limit_codes`** (`List[int]`, default: `[429, 503]`)
+ A list of HTTP status codes that trigger the rate-limiting logic.
+
+- These status codes indicate the server is overwhelmed or actively limiting requests.
+- You can customize this list to include other codes based on specific server behavior.
+
+**Example:**
+If `rate_limit_codes = [429, 503, 504]`, the crawler will back off on these three error codes.
+
+---
+
+**How to Use the `RateLimiter`:**
+
+Here’s an example of initializing and using a `RateLimiter` in your project:
+
+```python
+from crawl4ai import RateLimiter
+
+# Create a RateLimiter with custom settings
+rate_limiter = RateLimiter(
+ base_delay=(2.0, 4.0), # Random delay between 2-4 seconds
+ max_delay=30.0, # Cap delay at 30 seconds
+ max_retries=5, # Retry up to 5 times on rate-limiting errors
+ rate_limit_codes=[429, 503] # Handle these HTTP status codes
+)
+
+# RateLimiter will handle delays and retries internally
+# No additional setup is required for its operation
+```
+
+The `RateLimiter` integrates seamlessly with dispatchers like `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher`, ensuring requests are paced correctly without user intervention. Its internal mechanisms manage delays and retries to avoid overwhelming servers while maximizing efficiency.
+
+
+### 2.2 Crawler Monitor
+
+The CrawlerMonitor provides real-time visibility into crawling operations:
+
+```python
+from crawl4ai import CrawlerMonitor, DisplayMode
+monitor = CrawlerMonitor(
+ # Maximum rows in live display
+ max_visible_rows=15,
+
+ # DETAILED or AGGREGATED view
+ display_mode=DisplayMode.DETAILED
+)
+```
+
+**Display Modes**:
+
+1. **DETAILED**: Shows individual task status, memory usage, and timing
+2. **AGGREGATED**: Displays summary statistics and overall progress
+
+---
+
+## 3. Available Dispatchers
+
+### 3.1 MemoryAdaptiveDispatcher (Default)
+
+Automatically manages concurrency based on system memory usage:
+
+```python
+from crawl4ai.async_dispatcher import MemoryAdaptiveDispatcher
+
+dispatcher = MemoryAdaptiveDispatcher(
+ memory_threshold_percent=90.0, # Pause if memory exceeds this
+ check_interval=1.0, # How often to check memory
+ max_session_permit=10, # Maximum concurrent tasks
+ rate_limiter=RateLimiter( # Optional rate limiting
+ base_delay=(1.0, 2.0),
+ max_delay=30.0,
+ max_retries=2
+ ),
+ monitor=CrawlerMonitor( # Optional monitoring
+ max_visible_rows=15,
+ display_mode=DisplayMode.DETAILED
+ )
+)
+```
+
+**Constructor Parameters:**
+
+1. **`memory_threshold_percent`** (`float`, default: `90.0`)
+ Specifies the memory usage threshold (as a percentage). If system memory usage exceeds this value, the dispatcher pauses crawling to prevent system overload.
+
+2. **`check_interval`** (`float`, default: `1.0`)
+ The interval (in seconds) at which the dispatcher checks system memory usage.
+
+3. **`max_session_permit`** (`int`, default: `10`)
+ The maximum number of concurrent crawling tasks allowed. This ensures resource limits are respected while maintaining concurrency.
+
+4. **`memory_wait_timeout`** (`float`, default: `600.0`)
+ Optional timeout (in seconds). If memory usage exceeds `memory_threshold_percent` for longer than this duration, a `MemoryError` is raised.
+
+5. **`rate_limiter`** (`RateLimiter`, default: `None`)
+ Optional rate-limiting logic to avoid server-side blocking (e.g., for handling 429 or 503 errors). See **RateLimiter** for details.
+
+6. **`monitor`** (`CrawlerMonitor`, default: `None`)
+ Optional monitoring for real-time task tracking and performance insights. See **CrawlerMonitor** for details.
+
+---
+
+### 3.2 SemaphoreDispatcher
+
+Provides simple concurrency control with a fixed limit:
+
+```python
+from crawl4ai.async_dispatcher import SemaphoreDispatcher
+
+dispatcher = SemaphoreDispatcher(
+ max_session_permit=20, # Maximum concurrent tasks
+ rate_limiter=RateLimiter( # Optional rate limiting
+ base_delay=(0.5, 1.0),
+ max_delay=10.0
+ ),
+ monitor=CrawlerMonitor( # Optional monitoring
+ max_visible_rows=15,
+ display_mode=DisplayMode.DETAILED
+ )
+)
+```
+
+**Constructor Parameters:**
+
+1. **`max_session_permit`** (`int`, default: `20`)
+ The maximum number of concurrent crawling tasks allowed, irrespective of semaphore slots.
+
+2. **`rate_limiter`** (`RateLimiter`, default: `None`)
+ Optional rate-limiting logic to avoid overwhelming servers. See **RateLimiter** for details.
+
+3. **`monitor`** (`CrawlerMonitor`, default: `None`)
+ Optional monitoring for tracking task progress and resource usage. See **CrawlerMonitor** for details.
+
+---
+
+## 4. Usage Examples
+
+### 4.1 Batch Processing (Default)
+
+```python
+async def crawl_batch():
+ browser_config = BrowserConfig(headless=True, verbose=False)
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ stream=False # Default: get all results at once
+ )
+
+ dispatcher = MemoryAdaptiveDispatcher(
+ memory_threshold_percent=70.0,
+ check_interval=1.0,
+ max_session_permit=10,
+ monitor=CrawlerMonitor(
+ display_mode=DisplayMode.DETAILED
+ )
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ # Get all results at once
+ results = await crawler.arun_many(
+ urls=urls,
+ config=run_config,
+ dispatcher=dispatcher
+ )
+
+ # Process all results after completion
+ for result in results:
+ if result.success:
+ await process_result(result)
+ else:
+ print(f"Failed to crawl {result.url}: {result.error_message}")
+```
+
+**Review:**
+- **Purpose:** Executes a batch crawl with all URLs processed together after crawling is complete.
+- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` to manage concurrency and system memory.
+- **Stream:** Disabled (`stream=False`), so all results are collected at once for post-processing.
+- **Best Use Case:** When you need to analyze results in bulk rather than individually during the crawl.
+
+---
+
+### 4.2 Streaming Mode
+
+```python
+async def crawl_streaming():
+ browser_config = BrowserConfig(headless=True, verbose=False)
+ run_config = CrawlerRunConfig(
+ cache_mode=CacheMode.BYPASS,
+ stream=True # Enable streaming mode
+ )
+
+ dispatcher = MemoryAdaptiveDispatcher(
+ memory_threshold_percent=70.0,
+ check_interval=1.0,
+ max_session_permit=10,
+ monitor=CrawlerMonitor(
+ display_mode=DisplayMode.DETAILED
+ )
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ # Process results as they become available
+ async for result in await crawler.arun_many(
+ urls=urls,
+ config=run_config,
+ dispatcher=dispatcher
+ ):
+ if result.success:
+ # Process each result immediately
+ await process_result(result)
+ else:
+ print(f"Failed to crawl {result.url}: {result.error_message}")
+```
+
+**Review:**
+- **Purpose:** Enables streaming to process results as soon as they’re available.
+- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` for concurrency and memory management.
+- **Stream:** Enabled (`stream=True`), allowing real-time processing during crawling.
+- **Best Use Case:** When you need to act on results immediately, such as for real-time analytics or progressive data storage.
+
+---
+
+### 4.3 Semaphore-based Crawling
+
+```python
+async def crawl_with_semaphore(urls):
+ browser_config = BrowserConfig(headless=True, verbose=False)
+ run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+
+ dispatcher = SemaphoreDispatcher(
+ semaphore_count=5,
+ rate_limiter=RateLimiter(
+ base_delay=(0.5, 1.0),
+ max_delay=10.0
+ ),
+ monitor=CrawlerMonitor(
+ max_visible_rows=15,
+ display_mode=DisplayMode.DETAILED
+ )
+ )
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ results = await crawler.arun_many(
+ urls,
+ config=run_config,
+ dispatcher=dispatcher
+ )
+ return results
+```
+
+**Review:**
+- **Purpose:** Uses `SemaphoreDispatcher` to limit concurrency with a fixed number of slots.
+- **Dispatcher:** Configured with a semaphore to control parallel crawling tasks.
+- **Rate Limiter:** Prevents servers from being overwhelmed by pacing requests.
+- **Best Use Case:** When you want precise control over the number of concurrent requests, independent of system memory.
+
+---
+
+### 4.4 Robots.txt Consideration
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+
+async def main():
+ urls = [
+ "https://example1.com",
+ "https://example2.com",
+ "https://example3.com"
+ ]
+
+ config = CrawlerRunConfig(
+ cache_mode=CacheMode.ENABLED,
+ check_robots_txt=True, # Will respect robots.txt for each URL
+ semaphore_count=3 # Max concurrent requests
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ async for result in crawler.arun_many(urls, config=config):
+ if result.success:
+ print(f"Successfully crawled {result.url}")
+ elif result.status_code == 403 and "robots.txt" in result.error_message:
+ print(f"Skipped {result.url} - blocked by robots.txt")
+ else:
+ print(f"Failed to crawl {result.url}: {result.error_message}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**Review:**
+- **Purpose:** Ensures compliance with `robots.txt` rules for ethical and legal web crawling.
+- **Configuration:** Set `check_robots_txt=True` to validate each URL against `robots.txt` before crawling.
+- **Dispatcher:** Handles requests with concurrency limits (`semaphore_count=3`).
+- **Best Use Case:** When crawling websites that strictly enforce robots.txt policies or for responsible crawling practices.
+
+---
+
+## 5. Dispatch Results
+
+Each crawl result includes dispatch information:
+
+```python
+@dataclass
+class DispatchResult:
+ task_id: str
+ memory_usage: float
+ peak_memory: float
+ start_time: datetime
+ end_time: datetime
+ error_message: str = ""
+```
+
+Access via `result.dispatch_result`:
+
+```python
+for result in results:
+ if result.success:
+ dr = result.dispatch_result
+ print(f"URL: {result.url}")
+ print(f"Memory: {dr.memory_usage:.1f}MB")
+ print(f"Duration: {dr.end_time - dr.start_time}")
+```
+
+## 6. Summary
+
+1. **Two Dispatcher Types**:
+
+ - MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
+ - SemaphoreDispatcher: Fixed concurrency limit
+
+2. **Optional Components**:
+
+ - RateLimiter: Smart request pacing and backoff
+ - CrawlerMonitor: Real-time progress visualization
+
+3. **Key Benefits**:
+
+ - Automatic memory management
+ - Built-in rate limiting
+ - Live progress monitoring
+ - Flexible concurrency control
+
+Choose the dispatcher that best fits your needs:
+
+- **MemoryAdaptiveDispatcher**: For large crawls or limited resources
+- **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios
+
+```
+
+
+## File: docs/md_v2/advanced/network-console-capture.md
+
+```md
+# Network Requests & Console Message Capturing
+
+Crawl4AI can capture all network requests and browser console messages during a crawl, which is invaluable for debugging, security analysis, or understanding page behavior.
+
+## Configuration
+
+To enable network and console capturing, use these configuration options:
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+# Enable both network request capture and console message capture
+config = CrawlerRunConfig(
+ capture_network_requests=True, # Capture all network requests and responses
+ capture_console_messages=True # Capture all browser console output
+)
+```
+
+## Example Usage
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+async def main():
+ # Enable both network request capture and console message capture
+ config = CrawlerRunConfig(
+ capture_network_requests=True,
+ capture_console_messages=True
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun(
+ url="https://example.com",
+ config=config
+ )
+
+ if result.success:
+ # Analyze network requests
+ if result.network_requests:
+ print(f"Captured {len(result.network_requests)} network events")
+
+ # Count request types
+ request_count = len([r for r in result.network_requests if r.get("event_type") == "request"])
+ response_count = len([r for r in result.network_requests if r.get("event_type") == "response"])
+ failed_count = len([r for r in result.network_requests if r.get("event_type") == "request_failed"])
+
+ print(f"Requests: {request_count}, Responses: {response_count}, Failed: {failed_count}")
+
+ # Find API calls
+ api_calls = [r for r in result.network_requests
+ if r.get("event_type") == "request" and "api" in r.get("url", "")]
+ if api_calls:
+ print(f"Detected {len(api_calls)} API calls:")
+ for call in api_calls[:3]: # Show first 3
+ print(f" - {call.get('method')} {call.get('url')}")
+
+ # Analyze console messages
+ if result.console_messages:
+ print(f"Captured {len(result.console_messages)} console messages")
+
+ # Group by type
+ message_types = {}
+ for msg in result.console_messages:
+ msg_type = msg.get("type", "unknown")
+ message_types[msg_type] = message_types.get(msg_type, 0) + 1
+
+ print("Message types:", message_types)
+
+ # Show errors (often the most important)
+ errors = [msg for msg in result.console_messages if msg.get("type") == "error"]
+ if errors:
+ print(f"Found {len(errors)} console errors:")
+ for err in errors[:2]: # Show first 2
+ print(f" - {err.get('text', '')[:100]}")
+
+ # Export all captured data to a file for detailed analysis
+ with open("network_capture.json", "w") as f:
+ json.dump({
+ "url": result.url,
+ "network_requests": result.network_requests or [],
+ "console_messages": result.console_messages or []
+ }, f, indent=2)
+
+ print("Exported detailed capture data to network_capture.json")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Captured Data Structure
+
+### Network Requests
+
+The `result.network_requests` contains a list of dictionaries, each representing a network event with these common fields:
+
+| Field | Description |
+|-------|-------------|
+| `event_type` | Type of event: `"request"`, `"response"`, or `"request_failed"` |
+| `url` | The URL of the request |
+| `timestamp` | Unix timestamp when the event was captured |
+
+#### Request Event Fields
+
+```json
+{
+ "event_type": "request",
+ "url": "https://example.com/api/data.json",
+ "method": "GET",
+ "headers": {"User-Agent": "...", "Accept": "..."},
+ "post_data": "key=value&otherkey=value",
+ "resource_type": "fetch",
+ "is_navigation_request": false,
+ "timestamp": 1633456789.123
+}
+```
+
+#### Response Event Fields
+
+```json
+{
+ "event_type": "response",
+ "url": "https://example.com/api/data.json",
+ "status": 200,
+ "status_text": "OK",
+ "headers": {"Content-Type": "application/json", "Cache-Control": "..."},
+ "from_service_worker": false,
+ "request_timing": {"requestTime": 1234.56, "receiveHeadersEnd": 1234.78},
+ "timestamp": 1633456789.456
+}
+```
+
+#### Failed Request Event Fields
+
+```json
+{
+ "event_type": "request_failed",
+ "url": "https://example.com/missing.png",
+ "method": "GET",
+ "resource_type": "image",
+ "failure_text": "net::ERR_ABORTED 404",
+ "timestamp": 1633456789.789
+}
+```
+
+### Console Messages
+
+The `result.console_messages` contains a list of dictionaries, each representing a console message with these common fields:
+
+| Field | Description |
+|-------|-------------|
+| `type` | Message type: `"log"`, `"error"`, `"warning"`, `"info"`, etc. |
+| `text` | The message text |
+| `timestamp` | Unix timestamp when the message was captured |
+
+#### Console Message Example
+
+```json
+{
+ "type": "error",
+ "text": "Uncaught TypeError: Cannot read property 'length' of undefined",
+ "location": "https://example.com/script.js:123:45",
+ "timestamp": 1633456790.123
+}
+```
+
+## Key Benefits
+
+- **Full Request Visibility**: Capture all network activity including:
+ - Requests (URLs, methods, headers, post data)
+ - Responses (status codes, headers, timing)
+ - Failed requests (with error messages)
+
+- **Console Message Access**: View all JavaScript console output:
+ - Log messages
+ - Warnings
+ - Errors with stack traces
+ - Developer debugging information
+
+- **Debugging Power**: Identify issues such as:
+ - Failed API calls or resource loading
+ - JavaScript errors affecting page functionality
+ - CORS or other security issues
+ - Hidden API endpoints and data flows
+
+- **Security Analysis**: Detect:
+ - Unexpected third-party requests
+ - Data leakage in request payloads
+ - Suspicious script behavior
+
+- **Performance Insights**: Analyze:
+ - Request timing data
+ - Resource loading patterns
+ - Potential bottlenecks
+
+## Use Cases
+
+1. **API Discovery**: Identify hidden endpoints and data flows in single-page applications
+2. **Debugging**: Track down JavaScript errors affecting page functionality
+3. **Security Auditing**: Detect unwanted third-party requests or data leakage
+4. **Performance Analysis**: Identify slow-loading resources
+5. **Ad/Tracker Analysis**: Detect and catalog advertising or tracking calls
+
+This capability is especially valuable for complex sites with heavy JavaScript, single-page applications, or when you need to understand the exact communication happening between a browser and servers.
+```
+
+
+## File: docs/md_v2/advanced/proxy-security.md
+
+```md
+# Proxy
+
+## Basic Proxy Setup
+
+Simple proxy configuration with `BrowserConfig`:
+
+```python
+from crawl4ai.async_configs import BrowserConfig
+
+# Using proxy URL
+browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
+async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(url="https://example.com")
+
+# Using SOCKS proxy
+browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
+async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(url="https://example.com")
+```
+
+## Authenticated Proxy
+
+Use an authenticated proxy with `BrowserConfig`:
+
+```python
+from crawl4ai.async_configs import BrowserConfig
+
+proxy_config = {
+ "server": "http://proxy.example.com:8080",
+ "username": "user",
+ "password": "pass"
+}
+
+browser_config = BrowserConfig(proxy_config=proxy_config)
+async with AsyncWebCrawler(config=browser_config) as crawler:
+ result = await crawler.arun(url="https://example.com")
+```
+
+Here's the corrected documentation:
+
+## Rotating Proxies
+
+Example using a proxy rotation service dynamically:
+
+```python
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
+
+async def get_next_proxy():
+ # Your proxy rotation logic here
+ return {"server": "http://next.proxy.com:8080"}
+
+async def main():
+ browser_config = BrowserConfig()
+ run_config = CrawlerRunConfig()
+
+ async with AsyncWebCrawler(config=browser_config) as crawler:
+ # For each URL, create a new run config with different proxy
+ for url in urls:
+ proxy = await get_next_proxy()
+ # Clone the config and update proxy - this creates a new browser context
+ current_config = run_config.clone(proxy_config=proxy)
+ result = await crawler.arun(url=url, config=current_config)
+
+if __name__ == "__main__":
+ import asyncio
+ asyncio.run(main())
+```
+
+
+```
+
+
+## File: docs/md_v2/advanced/session-management.md
+
+```md
+# Session Management
+
+Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
+
+- **Performing JavaScript actions before and after crawling.**
+- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly.
+
+**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
+
+---
+
+#### Basic Session Usage
+
+Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
+
+```python
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+
+async with AsyncWebCrawler() as crawler:
+ session_id = "my_session"
+
+ # Define configurations
+ config1 = CrawlerRunConfig(
+ url="https://example.com/page1", session_id=session_id
+ )
+ config2 = CrawlerRunConfig(
+ url="https://example.com/page2", session_id=session_id
+ )
+
+ # First request
+ result1 = await crawler.arun(config=config1)
+
+ # Subsequent request using the same session
+ result2 = await crawler.arun(config=config2)
+
+ # Clean up when done
+ await crawler.crawler_strategy.kill_session(session_id)
+```
+
+---
+
+#### Dynamic Content with Sessions
+
+Here's an example of crawling GitHub commits across multiple pages while preserving session state:
+
+```python
+from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+from crawl4ai.cache_context import CacheMode
+
+async def crawl_dynamic_content():
+ async with AsyncWebCrawler() as crawler:
+ session_id = "github_commits_session"
+ url = "https://github.com/microsoft/TypeScript/commits/main"
+ all_commits = []
+
+ # Define extraction schema
+ schema = {
+ "name": "Commit Extractor",
+ "baseSelector": "li.Box-sc-g0xbh4-0",
+ "fields": [{
+ "name": "title", "selector": "h4.markdown-title", "type": "text"
+ }],
+ }
+ extraction_strategy = JsonCssExtractionStrategy(schema)
+
+ # JavaScript and wait configurations
+ js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
+ wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
+
+ # Crawl multiple pages
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ extraction_strategy=extraction_strategy,
+ js_code=js_next_page if page > 0 else None,
+ wait_for=wait_for if page > 0 else None,
+ js_only=page > 0,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ if result.success:
+ commits = json.loads(result.extracted_content)
+ all_commits.extend(commits)
+ print(f"Page {page + 1}: Found {len(commits)} commits")
+
+ # Clean up session
+ await crawler.crawler_strategy.kill_session(session_id)
+ return all_commits
+```
+
+---
+
+## Example 1: Basic Session-Based Crawling
+
+A simple example using session-based crawling:
+
+```python
+import asyncio
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
+from crawl4ai.cache_context import CacheMode
+
+async def basic_session_crawl():
+ async with AsyncWebCrawler() as crawler:
+ session_id = "dynamic_content_session"
+ url = "https://example.com/dynamic-content"
+
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
+ css_selector=".content-item",
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
+
+ await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(basic_session_crawl())
+```
+
+This example shows:
+1. Reusing the same `session_id` across multiple requests.
+2. Executing JavaScript to load more content dynamically.
+3. Properly closing the session to free resources.
+
+---
+
+## Advanced Technique 1: Custom Execution Hooks
+
+> Warning: You might feel confused by the end of the next few examples 😅, so make sure you are comfortable with the order of the parts before you start this.
+
+Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
+
+```python
+async def advanced_session_crawl_with_hooks():
+ first_commit = ""
+
+ async def on_execution_started(page):
+ nonlocal first_commit
+ try:
+ while True:
+ await page.wait_for_selector("li.commit-item h4")
+ commit = await page.query_selector("li.commit-item h4")
+ commit = await commit.evaluate("(element) => element.textContent").strip()
+ if commit and commit != first_commit:
+ first_commit = commit
+ break
+ await asyncio.sleep(0.5)
+ except Exception as e:
+ print(f"Warning: New content didn't appear: {e}")
+
+ async with AsyncWebCrawler() as crawler:
+ session_id = "commit_session"
+ url = "https://github.com/example/repo/commits/main"
+ crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
+
+ js_next_page = """document.querySelector('a.pagination-next').click();"""
+
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ js_code=js_next_page if page > 0 else None,
+ css_selector="li.commit-item",
+ js_only=page > 0,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
+
+ await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(advanced_session_crawl_with_hooks())
+```
+
+This technique ensures new content loads before the next action.
+
+---
+
+## Advanced Technique 2: Integrated JavaScript Execution and Waiting
+
+Combine JavaScript execution and waiting logic for concise handling of dynamic content:
+
+```python
+async def integrated_js_and_wait_crawl():
+ async with AsyncWebCrawler() as crawler:
+ session_id = "integrated_session"
+ url = "https://github.com/example/repo/commits/main"
+
+ js_next_page_and_wait = """
+ (async () => {
+ const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
+ const initialCommit = getCurrentCommit();
+ document.querySelector('a.pagination-next').click();
+ while (getCurrentCommit() === initialCommit) {
+ await new Promise(resolve => setTimeout(resolve, 100));
+ }
+ })();
+ """
+
+ for page in range(3):
+ config = CrawlerRunConfig(
+ url=url,
+ session_id=session_id,
+ js_code=js_next_page_and_wait if page > 0 else None,
+ css_selector="li.commit-item",
+ js_only=page > 0,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ result = await crawler.arun(config=config)
+ print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
+
+ await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(integrated_js_and_wait_crawl())
+```
+
+---
+
+#### Common Use Cases for Sessions
+
+1. **Authentication Flows**: Login and interact with secured pages.
+
+2. **Pagination Handling**: Navigate through multiple pages.
+
+3. **Form Submissions**: Fill forms, submit, and process results.
+
+4. **Multi-step Processes**: Complete workflows that span multiple actions.
+
+5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
+
+```
+
+
+## File: docs/md_v2/advanced/ssl-certificate.md
+
+```md
+# `SSLCertificate` Reference
+
+The **`SSLCertificate`** class encapsulates an SSL certificate’s data and allows exporting it in various formats (PEM, DER, JSON, or text). It’s used within **Crawl4AI** whenever you set **`fetch_ssl_certificate=True`** in your **`CrawlerRunConfig`**.
+
+## 1. Overview
+
+**Location**: `crawl4ai/ssl_certificate.py`
+
+```python
+class SSLCertificate:
+ """
+ Represents an SSL certificate with methods to export in various formats.
+
+ Main Methods:
+ - from_url(url, timeout=10)
+ - from_file(file_path)
+ - from_binary(binary_data)
+ - to_json(filepath=None)
+ - to_pem(filepath=None)
+ - to_der(filepath=None)
+ ...
+
+ Common Properties:
+ - issuer
+ - subject
+ - valid_from
+ - valid_until
+ - fingerprint
+ """
+```
+
+### Typical Use Case
+1. You **enable** certificate fetching in your crawl by:
+ ```python
+ CrawlerRunConfig(fetch_ssl_certificate=True, ...)
+ ```
+2. After `arun()`, if `result.ssl_certificate` is present, it’s an instance of **`SSLCertificate`**.
+3. You can **read** basic properties (issuer, subject, validity) or **export** them in multiple formats.
+
+---
+
+## 2. Construction & Fetching
+
+### 2.1 **`from_url(url, timeout=10)`**
+Manually load an SSL certificate from a given URL (port 443). Typically used internally, but you can call it directly if you want:
+
+```python
+cert = SSLCertificate.from_url("https://example.com")
+if cert:
+ print("Fingerprint:", cert.fingerprint)
+```
+
+### 2.2 **`from_file(file_path)`**
+Load from a file containing certificate data in ASN.1 or DER. Rarely needed unless you have local cert files:
+
+```python
+cert = SSLCertificate.from_file("/path/to/cert.der")
+```
+
+### 2.3 **`from_binary(binary_data)`**
+Initialize from raw binary. E.g., if you captured it from a socket or another source:
+
+```python
+cert = SSLCertificate.from_binary(raw_bytes)
+```
+
+---
+
+## 3. Common Properties
+
+After obtaining a **`SSLCertificate`** instance (e.g. `result.ssl_certificate` from a crawl), you can read:
+
+1. **`issuer`** *(dict)*
+ - E.g. `{"CN": "My Root CA", "O": "..."}`
+2. **`subject`** *(dict)*
+ - E.g. `{"CN": "example.com", "O": "ExampleOrg"}`
+3. **`valid_from`** *(str)*
+ - NotBefore date/time. Often in ASN.1/UTC format.
+4. **`valid_until`** *(str)*
+ - NotAfter date/time.
+5. **`fingerprint`** *(str)*
+ - The SHA-256 digest (lowercase hex).
+ - E.g. `"d14d2e..."`
+
+---
+
+## 4. Export Methods
+
+Once you have a **`SSLCertificate`** object, you can **export** or **inspect** it:
+
+### 4.1 **`to_json(filepath=None)` → `Optional[str]`**
+- Returns a JSON string containing the parsed certificate fields.
+- If `filepath` is provided, saves it to disk instead, returning `None`.
+
+**Usage**:
+```python
+json_data = cert.to_json() # returns JSON string
+cert.to_json("certificate.json") # writes file, returns None
+```
+
+### 4.2 **`to_pem(filepath=None)` → `Optional[str]`**
+- Returns a PEM-encoded string (common for web servers).
+- If `filepath` is provided, saves it to disk instead.
+
+```python
+pem_str = cert.to_pem() # in-memory PEM string
+cert.to_pem("/path/to/cert.pem") # saved to file
+```
+
+### 4.3 **`to_der(filepath=None)` → `Optional[bytes]`**
+- Returns the original DER (binary ASN.1) bytes.
+- If `filepath` is specified, writes the bytes there instead.
+
+```python
+der_bytes = cert.to_der()
+cert.to_der("certificate.der")
+```
+
+### 4.4 (Optional) **`export_as_text()`**
+- If you see a method like `export_as_text()`, it typically returns an OpenSSL-style textual representation.
+- Not always needed, but can help for debugging or manual inspection.
+
+---
+
+## 5. Example Usage in Crawl4AI
+
+Below is a minimal sample showing how the crawler obtains an SSL cert from a site, then reads or exports it. The code snippet:
+
+```python
+import asyncio
+import os
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
+
+async def main():
+ tmp_dir = "tmp"
+ os.makedirs(tmp_dir, exist_ok=True)
+
+ config = CrawlerRunConfig(
+ fetch_ssl_certificate=True,
+ cache_mode=CacheMode.BYPASS
+ )
+
+ async with AsyncWebCrawler() as crawler:
+ result = await crawler.arun("https://example.com", config=config)
+ if result.success and result.ssl_certificate:
+ cert = result.ssl_certificate
+ # 1. Basic Info
+ print("Issuer CN:", cert.issuer.get("CN", ""))
+ print("Valid until:", cert.valid_until)
+ print("Fingerprint:", cert.fingerprint)
+
+ # 2. Export
+ cert.to_json(os.path.join(tmp_dir, "certificate.json"))
+ cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
+ cert.to_der(os.path.join(tmp_dir, "certificate.der"))
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+---
+
+## 6. Notes & Best Practices
+
+1. **Timeout**: `SSLCertificate.from_url` internally uses a default **10s** socket connect and wraps SSL.
+2. **Binary Form**: The certificate is loaded in ASN.1 (DER) form, then re-parsed by `OpenSSL.crypto`.
+3. **Validation**: This does **not** validate the certificate chain or trust store. It only fetches and parses.
+4. **Integration**: Within Crawl4AI, you typically just set `fetch_ssl_certificate=True` in `CrawlerRunConfig`; the final result’s `ssl_certificate` is automatically built.
+5. **Export**: If you need to store or analyze a cert, the `to_json` and `to_pem` are quite universal.
+
+---
+
+### Summary
+
+- **`SSLCertificate`** is a convenience class for capturing and exporting the **TLS certificate** from your crawled site(s).
+- Common usage is in the **`CrawlResult.ssl_certificate`** field, accessible after setting `fetch_ssl_certificate=True`.
+- Offers quick access to essential certificate details (`issuer`, `subject`, `fingerprint`) and is easy to export (PEM, DER, JSON) for further analysis or server usage.
+
+Use it whenever you need **insight** into a site’s certificate or require some form of cryptographic or compliance check.
+```
+