docs(examples): update v0.4.3 features demo to v0.4.3b2

Rename and replace the features demo file to reflect the beta 2 version number. The old v0.4.3 demo file is removed and replaced with a new beta 2 version. Renames: - docs/examples/v0_4_3_features_demo.py -> docs/examples/v0_4_3b2_features_demo.py
docs(examples): update demo scripts and fix output formats
2025-01-22 20:41:43 +08:00 · 2025-01-22 20:40:03 +08:00 · 2025-01-22 17:14:24 +08:00 · 2025-01-22 16:11:01 +08:00 · 2025-01-21 21:20:04 +08:00 · 2025-01-21 21:03:11 +08:00
191 changed files with 25703 additions and 18326 deletions
--- a/.do/app.yaml
+++ b/.do/app.yaml
@@ -1,19 +0,0 @@
 alerts:
 - rule: DEPLOYMENT_FAILED
 - rule: DOMAIN_FAILED
 name: crawl4ai
 region: nyc
 services:
 - dockerfile_path: Dockerfile
  github:
    branch: 0.3.74
    deploy_on_push: true
    repo: unclecode/crawl4ai 
  health_check:
    http_path: /health
  http_port: 11235
  instance_count: 1
  instance_size_slug: professional-xs
  name: web
  routes:
  - path: /
--- a/.do/deploy.template.yaml
+++ b/.do/deploy.template.yaml
@@ -1,22 +0,0 @@
 spec:
  name: crawl4ai
  services:
    - name: crawl4ai
      git:
        branch: 0.3.74
        repo_clone_url: https://github.com/unclecode/crawl4ai.git
      dockerfile_path: Dockerfile
      http_port: 11235
      instance_count: 1
      instance_size_slug: professional-xs
      health_check:
        http_path: /health
      envs:
        - key: INSTALL_TYPE
          value: "basic"
        - key: PYTHON_VERSION  
          value: "3.10"
        - key: ENABLE_GPU
          value: "false"
      routes:
        - path: /
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,12 @@
 # Documentation
 *.html linguist-documentation
 docs/* linguist-documentation
 docs/examples/* linguist-documentation
 docs/md_v2/* linguist-documentation
 # Explicitly mark Python as the main language
 *.py linguist-detectable=true
 *.py linguist-language=Python
 # Exclude HTML from language statistics
 *.html linguist-detectable=false
--- a/.gitignore
+++ b/.gitignore
@@ -208,7 +208,7 @@ git_issues.md
 .next/
 .tests/
-.issues/
+# .issues/
 .docs/
 .issues/
 .gitboss/
@@ -218,3 +218,16 @@ manage-collab.sh
 publish.sh
 combine.sh
 combined_output.txt
 .local
 .scripts
 tree.md
 tree.md
 .scripts
 .local
 .do
 /plans
 .codeiumignore
 todo/
 # windsurf rules
 .windsurfrules
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,19 +1,220 @@
 # Changelog
-## [0.4.1] December 8, 2024
+All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 Okay, here's a detailed changelog in Markdown format, generated from the provided git diff and commit history. I've focused on user-facing changes, fixes, and features, and grouped them as requested:
 ## Version 0.4.3 (2025-01-21)
 This release introduces several powerful new features, including robots.txt compliance, dynamic proxy support, LLM-powered schema generation, and improved documentation.
 ### Features
 -   **Robots.txt Compliance:**
    -   Added robots.txt compliance support with efficient SQLite-based caching.
    -   New `check_robots_txt` parameter in `CrawlerRunConfig` to enable robots.txt checking before crawling a URL.
    -   Automated robots.txt checking is now integrated into `AsyncWebCrawler` with 403 status codes for blocked URLs.
 -   **Proxy Configuration:**
    -   Added proxy configuration support to `CrawlerRunConfig`, allowing dynamic proxy settings per crawl request.
    -   Updated documentation with examples for using proxy configuration in crawl operations.
 -   **LLM-Powered Schema Generation:**
    -   Introduced a new utility for automatic CSS and XPath schema generation using OpenAI or Ollama models.
    -   Added comprehensive documentation and examples for schema generation.
    -   New prompt templates optimized for HTML schema analysis.
 -   **URL Redirection Tracking:**
    -   Added URL redirection tracking to capture the final URL after any redirects.
    -   The final URL is now available in the `redirected_url` field of the `AsyncCrawlResponse` object.
 -   **Enhanced Streamlined Documentation:**
    -   Refactored and improved the documentation structure for clarity and ease of use.
    -   Added detailed explanations of new features and updated examples.
 -   **Improved Browser Context Management:**
    -   Enhanced the management of browser contexts and added shared data support.
    -   Introduced the `shared_data` parameter in `CrawlerRunConfig` to pass data between hooks.
 -   **Memory Dispatcher System:**
    -   Migrated to a memory dispatcher system with enhanced monitoring capabilities.
    -   Introduced `MemoryAdaptiveDispatcher` and `SemaphoreDispatcher` for improved resource management.
    -   Added `RateLimiter` for rate limiting support.
    -   New `CrawlerMonitor` for real-time monitoring of crawler operations.
 -   **Streaming Support:**
    -   Added streaming support for processing crawled URLs as they are processed.
    -   Enabled streaming mode with the `stream` parameter in `CrawlerRunConfig`.
 -   **Content Scraping Strategy:**
    -   Introduced a new `LXMLWebScrapingStrategy` for faster content scraping.
    -   Added support for selecting the scraping strategy via the `scraping_strategy` parameter in `CrawlerRunConfig`.
 ### Bug Fixes
 -   **Browser Path Management:**
    -   Improved browser path management for consistent behavior across different environments.
 -   **Memory Threshold:**
    -   Adjusted the default memory threshold to improve resource utilization.
 -   **Pydantic Model Fields:**
    -   Made several model fields optional with default values to improve flexibility.
 ### Refactor
 -   **Documentation Structure:**
    -   Reorganized documentation structure to improve navigation and readability.
    -   Updated styles and added new sections for advanced features.
 -   **Scraping Mode:**
    -   Replaced the `ScrapingMode` enum with a strategy pattern for more flexible content scraping.
 -   **Version Update:**
    -   Updated the version to `0.4.248`.
 -   **Code Cleanup:**
    -   Removed unused files and improved type hints.
    -   Applied Ruff corrections for code quality.
 -   **Updated dependencies:**
    -   Updated dependencies to their latest versions to ensure compatibility and security.
 -   **Ignored certain patterns and directories:**
    -   Updated `.gitignore` and `.codeiumignore` to ignore additional patterns and directories, streamlining the development environment.
 -   **Simplified Personal Story in README:**
    -   Streamlined the personal story and project vision in the `README.md` for clarity.
 -   **Removed Deprecated Files:**
    -   Deleted several deprecated files and examples that are no longer relevant.
 ---
 **Previous Releases:**
 ### 0.4.24x (2024-12-31)
 -   **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling.
 -   **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies.
 -   **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction.
 -   **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types.
 -   **Performance Boost**: Optimized caching, parallel processing, and memory management.
 -   **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking.
 -   **Security Features**: Improved input validation and safe expression evaluation.
 ### 0.4.247 (2025-01-06)
 #### Added
 - **Windows Event Loop Configuration**: Introduced a utility function `configure_windows_event_loop` to resolve `NotImplementedError` for asyncio subprocesses on Windows. ([#utils.py](crawl4ai/utils.py), [#tutorials/async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
 - **`page_need_scroll` Method**: Added a method to determine if a page requires scrolling before taking actions in `AsyncPlaywrightCrawlerStrategy`. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
 #### Changed
 - **Version Bump**: Updated the version from `0.4.246` to `0.4.247`. ([#__version__.py](crawl4ai/__version__.py))
 - **Improved Scrolling Logic**: Enhanced scrolling methods in `AsyncPlaywrightCrawlerStrategy` by adding a `scroll_delay` parameter for better control. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
 - **Markdown Generation Example**: Updated the `hello_world.py` example to reflect the latest API changes and better illustrate features. ([#examples/hello_world.py](docs/examples/hello_world.py))
 - **Documentation Update**: 
  - Added Windows-specific instructions for handling asyncio event loops. ([#async-webcrawler-basics.md](docs/md_v3/tutorials/async-webcrawler-basics.md))
 #### Removed
 - **Legacy Markdown Generation Code**: Removed outdated and unused code for markdown generation in `content_scraping_strategy.py`. ([#content_scraping_strategy.py](crawl4ai/content_scraping_strategy.py))
 #### Fixed
 - **Page Closing to Prevent Memory Leaks**:
  - **Description**: Added a `finally` block to ensure pages are closed when no `session_id` is provided.
  - **Impact**: Prevents memory leaks caused by lingering pages after a crawl.
  - **File**: [`async_crawler_strategy.py`](crawl4ai/async_crawler_strategy.py)
  - **Code**:
    ```python
    finally:
        # If no session_id is given we should close the page
        if not config.session_id:
            await page.close()
    ```
 - **Multiple Element Selection**: Modified `_get_elements` in `JsonCssExtractionStrategy` to return all matching elements instead of just the first one, ensuring comprehensive extraction. ([#extraction_strategy.py](crawl4ai/extraction_strategy.py))
 - **Error Handling in Scrolling**: Added robust error handling to ensure scrolling proceeds safely even if a configuration is missing. ([#async_crawler_strategy.py](crawl4ai/async_crawler_strategy.py))
 #### Other
 - **Git Ignore Update**: Added `/plans` to `.gitignore` for better development environment consistency. ([#.gitignore](.gitignore))
 ## [0.4.24] - 2024-12-31
 ### Added
 - **Browser and SSL Handling**
  - SSL certificate validation options in extraction strategies
  - Custom certificate paths support
  - Configurable certificate validation skipping
  - Enhanced response status code handling with retry logic
 - **Content Processing**
  - New content filtering system with regex support
  - Advanced chunking strategies for large content
  - Memory-efficient parallel processing
  - Configurable chunk size optimization
 - **JSON Extraction**
  - Complex JSONPath expression support
  - JSON-CSS and Microdata extraction
  - RDFa parsing capabilities
  - Advanced data transformation pipeline
 - **Field Types**
  - New field types: `computed`, `conditional`, `aggregate`, `template`
  - Field inheritance system
  - Reusable field definitions
  - Custom validation rules
 ### Changed
 - **Performance**
  - Optimized selector compilation with caching
  - Improved HTML parsing efficiency
  - Enhanced memory management for large documents
  - Batch processing optimizations
 - **Error Handling**
  - More detailed error messages and categorization
  - Enhanced debugging capabilities
  - Improved performance metrics tracking
  - Better error recovery mechanisms
 ### Deprecated
 - Old field computation method using `eval`
 - Direct browser manipulation without proper SSL handling
 - Simple text-based content filtering
 ### Removed
 - Legacy extraction patterns without proper error handling
 - Unsafe eval-based field computation
 - Direct DOM manipulation without sanitization
 ### Fixed
 - Memory leaks in large document processing
 - SSL certificate validation issues
 - Incorrect handling of nested JSON structures
 - Performance bottlenecks in parallel processing
 ### Security
 - Improved input validation and sanitization
 - Safe expression evaluation system
 - Enhanced resource protection
 - Rate limiting implementation
 ## [0.4.1] - 2024-12-08
 ### **File: `crawl4ai/async_crawler_strategy.py`**
 #### **New Parameters and Attributes Added**
- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
+- **`text_mode` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
 - **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
+- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_mode` mode (default values: 800x600 for `text_mode`, 1920x1080 otherwise).
- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
+- **`extra_args`**: Adds browser-specific flags for `text_mode` mode.
 - **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
 #### **Browser Context Adjustments**
- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
+- Added **`viewport` adjustments**: Dynamically computed based on `text_mode` or custom configuration.
- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.
+- Enhanced support for `light_mode` and `text_mode` by adding specific browser arguments to reduce resource consumption.
 #### **Dynamic Content Handling**
 - **Full Page Scan Feature**:
@@ -709,7 +910,7 @@ This commit introduces several key enhancements, including improved error handli
 - Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
 - Enhanced flexibility in `CosineStrategy`:
  - Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
+- Updated `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` for better JSON-based extraction.
 ### Fixed
 - Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
@@ -980,6 +1181,6 @@ These changes focus on refining the existing codebase, resulting in a more stabl
 - Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
 - Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
-## [0.2.4] - 2024-06-17
+## [v0.2.4] - 2024-06-17
 ### Fixed
 - Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs
--- a/README.md
+++ b/README.md
@@ -1,19 +1,40 @@
 # 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
 <div align="center">
 <a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 ![PyPI - Downloads](https://img.shields.io/pypi/dm/Crawl4AI)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
-[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
+
-[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
+[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
 [![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
 [![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
 <!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
 [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
 </div>
 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  
-[✨ Check out latest update v0.4.2](#-recent-updates)
+[✨ Check out latest update v0.4.3b1x](#-recent-updates)
-🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)
+🎉 **Version 0.4.3b1 is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
 <details>
 <summary>🤓 <strong>My Personal Story</strong></summary>
 My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
 Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
 I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI—a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
 Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
 </details>
 ## 🧐 Why Crawl4AI?
@@ -28,20 +49,32 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
 1. Install Crawl4AI:
 ```bash
-pip install crawl4ai
+# Install the package
-crawl4ai-setup # Setup the browser
+pip install -U crawl4ai
 # Run post-installation setup
 crawl4ai-setup
 # Verify your installation
 crawl4ai-doctor
 ```
 If you encounter any browser-related issues, you can install them manually:
 ```bash
 python -m playwright install --with-deps chromium
 ```
 2. Run a simple web crawl:
 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import *
 async def main():
-    async with AsyncWebCrawler(verbose=True) as crawler:
+    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(url="https://www.nbcnews.com/business")
+        result = await crawler.arun(
-        # Soone will be change to result.markdown
+            url="https://www.nbcnews.com/business",
-        print(result.markdown_v2.raw_markdown) 
+        )
        print(result.markdown)
 if __name__ == "__main__":
    asyncio.run(main())
@@ -127,7 +160,7 @@ if __name__ == "__main__":
 ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
+✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
 ## Installation 🛠️
@@ -200,193 +233,26 @@ pip install -e ".[all]"             # Install all optional features
 </details>
 <details>
-<summary>🚀 <strong>One-Click Deployment</strong></summary>
+<summary>🐳 <strong>Docker Deployment</strong></summary>
-Deploy your own instance of Crawl4AI with one click:
+> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
-[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
+### Current Docker Support
-> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
+The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
-The deploy will:
+- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
- Set up a Docker container with Crawl4AI
+- ⚠️ Note: This setup will be replaced in the next major release
 - Configure Playwright and all dependencies
 - Start the FastAPI server on port `11235`
 - Set up health checks and auto-deployment
-</details>
+### What's Coming Next?
-<details>
+Our new Docker implementation will bring:
-<summary>🐳 <strong>Using Docker</strong></summary>
+- Improved performance and resource efficiency
 - Streamlined deployment process
 - Better integration with Crawl4AI features
 - Enhanced scalability options
-Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
+Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
 ---
 <details>
 <summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
 Choose the appropriate image based on your platform and needs:
 ### For AMD64 (Regular Linux/Windows):
 ```bash
 # Basic version (recommended)
 docker pull unclecode/crawl4ai:basic-amd64
 docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
 # Full ML/LLM support
 docker pull unclecode/crawl4ai:all-amd64
 docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
 # With GPU support
 docker pull unclecode/crawl4ai:gpu-amd64
 docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
 ```
 ### For ARM64 (M1/M2 Macs, ARM servers):
 ```bash
 # Basic version (recommended)
 docker pull unclecode/crawl4ai:basic-arm64
 docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
 # Full ML/LLM support
 docker pull unclecode/crawl4ai:all-arm64
 docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
 # With GPU support
 docker pull unclecode/crawl4ai:gpu-arm64
 docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
 ```
 Need more memory? Add `--shm-size`:
 ```bash
 docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
 ```
 Test the installation:
 ```bash
 curl http://localhost:11235/health
 ```
 ### For Raspberry Pi (32-bit) (coming soon):
 ```bash
 # Pull and run basic version (recommended for Raspberry Pi)
 docker pull unclecode/crawl4ai:basic-armv7
 docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
 # With increased shared memory if needed
 docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
 ```
 Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
 </details>
 <details>
 <summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
 Build the image locally based on your platform:
 ```bash
 # Clone the repository
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
 # For AMD64 (Regular Linux/Windows)
 docker build --platform linux/amd64 \
  --tag crawl4ai:local \
  --build-arg INSTALL_TYPE=basic \
  .
 # For ARM64 (M1/M2 Macs, ARM servers)
 docker build --platform linux/arm64 \
  --tag crawl4ai:local \
  --build-arg INSTALL_TYPE=basic \
  .
 ```
 Build options:
 - INSTALL_TYPE=basic (default): Basic crawling features
 - INSTALL_TYPE=all: Full ML/LLM support
 - ENABLE_GPU=true: Add GPU support
 Example with all options:
 ```bash
 docker build --platform linux/amd64 \
  --tag crawl4ai:local \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=true \
  .
 ```
 Run your local build:
 ```bash
 # Regular run
 docker run -p 11235:11235 crawl4ai:local
 # With increased shared memory
 docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
 ```
 Test the installation:
 ```bash
 curl http://localhost:11235/health
 ```
 </details>
 <details>
 <summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
 Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
 ```bash
 # Clone the repository
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
 ```
 ### For AMD64 (Regular Linux/Windows):
 ```bash
 # Build and run locally
 docker-compose --profile local-amd64 up
 # Run from Docker Hub
 VERSION=basic docker-compose --profile hub-amd64 up   # Basic version
 VERSION=all docker-compose --profile hub-amd64 up     # Full ML/LLM support
 VERSION=gpu docker-compose --profile hub-amd64 up     # GPU support
 ```
 ### For ARM64 (M1/M2 Macs, ARM servers):
 ```bash
 # Build and run locally
 docker-compose --profile local-arm64 up
 # Run from Docker Hub
 VERSION=basic docker-compose --profile hub-arm64 up   # Basic version
 VERSION=all docker-compose --profile hub-arm64 up     # Full ML/LLM support
 VERSION=gpu docker-compose --profile hub-arm64 up     # GPU support
 ```
 Environment variables (optional):
 ```bash
 # Create a .env file
 CRAWL4AI_API_TOKEN=your_token
 OPENAI_API_KEY=your_openai_key
 CLAUDE_API_KEY=your_claude_key
 ```
 The compose file includes:
 - Memory management (4GB limit, 1GB reserved)
 - Shared memory volume for browser support
 - Health checks
 - Auto-restart policy
 - All necessary port mappings
 Test the installation:
 ```bash
 curl http://localhost:11235/health
 ```
 </details>
@@ -410,7 +276,7 @@ task_id = response.json()["task_id"]
 result = requests.get(f"http://localhost:11235/task/{task_id}")
 ```
-For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
+For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
 </details>
@@ -424,17 +290,16 @@ You can check the project structure in the directory [https://github.com/uncleco
 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 async def main():
-    async with AsyncWebCrawler(
+    browser_config = BrowserConfig(
        headless=True,  
        verbose=True,
-    ) as crawler:
+    )
-        result = await crawler.arun(
+    run_config = CrawlerRunConfig(
            url="https://docs.micronaut.io/4.7.6/guide/",
        cache_mode=CacheMode.ENABLED,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
@@ -443,6 +308,12 @@ async def main():
        #     content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
        # ),
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.micronaut.io/4.7.6/guide/",
            config=run_config
        )
        print(len(result.markdown))
        print(len(result.fit_markdown))
        print(len(result.markdown_v2.fit_markdown))
@@ -458,7 +329,7 @@ if __name__ == "__main__":
 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 import json
@@ -493,36 +364,26 @@ async def main():
            "type": "attribute",
            "attribute": "src"
        }
-    ]
+    }
 }
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
-    async with AsyncWebCrawler(
+    browser_config = BrowserConfig(
        headless=False,
        verbose=True
-    ) as crawler:
+    )
    run_config = CrawlerRunConfig(
        extraction_strategy=extraction_strategy,
        js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
        cache_mode=CacheMode.BYPASS
    )
-        # Create the JavaScript that handles clicking multiple times
+    async with AsyncWebCrawler(config=browser_config) as crawler:
        js_click_tabs = """
        (async () => {
            const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
            for(let tab of tabs) {
                // scroll to the tab
                tab.scrollIntoView();
                tab.click();
                // Wait for content to load and animations to complete
                await new Promise(r => setTimeout(r, 500));
            }
        })();
        """     
        result = await crawler.arun(
            url="https://www.kidocode.com/degrees/technology",
-            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
+            config=run_config
            js_code=[js_click_tabs],
            cache_mode=CacheMode.BYPASS
        )
        companies = json.loads(result.extracted_content)
@@ -542,7 +403,7 @@ if __name__ == "__main__":
 ```python
 import os
 import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 from crawl4ai.extraction_strategy import LLMExtractionStrategy
 from pydantic import BaseModel, Field
@@ -552,9 +413,8 @@ class OpenAIModelFee(BaseModel):
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
 async def main():
-    async with AsyncWebCrawler(verbose=True) as crawler:
+    browser_config = BrowserConfig(verbose=True)
-        result = await crawler.arun(
+    run_config = CrawlerRunConfig(
            url='https://openai.com/api/pricing/',
        word_count_threshold=1,
        extraction_strategy=LLMExtractionStrategy(
            # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
@@ -568,6 +428,12 @@ async def main():
        ),            
        cache_mode=CacheMode.BYPASS,
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
            config=run_config
        )
        print(result.extracted_content)
 if __name__ == "__main__":
@@ -583,37 +449,29 @@ if __name__ == "__main__":
 import os, sys
 from pathlib import Path
 import asyncio, time
-from crawl4ai import AsyncWebCrawler
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 async def test_news_crawl():
    # Create a persistent user data directory
    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
    os.makedirs(user_data_dir, exist_ok=True)
-    async with AsyncWebCrawler(
+    browser_config = BrowserConfig(
        verbose=True,
        headless=True,
        user_data_dir=user_data_dir,
        use_persistent_context=True,
-        headers={
+    )
-            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
+    run_config = CrawlerRunConfig(
-            "Accept-Language": "en-US,en;q=0.5",
+        cache_mode=CacheMode.BYPASS
-            "Accept-Encoding": "gzip, deflate, br",
+    )
-            "DNT": "1",
+    
-            "Connection": "keep-alive",
+    async with AsyncWebCrawler(config=browser_config) as crawler:
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
            "Cache-Control": "max-age=0",
        }
    ) as crawler:
        url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
        result = await crawler.arun(
            url,
-            cache_mode=CacheMode.BYPASS,
+            config=run_config,
            magic=True,
        )
@@ -623,28 +481,73 @@ async def test_news_crawl():
 </details>
 ## ✨ Recent Updates
- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable.
+-   **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse.
+-   **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages.
+-   **🤖 Enhanced LLM Integration**:
- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving.
+    -   **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
+    -   **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
+    -   **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
+-   **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
+-   **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
+-   **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence.
- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.
+-   **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects.
 -   **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
 -   **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
 -   **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
 Read the full details in our [0.4.248 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
 Here's a clear markdown explanation for your users about version numbering:
-Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md).
+## Version Numbering in Crawl4AI
 Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
 ### Version Numbers Explained
 Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
 #### Pre-release Versions
 We use different suffixes to indicate development stages:
 - `dev` (0.4.3dev1): Development versions, unstable
 - `a` (0.4.3a1): Alpha releases, experimental features
 - `b` (0.4.3b1): Beta releases, feature complete but needs testing
 - `rc` (0.4.3rc1): Release candidates, potential final version
 #### Installation
 - Regular installation (stable version):
  ```bash
  pip install -U crawl4ai
  ```
 - Install pre-release versions:
  ```bash
  pip install crawl4ai --pre
  ```
 - Install specific version:
  ```bash
  pip install crawl4ai==0.4.3b1
  ```
 #### Why Pre-releases?
 We use pre-releases to:
 - Test new features in real-world scenarios
 - Gather feedback before final releases
 - Ensure stability for production users
 - Allow early adopters to try new features
 For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
 ## 📖 Documentation & Roadmap 
 > 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
-For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
+For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
 To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
@@ -709,9 +612,6 @@ We envision a future where AI is powered by real human knowledge, ensuring data
 For more details, see our [full mission statement](./MISSION.md).
 </details>
 ## Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/a.md
+++ b/a.md
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -2,45 +2,87 @@
 from .async_webcrawler import AsyncWebCrawler, CacheMode
 from .async_configs import BrowserConfig, CrawlerRunConfig
-from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy
+from .content_scraping_strategy import (
    ContentScrapingStrategy,
    WebScrapingStrategy,
    LXMLWebScrapingStrategy,
 )
 from .extraction_strategy import (
    ExtractionStrategy,
    LLMExtractionStrategy,
    CosineStrategy,
    JsonCssExtractionStrategy,
    JsonXPathExtractionStrategy
 )
 from .chunking_strategy import ChunkingStrategy, RegexChunking
 from .markdown_generation_strategy import DefaultMarkdownGenerator
-from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
+from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter
-from .models import CrawlResult
+from .models import CrawlResult, MarkdownGenerationResult
-from .__version__ import __version__
+from .async_dispatcher import (
    MemoryAdaptiveDispatcher,
    SemaphoreDispatcher,
    RateLimiter,
    CrawlerMonitor,
    DisplayMode,
    BaseDispatcher
 )
 __all__ = [
    "AsyncWebCrawler",
    "CrawlResult",
    "CacheMode",
-    'BrowserConfig',
+    "ContentScrapingStrategy",
-    'CrawlerRunConfig',
+    "WebScrapingStrategy",
-    'ExtractionStrategy',
+    "LXMLWebScrapingStrategy",
-    'LLMExtractionStrategy',
+    "BrowserConfig",
-    'CosineStrategy',
+    "CrawlerRunConfig",
-    'JsonCssExtractionStrategy',
+    "ExtractionStrategy",
-    'ChunkingStrategy',
+    "LLMExtractionStrategy",
-    'RegexChunking',
+    "CosineStrategy",
-    'DefaultMarkdownGenerator',
+    "JsonCssExtractionStrategy",
-    'PruningContentFilter',
+    "JsonXPathExtractionStrategy",
-    'BM25ContentFilter',
+    "ChunkingStrategy",
    "RegexChunking",
    "DefaultMarkdownGenerator",
    "PruningContentFilter",
    "BM25ContentFilter",
    "LLMContentFilter",
    "BaseDispatcher",
    "MemoryAdaptiveDispatcher",
    "SemaphoreDispatcher",
    "RateLimiter",
    "CrawlerMonitor",
    "DisplayMode",
    "MarkdownGenerationResult",
 ]
 def is_sync_version_installed():
    try:
        import selenium
        return True
    except ImportError:
        return False
 if is_sync_version_installed():
    try:
        from .web_crawler import WebCrawler
        __all__.append("WebCrawler")
    except ImportError:
-        import warnings
+        print(
-        print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
+            "Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
        )
 else:
    WebCrawler = None
    # import warnings
    # print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
 import warnings
 from pydantic import warnings as pydantic_warnings
 # Disable all Pydantic warnings
 warnings.filterwarnings("ignore", module="pydantic")
 # pydantic_warnings.filter_warnings()
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,2 +1,2 @@
 # crawl4ai/_version.py
-__version__ = "0.4.22"
+__version__ = "0.4.3b2"
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -2,12 +2,17 @@ from .config import (
    MIN_WORD_THRESHOLD,
    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
    SCREENSHOT_HEIGHT_TRESHOLD,
-    PAGE_TIMEOUT
+    PAGE_TIMEOUT,
    IMAGE_SCORE_THRESHOLD,
    SOCIAL_MEDIA_DOMAINS,
 )
 from .user_agent_generator import UserAgentGenerator
 from .extraction_strategy import ExtractionStrategy
-from .chunking_strategy import ChunkingStrategy
+from .chunking_strategy import ChunkingStrategy, RegexChunking
 from .markdown_generation_strategy import MarkdownGenerationStrategy
 from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
 from typing import Optional, Union, List
 class BrowserConfig:
    """
@@ -24,18 +29,21 @@ class BrowserConfig:
                         Default: True.
        use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
                                    advanced manipulation. Default: False.
        debugging_port (int): Port for the browser debugging protocol. Default: 9222.
        use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
                                       Automatically sets use_managed_browser=True. Default: False.
        user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
                                     temporary directory may be used. Default: None.
        chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chrome".
+                              is "chromium". Default: "chromium".
-        proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
+        channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
                              is "chromium". Default: "chromium".
        proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
                             Default: None.
        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
                                     If None, no additional proxy config. Default: None.
-        viewport_width (int): Default viewport width for pages. Default: 1920.
+        viewport_width (int): Default viewport width for pages. Default: 1080.
-        viewport_height (int): Default viewport height for pages. Default: 1080.
+        viewport_height (int): Default viewport height for pages. Default: 600.
        verbose (bool): Enable verbose logging.
                        Default: True.
        accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
@@ -57,7 +65,7 @@ class BrowserConfig:
                                       user_agent as-is. Default: None.
        user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
                                                    Default: None.
-        text_only (bool): If True, disables images and other rich content for potentially faster load times.
+        text_mode (bool): If True, disables images and other rich content for potentially faster load times.
                          Default: False.
        light_mode (bool): Disables certain background features for performance gains. Default: False.
        extra_args (list): Additional command-line arguments passed to the browser.
@@ -71,11 +79,12 @@ class BrowserConfig:
        use_managed_browser: bool = False,
        use_persistent_context: bool = False,
        user_data_dir: str = None,
-        chrome_channel: str = "chrome",
+        chrome_channel: str = "chromium",
-        proxy: str = None,
+        channel: str = "chromium",
        proxy: Optional[str] = None,
        proxy_config: dict = None,
-        viewport_width: int = 1920,
+        viewport_width: int = 1080,
-        viewport_height: int = 1080,
+        viewport_height: int = 600,
        accept_downloads: bool = False,
        downloads_path: str = None,
        storage_state=None,
@@ -91,23 +100,21 @@ class BrowserConfig:
        ),
        user_agent_mode: str = None,
        user_agent_generator_config: dict = None,
-        text_only: bool = False,
+        text_mode: bool = False,
        light_mode: bool = False,
        extra_args: list = None,
        debugging_port: int = 9222,
    ):
        self.browser_type = browser_type
        self.headless = headless
        self.use_managed_browser = use_managed_browser
        self.use_persistent_context = use_persistent_context
        self.user_data_dir = user_data_dir
-        if self.browser_type == "chromium":
+        self.chrome_channel = chrome_channel or self.browser_type or "chromium"
-            self.chrome_channel = "chrome"
+        self.channel = channel or self.browser_type or "chromium"
-        elif self.browser_type == "firefox":
+        if self.browser_type in ["firefox", "webkit"]:
-            self.chrome_channel = "firefox"
+            self.channel = ""
-        elif self.browser_type == "webkit":
+            self.chrome_channel = ""
            self.chrome_channel = "webkit"
        else:
            self.chrome_channel = chrome_channel or "chrome"
        self.proxy = proxy
        self.proxy_config = proxy_config
        self.viewport_width = viewport_width
@@ -122,17 +129,23 @@ class BrowserConfig:
        self.user_agent = user_agent
        self.user_agent_mode = user_agent_mode
        self.user_agent_generator_config = user_agent_generator_config
-        self.text_only = text_only
+        self.text_mode = text_mode
        self.light_mode = light_mode
        self.extra_args = extra_args if extra_args is not None else []
        self.sleep_on_close = sleep_on_close
        self.verbose = verbose
        self.debugging_port = debugging_port
        user_agenr_generator = UserAgentGenerator()
-        if self.user_agent_mode != "random":
+        if self.user_agent_mode != "random" and self.user_agent_generator_config:
            self.user_agent = user_agenr_generator.generate(
                **(self.user_agent_generator_config or {})
            )
        elif self.user_agent_mode == "random":
            self.user_agent = user_agenr_generator.generate()
        else:
            pass
        self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
        self.headers.setdefault("sec-ch-ua", self.browser_hint)
@@ -148,11 +161,12 @@ class BrowserConfig:
            use_managed_browser=kwargs.get("use_managed_browser", False),
            use_persistent_context=kwargs.get("use_persistent_context", False),
            user_data_dir=kwargs.get("user_data_dir"),
-            chrome_channel=kwargs.get("chrome_channel", "chrome"),
+            chrome_channel=kwargs.get("chrome_channel", "chromium"),
            channel=kwargs.get("channel", "chromium"),
            proxy=kwargs.get("proxy"),
            proxy_config=kwargs.get("proxy_config"),
-            viewport_width=kwargs.get("viewport_width", 1920),
+            viewport_width=kwargs.get("viewport_width", 1080),
-            viewport_height=kwargs.get("viewport_height", 1080),
+            viewport_height=kwargs.get("viewport_height", 600),
            accept_downloads=kwargs.get("accept_downloads", False),
            downloads_path=kwargs.get("downloads_path"),
            storage_state=kwargs.get("storage_state"),
@@ -160,17 +174,62 @@ class BrowserConfig:
            java_script_enabled=kwargs.get("java_script_enabled", True),
            cookies=kwargs.get("cookies", []),
            headers=kwargs.get("headers", {}),
-            user_agent=kwargs.get("user_agent",
+            user_agent=kwargs.get(
                "user_agent",
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
-                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
+                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
            ),
            user_agent_mode=kwargs.get("user_agent_mode"),
            user_agent_generator_config=kwargs.get("user_agent_generator_config"),
-            text_only=kwargs.get("text_only", False),
+            text_mode=kwargs.get("text_mode", False),
            light_mode=kwargs.get("light_mode", False),
-            extra_args=kwargs.get("extra_args", [])
+            extra_args=kwargs.get("extra_args", []),
        )
    def to_dict(self):
        return {
            "browser_type": self.browser_type,
            "headless": self.headless,
            "use_managed_browser": self.use_managed_browser,
            "use_persistent_context": self.use_persistent_context,
            "user_data_dir": self.user_data_dir,
            "chrome_channel": self.chrome_channel,
            "channel": self.channel,
            "proxy": self.proxy,
            "proxy_config": self.proxy_config,
            "viewport_width": self.viewport_width,
            "viewport_height": self.viewport_height,
            "accept_downloads": self.accept_downloads,
            "downloads_path": self.downloads_path,
            "storage_state": self.storage_state,
            "ignore_https_errors": self.ignore_https_errors,
            "java_script_enabled": self.java_script_enabled,
            "cookies": self.cookies,
            "headers": self.headers,
            "user_agent": self.user_agent,
            "user_agent_mode": self.user_agent_mode,
            "user_agent_generator_config": self.user_agent_generator_config,
            "text_mode": self.text_mode,
            "light_mode": self.light_mode,
            "extra_args": self.extra_args,
            "sleep_on_close": self.sleep_on_close,
            "verbose": self.verbose,
            "debugging_port": self.debugging_port,
        }
    def clone(self, **kwargs):
        """Create a copy of this configuration with updated values.
        Args:
            **kwargs: Key-value pairs of configuration options to update
        Returns:
            BrowserConfig: A new instance with the specified updates
        """
        config_dict = self.to_dict()
        config_dict.update(kwargs)
        return BrowserConfig.from_kwargs(config_dict)
 class CrawlerRunConfig:
    """
@@ -182,22 +241,45 @@ class CrawlerRunConfig:
    By using this class, you have a single place to understand and adjust the crawling options.
    Attributes:
        # Content Processing Parameters
        word_count_threshold (int): Minimum word count threshold before processing content.
                                    Default: MIN_WORD_THRESHOLD (typically 200).
        extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
                                                          Default: None (NoExtractionStrategy is used if None).
        chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
                                              Default: RegexChunking().
        markdown_generator (MarkdownGenerationStrategy): Strategy for generating markdown.
                                                         Default: None.
        content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
                                                        Default: None.
        only_text (bool): If True, attempt to extract text-only content where applicable.
                          Default: False.
        css_selector (str or None): CSS selector to extract a specific portion of the page.
                                    Default: None.
        excluded_tags (list of str or None): List of HTML tags to exclude from processing.
                                             Default: None.
        excluded_selector (str or None): CSS selector to exclude from processing.
                                         Default: None.
        keep_data_attributes (bool): If True, retain `data-*` attributes while removing unwanted attributes.
                                     Default: False.
        remove_forms (bool): If True, remove all `<form>` elements from the HTML.
                             Default: False.
        prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
                          Default: False.
        parser_type (str): Type of parser to use for HTML parsing.
                           Default: "lxml".
        scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
                           Default: WebScrapingStrategy.
        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
                                     If None, no additional proxy config. Default: None.
        # Caching Parameters
        cache_mode (CacheMode or None): Defines how caching is handled.
                                        If None, defaults to CacheMode.ENABLED internally.
                                        Default: None.
        session_id (str or None): Optional session ID to persist the browser context and the created
                                  page instance. If the ID already exists, the crawler does not
-                                    create a new page and uses the current page to preserve the state;
+                                  create a new page and uses the current page to preserve the state.
                                    if not, it creates a new page and context then stores it in 
                                    memory with the given session ID.
        bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
                             Default: False.
        disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
@@ -206,36 +288,34 @@ class CrawlerRunConfig:
                              Default: False.
        no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
                               Default: False.
-        css_selector (str or None): CSS selector to extract a specific portion of the page.
+        shared_data (dict or None): Shared data to be passed between hooks.
                                     Default: None.
-        screenshot (bool): Whether to take a screenshot after crawling.
+
-                           Default: False.
+        # Page Navigation and Timing Parameters
        pdf (bool): Whether to generate a PDF of the page.
                    Default: False.
        verbose (bool): Enable verbose logging.
                        Default: True.
        only_text (bool): If True, attempt to extract text-only content where applicable.
                          Default: False.
        image_description_min_word_threshold (int): Minimum words for image description extraction.
                                                    Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
        prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
                          Default: False.
        js_code (str or list of str or None): JavaScript code/snippets to run on the page.
                                              Default: None.
        wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
                                Default: None.
        js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
                        Default: False.
        wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
                          Default: "domcontentloaded".
        page_timeout (int): Timeout in ms for page operations like navigation.
                            Default: 60000 (60 seconds).
        wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
                                Default: None.
        wait_for_images (bool): If True, wait for images to load before extracting content.
                                Default: False.
        delay_before_return_html (float): Delay in seconds before retrieving final HTML.
                                          Default: 0.1.
        mean_delay (float): Mean base delay between requests when calling arun_many.
                            Default: 0.1.
        max_range (float): Max random additional delay range for requests in arun_many.
                           Default: 0.3.
        semaphore_count (int): Number of concurrent operations allowed.
                               Default: 5.
        # Page Interaction Parameters
        js_code (str or list of str or None): JavaScript code/snippets to run on the page.
                                              Default: None.
        js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
                        Default: False.
        ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
                                       Default: True.
        wait_for_images (bool): If True, wait for images to load before extracting content. 
                                Default: True.
        adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
                                           Default: False.
        scan_full_page (bool): If True, scroll through the entire page to load all content.
                               Default: False.
        scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
@@ -244,163 +324,392 @@ class CrawlerRunConfig:
                                Default: False.
        remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
                                        Default: False.
        delay_before_return_html (float): Delay in seconds before retrieving final HTML.
                                          Default: 0.1.
        log_console (bool): If True, log console messages from the page.
                            Default: False.
        simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
                              Default: False.
        override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
                                   Default: False.
        magic (bool): If True, attempts automatic handling of overlays/popups.
                      Default: False.
        adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
                                           Default: False.
        # Media Handling Parameters
        screenshot (bool): Whether to take a screenshot after crawling.
                           Default: False.
        screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
                                             Default: None.
        screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
                                           Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
-        mean_delay (float): Mean base delay between requests when calling arun_many.
+        pdf (bool): Whether to generate a PDF of the page.
-                            Default: 0.1.
+                    Default: False.
-        max_range (float): Max random additional delay range for requests in arun_many.
+        image_description_min_word_threshold (int): Minimum words for image description extraction.
-                           Default: 0.3.
+                                                    Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
-        # session_id and semaphore_count might be set at runtime, not needed as defaults here.
+        image_score_threshold (int): Minimum score threshold for processing an image.
                                     Default: IMAGE_SCORE_THRESHOLD (e.g., 3).
        exclude_external_images (bool): If True, exclude all external images from processing.
                                         Default: False.
        # Link and Domain Handling Parameters
        exclude_social_media_domains (list of str): List of domains to exclude for social media links.
                                                    Default: SOCIAL_MEDIA_DOMAINS (from config).
        exclude_external_links (bool): If True, exclude all external links from the results.
                                       Default: False.
        exclude_social_media_links (bool): If True, exclude links pointing to social media domains.
                                           Default: False.
        exclude_domains (list of str): List of specific domains to exclude from results.
                                       Default: [].
        # Debugging and Logging Parameters
        verbose (bool): Enable verbose logging.
                        Default: True.
        log_console (bool): If True, log console messages from the page.
                            Default: False.
        # Streaming Parameters
        stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
                      Default: False.
        # Optional Parameters
        stream (bool): If True, stream the page content as it is being loaded.
        url: str = None  # This is not a compulsory parameter
        check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
    """
    def __init__(
        self,
-        word_count_threshold: int =  MIN_WORD_THRESHOLD ,
+        # Content Processing Parameters
-        extraction_strategy : ExtractionStrategy=None,  # Will default to NoExtractionStrategy if None
+        word_count_threshold: int = MIN_WORD_THRESHOLD,
-        chunking_strategy : ChunkingStrategy= None,    # Will default to RegexChunking if None
+        extraction_strategy: ExtractionStrategy = None,
-        markdown_generator : MarkdownGenerationStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
        markdown_generator: MarkdownGenerationStrategy = None,
        content_filter=None,
        only_text: bool = False,
        css_selector: str = None,
        excluded_tags: list = None,
        excluded_selector: str = None,
        keep_data_attributes: bool = False,
        remove_forms: bool = False,
        prettiify: bool = False,
        parser_type: str = "lxml",
        scraping_strategy: ContentScrapingStrategy = None,
        proxy_config: dict = None,
        # SSL Parameters
        fetch_ssl_certificate: bool = False,
        # Caching Parameters
        cache_mode=None,
        session_id: str = None,
        bypass_cache: bool = False,
        disable_cache: bool = False,
        no_cache_read: bool = False,
        no_cache_write: bool = False,
-        css_selector: str = None,
+        shared_data: dict = None,
-        screenshot: bool = False,
+        # Page Navigation and Timing Parameters
        pdf: bool = False,
        verbose: bool = True,
        only_text: bool = False,
        image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
        prettiify: bool = False,
        js_code=None,
        wait_for: str = None,
        js_only: bool = False,
        wait_until: str = "domcontentloaded",
        page_timeout: int = PAGE_TIMEOUT,
        wait_for: str = None,
        wait_for_images: bool = False,
        delay_before_return_html: float = 0.1,
        mean_delay: float = 0.1,
        max_range: float = 0.3,
        semaphore_count: int = 5,
        # Page Interaction Parameters
        js_code: Union[str, List[str]] = None,
        js_only: bool = False,
        ignore_body_visibility: bool = True,
        wait_for_images: bool = True,
        adjust_viewport_to_content: bool = False,
        scan_full_page: bool = False,
        scroll_delay: float = 0.2,
        process_iframes: bool = False,
        remove_overlay_elements: bool = False,
        delay_before_return_html: float = 0.1,
        log_console: bool = False,
        simulate_user: bool = False,
        override_navigator: bool = False,
        magic: bool = False,
        adjust_viewport_to_content: bool = False,
        # Media Handling Parameters
        screenshot: bool = False,
        screenshot_wait_for: float = None,
        screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
-        mean_delay: float = 0.1,
+        pdf: bool = False,
-        max_range: float = 0.3,
+        image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-        semaphore_count: int = 5,
+        image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
        exclude_external_images: bool = False,
        # Link and Domain Handling Parameters
        exclude_social_media_domains: list = None,
        exclude_external_links: bool = False,
        exclude_social_media_links: bool = False,
        exclude_domains: list = None,
        # Debugging and Logging Parameters
        verbose: bool = True,
        log_console: bool = False,
        # Streaming Parameters
        stream: bool = False,
        url: str = None,
        check_robots_txt: bool = False,
    ):
        self.url = url
        # Content Processing Parameters
        self.word_count_threshold = word_count_threshold
        self.extraction_strategy = extraction_strategy
        self.chunking_strategy = chunking_strategy
        self.markdown_generator = markdown_generator
        self.content_filter = content_filter
        self.only_text = only_text
        self.css_selector = css_selector
        self.excluded_tags = excluded_tags or []
        self.excluded_selector = excluded_selector or ""
        self.keep_data_attributes = keep_data_attributes
        self.remove_forms = remove_forms
        self.prettiify = prettiify
        self.parser_type = parser_type
        self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
        self.proxy_config = proxy_config
        # SSL Parameters
        self.fetch_ssl_certificate = fetch_ssl_certificate
        # Caching Parameters
        self.cache_mode = cache_mode
        self.session_id = session_id
        self.bypass_cache = bypass_cache
        self.disable_cache = disable_cache
        self.no_cache_read = no_cache_read
        self.no_cache_write = no_cache_write
-        self.css_selector = css_selector
+        self.shared_data = shared_data
-        self.screenshot = screenshot
+
-        self.pdf = pdf
+        # Page Navigation and Timing Parameters
        self.verbose = verbose
        self.only_text = only_text
        self.image_description_min_word_threshold = image_description_min_word_threshold
        self.prettiify = prettiify
        self.js_code = js_code
        self.wait_for = wait_for
        self.js_only = js_only
        self.wait_until = wait_until
        self.page_timeout = page_timeout
-        self.ignore_body_visibility = ignore_body_visibility
+        self.wait_for = wait_for
        self.wait_for_images = wait_for_images
        self.adjust_viewport_to_content = adjust_viewport_to_content
        self.scan_full_page = scan_full_page
        self.scroll_delay = scroll_delay
        self.process_iframes = process_iframes
        self.remove_overlay_elements = remove_overlay_elements
        self.delay_before_return_html = delay_before_return_html
        self.log_console = log_console
        self.simulate_user = simulate_user
        self.override_navigator = override_navigator
        self.magic = magic
        self.screenshot_wait_for = screenshot_wait_for
        self.screenshot_height_threshold = screenshot_height_threshold
        self.mean_delay = mean_delay
        self.max_range = max_range
        self.semaphore_count = semaphore_count
        # Page Interaction Parameters
        self.js_code = js_code
        self.js_only = js_only
        self.ignore_body_visibility = ignore_body_visibility
        self.scan_full_page = scan_full_page
        self.scroll_delay = scroll_delay
        self.process_iframes = process_iframes
        self.remove_overlay_elements = remove_overlay_elements
        self.simulate_user = simulate_user
        self.override_navigator = override_navigator
        self.magic = magic
        self.adjust_viewport_to_content = adjust_viewport_to_content
        # Media Handling Parameters
        self.screenshot = screenshot
        self.screenshot_wait_for = screenshot_wait_for
        self.screenshot_height_threshold = screenshot_height_threshold
        self.pdf = pdf
        self.image_description_min_word_threshold = image_description_min_word_threshold
        self.image_score_threshold = image_score_threshold
        self.exclude_external_images = exclude_external_images
        # Link and Domain Handling Parameters
        self.exclude_social_media_domains = (
            exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
        )
        self.exclude_external_links = exclude_external_links
        self.exclude_social_media_links = exclude_social_media_links
        self.exclude_domains = exclude_domains or []
        # Debugging and Logging Parameters
        self.verbose = verbose
        self.log_console = log_console
        # Streaming Parameters
        self.stream = stream
        # Robots.txt Handling Parameters
        self.check_robots_txt = check_robots_txt
        # Validate type of extraction strategy and chunking strategy if they are provided
-        if self.extraction_strategy is not None and not isinstance(self.extraction_strategy, ExtractionStrategy):
+        if self.extraction_strategy is not None and not isinstance(
-            raise ValueError("extraction_strategy must be an instance of ExtractionStrategy")
+            self.extraction_strategy, ExtractionStrategy
-        if self.chunking_strategy is not None and not isinstance(self.chunking_strategy, ChunkingStrategy):
+        ):
-            raise ValueError("chunking_strategy must be an instance of ChunkingStrategy")
+            raise ValueError(
                "extraction_strategy must be an instance of ExtractionStrategy"
            )
        if self.chunking_strategy is not None and not isinstance(
            self.chunking_strategy, ChunkingStrategy
        ):
            raise ValueError(
                "chunking_strategy must be an instance of ChunkingStrategy"
            )
        # Set default chunking strategy if None
        if self.chunking_strategy is None:
            from .chunking_strategy import RegexChunking
            self.chunking_strategy = RegexChunking()
    @staticmethod
    def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
        return CrawlerRunConfig(
            # Content Processing Parameters
            word_count_threshold=kwargs.get("word_count_threshold", 200),
            extraction_strategy=kwargs.get("extraction_strategy"),
-            chunking_strategy=kwargs.get("chunking_strategy"),
+            chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
            markdown_generator=kwargs.get("markdown_generator"),
            content_filter=kwargs.get("content_filter"),
            only_text=kwargs.get("only_text", False),
            css_selector=kwargs.get("css_selector"),
            excluded_tags=kwargs.get("excluded_tags", []),
            excluded_selector=kwargs.get("excluded_selector", ""),
            keep_data_attributes=kwargs.get("keep_data_attributes", False),
            remove_forms=kwargs.get("remove_forms", False),
            prettiify=kwargs.get("prettiify", False),
            parser_type=kwargs.get("parser_type", "lxml"),
            scraping_strategy=kwargs.get("scraping_strategy"),
            proxy_config=kwargs.get("proxy_config"),
            # SSL Parameters
            fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
            # Caching Parameters
            cache_mode=kwargs.get("cache_mode"),
            session_id=kwargs.get("session_id"),
            bypass_cache=kwargs.get("bypass_cache", False),
            disable_cache=kwargs.get("disable_cache", False),
            no_cache_read=kwargs.get("no_cache_read", False),
            no_cache_write=kwargs.get("no_cache_write", False),
-            css_selector=kwargs.get("css_selector"),
+            shared_data=kwargs.get("shared_data", None),
-            screenshot=kwargs.get("screenshot", False),
+            # Page Navigation and Timing Parameters
            pdf=kwargs.get("pdf", False),
            verbose=kwargs.get("verbose", True),
            only_text=kwargs.get("only_text", False),
            image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold",  IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD),
            prettiify=kwargs.get("prettiify", False),
            js_code=kwargs.get("js_code"), # If not provided here, will default inside constructor
            wait_for=kwargs.get("wait_for"),
            js_only=kwargs.get("js_only", False),
            wait_until=kwargs.get("wait_until", "domcontentloaded"),
            page_timeout=kwargs.get("page_timeout", 60000),
            wait_for=kwargs.get("wait_for"),
            wait_for_images=kwargs.get("wait_for_images", False),
            delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
            mean_delay=kwargs.get("mean_delay", 0.1),
            max_range=kwargs.get("max_range", 0.3),
            semaphore_count=kwargs.get("semaphore_count", 5),
            # Page Interaction Parameters
            js_code=kwargs.get("js_code"),
            js_only=kwargs.get("js_only", False),
            ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
            adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
            scan_full_page=kwargs.get("scan_full_page", False),
            scroll_delay=kwargs.get("scroll_delay", 0.2),
            process_iframes=kwargs.get("process_iframes", False),
            remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
            delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
            log_console=kwargs.get("log_console", False),
            simulate_user=kwargs.get("simulate_user", False),
            override_navigator=kwargs.get("override_navigator", False),
            magic=kwargs.get("magic", False),
            adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
            # Media Handling Parameters
            screenshot=kwargs.get("screenshot", False),
            screenshot_wait_for=kwargs.get("screenshot_wait_for"),
-            screenshot_height_threshold=kwargs.get("screenshot_height_threshold", 20000),
+            screenshot_height_threshold=kwargs.get(
-            mean_delay=kwargs.get("mean_delay", 0.1),
+                "screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
-            max_range=kwargs.get("max_range", 0.3),
+            ),
-            semaphore_count=kwargs.get("semaphore_count", 5)
+            pdf=kwargs.get("pdf", False),
            image_description_min_word_threshold=kwargs.get(
                "image_description_min_word_threshold",
                IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
            ),
            image_score_threshold=kwargs.get(
                "image_score_threshold", IMAGE_SCORE_THRESHOLD
            ),
            exclude_external_images=kwargs.get("exclude_external_images", False),
            # Link and Domain Handling Parameters
            exclude_social_media_domains=kwargs.get(
                "exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
            ),
            exclude_external_links=kwargs.get("exclude_external_links", False),
            exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
            exclude_domains=kwargs.get("exclude_domains", []),
            # Debugging and Logging Parameters
            verbose=kwargs.get("verbose", True),
            log_console=kwargs.get("log_console", False),
            # Streaming Parameters
            stream=kwargs.get("stream", False),
            url=kwargs.get("url"),
            check_robots_txt=kwargs.get("check_robots_txt", False),
        )
    # Create a funciton returns dict of the object
    def to_dict(self):
        return {
            "word_count_threshold": self.word_count_threshold,
            "extraction_strategy": self.extraction_strategy,
            "chunking_strategy": self.chunking_strategy,
            "markdown_generator": self.markdown_generator,
            "content_filter": self.content_filter,
            "only_text": self.only_text,
            "css_selector": self.css_selector,
            "excluded_tags": self.excluded_tags,
            "excluded_selector": self.excluded_selector,
            "keep_data_attributes": self.keep_data_attributes,
            "remove_forms": self.remove_forms,
            "prettiify": self.prettiify,
            "parser_type": self.parser_type,
            "scraping_strategy": self.scraping_strategy,
            "proxy_config": self.proxy_config,
            "fetch_ssl_certificate": self.fetch_ssl_certificate,
            "cache_mode": self.cache_mode,
            "session_id": self.session_id,
            "bypass_cache": self.bypass_cache,
            "disable_cache": self.disable_cache,
            "no_cache_read": self.no_cache_read,
            "no_cache_write": self.no_cache_write,
            "shared_data": self.shared_data,
            "wait_until": self.wait_until,
            "page_timeout": self.page_timeout,
            "wait_for": self.wait_for,
            "wait_for_images": self.wait_for_images,
            "delay_before_return_html": self.delay_before_return_html,
            "mean_delay": self.mean_delay,
            "max_range": self.max_range,
            "semaphore_count": self.semaphore_count,
            "js_code": self.js_code,
            "js_only": self.js_only,
            "ignore_body_visibility": self.ignore_body_visibility,
            "scan_full_page": self.scan_full_page,
            "scroll_delay": self.scroll_delay,
            "process_iframes": self.process_iframes,
            "remove_overlay_elements": self.remove_overlay_elements,
            "simulate_user": self.simulate_user,
            "override_navigator": self.override_navigator,
            "magic": self.magic,
            "adjust_viewport_to_content": self.adjust_viewport_to_content,
            "screenshot": self.screenshot,
            "screenshot_wait_for": self.screenshot_wait_for,
            "screenshot_height_threshold": self.screenshot_height_threshold,
            "pdf": self.pdf,
            "image_description_min_word_threshold": self.image_description_min_word_threshold,
            "image_score_threshold": self.image_score_threshold,
            "exclude_external_images": self.exclude_external_images,
            "exclude_social_media_domains": self.exclude_social_media_domains,
            "exclude_external_links": self.exclude_external_links,
            "exclude_social_media_links": self.exclude_social_media_links,
            "exclude_domains": self.exclude_domains,
            "verbose": self.verbose,
            "log_console": self.log_console,
            "stream": self.stream,
            "url": self.url,
            "check_robots_txt": self.check_robots_txt,
        }
    def clone(self, **kwargs):
        """Create a copy of this configuration with updated values.
        Args:
            **kwargs: Key-value pairs of configuration options to update
        Returns:
            CrawlerRunConfig: A new instance with the specified updates
        Example:
            ```python
            # Create a new config with streaming enabled
            stream_config = config.clone(stream=True)
            # Create a new config with multiple updates
            new_config = config.clone(
                stream=True,
                cache_mode=CacheMode.BYPASS,
                verbose=True
            )
            ```
        """
        config_dict = self.to_dict()
        config_dict.update(kwargs)
        return CrawlerRunConfig.from_kwargs(config_dict)
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -1,27 +1,30 @@
-import os, sys
+import os
 from pathlib import Path
 import aiosqlite
 import asyncio
-from typing import Optional, Tuple, Dict
+from typing import Optional, Dict
 from contextlib import asynccontextmanager
 import logging
 import json  # Added for serialization/deserialization
 from .utils import ensure_content_dirs, generate_content_hash
-from .models import CrawlResult
+from .models import CrawlResult, MarkdownGenerationResult
 import xxhash
 import aiofiles
 from .config import NEED_MIGRATION
 from .version_manager import VersionManager
 from .async_logger import AsyncLogger
 from .utils import get_error_context, create_box_message
 # Set up logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-base_directory = DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
+# Set up logging
 # logging.basicConfig(level=logging.INFO)
 # logger = logging.getLogger(__name__)
 # logger.setLevel(logging.INFO)
 base_directory = DB_PATH = os.path.join(
    os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
 )
 os.makedirs(DB_PATH, exist_ok=True)
 DB_PATH = os.path.join(base_directory, "crawl4ai.db")
 class AsyncDatabaseManager:
    def __init__(self, pool_size: int = 10, max_retries: int = 3):
        self.db_path = DB_PATH
@@ -37,10 +40,9 @@ class AsyncDatabaseManager:
        self.logger = AsyncLogger(
            log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
            verbose=False,
-            tag_width=10
+            tag_width=10,
        )
    async def initialize(self):
        """Initialize the database and connection pool"""
        try:
@@ -67,28 +69,32 @@ class AsyncDatabaseManager:
            if needs_update:
                self.logger.info("New version detected, running updates", tag="INIT")
                await self.update_db_schema()
-                from .migrations import run_migration  # Import here to avoid circular imports
+                from .migrations import (
                    run_migration,
                )  # Import here to avoid circular imports
                await run_migration()
                self.version_manager.update_version()  # Update stored version after successful migration
-                self.logger.success("Version update completed successfully", tag="COMPLETE")
+                self.logger.success(
                    "Version update completed successfully", tag="COMPLETE"
                )
            else:
-                self.logger.success("Database initialization completed successfully", tag="COMPLETE")
+                self.logger.success(
-
+                    "Database initialization completed successfully", tag="COMPLETE"
                )
        except Exception as e:
            self.logger.error(
                message="Database initialization error: {error}",
                tag="ERROR",
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
            self.logger.info(
-                message="Database will be initialized on first use",
+                message="Database will be initialized on first use", tag="INIT"
                tag="INIT"
            )
            raise
    async def cleanup(self):
        """Cleanup connections when shutting down"""
        async with self.pool_lock:
@@ -107,6 +113,7 @@ class AsyncDatabaseManager:
                        self._initialized = True
                    except Exception as e:
                        import sys
                        error_context = get_error_context(sys.exc_info())
                        self.logger.error(
                            message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
@@ -115,8 +122,8 @@ class AsyncDatabaseManager:
                            params={
                                "error": str(e),
                                "context": error_context["code_context"],
-                                "traceback": error_context["full_traceback"]
+                                "traceback": error_context["full_traceback"],
-                            }
+                            },
                        )
                        raise
@@ -127,29 +134,40 @@ class AsyncDatabaseManager:
            async with self.pool_lock:
                if task_id not in self.connection_pool:
                    try:
-                        conn = await aiosqlite.connect(
+                        conn = await aiosqlite.connect(self.db_path, timeout=30.0)
-                            self.db_path,
+                        await conn.execute("PRAGMA journal_mode = WAL")
-                            timeout=30.0
+                        await conn.execute("PRAGMA busy_timeout = 5000")
                        )
                        await conn.execute('PRAGMA journal_mode = WAL')
                        await conn.execute('PRAGMA busy_timeout = 5000')
                        # Verify database structure
-                        async with conn.execute("PRAGMA table_info(crawled_data)") as cursor:
+                        async with conn.execute(
                            "PRAGMA table_info(crawled_data)"
                        ) as cursor:
                            columns = await cursor.fetchall()
                            column_names = [col[1] for col in columns]
                            expected_columns = {
-                                'url', 'html', 'cleaned_html', 'markdown', 'extracted_content',
+                                "url",
-                                'success', 'media', 'links', 'metadata', 'screenshot',
+                                "html",
-                                'response_headers', 'downloaded_files'
+                                "cleaned_html",
                                "markdown",
                                "extracted_content",
                                "success",
                                "media",
                                "links",
                                "metadata",
                                "screenshot",
                                "response_headers",
                                "downloaded_files",
                            }
                            missing_columns = expected_columns - set(column_names)
                            if missing_columns:
-                                raise ValueError(f"Database missing columns: {missing_columns}")
+                                raise ValueError(
                                    f"Database missing columns: {missing_columns}"
                                )
                        self.connection_pool[task_id] = conn
                    except Exception as e:
                        import sys
                        error_context = get_error_context(sys.exc_info())
                        error_message = (
                            f"Unexpected error in db get_connection at line {error_context['line_no']} "
@@ -158,7 +176,7 @@ class AsyncDatabaseManager:
                            f"Code context:\n{error_context['code_context']}"
                        )
                        self.logger.error(
-                            message=create_box_message(error_message, type= "error"),
+                            message=create_box_message(error_message, type="error"),
                        )
                        raise
@@ -167,6 +185,7 @@ class AsyncDatabaseManager:
        except Exception as e:
            import sys
            error_context = get_error_context(sys.exc_info())
            error_message = (
                f"Unexpected error in db get_connection at line {error_context['line_no']} "
@@ -175,7 +194,7 @@ class AsyncDatabaseManager:
                f"Code context:\n{error_context['code_context']}"
            )
            self.logger.error(
-                message=create_box_message(error_message, type= "error"),
+                message=create_box_message(error_message, type="error"),
            )
            raise
        finally:
@@ -185,7 +204,6 @@ class AsyncDatabaseManager:
                    del self.connection_pool[task_id]
            self.connection_semaphore.release()
    async def execute_with_retry(self, operation, *args):
        """Execute database operations with retry logic"""
        for attempt in range(self.max_retries):
@@ -200,10 +218,7 @@ class AsyncDatabaseManager:
                        message="Operation failed after {retries} attempts: {error}",
                        tag="ERROR",
                        force_verbose=True,
-                        params={
+                        params={"retries": self.max_retries, "error": str(e)},
                            "retries": self.max_retries,
                            "error": str(e)
                        }
                    )
                    raise
                await asyncio.sleep(1 * (attempt + 1))  # Exponential backoff
@@ -211,7 +226,8 @@ class AsyncDatabaseManager:
    async def ainit_db(self):
        """Initialize database schema"""
        async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
-            await db.execute('''
+            await db.execute(
                """
                CREATE TABLE IF NOT EXISTS crawled_data (
                    url TEXT PRIMARY KEY,
                    html TEXT,
@@ -226,11 +242,10 @@ class AsyncDatabaseManager:
                    response_headers TEXT DEFAULT "{}",
                    downloaded_files TEXT DEFAULT "{}"  -- New column added
                )
-            ''')
+            """
            )
            await db.commit()
    async def update_db_schema(self):
        """Update database schema if needed"""
        async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
@@ -239,7 +254,14 @@ class AsyncDatabaseManager:
            column_names = [column[1] for column in columns]
            # List of new columns to add
-            new_columns = ['media', 'links', 'metadata', 'screenshot', 'response_headers', 'downloaded_files']
+            new_columns = [
                "media",
                "links",
                "metadata",
                "screenshot",
                "response_headers",
                "downloaded_files",
            ]
            for column in new_columns:
                if column not in column_names:
@@ -248,22 +270,26 @@ class AsyncDatabaseManager:
    async def aalter_db_add_column(self, new_column: str, db):
        """Add new column to the database"""
-        if new_column == 'response_headers':
+        if new_column == "response_headers":
-            await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"')
+            await db.execute(
                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
            )
        else:
-            await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
+            await db.execute(
                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
            )
        self.logger.info(
            message="Added column '{column}' to the database",
            tag="INIT",
-            params={"column": new_column}
+            params={"column": new_column},
        )
    async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
        """Retrieve cached URL data as CrawlResult"""
        async def _get(db):
            async with db.execute(
-                'SELECT * FROM crawled_data WHERE url = ?', (url,)
+                "SELECT * FROM crawled_data WHERE url = ?", (url,)
            ) as cursor:
                row = await cursor.fetchone()
                if not row:
@@ -276,37 +302,58 @@ class AsyncDatabaseManager:
                # Load content from files using stored hashes
                content_fields = {
-                    'html': row_dict['html'],
+                    "html": row_dict["html"],
-                    'cleaned_html': row_dict['cleaned_html'],
+                    "cleaned_html": row_dict["cleaned_html"],
-                    'markdown': row_dict['markdown'],
+                    "markdown": row_dict["markdown"],
-                    'extracted_content': row_dict['extracted_content'],
+                    "extracted_content": row_dict["extracted_content"],
-                    'screenshot': row_dict['screenshot'],
+                    "screenshot": row_dict["screenshot"],
-                    'screenshots': row_dict['screenshot'],
+                    "screenshots": row_dict["screenshot"],
                }
                for field, hash_value in content_fields.items():
                    if hash_value:
                        content = await self._load_content(
                            hash_value,
-                            field.split('_')[0]  # Get content type from field name
+                            field.split("_")[0],  # Get content type from field name
                        )
                        row_dict[field] = content or ""
                    else:
                        row_dict[field] = ""
                # Parse JSON fields
-                json_fields = ['media', 'links', 'metadata', 'response_headers']
+                json_fields = [
                    "media",
                    "links",
                    "metadata",
                    "response_headers",
                    "markdown",
                ]
                for field in json_fields:
                    try:
-                        row_dict[field] = json.loads(row_dict[field]) if row_dict[field] else {}
+                        row_dict[field] = (
                            json.loads(row_dict[field]) if row_dict[field] else {}
                        )
                    except json.JSONDecodeError:
                        # Very UGLY, never mention it to me please
                        if field == "markdown" and isinstance(row_dict[field], str):
                            row_dict[field] = row_dict[field]
                        else:
                            row_dict[field] = {}
                if isinstance(row_dict["markdown"], Dict):
                    row_dict["markdown_v2"] = row_dict["markdown"]
                    if row_dict["markdown"].get("raw_markdown"):
                        row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
                # Parse downloaded_files
                try:
-                    row_dict['downloaded_files'] = json.loads(row_dict['downloaded_files']) if row_dict['downloaded_files'] else []
+                    row_dict["downloaded_files"] = (
                        json.loads(row_dict["downloaded_files"])
                        if row_dict["downloaded_files"]
                        else []
                    )
                except json.JSONDecodeError:
-                    row_dict['downloaded_files'] = []
+                    row_dict["downloaded_files"] = []
                # Remove any fields not in CrawlResult model
                valid_fields = CrawlResult.__annotations__.keys()
@@ -321,7 +368,7 @@ class AsyncDatabaseManager:
                message="Error retrieving cached URL: {error}",
                tag="ERROR",
                force_verbose=True,
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
            return None
@@ -329,19 +376,52 @@ class AsyncDatabaseManager:
        """Cache CrawlResult data"""
        # Store content files and get hashes
        content_map = {
-            'html': (result.html, 'html'),
+            "html": (result.html, "html"),
-            'cleaned_html': (result.cleaned_html or "", 'cleaned'),
+            "cleaned_html": (result.cleaned_html or "", "cleaned"),
-            'markdown': (result.markdown or "", 'markdown'),
+            "markdown": None,
-            'extracted_content': (result.extracted_content or "", 'extracted'),
+            "extracted_content": (result.extracted_content or "", "extracted"),
-            'screenshot': (result.screenshot or "", 'screenshots')
+            "screenshot": (result.screenshot or "", "screenshots"),
        }
        try:
            if isinstance(result.markdown, MarkdownGenerationResult):
                content_map["markdown"] = (
                    result.markdown.model_dump_json(),
                    "markdown",
                )
            elif hasattr(result, "markdown_v2"):
                content_map["markdown"] = (
                    result.markdown_v2.model_dump_json(),
                    "markdown",
                )
            elif isinstance(result.markdown, str):
                markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
                content_map["markdown"] = (
                    markdown_result.model_dump_json(),
                    "markdown",
                )
            else:
                content_map["markdown"] = (
                    MarkdownGenerationResult().model_dump_json(),
                    "markdown",
                )
        except Exception as e:
            self.logger.warning(
                message=f"Error processing markdown content: {str(e)}", tag="WARNING"
            )
            # Fallback to empty markdown result
            content_map["markdown"] = (
                MarkdownGenerationResult().model_dump_json(),
                "markdown",
            )
        content_hashes = {}
        for field, (content, content_type) in content_map.items():
            content_hashes[field] = await self._store_content(content, content_type)
        async def _cache(db):
-            await db.execute('''
+            await db.execute(
                """
                INSERT INTO crawled_data (
                    url, html, cleaned_html, markdown,
                    extracted_content, success, media, links, metadata,
@@ -360,20 +440,22 @@ class AsyncDatabaseManager:
                    screenshot = excluded.screenshot,
                    response_headers = excluded.response_headers,
                    downloaded_files = excluded.downloaded_files
-            ''', (
+            """,
                (
                    result.url,
-                content_hashes['html'],
+                    content_hashes["html"],
-                content_hashes['cleaned_html'],
+                    content_hashes["cleaned_html"],
-                content_hashes['markdown'],
+                    content_hashes["markdown"],
-                content_hashes['extracted_content'],
+                    content_hashes["extracted_content"],
                    result.success,
                    json.dumps(result.media),
                    json.dumps(result.links),
                    json.dumps(result.metadata or {}),
-                content_hashes['screenshot'],
+                    content_hashes["screenshot"],
                    json.dumps(result.response_headers or {}),
-                json.dumps(result.downloaded_files or [])
+                    json.dumps(result.downloaded_files or []),
-            ))
+                ),
            )
        try:
            await self.execute_with_retry(_cache)
@@ -382,14 +464,14 @@ class AsyncDatabaseManager:
                message="Error caching URL: {error}",
                tag="ERROR",
                force_verbose=True,
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
    async def aget_total_count(self) -> int:
        """Get total number of cached URLs"""
        async def _count(db):
-            async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
+            async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
                result = await cursor.fetchone()
                return result[0] if result else 0
@@ -400,14 +482,15 @@ class AsyncDatabaseManager:
                message="Error getting total count: {error}",
                tag="ERROR",
                force_verbose=True,
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
            return 0
    async def aclear_db(self):
        """Clear all data from the database"""
        async def _clear(db):
-            await db.execute('DELETE FROM crawled_data')
+            await db.execute("DELETE FROM crawled_data")
        try:
            await self.execute_with_retry(_clear)
@@ -416,13 +499,14 @@ class AsyncDatabaseManager:
                message="Error clearing database: {error}",
                tag="ERROR",
                force_verbose=True,
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
    async def aflush_db(self):
        """Drop the entire table"""
        async def _flush(db):
-            await db.execute('DROP TABLE IF EXISTS crawled_data')
+            await db.execute("DROP TABLE IF EXISTS crawled_data")
        try:
            await self.execute_with_retry(_flush)
@@ -431,10 +515,9 @@ class AsyncDatabaseManager:
                message="Error flushing database: {error}",
                tag="ERROR",
                force_verbose=True,
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
    async def _store_content(self, content: str, content_type: str) -> str:
        """Store content in filesystem and return hash"""
        if not content:
@@ -445,28 +528,31 @@ class AsyncDatabaseManager:
        # Only write if file doesn't exist
        if not os.path.exists(file_path):
-            async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
+            async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
                await f.write(content)
        return content_hash
-    async def _load_content(self, content_hash: str, content_type: str) -> Optional[str]:
+    async def _load_content(
        self, content_hash: str, content_type: str
    ) -> Optional[str]:
        """Load content from filesystem by hash"""
        if not content_hash:
            return None
        file_path = os.path.join(self.content_paths[content_type], content_hash)
        try:
-            async with aiofiles.open(file_path, 'r', encoding='utf-8') as f:
+            async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
                return await f.read()
        except:
            self.logger.error(
                message="Failed to load content: {file_path}",
                tag="ERROR",
                force_verbose=True,
-                params={"file_path": file_path}
+                params={"file_path": file_path},
            )
            return None
 # Create a singleton instance
 async_db_manager = AsyncDatabaseManager()
--- a/crawl4ai/async_dispatcher.py
+++ b/crawl4ai/async_dispatcher.py
@@ -0,0 +1,647 @@
 from typing import Dict, Optional, List, Tuple
 from .async_configs import CrawlerRunConfig
 from .models import (
    CrawlResult,
    CrawlerTaskResult,
    CrawlStatus,
    DisplayMode,
    CrawlStats,
    DomainState,
 )
 from rich.live import Live
 from rich.table import Table
 from rich.console import Console
 from rich import box
 from datetime import datetime, timedelta
 from collections.abc import AsyncGenerator
 import time
 import psutil
 import asyncio
 import uuid
 from urllib.parse import urlparse
 import random
 from abc import ABC, abstractmethod
 class RateLimiter:
    def __init__(
        self,
        base_delay: Tuple[float, float] = (1.0, 3.0),
        max_delay: float = 60.0,
        max_retries: int = 3,
        rate_limit_codes: List[int] = None,
    ):
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.max_retries = max_retries
        self.rate_limit_codes = rate_limit_codes or [429, 503]
        self.domains: Dict[str, DomainState] = {}
    def get_domain(self, url: str) -> str:
        return urlparse(url).netloc
    async def wait_if_needed(self, url: str) -> None:
        domain = self.get_domain(url)
        state = self.domains.get(domain)
        if not state:
            self.domains[domain] = DomainState()
            state = self.domains[domain]
        now = time.time()
        if state.last_request_time:
            wait_time = max(0, state.current_delay - (now - state.last_request_time))
            if wait_time > 0:
                await asyncio.sleep(wait_time)
        # Random delay within base range if no current delay
        if state.current_delay == 0:
            state.current_delay = random.uniform(*self.base_delay)
        state.last_request_time = time.time()
    def update_delay(self, url: str, status_code: int) -> bool:
        domain = self.get_domain(url)
        state = self.domains[domain]
        if status_code in self.rate_limit_codes:
            state.fail_count += 1
            if state.fail_count > self.max_retries:
                return False
            # Exponential backoff with random jitter
            state.current_delay = min(
                state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
            )
        else:
            # Gradually reduce delay on success
            state.current_delay = max(
                random.uniform(*self.base_delay), state.current_delay * 0.75
            )
            state.fail_count = 0
        return True
 class CrawlerMonitor:
    def __init__(
        self,
        max_visible_rows: int = 15,
        display_mode: DisplayMode = DisplayMode.DETAILED,
    ):
        self.console = Console()
        self.max_visible_rows = max_visible_rows
        self.display_mode = display_mode
        self.stats: Dict[str, CrawlStats] = {}
        self.process = psutil.Process()
        self.start_time = datetime.now()
        self.live = Live(self._create_table(), refresh_per_second=2)
    def start(self):
        self.live.start()
    def stop(self):
        self.live.stop()
    def add_task(self, task_id: str, url: str):
        self.stats[task_id] = CrawlStats(
            task_id=task_id, url=url, status=CrawlStatus.QUEUED
        )
        self.live.update(self._create_table())
    def update_task(self, task_id: str, **kwargs):
        if task_id in self.stats:
            for key, value in kwargs.items():
                setattr(self.stats[task_id], key, value)
            self.live.update(self._create_table())
    def _create_aggregated_table(self) -> Table:
        """Creates a compact table showing only aggregated statistics"""
        table = Table(
            box=box.ROUNDED,
            title="Crawler Status Overview",
            title_style="bold magenta",
            header_style="bold blue",
            show_lines=True,
        )
        # Calculate statistics
        total_tasks = len(self.stats)
        queued = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
        )
        in_progress = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
        )
        completed = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
        )
        failed = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
        )
        # Memory statistics
        current_memory = self.process.memory_info().rss / (1024 * 1024)
        total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
        peak_memory = max(
            (stat.peak_memory for stat in self.stats.values()), default=0.0
        )
        # Duration
        duration = datetime.now() - self.start_time
        # Create status row
        table.add_column("Status", style="bold cyan")
        table.add_column("Count", justify="right")
        table.add_column("Percentage", justify="right")
        table.add_row("Total Tasks", str(total_tasks), "100%")
        table.add_row(
            "[yellow]In Queue[/yellow]",
            str(queued),
            f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        table.add_row(
            "[blue]In Progress[/blue]",
            str(in_progress),
            f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        table.add_row(
            "[green]Completed[/green]",
            str(completed),
            f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        table.add_row(
            "[red]Failed[/red]",
            str(failed),
            f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        # Add memory information
        table.add_section()
        table.add_row(
            "[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
        )
        table.add_row(
            "[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
        )
        table.add_row(
            "[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
        )
        table.add_row(
            "[yellow]Runtime[/yellow]",
            str(timedelta(seconds=int(duration.total_seconds()))),
            "",
        )
        return table
    def _create_detailed_table(self) -> Table:
        table = Table(
            box=box.ROUNDED,
            title="Crawler Performance Monitor",
            title_style="bold magenta",
            header_style="bold blue",
        )
        # Add columns
        table.add_column("Task ID", style="cyan", no_wrap=True)
        table.add_column("URL", style="cyan", no_wrap=True)
        table.add_column("Status", style="bold")
        table.add_column("Memory (MB)", justify="right")
        table.add_column("Peak (MB)", justify="right")
        table.add_column("Duration", justify="right")
        table.add_column("Info", style="italic")
        # Add summary row
        total_memory = sum(stat.memory_usage for stat in self.stats.values())
        active_count = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
        )
        completed_count = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
        )
        failed_count = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
        )
        table.add_row(
            "[bold yellow]SUMMARY",
            f"Total: {len(self.stats)}",
            f"Active: {active_count}",
            f"{total_memory:.1f}",
            f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
            str(
                timedelta(
                    seconds=int((datetime.now() - self.start_time).total_seconds())
                )
            ),
            f"✓{completed_count} ✗{failed_count}",
            style="bold",
        )
        table.add_section()
        # Add rows for each task
        visible_stats = sorted(
            self.stats.values(),
            key=lambda x: (
                x.status != CrawlStatus.IN_PROGRESS,
                x.status != CrawlStatus.QUEUED,
                x.end_time or datetime.max,
            ),
        )[: self.max_visible_rows]
        for stat in visible_stats:
            status_style = {
                CrawlStatus.QUEUED: "white",
                CrawlStatus.IN_PROGRESS: "yellow",
                CrawlStatus.COMPLETED: "green",
                CrawlStatus.FAILED: "red",
            }[stat.status]
            table.add_row(
                stat.task_id[:8],  # Show first 8 chars of task ID
                stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
                f"[{status_style}]{stat.status.value}[/{status_style}]",
                f"{stat.memory_usage:.1f}",
                f"{stat.peak_memory:.1f}",
                stat.duration,
                stat.error_message[:40] if stat.error_message else "",
            )
        return table
    def _create_table(self) -> Table:
        """Creates the appropriate table based on display mode"""
        if self.display_mode == DisplayMode.AGGREGATED:
            return self._create_aggregated_table()
        return self._create_detailed_table()
 class BaseDispatcher(ABC):
    def __init__(
        self,
        rate_limiter: Optional[RateLimiter] = None,
        monitor: Optional[CrawlerMonitor] = None,
    ):
        self.crawler = None
        self._domain_last_hit: Dict[str, float] = {}
        self.concurrent_sessions = 0
        self.rate_limiter = rate_limiter
        self.monitor = monitor
    @abstractmethod
    async def crawl_url(
        self,
        url: str,
        config: CrawlerRunConfig,
        task_id: str,
        monitor: Optional[CrawlerMonitor] = None,
    ) -> CrawlerTaskResult:
        pass
    @abstractmethod
    async def run_urls(
        self,
        urls: List[str],
        crawler: "AsyncWebCrawler",  # noqa: F821
        config: CrawlerRunConfig,
        monitor: Optional[CrawlerMonitor] = None,
    ) -> List[CrawlerTaskResult]:
        pass
 class MemoryAdaptiveDispatcher(BaseDispatcher):
    def __init__(
        self,
        memory_threshold_percent: float = 90.0,
        check_interval: float = 1.0,
        max_session_permit: int = 20,
        memory_wait_timeout: float = 300.0,  # 5 minutes default timeout
        rate_limiter: Optional[RateLimiter] = None,
        monitor: Optional[CrawlerMonitor] = None,
    ):
        super().__init__(rate_limiter, monitor)
        self.memory_threshold_percent = memory_threshold_percent
        self.check_interval = check_interval
        self.max_session_permit = max_session_permit
        self.memory_wait_timeout = memory_wait_timeout
        self.result_queue = asyncio.Queue()  # Queue for storing results
    async def crawl_url(
        self,
        url: str,
        config: CrawlerRunConfig,
        task_id: str,
    ) -> CrawlerTaskResult:
        start_time = datetime.now()
        error_message = ""
        memory_usage = peak_memory = 0.0
        try:
            if self.monitor:
                self.monitor.update_task(
                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
                )
            self.concurrent_sessions += 1
            if self.rate_limiter:
                await self.rate_limiter.wait_if_needed(url)
            process = psutil.Process()
            start_memory = process.memory_info().rss / (1024 * 1024)
            result = await self.crawler.arun(url, config=config, session_id=task_id)
            end_memory = process.memory_info().rss / (1024 * 1024)
            memory_usage = peak_memory = end_memory - start_memory
            if self.rate_limiter and result.status_code:
                if not self.rate_limiter.update_delay(url, result.status_code):
                    error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
                    if self.monitor:
                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
                    result = CrawlerTaskResult(
                        task_id=task_id,
                        url=url,
                        result=result,
                        memory_usage=memory_usage,
                        peak_memory=peak_memory,
                        start_time=start_time,
                        end_time=datetime.now(),
                        error_message=error_message,
                    )
                    await self.result_queue.put(result)
                    return result
            if not result.success:
                error_message = result.error_message
                if self.monitor:
                    self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
            elif self.monitor:
                self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
        except Exception as e:
            error_message = str(e)
            if self.monitor:
                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
            result = CrawlResult(
                url=url, html="", metadata={}, success=False, error_message=str(e)
            )
        finally:
            end_time = datetime.now()
            if self.monitor:
                self.monitor.update_task(
                    task_id,
                    end_time=end_time,
                    memory_usage=memory_usage,
                    peak_memory=peak_memory,
                    error_message=error_message,
                )
            self.concurrent_sessions -= 1
        return CrawlerTaskResult(
            task_id=task_id,
            url=url,
            result=result,
            memory_usage=memory_usage,
            peak_memory=peak_memory,
            start_time=start_time,
            end_time=end_time,
            error_message=error_message,
        )
    async def run_urls(
        self,
        urls: List[str],
        crawler: "AsyncWebCrawler",  # noqa: F821
        config: CrawlerRunConfig,
        ) -> List[CrawlerTaskResult]:
            self.crawler = crawler
            if self.monitor:
                self.monitor.start()
            try:
                pending_tasks = []
                active_tasks = []
                task_queue = []
                for url in urls:
                    task_id = str(uuid.uuid4())
                    if self.monitor:
                        self.monitor.add_task(task_id, url)
                    task_queue.append((url, task_id))
                while task_queue or active_tasks:
                    wait_start_time = time.time()
                    while len(active_tasks) < self.max_session_permit and task_queue:
                        if psutil.virtual_memory().percent >= self.memory_threshold_percent:
                            # Check if we've exceeded the timeout
                            if time.time() - wait_start_time > self.memory_wait_timeout:
                                raise MemoryError(
                                    f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
                                )
                            await asyncio.sleep(self.check_interval)
                            continue
                        url, task_id = task_queue.pop(0)
                        task = asyncio.create_task(self.crawl_url(url, config, task_id))
                        active_tasks.append(task)
                    if not active_tasks:
                        await asyncio.sleep(self.check_interval)
                        continue
                    done, pending = await asyncio.wait(
                        active_tasks, return_when=asyncio.FIRST_COMPLETED
                    )
                    pending_tasks.extend(done)
                    active_tasks = list(pending)
                return await asyncio.gather(*pending_tasks)
            finally:
                if self.monitor:
                    self.monitor.stop()
    async def run_urls_stream(
        self,
        urls: List[str],
        crawler: "AsyncWebCrawler",
        config: CrawlerRunConfig,
    ) -> AsyncGenerator[CrawlerTaskResult, None]:
        self.crawler = crawler
        if self.monitor:
            self.monitor.start()
        try:
            active_tasks = []
            task_queue = []
            completed_count = 0
            total_urls = len(urls)
            # Initialize task queue
            for url in urls:
                task_id = str(uuid.uuid4())
                if self.monitor:
                    self.monitor.add_task(task_id, url)
                task_queue.append((url, task_id))
            while completed_count < total_urls:
                # Start new tasks if memory permits
                while len(active_tasks) < self.max_session_permit and task_queue:
                    if psutil.virtual_memory().percent >= self.memory_threshold_percent:
                        await asyncio.sleep(self.check_interval)
                        continue
                    url, task_id = task_queue.pop(0)
                    task = asyncio.create_task(self.crawl_url(url, config, task_id))
                    active_tasks.append(task)
                if not active_tasks and not task_queue:
                    break
                # Wait for any task to complete and yield results
                if active_tasks:
                    done, pending = await asyncio.wait(
                        active_tasks,
                        timeout=0.1,
                        return_when=asyncio.FIRST_COMPLETED
                    )
                    for completed_task in done:
                        result = await completed_task
                        completed_count += 1
                        yield result
                    active_tasks = list(pending)
                else:
                    await asyncio.sleep(self.check_interval)
        finally:
            if self.monitor:
                self.monitor.stop()
 class SemaphoreDispatcher(BaseDispatcher):
    def __init__(
        self,
        semaphore_count: int = 5,
        max_session_permit: int = 20,
        rate_limiter: Optional[RateLimiter] = None,
        monitor: Optional[CrawlerMonitor] = None,
    ):
        super().__init__(rate_limiter, monitor)
        self.semaphore_count = semaphore_count
        self.max_session_permit = max_session_permit
    async def crawl_url(
        self,
        url: str,
        config: CrawlerRunConfig,
        task_id: str,
        semaphore: asyncio.Semaphore = None,
    ) -> CrawlerTaskResult:
        start_time = datetime.now()
        error_message = ""
        memory_usage = peak_memory = 0.0
        try:
            if self.monitor:
                self.monitor.update_task(
                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
                )
            if self.rate_limiter:
                await self.rate_limiter.wait_if_needed(url)
            async with semaphore:
                process = psutil.Process()
                start_memory = process.memory_info().rss / (1024 * 1024)
                result = await self.crawler.arun(url, config=config, session_id=task_id)
                end_memory = process.memory_info().rss / (1024 * 1024)
                memory_usage = peak_memory = end_memory - start_memory
                if self.rate_limiter and result.status_code:
                    if not self.rate_limiter.update_delay(url, result.status_code):
                        error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
                        if self.monitor:
                            self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
                        return CrawlerTaskResult(
                            task_id=task_id,
                            url=url,
                            result=result,
                            memory_usage=memory_usage,
                            peak_memory=peak_memory,
                            start_time=start_time,
                            end_time=datetime.now(),
                            error_message=error_message,
                        )
                if not result.success:
                    error_message = result.error_message
                    if self.monitor:
                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
                elif self.monitor:
                    self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
        except Exception as e:
            error_message = str(e)
            if self.monitor:
                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
            result = CrawlResult(
                url=url, html="", metadata={}, success=False, error_message=str(e)
            )
        finally:
            end_time = datetime.now()
            if self.monitor:
                self.monitor.update_task(
                    task_id,
                    end_time=end_time,
                    memory_usage=memory_usage,
                    peak_memory=peak_memory,
                    error_message=error_message,
                )
        return CrawlerTaskResult(
            task_id=task_id,
            url=url,
            result=result,
            memory_usage=memory_usage,
            peak_memory=peak_memory,
            start_time=start_time,
            end_time=end_time,
            error_message=error_message,
        )
    async def run_urls(
        self,
        crawler: "AsyncWebCrawler",  # noqa: F821
        urls: List[str],
        config: CrawlerRunConfig,
    ) -> List[CrawlerTaskResult]:
        self.crawler = crawler
        if self.monitor:
            self.monitor.start()
        try:
            semaphore = asyncio.Semaphore(self.semaphore_count)
            tasks = []
            for url in urls:
                task_id = str(uuid.uuid4())
                if self.monitor:
                    self.monitor.add_task(task_id, url)
                task = asyncio.create_task(
                    self.crawl_url(url, config, task_id, semaphore)
                )
                tasks.append(task)
            return await asyncio.gather(*tasks, return_exceptions=True)
        finally:
            if self.monitor:
                self.monitor.stop()
--- a/crawl4ai/async_dispatcher_.py
+++ b/crawl4ai/async_dispatcher_.py
@@ -0,0 +1,588 @@
 from typing import Dict, Optional, List, Tuple
 from .async_configs import CrawlerRunConfig
 from .models import (
    CrawlResult,
    CrawlerTaskResult,
    CrawlStatus,
    DisplayMode,
    CrawlStats,
    DomainState,
 )
 from rich.live import Live
 from rich.table import Table
 from rich.console import Console
 from rich import box
 from datetime import datetime, timedelta
 import time
 import psutil
 import asyncio
 import uuid
 from urllib.parse import urlparse
 import random
 from abc import ABC, abstractmethod
 class RateLimiter:
    def __init__(
        self,
        base_delay: Tuple[float, float] = (1.0, 3.0),
        max_delay: float = 60.0,
        max_retries: int = 3,
        rate_limit_codes: List[int] = None,
    ):
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.max_retries = max_retries
        self.rate_limit_codes = rate_limit_codes or [429, 503]
        self.domains: Dict[str, DomainState] = {}
    def get_domain(self, url: str) -> str:
        return urlparse(url).netloc
    async def wait_if_needed(self, url: str) -> None:
        domain = self.get_domain(url)
        state = self.domains.get(domain)
        if not state:
            self.domains[domain] = DomainState()
            state = self.domains[domain]
        now = time.time()
        if state.last_request_time:
            wait_time = max(0, state.current_delay - (now - state.last_request_time))
            if wait_time > 0:
                await asyncio.sleep(wait_time)
        # Random delay within base range if no current delay
        if state.current_delay == 0:
            state.current_delay = random.uniform(*self.base_delay)
        state.last_request_time = time.time()
    def update_delay(self, url: str, status_code: int) -> bool:
        domain = self.get_domain(url)
        state = self.domains[domain]
        if status_code in self.rate_limit_codes:
            state.fail_count += 1
            if state.fail_count > self.max_retries:
                return False
            # Exponential backoff with random jitter
            state.current_delay = min(
                state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
            )
        else:
            # Gradually reduce delay on success
            state.current_delay = max(
                random.uniform(*self.base_delay), state.current_delay * 0.75
            )
            state.fail_count = 0
        return True
 class CrawlerMonitor:
    def __init__(
        self,
        max_visible_rows: int = 15,
        display_mode: DisplayMode = DisplayMode.DETAILED,
    ):
        self.console = Console()
        self.max_visible_rows = max_visible_rows
        self.display_mode = display_mode
        self.stats: Dict[str, CrawlStats] = {}
        self.process = psutil.Process()
        self.start_time = datetime.now()
        self.live = Live(self._create_table(), refresh_per_second=2)
    def start(self):
        self.live.start()
    def stop(self):
        self.live.stop()
    def add_task(self, task_id: str, url: str):
        self.stats[task_id] = CrawlStats(
            task_id=task_id, url=url, status=CrawlStatus.QUEUED
        )
        self.live.update(self._create_table())
    def update_task(self, task_id: str, **kwargs):
        if task_id in self.stats:
            for key, value in kwargs.items():
                setattr(self.stats[task_id], key, value)
            self.live.update(self._create_table())
    def _create_aggregated_table(self) -> Table:
        """Creates a compact table showing only aggregated statistics"""
        table = Table(
            box=box.ROUNDED,
            title="Crawler Status Overview",
            title_style="bold magenta",
            header_style="bold blue",
            show_lines=True,
        )
        # Calculate statistics
        total_tasks = len(self.stats)
        queued = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
        )
        in_progress = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
        )
        completed = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
        )
        failed = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
        )
        # Memory statistics
        current_memory = self.process.memory_info().rss / (1024 * 1024)
        total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
        peak_memory = max(
            (stat.peak_memory for stat in self.stats.values()), default=0.0
        )
        # Duration
        duration = datetime.now() - self.start_time
        # Create status row
        table.add_column("Status", style="bold cyan")
        table.add_column("Count", justify="right")
        table.add_column("Percentage", justify="right")
        table.add_row("Total Tasks", str(total_tasks), "100%")
        table.add_row(
            "[yellow]In Queue[/yellow]",
            str(queued),
            f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        table.add_row(
            "[blue]In Progress[/blue]",
            str(in_progress),
            f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        table.add_row(
            "[green]Completed[/green]",
            str(completed),
            f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        table.add_row(
            "[red]Failed[/red]",
            str(failed),
            f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
        )
        # Add memory information
        table.add_section()
        table.add_row(
            "[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
        )
        table.add_row(
            "[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
        )
        table.add_row(
            "[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
        )
        table.add_row(
            "[yellow]Runtime[/yellow]",
            str(timedelta(seconds=int(duration.total_seconds()))),
            "",
        )
        return table
    def _create_detailed_table(self) -> Table:
        table = Table(
            box=box.ROUNDED,
            title="Crawler Performance Monitor",
            title_style="bold magenta",
            header_style="bold blue",
        )
        # Add columns
        table.add_column("Task ID", style="cyan", no_wrap=True)
        table.add_column("URL", style="cyan", no_wrap=True)
        table.add_column("Status", style="bold")
        table.add_column("Memory (MB)", justify="right")
        table.add_column("Peak (MB)", justify="right")
        table.add_column("Duration", justify="right")
        table.add_column("Info", style="italic")
        # Add summary row
        total_memory = sum(stat.memory_usage for stat in self.stats.values())
        active_count = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
        )
        completed_count = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
        )
        failed_count = sum(
            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
        )
        table.add_row(
            "[bold yellow]SUMMARY",
            f"Total: {len(self.stats)}",
            f"Active: {active_count}",
            f"{total_memory:.1f}",
            f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
            str(
                timedelta(
                    seconds=int((datetime.now() - self.start_time).total_seconds())
                )
            ),
            f"✓{completed_count} ✗{failed_count}",
            style="bold",
        )
        table.add_section()
        # Add rows for each task
        visible_stats = sorted(
            self.stats.values(),
            key=lambda x: (
                x.status != CrawlStatus.IN_PROGRESS,
                x.status != CrawlStatus.QUEUED,
                x.end_time or datetime.max,
            ),
        )[: self.max_visible_rows]
        for stat in visible_stats:
            status_style = {
                CrawlStatus.QUEUED: "white",
                CrawlStatus.IN_PROGRESS: "yellow",
                CrawlStatus.COMPLETED: "green",
                CrawlStatus.FAILED: "red",
            }[stat.status]
            table.add_row(
                stat.task_id[:8],  # Show first 8 chars of task ID
                stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
                f"[{status_style}]{stat.status.value}[/{status_style}]",
                f"{stat.memory_usage:.1f}",
                f"{stat.peak_memory:.1f}",
                stat.duration,
                stat.error_message[:40] if stat.error_message else "",
            )
        return table
    def _create_table(self) -> Table:
        """Creates the appropriate table based on display mode"""
        if self.display_mode == DisplayMode.AGGREGATED:
            return self._create_aggregated_table()
        return self._create_detailed_table()
 class BaseDispatcher(ABC):
    def __init__(
        self,
        rate_limiter: Optional[RateLimiter] = None,
        monitor: Optional[CrawlerMonitor] = None,
    ):
        self.crawler = None
        self._domain_last_hit: Dict[str, float] = {}
        self.concurrent_sessions = 0
        self.rate_limiter = rate_limiter
        self.monitor = monitor
    @abstractmethod
    async def crawl_url(
        self,
        url: str,
        config: CrawlerRunConfig,
        task_id: str,
        monitor: Optional[CrawlerMonitor] = None,
    ) -> CrawlerTaskResult:
        pass
    @abstractmethod
    async def run_urls(
        self,
        urls: List[str],
        crawler: "AsyncWebCrawler",  # noqa: F821
        config: CrawlerRunConfig,
        monitor: Optional[CrawlerMonitor] = None,
    ) -> List[CrawlerTaskResult]:
        pass
 class MemoryAdaptiveDispatcher(BaseDispatcher):
    def __init__(
        self,
        memory_threshold_percent: float = 90.0,
        check_interval: float = 1.0,
        max_session_permit: int = 20,
        memory_wait_timeout: float = 300.0,  # 5 minutes default timeout
        rate_limiter: Optional[RateLimiter] = None,
        monitor: Optional[CrawlerMonitor] = None,
    ):
        super().__init__(rate_limiter, monitor)
        self.memory_threshold_percent = memory_threshold_percent
        self.check_interval = check_interval
        self.max_session_permit = max_session_permit
        self.memory_wait_timeout = memory_wait_timeout
    async def crawl_url(
        self,
        url: str,
        config: CrawlerRunConfig,
        task_id: str,
    ) -> CrawlerTaskResult:
        start_time = datetime.now()
        error_message = ""
        memory_usage = peak_memory = 0.0
        try:
            if self.monitor:
                self.monitor.update_task(
                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
                )
            self.concurrent_sessions += 1
            if self.rate_limiter:
                await self.rate_limiter.wait_if_needed(url)
            process = psutil.Process()
            start_memory = process.memory_info().rss / (1024 * 1024)
            result = await self.crawler.arun(url, config=config, session_id=task_id)
            end_memory = process.memory_info().rss / (1024 * 1024)
            memory_usage = peak_memory = end_memory - start_memory
            if self.rate_limiter and result.status_code:
                if not self.rate_limiter.update_delay(url, result.status_code):
                    error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
                    if self.monitor:
                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
                    return CrawlerTaskResult(
                        task_id=task_id,
                        url=url,
                        result=result,
                        memory_usage=memory_usage,
                        peak_memory=peak_memory,
                        start_time=start_time,
                        end_time=datetime.now(),
                        error_message=error_message,
                    )
            if not result.success:
                error_message = result.error_message
                if self.monitor:
                    self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
            elif self.monitor:
                self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
        except Exception as e:
            error_message = str(e)
            if self.monitor:
                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
            result = CrawlResult(
                url=url, html="", metadata={}, success=False, error_message=str(e)
            )
        finally:
            end_time = datetime.now()
            if self.monitor:
                self.monitor.update_task(
                    task_id,
                    end_time=end_time,
                    memory_usage=memory_usage,
                    peak_memory=peak_memory,
                    error_message=error_message,
                )
            self.concurrent_sessions -= 1
        return CrawlerTaskResult(
            task_id=task_id,
            url=url,
            result=result,
            memory_usage=memory_usage,
            peak_memory=peak_memory,
            start_time=start_time,
            end_time=end_time,
            error_message=error_message,
        )
    async def run_urls(
        self,
        urls: List[str],
        crawler: "AsyncWebCrawler",  # noqa: F821
        config: CrawlerRunConfig,
    ) -> List[CrawlerTaskResult]:
        self.crawler = crawler
        if self.monitor:
            self.monitor.start()
        try:
            pending_tasks = []
            active_tasks = []
            task_queue = []
            for url in urls:
                task_id = str(uuid.uuid4())
                if self.monitor:
                    self.monitor.add_task(task_id, url)
                task_queue.append((url, task_id))
            while task_queue or active_tasks:
                wait_start_time = time.time()
                while len(active_tasks) < self.max_session_permit and task_queue:
                    if psutil.virtual_memory().percent >= self.memory_threshold_percent:
                        # Check if we've exceeded the timeout
                        if time.time() - wait_start_time > self.memory_wait_timeout:
                            raise MemoryError(
                                f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
                            )
                        await asyncio.sleep(self.check_interval)
                        continue
                    url, task_id = task_queue.pop(0)
                    task = asyncio.create_task(self.crawl_url(url, config, task_id))
                    active_tasks.append(task)
                if not active_tasks:
                    await asyncio.sleep(self.check_interval)
                    continue
                done, pending = await asyncio.wait(
                    active_tasks, return_when=asyncio.FIRST_COMPLETED
                )
                pending_tasks.extend(done)
                active_tasks = list(pending)
            return await asyncio.gather(*pending_tasks)
        finally:
            if self.monitor:
                self.monitor.stop()
 class SemaphoreDispatcher(BaseDispatcher):
    def __init__(
        self,
        semaphore_count: int = 5,
        max_session_permit: int = 20,
        rate_limiter: Optional[RateLimiter] = None,
        monitor: Optional[CrawlerMonitor] = None,
    ):
        super().__init__(rate_limiter, monitor)
        self.semaphore_count = semaphore_count
        self.max_session_permit = max_session_permit
    async def crawl_url(
        self,
        url: str,
        config: CrawlerRunConfig,
        task_id: str,
        semaphore: asyncio.Semaphore = None,
    ) -> CrawlerTaskResult:
        start_time = datetime.now()
        error_message = ""
        memory_usage = peak_memory = 0.0
        try:
            if self.monitor:
                self.monitor.update_task(
                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
                )
            if self.rate_limiter:
                await self.rate_limiter.wait_if_needed(url)
            async with semaphore:
                process = psutil.Process()
                start_memory = process.memory_info().rss / (1024 * 1024)
                result = await self.crawler.arun(url, config=config, session_id=task_id)
                end_memory = process.memory_info().rss / (1024 * 1024)
                memory_usage = peak_memory = end_memory - start_memory
                if self.rate_limiter and result.status_code:
                    if not self.rate_limiter.update_delay(url, result.status_code):
                        error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
                        if self.monitor:
                            self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
                        return CrawlerTaskResult(
                            task_id=task_id,
                            url=url,
                            result=result,
                            memory_usage=memory_usage,
                            peak_memory=peak_memory,
                            start_time=start_time,
                            end_time=datetime.now(),
                            error_message=error_message,
                        )
                if not result.success:
                    error_message = result.error_message
                    if self.monitor:
                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
                elif self.monitor:
                    self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
        except Exception as e:
            error_message = str(e)
            if self.monitor:
                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
            result = CrawlResult(
                url=url, html="", metadata={}, success=False, error_message=str(e)
            )
        finally:
            end_time = datetime.now()
            if self.monitor:
                self.monitor.update_task(
                    task_id,
                    end_time=end_time,
                    memory_usage=memory_usage,
                    peak_memory=peak_memory,
                    error_message=error_message,
                )
        return CrawlerTaskResult(
            task_id=task_id,
            url=url,
            result=result,
            memory_usage=memory_usage,
            peak_memory=peak_memory,
            start_time=start_time,
            end_time=end_time,
            error_message=error_message,
        )
    async def run_urls(
        self,
        crawler: "AsyncWebCrawler",  # noqa: F821
        urls: List[str],
        config: CrawlerRunConfig,
    ) -> List[CrawlerTaskResult]:
        self.crawler = crawler
        if self.monitor:
            self.monitor.start()
        try:
            semaphore = asyncio.Semaphore(self.semaphore_count)
            tasks = []
            for url in urls:
                task_id = str(uuid.uuid4())
                if self.monitor:
                    self.monitor.add_task(task_id, url)
                task = asyncio.create_task(
                    self.crawl_url(url, config, task_id, semaphore)
                )
                tasks.append(task)
            return await asyncio.gather(*tasks, return_exceptions=True)
        finally:
            if self.monitor:
                self.monitor.stop()
--- a/crawl4ai/async_logger.py
+++ b/crawl4ai/async_logger.py
@@ -1,10 +1,10 @@
 from enum import Enum
-from typing import Optional, Dict, Any, Union
+from typing import Optional, Dict, Any
-from colorama import Fore, Back, Style, init
+from colorama import Fore, Style, init
 import time
 import os
 from datetime import datetime
 class LogLevel(Enum):
    DEBUG = 1
    INFO = 2
@@ -12,6 +12,7 @@ class LogLevel(Enum):
    WARNING = 4
    ERROR = 5
 class AsyncLogger:
    """
    Asynchronous logger with support for colored console output and file logging.
@@ -19,16 +20,16 @@ class AsyncLogger:
    """
    DEFAULT_ICONS = {
-        'INIT': '→',
+        "INIT": "→",
-        'READY': '✓',
+        "READY": "✓",
-        'FETCH': '↓',
+        "FETCH": "↓",
-        'SCRAPE': '◆',
+        "SCRAPE": "◆",
-        'EXTRACT': '■',
+        "EXTRACT": "■",
-        'COMPLETE': '●',
+        "COMPLETE": "●",
-        'ERROR': '×',
+        "ERROR": "×",
-        'DEBUG': '⋯',
+        "DEBUG": "⋯",
-        'INFO': 'ℹ',
+        "INFO": "ℹ",
-        'WARNING': '⚠',
+        "WARNING": "⚠",
    }
    DEFAULT_COLORS = {
@@ -42,11 +43,11 @@ class AsyncLogger:
    def __init__(
        self,
        log_file: Optional[str] = None,
-        log_level: LogLevel = LogLevel.INFO,
+        log_level: LogLevel = LogLevel.DEBUG,
        tag_width: int = 10,
        icons: Optional[Dict[str, str]] = None,
        colors: Optional[Dict[LogLevel, str]] = None,
-        verbose: bool = True
+        verbose: bool = True,
    ):
        """
        Initialize the logger.
@@ -77,18 +78,20 @@ class AsyncLogger:
    def _get_icon(self, tag: str) -> str:
        """Get the icon for a tag, defaulting to info icon if not found."""
-        return self.icons.get(tag, self.icons['INFO'])
+        return self.icons.get(tag, self.icons["INFO"])
    def _write_to_file(self, message: str):
        """Write a message to the log file if configured."""
        if self.log_file:
-            timestamp = datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
+            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
-            with open(self.log_file, 'a', encoding='utf-8') as f:
+            with open(self.log_file, "a", encoding="utf-8") as f:
                # Strip ANSI color codes for file output
-                clean_message = message.replace(Fore.RESET, '').replace(Style.RESET_ALL, '')
+                clean_message = message.replace(Fore.RESET, "").replace(
                    Style.RESET_ALL, ""
                )
                for color in vars(Fore).values():
                    if isinstance(color, str):
-                        clean_message = clean_message.replace(color, '')
+                        clean_message = clean_message.replace(color, "")
                f.write(f"[{timestamp}] {clean_message}\n")
    def _log(
@@ -99,7 +102,7 @@ class AsyncLogger:
        params: Optional[Dict[str, Any]] = None,
        colors: Optional[Dict[str, str]] = None,
        base_color: Optional[str] = None,
-        **kwargs
+        **kwargs,
    ):
        """
        Core logging method that handles message formatting and output.
@@ -128,12 +131,13 @@ class AsyncLogger:
                        if key in params:
                            value_str = str(params[key])
                            formatted_message = formatted_message.replace(
-                                value_str, 
+                                value_str, f"{color}{value_str}{Style.RESET_ALL}"
                                f"{color}{value_str}{Style.RESET_ALL}"
                            )
            except KeyError as e:
-                formatted_message = f"LOGGING ERROR: Missing parameter {e} in message template"
+                formatted_message = (
                    f"LOGGING ERROR: Missing parameter {e} in message template"
                )
                level = LogLevel.ERROR
        else:
            formatted_message = message
@@ -175,7 +179,7 @@ class AsyncLogger:
        success: bool,
        timing: float,
        tag: str = "FETCH",
-        url_length: int = 50
+        url_length: int = 50,
    ):
        """
        Convenience method for logging URL fetch status.
@@ -195,20 +199,16 @@ class AsyncLogger:
                "url": url,
                "url_length": url_length,
                "status": success,
-                "timing": timing
+                "timing": timing,
            },
            colors={
                "status": Fore.GREEN if success else Fore.RED,
-                "timing": Fore.YELLOW
+                "timing": Fore.YELLOW,
-            }
+            },
        )
    def error_status(
-        self,
+        self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
        url: str,
        error: str,
        tag: str = "ERROR",
        url_length: int = 50
    ):
        """
        Convenience method for logging error status.
@@ -223,9 +223,5 @@ class AsyncLogger:
            level=LogLevel.ERROR,
            message="{url:.{url_length}}... | Error: {error}",
            tag=tag,
-            params={
+            params={"url": url, "url_length": url_length, "error": error},
                "url": url,
                "url_length": url_length,
                "error": error
            }
        )
--- a/crawl4ai/async_tools.py
+++ b/crawl4ai/async_tools.py
@@ -1,183 +0,0 @@
 import asyncio
 import base64
 import time
 from abc import ABC, abstractmethod
 from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os, sys, shutil
 import tempfile, subprocess
 from playwright.async_api import async_playwright, Page, Browser, Error
 from playwright.async_api import TimeoutError as PlaywrightTimeoutError
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
 from pathlib import Path
 from playwright.async_api import ProxySettings
 from pydantic import BaseModel
 import hashlib
 import json
 import uuid
 from .models import AsyncCrawlResponse
 from .utils import create_box_message
 from .user_agent_generator import UserAgentGenerator
 from playwright_stealth import StealthConfig, stealth_async
 class ManagedBrowser:
    def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None, host: str = "localhost", debugging_port: int = 9222):
        self.browser_type = browser_type
        self.user_data_dir = user_data_dir
        self.headless = headless
        self.browser_process = None
        self.temp_dir = None
        self.debugging_port = debugging_port
        self.host = host
        self.logger = logger
        self.shutting_down = False
    async def start(self) -> str:
        """
        Starts the browser process and returns the CDP endpoint URL.
        If user_data_dir is not provided, creates a temporary directory.
        """
        # Create temp dir if needed
        if not self.user_data_dir:
            self.temp_dir = tempfile.mkdtemp(prefix="browser-profile-")
            self.user_data_dir = self.temp_dir
        # Get browser path and args based on OS and browser type
        browser_path = self._get_browser_path()
        args = self._get_browser_args()
        # Start browser process
        try:
            self.browser_process = subprocess.Popen(
                args,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE
            )
            # Monitor browser process output for errors
            asyncio.create_task(self._monitor_browser_process())
            await asyncio.sleep(2)  # Give browser time to start
            return f"http://{self.host}:{self.debugging_port}"
        except Exception as e:
            await self.cleanup()
            raise Exception(f"Failed to start browser: {e}")
    async def _monitor_browser_process(self):
        """Monitor the browser process for unexpected termination."""
        if self.browser_process:
            try:
                stdout, stderr = await asyncio.gather(
                    asyncio.to_thread(self.browser_process.stdout.read),
                    asyncio.to_thread(self.browser_process.stderr.read)
                )
                # Check shutting_down flag BEFORE logging anything
                if self.browser_process.poll() is not None:
                    if not self.shutting_down:
                        self.logger.error(
                            message="Browser process terminated unexpectedly | Code: {code} | STDOUT: {stdout} | STDERR: {stderr}",
                            tag="ERROR",
                            params={
                                "code": self.browser_process.returncode,
                                "stdout": stdout.decode(),
                                "stderr": stderr.decode()
                            }
                        )                
                        await self.cleanup()
                    else:
                        self.logger.info(
                            message="Browser process terminated normally | Code: {code}",
                            tag="INFO",
                            params={"code": self.browser_process.returncode}
                        )
            except Exception as e:
                if not self.shutting_down:
                    self.logger.error(
                        message="Error monitoring browser process: {error}",
                        tag="ERROR",
                        params={"error": str(e)}
                    )
    def _get_browser_path(self) -> str:
        """Returns the browser executable path based on OS and browser type"""
        if sys.platform == "darwin":  # macOS
            paths = {
                "chromium": "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
                "firefox": "/Applications/Firefox.app/Contents/MacOS/firefox",
                "webkit": "/Applications/Safari.app/Contents/MacOS/Safari"
            }
        elif sys.platform == "win32":  # Windows
            paths = {
                "chromium": "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
                "firefox": "C:\\Program Files\\Mozilla Firefox\\firefox.exe",
                "webkit": None  # WebKit not supported on Windows
            }
        else:  # Linux
            paths = {
                "chromium": "google-chrome",
                "firefox": "firefox",
                "webkit": None  # WebKit not supported on Linux
            }
        return paths.get(self.browser_type)
    def _get_browser_args(self) -> List[str]:
        """Returns browser-specific command line arguments"""
        base_args = [self._get_browser_path()]
        if self.browser_type == "chromium":
            args = [
                f"--remote-debugging-port={self.debugging_port}",
                f"--user-data-dir={self.user_data_dir}",
            ]
            if self.headless:
                args.append("--headless=new")
        elif self.browser_type == "firefox":
            args = [
                "--remote-debugging-port", str(self.debugging_port),
                "--profile", self.user_data_dir,
            ]
            if self.headless:
                args.append("--headless")
        else:
            raise NotImplementedError(f"Browser type {self.browser_type} not supported")
        return base_args + args
    async def cleanup(self):
        """Cleanup browser process and temporary directory"""
        # Set shutting_down flag BEFORE any termination actions
        self.shutting_down = True
        if self.browser_process:
            try:
                self.browser_process.terminate()
                # Wait for process to end gracefully
                for _ in range(10):  # 10 attempts, 100ms each
                    if self.browser_process.poll() is not None:
                        break
                    await asyncio.sleep(0.1)
                # Force kill if still running
                if self.browser_process.poll() is None:
                    self.browser_process.kill()
                    await asyncio.sleep(0.1)  # Brief wait for kill to take effect
            except Exception as e:
                self.logger.error(
                    message="Error terminating browser: {error}",
                    tag="ERROR",
                    params={"error": str(e)}
                )
        if self.temp_dir and os.path.exists(self.temp_dir):
            try:
                shutil.rmtree(self.temp_dir)
            except Exception as e:
                self.logger.error(
                    message="Error removing temporary directory: {error}",
                    tag="ERROR",
                    params={"error": str(e)}
                )
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -1,40 +1,54 @@
-import os, sys
+import os
 import sys
 import time
 import warnings
-from enum import Enum
+from colorama import Fore
 from colorama import init, Fore, Back, Style
 from pathlib import Path
-from typing import Optional, List, Union
+from typing import Optional, List
 import json
 import asyncio
 # from contextlib import nullcontext, asynccontextmanager
 from contextlib import asynccontextmanager
-from .models import CrawlResult, MarkdownGenerationResult
+from .models import CrawlResult, MarkdownGenerationResult, CrawlerTaskResult, DispatchResult
 from .async_database import async_db_manager
-from .chunking_strategy import *
+from .chunking_strategy import *  # noqa: F403
-from .content_filter_strategy import *
+from .chunking_strategy import RegexChunking, ChunkingStrategy, IdentityChunking
-from .extraction_strategy import *
+from .content_filter_strategy import *  # noqa: F403
-from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy, AsyncCrawlResponse
+from .content_filter_strategy import RelevantContentFilter
 from .extraction_strategy import * # noqa: F403
 from .extraction_strategy import NoExtractionStrategy, ExtractionStrategy
 from .async_crawler_strategy import (
    AsyncCrawlerStrategy,
    AsyncPlaywrightCrawlerStrategy,
    AsyncCrawlResponse,
 )
 from .cache_context import CacheMode, CacheContext, _legacy_to_cache_mode
-from .markdown_generation_strategy import DefaultMarkdownGenerator, MarkdownGenerationStrategy
+from .markdown_generation_strategy import (
-from .content_scraping_strategy import WebScrapingStrategy
+    DefaultMarkdownGenerator,
    MarkdownGenerationStrategy,
 )
 from .async_logger import AsyncLogger
 from .async_configs import BrowserConfig, CrawlerRunConfig
-from .config import (
+from .async_dispatcher import * # noqa: F403
-    MIN_WORD_THRESHOLD, 
+from .async_dispatcher import BaseDispatcher, MemoryAdaptiveDispatcher, RateLimiter
-    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
+
-    URL_LOG_SHORTEN_LENGTH
+from .config import MIN_WORD_THRESHOLD
 )
 from .utils import (
    sanitize_input_encode,
    InvalidCSSSelectorError,
    format_html,
    fast_format_html,
-    create_box_message
+    create_box_message,
    get_error_context,
    RobotsParser,
 )
-from urllib.parse import urlparse
+from typing import Union, AsyncGenerator, List, TypeVar
-import random
+from collections.abc import AsyncGenerator
 CrawlResultT = TypeVar('CrawlResultT', bound=CrawlResult)
 RunManyReturn = Union[List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
 from .__version__ import __version__ as crawl4ai_version
@@ -42,14 +56,67 @@ class AsyncWebCrawler:
    """
    Asynchronous web crawler with flexible caching capabilities.
    There are two ways to use the crawler:
    1. Using context manager (recommended for simple cases):
        ```python
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(url="https://example.com")
        ```
    2. Using explicit lifecycle management (recommended for long-running applications):
        ```python
        crawler = AsyncWebCrawler()
        await crawler.start()
        # Use the crawler multiple times
        result1 = await crawler.arun(url="https://example.com")
        result2 = await crawler.arun(url="https://another.com")
        await crawler.close()
        ```
    Migration Guide:
    Old way (deprecated):
        crawler = AsyncWebCrawler(always_by_pass_cache=True, browser_type="chromium", headless=True)
    New way (recommended):
        browser_config = BrowserConfig(browser_type="chromium", headless=True)
-        crawler = AsyncWebCrawler(browser_config=browser_config)
+        crawler = AsyncWebCrawler(config=browser_config)
    Attributes:
        browser_config (BrowserConfig): Configuration object for browser settings.
        crawler_strategy (AsyncCrawlerStrategy): Strategy for crawling web pages.
        logger (AsyncLogger): Logger instance for recording events and errors.
        always_bypass_cache (bool): Whether to always bypass cache.
        crawl4ai_folder (str): Directory for storing cache.
        base_directory (str): Base directory for storing cache.
        ready (bool): Whether the crawler is ready for use.
        Methods:
            start(): Start the crawler explicitly without using context manager.
            close(): Close the crawler explicitly without using context manager.
            arun(): Run the crawler for a single source: URL (web, local file, or raw HTML).
            awarmup(): Perform warmup sequence.
            arun_many(): Run the crawler for multiple sources.
            aprocess_html(): Process HTML content.
    Typical Usage:
        async with AsyncWebCrawler() as crawler:
            result = await crawler.arun(url="https://example.com")
            print(result.markdown)
        Using configuration:
        browser_config = BrowserConfig(browser_type="chromium", headless=True)
        async with AsyncWebCrawler(config=browser_config) as crawler:
            crawler_config = CrawlerRunConfig(
                cache_mode=CacheMode.BYPASS
            )
            result = await crawler.arun(url="https://example.com", config=crawler_config)
            print(result.markdown)
    """
    _domain_last_hit = {}
    def __init__(
@@ -77,10 +144,18 @@ class AsyncWebCrawler:
        # Handle browser configuration
        browser_config = config
        if browser_config is not None:
-            if any(k in kwargs for k in ["browser_type", "headless", "viewport_width", "viewport_height"]):
+            if any(
                k in kwargs
                for k in [
                    "browser_type",
                    "headless",
                    "viewport_width",
                    "viewport_height",
                ]
            ):
                self.logger.warning(
                    message="Both browser_config and legacy browser parameters provided. browser_config will take precedence.",
-                    tag="WARNING"
+                    tag="WARNING",
                )
        else:
            # Create browser config from kwargs for backwards compatibility
@@ -92,17 +167,21 @@ class AsyncWebCrawler:
        self.logger = AsyncLogger(
            log_file=os.path.join(base_directory, ".crawl4ai", "crawler.log"),
            verbose=self.browser_config.verbose,
-            tag_width=10
+            tag_width=10,
        )
        # Initialize crawler strategy
        params = {k: v for k, v in kwargs.items() if k in ["browser_congig", "logger"]}
        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
            browser_config=browser_config,
            logger=self.logger,
-            **kwargs  # Pass remaining kwargs for backwards compatibility
+            **params,  # Pass remaining kwargs for backwards compatibility
        )
        # If craweler strategy doesnt have logger, use crawler logger
        if not self.crawler_strategy.logger:
            self.crawler_strategy.logger = self.logger
        # Handle deprecated cache parameter
        if always_by_pass_cache is not None:
            if kwargs.get("warning", True):
@@ -111,7 +190,7 @@ class AsyncWebCrawler:
                    "Use 'always_bypass_cache' instead. "
                    "Pass warning=False to suppress this warning.",
                    DeprecationWarning,
-                    stacklevel=2
+                    stacklevel=2,
                )
            self.always_bypass_cache = always_by_pass_cache
        else:
@@ -125,18 +204,54 @@ class AsyncWebCrawler:
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
        # Initialize robots parser
        self.robots_parser = RobotsParser()
        self.ready = False
-    async def __aenter__(self):
+    async def start(self):
        """
        Start the crawler explicitly without using context manager.
        This is equivalent to using 'async with' but gives more control over the lifecycle.
        This method will:
        1. Initialize the browser and context
        2. Perform warmup sequence
        3. Return the crawler instance for method chaining
        Returns:
            AsyncWebCrawler: The initialized crawler instance
        """
        await self.crawler_strategy.__aenter__()
        await self.awarmup()
        return self
    async def close(self):
        """
        Close the crawler explicitly without using context manager.
        This should be called when you're done with the crawler if you used start().
        This method will:
        1. Clean up browser resources
        2. Close any open pages and contexts
        """
        await self.crawler_strategy.__aexit__(None, None, None)
    async def __aenter__(self):
        return await self.start()
    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
+        await self.close()
    async def awarmup(self):
-        """Initialize the crawler with warm-up sequence."""
+        """
        Initialize the crawler with warm-up sequence.
        This method:
        1. Logs initialization info
        2. Sets up browser configuration
        3. Marks the crawler as ready
        """
        self.logger.info(f"Crawl4AI {crawl4ai_version}", tag="INIT")
        self.ready = True
@@ -204,14 +319,14 @@ class AsyncWebCrawler:
            try:
                # Handle configuration
                if crawler_config is not None:
-                        if any(param is not None for param in [
+                    # if any(param is not None for param in [
-                            word_count_threshold, extraction_strategy, chunking_strategy,
+                    #     word_count_threshold, extraction_strategy, chunking_strategy,
-                            content_filter, cache_mode, css_selector, screenshot, pdf
+                    #     content_filter, cache_mode, css_selector, screenshot, pdf
-                        ]):
+                    # ]):
-                            self.logger.warning(
+                    #     self.logger.warning(
-                                message="Both crawler_config and legacy parameters provided. crawler_config will take precedence.",
+                    #         message="Both crawler_config and legacy parameters provided. crawler_config will take precedence.",
-                                tag="WARNING"
+                    #         tag="WARNING"
-                            )
+                    #     )
                    config = crawler_config
                else:
                    # Merge all parameters into a single kwargs dict for config creation
@@ -229,7 +344,7 @@ class AsyncWebCrawler:
                        "screenshot": screenshot,
                        "pdf": pdf,
                        "verbose": verbose,
-                            **kwargs
+                        **kwargs,
                    }
                    config = CrawlerRunConfig.from_kwargs(config_kwargs)
@@ -240,7 +355,7 @@ class AsyncWebCrawler:
                            "Cache control boolean flags are deprecated and will be removed in version 0.5.0. "
                            "Use 'cache_mode' parameter instead.",
                            DeprecationWarning,
-                                stacklevel=2
+                            stacklevel=2,
                        )
                    # Convert legacy parameters if cache_mode not provided
@@ -249,7 +364,7 @@ class AsyncWebCrawler:
                            disable_cache=disable_cache,
                            bypass_cache=bypass_cache,
                            no_cache_read=no_cache_read,
-                                no_cache_write=no_cache_write
+                            no_cache_write=no_cache_write,
                        )
                # Default to ENABLED if no cache mode specified
@@ -257,11 +372,13 @@ class AsyncWebCrawler:
                    config.cache_mode = CacheMode.ENABLED
                # Create cache context
-                    cache_context = CacheContext(url, config.cache_mode, self.always_bypass_cache)
+                cache_context = CacheContext(
                    url, config.cache_mode, self.always_bypass_cache
                )
                # Initialize processing variables
                async_response: AsyncCrawlResponse = None
-                    cached_result = None
+                cached_result: CrawlResult = None
                screenshot_data = None
                pdf_data = None
                extracted_content = None
@@ -273,7 +390,14 @@ class AsyncWebCrawler:
                if cached_result:
                    html = sanitize_input_encode(cached_result.html)
-                        extracted_content = sanitize_input_encode(cached_result.extracted_content or "")
+                    extracted_content = sanitize_input_encode(
                        cached_result.extracted_content or ""
                    )
                    extracted_content = (
                        None
                        if not extracted_content or extracted_content == "[]"
                        else extracted_content
                    )
                    # If screenshot is requested but its not in cache, then set cache_result to None
                    screenshot_data = cached_result.screenshot
                    pdf_data = cached_result.pdf
@@ -284,7 +408,7 @@ class AsyncWebCrawler:
                        url=cache_context.display_url,
                        success=bool(html),
                        timing=time.perf_counter() - start_time,
-                            tag="FETCH"
+                        tag="FETCH",
                    )
                # Fetch fresh content if needed
@@ -294,10 +418,22 @@ class AsyncWebCrawler:
                    if user_agent:
                        self.crawler_strategy.update_user_agent(user_agent)
                    # Check robots.txt if enabled
                    if config and config.check_robots_txt:
                        if not await self.robots_parser.can_fetch(url, self.browser_config.user_agent):
                            return CrawlResult(
                                url=url,
                                html="",
                                success=False,
                                status_code=403,
                                error_message="Access denied by robots.txt",
                                response_headers={"X-Robots-Status": "Blocked by robots.txt"}
                            )
                    # Pass config to crawl method
                    async_response = await self.crawler_strategy.crawl(
                        url,
-                            config=config  # Pass the entire config object
+                        config=config,  # Pass the entire config object
                    )
                    html = sanitize_input_encode(async_response.html)
@@ -309,11 +445,11 @@ class AsyncWebCrawler:
                        url=cache_context.display_url,
                        success=bool(html),
                        timing=t2 - t1,
-                            tag="FETCH"
+                        tag="FETCH",
                    )
                    # Process the HTML content
-                    crawl_result = await self.aprocess_html(
+                    crawl_result : CrawlResult = await self.aprocess_html(
                        url=url,
                        html=html,
                        extracted_content=extracted_content,
@@ -321,20 +457,40 @@ class AsyncWebCrawler:
                        screenshot=screenshot_data,
                        pdf_data=pdf_data,
                        verbose=config.verbose,
-                        **kwargs
+                        is_raw_html=True if url.startswith("raw:") else False,
                        **kwargs,
                    )
                    # Set response data
                    if async_response:
                    crawl_result.status_code = async_response.status_code
                    crawl_result.redirected_url = async_response.redirected_url or url
                    crawl_result.response_headers = async_response.response_headers
                    crawl_result.downloaded_files = async_response.downloaded_files
-                    else:
+                    crawl_result.ssl_certificate = (
-                        crawl_result.status_code = 200
+                        async_response.ssl_certificate
-                        crawl_result.response_headers = cached_result.response_headers if cached_result else {}
+                    )  # Add SSL certificate
                    # # Check and set values from async_response to crawl_result
                    # try:
                    #     for key in vars(async_response):
                    #         if hasattr(crawl_result, key):
                    #             value = getattr(async_response, key, None)
                    #             current_value = getattr(crawl_result, key, None)
                    #             if value is not None and not current_value:
                    #                 try:
                    #                     setattr(crawl_result, key, value)
                    #                 except Exception as e:
                    #                     self.logger.warning(
                    #                         message=f"Failed to set attribute {key}: {str(e)}",
                    #                         tag="WARNING"
                    #                     )
                    # except Exception as e:
                    #     self.logger.warning(
                    #         message=f"Error copying response attributes: {str(e)}",
                    #         tag="WARNING"
                    #     )
                    crawl_result.success = bool(html)
-                    crawl_result.session_id = getattr(config, 'session_id', None)
+                    crawl_result.session_id = getattr(config, "session_id", None)
                    self.logger.success(
                        message="{url:.50}... | Status: {status} | Total: {timing}",
@@ -342,12 +498,12 @@ class AsyncWebCrawler:
                        params={
                            "url": cache_context.display_url,
                            "status": crawl_result.success,
-                            "timing": f"{time.perf_counter() - start_time:.2f}s"
+                            "timing": f"{time.perf_counter() - start_time:.2f}s",
                        },
                        colors={
                            "status": Fore.GREEN if crawl_result.success else Fore.RED,
-                            "timing": Fore.YELLOW
+                            "timing": Fore.YELLOW,
-                        }
+                        },
                    )
                    # Update cache if appropriate
@@ -356,6 +512,23 @@ class AsyncWebCrawler:
                    return crawl_result
                else:
                    self.logger.success(
                        message="{url:.50}... | Status: {status} | Total: {timing}",
                        tag="COMPLETE",
                        params={
                            "url": cache_context.display_url,
                            "status": True,
                            "timing": f"{time.perf_counter() - start_time:.2f}s",
                        },
                        colors={"status": Fore.GREEN, "timing": Fore.YELLOW},
                    )
                    cached_result.success = bool(html)
                    cached_result.session_id = getattr(config, "session_id", None)
                    cached_result.redirected_url = cached_result.redirected_url or url
                    return cached_result
            except Exception as e:
                error_context = get_error_context(sys.exc_info())
@@ -371,14 +544,11 @@ class AsyncWebCrawler:
                self.logger.error_status(
                    url=url,
                    error=create_box_message(error_message, type="error"),
-                        tag="ERROR"
+                    tag="ERROR",
                )
                return CrawlResult(
-                        url=url,
+                    url=url, html="", success=False, error_message=error_message
                        html="",
                        success=False,
                        error_message=error_message
                )
    async def aprocess_html(
@@ -401,6 +571,7 @@ class AsyncWebCrawler:
            extracted_content: Previously extracted content (if any)
            config: Configuration object controlling processing behavior
            screenshot: Screenshot data (if any)
            pdf_data: PDF data (if any)
            verbose: Whether to enable verbose logging
            **kwargs: Additional parameters for backwards compatibility
@@ -411,49 +582,58 @@ class AsyncWebCrawler:
            _url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
            t1 = time.perf_counter()
-                # Initialize scraping strategy
+            # Get scraping strategy and ensure it has a logger
-                scrapping_strategy = WebScrapingStrategy(logger=self.logger)
+            scraping_strategy = config.scraping_strategy
            if not scraping_strategy.logger:
                scraping_strategy.logger = self.logger
            # Process HTML content
-                result = scrapping_strategy.scrap(
+            params = {k: v for k, v in config.to_dict().items() if k not in ["url"]}
-                    url,
+            # add keys from kwargs to params that doesn't exist in params
-                    html,
+            params.update({k: v for k, v in kwargs.items() if k not in params.keys()})
-                    word_count_threshold=config.word_count_threshold,
+
-                    css_selector=config.css_selector,
+            result = scraping_strategy.scrap(url, html, **params)
                    only_text=config.only_text,
                    image_description_min_word_threshold=config.image_description_min_word_threshold,
                    content_filter=config.content_filter,
                    **kwargs
                )
            if result is None:
-                    raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
+                raise ValueError(
                    f"Process HTML, Failed to extract content from the website: {url}"
                )
        except InvalidCSSSelectorError as e:
            raise ValueError(str(e))
        except Exception as e:
-                raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
+            raise ValueError(
                f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}"
            )
-       
+        # Extract results - handle both dict and ScrapingResult
-
+        if isinstance(result, dict):
            # Extract results
            cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
-            fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
+            media = result.get("media", {})
-            fit_html = sanitize_input_encode(result.get("fit_html", ""))
+            links = result.get("links", {})
            media = result.get("media", [])
            links = result.get("links", [])
            metadata = result.get("metadata", {})
        else:
            cleaned_html = sanitize_input_encode(result.cleaned_html)
            media = result.media.model_dump()
            links = result.links.model_dump()
            metadata = result.metadata
        # Markdown Generation
-            markdown_generator: Optional[MarkdownGenerationStrategy] = config.markdown_generator or DefaultMarkdownGenerator()
+        markdown_generator: Optional[MarkdownGenerationStrategy] = (
-            if not config.content_filter and not markdown_generator.content_filter:
+            config.markdown_generator or DefaultMarkdownGenerator()
-                markdown_generator.content_filter = PruningContentFilter()
+        )
-            markdown_result: MarkdownGenerationResult = markdown_generator.generate_markdown(
+        # Uncomment if by default we want to use PruningContentFilter
        # if not config.content_filter and not markdown_generator.content_filter:
        #     markdown_generator.content_filter = PruningContentFilter()
        markdown_result: MarkdownGenerationResult = (
            markdown_generator.generate_markdown(
                cleaned_html=cleaned_html,
                base_url=url,
                # html2text_options=kwargs.get('html2text', {})
            )
        )
        markdown_v2 = markdown_result
        markdown = sanitize_input_encode(markdown_result.raw_markdown)
@@ -461,38 +641,50 @@ class AsyncWebCrawler:
        self.logger.info(
            message="Processed {url:.50}... | Time: {timing}ms",
            tag="SCRAPE",
-                params={
+            params={"url": _url, "timing": int((time.perf_counter() - t1) * 1000)},
                    "url": _url,
                    "timing": int((time.perf_counter() - t1) * 1000)
                }
        )
        # Handle content extraction if needed
-            if (extracted_content is None and 
+        if (
-                config.extraction_strategy and 
+            not bool(extracted_content)
-                config.chunking_strategy and 
+            and config.extraction_strategy
-                not isinstance(config.extraction_strategy, NoExtractionStrategy)):
+            and not isinstance(config.extraction_strategy, NoExtractionStrategy)
-                
+        ):
            t1 = time.perf_counter()
-                # Handle different extraction strategy types
+            # Choose content based on input_format
-                if isinstance(config.extraction_strategy, (JsonCssExtractionStrategy, JsonCssExtractionStrategy)):
+            content_format = config.extraction_strategy.input_format
-                    config.extraction_strategy.verbose = verbose
+            if content_format == "fit_markdown" and not markdown_result.fit_markdown:
-                    extracted_content = config.extraction_strategy.run(url, [html])
+                self.logger.warning(
-                    extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
+                    message="Fit markdown requested but not available. Falling back to raw markdown.",
-                else:
+                    tag="EXTRACT",
-                    sections = config.chunking_strategy.chunk(markdown)
+                    params={"url": _url},
                )
                content_format = "markdown"
            content = {
                "markdown": markdown,
                "html": html,
                "fit_markdown": markdown_result.raw_markdown,
            }.get(content_format, markdown)
            # Use IdentityChunking for HTML input, otherwise use provided chunking strategy
            chunking = (
                IdentityChunking()
                if content_format == "html"
                else config.chunking_strategy
            )
            sections = chunking.chunk(content)
            extracted_content = config.extraction_strategy.run(url, sections)
-                    extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
+            extracted_content = json.dumps(
                extracted_content, indent=4, default=str, ensure_ascii=False
            )
            # Log extraction completion
            self.logger.info(
                message="Completed for {url:.50}... | Time: {timing}s",
                tag="EXTRACT",
-                    params={
+                params={"url": _url, "timing": time.perf_counter() - t1},
                        "url": _url,
                        "timing": time.perf_counter() - t1
                    }
            )
        # Handle screenshot and PDF data
@@ -510,8 +702,8 @@ class AsyncWebCrawler:
            cleaned_html=cleaned_html,
            markdown_v2=markdown_v2,
            markdown=markdown,
-                fit_markdown=fit_markdown,
+            fit_markdown=markdown_result.fit_markdown,
-                fit_html=fit_html,
+            fit_html=markdown_result.fit_html,
            media=media,
            links=links,
            metadata=metadata,
@@ -526,6 +718,7 @@ class AsyncWebCrawler:
        self,
        urls: List[str],
        config: Optional[CrawlerRunConfig] = None, 
        dispatcher: Optional[BaseDispatcher] = None,
        # Legacy parameters maintained for backwards compatibility
        word_count_threshold=MIN_WORD_THRESHOLD,
        extraction_strategy: ExtractionStrategy = None,
@@ -538,138 +731,83 @@ class AsyncWebCrawler:
        pdf: bool = False,
        user_agent: str = None,
        verbose=True,
-            **kwargs,
+        **kwargs
-        ) -> List[CrawlResult]:
+        ) -> RunManyReturn:
        """
-            Runs the crawler for multiple URLs concurrently.
+        Runs the crawler for multiple URLs concurrently using a configurable dispatcher strategy.
            Migration Guide:
            Old way (deprecated):
                results = await crawler.arun_many(
                    urls,
                    word_count_threshold=200,
                    screenshot=True,
                    ...
                )
            New way (recommended):
                config = CrawlerRunConfig(
                    word_count_threshold=200,
                    screenshot=True,
                    ...
                )
                results = await crawler.arun_many(urls, crawler_config=config)
        Args:
        urls: List of URLs to crawl
-                crawler_config: Configuration object controlling crawl behavior for all URLs
+        config: Configuration object controlling crawl behavior for all URLs
        dispatcher: The dispatcher strategy instance to use. Defaults to MemoryAdaptiveDispatcher
        [other parameters maintained for backwards compatibility]
        Returns:
-                List[CrawlResult]: Results for each URL
+        Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
            Either a list of all results or an async generator yielding results
        Examples:
        # Batch processing (default)
        results = await crawler.arun_many(
            urls=["https://example1.com", "https://example2.com"],
            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        )
        for result in results:
            print(f"Processed {result.url}: {len(result.markdown)} chars")
        # Streaming results
        async for result in await crawler.arun_many(
            urls=["https://example1.com", "https://example2.com"],
            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True),
        ):
            print(f"Processed {result.url}: {len(result.markdown)} chars")
        """
-            crawler_config = config
+        if config is None:
-            # Handle configuration
+            config = CrawlerRunConfig(
-            if crawler_config is not None:
+                word_count_threshold=word_count_threshold,
-                if any(param is not None for param in [
+                extraction_strategy=extraction_strategy,
-                    word_count_threshold, extraction_strategy, chunking_strategy,
+                chunking_strategy=chunking_strategy,
-                    content_filter, cache_mode, css_selector, screenshot, pdf
+                content_filter=content_filter,
-                ]):
+                cache_mode=cache_mode,
-                    self.logger.warning(
+                bypass_cache=bypass_cache,
-                        message="Both crawler_config and legacy parameters provided. crawler_config will take precedence.",
+                css_selector=css_selector,
-                        tag="WARNING"
+                screenshot=screenshot,
                pdf=pdf,
                verbose=verbose,
                **kwargs,
            )
-                config = crawler_config
+
        if dispatcher is None:
            dispatcher = MemoryAdaptiveDispatcher(
                rate_limiter=RateLimiter(
                    base_delay=(1.0, 3.0), max_delay=60.0, max_retries=3
                ),
            )
        transform_result = lambda task_result: (
            setattr(task_result.result, 'dispatch_result', 
                DispatchResult(
                    task_id=task_result.task_id,
                    memory_usage=task_result.memory_usage,
                    peak_memory=task_result.peak_memory,
                    start_time=task_result.start_time,
                    end_time=task_result.end_time,
                    error_message=task_result.error_message,
                )
            ) or task_result.result
        )
        stream = config.stream
        if stream:
            async def result_transformer():
                async for task_result in dispatcher.run_urls_stream(crawler=self, urls=urls, config=config):
                    yield transform_result(task_result)
            return result_transformer()
        else:
-                # Merge all parameters into a single kwargs dict for config creation
+            _results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
-                config_kwargs = {
+            return [transform_result(res) for res in _results]    
                    "word_count_threshold": word_count_threshold,
                    "extraction_strategy": extraction_strategy,
                    "chunking_strategy": chunking_strategy,
                    "content_filter": content_filter,
                    "cache_mode": cache_mode,
                    "bypass_cache": bypass_cache,
                    "css_selector": css_selector,
                    "screenshot": screenshot,
                    "pdf": pdf,
                    "verbose": verbose,
                    **kwargs
                }
                config = CrawlerRunConfig.from_kwargs(config_kwargs)
            if bypass_cache:
                if kwargs.get("warning", True):
                    warnings.warn(
                        "'bypass_cache' is deprecated and will be removed in version 0.5.0. "
                        "Use 'cache_mode=CacheMode.BYPASS' instead. "
                        "Pass warning=False to suppress this warning.",
                        DeprecationWarning,
                        stacklevel=2
                    )
                if config.cache_mode is None:
                    config.cache_mode = CacheMode.BYPASS
            semaphore_count = config.semaphore_count or 5
            semaphore = asyncio.Semaphore(semaphore_count)
            async def crawl_with_semaphore(url):
                # Handle rate limiting per domain
                domain = urlparse(url).netloc
                current_time = time.time()
                self.logger.debug(
                    message="Started task for {url:.50}...",
                    tag="PARALLEL",
                    params={"url": url}
                )
                # Get delay settings from config
                mean_delay = config.mean_delay
                max_range = config.max_range
                # Apply rate limiting
                if domain in self._domain_last_hit:
                    time_since_last = current_time - self._domain_last_hit[domain]
                    if time_since_last < mean_delay:
                        delay = mean_delay + random.uniform(0, max_range)
                        await asyncio.sleep(delay)
                self._domain_last_hit[domain] = current_time
                async with semaphore:
                    return await self.arun(
                        url,
                        crawler_config=config,  # Pass the entire config object
                        user_agent=user_agent  # Maintain user_agent override capability
                    )
            # Log start of concurrent crawling
            self.logger.info(
                message="Starting concurrent crawling for {count} URLs...",
                tag="INIT",
                params={"count": len(urls)}
            )
            # Execute concurrent crawls
            start_time = time.perf_counter()
            tasks = [crawl_with_semaphore(url) for url in urls]
            results = await asyncio.gather(*tasks, return_exceptions=True)
            end_time = time.perf_counter()
            # Log completion
            self.logger.success(
                message="Concurrent crawling completed for {count} URLs | Total time: {timing}",
                tag="COMPLETE",
                params={
                    "count": len(urls),
                    "timing": f"{end_time - start_time:.2f}s"
                },
                colors={
                    "timing": Fore.YELLOW
                }
            )
            return [result if not isinstance(result, Exception) else str(result) for result in results]
    async def aclear_cache(self):
        """Clear the cache database."""
@@ -682,5 +820,3 @@ class AsyncWebCrawler:
    async def aget_cache_size(self):
        """Get the total number of cached items."""
        return await async_db_manager.aget_total_count()
--- a/crawl4ai/cache_context.py
+++ b/crawl4ai/cache_context.py
@@ -12,6 +12,7 @@ class CacheMode(Enum):
    - WRITE_ONLY: Only write to cache, don't read
    - BYPASS: Bypass cache for this operation
    """
    ENABLED = "enabled"
    DISABLED = "disabled"
    READ_ONLY = "read_only"
@@ -25,25 +26,62 @@ class CacheContext:
    This class centralizes all cache-related logic and URL type checking,
    making the caching behavior more predictable and maintainable.
    Attributes:
        url (str): The URL being processed.
        cache_mode (CacheMode): The cache mode for the current operation.
        always_bypass (bool): If True, bypasses caching for this operation.
        is_cacheable (bool): True if the URL is cacheable, False otherwise.
        is_web_url (bool): True if the URL is a web URL, False otherwise.
        is_local_file (bool): True if the URL is a local file, False otherwise.
        is_raw_html (bool): True if the URL is raw HTML, False otherwise.
        _url_display (str): The display name for the URL (web, local file, or raw HTML).
    """
    def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
        """
        Initializes the CacheContext with the provided URL and cache mode.
        Args:
            url (str): The URL being processed.
            cache_mode (CacheMode): The cache mode for the current operation.
            always_bypass (bool): If True, bypasses caching for this operation.
        """
        self.url = url
        self.cache_mode = cache_mode
        self.always_bypass = always_bypass
-        self.is_cacheable = url.startswith(('http://', 'https://', 'file://'))
+        self.is_cacheable = url.startswith(("http://", "https://", "file://"))
-        self.is_web_url = url.startswith(('http://', 'https://'))
+        self.is_web_url = url.startswith(("http://", "https://"))
        self.is_local_file = url.startswith("file://")
        self.is_raw_html = url.startswith("raw:")
        self._url_display = url if not self.is_raw_html else "Raw HTML"
    def should_read(self) -> bool:
-        """Determines if cache should be read based on context."""
+        """
        Determines if cache should be read based on context.
        How it works:
        1. If always_bypass is True or is_cacheable is False, return False.
        2. If cache_mode is ENABLED or READ_ONLY, return True.
        Returns:
            bool: True if cache should be read, False otherwise.
        """
        if self.always_bypass or not self.is_cacheable:
            return False
        return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
    def should_write(self) -> bool:
-        """Determines if cache should be written based on context."""
+        """
        Determines if cache should be written based on context.
        How it works:
        1. If always_bypass is True or is_cacheable is False, return False.
        2. If cache_mode is ENABLED or WRITE_ONLY, return True.
        Returns:
            bool: True if cache should be written, False otherwise.
        """
        if self.always_bypass or not self.is_cacheable:
            return False
        return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
@@ -58,7 +96,7 @@ def _legacy_to_cache_mode(
    disable_cache: bool = False,
    bypass_cache: bool = False,
    no_cache_read: bool = False,
-    no_cache_write: bool = False
+    no_cache_write: bool = False,
 ) -> CacheMode:
    """
    Converts legacy cache parameters to the new CacheMode enum.
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -3,23 +3,53 @@ import re
 from collections import Counter
 import string
 from .model_loader import load_nltk_punkt
-from .utils import *
+
 # Define the abstract base class for chunking strategies
 class ChunkingStrategy(ABC):
    """
    Abstract base class for chunking strategies.
    """
    @abstractmethod
    def chunk(self, text: str) -> list:
        """
        Abstract method to chunk the given text.
        Args:
            text (str): The text to chunk.
        Returns:
            list: A list of chunks.
        """
        pass
 # Create an identity chunking strategy f(x) = [x]
 class IdentityChunking(ChunkingStrategy):
    """
    Chunking strategy that returns the input text as a single chunk.
    """
    def chunk(self, text: str) -> list:
        return [text]
 # Regex-based chunking
 class RegexChunking(ChunkingStrategy):
    """
    Chunking strategy that splits text based on regular expression patterns.
    """
    def __init__(self, patterns=None, **kwargs):
        """
        Initialize the RegexChunking object.
        Args:
            patterns (list): A list of regular expression patterns to split text.
        """
        if patterns is None:
-            patterns = [r'\n\n']  # Default split pattern
+            patterns = [r"\n\n"]  # Default split pattern
        self.patterns = patterns
    def chunk(self, text: str) -> list:
@@ -31,11 +61,18 @@ class RegexChunking(ChunkingStrategy):
            paragraphs = new_paragraphs
        return paragraphs
 # NLP-based sentence chunking
 class NlpSentenceChunking(ChunkingStrategy):
    """
    Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
    """
    def __init__(self, **kwargs):
        """
        Initialize the NlpSentenceChunking object.
        """
        load_nltk_punkt()
        pass
    def chunk(self, text: str) -> list:
        # Improved regex for sentence splitting
@@ -45,16 +82,32 @@ class NlpSentenceChunking(ChunkingStrategy):
        # sentences = sentence_endings.split(text)
        # sens =  [sent.strip() for sent in sentences if sent]
        from nltk.tokenize import sent_tokenize
        sentences = sent_tokenize(text)
        sens = [sent.strip() for sent in sentences]
        return list(set(sens))
 # Topic-based segmentation using TextTiling
 class TopicSegmentationChunking(ChunkingStrategy):
    """
    Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
    How it works:
    1. Segment the text into topics using TextTilingTokenizer
    2. Extract keywords for each topic segment
    """
    def __init__(self, num_keywords=3, **kwargs):
        """
        Initialize the TopicSegmentationChunking object.
        Args:
            num_keywords (int): The number of keywords to extract for each topic segment.
        """
        import nltk as nl
        self.tokenizer = nl.tokenize.TextTilingTokenizer()
        self.num_keywords = num_keywords
@@ -66,8 +119,14 @@ class TopicSegmentationChunking(ChunkingStrategy):
    def extract_keywords(self, text: str) -> list:
        # Tokenize and remove stopwords and punctuation
        import nltk as nl
        tokens = nl.toknize.word_tokenize(text)
-        tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]
+        tokens = [
            token.lower()
            for token in tokens
            if token not in nl.corpus.stopwords.words("english")
            and token not in string.punctuation
        ]
        # Calculate frequency distribution
        freq_dist = Counter(tokens)
@@ -78,11 +137,23 @@ class TopicSegmentationChunking(ChunkingStrategy):
        # Segment the text into topics
        segments = self.chunk(text)
        # Extract keywords for each topic segment
-        segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
+        segments_with_topics = [
            (segment, self.extract_keywords(segment)) for segment in segments
        ]
        return segments_with_topics
 # Fixed-length word chunks
 class FixedLengthWordChunking(ChunkingStrategy):
    """
    Chunking strategy that splits text into fixed-length word chunks.
    How it works:
    1. Split the text into words
    2. Create chunks of fixed length
    3. Return the list of chunks
    """
    def __init__(self, chunk_size=100, **kwargs):
        """
        Initialize the fixed-length word chunking strategy with the given chunk size.
@@ -94,10 +165,23 @@ class FixedLengthWordChunking(ChunkingStrategy):
    def chunk(self, text: str) -> list:
        words = text.split()
-        return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
+        return [
            " ".join(words[i : i + self.chunk_size])
            for i in range(0, len(words), self.chunk_size)
        ]
 # Sliding window chunking
 class SlidingWindowChunking(ChunkingStrategy):
    """
    Chunking strategy that splits text into overlapping word chunks.
    How it works:
    1. Split the text into words
    2. Create chunks of fixed length
    3. Return the list of chunks
    """
    def __init__(self, window_size=100, step=50, **kwargs):
        """
        Initialize the sliding window chunking strategy with the given window size and
@@ -118,17 +202,27 @@ class SlidingWindowChunking(ChunkingStrategy):
            return [text]
        for i in range(0, len(words) - self.window_size + 1, self.step):
-            chunk = ' '.join(words[i:i + self.window_size])
+            chunk = " ".join(words[i : i + self.window_size])
            chunks.append(chunk)
        # Handle the last chunk if it doesn't align perfectly
        if i + self.window_size < len(words):
-            chunks.append(' '.join(words[-self.window_size:]))
+            chunks.append(" ".join(words[-self.window_size :]))
        return chunks
 class OverlappingWindowChunking(ChunkingStrategy):
    """
    Chunking strategy that splits text into overlapping word chunks.
    How it works:
    1. Split the text into words using whitespace
    2. Create chunks of fixed length equal to the window size
    3. Slide the window by the overlap size
    4. Return the list of chunks
    """
    def __init__(self, window_size=1000, overlap=100, **kwargs):
        """
        Initialize the overlapping window chunking strategy with the given window size and
@@ -151,7 +245,7 @@ class OverlappingWindowChunking(ChunkingStrategy):
        start = 0
        while start < len(words):
            end = start + self.window_size
-            chunk = ' '.join(words[start:end])
+            chunk = " ".join(words[start:end])
            chunks.append(chunk)
            if end >= len(words):
--- a/crawl4ai/cli.py
+++ b/crawl4ai/cli.py
@@ -0,0 +1,123 @@
 import click
 import sys
 import asyncio
 from typing import List
 from .docs_manager import DocsManager
 from .async_logger import AsyncLogger
 logger = AsyncLogger(verbose=True)
 docs_manager = DocsManager(logger)
 def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
    """Print formatted table with headers and rows"""
    widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
    border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
    def format_row(row):
        return (
            "|"
            + "|".join(
                f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
                for cell, w in zip(row, widths)
            )
            + "|"
        )
    click.echo(border)
    click.echo(format_row(headers))
    click.echo(border)
    for row in rows:
        click.echo(format_row(row))
    click.echo(border)
@click.group()
 def cli():
    """Crawl4AI Command Line Interface"""
    pass
@cli.group()
 def docs():
    """Documentation operations"""
    pass
@docs.command()
@click.argument("sections", nargs=-1)
@click.option(
    "--mode", type=click.Choice(["extended", "condensed"]), default="extended"
 )
 def combine(sections: tuple, mode: str):
    """Combine documentation sections"""
    try:
        asyncio.run(docs_manager.ensure_docs_exist())
        click.echo(docs_manager.generate(sections, mode))
    except Exception as e:
        logger.error(str(e), tag="ERROR")
        sys.exit(1)
@docs.command()
@click.argument("query")
@click.option("--top-k", "-k", default=5)
@click.option("--build-index", is_flag=True, help="Build index if missing")
 def search(query: str, top_k: int, build_index: bool):
    """Search documentation"""
    try:
        result = docs_manager.search(query, top_k)
        if result == "No search index available. Call build_search_index() first.":
            if build_index or click.confirm("No search index found. Build it now?"):
                asyncio.run(docs_manager.llm_text.generate_index_files())
                result = docs_manager.search(query, top_k)
        click.echo(result)
    except Exception as e:
        click.echo(f"Error: {str(e)}", err=True)
        sys.exit(1)
@docs.command()
 def update():
    """Update docs from GitHub"""
    try:
        asyncio.run(docs_manager.fetch_docs())
        click.echo("Documentation updated successfully")
    except Exception as e:
        click.echo(f"Error: {str(e)}", err=True)
        sys.exit(1)
@docs.command()
@click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
@click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
 def index(force_facts: bool, clear_cache: bool):
    """Build or rebuild search indexes"""
    try:
        asyncio.run(docs_manager.ensure_docs_exist())
        asyncio.run(
            docs_manager.llm_text.generate_index_files(
                force_generate_facts=force_facts, clear_bm25_cache=clear_cache
            )
        )
        click.echo("Search indexes built successfully")
    except Exception as e:
        click.echo(f"Error: {str(e)}", err=True)
        sys.exit(1)
 # Add docs list command
@docs.command()
 def list():
    """List available documentation sections"""
    try:
        sections = docs_manager.list()
        print_table(["Sections"], [[section] for section in sections])
    except Exception as e:
        click.echo(f"Error: {str(e)}", err=True)
        sys.exit(1)
 if __name__ == "__main__":
    cli()
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -13,6 +13,8 @@ PROVIDER_MODELS = {
    "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
    "openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
    "openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
    "openai/o1-mini": os.getenv("OPENAI_API_KEY"),
    "openai/o1-preview": os.getenv("OPENAI_API_KEY"),
    "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
@@ -20,7 +22,7 @@ PROVIDER_MODELS = {
 }
 # Chunk token threshold
-CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
+CHUNK_TOKEN_THRESHOLD = 2**11  # 2048 tokens
 OVERLAP_RATE = 0.1
 WORD_TOKEN_RATE = 1.3
@@ -28,19 +30,41 @@ WORD_TOKEN_RATE = 1.3
 MIN_WORD_THRESHOLD = 1
 IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
-IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height'] 
+IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
-ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
+ONLY_TEXT_ELIGIBLE_TAGS = [
    "b",
    "i",
    "u",
    "span",
    "del",
    "ins",
    "sub",
    "sup",
    "strong",
    "em",
    "code",
    "kbd",
    "var",
    "s",
    "q",
    "abbr",
    "cite",
    "dfn",
    "time",
    "small",
    "mark",
 ]
 SOCIAL_MEDIA_DOMAINS = [
-                            'facebook.com',
+    "facebook.com",
-                            'twitter.com',
+    "twitter.com",
-                            'x.com',
+    "x.com",
-                            'linkedin.com',
+    "linkedin.com",
-                            'instagram.com',
+    "instagram.com",
-                            'pinterest.com',
+    "pinterest.com",
-                            'tiktok.com',
+    "tiktok.com",
-                            'snapchat.com',
+    "snapchat.com",
-                            'reddit.com',
+    "reddit.com",
-                        ]
+]
 # Threshold for the Image extraction - Range is 1 to 6
 # Images are scored based on point based system, to filter based on usefulness. Points are assigned
@@ -58,5 +82,5 @@ NEED_MIGRATION = True
 URL_LOG_SHORTEN_LENGTH = 30
 SHOW_DEPRECATION_WARNINGS = True
 SCREENSHOT_HEIGHT_TRESHOLD = 10000
-PAGE_TIMEOUT=60000
+PAGE_TIMEOUT = 60000
-DOWNLOAD_PAGE_TIMEOUT=60000
+DOWNLOAD_PAGE_TIMEOUT = 60000
--- a/crawl4ai/content_filter_strategy.py
+++ b/crawl4ai/content_filter_strategy.py
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -15,32 +15,30 @@ import logging, time
 import base64
 from PIL import Image, ImageDraw, ImageFont
 from io import BytesIO
-from typing import List, Callable
+from typing import Callable
 import requests
 import os
 from pathlib import Path
 from .utils import *
-logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
+logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
 logger.setLevel(logging.WARNING)
-logger_driver = logging.getLogger('selenium.webdriver.common.service')
+logger_driver = logging.getLogger("selenium.webdriver.common.service")
 logger_driver.setLevel(logging.WARNING)
-urllib3_logger = logging.getLogger('urllib3.connectionpool')
+urllib3_logger = logging.getLogger("urllib3.connectionpool")
 urllib3_logger.setLevel(logging.WARNING)
 # Disable http.client logging
-http_client_logger = logging.getLogger('http.client')
+http_client_logger = logging.getLogger("http.client")
 http_client_logger.setLevel(logging.WARNING)
 # Disable driver_finder and service logging
-driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
+driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
 driver_finder_logger.setLevel(logging.WARNING)
 class CrawlerStrategy(ABC):
    @abstractmethod
    def crawl(self, url: str, **kwargs) -> str:
@@ -58,8 +56,9 @@ class CrawlerStrategy(ABC):
    def set_hook(self, hook_type: str, hook: Callable):
        pass
 class CloudCrawlerStrategy(CrawlerStrategy):
-    def __init__(self, use_cached_html = False):
+    def __init__(self, use_cached_html=False):
        super().__init__()
        self.use_cached_html = use_cached_html
@@ -76,6 +75,7 @@ class CloudCrawlerStrategy(CrawlerStrategy):
        html = response["results"][0]["html"]
        return sanitize_input_encode(html)
 class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        super().__init__()
@@ -87,9 +87,14 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        if kwargs.get("user_agent"):
            self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
        else:
-            user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+            user_agent = kwargs.get(
                "user_agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
            )
            self.options.add_argument(f"--user-agent={user_agent}")
-            self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+            self.options.add_argument(
                "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
            )
        self.options.headless = kwargs.get("headless", True)
        if self.options.headless:
@@ -123,11 +128,11 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        # Hooks
        self.hooks = {
-            'on_driver_created': None,
+            "on_driver_created": None,
-            'on_user_agent_updated': None,
+            "on_user_agent_updated": None,
-            'before_get_url': None,
+            "before_get_url": None,
-            'after_get_url': None,
+            "after_get_url": None,
-            'before_return_html': None
+            "before_return_html": None,
        }
        # chromedriver_autoinstaller.install()
@@ -138,7 +143,6 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        # chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
        # self.service = Service(chromedriver_autoinstaller.install())
        # chromedriver_path = ChromeDriverManager().install()
        # self.service = Service(chromedriver_path)
        # self.service.log_path = "NUL"
@@ -148,14 +152,12 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        self.service = Service()
        self.driver = webdriver.Chrome(options=self.options)
-        self.driver = self.execute_hook('on_driver_created', self.driver)
+        self.driver = self.execute_hook("on_driver_created", self.driver)
        if kwargs.get("cookies"):
            for cookie in kwargs.get("cookies"):
                self.driver.add_cookie(cookie)
    def set_hook(self, hook_type: str, hook: Callable):
        if hook_type in self.hooks:
            self.hooks[hook_type] = hook
@@ -170,7 +172,9 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
                if isinstance(result, webdriver.Chrome):
                    return result
                else:
-                    raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.")
+                    raise TypeError(
                        f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
                    )
        # If the hook returns None or there is no hook, return self.driver
        return self.driver
@@ -178,13 +182,13 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        self.options.add_argument(f"user-agent={user_agent}")
        self.driver.quit()
        self.driver = webdriver.Chrome(service=self.service, options=self.options)
-        self.driver = self.execute_hook('on_user_agent_updated', self.driver)
+        self.driver = self.execute_hook("on_user_agent_updated", self.driver)
    def set_custom_headers(self, headers: dict):
        # Enable Network domain for sending headers
-        self.driver.execute_cdp_cmd('Network.enable', {})
+        self.driver.execute_cdp_cmd("Network.enable", {})
        # Set extra HTTP headers
-        self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers})
+        self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
    def _ensure_page_load(self, max_checks=6, check_interval=0.01):
        initial_length = len(self.driver.page_source)
@@ -202,36 +206,53 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
    def crawl(self, url: str, **kwargs) -> str:
        # Create md5 hash of the URL
        import hashlib
        url_hash = hashlib.md5(url.encode()).hexdigest()
        if self.use_cached_html:
-            cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash)
+            cache_file_path = os.path.join(
                os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
                ".crawl4ai",
                "cache",
                url_hash,
            )
            if os.path.exists(cache_file_path):
                with open(cache_file_path, "r") as f:
                    return sanitize_input_encode(f.read())
        try:
-            self.driver = self.execute_hook('before_get_url', self.driver)
+            self.driver = self.execute_hook("before_get_url", self.driver)
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
-            self.driver.get(url) #<html><head></head><body></body></html>
+            self.driver.get(url)  # <html><head></head><body></body></html>
            WebDriverWait(self.driver, 20).until(
-                lambda d: d.execute_script('return document.readyState') == 'complete'
+                lambda d: d.execute_script("return document.readyState") == "complete"
            )
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
            )
-            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
+            self.driver.execute_script(
                "window.scrollTo(0, document.body.scrollHeight);"
            )
-            self.driver = self.execute_hook('after_get_url', self.driver)
+            self.driver = self.execute_hook("after_get_url", self.driver)
-            html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source                                        
+            html = sanitize_input_encode(
-            can_not_be_done_headless = False # Look at my creativity for naming variables
+                self._ensure_page_load()
            )  # self.driver.page_source
            can_not_be_done_headless = (
                False  # Look at my creativity for naming variables
            )
            # TODO: Very ugly approach, but promise to change it!
-            if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>":
+            if (
-                print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...")
+                kwargs.get("bypass_headless", False)
                or html == "<html><head></head><body></body></html>"
            ):
                print(
                    "[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
                )
                can_not_be_done_headless = True
                options = Options()
                options.headless = False
@@ -239,7 +260,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
                options.add_argument("--window-size=5,5")
                driver = webdriver.Chrome(service=self.service, options=options)
                driver.get(url)
-                self.driver = self.execute_hook('after_get_url', driver)
+                self.driver = self.execute_hook("after_get_url", driver)
                html = sanitize_input_encode(driver.page_source)
                driver.quit()
@@ -249,17 +270,21 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
                self.driver.execute_script(self.js_code)
                # Optionally, wait for some condition after executing the JS code
                WebDriverWait(self.driver, 10).until(
-                    lambda driver: driver.execute_script("return document.readyState") == "complete"
+                    lambda driver: driver.execute_script("return document.readyState")
                    == "complete"
                )
            elif self.js_code and type(self.js_code) == list:
                for js in self.js_code:
                    self.driver.execute_script(js)
                    WebDriverWait(self.driver, 10).until(
-                        lambda driver: driver.execute_script("return document.readyState") == "complete"
+                        lambda driver: driver.execute_script(
                            "return document.readyState"
                        )
                        == "complete"
                    )
            # Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
-            wait_for = kwargs.get('wait_for', False)
+            wait_for = kwargs.get("wait_for", False)
            if wait_for:
                if callable(wait_for):
                    print("[LOG] 🔄 Waiting for condition...")
@@ -272,10 +297,15 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            if not can_not_be_done_headless:
                html = sanitize_input_encode(self.driver.page_source)
-            self.driver = self.execute_hook('before_return_html', self.driver, html)
+            self.driver = self.execute_hook("before_return_html", self.driver, html)
            # Store in cache
-            cache_file_path = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai", "cache", url_hash)
+            cache_file_path = os.path.join(
                os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
                ".crawl4ai",
                "cache",
                url_hash,
            )
            with open(cache_file_path, "w", encoding="utf-8") as f:
                f.write(html)
@@ -284,16 +314,16 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            return html
        except InvalidArgumentException as e:
-            if not hasattr(e, 'msg'):
+            if not hasattr(e, "msg"):
                e.msg = sanitize_input_encode(str(e))
            raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
        except WebDriverException as e:
            # If e does nlt have msg attribute create it and set it to str(e)
-            if not hasattr(e, 'msg'):
+            if not hasattr(e, "msg"):
                e.msg = sanitize_input_encode(str(e))
            raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
        except Exception as e:
-            if not hasattr(e, 'msg'):
+            if not hasattr(e, "msg"):
                e.msg = sanitize_input_encode(str(e))
            raise Exception(f"Failed to crawl {url}: {e.msg}")
@@ -301,7 +331,9 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        try:
            # Get the dimensions of the page
            total_width = self.driver.execute_script("return document.body.scrollWidth")
-            total_height = self.driver.execute_script("return document.body.scrollHeight")
+            total_height = self.driver.execute_script(
                "return document.body.scrollHeight"
            )
            # Set the window size to the dimensions of the page
            self.driver.set_window_size(total_width, total_height)
@@ -313,23 +345,25 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            image = Image.open(BytesIO(screenshot))
            # Convert image to RGB mode (this will handle both RGB and RGBA images)
-            rgb_image = image.convert('RGB')
+            rgb_image = image.convert("RGB")
            # Convert to JPEG and compress
            buffered = BytesIO()
            rgb_image.save(buffered, format="JPEG", quality=85)
-            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
+            img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
            if self.verbose:
-                print(f"[LOG] 📸 Screenshot taken and converted to base64")
+                print("[LOG] 📸 Screenshot taken and converted to base64")
            return img_base64
        except Exception as e:
-            error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
+            error_message = sanitize_input_encode(
                f"Failed to take screenshot: {str(e)}"
            )
            print(error_message)
            # Generate an image with black background
-            img = Image.new('RGB', (800, 600), color='black')
+            img = Image.new("RGB", (800, 600), color="black")
            draw = ImageDraw.Draw(img)
            # Load a font
@@ -352,7 +386,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            # Convert to base64
            buffered = BytesIO()
            img.save(buffered, format="JPEG")
-            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
+            img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
            return img_base64
--- a/crawl4ai/database.py
+++ b/crawl4ai/database.py
@@ -7,11 +7,13 @@ DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".cra
 os.makedirs(DB_PATH, exist_ok=True)
 DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
 def init_db():
    global DB_PATH
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
-    cursor.execute('''
+    cursor.execute(
        """
        CREATE TABLE IF NOT EXISTS crawled_data (
            url TEXT PRIMARY KEY,
            html TEXT,
@@ -24,31 +26,42 @@ def init_db():
            metadata TEXT DEFAULT "{}",
            screenshot TEXT DEFAULT ""
        )
-    ''')
+    """
    )
    conn.commit()
    conn.close()
 def alter_db_add_screenshot(new_column: str = "media"):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
+        cursor.execute(
            f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
        )
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error altering database to add screenshot column: {e}")
 def check_db_path():
    if not DB_PATH:
        raise ValueError("Database path is not set or is empty.")
-def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
+
 def get_cached_url(
    url: str,
 ) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,))
+        cursor.execute(
            "SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
            (url,),
        )
        result = cursor.fetchone()
        conn.close()
        return result
@@ -56,12 +69,25 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str
        print(f"Error retrieving cached URL: {e}")
        return None
-def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
+
 def cache_url(
    url: str,
    html: str,
    cleaned_html: str,
    markdown: str,
    extracted_content: str,
    success: bool,
    media: str = "{}",
    links: str = "{}",
    metadata: str = "{}",
    screenshot: str = "",
 ):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute('''
+        cursor.execute(
            """
            INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(url) DO UPDATE SET
@@ -74,18 +100,32 @@ def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_c
                links = excluded.links,    
                metadata = excluded.metadata,      
                screenshot = excluded.screenshot
-        ''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
+        """,
            (
                url,
                html,
                cleaned_html,
                markdown,
                extracted_content,
                success,
                media,
                links,
                metadata,
                screenshot,
            ),
        )
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error caching URL: {e}")
 def get_total_count() -> int:
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute('SELECT COUNT(*) FROM crawled_data')
+        cursor.execute("SELECT COUNT(*) FROM crawled_data")
        result = cursor.fetchone()
        conn.close()
        return result[0]
@@ -93,43 +133,48 @@ def get_total_count() -> int:
        print(f"Error getting total count: {e}")
        return 0
 def clear_db():
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute('DELETE FROM crawled_data')
+        cursor.execute("DELETE FROM crawled_data")
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error clearing database: {e}")
 def flush_db():
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute('DROP TABLE crawled_data')
+        cursor.execute("DROP TABLE crawled_data")
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error flushing database: {e}")
 def update_existing_records(new_column: str = "media", default_value: str = "{}"):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL')
+        cursor.execute(
            f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
        )
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error updating existing records: {e}")
 if __name__ == "__main__":
    # Delete the existing database file
    if os.path.exists(DB_PATH):
        os.remove(DB_PATH)
    init_db()
    # alter_db_add_screenshot("COL_NAME")
--- a/crawl4ai/docs_manager.py
+++ b/crawl4ai/docs_manager.py
@@ -0,0 +1,75 @@
 import requests
 import shutil
 from pathlib import Path
 from crawl4ai.async_logger import AsyncLogger
 from crawl4ai.llmtxt import AsyncLLMTextManager
 class DocsManager:
    def __init__(self, logger=None):
        self.docs_dir = Path.home() / ".crawl4ai" / "docs"
        self.local_docs = Path(__file__).parent.parent / "docs" / "llm.txt"
        self.docs_dir.mkdir(parents=True, exist_ok=True)
        self.logger = logger or AsyncLogger(verbose=True)
        self.llm_text = AsyncLLMTextManager(self.docs_dir, self.logger)
    async def ensure_docs_exist(self):
        """Fetch docs if not present"""
        if not any(self.docs_dir.iterdir()):
            await self.fetch_docs()
    async def fetch_docs(self) -> bool:
        """Copy from local docs or download from GitHub"""
        try:
            # Try local first
            if self.local_docs.exists() and (
                any(self.local_docs.glob("*.md"))
                or any(self.local_docs.glob("*.tokens"))
            ):
                # Empty the local docs directory
                for file_path in self.docs_dir.glob("*.md"):
                    file_path.unlink()
                # for file_path in self.docs_dir.glob("*.tokens"):
                #     file_path.unlink()
                for file_path in self.local_docs.glob("*.md"):
                    shutil.copy2(file_path, self.docs_dir / file_path.name)
                # for file_path in self.local_docs.glob("*.tokens"):
                #     shutil.copy2(file_path, self.docs_dir / file_path.name)
                return True
            # Fallback to GitHub
            response = requests.get(
                "https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
                headers={"Accept": "application/vnd.github.v3+json"},
            )
            response.raise_for_status()
            for item in response.json():
                if item["type"] == "file" and item["name"].endswith(".md"):
                    content = requests.get(item["download_url"]).text
                    with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
                        f.write(content)
            return True
        except Exception as e:
            self.logger.error(f"Failed to fetch docs: {str(e)}")
            raise
    def list(self) -> list[str]:
        """List available topics"""
        names = [file_path.stem for file_path in self.docs_dir.glob("*.md")]
        # Remove [0-9]+_ prefix
        names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
        # Exclude those end with .xs.md and .q.md
        names = [
            name
            for name in names
            if not name.endswith(".xs") and not name.endswith(".q")
        ]
        return names
    def generate(self, sections, mode="extended"):
        return self.llm_text.generate(sections, mode)
    def search(self, query: str, top_k: int = 5):
        return self.llm_text.search(query, top_k)
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
--- a/crawl4ai/html2text/init.py
+++ b/crawl4ai/html2text/init.py
@@ -903,7 +903,13 @@ class HTML2Text(html.parser.HTMLParser):
                self.empty_link = False
        if not self.code and not self.pre and not entity_char:
-            data = escape_md_section(data, snob=self.escape_snob, escape_dot=self.escape_dot, escape_plus=self.escape_plus, escape_dash=self.escape_dash)
+            data = escape_md_section(
                data,
                snob=self.escape_snob,
                escape_dot=self.escape_dot,
                escape_plus=self.escape_plus,
                escape_dash=self.escape_dash,
            )
        self.preceding_data = data
        self.o(data, puredata=True)
@@ -1006,6 +1012,7 @@ class HTML2Text(html.parser.HTMLParser):
                    newlines += 1
        return result
 def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
    if bodywidth is None:
        bodywidth = config.BODY_WIDTH
@@ -1013,6 +1020,7 @@ def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) ->
    return h.handle(html)
 class CustomHTML2Text(HTML2Text):
    def __init__(self, *args, handle_code_in_pre=False, **kwargs):
        super().__init__(*args, **kwargs)
@@ -1041,9 +1049,9 @@ class CustomHTML2Text(HTML2Text):
    def update_params(self, **kwargs):
        """Update parameters and set preserved tags."""
        for key, value in kwargs.items():
-            if key == 'preserve_tags':
+            if key == "preserve_tags":
                self.preserve_tags = set(value)
-            elif key == 'handle_code_in_pre':
+            elif key == "handle_code_in_pre":
                self.handle_code_in_pre = value
            else:
                setattr(self, key, value)
@@ -1056,17 +1064,19 @@ class CustomHTML2Text(HTML2Text):
                    self.current_preserved_tag = tag
                    self.preserved_content = []
                    # Format opening tag with attributes
-                    attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
+                    attr_str = "".join(
-                    self.preserved_content.append(f'<{tag}{attr_str}>')
+                        f' {k}="{v}"' for k, v in attrs.items() if v is not None
                    )
                    self.preserved_content.append(f"<{tag}{attr_str}>")
                self.preserve_depth += 1
                return
            else:
                self.preserve_depth -= 1
                if self.preserve_depth == 0:
-                    self.preserved_content.append(f'</{tag}>')
+                    self.preserved_content.append(f"</{tag}>")
                    # Output the preserved HTML block with proper spacing
-                    preserved_html = ''.join(self.preserved_content)
+                    preserved_html = "".join(self.preserved_content)
-                    self.o('\n' + preserved_html + '\n')
+                    self.o("\n" + preserved_html + "\n")
                    self.current_preserved_tag = None
                return
@@ -1074,29 +1084,31 @@ class CustomHTML2Text(HTML2Text):
        if self.preserve_depth > 0:
            if start:
                # Format nested tags with attributes
-                attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
+                attr_str = "".join(
-                self.preserved_content.append(f'<{tag}{attr_str}>')
+                    f' {k}="{v}"' for k, v in attrs.items() if v is not None
                )
                self.preserved_content.append(f"<{tag}{attr_str}>")
            else:
-                self.preserved_content.append(f'</{tag}>')
+                self.preserved_content.append(f"</{tag}>")
            return
        # Handle pre tags
-        if tag == 'pre':
+        if tag == "pre":
            if start:
-                self.o('```\n')  # Markdown code block start
+                self.o("```\n")  # Markdown code block start
                self.inside_pre = True
            else:
-                self.o('\n```\n')  # Markdown code block end
+                self.o("\n```\n")  # Markdown code block end
                self.inside_pre = False
-        elif tag == 'code':
+        elif tag == "code":
            if self.inside_pre and not self.handle_code_in_pre:
                # Ignore code tags inside pre blocks if handle_code_in_pre is False
                return
            if start:
-                self.o('`')  # Markdown inline code start
+                self.o("`")  # Markdown inline code start
                self.inside_code = True
            else:
-                self.o('`')  # Markdown inline code end
+                self.o("`")  # Markdown inline code end
                self.inside_code = False
        else:
            super().handle_tag(tag, attrs, start)
@@ -1113,13 +1125,12 @@ class CustomHTML2Text(HTML2Text):
            return
        if self.inside_code:
            # Inline code: no newlines allowed
-            self.o(data.replace('\n', ' '))
+            self.o(data.replace("\n", " "))
            return
        # Default behavior for other tags
        super().handle_data(data, entity_char)
    #     # Handle pre tags
    #     if tag == 'pre':
    #         if start:
--- a/crawl4ai/html2text/_typing.py
+++ b/crawl4ai/html2text/_typing.py
@@ -1,2 +1,3 @@
 class OutCallback:
-    def __call__(self, s: str) -> None: ...
+    def __call__(self, s: str) -> None:
        ...
--- a/crawl4ai/html2text/utils.py
+++ b/crawl4ai/html2text/utils.py
@@ -210,7 +210,7 @@ def escape_md_section(
    snob: bool = False,
    escape_dot: bool = True,
    escape_plus: bool = True,
-    escape_dash: bool = True
+    escape_dash: bool = True,
 ) -> str:
    """
    Escapes markdown-sensitive characters across whole document sections.
@@ -233,6 +233,7 @@ def escape_md_section(
    return text
 def reformat_table(lines: List[str], right_margin: int) -> List[str]:
    """
    Given the lines of a table
--- a/crawl4ai/install.py
+++ b/crawl4ai/install.py
@@ -6,6 +6,7 @@ from .async_logger import AsyncLogger, LogLevel
 # Initialize logger
 logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
 def post_install():
    """Run all post-installation tasks"""
    logger.info("Running post-installation setup...", tag="INIT")
@@ -13,21 +14,36 @@ def post_install():
    run_migration()
    logger.success("Post-installation setup completed!", tag="COMPLETE")
 def install_playwright():
    logger.info("Installing Playwright browsers...", tag="INIT")
    try:
-        subprocess.check_call([sys.executable, "-m", "playwright", "install"])
+        # subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
-        logger.success("Playwright installation completed successfully.", tag="COMPLETE")
+        subprocess.check_call(
-    except subprocess.CalledProcessError as e:
+            [
-        logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
+                sys.executable,
-        logger.warning(
+                "-m",
-            "Please run 'python -m playwright install' manually after the installation."
+                "playwright",
                "install",
                "--with-deps",
                "--force",
                "chromium",
            ]
        )
-    except Exception as e:
+        logger.success(
-        logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
+            "Playwright installation completed successfully.", tag="COMPLETE"
        logger.warning(
            "Please run 'python -m playwright install' manually after the installation."
        )
    except subprocess.CalledProcessError:
        # logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
        logger.warning(
            f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
        )
    except Exception:
        # logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
        logger.warning(
            f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
        )
 def run_migration():
    """Initialize database during installation"""
@@ -36,9 +52,58 @@ def run_migration():
        from crawl4ai.async_database import async_db_manager
        asyncio.run(async_db_manager.initialize())
-        logger.success("Database initialization completed successfully.", tag="COMPLETE")
+        logger.success(
            "Database initialization completed successfully.", tag="COMPLETE"
        )
    except ImportError:
        logger.warning("Database module not found. Will initialize on first use.")
    except Exception as e:
        logger.warning(f"Database initialization failed: {e}")
        logger.warning("Database will be initialized on first use")
 async def run_doctor():
    """Test if Crawl4AI is working properly"""
    logger.info("Running Crawl4AI health check...", tag="INIT")
    try:
        from .async_webcrawler import (
            AsyncWebCrawler,
            BrowserConfig,
            CrawlerRunConfig,
            CacheMode,
        )
        browser_config = BrowserConfig(
            headless=True,
            browser_type="chromium",
            ignore_https_errors=True,
            light_mode=True,
            viewport_width=1280,
            viewport_height=720,
        )
        run_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            screenshot=True,
        )
        async with AsyncWebCrawler(config=browser_config) as crawler:
            logger.info("Testing crawling capabilities...", tag="TEST")
            result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
            if result and result.markdown:
                logger.success("✅ Crawling test passed!", tag="COMPLETE")
                return True
            else:
                raise Exception("Failed to get content")
    except Exception as e:
        logger.error(f"❌ Test failed: {e}", tag="ERROR")
        return False
 def doctor():
    """Entry point for the doctor command"""
    import asyncio
    return asyncio.run(run_doctor())
--- a/crawl4ai/js_snippet/init.py
+++ b/crawl4ai/js_snippet/init.py
@@ -1,15 +1,18 @@
-import os, sys
+import os
 # Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
 def load_js_script(script_name):
    # Get the path of the current script
    current_script_path = os.path.dirname(os.path.realpath(__file__))
    # Get the path of the script to load
-    script_path = os.path.join(current_script_path, script_name + '.js')
+    script_path = os.path.join(current_script_path, script_name + ".js")
    # Check if the script exists
    if not os.path.exists(script_path):
-        raise ValueError(f"Script {script_name} not found in the folder {current_script_path}")
+        raise ValueError(
            f"Script {script_name} not found in the folder {current_script_path}"
        )
    # Load the content of the script
-    with open(script_path, 'r') as f:
+    with open(script_path, "r") as f:
        script_content = f.read()
    return script_content
--- a/crawl4ai/llmtxt.py
+++ b/crawl4ai/llmtxt.py
@@ -0,0 +1,546 @@
 import os
 from pathlib import Path
 import re
 from typing import Dict, List, Tuple, Optional, Any
 import json
 from tqdm import tqdm
 import time
 import psutil
 import numpy as np
 from rank_bm25 import BM25Okapi
 from nltk.tokenize import word_tokenize
 from nltk.corpus import stopwords
 from nltk.stem import WordNetLemmatizer
 from litellm import batch_completion
 from .async_logger import AsyncLogger
 import litellm
 import pickle
 import hashlib  # <--- ADDED for file-hash
 import glob
 litellm.set_verbose = False
 def _compute_file_hash(file_path: Path) -> str:
    """Compute MD5 hash for the file's entire content."""
    hash_md5 = hashlib.md5()
    with file_path.open("rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()
 class AsyncLLMTextManager:
    def __init__(
        self,
        docs_dir: Path,
        logger: Optional[AsyncLogger] = None,
        max_concurrent_calls: int = 5,
        batch_size: int = 3,
    ) -> None:
        self.docs_dir = docs_dir
        self.logger = logger
        self.max_concurrent_calls = max_concurrent_calls
        self.batch_size = batch_size
        self.bm25_index = None
        self.document_map: Dict[str, Any] = {}
        self.tokenized_facts: List[str] = []
        self.bm25_index_file = self.docs_dir / "bm25_index.pkl"
    async def _process_document_batch(self, doc_batch: List[Path]) -> None:
        """Process a batch of documents in parallel"""
        contents = []
        for file_path in doc_batch:
            try:
                with open(file_path, "r", encoding="utf-8") as f:
                    contents.append(f.read())
            except Exception as e:
                self.logger.error(f"Error reading {file_path}: {str(e)}")
                contents.append("")  # Add empty content to maintain batch alignment
        prompt = """Given a documentation file, generate a list of atomic facts where each fact:
 1. Represents a single piece of knowledge
 2. Contains variations in terminology for the same concept
 3. References relevant code patterns if they exist
 4. Is written in a way that would match natural language queries
 Each fact should follow this format:
 <main_concept>: <fact_statement> | <related_terms> | <code_reference>
 Example Facts:
 browser_config: Configure headless mode and browser type for AsyncWebCrawler | headless, browser_type, chromium, firefox | BrowserConfig(browser_type="chromium", headless=True)
 redis_connection: Redis client connection requires host and port configuration | redis setup, redis client, connection params | Redis(host='localhost', port=6379, db=0)
 pandas_filtering: Filter DataFrame rows using boolean conditions | dataframe filter, query, boolean indexing | df[df['column'] > 5]
 Wrap your response in <index>...</index> tags.
 """
        # Prepare messages for batch processing
        messages_list = [
            [
                {
                    "role": "user",
                    "content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
                }
            ]
            for content in contents
            if content
        ]
        try:
            responses = batch_completion(
                model="anthropic/claude-3-5-sonnet-latest",
                messages=messages_list,
                logger_fn=None,
            )
            # Process responses and save index files
            for response, file_path in zip(responses, doc_batch):
                try:
                    index_content_match = re.search(
                        r"<index>(.*?)</index>",
                        response.choices[0].message.content,
                        re.DOTALL,
                    )
                    if not index_content_match:
                        self.logger.warning(
                            f"No <index>...</index> content found for {file_path}"
                        )
                        continue
                    index_content = re.sub(
                        r"\n\s*\n", "\n", index_content_match.group(1)
                    ).strip()
                    if index_content:
                        index_file = file_path.with_suffix(".q.md")
                        with open(index_file, "w", encoding="utf-8") as f:
                            f.write(index_content)
                        self.logger.info(f"Created index file: {index_file}")
                    else:
                        self.logger.warning(
                            f"No index content found in response for {file_path}"
                        )
                except Exception as e:
                    self.logger.error(
                        f"Error processing response for {file_path}: {str(e)}"
                    )
        except Exception as e:
            self.logger.error(f"Error in batch completion: {str(e)}")
    def _validate_fact_line(self, line: str) -> Tuple[bool, Optional[str]]:
        if "|" not in line:
            return False, "Missing separator '|'"
        parts = [p.strip() for p in line.split("|")]
        if len(parts) != 3:
            return False, f"Expected 3 parts, got {len(parts)}"
        concept_part = parts[0]
        if ":" not in concept_part:
            return False, "Missing ':' in concept definition"
        return True, None
    def _load_or_create_token_cache(self, fact_file: Path) -> Dict:
        """
        Load token cache from .q.tokens if present and matching file hash.
        Otherwise return a new structure with updated file-hash.
        """
        cache_file = fact_file.with_suffix(".q.tokens")
        current_hash = _compute_file_hash(fact_file)
        if cache_file.exists():
            try:
                with open(cache_file, "r") as f:
                    cache = json.load(f)
                # If the hash matches, return it directly
                if cache.get("content_hash") == current_hash:
                    return cache
                # Otherwise, we signal that it's changed
                self.logger.info(f"Hash changed for {fact_file}, reindex needed.")
            except json.JSONDecodeError:
                self.logger.warning(f"Corrupt token cache for {fact_file}, rebuilding.")
            except Exception as e:
                self.logger.warning(f"Error reading cache for {fact_file}: {str(e)}")
        # Return a fresh cache
        return {"facts": {}, "content_hash": current_hash}
    def _save_token_cache(self, fact_file: Path, cache: Dict) -> None:
        cache_file = fact_file.with_suffix(".q.tokens")
        # Always ensure we're saving the correct file-hash
        cache["content_hash"] = _compute_file_hash(fact_file)
        with open(cache_file, "w") as f:
            json.dump(cache, f)
    def preprocess_text(self, text: str) -> List[str]:
        parts = [x.strip() for x in text.split("|")] if "|" in text else [text]
        # Remove : after the first word of parts[0]
        parts[0] = re.sub(r"^(.*?):", r"\1", parts[0])
        lemmatizer = WordNetLemmatizer()
        stop_words = set(stopwords.words("english")) - {
            "how",
            "what",
            "when",
            "where",
            "why",
            "which",
        }
        tokens = []
        for part in parts:
            if "(" in part and ")" in part:
                code_tokens = re.findall(
                    r'[\w_]+(?=\()|[\w_]+(?==[\'"]{1}[\w_]+[\'"]{1})', part
                )
                tokens.extend(code_tokens)
            words = word_tokenize(part.lower())
            tokens.extend(
                [
                    lemmatizer.lemmatize(token)
                    for token in words
                    if token not in stop_words
                ]
            )
        return tokens
    def maybe_load_bm25_index(self, clear_cache=False) -> bool:
        """
        Load existing BM25 index from disk, if present and clear_cache=False.
        """
        if not clear_cache and os.path.exists(self.bm25_index_file):
            self.logger.info("Loading existing BM25 index from disk.")
            with open(self.bm25_index_file, "rb") as f:
                data = pickle.load(f)
            self.tokenized_facts = data["tokenized_facts"]
            self.bm25_index = data["bm25_index"]
            return True
        return False
    def build_search_index(self, clear_cache=False) -> None:
        """
        Checks for new or modified .q.md files by comparing file-hash.
        If none need reindexing and clear_cache is False, loads existing index if available.
        Otherwise, reindexes only changed/new files and merges or creates a new index.
        """
        # If clear_cache is True, we skip partial logic: rebuild everything from scratch
        if clear_cache:
            self.logger.info("Clearing cache and rebuilding full search index.")
            if self.bm25_index_file.exists():
                self.bm25_index_file.unlink()
        process = psutil.Process()
        self.logger.info("Checking which .q.md files need (re)indexing...")
        # Gather all .q.md files
        q_files = [
            self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
        ]
        # We'll store known (unchanged) facts in these lists
        existing_facts: List[str] = []
        existing_tokens: List[List[str]] = []
        # Keep track of invalid lines for logging
        invalid_lines = []
        needSet = []  # files that must be (re)indexed
        for qf in q_files:
            token_cache_file = qf.with_suffix(".q.tokens")
            # If no .q.tokens or clear_cache is True → definitely reindex
            if clear_cache or not token_cache_file.exists():
                needSet.append(qf)
                continue
            # Otherwise, load the existing cache and compare hash
            cache = self._load_or_create_token_cache(qf)
            # If the .q.tokens was out of date (i.e. changed hash), we reindex
            if len(cache["facts"]) == 0 or cache.get(
                "content_hash"
            ) != _compute_file_hash(qf):
                needSet.append(qf)
            else:
                # File is unchanged → retrieve cached token data
                for line, cache_data in cache["facts"].items():
                    existing_facts.append(line)
                    existing_tokens.append(cache_data["tokens"])
                    self.document_map[line] = qf  # track the doc for that fact
        if not needSet and not clear_cache:
            # If no file needs reindexing, try loading existing index
            if self.maybe_load_bm25_index(clear_cache=False):
                self.logger.info(
                    "No new/changed .q.md files found. Using existing BM25 index."
                )
                return
            else:
                # If there's no existing index, we must build a fresh index from the old caches
                self.logger.info(
                    "No existing BM25 index found. Building from cached facts."
                )
                if existing_facts:
                    self.logger.info(
                        f"Building BM25 index with {len(existing_facts)} cached facts."
                    )
                    self.bm25_index = BM25Okapi(existing_tokens)
                    self.tokenized_facts = existing_facts
                    with open(self.bm25_index_file, "wb") as f:
                        pickle.dump(
                            {
                                "bm25_index": self.bm25_index,
                                "tokenized_facts": self.tokenized_facts,
                            },
                            f,
                        )
                else:
                    self.logger.warning("No facts found at all. Index remains empty.")
                return
        # ----------------------------------------------------- /Users/unclecode/.crawl4ai/docs/14_proxy_security.q.q.tokens '/Users/unclecode/.crawl4ai/docs/14_proxy_security.q.md'
        # If we reach here, we have new or changed .q.md files
        # We'll parse them, reindex them, and then combine with existing_facts
        # -----------------------------------------------------
        self.logger.info(f"{len(needSet)} file(s) need reindexing. Parsing now...")
        # 1) Parse the new or changed .q.md files
        new_facts = []
        new_tokens = []
        with tqdm(total=len(needSet), desc="Indexing changed files") as file_pbar:
            for file in needSet:
                # We'll build up a fresh cache
                fresh_cache = {"facts": {}, "content_hash": _compute_file_hash(file)}
                try:
                    with open(file, "r", encoding="utf-8") as f_obj:
                        content = f_obj.read().strip()
                        lines = [l.strip() for l in content.split("\n") if l.strip()]
                    for line in lines:
                        is_valid, error = self._validate_fact_line(line)
                        if not is_valid:
                            invalid_lines.append((file, line, error))
                            continue
                        tokens = self.preprocess_text(line)
                        fresh_cache["facts"][line] = {
                            "tokens": tokens,
                            "added": time.time(),
                        }
                        new_facts.append(line)
                        new_tokens.append(tokens)
                        self.document_map[line] = file
                    # Save the new .q.tokens with updated hash
                    self._save_token_cache(file, fresh_cache)
                    mem_usage = process.memory_info().rss / 1024 / 1024
                    self.logger.debug(
                        f"Memory usage after {file.name}: {mem_usage:.2f}MB"
                    )
                except Exception as e:
                    self.logger.error(f"Error processing {file}: {str(e)}")
                file_pbar.update(1)
        if invalid_lines:
            self.logger.warning(f"Found {len(invalid_lines)} invalid fact lines:")
            for file, line, error in invalid_lines:
                self.logger.warning(f"{file}: {error} in line: {line[:50]}...")
        # 2) Merge newly tokenized facts with the existing ones
        all_facts = existing_facts + new_facts
        all_tokens = existing_tokens + new_tokens
        # 3) Build BM25 index from combined facts
        self.logger.info(
            f"Building BM25 index with {len(all_facts)} total facts (old + new)."
        )
        self.bm25_index = BM25Okapi(all_tokens)
        self.tokenized_facts = all_facts
        # 4) Save the updated BM25 index to disk
        with open(self.bm25_index_file, "wb") as f:
            pickle.dump(
                {
                    "bm25_index": self.bm25_index,
                    "tokenized_facts": self.tokenized_facts,
                },
                f,
            )
        final_mem = process.memory_info().rss / 1024 / 1024
        self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
    async def generate_index_files(
        self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
    ) -> None:
        """
        Generate index files for all documents in parallel batches
        Args:
            force_generate_facts (bool): If True, regenerate indexes even if they exist
            clear_bm25_cache (bool): If True, clear existing BM25 index cache
        """
        self.logger.info("Starting index generation for documentation files.")
        md_files = [
            self.docs_dir / f
            for f in os.listdir(self.docs_dir)
            if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
        ]
        # Filter out files that already have .q files unless force=True
        if not force_generate_facts:
            md_files = [
                f
                for f in md_files
                if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
            ]
        if not md_files:
            self.logger.info("All index files exist. Use force=True to regenerate.")
        else:
            # Process documents in batches
            for i in range(0, len(md_files), self.batch_size):
                batch = md_files[i : i + self.batch_size]
                self.logger.info(
                    f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
                )
                await self._process_document_batch(batch)
        self.logger.info("Index generation complete, building/updating search index.")
        self.build_search_index(clear_cache=clear_bm25_cache)
    def generate(self, sections: List[str], mode: str = "extended") -> str:
        # Get all markdown files
        all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
            str(self.docs_dir / "[0-9]*.xs.md")
        )
        # Extract base names without extensions
        base_docs = {
            Path(f).name.split(".")[0]
            for f in all_files
            if not Path(f).name.endswith(".q.md")
        }
        # Filter by sections if provided
        if sections:
            base_docs = {
                doc
                for doc in base_docs
                if any(section.lower() in doc.lower() for section in sections)
            }
        # Get file paths based on mode
        files = []
        for doc in sorted(
            base_docs,
            key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
        ):
            if mode == "condensed":
                xs_file = self.docs_dir / f"{doc}.xs.md"
                regular_file = self.docs_dir / f"{doc}.md"
                files.append(str(xs_file if xs_file.exists() else regular_file))
            else:
                files.append(str(self.docs_dir / f"{doc}.md"))
        # Read and format content
        content = []
        for file in files:
            try:
                with open(file, "r", encoding="utf-8") as f:
                    fname = Path(file).name
                    content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
            except Exception as e:
                self.logger.error(f"Error reading {file}: {str(e)}")
        return "\n\n---\n\n".join(content) if content else ""
    def search(self, query: str, top_k: int = 5) -> str:
        if not self.bm25_index:
            return "No search index available. Call build_search_index() first."
        query_tokens = self.preprocess_text(query)
        doc_scores = self.bm25_index.get_scores(query_tokens)
        mean_score = np.mean(doc_scores)
        std_score = np.std(doc_scores)
        score_threshold = mean_score + (0.25 * std_score)
        file_data = self._aggregate_search_scores(
            doc_scores=doc_scores,
            score_threshold=score_threshold,
            query_tokens=query_tokens,
        )
        ranked_files = sorted(
            file_data.items(),
            key=lambda x: (
                x[1]["code_match_score"] * 2.0
                + x[1]["match_count"] * 1.5
                + x[1]["total_score"]
            ),
            reverse=True,
        )[:top_k]
        results = []
        for file, _ in ranked_files:
            main_doc = str(file).replace(".q.md", ".md")
            if os.path.exists(self.docs_dir / main_doc):
                with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
                    only_file_name = main_doc.split("/")[-1]
                    content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
                    results.append("\n".join(content))
        return "\n\n---\n\n".join(results)
    def _aggregate_search_scores(
        self, doc_scores: List[float], score_threshold: float, query_tokens: List[str]
    ) -> Dict:
        file_data = {}
        for idx, score in enumerate(doc_scores):
            if score <= score_threshold:
                continue
            fact = self.tokenized_facts[idx]
            file_path = self.document_map[fact]
            if file_path not in file_data:
                file_data[file_path] = {
                    "total_score": 0,
                    "match_count": 0,
                    "code_match_score": 0,
                    "matched_facts": [],
                }
            components = fact.split("|") if "|" in fact else [fact]
            code_match_score = 0
            if len(components) == 3:
                code_ref = components[2].strip()
                code_tokens = self.preprocess_text(code_ref)
                code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
                    query_tokens
                )
            file_data[file_path]["total_score"] += score
            file_data[file_path]["match_count"] += 1
            file_data[file_path]["code_match_score"] = max(
                file_data[file_path]["code_match_score"], code_match_score
            )
            file_data[file_path]["matched_facts"].append(fact)
        return file_data
    def refresh_index(self) -> None:
        """Convenience method for a full rebuild."""
        self.build_search_index(clear_cache=True)
--- a/crawl4ai/markdown_generation_strategy.py
+++ b/crawl4ai/markdown_generation_strategy.py
@@ -2,47 +2,97 @@ from abc import ABC, abstractmethod
 from typing import Optional, Dict, Any, Tuple
 from .models import MarkdownGenerationResult
 from .html2text import CustomHTML2Text
-from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
+from .content_filter_strategy import RelevantContentFilter
 import re
 from urllib.parse import urljoin
 # Pre-compile the regex pattern
 LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
 def fast_urljoin(base: str, url: str) -> str:
    """Fast URL joining for common cases."""
-    if url.startswith(('http://', 'https://', 'mailto:', '//')):
+    if url.startswith(("http://", "https://", "mailto:", "//")):
        return url
-    if url.startswith('/'):
+    if url.startswith("/"):
        # Handle absolute paths
-        if base.endswith('/'):
+        if base.endswith("/"):
            return base[:-1] + url
        return base + url
    return urljoin(base, url)
 class MarkdownGenerationStrategy(ABC):
    """Abstract base class for markdown generation strategies."""
-    def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
+
    def __init__(
        self,
        content_filter: Optional[RelevantContentFilter] = None,
        options: Optional[Dict[str, Any]] = None,
    ):
        self.content_filter = content_filter
        self.options = options or {}
    @abstractmethod
-    def generate_markdown(self, 
+    def generate_markdown(
        self,
        cleaned_html: str,
        base_url: str = "",
        html2text_options: Optional[Dict[str, Any]] = None,
        content_filter: Optional[RelevantContentFilter] = None,
        citations: bool = True,
-                         **kwargs) -> MarkdownGenerationResult:
+        **kwargs,
    ) -> MarkdownGenerationResult:
        """Generate markdown from cleaned HTML."""
        pass
 class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
-    """Default implementation of markdown generation strategy."""
+    """
-    def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
+    Default implementation of markdown generation strategy.
    How it works:
    1. Generate raw markdown from cleaned HTML.
    2. Convert links to citations.
    3. Generate fit markdown if content filter is provided.
    4. Return MarkdownGenerationResult.
    Args:
        content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
        options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
    Returns:
        MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
    """
    def __init__(
        self,
        content_filter: Optional[RelevantContentFilter] = None,
        options: Optional[Dict[str, Any]] = None,
    ):
        super().__init__(content_filter, options)
-    def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
+    def convert_links_to_citations(
        self, markdown: str, base_url: str = ""
    ) -> Tuple[str, str]:
        """
        Convert links in markdown to citations.
        How it works:
        1. Find all links in the markdown.
        2. Convert links to citations.
        3. Return converted markdown and references markdown.
        Note:
        This function uses a regex pattern to find links in markdown.
        Args:
            markdown (str): Markdown text.
            base_url (str): Base URL for URL joins.
        Returns:
            Tuple[str, str]: Converted markdown and references markdown.
        """
        link_map = {}
        url_cache = {}  # Cache for URL joins
        parts = []
@@ -50,28 +100,34 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
        counter = 1
        for match in LINK_PATTERN.finditer(markdown):
-            parts.append(markdown[last_end:match.start()])
+            parts.append(markdown[last_end : match.start()])
            text, url, title = match.groups()
            # Use cached URL if available, otherwise compute and cache
-            if base_url and not url.startswith(('http://', 'https://', 'mailto:')):
+            if base_url and not url.startswith(("http://", "https://", "mailto:")):
                if url not in url_cache:
                    url_cache[url] = fast_urljoin(base_url, url)
                url = url_cache[url]
            if url not in link_map:
                desc = []
-                if title: desc.append(title)
+                if title:
-                if text and text != title: desc.append(text)
+                    desc.append(title)
                if text and text != title:
                    desc.append(text)
                link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
                counter += 1
            num = link_map[url][0]
-            parts.append(f"{text}⟨{num}⟩" if not match.group(0).startswith('!') else f"![{text}⟨{num}⟩]")
+            parts.append(
                f"{text}⟨{num}⟩"
                if not match.group(0).startswith("!")
                else f"![{text}⟨{num}⟩]"
            )
            last_end = match.end()
        parts.append(markdown[last_end:])
-        converted_text = ''.join(parts)
+        converted_text = "".join(parts)
        # Pre-build reference strings
        references = ["\n\n## References\n\n"]
@@ -80,52 +136,118 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
            for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
        )
-        return converted_text, ''.join(references)
+        return converted_text, "".join(references)
-    def generate_markdown(self, 
+    def generate_markdown(
        self,
        cleaned_html: str,
        base_url: str = "",
        html2text_options: Optional[Dict[str, Any]] = None,
        options: Optional[Dict[str, Any]] = None,
        content_filter: Optional[RelevantContentFilter] = None,
        citations: bool = True,
-                         **kwargs) -> MarkdownGenerationResult:
+        **kwargs,
-        """Generate markdown with citations from cleaned HTML."""
+    ) -> MarkdownGenerationResult:
-        # Initialize HTML2Text with options
+        """
-        h = CustomHTML2Text()
+        Generate markdown with citations from cleaned HTML.
        How it works:
        1. Generate raw markdown from cleaned HTML.
        2. Convert links to citations.
        3. Generate fit markdown if content filter is provided.
        4. Return MarkdownGenerationResult.
        Args:
            cleaned_html (str): Cleaned HTML content.
            base_url (str): Base URL for URL joins.
            html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
            options (Optional[Dict[str, Any]]): Additional options for markdown generation.
            content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
            citations (bool): Whether to generate citations.
        Returns:
            MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
        """
        try:
            # Initialize HTML2Text with default options for better conversion
            h = CustomHTML2Text(baseurl=base_url)
            default_options = {
                "body_width": 0,  # Disable text wrapping
                "ignore_emphasis": False,
                "ignore_links": False,
                "ignore_images": False,
                "protect_links": True,
                "single_line_break": True,
                "mark_code": True,
                "escape_snob": False,
            }
            # Update with custom options if provided
            if html2text_options:
-            h.update_params(**html2text_options)
+                default_options.update(html2text_options)
            elif options:
-            h.update_params(**options)
+                default_options.update(options)
            elif self.options:
-            h.update_params(**self.options)
+                default_options.update(self.options)
            h.update_params(**default_options)
            # Ensure we have valid input
            if not cleaned_html:
                cleaned_html = ""
            elif not isinstance(cleaned_html, str):
                cleaned_html = str(cleaned_html)
            # Generate raw markdown
            try:
                raw_markdown = h.handle(cleaned_html)
-        raw_markdown = raw_markdown.replace('    ```', '```')
+            except Exception as e:
                raw_markdown = f"Error converting HTML to markdown: {str(e)}"
            raw_markdown = raw_markdown.replace("    ```", "```")
            # Convert links to citations
-        markdown_with_citations: str = ""
+            markdown_with_citations: str = raw_markdown
            references_markdown: str = ""
            if citations:
-            markdown_with_citations, references_markdown = self.convert_links_to_citations(
+                try:
-                raw_markdown, base_url
+                    (
-            )
+                        markdown_with_citations,
                        references_markdown,
                    ) = self.convert_links_to_citations(raw_markdown, base_url)
                except Exception as e:
                    markdown_with_citations = raw_markdown
                    references_markdown = f"Error generating citations: {str(e)}"
            # Generate fit markdown if content filter is provided
            fit_markdown: Optional[str] = ""
            filtered_html: Optional[str] = ""
            if content_filter or self.content_filter:
                try:
                    content_filter = content_filter or self.content_filter
                    filtered_html = content_filter.filter_content(cleaned_html)
-            filtered_html = '\n'.join('<div>{}</div>'.format(s) for s in filtered_html)
+                    filtered_html = "\n".join(
                        "<div>{}</div>".format(s) for s in filtered_html
                    )
                    fit_markdown = h.handle(filtered_html)
                except Exception as e:
                    fit_markdown = f"Error generating fit markdown: {str(e)}"
                    filtered_html = ""
            return MarkdownGenerationResult(
-            raw_markdown=raw_markdown,
+                raw_markdown=raw_markdown or "",
-            markdown_with_citations=markdown_with_citations,
+                markdown_with_citations=markdown_with_citations or "",
-            references_markdown=references_markdown,
+                references_markdown=references_markdown or "",
-            fit_markdown=fit_markdown,
+                fit_markdown=fit_markdown or "",
-            fit_html=filtered_html,
+                fit_html=filtered_html or "",
            )
        except Exception as e:
            # If anything fails, return empty strings with error message
            error_msg = f"Error in markdown generation: {str(e)}"
            return MarkdownGenerationResult(
                raw_markdown=error_msg,
                markdown_with_citations=error_msg,
                references_markdown="",
                fit_markdown="",
                fit_html="",
            )
--- a/crawl4ai/migrations.py
+++ b/crawl4ai/migrations.py
@@ -1,13 +1,11 @@
 import os
 import asyncio
 import logging
 from pathlib import Path
 import aiosqlite
 from typing import Optional
 import xxhash
 import aiofiles
 import shutil
 import time
 from datetime import datetime
 from .async_logger import AsyncLogger, LogLevel
@@ -17,6 +15,7 @@ logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
 # logging.basicConfig(level=logging.INFO)
 # logger = logging.getLogger(__name__)
 class DatabaseMigration:
    def __init__(self, db_path: str):
        self.db_path = db_path
@@ -24,11 +23,11 @@ class DatabaseMigration:
    def _ensure_content_dirs(self, base_path: str) -> dict:
        dirs = {
-            'html': 'html_content',
+            "html": "html_content",
-            'cleaned': 'cleaned_html',
+            "cleaned": "cleaned_html",
-            'markdown': 'markdown_content', 
+            "markdown": "markdown_content",
-            'extracted': 'extracted_content',
+            "extracted": "extracted_content",
-            'screenshots': 'screenshots'
+            "screenshots": "screenshots",
        }
        content_paths = {}
        for key, dirname in dirs.items():
@@ -52,7 +51,7 @@ class DatabaseMigration:
        file_path = os.path.join(self.content_paths[content_type], content_hash)
        if not os.path.exists(file_path):
-            async with aiofiles.open(file_path, 'w', encoding='utf-8') as f:
+            async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
                await f.write(content)
        return content_hash
@@ -66,24 +65,36 @@ class DatabaseMigration:
            async with aiosqlite.connect(self.db_path) as db:
                # Get all rows
                async with db.execute(
-                    '''SELECT url, html, cleaned_html, markdown, 
+                    """SELECT url, html, cleaned_html, markdown, 
-                       extracted_content, screenshot FROM crawled_data'''
+                       extracted_content, screenshot FROM crawled_data"""
                ) as cursor:
                    rows = await cursor.fetchall()
                migrated_count = 0
                for row in rows:
-                    url, html, cleaned_html, markdown, extracted_content, screenshot = row
+                    (
                        url,
                        html,
                        cleaned_html,
                        markdown,
                        extracted_content,
                        screenshot,
                    ) = row
                    # Store content in files and get hashes
-                    html_hash = await self._store_content(html, 'html')
+                    html_hash = await self._store_content(html, "html")
-                    cleaned_hash = await self._store_content(cleaned_html, 'cleaned')
+                    cleaned_hash = await self._store_content(cleaned_html, "cleaned")
-                    markdown_hash = await self._store_content(markdown, 'markdown')
+                    markdown_hash = await self._store_content(markdown, "markdown")
-                    extracted_hash = await self._store_content(extracted_content, 'extracted')
+                    extracted_hash = await self._store_content(
-                    screenshot_hash = await self._store_content(screenshot, 'screenshots')
+                        extracted_content, "extracted"
                    )
                    screenshot_hash = await self._store_content(
                        screenshot, "screenshots"
                    )
                    # Update database with hashes
-                    await db.execute('''
+                    await db.execute(
                        """
                        UPDATE crawled_data 
                        SET html = ?, 
                            cleaned_html = ?,
@@ -91,26 +102,37 @@ class DatabaseMigration:
                            extracted_content = ?,
                            screenshot = ?
                        WHERE url = ?
-                    ''', (html_hash, cleaned_hash, markdown_hash, 
+                    """,
-                         extracted_hash, screenshot_hash, url))
+                        (
                            html_hash,
                            cleaned_hash,
                            markdown_hash,
                            extracted_hash,
                            screenshot_hash,
                            url,
                        ),
                    )
                    migrated_count += 1
                    if migrated_count % 100 == 0:
                        logger.info(f"Migrated {migrated_count} records...", tag="INIT")
                await db.commit()
-                logger.success(f"Migration completed. {migrated_count} records processed.", tag="COMPLETE")
+                logger.success(
                    f"Migration completed. {migrated_count} records processed.",
                    tag="COMPLETE",
                )
        except Exception as e:
            # logger.error(f"Migration failed: {e}")
            logger.error(
                message="Migration failed: {error}",
                tag="ERROR",
-                params={"error": str(e)}
+                params={"error": str(e)},
            )
            raise e
 async def backup_database(db_path: str) -> str:
    """Create backup of existing database"""
    if not os.path.exists(db_path):
@@ -118,7 +140,7 @@ async def backup_database(db_path: str) -> str:
        return None
    # Create backup with timestamp
-    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    backup_path = f"{db_path}.backup_{timestamp}"
    try:
@@ -132,12 +154,11 @@ async def backup_database(db_path: str) -> str:
    except Exception as e:
        # logger.error(f"Backup failed: {e}")
        logger.error(
-                message="Migration failed: {error}",
+            message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
                tag="ERROR",
                params={"error": str(e)}
        )
        raise e
 async def run_migration(db_path: Optional[str] = None):
    """Run database migration"""
    if db_path is None:
@@ -155,14 +176,19 @@ async def run_migration(db_path: Optional[str] = None):
    migration = DatabaseMigration(db_path)
    await migration.migrate_database()
 def main():
    """CLI entry point for migration"""
    import argparse
-    parser = argparse.ArgumentParser(description='Migrate Crawl4AI database to file-based storage')
+
-    parser.add_argument('--db-path', help='Custom database path')
+    parser = argparse.ArgumentParser(
        description="Migrate Crawl4AI database to file-based storage"
    )
    parser.add_argument("--db-path", help="Custom database path")
    args = parser.parse_args()
    asyncio.run(run_migration(args.db_path))
 if __name__ == "__main__":
    main()
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -2,75 +2,86 @@ from functools import lru_cache
 from pathlib import Path
 import subprocess, os
 import shutil
 import tarfile
 from .model_loader import *
 import argparse
 import urllib.request
 from crawl4ai.config import MODEL_REPO_BRANCH
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@lru_cache()
 def get_available_memory(device):
    import torch
-    if device.type == 'cuda':
+
    if device.type == "cuda":
        return torch.cuda.get_device_properties(device).total_memory
-    elif device.type == 'mps':      
+    elif device.type == "mps":
-        return 48 * 1024 ** 3  # Assuming 8GB for MPS, as a conservative estimate
+        return 48 * 1024**3  # Assuming 8GB for MPS, as a conservative estimate
    else:
        return 0
@lru_cache()
 def calculate_batch_size(device):
    available_memory = get_available_memory(device)
-    if device.type == 'cpu':
+    if device.type == "cpu":
        return 16
-    elif device.type in ['cuda', 'mps']:
+    elif device.type in ["cuda", "mps"]:
        # Adjust these thresholds based on your model size and available memory
-        if available_memory >= 31 * 1024 ** 3:  # > 32GB
+        if available_memory >= 31 * 1024**3:  # > 32GB
            return 256
-        elif available_memory >= 15 * 1024 ** 3:  # > 16GB to 32GB
+        elif available_memory >= 15 * 1024**3:  # > 16GB to 32GB
            return 128
-        elif available_memory >= 8 * 1024 ** 3:  # 8GB to 16GB
+        elif available_memory >= 8 * 1024**3:  # 8GB to 16GB
            return 64
        else:
            return 32
    else:
        return 16  # Default batch size
@lru_cache()
 def get_device():
    import torch
    if torch.cuda.is_available():
-        device = torch.device('cuda')
+        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
-        device = torch.device('mps')
+        device = torch.device("mps")
    else:
-        device = torch.device('cpu')
+        device = torch.device("cpu")
    return device
 def set_model_device(model):
    device = get_device()
    model.to(device)
    return model, device
@lru_cache()
 def get_home_folder():
-    home_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
+    home_folder = os.path.join(
        os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
    )
    os.makedirs(home_folder, exist_ok=True)
    os.makedirs(f"{home_folder}/cache", exist_ok=True)
    os.makedirs(f"{home_folder}/models", exist_ok=True)
    return home_folder
@lru_cache()
 def load_bert_base_uncased():
-    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+    from transformers import BertTokenizer, BertModel
-    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
+
-    model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
+    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
    model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model
@lru_cache()
 def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
    """Load the Hugging Face model for embedding.
@@ -81,30 +92,35 @@ def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
    Returns:
        tuple: The tokenizer and model.
    """
-    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+    from transformers import AutoTokenizer, AutoModel
    tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
    model = AutoModel.from_pretrained(model_name, resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model
@lru_cache()
 def load_text_classifier():
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers import pipeline
    import torch
-    tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+    tokenizer = AutoTokenizer.from_pretrained(
-    model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+        "dstefa/roberta-base_topic_classification_nyt_news"
    )
    model = AutoModelForSequenceClassification.from_pretrained(
        "dstefa/roberta-base_topic_classification_nyt_news"
    )
    model.eval()
    model, device = set_model_device(model)
    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
    return pipe
@lru_cache()
 def load_text_multilabel_classifier():
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
    import numpy as np
    from scipy.special import expit
    import torch
@@ -117,17 +133,26 @@ def load_text_multilabel_classifier():
    #     device = torch.device("cpu")
    #     # return load_spacy_model(), torch.device("cpu")
    MODEL = "cardiffnlp/tweet-topic-21-multi"
    tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
-    model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
+    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL, resume_download=None
    )
    model.eval()
    model, device = set_model_device(model)
    class_mapping = model.config.id2label
    def _classifier(texts, threshold=0.5, max_length=64):
-        tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
+        tokens = tokenizer(
-        tokens = {key: val.to(device) for key, val in tokens.items()}  # Move tokens to the selected device
+            texts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=max_length,
        )
        tokens = {
            key: val.to(device) for key, val in tokens.items()
        }  # Move tokens to the selected device
        with torch.no_grad():
            output = model(**tokens)
@@ -138,25 +163,31 @@ def load_text_multilabel_classifier():
        batch_labels = []
        for prediction in predictions:
-            labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
+            labels = [
                class_mapping[i] for i, value in enumerate(prediction) if value == 1
            ]
            batch_labels.append(labels)
        return batch_labels
    return _classifier, device
@lru_cache()
 def load_nltk_punkt():
    import nltk
    try:
-        nltk.data.find('tokenizers/punkt')
+        nltk.data.find("tokenizers/punkt")
    except LookupError:
-        nltk.download('punkt')
+        nltk.download("punkt")
-    return nltk.data.find('tokenizers/punkt')
+    return nltk.data.find("tokenizers/punkt")
@lru_cache()
 def load_spacy_model():
    import spacy
    name = "models/reuters"
    home_folder = get_home_folder()
    model_folder = Path(home_folder) / name
@@ -176,7 +207,9 @@ def load_spacy_model():
                if model_folder.exists():
                    shutil.rmtree(model_folder)
            except PermissionError:
-                print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
+                print(
                    "[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
                )
                print(f"- {repo_folder}")
                print(f"- {model_folder}")
                return None
@@ -187,7 +220,7 @@ def load_spacy_model():
                ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
-                check=True
+                check=True,
            )
            # Create the models directory if it doesn't exist
@@ -215,6 +248,7 @@ def load_spacy_model():
        print(f"Error loading spacy model: {e}")
        return None
 def download_all_models(remove_existing=False):
    """Download all models required for Crawl4AI."""
    if remove_existing:
@@ -243,14 +277,20 @@ def download_all_models(remove_existing=False):
    load_nltk_punkt()
    print("[LOG] ✅ All models downloaded successfully.")
 def main():
    print("[LOG] Welcome to the Crawl4AI Model Downloader!")
    print("[LOG] This script will download all the models required for Crawl4AI.")
    parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
-    parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
+    parser.add_argument(
        "--remove-existing",
        action="store_true",
        help="Remove existing models before downloading",
    )
    args = parser.parse_args()
    download_all_models(remove_existing=args.remove_existing)
 if __name__ == "__main__":
    main()
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -1,12 +1,83 @@
 from pydantic import BaseModel, HttpUrl
-from typing import List, Dict, Optional, Callable, Awaitable, Union
+from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
 from enum import Enum
 from dataclasses import dataclass
 from .ssl_certificate import SSLCertificate
 from datetime import datetime
 from datetime import timedelta
 ###############################
 # Dispatcher Models
 ###############################
@dataclass
 class DomainState:
    last_request_time: float = 0
    current_delay: float = 0
    fail_count: int = 0
@dataclass
 class CrawlerTaskResult:
    task_id: str
    url: str
    result: "CrawlResult"
    memory_usage: float
    peak_memory: float
    start_time: datetime
    end_time: datetime
    error_message: str = ""
 class CrawlStatus(Enum):
    QUEUED = "QUEUED"
    IN_PROGRESS = "IN_PROGRESS"
    COMPLETED = "COMPLETED"
    FAILED = "FAILED"
@dataclass
 class CrawlStats:
    task_id: str
    url: str
    status: CrawlStatus
    start_time: Optional[datetime] = None
    end_time: Optional[datetime] = None
    memory_usage: float = 0.0
    peak_memory: float = 0.0
    error_message: str = ""
    @property
    def duration(self) -> str:
        if not self.start_time:
            return "0:00"
        end = self.end_time or datetime.now()
        duration = end - self.start_time
        return str(timedelta(seconds=int(duration.total_seconds())))
 class DisplayMode(Enum):
    DETAILED = "DETAILED"
    AGGREGATED = "AGGREGATED"
 ###############################
 # Crawler Models
 ###############################
@dataclass
 class TokenUsage:
    completion_tokens: int = 0
    prompt_tokens: int = 0
    total_tokens: int = 0
    completion_tokens_details: Optional[dict] = None
    prompt_tokens_details: Optional[dict] = None
 class UrlModel(BaseModel):
    url: HttpUrl
    forced: bool = False
 class MarkdownGenerationResult(BaseModel):
    raw_markdown: str
    markdown_with_citations: str
@@ -14,6 +85,16 @@ class MarkdownGenerationResult(BaseModel):
    fit_markdown: Optional[str] = None
    fit_html: Optional[str] = None
 class DispatchResult(BaseModel):
    task_id: str
    memory_usage: float
    peak_memory: float
    start_time: datetime
    end_time: datetime
    error_message: str = ""
 class CrawlResult(BaseModel):
    url: str
    html: str
@@ -23,7 +104,7 @@ class CrawlResult(BaseModel):
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
-    pdf : Optional[bytes] = None
+    pdf: Optional[bytes] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    markdown_v2: Optional[MarkdownGenerationResult] = None
    fit_markdown: Optional[str] = None
@@ -34,6 +115,13 @@ class CrawlResult(BaseModel):
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
    ssl_certificate: Optional[SSLCertificate] = None
    dispatch_result: Optional[DispatchResult] = None
    redirected_url: Optional[str] = None
    class Config:
        arbitrary_types_allowed = True
 class AsyncCrawlResponse(BaseModel):
    html: str
@@ -43,8 +131,52 @@ class AsyncCrawlResponse(BaseModel):
    pdf_data: Optional[bytes] = None
    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
    downloaded_files: Optional[List[str]] = None
    ssl_certificate: Optional[SSLCertificate] = None
    redirected_url: Optional[str] = None
    class Config:
        arbitrary_types_allowed = True
 ###############################
 # Scraping Models
 ###############################
 class MediaItem(BaseModel):
    src: Optional[str] = ""
    alt: Optional[str] = ""
    desc: Optional[str] = ""
    score: Optional[int] = 0
    type: str = "image"
    group_id: Optional[int] = 0
    format: Optional[str] = None
    width: Optional[int] = None
 class Link(BaseModel):
    href: Optional[str] = ""
    text: Optional[str] = ""
    title: Optional[str] = ""
    base_domain: Optional[str] = ""
 class Media(BaseModel):
    images: List[MediaItem] = []
    videos: List[
        MediaItem
    ] = []  # Using MediaItem model for now, can be extended with Video model if needed
    audios: List[
        MediaItem
    ] = []  # Using MediaItem model for now, can be extended with Audio model if needed
 class Links(BaseModel):
    internal: List[Link] = []
    external: List[Link] = []
 class ScrapingResult(BaseModel):
    cleaned_html: str
    success: bool
    media: Media = Media()
    links: Links = Links()
    metadata: Dict[str, Any] = {}
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -202,3 +202,808 @@ Avoid Common Mistakes:
 Result
 Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
 PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
 INPUT HTML: 
 <|HTML_CONTENT_START|>
 {HTML}
 <|HTML_CONTENT_END|>
 SPECIFIC INSTRUCTION: 
 <|USER_INSTRUCTION_START|>
 {REQUEST}
 <|USER_INSTRUCTION_END|>
 TASK DETAILS:
 1. Content Selection
 - DO: Keep essential information, main content, key details
 - DO: Preserve hierarchical structure using markdown headers
 - DO: Keep code blocks, tables, key lists
 - DON'T: Include navigation menus, ads, footers, cookie notices
 - DON'T: Keep social media widgets, sidebars, related content
 2. Content Transformation
 - DO: Use proper markdown syntax (#, ##, **, `, etc)
 - DO: Convert tables to markdown tables
 - DO: Preserve code formatting with ```language blocks
 - DO: Maintain link texts but remove tracking parameters
 - DON'T: Include HTML tags in output
 - DON'T: Keep class names, ids, or other HTML attributes
 3. Content Organization
 - DO: Maintain logical flow of information
 - DO: Group related content under appropriate headers
 - DO: Use consistent header levels
 - DON'T: Fragment related content
 - DON'T: Duplicate information
 Example Input:
 <div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
 <div class="sidebar">Related articles...</div>
 Example Output:
 # Setup Guide
 Follow these steps...
 IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
 OUTPUT FORMAT: 
 Wrap your response in <content> tags. Use proper markdown throughout.
 <content>
 [Your markdown content here]
 </content>
 Begin filtering now."""
 JSON_SCHEMA_BUILDER= """
 # HTML Schema Generation Instructions
 You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
 ## Your Core Responsibilities:
 1. Analyze HTML structure to identify repeating patterns and important data points
 2. Generate valid JSON schemas following the specified format
 3. Create appropriate selectors that will work reliably for data extraction
 4. Name fields meaningfully based on their content and purpose
 5. Handle both specific user requests and autonomous pattern detection
 ## Available Schema Types You Can Generate:
 <schema_types>
 1. Basic Single-Level Schema
   - Use for simple, flat data structures
   - Example: Product cards, user profiles
   - Direct field extractions
 2. Nested Object Schema
   - Use for hierarchical data
   - Example: Articles with author details
   - Contains objects within objects
 3. List Schema
   - Use for repeating elements
   - Example: Comment sections, product lists
   - Handles arrays of similar items
 4. Complex Nested Lists
   - Use for multi-level data
   - Example: Categories with subcategories
   - Multiple levels of nesting
 5. Transformation Schema
   - Use for data requiring processing
   - Supports regex and text transformations
   - Special attribute handling
 </schema_types>
 <schema_structure>
 Your output must always be a JSON object with this structure:
 {
  "name": "Descriptive name of the pattern",
  "baseSelector": "CSS selector for the repeating element",
  "fields": [
    {
      "name": "field_name",
      "selector": "CSS selector",
      "type": "text|attribute|nested|list|regex",
      "attribute": "attribute_name",  // Optional
      "transform": "transformation_type",  // Optional
      "pattern": "regex_pattern",  // Optional
      "fields": []  // For nested/list types
    }
  ]
 }
 </schema_structure>
 <type_definitions>
 Available field types:
 - text: Direct text extraction
 - attribute: HTML attribute extraction
 - nested: Object containing other fields
 - list: Array of similar items
 - regex: Pattern-based extraction
 </type_definitions>
 <behavior_rules>
 1. When given a specific query:
   - Focus on extracting requested data points
   - Use most specific selectors possible
   - Include all fields mentioned in the query
 2. When no query is provided:
   - Identify main content areas
   - Extract all meaningful data points
   - Use semantic structure to determine importance
   - Include prices, dates, titles, and other common data types
 3. Always:
   - Use reliable CSS selectors
   - Handle dynamic class names appropriately
   - Create descriptive field names
   - Follow consistent naming conventions
 </behavior_rules>
 <examples>
 1. Basic Product Card Example:
 <html>
 <div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
  <h2 class="product-title">Gaming Laptop</h2>
  <span class="price">$999.99</span>
  <img src="laptop.jpg" alt="Gaming Laptop">
 </div>
 </html>
 Generated Schema:
 {
  "name": "Product Cards",
  "baseSelector": ".product-card",
  "baseFields": [
    {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
    {"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
  ],
  "fields": [
    {
      "name": "title",
      "selector": ".product-title",
      "type": "text"
    },
    {
      "name": "price",
      "selector": ".price",
      "type": "text"
    },
    {
      "name": "image_url",
      "selector": "img",
      "type": "attribute",
      "attribute": "src"
    }
  ]
 }
 2. Article with Author Details Example:
 <html>
 <article>
  <h1>The Future of AI</h1>
  <div class="author-info">
    <span class="author-name">Dr. Smith</span>
    <img src="author.jpg" alt="Dr. Smith">
  </div>
 </article>
 </html>
 Generated Schema:
 {
  "name": "Article Details",
  "baseSelector": "article",
  "fields": [
    {
      "name": "title",
      "selector": "h1",
      "type": "text"
    },
    {
      "name": "author",
      "type": "nested",
      "selector": ".author-info",
      "fields": [
        {
          "name": "name",
          "selector": ".author-name",
          "type": "text"
        },
        {
          "name": "avatar",
          "selector": "img",
          "type": "attribute",
          "attribute": "src"
        }
      ]
    }
  ]
 }
 3. Comments Section Example:
 <html>
 <div class="comments-container">
  <div class="comment" data-user-id="123">
    <div class="user-name">John123</div>
    <p class="comment-text">Great article!</p>
  </div>
  <div class="comment" data-user-id="456">
    <div class="user-name">Alice456</div>
    <p class="comment-text">Thanks for sharing.</p>
  </div>
 </div>
 </html>
 Generated Schema:
 {
  "name": "Comment Section",
  "baseSelector": ".comments-container",
  "baseFields": [
    {"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
  ],
  "fields": [
    {
      "name": "comments",
      "type": "list",
      "selector": ".comment",
      "fields": [
        {
          "name": "user",
          "selector": ".user-name",
          "type": "text"
        },
        {
          "name": "content",
          "selector": ".comment-text",
          "type": "text"
        }
      ]
    }
  ]
 }
 4. E-commerce Categories Example:
 <html>
 <div class="category-section" data-category="electronics">
  <h2>Electronics</h2>
  <div class="subcategory">
    <h3>Laptops</h3>
    <div class="product">
      <span class="product-name">MacBook Pro</span>
      <span class="price">$1299</span>
    </div>
    <div class="product">
      <span class="product-name">Dell XPS</span>
      <span class="price">$999</span>
    </div>
  </div>
 </div>
 </html>
 Generated Schema:
 {
  "name": "E-commerce Categories",
  "baseSelector": ".category-section",
  "baseFields": [
    {"name": "data_category", "type": "attribute", "attribute": "data-category"}
  ],
  "fields": [
    {
      "name": "category_name",
      "selector": "h2",
      "type": "text"
    },
    {
      "name": "subcategories",
      "type": "nested_list",
      "selector": ".subcategory",
      "fields": [
        {
          "name": "name",
          "selector": "h3",
          "type": "text"
        },
        {
          "name": "products",
          "type": "list",
          "selector": ".product",
          "fields": [
            {
              "name": "name",
              "selector": ".product-name",
              "type": "text"
            },
            {
              "name": "price",
              "selector": ".price",
              "type": "text"
            }
          ]
        }
      ]
    }
  ]
 }
 5. Job Listings with Transformations Example:
 <html>
 <div class="job-post">
  <h3 class="job-title">Senior Developer</h3>
  <span class="salary-text">Salary: $120,000/year</span>
  <span class="location">  New York, NY  </span>
 </div>
 </html>
 Generated Schema:
 {
  "name": "Job Listings",
  "baseSelector": ".job-post",
  "fields": [
    {
      "name": "title",
      "selector": ".job-title",
      "type": "text",
      "transform": "uppercase"
    },
    {
      "name": "salary",
      "selector": ".salary-text",
      "type": "regex",
      "pattern": "\\$([\\d,]+)"
    },
    {
      "name": "location",
      "selector": ".location",
      "type": "text",
      "transform": "strip"
    }
  ]
 }
 6. Skyscanner Place Card Example:
 <html>
 <div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
  <div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
    <div class="PlaceCard_nameContent__ODUwZ">
      <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
    </div>
    <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
  </div>
  <span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
  <a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
    <div class="PriceDescription_container__NjEzM">
      <span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
    </div>
  </a>
 </div>
 </html>
 Generated Schema:
 {
  "name": "Skyscanner Place Cards",
  "baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
  "baseFields": [
    {"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
  ],
  "fields": [
    {
      "name": "city_name",
      "selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
      "type": "text"
    },
    {
      "name": "country_name",
      "selector": "span[class*='PlaceCard_subName__']",
      "type": "text"
    },
    {
      "name": "description",
      "selector": "span[class*='PlaceCard_advertLabel__']",
      "type": "text"
    },
    {
      "name": "flight_price",
      "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
      "type": "text"
    },
    {
      "name": "flight_url",
      "selector": "a[data-testid='flights-link']",
      "type": "attribute",
      "attribute": "href"
    }
  ]
 }
 </examples>
 <output_requirements>
 Your output must:
 1. Be valid JSON only
 2. Include no explanatory text
 3. Follow the exact schema structure provided
 4. Use appropriate field types
 5. Include all required fields
 6. Use valid CSS selectors
 </output_requirements>
 """
 JSON_SCHEMA_BUILDER_XPATH = """
 # HTML Schema Generation Instructions
 You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
 ## Your Core Responsibilities:
 1. Analyze HTML structure to identify repeating patterns and important data points
 2. Generate valid JSON schemas following the specified format
 3. Create appropriate XPath selectors that will work reliably for data extraction
 4. Name fields meaningfully based on their content and purpose
 5. Handle both specific user requests and autonomous pattern detection
 ## Available Schema Types You Can Generate:
 <schema_types>
 1. Basic Single-Level Schema
  - Use for simple, flat data structures
  - Example: Product cards, user profiles
  - Direct field extractions
 2. Nested Object Schema
  - Use for hierarchical data
  - Example: Articles with author details
  - Contains objects within objects
 3. List Schema
  - Use for repeating elements
  - Example: Comment sections, product lists
  - Handles arrays of similar items
 4. Complex Nested Lists
  - Use for multi-level data
  - Example: Categories with subcategories
  - Multiple levels of nesting
 5. Transformation Schema
  - Use for data requiring processing
  - Supports regex and text transformations
  - Special attribute handling
 </schema_types>
 <schema_structure>
 Your output must always be a JSON object with this structure:
 {
 "name": "Descriptive name of the pattern",
 "baseSelector": "XPath selector for the repeating element",
 "fields": [
   {
     "name": "field_name",
     "selector": "XPath selector",
     "type": "text|attribute|nested|list|regex",
     "attribute": "attribute_name",  // Optional
     "transform": "transformation_type",  // Optional
     "pattern": "regex_pattern",  // Optional
     "fields": []  // For nested/list types
   }
 ]
 }
 </schema_structure>
 <type_definitions>
 Available field types:
 - text: Direct text extraction
 - attribute: HTML attribute extraction
 - nested: Object containing other fields
 - list: Array of similar items
 - regex: Pattern-based extraction
 </type_definitions>
 <behavior_rules>
 1. When given a specific query:
  - Focus on extracting requested data points
  - Use most specific selectors possible
  - Include all fields mentioned in the query
 2. When no query is provided:
  - Identify main content areas
  - Extract all meaningful data points
  - Use semantic structure to determine importance
  - Include prices, dates, titles, and other common data types
 3. Always:
  - Use reliable XPath selectors
  - Handle dynamic element IDs appropriately
  - Create descriptive field names
  - Follow consistent naming conventions
 </behavior_rules>
 <examples>
 1. Basic Product Card Example:
 <html>
 <div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
 <h2 class="product-title">Gaming Laptop</h2>
 <span class="price">$999.99</span>
 <img src="laptop.jpg" alt="Gaming Laptop">
 </div>
 </html>
 Generated Schema:
 {
 "name": "Product Cards",
 "baseSelector": "//div[@class='product-card']",
 "baseFields": [
   {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
   {"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
 ],
 "fields": [
   {
     "name": "title",
     "selector": ".//h2[@class='product-title']",
     "type": "text"
   },
   {
     "name": "price",
     "selector": ".//span[@class='price']",
     "type": "text"
   },
   {
     "name": "image_url",
     "selector": ".//img",
     "type": "attribute",
     "attribute": "src"
   }
 ]
 }
 2. Article with Author Details Example:
 <html>
 <article>
 <h1>The Future of AI</h1>
 <div class="author-info">
   <span class="author-name">Dr. Smith</span>
   <img src="author.jpg" alt="Dr. Smith">
 </div>
 </article>
 </html>
 Generated Schema:
 {
 "name": "Article Details",
 "baseSelector": "//article",
 "fields": [
   {
     "name": "title",
     "selector": ".//h1",
     "type": "text"
   },
   {
     "name": "author",
     "type": "nested",
     "selector": ".//div[@class='author-info']",
     "fields": [
       {
         "name": "name",
         "selector": ".//span[@class='author-name']",
         "type": "text"
       },
       {
         "name": "avatar",
         "selector": ".//img",
         "type": "attribute",
         "attribute": "src"
       }
     ]
   }
 ]
 }
 3. Comments Section Example:
 <html>
 <div class="comments-container">
 <div class="comment" data-user-id="123">
   <div class="user-name">John123</div>
   <p class="comment-text">Great article!</p>
 </div>
 <div class="comment" data-user-id="456">
   <div class="user-name">Alice456</div>
   <p class="comment-text">Thanks for sharing.</p>
 </div>
 </div>
 </html>
 Generated Schema:
 {
 "name": "Comment Section",
 "baseSelector": "//div[@class='comments-container']",
 "fields": [
   {
     "name": "comments",
     "type": "list",
     "selector": ".//div[@class='comment']",
     "baseFields": [
       {"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
     ],
     "fields": [
       {
         "name": "user",
         "selector": ".//div[@class='user-name']",
         "type": "text"
       },
       {
         "name": "content",
         "selector": ".//p[@class='comment-text']",
         "type": "text"
       }
     ]
   }
 ]
 }
 4. E-commerce Categories Example:
 <html>
 <div class="category-section" data-category="electronics">
 <h2>Electronics</h2>
 <div class="subcategory">
   <h3>Laptops</h3>
   <div class="product">
     <span class="product-name">MacBook Pro</span>
     <span class="price">$1299</span>
   </div>
   <div class="product">
     <span class="product-name">Dell XPS</span>
     <span class="price">$999</span>
   </div>
 </div>
 </div>
 </html>
 Generated Schema:
 {
 "name": "E-commerce Categories",
 "baseSelector": "//div[@class='category-section']",
 "baseFields": [
   {"name": "data_category", "type": "attribute", "attribute": "data-category"}
 ],
 "fields": [
   {
     "name": "category_name",
     "selector": ".//h2",
     "type": "text"
   },
   {
     "name": "subcategories",
     "type": "nested_list",
     "selector": ".//div[@class='subcategory']",
     "fields": [
       {
         "name": "name",
         "selector": ".//h3",
         "type": "text"
       },
       {
         "name": "products",
         "type": "list",
         "selector": ".//div[@class='product']",
         "fields": [
           {
             "name": "name",
             "selector": ".//span[@class='product-name']",
             "type": "text"
           },
           {
             "name": "price",
             "selector": ".//span[@class='price']",
             "type": "text"
           }
         ]
       }
     ]
   }
 ]
 }
 5. Job Listings with Transformations Example:
 <html>
 <div class="job-post">
 <h3 class="job-title">Senior Developer</h3>
 <span class="salary-text">Salary: $120,000/year</span>
 <span class="location">  New York, NY  </span>
 </div>
 </html>
 Generated Schema:
 {
 "name": "Job Listings",
 "baseSelector": "//div[@class='job-post']",
 "fields": [
   {
     "name": "title",
     "selector": ".//h3[@class='job-title']",
     "type": "text",
     "transform": "uppercase"
   },
   {
     "name": "salary",
     "selector": ".//span[@class='salary-text']",
     "type": "regex",
     "pattern": "\\$([\\d,]+)"
   },
   {
     "name": "location",
     "selector": ".//span[@class='location']",
     "type": "text",
     "transform": "strip"
   }
 ]
 }
 6. Skyscanner Place Card Example:
 <html>
 <div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
 <div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
   <div class="PlaceCard_nameContent__ODUwZ">
     <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
   </div>
   <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
 </div>
 <span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
 <a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
   <div class="PriceDescription_container__NjEzM">
     <span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
   </div>
 </a>
 </div>
 </html>
 Generated Schema:
 {
 "name": "Skyscanner Place Cards",
 "baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
 "baseFields": [
   {"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
 ],
 "fields": [
   {
     "name": "city_name",
     "selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
     "type": "text"
   },
   {
     "name": "country_name",
     "selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
     "type": "text"
   },
   {
     "name": "description",
     "selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
     "type": "text"
   },
   {
     "name": "flight_price",
     "selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
     "type": "text"
   },
   {
     "name": "flight_url",
     "selector": ".//a[@data-testid='flights-link']",
     "type": "attribute",
     "attribute": "href"
   }
 ]
 }
 </examples>
 <output_requirements>
 Your output must:
 1. Be valid JSON only
 2. Include no explanatory text
 3. Follow the exact schema structure provided
 4. Use appropriate field types
 5. Include all required fields
 6. Use valid XPath selectors
 </output_requirements>
 """
--- a/crawl4ai/ssl_certificate.py
+++ b/crawl4ai/ssl_certificate.py
@@ -0,0 +1,184 @@
 """SSL Certificate class for handling certificate operations."""
 import ssl
 import socket
 import base64
 import json
 from typing import Dict, Any, Optional
 from urllib.parse import urlparse
 import OpenSSL.crypto
 from pathlib import Path
 class SSLCertificate:
    """
    A class representing an SSL certificate with methods to export in various formats.
    Attributes:
        cert_info (Dict[str, Any]): The certificate information.
        Methods:
            from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
            from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
            from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
            export_as_pem() -> str: Export the certificate as PEM format.
            export_as_der() -> bytes: Export the certificate as DER format.
            export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
            export_as_text() -> str: Export the certificate as text format.
    """
    def __init__(self, cert_info: Dict[str, Any]):
        self._cert_info = self._decode_cert_data(cert_info)
    @staticmethod
    def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
        """
        Create SSLCertificate instance from a URL.
        Args:
            url (str): URL of the website.
            timeout (int): Timeout for the connection (default: 10).
        Returns:
            Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
        """
        try:
            hostname = urlparse(url).netloc
            if ":" in hostname:
                hostname = hostname.split(":")[0]
            context = ssl.create_default_context()
            with socket.create_connection((hostname, 443), timeout=timeout) as sock:
                with context.wrap_socket(sock, server_hostname=hostname) as ssock:
                    cert_binary = ssock.getpeercert(binary_form=True)
                    x509 = OpenSSL.crypto.load_certificate(
                        OpenSSL.crypto.FILETYPE_ASN1, cert_binary
                    )
                    cert_info = {
                        "subject": dict(x509.get_subject().get_components()),
                        "issuer": dict(x509.get_issuer().get_components()),
                        "version": x509.get_version(),
                        "serial_number": hex(x509.get_serial_number()),
                        "not_before": x509.get_notBefore(),
                        "not_after": x509.get_notAfter(),
                        "fingerprint": x509.digest("sha256").hex(),
                        "signature_algorithm": x509.get_signature_algorithm(),
                        "raw_cert": base64.b64encode(cert_binary),
                    }
                    # Add extensions
                    extensions = []
                    for i in range(x509.get_extension_count()):
                        ext = x509.get_extension(i)
                        extensions.append(
                            {"name": ext.get_short_name(), "value": str(ext)}
                        )
                    cert_info["extensions"] = extensions
                    return SSLCertificate(cert_info)
        except Exception:
            return None
    @staticmethod
    def _decode_cert_data(data: Any) -> Any:
        """Helper method to decode bytes in certificate data."""
        if isinstance(data, bytes):
            return data.decode("utf-8")
        elif isinstance(data, dict):
            return {
                (
                    k.decode("utf-8") if isinstance(k, bytes) else k
                ): SSLCertificate._decode_cert_data(v)
                for k, v in data.items()
            }
        elif isinstance(data, list):
            return [SSLCertificate._decode_cert_data(item) for item in data]
        return data
    def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
        """
        Export certificate as JSON.
        Args:
            filepath (Optional[str]): Path to save the JSON file (default: None).
        Returns:
            Optional[str]: JSON string if successful, None otherwise.
        """
        json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
        if filepath:
            Path(filepath).write_text(json_str, encoding="utf-8")
            return None
        return json_str
    def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
        """
        Export certificate as PEM.
        Args:
            filepath (Optional[str]): Path to save the PEM file (default: None).
        Returns:
            Optional[str]: PEM string if successful, None otherwise.
        """
        try:
            x509 = OpenSSL.crypto.load_certificate(
                OpenSSL.crypto.FILETYPE_ASN1,
                base64.b64decode(self._cert_info["raw_cert"]),
            )
            pem_data = OpenSSL.crypto.dump_certificate(
                OpenSSL.crypto.FILETYPE_PEM, x509
            ).decode("utf-8")
            if filepath:
                Path(filepath).write_text(pem_data, encoding="utf-8")
                return None
            return pem_data
        except Exception:
            return None
    def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
        """
        Export certificate as DER.
        Args:
            filepath (Optional[str]): Path to save the DER file (default: None).
        Returns:
            Optional[bytes]: DER bytes if successful, None otherwise.
        """
        try:
            der_data = base64.b64decode(self._cert_info["raw_cert"])
            if filepath:
                Path(filepath).write_bytes(der_data)
                return None
            return der_data
        except Exception:
            return None
    @property
    def issuer(self) -> Dict[str, str]:
        """Get certificate issuer information."""
        return self._cert_info.get("issuer", {})
    @property
    def subject(self) -> Dict[str, str]:
        """Get certificate subject information."""
        return self._cert_info.get("subject", {})
    @property
    def valid_from(self) -> str:
        """Get certificate validity start date."""
        return self._cert_info.get("not_before", "")
    @property
    def valid_until(self) -> str:
        """Get certificate validity end date."""
        return self._cert_info.get("not_after", "")
    @property
    def fingerprint(self) -> str:
        """Get certificate fingerprint."""
        return self._cert_info.get("fingerprint", "")
--- a/crawl4ai/user_agent_generator.py
+++ b/crawl4ai/user_agent_generator.py
@@ -4,6 +4,35 @@ import re
 class UserAgentGenerator:
    """
    Generate random user agents with specified constraints.
    Attributes:
        desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
        mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
        browser_combinations (dict): A dictionary of possible browser combinations and their corresponding user agent strings.
        rendering_engines (dict): A dictionary of possible rendering engines and their corresponding user agent strings.
        chrome_versions (list): A list of possible Chrome browser versions.
        firefox_versions (list): A list of possible Firefox browser versions.
        edge_versions (list): A list of possible Edge browser versions.
        safari_versions (list): A list of possible Safari browser versions.
        ios_versions (list): A list of possible iOS browser versions.
        android_versions (list): A list of possible Android browser versions.
        Methods:
            generate_user_agent(
                platform: Literal["desktop", "mobile"] = "desktop",
                browser: str = "chrome",
                rendering_engine: str = "chrome_webkit",
                chrome_version: Optional[str] = None,
                firefox_version: Optional[str] = None,
                edge_version: Optional[str] = None,
                safari_version: Optional[str] = None,
                ios_version: Optional[str] = None,
                android_version: Optional[str] = None
            ): Generates a random user agent string based on the specified parameters.
    """
    def __init__(self):
        # Previous platform definitions remain the same...
        self.desktop_platforms = {
@@ -19,7 +48,7 @@ class UserAgentGenerator:
                "generic": "(X11; Linux x86_64)",
                "ubuntu": "(X11; Ubuntu; Linux x86_64)",
                "chrome_os": "(X11; CrOS x86_64 14541.0.0)",
-            }
+            },
        }
        self.mobile_platforms = {
@@ -32,26 +61,14 @@ class UserAgentGenerator:
            "ios": {
                "iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
                "ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
-            }
+            },
        }
        # Browser Combinations
        self.browser_combinations = {
-            1: [
+            1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
-                ["chrome"],
+            2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
-                ["firefox"],
+            3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
                ["safari"],
                ["edge"]
            ],
            2: [
                ["gecko", "firefox"],
                ["chrome", "safari"],
                ["webkit", "safari"]
            ],
            3: [
                ["chrome", "safari", "edge"],
                ["webkit", "chrome", "safari"]
            ]
        }
        # Rendering Engines with versions
@@ -62,7 +79,7 @@ class UserAgentGenerator:
                "Gecko/20100101",
                "Gecko/20100101",  # Firefox usually uses this constant version
                "Gecko/2010010",
-            ]
+            ],
        }
        # Browser Versions
@@ -105,7 +122,21 @@ class UserAgentGenerator:
        ]
    def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
-        """Get a valid combination of browser versions"""
+        """
        Get a valid combination of browser versions.
        How it works:
        1. Check if the number of browsers is supported.
        2. Randomly choose a combination of browsers.
        3. Iterate through the combination and add browser versions.
        4. Return the browser stack.
        Args:
            num_browsers: Number of browser specifications (1-3)
        Returns:
            List[str]: A list of browser versions.
        """
        if num_browsers not in self.browser_combinations:
            raise ValueError(f"Unsupported number of browsers: {num_browsers}")
@@ -128,12 +159,14 @@ class UserAgentGenerator:
        return browser_stack
-    def generate(self, 
+    def generate(
-                device_type: Optional[Literal['desktop', 'mobile']] = None,
+        self,
        device_type: Optional[Literal["desktop", "mobile"]] = None,
        os_type: Optional[str] = None,
        device_brand: Optional[str] = None,
-                browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
+        browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
-                num_browsers: int = 3) -> str:
+        num_browsers: int = 3,
    ) -> str:
        """
        Generate a random user agent with specified constraints.
@@ -173,9 +206,13 @@ class UserAgentGenerator:
    def get_random_platform(self, device_type, os_type, device_brand):
        """Helper method to get random platform based on constraints"""
-        platforms = self.desktop_platforms if device_type == 'desktop' else \
+        platforms = (
-                   self.mobile_platforms if device_type == 'mobile' else \
+            self.desktop_platforms
-                   {**self.desktop_platforms, **self.mobile_platforms}
+            if device_type == "desktop"
            else self.mobile_platforms
            if device_type == "mobile"
            else {**self.desktop_platforms, **self.mobile_platforms}
        )
        if os_type:
            for platform_group in [self.desktop_platforms, self.mobile_platforms]:
@@ -191,10 +228,10 @@ class UserAgentGenerator:
    def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
        """Parse a user agent string to extract browser and version information"""
        browsers = {
-            'chrome': r'Chrome/(\d+)',
+            "chrome": r"Chrome/(\d+)",
-            'edge': r'Edg/(\d+)',
+            "edge": r"Edg/(\d+)",
-            'safari': r'Version/(\d+)',
+            "safari": r"Version/(\d+)",
-            'firefox': r'Firefox/(\d+)'
+            "firefox": r"Firefox/(\d+)",
        }
        result = {}
@@ -213,25 +250,26 @@ class UserAgentGenerator:
        hints = []
        # Handle different browser combinations
-        if 'chrome' in browsers:
+        if "chrome" in browsers:
            hints.append(f'"Chromium";v="{browsers["chrome"]}"')
            hints.append('"Not_A Brand";v="8"')
-            if 'edge' in browsers:
+            if "edge" in browsers:
                hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
            else:
                hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
-        elif 'firefox' in browsers:
+        elif "firefox" in browsers:
            # Firefox doesn't typically send Sec-CH-UA
            return '""'
-        elif 'safari' in browsers:
+        elif "safari" in browsers:
            # Safari's format for client hints
            hints.append(f'"Safari";v="{browsers["safari"]}"')
            hints.append('"Not_A Brand";v="8"')
-        return ', '.join(hints)
+        return ", ".join(hints)
 # Example usage:
 if __name__ == "__main__":
@@ -239,7 +277,7 @@ if __name__ == "__main__":
    print(generator.generate())
    print("\nSingle browser (Chrome):")
-    print(generator.generate(num_browsers=1, browser_type='chrome'))
+    print(generator.generate(num_browsers=1, browser_type="chrome"))
    print("\nTwo browsers (Gecko/Firefox):")
    print(generator.generate(num_browsers=2))
@@ -248,16 +286,14 @@ if __name__ == "__main__":
    print(generator.generate(num_browsers=3))
    print("\nFirefox on Linux:")
-    print(generator.generate(
+    print(
-        device_type='desktop',
+        generator.generate(
-        os_type='linux',
+            device_type="desktop",
-        browser_type='firefox',
+            os_type="linux",
-        num_browsers=2
+            browser_type="firefox",
-    ))
+            num_browsers=2,
        )
    )
    print("\nChrome/Safari/Edge on Windows:")
-    print(generator.generate(
+    print(generator.generate(device_type="desktop", os_type="windows", num_browsers=3))
        device_type='desktop',
        os_type='windows',
        num_browsers=3
    ))
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
--- a/crawl4ai/version_manager.py
+++ b/crawl4ai/version_manager.py
@@ -1,9 +1,9 @@
 # version_manager.py
 import os
 from pathlib import Path
 from packaging import version
 from . import __version__
 class VersionManager:
    def __init__(self):
        self.home_dir = Path.home() / ".crawl4ai"
@@ -27,4 +27,3 @@ class VersionManager:
        installed = self.get_installed_version()
        current = version.parse(__version__.__version__)
        return installed is None or installed < current
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -1,9 +1,10 @@
 import os, time
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 from pathlib import Path
 from .models import UrlModel, CrawlResult
-from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
+from .database import init_db, get_cached_url, cache_url
 from .utils import *
 from .chunking_strategy import *
 from .extraction_strategy import *
@@ -14,14 +15,27 @@ from .content_scraping_strategy import WebScrapingStrategy
 from .config import *
 import warnings
 import json
-warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
+
 warnings.filterwarnings(
    "ignore",
    message='Field "model_name" has conflict with protected namespace "model_".',
 )
 class WebCrawler:
-    def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False):
+    def __init__(
-        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
+        self,
        crawler_strategy: CrawlerStrategy = None,
        always_by_pass_cache: bool = False,
        verbose: bool = False,
    ):
        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
            verbose=verbose
        )
        self.always_by_pass_cache = always_by_pass_cache
-        self.crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
+        self.crawl4ai_folder = os.path.join(
            os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
        )
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
        init_db()
@@ -30,11 +44,11 @@ class WebCrawler:
    def warmup(self):
        print("[LOG] 🌤️  Warming up the WebCrawler")
        self.run(
-            url='https://google.com/',
+            url="https://google.com/",
            word_count_threshold=5,
            extraction_strategy=NoExtractionStrategy(),
            bypass_cache=False,
-            verbose=False
+            verbose=False,
        )
        self.ready = True
        print("[LOG] 🌞 WebCrawler is ready to crawl")
@@ -80,6 +94,7 @@ class WebCrawler:
        **kwargs,
    ) -> List[CrawlResult]:
        extraction_strategy = extraction_strategy or NoExtractionStrategy()
        def fetch_page_wrapper(url_model, *args, **kwargs):
            return self.fetch_page(url_model, *args, **kwargs)
@@ -150,12 +165,25 @@ class WebCrawler:
                html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
                t2 = time.time()
                if verbose:
-                        print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds")
+                    print(
                        f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
                    )
                if screenshot:
                    screenshot_data = self.crawler_strategy.take_screenshot()
-                
+            crawl_result = self.process_html(
-                crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs)
+                url,
                html,
                extracted_content,
                word_count_threshold,
                extraction_strategy,
                chunking_strategy,
                css_selector,
                screenshot_data,
                verbose,
                bool(cached),
                **kwargs,
            )
            crawl_result.success = bool(html)
            return crawl_result
        except Exception as e:
@@ -183,7 +211,11 @@ class WebCrawler:
        try:
            t1 = time.time()
            scrapping_strategy = WebScrapingStrategy()
-                extra_params = {k: v for k, v in kwargs.items() if k not in ["only_text", "image_description_min_word_threshold"]}
+            extra_params = {
                k: v
                for k, v in kwargs.items()
                if k not in ["only_text", "image_description_min_word_threshold"]
            }
            result = scrapping_strategy.scrap(
                url,
                html,
@@ -191,14 +223,17 @@ class WebCrawler:
                css_selector=css_selector,
                only_text=kwargs.get("only_text", False),
                image_description_min_word_threshold=kwargs.get(
-                        "image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
+                    "image_description_min_word_threshold",
                    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
                ),
                **extra_params,
            )
            # result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
            if verbose:
-                    print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")
+                print(
                    f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
                )
            if result is None:
                raise ValueError(f"Failed to extract content from the website: {url}")
@@ -213,14 +248,20 @@ class WebCrawler:
        if extracted_content is None:
            if verbose:
-                    print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+                print(
                    f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
                )
            sections = chunking_strategy.chunk(markdown)
            extracted_content = extraction_strategy.run(url, sections)
-                extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
+            extracted_content = json.dumps(
                extracted_content, indent=4, default=str, ensure_ascii=False
            )
            if verbose:
-                    print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
+                print(
                    f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
                )
        screenshot = None if not screenshot else screenshot
--- a/docs/deprecated/docker-deployment.md
+++ b/docs/deprecated/docker-deployment.md
@@ -0,0 +1,189 @@
 # 🐳 Using Docker (Legacy)
 Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
 ---
 <details>
 <summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
 Choose the appropriate image based on your platform and needs:
 ### For AMD64 (Regular Linux/Windows):
 ```bash
 # Basic version (recommended)
 docker pull unclecode/crawl4ai:basic-amd64
 docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
 # Full ML/LLM support
 docker pull unclecode/crawl4ai:all-amd64
 docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
 # With GPU support
 docker pull unclecode/crawl4ai:gpu-amd64
 docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
 ```
 ### For ARM64 (M1/M2 Macs, ARM servers):
 ```bash
 # Basic version (recommended)
 docker pull unclecode/crawl4ai:basic-arm64
 docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
 # Full ML/LLM support
 docker pull unclecode/crawl4ai:all-arm64
 docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
 # With GPU support
 docker pull unclecode/crawl4ai:gpu-arm64
 docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
 ```
 Need more memory? Add `--shm-size`:
 ```bash
 docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
 ```
 Test the installation:
 ```bash
 curl http://localhost:11235/health
 ```
 ### For Raspberry Pi (32-bit) (coming soon):
 ```bash
 # Pull and run basic version (recommended for Raspberry Pi)
 docker pull unclecode/crawl4ai:basic-armv7
 docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
 # With increased shared memory if needed
 docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
 ```
 Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
 </details>
 <details>
 <summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
 Build the image locally based on your platform:
 ```bash
 # Clone the repository
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
 # For AMD64 (Regular Linux/Windows)
 docker build --platform linux/amd64 \
  --tag crawl4ai:local \
  --build-arg INSTALL_TYPE=basic \
  .
 # For ARM64 (M1/M2 Macs, ARM servers)
 docker build --platform linux/arm64 \
  --tag crawl4ai:local \
  --build-arg INSTALL_TYPE=basic \
  .
 ```
 Build options:
 - INSTALL_TYPE=basic (default): Basic crawling features
 - INSTALL_TYPE=all: Full ML/LLM support
 - ENABLE_GPU=true: Add GPU support
 Example with all options:
 ```bash
 docker build --platform linux/amd64 \
  --tag crawl4ai:local \
  --build-arg INSTALL_TYPE=all \
  --build-arg ENABLE_GPU=true \
  .
 ```
 Run your local build:
 ```bash
 # Regular run
 docker run -p 11235:11235 crawl4ai:local
 # With increased shared memory
 docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
 ```
 Test the installation:
 ```bash
 curl http://localhost:11235/health
 ```
 </details>
 <details>
 <summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
 Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
 ```bash
 # Clone the repository
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
 ```
 ### For AMD64 (Regular Linux/Windows):
 ```bash
 # Build and run locally
 docker-compose --profile local-amd64 up
 # Run from Docker Hub
 VERSION=basic docker-compose --profile hub-amd64 up   # Basic version
 VERSION=all docker-compose --profile hub-amd64 up     # Full ML/LLM support
 VERSION=gpu docker-compose --profile hub-amd64 up     # GPU support
 ```
 ### For ARM64 (M1/M2 Macs, ARM servers):
 ```bash
 # Build and run locally
 docker-compose --profile local-arm64 up
 # Run from Docker Hub
 VERSION=basic docker-compose --profile hub-arm64 up   # Basic version
 VERSION=all docker-compose --profile hub-arm64 up     # Full ML/LLM support
 VERSION=gpu docker-compose --profile hub-arm64 up     # GPU support
 ```
 Environment variables (optional):
 ```bash
 # Create a .env file
 CRAWL4AI_API_TOKEN=your_token
 OPENAI_API_KEY=your_openai_key
 CLAUDE_API_KEY=your_claude_key
 ```
 The compose file includes:
 - Memory management (4GB limit, 1GB reserved)
 - Shared memory volume for browser support
 - Health checks
 - Auto-restart policy
 - All necessary port mappings
 Test the installation:
 ```bash
 curl http://localhost:11235/health
 ```
 </details>
 <details>
 <summary>🚀 <strong>One-Click Deployment</strong></summary>
 Deploy your own instance of Crawl4AI with one click:
 [![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
 > 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
 The deploy will:
 - Set up a Docker container with Crawl4AI
 - Configure Playwright and all dependencies
 - Start the FastAPI server on port `11235`
 - Set up health checks and auto-deployment
 </details>
--- a/docs/examples/amazon_product_extraction_direct_url.py
+++ b/docs/examples/amazon_product_extraction_direct_url.py
@@ -0,0 +1,110 @@
 """
 This example demonstrates how to use JSON CSS extraction to scrape product information 
 from Amazon search results. It shows how to extract structured data like product titles,
 prices, ratings, and other details using CSS selectors.
 """
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
 import json
 async def extract_amazon_products():
    # Initialize browser config
    browser_config = BrowserConfig(browser_type="chromium", headless=True)
    # Initialize crawler config with JSON CSS extraction strategy
    crawler_config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "Amazon Product Search Results",
                "baseSelector": "[data-component-type='s-search-result']",
                "fields": [
                    {
                        "name": "asin",
                        "selector": "",
                        "type": "attribute",
                        "attribute": "data-asin",
                    },
                    {"name": "title", "selector": "h2 a span", "type": "text"},
                    {
                        "name": "url",
                        "selector": "h2 a",
                        "type": "attribute",
                        "attribute": "href",
                    },
                    {
                        "name": "image",
                        "selector": ".s-image",
                        "type": "attribute",
                        "attribute": "src",
                    },
                    {
                        "name": "rating",
                        "selector": ".a-icon-star-small .a-icon-alt",
                        "type": "text",
                    },
                    {
                        "name": "reviews_count",
                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
                        "type": "text",
                    },
                    {
                        "name": "price",
                        "selector": ".a-price .a-offscreen",
                        "type": "text",
                    },
                    {
                        "name": "original_price",
                        "selector": ".a-price.a-text-price .a-offscreen",
                        "type": "text",
                    },
                    {
                        "name": "sponsored",
                        "selector": ".puis-sponsored-label-text",
                        "type": "exists",
                    },
                    {
                        "name": "delivery_info",
                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
                        "type": "text",
                        "multiple": True,
                    },
                ],
            }
        )
    )
    # Example search URL (you should replace with your actual Amazon URL)
    url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Extract the data
        result = await crawler.arun(url=url, config=crawler_config)
        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"ASIN: {product.get('asin')}")
                print(f"Title: {product.get('title')}")
                print(f"Price: {product.get('price')}")
                print(f"Original Price: {product.get('original_price')}")
                print(f"Rating: {product.get('rating')}")
                print(f"Reviews: {product.get('reviews_count')}")
                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
                if product.get("delivery_info"):
                    print(f"Delivery: {' '.join(product['delivery_info'])}")
                print("-" * 80)
 if __name__ == "__main__":
    import asyncio
    asyncio.run(extract_amazon_products())
--- a/docs/examples/amazon_product_extraction_using_hooks.py
+++ b/docs/examples/amazon_product_extraction_using_hooks.py
@@ -0,0 +1,150 @@
 """
 This example demonstrates how to use JSON CSS extraction to scrape product information 
 from Amazon search results. It shows how to extract structured data like product titles,
 prices, ratings, and other details using CSS selectors.
 """
 from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
 import json
 from playwright.async_api import Page, BrowserContext
 async def extract_amazon_products():
    # Initialize browser config
    browser_config = BrowserConfig(
        # browser_type="chromium",
        headless=True
    )
    # Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "Amazon Product Search Results",
                "baseSelector": "[data-component-type='s-search-result']",
                "fields": [
                    {
                        "name": "asin",
                        "selector": "",
                        "type": "attribute",
                        "attribute": "data-asin",
                    },
                    {"name": "title", "selector": "h2 a span", "type": "text"},
                    {
                        "name": "url",
                        "selector": "h2 a",
                        "type": "attribute",
                        "attribute": "href",
                    },
                    {
                        "name": "image",
                        "selector": ".s-image",
                        "type": "attribute",
                        "attribute": "src",
                    },
                    {
                        "name": "rating",
                        "selector": ".a-icon-star-small .a-icon-alt",
                        "type": "text",
                    },
                    {
                        "name": "reviews_count",
                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
                        "type": "text",
                    },
                    {
                        "name": "price",
                        "selector": ".a-price .a-offscreen",
                        "type": "text",
                    },
                    {
                        "name": "original_price",
                        "selector": ".a-price.a-text-price .a-offscreen",
                        "type": "text",
                    },
                    {
                        "name": "sponsored",
                        "selector": ".puis-sponsored-label-text",
                        "type": "exists",
                    },
                    {
                        "name": "delivery_info",
                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
                        "type": "text",
                        "multiple": True,
                    },
                ],
            }
        ),
    )
    url = "https://www.amazon.com/"
    async def after_goto(
        page: Page, context: BrowserContext, url: str, response: dict, **kwargs
    ):
        """Hook called after navigating to each URL"""
        print(f"[HOOK] after_goto - Successfully loaded: {url}")
        try:
            # Wait for search box to be available
            search_box = await page.wait_for_selector(
                "#twotabsearchtextbox", timeout=1000
            )
            # Type the search query
            await search_box.fill("Samsung Galaxy Tab")
            # Get the search button and prepare for navigation
            search_button = await page.wait_for_selector(
                "#nav-search-submit-button", timeout=1000
            )
            # Click with navigation waiting
            await search_button.click()
            # Wait for search results to load
            await page.wait_for_selector(
                '[data-component-type="s-search-result"]', timeout=10000
            )
            print("[HOOK] Search completed and results loaded!")
        except Exception as e:
            print(f"[HOOK] Error during search operation: {str(e)}")
        return page
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler.crawler_strategy.set_hook("after_goto", after_goto)
        # Extract the data
        result = await crawler.arun(url=url, config=crawler_config)
        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"ASIN: {product.get('asin')}")
                print(f"Title: {product.get('title')}")
                print(f"Price: {product.get('price')}")
                print(f"Original Price: {product.get('original_price')}")
                print(f"Rating: {product.get('rating')}")
                print(f"Reviews: {product.get('reviews_count')}")
                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
                if product.get("delivery_info"):
                    print(f"Delivery: {' '.join(product['delivery_info'])}")
                print("-" * 80)
 if __name__ == "__main__":
    import asyncio
    asyncio.run(extract_amazon_products())
--- a/docs/examples/amazon_product_extraction_using_use_javascript.py
+++ b/docs/examples/amazon_product_extraction_using_use_javascript.py
@@ -0,0 +1,126 @@
 """
 This example demonstrates how to use JSON CSS extraction to scrape product information 
 from Amazon search results. It shows how to extract structured data like product titles,
 prices, ratings, and other details using CSS selectors.
 """
 from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
 import json
 async def extract_amazon_products():
    # Initialize browser config
    browser_config = BrowserConfig(
        # browser_type="chromium",
        headless=True
    )
    js_code_to_search = """
        const task = async () => {
            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
            document.querySelector('#nav-search-submit-button').click();
        }
        await task();
    """
    js_code_to_search_sync = """
            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
            document.querySelector('#nav-search-submit-button').click();
    """
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        js_code=js_code_to_search,
        wait_for='css:[data-component-type="s-search-result"]',
        extraction_strategy=JsonCssExtractionStrategy(
            schema={
                "name": "Amazon Product Search Results",
                "baseSelector": "[data-component-type='s-search-result']",
                "fields": [
                    {
                        "name": "asin",
                        "selector": "",
                        "type": "attribute",
                        "attribute": "data-asin",
                    },
                    {"name": "title", "selector": "h2 a span", "type": "text"},
                    {
                        "name": "url",
                        "selector": "h2 a",
                        "type": "attribute",
                        "attribute": "href",
                    },
                    {
                        "name": "image",
                        "selector": ".s-image",
                        "type": "attribute",
                        "attribute": "src",
                    },
                    {
                        "name": "rating",
                        "selector": ".a-icon-star-small .a-icon-alt",
                        "type": "text",
                    },
                    {
                        "name": "reviews_count",
                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
                        "type": "text",
                    },
                    {
                        "name": "price",
                        "selector": ".a-price .a-offscreen",
                        "type": "text",
                    },
                    {
                        "name": "original_price",
                        "selector": ".a-price.a-text-price .a-offscreen",
                        "type": "text",
                    },
                    {
                        "name": "sponsored",
                        "selector": ".puis-sponsored-label-text",
                        "type": "exists",
                    },
                    {
                        "name": "delivery_info",
                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
                        "type": "text",
                        "multiple": True,
                    },
                ],
            }
        ),
    )
    # Example search URL (you should replace with your actual Amazon URL)
    url = "https://www.amazon.com/"
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Extract the data
        result = await crawler.arun(url=url, config=crawler_config)
        # Process and print the results
        if result and result.extracted_content:
            # Parse the JSON string into a list of products
            products = json.loads(result.extracted_content)
            # Process each product in the list
            for product in products:
                print("\nProduct Details:")
                print(f"ASIN: {product.get('asin')}")
                print(f"Title: {product.get('title')}")
                print(f"Price: {product.get('price')}")
                print(f"Original Price: {product.get('original_price')}")
                print(f"Rating: {product.get('rating')}")
                print(f"Reviews: {product.get('reviews_count')}")
                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
                if product.get("delivery_info"):
                    print(f"Delivery: {' '.join(product['delivery_info'])}")
                print("-" * 80)
 if __name__ == "__main__":
    import asyncio
    asyncio.run(extract_amazon_products())
--- a/docs/examples/async_webcrawler_multiple_urls_example.py
+++ b/docs/examples/async_webcrawler_multiple_urls_example.py
@@ -1,12 +1,16 @@
 # File: async_webcrawler_multiple_urls_example.py
 import os, sys
 # append 2 parent directories to sys.path to import crawl4ai
-parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+parent_dir = os.path.dirname(
    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 )
 sys.path.append(parent_dir)
 import asyncio
 from crawl4ai import AsyncWebCrawler
 async def main():
    # Initialize the AsyncWebCrawler
    async with AsyncWebCrawler(verbose=True) as crawler:
@@ -16,7 +20,7 @@ async def main():
            "https://python.org",
            "https://github.com",
            "https://stackoverflow.com",
-            "https://news.ycombinator.com"
+            "https://news.ycombinator.com",
        ]
        # Set up crawling parameters
@@ -27,7 +31,7 @@ async def main():
            urls=urls,
            word_count_threshold=word_count_threshold,
            bypass_cache=True,
-            verbose=True
+            verbose=True,
        )
        # Process the results
@@ -36,7 +40,9 @@ async def main():
                print(f"Successfully crawled: {result.url}")
                print(f"Title: {result.metadata.get('title', 'N/A')}")
                print(f"Word count: {len(result.markdown.split())}")
-                print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
+                print(
                    f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
                )
                print(f"Number of images: {len(result.media.get('images', []))}")
                print("---")
            else:
@@ -44,5 +50,6 @@ async def main():
                print(f"Error: {result.error_message}")
                print("---")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/browser_optimization_example.py
+++ b/docs/examples/browser_optimization_example.py
@@ -0,0 +1,126 @@
 """
 This example demonstrates optimal browser usage patterns in Crawl4AI:
 1. Sequential crawling with session reuse
 2. Parallel crawling with browser instance reuse
 3. Performance optimization settings
 """
 import asyncio
 from typing import List
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 async def crawl_sequential(urls: List[str]):
    """
    Sequential crawling using session reuse - most efficient for moderate workloads
    """
    print("\n=== Sequential Crawling with Session Reuse ===")
    # Configure browser with optimized settings
    browser_config = BrowserConfig(
        headless=True,
        browser_args=[
            "--disable-gpu",  # Disable GPU acceleration
            "--disable-dev-shm-usage",  # Disable /dev/shm usage
            "--no-sandbox",  # Required for Docker
        ],
        viewport={
            "width": 800,
            "height": 600,
        },  # Smaller viewport for better performance
    )
    # Configure crawl settings
    crawl_config = CrawlerRunConfig(
        markdown_generator=DefaultMarkdownGenerator(
            #  content_filter=PruningContentFilter(), In case you need fit_markdown
        ),
    )
    # Create single crawler instance
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    try:
        session_id = "session1"  # Use same session for all URLs
        for url in urls:
            result = await crawler.arun(
                url=url,
                config=crawl_config,
                session_id=session_id,  # Reuse same browser tab
            )
            if result.success:
                print(f"Successfully crawled {url}")
                print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
    finally:
        await crawler.close()
 async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
    """
    Parallel crawling while reusing browser instance - best for large workloads
    """
    print("\n=== Parallel Crawling with Browser Reuse ===")
    browser_config = BrowserConfig(
        headless=True,
        browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
        viewport={"width": 800, "height": 600},
    )
    crawl_config = CrawlerRunConfig(
        markdown_generator=DefaultMarkdownGenerator(
            #  content_filter=PruningContentFilter(), In case you need fit_markdown
        ),
    )
    # Create single crawler instance for all parallel tasks
    crawler = AsyncWebCrawler(config=browser_config)
    await crawler.start()
    try:
        # Create tasks in batches to control concurrency
        for i in range(0, len(urls), max_concurrent):
            batch = urls[i : i + max_concurrent]
            tasks = []
            for j, url in enumerate(batch):
                session_id = (
                    f"parallel_session_{j}"  # Different session per concurrent task
                )
                task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
                tasks.append(task)
            # Wait for batch to complete
            results = await asyncio.gather(*tasks, return_exceptions=True)
            # Process results
            for url, result in zip(batch, results):
                if isinstance(result, Exception):
                    print(f"Error crawling {url}: {str(result)}")
                elif result.success:
                    print(f"Successfully crawled {url}")
                    print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
    finally:
        await crawler.close()
 async def main():
    # Example URLs
    urls = [
        "https://example.com/page1",
        "https://example.com/page2",
        "https://example.com/page3",
        "https://example.com/page4",
    ]
    # Demo sequential crawling
    await crawl_sequential(urls)
    # Demo parallel crawling
    await crawl_parallel(urls, max_concurrent=2)
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/crawlai_vs_firecrawl.py
+++ b/docs/examples/crawlai_vs_firecrawl.py
@@ -1,31 +1,32 @@
 import os, time
 # append the path to the root of the project
 import sys
 import asyncio
-sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
+
 sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
 from firecrawl import FirecrawlApp
 from crawl4ai import AsyncWebCrawler
-__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'
+
 __data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
 async def compare():
-    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
    # Tet Firecrawl with a simple crawl
    start = time.time()
    scrape_status = app.scrape_url(
-    'https://www.nbcnews.com/business',
+        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
    params={'formats': ['markdown', 'html']}
    )
    end = time.time()
    print(f"Time taken: {end - start} seconds")
-    print(len(scrape_status['markdown']))
+    print(len(scrape_status["markdown"]))
    # save the markdown content with provider name
    with open(f"{__data__}/firecrawl_simple.md", "w") as f:
-        f.write(scrape_status['markdown'])
+        f.write(scrape_status["markdown"])
    # Count how many "cldnry.s-nbcnews.com" are in the markdown
-    print(scrape_status['markdown'].count("cldnry.s-nbcnews.com"))
+    print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
    async with AsyncWebCrawler() as crawler:
        start = time.time()
@@ -34,7 +35,7 @@ async def compare():
            # js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
            word_count_threshold=0,
            bypass_cache=True,
-            verbose=False
+            verbose=False,
        )
        end = time.time()
        print(f"Time taken: {end - start} seconds")
@@ -48,10 +49,12 @@ async def compare():
        start = time.time()
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
-            js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
+            js_code=[
                "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
            ],
            word_count_threshold=0,
            bypass_cache=True,
-            verbose=False
+            verbose=False,
        )
        end = time.time()
        print(f"Time taken: {end - start} seconds")
@@ -62,6 +65,6 @@ async def compare():
        # count how many "cldnry.s-nbcnews.com" are in the markdown
        print(result.markdown.count("cldnry.s-nbcnews.com"))
 if __name__ == "__main__":
    asyncio.run(compare())
--- a/docs/examples/dispatcher_example.py
+++ b/docs/examples/dispatcher_example.py
@@ -0,0 +1,136 @@
 import asyncio
 import time
 from rich import print
 from rich.table import Table
 from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    MemoryAdaptiveDispatcher,
    SemaphoreDispatcher,
    RateLimiter,
    CrawlerMonitor,
    DisplayMode,
    CacheMode,
    LXMLWebScrapingStrategy,
 )
 async def memory_adaptive(urls, browser_config, run_config):
    """Memory adaptive crawler with monitoring"""
    start = time.perf_counter()
    async with AsyncWebCrawler(config=browser_config) as crawler:
        dispatcher = MemoryAdaptiveDispatcher(
            memory_threshold_percent=70.0,
            max_session_permit=10,
            monitor=CrawlerMonitor(
                max_visible_rows=15, display_mode=DisplayMode.DETAILED
            ),
        )
        results = await crawler.arun_many(
            urls, config=run_config, dispatcher=dispatcher
        )
    duration = time.perf_counter() - start
    return len(results), duration
 async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
    """Memory adaptive crawler with rate limiting"""
    start = time.perf_counter()
    async with AsyncWebCrawler(config=browser_config) as crawler:
        dispatcher = MemoryAdaptiveDispatcher(
            memory_threshold_percent=70.0,
            max_session_permit=10,
            rate_limiter=RateLimiter(
                base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
            ),
            monitor=CrawlerMonitor(
                max_visible_rows=15, display_mode=DisplayMode.DETAILED
            ),
        )
        results = await crawler.arun_many(
            urls, config=run_config, dispatcher=dispatcher
        )
    duration = time.perf_counter() - start
    return len(results), duration
 async def semaphore(urls, browser_config, run_config):
    """Basic semaphore crawler"""
    start = time.perf_counter()
    async with AsyncWebCrawler(config=browser_config) as crawler:
        dispatcher = SemaphoreDispatcher(
            semaphore_count=5,
            monitor=CrawlerMonitor(
                max_visible_rows=15, display_mode=DisplayMode.DETAILED
            ),
        )
        results = await crawler.arun_many(
            urls, config=run_config, dispatcher=dispatcher
        )
    duration = time.perf_counter() - start
    return len(results), duration
 async def semaphore_with_rate_limit(urls, browser_config, run_config):
    """Semaphore crawler with rate limiting"""
    start = time.perf_counter()
    async with AsyncWebCrawler(config=browser_config) as crawler:
        dispatcher = SemaphoreDispatcher(
            semaphore_count=5,
            rate_limiter=RateLimiter(
                base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
            ),
            monitor=CrawlerMonitor(
                max_visible_rows=15, display_mode=DisplayMode.DETAILED
            ),
        )
        results = await crawler.arun_many(
            urls, config=run_config, dispatcher=dispatcher
        )
    duration = time.perf_counter() - start
    return len(results), duration
 def create_performance_table(results):
    """Creates a rich table showing performance results"""
    table = Table(title="Crawler Strategy Performance Comparison")
    table.add_column("Strategy", style="cyan")
    table.add_column("URLs Crawled", justify="right", style="green")
    table.add_column("Time (seconds)", justify="right", style="yellow")
    table.add_column("URLs/second", justify="right", style="magenta")
    sorted_results = sorted(results.items(), key=lambda x: x[1][1])
    for strategy, (urls_crawled, duration) in sorted_results:
        urls_per_second = urls_crawled / duration
        table.add_row(
            strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
        )
    return table
 async def main():
    urls = [f"https://example.com/page{i}" for i in range(1, 40)]
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
    results = {
        "Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
        # "Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
        #     urls, browser_config, run_config
        # ),
        # "Semaphore": await semaphore(urls, browser_config, run_config),
        # "Semaphore + Rate Limit": await semaphore_with_rate_limit(
        #     urls, browser_config, run_config
        # ),
    }
    table = create_performance_table(results)
    print("\nPerformance Summary:")
    print(table)
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/docker_example.py
+++ b/docs/examples/docker_example.py
@@ -6,15 +6,24 @@ import base64
 import os
 from typing import Dict, Any
 class Crawl4AiTester:
    def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
        self.base_url = base_url
-        self.api_token = api_token or os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code"  # Check environment variable as fallback
+        self.api_token = (
-        self.headers = {'Authorization': f'Bearer {self.api_token}'} if self.api_token else {}
+            api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
        )  # Check environment variable as fallback
        self.headers = (
            {"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
        )
-    def submit_and_wait(self, request_data: Dict[str, Any], timeout: int = 300) -> Dict[str, Any]:
+    def submit_and_wait(
        self, request_data: Dict[str, Any], timeout: int = 300
    ) -> Dict[str, Any]:
        # Submit crawl job
-        response = requests.post(f"{self.base_url}/crawl", json=request_data, headers=self.headers)
+        response = requests.post(
            f"{self.base_url}/crawl", json=request_data, headers=self.headers
        )
        if response.status_code == 403:
            raise Exception("API token is invalid or missing")
        task_id = response.json()["task_id"]
@@ -24,9 +33,13 @@ class Crawl4AiTester:
        start_time = time.time()
        while True:
            if time.time() - start_time > timeout:
-                raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")
+                raise TimeoutError(
                    f"Task {task_id} did not complete within {timeout} seconds"
                )
-            result = requests.get(f"{self.base_url}/task/{task_id}", headers=self.headers)
+            result = requests.get(
                f"{self.base_url}/task/{task_id}", headers=self.headers
            )
            status = result.json()
            if status["status"] == "failed":
@@ -39,7 +52,12 @@ class Crawl4AiTester:
            time.sleep(2)
    def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
-        response = requests.post(f"{self.base_url}/crawl_sync", json=request_data, headers=self.headers, timeout=60)
+        response = requests.post(
            f"{self.base_url}/crawl_sync",
            json=request_data,
            headers=self.headers,
            timeout=60,
        )
        if response.status_code == 408:
            raise TimeoutError("Task did not complete within server timeout")
        response.raise_for_status()
@@ -48,16 +66,15 @@ class Crawl4AiTester:
    def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
        """Directly crawl without using task queue"""
        response = requests.post(
-            f"{self.base_url}/crawl_direct", 
+            f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
            json=request_data, 
            headers=self.headers
        )
        response.raise_for_status()
        return response.json()
 def test_docker_deployment(version="basic"):
    tester = Crawl4AiTester(
-        base_url="http://localhost:11235" ,
+        base_url="http://localhost:11235",
        # base_url="https://api.crawl4ai.com" # just for example
        # api_token="test" # just for example
    )
@@ -70,7 +87,7 @@ def test_docker_deployment(version="basic"):
            health = requests.get(f"{tester.base_url}/health", timeout=10)
            print("Health check:", health.json())
            break
-        except requests.exceptions.RequestException as e:
+        except requests.exceptions.RequestException:
            if i == max_retries - 1:
                print(f"Failed to connect after {max_retries} attempts")
                sys.exit(1)
@@ -99,7 +116,7 @@ def test_basic_crawl(tester: Crawl4AiTester):
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 10,
-        "session_id": "test"
+        "session_id": "test",
    }
    result = tester.submit_and_wait(request)
@@ -107,19 +124,21 @@ def test_basic_crawl(tester: Crawl4AiTester):
    assert result["result"]["success"]
    assert len(result["result"]["markdown"]) > 0
 def test_basic_crawl_sync(tester: Crawl4AiTester):
    print("\n=== Testing Basic Crawl (Sync) ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 10,
-        "session_id": "test"
+        "session_id": "test",
    }
    result = tester.submit_sync(request)
    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result['status'] == 'completed'
+    assert result["status"] == "completed"
-    assert result['result']['success']
+    assert result["result"]["success"]
-    assert len(result['result']['markdown']) > 0
+    assert len(result["result"]["markdown"]) > 0
 def test_basic_crawl_direct(tester: Crawl4AiTester):
    print("\n=== Testing Basic Crawl (Direct) ===")
@@ -127,13 +146,14 @@ def test_basic_crawl_direct(tester: Crawl4AiTester):
        "urls": "https://www.nbcnews.com/business",
        "priority": 10,
        # "session_id": "test"
-        "cache_mode": "bypass"  # or "enabled", "disabled", "read_only", "write_only"
+        "cache_mode": "bypass",  # or "enabled", "disabled", "read_only", "write_only"
    }
    result = tester.crawl_direct(request)
    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result['result']['success']
+    assert result["result"]["success"]
-    assert len(result['result']['markdown']) > 0
+    assert len(result["result"]["markdown"]) > 0
 def test_js_execution(tester: Crawl4AiTester):
    print("\n=== Testing JS Execution ===")
@@ -144,32 +164,29 @@ def test_js_execution(tester: Crawl4AiTester):
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
        ],
        "wait_for": "article.tease-card:nth-child(10)",
-        "crawler_params": {
+        "crawler_params": {"headless": True},
            "headless": True
        }
    }
    result = tester.submit_and_wait(request)
    print(f"JS execution result length: {len(result['result']['markdown'])}")
    assert result["result"]["success"]
 def test_css_selector(tester: Crawl4AiTester):
    print("\n=== Testing CSS Selector ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 7,
        "css_selector": ".wide-tease-item__description",
-        "crawler_params": {
+        "crawler_params": {"headless": True},
-            "headless": True
+        "extra": {"word_count_threshold": 10},
        },
        "extra": {"word_count_threshold": 10}
    }
    result = tester.submit_and_wait(request)
    print(f"CSS selector result length: {len(result['result']['markdown'])}")
    assert result["result"]["success"]
 def test_structured_extraction(tester: Crawl4AiTester):
    print("\n=== Testing Structured Extraction ===")
    schema = {
@@ -190,19 +207,14 @@ def test_structured_extraction(tester: Crawl4AiTester):
                "name": "price",
                "selector": "td:nth-child(2)",
                "type": "text",
-            }
+            },
        ],
    }
    request = {
        "urls": "https://www.coinbase.com/explore",
        "priority": 9,
-        "extraction_config": {
+        "extraction_config": {"type": "json_css", "params": {"schema": schema}},
            "type": "json_css",
            "params": {
                "schema": schema
            }
        }
    }
    result = tester.submit_and_wait(request)
@@ -212,6 +224,7 @@ def test_structured_extraction(tester: Crawl4AiTester):
    assert result["result"]["success"]
    assert len(extracted) > 0
 def test_llm_extraction(tester: Crawl4AiTester):
    print("\n=== Testing LLM Extraction ===")
    schema = {
@@ -219,18 +232,18 @@ def test_llm_extraction(tester: Crawl4AiTester):
        "properties": {
            "model_name": {
                "type": "string",
-                "description": "Name of the OpenAI model."
+                "description": "Name of the OpenAI model.",
            },
            "input_fee": {
                "type": "string",
-                "description": "Fee for input token for the OpenAI model."
+                "description": "Fee for input token for the OpenAI model.",
            },
            "output_fee": {
                "type": "string",
-                "description": "Fee for output token for the OpenAI model."
+                "description": "Fee for output token for the OpenAI model.",
            }
            },
-        "required": ["model_name", "input_fee", "output_fee"]
+        },
        "required": ["model_name", "input_fee", "output_fee"],
    }
    request = {
@@ -243,10 +256,10 @@ def test_llm_extraction(tester: Crawl4AiTester):
                "api_token": os.getenv("OPENAI_API_KEY"),
                "schema": schema,
                "extraction_type": "schema",
-                "instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens."""
+                "instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
            }
            },
-        "crawler_params": {"word_count_threshold": 1}
+        },
        "crawler_params": {"word_count_threshold": 1},
    }
    try:
@@ -258,6 +271,7 @@ def test_llm_extraction(tester: Crawl4AiTester):
    except Exception as e:
        print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
 def test_llm_with_ollama(tester: Crawl4AiTester):
    print("\n=== Testing LLM with Ollama ===")
    schema = {
@@ -265,18 +279,18 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
        "properties": {
            "article_title": {
                "type": "string",
-                "description": "The main title of the news article"
+                "description": "The main title of the news article",
            },
            "summary": {
                "type": "string",
-                "description": "A brief summary of the article content"
+                "description": "A brief summary of the article content",
            },
            "main_topics": {
                "type": "array",
                "items": {"type": "string"},
-                "description": "Main topics or themes discussed in the article"
+                "description": "Main topics or themes discussed in the article",
-            }
+            },
-        }
+        },
    }
    request = {
@@ -288,11 +302,11 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
                "provider": "ollama/llama2",
                "schema": schema,
                "extraction_type": "schema",
-                "instruction": "Extract the main article information including title, summary, and main topics."
+                "instruction": "Extract the main article information including title, summary, and main topics.",
-            }
+            },
        },
        "extra": {"word_count_threshold": 1},
-        "crawler_params": {"verbose": True}
+        "crawler_params": {"verbose": True},
    }
    try:
@@ -303,6 +317,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
    except Exception as e:
        print(f"Ollama extraction test failed: {str(e)}")
 def test_cosine_extraction(tester: Crawl4AiTester):
    print("\n=== Testing Cosine Extraction ===")
    request = {
@@ -314,9 +329,9 @@ def test_cosine_extraction(tester: Crawl4AiTester):
                "semantic_filter": "business finance economy",
                "word_count_threshold": 10,
                "max_dist": 0.2,
-                "top_k": 3
+                "top_k": 3,
-            }
+            },
-        }
+        },
    }
    try:
@@ -328,15 +343,14 @@ def test_cosine_extraction(tester: Crawl4AiTester):
    except Exception as e:
        print(f"Cosine extraction test failed: {str(e)}")
 def test_screenshot(tester: Crawl4AiTester):
    print("\n=== Testing Screenshot ===")
    request = {
        "urls": "https://www.nbcnews.com/business",
        "priority": 5,
        "screenshot": True,
-        "crawler_params": {
+        "crawler_params": {"headless": True},
            "headless": True
        }
    }
    result = tester.submit_and_wait(request)
@@ -351,6 +365,7 @@ def test_screenshot(tester: Crawl4AiTester):
    assert result["result"]["success"]
 if __name__ == "__main__":
    version = sys.argv[1] if len(sys.argv) > 1 else "basic"
    # version = "full"
--- a/docs/examples/extraction_strategies_example.py
+++ b/docs/examples/extraction_strategies_example.py
@@ -0,0 +1,127 @@
 """
 Example demonstrating different extraction strategies with various input formats.
 This example shows how to:
 1. Use different input formats (markdown, HTML, fit_markdown)
 2. Work with JSON-based extractors (CSS and XPath)
 3. Use LLM-based extraction with different input formats
 4. Configure browser and crawler settings properly
 """
 import asyncio
 import os
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 from crawl4ai.extraction_strategy import (
    LLMExtractionStrategy,
    JsonCssExtractionStrategy,
    JsonXPathExtractionStrategy,
 )
 from crawl4ai.content_filter_strategy import PruningContentFilter
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
    """Helper function to run extraction with proper configuration"""
    try:
        # Configure the crawler run settings
        config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            extraction_strategy=strategy,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter()  # For fit_markdown support
            ),
        )
        # Run the crawler
        result = await crawler.arun(url=url, config=config)
        if result.success:
            print(f"\n=== {name} Results ===")
            print(f"Extracted Content: {result.extracted_content}")
            print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
            print(
                f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
            )
        else:
            print(f"Error in {name}: Crawl failed")
    except Exception as e:
        print(f"Error in {name}: {str(e)}")
 async def main():
    # Example URL (replace with actual URL)
    url = "https://example.com/product-page"
    # Configure browser settings
    browser_config = BrowserConfig(headless=True, verbose=True)
    # Initialize extraction strategies
    # 1. LLM Extraction with different input formats
    markdown_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini",
        api_token=os.getenv("OPENAI_API_KEY"),
        instruction="Extract product information including name, price, and description",
    )
    html_strategy = LLMExtractionStrategy(
        input_format="html",
        provider="openai/gpt-4o-mini",
        api_token=os.getenv("OPENAI_API_KEY"),
        instruction="Extract product information from HTML including structured data",
    )
    fit_markdown_strategy = LLMExtractionStrategy(
        input_format="fit_markdown",
        provider="openai/gpt-4o-mini",
        api_token=os.getenv("OPENAI_API_KEY"),
        instruction="Extract product information from cleaned markdown",
    )
    # 2. JSON CSS Extraction (automatically uses HTML input)
    css_schema = {
        "baseSelector": ".product",
        "fields": [
            {"name": "title", "selector": "h1.product-title", "type": "text"},
            {"name": "price", "selector": ".price", "type": "text"},
            {"name": "description", "selector": ".description", "type": "text"},
        ],
    }
    css_strategy = JsonCssExtractionStrategy(schema=css_schema)
    # 3. JSON XPath Extraction (automatically uses HTML input)
    xpath_schema = {
        "baseSelector": "//div[@class='product']",
        "fields": [
            {
                "name": "title",
                "selector": ".//h1[@class='product-title']/text()",
                "type": "text",
            },
            {
                "name": "price",
                "selector": ".//span[@class='price']/text()",
                "type": "text",
            },
            {
                "name": "description",
                "selector": ".//div[@class='description']/text()",
                "type": "text",
            },
        ],
    }
    xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
    # Use context manager for proper resource handling
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Run all strategies
        await run_extraction(crawler, url, markdown_strategy, "Markdown LLM")
        await run_extraction(crawler, url, html_strategy, "HTML LLM")
        await run_extraction(crawler, url, fit_markdown_strategy, "Fit Markdown LLM")
        await run_extraction(crawler, url, css_strategy, "CSS Extraction")
        await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/full_page_screenshot_and_pdf_export.md
+++ b/docs/examples/full_page_screenshot_and_pdf_export.md
@@ -39,8 +39,8 @@ async def main():
                    f.write(b64decode(result.screenshot))
            # Save PDF
-            if result.pdf_data:
+            if result.pdf:
-                pdf_bytes = b64decode(result.pdf_data)
+                pdf_bytes = b64decode(result.pdf)
                with open(os.path.join(__location__, "page.pdf"), "wb") as f:
                    f.write(pdf_bytes)
--- a/docs/examples/hello_world.py
+++ b/docs/examples/hello_world.py
@@ -0,0 +1,23 @@
 import asyncio
 from crawl4ai import *
 async def main():
    browser_config = BrowserConfig(headless=True, verbose=True)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(
                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
                )
            ),
        )
        result = await crawler.arun(
            url="https://www.helloworld.org", config=crawler_config
        )
        print(result.markdown_v2.raw_markdown[:500])
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/hooks_example.py
+++ b/docs/examples/hooks_example.py
@@ -0,0 +1,118 @@
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 from playwright.async_api import Page, BrowserContext
 async def main():
    print("🔗 Hooks Example: Demonstrating different hook use cases")
    # Configure browser settings
    browser_config = BrowserConfig(headless=True)
    # Configure crawler settings
    crawler_run_config = CrawlerRunConfig(
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        wait_for="body",
        cache_mode=CacheMode.BYPASS,
    )
    # Create crawler instance
    crawler = AsyncWebCrawler(config=browser_config)
    # Define and set hook functions
    async def on_browser_created(browser, context: BrowserContext, **kwargs):
        """Hook called after the browser is created"""
        print("[HOOK] on_browser_created - Browser is ready!")
        # Example: Set a cookie that will be used for all requests
        return browser
    async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
        """Hook called after a new page and context are created"""
        print("[HOOK] on_page_context_created - New page created!")
        # Example: Set default viewport size
        await context.add_cookies(
            [
                {
                    "name": "session_id",
                    "value": "example_session",
                    "domain": ".example.com",
                    "path": "/",
                }
            ]
        )
        await page.set_viewport_size({"width": 1080, "height": 800})
        return page
    async def on_user_agent_updated(
        page: Page, context: BrowserContext, user_agent: str, **kwargs
    ):
        """Hook called when the user agent is updated"""
        print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
        return page
    async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
        """Hook called after custom JavaScript execution"""
        print("[HOOK] on_execution_started - Custom JS executed!")
        return page
    async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
        """Hook called before navigating to each URL"""
        print(f"[HOOK] before_goto - About to visit: {url}")
        # Example: Add custom headers for the request
        await page.set_extra_http_headers({"Custom-Header": "my-value"})
        return page
    async def after_goto(
        page: Page, context: BrowserContext, url: str, response: dict, **kwargs
    ):
        """Hook called after navigating to each URL"""
        print(f"[HOOK] after_goto - Successfully loaded: {url}")
        # Example: Wait for a specific element to be loaded
        try:
            await page.wait_for_selector(".content", timeout=1000)
            print("Content element found!")
        except:
            print("Content element not found, continuing anyway")
        return page
    async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
        """Hook called before retrieving the HTML content"""
        print("[HOOK] before_retrieve_html - About to get HTML content")
        # Example: Scroll to bottom to trigger lazy loading
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
        return page
    async def before_return_html(
        page: Page, context: BrowserContext, html: str, **kwargs
    ):
        """Hook called before returning the HTML content"""
        print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
        # Example: You could modify the HTML content here if needed
        return page
    # Set all the hooks
    crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
    crawler.crawler_strategy.set_hook(
        "on_page_context_created", on_page_context_created
    )
    crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
    crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
    crawler.crawler_strategy.set_hook("before_goto", before_goto)
    crawler.crawler_strategy.set_hook("after_goto", after_goto)
    crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
    crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
    await crawler.start()
    # Example usage: crawl a simple website
    url = "https://example.com"
    result = await crawler.arun(url, config=crawler_run_config)
    print(f"\nCrawled URL: {result.url}")
    print(f"HTML length: {len(result.html)}")
    await crawler.close()
 if __name__ == "__main__":
    import asyncio
    asyncio.run(main())
--- a/docs/examples/language_support_example.py
+++ b/docs/examples/language_support_example.py
@@ -1,6 +1,7 @@
 import asyncio
 from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
 async def main():
    # Example 1: Setting language when creating the crawler
    crawler1 = AsyncWebCrawler(
@@ -9,11 +10,15 @@ async def main():
        )
    )
    result1 = await crawler1.arun("https://www.example.com")
-    print("Example 1 result:", result1.extracted_content[:100])  # Print first 100 characters
+    print(
        "Example 1 result:", result1.extracted_content[:100]
    )  # Print first 100 characters
    # Example 2: Setting language before crawling
    crawler2 = AsyncWebCrawler()
-    crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
+    crawler2.crawler_strategy.headers[
        "Accept-Language"
    ] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
    result2 = await crawler2.arun("https://www.example.com")
    print("Example 2 result:", result2.extracted_content[:100])
@@ -21,7 +26,7 @@ async def main():
    crawler3 = AsyncWebCrawler()
    result3 = await crawler3.arun(
        "https://www.example.com",
-        headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
+        headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"},
    )
    print("Example 3 result:", result3.extracted_content[:100])
@@ -33,13 +38,13 @@ async def main():
    ]
    crawler4 = AsyncWebCrawler()
-    results = await asyncio.gather(*[
+    results = await asyncio.gather(
-        crawler4.arun(url, headers={"Accept-Language": lang})
+        *[crawler4.arun(url, headers={"Accept-Language": lang}) for url, lang in urls]
-        for url, lang in urls
+    )
    ])
    for url, result in zip([u for u, _ in urls], results):
        print(f"Result for {url}:", result.extracted_content[:100])
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/llm_extraction_openai_pricing.py
+++ b/docs/examples/llm_extraction_openai_pricing.py
@@ -3,32 +3,37 @@ from crawl4ai.crawler_strategy import *
 import asyncio
 from pydantic import BaseModel, Field
-url = r'https://openai.com/api/pricing/'
+url = r"https://openai.com/api/pricing/"
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )
 from crawl4ai import AsyncWebCrawler
 async def main():
    # Use AsyncWebCrawler
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url=url,
            word_count_threshold=1,
-            extraction_strategy= LLMExtractionStrategy(
+            extraction_strategy=LLMExtractionStrategy(
                # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
-                provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
+                provider="groq/llama-3.1-70b-versatile",
                api_token=os.getenv("GROQ_API_KEY"),
                schema=OpenAIModelFee.model_json_schema(),
                extraction_type="schema",
-                instruction="From the crawled content, extract all mentioned model names along with their " \
+                instruction="From the crawled content, extract all mentioned model names along with their "
-                            "fees for input and output tokens. Make sure not to miss anything in the entire content. " \
+                "fees for input and output tokens. Make sure not to miss anything in the entire content. "
-                            'One extracted model JSON format should look like this: ' \
+                "One extracted model JSON format should look like this: "
-                            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+                '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
            ),
        )
        print("Success:", result.success)
        model_fees = json.loads(result.extracted_content)
@@ -37,4 +42,5 @@ async def main():
        with open(".data/data.json", "w", encoding="utf-8") as f:
            f.write(result.extracted_content)
 asyncio.run(main())
--- a/docs/examples/llm_markdown_generator.py
+++ b/docs/examples/llm_markdown_generator.py
@@ -0,0 +1,87 @@
 import os
 import asyncio
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 from crawl4ai.content_filter_strategy import LLMContentFilter
 async def test_llm_filter():
    # Create an HTML source that needs intelligent filtering
    url = "https://docs.python.org/3/tutorial/classes.html"
    browser_config = BrowserConfig(
        headless=True,
        verbose=True
    )
    # run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # First get the raw HTML
        result = await crawler.arun(url, config=run_config)
        html = result.cleaned_html
        # Initialize LLM filter with focused instruction
        filter = LLMContentFilter(
            provider="openai/gpt-4o",
            api_token=os.getenv('OPENAI_API_KEY'),
            instruction="""
            Focus on extracting the core educational content about Python classes.
            Include:
            - Key concepts and their explanations
            - Important code examples
            - Essential technical details
            Exclude:
            - Navigation elements
            - Sidebars
            - Footer content
            - Version information
            - Any non-essential UI elements
            Format the output as clean markdown with proper code blocks and headers.
            """,
            verbose=True
        )
        filter = LLMContentFilter(
            provider="openai/gpt-4o",
            api_token=os.getenv('OPENAI_API_KEY'),
            chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
            instruction="""
            Extract the main educational content while preserving its original wording and substance completely. Your task is to:
            1. Maintain the exact language and terminology used in the main content
            2. Keep all technical explanations, examples, and educational content intact
            3. Preserve the original flow and structure of the core content
            4. Remove only clearly irrelevant elements like:
            - Navigation menus
            - Advertisement sections
            - Cookie notices
            - Footers with site information
            - Sidebars with external links
            - Any UI elements that don't contribute to learning
            The goal is to create a clean markdown version that reads exactly like the original article, 
            keeping all valuable content but free from distracting elements. Imagine you're creating 
            a perfect reading experience where nothing valuable is lost, but all noise is removed.
            """,
            verbose=True
        )        
        # Apply filtering
        filtered_content = filter.filter_content(html, ignore_cache = True)
        # Show results
        print("\nFiltered Content Length:", len(filtered_content))
        print("\nFirst 500 chars of filtered content:")
        if filtered_content:
            print(filtered_content[0][:500])
        # Save on disc the markdown version
        with open("filtered_content.md", "w", encoding="utf-8") as f:
            f.write("\n".join(filtered_content))
        # Show token usage
        filter.show_usage()
 if __name__ == "__main__":
    asyncio.run(test_llm_filter())
--- a/docs/examples/quickstart_async.config.py
+++ b/docs/examples/quickstart_async.config.py
@@ -1,18 +1,23 @@
 import os, sys
-sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+
-os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692"
+sys.path.append(
    os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 )
 import asyncio
 import time
 import json
 import re
-from typing import Dict, List
+from typing import Dict
 from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
 from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
+from crawl4ai.content_filter_strategy import PruningContentFilter
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
+from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
 )
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
@@ -21,108 +26,172 @@ print("GitHub Repository: https://github.com/unclecode/crawl4ai")
 print("Twitter: @unclecode")
 print("Website: https://crawl4ai.com")
 # Basic Example - Simple Crawl
 async def simple_crawl():
    print("\n--- Basic Usage ---")
    browser_config = BrowserConfig(headless=True)
-    crawler_config = CrawlerRunConfig(
+    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        cache_mode=CacheMode.BYPASS
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
+            url="https://www.nbcnews.com/business", config=crawler_config
            config=crawler_config
        )
        print(result.markdown[:500])
 async def clean_content():
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        excluded_tags=["nav", "footer", "aside"],
        remove_overlay_elements=True,
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=PruningContentFilter(
                threshold=0.48, threshold_type="fixed", min_word_threshold=0
            ),
            options={"ignore_links": True},
        ),
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Apple",
            config=crawler_config,
        )
        full_markdown_length = len(result.markdown_v2.raw_markdown)
        fit_markdown_length = len(result.markdown_v2.fit_markdown)
        print(f"Full Markdown Length: {full_markdown_length}")
        print(f"Fit Markdown Length: {fit_markdown_length}")
 async def link_analysis():
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        exclude_external_links=True,
        exclude_social_media_links=True,
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=crawler_config,
        )
        print(f"Found {len(result.links['internal'])} internal links")
        print(f"Found {len(result.links['external'])} external links")
        for link in result.links["internal"][:5]:
            print(f"Href: {link['href']}\nText: {link['text']}\n")
 # JavaScript Execution Example
 async def simple_example_with_running_js_code():
    print("\n--- Executing JavaScript and Using CSS Selectors ---")
-    browser_config = BrowserConfig(
+    browser_config = BrowserConfig(headless=True, java_script_enabled=True)
        headless=True,
        java_script_enabled=True
    )
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
-        js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
+        js_code="const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();",
        # wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
+            url="https://www.nbcnews.com/business", config=crawler_config
            config=crawler_config
        )
        print(result.markdown[:500])
 # CSS Selector Example
 async def simple_example_with_css_selector():
    print("\n--- Using CSS Selectors ---")
    browser_config = BrowserConfig(headless=True)
    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
+        cache_mode=CacheMode.BYPASS, css_selector=".wide-tease-item__description"
        css_selector=".wide-tease-item__description"
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
+            url="https://www.nbcnews.com/business", config=crawler_config
            config=crawler_config
        )
        print(result.markdown[:500])
 async def media_handling():
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business", config=crawler_config
        )
        for img in result.media["images"][:5]:
            print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
 async def custom_hook_workflow(verbose=True):
    async with AsyncWebCrawler() as crawler:
        # Set a 'before_goto' hook to run custom code just before navigation
        crawler.crawler_strategy.set_hook(
            "before_goto",
            lambda page, context: print("[Hook] Preparing to navigate..."),
        )
        # Perform the crawl operation
        result = await crawler.arun(url="https://crawl4ai.com")
        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
 # Proxy Example
 async def use_proxy():
    print("\n--- Using a Proxy ---")
    browser_config = BrowserConfig(
        headless=True,
-        proxy="http://your-proxy-url:port"
+        proxy_config={
-    )
+            "server": "http://proxy.example.com:8080",
-    crawler_config = CrawlerRunConfig(
+            "username": "username",
-        cache_mode=CacheMode.BYPASS
+            "password": "password",
        },
    )
    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
+            url="https://www.nbcnews.com/business", config=crawler_config
            config=crawler_config
        )
        if result.success:
            print(result.markdown[:500])
 # Screenshot Example
 async def capture_and_save_screenshot(url: str, output_path: str):
    browser_config = BrowserConfig(headless=True)
-    crawler_config = CrawlerRunConfig(
+    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, screenshot=True)
        cache_mode=CacheMode.BYPASS,
        screenshot=True
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
+        result = await crawler.arun(url=url, config=crawler_config)
            url=url,
            config=crawler_config
        )
        if result.success and result.screenshot:
            import base64
            screenshot_data = base64.b64decode(result.screenshot)
-            with open(output_path, 'wb') as f:
+            with open(output_path, "wb") as f:
                f.write(screenshot_data)
            print(f"Screenshot saved successfully to {output_path}")
        else:
            print("Failed to capture screenshot")
 # LLM Extraction Example
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+    output_fee: str = Field(
        ..., description="Fee for output token for the OpenAI model."
    )
-async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+
 async def extract_structured_data_using_llm(
    provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
 ):
    print(f"\n--- Extracting Structured Data with {provider} ---")
    if api_token is None and provider != "ollama":
@@ -131,18 +200,14 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
    browser_config = BrowserConfig(headless=True)
-    extra_args = {
+    extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
        "temperature": 0,
        "top_p": 0.9,
        "max_tokens": 2000
    }
    if extra_headers:
        extra_args["extra_headers"] = extra_headers
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        word_count_threshold=1,
-        page_timeout = 80000,
+        page_timeout=80000,
        extraction_strategy=LLMExtractionStrategy(
            provider=provider,
            api_token=api_token,
@@ -150,23 +215,23 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
            extraction_type="schema",
            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
            Do not miss any models in the entire content.""",
-            extra_args=extra_args
+            extra_args=extra_args,
-        )
+        ),
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url="https://openai.com/api/pricing/",
+            url="https://openai.com/api/pricing/", config=crawler_config
            config=crawler_config
        )
        print(result.extracted_content)
 # CSS Extraction Example
 async def extract_structured_data_using_css_extractor():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
        "name": "KidoCode Courses",
-        "baseSelector": "section.charge-methodology .w-tab-content > div",
+        "baseSelector": "section.charge-methodology .framework-collection-item.w-dyn-item",
        "fields": [
            {
                "name": "section_title",
@@ -192,15 +257,12 @@ async def extract_structured_data_using_css_extractor():
                "name": "course_icon",
                "selector": ".image-92",
                "type": "attribute",
-                "attribute": "src"
+                "attribute": "src",
-            }
+            },
-        ]
+        ],
    }
-    browser_config = BrowserConfig(
+    browser_config = BrowserConfig(headless=True, java_script_enabled=True)
        headless=True,
        java_script_enabled=True
    )
    js_click_tabs = """
    (async () => {
@@ -216,19 +278,20 @@ async def extract_structured_data_using_css_extractor():
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
-        js_code=[js_click_tabs]
+        js_code=[js_click_tabs],
        delay_before_return_html=1
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url="https://www.kidocode.com/degrees/technology",
+            url="https://www.kidocode.com/degrees/technology", config=crawler_config
            config=crawler_config
        )
        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))
 # Dynamic Content Examples - Method 1
 async def crawl_dynamic_content_pages_method_1():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -249,10 +312,7 @@ async def crawl_dynamic_content_pages_method_1():
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
-    browser_config = BrowserConfig(
+    browser_config = BrowserConfig(headless=False, java_script_enabled=True)
        headless=False,
        java_script_enabled=True
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
@@ -272,7 +332,7 @@ async def crawl_dynamic_content_pages_method_1():
                css_selector="li.Box-sc-g0xbh4-0",
                js_code=js_next_page if page > 0 else None,
                js_only=page > 0,
-                session_id=session_id
+                session_id=session_id,
            )
            result = await crawler.arun(url=url, config=crawler_config)
@@ -286,14 +346,12 @@ async def crawl_dynamic_content_pages_method_1():
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 # Dynamic Content Examples - Method 2
 async def crawl_dynamic_content_pages_method_2():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
-    browser_config = BrowserConfig(
+    browser_config = BrowserConfig(headless=False, java_script_enabled=True)
        headless=False,
        java_script_enabled=True
    )
    js_next_page_and_wait = """
    (async () => {
@@ -343,7 +401,7 @@ async def crawl_dynamic_content_pages_method_2():
                extraction_strategy=extraction_strategy,
                js_code=js_next_page_and_wait if page > 0 else None,
                js_only=page > 0,
-                session_id=session_id
+                session_id=session_id,
            )
            result = await crawler.arun(url=url, config=crawler_config)
@@ -355,88 +413,132 @@ async def crawl_dynamic_content_pages_method_2():
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 async def cosine_similarity_extraction():
    crawl_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=CosineStrategy(
            word_count_threshold=10,
            max_dist=0.2,  # Maximum distance between two words
            linkage_method="ward",  # Linkage method for hierarchical clustering (ward, complete, average, single)
            top_k=3,  # Number of top keywords to extract
            sim_threshold=0.3,  # Similarity threshold for clustering
            semantic_filter="McDonald's economic impact, American consumer trends",  # Keywords to filter the content semantically using embeddings
            verbose=True,
        ),
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
            config=crawl_config,
        )
        print(json.loads(result.extracted_content)[:5])
 # Browser Comparison
 async def crawl_custom_browser_type():
    print("\n--- Browser Comparison ---")
    # Firefox
-    browser_config_firefox = BrowserConfig(
+    browser_config_firefox = BrowserConfig(browser_type="firefox", headless=True)
        browser_type="firefox",
        headless=True
    )
    start = time.time()
    async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
        result = await crawler.arun(
            url="https://www.example.com",
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
        )
        print("Firefox:", time.time() - start)
        print(result.markdown[:500])
    # WebKit
-    browser_config_webkit = BrowserConfig(
+    browser_config_webkit = BrowserConfig(browser_type="webkit", headless=True)
        browser_type="webkit",
        headless=True
    )
    start = time.time()
    async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
        result = await crawler.arun(
            url="https://www.example.com",
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
        )
        print("WebKit:", time.time() - start)
        print(result.markdown[:500])
    # Chromium (default)
-    browser_config_chromium = BrowserConfig(
+    browser_config_chromium = BrowserConfig(browser_type="chromium", headless=True)
        browser_type="chromium",
        headless=True
    )
    start = time.time()
    async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
        result = await crawler.arun(
            url="https://www.example.com",
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
        )
        print("Chromium:", time.time() - start)
        print(result.markdown[:500])
 # Anti-Bot and User Simulation
 async def crawl_with_user_simulation():
    browser_config = BrowserConfig(
        headless=True,
        user_agent_mode="random",
-        user_agent_generator_config={
+        user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
            "device_type": "mobile",
            "os_type": "android"
        }
    )
    crawler_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        magic=True,
        simulate_user=True,
-        override_navigator=True
+        override_navigator=True,
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
+        result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
            url="YOUR-URL-HERE",
            config=crawler_config
        )
        print(result.markdown)
 async def ssl_certification():
    # Configure crawler to fetch SSL certificate
    config = CrawlerRunConfig(
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS,  # Bypass cache to always get fresh certificates
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)
        if result.success and result.ssl_certificate:
            cert = result.ssl_certificate
            # 1. Access certificate properties directly
            print("\nCertificate Information:")
            print(f"Issuer: {cert.issuer.get('CN', '')}")
            print(f"Valid until: {cert.valid_until}")
            print(f"Fingerprint: {cert.fingerprint}")
            # 2. Export certificate in different formats
            cert.to_json(os.path.join(tmp_dir, "certificate.json"))  # For analysis
            print("\nCertificate exported to:")
            print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
            pem_data = cert.to_pem(
                os.path.join(tmp_dir, "certificate.pem")
            )  # For web servers
            print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
            der_data = cert.to_der(
                os.path.join(tmp_dir, "certificate.der")
            )  # For Java apps
            print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
 # Speed Comparison
 async def speed_comparison():
    print("\n--- Speed Comparison ---")
    # Firecrawl comparison
    from firecrawl import FirecrawlApp
-    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+
    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
    start = time.time()
    scrape_status = app.scrape_url(
-        'https://www.nbcnews.com/business',
+        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
        params={'formats': ['markdown', 'html']}
    )
    end = time.time()
    print("Firecrawl:")
@@ -454,9 +556,8 @@ async def speed_comparison():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            config=CrawlerRunConfig(
-                cache_mode=CacheMode.BYPASS,
+                cache_mode=CacheMode.BYPASS, word_count_threshold=0
-                word_count_threshold=0
+            ),
            )
        )
        end = time.time()
        print("Crawl4AI (simple crawl):")
@@ -474,12 +575,10 @@ async def speed_comparison():
                word_count_threshold=0,
                markdown_generator=DefaultMarkdownGenerator(
                    content_filter=PruningContentFilter(
-                        threshold=0.48,
+                        threshold=0.48, threshold_type="fixed", min_word_threshold=0
                        threshold_type="fixed",
                        min_word_threshold=0
                    )
                )
                    )
                ),
            ),
        )
        end = time.time()
        print("Crawl4AI (Markdown Plus):")
@@ -489,30 +588,31 @@ async def speed_comparison():
        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
        print()
 # Main execution
 async def main():
    # Basic examples
-    # await simple_crawl()
+    await simple_crawl()
-    # await simple_example_with_running_js_code()
+    await simple_example_with_running_js_code()
-    # await simple_example_with_css_selector()
+    await simple_example_with_css_selector()
    # Advanced examples
-    # await extract_structured_data_using_css_extractor()
+    await extract_structured_data_using_css_extractor()
-    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm(
-    # await crawl_dynamic_content_pages_method_1()
+        "openai/gpt-4o", os.getenv("OPENAI_API_KEY")
-    # await crawl_dynamic_content_pages_method_2()
+    )
    await crawl_dynamic_content_pages_method_1()
    await crawl_dynamic_content_pages_method_2()
    # Browser comparisons
-    # await crawl_custom_browser_type()
+    await crawl_custom_browser_type()
    # Performance testing
    # await speed_comparison()
    # Screenshot example
-    # await capture_and_save_screenshot(
+    await capture_and_save_screenshot(
-    #     "https://www.example.com",
+        "https://www.example.com",
-    #     os.path.join(__location__, "tmp/example_screenshot.jpg")
+        os.path.join(__location__, "tmp/example_screenshot.jpg")
-    # )
+    )
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -1,6 +1,10 @@
 import os, sys
 # append parent directory to system path
-sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";
+sys.path.append(
    os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 )
 os.environ["FIRECRAWL_API_KEY"] = "fc-84b370ccfad44beabc686b38f1769692"
 import asyncio
 # import nest_asyncio
@@ -15,7 +19,7 @@ from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
 from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
+from crawl4ai.content_filter_strategy import PruningContentFilter
 from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
@@ -32,9 +36,12 @@ print("Website: https://crawl4ai.com")
 async def simple_crawl():
    print("\n--- Basic Usage ---")
    async with AsyncWebCrawler(verbose=True) as crawler:
-        result = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode= CacheMode.BYPASS)
+        result = await crawler.arun(
            url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
        )
        print(result.markdown[:500])  # Print first 500 characters
 async def simple_example_with_running_js_code():
    print("\n--- Executing JavaScript and Using CSS Selectors ---")
    # New code to handle the wait_for parameter
@@ -57,6 +64,7 @@ async def simple_example_with_running_js_code():
        )
        print(result.markdown[:500])  # Print first 500 characters
 async def simple_example_with_css_selector():
    print("\n--- Using CSS Selectors ---")
    async with AsyncWebCrawler(verbose=True) as crawler:
@@ -67,26 +75,27 @@ async def simple_example_with_css_selector():
        )
        print(result.markdown[:500])  # Print first 500 characters
 async def use_proxy():
    print("\n--- Using a Proxy ---")
    print(
        "Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
    )
    # Uncomment and modify the following lines to use a proxy
-    async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
+    async with AsyncWebCrawler(
        verbose=True, proxy="http://your-proxy-url:port"
    ) as crawler:
        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
+            url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
            cache_mode= CacheMode.BYPASS
        )
        if result.success:
            print(result.markdown[:500])  # Print first 500 characters
 async def capture_and_save_screenshot(url: str, output_path: str):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
-            url=url,
+            url=url, screenshot=True, cache_mode=CacheMode.BYPASS
            screenshot=True,
            cache_mode= CacheMode.BYPASS
        )
        if result.success and result.screenshot:
@@ -96,13 +105,14 @@ async def capture_and_save_screenshot(url: str, output_path: str):
            screenshot_data = base64.b64decode(result.screenshot)
            # Save the screenshot as a JPEG file
-            with open(output_path, 'wb') as f:
+            with open(output_path, "wb") as f:
                f.write(screenshot_data)
            print(f"Screenshot saved successfully to {output_path}")
        else:
            print("Failed to capture screenshot")
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -110,7 +120,10 @@ class OpenAIModelFee(BaseModel):
        ..., description="Fee for output token for the OpenAI model."
    )
-async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+
 async def extract_structured_data_using_llm(
    provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
 ):
    print(f"\n--- Extracting Structured Data with {provider} ---")
    if api_token is None and provider != "ollama":
@@ -118,7 +131,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
        return
    # extra_args = {}
-    extra_args={
+    extra_args = {
        "temperature": 0,
        "top_p": 0.9,
        "max_tokens": 2000,
@@ -139,12 +152,13 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
-                extra_args=extra_args
+                extra_args=extra_args,
            ),
            cache_mode=CacheMode.BYPASS,
        )
        print(result.extracted_content)
 async def extract_structured_data_using_css_extractor():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
@@ -175,16 +189,12 @@ async def extract_structured_data_using_css_extractor():
                "name": "course_icon",
                "selector": ".image-92",
                "type": "attribute",
-            "attribute": "src"
+                "attribute": "src",
            },
        ],
    }
    ]
 }
    async with AsyncWebCrawler(
        headless=True,
        verbose=True
    ) as crawler:
    async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
        # Create the JavaScript that handles clicking multiple times
        js_click_tabs = """
        (async () => {
@@ -204,13 +214,14 @@ async def extract_structured_data_using_css_extractor():
            url="https://www.kidocode.com/degrees/technology",
            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
            js_code=[js_click_tabs],
-            cache_mode=CacheMode.BYPASS
+            cache_mode=CacheMode.BYPASS,
        )
        companies = json.loads(result.extracted_content)
        print(f"Successfully extracted {len(companies)} companies")
        print(json.dumps(companies[0], indent=2))
 # Advanced Session-Based Crawling with Dynamic Content 🔄
 async def crawl_dynamic_content_pages_method_1():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -267,6 +278,7 @@ async def crawl_dynamic_content_pages_method_1():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 async def crawl_dynamic_content_pages_method_2():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
@@ -334,8 +346,11 @@ async def crawl_dynamic_content_pages_method_2():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 async def crawl_dynamic_content_pages_method_3():
-    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")
+    print(
        "\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---"
    )
    async with AsyncWebCrawler(verbose=True) as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
@@ -395,41 +410,54 @@ async def crawl_dynamic_content_pages_method_3():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 async def crawl_custom_browser_type():
    # Use Firefox
    start = time.time()
-    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
+    async with AsyncWebCrawler(
-        result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
+        browser_type="firefox", verbose=True, headless=True
    ) as crawler:
        result = await crawler.arun(
            url="https://www.example.com", cache_mode=CacheMode.BYPASS
        )
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)
    # Use WebKit
    start = time.time()
-    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
+    async with AsyncWebCrawler(
-        result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
+        browser_type="webkit", verbose=True, headless=True
    ) as crawler:
        result = await crawler.arun(
            url="https://www.example.com", cache_mode=CacheMode.BYPASS
        )
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)
    # Use Chromium (default)
    start = time.time()
-    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
+    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
-        result = await crawler.arun(url="https://www.example.com", cache_mode= CacheMode.BYPASS)
+        result = await crawler.arun(
            url="https://www.example.com", cache_mode=CacheMode.BYPASS
        )
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)
 async def crawl_with_user_simultion():
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        url = "YOUR-URL-HERE"
        result = await crawler.arun(
            url=url,
            cache_mode=CacheMode.BYPASS,
-            magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
+            magic=True,  # Automatically detects and removes overlays, popups, and other elements that block content
            # simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
            # override_navigator = True # Overrides the navigator object to make it look like a real user
        )
        print(result.markdown)
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -439,11 +467,11 @@ async def speed_comparison():
    # print()
    # Simulated Firecrawl performance
    from firecrawl import FirecrawlApp
-    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+
    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
    start = time.time()
    scrape_status = app.scrape_url(
-    'https://www.nbcnews.com/business',
+        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
    params={'formats': ['markdown', 'html']}
    )
    end = time.time()
    print("Firecrawl:")
@@ -474,7 +502,9 @@ async def speed_comparison():
            url="https://www.nbcnews.com/business",
            word_count_threshold=0,
            markdown_generator=DefaultMarkdownGenerator(
-                content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
+                content_filter=PruningContentFilter(
                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
                )
                # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
            cache_mode=CacheMode.BYPASS,
@@ -498,7 +528,9 @@ async def speed_comparison():
            word_count_threshold=0,
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
-                content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
+                content_filter=PruningContentFilter(
                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
                )
                # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
            verbose=False,
@@ -520,6 +552,7 @@ async def speed_comparison():
    print("If you run these tests in an environment with better network conditions,")
    print("you may observe an even more significant speed advantage for Crawl4AI.")
 async def generate_knowledge_graph():
    class Entity(BaseModel):
        name: str
@@ -536,11 +569,11 @@ async def generate_knowledge_graph():
        relationships: List[Relationship]
    extraction_strategy = LLMExtractionStrategy(
-            provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
+        provider="openai/gpt-4o-mini",  # Or any other provider, including Ollama and open source models
-            api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
+        api_token=os.getenv("OPENAI_API_KEY"),  # In case of Ollama just pass "no-token"
        schema=KnowledgeGraph.model_json_schema(),
        extraction_type="schema",
-            instruction="""Extract entities and relationships from the given text."""
+        instruction="""Extract entities and relationships from the given text.""",
    )
    async with AsyncWebCrawler() as crawler:
        url = "https://paulgraham.com/love.html"
@@ -554,27 +587,22 @@ async def generate_knowledge_graph():
        with open(os.path.join(__location__, "kb.json"), "w") as f:
            f.write(result.extracted_content)
 async def fit_markdown_remove_overlay():
 async def fit_markdown_remove_overlay():
    async with AsyncWebCrawler(
        headless=True,  # Set to False to see what is happening
        verbose=True,
        user_agent_mode="random",
-            user_agent_generator_config={
+        user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
                "device_type": "mobile",
                "os_type": "android"
            },
    ) as crawler:
        result = await crawler.arun(
-            url='https://www.kidocode.com/degrees/technology',
+            url="https://www.kidocode.com/degrees/technology",
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(
                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
                ),
-                options={
+                options={"ignore_links": True},
                    "ignore_links": True
                }
            ),
            # markdown_generator=DefaultMarkdownGenerator(
            #     content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
@@ -593,13 +621,20 @@ async def fit_markdown_remove_overlay():
            with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
                f.write(result.cleaned_html)
-            with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
+            with open(
                os.path.join(__location__, "output/output_raw_markdown.md"), "w"
            ) as f:
                f.write(result.markdown_v2.raw_markdown)
-            with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
+            with open(
                os.path.join(__location__, "output/output_markdown_with_citations.md"),
                "w",
            ) as f:
                f.write(result.markdown_v2.markdown_with_citations)
-            with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:   
+            with open(
                os.path.join(__location__, "output/output_fit_markdown.md"), "w"
            ) as f:
                f.write(result.markdown_v2.fit_markdown)
    print("Done")
@@ -627,13 +662,13 @@ async def main():
    # }
    # await extract_structured_data_using_llm(extra_headers=custom_headers)
-    await crawl_dynamic_content_pages_method_1()
+    # await crawl_dynamic_content_pages_method_1()
-    await crawl_dynamic_content_pages_method_2()
+    # await crawl_dynamic_content_pages_method_2()
    await crawl_dynamic_content_pages_method_3()
-    await crawl_custom_browser_type()
+    # await crawl_custom_browser_type()
-    await speed_comparison()
+    # await speed_comparison()
 if __name__ == "__main__":
--- a/docs/examples/quickstart_sync.py
+++ b/docs/examples/quickstart_sync.py
@@ -10,15 +10,17 @@ from functools import lru_cache
 console = Console()
@lru_cache()
 def create_crawler():
    crawler = WebCrawler(verbose=True)
    crawler.warmup()
    return crawler
 def print_result(result):
    # Print each key in one line and just the first 10 characters of each one's value and three dots
-    console.print(f"\t[bold]Result:[/bold]")
+    console.print("\t[bold]Result:[/bold]")
    for key, value in result.model_dump().items():
        if isinstance(value, str) and value:
            console.print(f"\t{key}: [green]{value[:20]}...[/green]")
@@ -33,18 +35,27 @@ def cprint(message, press_any_key=False):
        console.print("Press any key to continue...", style="")
        input()
 def basic_usage(crawler):
-    cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
+    cprint(
-    result = crawler.run(url="https://www.nbcnews.com/business", only_text = True)
+        "🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
    )
    result = crawler.run(url="https://www.nbcnews.com/business", only_text=True)
    cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
    print_result(result)
 def basic_usage_some_params(crawler):
-    cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
+    cprint(
-    result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True)
+        "🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
    )
    result = crawler.run(
        url="https://www.nbcnews.com/business", word_count_threshold=1, only_text=True
    )
    cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
    print_result(result)
 def screenshot_usage(crawler):
    cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
    result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
@@ -55,16 +66,23 @@ def screenshot_usage(crawler):
    cprint("Screenshot saved to 'screenshot.png'!")
    print_result(result)
 def understanding_parameters(crawler):
-    cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
+    cprint(
-    cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.")
+        "\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]"
    )
    cprint(
        "By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action."
    )
    # First crawl (reads from cache)
    cprint("1️⃣ First crawl (caches the result):", True)
    start_time = time.time()
    result = crawler.run(url="https://www.nbcnews.com/business")
    end_time = time.time()
-    cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
+    cprint(
        f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]"
    )
    print_result(result)
    # Force to crawl again
@@ -72,132 +90,194 @@ def understanding_parameters(crawler):
    start_time = time.time()
    result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
    end_time = time.time()
-    cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
+    cprint(
        f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]"
    )
    print_result(result)
 def add_chunking_strategy(crawler):
    # Adding a chunking strategy: RegexChunking
-    cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
+    cprint(
-    cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
+        "\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]",
        True,
    )
    cprint(
        "RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!"
    )
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        chunking_strategy=RegexChunking(patterns=["\n\n"])
+        chunking_strategy=RegexChunking(patterns=["\n\n"]),
    )
    cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
    print_result(result)
    # Adding another chunking strategy: NlpSentenceChunking
-    cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
+    cprint(
-    cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
+        "\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]",
        True,
    )
    cprint(
        "NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!"
    )
    result = crawler.run(
-        url="https://www.nbcnews.com/business",
+        url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking()
        chunking_strategy=NlpSentenceChunking()
    )
    cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
    print_result(result)
 def add_extraction_strategy(crawler):
    # Adding an extraction strategy: CosineStrategy
-    cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
+    cprint(
-    cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
+        "\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]",
        True,
    )
    cprint(
        "CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!"
    )
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True)
+        extraction_strategy=CosineStrategy(
            word_count_threshold=10,
            max_dist=0.2,
            linkage_method="ward",
            top_k=3,
            sim_threshold=0.3,
            verbose=True,
        ),
    )
    cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
    print_result(result)
    # Using semantic_filter with CosineStrategy
-    cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
+    cprint(
        "You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!"
    )
    result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=CosineStrategy(
            semantic_filter="inflation rent prices",
        ),
    )
    cprint(
        "[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]"
    )
    cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
    print_result(result)
 def add_llm_extraction_strategy(crawler):
    # Adding an LLM extraction strategy without instructions
-    cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
+    cprint(
-    cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
+        "\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]",
        True,
    )
    cprint(
        "LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!"
    )
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
+        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
        ),
    )
    cprint(
        "[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]"
    )
    cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
    print_result(result)
    # Adding an LLM extraction strategy with instructions
-    cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
+    cprint(
-    cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
+        "\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]",
        True,
    )
    cprint(
        "Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!"
    )
    result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
-            api_token=os.getenv('OPENAI_API_KEY'),
+            api_token=os.getenv("OPENAI_API_KEY"),
-            instruction="I am interested in only financial news"
+            instruction="I am interested in only financial news",
        ),
    )
    cprint(
        "[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]"
    )
    cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
    print_result(result)
    result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
-            api_token=os.getenv('OPENAI_API_KEY'),
+            api_token=os.getenv("OPENAI_API_KEY"),
-            instruction="Extract only content related to technology"
+            instruction="Extract only content related to technology",
        ),
    )
    cprint(
        "[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]"
    )
    cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]")
    print_result(result)
 def targeted_extraction(crawler):
    # Using a CSS selector to extract only H2 tags
-    cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
+    cprint(
-    result = crawler.run(
+        "\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]",
-        url="https://www.nbcnews.com/business",
+        True,
        css_selector="h2"
    )
    result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
    cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
    print_result(result)
 def interactive_extraction(crawler):
    # Passing JavaScript code to interact with the page
-    cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
+    cprint(
-    cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
+        "\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
        True,
    )
    cprint(
        "In this example we try to click the 'Load More' button on the page using JavaScript code."
    )
    js_code = """
    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
    """
    # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
    # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
-    result = crawler.run(
+    result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
-        url="https://www.nbcnews.com/business",
+    cprint(
-        js = js_code
+        "[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
    )
    cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
    print_result(result)
 def multiple_scrip(crawler):
    # Passing JavaScript code to interact with the page
-    cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
+    cprint(
-    cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
+        "\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
-    js_code = ["""
+        True,
    )
    cprint(
        "In this example we try to click the 'Load More' button on the page using JavaScript code."
    )
    js_code = [
        """
    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
-    """] * 2
+    """
    ] * 2
    # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
    # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
-    result = crawler.run(
+    result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
-        url="https://www.nbcnews.com/business",
+    cprint(
-        js = js_code  
+        "[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
    )
    cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
    print_result(result)
 def using_crawler_hooks(crawler):
    # Example usage of the hooks for authentication and setting a cookie
    def on_driver_created(driver):
@@ -206,33 +286,34 @@ def using_crawler_hooks(crawler):
        driver.maximize_window()
        # Example customization: logging in to a hypothetical website
-        driver.get('https://example.com/login')
+        driver.get("https://example.com/login")
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
        WebDriverWait(driver, 10).until(
-            EC.presence_of_element_located((By.NAME, 'username'))
+            EC.presence_of_element_located((By.NAME, "username"))
        )
-        driver.find_element(By.NAME, 'username').send_keys('testuser')
+        driver.find_element(By.NAME, "username").send_keys("testuser")
-        driver.find_element(By.NAME, 'password').send_keys('password123')
+        driver.find_element(By.NAME, "password").send_keys("password123")
-        driver.find_element(By.NAME, 'login').click()
+        driver.find_element(By.NAME, "login").click()
        WebDriverWait(driver, 10).until(
-            EC.presence_of_element_located((By.ID, 'welcome'))
+            EC.presence_of_element_located((By.ID, "welcome"))
        )
        # Add a custom cookie
-        driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
+        driver.add_cookie({"name": "test_cookie", "value": "cookie_value"})
        return driver
    def before_get_url(driver):
        print("[HOOK] before_get_url")
        # Example customization: add a custom header
        # Enable Network domain for sending headers
-        driver.execute_cdp_cmd('Network.enable', {})
+        driver.execute_cdp_cmd("Network.enable", {})
        # Add a custom header
-        driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
+        driver.execute_cdp_cmd(
            "Network.setExtraHTTPHeaders", {"headers": {"X-Test-Header": "test"}}
        )
        return driver
    def after_get_url(driver):
@@ -247,20 +328,24 @@ def using_crawler_hooks(crawler):
        print(len(html))
        return driver
-    cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
+    cprint(
        "\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]",
        True,
    )
    crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
-    crawler_strategy.set_hook('on_driver_created', on_driver_created)
+    crawler_strategy.set_hook("on_driver_created", on_driver_created)
-    crawler_strategy.set_hook('before_get_url', before_get_url)
+    crawler_strategy.set_hook("before_get_url", before_get_url)
-    crawler_strategy.set_hook('after_get_url', after_get_url)
+    crawler_strategy.set_hook("after_get_url", after_get_url)
-    crawler_strategy.set_hook('before_return_html', before_return_html)
+    crawler_strategy.set_hook("before_return_html", before_return_html)
    crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
    crawler.warmup()
    result = crawler.run(url="https://example.com")
    cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
-    print_result(result= result)
+    print_result(result=result)
 def using_crawler_hooks_dleay_example(crawler):
    def delay(driver):
@@ -270,12 +355,14 @@ def using_crawler_hooks_dleay_example(crawler):
    def create_crawler():
        crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
-        crawler_strategy.set_hook('after_get_url', delay)
+        crawler_strategy.set_hook("after_get_url", delay)
        crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
        crawler.warmup()
        return crawler
-    cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]")
+    cprint(
        "\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]"
    )
    crawler = create_crawler()
    result = crawler.run(url="https://google.com", bypass_cache=True)
@@ -283,11 +370,16 @@ def using_crawler_hooks_dleay_example(crawler):
    print_result(result)
 def main():
-    cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
+    cprint(
-    cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
+        "🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]"
-    cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")
+    )
    cprint(
        "⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]"
    )
    cprint(
        "If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files."
    )
    crawler = create_crawler()
@@ -305,8 +397,10 @@ def main():
    interactive_extraction(crawler)
    multiple_scrip(crawler)
-    cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")
+    cprint(
        "\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]"
    )
 if __name__ == "__main__":
    main()
--- a/docs/examples/quickstart_v0.ipynb
+++ b/docs/examples/quickstart_v0.ipynb
@@ -702,7 +702,7 @@
        "\n",
        "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
        "\n",
-        "For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n",
+        "For more information and advanced usage, please visit the [Crawl4AI documentation](https://docs.crawl4ai.com/).\n",
        "\n",
        "Happy crawling!"
      ]
--- a/docs/examples/research_assistant.py
+++ b/docs/examples/research_assistant.py
@@ -11,7 +11,9 @@ from groq import Groq
 # Import threadpools to run the crawl_url function in a separate thread
 from concurrent.futures import ThreadPoolExecutor
-client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
+client = AsyncOpenAI(
    base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")
 )
 # Instrument the OpenAI client
 cl.instrument_openai()
@@ -25,32 +27,31 @@ settings = {
    "presence_penalty": 0,
 }
 def extract_urls(text):
-    url_pattern = re.compile(r'(https?://\S+)')
+    url_pattern = re.compile(r"(https?://\S+)")
    return url_pattern.findall(text)
 def crawl_url(url):
    data = {
        "urls": [url],
        "include_raw_html": True,
        "word_count_threshold": 10,
        "extraction_strategy": "NoExtractionStrategy",
-        "chunking_strategy": "RegexChunking"
+        "chunking_strategy": "RegexChunking",
    }
    response = requests.post("https://crawl4ai.com/crawl", json=data)
    response_data = response.json()
-    response_data = response_data['results'][0]
+    response_data = response_data["results"][0]
-    return response_data['markdown']
+    return response_data["markdown"]
@cl.on_chat_start
 async def on_chat_start():
-    cl.user_session.set("session", {
+    cl.user_session.set("session", {"history": [], "context": {}})
-        "history": [],
+    await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
-        "context": {}
+
    })  
    await cl.Message(
        content="Welcome to the chat! How can I assist you today?"
    ).send()
@cl.on_message
 async def on_message(message: cl.Message):
@@ -59,7 +60,6 @@ async def on_message(message: cl.Message):
    # Extract URLs from the user's message
    urls = extract_urls(message.content)
    futures = []
    with ThreadPoolExecutor() as executor:
        for url in urls:
@@ -69,16 +69,9 @@ async def on_message(message: cl.Message):
    for url, result in zip(urls, results):
        ref_number = f"REF_{len(user_session['context']) + 1}"
-        user_session["context"][ref_number] = {
+        user_session["context"][ref_number] = {"url": url, "content": result}
            "url": url,
            "content": result
        }    
-
+    user_session["history"].append({"role": "user", "content": message.content})
    user_session["history"].append({
        "role": "user",
        "content": message.content
    })
    # Create a system message that includes the context
    context_messages = [
@@ -95,26 +88,17 @@ async def on_message(message: cl.Message):
                "If not, there is no need to add a references section. "
                "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
                "\n\n".join(context_messages)
-            )
+            ),
        }
    else:
-        system_message = {
+        system_message = {"role": "system", "content": "You are a helpful assistant."}
            "role": "system",
            "content": "You are a helpful assistant."
        }
    msg = cl.Message(content="")
    await msg.send()
    # Get response from the LLM
    stream = await client.chat.completions.create(
-        messages=[
+        messages=[system_message, *user_session["history"]], stream=True, **settings
            system_message,
            *user_session["history"]
        ],
        stream=True,
        **settings
    )
    assistant_response = ""
@@ -124,10 +108,7 @@ async def on_message(message: cl.Message):
            await msg.stream_token(token)
    # Add assistant message to the history
-    user_session["history"].append({
+    user_session["history"].append({"role": "assistant", "content": assistant_response})
        "role": "assistant",
        "content": assistant_response
    })
    await msg.update()
    # Append the reference section to the assistant's response
@@ -154,6 +135,7 @@ async def on_audio_chunk(chunk: cl.AudioChunk):
    pass
@cl.step(type="tool")
 async def speech_to_text(audio_file):
    cli = Groq()
@@ -179,17 +161,12 @@ async def on_audio_end(elements: list[ElementBased]):
    end_time = time.time()
    print(f"Transcription took {end_time - start_time} seconds")
-    user_msg = cl.Message(
+    user_msg = cl.Message(author="You", type="user_message", content=transcription)
        author="You", 
        type="user_message",
        content=transcription
    )
    await user_msg.send()
    await on_message(user_msg)
 if __name__ == "__main__":
    from chainlit.cli import run_chainlit
    run_chainlit(__file__)
--- a/docs/examples/rest_call.py
+++ b/docs/examples/rest_call.py
@@ -1,4 +1,3 @@
 import requests, base64, os
 data = {
@@ -7,58 +6,49 @@ data = {
 }
 response = requests.post("https://crawl4ai.com/crawl", json=data)
-result = response.json()['results'][0]
+result = response.json()["results"][0]
 print(result.keys())
 # dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
 # 'links', 'screenshot', 'markdown', 'extracted_content',
 # 'metadata', 'error_message'])
 with open("screenshot.png", "wb") as f:
-    f.write(base64.b64decode(result['screenshot']))
+    f.write(base64.b64decode(result["screenshot"]))
 # Example of filtering the content using CSS selectors
 data = {
-    "urls": [
+    "urls": ["https://www.nbcnews.com/business"],
        "https://www.nbcnews.com/business"
    ],
    "css_selector": "article",
    "screenshot": True,
 }
 # Example of executing a JS script on the page before extracting the content
 data = {
-    "urls": [
+    "urls": ["https://www.nbcnews.com/business"],
        "https://www.nbcnews.com/business"
    ],
    "screenshot": True,
-    'js' : ["""
+    "js": [
        """
    const loadMoreButton = Array.from(document.querySelectorAll('button')).
    find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
-    """]
+    """
    ],
 }
 # Example of using a custom extraction strategy
 data = {
-    "urls": [
+    "urls": ["https://www.nbcnews.com/business"],
        "https://www.nbcnews.com/business"
    ],
    "extraction_strategy": "CosineStrategy",
-    "extraction_strategy_args": {
+    "extraction_strategy_args": {"semantic_filter": "inflation rent prices"},
        "semantic_filter": "inflation rent prices"
    },
 }
 # Example of using LLM to extract content
 data = {
-    "urls": [
+    "urls": ["https://www.nbcnews.com/business"],
        "https://www.nbcnews.com/business"
    ],
    "extraction_strategy": "LLMExtractionStrategy",
    "extraction_strategy_args": {
        "provider": "groq/llama3-8b-8192",
        "api_token": os.environ.get("GROQ_API_KEY"),
        "instruction": """I am interested in only financial news, 
-        and translate them in French."""
+        and translate them in French.""",
    },
 }
--- a/docs/examples/scraping_strategies_performance.py
+++ b/docs/examples/scraping_strategies_performance.py
@@ -0,0 +1,135 @@
 import time, re
 from crawl4ai.content_scraping_strategy import WebScrapingStrategy,  LXMLWebScrapingStrategy
 import time
 import functools
 from collections import defaultdict
 class TimingStats:
    def __init__(self):
        self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
    def add(self, strategy_name, func_name, elapsed):
        self.stats[strategy_name][func_name]["calls"] += 1
        self.stats[strategy_name][func_name]["total_time"] += elapsed
    def report(self):
        for strategy_name, funcs in self.stats.items():
            print(f"\n{strategy_name} Timing Breakdown:")
            print("-" * 60)
            print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
            print("-" * 60)
            for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
                avg_ms = (data["total_time"] / data["calls"]) * 1000
                print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
 timing_stats = TimingStats()
 # Modify timing decorator
 def timing_decorator(strategy_name):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            start = time.time()
            result = func(*args, **kwargs)
            elapsed = time.time() - start
            timing_stats.add(strategy_name, func.__name__, elapsed)
            return result
        return wrapper
    return decorator
 # Modified decorator application
 def apply_decorators(cls, method_name, strategy_name):
    try:
        original_method = getattr(cls, method_name)
        decorated_method = timing_decorator(strategy_name)(original_method)
        setattr(cls, method_name, decorated_method)
    except AttributeError:
        print(f"Method {method_name} not found in class {cls.__name__}.")
 # Apply to key methods
 methods_to_profile = [
    '_scrap',
    # 'process_element', 
    '_process_element', 
    'process_image',
 ]
 # Apply decorators to both strategies
 for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
    for method in methods_to_profile:
        apply_decorators(strategy, method, name)
 def generate_large_html(n_elements=1000):
    html = ['<!DOCTYPE html><html><head></head><body>']
    for i in range(n_elements):
        html.append(f'''
            <div class="article">
                <h2>Heading {i}</h2>
                <div>
                    <div>
                        <p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
                    </div>
                </div>
                <img src="image{i}.jpg" alt="Image {i}">
                <ul>
                    <li>List item {i}.1</li>
                    <li>List item {i}.2</li>
                </ul>
            </div>
        ''')
    html.append('</body></html>')
    return ''.join(html)
 def test_scraping():
    # Initialize both scrapers
    original_scraper = WebScrapingStrategy()
    selected_scraper = LXMLWebScrapingStrategy()
    # Generate test HTML
    print("Generating HTML...")
    html = generate_large_html(5000)
    print(f"HTML Size: {len(html)/1024:.2f} KB")
    # Time the scraping
    print("\nStarting scrape...")
    start_time = time.time()
    kwargs = {
        "url": "http://example.com",
        "html": html,
        "word_count_threshold": 5,
        "keep_data_attributes": True
    }
    t1 = time.perf_counter()
    result_selected = selected_scraper.scrap(**kwargs)
    t2 = time.perf_counter()
    result_original = original_scraper.scrap(**kwargs)
    t3 = time.perf_counter()
    elapsed = t3 - start_time
    print(f"\nScraping completed in {elapsed:.2f} seconds")
    timing_stats.report()
    # Print stats of LXML output
    print("\Turbo Output:")
    print(f"\nExtracted links: {len(result_selected.links.internal) + len(result_selected.links.external)}")
    print(f"Extracted images: {len(result_selected.media.images)}")
    print(f"Clean HTML size: {len(result_selected.cleaned_html)/1024:.2f} KB")
    print(f"Scraping time: {t2 - t1:.2f} seconds")
    # Print stats of original output
    print("\nOriginal Output:")
    print(f"\nExtracted links: {len(result_original.links.internal) + len(result_original.links.external)}")
    print(f"Extracted images: {len(result_original.media.images)}")
    print(f"Clean HTML size: {len(result_original.cleaned_html)/1024:.2f} KB")
    print(f"Scraping time: {t3 - t1:.2f} seconds")
 if __name__ == "__main__":
    test_scraping()
--- a/docs/examples/ssl_example.py
+++ b/docs/examples/ssl_example.py
@@ -0,0 +1,51 @@
 """Example showing how to work with SSL certificates in Crawl4AI."""
 import asyncio
 import os
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
 # Create tmp directory if it doesn't exist
 parent_dir = os.path.dirname(
    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 )
 tmp_dir = os.path.join(parent_dir, "tmp")
 os.makedirs(tmp_dir, exist_ok=True)
 async def main():
    # Configure crawler to fetch SSL certificate
    config = CrawlerRunConfig(
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS,  # Bypass cache to always get fresh certificates
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)
        if result.success and result.ssl_certificate:
            cert = result.ssl_certificate
            # 1. Access certificate properties directly
            print("\nCertificate Information:")
            print(f"Issuer: {cert.issuer.get('CN', '')}")
            print(f"Valid until: {cert.valid_until}")
            print(f"Fingerprint: {cert.fingerprint}")
            # 2. Export certificate in different formats
            cert.to_json(os.path.join(tmp_dir, "certificate.json"))  # For analysis
            print("\nCertificate exported to:")
            print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
            pem_data = cert.to_pem(
                os.path.join(tmp_dir, "certificate.pem")
            )  # For web servers
            print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
            der_data = cert.to_der(
                os.path.join(tmp_dir, "certificate.der")
            )  # For Java apps
            print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/summarize_page.py
+++ b/docs/examples/summarize_page.py
@@ -1,39 +1,41 @@
 import os
 import time
 import json
 from crawl4ai.web_crawler import WebCrawler
 from crawl4ai.chunking_strategy import *
 from crawl4ai.extraction_strategy import *
 from crawl4ai.crawler_strategy import *
-url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
+url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"
 crawler = WebCrawler()
 crawler.warmup()
 from pydantic import BaseModel, Field
 class PageSummary(BaseModel):
    title: str = Field(..., description="Title of the page.")
    summary: str = Field(..., description="Summary of the page.")
    brief_summary: str = Field(..., description="Brief summary of the page.")
    keywords: list = Field(..., description="Keywords assigned to the page.")
 result = crawler.run(
    url=url,
    word_count_threshold=1,
-    extraction_strategy= LLMExtractionStrategy(
+    extraction_strategy=LLMExtractionStrategy(
-        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        provider="openai/gpt-4o",
        api_token=os.getenv("OPENAI_API_KEY"),
        schema=PageSummary.model_json_schema(),
        extraction_type="schema",
-        apply_chunking =False,
+        apply_chunking=False,
-        instruction="From the crawled content, extract the following details: "\
+        instruction="From the crawled content, extract the following details: "
-            "1. Title of the page "\
+        "1. Title of the page "
-            "2. Summary of the page, which is a detailed summary "\
+        "2. Summary of the page, which is a detailed summary "
-            "3. Brief summary of the page, which is a paragraph text "\
+        "3. Brief summary of the page, which is a paragraph text "
-            "4. Keywords assigned to the page, which is a list of keywords. "\
+        "4. Keywords assigned to the page, which is a list of keywords. "
-            'The extracted JSON format should look like this: '\
+        "The extracted JSON format should look like this: "
-            '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
+        '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }',
    ),
    bypass_cache=True,
 )
--- a/docs/examples/tmp/chainlit_review.py
+++ b/docs/examples/tmp/chainlit_review.py
@@ -1,281 +0,0 @@
 from openai import AsyncOpenAI
 from chainlit.types import ThreadDict
 import chainlit as cl
 from chainlit.input_widget import Select, Switch, Slider
 client = AsyncOpenAI()
 # Instrument the OpenAI client
 cl.instrument_openai()
 settings = {
    "model": "gpt-3.5-turbo",
    "temperature": 0.5,
    "max_tokens": 500,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0,
 }
@cl.action_callback("action_button")
 async def on_action(action: cl.Action):
    print("The user clicked on the action button!")
    return "Thank you for clicking on the action button!"
@cl.set_chat_profiles
 async def chat_profile():
    return [
        cl.ChatProfile(
            name="GPT-3.5",
            markdown_description="The underlying LLM model is **GPT-3.5**.",
            icon="https://picsum.photos/200",
        ),
        cl.ChatProfile(
            name="GPT-4",
            markdown_description="The underlying LLM model is **GPT-4**.",
            icon="https://picsum.photos/250",
        ),
    ]
@cl.on_chat_start
 async def on_chat_start():
    settings = await cl.ChatSettings(
        [
            Select(
                id="Model",
                label="OpenAI - Model",
                values=["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4", "gpt-4-32k"],
                initial_index=0,
            ),
            Switch(id="Streaming", label="OpenAI - Stream Tokens", initial=True),
            Slider(
                id="Temperature",
                label="OpenAI - Temperature",
                initial=1,
                min=0,
                max=2,
                step=0.1,
            ),
            Slider(
                id="SAI_Steps",
                label="Stability AI - Steps",
                initial=30,
                min=10,
                max=150,
                step=1,
                description="Amount of inference steps performed on image generation.",
            ),
            Slider(
                id="SAI_Cfg_Scale",
                label="Stability AI - Cfg_Scale",
                initial=7,
                min=1,
                max=35,
                step=0.1,
                description="Influences how strongly your generation is guided to match your prompt.",
            ),
            Slider(
                id="SAI_Width",
                label="Stability AI - Image Width",
                initial=512,
                min=256,
                max=2048,
                step=64,
                tooltip="Measured in pixels",
            ),
            Slider(
                id="SAI_Height",
                label="Stability AI - Image Height",
                initial=512,
                min=256,
                max=2048,
                step=64,
                tooltip="Measured in pixels",
            ),
        ]
    ).send()
    chat_profile = cl.user_session.get("chat_profile")
    await cl.Message(
        content=f"starting chat using the {chat_profile} chat profile"
    ).send()
    print("A new chat session has started!")
    cl.user_session.set("session", {
        "history": [],
        "context": []
    })  
    image = cl.Image(url="https://c.tenor.com/uzWDSSLMCmkAAAAd/tenor.gif", name="cat image", display="inline")
    # Attach the image to the message
    await cl.Message(
        content="You are such a good girl, aren't you?!",
        elements=[image],
    ).send()
    text_content = "Hello, this is a text element."
    elements = [
        cl.Text(name="simple_text", content=text_content, display="inline")
    ]
    await cl.Message(
        content="Check out this text element!",
        elements=elements,
    ).send()
    elements = [
        cl.Audio(path="./assets/audio.mp3", display="inline"),
    ]
    await cl.Message(
        content="Here is an audio file",
        elements=elements,
    ).send()
    await cl.Avatar(
        name="Tool 1",
        url="https://avatars.githubusercontent.com/u/128686189?s=400&u=a1d1553023f8ea0921fba0debbe92a8c5f840dd9&v=4",
    ).send()
    await cl.Message(
        content="This message should not have an avatar!", author="Tool 0"
    ).send()
    await cl.Message(
        content="This message should have an avatar!", author="Tool 1"
    ).send()
    elements = [
        cl.File(
            name="quickstart.py",
            path="./quickstart.py",
            display="inline",
        ),
    ]
    await cl.Message(
        content="This message has a file element", elements=elements
    ).send()
    # Sending an action button within a chatbot message
    actions = [
        cl.Action(name="action_button", value="example_value", description="Click me!")
    ]
    await cl.Message(content="Interact with this action button:", actions=actions).send()
    # res = await cl.AskActionMessage(
    #     content="Pick an action!",
    #     actions=[
    #         cl.Action(name="continue", value="continue", label="✅ Continue"),
    #         cl.Action(name="cancel", value="cancel", label="❌ Cancel"),
    #     ],
    # ).send()
    # if res and res.get("value") == "continue":
    #     await cl.Message(
    #         content="Continue!",
    #     ).send()
    # import plotly.graph_objects as go
    # fig = go.Figure(
    #     data=[go.Bar(y=[2, 1, 3])],
    #     layout_title_text="An example figure",
    # )
    # elements = [cl.Plotly(name="chart", figure=fig, display="inline")]
    # await cl.Message(content="This message has a chart", elements=elements).send()
    # Sending a pdf with the local file path
    # elements = [
    #   cl.Pdf(name="pdf1", display="inline", path="./pdf1.pdf")
    # ]
    # cl.Message(content="Look at this local pdf!", elements=elements).send()    
@cl.on_settings_update
 async def setup_agent(settings):
    print("on_settings_update", settings)
@cl.on_stop
 def on_stop():
    print("The user wants to stop the task!")
@cl.on_chat_end
 def on_chat_end():
    print("The user disconnected!")
@cl.on_chat_resume
 async def on_chat_resume(thread: ThreadDict):
    print("The user resumed a previous chat session!")
 # @cl.on_message
 async def on_message(message: cl.Message):
    cl.user_session.get("session")["history"].append({
        "role": "user",
        "content": message.content
    })    
    response = await client.chat.completions.create(
        messages=[
            {
                "content": "You are a helpful bot",
                "role": "system"
            },
            *cl.user_session.get("session")["history"]
        ],
        **settings
    )
    # Add assitanr message to the history
    cl.user_session.get("session")["history"].append({
        "role": "assistant",
        "content": response.choices[0].message.content
    })
    # msg.content = response.choices[0].message.content
    # await msg.update()
    # await cl.Message(content=response.choices[0].message.content).send()
@cl.on_message
 async def on_message(message: cl.Message):
    cl.user_session.get("session")["history"].append({
        "role": "user",
        "content": message.content
    })    
    msg = cl.Message(content="")
    await msg.send()    
    stream = await client.chat.completions.create(
        messages=[
            {
                "content": "You are a helpful bot",
                "role": "system"
            },
            *cl.user_session.get("session")["history"]
        ],
        stream = True, 
        **settings
    )
    async for part in stream:
        if token := part.choices[0].delta.content or "":
            await msg.stream_token(token)
    # Add assitanr message to the history
    cl.user_session.get("session")["history"].append({
        "role": "assistant",
        "content": msg.content
    })    
    await msg.update()
 if __name__ == "__main__":
    from chainlit.cli import run_chainlit
    run_chainlit(__file__)
--- a/docs/examples/tmp/research_assistant_audio_not_completed.py
+++ b/docs/examples/tmp/research_assistant_audio_not_completed.py
@@ -1,238 +0,0 @@
 # Make sure to install the required packageschainlit and groq
 import os, time
 from openai import AsyncOpenAI
 import chainlit as cl
 import re
 import requests
 from io import BytesIO
 from chainlit.element import ElementBased
 from groq import Groq
 # Import threadpools to run the crawl_url function in a separate thread
 from concurrent.futures import ThreadPoolExecutor
 client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
 # Instrument the OpenAI client
 cl.instrument_openai()
 settings = {
    "model": "llama3-8b-8192",
    "temperature": 0.5,
    "max_tokens": 500,
    "top_p": 1,
    "frequency_penalty": 0,
    "presence_penalty": 0,
 }
 def extract_urls(text):
    url_pattern = re.compile(r'(https?://\S+)')
    return url_pattern.findall(text)
 def crawl_url(url):
    data = {
        "urls": [url],
        "include_raw_html": True,
        "word_count_threshold": 10,
        "extraction_strategy": "NoExtractionStrategy",
        "chunking_strategy": "RegexChunking"
    }
    response = requests.post("https://crawl4ai.com/crawl", json=data)
    response_data = response.json()
    response_data = response_data['results'][0]
    return response_data['markdown']
@cl.on_chat_start
 async def on_chat_start():
    cl.user_session.set("session", {
        "history": [],
        "context": {}
    })  
    await cl.Message(
        content="Welcome to the chat! How can I assist you today?"
    ).send()
@cl.on_message
 async def on_message(message: cl.Message):
    user_session = cl.user_session.get("session")
    # Extract URLs from the user's message
    urls = extract_urls(message.content)
    futures = []
    with ThreadPoolExecutor() as executor:
        for url in urls:
            futures.append(executor.submit(crawl_url, url))
    results = [future.result() for future in futures]
    for url, result in zip(urls, results):
        ref_number = f"REF_{len(user_session['context']) + 1}"
        user_session["context"][ref_number] = {
            "url": url,
            "content": result
        }    
    # for url in urls:
    #     # Crawl the content of each URL and add it to the session context with a reference number
    #     ref_number = f"REF_{len(user_session['context']) + 1}"
    #     crawled_content = crawl_url(url)
    #     user_session["context"][ref_number] = {
    #         "url": url,
    #         "content": crawled_content
    #     }
    user_session["history"].append({
        "role": "user",
        "content": message.content
    })
    # Create a system message that includes the context
    context_messages = [
        f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
        for ref, data in user_session["context"].items()
    ]
    if context_messages:
        system_message = {
            "role": "system",
            "content": (
                "You are a helpful bot. Use the following context for answering questions. "
                "Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
                "If the question requires any information from the provided appendices or context, refer to the sources. "
                "If not, there is no need to add a references section. "
                "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
                "\n\n".join(context_messages)
            )
        }
    else:
        system_message = {
            "role": "system",
            "content": "You are a helpful assistant."
        }
    msg = cl.Message(content="")
    await msg.send()
    # Get response from the LLM
    stream = await client.chat.completions.create(
        messages=[
            system_message,
            *user_session["history"]
        ],
        stream=True,
        **settings
    )
    assistant_response = ""
    async for part in stream:
        if token := part.choices[0].delta.content:
            assistant_response += token
            await msg.stream_token(token)
    # Add assistant message to the history
    user_session["history"].append({
        "role": "assistant",
        "content": assistant_response
    })
    await msg.update()
    # Append the reference section to the assistant's response
    reference_section = "\n\nReferences:\n"
    for ref, data in user_session["context"].items():
        reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
    msg.content += reference_section
    await msg.update()
@cl.on_audio_chunk
 async def on_audio_chunk(chunk: cl.AudioChunk):
    if chunk.isStart:
        buffer = BytesIO()
        # This is required for whisper to recognize the file type
        buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
        # Initialize the session for a new audio stream
        cl.user_session.set("audio_buffer", buffer)
        cl.user_session.set("audio_mime_type", chunk.mimeType)
    # Write the chunks to a buffer and transcribe the whole audio at the end
    cl.user_session.get("audio_buffer").write(chunk.data)
    pass
@cl.step(type="tool")
 async def speech_to_text(audio_file):
    cli = Groq()
    # response = cli.audio.transcriptions.create(
    #     file=audio_file, #(filename, file.read()),
    #     model="whisper-large-v3",
    # )
    response = await client.audio.transcriptions.create(
        model="whisper-large-v3", file=audio_file
    )
    return response.text
@cl.on_audio_end
 async def on_audio_end(elements: list[ElementBased]):
    # Get the audio buffer from the session
    audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
    audio_buffer.seek(0)  # Move the file pointer to the beginning
    audio_file = audio_buffer.read()
    audio_mime_type: str = cl.user_session.get("audio_mime_type")
    # input_audio_el = cl.Audio(
    #     mime=audio_mime_type, content=audio_file, name=audio_buffer.name
    # )
    # await cl.Message(
    #     author="You", 
    #     type="user_message",
    #     content="",
    #     elements=[input_audio_el, *elements]
    # ).send()
    # answer_message = await cl.Message(content="").send()
    start_time = time.time()
    whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
    transcription = await speech_to_text(whisper_input)
    end_time = time.time()
    print(f"Transcription took {end_time - start_time} seconds")
    user_msg = cl.Message(
        author="You", 
        type="user_message",
        content=transcription
    )
    await user_msg.send()
    await on_message(user_msg)
    # images = [file for file in elements if "image" in file.mime]
    # text_answer = await generate_text_answer(transcription, images)
    # output_name, output_audio = await text_to_speech(text_answer, audio_mime_type)
    # output_audio_el = cl.Audio(
    #     name=output_name,
    #     auto_play=True,
    #     mime=audio_mime_type,
    #     content=output_audio,
    # )
    # answer_message.elements = [output_audio_el]
    # answer_message.content = transcription
    # await answer_message.update()
 if __name__ == "__main__":
    from chainlit.cli import run_chainlit
    run_chainlit(__file__)
--- a/docs/examples/v0.3.74.overview.py
+++ b/docs/examples/v0.3.74.overview.py
@@ -1,4 +1,5 @@
 import os, sys
 # append the parent directory to the sys.path
 parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
 sys.path.append(parent_dir)
@@ -13,6 +14,7 @@ import json
 from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.content_filter_strategy import BM25ContentFilter
 # 1. File Download Processing Example
 async def download_example():
    """Example of downloading files from Python.org"""
@@ -23,9 +25,7 @@ async def download_example():
    print(f"Downloads will be saved to: {downloads_path}")
    async with AsyncWebCrawler(
-        accept_downloads=True,
+        accept_downloads=True, downloads_path=downloads_path, verbose=True
        downloads_path=downloads_path,
        verbose=True
    ) as crawler:
        result = await crawler.arun(
            url="https://www.python.org/downloads/",
@@ -40,7 +40,7 @@ async def download_example():
            }
            """,
            delay_before_return_html=1,  # Wait 5 seconds to ensure download starts
-            cache_mode=CacheMode.BYPASS
+            cache_mode=CacheMode.BYPASS,
        )
        if result.downloaded_files:
@@ -52,24 +52,25 @@ async def download_example():
        else:
            print("\nNo files were downloaded")
 # 2. Local File and Raw HTML Processing Example
 async def local_and_raw_html_example():
    """Example of processing local files and raw HTML"""
    # Create a sample HTML file
    sample_file = os.path.join(__data__, "sample.html")
    with open(sample_file, "w") as f:
-        f.write("""
+        f.write(
            """
        <html><body>
            <h1>Test Content</h1>
            <p>This is a test paragraph.</p>
        </body></html>
-        """)
+        """
        )
    async with AsyncWebCrawler(verbose=True) as crawler:
        # Process local file
-        local_result = await crawler.arun(
+        local_result = await crawler.arun(url=f"file://{os.path.abspath(sample_file)}")
            url=f"file://{os.path.abspath(sample_file)}"
        )
        # Process raw HTML
        raw_html = """
@@ -78,9 +79,7 @@ async def local_and_raw_html_example():
            <p>This is a test of raw HTML processing.</p>
        </body></html>
        """
-        raw_result = await crawler.arun(
+        raw_result = await crawler.arun(url=f"raw:{raw_html}")
            url=f"raw:{raw_html}"
        )
        # Clean up
        os.remove(sample_file)
@@ -88,6 +87,7 @@ async def local_and_raw_html_example():
        print("Local file content:", local_result.markdown)
        print("\nRaw HTML content:", raw_result.markdown)
 # 3. Enhanced Markdown Generation Example
 async def markdown_generation_example():
    """Example of enhanced markdown generation with citations and LLM-friendly features"""
@@ -102,27 +102,32 @@ async def markdown_generation_example():
            url="https://en.wikipedia.org/wiki/Apple",
            css_selector="main div#bodyContent",
            content_filter=content_filter,
-            cache_mode=CacheMode.BYPASS
+            cache_mode=CacheMode.BYPASS,
        )
        from crawl4ai import AsyncWebCrawler
        from crawl4ai.content_filter_strategy import BM25ContentFilter
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/Apple",
            css_selector="main div#bodyContent",
-            content_filter=BM25ContentFilter()
+            content_filter=BM25ContentFilter(),
        )
        print(result.markdown_v2.fit_markdown)
        print("\nMarkdown Generation Results:")
        print(f"1. Original markdown length: {len(result.markdown)}")
-        print(f"2. New markdown versions (markdown_v2):")
+        print("2. New markdown versions (markdown_v2):")
        print(f"   - Raw markdown length: {len(result.markdown_v2.raw_markdown)}")
-        print(f"   - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}")
+        print(
-        print(f"   - References section length: {len(result.markdown_v2.references_markdown)}")
+            f"   - Citations markdown length: {len(result.markdown_v2.markdown_with_citations)}"
        )
        print(
            f"   - References section length: {len(result.markdown_v2.references_markdown)}"
        )
        if result.markdown_v2.fit_markdown:
-            print(f"   - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}")
+            print(
                f"   - Filtered markdown length: {len(result.markdown_v2.fit_markdown)}"
            )
        # Save examples to files
        output_dir = os.path.join(__data__, "markdown_examples")
@@ -148,7 +153,10 @@ async def markdown_generation_example():
        print("\nSample of markdown with citations:")
        print(result.markdown_v2.markdown_with_citations[:500] + "...\n")
        print("Sample of references:")
-        print('\n'.join(result.markdown_v2.references_markdown.split('\n')[:10]) + "...")
+        print(
            "\n".join(result.markdown_v2.references_markdown.split("\n")[:10]) + "..."
        )
 # 4. Browser Management Example
 async def browser_management_example():
@@ -163,31 +171,31 @@ async def browser_management_example():
        use_managed_browser=True,
        user_data_dir=user_data_dir,
        headless=False,
-        verbose=True
+        verbose=True,
    ) as crawler:
        result = await crawler.arun(
            url="https://crawl4ai.com",
            # session_id="persistent_session_1",
-            cache_mode=CacheMode.BYPASS
+            cache_mode=CacheMode.BYPASS,
        )
        # Use GitHub as an example - it's a good test for browser management
        # because it requires proper browser handling
        result = await crawler.arun(
            url="https://github.com/trending",
            # session_id="persistent_session_1",
-            cache_mode=CacheMode.BYPASS
+            cache_mode=CacheMode.BYPASS,
        )
        print("\nBrowser session result:", result.success)
        if result.success:
-            print("Page title:", result.metadata.get('title', 'No title found'))
+            print("Page title:", result.metadata.get("title", "No title found"))
 # 5. API Usage Example
 async def api_example():
    """Example of using the new API endpoints"""
-    api_token = os.getenv('CRAWL4AI_API_TOKEN') or "test_api_code"
+    api_token = os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
-    headers = {'Authorization': f'Bearer {api_token}'}    
+    headers = {"Authorization": f"Bearer {api_token}"}
    async with aiohttp.ClientSession() as session:
        # Submit crawl job
        crawl_request = {
@@ -199,26 +207,18 @@ async def api_example():
                        "name": "Hacker News Articles",
                        "baseSelector": ".athing",
                        "fields": [
-                            {
+                            {"name": "title", "selector": ".title a", "type": "text"},
-                                "name": "title",
+                            {"name": "score", "selector": ".score", "type": "text"},
                                "selector": ".title a",
                                "type": "text"
                            },
                            {
                                "name": "score",
                                "selector": ".score",
                                "type": "text"
                            },
                            {
                                "name": "url",
                                "selector": ".title a",
                                "type": "attribute",
-                                "attribute": "href"
+                                "attribute": "href",
-                            }
+                            },
-                        ]
+                        ],
                    }
                    }
                },
            },
            "crawler_params": {
                "headless": True,
                # "use_managed_browser": True
@@ -229,9 +229,7 @@ async def api_example():
        }
        async with session.post(
-            "http://localhost:11235/crawl",
+            "http://localhost:11235/crawl", json=crawl_request, headers=headers
            json=crawl_request,
            headers=headers
        ) as response:
            task_data = await response.json()
            task_id = task_data["task_id"]
@@ -239,8 +237,7 @@ async def api_example():
            # Check task status
            while True:
                async with session.get(
-                    f"http://localhost:11235/task/{task_id}",
+                    f"http://localhost:11235/task/{task_id}", headers=headers
                    headers=headers
                ) as status_response:
                    result = await status_response.json()
                    print(f"Task status: {result['status']}")
@@ -248,12 +245,13 @@ async def api_example():
                    if result["status"] == "completed":
                        print("Task completed!")
                        print("Results:")
-                        news = json.loads(result["results"][0]['extracted_content'])
+                        news = json.loads(result["results"][0]["extracted_content"])
                        print(json.dumps(news[:4], indent=2))
                        break
                    else:
                        await asyncio.sleep(1)
 # Main execution
 async def main():
    # print("Running Crawl4AI feature examples...")
@@ -273,5 +271,6 @@ async def main():
    # print("\n5. Running API Example:")
    await api_example()
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/v0_4_24_walkthrough.py
+++ b/docs/examples/v0_4_24_walkthrough.py
@@ -0,0 +1,464 @@
 """
 Crawl4AI v0.4.24 Feature Walkthrough
 ===================================
 This script demonstrates the new features introduced in Crawl4AI v0.4.24.
 Each section includes detailed examples and explanations of the new capabilities.
 """
 import asyncio
 import os
 import json
 import re
 from typing import List
 from crawl4ai import (
    AsyncWebCrawler,
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    LLMExtractionStrategy,
    JsonCssExtractionStrategy,
 )
 from crawl4ai.content_filter_strategy import RelevantContentFilter
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
 from bs4 import BeautifulSoup
 # Sample HTML for demonstrations
 SAMPLE_HTML = """
 <div class="article-list">
    <article class="post" data-category="tech" data-author="john">
        <h2 class="title"><a href="/post-1">First Post</a></h2>
        <div class="meta">
            <a href="/author/john" class="author">John Doe</a>
            <span class="date">2023-12-31</span>
        </div>
        <div class="content">
            <p>First post content...</p>
            <a href="/read-more-1" class="read-more">Read More</a>
        </div>
    </article>
    <article class="post" data-category="science" data-author="jane">
        <h2 class="title"><a href="/post-2">Second Post</a></h2>
        <div class="meta">
            <a href="/author/jane" class="author">Jane Smith</a>
            <span class="date">2023-12-30</span>
        </div>
        <div class="content">
            <p>Second post content...</p>
            <a href="/read-more-2" class="read-more">Read More</a>
        </div>
    </article>
 </div>
 """
 async def demo_ssl_features():
    """
    Enhanced SSL & Security Features Demo
    -----------------------------------
    This example demonstrates the new SSL certificate handling and security features:
    1. Custom certificate paths
    2. SSL verification options
    3. HTTPS error handling
    4. Certificate validation configurations
    These features are particularly useful when:
    - Working with self-signed certificates
    - Dealing with corporate proxies
    - Handling mixed content websites
    - Managing different SSL security levels
    """
    print("\n1. Enhanced SSL & Security Demo")
    print("--------------------------------")
    browser_config = BrowserConfig()
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        fetch_ssl_certificate=True,  # Enable SSL certificate fetching
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url="https://example.com", config=run_config)
        print(f"SSL Crawl Success: {result.success}")
        result.ssl_certificate.to_json(
            os.path.join(os.getcwd(), "ssl_certificate.json")
        )
        if not result.success:
            print(f"SSL Error: {result.error_message}")
 async def demo_content_filtering():
    """
    Smart Content Filtering Demo
    ----------------------
    Demonstrates advanced content filtering capabilities:
    1. Custom filter to identify and extract specific content
    2. Integration with markdown generation
    3. Flexible pruning rules
    """
    print("\n2. Smart Content Filtering Demo")
    print("--------------------------------")
    # Create a custom content filter
    class CustomNewsFilter(RelevantContentFilter):
        def __init__(self):
            super().__init__()
            # Add news-specific patterns
            self.negative_patterns = re.compile(
                r"nav|footer|header|sidebar|ads|comment|share|related|recommended|popular|trending",
                re.I,
            )
            self.min_word_count = 30  # Higher threshold for news content
        def filter_content(
            self, html: str, min_word_threshold: int = None
        ) -> List[str]:
            """
            Implements news-specific content filtering logic.
            Args:
                html (str): HTML content to be filtered
                min_word_threshold (int, optional): Minimum word count threshold
            Returns:
                List[str]: List of filtered HTML content blocks
            """
            if not html or not isinstance(html, str):
                return []
            soup = BeautifulSoup(html, "lxml")
            if not soup.body:
                soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
            body = soup.find("body")
            # Extract chunks with metadata
            chunks = self.extract_text_chunks(
                body, min_word_threshold or self.min_word_count
            )
            # Filter chunks based on news-specific criteria
            filtered_chunks = []
            for _, text, tag_type, element in chunks:
                # Skip if element has negative class/id
                if self.is_excluded(element):
                    continue
                # Headers are important in news articles
                if tag_type == "header":
                    filtered_chunks.append(self.clean_element(element))
                    continue
                # For content, check word count and link density
                text = element.get_text(strip=True)
                if len(text.split()) >= (min_word_threshold or self.min_word_count):
                    # Calculate link density
                    links_text = " ".join(
                        a.get_text(strip=True) for a in element.find_all("a")
                    )
                    link_density = len(links_text) / len(text) if text else 1
                    # Accept if link density is reasonable
                    if link_density < 0.5:
                        filtered_chunks.append(self.clean_element(element))
            return filtered_chunks
    # Create markdown generator with custom filter
    markdown_gen = DefaultMarkdownGenerator(content_filter=CustomNewsFilter())
    run_config = CrawlerRunConfig(
        markdown_generator=markdown_gen, cache_mode=CacheMode.BYPASS
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://news.ycombinator.com", config=run_config
        )
        print("Filtered Content Sample:")
        print(result.markdown[:500])  # Show first 500 chars
 async def demo_json_extraction():
    """
    Improved JSON Extraction Demo
    ---------------------------
    Demonstrates the enhanced JSON extraction capabilities:
    1. Base element attributes extraction
    2. Complex nested structures
    3. Multiple extraction patterns
    Key features shown:
    - Extracting attributes from base elements (href, data-* attributes)
    - Processing repeated patterns
    - Handling optional fields
    """
    print("\n3. Improved JSON Extraction Demo")
    print("--------------------------------")
    # Define the extraction schema with base element attributes
    json_strategy = JsonCssExtractionStrategy(
        schema={
            "name": "Blog Posts",
            "baseSelector": "div.article-list",
            "baseFields": [
                {"name": "list_id", "type": "attribute", "attribute": "data-list-id"},
                {"name": "category", "type": "attribute", "attribute": "data-category"},
            ],
            "fields": [
                {
                    "name": "posts",
                    "selector": "article.post",
                    "type": "nested_list",
                    "baseFields": [
                        {
                            "name": "post_id",
                            "type": "attribute",
                            "attribute": "data-post-id",
                        },
                        {
                            "name": "author_id",
                            "type": "attribute",
                            "attribute": "data-author",
                        },
                    ],
                    "fields": [
                        {
                            "name": "title",
                            "selector": "h2.title a",
                            "type": "text",
                            "baseFields": [
                                {
                                    "name": "url",
                                    "type": "attribute",
                                    "attribute": "href",
                                }
                            ],
                        },
                        {
                            "name": "author",
                            "selector": "div.meta a.author",
                            "type": "text",
                            "baseFields": [
                                {
                                    "name": "profile_url",
                                    "type": "attribute",
                                    "attribute": "href",
                                }
                            ],
                        },
                        {"name": "date", "selector": "span.date", "type": "text"},
                        {
                            "name": "read_more",
                            "selector": "a.read-more",
                            "type": "nested",
                            "fields": [
                                {"name": "text", "type": "text"},
                                {
                                    "name": "url",
                                    "type": "attribute",
                                    "attribute": "href",
                                },
                            ],
                        },
                    ],
                }
            ],
        }
    )
    # Demonstrate extraction from raw HTML
    run_config = CrawlerRunConfig(
        extraction_strategy=json_strategy, cache_mode=CacheMode.BYPASS
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="raw:" + SAMPLE_HTML,  # Use raw: prefix for raw HTML
            config=run_config,
        )
        print("Extracted Content:")
        print(result.extracted_content)
 async def demo_input_formats():
    """
    Input Format Handling Demo
    ----------------------
    Demonstrates how LLM extraction can work with different input formats:
    1. Markdown (default) - Good for simple text extraction
    2. HTML - Better when you need structure and attributes
    This example shows how HTML input can be beneficial when:
    - You need to understand the DOM structure
    - You want to extract both visible text and HTML attributes
    - The content has complex layouts like tables or forms
    """
    print("\n4. Input Format Handling Demo")
    print("---------------------------")
    # Create a dummy HTML with rich structure
    dummy_html = """
    <div class="job-posting" data-post-id="12345">
        <header class="job-header">
            <h1 class="job-title">Senior AI/ML Engineer</h1>
            <div class="job-meta">
                <span class="department">AI Research Division</span>
                <span class="location" data-remote="hybrid">San Francisco (Hybrid)</span>
            </div>
            <div class="salary-info" data-currency="USD">
                <span class="range">$150,000 - $220,000</span>
                <span class="period">per year</span>
            </div>
        </header>
        <section class="requirements">
            <div class="technical-skills">
                <h3>Technical Requirements</h3>
                <ul class="required-skills">
                    <li class="skill required" data-priority="must-have">
                        5+ years experience in Machine Learning
                    </li>
                    <li class="skill required" data-priority="must-have">
                        Proficiency in Python and PyTorch/TensorFlow
                    </li>
                    <li class="skill preferred" data-priority="nice-to-have">
                        Experience with distributed training systems
                    </li>
                </ul>
            </div>
            <div class="soft-skills">
                <h3>Professional Skills</h3>
                <ul class="required-skills">
                    <li class="skill required" data-priority="must-have">
                        Strong problem-solving abilities
                    </li>
                    <li class="skill preferred" data-priority="nice-to-have">
                        Experience leading technical teams
                    </li>
                </ul>
            </div>
        </section>
        <section class="timeline">
            <time class="deadline" datetime="2024-02-28">
                Application Deadline: February 28, 2024
            </time>
        </section>
        <footer class="contact-section">
            <div class="hiring-manager">
                <h4>Hiring Manager</h4>
                <div class="contact-info">
                    <span class="name">Dr. Sarah Chen</span>
                    <span class="title">Director of AI Research</span>
                    <span class="email">ai.hiring@example.com</span>
                </div>
            </div>
            <div class="team-info">
                <p>Join our team of 50+ researchers working on cutting-edge AI applications</p>
            </div>
        </footer>
    </div>
    """
    # Use raw:// prefix to pass HTML content directly
    url = f"raw://{dummy_html}"
    from pydantic import BaseModel, Field
    from typing import List, Optional
    # Define our schema using Pydantic
    class JobRequirement(BaseModel):
        category: str = Field(
            description="Category of the requirement (e.g., Technical, Soft Skills)"
        )
        items: List[str] = Field(
            description="List of specific requirements in this category"
        )
        priority: str = Field(
            description="Priority level (Required/Preferred) based on the HTML class or context"
        )
    class JobPosting(BaseModel):
        title: str = Field(description="Job title")
        department: str = Field(description="Department or team")
        location: str = Field(description="Job location, including remote options")
        salary_range: Optional[str] = Field(description="Salary range if specified")
        requirements: List[JobRequirement] = Field(
            description="Categorized job requirements"
        )
        application_deadline: Optional[str] = Field(
            description="Application deadline if specified"
        )
        contact_info: Optional[dict] = Field(
            description="Contact information from footer or contact section"
        )
    # First try with markdown (default)
    markdown_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o",
        api_token=os.getenv("OPENAI_API_KEY"),
        schema=JobPosting.model_json_schema(),
        extraction_type="schema",
        instruction="""
        Extract job posting details into structured data. Focus on the visible text content 
        and organize requirements into categories.
        """,
        input_format="markdown",  # default
    )
    # Then with HTML for better structure understanding
    html_strategy = LLMExtractionStrategy(
        provider="openai/gpt-4",
        api_token=os.getenv("OPENAI_API_KEY"),
        schema=JobPosting.model_json_schema(),
        extraction_type="schema",
        instruction="""
        Extract job posting details, using HTML structure to:
        1. Identify requirement priorities from CSS classes (e.g., 'required' vs 'preferred')
        2. Extract contact info from the page footer or dedicated contact section
        3. Parse salary information from specially formatted elements
        4. Determine application deadline from timestamp or date elements
        Use HTML attributes and classes to enhance extraction accuracy.
        """,
        input_format="html",  # explicitly use HTML
    )
    async with AsyncWebCrawler() as crawler:
        # Try with markdown first
        markdown_config = CrawlerRunConfig(extraction_strategy=markdown_strategy)
        markdown_result = await crawler.arun(url=url, config=markdown_config)
        print("\nMarkdown-based Extraction Result:")
        items = json.loads(markdown_result.extracted_content)
        print(json.dumps(items, indent=2))
        # Then with HTML for better structure understanding
        html_config = CrawlerRunConfig(extraction_strategy=html_strategy)
        html_result = await crawler.arun(url=url, config=html_config)
        print("\nHTML-based Extraction Result:")
        items = json.loads(html_result.extracted_content)
        print(json.dumps(items, indent=2))
 # Main execution
 async def main():
    print("Crawl4AI v0.4.24 Feature Walkthrough")
    print("====================================")
    # Run all demos
    await demo_ssl_features()
    await demo_content_filtering()
    await demo_json_extraction()
    # await demo_input_formats()
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/examples/v0_4_3b2_features_demo.py
+++ b/docs/examples/v0_4_3b2_features_demo.py
@@ -0,0 +1,354 @@
 """
 Crawl4ai v0.4.3b2 Features Demo
 ============================
 This demonstration showcases three major categories of new features in Crawl4ai v0.4.3:
 1. Efficiency & Speed:
   - Memory-efficient dispatcher strategies
   - New scraping algorithm
   - Streaming support for batch crawling
 2. LLM Integration:
   - Automatic schema generation
   - LLM-powered content filtering
   - Smart markdown generation
 3. Core Improvements:
   - Robots.txt compliance
   - Proxy rotation
   - Enhanced URL handling
   - Shared data among hooks
   - add page routes
 Each demo function can be run independently or as part of the full suite.
 """
 import asyncio
 import os
 import json
 import re
 import random
 from typing import Optional, Dict
 from dotenv import load_dotenv
 load_dotenv()
 from crawl4ai import (
    AsyncWebCrawler, 
    BrowserConfig,
    CrawlerRunConfig,
    CacheMode,
    DisplayMode,
    MemoryAdaptiveDispatcher,
    CrawlerMonitor,
    DefaultMarkdownGenerator,
    LXMLWebScrapingStrategy,
    JsonCssExtractionStrategy,
    LLMContentFilter
 )
 async def demo_memory_dispatcher():
    """Demonstrates the new memory-efficient dispatcher system.
    Key Features:
    - Adaptive memory management
    - Real-time performance monitoring
    - Concurrent session control
    """
    print("\n=== Memory Dispatcher Demo ===")
    try:
        # Configuration
        browser_config = BrowserConfig(headless=True, verbose=False)
        crawler_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator()
        )
        # Test URLs
        urls = ["http://example.com", "http://example.org", "http://example.net"] * 3
        print("\n📈 Initializing crawler with memory monitoring...")
        async with AsyncWebCrawler(config=browser_config) as crawler:
            monitor = CrawlerMonitor(
                max_visible_rows=10,
                display_mode=DisplayMode.DETAILED
            )
            dispatcher = MemoryAdaptiveDispatcher(
                memory_threshold_percent=80.0,
                check_interval=0.5,
                max_session_permit=5,
                monitor=monitor
            )
            print("\n🚀 Starting batch crawl...")
            results = await dispatcher.run_urls(
                urls=urls,
                crawler=crawler,
                config=crawler_config,
            )
            print(f"\n✅ Completed {len(results)} URLs successfully")
    except Exception as e:
        print(f"\n❌ Error in memory dispatcher demo: {str(e)}")
 async def demo_streaming_support():
    """
    2. Streaming Support Demo
    ======================
    Shows how to process URLs as they complete using streaming
    """
    print("\n=== 2. Streaming Support Demo ===")
    browser_config = BrowserConfig(headless=True, verbose=False)
    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True)
    # Test URLs
    urls = ["http://example.com", "http://example.org", "http://example.net"] * 2
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Initialize dispatcher for streaming
        dispatcher = MemoryAdaptiveDispatcher(max_session_permit=3, check_interval=0.5)
        print("Starting streaming crawl...")
        async for result in dispatcher.run_urls_stream(
            urls=urls, crawler=crawler, config=crawler_config
        ):
            # Process each result as it arrives
            print(
                f"Received result for {result.url} - Success: {result.result.success}"
            )
            if result.result.success:
                print(f"Content length: {len(result.result.markdown)}")
 async def demo_content_scraping():
    """
    3. Content Scraping Strategy Demo
    ==============================
    Demonstrates the new LXMLWebScrapingStrategy for faster content scraping.
    """
    print("\n=== 3. Content Scraping Strategy Demo ===")
    crawler = AsyncWebCrawler()
    url = "https://example.com/article"
    # Configure with the new LXML strategy
    config = CrawlerRunConfig(scraping_strategy=LXMLWebScrapingStrategy(), verbose=True)
    print("Scraping content with LXML strategy...")
    async with crawler:
        result = await crawler.arun(url, config=config)
        if result.success:
            print("Successfully scraped content using LXML strategy")
 async def demo_llm_markdown():
    """
    4. LLM-Powered Markdown Generation Demo
    ===================================
    Shows how to use the new LLM-powered content filtering and markdown generation.
    """
    print("\n=== 4. LLM-Powered Markdown Generation Demo ===")
    crawler = AsyncWebCrawler()
    url = "https://docs.python.org/3/tutorial/classes.html"
    content_filter = LLMContentFilter(
        provider="openai/gpt-4o",
        api_token=os.getenv("OPENAI_API_KEY"),
        instruction="""
        Focus on extracting the core educational content about Python classes.
        Include:
        - Key concepts and their explanations
        - Important code examples
        - Essential technical details
        Exclude:
        - Navigation elements
        - Sidebars
        - Footer content
        - Version information
        - Any non-essential UI elements
        Format the output as clean markdown with proper code blocks and headers.
        """,
        verbose=True,
    )
    # Configure LLM-powered markdown generation
    config = CrawlerRunConfig(
        markdown_generator=DefaultMarkdownGenerator(
            content_filter=content_filter
        ), 
        cache_mode = CacheMode.BYPASS,
        verbose=True
    )
    print("Generating focused markdown with LLM...")
    async with crawler:
        result = await crawler.arun(url, config=config)
        if result.success and result.markdown_v2:
            print("Successfully generated LLM-filtered markdown")
            print("First 500 chars of filtered content:")
            print(result.markdown_v2.fit_markdown[:500])
            print("Successfully generated LLM-filtered markdown")
 async def demo_robots_compliance():
    """
    5. Robots.txt Compliance Demo
    ==========================
    Demonstrates the new robots.txt compliance feature with SQLite caching.
    """
    print("\n=== 5. Robots.txt Compliance Demo ===")
    crawler = AsyncWebCrawler()
    urls = ["https://example.com", "https://facebook.com", "https://twitter.com"]
    # Enable robots.txt checking
    config = CrawlerRunConfig(check_robots_txt=True, verbose=True)
    print("Crawling with robots.txt compliance...")
    async with crawler:
        results = await crawler.arun_many(urls, config=config)
        for result in results:
            if result.status_code == 403:
                print(f"Access blocked by robots.txt: {result.url}")
            elif result.success:
                print(f"Successfully crawled: {result.url}")
 async def demo_json_schema_generation():
    """
    7. LLM-Powered Schema Generation Demo
    =================================
    Demonstrates automatic CSS and XPath schema generation using LLM models.
    """
    print("\n=== 7. LLM-Powered Schema Generation Demo ===")
    # Example HTML content for a job listing
    html_content = """
    <div class="job-listing">
        <h1 class="job-title">Senior Software Engineer</h1>
        <div class="job-details">
            <span class="location">San Francisco, CA</span>
            <span class="salary">$150,000 - $200,000</span>
            <div class="requirements">
                <h2>Requirements</h2>
                <ul>
                    <li>5+ years Python experience</li>
                    <li>Strong background in web crawling</li>
                </ul>
            </div>
        </div>
    </div>
    """
    print("Generating CSS selectors schema...")
    # Generate CSS selectors with a specific query
    css_schema = JsonCssExtractionStrategy.generate_schema(
        html_content,
        schema_type="CSS",
        query="Extract job title, location, and salary information",
        provider="openai/gpt-4o",  # or use other providers like "ollama"
    )
    print("\nGenerated CSS Schema:")
    print(css_schema)
    # Example of using the generated schema with crawler
    crawler = AsyncWebCrawler()
    url = "https://example.com/job-listing"
    # Create an extraction strategy with the generated schema
    extraction_strategy = JsonCssExtractionStrategy(schema=css_schema)
    config = CrawlerRunConfig(extraction_strategy=extraction_strategy, verbose=True)
    print("\nTesting generated schema with crawler...")
    async with crawler:
        result = await crawler.arun(url, config=config)
        if result.success:
            print(json.dumps(result.extracted_content, indent=2) if result.extracted_content else None)
            print("Successfully used generated schema for crawling")
 async def demo_proxy_rotation():
    """
    8. Proxy Rotation Demo
    ===================
    Demonstrates how to rotate proxies for each request using Crawl4ai.
    """
    print("\n=== 8. Proxy Rotation Demo ===")
    async def get_next_proxy(proxy_file: str = f"proxies.txt") -> Optional[Dict]:
        """Get next proxy from local file"""
        try:
            proxies = os.getenv("PROXIES", "").split(",")
            ip, port, username, password = random.choice(proxies).split(":")
            return {
                "server": f"http://{ip}:{port}",
                "username": username,
                "password": password,
                "ip": ip  # Store original IP for verification
            }
        except Exception as e:
            print(f"Error loading proxy: {e}")
            return None
    # Create 10 test requests to httpbin
    urls = ["https://httpbin.org/ip"] * 2
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler(config=browser_config) as crawler:
        for url in urls:
            proxy = await get_next_proxy()
            if not proxy:
                print("No proxy available, skipping...")
                continue
            # Create new config with proxy
            current_config = run_config.clone(proxy_config=proxy)
            result = await crawler.arun(url=url, config=current_config)
            if result.success:
                ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
                print(f"Proxy {proxy['ip']} -> Response IP: {ip_match.group(0) if ip_match else 'Not found'}")
                verified = ip_match.group(0) == proxy['ip']
                if verified:
                    print(f"✅ Proxy working! IP matches: {proxy['ip']}")
                else:
                    print(f"❌ Proxy failed or IP mismatch!")
            else:
                print(f"Failed with proxy {proxy['ip']}")
 async def main():
    """Run all feature demonstrations."""
    print("\n📊 Running Crawl4ai v0.4.3 Feature Demos\n")
    # Efficiency & Speed Demos
    print("\n🚀 EFFICIENCY & SPEED DEMOS")
    await demo_memory_dispatcher()
    await demo_streaming_support()
    await demo_content_scraping()
    # # LLM Integration Demos
    print("\n🤖 LLM INTEGRATION DEMOS")
    await demo_json_schema_generation()
    await demo_llm_markdown()
    # # Core Improvements
    print("\n🔧 CORE IMPROVEMENT DEMOS")
    await demo_robots_compliance()
    await demo_proxy_rotation()
 if __name__ == "__main__":
    asyncio.run(main())
--- a/docs/md_v2/advanced/advanced-features.md
+++ b/docs/md_v2/advanced/advanced-features.md
@@ -0,0 +1,365 @@
 # Overview of Some Important Advanced Features 
 (Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
 Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
 1. **Proxy Usage**  
 2. **Capturing PDFs & Screenshots**  
 3. **Handling SSL Certificates**  
 4. **Custom Headers**  
 5. **Session Persistence & Local Storage**
 6. **Robots.txt Compliance**
 > **Prerequisites**  
 > - You have a basic grasp of [AsyncWebCrawler Basics](../core/simple-crawling.md)  
 > - You know how to run or configure your Python environment with Playwright installed
 ---
 ## 1. Proxy Usage
 If you need to route your crawl traffic through a proxy—whether for IP rotation, geo-testing, or privacy—Crawl4AI supports it via `BrowserConfig.proxy_config`.
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
 async def main():
    browser_cfg = BrowserConfig(
        proxy_config={
            "server": "http://proxy.example.com:8080",
            "username": "myuser",
            "password": "mypass",
        },
        headless=True
    )
    crawler_cfg = CrawlerRunConfig(
        verbose=True
    )
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://www.whatismyip.com/",
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Page fetched via proxy.")
            print("Page HTML snippet:", result.html[:200])
        else:
            print("[ERROR]", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Key Points**  
 - **`proxy_config`** expects a dict with `server` and optional auth credentials.  
 - Many commercial proxies provide an HTTP/HTTPS “gateway” server that you specify in `server`.  
 - If your proxy doesn’t need auth, omit `username`/`password`.
 ---
 ## 2. Capturing PDFs & Screenshots
 Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI can do both in one pass:
 ```python
 import os, asyncio
 from base64 import b64decode
 from crawl4ai import AsyncWebCrawler, CacheMode
 async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
            cache_mode=CacheMode.BYPASS,
            pdf=True,
            screenshot=True
        )
        if result.success:
            # Save screenshot
            if result.screenshot:
                with open("wikipedia_screenshot.png", "wb") as f:
                    f.write(b64decode(result.screenshot))
            # Save PDF
            if result.pdf:
                with open("wikipedia_page.pdf", "wb") as f:
                    f.write(result.pdf)
            print("[OK] PDF & screenshot captured.")
        else:
            print("[ERROR]", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Why PDF + Screenshot?**  
 - Large or complex pages can be slow or error-prone with “traditional” full-page screenshots.  
 - Exporting a PDF is more reliable for very long pages. Crawl4AI automatically converts the first PDF page into an image if you request both.  
 **Relevant Parameters**  
 - **`pdf=True`**: Exports the current page as a PDF (base64-encoded in `result.pdf`).  
 - **`screenshot=True`**: Creates a screenshot (base64-encoded in `result.screenshot`).  
 - **`scan_full_page`** or advanced hooking can further refine how the crawler captures content.
 ---
 ## 3. Handling SSL Certificates
 If you need to verify or export a site’s SSL certificate—for compliance, debugging, or data analysis—Crawl4AI can fetch it during the crawl:
 ```python
 import asyncio, os
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
 async def main():
    tmp_dir = os.path.join(os.getcwd(), "tmp")
    os.makedirs(tmp_dir, exist_ok=True)
    config = CrawlerRunConfig(
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com", config=config)
        if result.success and result.ssl_certificate:
            cert = result.ssl_certificate
            print("\nCertificate Information:")
            print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
            print(f"Valid until: {cert.valid_until}")
            print(f"Fingerprint: {cert.fingerprint}")
            # Export in multiple formats:
            cert.to_json(os.path.join(tmp_dir, "certificate.json"))
            cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
            cert.to_der(os.path.join(tmp_dir, "certificate.der"))
            print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
        else:
            print("[ERROR] No certificate or crawl failed.")
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Key Points**  
 - **`fetch_ssl_certificate=True`** triggers certificate retrieval.  
 - `result.ssl_certificate` includes methods (`to_json`, `to_pem`, `to_der`) for saving in various formats (handy for server config, Java keystores, etc.).
 ---
 ## 4. Custom Headers
 Sometimes you need to set custom headers (e.g., language preferences, authentication tokens, or specialized user-agent strings). You can do this in multiple ways:
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
 async def main():
    # Option 1: Set headers at the crawler strategy level
    crawler1 = AsyncWebCrawler(
        # The underlying strategy can accept headers in its constructor
        crawler_strategy=None  # We'll override below for clarity
    )
    crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
    crawler1.crawler_strategy.set_custom_headers({
        "Accept-Language": "fr-FR,fr;q=0.9"
    })
    result1 = await crawler1.arun("https://www.example.com")
    print("Example 1 result success:", result1.success)
    # Option 2: Pass headers directly to `arun()`
    crawler2 = AsyncWebCrawler()
    result2 = await crawler2.arun(
        url="https://www.example.com",
        headers={"Accept-Language": "es-ES,es;q=0.9"}
    )
    print("Example 2 result success:", result2.success)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Notes**  
 - Some sites may react differently to certain headers (e.g., `Accept-Language`).  
 - If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-based-crawling.md) or use `UserAgentGenerator`.
 ---
 ## 5. Session Persistence & Local Storage
 Crawl4AI can preserve cookies and localStorage so you can continue where you left off—ideal for logging into sites or skipping repeated auth flows.
 ### 5.1 `storage_state`
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
 async def main():
    storage_dict = {
        "cookies": [
            {
                "name": "session",
                "value": "abcd1234",
                "domain": "example.com",
                "path": "/",
                "expires": 1699999999.0,
                "httpOnly": False,
                "secure": False,
                "sameSite": "None"
            }
        ],
        "origins": [
            {
                "origin": "https://example.com",
                "localStorage": [
                    {"name": "token", "value": "my_auth_token"}
                ]
            }
        ]
    }
    # Provide the storage state as a dictionary to start "already logged in"
    async with AsyncWebCrawler(
        headless=True,
        storage_state=storage_dict
    ) as crawler:
        result = await crawler.arun("https://example.com/protected")
        if result.success:
            print("Protected page content length:", len(result.html))
        else:
            print("Failed to crawl protected page")
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 ### 5.2 Exporting & Reusing State
 You can sign in once, export the browser context, and reuse it later—without re-entering credentials.
 - **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.  
 - Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
 **See**: [Detailed session management tutorial](./session-management.md) or [Explanations → Browser Context & Managed Browser](./identity-based-crawling.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
 ---
 ## 6. Robots.txt Compliance
 Crawl4AI supports respecting robots.txt rules with efficient caching:
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 async def main():
    # Enable robots.txt checking in config
    config = CrawlerRunConfig(
        check_robots_txt=True  # Will check and respect robots.txt rules
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            "https://example.com",
            config=config
        )
        if not result.success and result.status_code == 403:
            print("Access denied by robots.txt")
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Key Points**
 - Robots.txt files are cached locally for efficiency
 - Cache is stored in `~/.crawl4ai/robots/robots_cache.db`
 - Cache has a default TTL of 7 days
 - If robots.txt can't be fetched, crawling is allowed
 - Returns 403 status code if URL is disallowed
 ---
 ## Putting It All Together
 Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
 ```python
 import os, asyncio
 from base64 import b64decode
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
 async def main():
    # 1. Browser config with proxy + headless
    browser_cfg = BrowserConfig(
        proxy_config={
            "server": "http://proxy.example.com:8080",
            "username": "myuser",
            "password": "mypass",
        },
        headless=True,
    )
    # 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
    crawler_cfg = CrawlerRunConfig(
        pdf=True,
        screenshot=True,
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS,
        headers={"Accept-Language": "en-US,en;q=0.8"},
        storage_state="my_storage.json",  # Reuse session from a previous sign-in
        verbose=True,
    )
    # 3. Crawl
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url = "https://secure.example.com/protected", 
            config=crawler_cfg
        )
        if result.success:
            print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
            # Save PDF & screenshot
            if result.pdf:
                with open("result.pdf", "wb") as f:
                    f.write(b64decode(result.pdf))
            if result.screenshot:
                with open("result.png", "wb") as f:
                    f.write(b64decode(result.screenshot))
            # Check SSL cert
            if result.ssl_certificate:
                print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
        else:
            print("[ERROR]", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 ---
 ## Conclusion & Next Steps
 You’ve now explored several **advanced** features:
 - **Proxy Usage**  
 - **PDF & Screenshot** capturing for large or critical pages  
 - **SSL Certificate** retrieval & exporting  
 - **Custom Headers** for language or specialized requests  
 - **Session Persistence** via storage state
 - **Robots.txt Compliance**
 With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
 **Last Updated**: 2025-01-01
--- a/docs/md_v2/advanced/content-processing.md
+++ b/docs/md_v2/advanced/content-processing.md
@@ -1,223 +0,0 @@
 # Content Processing
 Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
 ## Content Cleaning
 ### Understanding Clean Content
 When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
 1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
 2. **Content Relevance**: Identifies and preserves meaningful content blocks
 3. **Layout Analysis**: Understands page structure to identify main content areas
 ```python
 result = await crawler.arun(
    url="https://example.com",
    word_count_threshold=10,        # Remove blocks with fewer words
    excluded_tags=['form', 'nav'],  # Remove specific HTML tags
    remove_overlay_elements=True    # Remove popups/modals
 )
 # Get clean content
 print(result.cleaned_html)  # Cleaned HTML
 print(result.markdown)      # Clean markdown version
 ```
 ### Fit Markdown: Smart Content Extraction
 One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
 #### How Fit Markdown Works
 - Analyzes content density and distribution
 - Identifies content patterns and structures
 - Removes boilerplate content (headers, footers, sidebars)
 - Preserves the most relevant content blocks
 - Maintains content hierarchy and formatting
 #### Perfect For:
 - Blog posts and articles
 - News content
 - Documentation pages
 - Any page with a clear main content area
 #### Not Recommended For:
 - E-commerce product listings
 - Search results pages
 - Social media feeds
 - Pages with multiple equal-weight content sections
 ```python
 result = await crawler.arun(url="https://example.com")
 # Get the most relevant content
 main_content = result.fit_markdown
 # Compare with regular markdown
 all_content = result.markdown
 print(f"Fit Markdown Length: {len(main_content)}")
 print(f"Regular Markdown Length: {len(all_content)}")
 ```
 #### Example Use Case
 ```python
 async def extract_article_content(url: str) -> str:
    """Extract main article content from a blog or news site."""
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url)
        # fit_markdown will focus on the article content,
        # excluding navigation, ads, and other distractions
        return result.fit_markdown
 ```
 ## Media Processing
 Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
 ### Image Processing
 The library handles various image scenarios, including:
 - Regular images
 - Lazy-loaded images
 - Background images
 - Responsive images
 - Image metadata and context
 ```python
 result = await crawler.arun(url="https://example.com")
 for image in result.media["images"]:
    # Each image includes rich metadata
    print(f"Source: {image['src']}")
    print(f"Alt text: {image['alt']}")
    print(f"Description: {image['desc']}")
    print(f"Context: {image['context']}")  # Surrounding text
    print(f"Relevance score: {image['score']}")  # 0-10 score
 ```
 ### Handling Lazy-Loaded Content
 Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    wait_for="css:img[data-src]",  # Wait for lazy images
    delay_before_return_html=2.0   # Additional wait time
 )
 ```
 ### Video and Audio Content
 The library extracts video and audio elements with their metadata:
 ```python
 # Process videos
 for video in result.media["videos"]:
    print(f"Video source: {video['src']}")
    print(f"Type: {video['type']}")
    print(f"Duration: {video.get('duration')}")
    print(f"Thumbnail: {video.get('poster')}")
 # Process audio
 for audio in result.media["audios"]:
    print(f"Audio source: {audio['src']}")
    print(f"Type: {audio['type']}")
    print(f"Duration: {audio.get('duration')}")
 ```
 ## Link Analysis
 Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
 ### Link Classification
 The library automatically categorizes links into:
 - Internal links (same domain)
 - External links (different domains)
 - Social media links
 - Navigation links
 - Content links
 ```python
 result = await crawler.arun(url="https://example.com")
 # Analyze internal links
 for link in result.links["internal"]:
    print(f"Internal: {link['href']}")
    print(f"Link text: {link['text']}")
    print(f"Context: {link['context']}")  # Surrounding text
    print(f"Type: {link['type']}")  # nav, content, etc.
 # Analyze external links
 for link in result.links["external"]:
    print(f"External: {link['href']}")
    print(f"Domain: {link['domain']}")
    print(f"Type: {link['type']}")
 ```
 ### Smart Link Filtering
 Control which links are included in the results:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    exclude_external_links=True,          # Remove external links
    exclude_social_media_links=True,      # Remove social media links
    exclude_social_media_domains=[                # Custom social media domains
        "facebook.com", "twitter.com", "instagram.com"
    ],
    exclude_domains=["ads.example.com"]   # Exclude specific domains
 )
 ```
 ## Metadata Extraction
 Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
 ```python
 result = await crawler.arun(url="https://example.com")
 metadata = result.metadata
 print(f"Title: {metadata['title']}")
 print(f"Description: {metadata['description']}")
 print(f"Keywords: {metadata['keywords']}")
 print(f"Author: {metadata['author']}")
 print(f"Published Date: {metadata['published_date']}")
 print(f"Modified Date: {metadata['modified_date']}")
 print(f"Language: {metadata['language']}")
 ```
 ## Best Practices
 1. **Use Fit Markdown for Articles**
   ```python
   # Perfect for blog posts, news articles, documentation
   content = result.fit_markdown
   ```
 2. **Handle Media Appropriately**
   ```python
   # Filter by relevance score
   relevant_images = [
       img for img in result.media["images"]
       if img['score'] > 5
   ]
   ```
 3. **Combine Link Analysis with Content**
   ```python
   # Get content links with context
   content_links = [
       link for link in result.links["internal"]
       if link['type'] == 'content'
   ]
   ```
 4. **Clean Content with Purpose**
   ```python
   # Customize cleaning based on your needs
   result = await crawler.arun(
       url=url,
       word_count_threshold=20,      # Adjust based on content type
       keep_data_attributes=False,   # Remove data attributes
       process_iframes=True         # Include iframe content
   )
   ```
--- a/docs/md_v2/advanced/crawl-dispatcher.md
+++ b/docs/md_v2/advanced/crawl-dispatcher.md
@@ -0,0 +1,12 @@
 # Crawl Dispatcher
 We’re excited to announce a **Crawl Dispatcher** module that can handle **thousands** of crawling tasks simultaneously. By efficiently managing system resources (memory, CPU, network), this dispatcher ensures high-performance data extraction at scale. It also provides **real-time monitoring** of each crawler’s status, memory usage, and overall progress.
 Stay tuned—this feature is **coming soon** in an upcoming release of Crawl4AI! For the latest news, keep an eye on our changelogs and follow [@unclecode](https://twitter.com/unclecode) on X.
 Below is a **sample** of how the dispatcher’s performance monitor might look in action:
 ![Crawl Dispatcher Performance Monitor](../assets/images/dispatcher.png)
 We can’t wait to bring you this streamlined, **scalable** approach to multi-URL crawling—**watch this space** for updates!
--- a/docs/md_v2/advanced/file-downloading.md
+++ b/docs/md_v2/advanced/file-downloading.md
@@ -0,0 +1,118 @@
 # Download Handling in Crawl4AI
 This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
 ## Enabling Downloads
 To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
 ```python
 from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
 async def main():
    config = BrowserConfig(accept_downloads=True)  # Enable downloads globally
    async with AsyncWebCrawler(config=config) as crawler:
        # ... your crawling logic ...
 asyncio.run(main())
 ```
 ## Specifying Download Location
 Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
 ```python
 from crawl4ai.async_configs import BrowserConfig
 import os
 downloads_path = os.path.join(os.getcwd(), "my_downloads")  # Custom download path
 os.makedirs(downloads_path, exist_ok=True)
 config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
 async def main():
    async with AsyncWebCrawler(config=config) as crawler:
        result = await crawler.arun(url="https://example.com")
        # ...
 ```
 ## Triggering Downloads
 Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
 ```python
 from crawl4ai.async_configs import CrawlerRunConfig
 config = CrawlerRunConfig(
    js_code="""
        const downloadLink = document.querySelector('a[href$=".exe"]');
        if (downloadLink) {
            downloadLink.click();
        }
    """,
    wait_for=5  # Wait 5 seconds for the download to start
 )
 result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
 ```
 ## Accessing Downloaded Files
 The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
 ```python
 if result.downloaded_files:
    print("Downloaded files:")
    for file_path in result.downloaded_files:
        print(f"- {file_path}")
        file_size = os.path.getsize(file_path)
        print(f"- File size: {file_size} bytes")
 else:
    print("No files downloaded.")
 ```
 ## Example: Downloading Multiple Files
 ```python
 from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
 import os
 from pathlib import Path
 async def download_multiple_files(url: str, download_path: str):
    config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
    async with AsyncWebCrawler(config=config) as crawler:
        run_config = CrawlerRunConfig(
            js_code="""
                const downloadLinks = document.querySelectorAll('a[download]');
                for (const link of downloadLinks) {
                    link.click();
                    // Delay between clicks
                    await new Promise(r => setTimeout(r, 2000));  
                }
            """,
            wait_for=10  # Wait for all downloads to start
        )
        result = await crawler.arun(url=url, config=run_config)
        if result.downloaded_files:
            print("Downloaded files:")
            for file in result.downloaded_files:
                print(f"- {file}")
        else:
            print("No files downloaded.")
 # Usage
 download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
 os.makedirs(download_path, exist_ok=True)
 asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
 ```
 ## Important Considerations
 - **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
 - **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
 - **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
 - **Security:** Scan downloaded files for potential security threats before use.
 This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
--- a/docs/md_v2/advanced/hooks-auth.md
+++ b/docs/md_v2/advanced/hooks-auth.md
@@ -1,114 +1,254 @@
-# Hooks & Auth for AsyncWebCrawler
+# Hooks & Auth in AsyncWebCrawler
-Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
+Crawl4AI’s **hooks** let you customize the crawler at specific points in the pipeline:
-## Example: Using Crawler Hooks with AsyncWebCrawler
+1. **`on_browser_created`** – After browser creation.  
 2. **`on_page_context_created`** – After a new context & page are created.  
 3. **`before_goto`** – Just before navigating to a page.  
 4. **`after_goto`** – Right after navigation completes.  
 5. **`on_user_agent_updated`** – Whenever the user agent changes.  
 6. **`on_execution_started`** – Once custom JavaScript execution begins.  
 7. **`before_retrieve_html`** – Just before the crawler retrieves final HTML.  
 8. **`before_return_html`** – Right before returning the HTML content.
-Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
+**Important**: Avoid heavy tasks in `on_browser_created` since you don’t yet have a page context. If you need to *log in*, do so in **`on_page_context_created`**.
-1. Configure the browser when it's created.
+> note "Important Hook Usage Warning"
-2. Add custom headers before navigating to the URL.
+    **Avoid Misusing Hooks**: Do not manipulate page objects in the wrong hook or at the wrong time, as it can crash the pipeline or produce incorrect results. A common mistake is attempting to handle authentication prematurely—such as creating or closing pages in `on_browser_created`. 
 3. Log the current URL after navigation.
 4. Perform actions after JavaScript execution.
 5. Log the length of the HTML before returning it.
-### Hook Definitions
+>   **Use the Right Hook for Auth**: If you need to log in or set tokens, use `on_page_context_created`. This ensures you have a valid page/context to work with, without disrupting the main crawling flow.
 >    **Identity-Based Crawling**: For robust auth, consider identity-based crawling (or passing a session ID) to preserve state. Run your initial login steps in a separate, well-defined process, then feed that session to your main crawl—rather than shoehorning complex authentication into early hooks. Check out [Identity-Based Crawling](../advanced/identity-based-crawling.md) for more details.
 >    **Be Cautious**: Overwriting or removing elements in the wrong hook can compromise the final crawl. Keep hooks focused on smaller tasks (like route filters, custom headers), and let your main logic (crawling, data extraction) proceed normally.
 Below is an example demonstration.
 ---
 ## Example: Using Hooks in AsyncWebCrawler
 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler
+import json
-from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from playwright.async_api import Page, Browser, BrowserContext
+from playwright.async_api import Page, BrowserContext
 async def on_browser_created(browser: Browser):
    print("[HOOK] on_browser_created")
    # Example customization: set browser viewport size
    context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
    page = await context.new_page()
    # Example customization: logging in to a hypothetical website
    await page.goto('https://example.com/login')
    await page.fill('input[name="username"]', 'testuser')
    await page.fill('input[name="password"]', 'password123')
    await page.click('button[type="submit"]')
    await page.wait_for_selector('#welcome')
    # Add a custom cookie
    await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
    await page.close()
    await context.close()
 async def before_goto(page: Page):
    print("[HOOK] before_goto")
    # Example customization: add custom headers
    await page.set_extra_http_headers({'X-Test-Header': 'test'})
 async def after_goto(page: Page):
    print("[HOOK] after_goto")
    # Example customization: log the URL
    print(f"Current URL: {page.url}")
 async def on_execution_started(page: Page):
    print("[HOOK] on_execution_started")
    # Example customization: perform actions after JS execution
    await page.evaluate("console.log('Custom JS executed')")
 async def before_return_html(page: Page, html: str):
    print("[HOOK] before_return_html")
    # Example customization: log the HTML length
    print(f"HTML length: {len(html)}")
    return page
 ```
 ### Using the Hooks with the AsyncWebCrawler
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
 async def main():
-    print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
+    print("🔗 Hooks Example: Demonstrating recommended usage")
-    initial_cookies = [
+    # 1) Configure the browser
-        {"name": "sessionId", "value": "abc123", "domain": ".example.com"},
+    browser_config = BrowserConfig(
-        {"name": "userId", "value": "12345", "domain": ".example.com"}
+        headless=True,
-    ]
+        verbose=True
    crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, cookies=initial_cookies)
    crawler_strategy.set_hook('on_browser_created', on_browser_created)
    crawler_strategy.set_hook('before_goto', before_goto)
    crawler_strategy.set_hook('after_goto', after_goto)
    crawler_strategy.set_hook('on_execution_started', on_execution_started)
    crawler_strategy.set_hook('before_return_html', before_return_html)
    async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            js_code="window.scrollTo(0, document.body.scrollHeight);",
            wait_for="footer"
    )
-    print("📦 Crawler Hooks result:")
+    # 2) Configure the crawler run
-    print(result)
+    crawler_run_config = CrawlerRunConfig(
        js_code="window.scrollTo(0, document.body.scrollHeight);",
        wait_for="body",
        cache_mode=CacheMode.BYPASS
    )
-asyncio.run(main())
+    # 3) Create the crawler instance
    crawler = AsyncWebCrawler(config=browser_config)
    #
    # Define Hook Functions
    #
    async def on_browser_created(browser, **kwargs):
        # Called once the browser instance is created (but no pages or contexts yet)
        print("[HOOK] on_browser_created - Browser created successfully!")
        # Typically, do minimal setup here if needed
        return browser
    async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
        # Called right after a new page + context are created (ideal for auth or route config).
        print("[HOOK] on_page_context_created - Setting up page & context.")
        # Example 1: Route filtering (e.g., block images)
        async def route_filter(route):
            if route.request.resource_type == "image":
                print(f"[HOOK] Blocking image request: {route.request.url}")
                await route.abort()
            else:
                await route.continue_()
        await context.route("**", route_filter)
        # Example 2: (Optional) Simulate a login scenario
        # (We do NOT create or close pages here, just do quick steps if needed)
        # e.g., await page.goto("https://example.com/login")
        # e.g., await page.fill("input[name='username']", "testuser")
        # e.g., await page.fill("input[name='password']", "password123")
        # e.g., await page.click("button[type='submit']")
        # e.g., await page.wait_for_selector("#welcome")
        # e.g., await context.add_cookies([...])
        # Then continue
        # Example 3: Adjust the viewport
        await page.set_viewport_size({"width": 1080, "height": 600})
        return page
    async def before_goto(
        page: Page, context: BrowserContext, url: str, **kwargs
    ):
        # Called before navigating to each URL.
        print(f"[HOOK] before_goto - About to navigate: {url}")
        # e.g., inject custom headers
        await page.set_extra_http_headers({
            "Custom-Header": "my-value"
        })
        return page
    async def after_goto(
        page: Page, context: BrowserContext, 
        url: str, response, **kwargs
    ):
        # Called after navigation completes.
        print(f"[HOOK] after_goto - Successfully loaded: {url}")
        # e.g., wait for a certain element if we want to verify
        try:
            await page.wait_for_selector('.content', timeout=1000)
            print("[HOOK] Found .content element!")
        except:
            print("[HOOK] .content not found, continuing anyway.")
        return page
    async def on_user_agent_updated(
        page: Page, context: BrowserContext, 
        user_agent: str, **kwargs
    ):
        # Called whenever the user agent updates.
        print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
        return page
    async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
        # Called after custom JavaScript execution begins.
        print("[HOOK] on_execution_started - JS code is running!")
        return page
    async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
        # Called before final HTML retrieval.
        print("[HOOK] before_retrieve_html - We can do final actions")
        # Example: Scroll again
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
        return page
    async def before_return_html(
        page: Page, context: BrowserContext, html: str, **kwargs
    ):
        # Called just before returning the HTML in the result.
        print(f"[HOOK] before_return_html - HTML length: {len(html)}")
        return page
    #
    # Attach Hooks
    #
    crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
    crawler.crawler_strategy.set_hook(
        "on_page_context_created", on_page_context_created
    )
    crawler.crawler_strategy.set_hook("before_goto", before_goto)
    crawler.crawler_strategy.set_hook("after_goto", after_goto)
    crawler.crawler_strategy.set_hook(
        "on_user_agent_updated", on_user_agent_updated
    )
    crawler.crawler_strategy.set_hook(
        "on_execution_started", on_execution_started
    )
    crawler.crawler_strategy.set_hook(
        "before_retrieve_html", before_retrieve_html
    )
    crawler.crawler_strategy.set_hook(
        "before_return_html", before_return_html
    )
    await crawler.start()
    # 4) Run the crawler on an example page
    url = "https://example.com"
    result = await crawler.arun(url, config=crawler_run_config)
    if result.success:
        print("\nCrawled URL:", result.url)
        print("HTML length:", len(result.html))
    else:
        print("Error:", result.error_message)
    await crawler.close()
 if __name__ == "__main__":
    asyncio.run(main())
 ```
-### Explanation
+---
- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
+## Hook Lifecycle Summary
 - `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
 - `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
 - `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
 - `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
-### Additional Ideas
+1. **`on_browser_created`**:  
   - Browser is up, but **no** pages or contexts yet.  
   - Light setup only—don’t try to open or close pages here (that belongs in `on_page_context_created`).
- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
+2. **`on_page_context_created`**:  
- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
+   - Perfect for advanced **auth** or route blocking.  
- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
+   - You have a **page** + **context** ready but haven’t navigated to the target URL yet.
- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
+
- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
+3. **`before_goto`**:  
   - Right before navigation. Typically used for setting **custom headers** or logging the target URL.
 4. **`after_goto`**:  
   - After page navigation is done. Good place for verifying content or waiting on essential elements. 
 5. **`on_user_agent_updated`**:  
   - Whenever the user agent changes (for stealth or different UA modes).
 6. **`on_execution_started`**:  
   - If you set `js_code` or run custom scripts, this runs once your JS is about to start.
 7. **`before_retrieve_html`**:  
   - Just before the final HTML snapshot is taken. Often you do a final scroll or lazy-load triggers here.
 8. **`before_return_html`**:  
   - The last hook before returning HTML to the `CrawlResult`. Good for logging HTML length or minor modifications.
 ---
 ## When to Handle Authentication
 **Recommended**: Use **`on_page_context_created`** if you need to:
 - Navigate to a login page or fill forms
 - Set cookies or localStorage tokens
 - Block resource routes to avoid ads
 This ensures the newly created context is under your control **before** `arun()` navigates to the main URL.
 ---
 ## Additional Considerations
 - **Session Management**: If you want multiple `arun()` calls to reuse a single session, pass `session_id=` in your `CrawlerRunConfig`. Hooks remain the same.  
 - **Performance**: Hooks can slow down crawling if they do heavy tasks. Keep them concise.  
 - **Error Handling**: If a hook fails, the overall crawl might fail. Catch exceptions or handle them gracefully.  
 - **Concurrency**: If you run `arun_many()`, each URL triggers these hooks in parallel. Ensure your hooks are thread/async-safe.
 ---
 ## Conclusion
 Hooks provide **fine-grained** control over:
 - **Browser** creation (light tasks only)
 - **Page** and **context** creation (auth, route blocking)
 - **Navigation** phases
 - **Final HTML** retrieval
 Follow the recommended usage:
 - **Login** or advanced tasks in `on_page_context_created`  
 - **Custom headers** or logs in `before_goto` / `after_goto`  
 - **Scrolling** or final checks in `before_retrieve_html` / `before_return_html`
 By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.
--- a/docs/md_v2/advanced/identity-based-crawling.md
+++ b/docs/md_v2/advanced/identity-based-crawling.md
@@ -0,0 +1,180 @@
 # Preserve Your Identity with Crawl4AI
 Crawl4AI empowers you to navigate and interact with the web using your **authentic digital identity**, ensuring you’re recognized as a human and not mistaken for a bot. This tutorial covers:
 1. **Managed Browsers** – The recommended approach for persistent profiles and identity-based crawling.  
 2. **Magic Mode** – A simplified fallback solution for quick automation without persistent identity.
 ---
 ## 1. Managed Browsers: Your Digital Identity Solution
 **Managed Browsers** let developers create and use **persistent browser profiles**. These profiles store local storage, cookies, and other session data, letting you browse as your **real self**—complete with logins, preferences, and cookies.
 ### Key Benefits
 - **Authentic Browsing Experience**: Retain session data and browser fingerprints as though you’re a normal user.  
 - **Effortless Configuration**: Once you log in or solve CAPTCHAs in your chosen data directory, you can re-run crawls without repeating those steps.  
 - **Empowered Data Access**: If you can see the data in your own browser, you can automate its retrieval with your genuine identity.
 ---
 Below is a **partial update** to your **Managed Browsers** tutorial, specifically the section about **creating a user-data directory** using **Playwright’s Chromium** binary rather than a system-wide Chrome/Edge. We’ll show how to **locate** that binary and launch it with a `--user-data-dir` argument to set up your profile. You can then point `BrowserConfig.user_data_dir` to that folder for subsequent crawls.
 ---
 ### Creating a User Data Directory (Command-Line Approach via Playwright)
 If you installed Crawl4AI (which installs Playwright under the hood), you already have a Playwright-managed Chromium on your system. Follow these steps to launch that **Chromium** from your command line, specifying a **custom** data directory:
 1. **Find** the Playwright Chromium binary:
   - On most systems, installed browsers go under a `~/.cache/ms-playwright/` folder or similar path.  
   - To see an overview of installed browsers, run:
     ```bash
     python -m playwright install --dry-run
     ```
     or
     ```bash
     playwright install --dry-run
     ```
     (depending on your environment). This shows where Playwright keeps Chromium.
   - For instance, you might see a path like:
     ```
     ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome
     ```
     on Linux, or a corresponding folder on macOS/Windows.
 2. **Launch** the Playwright Chromium binary with a **custom** user-data directory:
   ```bash
   # Linux example
   ~/.cache/ms-playwright/chromium-1234/chrome-linux/chrome \
       --user-data-dir=/home/<you>/my_chrome_profile
   ```
   ```bash
   # macOS example (Playwright’s internal binary)
   ~/Library/Caches/ms-playwright/chromium-1234/chrome-mac/Chromium.app/Contents/MacOS/Chromium \
       --user-data-dir=/Users/<you>/my_chrome_profile
   ```
   ```powershell
   # Windows example (PowerShell/cmd)
   "C:\Users\<you>\AppData\Local\ms-playwright\chromium-1234\chrome-win\chrome.exe" ^
       --user-data-dir="C:\Users\<you>\my_chrome_profile"
   ```
   **Replace** the path with the actual subfolder indicated in your `ms-playwright` cache structure.  
   - This **opens** a fresh Chromium with your new or existing data folder.  
   - **Log into** any sites or configure your browser the way you want.  
   - **Close** when done—your profile data is saved in that folder.
 3. **Use** that folder in **`BrowserConfig.user_data_dir`**:
   ```python
   from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
   browser_config = BrowserConfig(
       headless=True,
       use_managed_browser=True,
       user_data_dir="/home/<you>/my_chrome_profile",
       browser_type="chromium"
   )
   ```
   - Next time you run your code, it reuses that folder—**preserving** your session data, cookies, local storage, etc.
 ---
 ## 3. Using Managed Browsers in Crawl4AI
 Once you have a data directory with your session data, pass it to **`BrowserConfig`**:
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
 async def main():
    # 1) Reference your persistent data directory
    browser_config = BrowserConfig(
        headless=True,             # 'True' for automated runs
        verbose=True,
        use_managed_browser=True,  # Enables persistent browser strategy
        browser_type="chromium",
        user_data_dir="/path/to/my-chrome-profile"
    )
    # 2) Standard crawl config
    crawl_config = CrawlerRunConfig(
        wait_for="css:.logged-in-content"
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(url="https://example.com/private", config=crawl_config)
        if result.success:
            print("Successfully accessed private data with your identity!")
        else:
            print("Error:", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 ### Workflow
 1. **Login** externally (via CLI or your normal Chrome with `--user-data-dir=...`).  
 2. **Close** that browser.  
 3. **Use** the same folder in `user_data_dir=` in Crawl4AI.  
 4. **Crawl** – The site sees your identity as if you’re the same user who just logged in.
 ---
 ## 4. Magic Mode: Simplified Automation
 If you **don’t** need a persistent profile or identity-based approach, **Magic Mode** offers a quick way to simulate human-like browsing without storing long-term data.
 ```python
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        config=CrawlerRunConfig(
            magic=True,  # Simplifies a lot of interaction
            remove_overlay_elements=True,
            page_timeout=60000
        )
    )
 ```
 **Magic Mode**:
 - Simulates a user-like experience  
 - Randomizes user agent & navigator
 - Randomizes interactions & timings  
 - Masks automation signals  
 - Attempts pop-up handling  
 **But** it’s no substitute for **true** user-based sessions if you want a fully legitimate identity-based solution.
 ---
 ## 5. Comparing Managed Browsers vs. Magic Mode
 | Feature                    | **Managed Browsers**                                           | **Magic Mode**                                     |
 |----------------------------|---------------------------------------------------------------|-----------------------------------------------------|
 | **Session Persistence**    | Full localStorage/cookies retained in user_data_dir           | No persistent data (fresh each run)                |
 | **Genuine Identity**       | Real user profile with full rights & preferences              | Emulated user-like patterns, but no actual identity |
 | **Complex Sites**          | Best for login-gated sites or heavy config                    | Simple tasks, minimal login or config needed        |
 | **Setup**                  | External creation of user_data_dir, then use in Crawl4AI       | Single-line approach (`magic=True`)                 |
 | **Reliability**            | Extremely consistent (same data across runs)                  | Good for smaller tasks, can be less stable          |
 ---
 ## 6. Summary
 - **Create** your user-data directory by launching Chrome/Chromium externally with `--user-data-dir=/some/path`.  
 - **Log in** or configure sites as needed, then close the browser.  
 - **Reference** that folder in `BrowserConfig(user_data_dir="...")` + `use_managed_browser=True`.  
 - Enjoy **persistent** sessions that reflect your real identity.  
 - If you only need quick, ephemeral automation, **Magic Mode** might suffice.
 **Recommended**: Always prefer a **Managed Browser** for robust, identity-based crawling and simpler interactions with complex sites. Use **Magic Mode** for quick tasks or prototypes where persistent data is unnecessary.
 With these approaches, you preserve your **authentic** browsing environment, ensuring the site sees you exactly as a normal user—no repeated logins or wasted time.
--- a/docs/md_v2/advanced/lazy-loading.md
+++ b/docs/md_v2/advanced/lazy-loading.md
@@ -0,0 +1,104 @@
 ## Handling Lazy-Loaded Images
 Many websites now load images **lazily** as you scroll. If you need to ensure they appear in your final crawl (and in `result.media`), consider:
 1. **`wait_for_images=True`** – Wait for images to fully load.  
 2. **`scan_full_page`** – Force the crawler to scroll the entire page, triggering lazy loads.  
 3. **`scroll_delay`** – Add small delays between scroll steps.  
 **Note**: If the site requires multiple “Load More” triggers or complex interactions, see the [Page Interaction docs](../core/page-interaction.md).
 ### Example: Ensuring Lazy Images Appear
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
 from crawl4ai.async_configs import CacheMode
 async def main():
    config = CrawlerRunConfig(
        # Force the crawler to wait until images are fully loaded
        wait_for_images=True,
        # Option 1: If you want to automatically scroll the page to load images
        scan_full_page=True,  # Tells the crawler to try scrolling the entire page
        scroll_delay=0.5,     # Delay (seconds) between scroll steps
        # Option 2: If the site uses a 'Load More' or JS triggers for images,
        # you can also specify js_code or wait_for logic here.
        cache_mode=CacheMode.BYPASS,
        verbose=True
    )
    async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
        result = await crawler.arun("https://www.example.com/gallery", config=config)
        if result.success:
            images = result.media.get("images", [])
            print("Images found:", len(images))
            for i, img in enumerate(images[:5]):
                print(f"[Image {i}] URL: {img['src']}, Score: {img.get('score','N/A')}")
        else:
            print("Error:", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Explanation**:
 - **`wait_for_images=True`**  
  The crawler tries to ensure images have finished loading before finalizing the HTML.  
 - **`scan_full_page=True`**  
  Tells the crawler to attempt scrolling from top to bottom. Each scroll step helps trigger lazy loading.  
 - **`scroll_delay=0.5`**  
  Pause half a second between each scroll step. Helps the site load images before continuing.
 **When to Use**:
 - **Lazy-Loading**: If images appear only when the user scrolls into view, `scan_full_page` + `scroll_delay` helps the crawler see them.  
 - **Heavier Pages**: If a page is extremely long, be mindful that scanning the entire page can be slow. Adjust `scroll_delay` or the max scroll steps as needed.
 ---
 ## Combining with Other Link & Media Filters
 You can still combine **lazy-load** logic with the usual **exclude_external_images**, **exclude_domains**, or link filtration:
 ```python
 config = CrawlerRunConfig(
    wait_for_images=True,
    scan_full_page=True,
    scroll_delay=0.5,
    # Filter out external images if you only want local ones
    exclude_external_images=True,
    # Exclude certain domains for links
    exclude_domains=["spammycdn.com"],
 )
 ```
 This approach ensures you see **all** images from the main domain while ignoring external ones, and the crawler physically scrolls the entire page so that lazy-loading triggers.
 ---
 ## Tips & Troubleshooting
 1. **Long Pages**  
   - Setting `scan_full_page=True` on extremely long or infinite-scroll pages can be resource-intensive.  
   - Consider using [hooks](../core/page-interaction.md) or specialized logic to load specific sections or “Load More” triggers repeatedly.
 2. **Mixed Image Behavior**  
   - Some sites load images in batches as you scroll. If you’re missing images, increase your `scroll_delay` or call multiple partial scrolls in a loop with JS code or hooks.
 3. **Combining with Dynamic Wait**  
   - If the site has a placeholder that only changes to a real image after a certain event, you might do `wait_for="css:img.loaded"` or a custom JS `wait_for`.
 4. **Caching**  
   - If `cache_mode` is enabled, repeated crawls might skip some network fetches. If you suspect caching is missing new images, set `cache_mode=CacheMode.BYPASS` for fresh fetches.
 ---
 With **lazy-loading** support, **wait_for_images**, and **scan_full_page** settings, you can capture the entire gallery or feed of images you expect—even if the site only loads them as the user scrolls. Combine these with the standard media filtering and domain exclusion for a complete link & media handling strategy.
--- a/docs/md_v2/advanced/magic-mode.md
+++ b/docs/md_v2/advanced/magic-mode.md
@@ -1,52 +0,0 @@
 # Magic Mode & Anti-Bot Protection
 Crawl4AI provides powerful anti-detection capabilities, with Magic Mode being the simplest and most comprehensive solution.
 ## Magic Mode
 The easiest way to bypass anti-bot protections:
 ```python
 async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        magic=True  # Enables all anti-detection features
    )
 ```
 Magic Mode automatically:
 - Masks browser automation signals
 - Simulates human-like behavior
 - Overrides navigator properties
 - Handles cookie consent popups
 - Manages browser fingerprinting
 - Randomizes timing patterns
 ## Manual Anti-Bot Options
 While Magic Mode is recommended, you can also configure individual anti-detection features:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    simulate_user=True,        # Simulate human behavior
    override_navigator=True    # Mask automation signals
 )
 ```
 Note: When `magic=True` is used, you don't need to set these individual options.
 ## Example: Handling Protected Sites
 ```python
 async def crawl_protected_site(url: str):
    async with AsyncWebCrawler(headless=True) as crawler:
        result = await crawler.arun(
            url=url,
            magic=True,
            remove_overlay_elements=True,  # Remove popups/modals
            page_timeout=60000            # Increased timeout for protection checks
        )
        return result.markdown if result.success else None
 ```
--- a/docs/md_v2/advanced/managed_browser.md
+++ b/docs/md_v2/advanced/managed_browser.md
@@ -1,136 +0,0 @@
 # Content Filtering in Crawl4AI
 This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages.  You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
 ## Relevance Content Filter
 The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
 ## Pruning Content Filter
 The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
 ### Usage
 ```python
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.content_filter_strategy import PruningContentFilter
 async def filter_content(url):
    async with AsyncWebCrawler() as crawler:
        content_filter = PruningContentFilter(
            min_word_threshold=5,
            threshold_type='dynamic',
            threshold=0.45
        )
        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
        if result.success:
            print(f"Cleaned Markdown:\n{result.fit_markdown}")
 ```
 ### Parameters
 - **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
 - **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
  - `'fixed'`: Uses a constant threshold value for all nodes
  - `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
 - **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
  - For fixed threshold: Nodes scoring below this value are removed
  - For dynamic threshold: This value is adjusted based on node properties
 ### How It Works
 The pruning algorithm evaluates each node using multiple metrics:
 - Text density: Ratio of actual text to overall node content
 - Link density: Proportion of text within links
 - Tag importance: Weight based on HTML tag type (e.g., article, p, div)
 - Content quality: Metrics like text length and structural importance
 Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
 The algorithm is particularly effective for:
 - Removing boilerplate content
 - Eliminating navigation menus and sidebars
 - Preserving main article content
 - Maintaining document structure while removing noise
 ## BM25 Algorithm
 The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
 ### Usage
 To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
 ```python
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.content_filter_strategy import BM25ContentFilter
 async def filter_content(url, query=None):
    async with AsyncWebCrawler() as crawler:
        content_filter = BM25ContentFilter(user_query=query)
        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
        if result.success:
            print(f"Filtered Content (JSON):\n{result.extracted_content}")
            print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
            print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
        else:
            print("Error:", result.error_message)
 # Example usage:
 asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
 asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
 ```
 ### Parameters
 - **`user_query`**:  (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
 - **`bm25_threshold`**: (Optional, default 1.0)  A float value that controls the threshold for relevance.  Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
 ## Fit Markdown Flag
 Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned  `CrawlResult` object: `fit_markdown` and `fit_html`.
 ## Custom Content Filtering Strategies
 You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method.  This allows you to tailor the filtering logic to your specific needs.
 ```python
 from crawl4ai.content_filter_strategy import RelevantContentFilter
 from bs4 import BeautifulSoup, Tag
 from typing import List
 class MyCustomFilter(RelevantContentFilter):
    def filter_content(self, html: str) -> List[str]:
        soup = BeautifulSoup(html, 'lxml')
        # Implement custom filtering logic here
        # Example: extract all paragraphs within divs with class "article-body"
        filtered_paragraphs = []
        for tag in soup.select("div.article-body p"):
            if isinstance(tag, Tag):
                filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.  
        return filtered_paragraphs
 async def custom_filter_demo(url: str):
    async with AsyncWebCrawler() as crawler:
        custom_filter = MyCustomFilter()
        result = await crawler.arun(url, extraction_strategy=custom_filter)
        if result.success:
            print(result.extracted_content)
 ```
 This example demonstrates extracting paragraphs from a specific div class.  You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
 ## Conclusion
 Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.
--- a/docs/md_v2/advanced/multi-url-crawling.md
+++ b/docs/md_v2/advanced/multi-url-crawling.md
@@ -0,0 +1,274 @@
 # Advanced Multi-URL Crawling with Dispatchers
 > **Heads Up**: Crawl4AI supports advanced dispatchers for **parallel** or **throttled** crawling, providing dynamic rate limiting and memory usage checks. The built-in `arun_many()` function uses these dispatchers to handle concurrency efficiently.
 ## 1. Introduction
 When crawling many URLs:
 - **Basic**: Use `arun()` in a loop (simple but less efficient)
 - **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
 - **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)
 **Why Dispatchers?**  
 - **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
 - **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
 - **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
 - **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency
 ## 2. Core Components
 ### 2.1 Rate Limiter
 ```python
 class RateLimiter:
    def __init__(
        base_delay: Tuple[float, float] = (1.0, 3.0),  # Random delay range between requests
        max_delay: float = 60.0,                        # Maximum backoff delay
        max_retries: int = 3,                          # Retries before giving up
        rate_limit_codes: List[int] = [429, 503]       # Status codes triggering backoff
    )
 ```
 The RateLimiter provides:
 - Random delays between requests
 - Exponential backoff on rate limit responses
 - Domain-specific rate limiting
 - Automatic retry handling
 ### 2.2 Crawler Monitor
 The CrawlerMonitor provides real-time visibility into crawling operations:
 ```python
 monitor = CrawlerMonitor(
    max_visible_rows=15,           # Maximum rows in live display
    display_mode=DisplayMode.DETAILED  # DETAILED or AGGREGATED view
 )
 ```
 **Display Modes**:
 1. **DETAILED**: Shows individual task status, memory usage, and timing
 2. **AGGREGATED**: Displays summary statistics and overall progress
 ## 3. Available Dispatchers
 ### 3.1 MemoryAdaptiveDispatcher (Default)
 Automatically manages concurrency based on system memory usage:
 ```python
 dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=90.0,  # Pause if memory exceeds this
    check_interval=1.0,             # How often to check memory
    max_session_permit=10,          # Maximum concurrent tasks
    rate_limiter=RateLimiter(       # Optional rate limiting
        base_delay=(1.0, 2.0),
        max_delay=30.0,
        max_retries=2
    ),
    monitor=CrawlerMonitor(         # Optional monitoring
        max_visible_rows=15,
        display_mode=DisplayMode.DETAILED
    )
 )
 ```
 ### 3.2 SemaphoreDispatcher
 Provides simple concurrency control with a fixed limit:
 ```python
 dispatcher = SemaphoreDispatcher(
    max_session_permit=5,             # Fixed concurrent tasks
    rate_limiter=RateLimiter(      # Optional rate limiting
        base_delay=(0.5, 1.0),
        max_delay=10.0
    ),
    monitor=CrawlerMonitor(        # Optional monitoring
        max_visible_rows=15,
        display_mode=DisplayMode.DETAILED
    )
 )
 ```
 ## 4. Usage Examples
 ### 4.1 Batch Processing (Default)
 ```python
 async def crawl_batch():
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=False  # Default: get all results at once
    )
    dispatcher = MemoryAdaptiveDispatcher(
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=10,
        monitor=CrawlerMonitor(
            display_mode=DisplayMode.DETAILED
        )
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Get all results at once
        results = await crawler.arun_many(
            urls=urls,
            config=run_config,
            dispatcher=dispatcher
        )
        # Process all results after completion
        for result in results:
            if result.success:
                await process_result(result)
            else:
                print(f"Failed to crawl {result.url}: {result.error_message}")
 ```
 ### 4.2 Streaming Mode
 ```python
 async def crawl_streaming():
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=True  # Enable streaming mode
    )
    dispatcher = MemoryAdaptiveDispatcher(
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=10,
        monitor=CrawlerMonitor(
            display_mode=DisplayMode.DETAILED
        )
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Process results as they become available
        async for result in await crawler.arun_many(
            urls=urls,
            config=run_config,
            dispatcher=dispatcher
        ):
            if result.success:
                # Process each result immediately
                await process_result(result)
            else:
                print(f"Failed to crawl {result.url}: {result.error_message}")
 ```
 ### 4.3 Semaphore-based Crawling
 ```python
 async def crawl_with_semaphore(urls):
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    dispatcher = SemaphoreDispatcher(
        semaphore_count=5,
        rate_limiter=RateLimiter(
            base_delay=(0.5, 1.0),
            max_delay=10.0
        ),
        monitor=CrawlerMonitor(
            max_visible_rows=15,
            display_mode=DisplayMode.DETAILED
        )
    )
    async with AsyncWebCrawler(config=browser_config) as crawler:
        results = await crawler.arun_many(
            urls, 
            config=run_config,
            dispatcher=dispatcher
        )
        return results
 ```
 ### 4.4 Robots.txt Consideration
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
 async def main():
    urls = [
        "https://example1.com",
        "https://example2.com",
        "https://example3.com"
    ]
    config = CrawlerRunConfig(
        cache_mode=CacheMode.ENABLED,
        check_robots_txt=True,  # Will respect robots.txt for each URL
        semaphore_count=3      # Max concurrent requests
    )
    async with AsyncWebCrawler() as crawler:
        async for result in crawler.arun_many(urls, config=config):
            if result.success:
                print(f"Successfully crawled {result.url}")
            elif result.status_code == 403 and "robots.txt" in result.error_message:
                print(f"Skipped {result.url} - blocked by robots.txt")
            else:
                print(f"Failed to crawl {result.url}: {result.error_message}")
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **Key Points**:
 - When `check_robots_txt=True`, each URL's robots.txt is checked before crawling
 - Robots.txt files are cached for efficiency
 - Failed robots.txt checks return 403 status code
 - Dispatcher handles robots.txt checks automatically for each URL
 ## 5. Dispatch Results
 Each crawl result includes dispatch information:
 ```python
@dataclass
 class DispatchResult:
    task_id: str
    memory_usage: float
    peak_memory: float
    start_time: datetime
    end_time: datetime
    error_message: str = ""
 ```
 Access via `result.dispatch_result`:
 ```python
 for result in results:
    if result.success:
        dr = result.dispatch_result
        print(f"URL: {result.url}")
        print(f"Memory: {dr.memory_usage:.1f}MB")
        print(f"Duration: {dr.end_time - dr.start_time}")
 ```
 ## 6. Summary
 1. **Two Dispatcher Types**:
   - MemoryAdaptiveDispatcher (default): Dynamic concurrency based on memory
   - SemaphoreDispatcher: Fixed concurrency limit
 2. **Optional Components**:
   - RateLimiter: Smart request pacing and backoff
   - CrawlerMonitor: Real-time progress visualization
 3. **Key Benefits**:
   - Automatic memory management
   - Built-in rate limiting
   - Live progress monitoring
   - Flexible concurrency control
 Choose the dispatcher that best fits your needs:
 - **MemoryAdaptiveDispatcher**: For large crawls or limited resources
 - **SemaphoreDispatcher**: For simple, fixed-concurrency scenarios
--- a/docs/md_v2/advanced/proxy-security.md
+++ b/docs/md_v2/advanced/proxy-security.md
@@ -1,84 +1,68 @@
-# Proxy & Security
+# Proxy 
 Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
 ## Basic Proxy Setup
-Simple proxy configuration:
+Simple proxy configuration with `BrowserConfig`:
 ```python
 from crawl4ai.async_configs import BrowserConfig
 # Using proxy URL
-async with AsyncWebCrawler(
+browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
-    proxy="http://proxy.example.com:8080"
+async with AsyncWebCrawler(config=browser_config) as crawler:
 ) as crawler:
    result = await crawler.arun(url="https://example.com")
 # Using SOCKS proxy
-async with AsyncWebCrawler(
+browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
-    proxy="socks5://proxy.example.com:1080"
+async with AsyncWebCrawler(config=browser_config) as crawler:
 ) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 ## Authenticated Proxy
-Use proxy with authentication:
+Use an authenticated proxy with `BrowserConfig`:
 ```python
 from crawl4ai.async_configs import BrowserConfig
 proxy_config = {
    "server": "http://proxy.example.com:8080",
    "username": "user",
    "password": "pass"
 }
-async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
+browser_config = BrowserConfig(proxy_config=proxy_config)
 async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 Here's the corrected documentation:
 ## Rotating Proxies 
-Example using a proxy rotation service:
+Example using a proxy rotation service dynamically:
 ```python
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
 async def get_next_proxy():
    # Your proxy rotation logic here
    return {"server": "http://next.proxy.com:8080"}
-async with AsyncWebCrawler() as crawler:
+async def main():
-    # Update proxy for each request
+    browser_config = BrowserConfig()
    run_config = CrawlerRunConfig()
    async with AsyncWebCrawler(config=browser_config) as crawler:
        # For each URL, create a new run config with different proxy
        for url in urls:
            proxy = await get_next_proxy()
-        crawler.update_proxy(proxy)
+            # Clone the config and update proxy - this creates a new browser context
-        result = await crawler.arun(url=url)
+            current_config = run_config.clone(proxy_config=proxy)
            result = await crawler.arun(url=url, config=current_config)
 if __name__ == "__main__":
    import asyncio
    asyncio.run(main())
 ```
 ## Custom Headers
 Add security-related headers:
 ```python
 headers = {
    "X-Forwarded-For": "203.0.113.195",
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache"
 }
 async with AsyncWebCrawler(headers=headers) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 ## Combining with Magic Mode
 For maximum protection, combine proxy with Magic Mode:
 ```python
 async with AsyncWebCrawler(
    proxy="http://proxy.example.com:8080",
    headers={"Accept-Language": "en-US"}
 ) as crawler:
    result = await crawler.arun(
        url="https://example.com",
        magic=True  # Enable all anti-detection features
    )
 ```
--- a/docs/md_v2/advanced/session-management-advanced.md
+++ b/docs/md_v2/advanced/session-management-advanced.md
@@ -1,276 +0,0 @@
 # Session-Based Crawling for Dynamic Content
 In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. To effectively crawl such websites, Crawl4AI provides powerful session-based crawling capabilities.
 This guide will explore advanced techniques for crawling dynamic content using Crawl4AI's session management features.
 ## Understanding Session-Based Crawling
 Session-based crawling allows you to maintain a persistent browser session across multiple requests. This is crucial when:
 1. The content changes dynamically without URL changes
 2. You need to interact with the page (e.g., clicking buttons) between requests
 3. The site requires authentication or maintains state across pages
 Crawl4AI's `AsyncWebCrawler` class supports session-based crawling through the `session_id` parameter and related methods.
 ## Basic Concepts
 Before diving into examples, let's review some key concepts:
 - **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple `arun` calls to maintain state.
 - **JavaScript Execution**: Use the `js_code` parameter to execute JavaScript on the page, such as clicking a "Load More" button.
 - **CSS Selectors**: Use these to target specific elements for extraction or interaction.
 - **Extraction Strategy**: Define how to extract structured data from the page.
 - **Wait Conditions**: Specify conditions to wait for before considering the page loaded.
 ## Example 1: Basic Session-Based Crawling
 Let's start with a basic example of session-based crawling:
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, CacheMode
 async def basic_session_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        session_id = "my_session"
        url = "https://example.com/dynamic-content"
        for page in range(3):
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
                css_selector=".content-item",
                cache_mode=CacheMode.BYPASS
            )
            print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
        await crawler.crawler_strategy.kill_session(session_id)
 asyncio.run(basic_session_crawl())
 ```
 This example demonstrates:
 1. Using a consistent `session_id` across multiple `arun` calls
 2. Executing JavaScript to load more content after the first page
 3. Using a CSS selector to extract specific content
 4. Properly closing the session after crawling
 ## Advanced Technique 1: Custom Execution Hooks
 Crawl4AI allows you to set custom hooks that execute at different stages of the crawling process. This is particularly useful for handling complex loading scenarios.
 Here's an example that waits for new content to appear before proceeding:
 ```python
 async def advanced_session_crawl_with_hooks():
    first_commit = ""
    async def on_execution_started(page):
        nonlocal first_commit
        try:
            while True:
                await page.wait_for_selector("li.commit-item h4")
                commit = await page.query_selector("li.commit-item h4")
                commit = await commit.evaluate("(element) => element.textContent")
                commit = commit.strip()
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
    async with AsyncWebCrawler(verbose=True) as crawler:
        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
        url = "https://github.com/example/repo/commits/main"
        session_id = "commit_session"
        all_commits = []
        js_next_page = """
        const button = document.querySelector('a.pagination-next');
        if (button) button.click();
        """
        for page in range(3):
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.commit-item",
                js_code=js_next_page if page > 0 else None,
                cache_mode=CacheMode.BYPASS,
                js_only=page > 0
            )
            commits = result.extracted_content.select("li.commit-item")
            all_commits.extend(commits)
            print(f"Page {page + 1}: Found {len(commits)} commits")
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 asyncio.run(advanced_session_crawl_with_hooks())
 ```
 This technique uses a custom `on_execution_started` hook to ensure new content has loaded before proceeding to the next step.
 ## Advanced Technique 2: Integrated JavaScript Execution and Waiting
 Instead of using separate hooks, you can integrate the waiting logic directly into your JavaScript execution. This approach can be more concise and easier to manage for some scenarios.
 Here's an example:
 ```python
 async def integrated_js_and_wait_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        url = "https://github.com/example/repo/commits/main"
        session_id = "integrated_session"
        all_commits = []
        js_next_page_and_wait = """
        (async () => {
            const getCurrentCommit = () => {
                const commits = document.querySelectorAll('li.commit-item h4');
                return commits.length > 0 ? commits[0].textContent.trim() : null;
            };
            const initialCommit = getCurrentCommit();
            const button = document.querySelector('a.pagination-next');
            if (button) button.click();
            while (true) {
                await new Promise(resolve => setTimeout(resolve, 100));
                const newCommit = getCurrentCommit();
                if (newCommit && newCommit !== initialCommit) {
                    break;
                }
            }
        })();
        """
        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.commit-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h4.commit-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
        for page in range(3):
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.commit-item",
                extraction_strategy=extraction_strategy,
                js_code=js_next_page_and_wait if page > 0 else None,
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )
            commits = json.loads(result.extracted_content)
            all_commits.extend(commits)
            print(f"Page {page + 1}: Found {len(commits)} commits")
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 asyncio.run(integrated_js_and_wait_crawl())
 ```
 This approach combines the JavaScript for clicking the "next" button and waiting for new content to load into a single script.
 ## Advanced Technique 3: Using the `wait_for` Parameter
 Crawl4AI provides a `wait_for` parameter that allows you to specify a condition to wait for before considering the page fully loaded. This can be particularly useful for dynamic content.
 Here's an example:
 ```python
 async def wait_for_parameter_crawl():
    async with AsyncWebCrawler(verbose=True) as crawler:
        url = "https://github.com/example/repo/commits/main"
        session_id = "wait_for_session"
        all_commits = []
        js_next_page = """
        const commits = document.querySelectorAll('li.commit-item h4');
        if (commits.length > 0) {
            window.lastCommit = commits[0].textContent.trim();
        }
        const button = document.querySelector('a.pagination-next');
        if (button) button.click();
        """
        wait_for = """() => {
            const commits = document.querySelectorAll('li.commit-item h4');
            if (commits.length === 0) return false;
            const firstCommit = commits[0].textContent.trim();
            return firstCommit !== window.lastCommit;
        }"""
        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.commit-item",
            "fields": [
                {
                    "name": "title",
                    "selector": "h4.commit-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
        for page in range(3):
            result = await crawler.arun(
                url=url,
                session_id=session_id,
                css_selector="li.commit-item",
                extraction_strategy=extraction_strategy,
                js_code=js_next_page if page > 0 else None,
                wait_for=wait_for if page > 0 else None,
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )
            commits = json.loads(result.extracted_content)
            all_commits.extend(commits)
            print(f"Page {page + 1}: Found {len(commits)} commits")
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 asyncio.run(wait_for_parameter_crawl())
 ```
 This technique separates the JavaScript execution (clicking the "next" button) from the waiting condition, providing more flexibility and clarity in some scenarios.
 ## Best Practices for Session-Based Crawling
 1. **Use Unique Session IDs**: Ensure each crawling session has a unique `session_id` to prevent conflicts.
 2. **Close Sessions**: Always close sessions using `kill_session` when you're done to free up resources.
 3. **Handle Errors**: Implement proper error handling to deal with unexpected situations during crawling.
 4. **Respect Website Terms**: Ensure your crawling adheres to the website's terms of service and robots.txt file.
 5. **Implement Delays**: Add appropriate delays between requests to avoid overwhelming the target server.
 6. **Use Extraction Strategies**: Leverage `JsonCssExtractionStrategy` or other extraction strategies for structured data extraction.
 7. **Optimize JavaScript**: Keep your JavaScript execution concise and efficient to improve crawling speed.
 8. **Monitor Performance**: Keep an eye on memory usage and crawling speed, especially for long-running sessions.
 ## Conclusion
 Session-based crawling with Crawl4AI provides powerful capabilities for handling dynamic content and complex web applications. By leveraging session management, JavaScript execution, and waiting strategies, you can effectively crawl and extract data from a wide range of modern websites.
 Remember to use these techniques responsibly and in compliance with website policies and ethical web scraping practices.
 For more advanced usage and API details, refer to the Crawl4AI API documentation.
--- a/docs/md_v2/advanced/session-management.md
+++ b/docs/md_v2/advanced/session-management.md
@@ -1,74 +1,76 @@
 # Session Management
-Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.
+Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
-## Basic Session Usage
+- **Performing JavaScript actions before and after crawling.**
 - **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly.
-Use `session_id` to maintain state between requests:
+**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
 ---
 #### Basic Session Usage
 Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
 ```python
 from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
 async with AsyncWebCrawler() as crawler:
    session_id = "my_session"
-    # First request
+    # Define configurations
-    result1 = await crawler.arun(
+    config1 = CrawlerRunConfig(
-        url="https://example.com/page1",
+        url="https://example.com/page1", session_id=session_id
-        session_id=session_id
+    )
    config2 = CrawlerRunConfig(
        url="https://example.com/page2", session_id=session_id
    )
-    # Subsequent request using same session
+    # First request
-    result2 = await crawler.arun(
+    result1 = await crawler.arun(config=config1)
-        url="https://example.com/page2",
+
-        session_id=session_id
+    # Subsequent request using the same session
-    )
+    result2 = await crawler.arun(config=config2)
    # Clean up when done
    await crawler.crawler_strategy.kill_session(session_id)
 ```
-## Dynamic Content with Sessions
+---
-Here's a real-world example of crawling GitHub commits across multiple pages:
+#### Dynamic Content with Sessions
 Here's an example of crawling GitHub commits across multiple pages while preserving session state:
 ```python
 from crawl4ai.async_configs import CrawlerRunConfig
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 from crawl4ai.cache_context import CacheMode
 async def crawl_dynamic_content():
-    async with AsyncWebCrawler(verbose=True) as crawler:
+    async with AsyncWebCrawler() as crawler:
        session_id = "github_commits_session"
        url = "https://github.com/microsoft/TypeScript/commits/main"
        session_id = "typescript_commits_session"
        all_commits = []
        # Define navigation JavaScript
        js_next_page = """
        const button = document.querySelector('a[data-testid="pagination-next-button"]');
        if (button) button.click();
        """
        # Define wait condition
        wait_for = """() => {
            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
            if (commits.length === 0) return false;
            const firstCommit = commits[0].textContent.trim();
            return firstCommit !== window.firstCommit;
        }"""
        # Define extraction schema
        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
-            "fields": [
+            "fields": [{
-                {
+                "name": "title", "selector": "h4.markdown-title", "type": "text"
-                    "name": "title",
+            }],
                    "selector": "h4.markdown-title",
                    "type": "text",
                    "transform": "strip",
                },
            ],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema)
        # JavaScript and wait configurations
        js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
        wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
        # Crawl multiple pages
        for page in range(3):
-            result = await crawler.arun(
+            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
                extraction_strategy=extraction_strategy,
@@ -78,6 +80,7 @@ async def crawl_dynamic_content():
                cache_mode=CacheMode.BYPASS
            )
            result = await crawler.arun(config=config)
            if result.success:
                commits = json.loads(result.extracted_content)
                all_commits.extend(commits)
@@ -88,46 +91,148 @@ async def crawl_dynamic_content():
        return all_commits
 ```
-## Session Best Practices
+---
-1. **Session Naming**:
+## Example 1: Basic Session-Based Crawling
-```python
+
-# Use descriptive session IDs
+A simple example using session-based crawling:
 session_id = "login_flow_session"
 session_id = "product_catalog_session"
 ```
 2. **Resource Management**:
 ```python
-try:
+import asyncio
-    # Your crawling code
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-    pass
+from crawl4ai.cache_context import CacheMode
-finally:
+
-    # Always clean up sessions
+async def basic_session_crawl():
    async with AsyncWebCrawler() as crawler:
        session_id = "dynamic_content_session"
        url = "https://example.com/dynamic-content"
        for page in range(3):
            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
                js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
                css_selector=".content-item",
                cache_mode=CacheMode.BYPASS
            )
            result = await crawler.arun(config=config)
            print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
        await crawler.crawler_strategy.kill_session(session_id)
 asyncio.run(basic_session_crawl())
 ```
-3. **State Management**:
+This example shows:
 1. Reusing the same `session_id` across multiple requests.
 2. Executing JavaScript to load more content dynamically.
 3. Properly closing the session to free resources.
 ---
 ## Advanced Technique 1: Custom Execution Hooks
 > Warning: You might feel confused by the end of the next few examples 😅, so make sure you are comfortable with the order of the parts before you start this.
 Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
 ```python
-# First page: login
+async def advanced_session_crawl_with_hooks():
-result = await crawler.arun(
+    first_commit = ""
    url="https://example.com/login",
    session_id=session_id,
    js_code="document.querySelector('form').submit();"
 )
-# Second page: verify login success
+    async def on_execution_started(page):
-result = await crawler.arun(
+        nonlocal first_commit
-    url="https://example.com/dashboard",
+        try:
            while True:
                await page.wait_for_selector("li.commit-item h4")
                commit = await page.query_selector("li.commit-item h4")
                commit = await commit.evaluate("(element) => element.textContent").strip()
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear: {e}")
    async with AsyncWebCrawler() as crawler:
        session_id = "commit_session"
        url = "https://github.com/example/repo/commits/main"
        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
        js_next_page = """document.querySelector('a.pagination-next').click();"""
        for page in range(3):
            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
-    wait_for="css:.user-profile"  # Wait for authenticated content
+                js_code=js_next_page if page > 0 else None,
-)
+                css_selector="li.commit-item",
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )
            result = await crawler.arun(config=config)
            print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
        await crawler.crawler_strategy.kill_session(session_id)
 asyncio.run(advanced_session_crawl_with_hooks())
 ```
-## Common Use Cases
+This technique ensures new content loads before the next action.
-1. **Authentication Flows**
+---
-2. **Pagination Handling**
+
-3. **Form Submissions**
+## Advanced Technique 2: Integrated JavaScript Execution and Waiting
-4. **Multi-step Processes**
+
-5. **Dynamic Content Navigation**
+Combine JavaScript execution and waiting logic for concise handling of dynamic content:
 ```python
 async def integrated_js_and_wait_crawl():
    async with AsyncWebCrawler() as crawler:
        session_id = "integrated_session"
        url = "https://github.com/example/repo/commits/main"
        js_next_page_and_wait = """
        (async () => {
            const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
            const initialCommit = getCurrentCommit();
            document.querySelector('a.pagination-next').click();
            while (getCurrentCommit() === initialCommit) {
                await new Promise(resolve => setTimeout(resolve, 100));
            }
        })();
        """
        for page in range(3):
            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
                js_code=js_next_page_and_wait if page > 0 else None,
                css_selector="li.commit-item",
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )
            result = await crawler.arun(config=config)
            print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
        await crawler.crawler_strategy.kill_session(session_id)
 asyncio.run(integrated_js_and_wait_crawl())
 ```
 ---
 #### Common Use Cases for Sessions
 1. **Authentication Flows**: Login and interact with secured pages.
 2. **Pagination Handling**: Navigate through multiple pages.
 3. **Form Submissions**: Fill forms, submit, and process results.
 4. **Multi-step Processes**: Complete workflows that span multiple actions.
 5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
--- a/docs/md_v2/advanced/ssl-certificate.md
+++ b/docs/md_v2/advanced/ssl-certificate.md
@@ -0,0 +1,179 @@
 # `SSLCertificate` Reference
 The **`SSLCertificate`** class encapsulates an SSL certificate’s data and allows exporting it in various formats (PEM, DER, JSON, or text). It’s used within **Crawl4AI** whenever you set **`fetch_ssl_certificate=True`** in your **`CrawlerRunConfig`**.  
 ## 1. Overview
 **Location**: `crawl4ai/ssl_certificate.py`
 ```python
 class SSLCertificate:
    """
    Represents an SSL certificate with methods to export in various formats.
    Main Methods:
    - from_url(url, timeout=10)
    - from_file(file_path)
    - from_binary(binary_data)
    - to_json(filepath=None)
    - to_pem(filepath=None)
    - to_der(filepath=None)
    ...
    Common Properties:
    - issuer
    - subject
    - valid_from
    - valid_until
    - fingerprint
    """
 ```
 ### Typical Use Case
 1. You **enable** certificate fetching in your crawl by:
   ```python
   CrawlerRunConfig(fetch_ssl_certificate=True, ...)
   ```
 2. After `arun()`, if `result.ssl_certificate` is present, it’s an instance of **`SSLCertificate`**.  
 3. You can **read** basic properties (issuer, subject, validity) or **export** them in multiple formats.
 ---
 ## 2. Construction & Fetching
 ### 2.1 **`from_url(url, timeout=10)`**
 Manually load an SSL certificate from a given URL (port 443). Typically used internally, but you can call it directly if you want:
 ```python
 cert = SSLCertificate.from_url("https://example.com")
 if cert:
    print("Fingerprint:", cert.fingerprint)
 ```
 ### 2.2 **`from_file(file_path)`**
 Load from a file containing certificate data in ASN.1 or DER. Rarely needed unless you have local cert files:
 ```python
 cert = SSLCertificate.from_file("/path/to/cert.der")
 ```
 ### 2.3 **`from_binary(binary_data)`**
 Initialize from raw binary. E.g., if you captured it from a socket or another source:
 ```python
 cert = SSLCertificate.from_binary(raw_bytes)
 ```
 ---
 ## 3. Common Properties
 After obtaining a **`SSLCertificate`** instance (e.g. `result.ssl_certificate` from a crawl), you can read:
 1. **`issuer`** *(dict)*  
   - E.g. `{"CN": "My Root CA", "O": "..."}`
 2. **`subject`** *(dict)*  
   - E.g. `{"CN": "example.com", "O": "ExampleOrg"}`
 3. **`valid_from`** *(str)*  
   - NotBefore date/time. Often in ASN.1/UTC format.
 4. **`valid_until`** *(str)*  
   - NotAfter date/time.
 5. **`fingerprint`** *(str)*  
   - The SHA-256 digest (lowercase hex).  
   - E.g. `"d14d2e..."`
 ---
 ## 4. Export Methods
 Once you have a **`SSLCertificate`** object, you can **export** or **inspect** it:
 ### 4.1 **`to_json(filepath=None)` → `Optional[str]`**
 - Returns a JSON string containing the parsed certificate fields.  
 - If `filepath` is provided, saves it to disk instead, returning `None`.
 **Usage**:
 ```python
 json_data = cert.to_json()  # returns JSON string
 cert.to_json("certificate.json")  # writes file, returns None
 ```
 ### 4.2 **`to_pem(filepath=None)` → `Optional[str]`**
 - Returns a PEM-encoded string (common for web servers).  
 - If `filepath` is provided, saves it to disk instead.
 ```python
 pem_str = cert.to_pem()              # in-memory PEM string
 cert.to_pem("/path/to/cert.pem")     # saved to file
 ```
 ### 4.3 **`to_der(filepath=None)` → `Optional[bytes]`**
 - Returns the original DER (binary ASN.1) bytes.  
 - If `filepath` is specified, writes the bytes there instead.
 ```python
 der_bytes = cert.to_der()
 cert.to_der("certificate.der")
 ```
 ### 4.4 (Optional) **`export_as_text()`**
 - If you see a method like `export_as_text()`, it typically returns an OpenSSL-style textual representation.  
 - Not always needed, but can help for debugging or manual inspection.
 ---
 ## 5. Example Usage in Crawl4AI
 Below is a minimal sample showing how the crawler obtains an SSL cert from a site, then reads or exports it. The code snippet:
 ```python
 import asyncio
 import os
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
 async def main():
    tmp_dir = "tmp"
    os.makedirs(tmp_dir, exist_ok=True)
    config = CrawlerRunConfig(
        fetch_ssl_certificate=True,
        cache_mode=CacheMode.BYPASS
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun("https://example.com", config=config)
        if result.success and result.ssl_certificate:
            cert = result.ssl_certificate
            # 1. Basic Info
            print("Issuer CN:", cert.issuer.get("CN", ""))
            print("Valid until:", cert.valid_until)
            print("Fingerprint:", cert.fingerprint)
            # 2. Export
            cert.to_json(os.path.join(tmp_dir, "certificate.json"))
            cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
            cert.to_der(os.path.join(tmp_dir, "certificate.der"))
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 ---
 ## 6. Notes & Best Practices
 1. **Timeout**: `SSLCertificate.from_url` internally uses a default **10s** socket connect and wraps SSL.  
 2. **Binary Form**: The certificate is loaded in ASN.1 (DER) form, then re-parsed by `OpenSSL.crypto`.  
 3. **Validation**: This does **not** validate the certificate chain or trust store. It only fetches and parses.  
 4. **Integration**: Within Crawl4AI, you typically just set `fetch_ssl_certificate=True` in `CrawlerRunConfig`; the final result’s `ssl_certificate` is automatically built.  
 5. **Export**: If you need to store or analyze a cert, the `to_json` and `to_pem` are quite universal.
 ---
 ### Summary
 - **`SSLCertificate`** is a convenience class for capturing and exporting the **TLS certificate** from your crawled site(s).  
 - Common usage is in the **`CrawlResult.ssl_certificate`** field, accessible after setting `fetch_ssl_certificate=True`.  
 - Offers quick access to essential certificate details (`issuer`, `subject`, `fingerprint`) and is easy to export (PEM, DER, JSON) for further analysis or server usage.
 Use it whenever you need **insight** into a site’s certificate or require some form of cryptographic or compliance check.
--- a/docs/md_v2/api/arun.md
+++ b/docs/md_v2/api/arun.md
@@ -1,244 +1,305 @@
-# Complete Parameter Guide for arun()
+# `arun()` Parameter Guide (New Approach)
-The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality.
+In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide:
 ## Core Parameters
 ```python
 await crawler.arun(
-    url="https://example.com",   # Required: URL to crawl
+    url="https://example.com",  
-    verbose=True,               # Enable detailed logging
+    config=my_run_config
    cache_mode=CacheMode.ENABLED,  # Control cache behavior
    warmup=True                # Whether to run warmup check
 )
 ```
-## Cache Control
+Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md).
 ---
 ## 1. Core Usage
 ```python
-from crawl4ai import CacheMode
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
-await crawler.arun(
+async def main():
-    cache_mode=CacheMode.ENABLED,    # Normal caching (read/write)
+    run_config = CrawlerRunConfig(
-    # Other cache modes:
+        verbose=True,            # Detailed logging
-    # cache_mode=CacheMode.DISABLED   # No caching at all
+        cache_mode=CacheMode.ENABLED,  # Use normal read/write cache
-    # cache_mode=CacheMode.READ_ONLY  # Only read from cache
+        check_robots_txt=True,   # Respect robots.txt rules
-    # cache_mode=CacheMode.WRITE_ONLY # Only write to cache
+        # ... other parameters
-    # cache_mode=CacheMode.BYPASS     # Skip cache for this operation
+    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=run_config
        )
        # Check if blocked by robots.txt
        if not result.success and result.status_code == 403:
            print(f"Error: {result.error_message}")
 ```
 **Key Fields**:
 - `verbose=True` logs each crawl step.  
 - `cache_mode` decides how to read/write the local crawl cache.
 ---
 ## 2. Cache Control
 **`cache_mode`** (default: `CacheMode.ENABLED`)  
 Use a built-in enum from `CacheMode`:
 - `ENABLED`: Normal caching—reads if available, writes if missing.
 - `DISABLED`: No caching—always refetch pages.
 - `READ_ONLY`: Reads from cache only; no new writes.
 - `WRITE_ONLY`: Writes to cache but doesn’t read existing data.
 - `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way).
 ```python
 run_config = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS
 )
 ```
-## Content Processing Parameters
+**Additional flags**:
 - `bypass_cache=True` acts like `CacheMode.BYPASS`.
 - `disable_cache=True` acts like `CacheMode.DISABLED`.
 - `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`.
 - `no_cache_write=True` acts like `CacheMode.READ_ONLY`.
 ---
 ## 3. Content Processing & Selection
 ### 3.1 Text Processing
 ### Text Processing
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    word_count_threshold=10,                # Minimum words per content block
+    word_count_threshold=10,   # Ignore text blocks <10 words
-    image_description_min_word_threshold=5,  # Minimum words for image descriptions
+    only_text=False,           # If True, tries to remove non-text elements
-    only_text=False,                        # Extract only text content
+    keep_data_attributes=False # Keep or discard data-* attributes
    excluded_tags=['form', 'nav'],          # HTML tags to exclude
    keep_data_attributes=False,             # Preserve data-* attributes
 )
 ```
-### Content Selection
+### 3.2 Content Selection
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    css_selector=".main-content",  # CSS selector for content extraction
+    css_selector=".main-content",  # Focus on .main-content region only
-    remove_forms=True,             # Remove all form elements
+    excluded_tags=["form", "nav"], # Remove entire tag blocks
-    remove_overlay_elements=True,  # Remove popups/modals/overlays
+    remove_forms=True,             # Specifically strip <form> elements
    remove_overlay_elements=True,  # Attempt to remove modals/popups
 )
 ```
-### Link Handling
+### 3.3 Link Handling
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    exclude_external_links=True,          # Remove external links
+    exclude_external_links=True,         # Remove external links from final content
-    exclude_social_media_links=True,      # Remove social media links
+    exclude_social_media_links=True,     # Remove links to known social sites
-    exclude_external_images=True,         # Remove external images
+    exclude_domains=["ads.example.com"], # Exclude links to these domains
-    exclude_domains=["ads.example.com"],  # Specific domains to exclude
+    exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list
    social_media_domains=[               # Additional social media domains
        "facebook.com",
        "twitter.com",
        "instagram.com"
    ]
 )
 ```
-## Browser Control Parameters
+### 3.4 Media Filtering
 ### Basic Browser Settings
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    headless=True,                # Run browser in headless mode
+    exclude_external_images=True  # Strip images from other domains
    browser_type="chromium",      # Browser engine: "chromium", "firefox", "webkit"
    page_timeout=60000,          # Page load timeout in milliseconds
    user_agent="custom-agent",    # Custom user agent
 )
 ```
-### Navigation and Waiting
+---
 ## 4. Page Navigation & Timing
 ### 4.1 Basic Browser Flow
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    wait_for="css:.dynamic-content",  # Wait for element/condition
+    wait_for="css:.dynamic-content", # Wait for .dynamic-content
-    delay_before_return_html=2.0,     # Wait before returning HTML (seconds)
+    delay_before_return_html=2.0,    # Wait 2s before capturing final HTML
    page_timeout=60000,             # Navigation & script timeout (ms)
 )
 ```
-### JavaScript Execution
+**Key Fields**:
 - `wait_for`:  
  - `"css:selector"` or  
  - `"js:() => boolean"`  
  e.g. `js:() => document.querySelectorAll('.item').length > 10`.
 - `mean_delay` & `max_range`: define random delays for `arun_many()` calls.  
 - `semaphore_count`: concurrency limit when crawling multiple URLs.
 ### 4.2 JavaScript Execution
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    js_code=[                     # JavaScript to execute (string or list)
+    js_code=[
        "window.scrollTo(0, document.body.scrollHeight);",
-        "document.querySelector('.load-more').click();"
+        "document.querySelector('.load-more')?.click();"
    ],
-    js_only=False,               # Only execute JavaScript without reloading page
+    js_only=False
 )
 ```
-### Anti-Bot Features
+- `js_code` can be a single string or a list of strings.  
 - `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.”
 ### 4.3 Anti-Bot
 ```python
-await crawler.arun(
+run_config = CrawlerRunConfig(
-    magic=True,              # Enable all anti-detection features
+    magic=True,
-    simulate_user=True,      # Simulate human behavior
+    simulate_user=True,
-    override_navigator=True  # Override navigator properties
+    override_navigator=True
 )
 ```
 - `magic=True` tries multiple stealth features.  
 - `simulate_user=True` mimics mouse movements or random delays.  
 - `override_navigator=True` fakes some navigator properties (like user agent checks).
 ---
 ## 5. Session Management
 **`session_id`**: 
 ```python
 run_config = CrawlerRunConfig(
    session_id="my_session123"
 )
 ```
 If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing).
 ---
 ## 6. Screenshot, PDF & Media Options
 ```python
 run_config = CrawlerRunConfig(
    screenshot=True,             # Grab a screenshot as base64
    screenshot_wait_for=1.0,     # Wait 1s before capturing
    pdf=True,                    # Also produce a PDF
    image_description_min_word_threshold=5,  # If analyzing alt text
    image_score_threshold=3,                # Filter out low-score images
 )
 ```
 **Where they appear**:
 - `result.screenshot` → Base64 screenshot string.
 - `result.pdf` → Byte array with PDF data.
 ---
 ## 7. Extraction Strategy
 **For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`:
 ```python
 run_config = CrawlerRunConfig(
    extraction_strategy=my_css_or_llm_strategy
 )
 ```
-### Session Management
+The extracted data will appear in `result.extracted_content`.
 ```python
 await crawler.arun(
    session_id="my_session",  # Session identifier for persistent browsing
 )
 ```
-### Screenshot Options
+---
-```python
+
-await crawler.arun(
+## 8. Comprehensive Example
-    screenshot=True,              # Take page screenshot
+
-    screenshot_wait_for=2.0,      # Wait before screenshot (seconds)
+Below is a snippet combining many parameters:
 )
 ```
 ### Proxy Configuration
 ```python
-await crawler.arun(
+import asyncio
-    proxy="http://proxy.example.com:8080",     # Simple proxy URL
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
-    proxy_config={                             # Advanced proxy settings
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-        "server": "http://proxy.example.com:8080",
+
-        "username": "user",
+async def main():
-        "password": "pass"
+    # Example schema
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {"name": "title", "selector": "h2", "type": "text"},
            {"name": "link",  "selector": "a",  "type": "attribute", "attribute": "href"}
        ]
    }
 )
 ```
-## Content Extraction Parameters
+    run_config = CrawlerRunConfig(
-
+        # Core
-### Extraction Strategy
+        verbose=True,
 ```python
 await crawler.arun(
    extraction_strategy=LLMExtractionStrategy(
        provider="ollama/llama2",
        schema=MySchema.schema(),
        instruction="Extract specific data"
    )
 )
 ```
 ### Chunking Strategy
 ```python
 await crawler.arun(
    chunking_strategy=RegexChunking(
        patterns=[r'\n\n', r'\.\s+']
    )
 )
 ```
 ### HTML to Text Options
 ```python
 await crawler.arun(
    html2text={
        "ignore_links": False,
        "ignore_images": False,
        "escape_dot": False,
        "body_width": 0,
        "protect_links": True,
        "unicode_snob": True
    }
 )
 ```
 ## Debug Options
 ```python
 await crawler.arun(
    log_console=True,   # Log browser console messages
 )
 ```
 ## Parameter Interactions and Notes
 1. **Cache and Performance Setup**
   ```python
   # Optimal caching for repeated crawls
   await crawler.arun(
        cache_mode=CacheMode.ENABLED,
        check_robots_txt=True,   # Respect robots.txt rules
        # Content
        word_count_threshold=10,
-       process_iframes=False
+        css_selector="main.content",
-   )
+        excluded_tags=["nav", "footer"],
-   ```
+        exclude_external_links=True,
-2. **Dynamic Content Handling**
+        # Page & JS
-   ```python
+        js_code="document.querySelector('.show-more')?.click();",
-   # Handle lazy-loaded content
+        wait_for="css:.loaded-block",
-   await crawler.arun(
+        page_timeout=30000,
       js_code="window.scrollTo(0, document.body.scrollHeight);",
       wait_for="css:.lazy-content",
       delay_before_return_html=2.0,
       cache_mode=CacheMode.WRITE_ONLY  # Cache results after dynamic load
   )
   ```
-3. **Content Extraction Pipeline**
+        # Extraction
-   ```python
+        extraction_strategy=JsonCssExtractionStrategy(schema),
   # Complete extraction setup
   await crawler.arun(
       css_selector=".main-content",
       word_count_threshold=20,
       extraction_strategy=my_strategy,
       chunking_strategy=my_chunking,
       process_iframes=True,
       remove_overlay_elements=True,
       cache_mode=CacheMode.ENABLED
   )
   ```
-## Best Practices
+        # Session
        session_id="persistent_session",
-1. **Performance Optimization**
+        # Media
-   ```python
+        screenshot=True,
-   await crawler.arun(
+        pdf=True,
       cache_mode=CacheMode.ENABLED,  # Use full caching
       word_count_threshold=10,      # Filter out noise
       process_iframes=False         # Skip iframes if not needed
   )
   ```
-2. **Reliable Scraping**
+        # Anti-bot
-   ```python
+        simulate_user=True,
-   await crawler.arun(
+        magic=True,
       magic=True,                   # Enable anti-detection
       delay_before_return_html=1.0, # Wait for dynamic content
       page_timeout=60000,          # Longer timeout for slow pages
       cache_mode=CacheMode.WRITE_ONLY  # Cache results after successful crawl
    )
   ```
-3. **Clean Content**
+    async with AsyncWebCrawler() as crawler:
-   ```python
+        result = await crawler.arun("https://example.com/posts", config=run_config)
-   await crawler.arun(
+        if result.success:
-       remove_overlay_elements=True,  # Remove popups
+            print("HTML length:", len(result.cleaned_html))
-       excluded_tags=['nav', 'aside'],# Remove unnecessary elements
+            print("Extraction JSON:", result.extracted_content)
-       keep_data_attributes=False,    # Remove data attributes
+            if result.screenshot:
-       cache_mode=CacheMode.ENABLED   # Use cache for faster processing
+                print("Screenshot length:", len(result.screenshot))
-   )
+            if result.pdf:
-   ```
+                print("PDF bytes length:", len(result.pdf))
        else:
            print("Error:", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 **What we covered**:
 1. **Crawling** the main content region, ignoring external links.  
 2. Running **JavaScript** to click “.show-more”.  
 3. **Waiting** for “.loaded-block” to appear.  
 4. Generating a **screenshot** & **PDF** of the final page.  
 5. Extracting repeated “article.post” elements with a **CSS-based** extraction strategy.
 ---
 ## 9. Best Practices
 1. **Use `BrowserConfig` for global browser** settings (headless, user agent).  
 2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.  
 3. Keep your **parameters consistent** in run configs—especially if you’re part of a large codebase with multiple crawls.  
 4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.  
 5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content.
 ---
 ## 10. Conclusion
 All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach:
 - Makes code **clearer** and **more maintainable**.  
 - Minimizes confusion about which arguments affect global vs. per-crawl behavior.  
 - Allows you to create **reusable** config objects for different pages or tasks.
 For a **full** reference, check out the [CrawlerRunConfig Docs](./parameters.md). 
 Happy crawling with your **structured, flexible** config approach!
--- a/docs/md_v2/api/arun_many.md
+++ b/docs/md_v2/api/arun_many.md
@@ -0,0 +1,124 @@
 # `arun_many(...)` Reference
 > **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences.
 ## Function Signature
 ```python
 async def arun_many(
    urls: Union[List[str], List[Any]],
    config: Optional[CrawlerRunConfig] = None,
    dispatcher: Optional[BaseDispatcher] = None,
    ...
 ) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
    """
    Crawl multiple URLs concurrently or in batches.
    :param urls: A list of URLs (or tasks) to crawl.
    :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
    :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
    """
 ```
 ## Differences from `arun()`
 1. **Multiple URLs**:  
   - Instead of crawling a single URL, you pass a list of them (strings or tasks).  
   - The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled.
 2. **Concurrency & Dispatchers**:  
   - **`dispatcher`** param allows advanced concurrency control.  
   - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  
   - Dispatchers handle concurrency, rate limiting, and memory-based adaptive throttling (see [Multi-URL Crawling](../advanced/multi-url-crawling.md)).
 3. **Streaming Support**:  
   - Enable streaming by setting `stream=True` in your `CrawlerRunConfig`.
   - When streaming, use `async for` to process results as they become available.
   - Ideal for processing large numbers of URLs without waiting for all to complete.
 4. **Parallel** Execution**:  
   - `arun_many()` can run multiple requests concurrently under the hood.  
   - Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times).
 ### Basic Example (Batch Mode)
 ```python
 # Minimal usage: The default dispatcher will be used
 results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com"],
    config=CrawlerRunConfig(stream=False)  # Default behavior
 )
 for res in results:
    if res.success:
        print(res.url, "crawled OK!")
    else:
        print("Failed:", res.url, "-", res.error_message)
 ```
 ### Streaming Example
 ```python
 config = CrawlerRunConfig(
    stream=True,  # Enable streaming mode
    cache_mode=CacheMode.BYPASS
 )
 # Process results as they complete
 async for result in await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=config
 ):
    if result.success:
        print(f"Just completed: {result.url}")
        # Process each result immediately
        process_result(result)
 ```
 ### With a Custom Dispatcher
 ```python
 dispatcher = MemoryAdaptiveDispatcher(
    memory_threshold_percent=70.0,
    max_session_permit=10
 )
 results = await crawler.arun_many(
    urls=["https://site1.com", "https://site2.com", "https://site3.com"],
    config=my_run_config,
    dispatcher=dispatcher
 )
 ```
 **Key Points**:
 - Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
 - `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  
 - If you need to handle authentication or session IDs, pass them in each individual task or within your run config.
 ### Return Value
 Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`.
 ---
 ## Dispatcher Reference
 - **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.  
 - **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.  
 For advanced usage or custom settings, see [Multi-URL Crawling with Dispatchers](../advanced/multi-url-crawling.md).
 ---
 ## Common Pitfalls
 1. **Large Lists**: If you pass thousands of URLs, be mindful of memory or rate-limits. A dispatcher can help.  
 2. **Session Reuse**: If you need specialized logins or persistent contexts, ensure your dispatcher or tasks handle sessions accordingly.  
 3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding.
 ---
 ## Conclusion
 Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs.
--- a/docs/md_v2/api/async-webcrawler.md
+++ b/docs/md_v2/api/async-webcrawler.md
@@ -1,320 +1,331 @@
 # AsyncWebCrawler
-The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.
+The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects.
-## Constructor
+**Recommended usage**:
 1. **Create** a `BrowserConfig` for global browser settings.  
 2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  
 3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  
 4. **Call** `arun(url, config=crawler_run_config)` for each page you want.
 ---
 ## 1. Constructor Overview
 ```python
-AsyncWebCrawler(
+class AsyncWebCrawler:
-    # Browser Settings
+    def __init__(
-    browser_type: str = "chromium",         # Options: "chromium", "firefox", "webkit"
+        self,
-    headless: bool = True,                  # Run browser in headless mode
+        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
-    verbose: bool = False,                  # Enable verbose logging
+        config: Optional[BrowserConfig] = None,
        always_bypass_cache: bool = False,           # deprecated
        always_by_pass_cache: Optional[bool] = None, # also deprecated
        base_directory: str = ...,
        thread_safe: bool = False,
        **kwargs,
    ):
        """
        Create an AsyncWebCrawler instance.
-    # Cache Settings
+        Args:
-    always_by_pass_cache: bool = False,     # Always bypass cache
+            crawler_strategy: 
-    base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())), # Base directory for cache
+                (Advanced) Provide a custom crawler strategy if needed.
            config: 
                A BrowserConfig object specifying how the browser is set up.
            always_bypass_cache: 
                (Deprecated) Use CrawlerRunConfig.cache_mode instead.
            base_directory:     
                Folder for storing caches/logs (if relevant).
            thread_safe: 
                If True, attempts some concurrency safeguards. Usually False.
            **kwargs: 
                Additional legacy or debugging parameters.
        """
    )
-    # Network Settings
+### Typical Initialization
    proxy: str = None,                      # Simple proxy URL
    proxy_config: Dict = None,              # Advanced proxy configuration
-    # Browser Behavior
+```python
-    sleep_on_close: bool = False,           # Wait before closing browser
+from crawl4ai import AsyncWebCrawler, BrowserConfig
-    # Custom Settings
+browser_cfg = BrowserConfig(
-    user_agent: str = None,                 # Custom user agent
+    browser_type="chromium",
-    headers: Dict[str, str] = {},           # Custom HTTP headers
+    headless=True,
-    js_code: Union[str, List[str]] = None,  # Default JavaScript to execute
+    verbose=True
 )
 crawler = AsyncWebCrawler(config=browser_cfg)
 ```
-### Parameters in Detail
+**Notes**:
 - **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`.
-#### Browser Settings
+---
- **browser_type** (str, optional)
+## 2. Lifecycle: Start/Close or Context Manager
  - Default: `"chromium"`
  - Options: `"chromium"`, `"firefox"`, `"webkit"`
  - Controls which browser engine to use
  ```python
  # Example: Using Firefox
  crawler = AsyncWebCrawler(browser_type="firefox")
  ```
- **headless** (bool, optional)
+### 2.1 Context Manager (Recommended)
  - Default: `True`
  - When `True`, browser runs without GUI
  - Set to `False` for debugging
  ```python
  # Visible browser for debugging
  crawler = AsyncWebCrawler(headless=False)
  ```
- **verbose** (bool, optional)
+```python
-  - Default: `False`
+async with AsyncWebCrawler(config=browser_cfg) as crawler:
-  - Enables detailed logging
+    result = await crawler.arun("https://example.com")
-  ```python
+    # The crawler automatically starts/closes resources
-  # Enable detailed logging
+```
  crawler = AsyncWebCrawler(verbose=True)
  ```
-#### Cache Settings
+When the `async with` block ends, the crawler cleans up (closes the browser, etc.).
- **always_by_pass_cache** (bool, optional)
+### 2.2 Manual Start & Close
  - Default: `False`
  - When `True`, always fetches fresh content
  ```python
  # Always fetch fresh content
  crawler = AsyncWebCrawler(always_by_pass_cache=True)
  ```
- **base_directory** (str, optional)
+```python
-  - Default: User's home directory
+crawler = AsyncWebCrawler(config=browser_cfg)
-  - Base path for cache storage
+await crawler.start()
  ```python
  # Custom cache directory
  crawler = AsyncWebCrawler(base_directory="/path/to/cache")
  ```
-#### Network Settings
+result1 = await crawler.arun("https://example.com")
 result2 = await crawler.arun("https://another.com")
- **proxy** (str, optional)
+await crawler.close()
-  - Simple proxy URL
+```
  ```python
  # Using simple proxy
  crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
  ```
- **proxy_config** (Dict, optional)
+Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle.
  - Advanced proxy configuration with authentication
  ```python
  # Advanced proxy with auth
  crawler = AsyncWebCrawler(proxy_config={
      "server": "http://proxy.example.com:8080",
      "username": "user",
      "password": "pass"
  })
  ```
-#### Browser Behavior
+---
- **sleep_on_close** (bool, optional)
+## 3. Primary Method: `arun()`
  - Default: `False`
  - Adds delay before closing browser
  ```python
  # Wait before closing
  crawler = AsyncWebCrawler(sleep_on_close=True)
  ```
 #### Custom Settings
 - **user_agent** (str, optional)
  - Custom user agent string
  ```python
  # Custom user agent
  crawler = AsyncWebCrawler(
      user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
  )
  ```
 - **headers** (Dict[str, str], optional)
  - Custom HTTP headers
  ```python
  # Custom headers
  crawler = AsyncWebCrawler(
      headers={
          "Accept-Language": "en-US",
          "Custom-Header": "Value"
      }
  )
  ```
 - **js_code** (Union[str, List[str]], optional)
  - Default JavaScript to execute on each page
  ```python
  # Default JavaScript
  crawler = AsyncWebCrawler(
      js_code=[
          "window.scrollTo(0, document.body.scrollHeight);",
          "document.querySelector('.load-more').click();"
      ]
  )
  ```
 ## Methods
 ### arun()
 The primary method for crawling web pages.
 ```python
 async def arun(
-    # Required
+    self,
-    url: str,                              # URL to crawl
+    url: str,
-    
+    config: Optional[CrawlerRunConfig] = None,
-    # Content Selection
+    # Legacy parameters for backward compatibility...
    css_selector: str = None,              # CSS selector for content
    word_count_threshold: int = 10,        # Minimum words per block
    # Cache Control
    bypass_cache: bool = False,            # Bypass cache for this request
    # Session Management
    session_id: str = None,                # Session identifier
    # Screenshot Options
    screenshot: bool = False,              # Take screenshot
    screenshot_wait_for: float = None,     # Wait before screenshot
    # Content Processing
    process_iframes: bool = False,         # Process iframe content
    remove_overlay_elements: bool = False, # Remove popups/modals
    # Anti-Bot Settings
    simulate_user: bool = False,           # Simulate human behavior
    override_navigator: bool = False,      # Override navigator properties
    magic: bool = False,                   # Enable all anti-detection
    # Content Filtering
    excluded_tags: List[str] = None,       # HTML tags to exclude
    exclude_external_links: bool = False,  # Remove external links
    exclude_social_media_links: bool = False, # Remove social media links
    # JavaScript Handling
    js_code: Union[str, List[str]] = None, # JavaScript to execute
    wait_for: str = None,                  # Wait condition
    # Page Loading
    page_timeout: int = 60000,            # Page load timeout (ms)
    delay_before_return_html: float = None, # Wait before return
    # Extraction
    extraction_strategy: ExtractionStrategy = None  # Extraction strategy
 ) -> CrawlResult:
    ...
 ```
-### Usage Examples
+### 3.1 New Approach
-#### Basic Crawling
+You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc.
 ```python
 async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 #### Advanced Crawling
 ```python
-async with AsyncWebCrawler(
+import asyncio
-    browser_type="firefox",
+from crawl4ai import CrawlerRunConfig, CacheMode
-    verbose=True,
+
-    headers={"Custom-Header": "Value"}
+run_cfg = CrawlerRunConfig(
-) as crawler:
+    cache_mode=CacheMode.BYPASS,
-    result = await crawler.arun(
+    css_selector="main.article",
-        url="https://example.com",
+    word_count_threshold=10,
        css_selector=".main-content",
        word_count_threshold=20,
        process_iframes=True,
        magic=True,
        wait_for="css:.dynamic-content",
    screenshot=True
-    )
+)
 async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun("https://example.com/news", config=run_cfg)
    print("Crawled HTML length:", len(result.cleaned_html))
    if result.screenshot:
        print("Screenshot base64 length:", len(result.screenshot))
 ```
-#### Session Management
+### 3.2 Legacy Parameters Still Accepted
 ```python
 async with AsyncWebCrawler() as crawler:
    # First request
    result1 = await crawler.arun(
        url="https://example.com/login",
        session_id="my_session"
    )
-    # Subsequent request using same session
+For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**.
-    result2 = await crawler.arun(
+
-        url="https://example.com/protected",
+---
-        session_id="my_session"
+
-    )
+## 4. Batch Processing: `arun_many()`
 ```python
 async def arun_many(
    self,
    urls: List[str],
    config: Optional[CrawlerRunConfig] = None,
    # Legacy parameters maintained for backwards compatibility...
 ) -> List[CrawlResult]:
    """
    Process multiple URLs with intelligent rate limiting and resource monitoring.
    """
 ```
-## Context Manager
+### 4.1 Resource-Aware Crawling
-AsyncWebCrawler implements the async context manager protocol:
+The `arun_many()` method now uses an intelligent dispatcher that:
 - Monitors system memory usage
 - Implements adaptive rate limiting
 - Provides detailed progress monitoring
 - Manages concurrent crawls efficiently
 ### 4.2 Example Usage
 ```python
-async def __aenter__(self) -> 'AsyncWebCrawler':
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, RateLimitConfig
-    # Initialize browser and resources
+from crawl4ai.dispatcher import DisplayMode
    return self
-async def __aexit__(self, *args):
+# Configure browser
-    # Cleanup resources
+browser_cfg = BrowserConfig(headless=True)
-    pass
+
 # Configure crawler with rate limiting
 run_cfg = CrawlerRunConfig(
    # Enable rate limiting
    enable_rate_limiting=True,
    rate_limit_config=RateLimitConfig(
        base_delay=(1.0, 2.0),  # Random delay between 1-2 seconds
        max_delay=30.0,         # Maximum delay after rate limit hits
        max_retries=2,          # Number of retries before giving up
        rate_limit_codes=[429, 503]  # Status codes that trigger rate limiting
    ),
    # Resource monitoring
    memory_threshold_percent=70.0,  # Pause if memory exceeds this
    check_interval=0.5,            # How often to check resources
    max_session_permit=3,          # Maximum concurrent crawls
    display_mode=DisplayMode.DETAILED.value  # Show detailed progress
 )
 urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
 ]
 async with AsyncWebCrawler(config=browser_cfg) as crawler:
    results = await crawler.arun_many(urls, config=run_cfg)
    for result in results:
        print(f"URL: {result.url}, Success: {result.success}")
 ```
-Always use AsyncWebCrawler with async context manager:
+### 4.3 Key Features
 ```python
 async with AsyncWebCrawler() as crawler:
    # Your crawling code here
    pass
 ```
-## Best Practices
+1. **Rate Limiting**
   - Automatic delay between requests
   - Exponential backoff on rate limit detection
   - Domain-specific rate limiting
   - Configurable retry strategy
-1. **Resource Management**
+2. **Resource Monitoring**
-```python
+   - Memory usage tracking
-# Always use context manager
+   - Adaptive concurrency based on system load
-async with AsyncWebCrawler() as crawler:
+   - Automatic pausing when resources are constrained
    # Crawler will be properly cleaned up
    pass
 ```
-2. **Error Handling**
+3. **Progress Monitoring**
-```python
+   - Detailed or aggregated progress display
-try:
+   - Real-time status updates
-    async with AsyncWebCrawler() as crawler:
+   - Memory usage statistics
-        result = await crawler.arun(url="https://example.com")
+
-        if not result.success:
+4. **Error Handling**
-            print(f"Crawl failed: {result.error_message}")
+   - Graceful handling of rate limits
-except Exception as e:
+   - Automatic retries with backoff
-    print(f"Error: {str(e)}")
+   - Detailed error reporting
-```
+
 ---
 ## 5. `CrawlResult` Output
 Each `arun()` returns a **`CrawlResult`** containing:
 - `url`: Final URL (if redirected).
 - `html`: Original HTML.
 - `cleaned_html`: Sanitized HTML.
 - `markdown_v2` (or future `markdown`): Markdown outputs (raw, fit, etc.).
 - `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies).
 - `screenshot`, `pdf`: If screenshots/PDF requested.
 - `media`, `links`: Information about discovered images/links.
 - `success`, `error_message`: Status info.
 For details, see [CrawlResult doc](./crawl-result.md).
 ---
 ## 6. Quick Example
 Below is an example hooking it all together:
 3. **Performance Optimization**
 ```python
-# Enable caching for better performance
+import asyncio
-crawler = AsyncWebCrawler(
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-    always_by_pass_cache=False,
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 import json
 async def main():
    # 1. Browser config
    browser_cfg = BrowserConfig(
        browser_type="firefox",
        headless=False,
        verbose=True
-)
+    )
    # 2. Run config
    schema = {
        "name": "Articles",
        "baseSelector": "article.post",
        "fields": [
            {
                "name": "title", 
                "selector": "h2", 
                "type": "text"
            },
            {
                "name": "url", 
                "selector": "a", 
                "type": "attribute", 
                "attribute": "href"
            }
        ]
    }
    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        extraction_strategy=JsonCssExtractionStrategy(schema),
        word_count_threshold=15,
        remove_overlay_elements=True,
        wait_for="css:.post"  # Wait for posts to appear
    )
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/blog",
            config=run_cfg
        )
        if result.success:
            print("Cleaned HTML length:", len(result.cleaned_html))
            if result.extracted_content:
                articles = json.loads(result.extracted_content)
                print("Extracted articles:", articles[:2])
        else:
            print("Error:", result.error_message)
 asyncio.run(main())
 ```
-4. **Anti-Detection**
+**Explanation**:
-```python
+- We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  
-# Maximum stealth
+- We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  
-crawler = AsyncWebCrawler(
+- We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`.
    headless=True,
    user_agent="Mozilla/5.0...",
    headers={"Accept-Language": "en-US"}
 )
 result = await crawler.arun(
    url="https://example.com",
    magic=True,
    simulate_user=True
 )
 ```
-## Note on Browser Types
+---
-Each browser type has its characteristics:
+## 7. Best Practices & Migration Notes
- **chromium**: Best overall compatibility
+1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  
- **firefox**: Good for specific use cases
+2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  
- **webkit**: Lighter weight, good for basic crawling
+3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead:
-Choose based on your specific needs:
+   ```python
-```python
+   run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20)
-# High compatibility
+   result = await crawler.arun(url="...", config=run_cfg)
-crawler = AsyncWebCrawler(browser_type="chromium")
+   ```
-# Memory efficient
+4. **Context Manager** usage is simplest unless you want a persistent crawler across many calls.
-crawler = AsyncWebCrawler(browser_type="webkit")
+
-```
+---
 ## 8. Summary
 **AsyncWebCrawler** is your entry point to asynchronous crawling:
 - **Constructor** accepts **`BrowserConfig`** (or defaults).  
 - **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  
 - **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  
 - For advanced lifecycle control, use `start()` and `close()` explicitly.  
 **Migration**:  
 - If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`.
 This modular approach ensures your code is **clean**, **scalable**, and **easy to maintain**. For any advanced or rarely used parameters, see the [BrowserConfig docs](../api/parameters.md).
--- a/docs/md_v2/api/crawl-result.md
+++ b/docs/md_v2/api/crawl-result.md
@@ -1,302 +1,355 @@
-# CrawlResult
+# `CrawlResult` Reference
-The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.
+The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON).
-## Class Definition
+**Location**: `crawl4ai/crawler/models.py` (for reference)
 ```python
 class CrawlResult(BaseModel):
-    """Result of a web crawling operation."""
+    url: str
-    
+    html: str
-    # Basic Information
+    success: bool
-    url: str                                # Crawled URL
+    cleaned_html: Optional[str] = None
-    success: bool                           # Whether crawl succeeded
+    media: Dict[str, List[Dict]] = {}
-    status_code: Optional[int] = None       # HTTP status code
+    links: Dict[str, List[Dict]] = {}
-    error_message: Optional[str] = None     # Error message if failed
+    downloaded_files: Optional[List[str]] = None
-    
+    screenshot: Optional[str] = None
-    # Content
+    pdf : Optional[bytes] = None
-    html: str                              # Raw HTML content
+    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
-    cleaned_html: Optional[str] = None      # Cleaned HTML
+    markdown_v2: Optional[MarkdownGenerationResult] = None
-    fit_html: Optional[str] = None          # Most relevant HTML content
+    fit_markdown: Optional[str] = None
-    markdown: Optional[str] = None          # HTML converted to markdown
+    fit_html: Optional[str] = None
-    fit_markdown: Optional[str] = None      # Most relevant markdown content
+    extracted_content: Optional[str] = None
-    downloaded_files: Optional[List[str]] = None  # Downloaded files
+    metadata: Optional[dict] = None
-    
+    error_message: Optional[str] = None
-    # Extracted Data
+    session_id: Optional[str] = None
-    extracted_content: Optional[str] = None  # Content from extraction strategy
+    response_headers: Optional[dict] = None
-    media: Dict[str, List[Dict]] = {}       # Extracted media information
+    status_code: Optional[int] = None
-    links: Dict[str, List[Dict]] = {}       # Extracted links
+    ssl_certificate: Optional[SSLCertificate] = None
-    metadata: Optional[dict] = None         # Page metadata
+    dispatch_result: Optional[DispatchResult] = None
-    
+    ...
    # Additional Data
    screenshot: Optional[str] = None         # Base64 encoded screenshot
    session_id: Optional[str] = None         # Session identifier
    response_headers: Optional[dict] = None  # HTTP response headers
 ```
-## Properties and Their Data Structures
+Below is a **field-by-field** explanation and possible usage patterns.
-### Basic Information
+---
 ## 1. Basic Crawl Info
 ### 1.1 **`url`** *(str)*  
 **What**: The final crawled URL (after any redirects).  
 **Usage**:
 ```python
-# Access basic information
+print(result.url)  # e.g., "https://example.com/"
 result = await crawler.arun(url="https://example.com")
 print(result.url)          # "https://example.com"
 print(result.success)      # True/False
 print(result.status_code)  # 200, 404, etc.
 print(result.error_message)  # Error details if failed
 ```
-### Content Properties
+### 1.2 **`success`** *(bool)*  
-
+**What**: `True` if the crawl pipeline ended without major errors; `False` otherwise.  
-#### HTML Content
+**Usage**:
 ```python
-# Raw HTML
+if not result.success:
-html_content = result.html
+    print(f"Crawl failed: {result.error_message}")
 # Cleaned HTML (removed ads, popups, etc.)
 clean_content = result.cleaned_html
 # Most relevant HTML content
 main_content = result.fit_html
 ```
-#### Markdown Content
+### 1.3 **`status_code`** *(Optional[int])*  
 **What**: The page’s HTTP status code (e.g., 200, 404).  
 **Usage**:
 ```python
-# Full markdown version
+if result.status_code == 404:
-markdown_content = result.markdown
+    print("Page not found!")
 # Most relevant markdown content
 main_content = result.fit_markdown
 ```
-### Media Content
+### 1.4 **`error_message`** *(Optional[str])*  
-
+**What**: If `success=False`, a textual description of the failure.  
-The media dictionary contains organized media elements:
+**Usage**:
 ```python
-# Structure
+if not result.success:
-media = {
+    print("Error:", result.error_message)
    "images": [
        {
            "src": str,           # Image URL
            "alt": str,           # Alt text
            "desc": str,          # Contextual description
            "score": float,       # Relevance score (0-10)
            "type": str,          # "image"
            "width": int,         # Image width (if available)
            "height": int,        # Image height (if available)
            "context": str,       # Surrounding text
            "lazy": bool          # Whether image was lazy-loaded
        }
    ],
    "videos": [
        {
            "src": str,           # Video URL
            "type": str,          # "video"
            "title": str,         # Video title
            "poster": str,        # Thumbnail URL
            "duration": str,      # Video duration
            "description": str    # Video description
        }
    ],
    "audios": [
        {
            "src": str,           # Audio URL
            "type": str,          # "audio"
            "title": str,         # Audio title
            "duration": str,      # Audio duration
            "description": str    # Audio description
        }
    ]
 }
 # Example usage
 for image in result.media["images"]:
    if image["score"] > 5:  # High-relevance images
        print(f"High-quality image: {image['src']}")
        print(f"Context: {image['context']}")
 ```
-### Link Analysis
+### 1.5 **`session_id`** *(Optional[str])*  
-
+**What**: The ID used for reusing a browser context across multiple calls.  
-The links dictionary organizes discovered links:
+**Usage**:
 ```python
-# Structure
+# If you used session_id="login_session" in CrawlerRunConfig, see it here:
-links = {
+print("Session:", result.session_id)
-    "internal": [
+```
        {
            "href": str,          # URL
            "text": str,          # Link text
            "title": str,         # Title attribute
            "type": str,          # Link type (nav, content, etc.)
            "context": str,       # Surrounding text
            "score": float        # Relevance score
        }
    ],
    "external": [
        {
            "href": str,          # External URL
            "text": str,          # Link text
            "title": str,         # Title attribute
            "domain": str,        # Domain name
            "type": str,          # Link type
            "context": str        # Surrounding text
        }
    ]
 }
-# Example usage
+### 1.6 **`response_headers`** *(Optional[dict])*  
 **What**: Final HTTP response headers.  
 **Usage**:
 ```python
 if result.response_headers:
    print("Server:", result.response_headers.get("Server", "Unknown"))
 ```
 ### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])*  
 **What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a  [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site’s certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`, 
 `subject`, `valid_from`, `valid_until`, etc. 
 **Usage**:
 ```python
 if result.ssl_certificate:
    print("Issuer:", result.ssl_certificate.issuer)
 ```
 ---
 ## 2. Raw / Cleaned Content
 ### 2.1 **`html`** *(str)*  
 **What**: The **original** unmodified HTML from the final page load.  
 **Usage**:
 ```python
 # Possibly large
 print(len(result.html))
 ```
 ### 2.2 **`cleaned_html`** *(Optional[str])*  
 **What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`.  
 **Usage**:
 ```python
 print(result.cleaned_html[:500])  # Show a snippet
 ```
 ### 2.3 **`fit_html`** *(Optional[str])*  
 **What**: If a **content filter** or heuristic (e.g., Pruning/BM25) modifies the HTML, the “fit” or post-filter version.  
 **When**: This is **only** present if your `markdown_generator` or `content_filter` produces it.  
 **Usage**:
 ```python
 if result.fit_html:
    print("High-value HTML content:", result.fit_html[:300])
 ```
 ---
 ## 3. Markdown Fields
 ### 3.1 The Markdown Generation Approach
 Crawl4AI can convert HTML→Markdown, optionally including:
 - **Raw** markdown  
 - **Links as citations** (with a references section)  
 - **Fit** markdown if a **content filter** is used (like Pruning or BM25)
 ### 3.2 **`markdown_v2`** *(Optional[MarkdownGenerationResult])*  
 **What**: The **structured** object holding multiple markdown variants. Soon to be consolidated into `markdown`.  
 **`MarkdownGenerationResult`** includes:
 - **`raw_markdown`** *(str)*: The full HTML→Markdown conversion.  
 - **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations.  
 - **`references_markdown`** *(str)*: The reference list or footnotes at the end.  
 - **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered “fit” text.  
 - **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`.
 **Usage**:
 ```python
 if result.markdown_v2:
    md_res = result.markdown_v2
    print("Raw MD:", md_res.raw_markdown[:300])
    print("Citations MD:", md_res.markdown_with_citations[:300])
    print("References:", md_res.references_markdown)
    if md_res.fit_markdown:
        print("Pruned text:", md_res.fit_markdown[:300])
 ```
 ### 3.3 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])*  
 **What**: In future versions, `markdown` will fully replace `markdown_v2`. Right now, it might be a `str` or a `MarkdownGenerationResult`.  
 **Usage**:
 ```python
 # Soon, you might see:
 if isinstance(result.markdown, MarkdownGenerationResult):
    print(result.markdown.raw_markdown[:200])
 else:
    print(result.markdown)
 ```
 ### 3.4 **`fit_markdown`** *(Optional[str])*  
 **What**: A direct reference to the final filtered markdown (legacy approach).  
 **When**: This is set if a filter or content strategy explicitly writes there. Usually overshadowed by `markdown_v2.fit_markdown`.  
 **Usage**:
 ```python
 print(result.fit_markdown)  # Legacy field, prefer result.markdown_v2.fit_markdown
 ```
 **Important**: “Fit” content (in `fit_markdown`/`fit_html`) only exists if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`.
 ---
 ## 4. Media & Links
 ### 4.1 **`media`** *(Dict[str, List[Dict]])*  
 **What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`.  
 **Common Fields** in each item:
 - `src` *(str)*: Media URL  
 - `alt` or `title` *(str)*: Descriptive text  
 - `score` *(float)*: Relevance score if the crawler’s heuristic found it “important”  
 - `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text  
 **Usage**:
 ```python
 images = result.media.get("images", [])
 for img in images:
    if img.get("score", 0) > 5:
        print("High-value image:", img["src"])
 ```
 ### 4.2 **`links`** *(Dict[str, List[Dict]])*  
 **What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`.  
 **Common Fields**:
 - `href` *(str)*: The link target  
 - `text` *(str)*: Link text  
 - `title` *(str)*: Title attribute  
 - `context` *(str)*: Surrounding text snippet  
 - `domain` *(str)*: If external, the domain
 **Usage**:
 ```python
 for link in result.links["internal"]:
-    print(f"Internal link: {link['href']}")
+    print(f"Internal link to {link['href']} with text {link['text']}")
    print(f"Context: {link['context']}")
 ```
-### Metadata
+---
-The metadata dictionary contains page information:
+## 5. Additional Fields
 ### 5.1 **`extracted_content`** *(Optional[str])*  
 **What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON).  
 **Usage**:
 ```python
 # Structure
 metadata = {
    "title": str,                # Page title
    "description": str,          # Meta description
    "keywords": List[str],       # Meta keywords
    "author": str,              # Author information
    "published_date": str,      # Publication date
    "modified_date": str,       # Last modified date
    "language": str,            # Page language
    "canonical_url": str,       # Canonical URL
    "og_data": Dict,           # Open Graph data
    "twitter_data": Dict       # Twitter card data
 }
 # Example usage
 if result.metadata:
    print(f"Title: {result.metadata['title']}")
    print(f"Author: {result.metadata.get('author', 'Unknown')}")
 ```
 ### Extracted Content
 Content from extraction strategies:
 ```python
 # For LLM or CSS extraction strategies
 if result.extracted_content:
-    structured_data = json.loads(result.extracted_content)
+    data = json.loads(result.extracted_content)
-    print(structured_data)
+    print(data)
 ```
-### Screenshot
+### 5.2 **`downloaded_files`** *(Optional[List[str]])*  
-
+**What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items.  
-Base64 encoded screenshot:
+**Usage**:
 ```python
-# Save screenshot if available
+if result.downloaded_files:
-if result.screenshot:
+    for file_path in result.downloaded_files:
-    import base64
+        print("Downloaded:", file_path)
 ```
-    # Decode and save
+### 5.3 **`screenshot`** *(Optional[str])*  
-    with open("screenshot.png", "wb") as f:
+**What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`.  
 **Usage**:
 ```python
 import base64
 if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))
 ```
-## Usage Examples
+### 5.4 **`pdf`** *(Optional[bytes])*  
-
+**What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`.  
-### Basic Content Access
+**Usage**:
 ```python
-async with AsyncWebCrawler() as crawler:
+if result.pdf:
-    result = await crawler.arun(url="https://example.com")
+    with open("page.pdf", "wb") as f:
-    
+        f.write(result.pdf)
    if result.success:
        # Get clean content
        print(result.fit_markdown)
        # Process images
        for image in result.media["images"]:
            if image["score"] > 7:
                print(f"High-quality image: {image['src']}")
 ```
-### Complete Data Processing
+### 5.5 **`metadata`** *(Optional[dict])*  
 **What**: Page-level metadata if discovered (title, description, OG data, etc.).  
 **Usage**:
 ```python
-async def process_webpage(url: str) -> Dict:
+if result.metadata:
-    async with AsyncWebCrawler() as crawler:
+    print("Title:", result.metadata.get("title"))
-        result = await crawler.arun(url=url)
+    print("Author:", result.metadata.get("author"))
 ```
 ---
 ## 6. `dispatch_result` (optional)
 A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains:
 - **`task_id`**: A unique identifier for the parallel task.
 - **`memory_usage`** (float): The memory (in MB) used at the time of completion.
 - **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task’s execution.
 - **`start_time`** / **`end_time`** (datetime): Time range for this crawling task.
 - **`error_message`** (str): Any dispatcher- or concurrency-related error encountered.
 ```python
 # Example usage:
 for result in results:
    if result.success and result.dispatch_result:
        dr = result.dispatch_result
        print(f"URL: {result.url}, Task ID: {dr.task_id}")
        print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)")
        print(f"Duration: {dr.end_time - dr.start_time}")
 ```
 > **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`. 
 ---
 ## 7. Example: Accessing Everything
 ```python
 async def handle_result(result: CrawlResult):
    if not result.success:
-            raise Exception(f"Crawl failed: {result.error_message}")
+        print("Crawl error:", result.error_message)
        return {
            "content": result.fit_markdown,
            "images": [
                img for img in result.media["images"]
                if img["score"] > 5
            ],
            "internal_links": [
                link["href"] for link in result.links["internal"]
            ],
            "metadata": result.metadata,
            "status": result.status_code
        }
 ```
 ### Error Handling
 ```python
 async def safe_crawl(url: str) -> Dict:
    async with AsyncWebCrawler() as crawler:
        try:
            result = await crawler.arun(url=url)
            if not result.success:
                return {
                    "success": False,
                    "error": result.error_message,
                    "status": result.status_code
                }
            return {
                "success": True,
                "content": result.fit_markdown,
                "status": result.status_code
            }
        except Exception as e:
            return {
                "success": False,
                "error": str(e),
                "status": None
            }
 ```
 ## Best Practices
 1. **Always Check Success**
 ```python
 if not result.success:
    print(f"Error: {result.error_message}")
        return
    # Basic info
    print("Crawled URL:", result.url)
    print("Status code:", result.status_code)
    # HTML
    print("Original HTML size:", len(result.html))
    print("Cleaned HTML size:", len(result.cleaned_html or ""))
    # Markdown output
    if result.markdown_v2:
        print("Raw Markdown:", result.markdown_v2.raw_markdown[:300])
        print("Citations Markdown:", result.markdown_v2.markdown_with_citations[:300])
        if result.markdown_v2.fit_markdown:
            print("Fit Markdown:", result.markdown_v2.fit_markdown[:200])
    else:
        print("Raw Markdown (legacy):", result.markdown[:200] if result.markdown else "N/A")
    # Media & Links
    if "images" in result.media:
        print("Image count:", len(result.media["images"]))
    if "internal" in result.links:
        print("Internal link count:", len(result.links["internal"]))
    # Extraction strategy result
    if result.extracted_content:
        print("Structured data:", result.extracted_content)
    # Screenshot/PDF
    if result.screenshot:
        print("Screenshot length:", len(result.screenshot))
    if result.pdf:
        print("PDF bytes length:", len(result.pdf))
 ```
-2. **Use fit_markdown for Articles**
+---
 ```python
 # Better for article content
 content = result.fit_markdown if result.fit_markdown else result.markdown
 ```
-3. **Filter Media by Score**
+## 8. Key Points & Future
 ```python
 relevant_images = [
    img for img in result.media["images"]
    if img["score"] > 5
 ]
 ```
-4. **Handle Missing Data**
+1. **`markdown_v2` vs `markdown`**  
-```python
+   - Right now, `markdown_v2` is the more robust container (`MarkdownGenerationResult`), providing **raw_markdown**, **markdown_with_citations**, references, plus possible **fit_markdown**.  
-metadata = result.metadata or {}
+   - In future versions, everything will unify under **`markdown`**. If you rely on advanced features (citations, fit content), check `markdown_v2`.
-title = metadata.get('title', 'Unknown Title')
+
-```
+2. **Fit Content**  
   - **`fit_markdown`** and **`fit_html`** appear only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly.  
   - If no filter is used, they remain `None`.
 3. **References & Citations**  
   - If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing.
 4. **Links & Media**  
   - `links["internal"]` and `links["external"]` group discovered anchors by domain.  
   - `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context.
 5. **Error Cases**  
   - If `success=False`, check `error_message` (e.g., timeouts, invalid URLs).  
   - `status_code` might be `None` if we failed before an HTTP response.
 Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**.
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -1,36 +1,296 @@
-# Parameter Reference Table
+# 1. **BrowserConfig** – Controlling the Browser
-| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
+`BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks.
-|-----------|---------------|------------|----------------|-------------|
+
-| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
+```python
-| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
+from crawl4ai import AsyncWebCrawler, BrowserConfig
-| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
+
-| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
+browser_cfg = BrowserConfig(
-| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
+    browser_type="chromium",
-| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
+    headless=True,
-| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
+    viewport_width=1280,
-| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
+    viewport_height=720,
-| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
+    proxy="http://user:pass@proxy:8080",
-| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
+    user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36",
-| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
+)
-| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
+```
-| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
+
-| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
+## 1.1 Parameter Highlights
-| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
+
-| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
+| **Parameter**         | **Type / Default**                     | **What It Does**                                                                                                                     |
-| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
+|-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
-| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
+| **`browser_type`**    | `"chromium"`, `"firefox"`, `"webkit"`<br/>*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests.                 |
-| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
+| **`headless`**        | `bool` (default: `True`)               | Headless means no visible UI. `False` is handy for debugging.                                                                         |
-| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
+| **`viewport_width`**  | `int` (default: `1080`)                | Initial page width (in px). Useful for testing responsive layouts.                                                                    |
-| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
+| **`viewport_height`** | `int` (default: `600`)                 | Initial page height (in px).                                                                                                          |
-| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
+| **`proxy`**           | `str` (default: `None`)                | Single-proxy URL if you want all traffic to go through it, e.g. `"http://user:pass@proxy:8080"`.                                      |
-| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
+| **`proxy_config`**    | `dict` (default: `None`)               | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`.                                  |
-| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
+| **`use_persistent_context`** | `bool` (default: `False`)       | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`.          |
-| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
+| **`user_data_dir`**   | `str or None` (default: `None`)        | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions.                                         |
-| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
+| **`ignore_https_errors`** | `bool` (default: `True`)           | If `True`, continues despite invalid certificates (common in dev/staging).                                                            |
-| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
+| **`java_script_enabled`** | `bool` (default: `True`)           | Disable if you want no JS overhead, or if only static content is needed.                                                              |
-| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
+| **`cookies`**         | `list` (default: `[]`)                 | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`.                                                |
-| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
+| **`headers`**         | `dict` (default: `{}`)                 | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`.                                                            |
-| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
+| **`user_agent`**      | `str` (default: Chrome-based UA)       | Your custom or random user agent. `user_agent_mode="random"` can shuffle it.                                                          |
-| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
+| **`light_mode`**      | `bool` (default: `False`)              | Disables some background features for performance gains.                                                                              |
-| async_webcrawler.py | cache_mode | `kwargs.get("cache_mode", CacheMode.ENABLE)` | AsyncWebCrawler | Cache handling mode for request |
+| **`text_mode`**       | `bool` (default: `False`)              | If `True`, tries to disable images/other heavy content for speed.                                                                     |
 | **`use_managed_browser`** | `bool` (default: `False`)          | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on.                  |
 | **`extra_args`**      | `list` (default: `[]`)                 | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`.                                                |
 **Tips**:
 - Set `headless=False` to visually **debug** how pages load or how interactions proceed.  
 - If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`.  
 - For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content.
 ---
 # 2. **CrawlerRunConfig** – Controlling Each Crawl
 While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc.
 ```python
 from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
 run_cfg = CrawlerRunConfig(
    wait_for="css:.main-content",
    word_count_threshold=15,
    excluded_tags=["nav", "footer"],
    exclude_external_links=True,
    stream=True,  # Enable streaming for arun_many()
 )
 ```
 ## 2.1 Parameter Highlights
 We group them by category. 
 ### A) **Content Processing**
 | **Parameter**                | **Type / Default**                   | **What It Does**                                                                                |
 |------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------|
 | **`word_count_threshold`**   | `int` (default: ~200)                | Skips text blocks below X words. Helps ignore trivial sections.                                 |
 | **`extraction_strategy`**    | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.).                                  |
 | **`markdown_generator`**     | `MarkdownGenerationStrategy` (None)  | If you want specialized markdown output (citations, filtering, chunking, etc.).                 |
 | **`content_filter`**         | `RelevantContentFilter` (None)       | Filters out irrelevant text blocks. E.g., `PruningContentFilter` or `BM25ContentFilter`.        |
 | **`css_selector`**           | `str` (None)                         | Retains only the part of the page matching this selector.                                       |
 | **`excluded_tags`**          | `list` (None)                        | Removes entire tags (e.g. `["script", "style"]`).                                               |
 | **`excluded_selector`**      | `str` (None)                         | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`.                                    |
 | **`only_text`**              | `bool` (False)                       | If `True`, tries to extract text-only content.                                                  |
 | **`prettiify`**              | `bool` (False)                       | If `True`, beautifies final HTML (slower, purely cosmetic).                                      |
 | **`keep_data_attributes`**   | `bool` (False)                       | If `True`, preserve `data-*` attributes in cleaned HTML.                                         |
 | **`remove_forms`**           | `bool` (False)                       | If `True`, remove all `<form>` elements.                                                        |
 ---
 ### B) **Caching & Session**
 | **Parameter**           | **Type / Default**     | **What It Does**                                                                                                              |
 |-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------|
 | **`cache_mode`**        | `CacheMode or None`    | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`.          |
 | **`session_id`**        | `str or None`          | Assign a unique ID to reuse a single browser session across multiple `arun()` calls.                                          |
 | **`bypass_cache`**      | `bool` (False)         | If `True`, acts like `CacheMode.BYPASS`.                                                                                     |
 | **`disable_cache`**     | `bool` (False)         | If `True`, acts like `CacheMode.DISABLED`.                                                                                   |
 | **`no_cache_read`**     | `bool` (False)         | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads).                                                  |
 | **`no_cache_write`**    | `bool` (False)         | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes).                                                   |
 Use these for controlling whether you read or write from a local content cache. Handy for large batch crawls or repeated site visits.
 ---
 ### C) **Page Navigation & Timing**
 | **Parameter**              | **Type / Default**      | **What It Does**                                                                                                    |
 |----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
 | **`wait_until`**           | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`.                               |
 | **`page_timeout`**         | `int` (60000 ms)        | Timeout for page navigation or JS steps. Increase for slow sites.                                                    |
 | **`wait_for`**             | `str or None`           | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction.                     |
 | **`wait_for_images`**      | `bool` (False)          | Wait for images to load before finishing. Slows down if you only want text.                                          |
 | **`delay_before_return_html`** | `float` (0.1)       | Additional pause (seconds) before final HTML is captured. Good for last-second updates.                               |
 | **`check_robots_txt`**     | `bool` (False)          | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency.            |
 | **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. |
 | **`semaphore_count`**      | `int` (5)               | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls.                                |
 ---
 ### D) **Page Interaction**
 | **Parameter**              | **Type / Default**            | **What It Does**                                                                                                                       |
 |----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
 | **`js_code`**              | `str or list[str]` (None)      | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`.                                                     |
 | **`js_only`**              | `bool` (False)                 | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload.                                           |
 | **`ignore_body_visibility`** | `bool` (True)                | Skip checking if `<body>` is visible. Usually best to keep `True`.                                                                     |
 | **`scan_full_page`**       | `bool` (False)                 | If `True`, auto-scroll the page to load dynamic content (infinite scroll).                                                              |
 | **`scroll_delay`**         | `float` (0.2)                  | Delay between scroll steps if `scan_full_page=True`.                                                                                   |
 | **`process_iframes`**      | `bool` (False)                 | Inlines iframe content for single-page extraction.                                                                                     |
 | **`remove_overlay_elements`** | `bool` (False)              | Removes potential modals/popups blocking the main content.                                                                              |
 | **`simulate_user`**        | `bool` (False)                 | Simulate user interactions (mouse movements) to avoid bot detection.                                                                    |
 | **`override_navigator`**   | `bool` (False)                 | Override `navigator` properties in JS for stealth.                                                                                      |
 | **`magic`**                | `bool` (False)                 | Automatic handling of popups/consent banners. Experimental.                                                                             |
 | **`adjust_viewport_to_content`** | `bool` (False)           | Resizes viewport to match page content height.                                                                                          |
 If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab.
 ---
 ### E) **Media Handling**
 | **Parameter**                              | **Type / Default**  | **What It Does**                                                                                         |
 |--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------|
 | **`screenshot`**                           | `bool` (False)      | Capture a screenshot (base64) in `result.screenshot`.                                                     |
 | **`screenshot_wait_for`**                  | `float or None`     | Extra wait time before the screenshot.                                                                    |
 | **`screenshot_height_threshold`**          | `int` (~20000)      | If the page is taller than this, alternate screenshot strategies are used.                                |
 | **`pdf`**                                  | `bool` (False)      | If `True`, returns a PDF in `result.pdf`.                                                                 |
 | **`image_description_min_word_threshold`** | `int` (~50)         | Minimum words for an image’s alt text or description to be considered valid.                              |
 | **`image_score_threshold`**                | `int` (~3)          | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.).              |
 | **`exclude_external_images`**              | `bool` (False)      | Exclude images from other domains.                                                                        |
 ---
 ### F) **Link/Domain Handling**
 | **Parameter**                | **Type / Default**      | **What It Does**                                                                                                             |
 |------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------|
 | **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output.                                      |
 | **`exclude_external_links`** | `bool` (False)          | Removes all links pointing outside the current domain.                                                                      |
 | **`exclude_social_media_links`** | `bool` (False)      | Strips links specifically to social sites (like Facebook or Twitter).                                                      |
 | **`exclude_domains`**        | `list` ([])             | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`).                                            |
 Use these for link-level content filtering (often to keep crawls “internal” or to remove spammy domains).
 ---
 ### G) **Rate Limiting & Resource Management**
 | **Parameter**                | **Type / Default**                     | **What It Does**                                                                                                           |
 |------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
 | **`enable_rate_limiting`**  | `bool` (default: `False`)              | Enable intelligent rate limiting for multiple URLs                                                                          |
 | **`rate_limit_config`**     | `RateLimitConfig` (default: `None`)    | Configuration for rate limiting behavior                                                                                   |
 The `RateLimitConfig` class has these fields:
 | **Field**           | **Type / Default**                     | **What It Does**                                                                                                           |
 |--------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
 | **`base_delay`**   | `Tuple[float, float]` (1.0, 3.0)      | Random delay range between requests to the same domain                                                                      |
 | **`max_delay`**    | `float` (60.0)                        | Maximum delay after rate limit detection                                                                                    |
 | **`max_retries`**  | `int` (3)                             | Number of retries before giving up on rate-limited requests                                                                 |
 | **`rate_limit_codes`** | `List[int]` ([429, 503])          | HTTP status codes that trigger rate limiting behavior                                                                       |
 | **Parameter**                  | **Type / Default**                     | **What It Does**                                                                                                           |
 |-------------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
 | **`memory_threshold_percent`** | `float` (70.0)                        | Maximum memory usage before pausing new crawls                                                                              |
 | **`check_interval`**          | `float` (1.0)                         | How often to check system resources (in seconds)                                                                           |
 | **`max_session_permit`**      | `int` (20)                            | Maximum number of concurrent crawl sessions                                                                                |
 | **`display_mode`**            | `str` (`None`, "DETAILED", "AGGREGATED") | How to display progress information                                                                                     |
 ---
 ### H) **Debug & Logging**
 | **Parameter**  | **Type / Default** | **What It Does**                                                         |
 |----------------|--------------------|---------------------------------------------------------------------------|
 | **`verbose`**  | `bool` (True)     | Prints logs detailing each step of crawling, interactions, or errors.    |
 | **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.|
 ---
 ## 2.2 Helper Methods
 Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:
 ```python
 # Create a base configuration
 base_config = CrawlerRunConfig(
    cache_mode=CacheMode.ENABLED,
    word_count_threshold=200
 )
 # Create variations using clone()
 stream_config = base_config.clone(stream=True)
 no_cache_config = base_config.clone(
    cache_mode=CacheMode.BYPASS,
    stream=True
 )
 ```
 The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config.
 ## 2.3 Example Usage
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RateLimitConfig
 async def main():
    # Configure the browser
    browser_cfg = BrowserConfig(
        headless=False,
        viewport_width=1280,
        viewport_height=720,
        proxy="http://user:pass@myproxy:8080",
        text_mode=True
    )
    # Configure the run
    run_cfg = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        session_id="my_session",
        css_selector="main.article",
        excluded_tags=["script", "style"],
        exclude_external_links=True,
        wait_for="css:.article-loaded",
        screenshot=True,
        enable_rate_limiting=True,
        rate_limit_config=RateLimitConfig(
            base_delay=(1.0, 3.0),
            max_delay=60.0,
            max_retries=3,
            rate_limit_codes=[429, 503]
        ),
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=20,
        display_mode="DETAILED",
        stream=True
    )
    async with AsyncWebCrawler(config=browser_cfg) as crawler:
        result = await crawler.arun(
            url="https://example.com/news",
            config=run_cfg
        )
        if result.success:
            print("Final cleaned_html length:", len(result.cleaned_html))
            if result.screenshot:
                print("Screenshot captured (base64, length):", len(result.screenshot))
        else:
            print("Crawl failed:", result.error_message)
 if __name__ == "__main__":
    asyncio.run(main())
 ## 2.4 Compliance & Ethics
 | **Parameter**          | **Type / Default**      | **What It Does**                                                                                                    |
 |-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------|
 | **`check_robots_txt`**| `bool` (False)          | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend.          |
 | **`user_agent`**      | `str` (None)            | User agent string to identify your crawler. Used for robots.txt checking when enabled.                                |
 ```python
 run_config = CrawlerRunConfig(
    check_robots_txt=True,  # Enable robots.txt compliance
    user_agent="MyBot/1.0"  # Identify your crawler
 )
 ```
 ## 3. Putting It All Together
 - **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent.  
 - **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS.  
 - **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`).  
 ```python
 # Create a modified copy with the clone() method
 stream_cfg = run_cfg.clone(
    stream=True,
    cache_mode=CacheMode.BYPASS
 )
--- a/docs/md_v2/api/strategies.md
+++ b/docs/md_v2/api/strategies.md
@@ -218,12 +218,12 @@ result = await crawler.arun(
 ## Best Practices
-1. **Choose the Right Strategy**
+1. **Choose the Right Strategy**
   - Use `LLMExtractionStrategy` for complex, unstructured content
   - Use `JsonCssExtractionStrategy` for well-structured HTML
   - Use `CosineStrategy` for content similarity and clustering
-2. **Optimize Chunking**
+2. **Optimize Chunking**
   ```python
   # For long documents
   strategy = LLMExtractionStrategy(
@@ -232,7 +232,7 @@ result = await crawler.arun(
   )
   ```
-3. **Handle Errors**
+3. **Handle Errors**
   ```python
   try:
       result = await crawler.arun(
@@ -245,7 +245,7 @@ result = await crawler.arun(
       print(f"Extraction failed: {e}")
   ```
-4. **Monitor Performance**
+4. **Monitor Performance**
   ```python
   strategy = CosineStrategy(
       verbose=True,  # Enable logging
--- a/docs/md_v2/assets/images/dispatcher.png
+++ b/docs/md_v2/assets/images/dispatcher.png
--- a/docs/md_v2/assets/styles.css
+++ b/docs/md_v2/assets/styles.css
@@ -7,6 +7,7 @@
 :root {
    --global-font-size: 16px;
    --global-code-font-size: 16px;
    --global-line-height: 1.5em;
    --global-space: 10px;
    --font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
@@ -20,6 +21,7 @@
    --invert-font-color: #151515; /* Dark color for inverted elements */
    --primary-color: #1a95e0; /* Primary color can remain the same or be adjusted for better contrast */
    --secondary-color: #727578; /* Secondary color for less important text */
    --secondary-dimmed-color: #8b857a; /* Dimmed secondary color */
    --error-color: #ff5555; /* Bright color for errors */
    --progress-bar-background: #444; /* Darker background for progress bar */
    --progress-bar-fill: #1a95e0; /* Bright color for progress bar fill */
@@ -37,8 +39,9 @@
    --secondary-color: #a3abba;
    --secondary-color: #d5cec0;
    --tertiary-color: #a3abba;
-    --primary-color: #09b5a5; /* Updated to the brand color */
+    --primary-dimmed-color: #09b5a5; /* Updated to the brand color */
    --primary-color: #50ffff; /* Updated to the brand color */
    --accent-color: rgb(243, 128, 245);
    --error-color: #ff3c74;
    --progress-bar-background: #3f3f44;
    --progress-bar-fill: #09b5a5; /* Updated to the brand color */
@@ -80,10 +83,16 @@ pre, code {
    line-height: var(--global-line-height);
 }
-strong,
+strong {
    /* color : var(--primary-dimmed-color); */
    /* background-color: #50ffff17; */
    text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color);
 }
 .highlight {
    /* background: url(//s2.svgbox.net/pen-brushes.svg?ic=brush-1&color=50ffff); */
-    background-color: #50ffff33;
+    background-color: #50ffff17;
 }
 .terminal-card > header {
@@ -158,3 +167,79 @@ ol li::before {
    /* float: left; */
    /* padding-right: 5px; */
 }
 /* 8 TERMINAL CSS */
 .terminal code {
    font-size: var(--global-code-font-size);
    background: var(--block-background-color);
    /* color: var(--secondary-color); */
    color: var(--primary-dimmed-color);
 }
 .terminal pre code {
    background: var(--block-background-color);
    color: var(--secondary-color);
 }
 .hljs-keyword, .hljs-selector-tag, .hljs-built_in, .hljs-name, .hljs-tag {
    color: var(--accent-color);
 }
 .hljs-string {
    color: var(--primary-dimmed-color);
 }
 .hljs-comment {
    color: var(--secondary-dimmed-color);
    font-style: italic;
    font-size: 0.9em;
 }
 .hljs-number {
    color: var(--primary-dimmed-color);
 }
 .terminal strong > code, .terminal h2 > code , .terminal h3 > code {
    background-color: transparent;
    /* color: var(--font-color); */
    color: var(--primary-dimmed-color);
    text-shadow: none;
 }
 blockquote {
    background-color: var(--invert-font-color);
    padding: 1em 2em;
    border-left: 2px solid var(--primary-dimmed-color);
 }
 blockquote::after {
    content: "💡";
    white-space: pre;
    position: absolute;
    top: 1em;
    left: 5px;
    line-height: var(--global-line-height);
    color: #9ca2ab;
 }
 pre {
    display: block;
    word-break: break-word;
    word-wrap: break-word;
 }
 .terminal h1 {
    font-size: 2em;
 }
 .terminal h1, .terminal h2, .terminal h3, .terminal h4, .terminal h5, .terminal h6 {
    text-shadow: 0 0 0px var(--font-color), 0 0 0px var(--font-color), 0 0 0px var(--font-color);
 }
 /* Lower max height or width for these images */
 div.badges a {
    /* no underline */
    text-decoration: none !important;
 }
 div.badges a > img {
    width: auto;
 }
--- a/docs/md_v2/basic/browser-config.md
+++ b/docs/md_v2/basic/browser-config.md
@@ -1,208 +0,0 @@
 # Browser Configuration
 Crawl4AI supports multiple browser engines and offers extensive configuration options for browser behavior.
 ## Browser Types
 Choose from three browser engines:
 ```python
 # Chromium (default)
 async with AsyncWebCrawler(browser_type="chromium") as crawler:
    result = await crawler.arun(url="https://example.com")
 # Firefox
 async with AsyncWebCrawler(browser_type="firefox") as crawler:
    result = await crawler.arun(url="https://example.com")
 # WebKit
 async with AsyncWebCrawler(browser_type="webkit") as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 ## Basic Configuration
 Common browser settings:
 ```python
 async with AsyncWebCrawler(
    headless=True,           # Run in headless mode (no GUI)
    verbose=True,           # Enable detailed logging
    sleep_on_close=False    # No delay when closing browser
 ) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 ## Identity Management
 Control how your crawler appears to websites:
 ```python
 # Custom user agent
 async with AsyncWebCrawler(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
 ) as crawler:
    result = await crawler.arun(url="https://example.com")
 # Custom headers
 headers = {
    "Accept-Language": "en-US,en;q=0.9",
    "Cache-Control": "no-cache"
 }
 async with AsyncWebCrawler(headers=headers) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 ## Screenshot Capabilities
 Capture page screenshots with enhanced error handling:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    screenshot=True,                # Enable screenshot
    screenshot_wait_for=2.0        # Wait 2 seconds before capture
 )
 if result.screenshot:  # Base64 encoded image
    import base64
    with open("screenshot.png", "wb") as f:
        f.write(base64.b64decode(result.screenshot))
 ```
 ## Timeouts and Waiting
 Control page loading behavior:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    page_timeout=60000,              # Page load timeout (ms)
    delay_before_return_html=2.0,    # Wait before content capture
    wait_for="css:.dynamic-content"  # Wait for specific element
 )
 ```
 ## JavaScript Execution
 Execute custom JavaScript before crawling:
 ```python
 # Single JavaScript command
 result = await crawler.arun(
    url="https://example.com",
    js_code="window.scrollTo(0, document.body.scrollHeight);"
 )
 # Multiple commands
 js_commands = [
    "window.scrollTo(0, document.body.scrollHeight);",
    "document.querySelector('.load-more').click();"
 ]
 result = await crawler.arun(
    url="https://example.com",
    js_code=js_commands
 )
 ```
 ## Proxy Configuration
 Use proxies for enhanced access:
 ```python
 # Simple proxy
 async with AsyncWebCrawler(
    proxy="http://proxy.example.com:8080"
 ) as crawler:
    result = await crawler.arun(url="https://example.com")
 # Proxy with authentication
 proxy_config = {
    "server": "http://proxy.example.com:8080",
    "username": "user",
    "password": "pass"
 }
 async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```
 ## Anti-Detection Features
 Enable stealth features to avoid bot detection:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    simulate_user=True,        # Simulate human behavior
    override_navigator=True,   # Mask automation signals
    magic=True               # Enable all anti-detection features
 )
 ```
 ## Handling Dynamic Content
 Configure browser to handle dynamic content:
 ```python
 # Wait for dynamic content
 result = await crawler.arun(
    url="https://example.com",
    wait_for="js:() => document.querySelector('.content').children.length > 10",
    process_iframes=True     # Process iframe content
 )
 # Handle lazy-loaded images
 result = await crawler.arun(
    url="https://example.com",
    js_code="window.scrollTo(0, document.body.scrollHeight);",
    delay_before_return_html=2.0  # Wait for images to load
 )
 ```
 ## Comprehensive Example
 Here's how to combine various browser configurations:
 ```python
 async def crawl_with_advanced_config(url: str):
    async with AsyncWebCrawler(
        # Browser setup
        browser_type="chromium",
        headless=True,
        verbose=True,
        # Identity
        user_agent="Custom User Agent",
        headers={"Accept-Language": "en-US"},
        # Proxy setup
        proxy="http://proxy.example.com:8080"
    ) as crawler:
        result = await crawler.arun(
            url=url,
            # Content handling
            process_iframes=True,
            screenshot=True,
            # Timing
            page_timeout=60000,
            delay_before_return_html=2.0,
            # Anti-detection
            magic=True,
            simulate_user=True,
            # Dynamic content
            js_code=[
                "window.scrollTo(0, document.body.scrollHeight);",
                "document.querySelector('.load-more')?.click();"
            ],
            wait_for="css:.dynamic-content"
        )
        return {
            "content": result.markdown,
            "screenshot": result.screenshot,
            "success": result.success
        }
 ```
--- a/docs/md_v2/basic/content-selection.md
+++ b/docs/md_v2/basic/content-selection.md
@@ -1,199 +0,0 @@
 # Content Selection
 Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.
 ## CSS Selectors
 The simplest way to extract specific content:
 ```python
 # Extract specific content using CSS selector
 result = await crawler.arun(
    url="https://example.com",
    css_selector=".main-article"  # Target main article content
 )
 # Multiple selectors
 result = await crawler.arun(
    url="https://example.com",
    css_selector="article h1, article .content"  # Target heading and content
 )
 ```
 ## Content Filtering
 Control what content is included or excluded:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    # Content thresholds
    word_count_threshold=10,        # Minimum words per block
    # Tag exclusions
    excluded_tags=['form', 'header', 'footer', 'nav'],
    # Link filtering
    exclude_external_links=True,    # Remove external links
    exclude_social_media_links=True,  # Remove social media links
    # Media filtering
    exclude_external_images=True   # Remove external images
 )
 ```
 ## Iframe Content
 Process content inside iframes:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    process_iframes=True,  # Extract iframe content
    remove_overlay_elements=True  # Remove popups/modals that might block iframes
 )
 ```
 ## Structured Content Selection
 ### Using LLMs for Smart Selection
 Use LLMs to intelligently extract specific types of content:
 ```python
 from pydantic import BaseModel
 from crawl4ai.extraction_strategy import LLMExtractionStrategy
 class ArticleContent(BaseModel):
    title: str
    main_points: List[str]
    conclusion: str
 strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",  # Works with any supported LLM
    schema=ArticleContent.schema(),
    instruction="Extract the main article title, key points, and conclusion"
 )
 result = await crawler.arun(
    url="https://example.com",
    extraction_strategy=strategy
 )
 article = json.loads(result.extracted_content)
 ```
 ### Pattern-Based Selection
 For repeated content patterns (like product listings, news feeds):
 ```python
 from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 schema = {
    "name": "News Articles",
    "baseSelector": "article.news-item",  # Repeated element
    "fields": [
        {"name": "headline", "selector": "h2", "type": "text"},
        {"name": "summary", "selector": ".summary", "type": "text"},
        {"name": "category", "selector": ".category", "type": "text"},
        {
            "name": "metadata",
            "type": "nested",
            "fields": [
                {"name": "author", "selector": ".author", "type": "text"},
                {"name": "date", "selector": ".date", "type": "text"}
            ]
        }
    ]
 }
 strategy = JsonCssExtractionStrategy(schema)
 result = await crawler.arun(
    url="https://example.com",
    extraction_strategy=strategy
 )
 articles = json.loads(result.extracted_content)
 ```
 ## Domain-Based Filtering
 Control content based on domains:
 ```python
 result = await crawler.arun(
    url="https://example.com",
    exclude_domains=["ads.com", "tracker.com"],
    exclude_social_media_domains=["facebook.com", "twitter.com"],  # Custom social media domains to exclude
    exclude_social_media_links=True
 )
 ```
 ## Media Selection
 Select specific types of media:
 ```python
 result = await crawler.arun(url="https://example.com")
 # Access different media types
 images = result.media["images"]  # List of image details
 videos = result.media["videos"]  # List of video details
 audios = result.media["audios"]  # List of audio details
 # Image with metadata
 for image in images:
    print(f"URL: {image['src']}")
    print(f"Alt text: {image['alt']}")
    print(f"Description: {image['desc']}")
    print(f"Relevance score: {image['score']}")
 ```
 ## Comprehensive Example
 Here's how to combine different selection methods:
 ```python
 async def extract_article_content(url: str):
    # Define structured extraction
    article_schema = {
        "name": "Article",
        "baseSelector": "article.main",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "content", "selector": ".content", "type": "text"}
        ]
    }
    # Define LLM extraction
    class ArticleAnalysis(BaseModel):
        key_points: List[str]
        sentiment: str
        category: str
    async with AsyncWebCrawler() as crawler:
        # Get structured content
        pattern_result = await crawler.arun(
            url=url,
            extraction_strategy=JsonCssExtractionStrategy(article_schema),
            word_count_threshold=10,
            excluded_tags=['nav', 'footer'],
            exclude_external_links=True
        )
        # Get semantic analysis
        analysis_result = await crawler.arun(
            url=url,
            extraction_strategy=LLMExtractionStrategy(
                provider="ollama/nemotron",
                schema=ArticleAnalysis.schema(),
                instruction="Analyze the article content"
            )
        )
        # Combine results
        return {
            "article": json.loads(pattern_result.extracted_content),
            "analysis": json.loads(analysis_result.extracted_content),
            "media": pattern_result.media
        }
 ```
--- a/docs/md_v2/basic/content_filtering.md
+++ b/docs/md_v2/basic/content_filtering.md
@@ -1,136 +0,0 @@
 # Content Filtering in Crawl4AI
 This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages.  You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
 ## Relevance Content Filter
 The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
 ## Pruning Content Filter
 The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
 ### Usage
 ```python
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.content_filter_strategy import PruningContentFilter
 async def filter_content(url):
    async with AsyncWebCrawler() as crawler:
        content_filter = PruningContentFilter(
            min_word_threshold=5,
            threshold_type='dynamic',
            threshold=0.45
        )
        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
        if result.success:
            print(f"Cleaned Markdown:\n{result.fit_markdown}")
 ```
 ### Parameters
 - **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
 - **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
  - `'fixed'`: Uses a constant threshold value for all nodes
  - `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
 - **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
  - For fixed threshold: Nodes scoring below this value are removed
  - For dynamic threshold: This value is adjusted based on node properties
 ### How It Works
 The pruning algorithm evaluates each node using multiple metrics:
 - Text density: Ratio of actual text to overall node content
 - Link density: Proportion of text within links
 - Tag importance: Weight based on HTML tag type (e.g., article, p, div)
 - Content quality: Metrics like text length and structural importance
 Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
 The algorithm is particularly effective for:
 - Removing boilerplate content
 - Eliminating navigation menus and sidebars
 - Preserving main article content
 - Maintaining document structure while removing noise
 ## BM25 Algorithm
 The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
 ### Usage
 To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
 ```python
 from crawl4ai import AsyncWebCrawler
 from crawl4ai.content_filter_strategy import BM25ContentFilter
 async def filter_content(url, query=None):
    async with AsyncWebCrawler() as crawler:
        content_filter = BM25ContentFilter(user_query=query)
        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
        if result.success:
            print(f"Filtered Content (JSON):\n{result.extracted_content}")
            print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
            print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
        else:
            print("Error:", result.error_message)
 # Example usage:
 asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
 asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
 ```
 ### Parameters
 - **`user_query`**:  (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
 - **`bm25_threshold`**: (Optional, default 1.0)  A float value that controls the threshold for relevance.  Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
 ## Fit Markdown Flag
 Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned  `CrawlResult` object: `fit_markdown` and `fit_html`.
 ## Custom Content Filtering Strategies
 You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method.  This allows you to tailor the filtering logic to your specific needs.
 ```python
 from crawl4ai.content_filter_strategy import RelevantContentFilter
 from bs4 import BeautifulSoup, Tag
 from typing import List
 class MyCustomFilter(RelevantContentFilter):
    def filter_content(self, html: str) -> List[str]:
        soup = BeautifulSoup(html, 'lxml')
        # Implement custom filtering logic here
        # Example: extract all paragraphs within divs with class "article-body"
        filtered_paragraphs = []
        for tag in soup.select("div.article-body p"):
            if isinstance(tag, Tag):
                filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.  
        return filtered_paragraphs
 async def custom_filter_demo(url: str):
    async with AsyncWebCrawler() as crawler:
        custom_filter = MyCustomFilter()
        result = await crawler.arun(url, extraction_strategy=custom_filter)
        if result.success:
            print(result.extracted_content)
 ```
 This example demonstrates extracting paragraphs from a specific div class.  You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
 ## Conclusion
 Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.
--- a/docs/md_v2/basic/file-download.md
+++ b/docs/md_v2/basic/file-download.md
@@ -1,148 +0,0 @@
 # Download Handling in Crawl4AI
 This guide explains how to use Crawl4AI to handle file downloads during crawling.  You'll learn how to trigger downloads, specify download locations, and access downloaded files.
 ## Enabling Downloads
 By default, Crawl4AI does not download files. To enable downloads, set the `accept_downloads` parameter to `True` in either the `AsyncWebCrawler` constructor or the `arun` method.
 ```python
 from crawl4ai import AsyncWebCrawler
 async def main():
    async with AsyncWebCrawler(accept_downloads=True) as crawler:  # Globally enable downloads
        # ... your crawling logic ...
 asyncio.run(main())
 ```
 Or, enable it for a specific crawl:
 ```python
 async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="...", accept_downloads=True)
        # ...
 ```
 ## Specifying Download Location
 You can specify the download directory using the `downloads_path` parameter. If not provided, Crawl4AI creates a "downloads" directory inside the `.crawl4ai` folder in your home directory.
 ```python
 import os
 from pathlib import Path
 # ... inside your crawl function:
 downloads_path = os.path.join(os.getcwd(), "my_downloads")  # Custom download path
 os.makedirs(downloads_path, exist_ok=True)
 result = await crawler.arun(url="...", downloads_path=downloads_path, accept_downloads=True)
 # ...
 ```
 If you are setting it globally, provide the path to the AsyncWebCrawler:
 ```python
 async def crawl_with_downloads(url: str, download_path: str):
    async with AsyncWebCrawler(
        accept_downloads=True,
        downloads_path=download_path, # or set it on arun
        verbose=True
    ) as crawler:
        result = await crawler.arun(url=url) # you still need to enable downloads per call.
        # ...
 ```
 ## Triggering Downloads
 Downloads are typically triggered by user interactions on a web page (e.g., clicking a download button).  You can simulate these actions with the `js_code` parameter, injecting JavaScript code to be executed within the browser context.  The `wait_for` parameter might also be crucial to allowing sufficient time for downloads to initiate before the crawler proceeds.
 ```python
 result = await crawler.arun(
    url="https://www.python.org/downloads/",
    js_code="""
        // Find and click the first Windows installer link
        const downloadLink = document.querySelector('a[href$=".exe"]');
        if (downloadLink) {
            downloadLink.click();
        }
    """,
    wait_for=5  # Wait for 5 seconds for the download to start
 )
 ```
 ## Accessing Downloaded Files
 Downloaded file paths are stored in the `downloaded_files` attribute of the returned  `CrawlResult`  object.  This is a list of strings, with each string representing the absolute path to a downloaded file.
 ```python
 if result.downloaded_files:
    print("Downloaded files:")
    for file_path in result.downloaded_files:
        print(f"- {file_path}")
        # Perform operations with downloaded files, e.g., check file size
        file_size = os.path.getsize(file_path)
        print(f"- File size: {file_size} bytes")
 else:
    print("No files downloaded.")
 ```
 ##  Example: Downloading Multiple Files
 ```python
 import asyncio
 import os
 from pathlib import Path
 from crawl4ai import AsyncWebCrawler
 async def download_multiple_files(url: str, download_path: str):
    async with AsyncWebCrawler(
        accept_downloads=True,
        downloads_path=download_path,
        verbose=True
    ) as crawler:
        result = await crawler.arun(
            url=url,
            js_code="""
            // Trigger multiple downloads (example)
            const downloadLinks = document.querySelectorAll('a[download]'); // Or a more specific selector
            for (const link of downloadLinks) {
                link.click();
                await new Promise(r => setTimeout(r, 2000)); // Add a small delay between clicks if needed
            }
            """,
            wait_for=10 # Adjust the timeout to match the expected time for all downloads to start
        )
        if result.downloaded_files:
            print("Downloaded files:")
            for file in result.downloaded_files:
                print(f"- {file}")
        else:
            print("No files downloaded.")
 # Example usage
 download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
 os.makedirs(download_path, exist_ok=True) # Create directory if it doesn't exist
 asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
 ```
 ## Important Considerations
 - **Browser Context:** Downloads are managed within the browser context.  Ensure your `js_code` correctly targets the download triggers on the specific web page.
 - **Waiting:**  Use `wait_for` to manage the timing of the crawl process if immediate download might not occur.
 - **Error Handling:** Implement proper error handling to gracefully manage failed downloads or incorrect file paths.
 - **Security:** Downloaded files should be scanned for potential security threats before use.
 This guide provides a foundation for handling downloads with Crawl4AI. You can adapt these techniques to manage downloads in various scenarios and integrate them into more complex crawling workflows.
--- a/docs/md_v2/basic/installation.md
+++ b/docs/md_v2/basic/installation.md
@@ -1,137 +0,0 @@
 # Installation 💻
 Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package, use it with Docker, or run it as a local server.
 ## Option 1: Python Package Installation (Recommended)
 Crawl4AI is now available on PyPI, making installation easier than ever. Choose the option that best fits your needs:
 ### Basic Installation
 For basic web crawling and scraping tasks:
 ```bash
 pip install crawl4ai
 playwright install # Install Playwright dependencies
 ```
 ### Installation with PyTorch
 For advanced text clustering (includes CosineSimilarity cluster strategy):
 ```bash
 pip install crawl4ai[torch]
 ```
 ### Installation with Transformers
 For text summarization and Hugging Face models:
 ```bash
 pip install crawl4ai[transformer]
 ```
 ### Full Installation
 For all features:
 ```bash
 pip install crawl4ai[all]
 ```
 ### Development Installation
 For contributors who plan to modify the source code:
 ```bash
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
 pip install -e ".[all]"
 playwright install # Install Playwright dependencies
 ```
 💡 After installation with "torch", "transformer", or "all" options, it's recommended to run the following CLI command to load the required models:
 ```bash
 crawl4ai-download-models
 ```
 This is optional but will boost the performance and speed of the crawler. You only need to do this once after installation.
 ## Playwright Installation Note for Ubuntu
 If you encounter issues with Playwright installation on Ubuntu, you may need to install additional dependencies:
 ```bash
 sudo apt-get install -y \
    libwoff1 \
    libopus0 \
    libwebp7 \
    libwebpdemux2 \
    libenchant-2-2 \
    libgudev-1.0-0 \
    libsecret-1-0 \
    libhyphen0 \
    libgdk-pixbuf2.0-0 \
    libegl1 \
    libnotify4 \
    libxslt1.1 \
    libevent-2.1-7 \
    libgles2 \
    libxcomposite1 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libepoxy0 \
    libgtk-3-0 \
    libharfbuzz-icu0 \
    libgstreamer-gl1.0-0 \
    libgstreamer-plugins-bad1.0-0 \
    gstreamer1.0-plugins-good \
    gstreamer1.0-plugins-bad \
    libxt6 \
    libxaw7 \
    xvfb \
    fonts-noto-color-emoji \
    libfontconfig \
    libfreetype6 \
    xfonts-cyrillic \
    xfonts-scalable \
    fonts-liberation \
    fonts-ipafont-gothic \
    fonts-wqy-zenhei \
    fonts-tlwg-loma-otf \
    fonts-freefont-ttf
 ```
 ## Option 2: Using Docker (Coming Soon)
 Docker support for Crawl4AI is currently in progress and will be available soon. This will allow you to run Crawl4AI in a containerized environment, ensuring consistency across different systems.
 ## Option 3: Local Server Installation
 For those who prefer to run Crawl4AI as a local server, instructions will be provided once the Docker implementation is complete.
 ## Verifying Your Installation
 After installation, you can verify that Crawl4AI is working correctly by running a simple Python script:
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
 async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://www.example.com")
        print(result.markdown[:500])  # Print first 500 characters
 if __name__ == "__main__":
    asyncio.run(main())
 ```
 This script should successfully crawl the example website and print the first 500 characters of the extracted content.
 ## Getting Help
 If you encounter any issues during installation or usage, please check the [documentation](https://crawl4ai.com/mkdocs/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues).
 Happy crawling! 🕷️🤖
--- a/Show More
+++ b/Show More
`@@ -1,2 +1,2 @@`
	`# crawl4ai/_version.py`	`# crawl4ai/_version.py`
	`__version__ = "0.4.22"`	`__version__ = "0.4.3b2"`