chore(version): bump version to 0.3.73

feat(core): Release v0.3.73 with Browser Takeover and Docker Support
Major changes: - Add browser takeover feature using CDP for authentic browsing - Implement Docker support with full API server documentation - Enhance Mockdown with tag preservation system - Improve parallel crawling performance This release focuses on authenticity and scalability, introducing the ability to use users' own browsers while providing containerized deployment options. Breaking changes include modified browser handling and API response structure. See CHANGELOG.md for detailed migration guide.
2024-11-05 20:05:58 +08:00 · 2024-11-05 20:04:18 +08:00 · 2024-11-04 20:33:15 +08:00 · 2024-11-04 16:51:59 +08:00 · 2024-11-04 14:12:07 +08:00 · 2024-11-04 14:10:27 +08:00
141 changed files with 23601 additions and 1004 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -165,6 +165,8 @@ Crawl4AI.egg-info/
 Crawl4AI.egg-info/*
 crawler_data.db
 .vscode/
+.tests/
+.test_pads/
 test_pad.py
 test_pad*.py
 .data/
@@ -172,3 +174,38 @@ Crawl4AI.egg-info/

 requirements0.txt
 a.txt
+
+*.sh
+.idea
+docs/examples/.chainlit/
+docs/examples/.chainlit/*
+.chainlit/config.toml
+.chainlit/translations/en-US.json
+
+local/
+.files/
+
+a.txt
+.lambda_function.py
+ec2*
+
+update_changelog.sh
+
+.DS_Store
+docs/.DS_Store
+tmp/
+test_env/
+**/.DS_Store
+**/.DS_Store
+
+todo.md
+git_changes.py
+git_changes.md
+pypi_build.sh
+git_issues.py
+git_issues.md
+
+.tests/
+.issues/
+.docs/
+.issues/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,31 +1,493 @@
 # Changelog

-All notable changes to this project will be documented in this file.
+# CHANGELOG

-## [Unreleased]
+## [v0.3.73] - 2024-11-05
+
+### Major Features
+- **New Doctor Feature**
+  - Added comprehensive system diagnostics tool
+  - Available through package hub and CLI
+  - Provides automated troubleshooting and system health checks
+  - Includes detailed reporting of configuration issues
+
+- **Dockerized API Server**
+  - Released complete Docker implementation for API server
+  - Added comprehensive documentation for Docker deployment
+  - Implemented container communication protocols
+  - Added environment configuration guides
+
+- **Managed Browser Integration**
+  - Added support for user-controlled browser instances
+  - Implemented `ManagedBrowser` class for better browser lifecycle management
+  - Added ability to connect to existing Chrome DevTools Protocol (CDP) endpoints
+  - Introduced user data directory support for persistent browser profiles
+
+- **Enhanced HTML Processing**
+  - Added HTML tag preservation feature during markdown conversion
+  - Introduced configurable tag preservation system
+  - Improved pre-tag and code block handling
+  - Added support for nested preserved tags with attribute retention
+
+### Improvements
+- **Browser Handling**
+  - Added flag to ignore body visibility for problematic pages
+  - Improved browser process cleanup and management
+  - Enhanced temporary directory handling for browser profiles
+  - Added configurable browser launch arguments
+
+- **Database Management**
+  - Implemented connection pooling for better performance
+  - Added retry logic for database operations
+  - Improved error handling and logging
+  - Enhanced cleanup procedures for database connections
+
+- **Resource Management**
+  - Added memory and CPU monitoring
+  - Implemented dynamic task slot allocation based on system resources
+  - Added configurable cleanup intervals
+
+### Technical Improvements
+- **Code Structure**
+  - Moved version management to dedicated _version.py file
+  - Improved error handling throughout the codebase
+  - Enhanced logging system with better error reporting
+  - Reorganized core components for better maintainability
+
+### Bug Fixes
+- Fixed issues with browser process termination
+- Improved handling of connection timeouts
+- Enhanced error recovery in database operations
+- Fixed memory leaks in long-running processes
+
+### Dependencies
+- Updated Playwright to v1.47
+- Updated core dependencies with more flexible version constraints
+- Added new development dependencies for testing
+
+### Breaking Changes
+- Changed default browser handling behavior
+- Modified database connection management approach
+- Updated API response structure for better consistency
+
+## Migration Guide
+When upgrading to v0.3.73, be aware of the following changes:
+
+1. Docker Deployment:
+   - Review Docker documentation for new deployment options
+   - Update environment configurations as needed
+   - Check container communication settings
+
+2. If using custom browser management:
+   - Update browser initialization code to use new ManagedBrowser class
+   - Review browser cleanup procedures
+
+3. For database operations:
+   - Check custom database queries for compatibility with new connection pooling
+   - Update error handling to work with new retry logic
+
+4. Using the Doctor:
+   - Run doctor command for system diagnostics: `crawl4ai doctor`
+   - Review generated reports for potential issues
+   - Follow recommended fixes for any identified problems
+
+
+## [2024-11-04 - 13:21:42] Comprehensive Update of Crawl4AI Features and Dependencies
+This commit introduces several key enhancements, including improved error handling and robust database operations in `async_database.py`, which now features a connection pool and retry logic for better reliability. Updates to the README.md provide clearer instructions and a better user experience with links to documentation sections. The `.gitignore` file has been refined to include additional directories, while the async web crawler now utilizes a managed browser for more efficient crawling. Furthermore, multiple dependency updates and introduction of the `CustomHTML2Text` class enhance text extraction capabilities.
+
+## [v0.3.73] - 2024-10-24

 ### Added
- 🔧 Separate Crawl and Extract JSON Semantic Chunk: Enhancing flexibility and efficiency in large-scale web crawling tasks.
- 🔍 Colab Integration: Exploring integration with Google Colab for easy experimentation in a collaborative notebook environment.
- 🎯 XPath and CSS Selector Support: Adding support for selective retrieval of specific elements from web pages.
- 📷 Image Captioning: Incorporating image captioning capabilities to extract meaningful descriptions from images.
- 💾 Embedding Data Generation and Storage: Developing functionalities to generate and store embedding data for each crawled website.
- 🔍 Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs.
+- preserve_tags: Added support for preserving specific HTML tags during markdown conversion.
+- Smart overlay removal system in AsyncPlaywrightCrawlerStrategy:
+  - Automatic removal of popups, modals, and cookie notices
+  - Detection and removal of fixed/sticky position elements
+  - Cleaning of empty block elements
+  - Configurable via `remove_overlay_elements` parameter
+- Enhanced screenshot capabilities:
+  - Added `screenshot_wait_for` parameter to control timing
+  - Improved screenshot handling with existing page context
+  - Better error handling with fallback error images
+- New URL normalization utilities:
+  - `normalize_url` function for consistent URL formatting
+  - `is_external_url` function for better link classification
+- Custom base directory support for cache storage:
+  - New `base_directory` parameter in AsyncWebCrawler
+  - Allows specifying alternative locations for `.crawl4ai` folder

-### Changed
- None
-
-### Deprecated
- None
-
-### Removed
- None
+### Enhanced
+- Link handling improvements:
+  - Better duplicate link detection
+  - Enhanced internal/external link classification
+  - Improved handling of special URL protocols
+  - Support for anchor links and protocol-relative URLs
+- Configuration refinements:
+  - Streamlined social media domain list
+  - More focused external content filtering
+- LLM extraction strategy:
+  - Added support for separate API base URL via `api_base` parameter
+  - Better handling of base URLs in configuration

 ### Fixed
- None
+- Screenshot functionality:
+  - Resolved issues with screenshot timing and context
+  - Improved error handling and recovery
+- Link processing:
+  - Fixed URL normalization edge cases
+  - Better handling of invalid URLs
+  - Improved error messages for link processing failures

-### Security
- None
+### Developer Notes
+- The overlay removal system uses advanced JavaScript injection for better compatibility
+- URL normalization handles special cases like mailto:, tel:, and protocol-relative URLs
+- Screenshot system now reuses existing page context for better performance
+- Link processing maintains separate dictionaries for internal and external links to ensure uniqueness

-## [1.0.0] - YYYY-MM-DD
- Initial release
+## [v0.3.72] - 2024-10-22
+
+### Added
+- New `ContentCleaningStrategy` class:
+  - Smart content extraction based on text density and element scoring
+  - Automatic removal of boilerplate content
+  - DOM tree analysis for better content identification
+  - Configurable thresholds for content detection
+- Advanced proxy support:
+  - Added `proxy_config` option for authenticated proxy connections
+  - Support for username/password in proxy configuration
+- New content output formats:
+  - `fit_markdown`: Optimized markdown output with main content focus
+  - `fit_html`: Clean HTML with only essential content
+
+### Enhanced
+- Image source detection:
+  - Support for multiple image source attributes (`src`, `data-src`, `srcset`, etc.)
+  - Automatic fallback through potential source attributes
+  - Smart handling of srcset attribute
+- External content handling:
+  - Made external link exclusion optional (disabled by default)
+  - Improved detection and handling of social media links
+  - Better control over external image filtering
+
+### Fixed
+- Image extraction reliability with multiple source attribute checks
+- External link and image handling logic for better accuracy
+
+### Developer Notes
+- The new `ContentCleaningStrategy` uses configurable thresholds for customization
+- Proxy configuration now supports more complex authentication scenarios
+- Content extraction process now provides both regular and optimized outputs
+
+## [v0.3.72] - 2024-10-20
+
+### Fixed
+- Added support for parsing Base64 encoded images in WebScrappingStrategy
+
+### Added
+- Forked and integrated a customized version of the html2text library for more control over Markdown generation
+- New configuration options for controlling external content:
+  - Ability to exclude all external links
+  - Option to specify domains to exclude (default includes major social media platforms)
+  - Control over excluding external images
+
+### Changed
+- Improved Markdown generation process:
+  - Added fine-grained control over character escaping in Markdown output
+  - Enhanced handling of code blocks and pre-formatted text
+- Updated `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500)
+- Enhanced flexibility in `CosineStrategy` with a more generic `load_HF_embedding_model` function
+
+### Improved
+- Optimized content scraping and processing for better efficiency
+- Enhanced error handling and logging in various components
+
+### Developer Notes
+- The customized html2text library is now located within the crawl4ai package
+- New configuration options are available in the `config.py` file for external content handling
+- The `WebScrappingStrategy` class has been updated to accommodate new external content exclusion options
+
+## [v0.3.71] - 2024-10-19
+
+### Added
+- New chunking strategies:
+  - `OverlappingWindowChunking`: Allows for overlapping chunks of text, useful for maintaining context between chunks.
+  - Enhanced `SlidingWindowChunking`: Improved to handle edge cases and last chunks more effectively.
+
+### Changed
+- Updated `CHUNK_TOKEN_THRESHOLD` in config to 2048 tokens (2^11) for better compatibility with most LLM models.
+- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
+- Enhanced flexibility in `CosineStrategy`:
+  - Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
+- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
+
+### Fixed
+- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
+
+### Developer Notes
+- Added more comprehensive docstrings to chunking strategies for better code documentation.
+- Removed hardcoded device setting in `CosineStrategy`, now using the automatically detected device.
+- Added a new example in `quickstart_async.py` for generating a knowledge graph from crawled content.
+
+These updates aim to provide more flexibility in text processing, improve performance, and enhance the overall capabilities of the crawl4ai library. The new chunking strategies, in particular, offer more options for handling large texts in various scenarios.
+
+## [v0.3.71] - 2024-10-18
+
+### Changes
+1. **Version Update**:
+   - Updated version number from 0.3.7 to 0.3.71.
+
+2. **Crawler Enhancements**:
+   - Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
+   - Improved context creation with additional options:
+     - Enabled `accept_downloads` and `java_script_enabled`.
+     - Added a cookie to enable cookies by default.
+
+3. **Error Handling Improvements**:
+   - Enhanced error messages in AsyncWebCrawler's `arun` method.
+   - Updated error reporting format for better visibility and consistency.
+
+4. **Performance Optimization**:
+   - Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
+
+### Documentation
+- Updated quickstart notebook:
+  - Changed installation command to use the released package instead of GitHub repository.
+  - Updated kernel display name.
+
+### Developer Notes
+- Minor code refactoring and cleanup.
+
+## [v0.3.7] - 2024-10-17
+
+### New Features
+1. **Enhanced Browser Stealth**: 
+   - Implemented `playwright_stealth` for improved bot detection avoidance.
+   - Added `StealthConfig` for fine-tuned control over stealth parameters.
+
+2. **User Simulation**:
+   - New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
+
+3. **Navigator Override**:
+   - Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
+
+4. **Improved iframe Handling**:
+   - New `process_iframes` parameter to extract and integrate iframe content into the main page.
+
+5. **Flexible Browser Selection**:
+   - Support for choosing between Chromium, Firefox, and WebKit browsers.
+
+6. **Include Links in Markdown**:
+    - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.   
+
+### Improvements
+1. **Better Error Handling**:
+   - Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
+   - Added console message and error logging for better debugging.
+
+2. **Image Processing Enhancements**:
+   - Improved image dimension updating and filtering logic.
+
+3. **Crawling Flexibility**:
+   - Added support for custom viewport sizes.
+   - Implemented delayed content retrieval with `delay_before_return_html` parameter.
+
+4. **Performance Optimization**:
+   - Adjusted default semaphore count for parallel crawling.
+
+### Bug Fixes
+- Fixed an issue where the HTML content could be empty after processing.
+
+### Examples
+- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
+
+### Developer Notes
+- Refactored code for better maintainability and readability.
+- Updated browser launch arguments for improved compatibility and performance.
+
+## [v0.3.6] - 2024-10-12 
+
+### 1. Improved Crawling Control
+- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
+- **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
+  - Useful for pages with delayed content loading.
+- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
+  - Provides better handling for slow-loading pages.
+- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
+
+### 2. Browser Type Selection
+- Added support for different browser types (Chromium, Firefox, WebKit).
+- Users can now specify the browser type when initializing AsyncWebCrawler.
+- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
+
+### 3. Screenshot Capture
+- Added ability to capture screenshots during crawling.
+- Useful for debugging and content verification.
+- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
+
+### 4. Enhanced LLM Extraction Strategy
+- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
+- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
+- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
+- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
+
+### 5. iframe Content Extraction
+- New feature to process and extract content from iframes.
+- **How to use**: Set `process_iframes=True` in the crawl method.
+
+### 6. Delayed Content Retrieval
+- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
+- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
+- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
+
+## Improvements and Optimizations
+
+### 1. AsyncWebCrawler Enhancements
+- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
+- Allows for more customized setups.
+
+### 2. Image Processing Optimization
+- Enhanced image handling in WebScrappingStrategy.
+- Added filtering for small, invisible, or irrelevant images.
+- Improved image scoring system for better content relevance.
+- Implemented JavaScript-based image dimension updating for more accurate representation.
+
+### 3. Database Schema Auto-updates
+- Automatic database schema updates ensure compatibility with the latest version.
+
+### 4. Enhanced Error Handling and Logging
+- Improved error messages and logging for easier debugging.
+
+### 5. Content Extraction Refinements
+- Refined HTML sanitization process.
+- Improved handling of base64 encoded images.
+- Enhanced Markdown conversion process.
+- Optimized content extraction algorithms.
+
+### 6. Utility Function Enhancements
+- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
+
+## Bug Fixes
+- Fixed an issue where image tags were being prematurely removed during content extraction.
+
+## Examples and Documentation
+- Updated `quickstart_async.py` with examples of:
+  - Using custom headers in LLM extraction.
+  - Different LLM provider usage (OpenAI, Hugging Face, Ollama).
+  - Custom browser type usage.
+
+## Developer Notes
+- Refactored code for better maintainability, flexibility, and performance.
+- Enhanced type hinting throughout the codebase for improved development experience.
+- Expanded error handling for more robust operation.
+
+These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
+
+## [v0.3.5] - 2024-09-02
+
+Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
+
+- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
+- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
+- Improve error handling and timeout management in crawling process
+- Fix typo in CrawlResult model (responser_headers -> response_headers)
+
+## [v0.2.77] - 2024-08-04
+
+Significant improvements in text processing and performance:
+
+- 🚀 **Dependency reduction**: Removed dependency on spaCy model for text chunk labeling in cosine extraction strategy.
+- 🤖 **Transformer upgrade**: Implemented text sequence classification using a transformer model for labeling text chunks.
+- ⚡ **Performance enhancement**: Improved model loading speed due to removal of spaCy dependency.
+- 🔧 **Future-proofing**: Laid groundwork for potential complete removal of spaCy dependency in future versions.
+
+These changes address issue #68 and provide a foundation for faster, more efficient text processing in Crawl4AI.
+
+## [v0.2.76] - 2024-08-02
+
+Major improvements in functionality, performance, and cross-platform compatibility! 🚀
+
+- 🐳 **Docker enhancements**: Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
+- 🌐 **Official Docker Hub image**: Launched our first official image on Docker Hub for streamlined deployment.
+- 🔧 **Selenium upgrade**: Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
+- 🖼️ **Image description**: Implemented ability to generate textual descriptions for extracted images from web pages.
+- ⚡ **Performance boost**: Various improvements to enhance overall speed and performance.
+
+A big shoutout to our amazing community contributors:
+- [@aravindkarnam](https://github.com/aravindkarnam) for developing the textual description extraction feature.
+- [@FractalMind](https://github.com/FractalMind) for creating the first official Docker Hub image and fixing Dockerfile errors.
+- [@ketonkss4](https://github.com/ketonkss4) for identifying Selenium's new capabilities, helping us reduce dependencies.
+
+Your contributions are driving Crawl4AI forward! 🙌
+
+## [v0.2.75] - 2024-07-19
+
+Minor improvements for a more maintainable codebase:
+
+- 🔄 Fixed typos in `chunking_strategy.py` and `crawler_strategy.py` to improve code readability
+- 🔄 Removed `.test_pads/` directory from `.gitignore` to keep our repository clean and organized
+
+These changes may seem small, but they contribute to a more stable and sustainable codebase. By fixing typos and updating our `.gitignore` settings, we're ensuring that our code is easier to maintain and scale in the long run.
+
+## [v0.2.74] - 2024-07-08
+A slew of exciting updates to improve the crawler's stability and robustness! 🎉
+
+- 💻 **UTF encoding fix**: Resolved the Windows \"charmap\" error by adding UTF encoding.
+- 🛡️ **Error handling**: Implemented MaxRetryError exception handling in LocalSeleniumCrawlerStrategy.
+- 🧹 **Input sanitization**: Improved input sanitization and handled encoding issues in LLMExtractionStrategy.
+- 🚮 **Database cleanup**: Removed existing database file and initialized a new one.
+
+
+## [v0.2.73] - 2024-07-03
+
+💡 In this release, we've bumped the version to v0.2.73 and refreshed our documentation to ensure you have the best experience with our project.
+
+* Supporting website need "with-head" mode to crawl the website with head.
+* Fixing the installation issues for setup.py and dockerfile.
+* Resolve multiple issues.
+
+## [v0.2.72] - 2024-06-30
+
+This release brings exciting updates and improvements to our project! 🎉
+
+* 📚 **Documentation Updates**: Our documentation has been revamped to reflect the latest changes and additions.
+* 🚀 **New Modes in setup.py**: We've added support for three new modes in setup.py: default, torch, and transformers. This enhances the project's flexibility and usability.
+* 🐳 **Docker File Updates**: The Docker file has been updated to ensure seamless compatibility with the new modes and improvements.
+* 🕷️ **Temporary Solution for Headless Crawling**: We've implemented a temporary solution to overcome issues with crawling websites in headless mode.
+
+These changes aim to improve the overall user experience, provide more flexibility, and enhance the project's performance. We're thrilled to share these updates with you and look forward to continuing to evolve and improve our project!
+
+## [0.2.71] - 2024-06-26
+
+**Improved Error Handling and Performance** 🚧
+
+* 🚫 Refactored `crawler_strategy.py` to handle exceptions and provide better error messages, making it more robust and reliable.
+* 💻 Optimized the `get_content_of_website_optimized` function in `utils.py` for improved performance, reducing potential bottlenecks.
+* 💻 Updated `utils.py` with the latest changes, ensuring consistency and accuracy.
+* 🚫 Migrated to `ChromeDriverManager` to resolve Chrome driver download issues, providing a smoother user experience.
+
+These changes focus on refining the existing codebase, resulting in a more stable, efficient, and user-friendly experience. With these improvements, you can expect fewer errors and better performance in the crawler strategy and utility functions.
+
+## [0.2.71] - 2024-06-25
+### Fixed
+- Speed up twice the extraction function.
+
+
+## [0.2.6] - 2024-06-22
+### Fixed
+- Fix issue #19: Update Dockerfile to ensure compatibility across multiple platforms.
+
+## [0.2.5] - 2024-06-18
+### Added
+- Added five important hooks to the crawler:
+  - on_driver_created: Called when the driver is ready for initializations.
+  - before_get_url: Called right before Selenium fetches the URL.
+  - after_get_url: Called after Selenium fetches the URL.
+  - before_return_html: Called when the data is parsed and ready.
+  - on_user_agent_updated: Called when the user changes the user_agent, causing the driver to reinitialize.
+- Added an example in `quickstart.py` in the example folder under the docs.
+- Enhancement issue #24: Replaced inline HTML tags (e.g., DEL, INS, SUB, ABBR) with textual format for better context handling in LLM.
+- Maintaining the semantic context of inline tags (e.g., abbreviation, DEL, INS) for improved LLM-friendliness.
+- Updated Dockerfile to ensure compatibility across multiple platforms (Hopefully!).
+
+## [0.2.4] - 2024-06-17
+### Fixed
+- Fix issue #22: Use MD5 hash for caching HTML files to handle long URLs
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -0,0 +1,32 @@
+# Contributors to Crawl4AI
+
+We would like to thank the following people for their contributions to Crawl4AI:
+
+## Core Team
+
+- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
+- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
+- [Aravind Karnam](https://github.com/aravindkarnam) - Developer
+
+## Community Contributors
+
+- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
+- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
+- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
+- [datehoer](https://github.com/datehoer) - Add browser prxy support
+
+## Other Contributors
+
+- [Gokhan](https://github.com/gkhngyk) 
+- [Shiv Kumar](https://github.com/shivkumar0757)
+- [QIN2DIM](https://github.com/QIN2DIM)
+
+## Acknowledgements
+
+We also want to thank all the users who have reported bugs, suggested features, or helped in any other way to make Crawl4AI better.
+
+---
+
+If you've contributed to Crawl4AI and your name isn't on this list, please [open a pull request](https://github.com/unclecode/crawl4ai/pulls) with your name, link, and contribution, and we'll review it promptly.
+
+Thank you all for your contributions!
--- a/139
+++ b/139
@@ -1,40 +1,121 @@
-# Use an official Python runtime as a parent image
-FROM python:3.10-slim
+# syntax=docker/dockerfile:1.4

-# Set the working directory in the container
-WORKDIR /usr/src/app
+# Build arguments
+ARG PYTHON_VERSION=3.10

-# Copy the current directory contents into the container at /usr/src/app
-COPY . .
+# Base stage with system dependencies
+FROM python:${PYTHON_VERSION}-slim as base

-# Install any needed packages specified in requirements.txt
-RUN pip install --no-cache-dir -r requirements.txt
+# Declare ARG variables again within the build stage
+ARG INSTALL_TYPE=all
+ARG ENABLE_GPU=false

-# Install dependencies for Chrome and ChromeDriver
+# Platform-specific labels
+LABEL maintainer="unclecode"
+LABEL description="Crawl4AI - Advanced Web Crawler with AI capabilities"
+LABEL version="1.0"
+
+# Environment setup
+ENV PYTHONUNBUFFERED=1 \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    PIP_DEFAULT_TIMEOUT=100 \
+    DEBIAN_FRONTEND=noninteractive
+
+# Install system dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    wget \
-    xvfb \
-    unzip \
+    build-essential \
    curl \
-    gnupg2 \
-    ca-certificates \
-    apt-transport-https \
-    software-properties-common \
-    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
-    && echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list \
-    && apt-get update \
-    && apt-get install -y google-chrome-stable \
+    wget \
+    gnupg \
+    git \
+    cmake \
+    pkg-config \
+    python3-dev \
+    libjpeg-dev \
+    libpng-dev \
    && rm -rf /var/lib/apt/lists/*

-# Set display port and dbus env to avoid hanging
-ENV DISPLAY=:99
-ENV DBUS_SESSION_BUS_ADDRESS=/dev/null
+# Playwright system dependencies for Linux
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    libglib2.0-0 \
+    libnss3 \
+    libnspr4 \
+    libatk1.0-0 \
+    libatk-bridge2.0-0 \
+    libcups2 \
+    libdrm2 \
+    libdbus-1-3 \
+    libxcb1 \
+    libxkbcommon0 \
+    libx11-6 \
+    libxcomposite1 \
+    libxdamage1 \
+    libxext6 \
+    libxfixes3 \
+    libxrandr2 \
+    libgbm1 \
+    libpango-1.0-0 \
+    libcairo2 \
+    libasound2 \
+    libatspi2.0-0 \
+    && rm -rf /var/lib/apt/lists/*

-# Make port 80 available to the world outside this container
-EXPOSE 80
+# GPU support if enabled
+RUN if [ "$ENABLE_GPU" = "true" ] ; then \
+    apt-get update && apt-get install -y --no-install-recommends \
+    nvidia-cuda-toolkit \
+    && rm -rf /var/lib/apt/lists/* ; \
+    fi

-# Define environment variable
-ENV PYTHONUNBUFFERED 1
+# Create and set working directory
+WORKDIR /app

-# Run uvicorn
-CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80", "--workers", "4"]
+# Copy the entire project
+COPY . .
+
+# Install base requirements
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Install required library for FastAPI
+RUN pip install fastapi uvicorn psutil
+
+# Install ML dependencies first for better layer caching
+RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
+        pip install --no-cache-dir \
+            torch \
+            torchvision \
+            torchaudio \
+            scikit-learn \
+            nltk \
+            transformers \
+            tokenizers && \
+        python -m nltk.downloader punkt stopwords ; \
+    fi
+
+# Install the package
+RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
+        pip install -e ".[all]" && \
+        python -m crawl4ai.model_loader ; \
+    elif [ "$INSTALL_TYPE" = "torch" ] ; then \
+        pip install -e ".[torch]" ; \
+    elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
+        pip install -e ".[transformer]" && \
+        python -m crawl4ai.model_loader ; \
+    else \
+        pip install -e "." ; \
+    fi
+
+# Install Playwright and browsers
+RUN playwright install
+
+# Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+
+# Expose port
+EXPOSE 8000
+
+# Start the FastAPI server
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1 @@
+include requirements.txt
--- a/MISSION.md
+++ b/MISSION.md
@@ -0,0 +1,46 @@
+# Mission
+
+![Mission Diagram](./docs/assets/pitch-dark.svg)
+
+### 1. The Data Capitalization Opportunity
+
+We live in an unprecedented era of digital wealth creation. Every day, individuals and enterprises generate massive amounts of valuable digital footprints across various platforms, social media channels, messenger apps, and cloud services. While people can interact with their data within these platforms, there's an immense untapped opportunity to transform this data into true capital assets. Just as physical property became a foundational element of wealth creation, personal and enterprise data has the potential to become a new form of capital on balance sheets.
+
+For individuals, this represents an opportunity to transform their digital activities into valuable assets. For enterprises, their internal communications, team discussions, and collaborative documents contain rich insights that could be structured and valued as intellectual capital. This wealth of information represents an unprecedented opportunity for value creation in the digital age.
+
+### 2. The Potential of Authentic Data
+
+While synthetic data has played a crucial role in AI development, there's an enormous untapped potential in the authentic data generated by individuals and organizations. Every message, document, and interaction contains unique insights and patterns that could enhance AI development. The challenge isn't a lack of data - it's that most authentic human-generated data remains inaccessible for productive use.
+
+By enabling willing participation in data sharing, we can unlock this vast reservoir of authentic human knowledge. This represents an opportunity to enhance AI development with diverse, real-world data that reflects the full spectrum of human experience and knowledge.
+
+## Our Pathway to Data Democracy
+
+### 1. Open-Source Foundation
+
+Our first step is creating an open-source data extraction engine that empowers developers and innovators to build tools for data structuring and organization. This foundation ensures transparency, security, and community-driven development. By making these tools openly available, we enable the technical infrastructure needed for true data ownership and capitalization.
+
+### 2. Data Capitalization Platform
+
+Building on this open-source foundation, we're developing a platform that helps individuals and enterprises transform their digital footprints into structured, valuable assets. This platform will provide the tools and frameworks needed to organize, understand, and value personal and organizational data as true capital assets.
+
+### 3. Creating a Data Marketplace
+
+The final piece is establishing a marketplace where individuals and organizations can willingly share their data assets. This creates opportunities for:
+- Individuals to earn equity, revenue, or other forms of value from their data
+- Enterprises to access diverse, high-quality data for AI development
+- Researchers to work with authentic human-generated data
+- Startups to build innovative solutions using real-world data
+
+## Economic Vision: A Shared Data Economy
+
+We envision a future where data becomes a fundamental asset class in a thriving shared economy. This transformation will democratize AI development by enabling willing participation in data sharing, ensuring that the benefits of AI advancement flow back to data creators. Just as property rights revolutionized economic systems, establishing data as a capital asset will create new opportunities for wealth creation and economic participation.
+
+This shared data economy will:
+- Enable individuals to capitalize on their digital footprints
+- Create new revenue streams for data creators
+- Provide AI developers with access to diverse, authentic data
+- Foster innovation through broader access to real-world data
+- Ensure more equitable distribution of AI's economic benefits
+
+Our vision is to facilitate this transformation from the ground up - starting with open-source tools, progressing to data capitalization platforms, and ultimately creating a thriving marketplace where data becomes a true asset class in a shared economy. This approach ensures that the future of AI is built on a foundation of authentic human knowledge, with benefits flowing back to the individuals and organizations who create and share their valuable data.
--- a/README.md
+++ b/README.md
@@ -1,513 +1,441 @@
-# Crawl4AI 🕷️🤖
+# 🔥🕷️ Crawl4AI: LLM Friendly Web Crawler & Scrapper
+
+<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
+![PyPI - Downloads](https://img.shields.io/pypi/dm/Crawl4AI)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
 [![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
 [![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
 [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)

-Crawl4AI has one clear task: to simplify crawling and extract useful information from web pages, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
+Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

-<<<<<<< HEAD
-## 🚀 New Changes Will be Released Soon
+## 🌟 Meet the Crawl4AI Assistant: Your Copilot for Crawling

- 🚀 10x faster!!
- 📜 Execute custome JavaScript before crawling!
- 🤝 Colab friendly!
- 📚 Chunking strategies: topic-based, regex, sentence, and more!
- 🧠 Extraction strategies: cosine clustering, LLM, and more!
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
+Use the [Crawl4AI GPT Assistant](https://tinyurl.com/crawl4ai-gpt) as your AI-powered copilot! With this assistant, you can:

-## 🚧 Work in Progress 👷‍♂️
+- 🧑‍💻 Generate code for complex crawling and extraction tasks
+- 💡 Get tailored support and examples
+- 📘 Learn Crawl4AI faster with step-by-step guidance

- 📷 Image Captioning: Incorporating image captioning capabilities to extract descriptions from images.
- 💾 Embedding Vector Data: Generate and store embedding data for each crawled website.
- 🔍 Semantic Search Engine: Building a semantic search engine that fetches content, performs vector search similarity, and generates labeled chunk data based on user queries and URLs.
-=======
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk)
+## New in 0.3.72 ✨

-## Recent Changes
+- 📄 Fit markdown generation for extracting main article content.
+- 🪄 Magic mode for comprehensive anti-bot detection bypass.
+- 🌐 Enhanced multi-browser support with seamless switching (Chromium, Firefox, WebKit)
+- 📚 New chunking strategies(Sliding window, Overlapping window, Flexible size control)
+- 💾 Improved caching system for better performance
+- ⚡ Optimized batch processing with automatic rate limiting

- 🚀 10x faster!!
- 📜 Execute custom JavaScript before crawling!
- 🤝 Colab friendly!
- 📚 Chunking strategies: topic-based, regex, sentence, and more!
- 🧠 Extraction strategies: cosine clustering, LLM, and more!
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
+## Try it Now!

-## Power and Simplicity of Crawl4AI 🚀
-
-To show the simplicity take a look at the first example:
-
-```python
-from crawl4ai import WebCrawler
-
-# Create the WebCrawler instance 
-crawler = WebCrawler()
-
-# Run the crawler with keyword filtering and CSS selector
-result = crawler.run(url="https://www.nbcnews.com/business")
-print(result) # {url, html, markdown, extracted_content, metadata}
-```
-
-Now let's try a complex task. Below is an example of how you can execute JavaScript, filter data using keywords, and use a CSS selector to extract specific content—all in one go!
-
-1. Instantiate a WebCrawler object.
-2. Execute custom JavaScript to click a "Load More" button.
-3. Extract semantical chunks of content and filter the data to include only content related to technology.
-4. Use a CSS selector to extract only paragraphs (`<p>` tags).
-
-```python
-# Import necessary modules
-from crawl4ai import WebCrawler
-from crawl4ai.chunking_strategy import *
-from crawl4ai.extraction_strategy import *
-from crawl4ai.crawler_strategy import *
-
-# Define the JavaScript code to click the "Load More" button
-js_code = """
-const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
-loadMoreButton && loadMoreButton.click();
-"""
-
-# Define the crawling strategy
-crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
-
-# Create the WebCrawler instance with the defined strategy
-crawler = WebCrawler(crawler_strategy=crawler_strategy)
-
-# Run the crawler with keyword filtering and CSS selector
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=CosineStrategy(
-        semantic_filter="technology",
-    ),
-)
-
-# Run the crawler with LLM extraction strategy
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=LLMExtractionStrategy(
-        provider="openai/gpt-4o",
-        api_token=os.getenv('OPENAI_API_KEY'),
-        instruction="Extract only content related to technology"
-    ),
-    css_selector="p"
-)
-
-# Display the extracted result
-print(result)
-```
-
-With Crawl4AI, you can perform advanced web crawling and data extraction tasks with just a few lines of code. This example demonstrates how you can harness the power of Crawl4AI to simplify your workflow and get the data you need efficiently.
-
---
-
-*Continue reading to learn more about the features, installation process, usage, and more.*
-
-
-## Table of Contents
-
-1. [Features](#features-)
-2. [Installation](#installation-)
-3. [REST API/Local Server](#using-the-local-server-ot-rest-api-)
-4. [Python Library Usage](#python-library-usage-)
-5. [Parameters](#parameters-)
-6. [Chunking Strategies](#chunking-strategies-)
-7. [Extraction Strategies](#extraction-strategies-)
-8. [Contributing](#contributing-)
-9. [License](#license-)
-10. [Contact](#contact-)
->>>>>>> new-release-0.0.2-no-spacy
+✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)

+✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)

 ## Features ✨

- 🕷️ Efficient web crawling to extract valuable data from websites
+- 🆓 Completely free and open-source
+- 🚀 Blazing fast performance, outperforming many paid services
 - 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
+- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
 - 🌍 Supports crawling multiple URLs simultaneously
- 🌃 Replace media tags with ALT.
- 🆓 Completely free to use and open-source
- 📜 Execute custom JavaScript before crawling
- 📚 Chunking strategies: topic-based, regex, sentence, and more
- 🧠 Extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Pass instructions/keywords to refine extraction
+- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
+- 🔗 Extracts all external and internal links
+- 📚 Extracts metadata from the page
+- 🔄 Custom hooks for authentication, headers, and page modifications
+- 🕵️ User-agent customization
+- 🖼️ Takes screenshots of pages with enhanced error handling
+- 📜 Executes multiple custom JavaScripts before crawling
+- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
+- 📚 Various chunking strategies: topic-based, regex, sentence, and more
+- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
+- 🎯 CSS selector support for precise data extraction
+- 📝 Passes instructions/keywords to refine extraction
+- 🔒 Proxy support with authentication for enhanced access
+- 🔄 Session management for complex multi-page crawling
+- 🌐 Asynchronous architecture for improved performance
+- 🖼️ Improved image processing with lazy-loading detection
+- 🕰️ Enhanced handling of delayed content loading
+- 🔑 Custom headers support for LLM interactions
+- 🖼️ iframe content extraction for comprehensive analysis
+- ⏱️ Flexible timeout and delayed content retrieval options

-## Installation 💻
+## Installation 🛠️

-There are three ways to use Crawl4AI:
-1. As a library (Recommended)
-2. As a local server (Docker) or using the REST API
-4. As a Google Colab notebook. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wz8u30rvbq6Scodye9AGCw8Qg_Z8QGsk)
+Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.

-To install Crawl4AI as a library, follow these steps:
+### Using pip 🐍
+
+Choose the installation option that best fits your needs:
+
+#### Basic Installation
+
+For basic web crawling and scraping tasks:

-1. Install the package from GitHub:
 ```bash
-virtualenv venv
-source venv/bin/activate
-pip install "crawl4ai[all] @ git+https://github.com/unclecode/crawl4ai.git"
+pip install crawl4ai
 ```

-    💡 Better to run the following CLI-command to load the required models. This is optional, but it will boost the performance and speed of the crawler. You need to do this only once.
+By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.

-    crawl4ai-download-models
+👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
+
+1. Through the command line:
+
+   ```bash
+   playwright install
+   ```
+
+2. If the above doesn't work, try this more specific command:
+
+   ```bash
+   python -m playwright install chromium
+   ```
+
+This second method has proven to be more reliable in some cases.
+
+#### Installation with Synchronous Version
+
+If you need the synchronous version using Selenium:
+
+```bash
+pip install crawl4ai[sync]
+```
+
+#### Development Installation
+
+For contributors who plan to modify the source code:

-2. Alternatively, you can clone the repository and install the package locally:
 ```bash
-virtualenv venv
-source venv/bin/activate
 git clone https://github.com/unclecode/crawl4ai.git
 cd crawl4ai
-pip install -e .[all]
+pip install -e .
 ```

-3. Use docker to run the local server:
+### Using Docker 🐳
+
+Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
+
+#### Option 1: Docker Hub (Recommended)
+
 ```bash
-docker build -t crawl4ai .
-# For Mac users
-# docker build --platform linux/amd64 -t crawl4ai .
-docker run -d -p 8000:80 crawl4ai
+# Pull and run from Docker Hub (choose one):
+docker pull unclecode/crawl4ai:basic    # Basic crawling features
+docker pull unclecode/crawl4ai:all      # Full installation (ML, LLM support)
+docker pull unclecode/crawl4ai:gpu      # GPU-enabled version
+
+# Run the container
+docker run -p 11235:11235 unclecode/crawl4ai:basic  # Replace 'basic' with your chosen version
 ```

-For more information about how to run Crawl4AI as a local server, please refer to the [GitHub repository](https://github.com/unclecode/crawl4ai).
+#### Option 2: Build from Repository

-## Using the Local server ot REST API 🌐
+```bash
+# Clone the repository
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai

-You can also use Crawl4AI through the REST API. This method allows you to send HTTP requests to the Crawl4AI server and receive structured data in response. The base URL for the API is `https://crawl4ai.com/crawl`. If you run the local server, you can use `http://localhost:8000/crawl`. (Port is dependent on your docker configuration)
+# Build the image
+docker build -t crawl4ai:local \
+  --build-arg INSTALL_TYPE=basic \  # Options: basic, all
+  .

-### Example Usage
+# Run your local build
+docker run -p 11235:11235 crawl4ai:local
+```

-To use the REST API, send a POST request to `https://crawl4ai.com/crawl` with the following parameters in the request body.
+Quick test (works for both options):
+```python
+import requests

-**Example Request:**
-```json
-{
-    "urls": ["https://www.nbcnews.com/business"],
-    "include_raw_html": false,
-    "bypass_cache": true,
-    "word_count_threshold": 5,
-    "extraction_strategy": "CosineStrategy",
-    "chunking_strategy": "RegexChunking",
-    "css_selector": "p",
-    "verbose": true,
-    "extraction_strategy_args": {
-        "semantic_filter": "finance economy and stock market",
-        "word_count_threshold": 20,
-        "max_dist": 0.2,
-        "linkage_method": "ward",
-        "top_k": 3
-    },
-    "chunking_strategy_args": {
-        "patterns": ["\n\n"]
+# Submit a crawl job
+response = requests.post(
+    "http://localhost:11235/crawl",
+    json={"urls": "https://example.com", "priority": 10}
+)
+task_id = response.json()["task_id"]
+
+# Get results
+result = requests.get(f"http://localhost:11235/task/{task_id}")
+```
+
+For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://crawl4ai.com/mkdocs/basic/docker-deployment/).
+
+
+## Quick Start 🚀
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## Advanced Usage 🔬
+
+### Executing JavaScript and Using CSS Selectors
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            js_code=js_code,
+            css_selector=".wide-tease-item__description",
+            bypass_cache=True
+        )
+        print(result.extracted_content)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Using a Proxy
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            bypass_cache=True
+        )
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Extracting Structured Data without LLM
+
+The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.
+
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def extract_news_teasers():
+    schema = {
+        "name": "News Teaser Extractor",
+        "baseSelector": ".wide-tease-item__wrapper",
+        "fields": [
+            {
+                "name": "category",
+                "selector": ".unibrow span[data-testid='unibrow-text']",
+                "type": "text",
+            },
+            {
+                "name": "headline",
+                "selector": ".wide-tease-item__headline",
+                "type": "text",
+            },
+            {
+                "name": "summary",
+                "selector": ".wide-tease-item__description",
+                "type": "text",
+            },
+            {
+                "name": "time",
+                "selector": "[data-testid='wide-tease-date']",
+                "type": "text",
+            },
+            {
+                "name": "image",
+                "type": "nested",
+                "selector": "picture.teasePicture img",
+                "fields": [
+                    {"name": "src", "type": "attribute", "attribute": "src"},
+                    {"name": "alt", "type": "attribute", "attribute": "alt"},
+                ],
+            },
+            {
+                "name": "link",
+                "selector": "a[href]",
+                "type": "attribute",
+                "attribute": "href",
+            },
+        ],
    }
-}
+
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            extraction_strategy=extraction_strategy,
+            bypass_cache=True,
+        )
+
+        assert result.success, "Failed to crawl the page"
+
+        news_teasers = json.loads(result.extracted_content)
+        print(f"Successfully extracted {len(news_teasers)} news teasers")
+        print(json.dumps(news_teasers[0], indent=2))
+
+if __name__ == "__main__":
+    asyncio.run(extract_news_teasers())
 ```

-**Example Response:**
-```json
-{
-    "status": "success",
-    "data": [
-        {
-            "url": "https://www.nbcnews.com/business",
-            "extracted_content": "...",
-            "html": "...",
-            "markdown": "...",
-            "metadata": {...}
-        }
-    ]
-}
-```
+For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/extraction/css-advanced/) section in the documentation.

-For more information about the available parameters and their descriptions, refer to the [Parameters](#parameters) section.
-
-
-## Python Library Usage 🚀
-
-🔥 A great way to try out Crawl4AI is to run `quickstart.py` in the `docs/examples` directory. This script demonstrates how to use Crawl4AI to crawl a website and extract content from it.
-
-### Quickstart Guide
-
-Create an instance of WebCrawler and call the `warmup()` function.
-```python
-crawler = WebCrawler()
-crawler.warmup()
-```
-
-### Understanding 'bypass_cache' and 'include_raw_html' parameters
-
-First crawl (caches the result):
-```python
-result = crawler.run(url="https://www.nbcnews.com/business")
-```
-
-Second crawl (Force to crawl again):
-```python
-result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
-```
-    💡 Don't forget to set `bypass_cache` to True if you want to try different strategies for the same URL. Otherwise, the cached result will be returned. You can also set `always_by_pass_cache` in constructor to True to always bypass the cache.
-
-Crawl result without raw HTML content:
-```python
-result = crawler.run(url="https://www.nbcnews.com/business", include_raw_html=False)
-```
-
-### Adding a chunking strategy: RegexChunking
-
-Using RegexChunking:
-```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    chunking_strategy=RegexChunking(patterns=["\n\n"])
-)
-```
-
-Using NlpSentenceChunking:
-```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    chunking_strategy=NlpSentenceChunking()
-)
-```
-
-### Extraction strategy: CosineStrategy
-
-So far, the extracted content is just the result of chunking. To extract meaningful content, you can use extraction strategies. These strategies cluster consecutive chunks into meaningful blocks, keeping the same order as the text in the HTML. This approach is perfect for use in RAG applications and semantical search queries.
-
-Using CosineStrategy:
-```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=CosineStrategy(
-        semantic_filter="",
-        word_count_threshold=10, 
-        max_dist=0.2, 
-        linkage_method="ward", 
-        top_k=3
-    )
-)
-```
-
-You can set `semantic_filter` to filter relevant documents before clustering. Documents are filtered based on their cosine similarity to the keyword filter embedding. 
+### Extracting Structured Data with OpenAI

 ```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=CosineStrategy(
-        semantic_filter="finance economy and stock market",
-        word_count_threshold=10, 
-        max_dist=0.2, 
-        linkage_method="ward", 
-        top_k=3
-    )
-)
+import os
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url='https://openai.com/api/pricing/',
+            word_count_threshold=1,
+            extraction_strategy=LLMExtractionStrategy(
+                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
+                schema=OpenAIModelFee.schema(),
+                extraction_type="schema",
+                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
+                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
+            ),            
+            bypass_cache=True,
+        )
+        print(result.extracted_content)
+
+if __name__ == "__main__":
+    asyncio.run(main())
 ```

-### Using LLMExtractionStrategy
+### Session Management and Dynamic Content Crawling
+
+Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:

-Without instructions:
 ```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=LLMExtractionStrategy(
-        provider="openai/gpt-4o", 
-        api_token=os.getenv('OPENAI_API_KEY')
-    )
-)
+import asyncio
+import re
+from bs4 import BeautifulSoup
+from crawl4ai import AsyncWebCrawler
+
+async def crawl_typescript_commits():
+    first_commit = ""
+    async def on_execution_started(page):
+        nonlocal first_commit 
+        try:
+            while True:
+                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
+                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
+                commit = await commit.evaluate('(element) => element.textContent')
+                commit = re.sub(r'\s+', '', commit)
+                if commit and commit != first_commit:
+                    first_commit = commit
+                    break
+                await asyncio.sleep(0.5)
+        except Exception as e:
+            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
+
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        js_next_page = """
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        for page in range(3):  # Crawl 3 pages
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.Box-sc-g0xbh4-0",
+                js=js_next_page if page > 0 else None,
+                bypass_cache=True,
+                js_only=page > 0
+            )
+
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
+            commits = soup.select("li")
+            all_commits.extend(commits)
+
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+if __name__ == "__main__":
+    asyncio.run(crawl_typescript_commits())
 ```

-With instructions:
-```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=LLMExtractionStrategy(
-        provider="openai/gpt-4o",
-        api_token=os.getenv('OPENAI_API_KEY'),
-        instruction="I am interested in only financial news"
-    )
-)
+This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.
+
+For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites/) section in the documentation.
+</details>
+
+
+## Speed Comparison 🚀
+
+Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.
+
+We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:
+
+```bash
+Firecrawl:
+Time taken: 7.02 seconds
+Content length: 42074 characters
+Images found: 49
+
+Crawl4AI (simple crawl):
+Time taken: 1.60 seconds
+Content length: 18238 characters
+Images found: 49
+
+Crawl4AI (with JavaScript execution):
+Time taken: 4.64 seconds
+Content length: 40869 characters
+Images found: 89
 ```

-### Targeted extraction using CSS selector
+As you can see, Crawl4AI outperforms Firecrawl significantly:

-Extract only H2 tags:
-```python
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    css_selector="h2"
-)
-```
+- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
+- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.

-### Passing JavaScript code to click 'Load More' button
+You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.

-Using JavaScript to click 'Load More' button:
-```python
-js_code = """
-const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
-loadMoreButton && loadMoreButton.click();
-"""
-crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
-crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
-result = crawler.run(url="https://www.nbcnews.com/business")
-```
+## Documentation 📚

-## Parameters 📖
-
-| Parameter             | Description                                                                                           | Required | Default Value       |
-|-----------------------|-------------------------------------------------------------------------------------------------------|----------|---------------------|
-| `urls`                | A list of URLs to crawl and extract data from.                                                        | Yes      | -                   |
-| `include_raw_html`    | Whether to include the raw HTML content in the response.                                              | No       | `false`             |
-| `bypass_cache`        | Whether to force a fresh crawl even if the URL has been previously crawled.                           | No       | `false`             |
-| `word_count_threshold`| The minimum number of words a block must contain to be considered meaningful (minimum value is 5).    | No       | `5`                 |
-| `extraction_strategy` | The strategy to use for extracting content from the HTML (e.g., "CosineStrategy").                    | No       | `NoExtractionStrategy`    |
-| `chunking_strategy`   | The strategy to use for chunking the text before processing (e.g., "RegexChunking").                  | No       | `RegexChunking`     |
-| `css_selector`        | The CSS selector to target specific parts of the HTML for extraction.                                 | No       | `None`              |
-| `verbose`             | Whether to enable verbose logging.                                                                    | No       | `true`              |
-
-## Chunking Strategies 📚
-
-### RegexChunking
-
-`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions. This is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.
-
-**Constructor Parameters:**
- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\n\n']`).
-
-**Example usage:**
-```python
-chunker = RegexChunking(patterns=[r'\n\n', r'\. '])
-chunks = chunker.chunk("This is a sample text. It will be split into chunks.")
-```
-
-### NlpSentenceChunking
-
-`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.
-
-**Constructor Parameters:**
- None.
-
-**Example usage:**
-```python
-chunker = NlpSentenceChunking()
-chunks = chunker.chunk("This is a sample text. It will be split into sentences.")
-```
-
-### TopicSegmentationChunking
-
-`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.
-
-**Constructor Parameters:**
- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.
-
-**Example usage:**
-```python
-chunker = TopicSegmentationChunking(num_keywords=3)
-chunks = chunker.chunk("This is a sample text. It will be split into topic-based segments.")
-```
-
-### FixedLengthWordChunking
-
-`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.
-
-**Constructor Parameters:**
- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.
-
-**Example usage:**
-```python
-chunker = FixedLengthWordChunking(chunk_size=100)
-chunks = chunker.chunk("This is a sample text. It will be split into fixed-length word chunks.")
-```
-
-### SlidingWindowChunking
-
-`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.
-
-**Constructor Parameters:**
- `window_size` (int, optional): The number of words in each chunk. Default is `100`.
- `step` (int, optional): The number of words to slide the window. Default is `50`.
-
-**Example usage:**
-```python
-chunker = SlidingWindowChunking(window_size=100, step=50)
-chunks = chunker.chunk("This is a sample text. It will be split using a sliding window approach.")
-```
-
-## Extraction Strategies 🧠
-
-### NoExtractionStrategy
-
-`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required.
-
-**Constructor Parameters:**
-None.
-
-**Example usage:**
-```python
-extractor = NoExtractionStrategy()
-extracted_content = extractor.extract(url, html)
-```
-
-### LLMExtractionStrategy
-
-`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.
-
-**Constructor Parameters:**
- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).
- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.
-
-**Example usage:**
-```python
-extractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')
-extracted_content = extractor.extract(url, html)
-```
-
-### CosineStrategy
-
-`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.
-
-**Constructor Parameters:**
- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.
- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.
- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.
- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.
- `top_k` (int, optional): Number of top categories to extract. Default is `3`.
- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.
-
-**Example usage:**
-```python
-extractor = CosineStrategy(semantic_filter='finance rental prices', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')
-extracted_content = extractor.extract(url, html)
-```
-
-### TopicExtractionStrategy
-
-`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.
-
-**Constructor Parameters:**
- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.
-
-**Example usage:**
-```python
-extractor = TopicExtractionStrategy(num_keywords=3)
-extracted_content = extractor.extract(url, html)
-```
+For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).

 ## Contributing 🤝

-We welcome contributions from the open-source community to help improve Crawl4AI and make it even more valuable for AI enthusiasts and developers. To contribute, please follow these steps:
-
-1. Fork the repository.
-2. Create a new branch for your feature or bug fix.
-3. Make your changes and commit them with descriptive messages.
-4. Push your changes to your forked repository.
-5. Submit a pull request to the main repository.
-
-For more information on contributing, please see our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md).
+We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.

 ## License 📄

@@ -515,10 +443,42 @@ Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode

 ## Contact 📧

-If you have any questions, suggestions, or feedback, please feel free to reach out to us:
+For questions, suggestions, or feedback, feel free to reach out:

 - GitHub: [unclecode](https://github.com/unclecode)
 - Twitter: [@unclecode](https://twitter.com/unclecode)
 - Website: [crawl4ai.com](https://crawl4ai.com)

-Let's work together to make the web more accessible and useful for AI applications! 💪🌐🤖
+Happy Crawling! 🕸️🚀
+
+
+# Mission
+
+Our mission is to unlock the untapped potential of personal and enterprise data in the digital age. In today's world, individuals and organizations generate vast amounts of valuable digital footprints, yet this data remains largely uncapitalized as a true asset. 
+
+Our open-source solution empowers developers and innovators to build tools for data extraction and structuring, laying the foundation for a new era of data ownership. By transforming personal and enterprise data into structured, tradeable assets, we're creating opportunities for individuals to capitalize on their digital footprints and for organizations to unlock the value of their collective knowledge.
+
+This democratization of data represents the first step toward a shared data economy, where willing participation in data sharing drives AI advancement while ensuring the benefits flow back to data creators. Through this approach, we're building a future where AI development is powered by authentic human knowledge rather than synthetic alternatives.
+
+![Mission Diagram](./docs/assets/pitch-dark.svg)
+
+For a detailed exploration of our vision, opportunities, and pathway forward, please see our [full mission statement](./MISSION.md).
+
+## Key Opportunities
+
+- **Data Capitalization**: Transform digital footprints into valuable assets that can appear on personal and enterprise balance sheets
+- **Authentic Data**: Unlock the vast reservoir of real human insights and knowledge for AI advancement
+- **Shared Economy**: Create new value streams where data creators directly benefit from their contributions
+
+## Development Pathway
+
+1. **Open-Source Foundation**: Building transparent, community-driven data extraction tools
+2. **Data Capitalization Platform**: Creating tools to structure and value digital assets
+3. **Shared Data Marketplace**: Establishing an economic platform for ethical data exchange
+
+For a detailed exploration of our vision, challenges, and solutions, please see our [full mission statement](./MISSION.md).
+
+
+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/README.sync.md
+++ b/README.sync.md
@@ -0,0 +1,244 @@
+# Crawl4AI v0.2.77 🕷️🤖
+
+[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
+[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
+[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
+[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
+[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
+
+Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
+
+#### [v0.2.77] - 2024-08-02
+
+Major improvements in functionality, performance, and cross-platform compatibility! 🚀
+
+- 🐳 **Docker enhancements**:
+  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
+- 🌐 **Official Docker Hub image**:
+  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
+- 🔧 **Selenium upgrade**:
+  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
+- 🖼️ **Image description**:
+  - Implemented ability to generate textual descriptions for extracted images from web pages.
+- ⚡ **Performance boost**:
+  - Various improvements to enhance overall speed and performance.
+  
+## Try it Now!
+
+✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
+
+✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
+
+✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
+
+## Features ✨
+
+- 🆓 Completely free and open-source
+- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
+- 🌍 Supports crawling multiple URLs simultaneously
+- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
+- 🔗 Extracts all external and internal links
+- 📚 Extracts metadata from the page
+- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
+- 🕵️ User-agent customization
+- 🖼️ Takes screenshots of the page
+- 📜 Executes multiple custom JavaScripts before crawling
+- 📚 Various chunking strategies: topic-based, regex, sentence, and more
+- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
+- 🎯 CSS selector support
+- 📝 Passes instructions/keywords to refine extraction
+
+# Crawl4AI
+
+## 🌟 Shoutout to Contributors of v0.2.77!
+
+A big thank you to the amazing contributors who've made this release possible:
+
+- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
+- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
+- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
+
+Your contributions are driving Crawl4AI forward! 🚀
+
+## Cool Examples 🚀
+
+### Quick Start
+
+```python
+from crawl4ai import WebCrawler
+
+# Create an instance of WebCrawler
+crawler = WebCrawler()
+
+# Warm up the crawler (load necessary models)
+crawler.warmup()
+
+# Run the crawler on a URL
+result = crawler.run(url="https://www.nbcnews.com/business")
+
+# Print the extracted content
+print(result.markdown)
+```
+
+## How to install 🛠 
+
+### Using pip 🐍
+```bash
+virtualenv venv
+source venv/bin/activate
+pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
+```
+
+### Using Docker 🐳
+
+```bash
+# For Mac users (M1/M2)
+# docker build --platform linux/amd64 -t crawl4ai .
+docker build -t crawl4ai .
+docker run -d -p 8000:80 crawl4ai
+```
+
+### Using Docker Hub 🐳
+
+```bash
+docker pull unclecode/crawl4ai:latest
+docker run -d -p 8000:80 unclecode/crawl4ai:latest
+```
+
+
+## Speed-First Design 🚀
+
+Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
+
+```python
+import time
+from crawl4ai.web_crawler import WebCrawler
+crawler = WebCrawler()
+crawler.warmup()
+
+start = time.time()
+url = r"https://www.nbcnews.com/business"
+result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
+end = time.time()
+print(f"Time taken: {end - start}")
+```
+
+Let's take a look the calculated time for the above code snippet:
+
+```bash
+[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
+[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
+[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
+Time taken: 1.439958095550537
+```
+Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
+
+### Extract Structured Data from Web Pages 📊
+
+Crawl all OpenAI models and their fees from the official page.
+
+```python
+import os
+from crawl4ai import WebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
+
+url = 'https://openai.com/api/pricing/'
+crawler = WebCrawler()
+crawler.warmup()
+
+result = crawler.run(
+        url=url,
+        word_count_threshold=1,
+        extraction_strategy= LLMExtractionStrategy(
+            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+            schema=OpenAIModelFee.schema(),
+            extraction_type="schema",
+            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
+            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
+        ),            
+        bypass_cache=True,
+    )
+
+print(result.extracted_content)
+```
+
+### Execute JS, Filter Data with CSS Selector, and Clustering
+
+```python
+from crawl4ai import WebCrawler
+from crawl4ai.chunking_strategy import CosineStrategy
+
+js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
+
+crawler = WebCrawler()
+crawler.warmup()
+
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    js=js_code,
+    css_selector="p",
+    extraction_strategy=CosineStrategy(semantic_filter="technology")
+)
+
+print(result.extracted_content)
+```
+
+### Extract Structured Data from Web Pages With Proxy and BaseUrl
+
+```python
+from crawl4ai import WebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+def create_crawler():
+    crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
+    crawler.warmup()
+    return crawler
+
+crawler = create_crawler()
+
+crawler.warmup()
+
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    extraction_strategy=LLMExtractionStrategy(
+        provider="openai/gpt-4o",
+        api_token="sk-",
+        base_url="https://api.openai.com/v1"
+    )
+)
+
+print(result.markdown)
+```
+
+## Documentation 📚
+
+For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
+
+## Contributing 🤝
+
+We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
+
+## License 📄
+
+Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
+
+## Contact 📧
+
+For questions, suggestions, or feedback, feel free to reach out:
+
+- GitHub: [unclecode](https://github.com/unclecode)
+- Twitter: [@unclecode](https://twitter.com/unclecode)
+- Website: [crawl4ai.com](https://crawl4ai.com)
+
+Happy Crawling! 🕸️🚀
+
+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -1 +1,30 @@
-from .web_crawler import WebCrawler
+# __init__.py
+
+from .async_webcrawler import AsyncWebCrawler
+from .models import CrawlResult
+from ._version import __version__
+# __version__ = "0.3.73"
+
+__all__ = [
+    "AsyncWebCrawler",
+    "CrawlResult",
+]
+
+def is_sync_version_installed():
+    try:
+        import selenium
+        return True
+    except ImportError:
+        return False
+
+if is_sync_version_installed():
+    try:
+        from .web_crawler import WebCrawler
+        __all__.append("WebCrawler")
+    except ImportError:
+        import warnings
+        print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
+else:
+    WebCrawler = None
+    import warnings
+    print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
--- a/crawl4ai/_version.py
+++ b/crawl4ai/_version.py
@@ -0,0 +1,2 @@
+# crawl4ai/_version.py
+__version__ = "0.3.73"
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -0,0 +1,881 @@
+import asyncio
+import base64
+import time
+from abc import ABC, abstractmethod
+from typing import Callable, Dict, Any, List, Optional, Awaitable
+import os, sys, shutil
+import tempfile, subprocess
+from playwright.async_api import async_playwright, Page, Browser, Error
+from io import BytesIO
+from PIL import Image, ImageDraw, ImageFont
+from pathlib import Path
+from playwright.async_api import ProxySettings
+from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+
+from playwright_stealth import StealthConfig, stealth_async
+
+stealth_config = StealthConfig(
+    webdriver=True,
+    chrome_app=True,
+    chrome_csi=True,
+    chrome_load_times=True,
+    chrome_runtime=True,
+    navigator_languages=True,
+    navigator_plugins=True,
+    navigator_permissions=True,
+    webgl_vendor=True,
+    outerdimensions=True,
+    navigator_hardware_concurrency=True,
+    media_codecs=True,
+)
+
+
+class ManagedBrowser:
+    def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False):
+        self.browser_type = browser_type
+        self.user_data_dir = user_data_dir
+        self.headless = headless
+        self.browser_process = None
+        self.temp_dir = None
+        self.debugging_port = 9222
+
+    async def start(self) -> str:
+        """
+        Starts the browser process and returns the CDP endpoint URL.
+        If user_data_dir is not provided, creates a temporary directory.
+        """
+        
+        # Create temp dir if needed
+        if not self.user_data_dir:
+            self.temp_dir = tempfile.mkdtemp(prefix="browser-profile-")
+            self.user_data_dir = self.temp_dir
+
+        # Get browser path and args based on OS and browser type
+        browser_path = self._get_browser_path()
+        args = self._get_browser_args()
+
+        # Start browser process
+        try:
+            self.browser_process = subprocess.Popen(
+                args,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE
+            )
+            await asyncio.sleep(2)  # Give browser time to start
+            return f"http://localhost:{self.debugging_port}"
+        except Exception as e:
+            await self.cleanup()
+            raise Exception(f"Failed to start browser: {e}")
+
+    def _get_browser_path(self) -> str:
+        """Returns the browser executable path based on OS and browser type"""
+        if sys.platform == "darwin":  # macOS
+            paths = {
+                "chromium": "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
+                "firefox": "/Applications/Firefox.app/Contents/MacOS/firefox",
+                "webkit": "/Applications/Safari.app/Contents/MacOS/Safari"
+            }
+        elif sys.platform == "win32":  # Windows
+            paths = {
+                "chromium": "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
+                "firefox": "C:\\Program Files\\Mozilla Firefox\\firefox.exe",
+                "webkit": None  # WebKit not supported on Windows
+            }
+        else:  # Linux
+            paths = {
+                "chromium": "google-chrome",
+                "firefox": "firefox",
+                "webkit": None  # WebKit not supported on Linux
+            }
+        
+        return paths.get(self.browser_type)
+
+    def _get_browser_args(self) -> List[str]:
+        """Returns browser-specific command line arguments"""
+        base_args = [self._get_browser_path()]
+        
+        if self.browser_type == "chromium":
+            args = [
+                f"--remote-debugging-port={self.debugging_port}",
+                f"--user-data-dir={self.user_data_dir}",
+            ]
+            if self.headless:
+                args.append("--headless=new")
+        elif self.browser_type == "firefox":
+            args = [
+                "--remote-debugging-port", str(self.debugging_port),
+                "--profile", self.user_data_dir,
+            ]
+            if self.headless:
+                args.append("--headless")
+        else:
+            raise NotImplementedError(f"Browser type {self.browser_type} not supported")
+            
+        return base_args + args
+
+    async def cleanup(self):
+        """Cleanup browser process and temporary directory"""
+        if self.browser_process:
+            try:
+                self.browser_process.terminate()
+                await asyncio.sleep(1)
+                if self.browser_process.poll() is None:
+                    self.browser_process.kill()
+            except Exception as e:
+                print(f"Error terminating browser: {e}")
+
+        if self.temp_dir and os.path.exists(self.temp_dir):
+            try:
+                shutil.rmtree(self.temp_dir)
+            except Exception as e:
+                print(f"Error removing temporary directory: {e}")
+
+class AsyncCrawlResponse(BaseModel):
+    html: str
+    response_headers: Dict[str, str]
+    status_code: int
+    screenshot: Optional[str] = None
+    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
+
+    class Config:
+        arbitrary_types_allowed = True
+
+class AsyncCrawlerStrategy(ABC):
+    @abstractmethod
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        pass
+    
+    @abstractmethod
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        pass
+    
+    @abstractmethod
+    async def take_screenshot(self, **kwargs) -> str:
+        pass
+    
+    @abstractmethod
+    def update_user_agent(self, user_agent: str):
+        pass
+    
+    @abstractmethod
+    def set_hook(self, hook_type: str, hook: Callable):
+        pass
+
+class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
+    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
+        self.use_cached_html = use_cached_html
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
+        self.proxy = kwargs.get("proxy")
+        self.proxy_config = kwargs.get("proxy_config")
+        self.headless = kwargs.get("headless", True)
+        self.browser_type = kwargs.get("browser_type", "chromium")
+        self.headers = kwargs.get("headers", {})
+        self.sessions = {}
+        self.session_ttl = 1800 
+        self.js_code = js_code
+        self.verbose = kwargs.get("verbose", False)
+        self.playwright = None
+        self.browser = None
+        self.sleep_on_close = kwargs.get("sleep_on_close", False)
+        self.use_managed_browser = kwargs.get("use_managed_browser", False)
+        self.user_data_dir = kwargs.get("user_data_dir", None)
+        self.managed_browser = None
+        self.hooks = {
+            'on_browser_created': None,
+            'on_user_agent_updated': None,
+            'on_execution_started': None,
+            'before_goto': None,
+            'after_goto': None,
+            'before_return_html': None,
+            'before_retrieve_html': None
+        }
+
+    async def __aenter__(self):
+        await self.start()
+        return self
+
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        await self.close()
+
+    async def start(self):
+        if self.playwright is None:
+            self.playwright = await async_playwright().start()
+        if self.browser is None:
+            if self.use_managed_browser:
+                # Use managed browser approach
+                self.managed_browser = ManagedBrowser(
+                    browser_type=self.browser_type,
+                    user_data_dir=self.user_data_dir,
+                    headless=self.headless
+                )
+                cdp_url = await self.managed_browser.start()
+                self.browser = await self.playwright.chromium.connect_over_cdp(cdp_url)
+            else:
+                browser_args = {
+                    "headless": self.headless,
+                    "args": [
+                        "--disable-gpu",
+                        "--no-sandbox",
+                        "--disable-dev-shm-usage",
+                        "--disable-blink-features=AutomationControlled",
+                        "--disable-infobars",
+                        "--window-position=0,0",
+                        "--ignore-certificate-errors",
+                        "--ignore-certificate-errors-spki-list",
+                        # "--headless=new",  # Use the new headless mode
+                    ]
+                }
+                
+                # Add proxy settings if a proxy is specified
+                if self.proxy:
+                    proxy_settings = ProxySettings(server=self.proxy)
+                    browser_args["proxy"] = proxy_settings
+                elif self.proxy_config:
+                    proxy_settings = ProxySettings(server=self.proxy_config.get("server"), username=self.proxy_config.get("username"), password=self.proxy_config.get("password"))
+                    browser_args["proxy"] = proxy_settings
+                    
+                # Select the appropriate browser based on the browser_type
+                if self.browser_type == "firefox":
+                    self.browser = await self.playwright.firefox.launch(**browser_args)
+                elif self.browser_type == "webkit":
+                    self.browser = await self.playwright.webkit.launch(**browser_args)
+                else:
+                    self.browser = await self.playwright.chromium.launch(**browser_args)
+
+            await self.execute_hook('on_browser_created', self.browser)
+
+    async def close(self):
+        if self.sleep_on_close:
+            await asyncio.sleep(0.5)
+        if self.browser:
+            await self.browser.close()
+            self.browser = None
+        if self.managed_browser:
+            await self.managed_browser.cleanup()
+            self.managed_browser = None
+        if self.playwright:
+            await self.playwright.stop()
+            self.playwright = None
+
+    def __del__(self):
+        if self.browser or self.playwright:
+            asyncio.get_event_loop().run_until_complete(self.close())
+
+    def set_hook(self, hook_type: str, hook: Callable):
+        if hook_type in self.hooks:
+            self.hooks[hook_type] = hook
+        else:
+            raise ValueError(f"Invalid hook type: {hook_type}")
+
+    async def execute_hook(self, hook_type: str, *args):
+        hook = self.hooks.get(hook_type)
+        if hook:
+            if asyncio.iscoroutinefunction(hook):
+                return await hook(*args)
+            else:
+                return hook(*args)
+        return args[0] if args else None
+
+    def update_user_agent(self, user_agent: str):
+        self.user_agent = user_agent
+
+    def set_custom_headers(self, headers: Dict[str, str]):
+        self.headers = headers
+
+    async def kill_session(self, session_id: str):
+        if session_id in self.sessions:
+            context, page, _ = self.sessions[session_id]
+            await page.close()
+            await context.close()
+            del self.sessions[session_id]
+
+    def _cleanup_expired_sessions(self):
+        current_time = time.time()
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
+        for sid in expired_sessions:
+            asyncio.create_task(self.kill_session(sid))
+            
+    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
+        wait_for = wait_for.strip()
+        
+        if wait_for.startswith('js:'):
+            # Explicitly specified JavaScript
+            js_code = wait_for[3:].strip()
+            return await self.csp_compliant_wait(page, js_code, timeout)
+        elif wait_for.startswith('css:'):
+            # Explicitly specified CSS selector
+            css_selector = wait_for[4:].strip()
+            try:
+                await page.wait_for_selector(css_selector, timeout=timeout)
+            except Error as e:
+                if 'Timeout' in str(e):
+                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
+                else:
+                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
+        else:
+            # Auto-detect based on content
+            if wait_for.startswith('()') or wait_for.startswith('function'):
+                # It's likely a JavaScript function
+                return await self.csp_compliant_wait(page, wait_for, timeout)
+            else:
+                # Assume it's a CSS selector first
+                try:
+                    await page.wait_for_selector(wait_for, timeout=timeout)
+                except Error as e:
+                    if 'Timeout' in str(e):
+                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
+                    else:
+                        # If it's not a timeout error, it might be an invalid selector
+                        # Let's try to evaluate it as a JavaScript function as a fallback
+                        try:
+                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
+                        except Error:
+                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
+    
+    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
+        wrapper_js = f"""
+        async () => {{
+            const userFunction = {user_wait_function};
+            const startTime = Date.now();
+            while (true) {{
+                if (await userFunction()) {{
+                    return true;
+                }}
+                if (Date.now() - startTime > {timeout}) {{
+                    throw new Error('Timeout waiting for condition');
+                }}
+                await new Promise(resolve => setTimeout(resolve, 100));
+            }}
+        }}
+        """
+        
+        try:
+            await page.evaluate(wrapper_js)
+        except TimeoutError:
+            raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
+        except Exception as e:
+            raise RuntimeError(f"Error in wait condition: {str(e)}")
+
+    async def process_iframes(self, page):
+        # Find all iframes
+        iframes = await page.query_selector_all('iframe')
+        
+        for i, iframe in enumerate(iframes):
+            try:
+                # Add a unique identifier to the iframe
+                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
+                
+                # Get the frame associated with this iframe
+                frame = await iframe.content_frame()
+                
+                if frame:
+                    # Wait for the frame to load
+                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
+                    
+                    # Extract the content of the iframe's body
+                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
+                    
+                    # Generate a unique class name for this iframe
+                    class_name = f'extracted-iframe-content-{i}'
+                    
+                    # Replace the iframe with a div containing the extracted content
+                    _iframe = iframe_content.replace('`', '\\`')
+                    await page.evaluate(f"""
+                        () => {{
+                            const iframe = document.getElementById('iframe-{i}');
+                            const div = document.createElement('div');
+                            div.innerHTML = `{_iframe}`;
+                            div.className = '{class_name}';
+                            iframe.replaceWith(div);
+                        }}
+                    """)
+                else:
+                    print(f"Warning: Could not access content frame for iframe {i}")
+            except Exception as e:
+                print(f"Error processing iframe {i}: {str(e)}")
+
+        # Return the page object
+        return page  
+    
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        response_headers = {}
+        status_code = None
+        
+        self._cleanup_expired_sessions()
+        session_id = kwargs.get("session_id")
+        if session_id:
+            context, page, _ = self.sessions.get(session_id, (None, None, None))
+            if not context:
+                context = await self.browser.new_context(
+                    user_agent=self.user_agent,
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None,
+                    accept_downloads=True,
+                    java_script_enabled=True
+                )
+                await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
+                await context.set_extra_http_headers(self.headers)
+                page = await context.new_page()
+                self.sessions[session_id] = (context, page, time.time())
+        else:
+            context = await self.browser.new_context(
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
+            )
+            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
+            page = await context.new_page()
+            # await stealth_async(page) #, stealth_config)
+
+        # Add console message and error logging
+        if kwargs.get("log_console", False):
+            page.on("console", lambda msg: print(f"Console: {msg.text}"))
+            page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
+        
+        try:
+            if self.verbose:
+                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                if os.path.exists(cache_file_path):
+                    html = ""
+                    with open(cache_file_path, "r") as f:
+                        html = f.read()
+                    # retrieve response headers and status code from cache
+                    with open(cache_file_path + ".meta", "r") as f:
+                        meta = json.load(f)
+                        response_headers = meta.get("response_headers", {})
+                        status_code = meta.get("status_code")
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
+                    return response
+
+            if not kwargs.get("js_only", False):
+                await self.execute_hook('before_goto', page)
+                
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # response = await page.goto("about:blank")
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
+                await self.execute_hook('after_goto', page)
+                
+                # Get status code and headers
+                status_code = response.status
+                response_headers = response.headers
+            else:
+                status_code = 200
+                response_headers = {}
+
+            # Replace the current wait_for_selector line with this more robust check:
+            try:
+                # First wait for body to exist, regardless of visibility
+                await page.wait_for_selector('body', state='attached', timeout=30000)
+                
+                # Then wait for it to become visible by checking CSS
+                await page.wait_for_function("""
+                    () => {
+                        const body = document.body;
+                        const style = window.getComputedStyle(body);
+                        return style.display !== 'none' && 
+                            style.visibility !== 'hidden' && 
+                            style.opacity !== '0';
+                    }
+                """, timeout=30000)
+                
+            except Error as e:
+                # If waiting fails, let's try to diagnose the issue
+                visibility_info = await page.evaluate("""
+                    () => {
+                        const body = document.body;
+                        const style = window.getComputedStyle(body);
+                        return {
+                            display: style.display,
+                            visibility: style.visibility,
+                            opacity: style.opacity,
+                            hasContent: body.innerHTML.length,
+                            classList: Array.from(body.classList)
+                        }
+                    }
+                """)
+                
+                if self.verbose:
+                    print(f"Body visibility debug info: {visibility_info}")
+                
+                # Even if body is hidden, we might still want to proceed
+                if kwargs.get('ignore_body_visibility', True):
+                    if self.verbose:
+                        print("Proceeding despite hidden body...")
+                    pass
+                else:
+                    raise Error(f"Body element is hidden: {visibility_info}")
+            
+            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
+
+            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
+            if js_code:
+                if isinstance(js_code, str):
+                    await page.evaluate(js_code)
+                elif isinstance(js_code, list):
+                    for js in js_code:
+                        await page.evaluate(js)
+                
+                await page.wait_for_load_state('networkidle')
+                # Check for on execution event
+                await self.execute_hook('on_execution_started', page)
+                
+            if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
+            wait_for = kwargs.get("wait_for")
+            if wait_for:
+                try:
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
+                except Exception as e:
+                    raise RuntimeError(f"Wait condition failed: {str(e)}")
+
+            # Update image dimensions
+            update_image_dimensions_js = """
+            () => {
+                return new Promise((resolve) => {
+                    const filterImage = (img) => {
+                        // Filter out images that are too small
+                        if (img.width < 100 && img.height < 100) return false;
+                        
+                        // Filter out images that are not visible
+                        const rect = img.getBoundingClientRect();
+                        if (rect.width === 0 || rect.height === 0) return false;
+                        
+                        // Filter out images with certain class names (e.g., icons, thumbnails)
+                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
+                        
+                        // Filter out images with certain patterns in their src (e.g., placeholder images)
+                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
+                        
+                        return true;
+                    };
+
+                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
+                    let imagesLeft = images.length;
+                    
+                    if (imagesLeft === 0) {
+                        resolve();
+                        return;
+                    }
+
+                    const checkImage = (img) => {
+                        if (img.complete && img.naturalWidth !== 0) {
+                            img.setAttribute('width', img.naturalWidth);
+                            img.setAttribute('height', img.naturalHeight);
+                            imagesLeft--;
+                            if (imagesLeft === 0) resolve();
+                        }
+                    };
+
+                    images.forEach(img => {
+                        checkImage(img);
+                        if (!img.complete) {
+                            img.onload = () => {
+                                checkImage(img);
+                            };
+                            img.onerror = () => {
+                                imagesLeft--;
+                                if (imagesLeft === 0) resolve();
+                            };
+                        }
+                    });
+
+                    // Fallback timeout of 5 seconds
+                    // setTimeout(() => resolve(), 5000);
+                    resolve();
+                });
+            }
+            """
+            await page.evaluate(update_image_dimensions_js)
+
+            # Wait a bit for any onload events to complete
+            await page.wait_for_timeout(100)
+
+            # Process iframes
+            if kwargs.get("process_iframes", False):
+                page = await self.process_iframes(page)
+            
+            await self.execute_hook('before_retrieve_html', page)
+            # Check if delay_before_return_html is set then wait for that time
+            delay_before_return_html = kwargs.get("delay_before_return_html")
+            if delay_before_return_html:
+                await asyncio.sleep(delay_before_return_html)
+                
+            # Check for remove_overlay_elements parameter
+            if kwargs.get("remove_overlay_elements", False):
+                await self.remove_overlay_elements(page)
+            
+            html = await page.content()
+            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                # Check we have screenshot_wait_for parameter, if we have simply wait for that time
+                screenshot_wait_for = kwargs.get("screenshot_wait_for")
+                if screenshot_wait_for:
+                    await asyncio.sleep(screenshot_wait_for)
+                screenshot_data = await self.take_screenshot(page)          
+
+            if self.verbose:
+                print(f"[LOG] ✅ Crawled {url} successfully!")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                with open(cache_file_path, "w", encoding="utf-8") as f:
+                    f.write(html)
+                # store response headers and status code in cache
+                with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
+                    json.dump({
+                        "response_headers": response_headers,
+                        "status_code": status_code
+                    }, f)
+
+            async def get_delayed_content(delay: float = 5.0) -> str:
+                if self.verbose:
+                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
+                await asyncio.sleep(delay)
+                return await page.content()
+                
+            response = AsyncCrawlResponse(
+                html=html, 
+                response_headers=response_headers, 
+                status_code=status_code,
+                screenshot=screenshot_data,
+                get_delayed_content=get_delayed_content
+            )
+            return response
+        except Error as e:
+            raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
+        # finally:
+        #     if not session_id:
+        #         await page.close()
+        #         await context.close()
+
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
+        semaphore = asyncio.Semaphore(semaphore_count)
+
+        async def crawl_with_semaphore(url):
+            async with semaphore:
+                return await self.crawl(url, **kwargs)
+
+        tasks = [crawl_with_semaphore(url) for url in urls]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        return [result if not isinstance(result, Exception) else str(result) for result in results]
+
+    async def remove_overlay_elements(self, page: Page) -> None:
+        """
+        Removes popup overlays, modals, cookie notices, and other intrusive elements from the page.
+        
+        Args:
+            page (Page): The Playwright page instance
+        """
+        remove_overlays_js = """
+        async () => {
+            // Function to check if element is visible
+            const isVisible = (elem) => {
+                const style = window.getComputedStyle(elem);
+                return style.display !== 'none' && 
+                       style.visibility !== 'hidden' && 
+                       style.opacity !== '0';
+            };
+
+            // Common selectors for popups and overlays
+            const commonSelectors = [
+                // Close buttons first
+                'button[class*="close" i]', 'button[class*="dismiss" i]', 
+                'button[aria-label*="close" i]', 'button[title*="close" i]',
+                'a[class*="close" i]', 'span[class*="close" i]',
+                
+                // Cookie notices
+                '[class*="cookie-banner" i]', '[id*="cookie-banner" i]',
+                '[class*="cookie-consent" i]', '[id*="cookie-consent" i]',
+                
+                // Newsletter/subscription dialogs
+                '[class*="newsletter" i]', '[class*="subscribe" i]',
+                
+                // Generic popups/modals
+                '[class*="popup" i]', '[class*="modal" i]', 
+                '[class*="overlay" i]', '[class*="dialog" i]',
+                '[role="dialog"]', '[role="alertdialog"]'
+            ];
+
+            // Try to click close buttons first
+            for (const selector of commonSelectors.slice(0, 6)) {
+                const closeButtons = document.querySelectorAll(selector);
+                for (const button of closeButtons) {
+                    if (isVisible(button)) {
+                        try {
+                            button.click();
+                            await new Promise(resolve => setTimeout(resolve, 100));
+                        } catch (e) {
+                            console.log('Error clicking button:', e);
+                        }
+                    }
+                }
+            }
+
+            // Remove remaining overlay elements
+            const removeOverlays = () => {
+                // Find elements with high z-index
+                const allElements = document.querySelectorAll('*');
+                for (const elem of allElements) {
+                    const style = window.getComputedStyle(elem);
+                    const zIndex = parseInt(style.zIndex);
+                    const position = style.position;
+                    
+                    if (
+                        isVisible(elem) && 
+                        (zIndex > 999 || position === 'fixed' || position === 'absolute') &&
+                        (
+                            elem.offsetWidth > window.innerWidth * 0.5 ||
+                            elem.offsetHeight > window.innerHeight * 0.5 ||
+                            style.backgroundColor.includes('rgba') ||
+                            parseFloat(style.opacity) < 1
+                        )
+                    ) {
+                        elem.remove();
+                    }
+                }
+
+                // Remove elements matching common selectors
+                for (const selector of commonSelectors) {
+                    const elements = document.querySelectorAll(selector);
+                    elements.forEach(elem => {
+                        if (isVisible(elem)) {
+                            elem.remove();
+                        }
+                    });
+                }
+            };
+
+            // Remove overlay elements
+            removeOverlays();
+
+            // Remove any fixed/sticky position elements at the top/bottom
+            const removeFixedElements = () => {
+                const elements = document.querySelectorAll('*');
+                elements.forEach(elem => {
+                    const style = window.getComputedStyle(elem);
+                    if (
+                        (style.position === 'fixed' || style.position === 'sticky') &&
+                        isVisible(elem)
+                    ) {
+                        elem.remove();
+                    }
+                });
+            };
+
+            removeFixedElements();
+            
+            // Remove empty block elements as: div, p, span, etc.
+            const removeEmptyBlockElements = () => {
+                const blockElements = document.querySelectorAll('div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6');
+                blockElements.forEach(elem => {
+                    if (elem.innerText.trim() === '') {
+                        elem.remove();
+                    }
+                });
+            };
+
+            // Remove margin-right and padding-right from body (often added by modal scripts)
+            document.body.style.marginRight = '0px';
+            document.body.style.paddingRight = '0px';
+            document.body.style.overflow = 'auto';
+
+            // Wait a bit for any animations to complete
+            await new Promise(resolve => setTimeout(resolve, 100));
+        }
+        """
+        
+        try:
+            await page.evaluate(remove_overlays_js)
+            await page.wait_for_timeout(500)  # Wait for any animations to complete
+        except Exception as e:
+            if self.verbose:
+                print(f"Warning: Failed to remove overlay elements: {str(e)}")
+
+    async def take_screenshot(self, page: Page) -> str:
+        try:
+            # The page is already loaded, just take the screenshot
+            screenshot = await page.screenshot(full_page=True)
+            return base64.b64encode(screenshot).decode('utf-8')
+        except Exception as e:
+            error_message = f"Failed to take screenshot: {str(e)}"
+            print(error_message)
+
+            # Generate an error image
+            img = Image.new('RGB', (800, 600), color='black')
+            draw = ImageDraw.Draw(img)
+            font = ImageFont.load_default()
+            draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+            
+            buffered = BytesIO()
+            img.save(buffered, format="JPEG")
+            return base64.b64encode(buffered.getvalue()).decode('utf-8')
+        finally:
+            await page.close()
+
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -0,0 +1,192 @@
+import os
+from pathlib import Path
+import aiosqlite
+import asyncio
+from typing import Optional, Tuple, Dict
+from contextlib import asynccontextmanager
+import logging
+
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+
+DB_PATH = os.path.join(Path.home(), ".crawl4ai")
+os.makedirs(DB_PATH, exist_ok=True)
+DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
+
+class AsyncDatabaseManager:
+    def __init__(self, pool_size: int = 10, max_retries: int = 3):
+        self.db_path = DB_PATH
+        self.pool_size = pool_size
+        self.max_retries = max_retries
+        self.connection_pool: Dict[int, aiosqlite.Connection] = {}
+        self.pool_lock = asyncio.Lock()
+        self.connection_semaphore = asyncio.Semaphore(pool_size)
+        
+    async def initialize(self):
+        """Initialize the database and connection pool"""
+        await self.ainit_db()
+        
+    async def cleanup(self):
+        """Cleanup connections when shutting down"""
+        async with self.pool_lock:
+            for conn in self.connection_pool.values():
+                await conn.close()
+            self.connection_pool.clear()
+
+    @asynccontextmanager
+    async def get_connection(self):
+        """Connection pool manager"""
+        async with self.connection_semaphore:
+            task_id = id(asyncio.current_task())
+            try:
+                async with self.pool_lock:
+                    if task_id not in self.connection_pool:
+                        conn = await aiosqlite.connect(
+                            self.db_path,
+                            timeout=30.0
+                        )
+                        await conn.execute('PRAGMA journal_mode = WAL')
+                        await conn.execute('PRAGMA busy_timeout = 5000')
+                        self.connection_pool[task_id] = conn
+                    
+                yield self.connection_pool[task_id]
+                
+            except Exception as e:
+                logger.error(f"Connection error: {e}")
+                raise
+            finally:
+                async with self.pool_lock:
+                    if task_id in self.connection_pool:
+                        await self.connection_pool[task_id].close()
+                        del self.connection_pool[task_id]
+
+    async def execute_with_retry(self, operation, *args):
+        """Execute database operations with retry logic"""
+        for attempt in range(self.max_retries):
+            try:
+                async with self.get_connection() as db:
+                    result = await operation(db, *args)
+                    await db.commit()
+                    return result
+            except Exception as e:
+                if attempt == self.max_retries - 1:
+                    logger.error(f"Operation failed after {self.max_retries} attempts: {e}")
+                    raise
+                await asyncio.sleep(1 * (attempt + 1))  # Exponential backoff
+
+    async def ainit_db(self):
+        """Initialize database schema"""
+        async def _init(db):
+            await db.execute('''
+                CREATE TABLE IF NOT EXISTS crawled_data (
+                    url TEXT PRIMARY KEY,
+                    html TEXT,
+                    cleaned_html TEXT,
+                    markdown TEXT,
+                    extracted_content TEXT,
+                    success BOOLEAN,
+                    media TEXT DEFAULT "{}",
+                    links TEXT DEFAULT "{}",
+                    metadata TEXT DEFAULT "{}",
+                    screenshot TEXT DEFAULT ""
+                )
+            ''')
+        
+        await self.execute_with_retry(_init)
+        await self.update_db_schema()
+
+    async def update_db_schema(self):
+        """Update database schema if needed"""
+        async def _check_columns(db):
+            cursor = await db.execute("PRAGMA table_info(crawled_data)")
+            columns = await cursor.fetchall()
+            return [column[1] for column in columns]
+
+        column_names = await self.execute_with_retry(_check_columns)
+        
+        for column in ['media', 'links', 'metadata', 'screenshot']:
+            if column not in column_names:
+                await self.aalter_db_add_column(column)
+
+    async def aalter_db_add_column(self, new_column: str):
+        """Add new column to the database"""
+        async def _alter(db):
+            await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
+            logger.info(f"Added column '{new_column}' to the database.")
+
+        await self.execute_with_retry(_alter)
+
+    async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
+        """Retrieve cached URL data"""
+        async def _get(db):
+            async with db.execute(
+                'SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?',
+                (url,)
+            ) as cursor:
+                return await cursor.fetchone()
+
+        try:
+            return await self.execute_with_retry(_get)
+        except Exception as e:
+            logger.error(f"Error retrieving cached URL: {e}")
+            return None
+
+    async def acache_url(self, url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media: str = "{}", links: str = "{}", metadata: str = "{}", screenshot: str = ""):
+        """Cache URL data with retry logic"""
+        async def _cache(db):
+            await db.execute('''
+                INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
+                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                ON CONFLICT(url) DO UPDATE SET
+                    html = excluded.html,
+                    cleaned_html = excluded.cleaned_html,
+                    markdown = excluded.markdown,
+                    extracted_content = excluded.extracted_content,
+                    success = excluded.success,
+                    media = excluded.media,      
+                    links = excluded.links,    
+                    metadata = excluded.metadata,      
+                    screenshot = excluded.screenshot
+            ''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
+
+        try:
+            await self.execute_with_retry(_cache)
+        except Exception as e:
+            logger.error(f"Error caching URL: {e}")
+
+    async def aget_total_count(self) -> int:
+        """Get total number of cached URLs"""
+        async def _count(db):
+            async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
+                result = await cursor.fetchone()
+                return result[0] if result else 0
+
+        try:
+            return await self.execute_with_retry(_count)
+        except Exception as e:
+            logger.error(f"Error getting total count: {e}")
+            return 0
+
+    async def aclear_db(self):
+        """Clear all data from the database"""
+        async def _clear(db):
+            await db.execute('DELETE FROM crawled_data')
+
+        try:
+            await self.execute_with_retry(_clear)
+        except Exception as e:
+            logger.error(f"Error clearing database: {e}")
+
+    async def aflush_db(self):
+        """Drop the entire table"""
+        async def _flush(db):
+            await db.execute('DROP TABLE IF EXISTS crawled_data')
+
+        try:
+            await self.execute_with_retry(_flush)
+        except Exception as e:
+            logger.error(f"Error flushing database: {e}")
+
+# Create a singleton instance
+async_db_manager = AsyncDatabaseManager()
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -0,0 +1,289 @@
+import os
+import time
+from pathlib import Path
+from typing import Optional
+import json
+import asyncio
+from .models import CrawlResult
+from .async_database import async_db_manager
+from .chunking_strategy import *
+from .extraction_strategy import *
+from .async_crawler_strategy import AsyncCrawlerStrategy, AsyncPlaywrightCrawlerStrategy, AsyncCrawlResponse
+from .content_scrapping_strategy import WebScrappingStrategy
+from .config import MIN_WORD_THRESHOLD, IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
+from .utils import (
+    sanitize_input_encode,
+    InvalidCSSSelectorError,
+    format_html
+)
+from ._version import __version__ as crawl4ai_version
+
+class AsyncWebCrawler:
+    def __init__(
+        self,
+        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
+        always_by_pass_cache: bool = False,
+        base_directory: str = str(Path.home()),
+        **kwargs,
+    ):
+        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
+            **kwargs
+        )
+        self.always_by_pass_cache = always_by_pass_cache
+        # self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
+        self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
+        os.makedirs(self.crawl4ai_folder, exist_ok=True)
+        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
+        self.ready = False
+        self.verbose = kwargs.get("verbose", False)
+
+    async def __aenter__(self):
+        await self.crawler_strategy.__aenter__()
+        await self.awarmup()
+        return self
+
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        await self.crawler_strategy.__aexit__(exc_type, exc_val, exc_tb)
+
+    async def awarmup(self):
+        # Print a message for crawl4ai and its version
+        print(f"[LOG] 🚀 Crawl4AI {crawl4ai_version}")
+        if self.verbose:
+            print("[LOG] 🌤️  Warming up the AsyncWebCrawler")
+        # await async_db_manager.ainit_db()
+        await async_db_manager.initialize()
+        await self.arun(
+            url="https://google.com/",
+            word_count_threshold=5,
+            bypass_cache=False,
+            verbose=False,
+        )
+        self.ready = True
+        if self.verbose:
+            print("[LOG] 🌞 AsyncWebCrawler is ready to crawl")
+
+    async def arun(
+        self,
+        url: str,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        bypass_cache: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        user_agent: str = None,
+        verbose=True,
+        **kwargs,
+    ) -> CrawlResult:
+        try:
+            extraction_strategy = extraction_strategy or NoExtractionStrategy()
+            extraction_strategy.verbose = verbose
+            if not isinstance(extraction_strategy, ExtractionStrategy):
+                raise ValueError("Unsupported extraction strategy")
+            if not isinstance(chunking_strategy, ChunkingStrategy):
+                raise ValueError("Unsupported chunking strategy")
+            
+            word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
+
+            async_response: AsyncCrawlResponse = None
+            cached = None
+            screenshot_data = None
+            extracted_content = None
+            if not bypass_cache and not self.always_by_pass_cache:
+                cached = await async_db_manager.aget_cached_url(url)
+
+            if kwargs.get("warmup", True) and not self.ready:
+                return None
+
+            if cached:
+                html = sanitize_input_encode(cached[1])
+                extracted_content = sanitize_input_encode(cached[4])
+                if screenshot:
+                    screenshot_data = cached[9]
+                    if not screenshot_data:
+                        cached = None
+
+            if not cached or not html:
+                t1 = time.time()
+                if user_agent:
+                    self.crawler_strategy.update_user_agent(user_agent)
+                async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
+                html = sanitize_input_encode(async_response.html)
+                screenshot_data = async_response.screenshot
+                t2 = time.time()
+                if verbose:
+                    print(
+                        f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
+                    )
+
+            crawl_result = await self.aprocess_html(
+                url,
+                html,
+                extracted_content,
+                word_count_threshold,
+                extraction_strategy,
+                chunking_strategy,
+                css_selector,
+                screenshot_data,
+                verbose,
+                bool(cached),
+                async_response=async_response,
+                bypass_cache=bypass_cache,
+                **kwargs,
+            )
+            crawl_result.status_code = async_response.status_code if async_response else 200
+            crawl_result.response_headers = async_response.response_headers if async_response else {}
+            crawl_result.success = bool(html)
+            crawl_result.session_id = kwargs.get("session_id", None)
+            return crawl_result
+        except Exception as e:
+            if not hasattr(e, "msg"):
+                e.msg = str(e)
+            print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
+            return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)
+
+    async def arun_many(
+        self,
+        urls: List[str],
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        bypass_cache: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        user_agent: str = None,
+        verbose=True,
+        **kwargs,
+    ) -> List[CrawlResult]:
+        tasks = [
+            self.arun(
+                url,
+                word_count_threshold,
+                extraction_strategy,
+                chunking_strategy,
+                bypass_cache,
+                css_selector,
+                screenshot,
+                user_agent,
+                verbose,
+                **kwargs
+            )
+            for url in urls
+        ]
+        return await asyncio.gather(*tasks)
+
+    async def aprocess_html(
+        self,
+        url: str,
+        html: str,
+        extracted_content: str,
+        word_count_threshold: int,
+        extraction_strategy: ExtractionStrategy,
+        chunking_strategy: ChunkingStrategy,
+        css_selector: str,
+        screenshot: str,
+        verbose: bool,
+        is_cached: bool,
+        **kwargs,
+    ) -> CrawlResult:
+        t = time.time()
+        # Extract content from HTML
+        try:
+            t1 = time.time()
+            scrapping_strategy = WebScrappingStrategy()
+            # result = await scrapping_strategy.ascrap(
+            result = scrapping_strategy.scrap(
+                url,
+                html,
+                word_count_threshold=word_count_threshold,
+                css_selector=css_selector,
+                only_text=kwargs.get("only_text", False),
+                image_description_min_word_threshold=kwargs.get(
+                    "image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
+                ),
+                **kwargs,
+            )
+            if verbose:
+                print(
+                    f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
+                )
+
+            if result is None:
+                raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
+        except InvalidCSSSelectorError as e:
+            raise ValueError(str(e))
+        except Exception as e:
+            raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
+
+        cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
+        markdown = sanitize_input_encode(result.get("markdown", ""))
+        fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
+        fit_html = sanitize_input_encode(result.get("fit_html", ""))
+        media = result.get("media", [])
+        links = result.get("links", [])
+        metadata = result.get("metadata", {})
+
+        if extracted_content is None and extraction_strategy and chunking_strategy:
+            if verbose:
+                print(
+                    f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {self.__class__.__name__}"
+                )
+
+            # Check if extraction strategy is type of JsonCssExtractionStrategy
+            if isinstance(extraction_strategy, JsonCssExtractionStrategy) or isinstance(extraction_strategy, JsonCssExtractionStrategy):
+                extraction_strategy.verbose = verbose
+                extracted_content = extraction_strategy.run(url, [html])
+                extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
+            else:
+                sections = chunking_strategy.chunk(markdown)
+                extracted_content = extraction_strategy.run(url, sections)
+                extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
+
+        if verbose:
+            print(
+                f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
+            )
+
+        screenshot = None if not screenshot else screenshot
+
+        if not is_cached or kwargs.get("bypass_cache", False) or self.always_by_pass_cache:
+            await async_db_manager.acache_url(
+                url,
+                html,
+                cleaned_html,
+                markdown,
+                extracted_content,
+                True,
+                json.dumps(media),
+                json.dumps(links),
+                json.dumps(metadata),
+                screenshot=screenshot,
+            )
+
+        return CrawlResult(
+            url=url,
+            html=html,
+            cleaned_html=format_html(cleaned_html),
+            markdown=markdown,
+            fit_markdown=fit_markdown,
+            fit_html= fit_html,
+            media=media,
+            links=links,
+            metadata=metadata,
+            screenshot=screenshot,
+            extracted_content=extracted_content,
+            success=True,
+            error_message="",
+        )
+
+    async def aclear_cache(self):
+        # await async_db_manager.aclear_db()
+        await async_db_manager.cleanup()
+
+    async def aflush_cache(self):
+        await async_db_manager.aflush_db()
+
+    async def aget_cache_size(self):
+        return await async_db_manager.aget_total_count()
+
+
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -3,6 +3,7 @@ import re
 from collections import Counter
 import string
 from .model_loader import load_nltk_punkt
+from .utils import *

 # Define the abstract base class for chunking strategies
 class ChunkingStrategy(ABC):
@@ -16,7 +17,7 @@ class ChunkingStrategy(ABC):
    
 # Regex-based chunking
 class RegexChunking(ChunkingStrategy):
-    def __init__(self, patterns=None):
+    def __init__(self, patterns=None, **kwargs):
        if patterns is None:
            patterns = [r'\n\n']  # Default split pattern
        self.patterns = patterns
@@ -32,7 +33,7 @@ class RegexChunking(ChunkingStrategy):
    
 # NLP-based sentence chunking 
 class NlpSentenceChunking(ChunkingStrategy):
-    def __init__(self):
+    def __init__(self, **kwargs):
        load_nltk_punkt()
        pass

@@ -52,9 +53,9 @@ class NlpSentenceChunking(ChunkingStrategy):
 # Topic-based segmentation using TextTiling
 class TopicSegmentationChunking(ChunkingStrategy):
    
-    def __init__(self, num_keywords=3):
+    def __init__(self, num_keywords=3, **kwargs):
        import nltk as nl
-        self.tokenizer = nl.toknize.TextTilingTokenizer()
+        self.tokenizer = nl.tokenize.TextTilingTokenizer()
        self.num_keywords = num_keywords

    def chunk(self, text: str) -> list:
@@ -82,7 +83,13 @@ class TopicSegmentationChunking(ChunkingStrategy):
    
 # Fixed-length word chunks
 class FixedLengthWordChunking(ChunkingStrategy):
-    def __init__(self, chunk_size=100):
+    def __init__(self, chunk_size=100, **kwargs):
+        """
+        Initialize the fixed-length word chunking strategy with the given chunk size.
+        
+        Args:
+            chunk_size (int): The size of each chunk in words.
+        """
        self.chunk_size = chunk_size

    def chunk(self, text: str) -> list:
@@ -91,15 +98,65 @@ class FixedLengthWordChunking(ChunkingStrategy):
    
 # Sliding window chunking
 class SlidingWindowChunking(ChunkingStrategy):
-    def __init__(self, window_size=100, step=50):
+    def __init__(self, window_size=100, step=50, **kwargs):
+        """
+        Initialize the sliding window chunking strategy with the given window size and
+        step size.
+        
+        Args:
+            window_size (int): The size of the sliding window in words.
+            step (int): The step size for sliding the window in words.
+        """
        self.window_size = window_size
        self.step = step

    def chunk(self, text: str) -> list:
        words = text.split()
        chunks = []
-        for i in range(0, len(words), self.step):
-            chunks.append(' '.join(words[i:i + self.window_size]))
+        
+        if len(words) <= self.window_size:
+            return [text]
+        
+        for i in range(0, len(words) - self.window_size + 1, self.step):
+            chunk = ' '.join(words[i:i + self.window_size])
+            chunks.append(chunk)
+        
+        # Handle the last chunk if it doesn't align perfectly
+        if i + self.window_size < len(words):
+            chunks.append(' '.join(words[-self.window_size:]))
+        
        return chunks
    

+class OverlappingWindowChunking(ChunkingStrategy):
+    def __init__(self, window_size=1000, overlap=100, **kwargs):
+        """
+        Initialize the overlapping window chunking strategy with the given window size and
+        overlap size.
+        
+        Args:
+            window_size (int): The size of the window in words.
+            overlap (int): The size of the overlap between consecutive chunks in words.
+        """
+        self.window_size = window_size
+        self.overlap = overlap
+
+    def chunk(self, text: str) -> list:
+        words = text.split()
+        chunks = []
+        
+        if len(words) <= self.window_size:
+            return [text]
+        
+        start = 0
+        while start < len(words):
+            end = start + self.window_size
+            chunk = ' '.join(words[start:end])
+            chunks.append(chunk)
+            
+            if end >= len(words):
+                break
+            
+            start = end - self.overlap
+        
+        return chunks
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -4,24 +4,50 @@ from dotenv import load_dotenv
 load_dotenv()  # Load environment variables from .env file

 # Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
-DEFAULT_PROVIDER = "openai/gpt-4-turbo"
+DEFAULT_PROVIDER = "openai/gpt-4o-mini"
 MODEL_REPO_BRANCH = "new-release-0.0.2"
 # Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
 PROVIDER_MODELS = {
    "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
    "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
    "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
-    "openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
-    "openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
+    "openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
    "openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
    "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
+    "anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
 }

-
 # Chunk token threshold
-CHUNK_TOKEN_THRESHOLD = 1000
+CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
+OVERLAP_RATE = 0.1
+WORD_TOKEN_RATE = 1.3

 # Threshold for the minimum number of word in a HTML tag to be considered 
-MIN_WORD_THRESHOLD = 5
+MIN_WORD_THRESHOLD = 1
+IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
+
+IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height'] 
+ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
+SOCIAL_MEDIA_DOMAINS = [
+                            'facebook.com',
+                            'twitter.com',
+                            'x.com',
+                            'linkedin.com',
+                            'instagram.com',
+                            'pinterest.com',
+                            'tiktok.com',
+                            'snapchat.com',
+                            'reddit.com',
+                        ]
+
+# Threshold for the Image extraction - Range is 1 to 6
+# Images are scored based on point based system, to filter based on usefulness. Points are assigned
+# to each image based on the following aspects.
+# If either height or width exceeds 150px
+# If image size is greater than 10Kb
+# If alt property is set
+# If image format is in jpg, png or webp
+# If image is in the first half of the total images extracted from the page
+IMAGE_SCORE_THRESHOLD = 2
--- a/crawl4ai/content_cleaning_strategy.py
+++ b/crawl4ai/content_cleaning_strategy.py
@@ -0,0 +1,196 @@
+from bs4 import BeautifulSoup, Tag
+import re
+from typing import Optional
+
+class ContentCleaningStrategy:
+    def __init__(self):
+        # Precompile regex patterns for performance
+        self.negative_patterns = re.compile(r'nav|footer|header|sidebar|ads|comment', re.I)
+        self.positive_patterns = re.compile(r'content|article|main|post', re.I)
+        self.priority_tags = {'article', 'main', 'section', 'div'}
+        self.non_content_tags = {'nav', 'footer', 'header', 'aside'}
+        # Thresholds
+        self.text_density_threshold = 9.0
+        self.min_word_count = 50
+        self.link_density_threshold = 0.2
+        self.max_dom_depth = 10  # To prevent excessive DOM traversal
+
+    def clean(self, clean_html: str) -> str:
+        """
+        Main function that takes cleaned HTML and returns super cleaned HTML.
+
+        Args:
+            clean_html (str): The cleaned HTML content.
+
+        Returns:
+            str: The super cleaned HTML containing only the main content.
+        """
+        try:
+            if not clean_html or not isinstance(clean_html, str):
+                return ''
+            soup = BeautifulSoup(clean_html, 'html.parser')
+            main_content = self.extract_main_content(soup)
+            if main_content:
+                super_clean_element = self.clean_element(main_content)
+                return str(super_clean_element)
+            else:
+                return ''
+        except Exception:
+            # Handle exceptions silently or log them as needed
+            return ''
+
+    def extract_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
+        """
+        Identifies and extracts the main content element from the HTML.
+
+        Args:
+            soup (BeautifulSoup): The parsed HTML soup.
+
+        Returns:
+            Optional[Tag]: The Tag object containing the main content, or None if not found.
+        """
+        candidates = []
+        for element in soup.find_all(self.priority_tags):
+            if self.is_non_content_tag(element):
+                continue
+            if self.has_negative_class_id(element):
+                continue
+            score = self.calculate_content_score(element)
+            candidates.append((score, element))
+        
+        if not candidates:
+            return None
+
+        # Sort candidates by score in descending order
+        candidates.sort(key=lambda x: x[0], reverse=True)
+        # Select the element with the highest score
+        best_element = candidates[0][1]
+        return best_element
+
+    def calculate_content_score(self, element: Tag) -> float:
+        """
+        Calculates a score for an element based on various heuristics.
+
+        Args:
+            element (Tag): The HTML element to score.
+
+        Returns:
+            float: The content score of the element.
+        """
+        score = 0.0
+
+        if self.is_priority_tag(element):
+            score += 5.0
+        if self.has_positive_class_id(element):
+            score += 3.0
+        if self.has_negative_class_id(element):
+            score -= 3.0
+        if self.is_high_text_density(element):
+            score += 2.0
+        if self.is_low_link_density(element):
+            score += 2.0
+        if self.has_sufficient_content(element):
+            score += 2.0
+        if self.has_headings(element):
+            score += 3.0
+
+        dom_depth = self.calculate_dom_depth(element)
+        score += min(dom_depth, self.max_dom_depth) * 0.5  # Adjust weight as needed
+
+        return score
+
+    def is_priority_tag(self, element: Tag) -> bool:
+        """Checks if the element is a priority tag."""
+        return element.name in self.priority_tags
+
+    def is_non_content_tag(self, element: Tag) -> bool:
+        """Checks if the element is a non-content tag."""
+        return element.name in self.non_content_tags
+
+    def has_negative_class_id(self, element: Tag) -> bool:
+        """Checks if the element has negative indicators in its class or id."""
+        class_id = ' '.join(filter(None, [
+            self.get_attr_str(element.get('class')),
+            element.get('id', '')
+        ]))
+        return bool(self.negative_patterns.search(class_id))
+
+    def has_positive_class_id(self, element: Tag) -> bool:
+        """Checks if the element has positive indicators in its class or id."""
+        class_id = ' '.join(filter(None, [
+            self.get_attr_str(element.get('class')),
+            element.get('id', '')
+        ]))
+        return bool(self.positive_patterns.search(class_id))
+
+    @staticmethod
+    def get_attr_str(attr) -> str:
+        """Converts an attribute value to a string."""
+        if isinstance(attr, list):
+            return ' '.join(attr)
+        elif isinstance(attr, str):
+            return attr
+        else:
+            return ''
+
+    def is_high_text_density(self, element: Tag) -> bool:
+        """Determines if the element has high text density."""
+        text_density = self.calculate_text_density(element)
+        return text_density > self.text_density_threshold
+
+    def calculate_text_density(self, element: Tag) -> float:
+        """Calculates the text density of an element."""
+        text_length = len(element.get_text(strip=True))
+        tag_count = len(element.find_all())
+        tag_count = tag_count or 1  # Prevent division by zero
+        return text_length / tag_count
+
+    def is_low_link_density(self, element: Tag) -> bool:
+        """Determines if the element has low link density."""
+        link_density = self.calculate_link_density(element)
+        return link_density < self.link_density_threshold
+
+    def calculate_link_density(self, element: Tag) -> float:
+        """Calculates the link density of an element."""
+        text = element.get_text(strip=True)
+        if not text:
+            return 0.0
+        link_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
+        return len(link_text) / len(text) if text else 0.0
+
+    def has_sufficient_content(self, element: Tag) -> bool:
+        """Checks if the element has sufficient word count."""
+        word_count = len(element.get_text(strip=True).split())
+        return word_count >= self.min_word_count
+
+    def calculate_dom_depth(self, element: Tag) -> int:
+        """Calculates the depth of an element in the DOM tree."""
+        depth = 0
+        current_element = element
+        while current_element.parent and depth < self.max_dom_depth:
+            depth += 1
+            current_element = current_element.parent
+        return depth
+
+    def has_headings(self, element: Tag) -> bool:
+        """Checks if the element contains heading tags."""
+        return bool(element.find(['h1', 'h2', 'h3']))
+
+    def clean_element(self, element: Tag) -> Tag:
+        """
+        Cleans the selected element by removing unnecessary attributes and nested non-content elements.
+
+        Args:
+            element (Tag): The HTML element to clean.
+
+        Returns:
+            Tag: The cleaned HTML element.
+        """
+        for tag in element.find_all(['script', 'style', 'aside']):
+            tag.decompose()
+        for tag in element.find_all():
+            attrs = dict(tag.attrs)
+            for attr in attrs:
+                if attr in ['style', 'onclick', 'onmouseover', 'align', 'bgcolor']:
+                    del tag.attrs[attr]
+        return element
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -0,0 +1,541 @@
+from abc import ABC, abstractmethod
+from typing import Dict, Any
+from bs4 import BeautifulSoup
+from concurrent.futures import ThreadPoolExecutor
+import asyncio, requests, re, os
+from .config import *
+from bs4 import element, NavigableString, Comment
+from urllib.parse import urljoin
+from requests.exceptions import InvalidSchema
+from .content_cleaning_strategy import ContentCleaningStrategy
+
+from .utils import (
+    sanitize_input_encode,
+    sanitize_html,
+    extract_metadata,
+    InvalidCSSSelectorError,
+    # CustomHTML2Text,
+    normalize_url,
+    is_external_url
+    
+)
+
+from .html2text import HTML2Text
+class CustomHTML2Text(HTML2Text):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.inside_pre = False
+        self.inside_code = False
+        self.preserve_tags = set()  # Set of tags to preserve
+        self.current_preserved_tag = None
+        self.preserved_content = []
+        self.preserve_depth = 0
+        
+        # Configuration options
+        self.skip_internal_links = False
+        self.single_line_break = False
+        self.mark_code = False
+        self.include_sup_sub = False
+        self.body_width = 0
+        self.ignore_mailto_links = True
+        self.ignore_links = False
+        self.escape_backslash = False
+        self.escape_dot = False
+        self.escape_plus = False
+        self.escape_dash = False
+        self.escape_snob = False
+
+    def update_params(self, **kwargs):
+        """Update parameters and set preserved tags."""
+        for key, value in kwargs.items():
+            if key == 'preserve_tags':
+                self.preserve_tags = set(value)
+            else:
+                setattr(self, key, value)
+
+    def handle_tag(self, tag, attrs, start):
+        # Handle preserved tags
+        if tag in self.preserve_tags:
+            if start:
+                if self.preserve_depth == 0:
+                    self.current_preserved_tag = tag
+                    self.preserved_content = []
+                    # Format opening tag with attributes
+                    attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
+                    self.preserved_content.append(f'<{tag}{attr_str}>')
+                self.preserve_depth += 1
+                return
+            else:
+                self.preserve_depth -= 1
+                if self.preserve_depth == 0:
+                    self.preserved_content.append(f'</{tag}>')
+                    # Output the preserved HTML block with proper spacing
+                    preserved_html = ''.join(self.preserved_content)
+                    self.o('\n' + preserved_html + '\n')
+                    self.current_preserved_tag = None
+                return
+
+        # If we're inside a preserved tag, collect all content
+        if self.preserve_depth > 0:
+            if start:
+                # Format nested tags with attributes
+                attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
+                self.preserved_content.append(f'<{tag}{attr_str}>')
+            else:
+                self.preserved_content.append(f'</{tag}>')
+            return
+
+        # Handle pre tags
+        if tag == 'pre':
+            if start:
+                self.o('```\n')
+                self.inside_pre = True
+            else:
+                self.o('\n```')
+                self.inside_pre = False
+        elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
+            pass
+        else:
+            super().handle_tag(tag, attrs, start)
+
+    def handle_data(self, data, entity_char=False):
+        """Override handle_data to capture content within preserved tags."""
+        if self.preserve_depth > 0:
+            self.preserved_content.append(data)
+            return
+        super().handle_data(data, entity_char)
+
+class ContentScrappingStrategy(ABC):
+    @abstractmethod
+    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        pass
+
+    @abstractmethod
+    async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        pass
+
+class WebScrappingStrategy(ContentScrappingStrategy):
+    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        return self._get_content_of_website_optimized(url, html, is_async=False, **kwargs)
+
+    async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
+
+    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+        success = True
+        if not html:
+            return None
+
+        soup = BeautifulSoup(html, 'html.parser')
+        body = soup.body
+        
+        
+        image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
+
+        for tag in kwargs.get('excluded_tags', []) or []:
+            for el in body.select(tag):
+                el.decompose()
+        
+        if css_selector:
+            selected_elements = body.select(css_selector)
+            if not selected_elements:
+                return {
+                    'markdown': '',
+                    'cleaned_html': '',
+                    'success': True,
+                    'media': {'images': [], 'videos': [], 'audios': []},
+                    'links': {'internal': [], 'external': []},
+                    'metadata': {},
+                    'message': f"No elements found for CSS selector: {css_selector}"
+                }
+                # raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
+            body = soup.new_tag('div')
+            for el in selected_elements:
+                body.append(el)
+
+        links = {'internal': [], 'external': []}
+        media = {'images': [], 'videos': [], 'audios': []}
+        internal_links_dict = {}
+        external_links_dict = {}
+
+        # Extract meaningful text for media files from closest parent
+        def find_closest_parent_with_useful_text(tag):
+                current_tag = tag
+                while current_tag:
+                    current_tag = current_tag.parent
+                    # Get the text content of the parent tag
+                    if current_tag:
+                        text_content = current_tag.get_text(separator=' ',strip=True)
+                        # Check if the text content has at least word_count_threshold
+                        if len(text_content.split()) >= image_description_min_word_threshold:
+                            return text_content
+                return None
+
+        def process_image(img, url, index, total_images):
+            #Check if an image has valid display and inside undesired html elements
+            def is_valid_image(img, parent, parent_classes):
+                style = img.get('style', '')
+                src = img.get('src', '')
+                classes_to_check = ['button', 'icon', 'logo']
+                tags_to_check = ['button', 'input']
+                return all([
+                    'display:none' not in style,
+                    src,
+                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                    parent.name not in tags_to_check
+                ])
+
+            #Score an image for it's usefulness
+            def score_image_for_usefulness(img, base_url, index, images_count):
+                # Function to parse image height/width value and units
+                def parse_dimension(dimension):
+                    if dimension:
+                        match = re.match(r"(\d+)(\D*)", dimension)
+                        if match:
+                            number = int(match.group(1))
+                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                            return number, unit
+                    return None, None
+
+                # Fetch image file metadata to extract size and extension
+                def fetch_image_file_size(img, base_url):
+                    #If src is relative path construct full URL, if not it may be CDN URL
+                    img_url = urljoin(base_url,img.get('src'))
+                    try:
+                        response = requests.head(img_url)
+                        if response.status_code == 200:
+                            return response.headers.get('Content-Length',None)
+                        else:
+                            print(f"Failed to retrieve file size for {img_url}")
+                            return None
+                    except InvalidSchema as e:
+                        return None
+                    finally:
+                        return
+
+                image_height = img.get('height')
+                height_value, height_unit = parse_dimension(image_height)
+                image_width =  img.get('width')
+                width_value, width_unit = parse_dimension(image_width)
+                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
+                image_src = img.get('src','')
+                if "data:image/" in image_src:
+                    image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
+                else:
+                    image_format = os.path.splitext(img.get('src',''))[1].lower()
+                # Remove . from format
+                image_format = image_format.strip('.').split('?')[0]
+                score = 0
+                if height_value:
+                    if height_unit == 'px' and height_value > 150:
+                        score += 1
+                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
+                        score += 1
+                if width_value:
+                    if width_unit == 'px' and width_value > 150:
+                        score += 1
+                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
+                        score += 1
+                if image_size > 10000:
+                    score += 1
+                if img.get('alt') != '':
+                    score+=1
+                if any(image_format==format for format in ['jpg','png','webp']):
+                    score+=1
+                if index/images_count<0.5:
+                    score+=1
+                return score
+
+            
+            
+            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+                return None
+            score = score_image_for_usefulness(img, url, index, total_images)
+            if score <= IMAGE_SCORE_THRESHOLD:
+                return None
+            return {
+                'src': img.get('src', ''),
+                'data-src': img.get('data-src', ''),
+                'alt': img.get('alt', ''),
+                'desc': find_closest_parent_with_useful_text(img),
+                'score': score,
+                'type': 'image'
+            }
+
+        def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
+            attrs_to_remove = []
+            for attr in element.attrs:
+                if attr not in important_attrs:
+                    if keep_data_attributes:
+                        if not attr.startswith('data-'):
+                            attrs_to_remove.append(attr)
+                    else:
+                        attrs_to_remove.append(attr)
+            
+            for attr in attrs_to_remove:
+                del element[attr]
+        
+        def process_element(element: element.PageElement) -> bool:
+            try:
+                if isinstance(element, NavigableString):
+                    if isinstance(element, Comment):
+                        element.extract()
+                    return False
+                
+                # if element.name == 'img':
+                #     process_image(element, url, 0, 1)
+                #     return True
+
+                if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
+                    element.decompose()
+                    return False
+
+                keep_element = False
+                
+                exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
+                exclude_social_media_domains = list(set(exclude_social_media_domains))
+
+                
+                try:
+                    if element.name == 'a' and element.get('href'):
+                        href = element.get('href', '').strip()
+                        if not href:  # Skip empty hrefs
+                            return False
+                            
+                        url_base = url.split('/')[2]
+                        
+                        # Normalize the URL
+                        try:
+                            normalized_href = normalize_url(href, url)
+                        except ValueError as e:
+                            # logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
+                            return False
+                            
+                        link_data = {
+                            'href': normalized_href,
+                            'text': element.get_text().strip(),
+                            'title': element.get('title', '').strip()
+                        }
+                        
+                        # Check for duplicates and add to appropriate dictionary
+                        is_external = is_external_url(normalized_href, url_base)
+                        if is_external:
+                            if normalized_href not in external_links_dict:
+                                external_links_dict[normalized_href] = link_data
+                        else:
+                            if normalized_href not in internal_links_dict:
+                                internal_links_dict[normalized_href] = link_data
+                                
+                        keep_element = True
+                        
+                        # Handle external link exclusions
+                        if is_external:
+                            if kwargs.get('exclude_external_links', False):
+                                element.decompose()
+                                return False
+                            elif kwargs.get('exclude_social_media_links', False):
+                                if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
+                                    element.decompose()
+                                    return False
+                            elif kwargs.get('exclude_domains', []):
+                                if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
+                                    element.decompose()
+                                    return False
+                                    
+                except Exception as e:
+                    raise Exception(f"Error processing links: {str(e)}")
+
+                try:
+                    if element.name == 'img':
+                        potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
+                        src = element.get('src', '')
+                        while not src and potential_sources:
+                            src = element.get(potential_sources.pop(0), '')
+                        if not src:
+                            element.decompose()
+                            return False
+                        
+                        # If it is srcset pick up the first image
+                        if 'srcset' in element.attrs:
+                            src = element.attrs['srcset'].split(',')[0].split(' ')[0]
+                            
+                        # Check flag if we should remove external images
+                        if kwargs.get('exclude_external_images', False):
+                            src_url_base = src.split('/')[2]
+                            url_base = url.split('/')[2]
+                            if url_base not in src_url_base:
+                                element.decompose()
+                                return False
+                            
+                        if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
+                            src_url_base = src.split('/')[2]
+                            url_base = url.split('/')[2]
+                            if any(domain in src for domain in exclude_social_media_domains):
+                                element.decompose()
+                                return False
+                            
+                        # Handle exclude domains
+                        if kwargs.get('exclude_domains', []):
+                            if any(domain in src for domain in kwargs.get('exclude_domains', [])):
+                                element.decompose()
+                                return False
+                        
+                        return True  # Always keep image elements
+                except Exception as e:
+                    raise "Error processing images"
+                
+                
+                # Check if flag to remove all forms is set
+                if kwargs.get('remove_forms', False) and element.name == 'form':
+                    element.decompose()
+                    return False
+                
+                if element.name in ['video', 'audio']:
+                    media[f"{element.name}s"].append({
+                        'src': element.get('src'),
+                        'alt': element.get('alt'),
+                        'type': element.name,
+                        'description': find_closest_parent_with_useful_text(element)
+                    })
+                    source_tags = element.find_all('source')
+                    for source_tag in source_tags:
+                        media[f"{element.name}s"].append({
+                        'src': source_tag.get('src'),
+                        'alt': element.get('alt'),
+                        'type': element.name,
+                        'description': find_closest_parent_with_useful_text(element)
+                    })
+                    return True  # Always keep video and audio elements
+
+                if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
+                    if kwargs.get('only_text', False):
+                        element.replace_with(element.get_text())
+
+                try:
+                    remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
+                except Exception as e:
+                    print('Error removing unwanted attributes:', str(e))
+                
+
+                # Process children
+                for child in list(element.children):
+                    if isinstance(child, NavigableString) and not isinstance(child, Comment):
+                        if len(child.strip()) > 0:
+                            keep_element = True
+                    else:
+                        if process_element(child):
+                            keep_element = True
+                    
+
+                # Check word count
+                if not keep_element:
+                    word_count = len(element.get_text(strip=True).split())
+                    keep_element = word_count >= word_count_threshold
+
+                if not keep_element:
+                    element.decompose()
+
+                return keep_element
+            except Exception as e:
+                print('Error processing element:', str(e))
+                return False
+
+        #process images by filtering and extracting contextual text from the page
+        # imgs = body.find_all('img')
+        # media['images'] = [
+        #     result for result in
+        #     (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
+        #     if result is not None
+        # ]
+        
+        process_element(body)
+        
+        # Update the links dictionary with unique links
+        links['internal'] = list(internal_links_dict.values())
+        links['external'] = list(external_links_dict.values())
+
+
+        # # Process images using ThreadPoolExecutor
+        imgs = body.find_all('img')
+        
+        with ThreadPoolExecutor() as executor:
+            image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
+        media['images'] = [result for result in image_results if result is not None]
+
+        def flatten_nested_elements(node):
+            if isinstance(node, NavigableString):
+                return node
+            if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
+                return flatten_nested_elements(node.contents[0])
+            node.contents = [flatten_nested_elements(child) for child in node.contents]
+            return node
+
+        body = flatten_nested_elements(body)
+        base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
+        for img in imgs:
+            src = img.get('src', '')
+            if base64_pattern.match(src):
+                # Replace base64 data with empty string
+                img['src'] = base64_pattern.sub('', src)
+                
+        try:
+            str(body)
+        except Exception as e:
+            # Reset body to the original HTML
+            success = False
+            body = BeautifulSoup(html, 'html.parser')
+            
+            # Create a new div with a special ID
+            error_div = body.new_tag('div', id='crawl4ai_error_message')
+            error_div.string = '''
+            Crawl4AI Error: This page is not fully supported.
+            
+            Possible reasons:
+            1. The page may have restrictions that prevent crawling.
+            2. The page might not be fully loaded.
+            
+            Suggestions:
+            - Try calling the crawl function with these parameters:
+            magic=True,
+            - Set headless=False to visualize what's happening on the page.
+            
+            If the issue persists, please check the page's structure and any potential anti-crawling measures.
+            '''
+            
+            # Append the error div to the body
+            body.body.append(error_div)
+            
+            print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
+
+
+        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
+
+        try:
+            h = CustomHTML2Text()
+            h.update_params(**kwargs.get('html2text', {}))            
+            markdown = h.handle(cleaned_html)
+        except Exception as e:
+            markdown = h.handle(sanitize_html(cleaned_html))
+        markdown = markdown.replace('    ```', '```')
+
+        try:
+            meta = extract_metadata(html, soup)
+        except Exception as e:
+            print('Error extracting metadata:', str(e))
+            meta = {}
+            
+        cleaner = ContentCleaningStrategy()
+        fit_html = cleaner.clean(cleaned_html)
+        fit_markdown = h.handle(fit_html)
+
+        cleaned_html = sanitize_html(cleaned_html)
+        return {
+            'markdown': markdown,
+            'fit_markdown': fit_markdown,
+            'fit_html': fit_html,
+            'cleaned_html': cleaned_html,
+            'success': success,
+            'media': media,
+            'links': links,
+            'metadata': meta
+        }
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -5,17 +5,58 @@ from selenium.webdriver.common.by import By
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support import expected_conditions as EC
 from selenium.webdriver.chrome.options import Options
-from selenium.common.exceptions import InvalidArgumentException
+from selenium.common.exceptions import InvalidArgumentException, WebDriverException
+# from selenium.webdriver.chrome.service import Service as ChromeService
+# from webdriver_manager.chrome import ChromeDriverManager
+# from urllib3.exceptions import MaxRetryError

-from typing import List
+from .config import *
+import logging, time
+import base64
+from PIL import Image, ImageDraw, ImageFont
+from io import BytesIO
+from typing import List, Callable
 import requests
 import os
 from pathlib import Path
+from .utils import *
+
+logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
+logger.setLevel(logging.WARNING)
+
+logger_driver = logging.getLogger('selenium.webdriver.common.service')
+logger_driver.setLevel(logging.WARNING)
+
+urllib3_logger = logging.getLogger('urllib3.connectionpool')
+urllib3_logger.setLevel(logging.WARNING)
+
+# Disable http.client logging
+http_client_logger = logging.getLogger('http.client')
+http_client_logger.setLevel(logging.WARNING)
+
+# Disable driver_finder and service logging
+driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
+driver_finder_logger.setLevel(logging.WARNING)
+
+
+

 class CrawlerStrategy(ABC):
    @abstractmethod
    def crawl(self, url: str, **kwargs) -> str:
        pass
+    
+    @abstractmethod
+    def take_screenshot(self, save_path: str):
+        pass
+    
+    @abstractmethod
+    def update_user_agent(self, user_agent: str):
+        pass
+    
+    @abstractmethod
+    def set_hook(self, hook_type: str, hook: Callable):
+        pass

 class CloudCrawlerStrategy(CrawlerStrategy):
    def __init__(self, use_cached_html = False):
@@ -33,60 +74,287 @@ class CloudCrawlerStrategy(CrawlerStrategy):
        response = requests.post("http://crawl4ai.uccode.io/crawl", json=data)
        response = response.json()
        html = response["results"][0]["html"]
-        return html
+        return sanitize_input_encode(html)

 class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
-    def __init__(self, use_cached_html=False, js_code=None):
+    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        super().__init__()
        print("[LOG] 🚀 Initializing LocalSeleniumCrawlerStrategy")
        self.options = Options()
        self.options.headless = True
+        if kwargs.get("proxy"):
+            self.options.add_argument("--proxy-server={}".format(kwargs.get("proxy")))
+        if kwargs.get("user_agent"):
+            self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
+        else:
+            user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+            self.options.add_argument(f"--user-agent={user_agent}")
+            self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+                  
+        self.options.headless = kwargs.get("headless", True)
+        if self.options.headless:
+            self.options.add_argument("--headless")
+        
+        self.options.add_argument("--disable-gpu")  
+        self.options.add_argument("--window-size=1920,1080")
        self.options.add_argument("--no-sandbox")
        self.options.add_argument("--disable-dev-shm-usage")
+        self.options.add_argument("--disable-blink-features=AutomationControlled")     
+        
+        # self.options.add_argument("--disable-dev-shm-usage")
        self.options.add_argument("--disable-gpu")
-        self.options.add_argument("--disable-extensions")
-        self.options.add_argument("--headless")
+        # self.options.add_argument("--disable-extensions")
+        # self.options.add_argument("--disable-infobars")
+        # self.options.add_argument("--disable-logging")
+        # self.options.add_argument("--disable-popup-blocking")
+        # self.options.add_argument("--disable-translate")
+        # self.options.add_argument("--disable-default-apps")
+        # self.options.add_argument("--disable-background-networking")
+        # self.options.add_argument("--disable-sync")
+        # self.options.add_argument("--disable-features=NetworkService,NetworkServiceInProcess")
+        # self.options.add_argument("--disable-browser-side-navigation")
+        # self.options.add_argument("--dns-prefetch-disable")
+        # self.options.add_argument("--disable-web-security")
+        self.options.add_argument("--log-level=3")
+        self.use_cached_html = use_cached_html
        self.use_cached_html = use_cached_html
        self.js_code = js_code
+        self.verbose = kwargs.get("verbose", False)
+        
+        # Hooks
+        self.hooks = {
+            'on_driver_created': None,
+            'on_user_agent_updated': None,
+            'before_get_url': None,
+            'after_get_url': None,
+            'before_return_html': None
+        }

        # chromedriver_autoinstaller.install()
-        import chromedriver_autoinstaller
-        self.service = Service(chromedriver_autoinstaller.install())
-        self.driver = webdriver.Chrome(service=self.service, options=self.options)
+        # import chromedriver_autoinstaller
+        # crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
+        # driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=self.options)
+        # chromedriver_path = chromedriver_autoinstaller.install()
+        # chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
+        # self.service = Service(chromedriver_autoinstaller.install())
+        
+        
+        # chromedriver_path = ChromeDriverManager().install()
+        # self.service = Service(chromedriver_path)
+        # self.service.log_path = "NUL"
+        # self.driver = webdriver.Chrome(service=self.service, options=self.options)
+        
+        # Use selenium-manager (built into Selenium 4.10.0+)
+        self.service = Service()
+        self.driver = webdriver.Chrome(options=self.options)
+        
+        self.driver = self.execute_hook('on_driver_created', self.driver)
+        
+        if kwargs.get("cookies"):
+            for cookie in kwargs.get("cookies"):
+                self.driver.add_cookie(cookie)
+            
+        

-    def crawl(self, url: str) -> str:
+    def set_hook(self, hook_type: str, hook: Callable):
+        if hook_type in self.hooks:
+            self.hooks[hook_type] = hook
+        else:
+            raise ValueError(f"Invalid hook type: {hook_type}")
+    
+    def execute_hook(self, hook_type: str, *args):
+        hook = self.hooks.get(hook_type)
+        if hook:
+            result = hook(*args)
+            if result is not None:
+                if isinstance(result, webdriver.Chrome):
+                    return result
+                else:
+                    raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.")
+        # If the hook returns None or there is no hook, return self.driver
+        return self.driver
+
+    def update_user_agent(self, user_agent: str):
+        self.options.add_argument(f"user-agent={user_agent}")
+        self.driver.quit()
+        self.driver = webdriver.Chrome(service=self.service, options=self.options)
+        self.driver = self.execute_hook('on_user_agent_updated', self.driver)
+
+    def set_custom_headers(self, headers: dict):
+        # Enable Network domain for sending headers
+        self.driver.execute_cdp_cmd('Network.enable', {})
+        # Set extra HTTP headers
+        self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers})
+
+    def _ensure_page_load(self,  max_checks=6, check_interval=0.01):
+        initial_length = len(self.driver.page_source)
+        
+        for ix in range(max_checks):
+            # print(f"Checking page load: {ix}")
+            time.sleep(check_interval)
+            current_length = len(self.driver.page_source)
+            
+            if current_length != initial_length:
+                break
+
+        return self.driver.page_source
+    
+    def crawl(self, url: str, **kwargs) -> str:
+        # Create md5 hash of the URL
+        import hashlib
+        url_hash = hashlib.md5(url.encode()).hexdigest()
+        
        if self.use_cached_html:
-            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
+            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url_hash)
            if os.path.exists(cache_file_path):
                with open(cache_file_path, "r") as f:
-                    return f.read()
+                    return sanitize_input_encode(f.read())

        try:
-            self.driver.get(url)
+            self.driver = self.execute_hook('before_get_url', self.driver)
+            if self.verbose:
+                print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
+            self.driver.get(url) #<html><head></head><body></body></html>
+            
+            WebDriverWait(self.driver, 20).until(
+                lambda d: d.execute_script('return document.readyState') == 'complete'
+            )
            WebDriverWait(self.driver, 10).until(
-                EC.presence_of_all_elements_located((By.TAG_NAME, "html"))
+                EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
            )
            
+            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
+            
+            self.driver = self.execute_hook('after_get_url', self.driver)
+            html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source                                        
+            can_not_be_done_headless = False # Look at my creativity for naming variables
+            
+            # TODO: Very ugly approach, but promise to change it!
+            if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>":
+                print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...")
+                can_not_be_done_headless = True
+                options = Options()
+                options.headless = False
+                # set window size very small
+                options.add_argument("--window-size=5,5")
+                driver = webdriver.Chrome(service=self.service, options=options)
+                driver.get(url)
+                self.driver = self.execute_hook('after_get_url', driver)
+                html = sanitize_input_encode(driver.page_source)
+                driver.quit()
+            
            # Execute JS code if provided
-            if self.js_code:
+            self.js_code = kwargs.get("js_code", self.js_code)
+            if self.js_code and type(self.js_code) == str:
                self.driver.execute_script(self.js_code)
                # Optionally, wait for some condition after executing the JS code
                WebDriverWait(self.driver, 10).until(
                    lambda driver: driver.execute_script("return document.readyState") == "complete"
                )
+            elif self.js_code and type(self.js_code) == list:
+                for js in self.js_code:
+                    self.driver.execute_script(js)
+                    WebDriverWait(self.driver, 10).until(
+                        lambda driver: driver.execute_script("return document.readyState") == "complete"
+                    )
            
-            html = self.driver.page_source
+            # Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
+            wait_for = kwargs.get('wait_for', False)
+            if wait_for:
+                if callable(wait_for):
+                    print("[LOG] 🔄 Waiting for condition...")
+                    WebDriverWait(self.driver, 20).until(wait_for)
+                else:
+                    print("[LOG] 🔄 Waiting for condition...")
+                    WebDriverWait(self.driver, 20).until(
+                        EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
+                    ) 
+            
+            if not can_not_be_done_headless:
+                html = sanitize_input_encode(self.driver.page_source)
+            self.driver = self.execute_hook('before_return_html', self.driver, html)
            
            # Store in cache
-            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
-            with open(cache_file_path, "w") as f:
+            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url_hash)
+            with open(cache_file_path, "w", encoding="utf-8") as f:
                f.write(html)
+                
+            if self.verbose:
+                print(f"[LOG] ✅ Crawled {url} successfully!")
            
            return html
        except InvalidArgumentException:
-            raise InvalidArgumentException(f"Invalid URL {url}")
+            if not hasattr(e, 'msg'):
+                e.msg = sanitize_input_encode(str(e))
+            raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
+        except WebDriverException as e:
+            # If e does nlt have msg attribute create it and set it to str(e)
+            if not hasattr(e, 'msg'):
+                e.msg = sanitize_input_encode(str(e))
+            raise WebDriverException(f"Failed to crawl {url}: {e.msg}")  
        except Exception as e:
-            raise Exception(f"Failed to crawl {url}: {str(e)}")
+            if not hasattr(e, 'msg'):
+                e.msg = sanitize_input_encode(str(e))
+            raise Exception(f"Failed to crawl {url}: {e.msg}")

+    def take_screenshot(self) -> str:
+        try:
+            # Get the dimensions of the page
+            total_width = self.driver.execute_script("return document.body.scrollWidth")
+            total_height = self.driver.execute_script("return document.body.scrollHeight")
+
+            # Set the window size to the dimensions of the page
+            self.driver.set_window_size(total_width, total_height)
+
+            # Take screenshot
+            screenshot = self.driver.get_screenshot_as_png()
+
+            # Open the screenshot with PIL
+            image = Image.open(BytesIO(screenshot))
+
+            # Convert image to RGB mode (this will handle both RGB and RGBA images)
+            rgb_image = image.convert('RGB')
+
+            # Convert to JPEG and compress
+            buffered = BytesIO()
+            rgb_image.save(buffered, format="JPEG", quality=85)
+            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
+
+            if self.verbose:
+                print(f"[LOG] 📸 Screenshot taken and converted to base64")
+
+            return img_base64
+        except Exception as e:
+            error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
+            print(error_message)
+
+            # Generate an image with black background
+            img = Image.new('RGB', (800, 600), color='black')
+            draw = ImageDraw.Draw(img)
+            
+            # Load a font
+            try:
+                font = ImageFont.truetype("arial.ttf", 40)
+            except IOError:
+                font = ImageFont.load_default()
+
+            # Define text color and wrap the text
+            text_color = (255, 255, 255)
+            max_width = 780
+            wrapped_text = wrap_text(draw, error_message, font, max_width)
+
+            # Calculate text position
+            text_position = (10, 10)
+            
+            # Draw the text on the image
+            draw.text(text_position, wrapped_text, fill=text_color, font=font)
+            
+            # Convert to base64
+            buffered = BytesIO()
+            img.save(buffered, format="JPEG")
+            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')
+
+            return img_base64
+        
    def quit(self):
-        self.driver.quit()
+        self.driver.quit()
--- a/crawl4ai/database.py
+++ b/crawl4ai/database.py
@@ -1,13 +1,12 @@
 import os
 from pathlib import Path
 import sqlite3
-from typing import Optional
 from typing import Optional, Tuple

 DB_PATH = os.path.join(Path.home(), ".crawl4ai")
 os.makedirs(DB_PATH, exist_ok=True)
 DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
-        
+
 def init_db():
    global DB_PATH
    conn = sqlite3.connect(DB_PATH)
@@ -19,22 +18,37 @@ def init_db():
            cleaned_html TEXT,
            markdown TEXT,
            extracted_content TEXT,
-            success BOOLEAN
+            success BOOLEAN,
+            media TEXT DEFAULT "{}",
+            links TEXT DEFAULT "{}",
+            metadata TEXT DEFAULT "{}",
+            screenshot TEXT DEFAULT ""
        )
    ''')
    conn.commit()
    conn.close()

-def check_db_path():
-    if not DB_PATH:
-        raise ValueError("Database path is not set or is empty.")
-
-def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
+def alter_db_add_screenshot(new_column: str = "media"):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success FROM crawled_data WHERE url = ?', (url,))
+        cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
+        conn.commit()
+        conn.close()
+    except Exception as e:
+        print(f"Error altering database to add screenshot column: {e}")
+
+def check_db_path():
+    if not DB_PATH:
+        raise ValueError("Database path is not set or is empty.")
+
+def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
+    check_db_path()
+    try:
+        conn = sqlite3.connect(DB_PATH)
+        cursor = conn.cursor()
+        cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,))
        result = cursor.fetchone()
        conn.close()
        return result
@@ -42,21 +56,25 @@ def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
        print(f"Error retrieving cached URL: {e}")
        return None

-def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool):
+def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
        cursor.execute('''
-            INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success)
-            VALUES (?, ?, ?, ?, ?, ?)
+            INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
+            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(url) DO UPDATE SET
                html = excluded.html,
                cleaned_html = excluded.cleaned_html,
                markdown = excluded.markdown,
                extracted_content = excluded.extracted_content,
-                success = excluded.success
-        ''', (url, html, cleaned_html, markdown, extracted_content, success))
+                success = excluded.success,
+                media = excluded.media,      
+                links = excluded.links,    
+                metadata = excluded.metadata,      
+                screenshot = excluded.screenshot
+        ''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
        conn.commit()
        conn.close()
    except Exception as e:
@@ -95,4 +113,23 @@ def flush_db():
        conn.commit()
        conn.close()
    except Exception as e:
-        print(f"Error flushing database: {e}")
+        print(f"Error flushing database: {e}")
+
+def update_existing_records(new_column: str = "media", default_value: str = "{}"):
+    check_db_path()
+    try:
+        conn = sqlite3.connect(DB_PATH)
+        cursor = conn.cursor()
+        cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL')
+        conn.commit()
+        conn.close()
+    except Exception as e:
+        print(f"Error updating existing records: {e}")
+
+if __name__ == "__main__":
+    # Delete the existing database file
+    if os.path.exists(DB_PATH):
+        os.remove(DB_PATH)
+    init_db()  
+    # alter_db_add_screenshot("COL_NAME")
+    
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -3,14 +3,15 @@ from typing import Any, List, Dict, Optional, Union
 from concurrent.futures import ThreadPoolExecutor, as_completed
 import json, time
 # from optimum.intel import IPEXModel
-from .prompts import PROMPT_EXTRACT_BLOCKS, PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
+from .prompts import *
 from .config import *
 from .utils import *
 from functools import partial
 from .model_loader import *
-
-
+import math
 import numpy as np
+from lxml import etree
+
 class ExtractionStrategy(ABC):
    """
    Abstract base class for all extraction strategies.
@@ -46,6 +47,7 @@ class ExtractionStrategy(ABC):
            for future in as_completed(futures):
                extracted_content.extend(future.result())
        return extracted_content    
+    
 class NoExtractionStrategy(ExtractionStrategy):
    def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
        return [{"index": 0, "content": html}]
@@ -54,7 +56,9 @@ class NoExtractionStrategy(ExtractionStrategy):
        return [{"index": i, "tags": [], "content": section} for i, section in enumerate(sections)]
   
 class LLMExtractionStrategy(ExtractionStrategy):
-    def __init__(self, provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, instruction:str = None, **kwargs):
+    def __init__(self, 
+                 provider: str = DEFAULT_PROVIDER, api_token: Optional[str] = None, 
+                 instruction:str = None, schema:Dict = None, extraction_type = "block", **kwargs):
        """
        Initialize the strategy with clustering parameters.

@@ -64,8 +68,23 @@ class LLMExtractionStrategy(ExtractionStrategy):
        """
        super().__init__() 
        self.provider = provider
-        self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
+        self.api_token = api_token or PROVIDER_MODELS.get(provider, "no-token") or os.getenv("OPENAI_API_KEY")
        self.instruction = instruction
+        self.extract_type = extraction_type
+        self.schema = schema
+        if schema:
+            self.extract_type = "schema"
+        
+        self.chunk_token_threshold = kwargs.get("chunk_token_threshold", CHUNK_TOKEN_THRESHOLD)
+        self.overlap_rate = kwargs.get("overlap_rate", OVERLAP_RATE)
+        self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
+        self.apply_chunking = kwargs.get("apply_chunking", True)
+        self.base_url = kwargs.get("base_url", None)
+        self.api_base = kwargs.get("api_base", kwargs.get("base_url", None))
+        self.extra_args = kwargs.get("extra_args", {})
+        if not self.apply_chunking:
+            self.chunk_token_threshold = 1e9
+        
        self.verbose = kwargs.get("verbose", False)
        
        if not self.api_token:
@@ -80,23 +99,33 @@ class LLMExtractionStrategy(ExtractionStrategy):
            "HTML": escape_json_string(sanitize_html(html)),
        }
        
+        prompt_with_variables = PROMPT_EXTRACT_BLOCKS
        if self.instruction:
            variable_values["REQUEST"] = self.instruction
+            prompt_with_variables = PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
+            
+        if self.extract_type == "schema" and self.schema:
+            variable_values["SCHEMA"] = json.dumps(self.schema, indent=2)
+            prompt_with_variables = PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION

-        prompt_with_variables = PROMPT_EXTRACT_BLOCKS if not self.instruction else PROMPT_EXTRACT_BLOCKS_WITH_INSTRUCTION
        for variable in variable_values:
            prompt_with_variables = prompt_with_variables.replace(
                "{" + variable + "}", variable_values[variable]
            )
        
-        response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token)
+        response = perform_completion_with_backoff(
+            self.provider, 
+            prompt_with_variables, 
+            self.api_token, 
+            base_url=self.api_base or self.base_url,
+            extra_args = self.extra_args
+            ) # , json_response=self.extract_type == "schema")
        try:
            blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
            blocks = json.loads(blocks)
            for block in blocks:
                block['error'] = False
        except Exception as e:
-            print("Error extracting blocks:", str(e))
            parsed, unparsed = split_and_parse_json_objects(response.choices[0].message.content)
            blocks = parsed
            if unparsed:
@@ -111,110 +140,213 @@ class LLMExtractionStrategy(ExtractionStrategy):
            print("[LOG] Extracted", len(blocks), "blocks from URL:", url, "block index:", ix)
        return blocks
    
-    def _merge(self, documents):
+    def _merge(self, documents, chunk_token_threshold, overlap):
        chunks = []
        sections = []
+        total_tokens = 0
+
+        # Calculate the total tokens across all documents
+        for document in documents:
+            total_tokens += len(document.split(' ')) * self.word_token_rate
+
+        # Calculate the number of sections needed
+        num_sections = math.floor(total_tokens / chunk_token_threshold)
+        if num_sections < 1:
+            num_sections = 1  # Ensure there is at least one section
+        adjusted_chunk_threshold = total_tokens / num_sections
+
        total_token_so_far = 0
+        current_chunk = []

        for document in documents:
-            if total_token_so_far < CHUNK_TOKEN_THRESHOLD:
-                chunk = document.split(' ')
-                total_token_so_far += len(chunk) * 1.3
-                chunks.append(document)
-            else:
-                sections.append('\n\n'.join(chunks))
-                chunks = [document]
-                total_token_so_far = len(document.split(' ')) * 1.3 
-                
-        if chunks:
-            sections.append('\n\n'.join(chunks))
+            tokens = document.split(' ')
+            token_count = len(tokens) * self.word_token_rate
            
-        return sections       
+            if total_token_so_far + token_count <= adjusted_chunk_threshold:
+                current_chunk.extend(tokens)
+                total_token_so_far += token_count
+            else:
+                # Ensure to handle the last section properly
+                if len(sections) == num_sections - 1:
+                    current_chunk.extend(tokens)
+                    continue
+                
+                # Add overlap if specified
+                if overlap > 0 and current_chunk:
+                    overlap_tokens = current_chunk[-overlap:]
+                    current_chunk.extend(overlap_tokens)
+                
+                sections.append(' '.join(current_chunk))
+                current_chunk = tokens
+                total_token_so_far = token_count
+
+        # Add the last chunk
+        if current_chunk:
+            sections.append(' '.join(current_chunk))
+
+        return sections
+

    def run(self, url: str, sections: List[str]) -> List[Dict[str, Any]]:
        """
        Process sections sequentially with a delay for rate limiting issues, specifically for LLMExtractionStrategy.
        """
        
-        merged_sections = self._merge(sections)
+        merged_sections = self._merge(
+            sections, self.chunk_token_threshold,
+            overlap= int(self.chunk_token_threshold * self.overlap_rate)
+        )
        extracted_content = []
        if self.provider.startswith("groq/"):
            # Sequential processing with a delay
            for ix, section in enumerate(merged_sections):
-                extracted_content.extend(self.extract(ix, url, section))
+                extract_func = partial(self.extract, url)
+                extracted_content.extend(extract_func(ix, sanitize_input_encode(section)))
                time.sleep(0.5)  # 500 ms delay between each processing
        else:
            # Parallel processing using ThreadPoolExecutor
+            # extract_func = partial(self.extract, url)
+            # for ix, section in enumerate(merged_sections):
+            #     extracted_content.append(extract_func(ix, section))            
+            
            with ThreadPoolExecutor(max_workers=4) as executor:
                extract_func = partial(self.extract, url)
-                futures = [executor.submit(extract_func, ix, section) for ix, section in enumerate(merged_sections)]
+                futures = [executor.submit(extract_func, ix, sanitize_input_encode(section)) for ix, section in enumerate(merged_sections)]
                
                for future in as_completed(futures):
-                    extracted_content.extend(future.result())
+                    try:
+                        extracted_content.extend(future.result())
+                    except Exception as e:
+                        if self.verbose:
+                            print(f"Error in thread execution: {e}")
+                        # Add error information to extracted_content
+                        extracted_content.append({
+                            "index": 0,
+                            "error": True,
+                            "tags": ["error"],
+                            "content": str(e)
+                        })

        
        return extracted_content        
  
 class CosineStrategy(ExtractionStrategy):
-    def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'BAAI/bge-small-en-v1.5', **kwargs):
+    def __init__(self, semantic_filter = None, word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name = 'sentence-transformers/all-MiniLM-L6-v2', sim_threshold = 0.3, **kwargs):
        """
        Initialize the strategy with clustering parameters.

-        :param semantic_filter: A keyword filter for document filtering.
-        :param word_count_threshold: Minimum number of words per cluster.
-        :param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
-        :param linkage_method: The linkage method for hierarchical clustering.
-        :param top_k: Number of top categories to extract.
+        Args:
+            semantic_filter (str): A keyword filter for document filtering.
+            word_count_threshold (int): Minimum number of words per cluster.
+            max_dist (float): The maximum cophenetic distance on the dendrogram to form clusters.
+            linkage_method (str): The linkage method for hierarchical clustering.
+            top_k (int): Number of top categories to extract.
        """
        super().__init__()
        
+        import numpy as np
+        
        self.semantic_filter = semantic_filter
        self.word_count_threshold = word_count_threshold
        self.max_dist = max_dist
        self.linkage_method = linkage_method
        self.top_k = top_k
+        self.sim_threshold = sim_threshold
        self.timer = time.time()
        self.verbose = kwargs.get("verbose", False)
        
        self.buffer_embeddings = np.array([])
+        self.get_embedding_method = "direct"
+        
+        self.device = get_device()
+        # import torch
+        # self.device = torch.device('cpu')
+        
+        self.default_batch_size = calculate_batch_size(self.device)

-        if model_name == "bert-base-uncased":
-            self.tokenizer, self.model = load_bert_base_uncased()
-        elif model_name == "BAAI/bge-small-en-v1.5":
-            self.tokenizer, self.model = load_bge_small_en_v1_5()
+        if self.verbose:
+            print(f"[LOG] Loading Extraction Model for {self.device.type} device.")

-        self.nlp = load_text_multilabel_classifier()
+        # if False and self.device.type == "cpu":
+        #     self.model = load_onnx_all_MiniLM_l6_v2()
+        #     self.tokenizer = self.model.tokenizer
+        #     self.get_embedding_method = "direct"
+        # else:
+
+        self.tokenizer, self.model = load_HF_embedding_model(model_name)
+        self.model.to(self.device)
+        self.model.eval()  
+        
+        self.get_embedding_method = "batch"
+        
+        self.buffer_embeddings = np.array([])
+
+        # if model_name == "bert-base-uncased":
+        #     self.tokenizer, self.model = load_bert_base_uncased()
+        #     self.model.eval()  # Ensure the model is in evaluation mode
+        #     self.get_embedding_method = "batch"
+        # elif model_name == "BAAI/bge-small-en-v1.5":
+        #     self.tokenizer, self.model = load_bge_small_en_v1_5()
+        #     self.model.eval()  # Ensure the model is in evaluation mode
+        #     self.get_embedding_method = "batch"
+        # elif model_name == "sentence-transformers/all-MiniLM-L6-v2":
+        #     self.model = load_onnx_all_MiniLM_l6_v2()
+        #     self.tokenizer = self.model.tokenizer
+        #     self.get_embedding_method = "direct"
+       
+        
+        if self.verbose:
+            print(f"[LOG] Loading Multilabel Classifier for {self.device.type} device.")
+            
+        self.nlp, _ = load_text_multilabel_classifier()
+        # self.default_batch_size = 16 if self.device.type == 'cpu' else 64
        
        if self.verbose:
            print(f"[LOG] Model loaded {model_name}, models/reuters, took " + str(time.time() - self.timer) + " seconds")

-    def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, threshold: float = 0.5) -> List[str]:
+    def filter_documents_embeddings(self, documents: List[str], semantic_filter: str, at_least_k: int = 20) -> List[str]:
        """
-        Filter documents based on the cosine similarity of their embeddings with the semantic_filter embedding.
+        Filter and sort documents based on the cosine similarity of their embeddings with the semantic_filter embedding.

        :param documents: List of text chunks (documents).
        :param semantic_filter: A string containing the keywords for filtering.
        :param threshold: Cosine similarity threshold for filtering documents.
-        :return: Filtered list of documents.
+        :param at_least_k: Minimum number of documents to return.
+        :return: List of filtered documents, ensuring at least `at_least_k` documents.
        """
-        from sklearn.metrics.pairwise import cosine_similarity
+        
        if not semantic_filter:
            return documents
+        
+        if len(documents) < at_least_k:
+            at_least_k = len(documents) // 2
+        
+        from sklearn.metrics.pairwise import cosine_similarity
+        
        # Compute embedding for the keyword filter
        query_embedding = self.get_embeddings([semantic_filter])[0]
        
-        # Compute embeddings for the docu  ments
+        # Compute embeddings for the documents
        document_embeddings = self.get_embeddings(documents)
        
        # Calculate cosine similarity between the query embedding and document embeddings
        similarities = cosine_similarity([query_embedding], document_embeddings).flatten()
        
        # Filter documents based on the similarity threshold
-        filtered_docs = [doc for doc, sim in zip(documents, similarities) if sim >= threshold]
+        filtered_docs = [(doc, sim) for doc, sim in zip(documents, similarities) if sim >= self.sim_threshold]
        
-        return filtered_docs
-
-    def get_embeddings(self, sentences: List[str], bypass_buffer=True):
+        # If the number of filtered documents is less than at_least_k, sort remaining documents by similarity
+        if len(filtered_docs) < at_least_k:
+            remaining_docs = [(doc, sim) for doc, sim in zip(documents, similarities) if sim < self.sim_threshold]
+            remaining_docs.sort(key=lambda x: x[1], reverse=True)
+            filtered_docs.extend(remaining_docs[:at_least_k - len(filtered_docs)])
+        
+        # Extract the document texts from the tuples
+        filtered_docs = [doc for doc, _ in filtered_docs]
+        
+        return filtered_docs[:at_least_k]
+    
+    def get_embeddings(self, sentences: List[str], batch_size=None, bypass_buffer=False):
        """
        Get BERT embeddings for a list of sentences.

@@ -224,19 +356,42 @@ class CosineStrategy(ExtractionStrategy):
        # if self.buffer_embeddings.any() and not bypass_buffer:
        #     return self.buffer_embeddings
        
-        import torch 
-        # Tokenize sentences and convert to tensor
-        encoded_input = self.tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
-        # Compute token embeddings
-        with torch.no_grad():
-            model_output = self.model(**encoded_input)
+        if self.device.type in [ "cpu", "gpu", "cuda", "mps"]:
+            import torch 
+            # Tokenize sentences and convert to tensor
+            if batch_size is None:
+                batch_size = self.default_batch_size
+                        
+            all_embeddings = []
+            for i in range(0, len(sentences), batch_size):
+                batch_sentences = sentences[i:i + batch_size]
+                encoded_input = self.tokenizer(batch_sentences, padding=True, truncation=True, return_tensors='pt')
+                encoded_input = {key: tensor.to(self.device) for key, tensor in encoded_input.items()}
+                
+                # Ensure no gradients are calculated
+                with torch.no_grad():
+                    model_output = self.model(**encoded_input)
+                
+                # Get embeddings from the last hidden state (mean pooling)
+                embeddings = model_output.last_hidden_state.mean(dim=1).cpu().numpy()
+                all_embeddings.append(embeddings)
            
-        # Get embeddings from the last hidden state (mean pooling)
-        embeddings = model_output.last_hidden_state.mean(1)
-        self.buffer_embeddings = embeddings.numpy()
-        return embeddings.numpy()
+            self.buffer_embeddings = np.vstack(all_embeddings)
+        elif self.device.type == "cpu":      
+            # self.buffer_embeddings = self.model(sentences)
+            if batch_size is None:
+                batch_size = self.default_batch_size
+                
+            all_embeddings = []
+            for i in range(0, len(sentences), batch_size):
+                batch_sentences = sentences[i:i + batch_size]
+                embeddings = self.model(batch_sentences)
+                all_embeddings.append(embeddings)
+                
+            self.buffer_embeddings = np.vstack(all_embeddings)
+        return self.buffer_embeddings

-    def hierarchical_clustering(self, sentences: List[str]):
+    def hierarchical_clustering(self, sentences: List[str], embeddings = None):
        """
        Perform hierarchical clustering on sentences and return cluster labels.

@@ -247,7 +402,7 @@ class CosineStrategy(ExtractionStrategy):
        from scipy.cluster.hierarchy import linkage, fcluster
        from scipy.spatial.distance import pdist
        self.timer = time.time()
-        embeddings = self.get_embeddings(sentences, bypass_buffer=False)
+        embeddings = self.get_embeddings(sentences, bypass_buffer=True)
        # print(f"[LOG] 🚀 Embeddings computed in {time.time() - self.timer:.2f} seconds")
        # Compute pairwise cosine distances
        distance_matrix = pdist(embeddings, 'cosine')
@@ -311,20 +466,33 @@ class CosineStrategy(ExtractionStrategy):
        # Convert filtered clusters to a sorted list of dictionaries
        cluster_list = [{"index": int(idx), "tags" : [], "content": " ".join(filtered_clusters[idx])} for idx in sorted(filtered_clusters)]
        
-        labels = self.nlp([cluster['content'] for cluster in cluster_list])
+        if self.verbose:
+            print(f"[LOG] 🚀 Assign tags using {self.device}")
        
-        for cluster, label in zip(cluster_list, labels):
-            cluster['tags'] = label
+        if self.device.type in ["gpu", "cuda", "mps", "cpu"]:
+            labels = self.nlp([cluster['content'] for cluster in cluster_list])
+            
+            for cluster, label in zip(cluster_list, labels):
+                cluster['tags'] = label
+        # elif self.device.type == "cpu":
+        #     # Process the text with the loaded model
+        #     texts = [cluster['content'] for cluster in cluster_list]
+        #     # Batch process texts
+        #     docs = self.nlp.pipe(texts, disable=["tagger", "parser", "ner", "lemmatizer"])

-        # Process the text with the loaded model
-        # for cluster in  cluster_list:
-        #     cluster['tags'] = self.nlp(cluster['content'])[0]['label']
-            # doc = self.nlp(cluster['content'])
-            # tok_k = self.top_k
-            # top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
-            # cluster['tags'] = [cat for cat, _ in top_categories]
+        #     for doc, cluster in zip(docs, cluster_list):
+        #         tok_k = self.top_k
+        #         top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+        #         cluster['tags'] = [cat for cat, _ in top_categories]
+                            
+            # for cluster in  cluster_list:
+            #     doc = self.nlp(cluster['content'])
+            #     tok_k = self.top_k
+            #     top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+            #     cluster['tags'] = [cat for cat, _ in top_categories]
        
-        # print(f"[LOG] 🚀 Categorization done in {time.time() - t:.2f} seconds")
+        if self.verbose:
+            print(f"[LOG] 🚀 Categorization done in {time.time() - t:.2f} seconds")
        
        return cluster_list

@@ -463,4 +631,241 @@ class ContentSummarizationStrategy(ExtractionStrategy):

        # Sort summaries by the original section index to maintain order
        summaries.sort(key=lambda x: x[0])
-        return [summary for _, summary in summaries]
+        return [summary for _, summary in summaries]
+  
+class JsonCssExtractionStrategy(ExtractionStrategy):
+    def __init__(self, schema: Dict[str, Any], **kwargs):
+        super().__init__(**kwargs)
+        self.schema = schema
+
+    def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
+        soup = BeautifulSoup(html, 'html.parser')
+        base_elements = soup.select(self.schema['baseSelector'])
+        
+        results = []
+        for element in base_elements:
+            item = self._extract_item(element, self.schema['fields'])
+            if item:
+                results.append(item)
+        
+        return results
+
+    
+
+    def _extract_field(self, element, field):
+        try:
+            if field['type'] == 'nested':
+                nested_element = element.select_one(field['selector'])
+                return self._extract_item(nested_element, field['fields']) if nested_element else {}
+            
+            if field['type'] == 'list':
+                elements = element.select(field['selector'])
+                return [self._extract_list_item(el, field['fields']) for el in elements]
+            
+            if field['type'] == 'nested_list':
+                elements = element.select(field['selector'])
+                return [self._extract_item(el, field['fields']) for el in elements]
+            
+            return self._extract_single_field(element, field)
+        except Exception as e:
+            if self.verbose:
+                print(f"Error extracting field {field['name']}: {str(e)}")
+            return field.get('default')
+
+    def _extract_list_item(self, element, fields):
+        item = {}
+        for field in fields:
+            value = self._extract_single_field(element, field)
+            if value is not None:
+                item[field['name']] = value
+        return item
+    
+    def _extract_single_field(self, element, field):
+        if 'selector' in field:
+            selected = element.select_one(field['selector'])
+            if not selected:
+                return field.get('default')
+        else:
+            selected = element
+
+        value = None
+        if field['type'] == 'text':
+            value = selected.get_text(strip=True)
+        elif field['type'] == 'attribute':
+            value = selected.get(field['attribute'])
+        elif field['type'] == 'html':
+            value = str(selected)
+        elif field['type'] == 'regex':
+            text = selected.get_text(strip=True)
+            match = re.search(field['pattern'], text)
+            value = match.group(1) if match else None
+
+        if 'transform' in field:
+            value = self._apply_transform(value, field['transform'])
+
+        return value if value is not None else field.get('default')
+
+    def _extract_item(self, element, fields):
+        item = {}
+        for field in fields:
+            if field['type'] == 'computed':
+                value = self._compute_field(item, field)
+            else:
+                value = self._extract_field(element, field)
+            if value is not None:
+                item[field['name']] = value
+        return item
+    
+    def _apply_transform(self, value, transform):
+        if transform == 'lowercase':
+            return value.lower()
+        elif transform == 'uppercase':
+            return value.upper()
+        elif transform == 'strip':
+            return value.strip()
+        return value
+
+    def _compute_field(self, item, field):
+        try:
+            if 'expression' in field:
+                return eval(field['expression'], {}, item)
+            elif 'function' in field:
+                return field['function'](item)
+        except Exception as e:
+            if self.verbose:
+                print(f"Error computing field {field['name']}: {str(e)}")
+            return field.get('default')
+
+    def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
+        combined_html = self.DEL.join(sections)
+        return self.extract(url, combined_html, **kwargs)
+    
+class JsonXPATHExtractionStrategy(ExtractionStrategy):
+    def __init__(self, schema: Dict[str, Any], **kwargs):
+        super().__init__(**kwargs)
+        self.schema = schema
+        self.use_cssselect = self._check_cssselect()
+
+    def _check_cssselect(self):
+        try:
+            import cssselect
+            return True
+        except ImportError:
+            print("Warning: cssselect is not installed. Falling back to XPath for all selectors.")
+            return False
+
+    def extract(self, url: str, html: str, *q, **kwargs) -> List[Dict[str, Any]]:
+        self.soup = BeautifulSoup(html, 'lxml')
+        self.tree = etree.HTML(str(self.soup))
+        
+        selector_type = 'xpath' if not self.use_cssselect else self.schema.get('selectorType', 'css')
+        base_selector = self.schema.get('baseXPath' if selector_type == 'xpath' else 'baseSelector')
+        base_elements = self._select_elements(base_selector, selector_type)
+        
+        results = []
+        for element in base_elements:
+            item = self._extract_item(element, self.schema['fields'])
+            if item:
+                results.append(item)
+        
+        return results
+
+    def _select_elements(self, selector, selector_type, element=None):
+        if selector_type == 'xpath' or not self.use_cssselect:
+            return self.tree.xpath(selector) if element is None else element.xpath(selector)
+        else:  # CSS
+            return self.tree.cssselect(selector) if element is None else element.cssselect(selector)
+
+    def _extract_field(self, element, field):
+        try:
+            selector_type = 'xpath' if not self.use_cssselect else field.get('selectorType', 'css')
+            selector = field.get('xpathSelector' if selector_type == 'xpath' else 'selector')
+            
+            if field['type'] == 'nested':
+                nested_element = self._select_elements(selector, selector_type, element)
+                return self._extract_item(nested_element[0], field['fields']) if nested_element else {}
+            
+            if field['type'] == 'list':
+                elements = self._select_elements(selector, selector_type, element)
+                return [self._extract_list_item(el, field['fields']) for el in elements]
+            
+            if field['type'] == 'nested_list':
+                elements = self._select_elements(selector, selector_type, element)
+                return [self._extract_item(el, field['fields']) for el in elements]
+            
+            return self._extract_single_field(element, field)
+        except Exception as e:
+            if self.verbose:
+                print(f"Error extracting field {field['name']}: {str(e)}")
+            return field.get('default')
+
+    def _extract_list_item(self, element, fields):
+        item = {}
+        for field in fields:
+            value = self._extract_single_field(element, field)
+            if value is not None:
+                item[field['name']] = value
+        return item
+    
+    def _extract_single_field(self, element, field):
+        selector_type = field.get('selectorType', 'css')
+        
+        if 'selector' in field:
+            selected = self._select_elements(field['selector'], selector_type, element)
+            if not selected:
+                return field.get('default')
+            selected = selected[0]
+        else:
+            selected = element
+
+        value = None
+        if field['type'] == 'text':
+            value = selected.text_content().strip() if hasattr(selected, 'text_content') else selected.text.strip()
+        elif field['type'] == 'attribute':
+            value = selected.get(field['attribute'])
+        elif field['type'] == 'html':
+            value = etree.tostring(selected, encoding='unicode')
+        elif field['type'] == 'regex':
+            text = selected.text_content().strip() if hasattr(selected, 'text_content') else selected.text.strip()
+            match = re.search(field['pattern'], text)
+            value = match.group(1) if match else None
+
+        if 'transform' in field:
+            value = self._apply_transform(value, field['transform'])
+
+        return value if value is not None else field.get('default')
+
+    def _extract_item(self, element, fields):
+        item = {}
+        for field in fields:
+            if field['type'] == 'computed':
+                value = self._compute_field(item, field)
+            else:
+                value = self._extract_field(element, field)
+            if value is not None:
+                item[field['name']] = value
+        return item
+    
+    def _apply_transform(self, value, transform):
+        if transform == 'lowercase':
+            return value.lower()
+        elif transform == 'uppercase':
+            return value.upper()
+        elif transform == 'strip':
+            return value.strip()
+        return value
+
+    def _compute_field(self, item, field):
+        try:
+            if 'expression' in field:
+                return eval(field['expression'], {}, item)
+            elif 'function' in field:
+                return field['function'](item)
+        except Exception as e:
+            if self.verbose:
+                print(f"Error computing field {field['name']}: {str(e)}")
+            return field.get('default')
+
+    def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
+        combined_html = self.DEL.join(sections)
+        return self.extract(url, combined_html, **kwargs)
--- a/crawl4ai/html2text/init.py
+++ b/crawl4ai/html2text/init.py
--- a/crawl4ai/html2text/main.py
+++ b/crawl4ai/html2text/main.py
@@ -0,0 +1,3 @@
+from .cli import main
+
+main()
--- a/crawl4ai/html2text/_typing.py
+++ b/crawl4ai/html2text/_typing.py
@@ -0,0 +1,2 @@
+class OutCallback:
+    def __call__(self, s: str) -> None: ...
--- a/crawl4ai/html2text/cli.py
+++ b/crawl4ai/html2text/cli.py
@@ -0,0 +1,330 @@
+import argparse
+import sys
+
+from . import HTML2Text, __version__, config
+
+
+def main() -> None:
+    baseurl = ""
+
+    class bcolors:
+        HEADER = "\033[95m"
+        OKBLUE = "\033[94m"
+        OKGREEN = "\033[92m"
+        WARNING = "\033[93m"
+        FAIL = "\033[91m"
+        ENDC = "\033[0m"
+        BOLD = "\033[1m"
+        UNDERLINE = "\033[4m"
+
+    p = argparse.ArgumentParser()
+    p.add_argument(
+        "--default-image-alt",
+        dest="default_image_alt",
+        default=config.DEFAULT_IMAGE_ALT,
+        help="The default alt string for images with missing ones",
+    )
+    p.add_argument(
+        "--pad-tables",
+        dest="pad_tables",
+        action="store_true",
+        default=config.PAD_TABLES,
+        help="pad the cells to equal column width in tables",
+    )
+    p.add_argument(
+        "--no-wrap-links",
+        dest="wrap_links",
+        action="store_false",
+        default=config.WRAP_LINKS,
+        help="don't wrap links during conversion",
+    )
+    p.add_argument(
+        "--wrap-list-items",
+        dest="wrap_list_items",
+        action="store_true",
+        default=config.WRAP_LIST_ITEMS,
+        help="wrap list items during conversion",
+    )
+    p.add_argument(
+        "--wrap-tables",
+        dest="wrap_tables",
+        action="store_true",
+        default=config.WRAP_TABLES,
+        help="wrap tables",
+    )
+    p.add_argument(
+        "--ignore-emphasis",
+        dest="ignore_emphasis",
+        action="store_true",
+        default=config.IGNORE_EMPHASIS,
+        help="don't include any formatting for emphasis",
+    )
+    p.add_argument(
+        "--reference-links",
+        dest="inline_links",
+        action="store_false",
+        default=config.INLINE_LINKS,
+        help="use reference style links instead of inline links",
+    )
+    p.add_argument(
+        "--ignore-links",
+        dest="ignore_links",
+        action="store_true",
+        default=config.IGNORE_ANCHORS,
+        help="don't include any formatting for links",
+    )
+    p.add_argument(
+        "--ignore-mailto-links",
+        action="store_true",
+        dest="ignore_mailto_links",
+        default=config.IGNORE_MAILTO_LINKS,
+        help="don't include mailto: links",
+    )
+    p.add_argument(
+        "--protect-links",
+        dest="protect_links",
+        action="store_true",
+        default=config.PROTECT_LINKS,
+        help="protect links from line breaks surrounding them with angle brackets",
+    )
+    p.add_argument(
+        "--ignore-images",
+        dest="ignore_images",
+        action="store_true",
+        default=config.IGNORE_IMAGES,
+        help="don't include any formatting for images",
+    )
+    p.add_argument(
+        "--images-as-html",
+        dest="images_as_html",
+        action="store_true",
+        default=config.IMAGES_AS_HTML,
+        help=(
+            "Always write image tags as raw html; preserves `height`, `width` and "
+            "`alt` if possible."
+        ),
+    )
+    p.add_argument(
+        "--images-to-alt",
+        dest="images_to_alt",
+        action="store_true",
+        default=config.IMAGES_TO_ALT,
+        help="Discard image data, only keep alt text",
+    )
+    p.add_argument(
+        "--images-with-size",
+        dest="images_with_size",
+        action="store_true",
+        default=config.IMAGES_WITH_SIZE,
+        help=(
+            "Write image tags with height and width attrs as raw html to retain "
+            "dimensions"
+        ),
+    )
+    p.add_argument(
+        "-g",
+        "--google-doc",
+        action="store_true",
+        dest="google_doc",
+        default=False,
+        help="convert an html-exported Google Document",
+    )
+    p.add_argument(
+        "-d",
+        "--dash-unordered-list",
+        action="store_true",
+        dest="ul_style_dash",
+        default=False,
+        help="use a dash rather than a star for unordered list items",
+    )
+    p.add_argument(
+        "-e",
+        "--asterisk-emphasis",
+        action="store_true",
+        dest="em_style_asterisk",
+        default=False,
+        help="use an asterisk rather than an underscore for emphasized text",
+    )
+    p.add_argument(
+        "-b",
+        "--body-width",
+        dest="body_width",
+        type=int,
+        default=config.BODY_WIDTH,
+        help="number of characters per output line, 0 for no wrap",
+    )
+    p.add_argument(
+        "-i",
+        "--google-list-indent",
+        dest="list_indent",
+        type=int,
+        default=config.GOOGLE_LIST_INDENT,
+        help="number of pixels Google indents nested lists",
+    )
+    p.add_argument(
+        "-s",
+        "--hide-strikethrough",
+        action="store_true",
+        dest="hide_strikethrough",
+        default=False,
+        help="hide strike-through text. only relevant when -g is " "specified as well",
+    )
+    p.add_argument(
+        "--escape-all",
+        action="store_true",
+        dest="escape_snob",
+        default=False,
+        help=(
+            "Escape all special characters.  Output is less readable, but avoids "
+            "corner case formatting issues."
+        ),
+    )
+    p.add_argument(
+        "--bypass-tables",
+        action="store_true",
+        dest="bypass_tables",
+        default=config.BYPASS_TABLES,
+        help="Format tables in HTML rather than Markdown syntax.",
+    )
+    p.add_argument(
+        "--ignore-tables",
+        action="store_true",
+        dest="ignore_tables",
+        default=config.IGNORE_TABLES,
+        help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
+    )
+    p.add_argument(
+        "--single-line-break",
+        action="store_true",
+        dest="single_line_break",
+        default=config.SINGLE_LINE_BREAK,
+        help=(
+            "Use a single line break after a block element rather than two line "
+            "breaks. NOTE: Requires --body-width=0"
+        ),
+    )
+    p.add_argument(
+        "--unicode-snob",
+        action="store_true",
+        dest="unicode_snob",
+        default=config.UNICODE_SNOB,
+        help="Use unicode throughout document",
+    )
+    p.add_argument(
+        "--no-automatic-links",
+        action="store_false",
+        dest="use_automatic_links",
+        default=config.USE_AUTOMATIC_LINKS,
+        help="Do not use automatic links wherever applicable",
+    )
+    p.add_argument(
+        "--no-skip-internal-links",
+        action="store_false",
+        dest="skip_internal_links",
+        default=config.SKIP_INTERNAL_LINKS,
+        help="Do not skip internal links",
+    )
+    p.add_argument(
+        "--links-after-para",
+        action="store_true",
+        dest="links_each_paragraph",
+        default=config.LINKS_EACH_PARAGRAPH,
+        help="Put links after each paragraph instead of document",
+    )
+    p.add_argument(
+        "--mark-code",
+        action="store_true",
+        dest="mark_code",
+        default=config.MARK_CODE,
+        help="Mark program code blocks with [code]...[/code]",
+    )
+    p.add_argument(
+        "--decode-errors",
+        dest="decode_errors",
+        default=config.DECODE_ERRORS,
+        help=(
+            "What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
+            "acceptable values"
+        ),
+    )
+    p.add_argument(
+        "--open-quote",
+        dest="open_quote",
+        default=config.OPEN_QUOTE,
+        help="The character used to open quotes",
+    )
+    p.add_argument(
+        "--close-quote",
+        dest="close_quote",
+        default=config.CLOSE_QUOTE,
+        help="The character used to close quotes",
+    )
+    p.add_argument(
+        "--version", action="version", version=".".join(map(str, __version__))
+    )
+    p.add_argument("filename", nargs="?")
+    p.add_argument("encoding", nargs="?", default="utf-8")
+    p.add_argument(
+        "--include-sup-sub",
+        dest="include_sup_sub",
+        action="store_true",
+        default=config.INCLUDE_SUP_SUB,
+        help="Include the sup and sub tags",
+    )
+    args = p.parse_args()
+
+    if args.filename and args.filename != "-":
+        with open(args.filename, "rb") as fp:
+            data = fp.read()
+    else:
+        data = sys.stdin.buffer.read()
+
+    try:
+        html = data.decode(args.encoding, args.decode_errors)
+    except UnicodeDecodeError as err:
+        warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
+        warning += " Use the " + bcolors.OKGREEN
+        warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
+        print(warning)
+        raise err
+
+    h = HTML2Text(baseurl=baseurl)
+    # handle options
+    if args.ul_style_dash:
+        h.ul_item_mark = "-"
+    if args.em_style_asterisk:
+        h.emphasis_mark = "*"
+        h.strong_mark = "__"
+
+    h.body_width = args.body_width
+    h.google_list_indent = args.list_indent
+    h.ignore_emphasis = args.ignore_emphasis
+    h.ignore_links = args.ignore_links
+    h.ignore_mailto_links = args.ignore_mailto_links
+    h.protect_links = args.protect_links
+    h.ignore_images = args.ignore_images
+    h.images_as_html = args.images_as_html
+    h.images_to_alt = args.images_to_alt
+    h.images_with_size = args.images_with_size
+    h.google_doc = args.google_doc
+    h.hide_strikethrough = args.hide_strikethrough
+    h.escape_snob = args.escape_snob
+    h.bypass_tables = args.bypass_tables
+    h.ignore_tables = args.ignore_tables
+    h.single_line_break = args.single_line_break
+    h.inline_links = args.inline_links
+    h.unicode_snob = args.unicode_snob
+    h.use_automatic_links = args.use_automatic_links
+    h.skip_internal_links = args.skip_internal_links
+    h.links_each_paragraph = args.links_each_paragraph
+    h.mark_code = args.mark_code
+    h.wrap_links = args.wrap_links
+    h.wrap_list_items = args.wrap_list_items
+    h.wrap_tables = args.wrap_tables
+    h.pad_tables = args.pad_tables
+    h.default_image_alt = args.default_image_alt
+    h.open_quote = args.open_quote
+    h.close_quote = args.close_quote
+    h.include_sup_sub = args.include_sup_sub
+
+    sys.stdout.write(h.handle(html))
--- a/crawl4ai/html2text/config.py
+++ b/crawl4ai/html2text/config.py
@@ -0,0 +1,172 @@
+import re
+
+# Use Unicode characters instead of their ascii pseudo-replacements
+UNICODE_SNOB = False
+
+# Marker to use for marking tables for padding post processing
+TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
+# Escape all special characters.  Output is less readable, but avoids
+# corner case formatting issues.
+ESCAPE_SNOB = False
+ESCAPE_BACKSLASH = False
+ESCAPE_DOT = False
+ESCAPE_PLUS = False
+ESCAPE_DASH = False
+
+# Put the links after each paragraph instead of at the end.
+LINKS_EACH_PARAGRAPH = False
+
+# Wrap long lines at position. 0 for no wrapping.
+BODY_WIDTH = 78
+
+# Don't show internal links (href="#local-anchor") -- corresponding link
+# targets won't be visible in the plain text file anyway.
+SKIP_INTERNAL_LINKS = True
+
+# Use inline, rather than reference, formatting for images and links
+INLINE_LINKS = True
+
+# Protect links from line breaks surrounding them with angle brackets (in
+# addition to their square brackets)
+PROTECT_LINKS = False
+# WRAP_LINKS = True
+WRAP_LINKS = True
+
+# Wrap list items.
+WRAP_LIST_ITEMS = False
+
+# Wrap tables
+WRAP_TABLES = False
+
+# Number of pixels Google indents nested lists
+GOOGLE_LIST_INDENT = 36
+
+# Values Google and others may use to indicate bold text
+BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
+
+IGNORE_ANCHORS = False
+IGNORE_MAILTO_LINKS = False
+IGNORE_IMAGES = False
+IMAGES_AS_HTML = False
+IMAGES_TO_ALT = False
+IMAGES_WITH_SIZE = False
+IGNORE_EMPHASIS = False
+MARK_CODE = False
+DECODE_ERRORS = "strict"
+DEFAULT_IMAGE_ALT = ""
+PAD_TABLES = False
+
+# Convert links with same href and text to <href> format
+# if they are absolute links
+USE_AUTOMATIC_LINKS = True
+
+# For checking space-only lines on line 771
+RE_SPACE = re.compile(r"\s\+")
+
+RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
+RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
+RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
+RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
+
+# to find links in the text
+RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
+
+# to find table separators
+RE_TABLE = re.compile(r" \| ")
+
+RE_MD_DOT_MATCHER = re.compile(
+    r"""
+    ^             # start of line
+    (\s*\d+)      # optional whitespace and a number
+    (\.)          # dot
+    (?=\s)        # lookahead assert whitespace
+    """,
+    re.MULTILINE | re.VERBOSE,
+)
+RE_MD_PLUS_MATCHER = re.compile(
+    r"""
+    ^
+    (\s*)
+    (\+)
+    (?=\s)
+    """,
+    flags=re.MULTILINE | re.VERBOSE,
+)
+RE_MD_DASH_MATCHER = re.compile(
+    r"""
+    ^
+    (\s*)
+    (-)
+    (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
+                  # or another dash (header or hr)
+    """,
+    flags=re.MULTILINE | re.VERBOSE,
+)
+RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
+RE_MD_BACKSLASH_MATCHER = re.compile(
+    r"""
+    (\\)          # match one slash
+    (?=[%s])      # followed by a char that requires escaping
+    """
+    % re.escape(RE_SLASH_CHARS),
+    flags=re.VERBOSE,
+)
+
+UNIFIABLE = {
+    "rsquo": "'",
+    "lsquo": "'",
+    "rdquo": '"',
+    "ldquo": '"',
+    "copy": "(C)",
+    "mdash": "--",
+    "nbsp": " ",
+    "rarr": "->",
+    "larr": "<-",
+    "middot": "*",
+    "ndash": "-",
+    "oelig": "oe",
+    "aelig": "ae",
+    "agrave": "a",
+    "aacute": "a",
+    "acirc": "a",
+    "atilde": "a",
+    "auml": "a",
+    "aring": "a",
+    "egrave": "e",
+    "eacute": "e",
+    "ecirc": "e",
+    "euml": "e",
+    "igrave": "i",
+    "iacute": "i",
+    "icirc": "i",
+    "iuml": "i",
+    "ograve": "o",
+    "oacute": "o",
+    "ocirc": "o",
+    "otilde": "o",
+    "ouml": "o",
+    "ugrave": "u",
+    "uacute": "u",
+    "ucirc": "u",
+    "uuml": "u",
+    "lrm": "",
+    "rlm": "",
+}
+
+# Format tables in HTML rather than Markdown syntax
+BYPASS_TABLES = False
+# Ignore table-related tags (table, th, td, tr) while keeping rows
+IGNORE_TABLES = False
+
+
+# Use a single line break after a block element rather than two line breaks.
+# NOTE: Requires body width setting to be 0.
+SINGLE_LINE_BREAK = False
+
+
+# Use double quotation marks when converting the <q> tag.
+OPEN_QUOTE = '"'
+CLOSE_QUOTE = '"'
+
+# Include the <sup> and <sub> tags
+INCLUDE_SUP_SUB = False
--- a/crawl4ai/html2text/elements.py
+++ b/crawl4ai/html2text/elements.py
@@ -0,0 +1,18 @@
+from typing import Dict, Optional
+
+
+class AnchorElement:
+    __slots__ = ["attrs", "count", "outcount"]
+
+    def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
+        self.attrs = attrs
+        self.count = count
+        self.outcount = outcount
+
+
+class ListElement:
+    __slots__ = ["name", "num"]
+
+    def __init__(self, name: str, num: int):
+        self.name = name
+        self.num = num
--- a/crawl4ai/html2text/utils.py
+++ b/crawl4ai/html2text/utils.py
@@ -0,0 +1,303 @@
+import html.entities
+from typing import Dict, List, Optional
+
+from . import config
+
+unifiable_n = {
+    html.entities.name2codepoint[k]: v
+    for k, v in config.UNIFIABLE.items()
+    if k != "nbsp"
+}
+
+
+def hn(tag: str) -> int:
+    if tag[0] == "h" and len(tag) == 2:
+        n = tag[1]
+        if "0" < n <= "9":
+            return int(n)
+    return 0
+
+
+def dumb_property_dict(style: str) -> Dict[str, str]:
+    """
+    :returns: A hash of css attributes
+    """
+    return {
+        x.strip().lower(): y.strip().lower()
+        for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
+    }
+
+
+def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
+    """
+    :type data: str
+
+    :returns: A hash of css selectors, each of which contains a hash of
+    css attributes.
+    :rtype: dict
+    """
+    # remove @import sentences
+    data += ";"
+    importIndex = data.find("@import")
+    while importIndex != -1:
+        data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
+        importIndex = data.find("@import")
+
+    # parse the css. reverted from dictionary comprehension in order to
+    # support older pythons
+    pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
+    try:
+        elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
+    except ValueError:
+        elements = {}  # not that important
+
+    return elements
+
+
+def element_style(
+    attrs: Dict[str, Optional[str]],
+    style_def: Dict[str, Dict[str, str]],
+    parent_style: Dict[str, str],
+) -> Dict[str, str]:
+    """
+    :type attrs: dict
+    :type style_def: dict
+    :type style_def: dict
+
+    :returns: A hash of the 'final' style attributes of the element
+    :rtype: dict
+    """
+    style = parent_style.copy()
+    if "class" in attrs:
+        assert attrs["class"] is not None
+        for css_class in attrs["class"].split():
+            css_style = style_def.get("." + css_class, {})
+            style.update(css_style)
+    if "style" in attrs:
+        assert attrs["style"] is not None
+        immediate_style = dumb_property_dict(attrs["style"])
+        style.update(immediate_style)
+
+    return style
+
+
+def google_list_style(style: Dict[str, str]) -> str:
+    """
+    Finds out whether this is an ordered or unordered list
+
+    :type style: dict
+
+    :rtype: str
+    """
+    if "list-style-type" in style:
+        list_style = style["list-style-type"]
+        if list_style in ["disc", "circle", "square", "none"]:
+            return "ul"
+
+    return "ol"
+
+
+def google_has_height(style: Dict[str, str]) -> bool:
+    """
+    Check if the style of the element has the 'height' attribute
+    explicitly defined
+
+    :type style: dict
+
+    :rtype: bool
+    """
+    return "height" in style
+
+
+def google_text_emphasis(style: Dict[str, str]) -> List[str]:
+    """
+    :type style: dict
+
+    :returns: A list of all emphasis modifiers of the element
+    :rtype: list
+    """
+    emphasis = []
+    if "text-decoration" in style:
+        emphasis.append(style["text-decoration"])
+    if "font-style" in style:
+        emphasis.append(style["font-style"])
+    if "font-weight" in style:
+        emphasis.append(style["font-weight"])
+
+    return emphasis
+
+
+def google_fixed_width_font(style: Dict[str, str]) -> bool:
+    """
+    Check if the css of the current element defines a fixed width font
+
+    :type style: dict
+
+    :rtype: bool
+    """
+    font_family = ""
+    if "font-family" in style:
+        font_family = style["font-family"]
+    return "courier new" == font_family or "consolas" == font_family
+
+
+def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
+    """
+    Extract numbering from list element attributes
+
+    :type attrs: dict
+
+    :rtype: int or None
+    """
+    if "start" in attrs:
+        assert attrs["start"] is not None
+        try:
+            return int(attrs["start"]) - 1
+        except ValueError:
+            pass
+
+    return 0
+
+
+def skipwrap(
+    para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
+) -> bool:
+    # If it appears to contain a link
+    # don't wrap
+    if not wrap_links and config.RE_LINK.search(para):
+        return True
+    # If the text begins with four spaces or one tab, it's a code block;
+    # don't wrap
+    if para[0:4] == "    " or para[0] == "\t":
+        return True
+
+    # If the text begins with only two "--", possibly preceded by
+    # whitespace, that's an emdash; so wrap.
+    stripped = para.lstrip()
+    if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
+        return False
+
+    # I'm not sure what this is for; I thought it was to detect lists,
+    # but there's a <br>-inside-<span> case in one of the tests that
+    # also depends upon it.
+    if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
+        return not wrap_list_items
+
+    # If text contains a pipe character it is likely a table
+    if not wrap_tables and config.RE_TABLE.search(para):
+        return True
+
+    # If the text begins with a single -, *, or +, followed by a space,
+    # or an integer, followed by a ., followed by a space (in either
+    # case optionally proceeded by whitespace), it's a list; don't wrap.
+    return bool(
+        config.RE_ORDERED_LIST_MATCHER.match(stripped)
+        or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
+    )
+
+
+def escape_md(text: str) -> str:
+    """
+    Escapes markdown-sensitive characters within other markdown
+    constructs.
+    """
+    return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
+
+
+def escape_md_section(
+    text: str,
+    escape_backslash: bool = True,
+    snob: bool = False,
+    escape_dot: bool = True,
+    escape_plus: bool = True,
+    escape_dash: bool = True
+) -> str:
+    """
+    Escapes markdown-sensitive characters across whole document sections.
+    Each escaping operation can be controlled individually.
+    """
+    if escape_backslash:
+        text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
+
+    if snob:
+        text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
+
+    if escape_dot:
+        text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
+
+    if escape_plus:
+        text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
+
+    if escape_dash:
+        text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
+
+    return text
+
+def reformat_table(lines: List[str], right_margin: int) -> List[str]:
+    """
+    Given the lines of a table
+    padds the cells and returns the new lines
+    """
+    # find the maximum width of the columns
+    max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
+    max_cols = len(max_width)
+    for line in lines:
+        cols = [x.rstrip() for x in line.split("|")]
+        num_cols = len(cols)
+
+        # don't drop any data if colspan attributes result in unequal lengths
+        if num_cols < max_cols:
+            cols += [""] * (max_cols - num_cols)
+        elif max_cols < num_cols:
+            max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
+            max_cols = num_cols
+
+        max_width = [
+            max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
+        ]
+
+    # reformat
+    new_lines = []
+    for line in lines:
+        cols = [x.rstrip() for x in line.split("|")]
+        if set(line.strip()) == set("-|"):
+            filler = "-"
+            new_cols = [
+                x.rstrip() + (filler * (M - len(x.rstrip())))
+                for x, M in zip(cols, max_width)
+            ]
+            new_lines.append("|-" + "|".join(new_cols) + "|")
+        else:
+            filler = " "
+            new_cols = [
+                x.rstrip() + (filler * (M - len(x.rstrip())))
+                for x, M in zip(cols, max_width)
+            ]
+            new_lines.append("| " + "|".join(new_cols) + "|")
+    return new_lines
+
+
+def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
+    """
+    Provide padding for tables in the text
+    """
+    lines = text.split("\n")
+    table_buffer = []  # type: List[str]
+    table_started = False
+    new_lines = []
+    for line in lines:
+        # Toggle table started
+        if config.TABLE_MARKER_FOR_PAD in line:
+            table_started = not table_started
+            if not table_started:
+                table = reformat_table(table_buffer, right_margin)
+                new_lines.extend(table)
+                table_buffer = []
+                new_lines.append("")
+            continue
+        # Process lines
+        if table_started:
+            table_buffer.append(line)
+        else:
+            new_lines.append(line)
+    return "\n".join(new_lines)
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -2,9 +2,59 @@ from functools import lru_cache
 from pathlib import Path
 import subprocess, os
 import shutil
-from crawl4ai.config import MODEL_REPO_BRANCH
+import tarfile
+from .model_loader import *
 import argparse
+import urllib.request
+from crawl4ai.config import MODEL_REPO_BRANCH
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))

+@lru_cache()
+def get_available_memory(device):
+    import torch
+    if device.type == 'cuda':
+        return torch.cuda.get_device_properties(device).total_memory
+    elif device.type == 'mps':      
+        return 48 * 1024 ** 3  # Assuming 8GB for MPS, as a conservative estimate
+    else:
+        return 0
+
+@lru_cache()
+def calculate_batch_size(device):
+    available_memory = get_available_memory(device)
+    
+    if device.type == 'cpu':
+        return 16
+    elif device.type in ['cuda', 'mps']:
+        # Adjust these thresholds based on your model size and available memory
+        if available_memory >= 31 * 1024 ** 3:  # > 32GB
+            return 256
+        elif available_memory >= 15 * 1024 ** 3:  # > 16GB to 32GB
+            return 128
+        elif available_memory >= 8 * 1024 ** 3:  # 8GB to 16GB
+            return 64
+        else:
+            return 32
+    else:
+        return 16  # Default batch size   
+    
+@lru_cache()
+def get_device():
+    import torch
+    if torch.cuda.is_available():
+        device = torch.device('cuda')
+    elif torch.backends.mps.is_available():
+        device = torch.device('mps')
+    else:
+        device = torch.device('cpu')
+    return device   
+    
+def set_model_device(model):
+    device = get_device()
+    model.to(device)    
+    return model, device
+
+@lru_cache()
 def get_home_folder():
    home_folder = os.path.join(Path.home(), ".crawl4ai")
    os.makedirs(home_folder, exist_ok=True)
@@ -17,25 +67,38 @@ def load_bert_base_uncased():
    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
    model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
+    model.eval()
+    model, device = set_model_device(model)
    return tokenizer, model

@lru_cache()
-def load_bge_small_en_v1_5():
+def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
+    """Load the Hugging Face model for embedding.
+    
+    Args:
+        model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
+        
+    Returns:
+        tuple: The tokenizer and model.
+    """
    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
-    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
-    model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
+    tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
+    model = AutoModel.from_pretrained(model_name, resume_download=None)
    model.eval()
+    model, device = set_model_device(model)
    return tokenizer, model

@lru_cache()
 def load_text_classifier():
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers import pipeline
+    import torch

    tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
    model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+    model.eval()
+    model, device = set_model_device(model)
    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
-
    return pipe

@lru_cache()
@@ -45,21 +108,23 @@ def load_text_multilabel_classifier():
    from scipy.special import expit
    import torch

+    # # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
+    # if torch.cuda.is_available():
+    #     device = torch.device("cuda")
+    # elif torch.backends.mps.is_available():
+    #     device = torch.device("mps")
+    # else:
+    #     device = torch.device("cpu")
+    #     # return load_spacy_model(), torch.device("cpu")
+    
+
    MODEL = "cardiffnlp/tweet-topic-21-multi"
    tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
+    model.eval()
+    model, device = set_model_device(model)
    class_mapping = model.config.id2label

-    # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
-    if torch.cuda.is_available():
-        device = torch.device("cuda")
-    elif torch.backends.mps.is_available():
-        device = torch.device("mps")
-    else:
-        device = torch.device("cpu")
-
-    model.to(device)
-
    def _classifier(texts, threshold=0.5, max_length=64):
        tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
        tokens = {key: val.to(device) for key, val in tokens.items()}  # Move tokens to the selected device
@@ -78,7 +143,7 @@ def load_text_multilabel_classifier():

        return batch_labels

-    return _classifier
+    return _classifier, device

@lru_cache()
 def load_nltk_punkt():
@@ -89,6 +154,67 @@ def load_nltk_punkt():
        nltk.download('punkt')
    return nltk.data.find('tokenizers/punkt')

+@lru_cache()
+def load_spacy_model():
+    import spacy
+    name = "models/reuters"
+    home_folder = get_home_folder()
+    model_folder = Path(home_folder) / name
+    
+    # Check if the model directory already exists
+    if not (model_folder.exists() and any(model_folder.iterdir())):
+        repo_url = "https://github.com/unclecode/crawl4ai.git"
+        branch = MODEL_REPO_BRANCH 
+        repo_folder = Path(home_folder) / "crawl4ai"
+        
+        print("[LOG] ⏬ Downloading Spacy model for the first time...")
+
+        # Remove existing repo folder if it exists
+        if repo_folder.exists():
+            try:
+                shutil.rmtree(repo_folder)
+                if model_folder.exists():
+                    shutil.rmtree(model_folder)
+            except PermissionError:
+                print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
+                print(f"- {repo_folder}")
+                print(f"- {model_folder}")
+                return None
+
+        try:
+            # Clone the repository
+            subprocess.run(
+                ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
+                stdout=subprocess.DEVNULL,
+                stderr=subprocess.DEVNULL,
+                check=True
+            )
+
+            # Create the models directory if it doesn't exist
+            models_folder = Path(home_folder) / "models"
+            models_folder.mkdir(parents=True, exist_ok=True)
+
+            # Copy the reuters model folder to the models directory
+            source_folder = repo_folder / "models" / "reuters"
+            shutil.copytree(source_folder, model_folder)
+
+            # Remove the cloned repository
+            shutil.rmtree(repo_folder)
+
+            print("[LOG] ✅ Spacy Model downloaded successfully")
+        except subprocess.CalledProcessError as e:
+            print(f"An error occurred while cloning the repository: {e}")
+            return None
+        except Exception as e:
+            print(f"An error occurred: {e}")
+            return None
+
+    try:
+        return spacy.load(str(model_folder))
+    except Exception as e:
+        print(f"Error loading spacy model: {e}")
+        return None
+
 def download_all_models(remove_existing=False):
    """Download all models required for Crawl4AI."""
    if remove_existing:
@@ -104,12 +230,15 @@ def download_all_models(remove_existing=False):
        print("[LOG] Existing models removed.")

    # Load each model to trigger download
-    print("[LOG] Downloading BERT Base Uncased...")
-    load_bert_base_uncased()
-    print("[LOG] Downloading BGE Small EN v1.5...")
-    load_bge_small_en_v1_5()
+    # print("[LOG] Downloading BERT Base Uncased...")
+    # load_bert_base_uncased()
+    # print("[LOG] Downloading BGE Small EN v1.5...")
+    # load_bge_small_en_v1_5()
+    # print("[LOG] Downloading ONNX model...")
+    # load_onnx_all_MiniLM_l6_v2()
    print("[LOG] Downloading text classifier...")
-    load_text_multilabel_classifier
+    _, device = load_text_multilabel_classifier()
+    print(f"[LOG] Text classifier loaded on {device}")
    print("[LOG] Downloading custom NLTK Punkt model...")
    load_nltk_punkt()
    print("[LOG] ✅ All models downloaded successfully.")
@@ -124,4 +253,4 @@ def main():
    download_all_models(remove_existing=args.remove_existing)

 if __name__ == "__main__":
-    main()
+    main()
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -1,5 +1,5 @@
 from pydantic import BaseModel, HttpUrl
-from typing import List
+from typing import List, Dict, Optional

 class UrlModel(BaseModel):
    url: HttpUrl
@@ -9,8 +9,16 @@ class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
-    cleaned_html: str = None
-    markdown: str = None
-    extracted_content: str = None
-    metadata: dict = None
-    error_message: str = None
+    cleaned_html: Optional[str] = None
+    media: Dict[str, List[Dict]] = {}
+    links: Dict[str, List[Dict]] = {}
+    screenshot: Optional[str] = None
+    markdown: Optional[str] = None
+    fit_markdown: Optional[str] = None
+    fit_html: Optional[str] = None
+    extracted_content: Optional[str] = None
+    metadata: Optional[dict] = None
+    error_message: Optional[str] = None
+    session_id: Optional[str] = None
+    response_headers: Optional[dict] = None
+    status_code: Optional[int] = None
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -1,4 +1,4 @@
-PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
+PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
 <url>{URL}</url>

 And here is the cleaned HTML content of that webpage:
@@ -29,7 +29,7 @@ To generate the JSON objects:

 5. Make sure the generated JSON is complete and parsable, with no errors or omissions.

-6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
+6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.

 Please provide your output within <blocks> tags, like this:

@@ -79,7 +79,7 @@ To generate the JSON objects:
 2. For each block:
   a. Assign it an index based on its order in the content.
   b. Analyze the content and generate ONE semantic tag that describe what the block is about.
-   c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
+   c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.

 3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.

@@ -87,7 +87,7 @@ To generate the JSON objects:

 5. Make sure the generated JSON is complete and parsable, with no errors or omissions.

-6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
+6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.

 7. Never alter the extracted content, just copy and paste it as it is.

@@ -142,7 +142,7 @@ To generate the JSON objects:

 5. Make sure the generated JSON is complete and parsable, with no errors or omissions.

-6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
+6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.

 7. Never alter the extracted content, just copy and paste it as it is.

@@ -164,4 +164,41 @@ Please provide your output within <blocks> tags, like this:

 **Make sure to follow the user instruction to extract blocks aligin with the instruction.**

-Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
+Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
+
+PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION = """Here is the content from the URL:
+<url>{URL}</url>
+
+<url_content>
+{HTML}
+</url_content>
+
+The user has made the following request for what information to extract from the above content:
+
+<user_request>
+{REQUEST}
+</user_request>
+
+<schema_block>
+{SCHEMA}
+</schema_block>
+
+Please carefully read the URL content and the user's request. If the user provided a desired JSON schema in the <schema_block> above, extract the requested information from the URL content according to that schema. If no schema was provided, infer an appropriate JSON schema based on the user's request that will best capture the key information they are looking for.
+
+Extraction instructions:
+Return the extracted information as a list of JSON objects, with each object in the list corresponding to a block of content from the URL, in the same order as it appears on the page. Wrap the entire JSON list in <blocks>...</blocks> XML tags.
+
+Quality Reflection:
+Before outputting your final answer, double check that the JSON you are returning is complete, containing all the information requested by the user, and is valid JSON that could be parsed by json.loads() with no errors or omissions. The outputted JSON objects should fully match the schema, either provided or inferred.
+
+Quality Score:
+After reflecting, score the quality and completeness of the JSON data you are about to return on a scale of 1 to 5. Write the score inside <score> tags.
+
+Avoid Common Mistakes:
+- Do NOT add any comments using "//" or "#" in the JSON output. It causes parsing errors.
+- Make sure the JSON is properly formatted with curly braces, square brackets, and commas in the right places.
+- Do not miss closing </blocks> tag at the end of the JSON output.
+- Do not generate the Python coee show me how to do the task, this is your task to extract the information and return it in JSON format.
+
+Result
+Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -1,19 +1,63 @@
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from bs4 import BeautifulSoup, Comment, element, Tag, NavigableString
-import html2text
 import json
 import html
 import re
 import os
-from html2text import HTML2Text
+import platform
+from .html2text import HTML2Text
 from .prompts import PROMPT_EXTRACT_BLOCKS
 from .config import *
 from pathlib import Path
+from typing import Dict, Any
+from urllib.parse import urljoin
+import requests
+from requests.exceptions import InvalidSchema

 class InvalidCSSSelectorError(Exception):
    pass

+def calculate_semaphore_count():
+    cpu_count = os.cpu_count()
+    memory_gb = get_system_memory() / (1024 ** 3)  # Convert to GB
+    base_count = max(1, cpu_count // 2)
+    memory_based_cap = int(memory_gb / 2)  # Assume 2GB per instance
+    return min(base_count, memory_based_cap)
+
+def get_system_memory():
+    system = platform.system()
+    if system == "Linux":
+        with open('/proc/meminfo', 'r') as mem:
+            for line in mem:
+                if line.startswith('MemTotal:'):
+                    return int(line.split()[1]) * 1024  # Convert KB to bytes
+    elif system == "Darwin":  # macOS
+        import subprocess
+        output = subprocess.check_output(['sysctl', '-n', 'hw.memsize']).decode('utf-8')
+        return int(output.strip())
+    elif system == "Windows":
+        import ctypes
+        kernel32 = ctypes.windll.kernel32
+        c_ulonglong = ctypes.c_ulonglong
+        class MEMORYSTATUSEX(ctypes.Structure):
+            _fields_ = [
+                ('dwLength', ctypes.c_ulong),
+                ('dwMemoryLoad', ctypes.c_ulong),
+                ('ullTotalPhys', c_ulonglong),
+                ('ullAvailPhys', c_ulonglong),
+                ('ullTotalPageFile', c_ulonglong),
+                ('ullAvailPageFile', c_ulonglong),
+                ('ullTotalVirtual', c_ulonglong),
+                ('ullAvailVirtual', c_ulonglong),
+                ('ullAvailExtendedVirtual', c_ulonglong),
+            ]
+        memoryStatus = MEMORYSTATUSEX()
+        memoryStatus.dwLength = ctypes.sizeof(MEMORYSTATUSEX)
+        kernel32.GlobalMemoryStatusEx(ctypes.byref(memoryStatus))
+        return memoryStatus.ullTotalPhys
+    else:
+        raise OSError("Unsupported operating system")

 def get_home_folder():
    home_folder = os.path.join(Path.home(), ".crawl4ai")
@@ -86,7 +130,7 @@ def split_and_parse_json_objects(json_string):
    return parsed_objects, unparsed_segments

 def sanitize_html(html):
-    # Replace all weird and special characters with an empty string
+    # Replace all unwanted and special characters with an empty string
    sanitized_html = html
    # sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)

@@ -95,6 +139,16 @@ def sanitize_html(html):

    return sanitized_html

+def sanitize_input_encode(text: str) -> str:
+    """Sanitize input to handle potential encoding issues."""
+    try:
+        # Attempt to encode and decode as UTF-8 to handle potential encoding issues
+        return text.encode('utf-8', errors='ignore').decode('utf-8')
+    except UnicodeEncodeError as e:
+        print(f"Warning: Encoding issue detected. Some characters may be lost. Error: {e}")
+        # Fall back to ASCII if UTF-8 fails
+        return text.encode('ascii', errors='ignore').decode('ascii')
+
 def escape_json_string(s):
    """
    Escapes characters in a string to be JSON safe.
@@ -124,12 +178,25 @@ def escape_json_string(s):
    
    return s

-class CustomHTML2Text(HTML2Text):
+class CustomHTML2Text_v0(HTML2Text):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
-        self.ignore_links = True
        self.inside_pre = False
        self.inside_code = False
+        
+        self.skip_internal_links = False
+        self.single_line_break = False
+        self.mark_code = False
+        self.include_sup_sub = False
+        self.body_width = 0
+        self.ignore_mailto_links = True
+        self.ignore_links = False
+        self.escape_backslash = False
+        self.escape_dot = False
+        self.escape_plus = False
+        self.escape_dash = False
+        self.escape_snob = False
+

    def handle_tag(self, tag, attrs, start):
        if tag == 'pre':
@@ -139,6 +206,10 @@ class CustomHTML2Text(HTML2Text):
            else:
                self.o('\n```')
                self.inside_pre = False
+        elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
+            pass
+
+
        # elif tag == 'code' and not self.inside_pre:
        #     if start:
        #         if not self.inside_pre:
@@ -151,7 +222,51 @@ class CustomHTML2Text(HTML2Text):

        super().handle_tag(tag, attrs, start)

-def get_content_of_website(html, word_count_threshold = MIN_WORD_THRESHOLD, css_selector = None):
+def replace_inline_tags(soup, tags, only_text=False):
+    tag_replacements = {
+        'b': lambda tag: f"**{tag.text}**",
+        'i': lambda tag: f"*{tag.text}*",
+        'u': lambda tag: f"__{tag.text}__",
+        'span': lambda tag: f"{tag.text}",
+        'del': lambda tag: f"~~{tag.text}~~",
+        'ins': lambda tag: f"++{tag.text}++",
+        'sub': lambda tag: f"~{tag.text}~",
+        'sup': lambda tag: f"^^{tag.text}^^",
+        'strong': lambda tag: f"**{tag.text}**",
+        'em': lambda tag: f"*{tag.text}*",
+        'code': lambda tag: f"`{tag.text}`",
+        'kbd': lambda tag: f"`{tag.text}`",
+        'var': lambda tag: f"_{tag.text}_",
+        's': lambda tag: f"~~{tag.text}~~",
+        'q': lambda tag: f'"{tag.text}"',
+        'abbr': lambda tag: f"{tag.text} ({tag.get('title', '')})",
+        'cite': lambda tag: f"_{tag.text}_",
+        'dfn': lambda tag: f"_{tag.text}_",
+        'time': lambda tag: f"{tag.text}",
+        'small': lambda tag: f"<small>{tag.text}</small>",
+        'mark': lambda tag: f"=={tag.text}=="
+    }
+    
+    replacement_data = [(tag, tag_replacements.get(tag, lambda t: t.text)) for tag in tags]
+
+    for tag_name, replacement_func in replacement_data:
+        for tag in soup.find_all(tag_name):
+            replacement_text = tag.text if only_text else replacement_func(tag)
+            tag.replace_with(replacement_text)
+
+    return soup    
+
+    # for tag_name in tags:
+    #     for tag in soup.find_all(tag_name):
+    #         if not only_text:
+    #             replacement_text = tag_replacements.get(tag_name, lambda t: t.text)(tag)
+    #             tag.replace_with(replacement_text)
+    #         else:
+    #             tag.replace_with(tag.text)
+
+    # return soup
+
+def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD, css_selector = None, **kwargs):
    try:
        if not html:
            return None
@@ -170,6 +285,28 @@ def get_content_of_website(html, word_count_threshold = MIN_WORD_THRESHOLD, css_
            for el in selected_elements:
                div_tag.append(el)
            body = div_tag
+            
+        links = {
+            'internal': [],
+            'external': []
+        }
+        
+        # Extract all internal and external links
+        for a in body.find_all('a', href=True):
+            href = a['href']
+            url_base = url.split('/')[2]
+            if href.startswith('http') and url_base not in href:
+                links['external'].append({
+                    'href': href,
+                    'text': a.get_text()
+                })
+            else:
+                links['internal'].append(
+                    {
+                        'href': href,
+                        'text': a.get_text()
+                    }
+                )

        # Remove script, style, and other tags that don't carry useful content from body
        for tag in body.find_all(['script', 'style', 'link', 'meta', 'noscript']):
@@ -180,6 +317,35 @@ def get_content_of_website(html, word_count_threshold = MIN_WORD_THRESHOLD, css_
            if tag.name != 'img':
                tag.attrs = {}

+        # Extract all img tgas int0 [{src: '', alt: ''}]
+        media = {
+            'images': [],
+            'videos': [],
+            'audios': []
+        }
+        for img in body.find_all('img'):
+            media['images'].append({
+                'src': img.get('src'),
+                'alt': img.get('alt'),
+                "type": "image"
+            })
+            
+        # Extract all video tags into [{src: '', alt: ''}]
+        for video in body.find_all('video'):
+            media['videos'].append({
+                'src': video.get('src'),
+                'alt': video.get('alt'),
+                "type": "video"
+            })
+            
+        # Extract all audio tags into [{src: '', alt: ''}]
+        for audio in body.find_all('audio'):
+            media['audios'].append({
+                'src': audio.get('src'),
+                'alt': audio.get('alt'),
+                "type": "audio"
+            })
+        
        # Replace images with their alt text or remove them if no alt text is available
        for img in body.find_all('img'):
            alt_text = img.get('alt')
@@ -189,7 +355,7 @@ def get_content_of_website(html, word_count_threshold = MIN_WORD_THRESHOLD, css_
                img.decompose()


-        # Create a function that replace content of all"pre" tage with its inner text
+        # Create a function that replace content of all"pre" tag with its inner text
        def replace_pre_tags_with_text(node):
            for child in node.find_all('pre'):
                # set child inner html to its text
@@ -198,6 +364,13 @@ def get_content_of_website(html, word_count_threshold = MIN_WORD_THRESHOLD, css_
        
        # Replace all "pre" tags with their inner text
        body = replace_pre_tags_with_text(body)
+        
+        # Replace inline tags with their text content
+        body = replace_inline_tags(
+            body, 
+            ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark'],
+            only_text=kwargs.get('only_text', False)
+        )

        # Recursively remove empty elements, their parent elements, and elements with word count below threshold
        def remove_empty_and_low_word_count_elements(node, word_count_threshold):
@@ -295,17 +468,311 @@ def get_content_of_website(html, word_count_threshold = MIN_WORD_THRESHOLD, css_
        markdown = h.handle(cleaned_html)
        markdown = markdown.replace('    ```', '```')
            
+        try:
+            meta = extract_metadata(html, soup)
+        except Exception as e:
+            print('Error extracting metadata:', str(e))
+            meta = {}
+                
+        
        # Return the Markdown content
        return{
            'markdown': markdown,
            'cleaned_html': cleaned_html,
-            'success': True
+            'success': True,
+            'media': media,
+            'links': links,
+            'metadata': meta
        }

    except Exception as e:
        print('Error processing HTML content:', str(e))
        raise InvalidCSSSelectorError(f"Invalid CSS selector: {css_selector}") from e

+def get_content_of_website_optimized(url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+    if not html:
+        return None
+
+    soup = BeautifulSoup(html, 'html.parser')
+    body = soup.body
+    
+    image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
+
+    for tag in kwargs.get('excluded_tags', []) or []:
+        for el in body.select(tag):
+            el.decompose()
+        
+    if css_selector:
+        selected_elements = body.select(css_selector)
+        if not selected_elements:
+            raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
+        body = soup.new_tag('div')
+        for el in selected_elements:
+            body.append(el)
+
+    links = {'internal': [], 'external': []}
+    media = {'images': [], 'videos': [], 'audios': []}
+
+    # Extract meaningful text for media files from closest parent
+    def find_closest_parent_with_useful_text(tag):
+            current_tag = tag
+            while current_tag:
+                current_tag = current_tag.parent
+                # Get the text content from the parent tag
+                if current_tag:
+                    text_content = current_tag.get_text(separator=' ',strip=True)
+                    # Check if the text content has at least word_count_threshold
+                    if len(text_content.split()) >= image_description_min_word_threshold:
+                        return text_content
+            return None
+
+    def process_image(img, url, index, total_images):
+        #Check if an image has valid display and inside undesired html elements
+        def is_valid_image(img, parent, parent_classes):
+            style = img.get('style', '')
+            src = img.get('src', '')
+            classes_to_check = ['button', 'icon', 'logo']
+            tags_to_check = ['button', 'input']
+            return all([
+                'display:none' not in style,
+                src,
+                not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                parent.name not in tags_to_check
+            ])
+
+        #Score an image for it's usefulness
+        def score_image_for_usefulness(img, base_url, index, images_count):
+            # Function to parse image height/width value and units
+            def parse_dimension(dimension):
+                if dimension:
+                    match = re.match(r"(\d+)(\D*)", dimension)
+                    if match:
+                        number = int(match.group(1))
+                        unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                        return number, unit
+                return None, None
+
+            # Fetch image file metadata to extract size and extension
+            def fetch_image_file_size(img, base_url):
+                #If src is relative path construct full URL, if not it may be CDN URL
+                img_url = urljoin(base_url,img.get('src'))
+                try:
+                    response = requests.head(img_url)
+                    if response.status_code == 200:
+                        return response.headers.get('Content-Length',None)
+                    else:
+                        print(f"Failed to retrieve file size for {img_url}")
+                        return None
+                except InvalidSchema as e:
+                    return None
+                finally:
+                    return
+
+            image_height = img.get('height')
+            height_value, height_unit = parse_dimension(image_height)
+            image_width =  img.get('width')
+            width_value, width_unit = parse_dimension(image_width)
+            image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
+            image_format = os.path.splitext(img.get('src',''))[1].lower()
+            # Remove . from format
+            image_format = image_format.strip('.')
+            score = 0
+            if height_value:
+                if height_unit == 'px' and height_value > 150:
+                    score += 1
+                if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
+                    score += 1
+            if width_value:
+                if width_unit == 'px' and width_value > 150:
+                    score += 1
+                if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
+                    score += 1
+            if image_size > 10000:
+                score += 1
+            if img.get('alt') != '':
+                score+=1
+            if any(image_format==format for format in ['jpg','png','webp']):
+                score+=1
+            if index/images_count<0.5:
+                score+=1
+            return score
+
+        if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+            return None
+        score = score_image_for_usefulness(img, url, index, total_images)
+        if score <= IMAGE_SCORE_THRESHOLD:
+            return None
+        return {
+            'src': img.get('src', '').replace('\\"', '"').strip(),
+            'alt': img.get('alt', ''),
+            'desc': find_closest_parent_with_useful_text(img),
+            'score': score,
+            'type': 'image'
+        }
+
+    def process_element(element: element.PageElement) -> bool:
+        try:
+            if isinstance(element, NavigableString):
+                if isinstance(element, Comment):
+                    element.extract()
+                return False
+
+            if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
+                element.decompose()
+                return False
+
+            keep_element = False
+
+            if element.name == 'a' and element.get('href'):
+                href = element['href']
+                url_base = url.split('/')[2]
+                link_data = {'href': href, 'text': element.get_text()}
+                if href.startswith('http') and url_base not in href:
+                    links['external'].append(link_data)
+                else:
+                    links['internal'].append(link_data)
+                keep_element = True
+
+            elif element.name == 'img':
+                return True  # Always keep image elements
+
+            elif element.name in ['video', 'audio']:
+                media[f"{element.name}s"].append({
+                    'src': element.get('src'),
+                    'alt': element.get('alt'),
+                    'type': element.name,
+                    'description': find_closest_parent_with_useful_text(element)
+                })
+                source_tags = element.find_all('source')
+                for source_tag in source_tags:
+                    media[f"{element.name}s"].append({
+                    'src': source_tag.get('src'),
+                    'alt': element.get('alt'),
+                    'type': element.name,
+                    'description': find_closest_parent_with_useful_text(element)
+                })
+                return True  # Always keep video and audio elements
+
+            if element.name != 'pre':
+                if element.name in ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']:
+                    if kwargs.get('only_text', False):
+                        element.replace_with(element.get_text())
+                    else:
+                        element.unwrap()
+                elif element.name != 'img':
+                    element.attrs = {}
+
+            # Process children
+            for child in list(element.children):
+                if isinstance(child, NavigableString) and not isinstance(child, Comment):
+                    if len(child.strip()) > 0:
+                        keep_element = True
+                else:
+                    if process_element(child):
+                        keep_element = True
+                
+
+            # Check word count
+            if not keep_element:
+                word_count = len(element.get_text(strip=True).split())
+                keep_element = word_count >= word_count_threshold
+
+            if not keep_element:
+                element.decompose()
+
+            return keep_element
+        except Exception as e:
+            print('Error processing element:', str(e))
+            return False
+
+    #process images by filtering and extracting contextual text from the page
+    imgs = body.find_all('img')
+    media['images'] = [
+        result for result in
+        (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
+        if result is not None
+    ]
+
+    process_element(body)
+
+    def flatten_nested_elements(node):
+        if isinstance(node, NavigableString):
+            return node
+        if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
+            return flatten_nested_elements(node.contents[0])
+        node.contents = [flatten_nested_elements(child) for child in node.contents]
+        return node
+
+    body = flatten_nested_elements(body)
+    base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
+    for img in imgs:
+        src = img.get('src', '')
+        if base64_pattern.match(src):
+            img['src'] = base64_pattern.sub('', src)
+
+    cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
+    cleaned_html = sanitize_html(cleaned_html)
+
+    h = CustomHTML2Text()
+    h.ignore_links = True
+    markdown = h.handle(cleaned_html)
+    markdown = markdown.replace('    ```', '```')
+
+    try:
+        meta = extract_metadata(html, soup)
+    except Exception as e:
+        print('Error extracting metadata:', str(e))
+        meta = {}
+
+    return {
+        'markdown': markdown,
+        'cleaned_html': cleaned_html,
+        'success': True,
+        'media': media,
+        'links': links,
+        'metadata': meta
+    }
+
+def extract_metadata(html, soup = None):
+    metadata = {}
+    
+    if not html:
+        return metadata
+    
+    # Parse HTML content with BeautifulSoup
+    if not soup:
+        soup = BeautifulSoup(html, 'html.parser')
+
+    # Title
+    title_tag = soup.find('title')
+    metadata['title'] = title_tag.string if title_tag else None
+
+    # Meta description
+    description_tag = soup.find('meta', attrs={'name': 'description'})
+    metadata['description'] = description_tag['content'] if description_tag else None
+
+    # Meta keywords
+    keywords_tag = soup.find('meta', attrs={'name': 'keywords'})
+    metadata['keywords'] = keywords_tag['content'] if keywords_tag else None
+
+    # Meta author
+    author_tag = soup.find('meta', attrs={'name': 'author'})
+    metadata['author'] = author_tag['content'] if author_tag else None
+
+    # Open Graph metadata
+    og_tags = soup.find_all('meta', attrs={'property': lambda value: value and value.startswith('og:')})
+    for tag in og_tags:
+        property_name = tag['property']
+        metadata[property_name] = tag['content']
+
+    # Twitter Card metadata
+    twitter_tags = soup.find_all('meta', attrs={'name': lambda value: value and value.startswith('twitter:')})
+    for tag in twitter_tags:
+        property_name = tag['name']
+        metadata[property_name] = tag['content']
+
+    return metadata
+
 def extract_xml_tags(string):
    tags = re.findall(r'<(\w+)>', string)
    return list(set(tags))
@@ -324,12 +791,26 @@ def extract_xml_data(tags, string):
    return data
    
 # Function to perform the completion with exponential backoff
-def perform_completion_with_backoff(provider, prompt_with_variables, api_token):
+def perform_completion_with_backoff(
+    provider, 
+    prompt_with_variables, 
+    api_token, 
+    json_response = False, 
+    base_url=None,
+    **kwargs
+    ):
    from litellm import completion 
    from litellm.exceptions import RateLimitError
    max_attempts = 3
    base_delay = 2  # Base delay in seconds, you can adjust this based on your needs
    
+    extra_args = {}
+    if json_response:
+        extra_args["response_format"] = { "type": "json_object" }
+        
+    if kwargs.get("extra_args"):
+        extra_args.update(kwargs["extra_args"])
+    
    for attempt in range(max_attempts):
        try:
            response =completion(
@@ -338,7 +819,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token):
                    {"role": "user", "content": prompt_with_variables}
                ],
                temperature=0.01,
-                api_key=api_token
+                api_key=api_token,
+                base_url=base_url,
+                **extra_args
            )
            return response  # Return the successful response
        except RateLimitError as e:
@@ -358,7 +841,7 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token):
                    "content": ["Rate limit error. Please try again later."]
                }]
    
-def extract_blocks(url, html, provider = DEFAULT_PROVIDER, api_token = None):
+def extract_blocks(url, html, provider = DEFAULT_PROVIDER, api_token = None, base_url = None):
    # api_token = os.getenv('GROQ_API_KEY', None) if not api_token else api_token
    api_token = PROVIDER_MODELS.get(provider, None) if not api_token else api_token
    
@@ -373,7 +856,7 @@ def extract_blocks(url, html, provider = DEFAULT_PROVIDER, api_token = None):
            "{" + variable + "}", variable_values[variable]
        )
        
-    response = perform_completion_with_backoff(provider, prompt_with_variables, api_token)
+    response = perform_completion_with_backoff(provider, prompt_with_variables, api_token, base_url=base_url)
        
    try:
        blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
@@ -382,7 +865,6 @@ def extract_blocks(url, html, provider = DEFAULT_PROVIDER, api_token = None):
        for block in blocks:
            block['error'] = False
    except Exception as e:
-        print("Error extracting blocks:", str(e))
        parsed, unparsed = split_and_parse_json_objects(response.choices[0].message.content)
        blocks = parsed
        # Append all unparsed segments as onr error block and content is list of unparsed segments
@@ -428,7 +910,6 @@ def extract_blocks_batch(batch_data, provider = "groq/llama3-70b-8192", api_toke
            blocks = json.loads(blocks)

        except Exception as e:
-            print("Error extracting blocks:", str(e))
            blocks = [{
                "index": 0,
                "tags": ["error"],
@@ -439,7 +920,6 @@ def extract_blocks_batch(batch_data, provider = "groq/llama3-70b-8192", api_toke
    
    return sum(all_blocks, [])

-
 def merge_chunks_based_on_token_threshold(chunks, token_threshold):
    """
    Merges small chunks into larger ones based on the total token threshold.
@@ -469,18 +949,97 @@ def merge_chunks_based_on_token_threshold(chunks, token_threshold):

    return merged_sections

-def process_sections(url: str, sections: list, provider: str, api_token: str) -> list:
+def process_sections(url: str, sections: list, provider: str, api_token: str, base_url=None) -> list:
    extracted_content = []
    if provider.startswith("groq/"):
        # Sequential processing with a delay
        for section in sections:
-            extracted_content.extend(extract_blocks(url, section, provider, api_token))
+            extracted_content.extend(extract_blocks(url, section, provider, api_token, base_url=base_url))
            time.sleep(0.5)  # 500 ms delay between each processing
    else:
        # Parallel processing using ThreadPoolExecutor
        with ThreadPoolExecutor() as executor:
-            futures = [executor.submit(extract_blocks, url, section, provider, api_token) for section in sections]
+            futures = [executor.submit(extract_blocks, url, section, provider, api_token, base_url=base_url) for section in sections]
            for future in as_completed(futures):
                extracted_content.extend(future.result())
    
-    return extracted_content
+    return extracted_content
+
+def wrap_text(draw, text, font, max_width):
+    # Wrap the text to fit within the specified width
+    lines = []
+    words = text.split()
+    while words:
+        line = ''
+        while words and draw.textbbox((0, 0), line + words[0], font=font)[2] <= max_width:
+            line += (words.pop(0) + ' ')
+        lines.append(line)
+    return '\n'.join(lines)
+
+def format_html(html_string):
+    soup = BeautifulSoup(html_string, 'html.parser')
+    return soup.prettify()
+
+def normalize_url(href, base_url):
+    """Normalize URLs to ensure consistent format"""
+    from urllib.parse import urljoin, urlparse
+
+    # Parse base URL to get components
+    parsed_base = urlparse(base_url)
+    if not parsed_base.scheme or not parsed_base.netloc:
+        raise ValueError(f"Invalid base URL format: {base_url}")
+
+    # Use urljoin to handle all cases
+    normalized = urljoin(base_url, href.strip())
+    return normalized
+
+def normalize_url_tmp(href, base_url):
+    """Normalize URLs to ensure consistent format"""
+    # Extract protocol and domain from base URL
+    try:
+        base_parts = base_url.split('/')
+        protocol = base_parts[0]
+        domain = base_parts[2]
+    except IndexError:
+        raise ValueError(f"Invalid base URL format: {base_url}")
+    
+    # Handle special protocols
+    special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
+    if any(href.lower().startswith(proto) for proto in special_protocols):
+        return href.strip()
+        
+    # Handle anchor links
+    if href.startswith('#'):
+        return f"{base_url}{href}"
+        
+    # Handle protocol-relative URLs
+    if href.startswith('//'):
+        return f"{protocol}{href}"
+        
+    # Handle root-relative URLs
+    if href.startswith('/'):
+        return f"{protocol}//{domain}{href}"
+        
+    # Handle relative URLs
+    if not href.startswith(('http://', 'https://')):
+        # Remove leading './' if present
+        href = href.lstrip('./')
+        return f"{protocol}//{domain}/{href}"
+        
+    return href.strip()
+
+def is_external_url(url, base_domain):
+    """Determine if a URL is external"""
+    special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
+    if any(url.lower().startswith(proto) for proto in special_protocols):
+        return True
+        
+    try:
+        # Handle URLs with protocol
+        if url.startswith(('http://', 'https://')):
+            url_domain = url.split('/')[2]
+            return base_domain.lower() not in url_domain.lower()
+    except IndexError:
+        return False
+        
+    return False
--- a/crawl4ai/web_crawler.back.py
+++ b/crawl4ai/web_crawler.back.py
@@ -0,0 +1,357 @@
+import os, time
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+from pathlib import Path
+
+from .models import UrlModel, CrawlResult
+from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
+from .utils import *
+from .chunking_strategy import *
+from .extraction_strategy import *
+from .crawler_strategy import *
+from typing import List
+from concurrent.futures import ThreadPoolExecutor
+from .config import *
+
+
+class WebCrawler:
+    def __init__(
+        self,
+        # db_path: str = None,
+        crawler_strategy: CrawlerStrategy = None,
+        always_by_pass_cache: bool = False,
+        verbose: bool = False,
+    ):
+        # self.db_path = db_path
+        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
+        self.always_by_pass_cache = always_by_pass_cache
+
+        # Create the .crawl4ai folder in the user's home directory if it doesn't exist
+        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
+        os.makedirs(self.crawl4ai_folder, exist_ok=True)
+        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
+
+        # If db_path is not provided, use the default path
+        # if not db_path:
+            # self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
+        
+        # flush_db()
+        init_db()
+        
+        self.ready = False
+        
+    def warmup(self):
+        print("[LOG] 🌤️  Warming up the WebCrawler")
+        result = self.run(
+            url='https://crawl4ai.uccode.io/',
+            word_count_threshold=5,
+            extraction_strategy= NoExtractionStrategy(),
+            bypass_cache=False,
+            verbose = False
+        )
+        self.ready = True
+        print("[LOG] 🌞 WebCrawler is ready to crawl")
+        
+    def fetch_page(
+        self,
+        url_model: UrlModel,
+        provider: str = DEFAULT_PROVIDER,
+        api_token: str = None,
+        extract_blocks_flag: bool = True,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        css_selector: str = None,
+        screenshot: bool = False,
+        use_cached_html: bool = False,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        **kwargs,
+    ) -> CrawlResult:
+        return self.run(
+            url_model.url,
+            word_count_threshold,
+            extraction_strategy or NoExtractionStrategy(),
+            chunking_strategy,
+            bypass_cache=url_model.forced,
+            css_selector=css_selector,
+            screenshot=screenshot,
+            **kwargs,
+        )
+        pass
+
+    def run_old(
+        self,
+        url: str,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        bypass_cache: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        user_agent: str = None,
+        verbose=True,
+        **kwargs,
+    ) -> CrawlResult:
+        if user_agent:
+            self.crawler_strategy.update_user_agent(user_agent)
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
+        extraction_strategy.verbose = verbose
+        # Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
+        if not isinstance(extraction_strategy, ExtractionStrategy):
+            raise ValueError("Unsupported extraction strategy")
+        if not isinstance(chunking_strategy, ChunkingStrategy):
+            raise ValueError("Unsupported chunking strategy")
+        
+        # make sure word_count_threshold is not lesser than MIN_WORD_THRESHOLD
+        if word_count_threshold < MIN_WORD_THRESHOLD:
+            word_count_threshold = MIN_WORD_THRESHOLD
+
+        # Check cache first
+        if not bypass_cache and not self.always_by_pass_cache:
+            cached = get_cached_url(url)
+            if cached:
+                return CrawlResult(
+                    **{
+                        "url": cached[0],
+                        "html": cached[1],
+                        "cleaned_html": cached[2],
+                        "markdown": cached[3],
+                        "extracted_content": cached[4],
+                        "success": cached[5],
+                        "media": json.loads(cached[6] or "{}"),
+                        "links": json.loads(cached[7] or "{}"),
+                        "metadata": json.loads(cached[8] or "{}"), # "metadata": "{}
+                        "screenshot": cached[9],
+                        "error_message": "",
+                    }
+                )
+
+        # Initialize WebDriver for crawling
+        t = time.time()
+        if kwargs.get("js", None):
+            self.crawler_strategy.js_code = kwargs.get("js")
+        html = self.crawler_strategy.crawl(url)
+        base64_image = None
+        if screenshot:
+            base64_image = self.crawler_strategy.take_screenshot()
+        success = True
+        error_message = ""
+        # Extract content from HTML
+        try:
+            result = get_content_of_website(url, html, word_count_threshold, css_selector=css_selector)
+            metadata = extract_metadata(html)
+            if result is None:
+                raise ValueError(f"Failed to extract content from the website: {url}")
+        except InvalidCSSSelectorError as e:
+            raise ValueError(str(e))
+        
+        cleaned_html = result.get("cleaned_html", "")
+        markdown = result.get("markdown", "")
+        media = result.get("media", [])
+        links = result.get("links", [])
+
+        # Print a profession LOG style message, show time taken and say crawling is done
+        if verbose:
+            print(
+                f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
+            )
+
+        extracted_content = []
+        if verbose:
+            print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+        t = time.time()
+        # Split markdown into sections
+        sections = chunking_strategy.chunk(markdown)
+        # sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
+
+        extracted_content = extraction_strategy.run(
+            url, sections,
+        )
+        extracted_content = json.dumps(extracted_content)
+
+        if verbose:
+            print(
+                f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds."
+            )
+
+        # Cache the result
+        cleaned_html = beautify_html(cleaned_html)
+        cache_url(
+            url,
+            html,
+            cleaned_html,
+            markdown,
+            extracted_content,
+            success,
+            json.dumps(media),
+            json.dumps(links),
+            json.dumps(metadata),
+            screenshot=base64_image,
+        )
+
+        return CrawlResult(
+            url=url,
+            html=html,
+            cleaned_html=cleaned_html,
+            markdown=markdown,
+            media=media,
+            links=links,
+            metadata=metadata,
+            screenshot=base64_image,
+            extracted_content=extracted_content,
+            success=success,
+            error_message=error_message,
+        )
+
+    def fetch_pages(
+        self,
+        url_models: List[UrlModel],
+        provider: str = DEFAULT_PROVIDER,
+        api_token: str = None,
+        extract_blocks_flag: bool = True,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        use_cached_html: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        **kwargs,
+    ) -> List[CrawlResult]:
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
+        def fetch_page_wrapper(url_model, *args, **kwargs):
+            return self.fetch_page(url_model, *args, **kwargs)
+
+        with ThreadPoolExecutor() as executor:
+            results = list(
+                executor.map(
+                    fetch_page_wrapper,
+                    url_models,
+                    [provider] * len(url_models),
+                    [api_token] * len(url_models),
+                    [extract_blocks_flag] * len(url_models),
+                    [word_count_threshold] * len(url_models),
+                    [css_selector] * len(url_models),
+                    [screenshot] * len(url_models),
+                    [use_cached_html] * len(url_models),
+                    [extraction_strategy] * len(url_models),
+                    [chunking_strategy] * len(url_models),
+                    *[kwargs] * len(url_models),
+                )
+            )
+
+        return results
+
+    def run(
+            self,
+            url: str,
+            word_count_threshold=MIN_WORD_THRESHOLD,
+            extraction_strategy: ExtractionStrategy = None,
+            chunking_strategy: ChunkingStrategy = RegexChunking(),
+            bypass_cache: bool = False,
+            css_selector: str = None,
+            screenshot: bool = False,
+            user_agent: str = None,
+            verbose=True,
+            **kwargs,
+        ) -> CrawlResult:
+            extraction_strategy = extraction_strategy or NoExtractionStrategy()
+            extraction_strategy.verbose = verbose
+            if not isinstance(extraction_strategy, ExtractionStrategy):
+                raise ValueError("Unsupported extraction strategy")
+            if not isinstance(chunking_strategy, ChunkingStrategy):
+                raise ValueError("Unsupported chunking strategy")
+            
+            if word_count_threshold < MIN_WORD_THRESHOLD:
+                word_count_threshold = MIN_WORD_THRESHOLD
+
+            # Check cache first
+            cached = None
+            extracted_content = None
+            if not bypass_cache and not self.always_by_pass_cache:
+                cached = get_cached_url(url)
+            
+            if cached:
+                html = cached[1]
+                extracted_content = cached[2]
+                if screenshot:
+                    screenshot = cached[9]
+            
+            else:
+                if user_agent:
+                    self.crawler_strategy.update_user_agent(user_agent)
+                html = self.crawler_strategy.crawl(url)
+                if screenshot:
+                    screenshot = self.crawler_strategy.take_screenshot()
+            
+            return self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot, verbose, bool(cached), **kwargs)
+
+    def process_html(
+            self,
+            url: str,
+            html: str,
+            extracted_content: str,
+            word_count_threshold: int,
+            extraction_strategy: ExtractionStrategy,
+            chunking_strategy: ChunkingStrategy,
+            css_selector: str,
+            screenshot: bool,
+            verbose: bool,
+            is_cached: bool,
+            **kwargs,
+        ) -> CrawlResult:
+            t = time.time()
+            # Extract content from HTML
+            try:
+                result = get_content_of_website(url, html, word_count_threshold, css_selector=css_selector)
+                metadata = extract_metadata(html)
+                if result is None:
+                    raise ValueError(f"Failed to extract content from the website: {url}")
+            except InvalidCSSSelectorError as e:
+                raise ValueError(str(e))
+            
+            cleaned_html = result.get("cleaned_html", "")
+            markdown = result.get("markdown", "")
+            media = result.get("media", [])
+            links = result.get("links", [])
+
+            if verbose:
+                print(f"[LOG] 🚀 Crawling done for {url}, success: True, time taken: {time.time() - t} seconds")
+                        
+            if extracted_content is None:
+                if verbose:
+                    print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+
+                sections = chunking_strategy.chunk(markdown)
+                extracted_content = extraction_strategy.run(url, sections)
+                extracted_content = json.dumps(extracted_content)
+
+                if verbose:
+                    print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds.")
+                
+            screenshot = None if not screenshot else screenshot
+            
+            if not is_cached:
+                cache_url(
+                    url,
+                    html,
+                    cleaned_html,
+                    markdown,
+                    extracted_content,
+                    True,
+                    json.dumps(media),
+                    json.dumps(links),
+                    json.dumps(metadata),
+                    screenshot=screenshot,
+                )                
+
+            return CrawlResult(
+                url=url,
+                html=html,
+                cleaned_html=cleaned_html,
+                markdown=markdown,
+                media=media,
+                links=links,
+                metadata=metadata,
+                screenshot=screenshot,
+                extracted_content=extracted_content,
+                success=True,
+                error_message="",
+            )
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -11,46 +11,33 @@ from .crawler_strategy import *
 from typing import List
 from concurrent.futures import ThreadPoolExecutor
 from .config import *
+import warnings
+import json
+warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')


 class WebCrawler:
-    def __init__(
-        self,
-        # db_path: str = None,
-        crawler_strategy: CrawlerStrategy = None,
-        always_by_pass_cache: bool = False,
-    ):
-        # self.db_path = db_path
-        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy()
+    def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False):
+        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
        self.always_by_pass_cache = always_by_pass_cache
-
-        # Create the .crawl4ai folder in the user's home directory if it doesn't exist
        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
-
-        # If db_path is not provided, use the default path
-        # if not db_path:
-            # self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
-        
-        # flush_db()
        init_db()
-        
        self.ready = False
        
    def warmup(self):
        print("[LOG] 🌤️  Warming up the WebCrawler")
-        result = self.run(
-            url='https://crawl4ai.uccode.io/',
+        self.run(
+            url='https://google.com/',
            word_count_threshold=5,
-            extraction_strategy= NoExtractionStrategy(),
+            extraction_strategy=NoExtractionStrategy(),
            bypass_cache=False,
-            verbose = False
+            verbose=False
        )
        self.ready = True
        print("[LOG] 🌞 WebCrawler is ready to crawl")
        
-
    def fetch_page(
        self,
        url_model: UrlModel,
@@ -58,6 +45,8 @@ class WebCrawler:
        api_token: str = None,
        extract_blocks_flag: bool = True,
        word_count_threshold=MIN_WORD_THRESHOLD,
+        css_selector: str = None,
+        screenshot: bool = False,
        use_cached_html: bool = False,
        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
@@ -69,111 +58,12 @@ class WebCrawler:
            extraction_strategy or NoExtractionStrategy(),
            chunking_strategy,
            bypass_cache=url_model.forced,
+            css_selector=css_selector,
+            screenshot=screenshot,
            **kwargs,
        )
        pass

-
-    def run(
-        self,
-        url: str,
-        word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        bypass_cache: bool = False,
-        css_selector: str = None,
-        verbose=True,
-        **kwargs,
-    ) -> CrawlResult:
-        extraction_strategy = extraction_strategy or NoExtractionStrategy()
-        extraction_strategy.verbose = verbose
-        # Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
-        if not isinstance(extraction_strategy, ExtractionStrategy):
-            raise ValueError("Unsupported extraction strategy")
-        if not isinstance(chunking_strategy, ChunkingStrategy):
-            raise ValueError("Unsupported chunking strategy")
-        
-        # make sure word_count_threshold is not lesser than MIN_WORD_THRESHOLD
-        if word_count_threshold < MIN_WORD_THRESHOLD:
-            word_count_threshold = MIN_WORD_THRESHOLD
-
-        # Check cache first
-        if not bypass_cache and not self.always_by_pass_cache:
-            cached = get_cached_url(url)
-            if cached:
-                return CrawlResult(
-                    **{
-                        "url": cached[0],
-                        "html": cached[1],
-                        "cleaned_html": cached[2],
-                        "markdown": cached[3],
-                        "extracted_content": cached[4],
-                        "success": cached[5],
-                        "error_message": "",
-                    }
-                )
-
-        # Initialize WebDriver for crawling
-        t = time.time()
-        html = self.crawler_strategy.crawl(url)
-        success = True
-        error_message = ""
-        # Extract content from HTML
-        try:
-            result = get_content_of_website(html, word_count_threshold, css_selector=css_selector)
-            if result is None:
-                raise ValueError(f"Failed to extract content from the website: {url}")
-        except InvalidCSSSelectorError as e:
-            raise ValueError(str(e))
-        
-        cleaned_html = result.get("cleaned_html", html)
-        markdown = result.get("markdown", "")
-
-        # Print a profession LOG style message, show time taken and say crawling is done
-        if verbose:
-            print(
-                f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
-            )
-
-        extracted_content = []
-        if verbose:
-            print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
-        t = time.time()
-        # Split markdown into sections
-        sections = chunking_strategy.chunk(markdown)
-        # sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
-
-        extracted_content = extraction_strategy.run(
-            url, sections,
-        )
-        extracted_content = json.dumps(extracted_content)
-
-        if verbose:
-            print(
-                f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds."
-            )
-
-        # Cache the result
-        cleaned_html = beautify_html(cleaned_html)
-        cache_url(
-            url,
-            html,
-            cleaned_html,
-            markdown,
-            extracted_content,
-            success,
-        )
-
-        return CrawlResult(
-            url=url,
-            html=html,
-            cleaned_html=cleaned_html,
-            markdown=markdown,
-            extracted_content=extracted_content,
-            success=success,
-            error_message=error_message,
-        )
-
    def fetch_pages(
        self,
        url_models: List[UrlModel],
@@ -182,6 +72,8 @@ class WebCrawler:
        extract_blocks_flag: bool = True,
        word_count_threshold=MIN_WORD_THRESHOLD,
        use_cached_html: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
        **kwargs,
@@ -199,6 +91,8 @@ class WebCrawler:
                    [api_token] * len(url_models),
                    [extract_blocks_flag] * len(url_models),
                    [word_count_threshold] * len(url_models),
+                    [css_selector] * len(url_models),
+                    [screenshot] * len(url_models),
                    [use_cached_html] * len(url_models),
                    [extraction_strategy] * len(url_models),
                    [chunking_strategy] * len(url_models),
@@ -207,3 +101,138 @@ class WebCrawler:
            )

        return results
+
+    def run(
+            self,
+            url: str,
+            word_count_threshold=MIN_WORD_THRESHOLD,
+            extraction_strategy: ExtractionStrategy = None,
+            chunking_strategy: ChunkingStrategy = RegexChunking(),
+            bypass_cache: bool = False,
+            css_selector: str = None,
+            screenshot: bool = False,
+            user_agent: str = None,
+            verbose=True,
+            **kwargs,
+        ) -> CrawlResult:
+            try:
+                extraction_strategy = extraction_strategy or NoExtractionStrategy()
+                extraction_strategy.verbose = verbose
+                if not isinstance(extraction_strategy, ExtractionStrategy):
+                    raise ValueError("Unsupported extraction strategy")
+                if not isinstance(chunking_strategy, ChunkingStrategy):
+                    raise ValueError("Unsupported chunking strategy")
+                
+                word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
+
+                cached = None
+                screenshot_data = None
+                extracted_content = None
+                if not bypass_cache and not self.always_by_pass_cache:
+                    cached = get_cached_url(url)
+                
+                if kwargs.get("warmup", True) and not self.ready:
+                    return None
+                
+                if cached:
+                    html = sanitize_input_encode(cached[1])
+                    extracted_content = sanitize_input_encode(cached[4])
+                    if screenshot:
+                        screenshot_data = cached[9]
+                        if not screenshot_data:
+                            cached = None
+                
+                if not cached or not html:
+                    if user_agent:
+                        self.crawler_strategy.update_user_agent(user_agent)
+                    t1 = time.time()
+                    html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
+                    t2 = time.time()
+                    if verbose:
+                        print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds")
+                    if screenshot:
+                        screenshot_data = self.crawler_strategy.take_screenshot()
+
+                
+                crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs)
+                crawl_result.success = bool(html)
+                return crawl_result
+            except Exception as e:
+                if not hasattr(e, "msg"):
+                    e.msg = str(e)
+                print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")    
+                return CrawlResult(url=url, html="", success=False, error_message=e.msg)
+
+    def process_html(
+            self,
+            url: str,
+            html: str,
+            extracted_content: str,
+            word_count_threshold: int,
+            extraction_strategy: ExtractionStrategy,
+            chunking_strategy: ChunkingStrategy,
+            css_selector: str,
+            screenshot: bool,
+            verbose: bool,
+            is_cached: bool,
+            **kwargs,
+        ) -> CrawlResult:
+            t = time.time()
+            # Extract content from HTML
+            try:
+                t1 = time.time()
+                result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
+                if verbose:
+                    print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")
+                
+                if result is None:
+                    raise ValueError(f"Failed to extract content from the website: {url}")
+            except InvalidCSSSelectorError as e:
+                raise ValueError(str(e))
+            
+            cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
+            markdown = sanitize_input_encode(result.get("markdown", ""))
+            media = result.get("media", [])
+            links = result.get("links", [])
+            metadata = result.get("metadata", {})
+                        
+            if extracted_content is None:
+                if verbose:
+                    print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+
+                sections = chunking_strategy.chunk(markdown)
+                extracted_content = extraction_strategy.run(url, sections)
+                extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)
+
+                if verbose:
+                    print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
+                
+            screenshot = None if not screenshot else screenshot
+            
+            if not is_cached:
+                cache_url(
+                    url,
+                    html,
+                    cleaned_html,
+                    markdown,
+                    extracted_content,
+                    True,
+                    json.dumps(media),
+                    json.dumps(links),
+                    json.dumps(metadata),
+                    screenshot=screenshot,
+                )                
+            
+            return CrawlResult(
+                url=url,
+                html=html,
+                cleaned_html=format_html(cleaned_html),
+                markdown=markdown,
+                media=media,
+                links=links,
+                metadata=metadata,
+                screenshot=screenshot,
+                extracted_content=extracted_content,
+                success=True,
+                error_message="",
+            )
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -1,10 +0,0 @@
-version: '3.8'
-
-services:
-  web:
-    build: .
-    command: uvicorn main:app --host 0.0.0.0 --port 80 --workers $(nproc)
-    ports:
-      - "80:80"
-    environment:
-      - PYTHONUNBUFFERED=1
--- a/docs/assets/pitch-dark.png
+++ b/docs/assets/pitch-dark.png
--- a/docs/assets/pitch-dark.svg
+++ b/docs/assets/pitch-dark.svg
@@ -0,0 +1,64 @@
+<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 500">
+    <!-- Background -->
+    <rect width="800" height="500" fill="#1a1a1a"/>
+    
+    <!-- Opportunities Section -->
+    <g transform="translate(50,50)">
+        <!-- Opportunity 1 Box -->
+        <rect x="0" y="0" width="300" height="150" rx="10" fill="#1a2d3d" stroke="#64b5f6" stroke-width="2"/>
+        <text x="150" y="30" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#64b5f6">Data Capitalization Opportunity</text>
+        <text x="150" y="60" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
+            <tspan x="150" dy="0">Transform digital footprints into assets</tspan>
+            <tspan x="150" dy="20">Personal data as capital</tspan>
+            <tspan x="150" dy="20">Enterprise knowledge valuation</tspan>
+            <tspan x="150" dy="20">New form of wealth creation</tspan>
+        </text>
+
+        <!-- Opportunity 2 Box -->
+        <rect x="0" y="200" width="300" height="150" rx="10" fill="#1a2d1a" stroke="#81c784" stroke-width="2"/>
+        <text x="150" y="230" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#81c784">Authentic Data Potential</text>
+        <text x="150" y="260" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
+            <tspan x="150" dy="0">Vast reservoir of real insights</tspan>
+            <tspan x="150" dy="20">Enhanced AI development</tspan>
+            <tspan x="150" dy="20">Diverse human knowledge</tspan>
+            <tspan x="150" dy="20">Willing participation model</tspan>
+        </text>
+    </g>
+
+    <!-- Development Pathway -->
+    <g transform="translate(450,50)">
+        <!-- Step 1 Box -->
+        <rect x="0" y="0" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
+        <text x="150" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">1. Open-Source Foundation</text>
+        <text x="150" y="65" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Data extraction engine &amp; community development</text>
+
+        <!-- Step 2 Box -->
+        <rect x="0" y="125" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
+        <text x="150" y="160" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">2. Data Capitalization Platform</text>
+        <text x="150" y="190" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Tools to structure &amp; value digital assets</text>
+
+        <!-- Step 3 Box -->
+        <rect x="0" y="250" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
+        <text x="150" y="285" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">3. Shared Data Marketplace</text>
+        <text x="150" y="315" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Economic platform for data exchange</text>
+    </g>
+
+    <!-- Connecting Arrows -->
+    <g transform="translate(400,125)">
+        <path d="M-20,0 L40,0" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
+        <path d="M-20,200 L40,200" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
+    </g>
+
+    <!-- Arrow Marker -->
+    <defs>
+        <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
+            <polygon points="0 0, 10 3.5, 0 7" fill="#666"/>
+        </marker>
+    </defs>
+
+    <!-- Vision Box at Bottom -->
+    <g transform="translate(200,420)">
+        <rect x="0" y="0" width="400" height="60" rx="10" fill="#2d2613" stroke="#ffd54f" stroke-width="2"/>
+        <text x="200" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ffd54f">Economic Vision: Shared Data Economy</text>
+    </g>
+</svg>
--- a/docs/chunking_strategies.json
+++ b/docs/chunking_strategies.json
@@ -1,12 +0,0 @@
-{
-    "RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
-    
-    "NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
-    
-    "TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
-    
-    "FixedLengthWordChunking": "### FixedLengthWordChunking\n\n`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.\n\n#### Constructor Parameters:\n- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.\n\n#### Example usage:\n```python\nchunker = FixedLengthWordChunking(chunk_size=100)\nchunks = chunker.chunk(\"This is a sample text. It will be split into fixed-length word chunks.\")\n```",
-    
-    "SlidingWindowChunking": "### SlidingWindowChunking\n\n`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.\n\n#### Constructor Parameters:\n- `window_size` (int, optional): The number of words in each chunk. Default is `100`.\n- `step` (int, optional): The number of words to slide the window. Default is `50`.\n\n#### Example usage:\n```python\nchunker = SlidingWindowChunking(window_size=100, step=50)\nchunks = chunker.chunk(\"This is a sample text. It will be split using a sliding window approach.\")\n```"
-  }
-  
--- a/docs/examples/assets/audio.mp3
+++ b/docs/examples/assets/audio.mp3
--- a/docs/examples/assets/basic.png
+++ b/docs/examples/assets/basic.png
--- a/docs/examples/assets/cosine_extraction.png
+++ b/docs/examples/assets/cosine_extraction.png
--- a/docs/examples/assets/css_js.png
+++ b/docs/examples/assets/css_js.png
--- a/docs/examples/assets/css_selector.png
+++ b/docs/examples/assets/css_selector.png
--- a/docs/examples/assets/exec_script.png
+++ b/docs/examples/assets/exec_script.png
--- a/docs/examples/assets/llm_extraction.png
+++ b/docs/examples/assets/llm_extraction.png
--- a/docs/examples/assets/semantic_extraction_cosine.png
+++ b/docs/examples/assets/semantic_extraction_cosine.png
--- a/docs/examples/assets/semantic_extraction_llm.png
+++ b/docs/examples/assets/semantic_extraction_llm.png
--- a/docs/examples/async_webcrawler_multiple_urls_example.py
+++ b/docs/examples/async_webcrawler_multiple_urls_example.py
@@ -0,0 +1,48 @@
+# File: async_webcrawler_multiple_urls_example.py
+import os, sys
+# append 2 parent directories to sys.path to import crawl4ai
+parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+sys.path.append(parent_dir)
+
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    # Initialize the AsyncWebCrawler
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # List of URLs to crawl
+        urls = [
+            "https://example.com",
+            "https://python.org",
+            "https://github.com",
+            "https://stackoverflow.com",
+            "https://news.ycombinator.com"
+        ]
+
+        # Set up crawling parameters
+        word_count_threshold = 100
+
+        # Run the crawling process for multiple URLs
+        results = await crawler.arun_many(
+            urls=urls,
+            word_count_threshold=word_count_threshold,
+            bypass_cache=True,
+            verbose=True
+        )
+
+        # Process the results
+        for result in results:
+            if result.success:
+                print(f"Successfully crawled: {result.url}")
+                print(f"Title: {result.metadata.get('title', 'N/A')}")
+                print(f"Word count: {len(result.markdown.split())}")
+                print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
+                print(f"Number of images: {len(result.media.get('images', []))}")
+                print("---")
+            else:
+                print(f"Failed to crawl: {result.url}")
+                print(f"Error: {result.error_message}")
+                print("---")
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/chainlit.md
+++ b/docs/examples/chainlit.md
@@ -0,0 +1,3 @@
+# Welcome to Crawl4AI! 🚀🤖
+
+Hi there, Developer! 👋 Here is an example of a research pipeline, where you can share a URL in your conversation with any LLM, and then the context of crawled pages will be used as the context.
--- a/docs/examples/crawlai_vs_firecrawl.py
+++ b/docs/examples/crawlai_vs_firecrawl.py
@@ -0,0 +1,67 @@
+import os, time
+# append the path to the root of the project
+import sys
+import asyncio
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
+from firecrawl import FirecrawlApp
+from crawl4ai import AsyncWebCrawler
+__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'
+
+async def compare():
+    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+
+    # Tet Firecrawl with a simple crawl
+    start = time.time()
+    scrape_status = app.scrape_url(
+    'https://www.nbcnews.com/business',
+    params={'formats': ['markdown', 'html']}
+    )
+    end = time.time()
+    print(f"Time taken: {end - start} seconds")
+    print(len(scrape_status['markdown']))
+    # save the markdown content with provider name
+    with open(f"{__data__}/firecrawl_simple.md", "w") as f:
+        f.write(scrape_status['markdown'])
+    # Count how many "cldnry.s-nbcnews.com" are in the markdown
+    print(scrape_status['markdown'].count("cldnry.s-nbcnews.com"))
+    
+
+
+    async with AsyncWebCrawler() as crawler:
+        start = time.time()
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            # js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
+            word_count_threshold=0,
+            bypass_cache=True, 
+            verbose=False
+        )
+        end = time.time()
+        print(f"Time taken: {end - start} seconds")
+        print(len(result.markdown))
+        # save the markdown content with provider name  
+        with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
+            f.write(result.markdown)
+        # count how many "cldnry.s-nbcnews.com" are in the markdown
+        print(result.markdown.count("cldnry.s-nbcnews.com"))
+
+        start = time.time()
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
+            word_count_threshold=0,
+            bypass_cache=True, 
+            verbose=False
+        )
+        end = time.time()
+        print(f"Time taken: {end - start} seconds")
+        print(len(result.markdown))
+        # save the markdown content with provider name
+        with open(f"{__data__}/crawl4ai_js.md", "w") as f:
+            f.write(result.markdown)
+        # count how many "cldnry.s-nbcnews.com" are in the markdown
+        print(result.markdown.count("cldnry.s-nbcnews.com"))
+        
+if __name__ == "__main__":
+    asyncio.run(compare())
+    
--- a/docs/examples/language_support_example.py
+++ b/docs/examples/language_support_example.py
@@ -0,0 +1,45 @@
+import asyncio
+from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
+
+async def main():
+    # Example 1: Setting language when creating the crawler
+    crawler1 = AsyncWebCrawler(
+        crawler_strategy=AsyncPlaywrightCrawlerStrategy(
+            headers={"Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7"}
+        )
+    )
+    result1 = await crawler1.arun("https://www.example.com")
+    print("Example 1 result:", result1.extracted_content[:100])  # Print first 100 characters
+
+    # Example 2: Setting language before crawling
+    crawler2 = AsyncWebCrawler()
+    crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
+    result2 = await crawler2.arun("https://www.example.com")
+    print("Example 2 result:", result2.extracted_content[:100])
+
+    # Example 3: Setting language when calling arun method
+    crawler3 = AsyncWebCrawler()
+    result3 = await crawler3.arun(
+        "https://www.example.com",
+        headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
+    )
+    print("Example 3 result:", result3.extracted_content[:100])
+
+    # Example 4: Crawling multiple pages with different languages
+    urls = [
+        ("https://www.example.com", "fr-FR,fr;q=0.9"),
+        ("https://www.example.org", "es-ES,es;q=0.9"),
+        ("https://www.example.net", "de-DE,de;q=0.9"),
+    ]
+    
+    crawler4 = AsyncWebCrawler()
+    results = await asyncio.gather(*[
+        crawler4.arun(url, headers={"Accept-Language": lang})
+        for url, lang in urls
+    ])
+    
+    for url, result in zip([u for u, _ in urls], results):
+        print(f"Result for {url}:", result.extracted_content[:100])
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/llm_extraction_openai_pricing.py
+++ b/docs/examples/llm_extraction_openai_pricing.py
@@ -0,0 +1,41 @@
+import os
+import time
+from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+from crawl4ai.crawler_strategy import *
+
+url = r'https://openai.com/api/pricing/'
+
+crawler = WebCrawler()
+crawler.warmup()
+
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+result = crawler.run(
+    url=url,
+    word_count_threshold=1,
+    extraction_strategy= LLMExtractionStrategy(
+        # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'), 
+        schema=OpenAIModelFee.model_json_schema(),
+        extraction_type="schema",
+        instruction="From the crawled content, extract all mentioned model names along with their "\
+            "fees for input and output tokens. Make sure not to miss anything in the entire content. "\
+            'One extracted model JSON format should look like this: '\
+            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+    ),
+    bypass_cache=True,
+)
+
+model_fees = json.loads(result.extracted_content)
+
+print(len(model_fees))
+
+with open(".data/data.json", "w", encoding="utf-8") as f:
+    f.write(result.extracted_content)
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -0,0 +1,542 @@
+import os, sys
+# append parent directory to system path
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";
+
+import asyncio
+# import nest_asyncio
+# nest_asyncio.apply()
+
+import time
+import json
+import os
+import re
+from typing import Dict, List
+from bs4 import BeautifulSoup
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import (
+    JsonCssExtractionStrategy,
+    LLMExtractionStrategy,
+)
+
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+print("Crawl4AI: Advanced Web Crawling and Data Extraction")
+print("GitHub Repository: https://github.com/unclecode/crawl4ai")
+print("Twitter: @unclecode")
+print("Website: https://crawl4ai.com")
+
+
+async def simple_crawl():
+    print("\n--- Basic Usage ---")
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
+        print(result.markdown[:500])  # Print first 500 characters
+
+async def simple_example_with_running_js_code():
+    print("\n--- Executing JavaScript and Using CSS Selectors ---")
+    # New code to handle the wait_for parameter
+    wait_for = """() => {
+        return Array.from(document.querySelectorAll('article.tease-card')).length > 10;
+    }"""
+
+    # wait_for can be also just a css selector
+    # wait_for = "article.tease-card:nth-child(10)"
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        js_code = [
+            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
+        ]
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            js_code=js_code,
+            # wait_for=wait_for,
+            bypass_cache=True,
+        )
+        print(result.markdown[:500])  # Print first 500 characters
+
+async def simple_example_with_css_selector():
+    print("\n--- Using CSS Selectors ---")
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            css_selector=".wide-tease-item__description",
+            bypass_cache=True,
+        )
+        print(result.markdown[:500])  # Print first 500 characters
+
+async def use_proxy():
+    print("\n--- Using a Proxy ---")
+    print(
+        "Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
+    )
+    # Uncomment and modify the following lines to use a proxy
+    # async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
+    #     result = await crawler.arun(
+    #         url="https://www.nbcnews.com/business",
+    #         bypass_cache=True
+    #     )
+    #     print(result.markdown[:500])  # Print first 500 characters
+
+async def capture_and_save_screenshot(url: str, output_path: str):
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url=url,
+            screenshot=True,
+            bypass_cache=True
+        )
+        
+        if result.success and result.screenshot:
+            import base64
+            
+            # Decode the base64 screenshot data
+            screenshot_data = base64.b64decode(result.screenshot)
+            
+            # Save the screenshot as a JPEG file
+            with open(output_path, 'wb') as f:
+                f.write(screenshot_data)
+            
+            print(f"Screenshot saved successfully to {output_path}")
+        else:
+            print("Failed to capture screenshot")
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(
+        ..., description="Fee for output token for the OpenAI model."
+    )
+
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+    print(f"\n--- Extracting Structured Data with {provider} ---")
+    
+    if api_token is None and provider != "ollama":
+        print(f"API token is required for {provider}. Skipping this example.")
+        return
+
+    extra_args = {}
+    if extra_headers:
+        extra_args["extra_headers"] = extra_headers
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://openai.com/api/pricing/",
+            word_count_threshold=1,
+            extraction_strategy=LLMExtractionStrategy(
+                provider=provider,
+                api_token=api_token,
+                schema=OpenAIModelFee.schema(),
+                extraction_type="schema",
+                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
+                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
+                extra_args=extra_args
+            ),
+            bypass_cache=True,
+        )
+        print(result.extracted_content)
+
+async def extract_structured_data_using_css_extractor():
+    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
+    schema = {
+        "name": "Coinbase Crypto Prices",
+        "baseSelector": ".cds-tableRow-t45thuk",
+        "fields": [
+            {
+                "name": "crypto",
+                "selector": "td:nth-child(1) h2",
+                "type": "text",
+            },
+            {
+                "name": "symbol",
+                "selector": "td:nth-child(1) p",
+                "type": "text",
+            },
+            {
+                "name": "price",
+                "selector": "td:nth-child(2)",
+                "type": "text",
+            }
+        ],
+    }
+
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.coinbase.com/explore",
+            extraction_strategy=extraction_strategy,
+            bypass_cache=True,
+        )
+
+        assert result.success, "Failed to crawl the page"
+
+        news_teasers = json.loads(result.extracted_content)
+        print(f"Successfully extracted {len(news_teasers)} news teasers")
+        print(json.dumps(news_teasers[0], indent=2))
+
+# Advanced Session-Based Crawling with Dynamic Content 🔄
+async def crawl_dynamic_content_pages_method_1():
+    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
+    first_commit = ""
+
+    async def on_execution_started(page):
+        nonlocal first_commit
+        try:
+            while True:
+                await page.wait_for_selector("li.Box-sc-g0xbh4-0 h4")
+                commit = await page.query_selector("li.Box-sc-g0xbh4-0 h4")
+                commit = await commit.evaluate("(element) => element.textContent")
+                commit = re.sub(r"\s+", "", commit)
+                if commit and commit != first_commit:
+                    first_commit = commit
+                    break
+                await asyncio.sleep(0.5)
+        except Exception as e:
+            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
+
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        js_next_page = """
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        for page in range(3):  # Crawl 3 pages
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.Box-sc-g0xbh4-0",
+                js=js_next_page if page > 0 else None,
+                bypass_cache=True,
+                js_only=page > 0,
+                headless=False,
+            )
+
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            soup = BeautifulSoup(result.cleaned_html, "html.parser")
+            commits = soup.select("li")
+            all_commits.extend(commits)
+
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+async def crawl_dynamic_content_pages_method_2():
+    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+        last_commit = ""
+
+        js_next_page_and_wait = """
+        (async () => {
+            const getCurrentCommit = () => {
+                const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+                return commits.length > 0 ? commits[0].textContent.trim() : null;
+            };
+
+            const initialCommit = getCurrentCommit();
+            const button = document.querySelector('a[data-testid="pagination-next-button"]');
+            if (button) button.click();
+
+            // Poll for changes
+            while (true) {
+                await new Promise(resolve => setTimeout(resolve, 100)); // Wait 100ms
+                const newCommit = getCurrentCommit();
+                if (newCommit && newCommit !== initialCommit) {
+                    break;
+                }
+            }
+        })();
+        """
+
+        schema = {
+            "name": "Commit Extractor",
+            "baseSelector": "li.Box-sc-g0xbh4-0",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h4.markdown-title",
+                    "type": "text",
+                    "transform": "strip",
+                },
+            ],
+        }
+        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+        for page in range(3):  # Crawl 3 pages
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.Box-sc-g0xbh4-0",
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page_and_wait if page > 0 else None,
+                js_only=page > 0,
+                bypass_cache=True,
+                headless=False,
+            )
+
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            commits = json.loads(result.extracted_content)
+            all_commits.extend(commits)
+
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+async def crawl_dynamic_content_pages_method_3():
+    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        js_next_page = """
+        const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+        if (commits.length > 0) {
+            window.firstCommit = commits[0].textContent.trim();
+        }
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        wait_for = """() => {
+            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+            if (commits.length === 0) return false;
+            const firstCommit = commits[0].textContent.trim();
+            return firstCommit !== window.firstCommit;
+        }"""
+        
+        schema = {
+            "name": "Commit Extractor",
+            "baseSelector": "li.Box-sc-g0xbh4-0",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h4.markdown-title",
+                    "type": "text",
+                    "transform": "strip",
+                },
+            ],
+        }
+        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+        for page in range(3):  # Crawl 3 pages
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.Box-sc-g0xbh4-0",
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page if page > 0 else None,
+                wait_for=wait_for if page > 0 else None,
+                js_only=page > 0,
+                bypass_cache=True,
+                headless=False,
+            )
+
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            commits = json.loads(result.extracted_content)
+            all_commits.extend(commits)
+
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+async def crawl_custom_browser_type():
+    # Use Firefox
+    start = time.time()
+    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+    # Use WebKit
+    start = time.time()
+    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+    # Use Chromium (default)
+    start = time.time()
+    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+async def crawl_with_user_simultion():
+    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
+        url = "YOUR-URL-HERE"
+        result = await crawler.arun(
+            url=url,            
+            bypass_cache=True,
+            magic = True, # Automatically detects and removes overlays, popups, and other elements that block content
+            # simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
+            # override_navigator = True # Overrides the navigator object to make it look like a real user
+        )
+        
+        print(result.markdown)    
+
+async def speed_comparison():
+    # print("\n--- Speed Comparison ---")
+    # print("Firecrawl (simulated):")
+    # print("Time taken: 7.02 seconds")
+    # print("Content length: 42074 characters")
+    # print("Images found: 49")
+    # print()
+    # Simulated Firecrawl performance
+    from firecrawl import FirecrawlApp
+    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+    start = time.time()
+    scrape_status = app.scrape_url(
+    'https://www.nbcnews.com/business',
+    params={'formats': ['markdown', 'html']}
+    )
+    end = time.time()
+    print("Firecrawl (simulated):")
+    print(f"Time taken: {end - start:.2f} seconds")
+    print(f"Content length: {len(scrape_status['markdown'])} characters")
+    print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
+    print()    
+
+    async with AsyncWebCrawler() as crawler:
+        # Crawl4AI simple crawl
+        start = time.time()
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            word_count_threshold=0,
+            bypass_cache=True,
+            verbose=False,
+        )
+        end = time.time()
+        print("Crawl4AI (simple crawl):")
+        print(f"Time taken: {end - start:.2f} seconds")
+        print(f"Content length: {len(result.markdown)} characters")
+        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
+        print()
+
+        # Crawl4AI with JavaScript execution
+        start = time.time()
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            js_code=[
+                "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
+            ],
+            word_count_threshold=0,
+            bypass_cache=True,
+            verbose=False,
+        )
+        end = time.time()
+        print("Crawl4AI (with JavaScript execution):")
+        print(f"Time taken: {end - start:.2f} seconds")
+        print(f"Content length: {len(result.markdown)} characters")
+        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
+
+    print("\nNote on Speed Comparison:")
+    print("The speed test conducted here may not reflect optimal conditions.")
+    print("When we call Firecrawl's API, we're seeing its best performance,")
+    print("while Crawl4AI's performance is limited by the local network speed.")
+    print("For a more accurate comparison, it's recommended to run these tests")
+    print("on servers with a stable and fast internet connection.")
+    print("Despite these limitations, Crawl4AI still demonstrates faster performance.")
+    print("If you run these tests in an environment with better network conditions,")
+    print("you may observe an even more significant speed advantage for Crawl4AI.")
+
+async def generate_knowledge_graph():
+    class Entity(BaseModel):
+        name: str
+        description: str
+        
+    class Relationship(BaseModel):
+        entity1: Entity
+        entity2: Entity
+        description: str
+        relation_type: str
+
+    class KnowledgeGraph(BaseModel):
+        entities: List[Entity]
+        relationships: List[Relationship]
+
+    extraction_strategy = LLMExtractionStrategy(
+            provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
+            api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
+            schema=KnowledgeGraph.model_json_schema(),
+            extraction_type="schema",
+            instruction="""Extract entities and relationships from the given text."""
+    )
+    async with AsyncWebCrawler() as crawler:
+        url = "https://paulgraham.com/love.html"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            extraction_strategy=extraction_strategy,
+            # magic=True
+        )
+        # print(result.extracted_content)
+        with open(os.path.join(__location__, "kb.json"), "w") as f:
+            f.write(result.extracted_content)
+
+async def fit_markdown_remove_overlay():
+    async with AsyncWebCrawler(headless = False) as crawler:
+        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            word_count_threshold = 10,
+            remove_overlay_elements=True,
+            screenshot = True
+        )
+        # Save markdown to file
+        with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
+            f.write(result.fit_markdown)
+
+    print("Done")
+
+
+async def main():
+    await simple_crawl()
+    await simple_example_with_running_js_code()
+    await simple_example_with_css_selector()
+    await use_proxy()
+    await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
+    await extract_structured_data_using_css_extractor()
+
+    # LLM extraction examples
+    await extract_structured_data_using_llm()
+    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
+    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm("ollama/llama3.2")    
+
+    # You always can pass custom headers to the extraction strategy
+    custom_headers = {
+        "Authorization": "Bearer your-custom-token",
+        "X-Custom-Header": "Some-Value"
+    }
+    await extract_structured_data_using_llm(extra_headers=custom_headers)
+    
+    # await crawl_dynamic_content_pages_method_1()
+    # await crawl_dynamic_content_pages_method_2()
+    await crawl_dynamic_content_pages_method_3()
+    
+    await crawl_custom_browser_type()
+    
+    await speed_comparison()
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/quickstart_sync.py
+++ b/docs/examples/quickstart_sync.py
@@ -12,7 +12,7 @@ console = Console()

@lru_cache()
 def create_crawler():
-    crawler = WebCrawler()
+    crawler = WebCrawler(verbose=True)
    crawler.warmup()
    return crawler

@@ -35,10 +35,26 @@ def cprint(message, press_any_key=False):

 def basic_usage(crawler):
    cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
-    result = crawler.run(url="https://www.nbcnews.com/business")
+    result = crawler.run(url="https://www.nbcnews.com/business", only_text = True)
    cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
    print_result(result)

+def basic_usage_some_params(crawler):
+    cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
+    result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True)
+    cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
+    print_result(result)
+
+def screenshot_usage(crawler):
+    cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
+    result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
+    cprint("[LOG] 📦 [bold yellow]Screenshot result:[/bold yellow]")
+    # Save the screenshot to a file
+    with open("screenshot.png", "wb") as f:
+        f.write(base64.b64decode(result.screenshot))
+    cprint("Screenshot saved to 'screenshot.png'!")
+    print_result(result)
+
 def understanding_parameters(crawler):
    cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
    cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.")
@@ -86,7 +102,7 @@ def add_extraction_strategy(crawler):
    cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3)
+        extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True)
    )
    cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
    print_result(result)
@@ -156,14 +172,118 @@ def interactive_extraction(crawler):
    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
    """
-    crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
-    crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+    # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+    # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
    result = crawler.run(
        url="https://www.nbcnews.com/business",
+        js = js_code
    )
    cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
    print_result(result)

+def multiple_scrip(crawler):
+    # Passing JavaScript code to interact with the page
+    cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
+    cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
+    js_code = ["""
+    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+    loadMoreButton && loadMoreButton.click();
+    """] * 2
+    # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
+    # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
+    result = crawler.run(
+        url="https://www.nbcnews.com/business",
+        js = js_code  
+    )
+    cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
+    print_result(result)
+
+def using_crawler_hooks(crawler):
+    # Example usage of the hooks for authentication and setting a cookie
+    def on_driver_created(driver):
+        print("[HOOK] on_driver_created")
+        # Example customization: maximize the window
+        driver.maximize_window()
+        
+        # Example customization: logging in to a hypothetical website
+        driver.get('https://example.com/login')
+        
+        from selenium.webdriver.support.ui import WebDriverWait
+        from selenium.webdriver.common.by import By
+        from selenium.webdriver.support import expected_conditions as EC
+        
+        WebDriverWait(driver, 10).until(
+            EC.presence_of_element_located((By.NAME, 'username'))
+        )
+        driver.find_element(By.NAME, 'username').send_keys('testuser')
+        driver.find_element(By.NAME, 'password').send_keys('password123')
+        driver.find_element(By.NAME, 'login').click()
+        WebDriverWait(driver, 10).until(
+            EC.presence_of_element_located((By.ID, 'welcome'))
+        )
+        # Add a custom cookie
+        driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
+        return driver        
+        
+
+    def before_get_url(driver):
+        print("[HOOK] before_get_url")
+        # Example customization: add a custom header
+        # Enable Network domain for sending headers
+        driver.execute_cdp_cmd('Network.enable', {})
+        # Add a custom header
+        driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
+        return driver
+    
+    def after_get_url(driver):
+        print("[HOOK] after_get_url")
+        # Example customization: log the URL
+        print(driver.current_url)
+        return driver
+
+    def before_return_html(driver, html):
+        print("[HOOK] before_return_html")
+        # Example customization: log the HTML
+        print(len(html))
+        return driver
+    
+    cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
+    
+    crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
+    crawler_strategy.set_hook('on_driver_created', on_driver_created)
+    crawler_strategy.set_hook('before_get_url', before_get_url)
+    crawler_strategy.set_hook('after_get_url', after_get_url)
+    crawler_strategy.set_hook('before_return_html', before_return_html)
+    
+    crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
+    crawler.warmup()    
+    result = crawler.run(url="https://example.com")
+    
+    cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
+    print_result(result= result)
+    
+def using_crawler_hooks_dleay_example(crawler):
+    def delay(driver):
+        print("Delaying for 5 seconds...")
+        time.sleep(5)
+        print("Resuming...")
+        
+    def create_crawler():
+        crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
+        crawler_strategy.set_hook('after_get_url', delay)
+        crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
+        crawler.warmup()
+        return crawler
+
+    cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]")
+    crawler = create_crawler()
+    result = crawler.run(url="https://google.com", bypass_cache=True)    
+    
+    cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
+    print_result(result)
+    
+    
+
 def main():
    cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
    cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
@@ -171,15 +291,19 @@ def main():

    crawler = create_crawler()

+    crawler.always_by_pass_cache = True
    basic_usage(crawler)
+    # basic_usage_some_params(crawler)
    understanding_parameters(crawler)
    
    crawler.always_by_pass_cache = True
+    screenshot_usage(crawler)
    add_chunking_strategy(crawler)
    add_extraction_strategy(crawler)
    add_llm_extraction_strategy(crawler)
    targeted_extraction(crawler)
    interactive_extraction(crawler)
+    multiple_scrip(crawler)

    cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")

--- a/docs/examples/quickstart_v0.ipynb
+++ b/docs/examples/quickstart_v0.ipynb
@@ -0,0 +1,735 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "6yLvrXn7yZQI"
+      },
+      "source": [
+        "# Crawl4AI: Advanced Web Crawling and Data Extraction\n",
+        "\n",
+        "Welcome to this interactive notebook showcasing Crawl4AI, an advanced asynchronous web crawling and data extraction library.\n",
+        "\n",
+        "- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\n",
+        "- Twitter: [@unclecode](https://twitter.com/unclecode)\n",
+        "- Website: [https://crawl4ai.com](https://crawl4ai.com)\n",
+        "\n",
+        "Let's explore the powerful features of Crawl4AI!"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "KIn_9nxFyZQK"
+      },
+      "source": [
+        "## Installation\n",
+        "\n",
+        "First, let's install Crawl4AI from GitHub:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "mSnaxLf3zMog"
+      },
+      "outputs": [],
+      "source": [
+        "!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "xlXqaRtayZQK"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install crawl4ai\n",
+        "!pip install nest-asyncio\n",
+        "!playwright install"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qKCE7TI7yZQL"
+      },
+      "source": [
+        "Now, let's import the necessary libraries:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {
+        "id": "I67tr7aAyZQL"
+      },
+      "outputs": [],
+      "source": [
+        "import asyncio\n",
+        "import nest_asyncio\n",
+        "from crawl4ai import AsyncWebCrawler\n",
+        "from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy\n",
+        "import json\n",
+        "import time\n",
+        "from pydantic import BaseModel, Field\n",
+        "\n",
+        "nest_asyncio.apply()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "h7yR_Rt_yZQM"
+      },
+      "source": [
+        "## Basic Usage\n",
+        "\n",
+        "Let's start with a simple crawl example:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "yBh6hf4WyZQM",
+        "outputId": "0f83af5c-abba-4175-ed95-70b7512e6bcc"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
+            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.05 seconds\n",
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.05 seconds.\n",
+            "18102\n"
+          ]
+        }
+      ],
+      "source": [
+        "async def simple_crawl():\n",
+        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
+        "        result = await crawler.arun(url=\"https://www.nbcnews.com/business\")\n",
+        "        print(len(result.markdown))\n",
+        "await simple_crawl()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9rtkgHI28uI4"
+      },
+      "source": [
+        "💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, you’ll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "MzZ0zlJ9yZQM"
+      },
+      "source": [
+        "## Advanced Features\n",
+        "\n",
+        "### Executing JavaScript and Using CSS Selectors"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "gHStF86xyZQM",
+        "outputId": "34d0fb6d-4dec-4677-f76e-85a1f082829b"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
+            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
+            "[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
+            "[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
+            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 6.06 seconds\n",
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.10 seconds\n",
+            "[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.11 seconds.\n",
+            "41135\n"
+          ]
+        }
+      ],
+      "source": [
+        "async def js_and_css():\n",
+        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
+        "        js_code = [\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"]\n",
+        "        result = await crawler.arun(\n",
+        "            url=\"https://www.nbcnews.com/business\",\n",
+        "            js_code=js_code,\n",
+        "            # css_selector=\"YOUR_CSS_SELECTOR_HERE\",\n",
+        "            bypass_cache=True\n",
+        "        )\n",
+        "        print(len(result.markdown))\n",
+        "\n",
+        "await js_and_css()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "cqE_W4coyZQM"
+      },
+      "source": [
+        "### Using a Proxy\n",
+        "\n",
+        "Note: You'll need to replace the proxy URL with a working proxy for this example to run successfully."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "QjAyiAGqyZQM"
+      },
+      "outputs": [],
+      "source": [
+        "async def use_proxy():\n",
+        "    async with AsyncWebCrawler(verbose=True, proxy=\"http://your-proxy-url:port\") as crawler:\n",
+        "        result = await crawler.arun(\n",
+        "            url=\"https://www.nbcnews.com/business\",\n",
+        "            bypass_cache=True\n",
+        "        )\n",
+        "        print(result.markdown[:500])  # Print first 500 characters\n",
+        "\n",
+        "# Uncomment the following line to run the proxy example\n",
+        "# await use_proxy()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XTZ88lbayZQN"
+      },
+      "source": [
+        "### Extracting Structured Data with OpenAI\n",
+        "\n",
+        "Note: You'll need to set your OpenAI API key as an environment variable for this example to work."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "fIOlDayYyZQN",
+        "outputId": "cb8359cc-dee0-4762-9698-5dfdcee055b8"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
+            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
+            "[LOG] 🕸️ Crawling https://openai.com/api/pricing/ using AsyncPlaywrightCrawlerStrategy...\n",
+            "[LOG] ✅ Crawled https://openai.com/api/pricing/ successfully!\n",
+            "[LOG] 🚀 Crawling done for https://openai.com/api/pricing/, success: True, time taken: 3.77 seconds\n",
+            "[LOG] 🚀 Content extracted for https://openai.com/api/pricing/, success: True, time taken: 0.21 seconds\n",
+            "[LOG] 🔥 Extracting semantic blocks for https://openai.com/api/pricing/, Strategy: AsyncWebCrawler\n",
+            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0\n",
+            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 1\n",
+            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 2\n",
+            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 3\n",
+            "[LOG] Extracted 4 blocks from URL: https://openai.com/api/pricing/ block index: 3\n",
+            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 4\n",
+            "[LOG] Extracted 5 blocks from URL: https://openai.com/api/pricing/ block index: 0\n",
+            "[LOG] Extracted 1 blocks from URL: https://openai.com/api/pricing/ block index: 4\n",
+            "[LOG] Extracted 8 blocks from URL: https://openai.com/api/pricing/ block index: 1\n",
+            "[LOG] Extracted 12 blocks from URL: https://openai.com/api/pricing/ block index: 2\n",
+            "[LOG] 🚀 Extraction done for https://openai.com/api/pricing/, time taken: 8.55 seconds.\n",
+            "5029\n"
+          ]
+        }
+      ],
+      "source": [
+        "import os\n",
+        "from google.colab import userdata\n",
+        "os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
+        "\n",
+        "class OpenAIModelFee(BaseModel):\n",
+        "    model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n",
+        "    input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n",
+        "    output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n",
+        "\n",
+        "async def extract_openai_fees():\n",
+        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
+        "        result = await crawler.arun(\n",
+        "            url='https://openai.com/api/pricing/',\n",
+        "            word_count_threshold=1,\n",
+        "            extraction_strategy=LLMExtractionStrategy(\n",
+        "                provider=\"openai/gpt-4o\", api_token=os.getenv('OPENAI_API_KEY'),\n",
+        "                schema=OpenAIModelFee.schema(),\n",
+        "                extraction_type=\"schema\",\n",
+        "                instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens.\n",
+        "                Do not miss any models in the entire content. One extracted model JSON format should look like this:\n",
+        "                {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}.\"\"\"\n",
+        "            ),\n",
+        "            bypass_cache=True,\n",
+        "        )\n",
+        "        print(len(result.extracted_content))\n",
+        "\n",
+        "# Uncomment the following line to run the OpenAI extraction example\n",
+        "await extract_openai_fees()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "BypA5YxEyZQN"
+      },
+      "source": [
+        "### Advanced Multi-Page Crawling with JavaScript Execution"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tfkcVQ0b7mw-"
+      },
+      "source": [
+        "## Advanced Multi-Page Crawling with JavaScript Execution\n",
+        "\n",
+        "This example demonstrates Crawl4AI's ability to handle complex crawling scenarios, specifically extracting commits from multiple pages of a GitHub repository. The challenge here is that clicking the \"Next\" button doesn't load a new page, but instead uses asynchronous JavaScript to update the content. This is a common hurdle in modern web crawling.\n",
+        "\n",
+        "To overcome this, we use Crawl4AI's custom JavaScript execution to simulate clicking the \"Next\" button, and implement a custom hook to detect when new data has loaded. Our strategy involves comparing the first commit's text before and after \"clicking\" Next, waiting until it changes to confirm new data has rendered. This showcases Crawl4AI's flexibility in handling dynamic content and its ability to implement custom logic for even the most challenging crawling tasks."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 11,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "qUBKGpn3yZQN",
+        "outputId": "3e555b6a-ed33-42f4-cce9-499a923fbe17"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
+            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
+            "[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
+            "[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
+            "[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 5.16 seconds\n",
+            "[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.28 seconds\n",
+            "[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
+            "[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.28 seconds.\n",
+            "Page 1: Found 35 commits\n",
+            "[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
+            "[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
+            "[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.78 seconds\n",
+            "[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.90 seconds\n",
+            "[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
+            "[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.90 seconds.\n",
+            "Page 2: Found 35 commits\n",
+            "[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
+            "[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
+            "[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 2.00 seconds\n",
+            "[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.74 seconds\n",
+            "[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
+            "[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.75 seconds.\n",
+            "Page 3: Found 35 commits\n",
+            "Successfully crawled 105 commits across 3 pages\n"
+          ]
+        }
+      ],
+      "source": [
+        "import re\n",
+        "from bs4 import BeautifulSoup\n",
+        "\n",
+        "async def crawl_typescript_commits():\n",
+        "    first_commit = \"\"\n",
+        "    async def on_execution_started(page):\n",
+        "        nonlocal first_commit\n",
+        "        try:\n",
+        "            while True:\n",
+        "                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')\n",
+        "                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')\n",
+        "                commit = await commit.evaluate('(element) => element.textContent')\n",
+        "                commit = re.sub(r'\\s+', '', commit)\n",
+        "                if commit and commit != first_commit:\n",
+        "                    first_commit = commit\n",
+        "                    break\n",
+        "                await asyncio.sleep(0.5)\n",
+        "        except Exception as e:\n",
+        "            print(f\"Warning: New content didn't appear after JavaScript execution: {e}\")\n",
+        "\n",
+        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
+        "        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)\n",
+        "\n",
+        "        url = \"https://github.com/microsoft/TypeScript/commits/main\"\n",
+        "        session_id = \"typescript_commits_session\"\n",
+        "        all_commits = []\n",
+        "\n",
+        "        js_next_page = \"\"\"\n",
+        "        const button = document.querySelector('a[data-testid=\"pagination-next-button\"]');\n",
+        "        if (button) button.click();\n",
+        "        \"\"\"\n",
+        "\n",
+        "        for page in range(3):  # Crawl 3 pages\n",
+        "            result = await crawler.arun(\n",
+        "                url=url,\n",
+        "                session_id=session_id,\n",
+        "                css_selector=\"li.Box-sc-g0xbh4-0\",\n",
+        "                js=js_next_page if page > 0 else None,\n",
+        "                bypass_cache=True,\n",
+        "                js_only=page > 0\n",
+        "            )\n",
+        "\n",
+        "            assert result.success, f\"Failed to crawl page {page + 1}\"\n",
+        "\n",
+        "            soup = BeautifulSoup(result.cleaned_html, 'html.parser')\n",
+        "            commits = soup.select(\"li\")\n",
+        "            all_commits.extend(commits)\n",
+        "\n",
+        "            print(f\"Page {page + 1}: Found {len(commits)} commits\")\n",
+        "\n",
+        "        await crawler.crawler_strategy.kill_session(session_id)\n",
+        "        print(f\"Successfully crawled {len(all_commits)} commits across 3 pages\")\n",
+        "\n",
+        "await crawl_typescript_commits()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EJRnYsp6yZQN"
+      },
+      "source": [
+        "### Using JsonCssExtractionStrategy for Fast Structured Output"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1ZMqIzB_8SYp"
+      },
+      "source": [
+        "The JsonCssExtractionStrategy is a powerful feature of Crawl4AI that allows for precise, structured data extraction from web pages. Here's how it works:\n",
+        "\n",
+        "1. You define a schema that describes the pattern of data you're interested in extracting.\n",
+        "2. The schema includes a base selector that identifies repeating elements on the page.\n",
+        "3. Within the schema, you define fields, each with its own selector and type.\n",
+        "4. These field selectors are applied within the context of each base selector element.\n",
+        "5. The strategy supports nested structures, lists within lists, and various data types.\n",
+        "6. You can even include computed fields for more complex data manipulation.\n",
+        "\n",
+        "This approach allows for highly flexible and precise data extraction, transforming semi-structured web content into clean, structured JSON data. It's particularly useful for extracting consistent data patterns from pages like product listings, news articles, or search results.\n",
+        "\n",
+        "For more details and advanced usage, check out the full documentation on the Crawl4AI website."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 12,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "trCMR2T9yZQN",
+        "outputId": "718d36f4-cccf-40f4-8d8c-c3ba73524d16"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
+            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
+            "[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
+            "[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
+            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 7.00 seconds\n",
+            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.32 seconds\n",
+            "[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
+            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.48 seconds.\n",
+            "Successfully extracted 11 news teasers\n",
+            "{\n",
+            "  \"category\": \"Business News\",\n",
+            "  \"headline\": \"NBC ripped up its Olympics playbook for 2024 \\u2014 so far, the new strategy paid off\",\n",
+            "  \"summary\": \"The Olympics have long been key to NBCUniversal. Paris marked the 18th Olympic Games broadcast by NBC in the U.S.\",\n",
+            "  \"time\": \"13h ago\",\n",
+            "  \"image\": {\n",
+            "    \"src\": \"https://media-cldnry.s-nbcnews.com/image/upload/t_focal-200x100,f_auto,q_auto:best/rockcms/2024-09/240903-nbc-olympics-ch-1344-c7a486.jpg\",\n",
+            "    \"alt\": \"Mike Tirico.\"\n",
+            "  },\n",
+            "  \"link\": \"https://www.nbcnews.com/business\"\n",
+            "}\n"
+          ]
+        }
+      ],
+      "source": [
+        "async def extract_news_teasers():\n",
+        "    schema = {\n",
+        "        \"name\": \"News Teaser Extractor\",\n",
+        "        \"baseSelector\": \".wide-tease-item__wrapper\",\n",
+        "        \"fields\": [\n",
+        "            {\n",
+        "                \"name\": \"category\",\n",
+        "                \"selector\": \".unibrow span[data-testid='unibrow-text']\",\n",
+        "                \"type\": \"text\",\n",
+        "            },\n",
+        "            {\n",
+        "                \"name\": \"headline\",\n",
+        "                \"selector\": \".wide-tease-item__headline\",\n",
+        "                \"type\": \"text\",\n",
+        "            },\n",
+        "            {\n",
+        "                \"name\": \"summary\",\n",
+        "                \"selector\": \".wide-tease-item__description\",\n",
+        "                \"type\": \"text\",\n",
+        "            },\n",
+        "            {\n",
+        "                \"name\": \"time\",\n",
+        "                \"selector\": \"[data-testid='wide-tease-date']\",\n",
+        "                \"type\": \"text\",\n",
+        "            },\n",
+        "            {\n",
+        "                \"name\": \"image\",\n",
+        "                \"type\": \"nested\",\n",
+        "                \"selector\": \"picture.teasePicture img\",\n",
+        "                \"fields\": [\n",
+        "                    {\"name\": \"src\", \"type\": \"attribute\", \"attribute\": \"src\"},\n",
+        "                    {\"name\": \"alt\", \"type\": \"attribute\", \"attribute\": \"alt\"},\n",
+        "                ],\n",
+        "            },\n",
+        "            {\n",
+        "                \"name\": \"link\",\n",
+        "                \"selector\": \"a[href]\",\n",
+        "                \"type\": \"attribute\",\n",
+        "                \"attribute\": \"href\",\n",
+        "            },\n",
+        "        ],\n",
+        "    }\n",
+        "\n",
+        "    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n",
+        "\n",
+        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
+        "        result = await crawler.arun(\n",
+        "            url=\"https://www.nbcnews.com/business\",\n",
+        "            extraction_strategy=extraction_strategy,\n",
+        "            bypass_cache=True,\n",
+        "        )\n",
+        "\n",
+        "        assert result.success, \"Failed to crawl the page\"\n",
+        "\n",
+        "        news_teasers = json.loads(result.extracted_content)\n",
+        "        print(f\"Successfully extracted {len(news_teasers)} news teasers\")\n",
+        "        print(json.dumps(news_teasers[0], indent=2))\n",
+        "\n",
+        "await extract_news_teasers()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "FnyVhJaByZQN"
+      },
+      "source": [
+        "## Speed Comparison\n",
+        "\n",
+        "Let's compare the speed of Crawl4AI with Firecrawl, a paid service. Note that we can't run Firecrawl in this Colab environment, so we'll simulate its performance based on previously recorded data."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "agDD186f3wig"
+      },
+      "source": [
+        "💡 **Note on Speed Comparison:**\n",
+        "\n",
+        "The speed test conducted here is running on Google Colab, where the internet speed and performance can vary and may not reflect optimal conditions. When we call Firecrawl's API, we're seeing its best performance, while Crawl4AI's performance is limited by Colab's network speed.\n",
+        "\n",
+        "For a more accurate comparison, it's recommended to run these tests on your own servers or computers with a stable and fast internet connection. Despite these limitations, Crawl4AI still demonstrates faster performance in this environment.\n",
+        "\n",
+        "If you run these tests locally, you may observe an even more significant speed advantage for Crawl4AI compared to other services."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "F7KwHv8G1LbY"
+      },
+      "outputs": [],
+      "source": [
+        "!pip install firecrawl"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "91813zILyZQN",
+        "outputId": "663223db-ab89-4976-b233-05ceca62b19b"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Firecrawl (simulated):\n",
+            "Time taken: 4.38 seconds\n",
+            "Content length: 41967 characters\n",
+            "Images found: 49\n",
+            "\n",
+            "Crawl4AI (simple crawl):\n",
+            "Time taken: 4.22 seconds\n",
+            "Content length: 18221 characters\n",
+            "Images found: 49\n",
+            "\n",
+            "Crawl4AI (with JavaScript execution):\n",
+            "Time taken: 9.13 seconds\n",
+            "Content length: 34243 characters\n",
+            "Images found: 89\n"
+          ]
+        }
+      ],
+      "source": [
+        "import os\n",
+        "from google.colab import userdata\n",
+        "os.environ['FIRECRAWL_API_KEY'] = userdata.get('FIRECRAWL_API_KEY')\n",
+        "import time\n",
+        "from firecrawl import FirecrawlApp\n",
+        "\n",
+        "async def speed_comparison():\n",
+        "    # Simulated Firecrawl performance\n",
+        "    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])\n",
+        "    start = time.time()\n",
+        "    scrape_status = app.scrape_url(\n",
+        "    'https://www.nbcnews.com/business',\n",
+        "    params={'formats': ['markdown', 'html']}\n",
+        "    )\n",
+        "    end = time.time()\n",
+        "    print(\"Firecrawl (simulated):\")\n",
+        "    print(f\"Time taken: {end - start:.2f} seconds\")\n",
+        "    print(f\"Content length: {len(scrape_status['markdown'])} characters\")\n",
+        "    print(f\"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}\")\n",
+        "    print()\n",
+        "\n",
+        "    async with AsyncWebCrawler() as crawler:\n",
+        "        # Crawl4AI simple crawl\n",
+        "        start = time.time()\n",
+        "        result = await crawler.arun(\n",
+        "            url=\"https://www.nbcnews.com/business\",\n",
+        "            word_count_threshold=0,\n",
+        "            bypass_cache=True,\n",
+        "            verbose=False\n",
+        "        )\n",
+        "        end = time.time()\n",
+        "        print(\"Crawl4AI (simple crawl):\")\n",
+        "        print(f\"Time taken: {end - start:.2f} seconds\")\n",
+        "        print(f\"Content length: {len(result.markdown)} characters\")\n",
+        "        print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
+        "        print()\n",
+        "\n",
+        "        # Crawl4AI with JavaScript execution\n",
+        "        start = time.time()\n",
+        "        result = await crawler.arun(\n",
+        "            url=\"https://www.nbcnews.com/business\",\n",
+        "            js_code=[\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"],\n",
+        "            word_count_threshold=0,\n",
+        "            bypass_cache=True,\n",
+        "            verbose=False\n",
+        "        )\n",
+        "        end = time.time()\n",
+        "        print(\"Crawl4AI (with JavaScript execution):\")\n",
+        "        print(f\"Time taken: {end - start:.2f} seconds\")\n",
+        "        print(f\"Content length: {len(result.markdown)} characters\")\n",
+        "        print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
+        "\n",
+        "await speed_comparison()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "OBFFYVJIyZQN"
+      },
+      "source": [
+        "If you run on a local machine with a proper internet speed:\n",
+        "- Simple crawl: Crawl4AI is typically over 3-4 times faster than Firecrawl.\n",
+        "- With JavaScript execution: Even when executing JavaScript to load more content (potentially doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.\n",
+        "\n",
+        "Please note that actual performance may vary depending on network conditions and the specific content being crawled."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "A6_1RK1_yZQO"
+      },
+      "source": [
+        "## Conclusion\n",
+        "\n",
+        "In this notebook, we've explored the powerful features of Crawl4AI, including:\n",
+        "\n",
+        "1. Basic crawling\n",
+        "2. JavaScript execution and CSS selector usage\n",
+        "3. Proxy support\n",
+        "4. Structured data extraction with OpenAI\n",
+        "5. Advanced multi-page crawling with JavaScript execution\n",
+        "6. Fast structured output using JsonCssExtractionStrategy\n",
+        "7. Speed comparison with other services\n",
+        "\n",
+        "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
+        "\n",
+        "For more information and advanced usage, please visit the [Crawl4AI documentation](https://crawl4ai.com/mkdocs/).\n",
+        "\n",
+        "Happy crawling!"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "venv",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.13"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
--- a/docs/examples/research_assistant.py
+++ b/docs/examples/research_assistant.py
@@ -0,0 +1,195 @@
+# Make sure to install the required packageschainlit and groq
+import os, time
+from openai import AsyncOpenAI
+import chainlit as cl
+import re
+import requests
+from io import BytesIO
+from chainlit.element import ElementBased
+from groq import Groq
+
+# Import threadpools to run the crawl_url function in a separate thread
+from concurrent.futures import ThreadPoolExecutor
+
+client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
+
+# Instrument the OpenAI client
+cl.instrument_openai()
+
+settings = {
+    "model": "llama3-8b-8192",
+    "temperature": 0.5,
+    "max_tokens": 500,
+    "top_p": 1,
+    "frequency_penalty": 0,
+    "presence_penalty": 0,
+}
+
+def extract_urls(text):
+    url_pattern = re.compile(r'(https?://\S+)')
+    return url_pattern.findall(text)
+
+def crawl_url(url):
+    data = {
+        "urls": [url],
+        "include_raw_html": True,
+        "word_count_threshold": 10,
+        "extraction_strategy": "NoExtractionStrategy",
+        "chunking_strategy": "RegexChunking"
+    }
+    response = requests.post("https://crawl4ai.com/crawl", json=data)
+    response_data = response.json()
+    response_data = response_data['results'][0]
+    return response_data['markdown']
+
+@cl.on_chat_start
+async def on_chat_start():
+    cl.user_session.set("session", {
+        "history": [],
+        "context": {}
+    })  
+    await cl.Message(
+        content="Welcome to the chat! How can I assist you today?"
+    ).send()
+
+@cl.on_message
+async def on_message(message: cl.Message):
+    user_session = cl.user_session.get("session")
+    
+    # Extract URLs from the user's message
+    urls = extract_urls(message.content)
+    
+    
+    futures = []
+    with ThreadPoolExecutor() as executor:
+        for url in urls:
+            futures.append(executor.submit(crawl_url, url))
+
+    results = [future.result() for future in futures]
+
+    for url, result in zip(urls, results):
+        ref_number = f"REF_{len(user_session['context']) + 1}"
+        user_session["context"][ref_number] = {
+            "url": url,
+            "content": result
+        }    
+
+
+    user_session["history"].append({
+        "role": "user",
+        "content": message.content
+    })
+
+    # Create a system message that includes the context
+    context_messages = [
+        f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
+        for ref, data in user_session["context"].items()
+    ]
+    if context_messages:
+        system_message = {
+            "role": "system",
+            "content": (
+                "You are a helpful bot. Use the following context for answering questions. "
+                "Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
+                "If the question requires any information from the provided appendices or context, refer to the sources. "
+                "If not, there is no need to add a references section. "
+                "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
+                "\n\n".join(context_messages)
+            )
+        }
+    else:
+        system_message = {
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }
+
+
+    msg = cl.Message(content="")
+    await msg.send()
+
+    # Get response from the LLM
+    stream = await client.chat.completions.create(
+        messages=[
+            system_message,
+            *user_session["history"]
+        ],
+        stream=True,
+        **settings
+    )
+
+    assistant_response = ""
+    async for part in stream:
+        if token := part.choices[0].delta.content:
+            assistant_response += token
+            await msg.stream_token(token)
+
+    # Add assistant message to the history
+    user_session["history"].append({
+        "role": "assistant",
+        "content": assistant_response
+    })
+    await msg.update()
+
+    # Append the reference section to the assistant's response
+    reference_section = "\n\nReferences:\n"
+    for ref, data in user_session["context"].items():
+        reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
+
+    msg.content += reference_section
+    await msg.update()
+
+
+@cl.on_audio_chunk
+async def on_audio_chunk(chunk: cl.AudioChunk):
+    if chunk.isStart:
+        buffer = BytesIO()
+        # This is required for whisper to recognize the file type
+        buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
+        # Initialize the session for a new audio stream
+        cl.user_session.set("audio_buffer", buffer)
+        cl.user_session.set("audio_mime_type", chunk.mimeType)
+
+    # Write the chunks to a buffer and transcribe the whole audio at the end
+    cl.user_session.get("audio_buffer").write(chunk.data)
+
+    pass
+
+@cl.step(type="tool")
+async def speech_to_text(audio_file):
+    cli = Groq()
+       
+    response = await client.audio.transcriptions.create(
+        model="whisper-large-v3", file=audio_file
+    )
+
+    return response.text
+
+
+@cl.on_audio_end
+async def on_audio_end(elements: list[ElementBased]):
+    # Get the audio buffer from the session
+    audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
+    audio_buffer.seek(0)  # Move the file pointer to the beginning
+    audio_file = audio_buffer.read()
+    audio_mime_type: str = cl.user_session.get("audio_mime_type")
+    
+    start_time = time.time()
+    whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
+    transcription = await speech_to_text(whisper_input)
+    end_time = time.time()
+    print(f"Transcription took {end_time - start_time} seconds")
+    
+    user_msg = cl.Message(
+        author="You", 
+        type="user_message",
+        content=transcription
+    )
+    await user_msg.send()
+    await on_message(user_msg)
+
+
+if __name__ == "__main__":
+    from chainlit.cli import run_chainlit
+    run_chainlit(__file__)
+
+
--- a/docs/examples/rest_call.py
+++ b/docs/examples/rest_call.py
@@ -0,0 +1,64 @@
+
+import requests, base64, os
+
+data = {
+    "urls": ["https://www.nbcnews.com/business"],
+    "screenshot": True,
+}
+
+response = requests.post("https://crawl4ai.com/crawl", json=data) 
+result = response.json()['results'][0]
+print(result.keys())
+# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media', 
+# 'links', 'screenshot', 'markdown', 'extracted_content', 
+# 'metadata', 'error_message'])
+with open("screenshot.png", "wb") as f:
+    f.write(base64.b64decode(result['screenshot']))
+    
+# Example of filtering the content using CSS selectors
+data = {
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
+    "css_selector": "article",
+    "screenshot": True,
+}
+
+# Example of executing a JS script on the page before extracting the content
+data = {
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
+    "screenshot": True,
+    'js' : ["""
+    const loadMoreButton = Array.from(document.querySelectorAll('button')).
+    find(button => button.textContent.includes('Load More'));
+    loadMoreButton && loadMoreButton.click();
+    """]
+}
+
+# Example of using a custom extraction strategy
+data = {
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
+    "extraction_strategy": "CosineStrategy",
+    "extraction_strategy_args": {
+        "semantic_filter": "inflation rent prices"
+    },
+}
+
+# Example of using LLM to extract content
+data = {
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
+    "extraction_strategy": "LLMExtractionStrategy",
+    "extraction_strategy_args": {
+        "provider": "groq/llama3-8b-8192",
+        "api_token": os.environ.get("GROQ_API_KEY"),
+        "instruction": """I am interested in only financial news, 
+        and translate them in French."""
+    },
+}
+
--- a/docs/examples/sample_ecommerce.html
+++ b/docs/examples/sample_ecommerce.html
@@ -0,0 +1,106 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Sample E-commerce Page for JsonCssExtractionStrategy Testing</title>
+    <style>
+        body { font-family: Arial, sans-serif; line-height: 1.6; padding: 20px; }
+        .category { border: 1px solid #ddd; margin-bottom: 20px; padding: 10px; }
+        .product { border: 1px solid #eee; margin: 10px 0; padding: 10px; }
+        .product-details, .product-reviews, .related-products { margin-top: 10px; }
+        .review { background-color: #f9f9f9; margin: 5px 0; padding: 5px; }
+    </style>
+</head>
+<body>
+    <h1>Sample E-commerce Product Catalog</h1>
+    <div id="catalog"></div>
+
+    <script>
+        const categories = ['Electronics', 'Home & Kitchen', 'Books'];
+        const products = [
+            {
+                name: 'Smartphone X',
+                price: '$999',
+                brand: 'TechCorp',
+                model: 'X-2000',
+                features: ['5G capable', '6.5" OLED screen', '128GB storage'],
+                reviews: [
+                    { reviewer: 'John D.', rating: '4.5', text: 'Great phone, love the camera!' },
+                    { reviewer: 'Jane S.', rating: '5', text: 'Best smartphone I\'ve ever owned.' }
+                ],
+                related: [
+                    { name: 'Phone Case', price: '$29.99' },
+                    { name: 'Screen Protector', price: '$9.99' }
+                ]
+            },
+            {
+                name: 'Laptop Pro',
+                price: '$1499',
+                brand: 'TechMaster',
+                model: 'LT-3000',
+                features: ['Intel i7 processor', '16GB RAM', '512GB SSD'],
+                reviews: [
+                    { reviewer: 'Alice W.', rating: '4', text: 'Powerful machine, but a bit heavy.' },
+                    { reviewer: 'Bob M.', rating: '5', text: 'Perfect for my development work!' }
+                ],
+                related: [
+                    { name: 'Laptop Bag', price: '$49.99' },
+                    { name: 'Wireless Mouse', price: '$24.99' }
+                ]
+            }
+        ];
+
+        function createProductHTML(product) {
+            return `
+                <div class="product">
+                    <h3 class="product-name">${product.name}</h3>
+                    <p class="product-price">${product.price}</p>
+                    <div class="product-details">
+                        <span class="brand">${product.brand}</span>
+                        <span class="model">${product.model}</span>
+                    </div>
+                    <ul class="product-features">
+                        ${product.features.map(feature => `<li>${feature}</li>`).join('')}
+                    </ul>
+                    <div class="product-reviews">
+                        ${product.reviews.map(review => `
+                            <div class="review">
+                                <span class="reviewer">${review.reviewer}</span>
+                                <span class="rating">${review.rating}</span>
+                                <p class="review-text">${review.text}</p>
+                            </div>
+                        `).join('')}
+                    </div>
+                    <ul class="related-products">
+                        ${product.related.map(item => `
+                            <li>
+                                <span class="related-name">${item.name}</span>
+                                <span class="related-price">${item.price}</span>
+                            </li>
+                        `).join('')}
+                    </ul>
+                </div>
+            `;
+        }
+
+        function createCategoryHTML(category, products) {
+            return `
+                <div class="category">
+                    <h2 class="category-name">${category}</h2>
+                    ${products.map(createProductHTML).join('')}
+                </div>
+            `;
+        }
+
+        function populateCatalog() {
+            const catalog = document.getElementById('catalog');
+            categories.forEach(category => {
+                catalog.innerHTML += createCategoryHTML(category, products);
+            });
+        }
+
+        populateCatalog();
+    </script>
+</body>
+</html>
--- a/docs/examples/summarize_page.py
+++ b/docs/examples/summarize_page.py
@@ -0,0 +1,46 @@
+import os
+import time
+import json
+from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+from crawl4ai.crawler_strategy import *
+
+url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
+
+crawler = WebCrawler()
+crawler.warmup()
+
+from pydantic import BaseModel, Field
+
+class PageSummary(BaseModel):
+    title: str = Field(..., description="Title of the page.")
+    summary: str = Field(..., description="Summary of the page.")
+    brief_summary: str = Field(..., description="Brief summary of the page.")
+    keywords: list = Field(..., description="Keywords assigned to the page.")
+
+result = crawler.run(
+    url=url,
+    word_count_threshold=1,
+    extraction_strategy= LLMExtractionStrategy(
+        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        schema=PageSummary.model_json_schema(),
+        extraction_type="schema",
+        apply_chunking =False,
+        instruction="From the crawled content, extract the following details: "\
+            "1. Title of the page "\
+            "2. Summary of the page, which is a detailed summary "\
+            "3. Brief summary of the page, which is a paragraph text "\
+            "4. Keywords assigned to the page, which is a list of keywords. "\
+            'The extracted JSON format should look like this: '\
+            '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
+    ),
+    bypass_cache=True,
+)
+
+page_summary = json.loads(result.extracted_content)
+
+print(page_summary)
+
+with open(".data/page_summary.json", "w", encoding="utf-8") as f:
+    f.write(result.extracted_content)
--- a/docs/examples/tmp/chainlit_review.py
+++ b/docs/examples/tmp/chainlit_review.py
@@ -0,0 +1,281 @@
+from openai import AsyncOpenAI
+from chainlit.types import ThreadDict
+import chainlit as cl
+from chainlit.input_widget import Select, Switch, Slider
+client = AsyncOpenAI()
+
+# Instrument the OpenAI client
+cl.instrument_openai()
+
+settings = {
+    "model": "gpt-3.5-turbo",
+    "temperature": 0.5,
+    "max_tokens": 500,
+    "top_p": 1,
+    "frequency_penalty": 0,
+    "presence_penalty": 0,
+}
+
+@cl.action_callback("action_button")
+async def on_action(action: cl.Action):
+    print("The user clicked on the action button!")
+
+    return "Thank you for clicking on the action button!"
+
+@cl.set_chat_profiles
+async def chat_profile():
+    return [
+        cl.ChatProfile(
+            name="GPT-3.5",
+            markdown_description="The underlying LLM model is **GPT-3.5**.",
+            icon="https://picsum.photos/200",
+        ),
+        cl.ChatProfile(
+            name="GPT-4",
+            markdown_description="The underlying LLM model is **GPT-4**.",
+            icon="https://picsum.photos/250",
+        ),
+    ]
+
+@cl.on_chat_start
+async def on_chat_start():
+    
+    settings = await cl.ChatSettings(
+        [
+            Select(
+                id="Model",
+                label="OpenAI - Model",
+                values=["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-4", "gpt-4-32k"],
+                initial_index=0,
+            ),
+            Switch(id="Streaming", label="OpenAI - Stream Tokens", initial=True),
+            Slider(
+                id="Temperature",
+                label="OpenAI - Temperature",
+                initial=1,
+                min=0,
+                max=2,
+                step=0.1,
+            ),
+            Slider(
+                id="SAI_Steps",
+                label="Stability AI - Steps",
+                initial=30,
+                min=10,
+                max=150,
+                step=1,
+                description="Amount of inference steps performed on image generation.",
+            ),
+            Slider(
+                id="SAI_Cfg_Scale",
+                label="Stability AI - Cfg_Scale",
+                initial=7,
+                min=1,
+                max=35,
+                step=0.1,
+                description="Influences how strongly your generation is guided to match your prompt.",
+            ),
+            Slider(
+                id="SAI_Width",
+                label="Stability AI - Image Width",
+                initial=512,
+                min=256,
+                max=2048,
+                step=64,
+                tooltip="Measured in pixels",
+            ),
+            Slider(
+                id="SAI_Height",
+                label="Stability AI - Image Height",
+                initial=512,
+                min=256,
+                max=2048,
+                step=64,
+                tooltip="Measured in pixels",
+            ),
+        ]
+    ).send()
+    
+    chat_profile = cl.user_session.get("chat_profile")
+    await cl.Message(
+        content=f"starting chat using the {chat_profile} chat profile"
+    ).send()
+    
+    print("A new chat session has started!")
+    cl.user_session.set("session", {
+        "history": [],
+        "context": []
+    })  
+    
+    image = cl.Image(url="https://c.tenor.com/uzWDSSLMCmkAAAAd/tenor.gif", name="cat image", display="inline")
+
+    # Attach the image to the message
+    await cl.Message(
+        content="You are such a good girl, aren't you?!",
+        elements=[image],
+    ).send()
+    
+    text_content = "Hello, this is a text element."
+    elements = [
+        cl.Text(name="simple_text", content=text_content, display="inline")
+    ]
+
+    await cl.Message(
+        content="Check out this text element!",
+        elements=elements,
+    ).send()
+    
+    elements = [
+        cl.Audio(path="./assets/audio.mp3", display="inline"),
+    ]
+    await cl.Message(
+        content="Here is an audio file",
+        elements=elements,
+    ).send()
+    
+    await cl.Avatar(
+        name="Tool 1",
+        url="https://avatars.githubusercontent.com/u/128686189?s=400&u=a1d1553023f8ea0921fba0debbe92a8c5f840dd9&v=4",
+    ).send()
+    
+    await cl.Message(
+        content="This message should not have an avatar!", author="Tool 0"
+    ).send()
+    
+    await cl.Message(
+        content="This message should have an avatar!", author="Tool 1"
+    ).send()
+    
+    elements = [
+        cl.File(
+            name="quickstart.py",
+            path="./quickstart.py",
+            display="inline",
+        ),
+    ]
+
+    await cl.Message(
+        content="This message has a file element", elements=elements
+    ).send()
+    
+    # Sending an action button within a chatbot message
+    actions = [
+        cl.Action(name="action_button", value="example_value", description="Click me!")
+    ]
+
+    await cl.Message(content="Interact with this action button:", actions=actions).send()
+    
+    # res = await cl.AskActionMessage(
+    #     content="Pick an action!",
+    #     actions=[
+    #         cl.Action(name="continue", value="continue", label="✅ Continue"),
+    #         cl.Action(name="cancel", value="cancel", label="❌ Cancel"),
+    #     ],
+    # ).send()
+
+    # if res and res.get("value") == "continue":
+    #     await cl.Message(
+    #         content="Continue!",
+    #     ).send()
+    
+    # import plotly.graph_objects as go
+    # fig = go.Figure(
+    #     data=[go.Bar(y=[2, 1, 3])],
+    #     layout_title_text="An example figure",
+    # )
+    # elements = [cl.Plotly(name="chart", figure=fig, display="inline")]
+
+    # await cl.Message(content="This message has a chart", elements=elements).send()
+    
+    # Sending a pdf with the local file path
+    # elements = [
+    #   cl.Pdf(name="pdf1", display="inline", path="./pdf1.pdf")
+    # ]
+
+    # cl.Message(content="Look at this local pdf!", elements=elements).send()    
+
+@cl.on_settings_update
+async def setup_agent(settings):
+    print("on_settings_update", settings)
+    
+@cl.on_stop
+def on_stop():
+    print("The user wants to stop the task!")
+
+@cl.on_chat_end
+def on_chat_end():
+    print("The user disconnected!")
+
+
+@cl.on_chat_resume
+async def on_chat_resume(thread: ThreadDict):
+    print("The user resumed a previous chat session!")
+
+
+
+
+# @cl.on_message
+async def on_message(message: cl.Message):
+    cl.user_session.get("session")["history"].append({
+        "role": "user",
+        "content": message.content
+    })    
+    response = await client.chat.completions.create(
+        messages=[
+            {
+                "content": "You are a helpful bot",
+                "role": "system"
+            },
+            *cl.user_session.get("session")["history"]
+        ],
+        **settings
+    )
+    
+
+    # Add assitanr message to the history
+    cl.user_session.get("session")["history"].append({
+        "role": "assistant",
+        "content": response.choices[0].message.content
+    })
+    
+    # msg.content = response.choices[0].message.content
+    # await msg.update()
+    
+    # await cl.Message(content=response.choices[0].message.content).send()
+
+@cl.on_message
+async def on_message(message: cl.Message):
+    cl.user_session.get("session")["history"].append({
+        "role": "user",
+        "content": message.content
+    })    
+
+    msg = cl.Message(content="")
+    await msg.send()    
+    
+    stream = await client.chat.completions.create(
+        messages=[
+            {
+                "content": "You are a helpful bot",
+                "role": "system"
+            },
+            *cl.user_session.get("session")["history"]
+        ],
+        stream = True, 
+        **settings
+    )
+    
+    async for part in stream:
+        if token := part.choices[0].delta.content or "":
+            await msg.stream_token(token)
+    
+    # Add assitanr message to the history
+    cl.user_session.get("session")["history"].append({
+        "role": "assistant",
+        "content": msg.content
+    })    
+    await msg.update()
+
+if __name__ == "__main__":
+    from chainlit.cli import run_chainlit
+    run_chainlit(__file__)
--- a/docs/examples/tmp/research_assistant_audio_not_completed.py
+++ b/docs/examples/tmp/research_assistant_audio_not_completed.py
@@ -0,0 +1,238 @@
+# Make sure to install the required packageschainlit and groq
+import os, time
+from openai import AsyncOpenAI
+import chainlit as cl
+import re
+import requests
+from io import BytesIO
+from chainlit.element import ElementBased
+from groq import Groq
+
+# Import threadpools to run the crawl_url function in a separate thread
+from concurrent.futures import ThreadPoolExecutor
+
+client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
+
+# Instrument the OpenAI client
+cl.instrument_openai()
+
+settings = {
+    "model": "llama3-8b-8192",
+    "temperature": 0.5,
+    "max_tokens": 500,
+    "top_p": 1,
+    "frequency_penalty": 0,
+    "presence_penalty": 0,
+}
+
+def extract_urls(text):
+    url_pattern = re.compile(r'(https?://\S+)')
+    return url_pattern.findall(text)
+
+def crawl_url(url):
+    data = {
+        "urls": [url],
+        "include_raw_html": True,
+        "word_count_threshold": 10,
+        "extraction_strategy": "NoExtractionStrategy",
+        "chunking_strategy": "RegexChunking"
+    }
+    response = requests.post("https://crawl4ai.com/crawl", json=data)
+    response_data = response.json()
+    response_data = response_data['results'][0]
+    return response_data['markdown']
+
+@cl.on_chat_start
+async def on_chat_start():
+    cl.user_session.set("session", {
+        "history": [],
+        "context": {}
+    })  
+    await cl.Message(
+        content="Welcome to the chat! How can I assist you today?"
+    ).send()
+
+@cl.on_message
+async def on_message(message: cl.Message):
+    user_session = cl.user_session.get("session")
+    
+    # Extract URLs from the user's message
+    urls = extract_urls(message.content)
+    
+    
+    futures = []
+    with ThreadPoolExecutor() as executor:
+        for url in urls:
+            futures.append(executor.submit(crawl_url, url))
+
+    results = [future.result() for future in futures]
+
+    for url, result in zip(urls, results):
+        ref_number = f"REF_{len(user_session['context']) + 1}"
+        user_session["context"][ref_number] = {
+            "url": url,
+            "content": result
+        }    
+    
+    # for url in urls:
+    #     # Crawl the content of each URL and add it to the session context with a reference number
+    #     ref_number = f"REF_{len(user_session['context']) + 1}"
+    #     crawled_content = crawl_url(url)
+    #     user_session["context"][ref_number] = {
+    #         "url": url,
+    #         "content": crawled_content
+    #     }
+
+    user_session["history"].append({
+        "role": "user",
+        "content": message.content
+    })
+
+    # Create a system message that includes the context
+    context_messages = [
+        f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
+        for ref, data in user_session["context"].items()
+    ]
+    if context_messages:
+        system_message = {
+            "role": "system",
+            "content": (
+                "You are a helpful bot. Use the following context for answering questions. "
+                "Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
+                "If the question requires any information from the provided appendices or context, refer to the sources. "
+                "If not, there is no need to add a references section. "
+                "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
+                "\n\n".join(context_messages)
+            )
+        }
+    else:
+        system_message = {
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }
+
+
+    msg = cl.Message(content="")
+    await msg.send()
+
+    # Get response from the LLM
+    stream = await client.chat.completions.create(
+        messages=[
+            system_message,
+            *user_session["history"]
+        ],
+        stream=True,
+        **settings
+    )
+
+    assistant_response = ""
+    async for part in stream:
+        if token := part.choices[0].delta.content:
+            assistant_response += token
+            await msg.stream_token(token)
+
+    # Add assistant message to the history
+    user_session["history"].append({
+        "role": "assistant",
+        "content": assistant_response
+    })
+    await msg.update()
+
+    # Append the reference section to the assistant's response
+    reference_section = "\n\nReferences:\n"
+    for ref, data in user_session["context"].items():
+        reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
+
+    msg.content += reference_section
+    await msg.update()
+
+
+@cl.on_audio_chunk
+async def on_audio_chunk(chunk: cl.AudioChunk):
+    if chunk.isStart:
+        buffer = BytesIO()
+        # This is required for whisper to recognize the file type
+        buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
+        # Initialize the session for a new audio stream
+        cl.user_session.set("audio_buffer", buffer)
+        cl.user_session.set("audio_mime_type", chunk.mimeType)
+
+    # Write the chunks to a buffer and transcribe the whole audio at the end
+    cl.user_session.get("audio_buffer").write(chunk.data)
+
+    pass
+
+@cl.step(type="tool")
+async def speech_to_text(audio_file):
+    cli = Groq()
+    
+    # response = cli.audio.transcriptions.create(
+    #     file=audio_file, #(filename, file.read()),
+    #     model="whisper-large-v3",
+    # )
+    
+    response = await client.audio.transcriptions.create(
+        model="whisper-large-v3", file=audio_file
+    )
+
+    return response.text
+
+
+@cl.on_audio_end
+async def on_audio_end(elements: list[ElementBased]):
+    # Get the audio buffer from the session
+    audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
+    audio_buffer.seek(0)  # Move the file pointer to the beginning
+    audio_file = audio_buffer.read()
+    audio_mime_type: str = cl.user_session.get("audio_mime_type")
+
+    # input_audio_el = cl.Audio(
+    #     mime=audio_mime_type, content=audio_file, name=audio_buffer.name
+    # )
+    # await cl.Message(
+    #     author="You", 
+    #     type="user_message",
+    #     content="",
+    #     elements=[input_audio_el, *elements]
+    # ).send()
+    
+    # answer_message = await cl.Message(content="").send()
+    
+    
+    start_time = time.time()
+    whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
+    transcription = await speech_to_text(whisper_input)
+    end_time = time.time()
+    print(f"Transcription took {end_time - start_time} seconds")
+    
+    user_msg = cl.Message(
+        author="You", 
+        type="user_message",
+        content=transcription
+    )
+    await user_msg.send()
+    await on_message(user_msg)
+
+    # images = [file for file in elements if "image" in file.mime]
+
+    # text_answer = await generate_text_answer(transcription, images)
+    
+    # output_name, output_audio = await text_to_speech(text_answer, audio_mime_type)
+    
+    # output_audio_el = cl.Audio(
+    #     name=output_name,
+    #     auto_play=True,
+    #     mime=audio_mime_type,
+    #     content=output_audio,
+    # )
+    
+    # answer_message.elements = [output_audio_el]
+    
+    # answer_message.content = transcription
+    # await answer_message.update()
+
+if __name__ == "__main__":
+    from chainlit.cli import run_chainlit
+    run_chainlit(__file__)
+
+
--- a/docs/extraction_strategies.json
+++ b/docs/extraction_strategies.json
@@ -1,10 +0,0 @@
-{
-    "NoExtractionStrategy": "### NoExtractionStrategy\n\n`NoExtractionStrategy` is a basic extraction strategy that returns the entire HTML content without any modification. It is useful for cases where no specific extraction is required. Only clean html, and amrkdown.\n\n#### Constructor Parameters:\nNone.\n\n#### Example usage:\n```python\nextractor = NoExtractionStrategy()\nextracted_content = extractor.extract(url, html)\n```",
-    
-    "LLMExtractionStrategy": "### LLMExtractionStrategy\n\n`LLMExtractionStrategy` uses a Language Model (LLM) to extract meaningful blocks or chunks from the given HTML content. This strategy leverages an external provider for language model completions.\n\n#### Constructor Parameters:\n- `provider` (str, optional): The provider to use for the language model completions. Default is `DEFAULT_PROVIDER` (e.g., openai/gpt-4).\n- `api_token` (str, optional): The API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.\n- `instruction` (str, optional): An instruction to guide the LLM on how to perform the extraction. This allows users to specify the type of data they are interested in or set the tone of the response. Default is `None`.\n\n#### Example usage:\n```python\nextractor = LLMExtractionStrategy(provider='openai', api_token='your_api_token', instruction='Extract only news about AI.')\nextracted_content = extractor.extract(url, html)\n```\n\nBy providing clear instructions, users can tailor the extraction process to their specific needs, enhancing the relevance and utility of the extracted content.",
-    
-    "CosineStrategy": "### CosineStrategy\n\n`CosineStrategy` uses hierarchical clustering based on cosine similarity to extract clusters of text from the given HTML content. This strategy is suitable for identifying related content sections.\n\n#### Constructor Parameters:\n- `semantic_filter` (str, optional): A string containing keywords for filtering relevant documents before clustering. If provided, documents are filtered based on their cosine similarity to the keyword filter embedding. Default is `None`.\n- `word_count_threshold` (int, optional): Minimum number of words per cluster. Default is `20`.\n- `max_dist` (float, optional): The maximum cophenetic distance on the dendrogram to form clusters. Default is `0.2`.\n- `linkage_method` (str, optional): The linkage method for hierarchical clustering. Default is `'ward'`.\n- `top_k` (int, optional): Number of top categories to extract. Default is `3`.\n- `model_name` (str, optional): The model name for embedding generation. Default is `'BAAI/bge-small-en-v1.5'`.\n\n#### Example usage:\n```python\nextractor = CosineStrategy(semantic_filter='artificial intelligence', word_count_threshold=10, max_dist=0.2, linkage_method='ward', top_k=3, model_name='BAAI/bge-small-en-v1.5')\nextracted_content = extractor.extract(url, html)\n```\n\n#### Cosine Similarity Filtering\n\nWhen a `semantic_filter` is provided, the `CosineStrategy` applies an embedding-based filtering process to select relevant documents before performing hierarchical clustering.",
-    
-    "TopicExtractionStrategy": "### TopicExtractionStrategy\n\n`TopicExtractionStrategy` uses the TextTiling algorithm to segment the HTML content into topics and extracts keywords for each segment. This strategy is useful for identifying and summarizing thematic content.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): Number of keywords to represent each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nextractor = TopicExtractionStrategy(num_keywords=3)\nextracted_content = extractor.extract(url, html)\n```"
-  }
-  
--- a/docs/md_v2/advanced/content-processing.md
+++ b/docs/md_v2/advanced/content-processing.md
@@ -0,0 +1,223 @@
+# Content Processing
+
+Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
+
+## Content Cleaning
+
+### Understanding Clean Content
+When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
+
+1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
+2. **Content Relevance**: Identifies and preserves meaningful content blocks
+3. **Layout Analysis**: Understands page structure to identify main content areas
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    word_count_threshold=10,        # Remove blocks with fewer words
+    excluded_tags=['form', 'nav'],  # Remove specific HTML tags
+    remove_overlay_elements=True    # Remove popups/modals
+)
+
+# Get clean content
+print(result.cleaned_html)  # Cleaned HTML
+print(result.markdown)      # Clean markdown version
+```
+
+### Fit Markdown: Smart Content Extraction
+One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
+
+#### How Fit Markdown Works
+- Analyzes content density and distribution
+- Identifies content patterns and structures
+- Removes boilerplate content (headers, footers, sidebars)
+- Preserves the most relevant content blocks
+- Maintains content hierarchy and formatting
+
+#### Perfect For:
+- Blog posts and articles
+- News content
+- Documentation pages
+- Any page with a clear main content area
+
+#### Not Recommended For:
+- E-commerce product listings
+- Search results pages
+- Social media feeds
+- Pages with multiple equal-weight content sections
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+# Get the most relevant content
+main_content = result.fit_markdown
+
+# Compare with regular markdown
+all_content = result.markdown
+
+print(f"Fit Markdown Length: {len(main_content)}")
+print(f"Regular Markdown Length: {len(all_content)}")
+```
+
+#### Example Use Case
+```python
+async def extract_article_content(url: str) -> str:
+    """Extract main article content from a blog or news site."""
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url=url)
+        
+        # fit_markdown will focus on the article content,
+        # excluding navigation, ads, and other distractions
+        return result.fit_markdown
+```
+
+## Media Processing
+
+Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
+
+### Image Processing
+The library handles various image scenarios, including:
+- Regular images
+- Lazy-loaded images
+- Background images
+- Responsive images
+- Image metadata and context
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+for image in result.media["images"]:
+    # Each image includes rich metadata
+    print(f"Source: {image['src']}")
+    print(f"Alt text: {image['alt']}")
+    print(f"Description: {image['desc']}")
+    print(f"Context: {image['context']}")  # Surrounding text
+    print(f"Relevance score: {image['score']}")  # 0-10 score
+```
+
+### Handling Lazy-Loaded Content
+Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for="css:img[data-src]",  # Wait for lazy images
+    delay_before_return_html=2.0   # Additional wait time
+)
+```
+
+### Video and Audio Content
+The library extracts video and audio elements with their metadata:
+
+```python
+# Process videos
+for video in result.media["videos"]:
+    print(f"Video source: {video['src']}")
+    print(f"Type: {video['type']}")
+    print(f"Duration: {video.get('duration')}")
+    print(f"Thumbnail: {video.get('poster')}")
+
+# Process audio
+for audio in result.media["audios"]:
+    print(f"Audio source: {audio['src']}")
+    print(f"Type: {audio['type']}")
+    print(f"Duration: {audio.get('duration')}")
+```
+
+## Link Analysis
+
+Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
+
+### Link Classification
+The library automatically categorizes links into:
+- Internal links (same domain)
+- External links (different domains)
+- Social media links
+- Navigation links
+- Content links
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+# Analyze internal links
+for link in result.links["internal"]:
+    print(f"Internal: {link['href']}")
+    print(f"Link text: {link['text']}")
+    print(f"Context: {link['context']}")  # Surrounding text
+    print(f"Type: {link['type']}")  # nav, content, etc.
+
+# Analyze external links
+for link in result.links["external"]:
+    print(f"External: {link['href']}")
+    print(f"Domain: {link['domain']}")
+    print(f"Type: {link['type']}")
+```
+
+### Smart Link Filtering
+Control which links are included in the results:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    exclude_external_links=True,          # Remove external links
+    exclude_social_media_links=True,      # Remove social media links
+    exclude_social_media_domains=[                # Custom social media domains
+        "facebook.com", "twitter.com", "instagram.com"
+    ],
+    exclude_domains=["ads.example.com"]   # Exclude specific domains
+)
+```
+
+## Metadata Extraction
+
+Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+metadata = result.metadata
+print(f"Title: {metadata['title']}")
+print(f"Description: {metadata['description']}")
+print(f"Keywords: {metadata['keywords']}")
+print(f"Author: {metadata['author']}")
+print(f"Published Date: {metadata['published_date']}")
+print(f"Modified Date: {metadata['modified_date']}")
+print(f"Language: {metadata['language']}")
+```
+
+## Best Practices
+
+1. **Use Fit Markdown for Articles**
+   ```python
+   # Perfect for blog posts, news articles, documentation
+   content = result.fit_markdown
+   ```
+
+2. **Handle Media Appropriately**
+   ```python
+   # Filter by relevance score
+   relevant_images = [
+       img for img in result.media["images"]
+       if img['score'] > 5
+   ]
+   ```
+
+3. **Combine Link Analysis with Content**
+   ```python
+   # Get content links with context
+   content_links = [
+       link for link in result.links["internal"]
+       if link['type'] == 'content'
+   ]
+   ```
+
+4. **Clean Content with Purpose**
+   ```python
+   # Customize cleaning based on your needs
+   result = await crawler.arun(
+       url=url,
+       word_count_threshold=20,      # Adjust based on content type
+       keep_data_attributes=False,   # Remove data attributes
+       process_iframes=True         # Include iframe content
+   )
+   ```
--- a/docs/md_v2/advanced/hooks-auth.md
+++ b/docs/md_v2/advanced/hooks-auth.md
@@ -0,0 +1,110 @@
+# Hooks & Auth for AsyncWebCrawler
+
+Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
+
+## Example: Using Crawler Hooks with AsyncWebCrawler
+
+Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
+
+1. Configure the browser when it's created.
+2. Add custom headers before navigating to the URL.
+3. Log the current URL after navigation.
+4. Perform actions after JavaScript execution.
+5. Log the length of the HTML before returning it.
+
+### Hook Definitions
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
+from playwright.async_api import Page, Browser
+
+async def on_browser_created(browser: Browser):
+    print("[HOOK] on_browser_created")
+    # Example customization: set browser viewport size
+    context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
+    page = await context.new_page()
+    
+    # Example customization: logging in to a hypothetical website
+    await page.goto('https://example.com/login')
+    await page.fill('input[name="username"]', 'testuser')
+    await page.fill('input[name="password"]', 'password123')
+    await page.click('button[type="submit"]')
+    await page.wait_for_selector('#welcome')
+    
+    # Add a custom cookie
+    await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
+    
+    await page.close()
+    await context.close()
+
+async def before_goto(page: Page):
+    print("[HOOK] before_goto")
+    # Example customization: add custom headers
+    await page.set_extra_http_headers({'X-Test-Header': 'test'})
+
+async def after_goto(page: Page):
+    print("[HOOK] after_goto")
+    # Example customization: log the URL
+    print(f"Current URL: {page.url}")
+
+async def on_execution_started(page: Page):
+    print("[HOOK] on_execution_started")
+    # Example customization: perform actions after JS execution
+    await page.evaluate("console.log('Custom JS executed')")
+
+async def before_return_html(page: Page, html: str):
+    print("[HOOK] before_return_html")
+    # Example customization: log the HTML length
+    print(f"HTML length: {len(html)}")
+    return page
+```
+
+### Using the Hooks with the AsyncWebCrawler
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
+
+async def main():
+    print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
+    
+    crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
+    crawler_strategy.set_hook('on_browser_created', on_browser_created)
+    crawler_strategy.set_hook('before_goto', before_goto)
+    crawler_strategy.set_hook('after_goto', after_goto)
+    crawler_strategy.set_hook('on_execution_started', on_execution_started)
+    crawler_strategy.set_hook('before_return_html', before_return_html)
+    
+    async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
+        result = await crawler.arun(
+            url="https://example.com",
+            js_code="window.scrollTo(0, document.body.scrollHeight);",
+            wait_for="footer"
+        )
+
+    print("📦 Crawler Hooks result:")
+    print(result)
+
+asyncio.run(main())
+```
+
+### Explanation
+
+- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
+- `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
+- `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
+- `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
+- `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
+
+### Additional Ideas
+
+- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
+- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
+- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
+- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
+- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
+
+By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.
--- a/docs/md_v2/advanced/magic-mode.md
+++ b/docs/md_v2/advanced/magic-mode.md
@@ -0,0 +1,52 @@
+# Magic Mode & Anti-Bot Protection
+
+Crawl4AI provides powerful anti-detection capabilities, with Magic Mode being the simplest and most comprehensive solution.
+
+## Magic Mode
+
+The easiest way to bypass anti-bot protections:
+
+```python
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        magic=True  # Enables all anti-detection features
+    )
+```
+
+Magic Mode automatically:
+- Masks browser automation signals
+- Simulates human-like behavior
+- Overrides navigator properties
+- Handles cookie consent popups
+- Manages browser fingerprinting
+- Randomizes timing patterns
+
+## Manual Anti-Bot Options
+
+While Magic Mode is recommended, you can also configure individual anti-detection features:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    simulate_user=True,        # Simulate human behavior
+    override_navigator=True    # Mask automation signals
+)
+```
+
+Note: When `magic=True` is used, you don't need to set these individual options.
+
+## Example: Handling Protected Sites
+
+```python
+async def crawl_protected_site(url: str):
+    async with AsyncWebCrawler(headless=True) as crawler:
+        result = await crawler.arun(
+            url=url,
+            magic=True,
+            remove_overlay_elements=True,  # Remove popups/modals
+            page_timeout=60000            # Increased timeout for protection checks
+        )
+        
+        return result.markdown if result.success else None
+```
--- a/docs/md_v2/advanced/proxy-security.md
+++ b/docs/md_v2/advanced/proxy-security.md
@@ -0,0 +1,84 @@
+# Proxy & Security
+
+Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
+
+## Basic Proxy Setup
+
+Simple proxy configuration:
+
+```python
+# Using proxy URL
+async with AsyncWebCrawler(
+    proxy="http://proxy.example.com:8080"
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# Using SOCKS proxy
+async with AsyncWebCrawler(
+    proxy="socks5://proxy.example.com:1080"
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Authenticated Proxy
+
+Use proxy with authentication:
+
+```python
+proxy_config = {
+    "server": "http://proxy.example.com:8080",
+    "username": "user",
+    "password": "pass"
+}
+
+async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Rotating Proxies
+
+Example using a proxy rotation service:
+
+```python
+async def get_next_proxy():
+    # Your proxy rotation logic here
+    return {"server": "http://next.proxy.com:8080"}
+
+async with AsyncWebCrawler() as crawler:
+    # Update proxy for each request
+    for url in urls:
+        proxy = await get_next_proxy()
+        crawler.update_proxy(proxy)
+        result = await crawler.arun(url=url)
+```
+
+## Custom Headers
+
+Add security-related headers:
+
+```python
+headers = {
+    "X-Forwarded-For": "203.0.113.195",
+    "Accept-Language": "en-US,en;q=0.9",
+    "Cache-Control": "no-cache",
+    "Pragma": "no-cache"
+}
+
+async with AsyncWebCrawler(headers=headers) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Combining with Magic Mode
+
+For maximum protection, combine proxy with Magic Mode:
+
+```python
+async with AsyncWebCrawler(
+    proxy="http://proxy.example.com:8080",
+    headers={"Accept-Language": "en-US"}
+) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        magic=True  # Enable all anti-detection features
+    )
+```
--- a/docs/md_v2/advanced/session-management-advanced.md
+++ b/docs/md_v2/advanced/session-management-advanced.md
@@ -0,0 +1,276 @@
+# Session-Based Crawling for Dynamic Content
+
+In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. To effectively crawl such websites, Crawl4AI provides powerful session-based crawling capabilities.
+
+This guide will explore advanced techniques for crawling dynamic content using Crawl4AI's session management features.
+
+## Understanding Session-Based Crawling
+
+Session-based crawling allows you to maintain a persistent browser session across multiple requests. This is crucial when:
+
+1. The content changes dynamically without URL changes
+2. You need to interact with the page (e.g., clicking buttons) between requests
+3. The site requires authentication or maintains state across pages
+
+Crawl4AI's `AsyncWebCrawler` class supports session-based crawling through the `session_id` parameter and related methods.
+
+## Basic Concepts
+
+Before diving into examples, let's review some key concepts:
+
+- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple `arun` calls to maintain state.
+- **JavaScript Execution**: Use the `js_code` parameter to execute JavaScript on the page, such as clicking a "Load More" button.
+- **CSS Selectors**: Use these to target specific elements for extraction or interaction.
+- **Extraction Strategy**: Define how to extract structured data from the page.
+- **Wait Conditions**: Specify conditions to wait for before considering the page loaded.
+
+## Example 1: Basic Session-Based Crawling
+
+Let's start with a basic example of session-based crawling:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def basic_session_crawl():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        session_id = "my_session"
+        url = "https://example.com/dynamic-content"
+
+        for page in range(3):
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
+                css_selector=".content-item",
+                bypass_cache=True
+            )
+            
+            print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+
+asyncio.run(basic_session_crawl())
+```
+
+This example demonstrates:
+1. Using a consistent `session_id` across multiple `arun` calls
+2. Executing JavaScript to load more content after the first page
+3. Using a CSS selector to extract specific content
+4. Properly closing the session after crawling
+
+## Advanced Technique 1: Custom Execution Hooks
+
+Crawl4AI allows you to set custom hooks that execute at different stages of the crawling process. This is particularly useful for handling complex loading scenarios.
+
+Here's an example that waits for new content to appear before proceeding:
+
+```python
+async def advanced_session_crawl_with_hooks():
+    first_commit = ""
+
+    async def on_execution_started(page):
+        nonlocal first_commit
+        try:
+            while True:
+                await page.wait_for_selector("li.commit-item h4")
+                commit = await page.query_selector("li.commit-item h4")
+                commit = await commit.evaluate("(element) => element.textContent")
+                commit = commit.strip()
+                if commit and commit != first_commit:
+                    first_commit = commit
+                    break
+                await asyncio.sleep(0.5)
+        except Exception as e:
+            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
+
+        url = "https://github.com/example/repo/commits/main"
+        session_id = "commit_session"
+        all_commits = []
+
+        js_next_page = """
+        const button = document.querySelector('a.pagination-next');
+        if (button) button.click();
+        """
+
+        for page in range(3):
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.commit-item",
+                js_code=js_next_page if page > 0 else None,
+                bypass_cache=True,
+                js_only=page > 0
+            )
+
+            commits = result.extracted_content.select("li.commit-item")
+            all_commits.extend(commits)
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+asyncio.run(advanced_session_crawl_with_hooks())
+```
+
+This technique uses a custom `on_execution_started` hook to ensure new content has loaded before proceeding to the next step.
+
+## Advanced Technique 2: Integrated JavaScript Execution and Waiting
+
+Instead of using separate hooks, you can integrate the waiting logic directly into your JavaScript execution. This approach can be more concise and easier to manage for some scenarios.
+
+Here's an example:
+
+```python
+async def integrated_js_and_wait_crawl():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        url = "https://github.com/example/repo/commits/main"
+        session_id = "integrated_session"
+        all_commits = []
+
+        js_next_page_and_wait = """
+        (async () => {
+            const getCurrentCommit = () => {
+                const commits = document.querySelectorAll('li.commit-item h4');
+                return commits.length > 0 ? commits[0].textContent.trim() : null;
+            };
+
+            const initialCommit = getCurrentCommit();
+            const button = document.querySelector('a.pagination-next');
+            if (button) button.click();
+
+            while (true) {
+                await new Promise(resolve => setTimeout(resolve, 100));
+                const newCommit = getCurrentCommit();
+                if (newCommit && newCommit !== initialCommit) {
+                    break;
+                }
+            }
+        })();
+        """
+
+        schema = {
+            "name": "Commit Extractor",
+            "baseSelector": "li.commit-item",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h4.commit-title",
+                    "type": "text",
+                    "transform": "strip",
+                },
+            ],
+        }
+        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+        for page in range(3):
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.commit-item",
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page_and_wait if page > 0 else None,
+                js_only=page > 0,
+                bypass_cache=True
+            )
+
+            commits = json.loads(result.extracted_content)
+            all_commits.extend(commits)
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+asyncio.run(integrated_js_and_wait_crawl())
+```
+
+This approach combines the JavaScript for clicking the "next" button and waiting for new content to load into a single script.
+
+## Advanced Technique 3: Using the `wait_for` Parameter
+
+Crawl4AI provides a `wait_for` parameter that allows you to specify a condition to wait for before considering the page fully loaded. This can be particularly useful for dynamic content.
+
+Here's an example:
+
+```python
+async def wait_for_parameter_crawl():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        url = "https://github.com/example/repo/commits/main"
+        session_id = "wait_for_session"
+        all_commits = []
+
+        js_next_page = """
+        const commits = document.querySelectorAll('li.commit-item h4');
+        if (commits.length > 0) {
+            window.lastCommit = commits[0].textContent.trim();
+        }
+        const button = document.querySelector('a.pagination-next');
+        if (button) button.click();
+        """
+
+        wait_for = """() => {
+            const commits = document.querySelectorAll('li.commit-item h4');
+            if (commits.length === 0) return false;
+            const firstCommit = commits[0].textContent.trim();
+            return firstCommit !== window.lastCommit;
+        }"""
+        
+        schema = {
+            "name": "Commit Extractor",
+            "baseSelector": "li.commit-item",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h4.commit-title",
+                    "type": "text",
+                    "transform": "strip",
+                },
+            ],
+        }
+        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+        for page in range(3):
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.commit-item",
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page if page > 0 else None,
+                wait_for=wait_for if page > 0 else None,
+                js_only=page > 0,
+                bypass_cache=True
+            )
+
+            commits = json.loads(result.extracted_content)
+            all_commits.extend(commits)
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+asyncio.run(wait_for_parameter_crawl())
+```
+
+This technique separates the JavaScript execution (clicking the "next" button) from the waiting condition, providing more flexibility and clarity in some scenarios.
+
+## Best Practices for Session-Based Crawling
+
+1. **Use Unique Session IDs**: Ensure each crawling session has a unique `session_id` to prevent conflicts.
+2. **Close Sessions**: Always close sessions using `kill_session` when you're done to free up resources.
+3. **Handle Errors**: Implement proper error handling to deal with unexpected situations during crawling.
+4. **Respect Website Terms**: Ensure your crawling adheres to the website's terms of service and robots.txt file.
+5. **Implement Delays**: Add appropriate delays between requests to avoid overwhelming the target server.
+6. **Use Extraction Strategies**: Leverage `JsonCssExtractionStrategy` or other extraction strategies for structured data extraction.
+7. **Optimize JavaScript**: Keep your JavaScript execution concise and efficient to improve crawling speed.
+8. **Monitor Performance**: Keep an eye on memory usage and crawling speed, especially for long-running sessions.
+
+## Conclusion
+
+Session-based crawling with Crawl4AI provides powerful capabilities for handling dynamic content and complex web applications. By leveraging session management, JavaScript execution, and waiting strategies, you can effectively crawl and extract data from a wide range of modern websites.
+
+Remember to use these techniques responsibly and in compliance with website policies and ethical web scraping practices.
+
+For more advanced usage and API details, refer to the Crawl4AI API documentation.
--- a/docs/md_v2/advanced/session-management.md
+++ b/docs/md_v2/advanced/session-management.md
@@ -0,0 +1,133 @@
+# Session Management
+
+Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.
+
+## Basic Session Usage
+
+Use `session_id` to maintain state between requests:
+
+```python
+async with AsyncWebCrawler() as crawler:
+    session_id = "my_session"
+    
+    # First request
+    result1 = await crawler.arun(
+        url="https://example.com/page1",
+        session_id=session_id
+    )
+    
+    # Subsequent request using same session
+    result2 = await crawler.arun(
+        url="https://example.com/page2",
+        session_id=session_id
+    )
+    
+    # Clean up when done
+    await crawler.crawler_strategy.kill_session(session_id)
+```
+
+## Dynamic Content with Sessions
+
+Here's a real-world example of crawling GitHub commits across multiple pages:
+
+```python
+async def crawl_dynamic_content():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        # Define navigation JavaScript
+        js_next_page = """
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        # Define wait condition
+        wait_for = """() => {
+            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+            if (commits.length === 0) return false;
+            const firstCommit = commits[0].textContent.trim();
+            return firstCommit !== window.firstCommit;
+        }"""
+        
+        # Define extraction schema
+        schema = {
+            "name": "Commit Extractor",
+            "baseSelector": "li.Box-sc-g0xbh4-0",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h4.markdown-title",
+                    "type": "text",
+                    "transform": "strip",
+                },
+            ],
+        }
+        extraction_strategy = JsonCssExtractionStrategy(schema)
+
+        # Crawl multiple pages
+        for page in range(3):
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page if page > 0 else None,
+                wait_for=wait_for if page > 0 else None,
+                js_only=page > 0,
+                bypass_cache=True
+            )
+
+            if result.success:
+                commits = json.loads(result.extracted_content)
+                all_commits.extend(commits)
+                print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        # Clean up session
+        await crawler.crawler_strategy.kill_session(session_id)
+        return all_commits
+```
+
+## Session Best Practices
+
+1. **Session Naming**:
+```python
+# Use descriptive session IDs
+session_id = "login_flow_session"
+session_id = "product_catalog_session"
+```
+
+2. **Resource Management**:
+```python
+try:
+    # Your crawling code
+    pass
+finally:
+    # Always clean up sessions
+    await crawler.crawler_strategy.kill_session(session_id)
+```
+
+3. **State Management**:
+```python
+# First page: login
+result = await crawler.arun(
+    url="https://example.com/login",
+    session_id=session_id,
+    js_code="document.querySelector('form').submit();"
+)
+
+# Second page: verify login success
+result = await crawler.arun(
+    url="https://example.com/dashboard",
+    session_id=session_id,
+    wait_for="css:.user-profile"  # Wait for authenticated content
+)
+```
+
+## Common Use Cases
+
+1. **Authentication Flows**
+2. **Pagination Handling**
+3. **Form Submissions**
+4. **Multi-step Processes**
+5. **Dynamic Content Navigation**
--- a/docs/md_v2/api/arun.md
+++ b/docs/md_v2/api/arun.md
@@ -0,0 +1,226 @@
+# Complete Parameter Guide for arun()
+
+The following parameters can be passed to the `arun()` method. They are organized by their primary usage context and functionality.
+
+## Core Parameters
+
+```python
+await crawler.arun(
+    url="https://example.com",   # Required: URL to crawl
+    verbose=True,               # Enable detailed logging
+    bypass_cache=False,         # Skip cache for this request
+    warmup=True                # Whether to run warmup check
+)
+```
+
+## Content Processing Parameters
+
+### Text Processing
+```python
+await crawler.arun(
+    word_count_threshold=10,                # Minimum words per content block
+    image_description_min_word_threshold=5,  # Minimum words for image descriptions
+    only_text=False,                        # Extract only text content
+    excluded_tags=['form', 'nav'],          # HTML tags to exclude
+    keep_data_attributes=False,             # Preserve data-* attributes
+)
+```
+
+### Content Selection
+```python
+await crawler.arun(
+    css_selector=".main-content",  # CSS selector for content extraction
+    remove_forms=True,             # Remove all form elements
+    remove_overlay_elements=True,  # Remove popups/modals/overlays
+)
+```
+
+### Link Handling
+```python
+await crawler.arun(
+    exclude_external_links=True,          # Remove external links
+    exclude_social_media_links=True,      # Remove social media links
+    exclude_external_images=True,         # Remove external images
+    exclude_domains=["ads.example.com"],  # Specific domains to exclude
+    social_media_domains=[               # Additional social media domains
+        "facebook.com",
+        "twitter.com",
+        "instagram.com"
+    ]
+)
+```
+
+## Browser Control Parameters
+
+### Basic Browser Settings
+```python
+await crawler.arun(
+    headless=True,                # Run browser in headless mode
+    browser_type="chromium",      # Browser engine: "chromium", "firefox", "webkit"
+    page_timeout=60000,          # Page load timeout in milliseconds
+    user_agent="custom-agent",    # Custom user agent
+)
+```
+
+### Navigation and Waiting
+```python
+await crawler.arun(
+    wait_for="css:.dynamic-content",  # Wait for element/condition
+    delay_before_return_html=2.0,     # Wait before returning HTML (seconds)
+)
+```
+
+### JavaScript Execution
+```python
+await crawler.arun(
+    js_code=[                     # JavaScript to execute (string or list)
+        "window.scrollTo(0, document.body.scrollHeight);",
+        "document.querySelector('.load-more').click();"
+    ],
+    js_only=False,               # Only execute JavaScript without reloading page
+)
+```
+
+### Anti-Bot Features
+```python
+await crawler.arun(
+    magic=True,              # Enable all anti-detection features
+    simulate_user=True,      # Simulate human behavior
+    override_navigator=True  # Override navigator properties
+)
+```
+
+### Session Management
+```python
+await crawler.arun(
+    session_id="my_session",  # Session identifier for persistent browsing
+)
+```
+
+### Screenshot Options
+```python
+await crawler.arun(
+    screenshot=True,              # Take page screenshot
+    screenshot_wait_for=2.0,      # Wait before screenshot (seconds)
+)
+```
+
+### Proxy Configuration
+```python
+await crawler.arun(
+    proxy="http://proxy.example.com:8080",     # Simple proxy URL
+    proxy_config={                             # Advanced proxy settings
+        "server": "http://proxy.example.com:8080",
+        "username": "user",
+        "password": "pass"
+    }
+)
+```
+
+## Content Extraction Parameters
+
+### Extraction Strategy
+```python
+await crawler.arun(
+    extraction_strategy=LLMExtractionStrategy(
+        provider="ollama/llama2",
+        schema=MySchema.schema(),
+        instruction="Extract specific data"
+    )
+)
+```
+
+### Chunking Strategy
+```python
+await crawler.arun(
+    chunking_strategy=RegexChunking(
+        patterns=[r'\n\n', r'\.\s+']
+    )
+)
+```
+
+### HTML to Text Options
+```python
+await crawler.arun(
+    html2text={
+        "ignore_links": False,
+        "ignore_images": False,
+        "escape_dot": False,
+        "body_width": 0,
+        "protect_links": True,
+        "unicode_snob": True
+    }
+)
+```
+
+## Debug Options
+```python
+await crawler.arun(
+    log_console=True,   # Log browser console messages
+)
+```
+
+## Parameter Interactions and Notes
+
+1. **Magic Mode Combinations**
+   ```python
+   # Full anti-detection setup
+   await crawler.arun(
+       magic=True,
+       headless=False,
+       simulate_user=True,
+       override_navigator=True
+   )
+   ```
+
+2. **Dynamic Content Handling**
+   ```python
+   # Handle lazy-loaded content
+   await crawler.arun(
+       js_code="window.scrollTo(0, document.body.scrollHeight);",
+       wait_for="css:.lazy-content",
+       delay_before_return_html=2.0
+   )
+   ```
+
+3. **Content Extraction Pipeline**
+   ```python
+   # Complete extraction setup
+   await crawler.arun(
+       css_selector=".main-content",
+       word_count_threshold=20,
+       extraction_strategy=my_strategy,
+       chunking_strategy=my_chunking,
+       process_iframes=True,
+       remove_overlay_elements=True
+   )
+   ```
+
+## Best Practices
+
+1. **Performance Optimization**
+   ```python
+   await crawler.arun(
+       bypass_cache=False,           # Use cache when possible
+       word_count_threshold=10,      # Filter out noise
+       process_iframes=False         # Skip iframes if not needed
+   )
+   ```
+
+2. **Reliable Scraping**
+   ```python
+   await crawler.arun(
+       magic=True,                   # Enable anti-detection
+       delay_before_return_html=1.0, # Wait for dynamic content
+       page_timeout=60000           # Longer timeout for slow pages
+   )
+   ```
+
+3. **Clean Content**
+   ```python
+   await crawler.arun(
+       remove_overlay_elements=True,  # Remove popups
+       excluded_tags=['nav', 'aside'],# Remove unnecessary elements
+       keep_data_attributes=False     # Remove data attributes
+   )
+   ```
--- a/docs/md_v2/api/async-webcrawler.md
+++ b/docs/md_v2/api/async-webcrawler.md
@@ -0,0 +1,320 @@
+# AsyncWebCrawler
+
+The `AsyncWebCrawler` class is the main interface for web crawling operations. It provides asynchronous web crawling capabilities with extensive configuration options.
+
+## Constructor
+
+```python
+AsyncWebCrawler(
+    # Browser Settings
+    browser_type: str = "chromium",         # Options: "chromium", "firefox", "webkit"
+    headless: bool = True,                  # Run browser in headless mode
+    verbose: bool = False,                  # Enable verbose logging
+    
+    # Cache Settings
+    always_by_pass_cache: bool = False,     # Always bypass cache
+    base_directory: str = str(Path.home()), # Base directory for cache
+    
+    # Network Settings
+    proxy: str = None,                      # Simple proxy URL
+    proxy_config: Dict = None,              # Advanced proxy configuration
+    
+    # Browser Behavior
+    sleep_on_close: bool = False,           # Wait before closing browser
+    
+    # Custom Settings
+    user_agent: str = None,                 # Custom user agent
+    headers: Dict[str, str] = {},           # Custom HTTP headers
+    js_code: Union[str, List[str]] = None,  # Default JavaScript to execute
+)
+```
+
+### Parameters in Detail
+
+#### Browser Settings
+
+- **browser_type** (str, optional)
+  - Default: `"chromium"`
+  - Options: `"chromium"`, `"firefox"`, `"webkit"`
+  - Controls which browser engine to use
+  ```python
+  # Example: Using Firefox
+  crawler = AsyncWebCrawler(browser_type="firefox")
+  ```
+
+- **headless** (bool, optional)
+  - Default: `True`
+  - When `True`, browser runs without GUI
+  - Set to `False` for debugging
+  ```python
+  # Visible browser for debugging
+  crawler = AsyncWebCrawler(headless=False)
+  ```
+
+- **verbose** (bool, optional)
+  - Default: `False`
+  - Enables detailed logging
+  ```python
+  # Enable detailed logging
+  crawler = AsyncWebCrawler(verbose=True)
+  ```
+
+#### Cache Settings
+
+- **always_by_pass_cache** (bool, optional)
+  - Default: `False`
+  - When `True`, always fetches fresh content
+  ```python
+  # Always fetch fresh content
+  crawler = AsyncWebCrawler(always_by_pass_cache=True)
+  ```
+
+- **base_directory** (str, optional)
+  - Default: User's home directory
+  - Base path for cache storage
+  ```python
+  # Custom cache directory
+  crawler = AsyncWebCrawler(base_directory="/path/to/cache")
+  ```
+
+#### Network Settings
+
+- **proxy** (str, optional)
+  - Simple proxy URL
+  ```python
+  # Using simple proxy
+  crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
+  ```
+
+- **proxy_config** (Dict, optional)
+  - Advanced proxy configuration with authentication
+  ```python
+  # Advanced proxy with auth
+  crawler = AsyncWebCrawler(proxy_config={
+      "server": "http://proxy.example.com:8080",
+      "username": "user",
+      "password": "pass"
+  })
+  ```
+
+#### Browser Behavior
+
+- **sleep_on_close** (bool, optional)
+  - Default: `False`
+  - Adds delay before closing browser
+  ```python
+  # Wait before closing
+  crawler = AsyncWebCrawler(sleep_on_close=True)
+  ```
+
+#### Custom Settings
+
+- **user_agent** (str, optional)
+  - Custom user agent string
+  ```python
+  # Custom user agent
+  crawler = AsyncWebCrawler(
+      user_agent="Mozilla/5.0 (Custom Agent) Chrome/90.0"
+  )
+  ```
+
+- **headers** (Dict[str, str], optional)
+  - Custom HTTP headers
+  ```python
+  # Custom headers
+  crawler = AsyncWebCrawler(
+      headers={
+          "Accept-Language": "en-US",
+          "Custom-Header": "Value"
+      }
+  )
+  ```
+
+- **js_code** (Union[str, List[str]], optional)
+  - Default JavaScript to execute on each page
+  ```python
+  # Default JavaScript
+  crawler = AsyncWebCrawler(
+      js_code=[
+          "window.scrollTo(0, document.body.scrollHeight);",
+          "document.querySelector('.load-more').click();"
+      ]
+  )
+  ```
+
+## Methods
+
+### arun()
+
+The primary method for crawling web pages.
+
+```python
+async def arun(
+    # Required
+    url: str,                              # URL to crawl
+    
+    # Content Selection
+    css_selector: str = None,              # CSS selector for content
+    word_count_threshold: int = 10,        # Minimum words per block
+    
+    # Cache Control
+    bypass_cache: bool = False,            # Bypass cache for this request
+    
+    # Session Management
+    session_id: str = None,                # Session identifier
+    
+    # Screenshot Options
+    screenshot: bool = False,              # Take screenshot
+    screenshot_wait_for: float = None,     # Wait before screenshot
+    
+    # Content Processing
+    process_iframes: bool = False,         # Process iframe content
+    remove_overlay_elements: bool = False, # Remove popups/modals
+    
+    # Anti-Bot Settings
+    simulate_user: bool = False,           # Simulate human behavior
+    override_navigator: bool = False,      # Override navigator properties
+    magic: bool = False,                   # Enable all anti-detection
+    
+    # Content Filtering
+    excluded_tags: List[str] = None,       # HTML tags to exclude
+    exclude_external_links: bool = False,  # Remove external links
+    exclude_social_media_links: bool = False, # Remove social media links
+    
+    # JavaScript Handling
+    js_code: Union[str, List[str]] = None, # JavaScript to execute
+    wait_for: str = None,                  # Wait condition
+    
+    # Page Loading
+    page_timeout: int = 60000,            # Page load timeout (ms)
+    delay_before_return_html: float = None, # Wait before return
+    
+    # Extraction
+    extraction_strategy: ExtractionStrategy = None  # Extraction strategy
+) -> CrawlResult:
+```
+
+### Usage Examples
+
+#### Basic Crawling
+```python
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+#### Advanced Crawling
+```python
+async with AsyncWebCrawler(
+    browser_type="firefox",
+    verbose=True,
+    headers={"Custom-Header": "Value"}
+) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        css_selector=".main-content",
+        word_count_threshold=20,
+        process_iframes=True,
+        magic=True,
+        wait_for="css:.dynamic-content",
+        screenshot=True
+    )
+```
+
+#### Session Management
+```python
+async with AsyncWebCrawler() as crawler:
+    # First request
+    result1 = await crawler.arun(
+        url="https://example.com/login",
+        session_id="my_session"
+    )
+    
+    # Subsequent request using same session
+    result2 = await crawler.arun(
+        url="https://example.com/protected",
+        session_id="my_session"
+    )
+```
+
+## Context Manager
+
+AsyncWebCrawler implements the async context manager protocol:
+
+```python
+async def __aenter__(self) -> 'AsyncWebCrawler':
+    # Initialize browser and resources
+    return self
+
+async def __aexit__(self, *args):
+    # Cleanup resources
+    pass
+```
+
+Always use AsyncWebCrawler with async context manager:
+```python
+async with AsyncWebCrawler() as crawler:
+    # Your crawling code here
+    pass
+```
+
+## Best Practices
+
+1. **Resource Management**
+```python
+# Always use context manager
+async with AsyncWebCrawler() as crawler:
+    # Crawler will be properly cleaned up
+    pass
+```
+
+2. **Error Handling**
+```python
+try:
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url="https://example.com")
+        if not result.success:
+            print(f"Crawl failed: {result.error_message}")
+except Exception as e:
+    print(f"Error: {str(e)}")
+```
+
+3. **Performance Optimization**
+```python
+# Enable caching for better performance
+crawler = AsyncWebCrawler(
+    always_by_pass_cache=False,
+    verbose=True
+)
+```
+
+4. **Anti-Detection**
+```python
+# Maximum stealth
+crawler = AsyncWebCrawler(
+    headless=True,
+    user_agent="Mozilla/5.0...",
+    headers={"Accept-Language": "en-US"}
+)
+result = await crawler.arun(
+    url="https://example.com",
+    magic=True,
+    simulate_user=True
+)
+```
+
+## Note on Browser Types
+
+Each browser type has its characteristics:
+
+- **chromium**: Best overall compatibility
+- **firefox**: Good for specific use cases
+- **webkit**: Lighter weight, good for basic crawling
+
+Choose based on your specific needs:
+```python
+# High compatibility
+crawler = AsyncWebCrawler(browser_type="chromium")
+
+# Memory efficient
+crawler = AsyncWebCrawler(browser_type="webkit")
+```
--- a/docs/md_v2/api/crawl-result.md
+++ b/docs/md_v2/api/crawl-result.md
@@ -0,0 +1,301 @@
+# CrawlResult
+
+The `CrawlResult` class represents the result of a web crawling operation. It provides access to various forms of extracted content and metadata from the crawled webpage.
+
+## Class Definition
+
+```python
+class CrawlResult(BaseModel):
+    """Result of a web crawling operation."""
+    
+    # Basic Information
+    url: str                                # Crawled URL
+    success: bool                           # Whether crawl succeeded
+    status_code: Optional[int] = None       # HTTP status code
+    error_message: Optional[str] = None     # Error message if failed
+    
+    # Content
+    html: str                              # Raw HTML content
+    cleaned_html: Optional[str] = None      # Cleaned HTML
+    fit_html: Optional[str] = None          # Most relevant HTML content
+    markdown: Optional[str] = None          # HTML converted to markdown
+    fit_markdown: Optional[str] = None      # Most relevant markdown content
+    
+    # Extracted Data
+    extracted_content: Optional[str] = None  # Content from extraction strategy
+    media: Dict[str, List[Dict]] = {}       # Extracted media information
+    links: Dict[str, List[Dict]] = {}       # Extracted links
+    metadata: Optional[dict] = None         # Page metadata
+    
+    # Additional Data
+    screenshot: Optional[str] = None         # Base64 encoded screenshot
+    session_id: Optional[str] = None         # Session identifier
+    response_headers: Optional[dict] = None  # HTTP response headers
+```
+
+## Properties and Their Data Structures
+
+### Basic Information
+
+```python
+# Access basic information
+result = await crawler.arun(url="https://example.com")
+
+print(result.url)          # "https://example.com"
+print(result.success)      # True/False
+print(result.status_code)  # 200, 404, etc.
+print(result.error_message)  # Error details if failed
+```
+
+### Content Properties
+
+#### HTML Content
+```python
+# Raw HTML
+html_content = result.html
+
+# Cleaned HTML (removed ads, popups, etc.)
+clean_content = result.cleaned_html
+
+# Most relevant HTML content
+main_content = result.fit_html
+```
+
+#### Markdown Content
+```python
+# Full markdown version
+markdown_content = result.markdown
+
+# Most relevant markdown content
+main_content = result.fit_markdown
+```
+
+### Media Content
+
+The media dictionary contains organized media elements:
+
+```python
+# Structure
+media = {
+    "images": [
+        {
+            "src": str,           # Image URL
+            "alt": str,           # Alt text
+            "desc": str,          # Contextual description
+            "score": float,       # Relevance score (0-10)
+            "type": str,          # "image"
+            "width": int,         # Image width (if available)
+            "height": int,        # Image height (if available)
+            "context": str,       # Surrounding text
+            "lazy": bool          # Whether image was lazy-loaded
+        }
+    ],
+    "videos": [
+        {
+            "src": str,           # Video URL
+            "type": str,          # "video"
+            "title": str,         # Video title
+            "poster": str,        # Thumbnail URL
+            "duration": str,      # Video duration
+            "description": str    # Video description
+        }
+    ],
+    "audios": [
+        {
+            "src": str,           # Audio URL
+            "type": str,          # "audio"
+            "title": str,         # Audio title
+            "duration": str,      # Audio duration
+            "description": str    # Audio description
+        }
+    ]
+}
+
+# Example usage
+for image in result.media["images"]:
+    if image["score"] > 5:  # High-relevance images
+        print(f"High-quality image: {image['src']}")
+        print(f"Context: {image['context']}")
+```
+
+### Link Analysis
+
+The links dictionary organizes discovered links:
+
+```python
+# Structure
+links = {
+    "internal": [
+        {
+            "href": str,          # URL
+            "text": str,          # Link text
+            "title": str,         # Title attribute
+            "type": str,          # Link type (nav, content, etc.)
+            "context": str,       # Surrounding text
+            "score": float        # Relevance score
+        }
+    ],
+    "external": [
+        {
+            "href": str,          # External URL
+            "text": str,          # Link text
+            "title": str,         # Title attribute
+            "domain": str,        # Domain name
+            "type": str,          # Link type
+            "context": str        # Surrounding text
+        }
+    ]
+}
+
+# Example usage
+for link in result.links["internal"]:
+    print(f"Internal link: {link['href']}")
+    print(f"Context: {link['context']}")
+```
+
+### Metadata
+
+The metadata dictionary contains page information:
+
+```python
+# Structure
+metadata = {
+    "title": str,                # Page title
+    "description": str,          # Meta description
+    "keywords": List[str],       # Meta keywords
+    "author": str,              # Author information
+    "published_date": str,      # Publication date
+    "modified_date": str,       # Last modified date
+    "language": str,            # Page language
+    "canonical_url": str,       # Canonical URL
+    "og_data": Dict,           # Open Graph data
+    "twitter_data": Dict       # Twitter card data
+}
+
+# Example usage
+if result.metadata:
+    print(f"Title: {result.metadata['title']}")
+    print(f"Author: {result.metadata.get('author', 'Unknown')}")
+```
+
+### Extracted Content
+
+Content from extraction strategies:
+
+```python
+# For LLM or CSS extraction strategies
+if result.extracted_content:
+    structured_data = json.loads(result.extracted_content)
+    print(structured_data)
+```
+
+### Screenshot
+
+Base64 encoded screenshot:
+
+```python
+# Save screenshot if available
+if result.screenshot:
+    import base64
+    
+    # Decode and save
+    with open("screenshot.png", "wb") as f:
+        f.write(base64.b64decode(result.screenshot))
+```
+
+## Usage Examples
+
+### Basic Content Access
+```python
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    
+    if result.success:
+        # Get clean content
+        print(result.fit_markdown)
+        
+        # Process images
+        for image in result.media["images"]:
+            if image["score"] > 7:
+                print(f"High-quality image: {image['src']}")
+```
+
+### Complete Data Processing
+```python
+async def process_webpage(url: str) -> Dict:
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url=url)
+        
+        if not result.success:
+            raise Exception(f"Crawl failed: {result.error_message}")
+        
+        return {
+            "content": result.fit_markdown,
+            "images": [
+                img for img in result.media["images"]
+                if img["score"] > 5
+            ],
+            "internal_links": [
+                link["href"] for link in result.links["internal"]
+            ],
+            "metadata": result.metadata,
+            "status": result.status_code
+        }
+```
+
+### Error Handling
+```python
+async def safe_crawl(url: str) -> Dict:
+    async with AsyncWebCrawler() as crawler:
+        try:
+            result = await crawler.arun(url=url)
+            
+            if not result.success:
+                return {
+                    "success": False,
+                    "error": result.error_message,
+                    "status": result.status_code
+                }
+            
+            return {
+                "success": True,
+                "content": result.fit_markdown,
+                "status": result.status_code
+            }
+            
+        except Exception as e:
+            return {
+                "success": False,
+                "error": str(e),
+                "status": None
+            }
+```
+
+## Best Practices
+
+1. **Always Check Success**
+```python
+if not result.success:
+    print(f"Error: {result.error_message}")
+    return
+```
+
+2. **Use fit_markdown for Articles**
+```python
+# Better for article content
+content = result.fit_markdown if result.fit_markdown else result.markdown
+```
+
+3. **Filter Media by Score**
+```python
+relevant_images = [
+    img for img in result.media["images"]
+    if img["score"] > 5
+]
+```
+
+4. **Handle Missing Data**
+```python
+metadata = result.metadata or {}
+title = metadata.get('title', 'Unknown Title')
+```
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -0,0 +1,35 @@
+# Parameter Reference Table
+
+| File Name | Parameter Name | Code Usage | Strategy/Class | Description |
+|-----------|---------------|------------|----------------|-------------|
+| async_crawler_strategy.py | user_agent | `kwargs.get("user_agent")` | AsyncPlaywrightCrawlerStrategy | User agent string for browser identification |
+| async_crawler_strategy.py | proxy | `kwargs.get("proxy")` | AsyncPlaywrightCrawlerStrategy | Proxy server configuration for network requests |
+| async_crawler_strategy.py | proxy_config | `kwargs.get("proxy_config")` | AsyncPlaywrightCrawlerStrategy | Detailed proxy configuration including auth |
+| async_crawler_strategy.py | headless | `kwargs.get("headless", True)` | AsyncPlaywrightCrawlerStrategy | Whether to run browser in headless mode |
+| async_crawler_strategy.py | browser_type | `kwargs.get("browser_type", "chromium")` | AsyncPlaywrightCrawlerStrategy | Type of browser to use (chromium/firefox/webkit) |
+| async_crawler_strategy.py | headers | `kwargs.get("headers", {})` | AsyncPlaywrightCrawlerStrategy | Custom HTTP headers for requests |
+| async_crawler_strategy.py | verbose | `kwargs.get("verbose", False)` | AsyncPlaywrightCrawlerStrategy | Enable detailed logging output |
+| async_crawler_strategy.py | sleep_on_close | `kwargs.get("sleep_on_close", False)` | AsyncPlaywrightCrawlerStrategy | Add delay before closing browser |
+| async_crawler_strategy.py | use_managed_browser | `kwargs.get("use_managed_browser", False)` | AsyncPlaywrightCrawlerStrategy | Use managed browser instance |
+| async_crawler_strategy.py | user_data_dir | `kwargs.get("user_data_dir", None)` | AsyncPlaywrightCrawlerStrategy | Custom directory for browser profile data |
+| async_crawler_strategy.py | session_id | `kwargs.get("session_id")` | AsyncPlaywrightCrawlerStrategy | Unique identifier for browser session |
+| async_crawler_strategy.py | override_navigator | `kwargs.get("override_navigator", False)` | AsyncPlaywrightCrawlerStrategy | Override browser navigator properties |
+| async_crawler_strategy.py | simulate_user | `kwargs.get("simulate_user", False)` | AsyncPlaywrightCrawlerStrategy | Simulate human-like behavior |
+| async_crawler_strategy.py | magic | `kwargs.get("magic", False)` | AsyncPlaywrightCrawlerStrategy | Enable advanced anti-detection features |
+| async_crawler_strategy.py | log_console | `kwargs.get("log_console", False)` | AsyncPlaywrightCrawlerStrategy | Log browser console messages |
+| async_crawler_strategy.py | js_only | `kwargs.get("js_only", False)` | AsyncPlaywrightCrawlerStrategy | Only execute JavaScript without page load |
+| async_crawler_strategy.py | page_timeout | `kwargs.get("page_timeout", 60000)` | AsyncPlaywrightCrawlerStrategy | Timeout for page load in milliseconds |
+| async_crawler_strategy.py | ignore_body_visibility | `kwargs.get("ignore_body_visibility", True)` | AsyncPlaywrightCrawlerStrategy | Process page even if body is hidden |
+| async_crawler_strategy.py | js_code | `kwargs.get("js_code", kwargs.get("js", self.js_code))` | AsyncPlaywrightCrawlerStrategy | Custom JavaScript code to execute |
+| async_crawler_strategy.py | wait_for | `kwargs.get("wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait for specific element/condition |
+| async_crawler_strategy.py | process_iframes | `kwargs.get("process_iframes", False)` | AsyncPlaywrightCrawlerStrategy | Extract content from iframes |
+| async_crawler_strategy.py | delay_before_return_html | `kwargs.get("delay_before_return_html")` | AsyncPlaywrightCrawlerStrategy | Additional delay before returning HTML |
+| async_crawler_strategy.py | remove_overlay_elements | `kwargs.get("remove_overlay_elements", False)` | AsyncPlaywrightCrawlerStrategy | Remove pop-ups and overlay elements |
+| async_crawler_strategy.py | screenshot | `kwargs.get("screenshot")` | AsyncPlaywrightCrawlerStrategy | Take page screenshot |
+| async_crawler_strategy.py | screenshot_wait_for | `kwargs.get("screenshot_wait_for")` | AsyncPlaywrightCrawlerStrategy | Wait before taking screenshot |
+| async_crawler_strategy.py | semaphore_count | `kwargs.get("semaphore_count", 5)` | AsyncPlaywrightCrawlerStrategy | Concurrent request limit |
+| async_webcrawler.py | verbose | `kwargs.get("verbose", False)` | AsyncWebCrawler | Enable detailed logging |
+| async_webcrawler.py | warmup | `kwargs.get("warmup", True)` | AsyncWebCrawler | Initialize crawler with warmup request |
+| async_webcrawler.py | session_id | `kwargs.get("session_id", None)` | AsyncWebCrawler | Session identifier for browser reuse |
+| async_webcrawler.py | only_text | `kwargs.get("only_text", False)` | AsyncWebCrawler | Extract only text content |
+| async_webcrawler.py | bypass_cache | `kwargs.get("bypass_cache", False)` | AsyncWebCrawler | Skip cache and force fresh crawl |
--- a/docs/md_v2/api/strategies.md
+++ b/docs/md_v2/api/strategies.md
@@ -0,0 +1,255 @@
+# Extraction & Chunking Strategies API
+
+This documentation covers the API reference for extraction and chunking strategies in Crawl4AI.
+
+## Extraction Strategies
+
+All extraction strategies inherit from the base `ExtractionStrategy` class and implement two key methods:
+- `extract(url: str, html: str) -> List[Dict[str, Any]]`
+- `run(url: str, sections: List[str]) -> List[Dict[str, Any]]`
+
+### LLMExtractionStrategy
+
+Used for extracting structured data using Language Models.
+
+```python
+LLMExtractionStrategy(
+    # Required Parameters
+    provider: str = DEFAULT_PROVIDER,     # LLM provider (e.g., "ollama/llama2")
+    api_token: Optional[str] = None,      # API token
+    
+    # Extraction Configuration
+    instruction: str = None,              # Custom extraction instruction
+    schema: Dict = None,                  # Pydantic model schema for structured data
+    extraction_type: str = "block",       # "block" or "schema"
+    
+    # Chunking Parameters
+    chunk_token_threshold: int = 4000,    # Maximum tokens per chunk
+    overlap_rate: float = 0.1,           # Overlap between chunks
+    word_token_rate: float = 0.75,       # Word to token conversion rate
+    apply_chunking: bool = True,         # Enable/disable chunking
+    
+    # API Configuration
+    base_url: str = None,                # Base URL for API
+    extra_args: Dict = {},               # Additional provider arguments
+    verbose: bool = False                # Enable verbose logging
+)
+```
+
+### CosineStrategy
+
+Used for content similarity-based extraction and clustering.
+
+```python
+CosineStrategy(
+    # Content Filtering
+    semantic_filter: str = None,        # Topic/keyword filter
+    word_count_threshold: int = 10,     # Minimum words per cluster
+    sim_threshold: float = 0.3,         # Similarity threshold
+    
+    # Clustering Parameters
+    max_dist: float = 0.2,             # Maximum cluster distance
+    linkage_method: str = 'ward',       # Clustering method
+    top_k: int = 3,                    # Top clusters to return
+    
+    # Model Configuration
+    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
+    
+    verbose: bool = False              # Enable verbose logging
+)
+```
+
+### JsonCssExtractionStrategy
+
+Used for CSS selector-based structured data extraction.
+
+```python
+JsonCssExtractionStrategy(
+    schema: Dict[str, Any],    # Extraction schema
+    verbose: bool = False      # Enable verbose logging
+)
+
+# Schema Structure
+schema = {
+    "name": str,              # Schema name
+    "baseSelector": str,      # Base CSS selector
+    "fields": [               # List of fields to extract
+        {
+            "name": str,      # Field name
+            "selector": str,  # CSS selector
+            "type": str,     # Field type: "text", "attribute", "html", "regex"
+            "attribute": str, # For type="attribute"
+            "pattern": str,  # For type="regex"
+            "transform": str, # Optional: "lowercase", "uppercase", "strip"
+            "default": Any    # Default value if extraction fails
+        }
+    ]
+}
+```
+
+## Chunking Strategies
+
+All chunking strategies inherit from `ChunkingStrategy` and implement the `chunk(text: str) -> list` method.
+
+### RegexChunking
+
+Splits text based on regex patterns.
+
+```python
+RegexChunking(
+    patterns: List[str] = None  # Regex patterns for splitting
+                               # Default: [r'\n\n']
+)
+```
+
+### SlidingWindowChunking
+
+Creates overlapping chunks with a sliding window approach.
+
+```python
+SlidingWindowChunking(
+    window_size: int = 100,    # Window size in words
+    step: int = 50             # Step size between windows
+)
+```
+
+### OverlappingWindowChunking
+
+Creates chunks with specified overlap.
+
+```python
+OverlappingWindowChunking(
+    window_size: int = 1000,   # Chunk size in words
+    overlap: int = 100         # Overlap size in words
+)
+```
+
+## Usage Examples
+
+### LLM Extraction
+
+```python
+from pydantic import BaseModel
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+# Define schema
+class Article(BaseModel):
+    title: str
+    content: str
+    author: str
+
+# Create strategy
+strategy = LLMExtractionStrategy(
+    provider="ollama/llama2",
+    schema=Article.schema(),
+    instruction="Extract article details"
+)
+
+# Use with crawler
+result = await crawler.arun(
+    url="https://example.com/article",
+    extraction_strategy=strategy
+)
+
+# Access extracted data
+data = json.loads(result.extracted_content)
+```
+
+### CSS Extraction
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+# Define schema
+schema = {
+    "name": "Product List",
+    "baseSelector": ".product-card",
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h2.title",
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": ".price",
+            "type": "text",
+            "transform": "strip"
+        },
+        {
+            "name": "image",
+            "selector": "img",
+            "type": "attribute",
+            "attribute": "src"
+        }
+    ]
+}
+
+# Create and use strategy
+strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(
+    url="https://example.com/products",
+    extraction_strategy=strategy
+)
+```
+
+### Content Chunking
+
+```python
+from crawl4ai.chunking_strategy import OverlappingWindowChunking
+
+# Create chunking strategy
+chunker = OverlappingWindowChunking(
+    window_size=500,  # 500 words per chunk
+    overlap=50        # 50 words overlap
+)
+
+# Use with extraction strategy
+strategy = LLMExtractionStrategy(
+    provider="ollama/llama2",
+    chunking_strategy=chunker
+)
+
+result = await crawler.arun(
+    url="https://example.com/long-article",
+    extraction_strategy=strategy
+)
+```
+
+## Best Practices
+
+1. **Choose the Right Strategy**
+   - Use `LLMExtractionStrategy` for complex, unstructured content
+   - Use `JsonCssExtractionStrategy` for well-structured HTML
+   - Use `CosineStrategy` for content similarity and clustering
+
+2. **Optimize Chunking**
+   ```python
+   # For long documents
+   strategy = LLMExtractionStrategy(
+       chunk_token_threshold=2000,  # Smaller chunks
+       overlap_rate=0.1           # 10% overlap
+   )
+   ```
+
+3. **Handle Errors**
+   ```python
+   try:
+       result = await crawler.arun(
+           url="https://example.com",
+           extraction_strategy=strategy
+       )
+       if result.success:
+           content = json.loads(result.extracted_content)
+   except Exception as e:
+       print(f"Extraction failed: {e}")
+   ```
+
+4. **Monitor Performance**
+   ```python
+   strategy = CosineStrategy(
+       verbose=True,  # Enable logging
+       word_count_threshold=20,  # Filter short content
+       top_k=5  # Limit results
+   )
+   ```
--- a/docs/md_v2/assets/DankMono-Bold.woff2
+++ b/docs/md_v2/assets/DankMono-Bold.woff2
--- a/docs/md_v2/assets/DankMono-Italic.woff2
+++ b/docs/md_v2/assets/DankMono-Italic.woff2
--- a/docs/md_v2/assets/DankMono-Regular.woff2
+++ b/docs/md_v2/assets/DankMono-Regular.woff2
--- a/docs/md_v2/assets/Monaco.woff
+++ b/docs/md_v2/assets/Monaco.woff
--- a/docs/md_v2/assets/dmvendor.css
+++ b/docs/md_v2/assets/dmvendor.css
--- a/docs/md_v2/assets/docs.zip
+++ b/docs/md_v2/assets/docs.zip
--- a/docs/md_v2/assets/highlight.css
+++ b/docs/md_v2/assets/highlight.css
--- a/docs/md_v2/assets/highlight.min.js
+++ b/docs/md_v2/assets/highlight.min.js
--- a/docs/md_v2/assets/highlight_init.js
+++ b/docs/md_v2/assets/highlight_init.js
@@ -0,0 +1,6 @@
+document.addEventListener('DOMContentLoaded', (event) => {
+    document.querySelectorAll('pre code').forEach((block) => {
+      hljs.highlightBlock(block);
+    });
+  });
+  
--- a/docs/md_v2/assets/styles.css
+++ b/docs/md_v2/assets/styles.css
@@ -0,0 +1,160 @@
+@font-face {
+    font-family: "Monaco";
+    font-style: normal;
+    font-weight: normal;
+    src: local("Monaco"), url("Monaco.woff") format("woff");
+}
+
+:root {
+    --global-font-size: 16px;
+    --global-line-height: 1.5em;
+    --global-space: 10px;
+    --font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
+        Courier New, monospace, serif;
+    --font-stack: dm, Monaco, Courier New, monospace, serif;
+    --mono-font-stack: Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono,
+        Courier New, monospace, serif;
+
+    --background-color: #151515; /* Dark background */
+    --font-color: #eaeaea; /* Light font color for contrast */
+    --invert-font-color: #151515; /* Dark color for inverted elements */
+    --primary-color: #1a95e0; /* Primary color can remain the same or be adjusted for better contrast */
+    --secondary-color: #727578; /* Secondary color for less important text */
+    --error-color: #ff5555; /* Bright color for errors */
+    --progress-bar-background: #444; /* Darker background for progress bar */
+    --progress-bar-fill: #1a95e0; /* Bright color for progress bar fill */
+    --code-bg-color: #1e1e1e; /* Darker background for code blocks */
+    --input-style: solid; /* Keeping input style solid */
+    --block-background-color: #202020; /* Darker background for block elements */
+    --global-font-color: #eaeaea; /* Light font color for global elements */
+
+    --background-color: #222225;
+
+    --background-color: #070708;
+    --page-width: 70em;
+    --font-color: #e8e9ed;
+    --invert-font-color: #222225;
+    --secondary-color: #a3abba;
+    --secondary-color: #d5cec0;
+    --tertiary-color: #a3abba;
+    --primary-color: #09b5a5; /* Updated to the brand color */
+    --primary-color: #50ffff; /* Updated to the brand color */
+    --error-color: #ff3c74;
+    --progress-bar-background: #3f3f44;
+    --progress-bar-fill: #09b5a5; /* Updated to the brand color */
+    --code-bg-color: #3f3f44;
+    --input-style: solid;
+    --display-h1-decoration: none;
+
+    --display-h1-decoration: none;
+}
+
+/* body {
+    background-color: var(--background-color);
+    color: var(--font-color);
+}
+
+a {
+    color: var(--primary-color);
+}
+
+a:hover {
+    background-color: var(--primary-color);
+    color: var(--invert-font-color);
+}
+
+blockquote::after {
+    color: #444; 
+}
+
+pre, code {
+    background-color: var(--code-bg-color);
+    color: var(--font-color);
+}
+
+.terminal-nav:first-child {
+    border-bottom: 1px dashed var(--secondary-color);
+} */
+
+.terminal-mkdocs-main-content {
+    line-height: var(--global-line-height);
+}
+
+strong,
+.highlight {
+    /* background: url(//s2.svgbox.net/pen-brushes.svg?ic=brush-1&color=50ffff); */
+    background-color: #50ffff33;
+}
+
+.terminal-card > header {
+    color: var(--font-color);
+    text-align: center;
+    background-color: var(--progress-bar-background);
+    padding: 0.3em 0.5em;
+}
+.btn.btn-sm {
+    color: var(--font-color);
+    padding: 0.2em 0.5em;
+    font-size: 0.8em;
+}
+
+.loading-message {
+    display: none;
+    margin-top: 20px;
+}
+
+.response-section {
+    display: none;
+    padding-top: 20px;
+}
+
+.tabs {
+    display: flex;
+    flex-direction: column;
+}
+.tab-list {
+    display: flex;
+    padding: 0;
+    margin: 0;
+    list-style-type: none;
+    border-bottom: 1px solid var(--font-color);
+}
+.tab-item {
+    cursor: pointer;
+    padding: 10px;
+    border: 1px solid var(--font-color);
+    margin-right: -1px;
+    border-bottom: none;
+}
+.tab-item:hover,
+.tab-item:focus,
+.tab-item:active {
+    background-color: var(--progress-bar-background);
+}
+.tab-content {
+    display: none;
+    border: 1px solid var(--font-color);
+    border-top: none;
+}
+.tab-content:first-of-type {
+    display: block;
+}
+
+.tab-content header {
+    padding: 0.5em;
+    display: flex; 
+    justify-content: end; 
+    align-items: center;
+    background-color: var(--progress-bar-background);
+}
+.tab-content pre {
+    margin: 0;
+    max-height: 300px; overflow: auto; border:none;
+}
+
+ol li::before {
+    content: counters(item, ".") ". ";
+    counter-increment: item;
+    /* float: left; */
+    /* padding-right: 5px; */
+}
--- a/docs/md_v2/basic/browser-config.md
+++ b/docs/md_v2/basic/browser-config.md
@@ -0,0 +1,208 @@
+# Browser Configuration
+
+Crawl4AI supports multiple browser engines and offers extensive configuration options for browser behavior.
+
+## Browser Types
+
+Choose from three browser engines:
+
+```python
+# Chromium (default)
+async with AsyncWebCrawler(browser_type="chromium") as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# Firefox
+async with AsyncWebCrawler(browser_type="firefox") as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# WebKit
+async with AsyncWebCrawler(browser_type="webkit") as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Basic Configuration
+
+Common browser settings:
+
+```python
+async with AsyncWebCrawler(
+    headless=True,           # Run in headless mode (no GUI)
+    verbose=True,           # Enable detailed logging
+    sleep_on_close=False    # No delay when closing browser
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Identity Management
+
+Control how your crawler appears to websites:
+
+```python
+# Custom user agent
+async with AsyncWebCrawler(
+    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# Custom headers
+headers = {
+    "Accept-Language": "en-US,en;q=0.9",
+    "Cache-Control": "no-cache"
+}
+async with AsyncWebCrawler(headers=headers) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Screenshot Capabilities
+
+Capture page screenshots with enhanced error handling:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    screenshot=True,                # Enable screenshot
+    screenshot_wait_for=2.0        # Wait 2 seconds before capture
+)
+
+if result.screenshot:  # Base64 encoded image
+    import base64
+    with open("screenshot.png", "wb") as f:
+        f.write(base64.b64decode(result.screenshot))
+```
+
+## Timeouts and Waiting
+
+Control page loading behavior:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    page_timeout=60000,              # Page load timeout (ms)
+    delay_before_return_html=2.0,    # Wait before content capture
+    wait_for="css:.dynamic-content"  # Wait for specific element
+)
+```
+
+## JavaScript Execution
+
+Execute custom JavaScript before crawling:
+
+```python
+# Single JavaScript command
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="window.scrollTo(0, document.body.scrollHeight);"
+)
+
+# Multiple commands
+js_commands = [
+    "window.scrollTo(0, document.body.scrollHeight);",
+    "document.querySelector('.load-more').click();"
+]
+result = await crawler.arun(
+    url="https://example.com",
+    js_code=js_commands
+)
+```
+
+## Proxy Configuration
+
+Use proxies for enhanced access:
+
+```python
+# Simple proxy
+async with AsyncWebCrawler(
+    proxy="http://proxy.example.com:8080"
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# Proxy with authentication
+proxy_config = {
+    "server": "http://proxy.example.com:8080",
+    "username": "user",
+    "password": "pass"
+}
+async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Anti-Detection Features
+
+Enable stealth features to avoid bot detection:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    simulate_user=True,        # Simulate human behavior
+    override_navigator=True,   # Mask automation signals
+    magic=True               # Enable all anti-detection features
+)
+```
+
+## Handling Dynamic Content
+
+Configure browser to handle dynamic content:
+
+```python
+# Wait for dynamic content
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for="js:() => document.querySelector('.content').children.length > 10",
+    process_iframes=True     # Process iframe content
+)
+
+# Handle lazy-loaded images
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="window.scrollTo(0, document.body.scrollHeight);",
+    delay_before_return_html=2.0  # Wait for images to load
+)
+```
+
+## Comprehensive Example
+
+Here's how to combine various browser configurations:
+
+```python
+async def crawl_with_advanced_config(url: str):
+    async with AsyncWebCrawler(
+        # Browser setup
+        browser_type="chromium",
+        headless=True,
+        verbose=True,
+        
+        # Identity
+        user_agent="Custom User Agent",
+        headers={"Accept-Language": "en-US"},
+        
+        # Proxy setup
+        proxy="http://proxy.example.com:8080"
+    ) as crawler:
+        result = await crawler.arun(
+            url=url,
+            # Content handling
+            process_iframes=True,
+            screenshot=True,
+            
+            # Timing
+            page_timeout=60000,
+            delay_before_return_html=2.0,
+            
+            # Anti-detection
+            magic=True,
+            simulate_user=True,
+            
+            # Dynamic content
+            js_code=[
+                "window.scrollTo(0, document.body.scrollHeight);",
+                "document.querySelector('.load-more')?.click();"
+            ],
+            wait_for="css:.dynamic-content"
+        )
+        
+        return {
+            "content": result.markdown,
+            "screenshot": result.screenshot,
+            "success": result.success
+        }
+```
--- a/docs/md_v2/basic/content-selection.md
+++ b/docs/md_v2/basic/content-selection.md
@@ -0,0 +1,199 @@
+# Content Selection
+
+Crawl4AI provides multiple ways to select and filter specific content from webpages. Learn how to precisely target the content you need.
+
+## CSS Selectors
+
+The simplest way to extract specific content:
+
+```python
+# Extract specific content using CSS selector
+result = await crawler.arun(
+    url="https://example.com",
+    css_selector=".main-article"  # Target main article content
+)
+
+# Multiple selectors
+result = await crawler.arun(
+    url="https://example.com",
+    css_selector="article h1, article .content"  # Target heading and content
+)
+```
+
+## Content Filtering
+
+Control what content is included or excluded:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    # Content thresholds
+    word_count_threshold=10,        # Minimum words per block
+    
+    # Tag exclusions
+    excluded_tags=['form', 'header', 'footer', 'nav'],
+    
+    # Link filtering
+    exclude_external_links=True,    # Remove external links
+    exclude_social_media_links=True,  # Remove social media links
+    
+    # Media filtering
+    exclude_external_images=True   # Remove external images
+)
+```
+
+## Iframe Content
+
+Process content inside iframes:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    process_iframes=True,  # Extract iframe content
+    remove_overlay_elements=True  # Remove popups/modals that might block iframes
+)
+```
+
+## Structured Content Selection
+
+### Using LLMs for Smart Selection
+
+Use LLMs to intelligently extract specific types of content:
+
+```python
+from pydantic import BaseModel
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class ArticleContent(BaseModel):
+    title: str
+    main_points: List[str]
+    conclusion: str
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",  # Works with any supported LLM
+    schema=ArticleContent.schema(),
+    instruction="Extract the main article title, key points, and conclusion"
+)
+
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=strategy
+)
+article = json.loads(result.extracted_content)
+```
+
+### Pattern-Based Selection
+
+For repeated content patterns (like product listings, news feeds):
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+schema = {
+    "name": "News Articles",
+    "baseSelector": "article.news-item",  # Repeated element
+    "fields": [
+        {"name": "headline", "selector": "h2", "type": "text"},
+        {"name": "summary", "selector": ".summary", "type": "text"},
+        {"name": "category", "selector": ".category", "type": "text"},
+        {
+            "name": "metadata",
+            "type": "nested",
+            "fields": [
+                {"name": "author", "selector": ".author", "type": "text"},
+                {"name": "date", "selector": ".date", "type": "text"}
+            ]
+        }
+    ]
+}
+
+strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=strategy
+)
+articles = json.loads(result.extracted_content)
+```
+
+## Domain-Based Filtering
+
+Control content based on domains:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    exclude_domains=["ads.com", "tracker.com"],
+    exclude_social_media_domains=["facebook.com", "twitter.com"],  # Custom social media domains to exclude
+    exclude_social_media_links=True
+)
+```
+
+## Media Selection
+
+Select specific types of media:
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+# Access different media types
+images = result.media["images"]  # List of image details
+videos = result.media["videos"]  # List of video details
+audios = result.media["audios"]  # List of audio details
+
+# Image with metadata
+for image in images:
+    print(f"URL: {image['src']}")
+    print(f"Alt text: {image['alt']}")
+    print(f"Description: {image['desc']}")
+    print(f"Relevance score: {image['score']}")
+```
+
+## Comprehensive Example
+
+Here's how to combine different selection methods:
+
+```python
+async def extract_article_content(url: str):
+    # Define structured extraction
+    article_schema = {
+        "name": "Article",
+        "baseSelector": "article.main",
+        "fields": [
+            {"name": "title", "selector": "h1", "type": "text"},
+            {"name": "content", "selector": ".content", "type": "text"}
+        ]
+    }
+    
+    # Define LLM extraction
+    class ArticleAnalysis(BaseModel):
+        key_points: List[str]
+        sentiment: str
+        category: str
+
+    async with AsyncWebCrawler() as crawler:
+        # Get structured content
+        pattern_result = await crawler.arun(
+            url=url,
+            extraction_strategy=JsonCssExtractionStrategy(article_schema),
+            word_count_threshold=10,
+            excluded_tags=['nav', 'footer'],
+            exclude_external_links=True
+        )
+        
+        # Get semantic analysis
+        analysis_result = await crawler.arun(
+            url=url,
+            extraction_strategy=LLMExtractionStrategy(
+                provider="ollama/nemotron",
+                schema=ArticleAnalysis.schema(),
+                instruction="Analyze the article content"
+            )
+        )
+        
+        # Combine results
+        return {
+            "article": json.loads(pattern_result.extracted_content),
+            "analysis": json.loads(analysis_result.extracted_content),
+            "media": pattern_result.media
+        }
+```
--- a/docs/md_v2/basic/docker-deploymeny.md
+++ b/docs/md_v2/basic/docker-deploymeny.md
@@ -0,0 +1,459 @@
+# Docker Deployment
+
+Crawl4AI provides official Docker images for easy deployment and scalability. This guide covers installation, configuration, and usage of Crawl4AI in Docker environments.
+
+## Quick Start 🚀
+
+Pull and run the basic version:
+
+```bash
+docker pull unclecode/crawl4ai:basic
+docker run -p 11235:11235 unclecode/crawl4ai:basic
+```
+
+Test the deployment:
+```python
+import requests
+
+# Test health endpoint
+health = requests.get("http://localhost:11235/health")
+print("Health check:", health.json())
+
+# Test basic crawl
+response = requests.post(
+    "http://localhost:11235/crawl",
+    json={
+        "urls": "https://www.nbcnews.com/business",
+        "priority": 10
+    }
+)
+task_id = response.json()["task_id"]
+print("Task ID:", task_id)
+```
+
+## Available Images 🏷️
+
+- `unclecode/crawl4ai:basic` - Basic web crawling capabilities
+- `unclecode/crawl4ai:all` - Full installation with all features
+- `unclecode/crawl4ai:gpu` - GPU-enabled version for ML features
+
+## Configuration Options 🔧
+
+### Environment Variables
+
+```bash
+docker run -p 11235:11235 \
+    -e MAX_CONCURRENT_TASKS=5 \
+    -e OPENAI_API_KEY=your_key \
+    unclecode/crawl4ai:all
+```
+
+### Volume Mounting
+
+Mount a directory for persistent data:
+```bash
+docker run -p 11235:11235 \
+    -v $(pwd)/data:/app/data \
+    unclecode/crawl4ai:all
+```
+
+### Resource Limits
+
+Control container resources:
+```bash
+docker run -p 11235:11235 \
+    --memory=4g \
+    --cpus=2 \
+    unclecode/crawl4ai:all
+```
+
+## Usage Examples 📝
+
+### Basic Crawling
+
+```python
+request = {
+    "urls": "https://www.nbcnews.com/business",
+    "priority": 10
+}
+
+response = requests.post("http://localhost:11235/crawl", json=request)
+task_id = response.json()["task_id"]
+
+# Get results
+result = requests.get(f"http://localhost:11235/task/{task_id}")
+```
+
+### Structured Data Extraction
+
+```python
+schema = {
+    "name": "Crypto Prices",
+    "baseSelector": ".cds-tableRow-t45thuk",
+    "fields": [
+        {
+            "name": "crypto",
+            "selector": "td:nth-child(1) h2",
+            "type": "text",
+        },
+        {
+            "name": "price",
+            "selector": "td:nth-child(2)",
+            "type": "text",
+        }
+    ],
+}
+
+request = {
+    "urls": "https://www.coinbase.com/explore",
+    "extraction_config": {
+        "type": "json_css",
+        "params": {"schema": schema}
+    }
+}
+```
+
+### Dynamic Content Handling
+
+```python
+request = {
+    "urls": "https://www.nbcnews.com/business",
+    "js_code": [
+        "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
+    ],
+    "wait_for": "article.tease-card:nth-child(10)"
+}
+```
+
+### AI-Powered Extraction (Full Version)
+
+```python
+request = {
+    "urls": "https://www.nbcnews.com/business",
+    "extraction_config": {
+        "type": "cosine",
+        "params": {
+            "semantic_filter": "business finance economy",
+            "word_count_threshold": 10,
+            "max_dist": 0.2,
+            "top_k": 3
+        }
+    }
+}
+```
+
+## Platform-Specific Instructions 💻
+
+### macOS
+```bash
+docker pull unclecode/crawl4ai:basic
+docker run -p 11235:11235 unclecode/crawl4ai:basic
+```
+
+### Ubuntu
+```bash
+# Basic version
+docker pull unclecode/crawl4ai:basic
+docker run -p 11235:11235 unclecode/crawl4ai:basic
+
+# With GPU support
+docker pull unclecode/crawl4ai:gpu
+docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu
+```
+
+### Windows (PowerShell)
+```powershell
+docker pull unclecode/crawl4ai:basic
+docker run -p 11235:11235 unclecode/crawl4ai:basic
+```
+
+## Testing 🧪
+
+Save this as `test_docker.py`:
+
+```python
+import requests
+import json
+import time
+import sys
+
+class Crawl4AiTester:
+    def __init__(self, base_url: str = "http://localhost:11235"):
+        self.base_url = base_url
+        
+    def submit_and_wait(self, request_data: dict, timeout: int = 300) -> dict:
+        # Submit crawl job
+        response = requests.post(f"{self.base_url}/crawl", json=request_data)
+        task_id = response.json()["task_id"]
+        print(f"Task ID: {task_id}")
+        
+        # Poll for result
+        start_time = time.time()
+        while True:
+            if time.time() - start_time > timeout:
+                raise TimeoutError(f"Task {task_id} timeout")
+                
+            result = requests.get(f"{self.base_url}/task/{task_id}")
+            status = result.json()
+            
+            if status["status"] == "completed":
+                return status
+                
+            time.sleep(2)
+
+def test_deployment():
+    tester = Crawl4AiTester()
+    
+    # Test basic crawl
+    request = {
+        "urls": "https://www.nbcnews.com/business",
+        "priority": 10
+    }
+    
+    result = tester.submit_and_wait(request)
+    print("Basic crawl successful!")
+    print(f"Content length: {len(result['result']['markdown'])}")
+
+if __name__ == "__main__":
+    test_deployment()
+```
+
+## Advanced Configuration ⚙️
+
+### Crawler Parameters
+
+The `crawler_params` field allows you to configure the browser instance and crawling behavior. Here are key parameters you can use:
+
+```python
+request = {
+    "urls": "https://example.com",
+    "crawler_params": {
+        # Browser Configuration
+        "headless": True,                    # Run in headless mode
+        "browser_type": "chromium",          # chromium/firefox/webkit
+        "user_agent": "custom-agent",        # Custom user agent
+        "proxy": "http://proxy:8080",        # Proxy configuration
+        
+        # Performance & Behavior
+        "page_timeout": 30000,               # Page load timeout (ms)
+        "verbose": True,                     # Enable detailed logging
+        "semaphore_count": 5,               # Concurrent request limit
+        
+        # Anti-Detection Features
+        "simulate_user": True,               # Simulate human behavior
+        "magic": True,                       # Advanced anti-detection
+        "override_navigator": True,          # Override navigator properties
+        
+        # Session Management
+        "user_data_dir": "./browser-data",   # Browser profile location
+        "use_managed_browser": True,         # Use persistent browser
+    }
+}
+```
+
+### Extra Parameters
+
+The `extra` field allows passing additional parameters directly to the crawler's `arun` function:
+
+```python
+request = {
+    "urls": "https://example.com",
+    "extra": {
+        "word_count_threshold": 10,          # Min words per block
+        "only_text": True,                   # Extract only text
+        "bypass_cache": True,                # Force fresh crawl
+        "process_iframes": True,             # Include iframe content
+    }
+}
+```
+
+### Complete Examples
+
+1. **Advanced News Crawling**
+```python
+request = {
+    "urls": "https://www.nbcnews.com/business",
+    "crawler_params": {
+        "headless": True,
+        "page_timeout": 30000,
+        "remove_overlay_elements": True      # Remove popups
+    },
+    "extra": {
+        "word_count_threshold": 50,          # Longer content blocks
+        "bypass_cache": True                 # Fresh content
+    },
+    "css_selector": ".article-body"
+}
+```
+
+2. **Anti-Detection Configuration**
+```python
+request = {
+    "urls": "https://example.com",
+    "crawler_params": {
+        "simulate_user": True,
+        "magic": True,
+        "override_navigator": True,
+        "user_agent": "Mozilla/5.0 ...",
+        "headers": {
+            "Accept-Language": "en-US,en;q=0.9"
+        }
+    }
+}
+```
+
+3. **LLM Extraction with Custom Parameters**
+```python
+request = {
+    "urls": "https://openai.com/pricing",
+    "extraction_config": {
+        "type": "llm",
+        "params": {
+            "provider": "openai/gpt-4",
+            "schema": pricing_schema
+        }
+    },
+    "crawler_params": {
+        "verbose": True,
+        "page_timeout": 60000
+    },
+    "extra": {
+        "word_count_threshold": 1,
+        "only_text": True
+    }
+}
+```
+
+4. **Session-Based Dynamic Content**
+```python
+request = {
+    "urls": "https://example.com",
+    "crawler_params": {
+        "session_id": "dynamic_session",
+        "headless": False,
+        "page_timeout": 60000
+    },
+    "js_code": ["window.scrollTo(0, document.body.scrollHeight);"],
+    "wait_for": "js:() => document.querySelectorAll('.item').length > 10",
+    "extra": {
+        "delay_before_return_html": 2.0
+    }
+}
+```
+
+5. **Screenshot with Custom Timing**
+```python
+request = {
+    "urls": "https://example.com",
+    "screenshot": True,
+    "crawler_params": {
+        "headless": True,
+        "screenshot_wait_for": ".main-content"
+    },
+    "extra": {
+        "delay_before_return_html": 3.0
+    }
+}
+```
+
+### Parameter Reference Table
+
+| Category | Parameter | Type | Description |
+|----------|-----------|------|-------------|
+| Browser | headless | bool | Run browser in headless mode |
+| Browser | browser_type | str | Browser engine selection |
+| Browser | user_agent | str | Custom user agent string |
+| Network | proxy | str | Proxy server URL |
+| Network | headers | dict | Custom HTTP headers |
+| Timing | page_timeout | int | Page load timeout (ms) |
+| Timing | delay_before_return_html | float | Wait before capture |
+| Anti-Detection | simulate_user | bool | Human behavior simulation |
+| Anti-Detection | magic | bool | Advanced protection |
+| Session | session_id | str | Browser session ID |
+| Session | user_data_dir | str | Profile directory |
+| Content | word_count_threshold | int | Minimum words per block |
+| Content | only_text | bool | Text-only extraction |
+| Content | process_iframes | bool | Include iframe content |
+| Debug | verbose | bool | Detailed logging |
+| Debug | log_console | bool | Browser console logs |
+
+## Troubleshooting 🔍
+
+### Common Issues
+
+1. **Connection Refused**
+   ```
+   Error: Connection refused at localhost:11235
+   ```
+   Solution: Ensure the container is running and ports are properly mapped.
+
+2. **Resource Limits**
+   ```
+   Error: No available slots
+   ```
+   Solution: Increase MAX_CONCURRENT_TASKS or container resources.
+
+3. **GPU Access**
+   ```
+   Error: GPU not found
+   ```
+   Solution: Ensure proper NVIDIA drivers and use `--gpus all` flag.
+
+### Debug Mode
+
+Access container for debugging:
+```bash
+docker run -it --entrypoint /bin/bash unclecode/crawl4ai:all
+```
+
+View container logs:
+```bash
+docker logs [container_id]
+```
+
+## Best Practices 🌟
+
+1. **Resource Management**
+   - Set appropriate memory and CPU limits
+   - Monitor resource usage via health endpoint
+   - Use basic version for simple crawling tasks
+
+2. **Scaling**
+   - Use multiple containers for high load
+   - Implement proper load balancing
+   - Monitor performance metrics
+
+3. **Security**
+   - Use environment variables for sensitive data
+   - Implement proper network isolation
+   - Regular security updates
+
+## API Reference 📚
+
+### Health Check
+```http
+GET /health
+```
+
+### Submit Crawl Task
+```http
+POST /crawl
+Content-Type: application/json
+
+{
+    "urls": "string or array",
+    "extraction_config": {
+        "type": "basic|llm|cosine|json_css",
+        "params": {}
+    },
+    "priority": 1-10,
+    "ttl": 3600
+}
+```
+
+### Get Task Status
+```http
+GET /task/{task_id}
+```
+
+For more details, visit the [official documentation](https://crawl4ai.com/mkdocs/).
--- a/docs/md_v2/basic/installation.md
+++ b/docs/md_v2/basic/installation.md
@@ -0,0 +1,92 @@
+# Installation 💻
+
+Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package, use it with Docker, or run it as a local server.
+
+## Option 1: Python Package Installation (Recommended)
+
+Crawl4AI is now available on PyPI, making installation easier than ever. Choose the option that best fits your needs:
+
+### Basic Installation
+
+For basic web crawling and scraping tasks:
+
+```bash
+pip install crawl4ai
+playwright install # Install Playwright dependencies
+```
+
+### Installation with PyTorch
+
+For advanced text clustering (includes CosineSimilarity cluster strategy):
+
+```bash
+pip install crawl4ai[torch]
+```
+
+### Installation with Transformers
+
+For text summarization and Hugging Face models:
+
+```bash
+pip install crawl4ai[transformer]
+```
+
+### Full Installation
+
+For all features:
+
+```bash
+pip install crawl4ai[all]
+```
+
+### Development Installation
+
+For contributors who plan to modify the source code:
+
+```bash
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+pip install -e ".[all]"
+playwright install # Install Playwright dependencies
+```
+
+💡 After installation with "torch", "transformer", or "all" options, it's recommended to run the following CLI command to load the required models:
+
+```bash
+crawl4ai-download-models
+```
+
+This is optional but will boost the performance and speed of the crawler. You only need to do this once after installation.
+
+## Option 2: Using Docker (Coming Soon)
+
+Docker support for Crawl4AI is currently in progress and will be available soon. This will allow you to run Crawl4AI in a containerized environment, ensuring consistency across different systems.
+
+## Option 3: Local Server Installation
+
+For those who prefer to run Crawl4AI as a local server, instructions will be provided once the Docker implementation is complete.
+
+## Verifying Your Installation
+
+After installation, you can verify that Crawl4AI is working correctly by running a simple Python script:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://www.example.com")
+        print(result.markdown[:500])  # Print first 500 characters
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+This script should successfully crawl the example website and print the first 500 characters of the extracted content.
+
+## Getting Help
+
+If you encounter any issues during installation or usage, please check the [documentation](https://crawl4ai.com/mkdocs/) or raise an issue on the [GitHub repository](https://github.com/unclecode/crawl4ai/issues).
+
+Happy crawling! 🕷️🤖
--- a/docs/md_v2/basic/output-formats.md
+++ b/docs/md_v2/basic/output-formats.md
@@ -0,0 +1,195 @@
+# Output Formats
+
+Crawl4AI provides multiple output formats to suit different needs, from raw HTML to structured data using LLM or pattern-based extraction.
+
+## Basic Formats
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+# Access different formats
+raw_html = result.html           # Original HTML
+clean_html = result.cleaned_html # Sanitized HTML
+markdown = result.markdown       # Standard markdown
+fit_md = result.fit_markdown    # Most relevant content in markdown
+```
+
+## Raw HTML
+
+Original, unmodified HTML from the webpage. Useful when you need to:
+- Preserve the exact page structure
+- Process HTML with your own tools
+- Debug page issues
+
+```python
+result = await crawler.arun(url="https://example.com")
+print(result.html)  # Complete HTML including headers, scripts, etc.
+```
+
+## Cleaned HTML
+
+Sanitized HTML with unnecessary elements removed. Automatically:
+- Removes scripts and styles
+- Cleans up formatting
+- Preserves semantic structure
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    excluded_tags=['form', 'header', 'footer'],  # Additional tags to remove
+    keep_data_attributes=False  # Remove data-* attributes
+)
+print(result.cleaned_html)
+```
+
+## Standard Markdown
+
+HTML converted to clean markdown format. Great for:
+- Content analysis
+- Documentation
+- Readability
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    include_links_on_markdown=True  # Include links in markdown
+)
+print(result.markdown)
+```
+
+## Fit Markdown
+
+Most relevant content extracted and converted to markdown. Ideal for:
+- Article extraction
+- Main content focus
+- Removing boilerplate
+
+```python
+result = await crawler.arun(url="https://example.com")
+print(result.fit_markdown)  # Only the main content
+```
+
+## Structured Data Extraction
+
+Crawl4AI offers two powerful approaches for structured data extraction:
+
+### 1. LLM-Based Extraction
+
+Use any LLM (OpenAI, HuggingFace, Ollama, etc.) to extract structured data with high accuracy:
+
+```python
+from pydantic import BaseModel
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class KnowledgeGraph(BaseModel):
+    entities: List[dict]
+    relationships: List[dict]
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",  # or "huggingface/...", "ollama/..."
+    api_token="your-token",   # not needed for Ollama
+    schema=KnowledgeGraph.schema(),
+    instruction="Extract entities and relationships from the content"
+)
+
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=strategy
+)
+knowledge_graph = json.loads(result.extracted_content)
+```
+
+### 2. Pattern-Based Extraction
+
+For pages with repetitive patterns (e.g., product listings, article feeds), use JsonCssExtractionStrategy:
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+schema = {
+    "name": "Product Listing",
+    "baseSelector": ".product-card",  # Repeated element
+    "fields": [
+        {"name": "title", "selector": "h2", "type": "text"},
+        {"name": "price", "selector": ".price", "type": "text"},
+        {"name": "description", "selector": ".desc", "type": "text"}
+    ]
+}
+
+strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=strategy
+)
+products = json.loads(result.extracted_content)
+```
+
+## Content Customization
+
+### HTML to Text Options
+
+Configure markdown conversion:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    html2text={
+        "escape_dot": False,
+        "body_width": 0,
+        "protect_links": True,
+        "unicode_snob": True
+    }
+)
+```
+
+### Content Filters
+
+Control what content is included:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    word_count_threshold=10,        # Minimum words per block
+    exclude_external_links=True,    # Remove external links
+    exclude_external_images=True,   # Remove external images
+    excluded_tags=['form', 'nav']   # Remove specific HTML tags
+)
+```
+
+## Comprehensive Example
+
+Here's how to use multiple output formats together:
+
+```python
+async def crawl_content(url: str):
+    async with AsyncWebCrawler() as crawler:
+        # Extract main content with fit markdown
+        result = await crawler.arun(
+            url=url,
+            word_count_threshold=10,
+            exclude_external_links=True
+        )
+        
+        # Get structured data using LLM
+        llm_result = await crawler.arun(
+            url=url,
+            extraction_strategy=LLMExtractionStrategy(
+                provider="ollama/nemotron",
+                schema=YourSchema.schema(),
+                instruction="Extract key information"
+            )
+        )
+        
+        # Get repeated patterns (if any)
+        pattern_result = await crawler.arun(
+            url=url,
+            extraction_strategy=JsonCssExtractionStrategy(your_schema)
+        )
+        
+        return {
+            "main_content": result.fit_markdown,
+            "structured_data": json.loads(llm_result.extracted_content),
+            "pattern_data": json.loads(pattern_result.extracted_content),
+            "media": result.media
+        }
+```
--- a/docs/md_v2/basic/page-interaction.md
+++ b/docs/md_v2/basic/page-interaction.md
@@ -0,0 +1,207 @@
+# Page Interaction
+
+Crawl4AI provides powerful features for interacting with dynamic webpages, handling JavaScript execution, and managing page events.
+
+## JavaScript Execution
+
+### Basic Execution
+
+```python
+# Single JavaScript command
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="window.scrollTo(0, document.body.scrollHeight);"
+)
+
+# Multiple commands
+js_commands = [
+    "window.scrollTo(0, document.body.scrollHeight);",
+    "document.querySelector('.load-more').click();",
+    "document.querySelector('#consent-button').click();"
+]
+result = await crawler.arun(
+    url="https://example.com",
+    js_code=js_commands
+)
+```
+
+## Wait Conditions
+
+### CSS-Based Waiting
+
+Wait for elements to appear:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for="css:.dynamic-content"  # Wait for element with class 'dynamic-content'
+)
+```
+
+### JavaScript-Based Waiting
+
+Wait for custom conditions:
+
+```python
+# Wait for number of elements
+wait_condition = """() => {
+    return document.querySelectorAll('.item').length > 10;
+}"""
+
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for=f"js:{wait_condition}"
+)
+
+# Wait for dynamic content to load
+wait_for_content = """() => {
+    const content = document.querySelector('.content');
+    return content && content.innerText.length > 100;
+}"""
+
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for=f"js:{wait_for_content}"
+)
+```
+
+## Handling Dynamic Content
+
+### Load More Content
+
+Handle infinite scroll or load more buttons:
+
+```python
+# Scroll and wait pattern
+result = await crawler.arun(
+    url="https://example.com",
+    js_code=[
+        # Scroll to bottom
+        "window.scrollTo(0, document.body.scrollHeight);",
+        # Click load more if exists
+        "const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();"
+    ],
+    # Wait for new content
+    wait_for="js:() => document.querySelectorAll('.item').length > previousCount"
+)
+```
+
+### Form Interaction
+
+Handle forms and inputs:
+
+```python
+js_form_interaction = """
+    // Fill form fields
+    document.querySelector('#search').value = 'search term';
+    // Submit form
+    document.querySelector('form').submit();
+"""
+
+result = await crawler.arun(
+    url="https://example.com",
+    js_code=js_form_interaction,
+    wait_for="css:.results"  # Wait for results to load
+)
+```
+
+## Timing Control
+
+### Delays and Timeouts
+
+Control timing of interactions:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    page_timeout=60000,              # Page load timeout (ms)
+    delay_before_return_html=2.0,    # Wait before capturing content
+)
+```
+
+## Complex Interactions Example
+
+Here's an example of handling a dynamic page with multiple interactions:
+
+```python
+async def crawl_dynamic_content():
+    async with AsyncWebCrawler() as crawler:
+        # Initial page load
+        result = await crawler.arun(
+            url="https://example.com",
+            # Handle cookie consent
+            js_code="document.querySelector('.cookie-accept')?.click();",
+            wait_for="css:.main-content"
+        )
+
+        # Load more content
+        session_id = "dynamic_session"  # Keep session for multiple interactions
+        
+        for page in range(3):  # Load 3 pages of content
+            result = await crawler.arun(
+                url="https://example.com",
+                session_id=session_id,
+                js_code=[
+                    # Scroll to bottom
+                    "window.scrollTo(0, document.body.scrollHeight);",
+                    # Store current item count
+                    "window.previousCount = document.querySelectorAll('.item').length;",
+                    # Click load more
+                    "document.querySelector('.load-more')?.click();"
+                ],
+                # Wait for new items
+                wait_for="""() => {
+                    const currentCount = document.querySelectorAll('.item').length;
+                    return currentCount > window.previousCount;
+                }""",
+                # Only execute JS without reloading page
+                js_only=True if page > 0 else False
+            )
+            
+            # Process content after each load
+            print(f"Page {page + 1} items:", len(result.cleaned_html))
+            
+        # Clean up session
+        await crawler.crawler_strategy.kill_session(session_id)
+```
+
+## Using with Extraction Strategies
+
+Combine page interaction with structured extraction:
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
+
+# Pattern-based extraction after interaction
+schema = {
+    "name": "Dynamic Items",
+    "baseSelector": ".item",
+    "fields": [
+        {"name": "title", "selector": "h2", "type": "text"},
+        {"name": "description", "selector": ".desc", "type": "text"}
+    ]
+}
+
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="window.scrollTo(0, document.body.scrollHeight);",
+    wait_for="css:.item:nth-child(10)",  # Wait for 10 items
+    extraction_strategy=JsonCssExtractionStrategy(schema)
+)
+
+# Or use LLM to analyze dynamic content
+class ContentAnalysis(BaseModel):
+    topics: List[str]
+    summary: str
+
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="document.querySelector('.show-more').click();",
+    wait_for="css:.full-content",
+    extraction_strategy=LLMExtractionStrategy(
+        provider="ollama/nemotron",
+        schema=ContentAnalysis.schema(),
+        instruction="Analyze the full content"
+    )
+)
+```
--- a/docs/md_v2/basic/quickstart.md
+++ b/docs/md_v2/basic/quickstart.md
@@ -0,0 +1,297 @@
+# Quick Start Guide 🚀
+
+Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
+
+## Getting Started 🛠️
+
+First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # We'll add our crawling code here
+        pass
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Basic Usage
+
+Simply provide a URL and let Crawl4AI do the magic!
+
+```python
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
+        print(f"Basic crawl result: {result.markdown[:500]}")  # Print first 500 characters
+
+asyncio.run(main())
+```
+
+### Taking Screenshots 📸
+
+Capture screenshots of web pages easily:
+
+```python
+async def capture_and_save_screenshot(url: str, output_path: str):
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url=url,
+            screenshot=True,
+            bypass_cache=True
+        )
+        
+        if result.success and result.screenshot:
+            import base64
+            screenshot_data = base64.b64decode(result.screenshot)
+            with open(output_path, 'wb') as f:
+                f.write(screenshot_data)
+            print(f"Screenshot saved successfully to {output_path}")
+        else:
+            print("Failed to capture screenshot")
+```
+
+### Browser Selection 🌐
+
+Crawl4AI supports multiple browser engines. Here's how to use different browsers:
+
+```python
+# Use Firefox
+async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
+    result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+
+# Use WebKit
+async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
+    result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+
+# Use Chromium (default)
+async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
+    result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+```
+
+### User Simulation 🎭
+
+Simulate real user behavior to avoid detection:
+
+```python
+async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
+    result = await crawler.arun(
+        url="YOUR-URL-HERE",
+        bypass_cache=True,
+        simulate_user=True,  # Causes random mouse movements and clicks
+        override_navigator=True  # Makes the browser appear more like a real user
+    )
+```
+
+### Understanding Parameters 🧠
+
+By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
+
+```python
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # First crawl (caches the result)
+        result1 = await crawler.arun(url="https://www.nbcnews.com/business")
+        print(f"First crawl result: {result1.markdown[:100]}...")
+
+        # Force to crawl again
+        result2 = await crawler.arun(url="https://www.nbcnews.com/business", bypass_cache=True)
+        print(f"Second crawl result: {result2.markdown[:100]}...")
+
+asyncio.run(main())
+```
+
+### Adding a Chunking Strategy 🧩
+
+Let's add a chunking strategy: `RegexChunking`! This strategy splits the text based on a given regex pattern.
+
+```python
+from crawl4ai.chunking_strategy import RegexChunking
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            chunking_strategy=RegexChunking(patterns=["\n\n"])
+        )
+        print(f"RegexChunking result: {result.extracted_content[:200]}...")
+
+asyncio.run(main())
+```
+
+### Using LLMExtractionStrategy with Different Providers 🤖
+
+Crawl4AI supports multiple LLM providers for extraction:
+
+```python
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+# OpenAI
+await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
+
+# Hugging Face
+await extract_structured_data_using_llm(
+    "huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", 
+    os.getenv("HUGGINGFACE_API_KEY")
+)
+
+# Ollama
+await extract_structured_data_using_llm("ollama/llama3.2")
+
+# With custom headers
+custom_headers = {
+    "Authorization": "Bearer your-custom-token",
+    "X-Custom-Header": "Some-Value"
+}
+await extract_structured_data_using_llm(extra_headers=custom_headers)
+```
+
+### Knowledge Graph Generation 🕸️
+
+Generate knowledge graphs from web content:
+
+```python
+from pydantic import BaseModel
+from typing import List
+
+class Entity(BaseModel):
+    name: str
+    description: str
+    
+class Relationship(BaseModel):
+    entity1: Entity
+    entity2: Entity
+    description: str
+    relation_type: str
+
+class KnowledgeGraph(BaseModel):
+    entities: List[Entity]
+    relationships: List[Relationship]
+
+extraction_strategy = LLMExtractionStrategy(
+    provider='openai/gpt-4o-mini',
+    api_token=os.getenv('OPENAI_API_KEY'),
+    schema=KnowledgeGraph.model_json_schema(),
+    extraction_type="schema",
+    instruction="Extract entities and relationships from the given text."
+)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://paulgraham.com/love.html",
+        bypass_cache=True,
+        extraction_strategy=extraction_strategy
+    )
+```
+
+### Advanced Session-Based Crawling with Dynamic Content 🔄
+
+For modern web applications with dynamic content loading, here's how to handle pagination and content updates:
+
+```python
+async def crawl_dynamic_content():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        
+        js_next_page = """
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        wait_for = """() => {
+            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+            if (commits.length === 0) return false;
+            const firstCommit = commits[0].textContent.trim();
+            return firstCommit !== window.firstCommit;
+        }"""
+        
+        schema = {
+            "name": "Commit Extractor",
+            "baseSelector": "li.Box-sc-g0xbh4-0",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h4.markdown-title",
+                    "type": "text",
+                    "transform": "strip",
+                },
+            ],
+        }
+        extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+        for page in range(3):  # Crawl 3 pages
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.Box-sc-g0xbh4-0",
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page if page > 0 else None,
+                wait_for=wait_for if page > 0 else None,
+                js_only=page > 0,
+                bypass_cache=True,
+                headless=False,
+            )
+
+        await crawler.crawler_strategy.kill_session(session_id)
+```
+
+### Handling Overlays and Fitting Content 📏
+
+Remove overlay elements and fit content appropriately:
+
+```python
+async with AsyncWebCrawler(headless=False) as crawler:
+    result = await crawler.arun(
+        url="your-url-here",
+        bypass_cache=True,
+        word_count_threshold=10,
+        remove_overlay_elements=True,
+        screenshot=True
+    )
+```
+
+## Performance Comparison 🏎️
+
+Crawl4AI offers impressive performance compared to other solutions:
+
+```python
+# Firecrawl comparison
+from firecrawl import FirecrawlApp
+app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+start = time.time()
+scrape_status = app.scrape_url(
+    'https://www.nbcnews.com/business',
+    params={'formats': ['markdown', 'html']}
+)
+end = time.time()
+
+# Crawl4AI comparison
+async with AsyncWebCrawler() as crawler:
+    start = time.time()
+    result = await crawler.arun(
+        url="https://www.nbcnews.com/business",
+        word_count_threshold=0,
+        bypass_cache=True,
+        verbose=False,
+    )
+    end = time.time()
+```
+
+Note: Performance comparisons should be conducted in environments with stable and fast internet connections for accurate results.
+
+## Congratulations! 🎉
+
+You've made it through the updated Crawl4AI Quickstart Guide! Now you're equipped with even more powerful features to crawl the web asynchronously like a pro! 🕸️
+
+Happy crawling! 🚀
--- a/docs/md_v2/basic/simple-crawling.md
+++ b/docs/md_v2/basic/simple-crawling.md
@@ -0,0 +1,120 @@
+# Simple Crawling
+
+This guide covers the basics of web crawling with Crawl4AI. You'll learn how to set up a crawler, make your first request, and understand the response.
+
+## Basic Usage
+
+Here's the simplest way to crawl a webpage:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url="https://example.com")
+        print(result.markdown)  # Print clean markdown content
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## Understanding the Response
+
+The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
+
+```python
+result = await crawler.arun(url="https://example.com")
+
+# Different content formats
+print(result.html)         # Raw HTML
+print(result.cleaned_html) # Cleaned HTML
+print(result.markdown)     # Markdown version
+print(result.fit_markdown) # Most relevant content in markdown
+
+# Check success status
+print(result.success)      # True if crawl succeeded
+print(result.status_code)  # HTTP status code (e.g., 200, 404)
+
+# Access extracted media and links
+print(result.media)        # Dictionary of found media (images, videos, audio)
+print(result.links)        # Dictionary of internal and external links
+```
+
+## Adding Basic Options
+
+Customize your crawl with these common options:
+
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    word_count_threshold=10,        # Minimum words per content block
+    exclude_external_links=True,    # Remove external links
+    remove_overlay_elements=True,   # Remove popups/modals
+    process_iframes=True           # Process iframe content
+)
+```
+
+## Handling Errors
+
+Always check if the crawl was successful:
+
+```python
+result = await crawler.arun(url="https://example.com")
+if not result.success:
+    print(f"Crawl failed: {result.error_message}")
+    print(f"Status code: {result.status_code}")
+```
+
+## Logging and Debugging
+
+Enable verbose mode for detailed logging:
+
+```python
+async with AsyncWebCrawler(verbose=True) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+## Complete Example
+
+Here's a more comprehensive example showing common usage patterns:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://example.com",
+            # Content filtering
+            word_count_threshold=10,
+            excluded_tags=['form', 'header'],
+            exclude_external_links=True,
+            
+            # Content processing
+            process_iframes=True,
+            remove_overlay_elements=True,
+            
+            # Cache control
+            bypass_cache=False  # Use cache if available
+        )
+        
+        if result.success:
+            # Print clean content
+            print("Content:", result.markdown[:500])  # First 500 chars
+            
+            # Process images
+            for image in result.media["images"]:
+                print(f"Found image: {image['src']}")
+            
+            # Process links
+            for link in result.links["internal"]:
+                print(f"Internal link: {link['href']}")
+                
+        else:
+            print(f"Crawl failed: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
--- a/docs/md_v2/extraction/chunking.md
+++ b/docs/md_v2/extraction/chunking.md
@@ -0,0 +1,133 @@
+## Chunking Strategies 📚
+
+Crawl4AI provides several powerful chunking strategies to divide text into manageable parts for further processing. Each strategy has unique characteristics and is suitable for different scenarios. Let's explore them one by one.
+
+### RegexChunking
+
+`RegexChunking` splits text using regular expressions. This is ideal for creating chunks based on specific patterns like paragraphs or sentences.
+
+#### When to Use
+- Great for structured text with consistent delimiters.
+- Suitable for documents where specific patterns (e.g., double newlines, periods) indicate logical chunks.
+
+#### Parameters
+- `patterns` (list, optional): Regular expressions used to split the text. Default is to split by double newlines (`['\n\n']`).
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import RegexChunking
+
+# Define patterns for splitting text
+patterns = [r'\n\n', r'\. ']
+chunker = RegexChunking(patterns=patterns)
+
+# Sample text
+text = "This is a sample text. It will be split into chunks.\n\nThis is another paragraph."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### NlpSentenceChunking
+
+`NlpSentenceChunking` uses NLP models to split text into sentences, ensuring accurate sentence boundaries.
+
+#### When to Use
+- Ideal for texts where sentence boundaries are crucial.
+- Useful for creating chunks that preserve grammatical structures.
+
+#### Parameters
+- None.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import NlpSentenceChunking
+
+chunker = NlpSentenceChunking()
+
+# Sample text
+text = "This is a sample text. It will be split into sentences. Here's another sentence."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### TopicSegmentationChunking
+
+`TopicSegmentationChunking` employs the TextTiling algorithm to segment text into topic-based chunks. This method identifies thematic boundaries.
+
+#### When to Use
+- Perfect for long documents with distinct topics.
+- Useful when preserving topic continuity is more important than maintaining text order.
+
+#### Parameters
+- `num_keywords` (int, optional): Number of keywords for each topic segment. Default is `3`.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import TopicSegmentationChunking
+
+chunker = TopicSegmentationChunking(num_keywords=3)
+
+# Sample text
+text = "This document contains several topics. Topic one discusses AI. Topic two covers machine learning."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### FixedLengthWordChunking
+
+`FixedLengthWordChunking` splits text into chunks based on a fixed number of words. This ensures each chunk has approximately the same length.
+
+#### When to Use
+- Suitable for processing large texts where uniform chunk size is important.
+- Useful when the number of words per chunk needs to be controlled.
+
+#### Parameters
+- `chunk_size` (int, optional): Number of words per chunk. Default is `100`.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import FixedLengthWordChunking
+
+chunker = FixedLengthWordChunking(chunk_size=10)
+
+# Sample text
+text = "This is a sample text. It will be split into chunks of fixed length."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+### SlidingWindowChunking
+
+`SlidingWindowChunking` uses a sliding window approach to create overlapping chunks. Each chunk has a fixed length, and the window slides by a specified step size.
+
+#### When to Use
+- Ideal for creating overlapping chunks to preserve context.
+- Useful for tasks where context from adjacent chunks is needed.
+
+#### Parameters
+- `window_size` (int, optional): Number of words in each chunk. Default is `100`.
+- `step` (int, optional): Number of words to slide the window. Default is `50`.
+
+#### Example
+```python
+from crawl4ai.chunking_strategy import SlidingWindowChunking
+
+chunker = SlidingWindowChunking(window_size=10, step=5)
+
+# Sample text
+text = "This is a sample text. It will be split using a sliding window approach to preserve context."
+
+# Chunk the text
+chunks = chunker.chunk(text)
+print(chunks)
+```
+
+With these chunking strategies, you can choose the best method to divide your text based on your specific needs. Whether you need precise sentence boundaries, topic-based segmentation, or uniform chunk sizes, Crawl4AI has you covered. Happy chunking! 📝✨
--- a/docs/md_v2/extraction/cosine.md
+++ b/docs/md_v2/extraction/cosine.md
@@ -0,0 +1,222 @@
+# Cosine Strategy
+
+The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.
+
+## How It Works
+
+The Cosine Strategy:
+1. Breaks down page content into meaningful chunks
+2. Converts text into vector representations
+3. Calculates similarity between chunks
+4. Clusters similar content together
+5. Ranks and filters content based on relevance
+
+## Basic Usage
+
+```python
+from crawl4ai.extraction_strategy import CosineStrategy
+
+strategy = CosineStrategy(
+    semantic_filter="product reviews",    # Target content type
+    word_count_threshold=10,             # Minimum words per cluster
+    sim_threshold=0.3                    # Similarity threshold
+)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com/reviews",
+        extraction_strategy=strategy
+    )
+    
+    content = result.extracted_content
+```
+
+## Configuration Options
+
+### Core Parameters
+
+```python
+CosineStrategy(
+    # Content Filtering
+    semantic_filter: str = None,       # Keywords/topic for content filtering
+    word_count_threshold: int = 10,    # Minimum words per cluster
+    sim_threshold: float = 0.3,        # Similarity threshold (0.0 to 1.0)
+    
+    # Clustering Parameters
+    max_dist: float = 0.2,            # Maximum distance for clustering
+    linkage_method: str = 'ward',      # Clustering linkage method
+    top_k: int = 3,                   # Number of top categories to extract
+    
+    # Model Configuration
+    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
+    
+    verbose: bool = False             # Enable logging
+)
+```
+
+### Parameter Details
+
+1. **semantic_filter**
+   - Sets the target topic or content type
+   - Use keywords relevant to your desired content
+   - Example: "technical specifications", "user reviews", "pricing information"
+
+2. **sim_threshold**
+   - Controls how similar content must be to be grouped together
+   - Higher values (e.g., 0.8) mean stricter matching
+   - Lower values (e.g., 0.3) allow more variation
+   ```python
+   # Strict matching
+   strategy = CosineStrategy(sim_threshold=0.8)
+   
+   # Loose matching
+   strategy = CosineStrategy(sim_threshold=0.3)
+   ```
+
+3. **word_count_threshold**
+   - Filters out short content blocks
+   - Helps eliminate noise and irrelevant content
+   ```python
+   # Only consider substantial paragraphs
+   strategy = CosineStrategy(word_count_threshold=50)
+   ```
+
+4. **top_k**
+   - Number of top content clusters to return
+   - Higher values return more diverse content
+   ```python
+   # Get top 5 most relevant content clusters
+   strategy = CosineStrategy(top_k=5)
+   ```
+
+## Use Cases
+
+### 1. Article Content Extraction
+```python
+strategy = CosineStrategy(
+    semantic_filter="main article content",
+    word_count_threshold=100,  # Longer blocks for articles
+    top_k=1                   # Usually want single main content
+)
+
+result = await crawler.arun(
+    url="https://example.com/blog/post",
+    extraction_strategy=strategy
+)
+```
+
+### 2. Product Review Analysis
+```python
+strategy = CosineStrategy(
+    semantic_filter="customer reviews and ratings",
+    word_count_threshold=20,   # Reviews can be shorter
+    top_k=10,                 # Get multiple reviews
+    sim_threshold=0.4         # Allow variety in review content
+)
+```
+
+### 3. Technical Documentation
+```python
+strategy = CosineStrategy(
+    semantic_filter="technical specifications documentation",
+    word_count_threshold=30,
+    sim_threshold=0.6,        # Stricter matching for technical content
+    max_dist=0.3             # Allow related technical sections
+)
+```
+
+## Advanced Features
+
+### Custom Clustering
+```python
+strategy = CosineStrategy(
+    linkage_method='complete',  # Alternative clustering method
+    max_dist=0.4,              # Larger clusters
+    model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'  # Multilingual support
+)
+```
+
+### Content Filtering Pipeline
+```python
+strategy = CosineStrategy(
+    semantic_filter="pricing plans features",
+    word_count_threshold=15,
+    sim_threshold=0.5,
+    top_k=3
+)
+
+async def extract_pricing_features(url: str):
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url=url,
+            extraction_strategy=strategy
+        )
+        
+        if result.success:
+            content = json.loads(result.extracted_content)
+            return {
+                'pricing_features': content,
+                'clusters': len(content),
+                'similarity_scores': [item['score'] for item in content]
+            }
+```
+
+## Best Practices
+
+1. **Adjust Thresholds Iteratively**
+   - Start with default values
+   - Adjust based on results
+   - Monitor clustering quality
+
+2. **Choose Appropriate Word Count Thresholds**
+   - Higher for articles (100+)
+   - Lower for reviews/comments (20+)
+   - Medium for product descriptions (50+)
+
+3. **Optimize Performance**
+   ```python
+   strategy = CosineStrategy(
+       word_count_threshold=10,  # Filter early
+       top_k=5,                 # Limit results
+       verbose=True             # Monitor performance
+   )
+   ```
+
+4. **Handle Different Content Types**
+   ```python
+   # For mixed content pages
+   strategy = CosineStrategy(
+       semantic_filter="product features",
+       sim_threshold=0.4,      # More flexible matching
+       max_dist=0.3,          # Larger clusters
+       top_k=3                # Multiple relevant sections
+   )
+   ```
+
+## Error Handling
+
+```python
+try:
+    result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=strategy
+    )
+    
+    if result.success:
+        content = json.loads(result.extracted_content)
+        if not content:
+            print("No relevant content found")
+    else:
+        print(f"Extraction failed: {result.error_message}")
+        
+except Exception as e:
+    print(f"Error during extraction: {str(e)}")
+```
+
+The Cosine Strategy is particularly effective when:
+- Content structure is inconsistent
+- You need semantic understanding
+- You want to find similar content blocks
+- Structure-based extraction (CSS/XPath) isn't reliable
+
+It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.
--- a/docs/md_v2/extraction/css-advanced.md
+++ b/docs/md_v2/extraction/css-advanced.md
@@ -0,0 +1,282 @@
+# Advanced Usage of JsonCssExtractionStrategy
+
+While the basic usage of JsonCssExtractionStrategy is powerful for simple structures, its true potential shines when dealing with complex, nested HTML structures. This section will explore advanced usage scenarios, demonstrating how to extract nested objects, lists, and nested lists.
+
+## Hypothetical Website Example
+
+Let's consider a hypothetical e-commerce website that displays product categories, each containing multiple products. Each product has details, reviews, and related items. This complex structure will allow us to demonstrate various advanced features of JsonCssExtractionStrategy.
+
+Assume the HTML structure looks something like this:
+
+```html
+<div class="category">
+  <h2 class="category-name">Electronics</h2>
+  <div class="product">
+    <h3 class="product-name">Smartphone X</h3>
+    <p class="product-price">$999</p>
+    <div class="product-details">
+      <span class="brand">TechCorp</span>
+      <span class="model">X-2000</span>
+    </div>
+    <ul class="product-features">
+      <li>5G capable</li>
+      <li>6.5" OLED screen</li>
+      <li>128GB storage</li>
+    </ul>
+    <div class="product-reviews">
+      <div class="review">
+        <span class="reviewer">John D.</span>
+        <span class="rating">4.5</span>
+        <p class="review-text">Great phone, love the camera!</p>
+      </div>
+      <div class="review">
+        <span class="reviewer">Jane S.</span>
+        <span class="rating">5</span>
+        <p class="review-text">Best smartphone I've ever owned.</p>
+      </div>
+    </div>
+    <ul class="related-products">
+      <li>
+        <span class="related-name">Phone Case</span>
+        <span class="related-price">$29.99</span>
+      </li>
+      <li>
+        <span class="related-name">Screen Protector</span>
+        <span class="related-price">$9.99</span>
+      </li>
+    </ul>
+  </div>
+  <!-- More products... -->
+</div>
+```
+
+Now, let's create a schema to extract this complex structure:
+
+```python
+schema = {
+    "name": "E-commerce Product Catalog",
+    "baseSelector": "div.category",
+    "fields": [
+        {
+            "name": "category_name",
+            "selector": "h2.category-name",
+            "type": "text"
+        },
+        {
+            "name": "products",
+            "selector": "div.product",
+            "type": "nested_list",
+            "fields": [
+                {
+                    "name": "name",
+                    "selector": "h3.product-name",
+                    "type": "text"
+                },
+                {
+                    "name": "price",
+                    "selector": "p.product-price",
+                    "type": "text"
+                },
+                {
+                    "name": "details",
+                    "selector": "div.product-details",
+                    "type": "nested",
+                    "fields": [
+                        {
+                            "name": "brand",
+                            "selector": "span.brand",
+                            "type": "text"
+                        },
+                        {
+                            "name": "model",
+                            "selector": "span.model",
+                            "type": "text"
+                        }
+                    ]
+                },
+                {
+                    "name": "features",
+                    "selector": "ul.product-features li",
+                    "type": "list",
+                    "fields": [
+                        {
+                            "name": "feature",
+                            "type": "text"
+                        }
+                    ]
+                },
+                {
+                    "name": "reviews",
+                    "selector": "div.review",
+                    "type": "nested_list",
+                    "fields": [
+                        {
+                            "name": "reviewer",
+                            "selector": "span.reviewer",
+                            "type": "text"
+                        },
+                        {
+                            "name": "rating",
+                            "selector": "span.rating",
+                            "type": "text"
+                        },
+                        {
+                            "name": "comment",
+                            "selector": "p.review-text",
+                            "type": "text"
+                        }
+                    ]
+                },
+                {
+                    "name": "related_products",
+                    "selector": "ul.related-products li",
+                    "type": "list",
+                    "fields": [
+                        {
+                            "name": "name",
+                            "selector": "span.related-name",
+                            "type": "text"
+                        },
+                        {
+                            "name": "price",
+                            "selector": "span.related-price",
+                            "type": "text"
+                        }
+                    ]
+                }
+            ]
+        }
+    ]
+}
+```
+
+This schema demonstrates several advanced features:
+
+1. **Nested Objects**: The `details` field is a nested object within each product.
+2. **Simple Lists**: The `features` field is a simple list of text items.
+3. **Nested Lists**: The `products` field is a nested list, where each item is a complex object.
+4. **Lists of Objects**: The `reviews` and `related_products` fields are lists of objects.
+
+Let's break down the key concepts:
+
+### Nested Objects
+
+To create a nested object, use `"type": "nested"` and provide a `fields` array for the nested structure:
+
+```python
+{
+    "name": "details",
+    "selector": "div.product-details",
+    "type": "nested",
+    "fields": [
+        {
+            "name": "brand",
+            "selector": "span.brand",
+            "type": "text"
+        },
+        {
+            "name": "model",
+            "selector": "span.model",
+            "type": "text"
+        }
+    ]
+}
+```
+
+### Simple Lists
+
+For a simple list of identical items, use `"type": "list"`:
+
+```python
+{
+    "name": "features",
+    "selector": "ul.product-features li",
+    "type": "list",
+    "fields": [
+        {
+            "name": "feature",
+            "type": "text"
+        }
+    ]
+}
+```
+
+### Nested Lists
+
+For a list of complex objects, use `"type": "nested_list"`:
+
+```python
+{
+    "name": "products",
+    "selector": "div.product",
+    "type": "nested_list",
+    "fields": [
+        // ... fields for each product
+    ]
+}
+```
+
+### Lists of Objects
+
+Similar to nested lists, but typically used for simpler objects within the list:
+
+```python
+{
+    "name": "related_products",
+    "selector": "ul.related-products li",
+    "type": "list",
+    "fields": [
+        {
+            "name": "name",
+            "selector": "span.related-name",
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": "span.related-price",
+            "type": "text"
+        }
+    ]
+}
+```
+
+## Using the Advanced Schema
+
+To use this advanced schema with AsyncWebCrawler:
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def extract_complex_product_data():
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
+            extraction_strategy=extraction_strategy,
+            bypass_cache=True,
+        )
+
+        assert result.success, "Failed to crawl the page"
+
+        product_data = json.loads(result.extracted_content)
+        print(json.dumps(product_data, indent=2))
+
+asyncio.run(extract_complex_product_data())
+```
+
+This will produce a structured JSON output that captures the complex hierarchy of the product catalog, including nested objects, lists, and nested lists.
+
+## Tips for Advanced Usage
+
+1. **Start Simple**: Begin with a basic schema and gradually add complexity.
+2. **Test Incrementally**: Test each part of your schema separately before combining them.
+3. **Use Chrome DevTools**: The Element Inspector is invaluable for identifying the correct selectors.
+4. **Handle Missing Data**: Use the `default` key in your field definitions to handle cases where data might be missing.
+5. **Leverage Transforms**: Use the `transform` key to clean or format extracted data (e.g., converting prices to numbers).
+6. **Consider Performance**: Very complex schemas might slow down extraction. Balance complexity with performance needs.
+
+By mastering these advanced techniques, you can use JsonCssExtractionStrategy to extract highly structured data from even the most complex web pages, making it a powerful tool for web scraping and data analysis tasks.
--- a/docs/md_v2/extraction/css.md
+++ b/docs/md_v2/extraction/css.md
@@ -0,0 +1,142 @@
+# JSON CSS Extraction Strategy with AsyncWebCrawler
+
+The `JsonCssExtractionStrategy` is a powerful feature of Crawl4AI that allows you to extract structured data from web pages using CSS selectors. This method is particularly useful when you need to extract specific data points from a consistent HTML structure, such as tables or repeated elements. Here's how to use it with the AsyncWebCrawler.
+
+## Overview
+
+The `JsonCssExtractionStrategy` works by defining a schema that specifies:
+1. A base CSS selector for the repeating elements
+2. Fields to extract from each element, each with its own CSS selector
+
+This strategy is fast and efficient, as it doesn't rely on external services like LLMs for extraction.
+
+## Example: Extracting Cryptocurrency Prices from Coinbase
+
+Let's look at an example that extracts cryptocurrency prices from the Coinbase explore page.
+
+```python
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def extract_structured_data_using_css_extractor():
+    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
+    
+    # Define the extraction schema
+    schema = {
+        "name": "Coinbase Crypto Prices",
+        "baseSelector": ".cds-tableRow-t45thuk",
+        "fields": [
+            {
+                "name": "crypto",
+                "selector": "td:nth-child(1) h2",
+                "type": "text",
+            },
+            {
+                "name": "symbol",
+                "selector": "td:nth-child(1) p",
+                "type": "text",
+            },
+            {
+                "name": "price",
+                "selector": "td:nth-child(2)",
+                "type": "text",
+            }
+        ],
+    }
+
+    # Create the extraction strategy
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+    # Use the AsyncWebCrawler with the extraction strategy
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.coinbase.com/explore",
+            extraction_strategy=extraction_strategy,
+            bypass_cache=True,
+        )
+
+        assert result.success, "Failed to crawl the page"
+
+        # Parse the extracted content
+        crypto_prices = json.loads(result.extracted_content)
+        print(f"Successfully extracted {len(crypto_prices)} cryptocurrency prices")
+        print(json.dumps(crypto_prices[0], indent=2))
+
+    return crypto_prices
+
+# Run the async function
+asyncio.run(extract_structured_data_using_css_extractor())
+```
+
+## Explanation of the Schema
+
+The schema defines how to extract the data:
+
+- `name`: A descriptive name for the extraction task.
+- `baseSelector`: The CSS selector for the repeating elements (in this case, table rows).
+- `fields`: An array of fields to extract from each element:
+  - `name`: The name to give the extracted data.
+  - `selector`: The CSS selector to find the specific data within the base element.
+  - `type`: The type of data to extract (usually "text" for textual content).
+
+## Advantages of JsonCssExtractionStrategy
+
+1. **Speed**: CSS selectors are fast to execute, making this method efficient for large datasets.
+2. **Precision**: You can target exactly the elements you need.
+3. **Structured Output**: The result is already structured as JSON, ready for further processing.
+4. **No External Dependencies**: Unlike LLM-based strategies, this doesn't require any API calls to external services.
+
+## Tips for Using JsonCssExtractionStrategy
+
+1. **Inspect the Page**: Use browser developer tools to identify the correct CSS selectors.
+2. **Test Selectors**: Verify your selectors in the browser console before using them in the script.
+3. **Handle Dynamic Content**: If the page uses JavaScript to load content, you may need to combine this with JS execution (see the Advanced Usage section).
+4. **Error Handling**: Always check the `result.success` flag and handle potential failures.
+
+## Advanced Usage: Combining with JavaScript Execution
+
+For pages that load data dynamically, you can combine the `JsonCssExtractionStrategy` with JavaScript execution:
+
+```python
+async def extract_dynamic_structured_data():
+    schema = {
+        "name": "Dynamic Crypto Prices",
+        "baseSelector": ".crypto-row",
+        "fields": [
+            {"name": "name", "selector": ".crypto-name", "type": "text"},
+            {"name": "price", "selector": ".crypto-price", "type": "text"},
+        ]
+    }
+
+    js_code = """
+    window.scrollTo(0, document.body.scrollHeight);
+    await new Promise(resolve => setTimeout(resolve, 2000));  // Wait for 2 seconds
+    """
+
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://example.com/crypto-prices",
+            extraction_strategy=extraction_strategy,
+            js_code=js_code,
+            wait_for=".crypto-row:nth-child(20)",  # Wait for 20 rows to load
+            bypass_cache=True,
+        )
+
+        crypto_data = json.loads(result.extracted_content)
+        print(f"Extracted {len(crypto_data)} cryptocurrency entries")
+
+asyncio.run(extract_dynamic_structured_data())
+```
+
+This advanced example demonstrates how to:
+1. Execute JavaScript to trigger dynamic content loading.
+2. Wait for a specific condition (20 rows loaded) before extraction.
+3. Extract data from the dynamically loaded content.
+
+By mastering the `JsonCssExtractionStrategy`, you can efficiently extract structured data from a wide variety of web pages, making it a valuable tool in your web scraping toolkit.
+
+For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](./css-advanced.md).
--- a/docs/md_v2/extraction/llm.md
+++ b/docs/md_v2/extraction/llm.md
@@ -0,0 +1,179 @@
+# LLM Extraction with AsyncWebCrawler
+
+Crawl4AI's AsyncWebCrawler allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages asynchronously. Below are two examples demonstrating how to use `LLMExtractionStrategy` for different purposes with the AsyncWebCrawler.
+
+## Example 1: Extract Structured Data
+
+In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.
+
+```python
+import os
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+async def extract_openai_fees():
+    url = 'https://openai.com/api/pricing/'
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url=url,
+            word_count_threshold=1,
+            extraction_strategy=LLMExtractionStrategy(
+                provider="openai/gpt-4o", # Or use ollama like provider="ollama/nemotron"
+                api_token=os.getenv('OPENAI_API_KEY'),
+                schema=OpenAIModelFee.model_json_schema(),
+                extraction_type="schema",
+                instruction="From the crawled content, extract all mentioned model names along with their "
+                            "fees for input and output tokens. Make sure not to miss anything in the entire content. "
+                            'One extracted model JSON format should look like this: '
+                            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+            ),
+            bypass_cache=True,
+        )
+
+    model_fees = json.loads(result.extracted_content)
+    print(f"Number of models extracted: {len(model_fees)}")
+
+    with open(".data/openai_fees.json", "w", encoding="utf-8") as f:
+        json.dump(model_fees, f, indent=2)
+
+asyncio.run(extract_openai_fees())
+```
+
+## Example 2: Extract Relevant Content
+
+In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.
+
+```python
+import os
+import json
+import asyncio
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+async def extract_tech_content():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            extraction_strategy=LLMExtractionStrategy(
+                provider="openai/gpt-4o",
+                api_token=os.getenv('OPENAI_API_KEY'),
+                instruction="Extract only content related to technology"
+            ),
+            bypass_cache=True,
+        )
+
+    tech_content = json.loads(result.extracted_content)
+    print(f"Number of tech-related items extracted: {len(tech_content)}")
+
+    with open(".data/tech_content.json", "w", encoding="utf-8") as f:
+        json.dump(tech_content, f, indent=2)
+
+asyncio.run(extract_tech_content())
+```
+
+## Advanced Usage: Combining JS Execution with LLM Extraction
+
+This example demonstrates how to combine JavaScript execution with LLM extraction to handle dynamic content:
+
+```python
+async def extract_dynamic_content():
+    js_code = """
+    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
+    if (loadMoreButton) {
+        loadMoreButton.click();
+        await new Promise(resolve => setTimeout(resolve, 2000));
+    }
+    """
+
+    wait_for = """
+    () => {
+        const articles = document.querySelectorAll('article.tease-card');
+        return articles.length > 10;
+    }
+    """
+
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            js_code=js_code,
+            wait_for=wait_for,
+            css_selector="article.tease-card",
+            extraction_strategy=LLMExtractionStrategy(
+                provider="openai/gpt-4o",
+                api_token=os.getenv('OPENAI_API_KEY'),
+                instruction="Summarize each article, focusing on technology-related content"
+            ),
+            bypass_cache=True,
+        )
+
+    summaries = json.loads(result.extracted_content)
+    print(f"Number of summarized articles: {len(summaries)}")
+
+    with open(".data/tech_summaries.json", "w", encoding="utf-8") as f:
+        json.dump(summaries, f, indent=2)
+
+asyncio.run(extract_dynamic_content())
+```
+
+## Customizing LLM Provider
+
+Crawl4AI uses the `litellm` library under the hood, which allows you to use any LLM provider you want. Just pass the correct model name and API token:
+
+```python
+extraction_strategy=LLMExtractionStrategy(
+    provider="your_llm_provider/model_name",
+    api_token="your_api_token",
+    instruction="Your extraction instruction"
+)
+```
+
+This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.
+
+## Error Handling and Retries
+
+When working with external LLM APIs, it's important to handle potential errors and implement retry logic. Here's an example of how you might do this:
+
+```python
+import asyncio
+from tenacity import retry, stop_after_attempt, wait_exponential
+
+class LLMExtractionError(Exception):
+    pass
+
+@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
+async def extract_with_retry(crawler, url, extraction_strategy):
+    try:
+        result = await crawler.arun(url=url, extraction_strategy=extraction_strategy, bypass_cache=True)
+        return json.loads(result.extracted_content)
+    except Exception as e:
+        raise LLMExtractionError(f"Failed to extract content: {str(e)}")
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        try:
+            content = await extract_with_retry(
+                crawler,
+                "https://www.example.com",
+                LLMExtractionStrategy(
+                    provider="openai/gpt-4o",
+                    api_token=os.getenv('OPENAI_API_KEY'),
+                    instruction="Extract and summarize main points"
+                )
+            )
+            print("Extracted content:", content)
+        except LLMExtractionError as e:
+            print(f"Extraction failed after retries: {e}")
+
+asyncio.run(main())
+```
+
+This example uses the `tenacity` library to implement a retry mechanism with exponential backoff, which can help handle temporary failures or rate limiting from the LLM API.
--- a/docs/md_v2/extraction/overview.md
+++ b/docs/md_v2/extraction/overview.md
@@ -0,0 +1,197 @@
+# Extraction Strategies Overview
+
+Crawl4AI provides powerful extraction strategies to help you get structured data from web pages. Each strategy is designed for specific use cases and offers different approaches to data extraction.
+
+## Available Strategies
+
+### [LLM-Based Extraction](llm.md)
+
+`LLMExtractionStrategy` uses Language Models to extract structured data from web content. This approach is highly flexible and can understand content semantically.
+
+```python
+from pydantic import BaseModel
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+class Product(BaseModel):
+    name: str
+    price: float
+    description: str
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/llama2",
+    schema=Product.schema(),
+    instruction="Extract product details from the page"
+)
+
+result = await crawler.arun(
+    url="https://example.com/product",
+    extraction_strategy=strategy
+)
+```
+
+**Best for:**
+- Complex data structures
+- Content requiring interpretation
+- Flexible content formats
+- Natural language processing
+
+### [CSS-Based Extraction](css.md)
+
+`JsonCssExtractionStrategy` extracts data using CSS selectors. This is fast, reliable, and perfect for consistently structured pages.
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+schema = {
+    "name": "Product Listing",
+    "baseSelector": ".product-card",
+    "fields": [
+        {"name": "title", "selector": "h2", "type": "text"},
+        {"name": "price", "selector": ".price", "type": "text"},
+        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"}
+    ]
+}
+
+strategy = JsonCssExtractionStrategy(schema)
+
+result = await crawler.arun(
+    url="https://example.com/products",
+    extraction_strategy=strategy
+)
+```
+
+**Best for:**
+- E-commerce product listings
+- News article collections
+- Structured content pages
+- High-performance needs
+
+### [Cosine Strategy](cosine.md)
+
+`CosineStrategy` uses similarity-based clustering to identify and extract relevant content sections.
+
+```python
+from crawl4ai.extraction_strategy import CosineStrategy
+
+strategy = CosineStrategy(
+    semantic_filter="product reviews",    # Content focus
+    word_count_threshold=10,             # Minimum words per cluster
+    sim_threshold=0.3,                   # Similarity threshold
+    max_dist=0.2,                        # Maximum cluster distance
+    top_k=3                             # Number of top clusters to extract
+)
+
+result = await crawler.arun(
+    url="https://example.com/reviews",
+    extraction_strategy=strategy
+)
+```
+
+**Best for:**
+- Content similarity analysis
+- Topic clustering
+- Relevant content extraction
+- Pattern recognition in text
+
+## Strategy Selection Guide
+
+Choose your strategy based on these factors:
+
+1. **Content Structure**
+   - Well-structured HTML → Use CSS Strategy
+   - Natural language text → Use LLM Strategy
+   - Mixed/Complex content → Use Cosine Strategy
+
+2. **Performance Requirements**
+   - Fastest: CSS Strategy
+   - Moderate: Cosine Strategy
+   - Variable: LLM Strategy (depends on provider)
+
+3. **Accuracy Needs**
+   - Highest structure accuracy: CSS Strategy
+   - Best semantic understanding: LLM Strategy
+   - Best content relevance: Cosine Strategy
+
+## Combining Strategies
+
+You can combine strategies for more powerful extraction:
+
+```python
+# First use CSS strategy for initial structure
+css_result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=css_strategy
+)
+
+# Then use LLM for semantic analysis
+llm_result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=llm_strategy
+)
+```
+
+## Common Use Cases
+
+1. **E-commerce Scraping**
+   ```python
+   # CSS Strategy for product listings
+   schema = {
+       "name": "Products",
+       "baseSelector": ".product",
+       "fields": [
+           {"name": "name", "selector": ".title", "type": "text"},
+           {"name": "price", "selector": ".price", "type": "text"}
+       ]
+   }
+   ```
+
+2. **News Article Extraction**
+   ```python
+   # LLM Strategy for article content
+   class Article(BaseModel):
+       title: str
+       content: str
+       author: str
+       date: str
+
+   strategy = LLMExtractionStrategy(
+       provider="ollama/llama2",
+       schema=Article.schema()
+   )
+   ```
+
+3. **Content Analysis**
+   ```python
+   # Cosine Strategy for topic analysis
+   strategy = CosineStrategy(
+       semantic_filter="technology trends",
+       top_k=5
+   )
+   ```
+
+## Best Practices
+
+1. **Choose the Right Strategy**
+   - Start with CSS for structured data
+   - Use LLM for complex interpretation
+   - Try Cosine for content relevance
+
+2. **Optimize Performance**
+   - Cache LLM results
+   - Keep CSS selectors specific
+   - Tune similarity thresholds
+
+3. **Handle Errors**
+   ```python
+   result = await crawler.arun(
+       url="https://example.com",
+       extraction_strategy=strategy
+   )
+   
+   if not result.success:
+       print(f"Extraction failed: {result.error_message}")
+   else:
+       data = json.loads(result.extracted_content)
+   ```
+
+Each strategy has its strengths and optimal use cases. Explore the detailed documentation for each strategy to learn more about their specific features and configurations.
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -0,0 +1,113 @@
+# Crawl4AI
+
+Welcome to the official documentation for Crawl4AI! 🕷️🤖 Crawl4AI is an open-source Python library designed to simplify web crawling and extract useful information from web pages. This documentation will guide you through the features, usage, and customization of Crawl4AI.
+
+## Introduction
+
+Crawl4AI has one clear task: to make crawling and data extraction from web pages easy and efficient, especially for large language models (LLMs) and AI applications. Whether you are using it as a REST API or a Python library, Crawl4AI offers a robust and flexible solution with full asynchronous support.
+
+## Quick Start
+
+Here's a quick example to show you how easy it is to use Crawl4AI with its asynchronous capabilities:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    # Create an instance of AsyncWebCrawler
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Run the crawler on a URL
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
+
+        # Print the extracted content
+        print(result.markdown)
+
+# Run the async main function
+asyncio.run(main())
+```
+
+## Key Features ✨
+
+- 🆓 Completely free and open-source
+- 🚀 Blazing fast performance, outperforming many paid services
+- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
+- 📄 Fit markdown generation for extracting main article content.
+- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
+- 🌍 Supports crawling multiple URLs simultaneously
+- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
+- 🔗 Extracts all external and internal links
+- 📚 Extracts metadata from the page
+- 🔄 Custom hooks for authentication, headers, and page modifications
+- 🕵️ User-agent customization
+- 🖼️ Takes screenshots of pages with enhanced error handling
+- 📜 Executes multiple custom JavaScripts before crawling
+- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
+- 📚 Various chunking strategies: topic-based, regex, sentence, and more
+- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
+- 🎯 CSS selector support for precise data extraction
+- 📝 Passes instructions/keywords to refine extraction
+- 🔒 Proxy support with authentication for enhanced access
+- 🔄 Session management for complex multi-page crawling
+- 🌐 Asynchronous architecture for improved performance
+- 🖼️ Improved image processing with lazy-loading detection
+- 🕰️ Enhanced handling of delayed content loading
+- 🔑 Custom headers support for LLM interactions
+- 🖼️ iframe content extraction for comprehensive analysis
+- ⏱️ Flexible timeout and delayed content retrieval options
+
+## Documentation Structure
+
+Our documentation is organized into several sections:
+
+### Basic Usage
+- [Installation](basic/installation.md)
+- [Quick Start](basic/quickstart.md)
+- [Simple Crawling](basic/simple-crawling.md)
+- [Browser Configuration](basic/browser-config.md)
+- [Content Selection](basic/content-selection.md)
+- [Output Formats](basic/output-formats.md)
+- [Page Interaction](basic/page-interaction.md)
+
+### Advanced Features
+- [Magic Mode](advanced/magic-mode.md)
+- [Session Management](advanced/session-management.md)
+- [Hooks & Authentication](advanced/hooks-auth.md)
+- [Proxy & Security](advanced/proxy-security.md)
+- [Content Processing](advanced/content-processing.md)
+
+### Extraction & Processing
+- [Extraction Strategies Overview](extraction/overview.md)
+- [LLM Integration](extraction/llm.md)
+- [CSS-Based Extraction](extraction/css.md)
+- [Cosine Strategy](extraction/cosine.md)
+- [Chunking Strategies](extraction/chunking.md)
+
+### API Reference
+- [AsyncWebCrawler](api/async-webcrawler.md)
+- [CrawlResult](api/crawl-result.md)
+- [Extraction Strategies](api/strategies.md)
+- [arun() Method Parameters](api/arun.md)
+
+### Examples
+- Coming soon!
+
+## Getting Started
+
+1. Install Crawl4AI:
+```bash
+pip install crawl4ai
+```
+
+2. Check out our [Quick Start Guide](basic/quickstart.md) to begin crawling web pages.
+
+3. Explore our [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) to see Crawl4AI in action.
+
+## Support
+
+For questions, suggestions, or issues:
+- GitHub Issues: [Report a Bug](https://github.com/unclecode/crawl4ai/issues)
+- Twitter: [@unclecode](https://twitter.com/unclecode)
+- Website: [crawl4ai.com](https://crawl4ai.com)
+
+Happy Crawling! 🕸️🚀
--- a/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
+++ b/docs/md_v2/tutorial/episode_01_Introduction_to_Crawl4AI_and_Basic_Installation.md
@@ -0,0 +1,51 @@
+# Crawl4AI
+
+## Episode 1: Introduction to Crawl4AI and Basic Installation
+
+### Quick Intro
+Walk through installation from PyPI, setup, and verification. Show how to install with options like `torch` or `transformer` for advanced capabilities.
+
+Here's a condensed outline of the **Installation and Setup** video content:
+
+---
+
+1) **Introduction to Crawl4AI**: Briefly explain that Crawl4AI is a powerful tool for web scraping, data extraction, and content processing, with customizable options for various needs.
+
+2) **Installation Overview**:   
+   
+   - **Basic Install**: Run `pip install crawl4ai` and `playwright install` (to set up browser dependencies).
+ 
+   - **Optional Advanced Installs**:
+     - `pip install crawl4ai[torch]` - Adds PyTorch for clustering.
+     - `pip install crawl4ai[transformer]` - Adds support for LLM-based extraction.
+     - `pip install crawl4ai[all]` - Installs all features for complete functionality.
+
+3) **Verifying the Installation**:
+   
+   - Walk through a simple test script to confirm the setup:
+      ```python
+      import asyncio
+      from crawl4ai import AsyncWebCrawler
+      
+      async def main():
+          async with AsyncWebCrawler(verbose=True) as crawler:
+              result = await crawler.arun(url="https://www.example.com")
+              print(result.markdown[:500])  # Show first 500 characters
+
+      asyncio.run(main())
+      ```
+   - Explain that this script initializes the crawler and runs it on a test URL, displaying part of the extracted content to verify functionality.
+
+4) **Important Tips**:
+   
+   - **Run** `playwright install` **after installation** to set up dependencies.
+   - **For full performance** on text-related tasks, run `crawl4ai-download-models` after installing with `[torch]`, `[transformer]`, or `[all]` options.
+   - If you encounter issues, refer to the documentation or GitHub issues.
+
+5) **Wrap Up**:
+   
+   - Introduce the next topic in the series, which will cover Crawl4AI's browser configuration options (like choosing between `chromium`, `firefox`, and `webkit`).
+
+---
+
+This structure provides a concise, effective guide to get viewers up and running with Crawl4AI in minutes.
--- a/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
+++ b/docs/md_v2/tutorial/episode_02_Overview_of_Advanced_Features.md
@@ -0,0 +1,78 @@
+# Crawl4AI
+
+## Episode 2: Overview of Advanced Features
+
+### Quick Intro
+A general overview of advanced features like hooks, CSS selectors, and JSON CSS extraction.
+
+Here's a condensed outline for an **Overview of Advanced Features** video covering Crawl4AI's powerful customization and extraction options:
+
+---
+
+### **Overview of Advanced Features**
+
+1) **Introduction to Advanced Features**:
+ 
+   - Briefly introduce Crawl4AI’s advanced tools, which let users go beyond basic crawling to customize and fine-tune their scraping workflows.
+
+2) **Taking Screenshots**:
+ 
+   - Explain the screenshot capability for capturing page state and verifying content.
+   - **Example**:
+      ```python
+      result = await crawler.arun(url="https://www.example.com", screenshot=True)
+      ```
+   - Mention that screenshots are saved as a base64 string in `result`, allowing easy decoding and saving.
+
+3) **Media and Link Extraction**:
+ 
+   - Demonstrate how to pull all media (images, videos) and links (internal and external) from a page for deeper analysis or content gathering.
+   - **Example**:
+      ```python
+      result = await crawler.arun(url="https://www.example.com")
+      print("Media:", result.media)
+      print("Links:", result.links)
+      ```
+
+4) **Custom User Agent**:
+ 
+   - Show how to set a custom user agent to disguise the crawler or simulate specific devices/browsers.
+   - **Example**:
+      ```python
+      result = await crawler.arun(url="https://www.example.com", user_agent="Mozilla/5.0 (compatible; MyCrawler/1.0)")
+      ```
+
+5) **Custom Hooks for Enhanced Control**:
+ 
+   - Briefly cover how to use hooks, which allow custom actions like setting headers or handling login during the crawl.
+   - **Example**: Setting a custom header with `before_get_url` hook.
+      ```python
+      async def before_get_url(page):
+          await page.set_extra_http_headers({"X-Test-Header": "test"})
+      ```
+
+6) **CSS Selectors for Targeted Extraction**:
+ 
+   - Explain the use of CSS selectors to extract specific elements, ideal for structured data like articles or product details.
+   - **Example**:
+      ```python
+      result = await crawler.arun(url="https://www.example.com", css_selector="h2")
+      print("H2 Tags:", result.extracted_content)
+      ```
+
+7) **Crawling Inside Iframes**:
+ 
+   - Mention how enabling `process_iframes=True` allows extracting content within iframes, useful for sites with embedded content or ads.
+   - **Example**:
+      ```python
+      result = await crawler.arun(url="https://www.example.com", process_iframes=True)
+      ```
+
+8) **Wrap-Up**:
+ 
+   - Summarize these advanced features and how they allow users to customize every part of their web scraping experience.
+   - Tease upcoming videos where each feature will be explored in detail.
+
+---
+
+This covers each advanced feature with a brief example, providing a useful overview to prepare viewers for the more in-depth videos.
--- a/Show More
+++ b/Show More