Update Documentation

Update version
feat: enhance crawler with overlay removal and improved screenshot capabilities
2024-10-27 19:24:46 +08:00 · 2024-10-24 20:24:21 +08:00 · 2024-10-24 20:22:47 +08:00 · 2024-10-22 20:19:22 +08:00 · 2024-10-20 19:25:25 +08:00 · 2024-10-20 19:11:18 +08:00
141 changed files with 11984 additions and 370 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -196,4 +196,14 @@ docs/.DS_Store
 tmp/
 test_env/
 **/.DS_Store
-**/.DS_Store
+**/.DS_Store
+
+todo.md
+git_changes.py
+git_changes.md
+pypi_build.sh
+git_issues.py
+git_issues.md
+
+.tests/
+.issues/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,299 @@
 # Changelog

+## [v0.3.73] - 2024-10-24
+
+### Added
+- Smart overlay removal system in AsyncPlaywrightCrawlerStrategy:
+  - Automatic removal of popups, modals, and cookie notices
+  - Detection and removal of fixed/sticky position elements
+  - Cleaning of empty block elements
+  - Configurable via `remove_overlay_elements` parameter
+- Enhanced screenshot capabilities:
+  - Added `screenshot_wait_for` parameter to control timing
+  - Improved screenshot handling with existing page context
+  - Better error handling with fallback error images
+- New URL normalization utilities:
+  - `normalize_url` function for consistent URL formatting
+  - `is_external_url` function for better link classification
+- Custom base directory support for cache storage:
+  - New `base_directory` parameter in AsyncWebCrawler
+  - Allows specifying alternative locations for `.crawl4ai` folder
+
+### Enhanced
+- Link handling improvements:
+  - Better duplicate link detection
+  - Enhanced internal/external link classification
+  - Improved handling of special URL protocols
+  - Support for anchor links and protocol-relative URLs
+- Configuration refinements:
+  - Streamlined social media domain list
+  - More focused external content filtering
+- LLM extraction strategy:
+  - Added support for separate API base URL via `api_base` parameter
+  - Better handling of base URLs in configuration
+
+### Fixed
+- Screenshot functionality:
+  - Resolved issues with screenshot timing and context
+  - Improved error handling and recovery
+- Link processing:
+  - Fixed URL normalization edge cases
+  - Better handling of invalid URLs
+  - Improved error messages for link processing failures
+
+### Developer Notes
+- The overlay removal system uses advanced JavaScript injection for better compatibility
+- URL normalization handles special cases like mailto:, tel:, and protocol-relative URLs
+- Screenshot system now reuses existing page context for better performance
+- Link processing maintains separate dictionaries for internal and external links to ensure uniqueness
+
+## [v0.3.72] - 2024-10-22
+
+### Added
+- New `ContentCleaningStrategy` class:
+  - Smart content extraction based on text density and element scoring
+  - Automatic removal of boilerplate content
+  - DOM tree analysis for better content identification
+  - Configurable thresholds for content detection
+- Advanced proxy support:
+  - Added `proxy_config` option for authenticated proxy connections
+  - Support for username/password in proxy configuration
+- New content output formats:
+  - `fit_markdown`: Optimized markdown output with main content focus
+  - `fit_html`: Clean HTML with only essential content
+
+### Enhanced
+- Image source detection:
+  - Support for multiple image source attributes (`src`, `data-src`, `srcset`, etc.)
+  - Automatic fallback through potential source attributes
+  - Smart handling of srcset attribute
+- External content handling:
+  - Made external link exclusion optional (disabled by default)
+  - Improved detection and handling of social media links
+  - Better control over external image filtering
+
+### Fixed
+- Image extraction reliability with multiple source attribute checks
+- External link and image handling logic for better accuracy
+
+### Developer Notes
+- The new `ContentCleaningStrategy` uses configurable thresholds for customization
+- Proxy configuration now supports more complex authentication scenarios
+- Content extraction process now provides both regular and optimized outputs
+
+## [v0.3.72] - 2024-10-20
+
+### Fixed
+- Added support for parsing Base64 encoded images in WebScrappingStrategy
+
+### Added
+- Forked and integrated a customized version of the html2text library for more control over Markdown generation
+- New configuration options for controlling external content:
+  - Ability to exclude all external links
+  - Option to specify domains to exclude (default includes major social media platforms)
+  - Control over excluding external images
+
+### Changed
+- Improved Markdown generation process:
+  - Added fine-grained control over character escaping in Markdown output
+  - Enhanced handling of code blocks and pre-formatted text
+- Updated `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500)
+- Enhanced flexibility in `CosineStrategy` with a more generic `load_HF_embedding_model` function
+
+### Improved
+- Optimized content scraping and processing for better efficiency
+- Enhanced error handling and logging in various components
+
+### Developer Notes
+- The customized html2text library is now located within the crawl4ai package
+- New configuration options are available in the `config.py` file for external content handling
+- The `WebScrappingStrategy` class has been updated to accommodate new external content exclusion options
+
+## [v0.3.71] - 2024-10-19
+
+### Added
+- New chunking strategies:
+  - `OverlappingWindowChunking`: Allows for overlapping chunks of text, useful for maintaining context between chunks.
+  - Enhanced `SlidingWindowChunking`: Improved to handle edge cases and last chunks more effectively.
+
+### Changed
+- Updated `CHUNK_TOKEN_THRESHOLD` in config to 2048 tokens (2^11) for better compatibility with most LLM models.
+- Improved `AsyncPlaywrightCrawlerStrategy.close()` method to use a shorter sleep time (0.5 seconds instead of 500), significantly reducing wait time when closing the crawler.
+- Enhanced flexibility in `CosineStrategy`:
+  - Now uses a more generic `load_HF_embedding_model` function, allowing for easier swapping of embedding models.
+- Updated `JsonCssExtractionStrategy` and `JsonXPATHExtractionStrategy` for better JSON-based extraction.
+
+### Fixed
+- Addressed potential issues with the sliding window chunking strategy to ensure all text is properly chunked.
+
+### Developer Notes
+- Added more comprehensive docstrings to chunking strategies for better code documentation.
+- Removed hardcoded device setting in `CosineStrategy`, now using the automatically detected device.
+- Added a new example in `quickstart_async.py` for generating a knowledge graph from crawled content.
+
+These updates aim to provide more flexibility in text processing, improve performance, and enhance the overall capabilities of the crawl4ai library. The new chunking strategies, in particular, offer more options for handling large texts in various scenarios.
+
+## [v0.3.71] - 2024-10-18
+
+### Changes
+1. **Version Update**:
+   - Updated version number from 0.3.7 to 0.3.71.
+
+2. **Crawler Enhancements**:
+   - Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
+   - Improved context creation with additional options:
+     - Enabled `accept_downloads` and `java_script_enabled`.
+     - Added a cookie to enable cookies by default.
+
+3. **Error Handling Improvements**:
+   - Enhanced error messages in AsyncWebCrawler's `arun` method.
+   - Updated error reporting format for better visibility and consistency.
+
+4. **Performance Optimization**:
+   - Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
+
+### Documentation
+- Updated quickstart notebook:
+  - Changed installation command to use the released package instead of GitHub repository.
+  - Updated kernel display name.
+
+### Developer Notes
+- Minor code refactoring and cleanup.
+
+## [v0.3.7] - 2024-10-17
+
+### New Features
+1. **Enhanced Browser Stealth**: 
+   - Implemented `playwright_stealth` for improved bot detection avoidance.
+   - Added `StealthConfig` for fine-tuned control over stealth parameters.
+
+2. **User Simulation**:
+   - New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
+
+3. **Navigator Override**:
+   - Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
+
+4. **Improved iframe Handling**:
+   - New `process_iframes` parameter to extract and integrate iframe content into the main page.
+
+5. **Flexible Browser Selection**:
+   - Support for choosing between Chromium, Firefox, and WebKit browsers.
+
+6. **Include Links in Markdown**:
+    - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.   
+
+### Improvements
+1. **Better Error Handling**:
+   - Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
+   - Added console message and error logging for better debugging.
+
+2. **Image Processing Enhancements**:
+   - Improved image dimension updating and filtering logic.
+
+3. **Crawling Flexibility**:
+   - Added support for custom viewport sizes.
+   - Implemented delayed content retrieval with `delay_before_return_html` parameter.
+
+4. **Performance Optimization**:
+   - Adjusted default semaphore count for parallel crawling.
+
+### Bug Fixes
+- Fixed an issue where the HTML content could be empty after processing.
+
+### Examples
+- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
+
+### Developer Notes
+- Refactored code for better maintainability and readability.
+- Updated browser launch arguments for improved compatibility and performance.
+
+## [v0.3.6] - 2024-10-12 
+
+### 1. Improved Crawling Control
+- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
+- **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
+  - Useful for pages with delayed content loading.
+- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
+  - Provides better handling for slow-loading pages.
+- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
+
+### 2. Browser Type Selection
+- Added support for different browser types (Chromium, Firefox, WebKit).
+- Users can now specify the browser type when initializing AsyncWebCrawler.
+- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
+
+### 3. Screenshot Capture
+- Added ability to capture screenshots during crawling.
+- Useful for debugging and content verification.
+- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
+
+### 4. Enhanced LLM Extraction Strategy
+- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
+- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
+- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
+- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
+
+### 5. iframe Content Extraction
+- New feature to process and extract content from iframes.
+- **How to use**: Set `process_iframes=True` in the crawl method.
+
+### 6. Delayed Content Retrieval
+- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
+- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
+- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
+
+## Improvements and Optimizations
+
+### 1. AsyncWebCrawler Enhancements
+- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
+- Allows for more customized setups.
+
+### 2. Image Processing Optimization
+- Enhanced image handling in WebScrappingStrategy.
+- Added filtering for small, invisible, or irrelevant images.
+- Improved image scoring system for better content relevance.
+- Implemented JavaScript-based image dimension updating for more accurate representation.
+
+### 3. Database Schema Auto-updates
+- Automatic database schema updates ensure compatibility with the latest version.
+
+### 4. Enhanced Error Handling and Logging
+- Improved error messages and logging for easier debugging.
+
+### 5. Content Extraction Refinements
+- Refined HTML sanitization process.
+- Improved handling of base64 encoded images.
+- Enhanced Markdown conversion process.
+- Optimized content extraction algorithms.
+
+### 6. Utility Function Enhancements
+- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
+
+## Bug Fixes
+- Fixed an issue where image tags were being prematurely removed during content extraction.
+
+## Examples and Documentation
+- Updated `quickstart_async.py` with examples of:
+  - Using custom headers in LLM extraction.
+  - Different LLM provider usage (OpenAI, Hugging Face, Ollama).
+  - Custom browser type usage.
+
+## Developer Notes
+- Refactored code for better maintainability, flexibility, and performance.
+- Enhanced type hinting throughout the codebase for improved development experience.
+- Expanded error handling for more robust operation.
+
+These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
+
+## [v0.3.5] - 2024-09-02
+
+Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
+
+- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
+- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
+- Improve error handling and timeout management in crawling process
+- Fix typo in CrawlResult model (responser_headers -> response_headers)
+
 ## [v0.2.77] - 2024-08-04

 Significant improvements in text processing and performance:
--- a/README.md
+++ b/README.md
@@ -8,8 +8,14 @@

 Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

-> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
+## New in 0.3.72 ✨

+- 📄 Fit markdown generation for extracting main article content.
+- 🪄 Magic mode for comprehensive anti-bot detection bypass.
+- 🌐 Enhanced multi-browser support with seamless switching (Chromium, Firefox, WebKit)
+- 📚 New chunking strategies(Sliding window, Overlapping window, Flexible size control)
+- 💾 Improved caching system for better performance
+- ⚡ Optimized batch processing with automatic rate limiting

 ## Try it Now!

@@ -22,23 +28,28 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
 - 🆓 Completely free and open-source
 - 🚀 Blazing fast performance, outperforming many paid services
 - 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
+- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
 - 🌍 Supports crawling multiple URLs simultaneously
 - 🎨 Extracts and returns all media tags (Images, Audio, and Video)
 - 🔗 Extracts all external and internal links
 - 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
+- 🔄 Custom hooks for authentication, headers, and page modifications
 - 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
+- 🖼️ Takes screenshots of pages with enhanced error handling
 - 📜 Executes multiple custom JavaScripts before crawling
 - 📊 Generates structured output without LLM using JsonCssExtractionStrategy
 - 📚 Various chunking strategies: topic-based, regex, sentence, and more
 - 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
 - 🎯 CSS selector support for precise data extraction
 - 📝 Passes instructions/keywords to refine extraction
- 🔒 Proxy support for enhanced privacy and access
- 🔄 Session management for complex multi-page crawling scenarios
- 🌐 Asynchronous architecture for improved performance and scalability
-
+- 🔒 Proxy support with authentication for enhanced access
+- 🔄 Session management for complex multi-page crawling
+- 🌐 Asynchronous architecture for improved performance
+- 🖼️ Improved image processing with lazy-loading detection
+- 🕰️ Enhanced handling of delayed content loading
+- 🔑 Custom headers support for LLM interactions
+- 🖼️ iframe content extraction for comprehensive analysis
+- ⏱️ Flexible timeout and delayed content retrieval options

 ## Installation 🛠️

@@ -56,9 +67,21 @@ For basic web crawling and scraping tasks:
 pip install crawl4ai
 ```

-By default this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
+By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.

-    👉 Note: The standard version of Crawl4AI uses Playwright for asynchronous crawling. If you encounter an error saying that Playwright is not installed, you can run playwright install. However, this should be done automatically during the setup process.
+👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
+
+1. Through the command line:
+   ```bash
+   playwright install
+   ```
+
+2. If the above doesn't work, try this more specific command:
+   ```bash
+   python -m playwright install chromium
+   ```
+
+This second method has proven to be more reliable in some cases.

 #### Installation with Synchronous Version

@@ -113,7 +136,7 @@ async def main():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
-            css_selector="article.tease-card",
+            css_selector=".wide-tease-item__description",
            bypass_cache=True
        )
        print(result.extracted_content)
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -3,7 +3,7 @@
 from .async_webcrawler import AsyncWebCrawler
 from .models import CrawlResult

-__version__ = "0.3.3"
+__version__ = "0.3.72"

 __all__ = [
    "AsyncWebCrawler",
--- a/crawl4ai/async_crawler_strategy
+++ b/crawl4ai/async_crawler_strategy
@@ -0,0 +1,558 @@
+import asyncio
+import base64
+import time
+from abc import ABC, abstractmethod
+from typing import Callable, Dict, Any, List, Optional, Awaitable
+import os
+from playwright.async_api import async_playwright, Page, Browser, Error
+from io import BytesIO
+from PIL import Image, ImageDraw, ImageFont
+from pathlib import Path
+from playwright.async_api import ProxySettings
+from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from playwright_stealth import stealth_async
+
+class AsyncCrawlResponse(BaseModel):
+    html: str
+    response_headers: Dict[str, str]
+    status_code: int
+    screenshot: Optional[str] = None
+    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
+
+    class Config:
+        arbitrary_types_allowed = True
+
+class AsyncCrawlerStrategy(ABC):
+    @abstractmethod
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        pass
+    
+    @abstractmethod
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        pass
+    
+    @abstractmethod
+    async def take_screenshot(self, url: str) -> str:
+        pass
+    
+    @abstractmethod
+    def update_user_agent(self, user_agent: str):
+        pass
+    
+    @abstractmethod
+    def set_hook(self, hook_type: str, hook: Callable):
+        pass
+
+class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
+    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
+        self.use_cached_html = use_cached_html
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
+        self.proxy = kwargs.get("proxy")
+        self.headless = kwargs.get("headless", True)
+        self.browser_type = kwargs.get("browser_type", "chromium")
+        self.headers = kwargs.get("headers", {})
+        self.sessions = {}
+        self.session_ttl = 1800 
+        self.js_code = js_code
+        self.verbose = kwargs.get("verbose", False)
+        self.playwright = None
+        self.browser = None
+        self.hooks = {
+            'on_browser_created': None,
+            'on_user_agent_updated': None,
+            'on_execution_started': None,
+            'before_goto': None,
+            'after_goto': None,
+            'before_return_html': None,
+            'before_retrieve_html': None
+        }
+
+    async def __aenter__(self):
+        await self.start()
+        return self
+
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        await self.close()
+
+    async def start(self):
+        if self.playwright is None:
+            self.playwright = await async_playwright().start()
+        if self.browser is None:
+            browser_args = {
+                "headless": self.headless,
+                "args": [
+                    "--disable-gpu",
+                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--window-position=0,0",
+                    "--ignore-certificate-errors",
+                    "--ignore-certificate-errors-spki-list",
+                    # "--headless=new",  # Use the new headless mode
+                ]
+            }
+            
+            # Add proxy settings if a proxy is specified
+            if self.proxy:
+                proxy_settings = ProxySettings(server=self.proxy)
+                browser_args["proxy"] = proxy_settings
+                
+            # Select the appropriate browser based on the browser_type
+            if self.browser_type == "firefox":
+                self.browser = await self.playwright.firefox.launch(**browser_args)
+            elif self.browser_type == "webkit":
+                self.browser = await self.playwright.webkit.launch(**browser_args)
+            else:
+                self.browser = await self.playwright.chromium.launch(**browser_args)
+
+            await self.execute_hook('on_browser_created', self.browser)
+
+    async def close(self):
+        if self.browser:
+            await self.browser.close()
+            self.browser = None
+        if self.playwright:
+            await self.playwright.stop()
+            self.playwright = None
+
+    def __del__(self):
+        if self.browser or self.playwright:
+            asyncio.get_event_loop().run_until_complete(self.close())
+
+    def set_hook(self, hook_type: str, hook: Callable):
+        if hook_type in self.hooks:
+            self.hooks[hook_type] = hook
+        else:
+            raise ValueError(f"Invalid hook type: {hook_type}")
+
+    async def execute_hook(self, hook_type: str, *args):
+        hook = self.hooks.get(hook_type)
+        if hook:
+            if asyncio.iscoroutinefunction(hook):
+                return await hook(*args)
+            else:
+                return hook(*args)
+        return args[0] if args else None
+
+    def update_user_agent(self, user_agent: str):
+        self.user_agent = user_agent
+
+    def set_custom_headers(self, headers: Dict[str, str]):
+        self.headers = headers
+
+    async def kill_session(self, session_id: str):
+        if session_id in self.sessions:
+            context, page, _ = self.sessions[session_id]
+            await page.close()
+            await context.close()
+            del self.sessions[session_id]
+
+    def _cleanup_expired_sessions(self):
+        current_time = time.time()
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
+        for sid in expired_sessions:
+            asyncio.create_task(self.kill_session(sid))
+            
+    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
+        wait_for = wait_for.strip()
+        
+        if wait_for.startswith('js:'):
+            # Explicitly specified JavaScript
+            js_code = wait_for[3:].strip()
+            return await self.csp_compliant_wait(page, js_code, timeout)
+        elif wait_for.startswith('css:'):
+            # Explicitly specified CSS selector
+            css_selector = wait_for[4:].strip()
+            try:
+                await page.wait_for_selector(css_selector, timeout=timeout)
+            except Error as e:
+                if 'Timeout' in str(e):
+                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
+                else:
+                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
+        else:
+            # Auto-detect based on content
+            if wait_for.startswith('()') or wait_for.startswith('function'):
+                # It's likely a JavaScript function
+                return await self.csp_compliant_wait(page, wait_for, timeout)
+            else:
+                # Assume it's a CSS selector first
+                try:
+                    await page.wait_for_selector(wait_for, timeout=timeout)
+                except Error as e:
+                    if 'Timeout' in str(e):
+                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
+                    else:
+                        # If it's not a timeout error, it might be an invalid selector
+                        # Let's try to evaluate it as a JavaScript function as a fallback
+                        try:
+                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
+                        except Error:
+                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
+    
+    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
+        wrapper_js = f"""
+        async () => {{
+            const userFunction = {user_wait_function};
+            const startTime = Date.now();
+            while (true) {{
+                if (await userFunction()) {{
+                    return true;
+                }}
+                if (Date.now() - startTime > {timeout}) {{
+                    throw new Error('Timeout waiting for condition');
+                }}
+                await new Promise(resolve => setTimeout(resolve, 100));
+            }}
+        }}
+        """
+        
+        try:
+            await page.evaluate(wrapper_js)
+        except TimeoutError:
+            raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
+        except Exception as e:
+            raise RuntimeError(f"Error in wait condition: {str(e)}")
+
+    async def process_iframes(self, page):
+        # Find all iframes
+        iframes = await page.query_selector_all('iframe')
+        
+        for i, iframe in enumerate(iframes):
+            try:
+                # Add a unique identifier to the iframe
+                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
+                
+                # Get the frame associated with this iframe
+                frame = await iframe.content_frame()
+                
+                if frame:
+                    # Wait for the frame to load
+                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
+                    
+                    # Extract the content of the iframe's body
+                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
+                    
+                    # Generate a unique class name for this iframe
+                    class_name = f'extracted-iframe-content-{i}'
+                    
+                    # Replace the iframe with a div containing the extracted content
+                    _iframe = iframe_content.replace('`', '\\`')
+                    await page.evaluate(f"""
+                        () => {{
+                            const iframe = document.getElementById('iframe-{i}');
+                            const div = document.createElement('div');
+                            div.innerHTML = `{_iframe}`;
+                            div.className = '{class_name}';
+                            iframe.replaceWith(div);
+                        }}
+                    """)
+                else:
+                    print(f"Warning: Could not access content frame for iframe {i}")
+            except Exception as e:
+                print(f"Error processing iframe {i}: {str(e)}")
+
+        # Return the page object
+        return page  
+    
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        response_headers = {}
+        status_code = None
+        
+        self._cleanup_expired_sessions()
+        session_id = kwargs.get("session_id")
+        if session_id:
+            context, page, _ = self.sessions.get(session_id, (None, None, None))
+            if not context:
+                context = await self.browser.new_context(
+                    user_agent=self.user_agent,
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None
+                )
+                await context.set_extra_http_headers(self.headers)
+                page = await context.new_page()
+                self.sessions[session_id] = (context, page, time.time())
+        else:
+            context = await self.browser.new_context(
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
+            )
+            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
+            page = await context.new_page()
+
+        try:
+            if self.verbose:
+                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                if os.path.exists(cache_file_path):
+                    html = ""
+                    with open(cache_file_path, "r") as f:
+                        html = f.read()
+                    # retrieve response headers and status code from cache
+                    with open(cache_file_path + ".meta", "r") as f:
+                        meta = json.load(f)
+                        response_headers = meta.get("response_headers", {})
+                        status_code = meta.get("status_code")
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
+                    return response
+
+            if not kwargs.get("js_only", False):
+                await self.execute_hook('before_goto', page)
+                
+                response = await page.goto("about:blank")
+                await stealth_async(page)
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # await stealth_async(page)
+                # response = await page.goto("about:blank")
+                # await stealth_async(page)
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
+                await self.execute_hook('after_goto', page)
+                
+                # Get status code and headers
+                status_code = response.status
+                response_headers = response.headers
+            else:
+                status_code = 200
+                response_headers = {}
+
+            await page.wait_for_selector('body')
+            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
+
+            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
+            if js_code:
+                if isinstance(js_code, str):
+                    await page.evaluate(js_code)
+                elif isinstance(js_code, list):
+                    for js in js_code:
+                        await page.evaluate(js)
+                
+                await page.wait_for_load_state('networkidle')
+                # Check for on execution event
+                await self.execute_hook('on_execution_started', page)
+                
+            if kwargs.get("simulate_user", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
+            wait_for = kwargs.get("wait_for")
+            if wait_for:
+                try:
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
+                except Exception as e:
+                    raise RuntimeError(f"Wait condition failed: {str(e)}")
+
+
+            
+            # Update image dimensions
+            update_image_dimensions_js = """
+            () => {
+                return new Promise((resolve) => {
+                    const filterImage = (img) => {
+                        // Filter out images that are too small
+                        if (img.width < 100 && img.height < 100) return false;
+                        
+                        // Filter out images that are not visible
+                        const rect = img.getBoundingClientRect();
+                        if (rect.width === 0 || rect.height === 0) return false;
+                        
+                        // Filter out images with certain class names (e.g., icons, thumbnails)
+                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
+                        
+                        // Filter out images with certain patterns in their src (e.g., placeholder images)
+                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
+                        
+                        return true;
+                    };
+
+                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
+                    let imagesLeft = images.length;
+                    
+                    if (imagesLeft === 0) {
+                        resolve();
+                        return;
+                    }
+
+                    const checkImage = (img) => {
+                        if (img.complete && img.naturalWidth !== 0) {
+                            img.setAttribute('width', img.naturalWidth);
+                            img.setAttribute('height', img.naturalHeight);
+                            imagesLeft--;
+                            if (imagesLeft === 0) resolve();
+                        }
+                    };
+
+                    images.forEach(img => {
+                        checkImage(img);
+                        if (!img.complete) {
+                            img.onload = () => {
+                                checkImage(img);
+                            };
+                            img.onerror = () => {
+                                imagesLeft--;
+                                if (imagesLeft === 0) resolve();
+                            };
+                        }
+                    });
+
+                    // Fallback timeout of 5 seconds
+                    setTimeout(() => resolve(), 5000);
+                });
+            }
+            """
+            await page.evaluate(update_image_dimensions_js)
+
+            # Wait a bit for any onload events to complete
+            await page.wait_for_timeout(100)
+
+            # Process iframes
+            if kwargs.get("process_iframes", False):
+                page = await self.process_iframes(page)
+            
+            await self.execute_hook('before_retrieve_html', page)
+            # Check if delay_before_return_html is set then wait for that time
+            delay_before_return_html = kwargs.get("delay_before_return_html")
+            if delay_before_return_html:
+                await asyncio.sleep(delay_before_return_html)
+                
+            html = await page.content()
+            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                screenshot_data = await self.take_screenshot(url)            
+
+            if self.verbose:
+                print(f"[LOG] ✅ Crawled {url} successfully!")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                with open(cache_file_path, "w", encoding="utf-8") as f:
+                    f.write(html)
+                # store response headers and status code in cache
+                with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
+                    json.dump({
+                        "response_headers": response_headers,
+                        "status_code": status_code
+                    }, f)
+
+            async def get_delayed_content(delay: float = 5.0) -> str:
+                if self.verbose:
+                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
+                await asyncio.sleep(delay)
+                return await page.content()
+                
+            response = AsyncCrawlResponse(
+                html=html, 
+                response_headers=response_headers, 
+                status_code=status_code,
+                screenshot=screenshot_data,
+                get_delayed_content=get_delayed_content
+            )
+            return response
+        except Error as e:
+            raise Error(f"Failed to crawl {url}: {str(e)}")
+        finally:
+            if not session_id:
+                await page.close()
+                await context.close()
+
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
+        semaphore = asyncio.Semaphore(semaphore_count)
+
+        async def crawl_with_semaphore(url):
+            async with semaphore:
+                return await self.crawl(url, **kwargs)
+
+        tasks = [crawl_with_semaphore(url) for url in urls]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        return [result if not isinstance(result, Exception) else str(result) for result in results]
+
+    async def take_screenshot(self, url: str, wait_time=1000) -> str:
+        async with await self.browser.new_context(user_agent=self.user_agent) as context:
+            page = await context.new_page()
+            try:
+                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
+                # Wait for a specified time (default is 1 second)
+                await page.wait_for_timeout(wait_time)
+                screenshot = await page.screenshot(full_page=True)
+                return base64.b64encode(screenshot).decode('utf-8')
+            except Exception as e:
+                error_message = f"Failed to take screenshot: {str(e)}"
+                print(error_message)
+
+                # Generate an error image
+                img = Image.new('RGB', (800, 600), color='black')
+                draw = ImageDraw.Draw(img)
+                font = ImageFont.load_default()
+                draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+                
+                buffered = BytesIO()
+                img.save(buffered, format="JPEG")
+                return base64.b64encode(buffered.getvalue()).decode('utf-8')
+            finally:
+                await page.close()
+
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -1,30 +1,45 @@
 import asyncio
-import base64, time
+import base64
+import time
 from abc import ABC, abstractmethod
-from typing import Callable, Dict, Any, List, Optional
+from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os
-import psutil
 from playwright.async_api import async_playwright, Page, Browser, Error
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
-from .utils import sanitize_input_encode
-import json, uuid
-import hashlib
 from pathlib import Path
 from playwright.async_api import ProxySettings
 from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from playwright_stealth import StealthConfig, stealth_async
+
+stealth_config = StealthConfig(
+    webdriver=True,
+    chrome_app=True,
+    chrome_csi=True,
+    chrome_load_times=True,
+    chrome_runtime=True,
+    navigator_languages=True,
+    navigator_plugins=True,
+    navigator_permissions=True,
+    webgl_vendor=True,
+    outerdimensions=True,
+    navigator_hardware_concurrency=True,
+    media_codecs=True,
+)

-def calculate_semaphore_count():
-    cpu_count = os.cpu_count()
-    memory_gb = psutil.virtual_memory().total / (1024 ** 3)  # Convert to GB
-    base_count = max(1, cpu_count // 2)
-    memory_based_cap = int(memory_gb / 2)  # Assume 2GB per instance
-    return min(base_count, memory_based_cap)

 class AsyncCrawlResponse(BaseModel):
    html: str
    response_headers: Dict[str, str]
    status_code: int
+    screenshot: Optional[str] = None
+    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
+
+    class Config:
+        arbitrary_types_allowed = True

 class AsyncCrawlerStrategy(ABC):
    @abstractmethod
@@ -36,7 +51,7 @@ class AsyncCrawlerStrategy(ABC):
        pass
    
    @abstractmethod
-    async def take_screenshot(self, url: str) -> str:
+    async def take_screenshot(self, **kwargs) -> str:
        pass
    
    @abstractmethod
@@ -50,23 +65,31 @@ class AsyncCrawlerStrategy(ABC):
 class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        self.use_cached_html = use_cached_html
-        self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
        self.proxy = kwargs.get("proxy")
+        self.proxy_config = kwargs.get("proxy_config")
        self.headless = kwargs.get("headless", True)
-        self.headers = {}
+        self.browser_type = kwargs.get("browser_type", "chromium")
+        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800 
        self.js_code = js_code
        self.verbose = kwargs.get("verbose", False)
        self.playwright = None
        self.browser = None
+        self.sleep_on_close = kwargs.get("sleep_on_close", False)
        self.hooks = {
            'on_browser_created': None,
            'on_user_agent_updated': None,
            'on_execution_started': None,
            'before_goto': None,
            'after_goto': None,
-            'before_return_html': None
+            'before_return_html': None,
+            'before_retrieve_html': None
        }

    async def __aenter__(self):
@@ -82,12 +105,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        if self.browser is None:
            browser_args = {
                "headless": self.headless,
-                # "headless": False,
                "args": [
                    "--disable-gpu",
-                    "--disable-dev-shm-usage",
-                    "--disable-setuid-sandbox",
                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--window-position=0,0",
+                    "--ignore-certificate-errors",
+                    "--ignore-certificate-errors-spki-list",
+                    # "--headless=new",  # Use the new headless mode
                ]
            }
            
@@ -95,12 +122,23 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            if self.proxy:
                proxy_settings = ProxySettings(server=self.proxy)
                browser_args["proxy"] = proxy_settings
+            elif self.proxy_config:
+                proxy_settings = ProxySettings(server=self.proxy_config.get("server"), username=self.proxy_config.get("username"), password=self.proxy_config.get("password"))
+                browser_args["proxy"] = proxy_settings
                
-                
-            self.browser = await self.playwright.chromium.launch(**browser_args)
+            # Select the appropriate browser based on the browser_type
+            if self.browser_type == "firefox":
+                self.browser = await self.playwright.firefox.launch(**browser_args)
+            elif self.browser_type == "webkit":
+                self.browser = await self.playwright.webkit.launch(**browser_args)
+            else:
+                self.browser = await self.playwright.chromium.launch(**browser_args)
+
            await self.execute_hook('on_browser_created', self.browser)

    async def close(self):
+        if self.sleep_on_close:
+            await asyncio.sleep(0.5)
        if self.browser:
            await self.browser.close()
            self.browser = None
@@ -142,12 +180,52 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

    def _cleanup_expired_sessions(self):
        current_time = time.time()
-        expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items() 
-                            if current_time - last_used > self.session_ttl]
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))
            
-            
+    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
+        wait_for = wait_for.strip()
+        
+        if wait_for.startswith('js:'):
+            # Explicitly specified JavaScript
+            js_code = wait_for[3:].strip()
+            return await self.csp_compliant_wait(page, js_code, timeout)
+        elif wait_for.startswith('css:'):
+            # Explicitly specified CSS selector
+            css_selector = wait_for[4:].strip()
+            try:
+                await page.wait_for_selector(css_selector, timeout=timeout)
+            except Error as e:
+                if 'Timeout' in str(e):
+                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
+                else:
+                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
+        else:
+            # Auto-detect based on content
+            if wait_for.startswith('()') or wait_for.startswith('function'):
+                # It's likely a JavaScript function
+                return await self.csp_compliant_wait(page, wait_for, timeout)
+            else:
+                # Assume it's a CSS selector first
+                try:
+                    await page.wait_for_selector(wait_for, timeout=timeout)
+                except Error as e:
+                    if 'Timeout' in str(e):
+                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
+                    else:
+                        # If it's not a timeout error, it might be an invalid selector
+                        # Let's try to evaluate it as a JavaScript function as a fallback
+                        try:
+                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
+                        except Error:
+                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
+    
    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
        wrapper_js = f"""
        async () => {{
@@ -172,6 +250,47 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        except Exception as e:
            raise RuntimeError(f"Error in wait condition: {str(e)}")

+    async def process_iframes(self, page):
+        # Find all iframes
+        iframes = await page.query_selector_all('iframe')
+        
+        for i, iframe in enumerate(iframes):
+            try:
+                # Add a unique identifier to the iframe
+                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
+                
+                # Get the frame associated with this iframe
+                frame = await iframe.content_frame()
+                
+                if frame:
+                    # Wait for the frame to load
+                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
+                    
+                    # Extract the content of the iframe's body
+                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
+                    
+                    # Generate a unique class name for this iframe
+                    class_name = f'extracted-iframe-content-{i}'
+                    
+                    # Replace the iframe with a div containing the extracted content
+                    _iframe = iframe_content.replace('`', '\\`')
+                    await page.evaluate(f"""
+                        () => {{
+                            const iframe = document.getElementById('iframe-{i}');
+                            const div = document.createElement('div');
+                            div.innerHTML = `{_iframe}`;
+                            div.className = '{class_name}';
+                            iframe.replaceWith(div);
+                        }}
+                    """)
+                else:
+                    print(f"Warning: Could not access content frame for iframe {i}")
+            except Exception as e:
+                print(f"Error processing iframe {i}: {str(e)}")
+
+        # Return the page object
+        return page  
+    
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
        status_code = None
@@ -183,25 +302,70 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            if not context:
                context = await self.browser.new_context(
                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None,
+                    accept_downloads=True,
+                    java_script_enabled=True
                )
+                await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
                await context.set_extra_http_headers(self.headers)
                page = await context.new_page()
                self.sessions[session_id] = (context, page, time.time())
        else:
            context = await self.browser.new_context(
-                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
            )
            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
            page = await context.new_page()
+            # await stealth_async(page) #, stealth_config)

+        # Add console message and error logging
+        if kwargs.get("log_console", False):
+            page.on("console", lambda msg: print(f"Console: {msg.text}"))
+            page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
+        
        try:
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")

            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
                if os.path.exists(cache_file_path):
                    html = ""
                    with open(cache_file_path, "r") as f:
@@ -211,12 +375,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        meta = json.load(f)
                        response_headers = meta.get("response_headers", {})
                        status_code = meta.get("status_code")
-                    response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
                    return response

            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
-                response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
+                
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # response = await page.goto("about:blank")
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
                await self.execute_hook('after_goto', page)
                
                # Get status code and headers
@@ -227,60 +400,131 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                response_headers = {}

            await page.wait_for_selector('body')
+            
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
            if js_code:
                if isinstance(js_code, str):
-                    r = await page.evaluate(js_code)
+                    await page.evaluate(js_code)
                elif isinstance(js_code, list):
                    for js in js_code:
                        await page.evaluate(js)
                
-                # await page.wait_for_timeout(100)
                await page.wait_for_load_state('networkidle')
-                # Check for on execution even
+                # Check for on execution event
                await self.execute_hook('on_execution_started', page)
                
-            # New code to handle the wait_for parameter
-            # Example usage:
-            # await crawler.crawl(
-            #     url,
-            #     js_code="// some JavaScript code",
-            #     wait_for="""() => {
-            #         return document.querySelector('#my-element') !== null;
-            #     }"""
-            # )
-            # Example of using a CSS selector:
-            # await crawler.crawl(
-            #     url,
-            #     wait_for="#my-element"
-            # )
+            if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
            wait_for = kwargs.get("wait_for")
            if wait_for:
                try:
-                    await self.csp_compliant_wait(page, wait_for, timeout=kwargs.get("timeout", 30000))
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
                except Exception as e:
-                    raise RuntimeError(f"Custom wait condition failed: {str(e)}")                
-                # try:
-                #     await page.wait_for_function(wait_for)
-                #     # if callable(wait_for):
-                #     #     await page.wait_for_function(wait_for)
-                #     # elif isinstance(wait_for, str):
-                #     #     await page.wait_for_selector(wait_for)
-                #     # else:
-                #     #     raise ValueError("wait_for must be either a callable or a CSS selector string")
-                # except Error as e:
-                #     raise Error(f"Custom wait condition failed: {str(e)}")
+                    raise RuntimeError(f"Wait condition failed: {str(e)}")

+            # Update image dimensions
+            update_image_dimensions_js = """
+            () => {
+                return new Promise((resolve) => {
+                    const filterImage = (img) => {
+                        // Filter out images that are too small
+                        if (img.width < 100 && img.height < 100) return false;
+                        
+                        // Filter out images that are not visible
+                        const rect = img.getBoundingClientRect();
+                        if (rect.width === 0 || rect.height === 0) return false;
+                        
+                        // Filter out images with certain class names (e.g., icons, thumbnails)
+                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
+                        
+                        // Filter out images with certain patterns in their src (e.g., placeholder images)
+                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
+                        
+                        return true;
+                    };
+
+                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
+                    let imagesLeft = images.length;
+                    
+                    if (imagesLeft === 0) {
+                        resolve();
+                        return;
+                    }
+
+                    const checkImage = (img) => {
+                        if (img.complete && img.naturalWidth !== 0) {
+                            img.setAttribute('width', img.naturalWidth);
+                            img.setAttribute('height', img.naturalHeight);
+                            imagesLeft--;
+                            if (imagesLeft === 0) resolve();
+                        }
+                    };
+
+                    images.forEach(img => {
+                        checkImage(img);
+                        if (!img.complete) {
+                            img.onload = () => {
+                                checkImage(img);
+                            };
+                            img.onerror = () => {
+                                imagesLeft--;
+                                if (imagesLeft === 0) resolve();
+                            };
+                        }
+                    });
+
+                    // Fallback timeout of 5 seconds
+                    // setTimeout(() => resolve(), 5000);
+                    resolve();
+                });
+            }
+            """
+            await page.evaluate(update_image_dimensions_js)
+
+            # Wait a bit for any onload events to complete
+            await page.wait_for_timeout(100)
+
+            # Process iframes
+            if kwargs.get("process_iframes", False):
+                page = await self.process_iframes(page)
+            
+            await self.execute_hook('before_retrieve_html', page)
+            # Check if delay_before_return_html is set then wait for that time
+            delay_before_return_html = kwargs.get("delay_before_return_html")
+            if delay_before_return_html:
+                await asyncio.sleep(delay_before_return_html)
+                
+            # Check for remove_overlay_elements parameter
+            if kwargs.get("remove_overlay_elements", False):
+                await self.remove_overlay_elements(page)
+            
            html = await page.content()
-            page = await self.execute_hook('before_return_html', page, html)
+            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                # Check we have screenshot_wait_for parameter, if we have simply wait for that time
+                screenshot_wait_for = kwargs.get("screenshot_wait_for")
+                if screenshot_wait_for:
+                    await asyncio.sleep(screenshot_wait_for)
+                screenshot_data = await self.take_screenshot(page)          

            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")

            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
                with open(cache_file_path, "w", encoding="utf-8") as f:
                    f.write(html)
                # store response headers and status code in cache
@@ -290,67 +534,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        "status_code": status_code
                    }, f)

-            response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+            async def get_delayed_content(delay: float = 5.0) -> str:
+                if self.verbose:
+                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
+                await asyncio.sleep(delay)
+                return await page.content()
+                
+            response = AsyncCrawlResponse(
+                html=html, 
+                response_headers=response_headers, 
+                status_code=status_code,
+                screenshot=screenshot_data,
+                get_delayed_content=get_delayed_content
+            )
            return response
        except Error as e:
-            raise Error(f"Failed to crawl {url}: {str(e)}")
-        finally:
-            if not session_id:
-                await page.close()
+            raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
+        # finally:
+        #     if not session_id:
+        #         await page.close()
+        #         await context.close()

-        # try:
-        #     html = await _crawl()
-        #     return sanitize_input_encode(html)
-        # except Error as e:
-        #     raise Error(f"Failed to crawl {url}: {str(e)}")
-        # except Exception as e:
-        #     raise Exception(f"Failed to crawl {url}: {str(e)}")
-
-    async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
-        """
-        Execute JavaScript code in a specific session and optionally wait for a condition.
-        
-        :param session_id: The ID of the session to execute the JS code in.
-        :param js_code: The JavaScript code to execute.
-        :param wait_for_js: JavaScript condition to wait for after execution.
-        :param wait_for_css: CSS selector to wait for after execution.
-        :return: AsyncCrawlResponse containing the page's HTML and other information.
-        :raises ValueError: If the session does not exist.
-        """
-        if not session_id:
-            raise ValueError("Session ID must be provided")
-        
-        if session_id not in self.sessions:
-            raise ValueError(f"No active session found for session ID: {session_id}")
-        
-        context, page, last_used = self.sessions[session_id]
-        
-        try:
-            await page.evaluate(js_code)
-            
-            if wait_for_js:
-                await page.wait_for_function(wait_for_js)
-            
-            if wait_for_css:
-                await page.wait_for_selector(wait_for_css)
-            
-            # Get the updated HTML content
-            html = await page.content()
-            
-            # Get response headers and status code (assuming these are available)
-            response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
-            status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
-            
-            # Update the last used time for this session
-            self.sessions[session_id] = (context, page, time.time())
-            
-            return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
-        except Error as e:
-            raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
-    
-    
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
-        semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
        semaphore = asyncio.Semaphore(semaphore_count)

        async def crawl_with_semaphore(url):
@@ -361,25 +567,156 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]

-    async def take_screenshot(self, url: str) -> str:
-        async with await self.browser.new_context(user_agent=self.user_agent) as context:
-            page = await context.new_page()
-            try:
-                await page.goto(url, wait_until="domcontentloaded")
-                screenshot = await page.screenshot(full_page=True)
-                return base64.b64encode(screenshot).decode('utf-8')
-            except Exception as e:
-                error_message = f"Failed to take screenshot: {str(e)}"
-                print(error_message)
+    async def remove_overlay_elements(self, page: Page) -> None:
+        """
+        Removes popup overlays, modals, cookie notices, and other intrusive elements from the page.
+        
+        Args:
+            page (Page): The Playwright page instance
+        """
+        remove_overlays_js = """
+        async () => {
+            // Function to check if element is visible
+            const isVisible = (elem) => {
+                const style = window.getComputedStyle(elem);
+                return style.display !== 'none' && 
+                       style.visibility !== 'hidden' && 
+                       style.opacity !== '0';
+            };

-                # Generate an error image
-                img = Image.new('RGB', (800, 600), color='black')
-                draw = ImageDraw.Draw(img)
-                font = ImageFont.load_default()
-                draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+            // Common selectors for popups and overlays
+            const commonSelectors = [
+                // Close buttons first
+                'button[class*="close" i]', 'button[class*="dismiss" i]', 
+                'button[aria-label*="close" i]', 'button[title*="close" i]',
+                'a[class*="close" i]', 'span[class*="close" i]',
                
-                buffered = BytesIO()
-                img.save(buffered, format="JPEG")
-                return base64.b64encode(buffered.getvalue()).decode('utf-8')
-            finally:
-                await page.close()
+                // Cookie notices
+                '[class*="cookie-banner" i]', '[id*="cookie-banner" i]',
+                '[class*="cookie-consent" i]', '[id*="cookie-consent" i]',
+                
+                // Newsletter/subscription dialogs
+                '[class*="newsletter" i]', '[class*="subscribe" i]',
+                
+                // Generic popups/modals
+                '[class*="popup" i]', '[class*="modal" i]', 
+                '[class*="overlay" i]', '[class*="dialog" i]',
+                '[role="dialog"]', '[role="alertdialog"]'
+            ];
+
+            // Try to click close buttons first
+            for (const selector of commonSelectors.slice(0, 6)) {
+                const closeButtons = document.querySelectorAll(selector);
+                for (const button of closeButtons) {
+                    if (isVisible(button)) {
+                        try {
+                            button.click();
+                            await new Promise(resolve => setTimeout(resolve, 100));
+                        } catch (e) {
+                            console.log('Error clicking button:', e);
+                        }
+                    }
+                }
+            }
+
+            // Remove remaining overlay elements
+            const removeOverlays = () => {
+                // Find elements with high z-index
+                const allElements = document.querySelectorAll('*');
+                for (const elem of allElements) {
+                    const style = window.getComputedStyle(elem);
+                    const zIndex = parseInt(style.zIndex);
+                    const position = style.position;
+                    
+                    if (
+                        isVisible(elem) && 
+                        (zIndex > 999 || position === 'fixed' || position === 'absolute') &&
+                        (
+                            elem.offsetWidth > window.innerWidth * 0.5 ||
+                            elem.offsetHeight > window.innerHeight * 0.5 ||
+                            style.backgroundColor.includes('rgba') ||
+                            parseFloat(style.opacity) < 1
+                        )
+                    ) {
+                        elem.remove();
+                    }
+                }
+
+                // Remove elements matching common selectors
+                for (const selector of commonSelectors) {
+                    const elements = document.querySelectorAll(selector);
+                    elements.forEach(elem => {
+                        if (isVisible(elem)) {
+                            elem.remove();
+                        }
+                    });
+                }
+            };
+
+            // Remove overlay elements
+            removeOverlays();
+
+            // Remove any fixed/sticky position elements at the top/bottom
+            const removeFixedElements = () => {
+                const elements = document.querySelectorAll('*');
+                elements.forEach(elem => {
+                    const style = window.getComputedStyle(elem);
+                    if (
+                        (style.position === 'fixed' || style.position === 'sticky') &&
+                        isVisible(elem)
+                    ) {
+                        elem.remove();
+                    }
+                });
+            };
+
+            removeFixedElements();
+            
+            // Remove empty block elements as: div, p, span, etc.
+            const removeEmptyBlockElements = () => {
+                const blockElements = document.querySelectorAll('div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6');
+                blockElements.forEach(elem => {
+                    if (elem.innerText.trim() === '') {
+                        elem.remove();
+                    }
+                });
+            };
+
+            // Remove margin-right and padding-right from body (often added by modal scripts)
+            document.body.style.marginRight = '0px';
+            document.body.style.paddingRight = '0px';
+            document.body.style.overflow = 'auto';
+
+            // Wait a bit for any animations to complete
+            await new Promise(resolve => setTimeout(resolve, 100));
+        }
+        """
+        
+        try:
+            await page.evaluate(remove_overlays_js)
+            await page.wait_for_timeout(500)  # Wait for any animations to complete
+        except Exception as e:
+            if self.verbose:
+                print(f"Warning: Failed to remove overlay elements: {str(e)}")
+
+    async def take_screenshot(self, page: Page) -> str:
+        try:
+            # The page is already loaded, just take the screenshot
+            screenshot = await page.screenshot(full_page=True)
+            return base64.b64encode(screenshot).decode('utf-8')
+        except Exception as e:
+            error_message = f"Failed to take screenshot: {str(e)}"
+            print(error_message)
+
+            # Generate an error image
+            img = Image.new('RGB', (800, 600), color='black')
+            draw = ImageDraw.Draw(img)
+            font = ImageFont.load_default()
+            draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+            
+            buffered = BytesIO()
+            img.save(buffered, format="JPEG")
+            return base64.b64encode(buffered.getvalue()).decode('utf-8')
+        finally:
+            await page.close()
+
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -29,14 +29,31 @@ class AsyncDatabaseManager:
                )
            ''')
            await db.commit()
+        await self.update_db_schema()

-    async def aalter_db_add_screenshot(self, new_column: str = "media"):
+    async def update_db_schema(self):
+        async with aiosqlite.connect(self.db_path) as db:
+            # Check if the 'media' column exists
+            cursor = await db.execute("PRAGMA table_info(crawled_data)")
+            columns = await cursor.fetchall()
+            column_names = [column[1] for column in columns]
+            
+            if 'media' not in column_names:
+                await self.aalter_db_add_column('media')
+            
+            # Check for other missing columns and add them if necessary
+            for column in ['links', 'metadata', 'screenshot']:
+                if column not in column_names:
+                    await self.aalter_db_add_column(column)
+
+    async def aalter_db_add_column(self, new_column: str):
        try:
            async with aiosqlite.connect(self.db_path) as db:
                await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
                await db.commit()
+            print(f"Added column '{new_column}' to the database.")
        except Exception as e:
-            print(f"Error altering database to add screenshot column: {e}")
+            print(f"Error altering database to add {new_column} column: {e}")

    async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
        try:
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -23,17 +23,19 @@ class AsyncWebCrawler:
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        always_by_pass_cache: bool = False,
-        verbose: bool = False,
+        base_directory: str = str(Path.home()),
+        **kwargs,
    ):
        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
-            verbose=verbose
+            **kwargs
        )
        self.always_by_pass_cache = always_by_pass_cache
-        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
+        # self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
+        self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
        self.ready = False
-        self.verbose = verbose
+        self.verbose = kwargs.get("verbose", False)

    async def __aenter__(self):
        await self.crawler_strategy.__aenter__()
@@ -80,7 +82,7 @@ class AsyncWebCrawler:
            
            word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)

-            async_response : AsyncCrawlResponse = None
+            async_response: AsyncCrawlResponse = None
            cached = None
            screenshot_data = None
            extracted_content = None
@@ -102,15 +104,14 @@ class AsyncWebCrawler:
                t1 = time.time()
                if user_agent:
                    self.crawler_strategy.update_user_agent(user_agent)
-                async_response : AsyncCrawlResponse = await self.crawler_strategy.crawl(url, **kwargs)
+                async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
                html = sanitize_input_encode(async_response.html)
+                screenshot_data = async_response.screenshot
                t2 = time.time()
                if verbose:
                    print(
                        f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
                    )
-                if screenshot:
-                    screenshot_data = await self.crawler_strategy.take_screenshot(url)

            crawl_result = await self.aprocess_html(
                url,
@@ -127,15 +128,15 @@ class AsyncWebCrawler:
                **kwargs,
            )
            crawl_result.status_code = async_response.status_code if async_response else 200
-            crawl_result.responser_headers = async_response.response_headers if async_response else {}
+            crawl_result.response_headers = async_response.response_headers if async_response else {}
            crawl_result.success = bool(html)
            crawl_result.session_id = kwargs.get("session_id", None)
            return crawl_result
        except Exception as e:
            if not hasattr(e, "msg"):
                e.msg = str(e)
-            print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
-            return CrawlResult(url=url, html="", success=False, error_message=e.msg)
+            print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
+            return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)

    async def arun_many(
        self,
@@ -187,7 +188,8 @@ class AsyncWebCrawler:
        try:
            t1 = time.time()
            scrapping_strategy = WebScrappingStrategy()
-            result = await scrapping_strategy.ascrap(
+            # result = await scrapping_strategy.ascrap(
+            result = scrapping_strategy.scrap(
                url,
                html,
                word_count_threshold=word_count_threshold,
@@ -196,6 +198,7 @@ class AsyncWebCrawler:
                image_description_min_word_threshold=kwargs.get(
                    "image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
                ),
+                **kwargs,
            )
            if verbose:
                print(
@@ -203,14 +206,16 @@ class AsyncWebCrawler:
                )

            if result is None:
-                raise ValueError(f"Failed to extract content from the website: {url}")
+                raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
        except InvalidCSSSelectorError as e:
            raise ValueError(str(e))
        except Exception as e:
-            raise ValueError(f"Failed to extract content from the website: {url}, error: {str(e)}")
+            raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")

        cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
        markdown = sanitize_input_encode(result.get("markdown", ""))
+        fit_markdown = sanitize_input_encode(result.get("fit_markdown", ""))
+        fit_html = sanitize_input_encode(result.get("fit_html", ""))
        media = result.get("media", [])
        links = result.get("links", [])
        metadata = result.get("metadata", {})
@@ -257,6 +262,8 @@ class AsyncWebCrawler:
            html=html,
            cleaned_html=format_html(cleaned_html),
            markdown=markdown,
+            fit_markdown=fit_markdown,
+            fit_html= fit_html,
            media=media,
            links=links,
            metadata=metadata,
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -84,6 +84,12 @@ class TopicSegmentationChunking(ChunkingStrategy):
 # Fixed-length word chunks
 class FixedLengthWordChunking(ChunkingStrategy):
    def __init__(self, chunk_size=100, **kwargs):
+        """
+        Initialize the fixed-length word chunking strategy with the given chunk size.
+        
+        Args:
+            chunk_size (int): The size of each chunk in words.
+        """
        self.chunk_size = chunk_size

    def chunk(self, text: str) -> list:
@@ -93,14 +99,64 @@ class FixedLengthWordChunking(ChunkingStrategy):
 # Sliding window chunking
 class SlidingWindowChunking(ChunkingStrategy):
    def __init__(self, window_size=100, step=50, **kwargs):
+        """
+        Initialize the sliding window chunking strategy with the given window size and
+        step size.
+        
+        Args:
+            window_size (int): The size of the sliding window in words.
+            step (int): The step size for sliding the window in words.
+        """
        self.window_size = window_size
        self.step = step

    def chunk(self, text: str) -> list:
        words = text.split()
        chunks = []
-        for i in range(0, len(words), self.step):
-            chunks.append(' '.join(words[i:i + self.window_size]))
+        
+        if len(words) <= self.window_size:
+            return [text]
+        
+        for i in range(0, len(words) - self.window_size + 1, self.step):
+            chunk = ' '.join(words[i:i + self.window_size])
+            chunks.append(chunk)
+        
+        # Handle the last chunk if it doesn't align perfectly
+        if i + self.window_size < len(words):
+            chunks.append(' '.join(words[-self.window_size:]))
+        
        return chunks
    

+class OverlappingWindowChunking(ChunkingStrategy):
+    def __init__(self, window_size=1000, overlap=100, **kwargs):
+        """
+        Initialize the overlapping window chunking strategy with the given window size and
+        overlap size.
+        
+        Args:
+            window_size (int): The size of the window in words.
+            overlap (int): The size of the overlap between consecutive chunks in words.
+        """
+        self.window_size = window_size
+        self.overlap = overlap
+
+    def chunk(self, text: str) -> list:
+        words = text.split()
+        chunks = []
+        
+        if len(words) <= self.window_size:
+            return [text]
+        
+        start = 0
+        while start < len(words):
+            end = start + self.window_size
+            chunk = ' '.join(words[start:end])
+            chunks.append(chunk)
+            
+            if end >= len(words):
+                break
+            
+            start = end - self.overlap
+        
+        return chunks
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -4,24 +4,23 @@ from dotenv import load_dotenv
 load_dotenv()  # Load environment variables from .env file

 # Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
-DEFAULT_PROVIDER = "openai/gpt-4-turbo"
+DEFAULT_PROVIDER = "openai/gpt-4o-mini"
 MODEL_REPO_BRANCH = "new-release-0.0.2"
 # Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
 PROVIDER_MODELS = {
    "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
    "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
    "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
-    "openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
-    "openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
+    "openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
    "openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
    "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
+    "anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
 }

-
 # Chunk token threshold
-CHUNK_TOKEN_THRESHOLD = 500
+CHUNK_TOKEN_THRESHOLD = 2 ** 11 # 2048 tokens
 OVERLAP_RATE = 0.1
 WORD_TOKEN_RATE = 1.3

@@ -29,6 +28,20 @@ WORD_TOKEN_RATE = 1.3
 MIN_WORD_THRESHOLD = 1
 IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1

+IMPORTANT_ATTRS = ['src', 'href', 'alt', 'title', 'width', 'height'] 
+ONLY_TEXT_ELIGIBLE_TAGS = ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']
+SOCIAL_MEDIA_DOMAINS = [
+                            'facebook.com',
+                            'twitter.com',
+                            'x.com',
+                            'linkedin.com',
+                            'instagram.com',
+                            'pinterest.com',
+                            'tiktok.com',
+                            'snapchat.com',
+                            'reddit.com',
+                        ]
+
 # Threshold for the Image extraction - Range is 1 to 6
 # Images are scored based on point based system, to filter based on usefulness. Points are assigned
 # to each image based on the following aspects.
--- a/crawl4ai/content_cleaning_strategy.py
+++ b/crawl4ai/content_cleaning_strategy.py
@@ -0,0 +1,196 @@
+from bs4 import BeautifulSoup, Tag
+import re
+from typing import Optional
+
+class ContentCleaningStrategy:
+    def __init__(self):
+        # Precompile regex patterns for performance
+        self.negative_patterns = re.compile(r'nav|footer|header|sidebar|ads|comment', re.I)
+        self.positive_patterns = re.compile(r'content|article|main|post', re.I)
+        self.priority_tags = {'article', 'main', 'section', 'div'}
+        self.non_content_tags = {'nav', 'footer', 'header', 'aside'}
+        # Thresholds
+        self.text_density_threshold = 9.0
+        self.min_word_count = 50
+        self.link_density_threshold = 0.2
+        self.max_dom_depth = 10  # To prevent excessive DOM traversal
+
+    def clean(self, clean_html: str) -> str:
+        """
+        Main function that takes cleaned HTML and returns super cleaned HTML.
+
+        Args:
+            clean_html (str): The cleaned HTML content.
+
+        Returns:
+            str: The super cleaned HTML containing only the main content.
+        """
+        try:
+            if not clean_html or not isinstance(clean_html, str):
+                return ''
+            soup = BeautifulSoup(clean_html, 'html.parser')
+            main_content = self.extract_main_content(soup)
+            if main_content:
+                super_clean_element = self.clean_element(main_content)
+                return str(super_clean_element)
+            else:
+                return ''
+        except Exception:
+            # Handle exceptions silently or log them as needed
+            return ''
+
+    def extract_main_content(self, soup: BeautifulSoup) -> Optional[Tag]:
+        """
+        Identifies and extracts the main content element from the HTML.
+
+        Args:
+            soup (BeautifulSoup): The parsed HTML soup.
+
+        Returns:
+            Optional[Tag]: The Tag object containing the main content, or None if not found.
+        """
+        candidates = []
+        for element in soup.find_all(self.priority_tags):
+            if self.is_non_content_tag(element):
+                continue
+            if self.has_negative_class_id(element):
+                continue
+            score = self.calculate_content_score(element)
+            candidates.append((score, element))
+        
+        if not candidates:
+            return None
+
+        # Sort candidates by score in descending order
+        candidates.sort(key=lambda x: x[0], reverse=True)
+        # Select the element with the highest score
+        best_element = candidates[0][1]
+        return best_element
+
+    def calculate_content_score(self, element: Tag) -> float:
+        """
+        Calculates a score for an element based on various heuristics.
+
+        Args:
+            element (Tag): The HTML element to score.
+
+        Returns:
+            float: The content score of the element.
+        """
+        score = 0.0
+
+        if self.is_priority_tag(element):
+            score += 5.0
+        if self.has_positive_class_id(element):
+            score += 3.0
+        if self.has_negative_class_id(element):
+            score -= 3.0
+        if self.is_high_text_density(element):
+            score += 2.0
+        if self.is_low_link_density(element):
+            score += 2.0
+        if self.has_sufficient_content(element):
+            score += 2.0
+        if self.has_headings(element):
+            score += 3.0
+
+        dom_depth = self.calculate_dom_depth(element)
+        score += min(dom_depth, self.max_dom_depth) * 0.5  # Adjust weight as needed
+
+        return score
+
+    def is_priority_tag(self, element: Tag) -> bool:
+        """Checks if the element is a priority tag."""
+        return element.name in self.priority_tags
+
+    def is_non_content_tag(self, element: Tag) -> bool:
+        """Checks if the element is a non-content tag."""
+        return element.name in self.non_content_tags
+
+    def has_negative_class_id(self, element: Tag) -> bool:
+        """Checks if the element has negative indicators in its class or id."""
+        class_id = ' '.join(filter(None, [
+            self.get_attr_str(element.get('class')),
+            element.get('id', '')
+        ]))
+        return bool(self.negative_patterns.search(class_id))
+
+    def has_positive_class_id(self, element: Tag) -> bool:
+        """Checks if the element has positive indicators in its class or id."""
+        class_id = ' '.join(filter(None, [
+            self.get_attr_str(element.get('class')),
+            element.get('id', '')
+        ]))
+        return bool(self.positive_patterns.search(class_id))
+
+    @staticmethod
+    def get_attr_str(attr) -> str:
+        """Converts an attribute value to a string."""
+        if isinstance(attr, list):
+            return ' '.join(attr)
+        elif isinstance(attr, str):
+            return attr
+        else:
+            return ''
+
+    def is_high_text_density(self, element: Tag) -> bool:
+        """Determines if the element has high text density."""
+        text_density = self.calculate_text_density(element)
+        return text_density > self.text_density_threshold
+
+    def calculate_text_density(self, element: Tag) -> float:
+        """Calculates the text density of an element."""
+        text_length = len(element.get_text(strip=True))
+        tag_count = len(element.find_all())
+        tag_count = tag_count or 1  # Prevent division by zero
+        return text_length / tag_count
+
+    def is_low_link_density(self, element: Tag) -> bool:
+        """Determines if the element has low link density."""
+        link_density = self.calculate_link_density(element)
+        return link_density < self.link_density_threshold
+
+    def calculate_link_density(self, element: Tag) -> float:
+        """Calculates the link density of an element."""
+        text = element.get_text(strip=True)
+        if not text:
+            return 0.0
+        link_text = ' '.join(a.get_text(strip=True) for a in element.find_all('a'))
+        return len(link_text) / len(text) if text else 0.0
+
+    def has_sufficient_content(self, element: Tag) -> bool:
+        """Checks if the element has sufficient word count."""
+        word_count = len(element.get_text(strip=True).split())
+        return word_count >= self.min_word_count
+
+    def calculate_dom_depth(self, element: Tag) -> int:
+        """Calculates the depth of an element in the DOM tree."""
+        depth = 0
+        current_element = element
+        while current_element.parent and depth < self.max_dom_depth:
+            depth += 1
+            current_element = current_element.parent
+        return depth
+
+    def has_headings(self, element: Tag) -> bool:
+        """Checks if the element contains heading tags."""
+        return bool(element.find(['h1', 'h2', 'h3']))
+
+    def clean_element(self, element: Tag) -> Tag:
+        """
+        Cleans the selected element by removing unnecessary attributes and nested non-content elements.
+
+        Args:
+            element (Tag): The HTML element to clean.
+
+        Returns:
+            Tag: The cleaned HTML element.
+        """
+        for tag in element.find_all(['script', 'style', 'aside']):
+            tag.decompose()
+        for tag in element.find_all():
+            attrs = dict(tag.attrs)
+            for attr in attrs:
+                if attr in ['style', 'onclick', 'onmouseover', 'align', 'bgcolor']:
+                    del tag.attrs[attr]
+        return element
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -7,17 +7,19 @@ from .config import *
 from bs4 import element, NavigableString, Comment
 from urllib.parse import urljoin
 from requests.exceptions import InvalidSchema
+from .content_cleaning_strategy import ContentCleaningStrategy

 from .utils import (
    sanitize_input_encode,
    sanitize_html,
    extract_metadata,
    InvalidCSSSelectorError,
-    CustomHTML2Text
+    CustomHTML2Text,
+    normalize_url,
+    is_external_url
+    
 )

-
-
 class ContentScrappingStrategy(ABC):
    @abstractmethod
    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
@@ -35,12 +37,14 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)

    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+        success = True
        if not html:
            return None

        soup = BeautifulSoup(html, 'html.parser')
        body = soup.body
        
+        
        image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)

        for tag in kwargs.get('excluded_tags', []) or []:
@@ -66,6 +70,8 @@ class WebScrappingStrategy(ContentScrappingStrategy):

        links = {'internal': [], 'external': []}
        media = {'images': [], 'videos': [], 'audios': []}
+        internal_links_dict = {}
+        external_links_dict = {}

        # Extract meaningful text for media files from closest parent
        def find_closest_parent_with_useful_text(tag):
@@ -127,9 +133,13 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                image_width =  img.get('width')
                width_value, width_unit = parse_dimension(image_width)
                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
-                image_format = os.path.splitext(img.get('src',''))[1].lower()
+                image_src = img.get('src','')
+                if "data:image/" in image_src:
+                    image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
+                else:
+                    image_format = os.path.splitext(img.get('src',''))[1].lower()
                # Remove . from format
-                image_format = image_format.strip('.')
+                image_format = image_format.strip('.').split('?')[0]
                score = 0
                if height_value:
                    if height_unit == 'px' and height_value > 150:
@@ -151,6 +161,8 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                    score+=1
                return score

+            
+            
            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
                return None
            score = score_image_for_usefulness(img, url, index, total_images)
@@ -158,41 +170,142 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                return None
            return {
                'src': img.get('src', ''),
+                'data-src': img.get('data-src', ''),
                'alt': img.get('alt', ''),
                'desc': find_closest_parent_with_useful_text(img),
                'score': score,
                'type': 'image'
            }

+        def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
+            attrs_to_remove = []
+            for attr in element.attrs:
+                if attr not in important_attrs:
+                    if keep_data_attributes:
+                        if not attr.startswith('data-'):
+                            attrs_to_remove.append(attr)
+                    else:
+                        attrs_to_remove.append(attr)
+            
+            for attr in attrs_to_remove:
+                del element[attr]
+        
        def process_element(element: element.PageElement) -> bool:
            try:
                if isinstance(element, NavigableString):
                    if isinstance(element, Comment):
                        element.extract()
                    return False
+                
+                # if element.name == 'img':
+                #     process_image(element, url, 0, 1)
+                #     return True

                if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
-                    if element.name == 'img':
-                        process_image(element, url, 0, 1)
                    element.decompose()
                    return False

                keep_element = False
+                
+                exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
+                exclude_social_media_domains = list(set(exclude_social_media_domains))

-                if element.name == 'a' and element.get('href'):
-                    href = element['href']
-                    url_base = url.split('/')[2]
-                    link_data = {'href': href, 'text': element.get_text()}
-                    if href.startswith('http') and url_base not in href:
-                        links['external'].append(link_data)
-                    else:
-                        links['internal'].append(link_data)
-                    keep_element = True
+                
+                try:
+                    if element.name == 'a' and element.get('href'):
+                        href = element.get('href', '').strip()
+                        if not href:  # Skip empty hrefs
+                            return False
+                            
+                        url_base = url.split('/')[2]
+                        
+                        # Normalize the URL
+                        try:
+                            normalized_href = normalize_url(href, url)
+                        except ValueError as e:
+                            # logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
+                            return False
+                            
+                        link_data = {
+                            'href': normalized_href,
+                            'text': element.get_text().strip(),
+                            'title': element.get('title', '').strip()
+                        }
+                        
+                        # Check for duplicates and add to appropriate dictionary
+                        is_external = is_external_url(normalized_href, url_base)
+                        if is_external:
+                            if normalized_href not in external_links_dict:
+                                external_links_dict[normalized_href] = link_data
+                        else:
+                            if normalized_href not in internal_links_dict:
+                                internal_links_dict[normalized_href] = link_data
+                                
+                        keep_element = True
+                        
+                        # Handle external link exclusions
+                        if is_external:
+                            if kwargs.get('exclude_external_links', False):
+                                element.decompose()
+                                return False
+                            elif kwargs.get('exclude_social_media_links', False):
+                                if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
+                                    element.decompose()
+                                    return False
+                            elif kwargs.get('exclude_domains', []):
+                                if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
+                                    element.decompose()
+                                    return False
+                                    
+                except Exception as e:
+                    raise Exception(f"Error processing links: {str(e)}")

-                elif element.name == 'img':
-                    return True  # Always keep image elements
-
-                elif element.name in ['video', 'audio']:
+                try:
+                    if element.name == 'img':
+                        potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
+                        src = element.get('src', '')
+                        while not src and potential_sources:
+                            src = element.get(potential_sources.pop(0), '')
+                        if not src:
+                            element.decompose()
+                            return False
+                        
+                        # If it is srcset pick up the first image
+                        if 'srcset' in element.attrs:
+                            src = element.attrs['srcset'].split(',')[0].split(' ')[0]
+                            
+                        # Check flag if we should remove external images
+                        if kwargs.get('exclude_external_images', False):
+                            src_url_base = src.split('/')[2]
+                            url_base = url.split('/')[2]
+                            if url_base not in src_url_base:
+                                element.decompose()
+                                return False
+                            
+                        if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
+                            src_url_base = src.split('/')[2]
+                            url_base = url.split('/')[2]
+                            if any(domain in src for domain in exclude_social_media_domains):
+                                element.decompose()
+                                return False
+                            
+                        # Handle exclude domains
+                        if kwargs.get('exclude_domains', []):
+                            if any(domain in src for domain in kwargs.get('exclude_domains', [])):
+                                element.decompose()
+                                return False
+                        
+                        return True  # Always keep image elements
+                except Exception as e:
+                    raise "Error processing images"
+                
+                
+                # Check if flag to remove all forms is set
+                if kwargs.get('remove_forms', False) and element.name == 'form':
+                    element.decompose()
+                    return False
+                
+                if element.name in ['video', 'audio']:
                    media[f"{element.name}s"].append({
                        'src': element.get('src'),
                        'alt': element.get('alt'),
@@ -209,14 +322,15 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                    })
                    return True  # Always keep video and audio elements

-                if element.name != 'pre':
-                    if element.name in ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']:
-                        if kwargs.get('only_text', False):
-                            element.replace_with(element.get_text())
-                        else:
-                            element.unwrap()
-                    elif element.name != 'img':
-                        element.attrs = {}
+                if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
+                    if kwargs.get('only_text', False):
+                        element.replace_with(element.get_text())
+
+                try:
+                    remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
+                except Exception as e:
+                    print('Error removing unwanted attributes:', str(e))
+                

                # Process children
                for child in list(element.children):
@@ -250,9 +364,15 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        # ]
        
        process_element(body)
+        
+        # Update the links dictionary with unique links
+        links['internal'] = list(internal_links_dict.values())
+        links['external'] = list(external_links_dict.values())
+

        # # Process images using ThreadPoolExecutor
        imgs = body.find_all('img')
+        
        with ThreadPoolExecutor() as executor:
            image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
        media['images'] = [result for result in image_results if result is not None]
@@ -272,12 +392,45 @@ class WebScrappingStrategy(ContentScrappingStrategy):
            if base64_pattern.match(src):
                # Replace base64 data with empty string
                img['src'] = base64_pattern.sub('', src)
-        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
-        cleaned_html = sanitize_html(cleaned_html)
+                
+        try:
+            str(body)
+        except Exception as e:
+            # Reset body to the original HTML
+            success = False
+            body = BeautifulSoup(html, 'html.parser')
+            
+            # Create a new div with a special ID
+            error_div = body.new_tag('div', id='crawl4ai_error_message')
+            error_div.string = '''
+            Crawl4AI Error: This page is not fully supported.
+            
+            Possible reasons:
+            1. The page may have restrictions that prevent crawling.
+            2. The page might not be fully loaded.
+            
+            Suggestions:
+            - Try calling the crawl function with these parameters:
+            magic=True,
+            - Set headless=False to visualize what's happening on the page.
+            
+            If the issue persists, please check the page's structure and any potential anti-crawling measures.
+            '''
+            
+            # Append the error div to the body
+            body.body.append(error_div)
+            
+            print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")

-        h = CustomHTML2Text()
-        h.ignore_links = True
-        markdown = h.handle(cleaned_html)
+
+        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
+
+        try:
+            h = CustomHTML2Text()
+            h.update_params(**kwargs.get('html2text', {}))            
+            markdown = h.handle(cleaned_html)
+        except Exception as e:
+            markdown = h.handle(sanitize_html(cleaned_html))
        markdown = markdown.replace('    ```', '```')

        try:
@@ -285,11 +438,18 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        except Exception as e:
            print('Error extracting metadata:', str(e))
            meta = {}
+            
+        cleaner = ContentCleaningStrategy()
+        fit_html = cleaner.clean(cleaned_html)
+        fit_markdown = h.handle(fit_html)

+        cleaned_html = sanitize_html(cleaned_html)
        return {
            'markdown': markdown,
+            'fit_markdown': fit_markdown,
+            'fit_html': fit_html,
            'cleaned_html': cleaned_html,
-            'success': True,
+            'success': success,
            'media': media,
            'links': links,
            'metadata': meta
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -68,7 +68,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
        """
        super().__init__() 
        self.provider = provider
-        self.api_token = api_token or PROVIDER_MODELS.get(provider, None) or os.getenv("OPENAI_API_KEY")
+        self.api_token = api_token or PROVIDER_MODELS.get(provider, "no-token") or os.getenv("OPENAI_API_KEY")
        self.instruction = instruction
        self.extract_type = extraction_type
        self.schema = schema
@@ -80,6 +80,8 @@ class LLMExtractionStrategy(ExtractionStrategy):
        self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
        self.apply_chunking = kwargs.get("apply_chunking", True)
        self.base_url = kwargs.get("base_url", None)
+        self.api_base = kwargs.get("api_base", kwargs.get("base_url", None))
+        self.extra_args = kwargs.get("extra_args", {})
        if not self.apply_chunking:
            self.chunk_token_threshold = 1e9
        
@@ -111,7 +113,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
                "{" + variable + "}", variable_values[variable]
            )
        
-        response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token, base_url=self.base_url) # , json_response=self.extract_type == "schema")
+        response = perform_completion_with_backoff(
+            self.provider, 
+            prompt_with_variables, 
+            self.api_token, 
+            base_url=self.api_base or self.base_url,
+            extra_args = self.extra_args
+            ) # , json_response=self.extract_type == "schema")
        try:
            blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
            blocks = json.loads(blocks)
@@ -227,11 +235,12 @@ class CosineStrategy(ExtractionStrategy):
        """
        Initialize the strategy with clustering parameters.

-        :param semantic_filter: A keyword filter for document filtering.
-        :param word_count_threshold: Minimum number of words per cluster.
-        :param max_dist: The maximum cophenetic distance on the dendrogram to form clusters.
-        :param linkage_method: The linkage method for hierarchical clustering.
-        :param top_k: Number of top categories to extract.
+        Args:
+            semantic_filter (str): A keyword filter for document filtering.
+            word_count_threshold (int): Minimum number of words per cluster.
+            max_dist (float): The maximum cophenetic distance on the dendrogram to form clusters.
+            linkage_method (str): The linkage method for hierarchical clustering.
+            top_k (int): Number of top categories to extract.
        """
        super().__init__()
        
@@ -250,8 +259,8 @@ class CosineStrategy(ExtractionStrategy):
        self.get_embedding_method = "direct"
        
        self.device = get_device()
-        import torch
-        self.device = torch.device('cpu')
+        # import torch
+        # self.device = torch.device('cpu')
        
        self.default_batch_size = calculate_batch_size(self.device)

@@ -264,7 +273,7 @@ class CosineStrategy(ExtractionStrategy):
        #     self.get_embedding_method = "direct"
        # else:

-        self.tokenizer, self.model = load_bge_small_en_v1_5()
+        self.tokenizer, self.model = load_HF_embedding_model(model_name)
        self.model.to(self.device)
        self.model.eval()  
        
@@ -731,7 +740,6 @@ class JsonCssExtractionStrategy(ExtractionStrategy):
        combined_html = self.DEL.join(sections)
        return self.extract(url, combined_html, **kwargs)
    
-
 class JsonXPATHExtractionStrategy(ExtractionStrategy):
    def __init__(self, schema: Dict[str, Any], **kwargs):
        super().__init__(**kwargs)
--- a/crawl4ai/html2text/init.py
+++ b/crawl4ai/html2text/init.py
--- a/crawl4ai/html2text/main.py
+++ b/crawl4ai/html2text/main.py
@@ -0,0 +1,3 @@
+from .cli import main
+
+main()
--- a/crawl4ai/html2text/_typing.py
+++ b/crawl4ai/html2text/_typing.py
@@ -0,0 +1,2 @@
+class OutCallback:
+    def __call__(self, s: str) -> None: ...
--- a/crawl4ai/html2text/cli.py
+++ b/crawl4ai/html2text/cli.py
@@ -0,0 +1,330 @@
+import argparse
+import sys
+
+from . import HTML2Text, __version__, config
+
+
+def main() -> None:
+    baseurl = ""
+
+    class bcolors:
+        HEADER = "\033[95m"
+        OKBLUE = "\033[94m"
+        OKGREEN = "\033[92m"
+        WARNING = "\033[93m"
+        FAIL = "\033[91m"
+        ENDC = "\033[0m"
+        BOLD = "\033[1m"
+        UNDERLINE = "\033[4m"
+
+    p = argparse.ArgumentParser()
+    p.add_argument(
+        "--default-image-alt",
+        dest="default_image_alt",
+        default=config.DEFAULT_IMAGE_ALT,
+        help="The default alt string for images with missing ones",
+    )
+    p.add_argument(
+        "--pad-tables",
+        dest="pad_tables",
+        action="store_true",
+        default=config.PAD_TABLES,
+        help="pad the cells to equal column width in tables",
+    )
+    p.add_argument(
+        "--no-wrap-links",
+        dest="wrap_links",
+        action="store_false",
+        default=config.WRAP_LINKS,
+        help="don't wrap links during conversion",
+    )
+    p.add_argument(
+        "--wrap-list-items",
+        dest="wrap_list_items",
+        action="store_true",
+        default=config.WRAP_LIST_ITEMS,
+        help="wrap list items during conversion",
+    )
+    p.add_argument(
+        "--wrap-tables",
+        dest="wrap_tables",
+        action="store_true",
+        default=config.WRAP_TABLES,
+        help="wrap tables",
+    )
+    p.add_argument(
+        "--ignore-emphasis",
+        dest="ignore_emphasis",
+        action="store_true",
+        default=config.IGNORE_EMPHASIS,
+        help="don't include any formatting for emphasis",
+    )
+    p.add_argument(
+        "--reference-links",
+        dest="inline_links",
+        action="store_false",
+        default=config.INLINE_LINKS,
+        help="use reference style links instead of inline links",
+    )
+    p.add_argument(
+        "--ignore-links",
+        dest="ignore_links",
+        action="store_true",
+        default=config.IGNORE_ANCHORS,
+        help="don't include any formatting for links",
+    )
+    p.add_argument(
+        "--ignore-mailto-links",
+        action="store_true",
+        dest="ignore_mailto_links",
+        default=config.IGNORE_MAILTO_LINKS,
+        help="don't include mailto: links",
+    )
+    p.add_argument(
+        "--protect-links",
+        dest="protect_links",
+        action="store_true",
+        default=config.PROTECT_LINKS,
+        help="protect links from line breaks surrounding them with angle brackets",
+    )
+    p.add_argument(
+        "--ignore-images",
+        dest="ignore_images",
+        action="store_true",
+        default=config.IGNORE_IMAGES,
+        help="don't include any formatting for images",
+    )
+    p.add_argument(
+        "--images-as-html",
+        dest="images_as_html",
+        action="store_true",
+        default=config.IMAGES_AS_HTML,
+        help=(
+            "Always write image tags as raw html; preserves `height`, `width` and "
+            "`alt` if possible."
+        ),
+    )
+    p.add_argument(
+        "--images-to-alt",
+        dest="images_to_alt",
+        action="store_true",
+        default=config.IMAGES_TO_ALT,
+        help="Discard image data, only keep alt text",
+    )
+    p.add_argument(
+        "--images-with-size",
+        dest="images_with_size",
+        action="store_true",
+        default=config.IMAGES_WITH_SIZE,
+        help=(
+            "Write image tags with height and width attrs as raw html to retain "
+            "dimensions"
+        ),
+    )
+    p.add_argument(
+        "-g",
+        "--google-doc",
+        action="store_true",
+        dest="google_doc",
+        default=False,
+        help="convert an html-exported Google Document",
+    )
+    p.add_argument(
+        "-d",
+        "--dash-unordered-list",
+        action="store_true",
+        dest="ul_style_dash",
+        default=False,
+        help="use a dash rather than a star for unordered list items",
+    )
+    p.add_argument(
+        "-e",
+        "--asterisk-emphasis",
+        action="store_true",
+        dest="em_style_asterisk",
+        default=False,
+        help="use an asterisk rather than an underscore for emphasized text",
+    )
+    p.add_argument(
+        "-b",
+        "--body-width",
+        dest="body_width",
+        type=int,
+        default=config.BODY_WIDTH,
+        help="number of characters per output line, 0 for no wrap",
+    )
+    p.add_argument(
+        "-i",
+        "--google-list-indent",
+        dest="list_indent",
+        type=int,
+        default=config.GOOGLE_LIST_INDENT,
+        help="number of pixels Google indents nested lists",
+    )
+    p.add_argument(
+        "-s",
+        "--hide-strikethrough",
+        action="store_true",
+        dest="hide_strikethrough",
+        default=False,
+        help="hide strike-through text. only relevant when -g is " "specified as well",
+    )
+    p.add_argument(
+        "--escape-all",
+        action="store_true",
+        dest="escape_snob",
+        default=False,
+        help=(
+            "Escape all special characters.  Output is less readable, but avoids "
+            "corner case formatting issues."
+        ),
+    )
+    p.add_argument(
+        "--bypass-tables",
+        action="store_true",
+        dest="bypass_tables",
+        default=config.BYPASS_TABLES,
+        help="Format tables in HTML rather than Markdown syntax.",
+    )
+    p.add_argument(
+        "--ignore-tables",
+        action="store_true",
+        dest="ignore_tables",
+        default=config.IGNORE_TABLES,
+        help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
+    )
+    p.add_argument(
+        "--single-line-break",
+        action="store_true",
+        dest="single_line_break",
+        default=config.SINGLE_LINE_BREAK,
+        help=(
+            "Use a single line break after a block element rather than two line "
+            "breaks. NOTE: Requires --body-width=0"
+        ),
+    )
+    p.add_argument(
+        "--unicode-snob",
+        action="store_true",
+        dest="unicode_snob",
+        default=config.UNICODE_SNOB,
+        help="Use unicode throughout document",
+    )
+    p.add_argument(
+        "--no-automatic-links",
+        action="store_false",
+        dest="use_automatic_links",
+        default=config.USE_AUTOMATIC_LINKS,
+        help="Do not use automatic links wherever applicable",
+    )
+    p.add_argument(
+        "--no-skip-internal-links",
+        action="store_false",
+        dest="skip_internal_links",
+        default=config.SKIP_INTERNAL_LINKS,
+        help="Do not skip internal links",
+    )
+    p.add_argument(
+        "--links-after-para",
+        action="store_true",
+        dest="links_each_paragraph",
+        default=config.LINKS_EACH_PARAGRAPH,
+        help="Put links after each paragraph instead of document",
+    )
+    p.add_argument(
+        "--mark-code",
+        action="store_true",
+        dest="mark_code",
+        default=config.MARK_CODE,
+        help="Mark program code blocks with [code]...[/code]",
+    )
+    p.add_argument(
+        "--decode-errors",
+        dest="decode_errors",
+        default=config.DECODE_ERRORS,
+        help=(
+            "What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
+            "acceptable values"
+        ),
+    )
+    p.add_argument(
+        "--open-quote",
+        dest="open_quote",
+        default=config.OPEN_QUOTE,
+        help="The character used to open quotes",
+    )
+    p.add_argument(
+        "--close-quote",
+        dest="close_quote",
+        default=config.CLOSE_QUOTE,
+        help="The character used to close quotes",
+    )
+    p.add_argument(
+        "--version", action="version", version=".".join(map(str, __version__))
+    )
+    p.add_argument("filename", nargs="?")
+    p.add_argument("encoding", nargs="?", default="utf-8")
+    p.add_argument(
+        "--include-sup-sub",
+        dest="include_sup_sub",
+        action="store_true",
+        default=config.INCLUDE_SUP_SUB,
+        help="Include the sup and sub tags",
+    )
+    args = p.parse_args()
+
+    if args.filename and args.filename != "-":
+        with open(args.filename, "rb") as fp:
+            data = fp.read()
+    else:
+        data = sys.stdin.buffer.read()
+
+    try:
+        html = data.decode(args.encoding, args.decode_errors)
+    except UnicodeDecodeError as err:
+        warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
+        warning += " Use the " + bcolors.OKGREEN
+        warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
+        print(warning)
+        raise err
+
+    h = HTML2Text(baseurl=baseurl)
+    # handle options
+    if args.ul_style_dash:
+        h.ul_item_mark = "-"
+    if args.em_style_asterisk:
+        h.emphasis_mark = "*"
+        h.strong_mark = "__"
+
+    h.body_width = args.body_width
+    h.google_list_indent = args.list_indent
+    h.ignore_emphasis = args.ignore_emphasis
+    h.ignore_links = args.ignore_links
+    h.ignore_mailto_links = args.ignore_mailto_links
+    h.protect_links = args.protect_links
+    h.ignore_images = args.ignore_images
+    h.images_as_html = args.images_as_html
+    h.images_to_alt = args.images_to_alt
+    h.images_with_size = args.images_with_size
+    h.google_doc = args.google_doc
+    h.hide_strikethrough = args.hide_strikethrough
+    h.escape_snob = args.escape_snob
+    h.bypass_tables = args.bypass_tables
+    h.ignore_tables = args.ignore_tables
+    h.single_line_break = args.single_line_break
+    h.inline_links = args.inline_links
+    h.unicode_snob = args.unicode_snob
+    h.use_automatic_links = args.use_automatic_links
+    h.skip_internal_links = args.skip_internal_links
+    h.links_each_paragraph = args.links_each_paragraph
+    h.mark_code = args.mark_code
+    h.wrap_links = args.wrap_links
+    h.wrap_list_items = args.wrap_list_items
+    h.wrap_tables = args.wrap_tables
+    h.pad_tables = args.pad_tables
+    h.default_image_alt = args.default_image_alt
+    h.open_quote = args.open_quote
+    h.close_quote = args.close_quote
+    h.include_sup_sub = args.include_sup_sub
+
+    sys.stdout.write(h.handle(html))
--- a/crawl4ai/html2text/config.py
+++ b/crawl4ai/html2text/config.py
@@ -0,0 +1,172 @@
+import re
+
+# Use Unicode characters instead of their ascii pseudo-replacements
+UNICODE_SNOB = False
+
+# Marker to use for marking tables for padding post processing
+TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
+# Escape all special characters.  Output is less readable, but avoids
+# corner case formatting issues.
+ESCAPE_SNOB = False
+ESCAPE_BACKSLASH = False
+ESCAPE_DOT = False
+ESCAPE_PLUS = False
+ESCAPE_DASH = False
+
+# Put the links after each paragraph instead of at the end.
+LINKS_EACH_PARAGRAPH = False
+
+# Wrap long lines at position. 0 for no wrapping.
+BODY_WIDTH = 78
+
+# Don't show internal links (href="#local-anchor") -- corresponding link
+# targets won't be visible in the plain text file anyway.
+SKIP_INTERNAL_LINKS = True
+
+# Use inline, rather than reference, formatting for images and links
+INLINE_LINKS = True
+
+# Protect links from line breaks surrounding them with angle brackets (in
+# addition to their square brackets)
+PROTECT_LINKS = False
+# WRAP_LINKS = True
+WRAP_LINKS = True
+
+# Wrap list items.
+WRAP_LIST_ITEMS = False
+
+# Wrap tables
+WRAP_TABLES = False
+
+# Number of pixels Google indents nested lists
+GOOGLE_LIST_INDENT = 36
+
+# Values Google and others may use to indicate bold text
+BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
+
+IGNORE_ANCHORS = False
+IGNORE_MAILTO_LINKS = False
+IGNORE_IMAGES = False
+IMAGES_AS_HTML = False
+IMAGES_TO_ALT = False
+IMAGES_WITH_SIZE = False
+IGNORE_EMPHASIS = False
+MARK_CODE = False
+DECODE_ERRORS = "strict"
+DEFAULT_IMAGE_ALT = ""
+PAD_TABLES = False
+
+# Convert links with same href and text to <href> format
+# if they are absolute links
+USE_AUTOMATIC_LINKS = True
+
+# For checking space-only lines on line 771
+RE_SPACE = re.compile(r"\s\+")
+
+RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
+RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
+RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
+RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
+
+# to find links in the text
+RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
+
+# to find table separators
+RE_TABLE = re.compile(r" \| ")
+
+RE_MD_DOT_MATCHER = re.compile(
+    r"""
+    ^             # start of line
+    (\s*\d+)      # optional whitespace and a number
+    (\.)          # dot
+    (?=\s)        # lookahead assert whitespace
+    """,
+    re.MULTILINE | re.VERBOSE,
+)
+RE_MD_PLUS_MATCHER = re.compile(
+    r"""
+    ^
+    (\s*)
+    (\+)
+    (?=\s)
+    """,
+    flags=re.MULTILINE | re.VERBOSE,
+)
+RE_MD_DASH_MATCHER = re.compile(
+    r"""
+    ^
+    (\s*)
+    (-)
+    (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
+                  # or another dash (header or hr)
+    """,
+    flags=re.MULTILINE | re.VERBOSE,
+)
+RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
+RE_MD_BACKSLASH_MATCHER = re.compile(
+    r"""
+    (\\)          # match one slash
+    (?=[%s])      # followed by a char that requires escaping
+    """
+    % re.escape(RE_SLASH_CHARS),
+    flags=re.VERBOSE,
+)
+
+UNIFIABLE = {
+    "rsquo": "'",
+    "lsquo": "'",
+    "rdquo": '"',
+    "ldquo": '"',
+    "copy": "(C)",
+    "mdash": "--",
+    "nbsp": " ",
+    "rarr": "->",
+    "larr": "<-",
+    "middot": "*",
+    "ndash": "-",
+    "oelig": "oe",
+    "aelig": "ae",
+    "agrave": "a",
+    "aacute": "a",
+    "acirc": "a",
+    "atilde": "a",
+    "auml": "a",
+    "aring": "a",
+    "egrave": "e",
+    "eacute": "e",
+    "ecirc": "e",
+    "euml": "e",
+    "igrave": "i",
+    "iacute": "i",
+    "icirc": "i",
+    "iuml": "i",
+    "ograve": "o",
+    "oacute": "o",
+    "ocirc": "o",
+    "otilde": "o",
+    "ouml": "o",
+    "ugrave": "u",
+    "uacute": "u",
+    "ucirc": "u",
+    "uuml": "u",
+    "lrm": "",
+    "rlm": "",
+}
+
+# Format tables in HTML rather than Markdown syntax
+BYPASS_TABLES = False
+# Ignore table-related tags (table, th, td, tr) while keeping rows
+IGNORE_TABLES = False
+
+
+# Use a single line break after a block element rather than two line breaks.
+# NOTE: Requires body width setting to be 0.
+SINGLE_LINE_BREAK = False
+
+
+# Use double quotation marks when converting the <q> tag.
+OPEN_QUOTE = '"'
+CLOSE_QUOTE = '"'
+
+# Include the <sup> and <sub> tags
+INCLUDE_SUP_SUB = False
--- a/crawl4ai/html2text/elements.py
+++ b/crawl4ai/html2text/elements.py
@@ -0,0 +1,18 @@
+from typing import Dict, Optional
+
+
+class AnchorElement:
+    __slots__ = ["attrs", "count", "outcount"]
+
+    def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
+        self.attrs = attrs
+        self.count = count
+        self.outcount = outcount
+
+
+class ListElement:
+    __slots__ = ["name", "num"]
+
+    def __init__(self, name: str, num: int):
+        self.name = name
+        self.num = num
--- a/crawl4ai/html2text/utils.py
+++ b/crawl4ai/html2text/utils.py
@@ -0,0 +1,303 @@
+import html.entities
+from typing import Dict, List, Optional
+
+from . import config
+
+unifiable_n = {
+    html.entities.name2codepoint[k]: v
+    for k, v in config.UNIFIABLE.items()
+    if k != "nbsp"
+}
+
+
+def hn(tag: str) -> int:
+    if tag[0] == "h" and len(tag) == 2:
+        n = tag[1]
+        if "0" < n <= "9":
+            return int(n)
+    return 0
+
+
+def dumb_property_dict(style: str) -> Dict[str, str]:
+    """
+    :returns: A hash of css attributes
+    """
+    return {
+        x.strip().lower(): y.strip().lower()
+        for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
+    }
+
+
+def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
+    """
+    :type data: str
+
+    :returns: A hash of css selectors, each of which contains a hash of
+    css attributes.
+    :rtype: dict
+    """
+    # remove @import sentences
+    data += ";"
+    importIndex = data.find("@import")
+    while importIndex != -1:
+        data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
+        importIndex = data.find("@import")
+
+    # parse the css. reverted from dictionary comprehension in order to
+    # support older pythons
+    pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
+    try:
+        elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
+    except ValueError:
+        elements = {}  # not that important
+
+    return elements
+
+
+def element_style(
+    attrs: Dict[str, Optional[str]],
+    style_def: Dict[str, Dict[str, str]],
+    parent_style: Dict[str, str],
+) -> Dict[str, str]:
+    """
+    :type attrs: dict
+    :type style_def: dict
+    :type style_def: dict
+
+    :returns: A hash of the 'final' style attributes of the element
+    :rtype: dict
+    """
+    style = parent_style.copy()
+    if "class" in attrs:
+        assert attrs["class"] is not None
+        for css_class in attrs["class"].split():
+            css_style = style_def.get("." + css_class, {})
+            style.update(css_style)
+    if "style" in attrs:
+        assert attrs["style"] is not None
+        immediate_style = dumb_property_dict(attrs["style"])
+        style.update(immediate_style)
+
+    return style
+
+
+def google_list_style(style: Dict[str, str]) -> str:
+    """
+    Finds out whether this is an ordered or unordered list
+
+    :type style: dict
+
+    :rtype: str
+    """
+    if "list-style-type" in style:
+        list_style = style["list-style-type"]
+        if list_style in ["disc", "circle", "square", "none"]:
+            return "ul"
+
+    return "ol"
+
+
+def google_has_height(style: Dict[str, str]) -> bool:
+    """
+    Check if the style of the element has the 'height' attribute
+    explicitly defined
+
+    :type style: dict
+
+    :rtype: bool
+    """
+    return "height" in style
+
+
+def google_text_emphasis(style: Dict[str, str]) -> List[str]:
+    """
+    :type style: dict
+
+    :returns: A list of all emphasis modifiers of the element
+    :rtype: list
+    """
+    emphasis = []
+    if "text-decoration" in style:
+        emphasis.append(style["text-decoration"])
+    if "font-style" in style:
+        emphasis.append(style["font-style"])
+    if "font-weight" in style:
+        emphasis.append(style["font-weight"])
+
+    return emphasis
+
+
+def google_fixed_width_font(style: Dict[str, str]) -> bool:
+    """
+    Check if the css of the current element defines a fixed width font
+
+    :type style: dict
+
+    :rtype: bool
+    """
+    font_family = ""
+    if "font-family" in style:
+        font_family = style["font-family"]
+    return "courier new" == font_family or "consolas" == font_family
+
+
+def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
+    """
+    Extract numbering from list element attributes
+
+    :type attrs: dict
+
+    :rtype: int or None
+    """
+    if "start" in attrs:
+        assert attrs["start"] is not None
+        try:
+            return int(attrs["start"]) - 1
+        except ValueError:
+            pass
+
+    return 0
+
+
+def skipwrap(
+    para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
+) -> bool:
+    # If it appears to contain a link
+    # don't wrap
+    if not wrap_links and config.RE_LINK.search(para):
+        return True
+    # If the text begins with four spaces or one tab, it's a code block;
+    # don't wrap
+    if para[0:4] == "    " or para[0] == "\t":
+        return True
+
+    # If the text begins with only two "--", possibly preceded by
+    # whitespace, that's an emdash; so wrap.
+    stripped = para.lstrip()
+    if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
+        return False
+
+    # I'm not sure what this is for; I thought it was to detect lists,
+    # but there's a <br>-inside-<span> case in one of the tests that
+    # also depends upon it.
+    if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
+        return not wrap_list_items
+
+    # If text contains a pipe character it is likely a table
+    if not wrap_tables and config.RE_TABLE.search(para):
+        return True
+
+    # If the text begins with a single -, *, or +, followed by a space,
+    # or an integer, followed by a ., followed by a space (in either
+    # case optionally proceeded by whitespace), it's a list; don't wrap.
+    return bool(
+        config.RE_ORDERED_LIST_MATCHER.match(stripped)
+        or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
+    )
+
+
+def escape_md(text: str) -> str:
+    """
+    Escapes markdown-sensitive characters within other markdown
+    constructs.
+    """
+    return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
+
+
+def escape_md_section(
+    text: str,
+    escape_backslash: bool = True,
+    snob: bool = False,
+    escape_dot: bool = True,
+    escape_plus: bool = True,
+    escape_dash: bool = True
+) -> str:
+    """
+    Escapes markdown-sensitive characters across whole document sections.
+    Each escaping operation can be controlled individually.
+    """
+    if escape_backslash:
+        text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
+
+    if snob:
+        text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
+
+    if escape_dot:
+        text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
+
+    if escape_plus:
+        text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
+
+    if escape_dash:
+        text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
+
+    return text
+
+def reformat_table(lines: List[str], right_margin: int) -> List[str]:
+    """
+    Given the lines of a table
+    padds the cells and returns the new lines
+    """
+    # find the maximum width of the columns
+    max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
+    max_cols = len(max_width)
+    for line in lines:
+        cols = [x.rstrip() for x in line.split("|")]
+        num_cols = len(cols)
+
+        # don't drop any data if colspan attributes result in unequal lengths
+        if num_cols < max_cols:
+            cols += [""] * (max_cols - num_cols)
+        elif max_cols < num_cols:
+            max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
+            max_cols = num_cols
+
+        max_width = [
+            max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
+        ]
+
+    # reformat
+    new_lines = []
+    for line in lines:
+        cols = [x.rstrip() for x in line.split("|")]
+        if set(line.strip()) == set("-|"):
+            filler = "-"
+            new_cols = [
+                x.rstrip() + (filler * (M - len(x.rstrip())))
+                for x, M in zip(cols, max_width)
+            ]
+            new_lines.append("|-" + "|".join(new_cols) + "|")
+        else:
+            filler = " "
+            new_cols = [
+                x.rstrip() + (filler * (M - len(x.rstrip())))
+                for x, M in zip(cols, max_width)
+            ]
+            new_lines.append("| " + "|".join(new_cols) + "|")
+    return new_lines
+
+
+def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
+    """
+    Provide padding for tables in the text
+    """
+    lines = text.split("\n")
+    table_buffer = []  # type: List[str]
+    table_started = False
+    new_lines = []
+    for line in lines:
+        # Toggle table started
+        if config.TABLE_MARKER_FOR_PAD in line:
+            table_started = not table_started
+            if not table_started:
+                table = reformat_table(table_buffer, right_margin)
+                new_lines.extend(table)
+                table_buffer = []
+                new_lines.append("")
+            continue
+        # Process lines
+        if table_started:
+            table_buffer.append(line)
+        else:
+            new_lines.append(line)
+    return "\n".join(new_lines)
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -72,10 +72,18 @@ def load_bert_base_uncased():
    return tokenizer, model

@lru_cache()
-def load_bge_small_en_v1_5():
+def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
+    """Load the Hugging Face model for embedding.
+    
+    Args:
+        model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
+        
+    Returns:
+        tuple: The tokenizer and model.
+    """
    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
-    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
-    model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
+    tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
+    model = AutoModel.from_pretrained(model_name, resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -14,9 +14,11 @@ class CrawlResult(BaseModel):
    links: Dict[str, List[Dict]] = {}
    screenshot: Optional[str] = None
    markdown: Optional[str] = None
+    fit_markdown: Optional[str] = None
+    fit_html: Optional[str] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
-    responser_headers: Optional[dict] = None
+    response_headers: Optional[dict] = None
    status_code: Optional[int] = None
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -1,4 +1,4 @@
-PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
+PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
 <url>{URL}</url>

 And here is the cleaned HTML content of that webpage:
@@ -79,7 +79,7 @@ To generate the JSON objects:
 2. For each block:
   a. Assign it an index based on its order in the content.
   b. Analyze the content and generate ONE semantic tag that describe what the block is about.
-   c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
+   c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.

 3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.

--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -1,12 +1,12 @@
 import time
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from bs4 import BeautifulSoup, Comment, element, Tag, NavigableString
-import html2text
 import json
 import html
 import re
 import os
-from html2text import HTML2Text
+import platform
+from .html2text import HTML2Text
 from .prompts import PROMPT_EXTRACT_BLOCKS
 from .config import *
 from pathlib import Path
@@ -18,6 +18,46 @@ from requests.exceptions import InvalidSchema
 class InvalidCSSSelectorError(Exception):
    pass

+def calculate_semaphore_count():
+    cpu_count = os.cpu_count()
+    memory_gb = get_system_memory() / (1024 ** 3)  # Convert to GB
+    base_count = max(1, cpu_count // 2)
+    memory_based_cap = int(memory_gb / 2)  # Assume 2GB per instance
+    return min(base_count, memory_based_cap)
+
+def get_system_memory():
+    system = platform.system()
+    if system == "Linux":
+        with open('/proc/meminfo', 'r') as mem:
+            for line in mem:
+                if line.startswith('MemTotal:'):
+                    return int(line.split()[1]) * 1024  # Convert KB to bytes
+    elif system == "Darwin":  # macOS
+        import subprocess
+        output = subprocess.check_output(['sysctl', '-n', 'hw.memsize']).decode('utf-8')
+        return int(output.strip())
+    elif system == "Windows":
+        import ctypes
+        kernel32 = ctypes.windll.kernel32
+        c_ulonglong = ctypes.c_ulonglong
+        class MEMORYSTATUSEX(ctypes.Structure):
+            _fields_ = [
+                ('dwLength', ctypes.c_ulong),
+                ('dwMemoryLoad', ctypes.c_ulong),
+                ('ullTotalPhys', c_ulonglong),
+                ('ullAvailPhys', c_ulonglong),
+                ('ullTotalPageFile', c_ulonglong),
+                ('ullAvailPageFile', c_ulonglong),
+                ('ullTotalVirtual', c_ulonglong),
+                ('ullAvailVirtual', c_ulonglong),
+                ('ullAvailExtendedVirtual', c_ulonglong),
+            ]
+        memoryStatus = MEMORYSTATUSEX()
+        memoryStatus.dwLength = ctypes.sizeof(MEMORYSTATUSEX)
+        kernel32.GlobalMemoryStatusEx(ctypes.byref(memoryStatus))
+        return memoryStatus.ullTotalPhys
+    else:
+        raise OSError("Unsupported operating system")

 def get_home_folder():
    home_folder = os.path.join(Path.home(), ".crawl4ai")
@@ -90,7 +130,7 @@ def split_and_parse_json_objects(json_string):
    return parsed_objects, unparsed_segments

 def sanitize_html(html):
-    # Replace all weird and special characters with an empty string
+    # Replace all unwanted and special characters with an empty string
    sanitized_html = html
    # sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)

@@ -141,9 +181,22 @@ def escape_json_string(s):
 class CustomHTML2Text(HTML2Text):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
-        self.ignore_links = True
        self.inside_pre = False
        self.inside_code = False
+        
+        self.skip_internal_links = False
+        self.single_line_break = False
+        self.mark_code = False
+        self.include_sup_sub = False
+        self.body_width = 0
+        self.ignore_mailto_links = True
+        self.ignore_links = False
+        self.escape_backslash = False
+        self.escape_dot = False
+        self.escape_plus = False
+        self.escape_dash = False
+        self.escape_snob = False
+

    def handle_tag(self, tag, attrs, start):
        if tag == 'pre':
@@ -153,6 +206,10 @@ class CustomHTML2Text(HTML2Text):
            else:
                self.o('\n```')
                self.inside_pre = False
+        elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
+            pass
+
+
        # elif tag == 'code' and not self.inside_pre:
        #     if start:
        #         if not self.inside_pre:
@@ -260,7 +317,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
            if tag.name != 'img':
                tag.attrs = {}

-        # Extract all img tgas inti [{src: '', alt: ''}]
+        # Extract all img tgas int0 [{src: '', alt: ''}]
        media = {
            'images': [],
            'videos': [],
@@ -298,7 +355,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
                img.decompose()


-        # Create a function that replace content of all"pre" tage with its inner text
+        # Create a function that replace content of all"pre" tag with its inner text
        def replace_pre_tags_with_text(node):
            for child in node.find_all('pre'):
                # set child inner html to its text
@@ -461,7 +518,7 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            current_tag = tag
            while current_tag:
                current_tag = current_tag.parent
-                # Get the text content of the parent tag
+                # Get the text content from the parent tag
                if current_tag:
                    text_content = current_tag.get_text(separator=' ',strip=True)
                    # Check if the text content has at least word_count_threshold
@@ -470,88 +527,88 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            return None

    def process_image(img, url, index, total_images):
-            #Check if an image has valid display and inside undesired html elements
-            def is_valid_image(img, parent, parent_classes):
-                style = img.get('style', '')
-                src = img.get('src', '')
-                classes_to_check = ['button', 'icon', 'logo']
-                tags_to_check = ['button', 'input']
-                return all([
-                    'display:none' not in style,
-                    src,
-                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
-                    parent.name not in tags_to_check
-                ])
+        #Check if an image has valid display and inside undesired html elements
+        def is_valid_image(img, parent, parent_classes):
+            style = img.get('style', '')
+            src = img.get('src', '')
+            classes_to_check = ['button', 'icon', 'logo']
+            tags_to_check = ['button', 'input']
+            return all([
+                'display:none' not in style,
+                src,
+                not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                parent.name not in tags_to_check
+            ])

-            #Score an image for it's usefulness
-            def score_image_for_usefulness(img, base_url, index, images_count):
-                # Function to parse image height/width value and units
-                def parse_dimension(dimension):
-                    if dimension:
-                        match = re.match(r"(\d+)(\D*)", dimension)
-                        if match:
-                            number = int(match.group(1))
-                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
-                            return number, unit
-                    return None, None
+        #Score an image for it's usefulness
+        def score_image_for_usefulness(img, base_url, index, images_count):
+            # Function to parse image height/width value and units
+            def parse_dimension(dimension):
+                if dimension:
+                    match = re.match(r"(\d+)(\D*)", dimension)
+                    if match:
+                        number = int(match.group(1))
+                        unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                        return number, unit
+                return None, None

-                # Fetch image file metadata to extract size and extension
-                def fetch_image_file_size(img, base_url):
-                    #If src is relative path construct full URL, if not it may be CDN URL
-                    img_url = urljoin(base_url,img.get('src'))
-                    try:
-                        response = requests.head(img_url)
-                        if response.status_code == 200:
-                            return response.headers.get('Content-Length',None)
-                        else:
-                            print(f"Failed to retrieve file size for {img_url}")
-                            return None
-                    except InvalidSchema as e:
+            # Fetch image file metadata to extract size and extension
+            def fetch_image_file_size(img, base_url):
+                #If src is relative path construct full URL, if not it may be CDN URL
+                img_url = urljoin(base_url,img.get('src'))
+                try:
+                    response = requests.head(img_url)
+                    if response.status_code == 200:
+                        return response.headers.get('Content-Length',None)
+                    else:
+                        print(f"Failed to retrieve file size for {img_url}")
                        return None
-                    finally:
-                        return
+                except InvalidSchema as e:
+                    return None
+                finally:
+                    return

-                image_height = img.get('height')
-                height_value, height_unit = parse_dimension(image_height)
-                image_width =  img.get('width')
-                width_value, width_unit = parse_dimension(image_width)
-                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
-                image_format = os.path.splitext(img.get('src',''))[1].lower()
-                # Remove . from format
-                image_format = image_format.strip('.')
-                score = 0
-                if height_value:
-                    if height_unit == 'px' and height_value > 150:
-                        score += 1
-                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
-                        score += 1
-                if width_value:
-                    if width_unit == 'px' and width_value > 150:
-                        score += 1
-                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
-                        score += 1
-                if image_size > 10000:
+            image_height = img.get('height')
+            height_value, height_unit = parse_dimension(image_height)
+            image_width =  img.get('width')
+            width_value, width_unit = parse_dimension(image_width)
+            image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
+            image_format = os.path.splitext(img.get('src',''))[1].lower()
+            # Remove . from format
+            image_format = image_format.strip('.')
+            score = 0
+            if height_value:
+                if height_unit == 'px' and height_value > 150:
                    score += 1
-                if img.get('alt') != '':
-                    score+=1
-                if any(image_format==format for format in ['jpg','png','webp']):
-                    score+=1
-                if index/images_count<0.5:
-                    score+=1
-                return score
+                if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
+                    score += 1
+            if width_value:
+                if width_unit == 'px' and width_value > 150:
+                    score += 1
+                if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
+                    score += 1
+            if image_size > 10000:
+                score += 1
+            if img.get('alt') != '':
+                score+=1
+            if any(image_format==format for format in ['jpg','png','webp']):
+                score+=1
+            if index/images_count<0.5:
+                score+=1
+            return score

-            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
-                return None
-            score = score_image_for_usefulness(img, url, index, total_images)
-            if score <= IMAGE_SCORE_THRESHOLD:
-                return None
-            return {
-                'src': img.get('src', ''),
-                'alt': img.get('alt', ''),
-                'desc': find_closest_parent_with_useful_text(img),
-                'score': score,
-                'type': 'image'
-            }
+        if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+            return None
+        score = score_image_for_usefulness(img, url, index, total_images)
+        if score <= IMAGE_SCORE_THRESHOLD:
+            return None
+        return {
+            'src': img.get('src', '').replace('\\"', '"').strip(),
+            'alt': img.get('alt', ''),
+            'desc': find_closest_parent_with_useful_text(img),
+            'score': score,
+            'type': 'image'
+        }

    def process_element(element: element.PageElement) -> bool:
        try:
@@ -651,8 +708,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
    for img in imgs:
        src = img.get('src', '')
        if base64_pattern.match(src):
-            # Replace base64 data with empty string
            img['src'] = base64_pattern.sub('', src)
+
    cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
    cleaned_html = sanitize_html(cleaned_html)

@@ -734,7 +791,14 @@ def extract_xml_data(tags, string):
    return data
    
 # Function to perform the completion with exponential backoff
-def perform_completion_with_backoff(provider, prompt_with_variables, api_token, json_response = False, base_url=None):
+def perform_completion_with_backoff(
+    provider, 
+    prompt_with_variables, 
+    api_token, 
+    json_response = False, 
+    base_url=None,
+    **kwargs
+    ):
    from litellm import completion 
    from litellm.exceptions import RateLimitError
    max_attempts = 3
@@ -743,6 +807,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token,
    extra_args = {}
    if json_response:
        extra_args["response_format"] = { "type": "json_object" }
+        
+    if kwargs.get("extra_args"):
+        extra_args.update(kwargs["extra_args"])
    
    for attempt in range(max_attempts):
        try:
@@ -913,4 +980,53 @@ def format_html(html_string):
    soup = BeautifulSoup(html_string, 'html.parser')
    return soup.prettify()

+def normalize_url(href, base_url):
+    """Normalize URLs to ensure consistent format"""
+    # Extract protocol and domain from base URL
+    try:
+        base_parts = base_url.split('/')
+        protocol = base_parts[0]
+        domain = base_parts[2]
+    except IndexError:
+        raise ValueError(f"Invalid base URL format: {base_url}")
+    
+    # Handle special protocols
+    special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
+    if any(href.lower().startswith(proto) for proto in special_protocols):
+        return href.strip()
+        
+    # Handle anchor links
+    if href.startswith('#'):
+        return f"{base_url}{href}"
+        
+    # Handle protocol-relative URLs
+    if href.startswith('//'):
+        return f"{protocol}{href}"
+        
+    # Handle root-relative URLs
+    if href.startswith('/'):
+        return f"{protocol}//{domain}{href}"
+        
+    # Handle relative URLs
+    if not href.startswith(('http://', 'https://')):
+        # Remove leading './' if present
+        href = href.lstrip('./')
+        return f"{protocol}//{domain}/{href}"
+        
+    return href.strip()

+def is_external_url(url, base_domain):
+    """Determine if a URL is external"""
+    special_protocols = {'mailto:', 'tel:', 'ftp:', 'file:', 'data:', 'javascript:'}
+    if any(url.lower().startswith(proto) for proto in special_protocols):
+        return True
+        
+    try:
+        # Handle URLs with protocol
+        if url.startswith(('http://', 'https://')):
+            url_domain = url.split('/')[2]
+            return base_domain.lower() not in url_domain.lower()
+    except IndexError:
+        return False
+        
+    return False
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -12,6 +12,7 @@ from typing import List
 from concurrent.futures import ThreadPoolExecutor
 from .config import *
 import warnings
+import json
 warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')


--- a/docs/details/extraction.md
+++ b/docs/details/extraction.md
@@ -0,0 +1,157 @@
+### Extraction Strategies
+
+#### 1. LLMExtractionStrategy
+```python
+LLMExtractionStrategy(
+    # Core Parameters
+    provider: str = DEFAULT_PROVIDER,  # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
+    api_token: Optional[str] = None,  # API token for the provider
+    instruction: str = None,  # Custom instruction for extraction
+    schema: Dict = None,  # Pydantic model schema for structured extraction
+    extraction_type: str = "block",  # Type of extraction: "block" or "schema"
+    
+    # Chunking Parameters
+    chunk_token_threshold: int = CHUNK_TOKEN_THRESHOLD,  # Maximum tokens per chunk
+    overlap_rate: float = OVERLAP_RATE,  # Overlap between chunks
+    word_token_rate: float = WORD_TOKEN_RATE,  # Conversion rate from words to tokens
+    apply_chunking: bool = True,  # Whether to apply text chunking
+    
+    # API Configuration
+    base_url: str = None,  # Base URL for API calls
+    api_base: str = None,  # Alternative base URL
+    extra_args: Dict = {},  # Additional provider-specific arguments
+    
+    verbose: bool = False  # Enable verbose logging
+)
+```
+
+Usage Example:
+```python
+class NewsArticle(BaseModel):
+    title: str
+    content: str
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",
+    api_token="your-token",
+    schema=NewsArticle.schema(),
+    instruction="Extract news article content with title and main text"
+)
+
+result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
+```
+
+#### 2. JsonCssExtractionStrategy
+```python
+JsonCssExtractionStrategy(
+    schema: Dict[str, Any],  # Schema defining extraction rules
+    verbose: bool = False  # Enable verbose logging
+)
+
+# Schema Structure
+schema = {
+    "name": str,  # Name of the extraction schema
+    "baseSelector": str,  # CSS selector for base elements
+    "fields": [
+        {
+            "name": str,  # Field name
+            "selector": str,  # CSS selector
+            "type": str,  # Field type: "text", "attribute", "html", "regex", "nested", "list", "nested_list"
+            "attribute": str,  # For type="attribute"
+            "pattern": str,  # For type="regex"
+            "transform": str,  # Optional: "lowercase", "uppercase", "strip"
+            "default": Any,  # Default value if extraction fails
+            "fields": List[Dict],  # For nested/list types
+        }
+    ]
+}
+```
+
+Usage Example:
+```python
+schema = {
+    "name": "News Articles",
+    "baseSelector": "article.news-item",
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h1",
+            "type": "text",
+            "transform": "strip"
+        },
+        {
+            "name": "date",
+            "selector": ".date",
+            "type": "attribute",
+            "attribute": "datetime"
+        }
+    ]
+}
+
+strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
+```
+
+#### 3. CosineStrategy
+```python
+CosineStrategy(
+    # Content Filtering
+    semantic_filter: str = None,  # Keyword filter for document filtering
+    word_count_threshold: int = 10,  # Minimum words per cluster
+    sim_threshold: float = 0.3,  # Similarity threshold for filtering
+    
+    # Clustering Parameters
+    max_dist: float = 0.2,  # Maximum distance for clustering
+    linkage_method: str = 'ward',  # Clustering linkage method
+    top_k: int = 3,  # Number of top categories to extract
+    
+    # Model Configuration
+    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
+    
+    verbose: bool = False  # Enable verbose logging
+)
+```
+
+### Chunking Strategies
+
+#### 1. RegexChunking
+```python
+RegexChunking(
+    patterns: List[str] = None  # List of regex patterns for splitting text
+    # Default pattern: [r'\n\n']
+)
+```
+
+Usage Example:
+```python
+chunker = RegexChunking(patterns=[r'\n\n', r'\.\s+'])  # Split on double newlines and sentences
+chunks = chunker.chunk(text)
+```
+
+#### 2. SlidingWindowChunking
+```python
+SlidingWindowChunking(
+    window_size: int = 100,  # Size of the window in words
+    step: int = 50,  # Number of words to slide the window
+)
+```
+
+Usage Example:
+```python
+chunker = SlidingWindowChunking(window_size=200, step=100)
+chunks = chunker.chunk(text)  # Creates overlapping chunks of 200 words, moving 100 words at a time
+```
+
+#### 3. OverlappingWindowChunking
+```python
+OverlappingWindowChunking(
+    window_size: int = 1000,  # Size of each chunk in words
+    overlap: int = 100  # Number of words to overlap between chunks
+)
+```
+
+Usage Example:
+```python
+chunker = OverlappingWindowChunking(window_size=500, overlap=50)
+chunks = chunker.chunk(text)  # Creates 500-word chunks with 50-word overlap
+```
--- a/docs/details/feature_lists.md
+++ b/docs/details/feature_lists.md
@@ -0,0 +1,175 @@
+# Features
+
+## Current Features
+1. Async-first architecture for high-performance web crawling
+2. Built-in anti-bot detection bypass ("magic mode")
+3. Multiple browser engine support (Chromium, Firefox, WebKit)
+4. Smart session management with automatic cleanup
+5. Automatic content cleaning and relevance scoring
+6. Built-in markdown generation with formatting preservation
+7. Intelligent image scoring and filtering
+8. Automatic popup and overlay removal
+9. Smart wait conditions (CSS/JavaScript based)
+10. Multi-provider LLM integration (OpenAI, HuggingFace, Ollama)
+11. Schema-based structured data extraction
+12. Automated iframe content processing
+13. Intelligent link categorization (internal/external)
+14. Multiple chunking strategies for large content
+15. Real-time HTML cleaning and sanitization
+16. Automatic screenshot capabilities
+17. Social media link filtering
+18. Semantic similarity-based content clustering
+19. Human behavior simulation for anti-bot bypass
+20. Proxy support with authentication
+21. Automatic resource cleanup
+22. Custom CSS selector-based extraction
+23. Automatic content relevance scoring ("fit" content)
+24. Recursive website crawling capabilities
+25. Flexible hook system for customization
+26. Built-in caching system
+27. Domain-based content filtering
+28. Dynamic content handling with JavaScript execution
+29. Automatic media content extraction and classification
+30. Metadata extraction and processing
+31. Customizable HTML to Markdown conversion
+32. Token-aware content chunking for LLM processing
+33. Automatic response header and status code handling
+34. Browser fingerprint customization
+35. Multiple extraction strategies (LLM, CSS, Cosine, XPATH)
+36. Automatic error image generation for failed screenshots
+37. Smart content overlap handling for large texts
+38. Built-in rate limiting for batch processing
+39. Automatic cookie handling
+40. Browser Console logging and debugging capabilities
+
+## Feature Techs
+• Browser Management
+  - Asynchronous browser control
+  - Multi-browser support (Chromium, Firefox, WebKit)
+  - Headless mode support
+  - Browser cleanup and resource management
+  - Custom browser arguments and configuration
+  - Context management with `__aenter__` and `__aexit__`
+
+• Session Handling
+  - Session management with TTL (Time To Live)
+  - Session reuse capabilities
+  - Session cleanup for expired sessions
+  - Session-based context preservation
+
+• Stealth Features
+  - Playwright stealth configuration
+  - Navigator properties override
+  - WebDriver detection evasion
+  - Chrome app simulation
+  - Plugin simulation
+  - Language preferences simulation
+  - Hardware concurrency simulation
+  - Media codecs simulation
+
+• Network Features
+  - Proxy support with authentication
+  - Custom headers management
+  - Cookie handling
+  - Response header capture
+  - Status code tracking
+  - Network idle detection
+
+• Page Interaction
+  - Smart wait functionality for multiple conditions
+  - CSS selector-based waiting
+  - JavaScript condition waiting
+  - Custom JavaScript execution
+  - User interaction simulation (mouse/keyboard)
+  - Page scrolling
+  - Timeout management
+  - Load state monitoring
+
+• Content Processing
+  - HTML content extraction
+  - Iframe processing and content extraction
+  - Delayed content retrieval
+  - Content caching
+  - Cache file management
+  - HTML cleaning and processing
+
+• Image Handling
+  - Screenshot capabilities (full page)
+  - Base64 encoding of screenshots
+  - Image dimension updating
+  - Image filtering (size/visibility)
+  - Error image generation
+  - Natural width/height preservation
+
+• Overlay Management
+  - Popup removal
+  - Cookie notice removal
+  - Newsletter dialog removal
+  - Modal removal
+  - Fixed position element removal
+  - Z-index based overlay detection
+  - Visibility checking
+
+• Hook System
+  - Browser creation hooks
+  - User agent update hooks
+  - Execution start hooks
+  - Navigation hooks (before/after goto)
+  - HTML retrieval hooks
+  - HTML return hooks
+
+• Error Handling
+  - Browser error catching
+  - Network error handling
+  - Timeout handling
+  - Screenshot error recovery
+  - Invalid selector handling
+  - General exception management
+
+• Performance Features
+  - Concurrent URL processing
+  - Semaphore-based rate limiting
+  - Async gathering of results
+  - Resource cleanup
+  - Memory management
+
+• Debug Features
+  - Console logging
+  - Page error logging
+  - Verbose mode
+  - Error message generation
+  - Warning system
+
+• Security Features
+  - Certificate error handling
+  - Sandbox configuration
+  - GPU handling
+  - CSP (Content Security Policy) compliant waiting
+
+• Configuration
+  - User agent customization
+  - Viewport configuration
+  - Timeout configuration
+  - Browser type selection
+  - Proxy configuration
+  - Header configuration
+
+• Data Models
+  - Pydantic model for responses
+  - Type hints throughout code
+  - Structured response format
+  - Optional response fields
+
+• File System Integration
+  - Cache directory management
+  - File path handling
+  - Cache metadata storage
+  - File read/write operations
+
+• Metadata Handling
+  - Response headers capture
+  - Status code tracking
+  - Cache metadata
+  - Session tracking
+  - Timestamp management
+
--- a/docs/details/features.md
+++ b/docs/details/features.md
@@ -0,0 +1,150 @@
+### 1. Basic Web Crawling
+```python
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    print(result.markdown)  # Get clean markdown content
+    print(result.html)      # Get raw HTML
+    print(result.cleaned_html)  # Get cleaned HTML
+```
+
+### 2. Browser Control Options
+- Multiple Browser Support
+```python
+# Choose between different browser engines
+crawler = AsyncWebCrawler(browser_type="firefox")  # or "chromium", "webkit"
+crawler = AsyncWebCrawler(headless=False)  # For visible browser
+```
+
+- Proxy Configuration
+```python
+crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
+# Or with authentication
+crawler = AsyncWebCrawler(proxy_config={
+    "server": "http://proxy.example.com:8080",
+    "username": "user",
+    "password": "pass"
+})
+```
+
+### 3. Content Selection & Filtering
+- CSS Selector Support
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    css_selector=".main-content"  # Extract specific content
+)
+```
+
+- Content Filtering Options
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    word_count_threshold=10,  # Minimum words per block
+    excluded_tags=['form', 'header'],  # Tags to exclude
+    exclude_external_links=True,  # Remove external links
+    exclude_social_media_links=True,  # Remove social media links
+    exclude_external_images=True  # Remove external images
+)
+```
+
+### 4. Dynamic Content Handling
+- JavaScript Execution
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="window.scrollTo(0, document.body.scrollHeight)"  # Execute custom JS
+)
+```
+
+- Wait Conditions
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for="css:.my-element",  # Wait for element
+    wait_for="js:() => document.readyState === 'complete'"  # Wait for condition
+)
+```
+
+### 5. Anti-Bot Protection Handling
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    simulate_user=True,  # Simulate human behavior
+    override_navigator=True,  # Mask automation signals
+    magic=True  # Enable all anti-detection features
+)
+```
+
+### 6. Session Management
+```python
+session_id = "my_session"
+result1 = await crawler.arun(url="https://example.com/page1", session_id=session_id)
+result2 = await crawler.arun(url="https://example.com/page2", session_id=session_id)
+await crawler.crawler_strategy.kill_session(session_id)
+```
+
+### 7. Media Handling
+- Screenshot Capture
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    screenshot=True
+)
+base64_screenshot = result.screenshot
+```
+
+- Media Extraction
+```python
+result = await crawler.arun(url="https://example.com")
+print(result.media['images'])  # List of images
+print(result.media['videos'])  # List of videos
+print(result.media['audios'])  # List of audio files
+```
+
+### 8. Structured Data Extraction
+- CSS-based Extraction
+```python
+schema = {
+    "name": "News Articles",
+    "baseSelector": "article",
+    "fields": [
+        {"name": "title", "selector": "h1", "type": "text"},
+        {"name": "date", "selector": ".date", "type": "text"}
+    ]
+}
+extraction_strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=extraction_strategy
+)
+structured_data = json.loads(result.extracted_content)
+```
+
+- LLM-based Extraction (Multiple Providers)
+```python
+class NewsArticle(BaseModel):
+    title: str
+    summary: str
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",  # or "huggingface/...", "ollama/..."
+    api_token="your-token",
+    schema=NewsArticle.schema(),
+    instruction="Extract news article details..."
+)
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=strategy
+)
+```
+
+### 9. Content Cleaning & Processing
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    remove_overlay_elements=True,  # Remove popups/modals
+    process_iframes=True,  # Process iframe content
+)
+print(result.fit_markdown)  # Get most relevant content
+print(result.fit_html)     # Get cleaned HTML
+```
--- a/docs/details/features_details.md
+++ b/docs/details/features_details.md
@@ -0,0 +1,457 @@
+I'll expand the outline with detailed descriptions and examples based on all the provided files. I'll start with the first few sections:
+
+### 1. Basic Web Crawling
+Basic web crawling provides the foundation for extracting content from websites. The library supports both simple single-page crawling and recursive website crawling.
+
+```python
+# Simple page crawling
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    print(result.html)        # Raw HTML
+    print(result.markdown)    # Cleaned markdown
+    print(result.cleaned_html)  # Cleaned HTML
+
+# Recursive website crawling
+class SimpleWebsiteScraper:
+    def __init__(self, crawler: AsyncWebCrawler):
+        self.crawler = crawler
+
+    async def scrape(self, start_url: str, max_depth: int):
+        results = await self.scrape_recursive(start_url, max_depth)
+        return results
+
+# Usage
+async with AsyncWebCrawler() as crawler:
+    scraper = SimpleWebsiteScraper(crawler)
+    results = await scraper.scrape("https://example.com", depth=2)
+```
+
+### 2. Browser Control Options
+The library provides extensive control over browser behavior, allowing customization of browser type, headless mode, and proxy settings.
+
+```python
+# Browser Type Selection
+async with AsyncWebCrawler(
+    browser_type="firefox",  # Options: "chromium", "firefox", "webkit"
+    headless=False,         # For visible browser
+    verbose=True           # Enable logging
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# Proxy Configuration
+async with AsyncWebCrawler(
+    proxy_config={
+        "server": "http://proxy.example.com:8080",
+        "username": "user",
+        "password": "pass"
+    },
+    headers={
+        "User-Agent": "Custom User Agent",
+        "Accept-Language": "en-US,en;q=0.9"
+    }
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+### 3. Content Selection & Filtering
+The library offers multiple ways to select and filter content, from CSS selectors to word count thresholds.
+
+```python
+# CSS Selector and Content Filtering
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        css_selector="article.main-content",  # Extract specific content
+        word_count_threshold=10,              # Minimum words per block
+        excluded_tags=['form', 'header'],     # Tags to exclude
+        exclude_external_links=True,          # Remove external links
+        exclude_social_media_links=True,      # Remove social media links
+        exclude_domains=["pinterest.com", "facebook.com"]  # Exclude specific domains
+    )
+
+# Custom HTML to Text Options
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        html2text={
+            "escape_dot": False,
+            "links_each_paragraph": True,
+            "protect_links": True
+        }
+    )
+```
+
+### 4. Dynamic Content Handling
+The library provides sophisticated handling of dynamic content with JavaScript execution and wait conditions.
+
+```python
+# JavaScript Execution and Wait Conditions
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        js_code=[
+            "window.scrollTo(0, document.body.scrollHeight);",
+            "document.querySelector('.load-more').click();"
+        ],
+        wait_for="css:.dynamic-content",  # Wait for element
+        delay_before_return_html=2.0      # Wait after JS execution
+    )
+
+# Smart Wait Conditions
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        wait_for="""() => {
+            return document.querySelectorAll('.item').length > 10;
+        }""",
+        page_timeout=60000  # 60 seconds timeout
+    )
+```
+
+### 5. Advanced Link Analysis
+The library provides comprehensive link analysis capabilities, distinguishing between internal and external links, with options for filtering and processing.
+
+```python
+# Basic Link Analysis
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    
+    # Access internal and external links
+    for internal_link in result.links['internal']:
+        print(f"Internal: {internal_link['href']} - {internal_link['text']}")
+    
+    for external_link in result.links['external']:
+        print(f"External: {external_link['href']} - {external_link['text']}")
+
+# Advanced Link Filtering
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        exclude_external_links=True,          # Remove all external links
+        exclude_social_media_links=True,      # Remove social media links
+        exclude_social_media_domains=[                # Custom social media domains
+            "facebook.com", "twitter.com", "instagram.com"
+        ],
+        exclude_domains=["pinterest.com"]     # Specific domains to exclude
+    )
+```
+
+### 6. Anti-Bot Protection Handling
+The library includes sophisticated anti-detection mechanisms to handle websites with bot protection.
+
+```python
+# Basic Anti-Detection
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        simulate_user=True,        # Simulate human behavior
+        override_navigator=True    # Override navigator properties
+    )
+
+# Advanced Anti-Detection with Magic Mode
+async with AsyncWebCrawler(headless=False) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        magic=True,               # Enable all anti-detection features
+        remove_overlay_elements=True,  # Remove popups/modals automatically
+        # Custom navigator properties
+        js_code="""
+        Object.defineProperty(navigator, 'webdriver', {
+            get: () => undefined
+        });
+        """
+    )
+```
+
+### 7. Session Management
+Session management allows maintaining state across multiple requests and handling cookies.
+
+```python
+# Basic Session Management
+async with AsyncWebCrawler() as crawler:
+    session_id = "my_session"
+    
+    # Login
+    login_result = await crawler.arun(
+        url="https://example.com/login",
+        session_id=session_id,
+        js_code="document.querySelector('form').submit();"
+    )
+    
+    # Use same session for subsequent requests
+    protected_result = await crawler.arun(
+        url="https://example.com/protected",
+        session_id=session_id
+    )
+    
+    # Clean up session
+    await crawler.crawler_strategy.kill_session(session_id)
+
+# Advanced Session with Custom Cookies
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        session_id="custom_session",
+        cookies=[{
+            "name": "sessionId",
+            "value": "abc123",
+            "domain": "example.com"
+        }]
+    )
+```
+
+### 8. Screenshot and Media Handling
+The library provides comprehensive media handling capabilities, including screenshots and media content extraction.
+
+```python
+# Screenshot Capture
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        screenshot=True,
+        screenshot_wait_for=2.0  # Wait before taking screenshot
+    )
+    
+    # Save screenshot
+    if result.screenshot:
+        with open("screenshot.png", "wb") as f:
+            f.write(base64.b64decode(result.screenshot))
+
+# Media Extraction
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    
+    # Process images with metadata
+    for image in result.media['images']:
+        print(f"Image: {image['src']}")
+        print(f"Alt text: {image['alt']}")
+        print(f"Context: {image['desc']}")
+        print(f"Relevance score: {image['score']}")
+    
+    # Process videos and audio
+    for video in result.media['videos']:
+        print(f"Video: {video['src']}")
+    for audio in result.media['audios']:
+        print(f"Audio: {audio['src']}")
+```
+
+### 9. Structured Data Extraction & Chunking
+The library supports multiple strategies for structured data extraction and content chunking.
+
+```python
+# LLM-based Extraction
+class NewsArticle(BaseModel):
+    title: str
+    content: str
+    author: str
+
+extraction_strategy = LLMExtractionStrategy(
+    provider='openai/gpt-4',
+    api_token="your-token",
+    schema=NewsArticle.schema(),
+    instruction="Extract news article details",
+    chunk_token_threshold=1000,
+    overlap_rate=0.1
+)
+
+# CSS-based Extraction
+schema = {
+    "name": "Product Listing",
+    "baseSelector": ".product-card",
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h2",
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": ".price",
+            "type": "text",
+            "transform": "strip"
+        }
+    ]
+}
+
+css_strategy = JsonCssExtractionStrategy(schema)
+
+# Text Chunking
+from crawl4ai.chunking_strategy import OverlappingWindowChunking
+
+chunking_strategy = OverlappingWindowChunking(
+    window_size=1000,
+    overlap=100
+)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=extraction_strategy,
+        chunking_strategy=chunking_strategy
+    )
+```
+
+
+### 10. Content Cleaning & Processing
+The library provides extensive content cleaning and processing capabilities, ensuring high-quality output in various formats.
+
+```python
+# Basic Content Cleaning
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        remove_overlay_elements=True,  # Remove popups/modals
+        process_iframes=True,          # Process iframe content
+        word_count_threshold=10        # Minimum words per block
+    )
+    
+    print(result.cleaned_html)    # Clean HTML
+    print(result.fit_html)        # Most relevant HTML content
+    print(result.fit_markdown)    # Most relevant markdown content
+
+# Advanced Content Processing
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        excluded_tags=['form', 'header', 'footer', 'nav'],
+        html2text={
+            "escape_dot": False,
+            "body_width": 0,
+            "protect_links": True,
+            "unicode_snob": True,
+            "ignore_links": False,
+            "ignore_images": False,
+            "ignore_emphasis": False,
+            "bypass_tables": False,
+            "ignore_tables": False
+        }
+    )
+```
+
+### Advanced Usage Patterns
+
+#### 1. Combining Multiple Features
+```python
+async with AsyncWebCrawler(
+    browser_type="chromium",
+    headless=False,
+    verbose=True
+) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        # Anti-bot measures
+        magic=True,
+        simulate_user=True,
+        
+        # Content selection
+        css_selector="article.main",
+        word_count_threshold=10,
+        
+        # Dynamic content handling
+        js_code="window.scrollTo(0, document.body.scrollHeight);",
+        wait_for="css:.dynamic-content",
+        
+        # Content filtering
+        exclude_external_links=True,
+        exclude_social_media_links=True,
+        
+        # Media handling
+        screenshot=True,
+        process_iframes=True,
+        
+        # Content cleaning
+        remove_overlay_elements=True
+    )
+```
+
+#### 2. Custom Extraction Pipeline
+```python
+# Define custom schemas and strategies
+class Article(BaseModel):
+    title: str
+    content: str
+    date: str
+
+# CSS extraction for initial content
+css_schema = {
+    "name": "Article Extraction",
+    "baseSelector": "article",
+    "fields": [
+        {"name": "title", "selector": "h1", "type": "text"},
+        {"name": "content", "selector": ".content", "type": "html"},
+        {"name": "date", "selector": ".date", "type": "text"}
+    ]
+}
+
+# LLM processing for semantic analysis
+llm_strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",
+    api_token="your-token",
+    schema=Article.schema(),
+    instruction="Extract and clean article content"
+)
+
+# Chunking strategy for large content
+chunking = OverlappingWindowChunking(window_size=1000, overlap=100)
+
+async with AsyncWebCrawler() as crawler:
+    # First pass: Extract structure
+    css_result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=JsonCssExtractionStrategy(css_schema)
+    )
+    
+    # Second pass: Semantic processing
+    llm_result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=llm_strategy,
+        chunking_strategy=chunking
+    )
+```
+
+#### 3. Website Crawling with Custom Processing
+```python
+class CustomWebsiteCrawler:
+    def __init__(self, crawler: AsyncWebCrawler):
+        self.crawler = crawler
+        self.results = {}
+
+    async def process_page(self, url: str) -> Dict:
+        result = await self.crawler.arun(
+            url=url,
+            magic=True,
+            word_count_threshold=10,
+            exclude_external_links=True,
+            process_iframes=True,
+            remove_overlay_elements=True
+        )
+        
+        # Process internal links
+        internal_links = [
+            link['href'] for link in result.links['internal']
+            if self._is_valid_link(link['href'])
+        ]
+        
+        # Extract media
+        media_urls = [img['src'] for img in result.media['images']]
+        
+        return {
+            'content': result.markdown,
+            'links': internal_links,
+            'media': media_urls,
+            'metadata': result.metadata
+        }
+
+    async def crawl_website(self, start_url: str, max_depth: int = 2):
+        visited = set()
+        queue = [(start_url, 0)]
+        
+        while queue:
+            url, depth = queue.pop(0)
+            if depth > max_depth or url in visited:
+                continue
+                
+            visited.add(url)
+            self.results[url] = await self.process_page(url)
+```
+
--- a/docs/details/input_output.md
+++ b/docs/details/input_output.md
@@ -0,0 +1,282 @@
+### AsyncWebCrawler Constructor Parameters
+```python
+AsyncWebCrawler(
+    # Core Browser Settings
+    browser_type: str = "chromium",  # Options: "chromium", "firefox", "webkit"
+    headless: bool = True,  # Whether to run browser in headless mode
+    verbose: bool = False,  # Enable verbose logging
+    
+    # Cache Settings
+    always_by_pass_cache: bool = False,  # Always bypass cache regardless of run settings
+    base_directory: str = str(Path.home()),  # Base directory for cache storage
+    
+    # Network Settings
+    proxy: str = None,  # Simple proxy URL (e.g., "http://proxy.example.com:8080")
+    proxy_config: Dict = None,  # Advanced proxy settings with auth: {"server": str, "username": str, "password": str}
+    
+    # Browser Behavior
+    sleep_on_close: bool = False,  # Wait before closing browser
+    
+    # Other Settings passed to AsyncPlaywrightCrawlerStrategy
+    user_agent: str = None,  # Custom user agent string
+    headers: Dict[str, str] = {},  # Custom HTTP headers
+    js_code: Union[str, List[str]] = None,  # Default JavaScript to execute
+)
+```
+
+### arun() Method Parameters
+```python
+arun(
+    # Core Parameters
+    url: str,  # Required: URL to crawl
+    
+    # Content Selection
+    css_selector: str = None,  # CSS selector to extract specific content
+    word_count_threshold: int = MIN_WORD_THRESHOLD,  # Minimum words for content blocks
+    
+    # Cache Control
+    bypass_cache: bool = False,  # Bypass cache for this request
+    
+    # Session Management
+    session_id: str = None,  # Session identifier for persistent browsing
+    
+    # Screenshot Options
+    screenshot: bool = False,  # Take page screenshot
+    screenshot_wait_for: float = None,  # Wait time before screenshot
+    
+    # Content Processing
+    process_iframes: bool = False,  # Process iframe content
+    remove_overlay_elements: bool = False,  # Remove popups/modals
+    
+    # Anti-Bot/Detection
+    simulate_user: bool = False,  # Simulate human-like behavior
+    override_navigator: bool = False,  # Override navigator properties
+    magic: bool = False,  # Enable all anti-detection features
+    
+    # Content Filtering
+    excluded_tags: List[str] = None,  # HTML tags to exclude
+    exclude_external_links: bool = False,  # Remove external links
+    exclude_social_media_links: bool = False,  # Remove social media links
+    exclude_external_images: bool = False,  # Remove external images
+    exclude_social_media_domains: List[str] = None,  # Additional social media domains to exclude
+    remove_forms: bool = False,  # Remove all form elements
+    
+    # JavaScript Handling
+    js_code: Union[str, List[str]] = None,  # JavaScript to execute
+    js_only: bool = False,  # Only execute JavaScript without reloading page
+    wait_for: str = None,  # Wait condition (CSS selector or JS function)
+    
+    # Page Loading
+    page_timeout: int = 60000,  # Page load timeout in milliseconds
+    delay_before_return_html: float = None,  # Wait before returning HTML
+    
+    # Debug Options
+    log_console: bool = False,  # Log browser console messages
+    
+    # Content Format Control
+    only_text: bool = False,  # Extract only text content
+    keep_data_attributes: bool = False,  # Keep data-* attributes in HTML
+    
+    # Markdown Options
+    include_links_on_markdown: bool = False,  # Include links in markdown output
+    html2text: Dict = {},  # HTML to text conversion options
+    
+    # Extraction Strategy
+    extraction_strategy: ExtractionStrategy = None,  # Strategy for structured data extraction
+    
+    # Advanced Browser Control
+    user_agent: str = None,  # Override user agent for this request
+)
+```
+
+### Extraction Strategy Parameters
+```python
+# JsonCssExtractionStrategy
+{
+    "name": str,  # Name of extraction schema
+    "baseSelector": str,  # Base CSS selector
+    "fields": [
+        {
+            "name": str,  # Field name
+            "selector": str,  # CSS selector
+            "type": str,  # Data type ("text", etc.)
+            "transform": str = None  # Optional transformation
+        }
+    ]
+}
+
+# LLMExtractionStrategy
+{
+    "provider": str,  # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
+    "api_token": str,  # API token
+    "schema": dict,  # Pydantic model schema
+    "extraction_type": str,  # Type of extraction ("schema", etc.)
+    "instruction": str,  # Extraction instruction
+    "extra_args": dict = None,  # Additional provider-specific arguments
+    "extra_headers": dict = None  # Additional HTTP headers
+}
+```
+
+### HTML to Text Conversion Options (html2text parameter)
+```python
+{
+    "escape_dot": bool = True,  # Escape dots in text
+    # Other html2text library options
+}
+```
+
+
+### CrawlResult Fields
+
+```python
+class CrawlResult(BaseModel):
+    # Basic Information
+    url: str  # The crawled URL
+    # Example: "https://example.com"
+    
+    success: bool  # Whether the crawl was successful
+    # Example: True/False
+    
+    status_code: Optional[int]  # HTTP status code
+    # Example: 200, 404, 500
+    
+    # Content Fields
+    html: str  # Raw HTML content
+    # Example: "<html><body>...</body></html>"
+    
+    cleaned_html: Optional[str]  # HTML after cleaning and processing
+    # Example: "<article><p>Clean content...</p></article>"
+    
+    fit_html: Optional[str]  # Most relevant HTML content after content cleaning strategy
+    # Example: "<div><p>Most relevant content...</p></div>"
+    
+    markdown: Optional[str]  # HTML converted to markdown
+    # Example: "# Title\n\nContent paragraph..."
+    
+    fit_markdown: Optional[str]  # Most relevant content in markdown
+    # Example: "# Main Article\n\nKey content..."
+    
+    # Media Content
+    media: Dict[str, List[Dict]] = {}  # Extracted media information
+    # Example: {
+    #     "images": [
+    #         {
+    #             "src": "https://example.com/image.jpg",
+    #             "alt": "Image description",
+    #             "desc": "Contextual description",
+    #             "score": 5,  # Relevance score
+    #             "type": "image"
+    #         }
+    #     ],
+    #     "videos": [
+    #         {
+    #             "src": "https://example.com/video.mp4",
+    #             "alt": "Video title",
+    #             "type": "video",
+    #             "description": "Video context"
+    #         }
+    #     ],
+    #     "audios": [
+    #         {
+    #             "src": "https://example.com/audio.mp3",
+    #             "alt": "Audio title",
+    #             "type": "audio",
+    #             "description": "Audio context"
+    #         }
+    #     ]
+    # }
+    
+    # Link Information
+    links: Dict[str, List[Dict]] = {}  # Extracted links
+    # Example: {
+    #     "internal": [
+    #         {
+    #             "href": "https://example.com/page",
+    #             "text": "Link text",
+    #             "title": "Link title"
+    #         }
+    #     ],
+    #     "external": [
+    #         {
+    #             "href": "https://external.com",
+    #             "text": "External link text",
+    #             "title": "External link title"
+    #         }
+    #     ]
+    # }
+    
+    # Extraction Results
+    extracted_content: Optional[str]  # Content from extraction strategy
+    # Example for JsonCssExtractionStrategy:
+    # '[{"title": "Article 1", "date": "2024-03-20"}, ...]'
+    # Example for LLMExtractionStrategy:
+    # '{"entities": [...], "relationships": [...]}'
+    
+    # Additional Information
+    metadata: Optional[dict] = None  # Page metadata
+    # Example: {
+    #     "title": "Page Title",
+    #     "description": "Meta description",
+    #     "keywords": ["keyword1", "keyword2"],
+    #     "author": "Author Name",
+    #     "published_date": "2024-03-20"
+    # }
+    
+    screenshot: Optional[str] = None  # Base64 encoded screenshot
+    # Example: "iVBORw0KGgoAAAANSUhEUgAA..."
+    
+    error_message: Optional[str] = None  # Error message if crawl failed
+    # Example: "Failed to load page: timeout"
+    
+    session_id: Optional[str] = None  # Session identifier
+    # Example: "session_123456"
+    
+    response_headers: Optional[dict] = None  # HTTP response headers
+    # Example: {
+    #     "content-type": "text/html",
+    #     "server": "nginx/1.18.0",
+    #     "date": "Wed, 20 Mar 2024 12:00:00 GMT"
+    # }
+```
+
+### Common Usage Patterns:
+
+1. Basic Content Extraction:
+```python
+result = await crawler.arun(url="https://example.com")
+print(result.markdown)  # Clean, readable content
+print(result.cleaned_html)  # Cleaned HTML
+```
+
+2. Media Analysis:
+```python
+result = await crawler.arun(url="https://example.com")
+for image in result.media["images"]:
+    if image["score"] > 3:  # High-relevance images
+        print(f"High-quality image: {image['src']}")
+```
+
+3. Link Analysis:
+```python
+result = await crawler.arun(url="https://example.com")
+internal_links = [link["href"] for link in result.links["internal"]]
+external_links = [link["href"] for link in result.links["external"]]
+```
+
+4. Structured Data Extraction:
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=my_strategy
+)
+structured_data = json.loads(result.extracted_content)
+```
+
+5. Error Handling:
+```python
+result = await crawler.arun(url="https://example.com")
+if not result.success:
+    print(f"Crawl failed: {result.error_message}")
+    print(f"Status code: {result.status_code}")
+```
+
--- a/docs/details/realworld_examples.md
+++ b/docs/details/realworld_examples.md
@@ -0,0 +1,67 @@
+1. **E-commerce Product Monitor**
+   - Scraping product details from multiple e-commerce sites
+   - Price tracking with structured data extraction
+   - Handling dynamic content and anti-bot measures
+   - Features: JsonCssExtraction, session management, anti-bot
+
+2. **News Aggregator & Summarizer**
+   - Crawling news websites
+   - Content extraction and summarization
+   - Topic classification
+   - Features: LLMExtraction, CosineStrategy, content cleaning
+
+3. **Academic Paper Research Assistant**
+   - Crawling research papers from academic sites
+   - Extracting citations and references
+   - Building knowledge graphs
+   - Features: structured extraction, link analysis, chunking
+
+4. **Social Media Content Analyzer**
+   - Handling JavaScript-heavy sites
+   - Dynamic content loading
+   - Sentiment analysis integration
+   - Features: dynamic content handling, session management
+
+5. **Real Estate Market Analyzer**
+   - Scraping property listings
+   - Processing image galleries
+   - Geolocation data extraction
+   - Features: media handling, structured data extraction
+
+6. **Documentation Site Generator**
+   - Recursive website crawling
+   - Markdown generation
+   - Link validation
+   - Features: website crawling, content cleaning
+
+7. **Job Board Aggregator**
+   - Handling pagination
+   - Structured job data extraction
+   - Filtering and categorization
+   - Features: session management, JsonCssExtraction
+
+8. **Recipe Database Builder**
+   - Schema-based extraction
+   - Image processing
+   - Ingredient parsing
+   - Features: structured extraction, media handling
+
+9. **Travel Blog Content Analyzer**
+   - Location extraction
+   - Image and map processing
+   - Content categorization
+   - Features: CosineStrategy, media handling
+
+10. **Technical Documentation Scraper**
+    - API documentation extraction
+    - Code snippet processing
+    - Version tracking
+    - Features: content cleaning, structured extraction
+
+Each example will include:
+- Problem description
+- Technical requirements
+- Complete implementation
+- Error handling
+- Output processing
+- Performance considerations
--- a/docs/examples/async_webcrawler_multiple_urls_example.py
+++ b/docs/examples/async_webcrawler_multiple_urls_example.py
@@ -0,0 +1,48 @@
+# File: async_webcrawler_multiple_urls_example.py
+import os, sys
+# append 2 parent directories to sys.path to import crawl4ai
+parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+sys.path.append(parent_dir)
+
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    # Initialize the AsyncWebCrawler
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # List of URLs to crawl
+        urls = [
+            "https://example.com",
+            "https://python.org",
+            "https://github.com",
+            "https://stackoverflow.com",
+            "https://news.ycombinator.com"
+        ]
+
+        # Set up crawling parameters
+        word_count_threshold = 100
+
+        # Run the crawling process for multiple URLs
+        results = await crawler.arun_many(
+            urls=urls,
+            word_count_threshold=word_count_threshold,
+            bypass_cache=True,
+            verbose=True
+        )
+
+        # Process the results
+        for result in results:
+            if result.success:
+                print(f"Successfully crawled: {result.url}")
+                print(f"Title: {result.metadata.get('title', 'N/A')}")
+                print(f"Word count: {len(result.markdown.split())}")
+                print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
+                print(f"Number of images: {len(result.media.get('images', []))}")
+                print("---")
+            else:
+                print(f"Failed to crawl: {result.url}")
+                print(f"Error: {result.error_message}")
+                print("---")
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/language_support_example.py
+++ b/docs/examples/language_support_example.py
@@ -0,0 +1,45 @@
+import asyncio
+from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
+
+async def main():
+    # Example 1: Setting language when creating the crawler
+    crawler1 = AsyncWebCrawler(
+        crawler_strategy=AsyncPlaywrightCrawlerStrategy(
+            headers={"Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7"}
+        )
+    )
+    result1 = await crawler1.arun("https://www.example.com")
+    print("Example 1 result:", result1.extracted_content[:100])  # Print first 100 characters
+
+    # Example 2: Setting language before crawling
+    crawler2 = AsyncWebCrawler()
+    crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
+    result2 = await crawler2.arun("https://www.example.com")
+    print("Example 2 result:", result2.extracted_content[:100])
+
+    # Example 3: Setting language when calling arun method
+    crawler3 = AsyncWebCrawler()
+    result3 = await crawler3.arun(
+        "https://www.example.com",
+        headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
+    )
+    print("Example 3 result:", result3.extracted_content[:100])
+
+    # Example 4: Crawling multiple pages with different languages
+    urls = [
+        ("https://www.example.com", "fr-FR,fr;q=0.9"),
+        ("https://www.example.org", "es-ES,es;q=0.9"),
+        ("https://www.example.net", "de-DE,de;q=0.9"),
+    ]
+    
+    crawler4 = AsyncWebCrawler()
+    results = await asyncio.gather(*[
+        crawler4.arun(url, headers={"Accept-Language": lang})
+        for url, lang in urls
+    ])
+    
+    for url, result in zip([u for u, _ in urls], results):
+        print(f"Result for {url}:", result.extracted_content[:100])
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -47,8 +47,7 @@
      },
      "outputs": [],
      "source": [
-        "# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
-        "!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
+        "!pip install crawl4ai\n",
        "!pip install nest-asyncio\n",
        "!playwright install"
      ]
@@ -714,7 +713,7 @@
      "provenance": []
    },
    "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "venv",
      "language": "python",
      "name": "python3"
    },
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -10,6 +10,7 @@ import time
 import json
 import os
 import re
+from typing import Dict, List
 from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
 from crawl4ai import AsyncWebCrawler
@@ -18,6 +19,8 @@ from crawl4ai.extraction_strategy import (
    LLMExtractionStrategy,
 )

+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
 print("Crawl4AI: Advanced Web Crawling and Data Extraction")
 print("GitHub Repository: https://github.com/unclecode/crawl4ai")
 print("Twitter: @unclecode")
@@ -30,7 +33,7 @@ async def simple_crawl():
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  # Print first 500 characters

-async def js_and_css():
+async def simple_example_with_running_js_code():
    print("\n--- Executing JavaScript and Using CSS Selectors ---")
    # New code to handle the wait_for parameter
    wait_for = """() => {
@@ -47,12 +50,21 @@ async def js_and_css():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
-            # css_selector="article.tease-card",
            # wait_for=wait_for,
            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters

+async def simple_example_with_css_selector():
+    print("\n--- Using CSS Selectors ---")
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            css_selector=".wide-tease-item__description",
+            bypass_cache=True,
+        )
+        print(result.markdown[:500])  # Print first 500 characters
+
 async def use_proxy():
    print("\n--- Using a Proxy ---")
    print(
@@ -66,6 +78,28 @@ async def use_proxy():
    #     )
    #     print(result.markdown[:500])  # Print first 500 characters

+async def capture_and_save_screenshot(url: str, output_path: str):
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url=url,
+            screenshot=True,
+            bypass_cache=True
+        )
+        
+        if result.success and result.screenshot:
+            import base64
+            
+            # Decode the base64 screenshot data
+            screenshot_data = base64.b64decode(result.screenshot)
+            
+            # Save the screenshot as a JPEG file
+            with open(output_path, 'wb') as f:
+                f.write(screenshot_data)
+            
+            print(f"Screenshot saved successfully to {output_path}")
+        else:
+            print("Failed to capture screenshot")
+
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -73,27 +107,30 @@ class OpenAIModelFee(BaseModel):
        ..., description="Fee for output token for the OpenAI model."
    )

-async def extract_structured_data_using_llm():
-    print("\n--- Extracting Structured Data with OpenAI ---")
-    print(
-        "Note: Set your OpenAI API key as an environment variable to run this example."
-    )
-    if not os.getenv("OPENAI_API_KEY"):
-        print("OpenAI API key not found. Skipping this example.")
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+    print(f"\n--- Extracting Structured Data with {provider} ---")
+    
+    if api_token is None and provider != "ollama":
+        print(f"API token is required for {provider}. Skipping this example.")
        return

+    extra_args = {}
+    if extra_headers:
+        extra_args["extra_headers"] = extra_headers
+
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
-                provider="openai/gpt-4o",
-                api_token=os.getenv("OPENAI_API_KEY"),
+                provider=provider,
+                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
+                extra_args=extra_args
            ),
            bypass_cache=True,
        )
@@ -320,6 +357,40 @@ async def crawl_dynamic_content_pages_method_3():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

+async def crawl_custom_browser_type():
+    # Use Firefox
+    start = time.time()
+    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+    # Use WebKit
+    start = time.time()
+    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+    # Use Chromium (default)
+    start = time.time()
+    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+async def crawl_with_user_simultion():
+    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
+        url = "YOUR-URL-HERE"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
+            override_navigator = True # Overrides the navigator object to make it look like a real user
+        )
+        
+        print(result.markdown)    
+
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -385,15 +456,84 @@ async def speed_comparison():
    print("If you run these tests in an environment with better network conditions,")
    print("you may observe an even more significant speed advantage for Crawl4AI.")

+async def generate_knowledge_graph():
+    class Entity(BaseModel):
+        name: str
+        description: str
+        
+    class Relationship(BaseModel):
+        entity1: Entity
+        entity2: Entity
+        description: str
+        relation_type: str
+
+    class KnowledgeGraph(BaseModel):
+        entities: List[Entity]
+        relationships: List[Relationship]
+
+    extraction_strategy = LLMExtractionStrategy(
+            provider='openai/gpt-4o-mini', # Or any other provider, including Ollama and open source models
+            api_token=os.getenv('OPENAI_API_KEY'), # In case of Ollama just pass "no-token"
+            schema=KnowledgeGraph.model_json_schema(),
+            extraction_type="schema",
+            instruction="""Extract entities and relationships from the given text."""
+    )
+    async with AsyncWebCrawler() as crawler:
+        url = "https://paulgraham.com/love.html"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            extraction_strategy=extraction_strategy,
+            # magic=True
+        )
+        # print(result.extracted_content)
+        with open(os.path.join(__location__, "kb.json"), "w") as f:
+            f.write(result.extracted_content)
+
+async def fit_markdown_remove_overlay():
+    async with AsyncWebCrawler(headless = False) as crawler:
+        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            word_count_threshold = 10,
+            remove_overlay_elements=True,
+            screenshot = True
+        )
+        # Save markdown to file
+        with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
+            f.write(result.fit_markdown)
+
+    print("Done")
+
+
 async def main():
    await simple_crawl()
-    await js_and_css()
+    await simple_example_with_running_js_code()
+    await simple_example_with_css_selector()
    await use_proxy()
+    await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
    await extract_structured_data_using_css_extractor()
+
+    # LLM extraction examples
    await extract_structured_data_using_llm()
+    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
+    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm("ollama/llama3.2")    
+
+    # You always can pass custom headers to the extraction strategy
+    custom_headers = {
+        "Authorization": "Bearer your-custom-token",
+        "X-Custom-Header": "Some-Value"
+    }
+    await extract_structured_data_using_llm(extra_headers=custom_headers)
+    
    # await crawl_dynamic_content_pages_method_1()
    # await crawl_dynamic_content_pages_method_2()
    await crawl_dynamic_content_pages_method_3()
+    
+    await crawl_custom_browser_type()
+    
    await speed_comparison()


--- a/docs/md_v0/api/core_classes_and_functions.md
+++ b/docs/md_v0/api/core_classes_and_functions.md
--- a/docs/md_v0/api/detailed_api_documentation.md
+++ b/docs/md_v0/api/detailed_api_documentation.md
--- a/docs/md_v0/assets/DankMono-Bold.woff2
+++ b/docs/md_v0/assets/DankMono-Bold.woff2
--- a/docs/md_v0/assets/DankMono-Italic.woff2
+++ b/docs/md_v0/assets/DankMono-Italic.woff2
--- a/docs/md_v0/assets/DankMono-Regular.woff2
+++ b/docs/md_v0/assets/DankMono-Regular.woff2
--- a/docs/md_v0/assets/Monaco.woff
+++ b/docs/md_v0/assets/Monaco.woff
--- a/docs/md_v0/assets/dmvendor.css
+++ b/docs/md_v0/assets/dmvendor.css
--- a/docs/md_v0/assets/highlight.css
+++ b/docs/md_v0/assets/highlight.css
--- a/docs/md_v0/assets/highlight.min.js
+++ b/docs/md_v0/assets/highlight.min.js
--- a/docs/md_v0/assets/highlight_init.js
+++ b/docs/md_v0/assets/highlight_init.js
--- a/docs/md_v0/assets/styles.css
+++ b/docs/md_v0/assets/styles.css
--- a/docs/md_v0/changelog.md
+++ b/docs/md_v0/changelog.md
--- a/docs/md_v0/chunking_strategies.json
+++ b/docs/md_v0/chunking_strategies.json
--- a/docs/md_v0/contact.md
+++ b/docs/md_v0/contact.md
--- a/docs/md_v0/demo.md
+++ b/docs/md_v0/demo.md
--- a/docs/md_v0/examples/hooks_auth.md
+++ b/docs/md_v0/examples/hooks_auth.md
--- a/docs/md_v0/examples/index.md
+++ b/docs/md_v0/examples/index.md
--- a/docs/md_v0/examples/js_execution_css_filtering.md
+++ b/docs/md_v0/examples/js_execution_css_filtering.md
--- a/docs/md_v0/examples/llm_extraction.md
+++ b/docs/md_v0/examples/llm_extraction.md
--- a/docs/md_v0/examples/research_assistant.md
+++ b/docs/md_v0/examples/research_assistant.md
--- a/docs/md_v0/examples/summarization.md
+++ b/docs/md_v0/examples/summarization.md
--- a/docs/md_v0/extraction_strategies.json
+++ b/docs/md_v0/extraction_strategies.json
--- a/docs/md_v0/full_details/advanced_features.md
+++ b/docs/md_v0/full_details/advanced_features.md
--- a/docs/md_v0/full_details/chunking_strategies.md
+++ b/docs/md_v0/full_details/chunking_strategies.md
--- a/docs/md_v0/full_details/crawl_request_parameters.md
+++ b/docs/md_v0/full_details/crawl_request_parameters.md
--- a/docs/md_v0/full_details/crawl_result_class.md
+++ b/docs/md_v0/full_details/crawl_result_class.md
--- a/docs/md_v0/full_details/extraction_strategies.md
+++ b/docs/md_v0/full_details/extraction_strategies.md
--- a/docs/md_v0/index.md
+++ b/docs/md_v0/index.md
--- a/docs/md_v0/installation.md
+++ b/docs/md_v0/installation.md
--- a/docs/md_v0/interactive_content.html
+++ b/docs/md_v0/interactive_content.html
--- a/docs/md_v0/introduction.md
+++ b/docs/md_v0/introduction.md
--- a/docs/md_v0/quickstart.md
+++ b/docs/md_v0/quickstart.md
--- a/docs/md_v1/api/core_classes_and_functions.md
+++ b/docs/md_v1/api/core_classes_and_functions.md
--- a/docs/md_v1/api/detailed_api_documentation.md
+++ b/docs/md_v1/api/detailed_api_documentation.md
--- a/docs/md_v1/assets/DankMono-Bold.woff2
+++ b/docs/md_v1/assets/DankMono-Bold.woff2
--- a/docs/md_v1/assets/DankMono-Italic.woff2
+++ b/docs/md_v1/assets/DankMono-Italic.woff2
--- a/docs/md_v1/assets/DankMono-Regular.woff2
+++ b/docs/md_v1/assets/DankMono-Regular.woff2
--- a/docs/md_v1/assets/Monaco.woff
+++ b/docs/md_v1/assets/Monaco.woff
--- a/docs/md_v1/assets/dmvendor.css
+++ b/docs/md_v1/assets/dmvendor.css
--- a/docs/md_v1/assets/highlight.css
+++ b/docs/md_v1/assets/highlight.css
--- a/docs/md_v1/assets/highlight.min.js
+++ b/docs/md_v1/assets/highlight.min.js
--- a/docs/md_v1/assets/highlight_init.js
+++ b/docs/md_v1/assets/highlight_init.js
--- a/docs/md_v1/assets/styles.css
+++ b/docs/md_v1/assets/styles.css
--- a/docs/md_v1/changelog.md
+++ b/docs/md_v1/changelog.md
--- a/docs/md_v1/contact.md
+++ b/docs/md_v1/contact.md
--- a/docs/md_v1/demo.md
+++ b/docs/md_v1/demo.md
--- a/docs/md_v1/examples/hooks_auth.md
+++ b/docs/md_v1/examples/hooks_auth.md
--- a/docs/md_v1/examples/index.md
+++ b/docs/md_v1/examples/index.md
--- a/docs/md_v1/examples/js_execution_css_filtering.md
+++ b/docs/md_v1/examples/js_execution_css_filtering.md
--- a/docs/md_v1/examples/json_css_extraction.md
+++ b/docs/md_v1/examples/json_css_extraction.md
--- a/docs/md_v1/examples/llm_extraction.md
+++ b/docs/md_v1/examples/llm_extraction.md
--- a/docs/md_v1/examples/research_assistant.md
+++ b/docs/md_v1/examples/research_assistant.md
--- a/docs/md_v1/examples/summarization.md
+++ b/docs/md_v1/examples/summarization.md
--- a/docs/md_v1/full_details/advanced_features.md
+++ b/docs/md_v1/full_details/advanced_features.md
--- a/docs/md_v1/full_details/advanced_jsoncss_extraction.md
+++ b/docs/md_v1/full_details/advanced_jsoncss_extraction.md
--- a/docs/md_v1/full_details/chunking_strategies.md
+++ b/docs/md_v1/full_details/chunking_strategies.md
--- a/docs/md_v1/full_details/crawl_request_parameters.md
+++ b/docs/md_v1/full_details/crawl_request_parameters.md
--- a/docs/md_v1/full_details/crawl_result_class.md
+++ b/docs/md_v1/full_details/crawl_result_class.md
--- a/docs/md_v1/full_details/extraction_strategies.md
+++ b/docs/md_v1/full_details/extraction_strategies.md
--- a/docs/md_v1/full_details/session_based_crawling.md
+++ b/docs/md_v1/full_details/session_based_crawling.md
--- a/docs/md_v1/index.md
+++ b/docs/md_v1/index.md
--- a/docs/md_v1/installation.md
+++ b/docs/md_v1/installation.md
--- a/docs/md_v1/interactive_content.html
+++ b/docs/md_v1/interactive_content.html
--- a/docs/md_v1/introduction.md
+++ b/docs/md_v1/introduction.md
--- a/docs/md_v1/mkdocs.yml
+++ b/docs/md_v1/mkdocs.yml
@@ -0,0 +1,45 @@
+site_name: Crawl4AI Documentation
+site_description: 🔥🕷️ Crawl4AI, Open-source LLM Friendly Web Crawler & Scrapper
+site_url: https://docs.crawl4ai.com
+repo_url: https://github.com/unclecode/crawl4ai
+repo_name: unclecode/crawl4ai
+docs_dir: docs/md
+nav:
+  - Home: index.md
+  - First Steps:
+      - Introduction: introduction.md
+      - Installation: installation.md
+      - Quick Start: quickstart.md
+  - Examples:
+      - Intro: examples/index.md
+      - Structured Data Extraction: examples/json_css_extraction.md
+      - LLM Extraction: examples/llm_extraction.md
+      - JS Execution & CSS Filtering: examples/js_execution_css_filtering.md
+      - Hooks & Auth: examples/hooks_auth.md
+      - Summarization: examples/summarization.md
+      - Research Assistant: examples/research_assistant.md
+  - Full Details of Using Crawler:
+      - Crawl Request Parameters: full_details/crawl_request_parameters.md
+      - Crawl Result Class: full_details/crawl_result_class.md
+      - Session Based Crawling: full_details/session_based_crawling.md
+      - Advanced Features: full_details/advanced_features.md
+      - Advanced JsonCssExtraction: full_details/advanced_jsoncss_extraction.md
+      - Chunking Strategies: full_details/chunking_strategies.md
+      - Extraction Strategies: full_details/extraction_strategies.md
+  - Miscellaneous:
+      - Change Log: changelog.md
+      - Contact: contact.md
+
+theme:
+  name: terminal
+  palette: dark
+
+# Add the css/extra.css
+extra_css:
+  - assets/styles.css
+  - assets/highlight.css
+  - assets/dmvendor.css
+
+extra_javascript:
+  - assets/highlight.min.js
+  - assets/highlight_init.js
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
UncleCode	4239654722	Update Documentation	2024-10-27 19:24:46 +08:00
UncleCode	38474bd66a	Update version	2024-10-24 20:24:21 +08:00
UncleCode	bcfe83f702	feat: enhance crawler with overlay removal and improved screenshot capabilities • Add smart overlay removal system for handling popups and modals • Improve screenshot functionality with configurable timing controls • Implement URL normalization and enhanced link processing • Add custom base directory support for cache storage • Refine external content filtering and social media domain handling This commit significantly improves the crawler's ability to handle modern websites by automatically removing intrusive overlays and providing better screenshot capabilities. URL handling is now more robust with proper normalization and duplicate detection. The cache system is more flexible with customizable base directory support. Breaking changes: None Issue numbers: None	2024-10-24 20:22:47 +08:00
UncleCode	60ba131ac8	[v0.3.72] Enhance content extraction and proxy support - Add ContentCleaningStrategy for improved content extraction - Implement advanced proxy configuration with authentication - Enhance image source detection and handling - Add fit_markdown and fit_html for refined content output - Improve external link and image handling flexibility	2024-10-22 20:19:22 +08:00
UncleCode	04d16e6d2b	Fix Base64 image parsing in WebScrappingStrategy (issue 182) - Add support for extracting Base64 encoded images - Improve image format detection to include Base64 images - Enhance compatibility with locally saved HTML files using Base64 image encoding	2024-10-20 19:25:25 +08:00
UncleCode	1dd36f9035	Refactor content scrapping strategy and improve error handling	2024-10-20 19:11:18 +08:00
UncleCode	6ec4cb33ca	Enhance Markdown generation and external content control - Integrate customized html2text library for flexible Markdown output - Add options to exclude external links and images - Improve content scraping efficiency and error handling - Update AsyncPlaywrightCrawlerStrategy for faster closing - Enhance CosineStrategy with generic embedding model loading	2024-10-20 18:56:58 +08:00
UncleCode	e7cd8a1c2d	Update Changelog	2024-10-19 18:37:12 +08:00
UncleCode	4e2852d5ff	[v0.3.71] Enhance chunking strategies and improve overall performance - Add OverlappingWindowChunking and improve SlidingWindowChunking - Update CHUNK_TOKEN_THRESHOLD to 2048 tokens - Optimize AsyncPlaywrightCrawlerStrategy close method - Enhance flexibility in CosineStrategy with generic embedding model loading - Improve JSON-based extraction strategies - Add knowledge graph generation example	2024-10-19 18:36:59 +08:00
UncleCode	b309bc34e1	Fix the model nam ein quick start example	2024-10-18 15:32:25 +08:00
UncleCode	b8147b64e0	chore: Bump version to 0.3.71 and improve error handling - Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation	2024-10-18 13:31:12 +08:00
UncleCode	aab6ea022e	Update requirements and switch to 0.3.8	2024-10-18 12:51:23 +08:00
UncleCode	dd17ed0e63	Rename some flags name, introducing magic flag.	2024-10-18 12:35:09 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00
unclecode	9ffa34b697	Update README	2024-10-14 22:58:27 +08:00
unclecode	740802c491	Merge branch '0.3.6'	2024-10-14 22:55:24 +08:00
unclecode	b9ac96c332	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-10-14 22:54:23 +08:00
unclecode	d06535388a	Update gitignore	2024-10-14 22:53:56 +08:00
unclecode	2b73bdf6b0	Update changelog	2024-10-14 21:04:02 +08:00
unclecode	6aa803d712	Update gitignore	2024-10-14 21:03:40 +08:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
UncleCode	ccbe72cfc1	Merge pull request #135 from hitesh22rana/fix/docs-example docs: fixed css_selector for example	2024-10-13 14:39:07 +08:00
unclecode	b9bbd42373	Update Quickstart examples	2024-10-13 14:37:45 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	9b2b267820	CHANGELOG UPDATE	2024-10-12 13:42:56 +08:00
unclecode	ff3524d9b1	feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts - Implement screenshot capture functionality - Add delayed content retrieval method - Introduce custom page timeout parameter - Enhance LLM support with multiple providers - Improve database schema auto-updates - Optimize image processing in WebScrappingStrategy - Update error handling and logging - Expand examples in quickstart_async.py	2024-10-12 13:42:42 +08:00
unclecode	b99d20b725	Add pypi_build.sh to .gitignore	2024-10-08 18:10:57 +08:00
hitesh22rana	768b93140f	docs: fixed css_selector for example	2024-10-05 00:25:41 +09:00
unclecode	4750810a67	Enhance AsyncWebCrawler with smart waiting and screenshot capabilities - Implement smart_wait function in AsyncPlaywrightCrawlerStrategy - Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler - Improve error handling and timeout management in crawling process - Fix typo in CrawlResult model (responser_headers -> response_headers) - Update .gitignore to exclude additional files - Adjust import path in test_basic_crawling.py	2024-10-02 17:34:56 +08:00
unclecode	e0e0db4247	Bump version to 0.3.4	2024-09-29 17:07:52 +08:00
unclecode	bccadec887	Remove dependency on psutil, PyYaml, and extend requests version range	2024-09-29 17:07:06 +08:00