Update README.md

Update simple-crawling.md (#379 )
In the comprehensive example, AttributeError: type object 'CacheMode' has no attribute 'ENABLE'. Did you mean: 'ENABLED'?
2024-12-30 21:23:19 +08:00 · 2024-12-27 17:42:59 +08:00 · 2024-12-24 19:56:07 +08:00 · 2024-12-15 19:49:38 +08:00 · 2024-12-15 19:49:30 +08:00 · 2024-12-13 21:51:38 +08:00
50 changed files with 9558 additions and 2174 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -206,6 +206,7 @@ pypi_build.sh
 git_issues.py
 git_issues.md

+.next/
 .tests/
 .issues/
 .docs/
@@ -214,4 +215,6 @@ git_issues.md
 todo_executor.md
 protect-all-except-feature.sh
 manage-collab.sh
-publish.sh
+publish.sh
+combine.sh
+combined_output.txt
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,141 @@
 # Changelog

+## [0.4.1] December 8, 2024
+
+### **File: `crawl4ai/async_crawler_strategy.py`**
+
+#### **New Parameters and Attributes Added**
+- **`text_only` (boolean)**: Enables text-only mode, disables images, JavaScript, and GPU-related features for faster, minimal rendering.
+- **`light_mode` (boolean)**: Optimizes the browser by disabling unnecessary background processes and features for efficiency.
+- **`viewport_width` and `viewport_height`**: Dynamically adjusts based on `text_only` mode (default values: 800x600 for `text_only`, 1920x1080 otherwise).
+- **`extra_args`**: Adds browser-specific flags for `text_only` mode.
+- **`adjust_viewport_to_content`**: Dynamically adjusts the viewport to the content size for accurate rendering.
+
+#### **Browser Context Adjustments**
+- Added **`viewport` adjustments**: Dynamically computed based on `text_only` or custom configuration.
+- Enhanced support for `light_mode` and `text_only` by adding specific browser arguments to reduce resource consumption.
+
+#### **Dynamic Content Handling**
+- **Full Page Scan Feature**:
+  - Scrolls through the entire page while dynamically detecting content changes.
+  - Ensures scrolling stops when no new dynamic content is loaded.
+
+#### **Session Management**
+- Added **`create_session`** method:
+  - Creates a new browser session and assigns a unique ID.
+  - Supports persistent and non-persistent contexts with full compatibility for cookies, headers, and proxies.
+
+#### **Improved Content Loading and Adjustment**
+- **`adjust_viewport_to_content`**:
+  - Automatically adjusts viewport to match content dimensions.
+  - Includes scaling via Chrome DevTools Protocol (CDP).
+- Enhanced content loading:
+  - Waits for images to load and ensures network activity is idle before proceeding.
+
+#### **Error Handling and Logging**
+- Improved error handling and detailed logging for:
+  - Viewport adjustment (`adjust_viewport_to_content`).
+  - Full page scanning (`scan_full_page`).
+  - Dynamic content loading.
+
+#### **Refactoring and Cleanup**
+- Removed hardcoded viewport dimensions in multiple places, replaced with dynamic values (`self.viewport_width`, `self.viewport_height`).
+- Removed commented-out and unused code for better readability.
+- Added default value for `delay_before_return_html` parameter.
+
+#### **Optimizations**
+- Reduced resource usage in `light_mode` by disabling unnecessary browser features such as extensions, background timers, and sync.
+- Improved compatibility for different browser types (`chrome`, `firefox`, `webkit`).
+
+---
+
+### **File: `docs/examples/quickstart_async.py`**
+
+#### **Schema Adjustment**
+- Changed schema reference for `LLMExtractionStrategy`:
+  - **Old**: `OpenAIModelFee.schema()`
+  - **New**: `OpenAIModelFee.model_json_schema()`
+  - This likely ensures better compatibility with the `OpenAIModelFee` class and its JSON schema.
+
+#### **Documentation Comments Updated**
+- Improved extraction instruction for schema-based LLM strategies.
+
+---
+
+### **New Features Added**
+1. **Text-Only Mode**:
+   - Focuses on minimal resource usage by disabling non-essential browser features.
+2. **Light Mode**:
+   - Optimizes browser for performance by disabling background tasks and unnecessary services.
+3. **Full Page Scanning**:
+   - Ensures the entire content of a page is crawled, including dynamic elements loaded during scrolling.
+4. **Dynamic Viewport Adjustment**:
+   - Automatically resizes the viewport to match content dimensions, improving compatibility and rendering accuracy.
+5. **Session Management**:
+   - Simplifies session handling with better support for persistent and non-persistent contexts.
+
+---
+
+### **Bug Fixes**
+- Fixed potential viewport mismatches by ensuring consistent use of `self.viewport_width` and `self.viewport_height` throughout the code.
+- Improved robustness of dynamic content loading to avoid timeouts and failed evaluations.
+
+
+
+
+
+
+
+## [0.3.75] December 1, 2024
+
+### PruningContentFilter
+
+#### 1. Introduced PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
+A new content filtering strategy that removes less relevant nodes based on metrics like text and link density.
+
+**Affected Files:**
+- `crawl4ai/content_filter_strategy.py`: Enhancement of content filtering capabilities.
+```diff
+Implemented effective pruning algorithm with comprehensive scoring.
+```
+- `README.md`: Improved documentation regarding new features.
+```diff
+Updated to include usage and explanation for the PruningContentFilter.
+```
+- `docs/md_v2/basic/content_filtering.md`: Expanded documentation for users.
+```diff
+Added detailed section explaining the PruningContentFilter.
+```
+
+#### 2. Added Unit Tests for PruningContentFilter (Dec 01, 2024) (Dec 01, 2024)
+Comprehensive tests added to ensure correct functionality of PruningContentFilter
+
+**Affected Files:**
+- `tests/async/test_content_filter_prune.py`: Increased test coverage for content filtering strategies.
+```diff
+Created test cases for various scenarios using the PruningContentFilter.
+```
+
+### Development Updates
+
+#### 3. Enhanced BM25ContentFilter tests (Dec 01, 2024) (Dec 01, 2024)
+Extended testing to cover additional edge cases and performance metrics.
+
+**Affected Files:**
+- `tests/async/test_content_filter_bm25.py`: Improved reliability and performance assurance.
+```diff
+Added tests for new extraction scenarios including malformed HTML.
+```
+
+### Infrastructure & Documentation
+
+#### 4. Updated Examples (Dec 01, 2024) (Dec 01, 2024)
+Altered examples in documentation to promote the use of PruningContentFilter alongside existing strategies.
+
+**Affected Files:**
+- `docs/examples/quickstart_async.py`: Enhanced usability and clarity for new users.
+- Revised example to illustrate usage of PruningContentFilter.
+
 ## [0.3.746] November 29, 2024

 ### Major Features
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1 +1,2 @@
-include requirements.txt
+include requirements.txt
+recursive-include crawl4ai/js_snippet *.js
--- a/README.md
+++ b/README.md
@@ -1,4 +1,4 @@
-# 🔥🕷️ Crawl4AI: Crawl Smarter, Faster, Freely. For AI.
+# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scrapper.

 <a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

@@ -11,7 +11,9 @@

 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  

-[✨ Check out latest update v0.3.745](#-recent-updates)
+[✨ Check out latest update v0.4.2](#-recent-updates)
+
+🎉 **Version 0.4.2 is out!** Introducing our experimental PruningContentFilter - a powerful new algorithm for smarter Markdown generation. Test it out and [share your feedback](https://github.com/unclecode/crawl4ai/issues)! [Read the release notes →](https://crawl4ai.com/mkdocs/blog)

 ## 🧐 Why Crawl4AI?

@@ -77,6 +79,7 @@ if __name__ == "__main__":
 - 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
 - ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
 - 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
+- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.

 </details>

@@ -92,6 +95,8 @@ if __name__ == "__main__":
 - 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
 - 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
 - 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
+- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
+- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.

 </details>

@@ -118,8 +123,6 @@ if __name__ == "__main__":

 </details>

-
-
 ## Try it Now!

 ✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
@@ -422,7 +425,7 @@ You can check the project structure in the directory [https://github.com/uncleco
 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.content_filter_strategy import BM25ContentFilter
+from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

 async def main():
@@ -434,8 +437,11 @@ async def main():
            url="https://docs.micronaut.io/4.7.6/guide/",
            cache_mode=CacheMode.ENABLED,
            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
+                content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
            ),
+            # markdown_generator=DefaultMarkdownGenerator(
+            #     content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
+            # ),
        )
        print(len(result.markdown))
        print(len(result.fit_markdown))
@@ -620,18 +626,27 @@ async def test_news_crawl():

 ## ✨ Recent Updates   

- 🚀 **Improved ManagedBrowser Configuration**: Dynamic host and port support for more flexible browser management.  
- 📝 **Enhanced Markdown Generation**: New generator class for better formatting and customization.  
- ⚡ **Fast HTML Formatting**: Significantly optimized HTML formatting in the web crawler.  
- 🛠️ **Utility & Sanitization Upgrades**: Improved sanitization and expanded utility functions for streamlined workflows.  
- 👥 **Acknowledgments**: Added contributor details and pull request acknowledgments for better transparency.  
+- 🔧 **Configurable Crawlers and Browsers**: Simplified crawling with `BrowserConfig` and `CrawlerRunConfig`, making setups cleaner and more scalable.
+- 🔐 **Session Management Enhancements**: Import/export local storage for personalized crawling with seamless session reuse.
+- 📸 **Supercharged Screenshots**: Take lightning-fast, full-page screenshots of very long pages.
+- 📜 **Full-Page PDF Export**: Convert any web page into a PDF for easy sharing or archiving.
+- 🖼️ **Lazy Load Handling**: Improved support for websites with lazy-loaded images. The crawler now waits for all images to fully load, ensuring no content is missed.
+- ⚡ **Text-Only Mode**: New mode for fast, lightweight crawling. Disables images, JavaScript, and GPU rendering, improving speed by 3-4x for text-focused crawls.
+- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to fit page content, ensuring accurate rendering and capturing of all elements.
+- 🔄 **Full-Page Scanning**: Added scrolling support for pages with infinite scroll or dynamic content loading. Ensures every part of the page is captured.
+- 🧑‍💻 **Session Reuse**: Introduced `create_session` for efficient crawling by reusing the same browser session across multiple requests.
+- 🌟 **Light Mode**: Optimized browser performance by disabling unnecessary features like extensions, background timers, and sync processes.


+Read the full details of this release in our [0.4.2 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.4.2.md).
+
 ## 📖 Documentation & Roadmap 

-For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
+> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!

-Moreover to check our development plans and upcoming features, check out our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
+For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
+
+To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).

 <details>
 <summary>📈 <strong>Development TODOs</strong></summary>
--- a/README.sync.md
+++ b/README.sync.md
@@ -1,244 +0,0 @@
-# Crawl4AI v0.2.77 🕷️🤖
-
-[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
-[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
-[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
-[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
-[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
-
-Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
-
-#### [v0.2.77] - 2024-08-02
-
-Major improvements in functionality, performance, and cross-platform compatibility! 🚀
-
- 🐳 **Docker enhancements**:
-  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
- 🌐 **Official Docker Hub image**:
-  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
- 🔧 **Selenium upgrade**:
-  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
- 🖼️ **Image description**:
-  - Implemented ability to generate textual descriptions for extracted images from web pages.
- ⚡ **Performance boost**:
-  - Various improvements to enhance overall speed and performance.
-  
-## Try it Now!
-
-✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
-
-✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
-
-✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
-
-## Features ✨
-
- 🆓 Completely free and open-source
- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
- 🌍 Supports crawling multiple URLs simultaneously
- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
- 🔗 Extracts all external and internal links
- 📚 Extracts metadata from the page
- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
- 🕵️ User-agent customization
- 🖼️ Takes screenshots of the page
- 📜 Executes multiple custom JavaScripts before crawling
- 📚 Various chunking strategies: topic-based, regex, sentence, and more
- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
- 🎯 CSS selector support
- 📝 Passes instructions/keywords to refine extraction
-
-# Crawl4AI
-
-## 🌟 Shoutout to Contributors of v0.2.77!
-
-A big thank you to the amazing contributors who've made this release possible:
-
- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
-
-Your contributions are driving Crawl4AI forward! 🚀
-
-## Cool Examples 🚀
-
-### Quick Start
-
-```python
-from crawl4ai import WebCrawler
-
-# Create an instance of WebCrawler
-crawler = WebCrawler()
-
-# Warm up the crawler (load necessary models)
-crawler.warmup()
-
-# Run the crawler on a URL
-result = crawler.run(url="https://www.nbcnews.com/business")
-
-# Print the extracted content
-print(result.markdown)
-```
-
-## How to install 🛠 
-
-### Using pip 🐍
-```bash
-virtualenv venv
-source venv/bin/activate
-pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
-```
-
-### Using Docker 🐳
-
-```bash
-# For Mac users (M1/M2)
-# docker build --platform linux/amd64 -t crawl4ai .
-docker build -t crawl4ai .
-docker run -d -p 8000:80 crawl4ai
-```
-
-### Using Docker Hub 🐳
-
-```bash
-docker pull unclecode/crawl4ai:latest
-docker run -d -p 8000:80 unclecode/crawl4ai:latest
-```
-
-
-## Speed-First Design 🚀
-
-Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
-
-```python
-import time
-from crawl4ai.web_crawler import WebCrawler
-crawler = WebCrawler()
-crawler.warmup()
-
-start = time.time()
-url = r"https://www.nbcnews.com/business"
-result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
-end = time.time()
-print(f"Time taken: {end - start}")
-```
-
-Let's take a look the calculated time for the above code snippet:
-
-```bash
-[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
-[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
-[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
-Time taken: 1.439958095550537
-```
-Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
-
-### Extract Structured Data from Web Pages 📊
-
-Crawl all OpenAI models and their fees from the official page.
-
-```python
-import os
-from crawl4ai import WebCrawler
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-from pydantic import BaseModel, Field
-
-class OpenAIModelFee(BaseModel):
-    model_name: str = Field(..., description="Name of the OpenAI model.")
-    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
-
-url = 'https://openai.com/api/pricing/'
-crawler = WebCrawler()
-crawler.warmup()
-
-result = crawler.run(
-        url=url,
-        word_count_threshold=1,
-        extraction_strategy= LLMExtractionStrategy(
-            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
-            schema=OpenAIModelFee.schema(),
-            extraction_type="schema",
-            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
-            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
-            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
-        ),            
-        bypass_cache=True,
-    )
-
-print(result.extracted_content)
-```
-
-### Execute JS, Filter Data with CSS Selector, and Clustering
-
-```python
-from crawl4ai import WebCrawler
-from crawl4ai.chunking_strategy import CosineStrategy
-
-js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
-
-crawler = WebCrawler()
-crawler.warmup()
-
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    js=js_code,
-    css_selector="p",
-    extraction_strategy=CosineStrategy(semantic_filter="technology")
-)
-
-print(result.extracted_content)
-```
-
-### Extract Structured Data from Web Pages With Proxy and BaseUrl
-
-```python
-from crawl4ai import WebCrawler
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-
-def create_crawler():
-    crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
-    crawler.warmup()
-    return crawler
-
-crawler = create_crawler()
-
-crawler.warmup()
-
-result = crawler.run(
-    url="https://www.nbcnews.com/business",
-    extraction_strategy=LLMExtractionStrategy(
-        provider="openai/gpt-4o",
-        api_token="sk-",
-        base_url="https://api.openai.com/v1"
-    )
-)
-
-print(result.markdown)
-```
-
-## Documentation 📚
-
-For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
-
-## Contributing 🤝
-
-We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
-
-## License 📄
-
-Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
-
-## Contact 📧
-
-For questions, suggestions, or feedback, feel free to reach out:
-
- GitHub: [unclecode](https://github.com/unclecode)
- Twitter: [@unclecode](https://twitter.com/unclecode)
- Website: [crawl4ai.com](https://crawl4ai.com)
-
-Happy Crawling! 🕸️🚀
-
-## Star History
-
-[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/a.md
+++ b/a.md
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -1,7 +1,11 @@
 # __init__.py

 from .async_webcrawler import AsyncWebCrawler, CacheMode
-
+from .async_configs import BrowserConfig, CrawlerRunConfig
+from .extraction_strategy import ExtractionStrategy, LLMExtractionStrategy, CosineStrategy, JsonCssExtractionStrategy
+from .chunking_strategy import ChunkingStrategy, RegexChunking
+from .markdown_generation_strategy import DefaultMarkdownGenerator
+from .content_filter_strategy import PruningContentFilter, BM25ContentFilter
 from .models import CrawlResult
 from .__version__ import __version__

@@ -9,6 +13,17 @@ __all__ = [
    "AsyncWebCrawler",
    "CrawlResult",
    "CacheMode",
+    'BrowserConfig',
+    'CrawlerRunConfig',
+    'ExtractionStrategy',
+    'LLMExtractionStrategy',
+    'CosineStrategy',
+    'JsonCssExtractionStrategy',
+    'ChunkingStrategy',
+    'RegexChunking',
+    'DefaultMarkdownGenerator',
+    'PruningContentFilter',
+    'BM25ContentFilter',
 ]

 def is_sync_version_installed():
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,2 +1,2 @@
 # crawl4ai/_version.py
-__version__ = "0.3.746"
+__version__ = "0.4.22"
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -0,0 +1,406 @@
+from .config import (
+    MIN_WORD_THRESHOLD, 
+    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
+    SCREENSHOT_HEIGHT_TRESHOLD,
+    PAGE_TIMEOUT
+)
+from .user_agent_generator import UserAgentGenerator
+from .extraction_strategy import ExtractionStrategy
+from .chunking_strategy import ChunkingStrategy
+from .markdown_generation_strategy import MarkdownGenerationStrategy
+
+class BrowserConfig:
+    """
+    Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrategy.
+
+    This class centralizes all parameters that affect browser and context creation. Instead of passing
+    scattered keyword arguments, users can instantiate and modify this configuration object. The crawler
+    code will then reference these settings to initialize the browser in a consistent, documented manner.
+
+    Attributes:
+        browser_type (str): The type of browser to launch. Supported values: "chromium", "firefox", "webkit".
+                            Default: "chromium".
+        headless (bool): Whether to run the browser in headless mode (no visible GUI).
+                         Default: True.
+        use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
+                                    advanced manipulation. Default: False.
+        use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
+                                       Automatically sets use_managed_browser=True. Default: False.
+        user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
+                                     temporary directory may be used. Default: None.
+        chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
+                              is "chromium". Default: "chrome".
+        proxy (str or None): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
+                             Default: None.
+        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
+                                     If None, no additional proxy config. Default: None.
+        viewport_width (int): Default viewport width for pages. Default: 1920.
+        viewport_height (int): Default viewport height for pages. Default: 1080.
+        verbose (bool): Enable verbose logging.
+                        Default: True.
+        accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
+                                 Default: False.
+        downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
+                                      a default path will be created. Default: None.
+        storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage).
+                                             Default: None.
+        ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
+        java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
+        cookies (list): List of cookies to add to the browser context. Each cookie is a dict with fields like
+                        {"name": "...", "value": "...", "url": "..."}.
+                        Default: [].
+        headers (dict): Extra HTTP headers to apply to all requests in this context.
+                        Default: {}.
+        user_agent (str): Custom User-Agent string to use. Default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+                           "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36".
+        user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
+                                       user_agent as-is. Default: None.
+        user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
+                                                    Default: None.
+        text_only (bool): If True, disables images and other rich content for potentially faster load times.
+                          Default: False.
+        light_mode (bool): Disables certain background features for performance gains. Default: False.
+        extra_args (list): Additional command-line arguments passed to the browser.
+                           Default: [].
+    """
+
+    def __init__(
+        self,
+        browser_type: str = "chromium",
+        headless: bool = True,
+        use_managed_browser: bool = False,
+        use_persistent_context: bool = False,
+        user_data_dir: str = None,
+        chrome_channel: str = "chrome",
+        proxy: str = None,
+        proxy_config: dict = None,
+        viewport_width: int = 1920,
+        viewport_height: int = 1080,
+        accept_downloads: bool = False,
+        downloads_path: str = None,
+        storage_state=None,
+        ignore_https_errors: bool = True,
+        java_script_enabled: bool = True,
+        sleep_on_close: bool = False,
+        verbose: bool = True,
+        cookies: list = None,
+        headers: dict = None,
+        user_agent: str = (
+            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
+        ),
+        user_agent_mode: str = None,
+        user_agent_generator_config: dict = None,
+        text_only: bool = False,
+        light_mode: bool = False,
+        extra_args: list = None,
+    ):
+        self.browser_type = browser_type
+        self.headless = headless
+        self.use_managed_browser = use_managed_browser
+        self.use_persistent_context = use_persistent_context
+        self.user_data_dir = user_data_dir
+        if self.browser_type == "chromium":
+            self.chrome_channel = "chrome"
+        elif self.browser_type == "firefox":
+            self.chrome_channel = "firefox"
+        elif self.browser_type == "webkit":
+            self.chrome_channel = "webkit"
+        else:
+            self.chrome_channel = chrome_channel or "chrome"
+        self.proxy = proxy
+        self.proxy_config = proxy_config
+        self.viewport_width = viewport_width
+        self.viewport_height = viewport_height
+        self.accept_downloads = accept_downloads
+        self.downloads_path = downloads_path
+        self.storage_state = storage_state
+        self.ignore_https_errors = ignore_https_errors
+        self.java_script_enabled = java_script_enabled
+        self.cookies = cookies if cookies is not None else []
+        self.headers = headers if headers is not None else {}
+        self.user_agent = user_agent
+        self.user_agent_mode = user_agent_mode
+        self.user_agent_generator_config = user_agent_generator_config
+        self.text_only = text_only
+        self.light_mode = light_mode
+        self.extra_args = extra_args if extra_args is not None else []
+        self.sleep_on_close = sleep_on_close
+        self.verbose = verbose
+        
+        user_agenr_generator = UserAgentGenerator()
+        if self.user_agent_mode != "random":
+            self.user_agent = user_agenr_generator.generate(
+                **(self.user_agent_generator_config or {})
+            )
+        self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
+        self.headers.setdefault("sec-ch-ua", self.browser_hint)
+
+        # If persistent context is requested, ensure managed browser is enabled
+        if self.use_persistent_context:
+            self.use_managed_browser = True
+
+    @staticmethod
+    def from_kwargs(kwargs: dict) -> "BrowserConfig":
+        return BrowserConfig(
+            browser_type=kwargs.get("browser_type", "chromium"),
+            headless=kwargs.get("headless", True),
+            use_managed_browser=kwargs.get("use_managed_browser", False),
+            use_persistent_context=kwargs.get("use_persistent_context", False),
+            user_data_dir=kwargs.get("user_data_dir"),
+            chrome_channel=kwargs.get("chrome_channel", "chrome"),
+            proxy=kwargs.get("proxy"),
+            proxy_config=kwargs.get("proxy_config"),
+            viewport_width=kwargs.get("viewport_width", 1920),
+            viewport_height=kwargs.get("viewport_height", 1080),
+            accept_downloads=kwargs.get("accept_downloads", False),
+            downloads_path=kwargs.get("downloads_path"),
+            storage_state=kwargs.get("storage_state"),
+            ignore_https_errors=kwargs.get("ignore_https_errors", True),
+            java_script_enabled=kwargs.get("java_script_enabled", True),
+            cookies=kwargs.get("cookies", []),
+            headers=kwargs.get("headers", {}),
+            user_agent=kwargs.get("user_agent",
+                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
+            ),
+            user_agent_mode=kwargs.get("user_agent_mode"),
+            user_agent_generator_config=kwargs.get("user_agent_generator_config"),
+            text_only=kwargs.get("text_only", False),
+            light_mode=kwargs.get("light_mode", False),
+            extra_args=kwargs.get("extra_args", [])
+        )
+
+
+class CrawlerRunConfig:
+    """
+    Configuration class for controlling how the crawler runs each crawl operation.
+    This includes parameters for content extraction, page manipulation, waiting conditions,
+    caching, and other runtime behaviors.
+
+    This centralizes parameters that were previously scattered as kwargs to `arun()` and related methods.
+    By using this class, you have a single place to understand and adjust the crawling options.
+
+    Attributes:
+        word_count_threshold (int): Minimum word count threshold before processing content.
+                                    Default: MIN_WORD_THRESHOLD (typically 200).
+        extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
+                                                          Default: None (NoExtractionStrategy is used if None).
+        chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
+                                              Default: RegexChunking().
+        content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
+                                                        Default: None.
+        cache_mode (CacheMode or None): Defines how caching is handled.
+                                        If None, defaults to CacheMode.ENABLED internally.
+                                        Default: None.
+        session_id (str or None):   Optional session ID to persist the browser context and the created 
+                                    page instance. If the ID already exists, the crawler does not 
+                                    create a new page and uses the current page to preserve the state;
+                                    if not, it creates a new page and context then stores it in 
+                                    memory with the given session ID.
+        bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
+                             Default: False.
+        disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
+                              Default: False.
+        no_cache_read (bool): Legacy parameter, if True acts like CacheMode.WRITE_ONLY.
+                              Default: False.
+        no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
+                               Default: False.
+        css_selector (str or None): CSS selector to extract a specific portion of the page.
+                                    Default: None.
+        screenshot (bool): Whether to take a screenshot after crawling.
+                           Default: False.
+        pdf (bool): Whether to generate a PDF of the page.
+                    Default: False.
+        verbose (bool): Enable verbose logging.
+                        Default: True.
+        only_text (bool): If True, attempt to extract text-only content where applicable.
+                          Default: False.
+        image_description_min_word_threshold (int): Minimum words for image description extraction.
+                                                    Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
+        prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
+                          Default: False.
+        js_code (str or list of str or None): JavaScript code/snippets to run on the page.
+                                              Default: None.
+        wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
+                                Default: None.
+        js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
+                        Default: False.
+        wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
+                          Default: "domcontentloaded".
+        page_timeout (int): Timeout in ms for page operations like navigation.
+                            Default: 60000 (60 seconds).
+        ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
+                                       Default: True.
+        wait_for_images (bool): If True, wait for images to load before extracting content. 
+                                Default: True.
+        adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
+                                           Default: False.
+        scan_full_page (bool): If True, scroll through the entire page to load all content.
+                               Default: False.
+        scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
+                              Default: 0.2.
+        process_iframes (bool): If True, attempts to process and inline iframe content.
+                                Default: False.
+        remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
+                                        Default: False.
+        delay_before_return_html (float): Delay in seconds before retrieving final HTML.
+                                          Default: 0.1.
+        log_console (bool): If True, log console messages from the page.
+                            Default: False.
+        simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
+                              Default: False.
+        override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
+                                   Default: False.
+        magic (bool): If True, attempts automatic handling of overlays/popups.
+                      Default: False.
+        screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
+                                             Default: None.
+        screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
+                                           Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
+        mean_delay (float): Mean base delay between requests when calling arun_many.
+                            Default: 0.1.
+        max_range (float): Max random additional delay range for requests in arun_many.
+                           Default: 0.3.
+        # session_id and semaphore_count might be set at runtime, not needed as defaults here.
+    """
+
+    def __init__(
+        self,
+        word_count_threshold: int =  MIN_WORD_THRESHOLD ,
+        extraction_strategy : ExtractionStrategy=None,  # Will default to NoExtractionStrategy if None
+        chunking_strategy : ChunkingStrategy= None,    # Will default to RegexChunking if None
+        markdown_generator : MarkdownGenerationStrategy = None,
+        content_filter=None,
+        cache_mode=None,
+        session_id: str = None,
+        bypass_cache: bool = False,
+        disable_cache: bool = False,
+        no_cache_read: bool = False,
+        no_cache_write: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        pdf: bool = False,
+        verbose: bool = True,
+        only_text: bool = False,
+        image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
+        prettiify: bool = False,
+        js_code=None,
+        wait_for: str = None,
+        js_only: bool = False,
+        wait_until: str = "domcontentloaded",
+        page_timeout: int = PAGE_TIMEOUT,
+        ignore_body_visibility: bool = True,
+        wait_for_images: bool = True,
+        adjust_viewport_to_content: bool = False,
+        scan_full_page: bool = False,
+        scroll_delay: float = 0.2,
+        process_iframes: bool = False,
+        remove_overlay_elements: bool = False,
+        delay_before_return_html: float = 0.1,
+        log_console: bool = False,
+        simulate_user: bool = False,
+        override_navigator: bool = False,
+        magic: bool = False,
+        screenshot_wait_for: float = None,
+        screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
+        mean_delay: float = 0.1,
+        max_range: float = 0.3,
+        semaphore_count: int = 5,
+    ):
+        self.word_count_threshold = word_count_threshold
+        self.extraction_strategy = extraction_strategy
+        self.chunking_strategy = chunking_strategy
+        self.markdown_generator = markdown_generator
+        self.content_filter = content_filter
+        self.cache_mode = cache_mode
+        self.session_id = session_id
+        self.bypass_cache = bypass_cache
+        self.disable_cache = disable_cache
+        self.no_cache_read = no_cache_read
+        self.no_cache_write = no_cache_write
+        self.css_selector = css_selector
+        self.screenshot = screenshot
+        self.pdf = pdf
+        self.verbose = verbose
+        self.only_text = only_text
+        self.image_description_min_word_threshold = image_description_min_word_threshold
+        self.prettiify = prettiify
+        self.js_code = js_code
+        self.wait_for = wait_for
+        self.js_only = js_only
+        self.wait_until = wait_until
+        self.page_timeout = page_timeout
+        self.ignore_body_visibility = ignore_body_visibility
+        self.wait_for_images = wait_for_images
+        self.adjust_viewport_to_content = adjust_viewport_to_content
+        self.scan_full_page = scan_full_page
+        self.scroll_delay = scroll_delay
+        self.process_iframes = process_iframes
+        self.remove_overlay_elements = remove_overlay_elements
+        self.delay_before_return_html = delay_before_return_html
+        self.log_console = log_console
+        self.simulate_user = simulate_user
+        self.override_navigator = override_navigator
+        self.magic = magic
+        self.screenshot_wait_for = screenshot_wait_for
+        self.screenshot_height_threshold = screenshot_height_threshold
+        self.mean_delay = mean_delay
+        self.max_range = max_range
+        self.semaphore_count = semaphore_count
+
+        # Validate type of extraction strategy and chunking strategy if they are provided
+        if self.extraction_strategy is not None and not isinstance(self.extraction_strategy, ExtractionStrategy):
+            raise ValueError("extraction_strategy must be an instance of ExtractionStrategy")
+        if self.chunking_strategy is not None and not isinstance(self.chunking_strategy, ChunkingStrategy):
+            raise ValueError("chunking_strategy must be an instance of ChunkingStrategy")
+
+        # Set default chunking strategy if None
+        if self.chunking_strategy is None:
+            from .chunking_strategy import RegexChunking
+            self.chunking_strategy = RegexChunking()
+        
+
+    @staticmethod
+    def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
+        return CrawlerRunConfig(
+            word_count_threshold=kwargs.get("word_count_threshold", 200),
+            extraction_strategy=kwargs.get("extraction_strategy"),
+            chunking_strategy=kwargs.get("chunking_strategy"),
+            markdown_generator=kwargs.get("markdown_generator"),
+            content_filter=kwargs.get("content_filter"),
+            cache_mode=kwargs.get("cache_mode"),
+            session_id=kwargs.get("session_id"),
+            bypass_cache=kwargs.get("bypass_cache", False),
+            disable_cache=kwargs.get("disable_cache", False),
+            no_cache_read=kwargs.get("no_cache_read", False),
+            no_cache_write=kwargs.get("no_cache_write", False),
+            css_selector=kwargs.get("css_selector"),
+            screenshot=kwargs.get("screenshot", False),
+            pdf=kwargs.get("pdf", False),
+            verbose=kwargs.get("verbose", True),
+            only_text=kwargs.get("only_text", False),
+            image_description_min_word_threshold=kwargs.get("image_description_min_word_threshold",  IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD),
+            prettiify=kwargs.get("prettiify", False),
+            js_code=kwargs.get("js_code"), # If not provided here, will default inside constructor
+            wait_for=kwargs.get("wait_for"),
+            js_only=kwargs.get("js_only", False),
+            wait_until=kwargs.get("wait_until", "domcontentloaded"),
+            page_timeout=kwargs.get("page_timeout", 60000),
+            ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
+            adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
+            scan_full_page=kwargs.get("scan_full_page", False),
+            scroll_delay=kwargs.get("scroll_delay", 0.2),
+            process_iframes=kwargs.get("process_iframes", False),
+            remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
+            delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
+            log_console=kwargs.get("log_console", False),
+            simulate_user=kwargs.get("simulate_user", False),
+            override_navigator=kwargs.get("override_navigator", False),
+            magic=kwargs.get("magic", False),
+            screenshot_wait_for=kwargs.get("screenshot_wait_for"),
+            screenshot_height_threshold=kwargs.get("screenshot_height_threshold", 20000),
+            mean_delay=kwargs.get("mean_delay", 0.1),
+            max_range=kwargs.get("max_range", 0.3),
+            semaphore_count=kwargs.get("semaphore_count", 5)
+        )
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -1,4 +1,4 @@
-import os
+import os, sys
 from pathlib import Path
 import aiosqlite
 import asyncio
@@ -13,6 +13,7 @@ import aiofiles
 from .config import NEED_MIGRATION
 from .version_manager import VersionManager
 from .async_logger import AsyncLogger
+from .utils import get_error_context, create_box_message
 # Set up logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
@@ -97,35 +98,84 @@ class AsyncDatabaseManager:

    @asynccontextmanager
    async def get_connection(self):
-        """Connection pool manager"""
+        """Connection pool manager with enhanced error handling"""
        if not self._initialized:
-            # Use an asyncio.Lock to ensure only one initialization occurs
            async with self.init_lock:
                if not self._initialized:
-                    await self.initialize()
-                    self._initialized = True
+                    try:
+                        await self.initialize()
+                        self._initialized = True
+                    except Exception as e:
+                        import sys
+                        error_context = get_error_context(sys.exc_info())
+                        self.logger.error(
+                            message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
+                            tag="ERROR",
+                            force_verbose=True,
+                            params={
+                                "error": str(e),
+                                "context": error_context["code_context"],
+                                "traceback": error_context["full_traceback"]
+                            }
+                        )
+                        raise

        await self.connection_semaphore.acquire()
        task_id = id(asyncio.current_task())
+        
        try:
            async with self.pool_lock:
                if task_id not in self.connection_pool:
-                    conn = await aiosqlite.connect(
-                        self.db_path,
-                        timeout=30.0
-                    )
-                    await conn.execute('PRAGMA journal_mode = WAL')
-                    await conn.execute('PRAGMA busy_timeout = 5000')
-                    self.connection_pool[task_id] = conn
+                    try:
+                        conn = await aiosqlite.connect(
+                            self.db_path,
+                            timeout=30.0
+                        )
+                        await conn.execute('PRAGMA journal_mode = WAL')
+                        await conn.execute('PRAGMA busy_timeout = 5000')
+                        
+                        # Verify database structure
+                        async with conn.execute("PRAGMA table_info(crawled_data)") as cursor:
+                            columns = await cursor.fetchall()
+                            column_names = [col[1] for col in columns]
+                            expected_columns = {
+                                'url', 'html', 'cleaned_html', 'markdown', 'extracted_content',
+                                'success', 'media', 'links', 'metadata', 'screenshot',
+                                'response_headers', 'downloaded_files'
+                            }
+                            missing_columns = expected_columns - set(column_names)
+                            if missing_columns:
+                                raise ValueError(f"Database missing columns: {missing_columns}")
+                        
+                        self.connection_pool[task_id] = conn
+                    except Exception as e:
+                        import sys
+                        error_context = get_error_context(sys.exc_info())
+                        error_message = (
+                            f"Unexpected error in db get_connection at line {error_context['line_no']} "
+                            f"in {error_context['function']} ({error_context['filename']}):\n"
+                            f"Error: {str(e)}\n\n"
+                            f"Code context:\n{error_context['code_context']}"
+                        )
+                        self.logger.error(
+                            message=create_box_message(error_message, type= "error"),
+                        )
+
+                        raise

            yield self.connection_pool[task_id]

        except Exception as e:
+            import sys
+            error_context = get_error_context(sys.exc_info())
+            error_message = (
+                f"Unexpected error in db get_connection at line {error_context['line_no']} "
+                f"in {error_context['function']} ({error_context['filename']}):\n"
+                f"Error: {str(e)}\n\n"
+                f"Code context:\n{error_context['code_context']}"
+            )
            self.logger.error(
-                message="Connection error: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)}
+                message=create_box_message(error_message, type= "error"),
            )
            raise
        finally:
@@ -230,7 +280,8 @@ class AsyncDatabaseManager:
                    'cleaned_html': row_dict['cleaned_html'],
                    'markdown': row_dict['markdown'],
                    'extracted_content': row_dict['extracted_content'],
-                    'screenshot': row_dict['screenshot']
+                    'screenshot': row_dict['screenshot'],
+                    'screenshots': row_dict['screenshot'],
                }
                
                for field, hash_value in content_fields.items():
--- a/crawl4ai/async_tools.py
+++ b/crawl4ai/async_tools.py
@@ -0,0 +1,183 @@
+import asyncio
+import base64
+import time
+from abc import ABC, abstractmethod
+from typing import Callable, Dict, Any, List, Optional, Awaitable
+import os, sys, shutil
+import tempfile, subprocess
+from playwright.async_api import async_playwright, Page, Browser, Error
+from playwright.async_api import TimeoutError as PlaywrightTimeoutError
+from io import BytesIO
+from PIL import Image, ImageDraw, ImageFont
+from pathlib import Path
+from playwright.async_api import ProxySettings
+from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from .models import AsyncCrawlResponse
+from .utils import create_box_message
+from .user_agent_generator import UserAgentGenerator
+from playwright_stealth import StealthConfig, stealth_async
+
+
+class ManagedBrowser:
+    def __init__(self, browser_type: str = "chromium", user_data_dir: Optional[str] = None, headless: bool = False, logger = None, host: str = "localhost", debugging_port: int = 9222):
+        self.browser_type = browser_type
+        self.user_data_dir = user_data_dir
+        self.headless = headless
+        self.browser_process = None
+        self.temp_dir = None
+        self.debugging_port = debugging_port
+        self.host = host
+        self.logger = logger
+        self.shutting_down = False
+
+    async def start(self) -> str:
+        """
+        Starts the browser process and returns the CDP endpoint URL.
+        If user_data_dir is not provided, creates a temporary directory.
+        """
+        
+        # Create temp dir if needed
+        if not self.user_data_dir:
+            self.temp_dir = tempfile.mkdtemp(prefix="browser-profile-")
+            self.user_data_dir = self.temp_dir
+
+        # Get browser path and args based on OS and browser type
+        browser_path = self._get_browser_path()
+        args = self._get_browser_args()
+
+        # Start browser process
+        try:
+            self.browser_process = subprocess.Popen(
+                args,
+                stdout=subprocess.PIPE,
+                stderr=subprocess.PIPE
+            )
+            # Monitor browser process output for errors
+            asyncio.create_task(self._monitor_browser_process())
+            await asyncio.sleep(2)  # Give browser time to start
+            return f"http://{self.host}:{self.debugging_port}"
+        except Exception as e:
+            await self.cleanup()
+            raise Exception(f"Failed to start browser: {e}")
+
+    async def _monitor_browser_process(self):
+        """Monitor the browser process for unexpected termination."""
+        if self.browser_process:
+            try:
+                stdout, stderr = await asyncio.gather(
+                    asyncio.to_thread(self.browser_process.stdout.read),
+                    asyncio.to_thread(self.browser_process.stderr.read)
+                )
+                
+                # Check shutting_down flag BEFORE logging anything
+                if self.browser_process.poll() is not None:
+                    if not self.shutting_down:
+                        self.logger.error(
+                            message="Browser process terminated unexpectedly | Code: {code} | STDOUT: {stdout} | STDERR: {stderr}",
+                            tag="ERROR",
+                            params={
+                                "code": self.browser_process.returncode,
+                                "stdout": stdout.decode(),
+                                "stderr": stderr.decode()
+                            }
+                        )                
+                        await self.cleanup()
+                    else:
+                        self.logger.info(
+                            message="Browser process terminated normally | Code: {code}",
+                            tag="INFO",
+                            params={"code": self.browser_process.returncode}
+                        )
+            except Exception as e:
+                if not self.shutting_down:
+                    self.logger.error(
+                        message="Error monitoring browser process: {error}",
+                        tag="ERROR",
+                        params={"error": str(e)}
+                    )
+
+    def _get_browser_path(self) -> str:
+        """Returns the browser executable path based on OS and browser type"""
+        if sys.platform == "darwin":  # macOS
+            paths = {
+                "chromium": "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome",
+                "firefox": "/Applications/Firefox.app/Contents/MacOS/firefox",
+                "webkit": "/Applications/Safari.app/Contents/MacOS/Safari"
+            }
+        elif sys.platform == "win32":  # Windows
+            paths = {
+                "chromium": "C:\\Program Files\\Google\\Chrome\\Application\\chrome.exe",
+                "firefox": "C:\\Program Files\\Mozilla Firefox\\firefox.exe",
+                "webkit": None  # WebKit not supported on Windows
+            }
+        else:  # Linux
+            paths = {
+                "chromium": "google-chrome",
+                "firefox": "firefox",
+                "webkit": None  # WebKit not supported on Linux
+            }
+        
+        return paths.get(self.browser_type)
+
+    def _get_browser_args(self) -> List[str]:
+        """Returns browser-specific command line arguments"""
+        base_args = [self._get_browser_path()]
+        
+        if self.browser_type == "chromium":
+            args = [
+                f"--remote-debugging-port={self.debugging_port}",
+                f"--user-data-dir={self.user_data_dir}",
+            ]
+            if self.headless:
+                args.append("--headless=new")
+        elif self.browser_type == "firefox":
+            args = [
+                "--remote-debugging-port", str(self.debugging_port),
+                "--profile", self.user_data_dir,
+            ]
+            if self.headless:
+                args.append("--headless")
+        else:
+            raise NotImplementedError(f"Browser type {self.browser_type} not supported")
+            
+        return base_args + args
+
+    async def cleanup(self):
+        """Cleanup browser process and temporary directory"""
+        # Set shutting_down flag BEFORE any termination actions
+        self.shutting_down = True
+        
+        if self.browser_process:
+            try:
+                self.browser_process.terminate()
+                # Wait for process to end gracefully
+                for _ in range(10):  # 10 attempts, 100ms each
+                    if self.browser_process.poll() is not None:
+                        break
+                    await asyncio.sleep(0.1)
+                
+                # Force kill if still running
+                if self.browser_process.poll() is None:
+                    self.browser_process.kill()
+                    await asyncio.sleep(0.1)  # Brief wait for kill to take effect
+                    
+            except Exception as e:
+                self.logger.error(
+                    message="Error terminating browser: {error}",
+                    tag="ERROR",
+                    params={"error": str(e)}
+                )
+
+        if self.temp_dir and os.path.exists(self.temp_dir):
+            try:
+                shutil.rmtree(self.temp_dir)
+            except Exception as e:
+                self.logger.error(
+                    message="Error removing temporary directory: {error}",
+                    tag="ERROR",
+                    params={"error": str(e)}
+                )
+
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -56,4 +56,7 @@ MAX_METRICS_HISTORY = 1000

 NEED_MIGRATION = True
 URL_LOG_SHORTEN_LENGTH = 30
-SHOW_DEPRECATION_WARNINGS = True
+SHOW_DEPRECATION_WARNINGS = True
+SCREENSHOT_HEIGHT_TRESHOLD = 10000
+PAGE_TIMEOUT=60000
+DOWNLOAD_PAGE_TIMEOUT=60000
--- a/crawl4ai/content_filter_strategy.py
+++ b/crawl4ai/content_filter_strategy.py
@@ -4,10 +4,10 @@ from typing import List, Tuple, Dict
 from rank_bm25 import BM25Okapi
 from time import perf_counter
 from collections import deque
-from bs4 import BeautifulSoup, NavigableString, Tag
+from bs4 import BeautifulSoup, NavigableString, Tag, Comment
 from .utils import clean_tokens
 from abc import ABC, abstractmethod
-
+import math
 from snowballstemmer import stemmer


@@ -358,145 +358,186 @@ class BM25ContentFilter(RelevantContentFilter):
        return [self.clean_element(tag) for _, _, tag in selected_candidates]


-class HeuristicContentFilter(RelevantContentFilter):
-    def __init__(self):
-        super().__init__()
-        # Weights for different heuristics
-        self.tag_weights = {
-            'article': 10,
-            'main': 8,
-            'section': 5,
-            'div': 3,
-            'p': 2,
-            'pre': 2,
-            'code': 2,
-            'blockquote': 2,
-            'li': 1,
-            'span': 1,
-        }
-        self.max_depth = 5  # Maximum depth from body to consider

-    def filter_content(self, html: str) -> List[str]:
-        """Implements heuristic content filtering without relying on a query."""
+
+
+
+class PruningContentFilter(RelevantContentFilter):
+    def __init__(self, user_query: str = None, min_word_threshold: int = None, 
+                 threshold_type: str = 'fixed', threshold: float = 0.48):
+        super().__init__(user_query)
+        self.min_word_threshold = min_word_threshold
+        self.threshold_type = threshold_type
+        self.threshold = threshold
+        
+        # Add tag importance for dynamic threshold
+        self.tag_importance = {
+            'article': 1.5,
+            'main': 1.4,
+            'section': 1.3,
+            'p': 1.2,
+            'h1': 1.4,
+            'h2': 1.3,
+            'h3': 1.2,
+            'div': 0.7,
+            'span': 0.6
+        }
+        
+        # Metric configuration
+        self.metric_config = {
+            'text_density': True,
+            'link_density': True,
+            'tag_weight': True,
+            'class_id_weight': True,
+            'text_length': True,
+        }
+        
+        self.metric_weights = {
+            'text_density': 0.4,
+            'link_density': 0.2,
+            'tag_weight': 0.2,
+            'class_id_weight': 0.1,
+            'text_length': 0.1,
+        }
+        
+        self.tag_weights = {
+            'div': 0.5,
+            'p': 1.0,
+            'article': 1.5,
+            'section': 1.0,
+            'span': 0.3,
+            'li': 0.5,
+            'ul': 0.5,
+            'ol': 0.5,
+            'h1': 1.2,
+            'h2': 1.1,
+            'h3': 1.0,
+            'h4': 0.9,
+            'h5': 0.8,
+            'h6': 0.7,
+        }
+
+    def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
        if not html or not isinstance(html, str):
            return []
-
+            
        soup = BeautifulSoup(html, 'lxml')
-
-        # Ensure there is a body tag
        if not soup.body:
            soup = BeautifulSoup(f'<body>{html}</body>', 'lxml')
-        body = soup.body
+        
+        # Remove comments and unwanted tags
+        self._remove_comments(soup)
+        self._remove_unwanted_tags(soup)
+        
+        # Prune tree starting from body
+        body = soup.find('body')
+        self._prune_tree(body)
+        
+        # Extract remaining content as list of HTML strings
+        content_blocks = []
+        for element in body.children:
+            if isinstance(element, str) or not hasattr(element, 'name'):
+                continue
+            if len(element.get_text(strip=True)) > 0:
+                content_blocks.append(str(element))
+                
+        return content_blocks

-        # Extract candidate text chunks
-        candidates = self.extract_text_chunks(body)
+    def _remove_comments(self, soup):
+        for element in soup(text=lambda text: isinstance(text, Comment)):
+            element.extract()

-        if not candidates:
-            return []
+    def _remove_unwanted_tags(self, soup):
+        for tag in self.excluded_tags:
+            for element in soup.find_all(tag):
+                element.decompose()

-        # Score each candidate
-        scored_candidates = []
-        for index, text, tag_type, tag in candidates:
-            score = self.score_element(tag, text)
-            if score > 0:
-                scored_candidates.append((score, index, text, tag))
+    def _prune_tree(self, node):
+        if not node or not hasattr(node, 'name') or node.name is None:
+            return

-        # Sort candidates by score and then by document order
-        scored_candidates.sort(key=lambda x: (-x[0], x[1]))
+        text_len = len(node.get_text(strip=True))
+        tag_len = len(node.encode_contents().decode('utf-8'))
+        link_text_len = sum(len(s.strip()) for s in (a.string for a in node.find_all('a', recursive=False)) if s)

-        # Extract the top candidates (e.g., top 5)
-        top_candidates = scored_candidates[:5]  # Adjust the number as needed
+        metrics = {
+            'node': node,
+            'tag_name': node.name,
+            'text_len': text_len,
+            'tag_len': tag_len,
+            'link_text_len': link_text_len
+        }

-        # Sort the top candidates back to their original document order
-        top_candidates.sort(key=lambda x: x[1])
+        score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)

-        # Clean and return the content
-        return [self.clean_element(tag) for _, _, _, tag in top_candidates]
+        if self.threshold_type == 'fixed':
+            should_remove = score < self.threshold
+        else:  # dynamic
+            tag_importance = self.tag_importance.get(node.name, 0.7)
+            text_ratio = text_len / tag_len if tag_len > 0 else 0
+            link_ratio = link_text_len / text_len if text_len > 0 else 1
+            
+            threshold = self.threshold  # base threshold
+            if tag_importance > 1:
+                threshold *= 0.8
+            if text_ratio > 0.4:
+                threshold *= 0.9
+            if link_ratio > 0.6:
+                threshold *= 1.2
+                
+            should_remove = score < threshold

-    def score_element(self, tag: Tag, text: str) -> float:
-        """Compute a score for an element based on heuristics."""
-        if not text or not tag:
-            return 0
+        if should_remove:
+            node.decompose()
+        else:
+            children = [child for child in node.children if hasattr(child, 'name')]
+            for child in children:
+                self._prune_tree(child)

-        # Exclude unwanted tags
-        if self.is_excluded(tag):
-            return 0
+    def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
+        if self.min_word_threshold:
+            # Get raw text from metrics node - avoid extra processing
+            text = metrics['node'].get_text(strip=True)
+            word_count = text.count(' ') + 1
+            if word_count < self.min_word_threshold:
+                return -1.0  # Guaranteed removal
+        score = 0.0
+        total_weight = 0.0

-        # Text density
-        text_length = len(text.strip())
-        html_length = len(str(tag))
-        text_density = text_length / html_length if html_length > 0 else 0
+        if self.metric_config['text_density']:
+            density = text_len / tag_len if tag_len > 0 else 0
+            score += self.metric_weights['text_density'] * density
+            total_weight += self.metric_weights['text_density']

-        # Link density
-        link_text_length = sum(len(a.get_text().strip()) for a in tag.find_all('a'))
-        link_density = link_text_length / text_length if text_length > 0 else 0
+        if self.metric_config['link_density']:
+            density = 1 - (link_text_len / text_len if text_len > 0 else 0)
+            score += self.metric_weights['link_density'] * density
+            total_weight += self.metric_weights['link_density']

-        # Tag weight
-        tag_weight = self.tag_weights.get(tag.name, 1)
+        if self.metric_config['tag_weight']:
+            tag_score = self.tag_weights.get(metrics['tag_name'], 0.5)
+            score += self.metric_weights['tag_weight'] * tag_score
+            total_weight += self.metric_weights['tag_weight']

-        # Depth factor (prefer elements closer to the body tag)
-        depth = self.get_depth(tag)
-        depth_weight = max(self.max_depth - depth, 1) / self.max_depth
+        if self.metric_config['class_id_weight']:
+            class_score = self._compute_class_id_weight(metrics['node'])
+            score += self.metric_weights['class_id_weight'] * max(0, class_score)
+            total_weight += self.metric_weights['class_id_weight']

-        # Compute the final score
-        score = (text_density * tag_weight * depth_weight) / (1 + link_density)
+        if self.metric_config['text_length']:
+            score += self.metric_weights['text_length'] * math.log(text_len + 1)
+            total_weight += self.metric_weights['text_length']

-        return score
+        return score / total_weight if total_weight > 0 else 0

-    def get_depth(self, tag: Tag) -> int:
-        """Compute the depth of the tag from the body tag."""
-        depth = 0
-        current = tag
-        while current and current != current.parent and current.name != 'body':
-            current = current.parent
-            depth += 1
-        return depth
-
-    def extract_text_chunks(self, body: Tag) -> List[Tuple[int, str, str, Tag]]:
-        """
-        Extracts text chunks from the body element while preserving order.
-        Returns list of tuples (index, text, tag_type, tag) for scoring.
-        """
-        chunks = []
-        index = 0
-
-        def traverse(element):
-            nonlocal index
-            if isinstance(element, NavigableString):
-                return
-            if not isinstance(element, Tag):
-                return
-            if self.is_excluded(element):
-                return
-            # Only consider included tags
-            if element.name in self.included_tags:
-                text = element.get_text(separator=' ', strip=True)
-                if len(text.split()) >= self.min_word_count:
-                    tag_type = 'header' if element.name in self.header_tags else 'content'
-                    chunks.append((index, text, tag_type, element))
-                    index += 1
-                    # Do not traverse children of this element to prevent duplication
-                    return
-            for child in element.children:
-                traverse(child)
-
-        traverse(body)
-        return chunks
-
-    def is_excluded(self, tag: Tag) -> bool:
-        """Determine if a tag should be excluded based on heuristics."""
-        if tag.name in self.excluded_tags:
-            return True
-        class_id = ' '.join(filter(None, [
-            ' '.join(tag.get('class', [])),
-            tag.get('id', '')
-        ]))
-        if self.negative_patterns.search(class_id):
-            return True
-        # Exclude tags with high link density (e.g., navigation menus)
-        text = tag.get_text(separator=' ', strip=True)
-        link_text_length = sum(len(a.get_text(strip=True)) for a in tag.find_all('a'))
-        text_length = len(text)
-        if text_length > 0 and (link_text_length / text_length) > 0.5:
-            return True
-        return False
+    def _compute_class_id_weight(self, node):
+        class_id_score = 0
+        if 'class' in node.attrs:
+            classes = ' '.join(node['class'])
+            if self.negative_patterns.match(classes):
+                class_id_score -= 0.5
+        if 'id' in node.attrs:
+            element_id = node['id']
+            if self.negative_patterns.match(element_id):
+                class_id_score -= 0.5
+        return class_id_score
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
@@ -6,6 +6,7 @@ from concurrent.futures import ThreadPoolExecutor
 import asyncio, requests, re, os
 from .config import *
 from bs4 import element, NavigableString, Comment
+from bs4 import PageElement, Tag
 from urllib.parse import urljoin
 from requests.exceptions import InvalidSchema
 # from .content_cleaning_strategy import ContentCleaningStrategy
@@ -13,15 +14,11 @@ from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter#,
 from .markdown_generation_strategy import MarkdownGenerationStrategy, DefaultMarkdownGenerator
 from .models import MarkdownGenerationResult
 from .utils import (
-    sanitize_input_encode,
-    sanitize_html,
    extract_metadata,
-    InvalidCSSSelectorError,
-    CustomHTML2Text,
    normalize_url,
    is_external_url    
 )
-from .tools import profile_and_time
+

 # Pre-compile regular expressions for Open Graph and Twitter metadata
 OG_REGEX = re.compile(r'^og:')
@@ -75,11 +72,10 @@ class WebScrapingStrategy(ContentScrapingStrategy):
            log_method(message=message, tag=tag, **kwargs)
                
    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
-        return self._get_content_of_website_optimized(url, html, is_async=False, **kwargs)
+        return self._scrap(url, html, is_async=False, **kwargs)

    async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
-        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
-
+        return await asyncio.to_thread(self._scrap, url, html, **kwargs)

    def _generate_markdown_content(self, 
                                 cleaned_html: str,
@@ -87,24 +83,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
                                 url: str,
                                 success: bool,
                                 **kwargs) -> Dict[str, Any]:
-        """Generate markdown content using either new strategy or legacy method.
-        
-        Args:
-            cleaned_html: Sanitized HTML content
-            html: Original HTML content
-            url: Base URL of the page
-            success: Whether scraping was successful
-            **kwargs: Additional options including:
-                - markdown_generator: Optional[MarkdownGenerationStrategy]
-                - html2text: Dict[str, Any] options for HTML2Text
-                - content_filter: Optional[RelevantContentFilter]
-                - fit_markdown: bool
-                - fit_markdown_user_query: Optional[str]
-                - fit_markdown_bm25_threshold: float
-        
-        Returns:
-            Dict containing markdown content in various formats
-        """
        markdown_generator: Optional[MarkdownGenerationStrategy] = kwargs.get('markdown_generator', DefaultMarkdownGenerator())
        
        if markdown_generator:
@@ -121,8 +99,6 @@ class WebScrapingStrategy(ContentScrapingStrategy):
                    html2text_options=kwargs.get('html2text', {})
                )
                
-                help_message = """"""
-                
                return {
                    'markdown': markdown_result.raw_markdown,  
                    'fit_markdown': markdown_result.fit_markdown,
@@ -144,46 +120,370 @@ class WebScrapingStrategy(ContentScrapingStrategy):
                }

        # Legacy method
-        h = CustomHTML2Text()
-        h.update_params(**kwargs.get('html2text', {}))            
-        markdown = h.handle(cleaned_html)
-        markdown = markdown.replace('    ```', '```')
+        """
+        # h = CustomHTML2Text()
+        # h.update_params(**kwargs.get('html2text', {}))            
+        # markdown = h.handle(cleaned_html)
+        # markdown = markdown.replace('    ```', '```')
        
-        fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
-        fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
+        # fit_markdown = "Set flag 'fit_markdown' to True to get cleaned HTML content."
+        # fit_html = "Set flag 'fit_markdown' to True to get cleaned HTML content."
        
-        if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
-            content_filter = kwargs.get('content_filter', None)
-            if not content_filter:
-                content_filter = BM25ContentFilter(
-                    user_query=kwargs.get('fit_markdown_user_query', None),
-                    bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
-                )
-            fit_html = content_filter.filter_content(html)
-            fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
-            fit_markdown = h.handle(fit_html)
+        # if kwargs.get('content_filter', None) or kwargs.get('fit_markdown', False):
+        #     content_filter = kwargs.get('content_filter', None)
+        #     if not content_filter:
+        #         content_filter = BM25ContentFilter(
+        #             user_query=kwargs.get('fit_markdown_user_query', None),
+        #             bm25_threshold=kwargs.get('fit_markdown_bm25_threshold', 1.0)
+        #         )
+        #     fit_html = content_filter.filter_content(html)
+        #     fit_html = '\n'.join('<div>{}</div>'.format(s) for s in fit_html)
+        #     fit_markdown = h.handle(fit_html)

-        markdown_v2 = MarkdownGenerationResult(
-            raw_markdown=markdown,
-            markdown_with_citations=markdown,
-            references_markdown=markdown,
-            fit_markdown=fit_markdown
-        )
+        # markdown_v2 = MarkdownGenerationResult(
+        #     raw_markdown=markdown,
+        #     markdown_with_citations=markdown,
+        #     references_markdown=markdown,
+        #     fit_markdown=fit_markdown
+        # )
        
-        return {
-            'markdown': markdown,
-            'fit_markdown': fit_markdown,
-            'fit_html': fit_html,
-            'markdown_v2' : markdown_v2
+        # return {
+        #     'markdown': markdown,
+        #     'fit_markdown': fit_markdown,
+        #     'fit_html': fit_html,
+        #     'markdown_v2' : markdown_v2
+        # }
+        """
+
+    def flatten_nested_elements(self, node):
+        if isinstance(node, NavigableString):
+            return node
+        if len(node.contents) == 1 and isinstance(node.contents[0], Tag) and node.contents[0].name == node.name:
+            return self.flatten_nested_elements(node.contents[0])
+        node.contents = [self.flatten_nested_elements(child) for child in node.contents]
+        return node
+
+    def find_closest_parent_with_useful_text(self, tag, **kwargs):
+        image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
+        current_tag = tag
+        while current_tag:
+            current_tag = current_tag.parent
+            # Get the text content of the parent tag
+            if current_tag:
+                text_content = current_tag.get_text(separator=' ',strip=True)
+                # Check if the text content has at least word_count_threshold
+                if len(text_content.split()) >= image_description_min_word_threshold:
+                    return text_content
+        return None
+
+    def remove_unwanted_attributes(self, element, important_attrs, keep_data_attributes=False):
+        attrs_to_remove = []
+        for attr in element.attrs:
+            if attr not in important_attrs:
+                if keep_data_attributes:
+                    if not attr.startswith('data-'):
+                        attrs_to_remove.append(attr)
+                else:
+                    attrs_to_remove.append(attr)
+        
+        for attr in attrs_to_remove:
+            del element[attr]
+
+    def process_image(self, img, url, index, total_images, **kwargs):
+        parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w') 
+                        if ' ' in u else None} 
+                        for u in [f"http{p}" for p in s.split("http") if p]]
+        
+        # Constants for checks
+        classes_to_check = frozenset(['button', 'icon', 'logo'])
+        tags_to_check = frozenset(['button', 'input'])
+        
+        # Pre-fetch commonly used attributes
+        style = img.get('style', '')
+        alt = img.get('alt', '')
+        src = img.get('src', '')
+        data_src = img.get('data-src', '')
+        width = img.get('width')
+        height = img.get('height')
+        parent = img.parent
+        parent_classes = parent.get('class', [])
+
+        # Quick validation checks
+        if ('display:none' in style or
+            parent.name in tags_to_check or
+            any(c in cls for c in parent_classes for cls in classes_to_check) or
+            any(c in src for c in classes_to_check) or
+            any(c in alt for c in classes_to_check)):
+            return None
+
+        # Quick score calculation
+        score = 0
+        if width and width.isdigit():
+            width_val = int(width)
+            score += 1 if width_val > 150 else 0
+        if height and height.isdigit():
+            height_val = int(height)
+            score += 1 if height_val > 150 else 0
+        if alt:
+            score += 1
+        score += index/total_images < 0.5
+        
+        image_format = ''
+        if "data:image/" in src:
+            image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
+        else:
+            image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
+        
+        if image_format in ('jpg', 'png', 'webp', 'avif'):
+            score += 1
+
+        if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
+            return None
+
+        # Use set for deduplication
+        unique_urls = set()
+        image_variants = []
+        
+        # Generate a unique group ID for this set of variants
+        group_id = index 
+        
+        # Base image info template
+        image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
+        base_info = {
+            'alt': alt,
+            'desc': self.find_closest_parent_with_useful_text(img, **kwargs),
+            'score': score,
+            'type': 'image',
+            'group_id': group_id # Group ID for this set of variants
        }

+        # Inline function for adding variants
+        def add_variant(src, width=None):
+            if src and not src.startswith('data:') and src not in unique_urls:
+                unique_urls.add(src)
+                image_variants.append({**base_info, 'src': src, 'width': width})

-    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+        # Process all sources
+        add_variant(src)
+        add_variant(data_src)
+        
+        # Handle srcset and data-srcset in one pass
+        for attr in ('srcset', 'data-srcset'):
+            if value := img.get(attr):
+                for source in parse_srcset(value):
+                    add_variant(source['url'], source['width'])
+
+        # Quick picture element check
+        if picture := img.find_parent('picture'):
+            for source in picture.find_all('source'):
+                if srcset := source.get('srcset'):
+                    for src in parse_srcset(srcset):
+                        add_variant(src['url'], src['width'])
+
+        # Framework-specific attributes in one pass
+        for attr, value in img.attrs.items():
+            if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
+                add_variant(value)
+
+        return image_variants if image_variants else None
+
+    
+    def process_element(self, url, element: PageElement, **kwargs) -> Dict[str, Any]:        
+        media = {'images': [], 'videos': [], 'audios': []}
+        internal_links_dict = {}
+        external_links_dict = {}
+        self._process_element(
+            url,
+            element,
+            media,
+            internal_links_dict,
+            external_links_dict,
+            **kwargs
+        )
+        return {
+            'media': media,
+            'internal_links_dict': internal_links_dict,
+            'external_links_dict': external_links_dict
+        }
+        
+    def _process_element(self, url, element: PageElement,  media: Dict[str, Any], internal_links_dict: Dict[str, Any], external_links_dict: Dict[str, Any], **kwargs) -> bool:
+        try:
+            if isinstance(element, NavigableString):
+                if isinstance(element, Comment):
+                    element.extract()
+                return False
+            
+            # if element.name == 'img':
+            #     process_image(element, url, 0, 1)
+            #     return True
+
+            if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
+                element.decompose()
+                return False
+
+            keep_element = False
+            
+            exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
+            exclude_social_media_domains = list(set(exclude_social_media_domains))
+            
+            try:
+                if element.name == 'a' and element.get('href'):
+                    href = element.get('href', '').strip()
+                    if not href:  # Skip empty hrefs
+                        return False
+                        
+                    url_base = url.split('/')[2]
+                    
+                    # Normalize the URL
+                    try:
+                        normalized_href = normalize_url(href, url)
+                    except ValueError as e:
+                        # logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
+                        return False
+                        
+                    link_data = {
+                        'href': normalized_href,
+                        'text': element.get_text().strip(),
+                        'title': element.get('title', '').strip()
+                    }
+                    
+                    # Check for duplicates and add to appropriate dictionary
+                    is_external = is_external_url(normalized_href, url_base)
+                    if is_external:
+                        if normalized_href not in external_links_dict:
+                            external_links_dict[normalized_href] = link_data
+                    else:
+                        if normalized_href not in internal_links_dict:
+                            internal_links_dict[normalized_href] = link_data
+                            
+                    keep_element = True
+                    
+                    # Handle external link exclusions
+                    if is_external:
+                        if kwargs.get('exclude_external_links', False):
+                            element.decompose()
+                            return False
+                        elif kwargs.get('exclude_social_media_links', False):
+                            if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
+                                element.decompose()
+                                return False
+                        elif kwargs.get('exclude_domains', []):
+                            if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
+                                element.decompose()
+                                return False
+                                
+            except Exception as e:
+                raise Exception(f"Error processing links: {str(e)}")
+
+            try:
+                if element.name == 'img':
+                    potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
+                    src = element.get('src', '')
+                    while not src and potential_sources:
+                        src = element.get(potential_sources.pop(0), '')
+                    if not src:
+                        element.decompose()
+                        return False
+                    
+                    # If it is srcset pick up the first image
+                    if 'srcset' in element.attrs:
+                        src = element.attrs['srcset'].split(',')[0].split(' ')[0]
+                        
+                    # Check flag if we should remove external images
+                    if kwargs.get('exclude_external_images', False):
+                        src_url_base = src.split('/')[2]
+                        url_base = url.split('/')[2]
+                        if url_base not in src_url_base:
+                            element.decompose()
+                            return False
+                        
+                    if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
+                        src_url_base = src.split('/')[2]
+                        url_base = url.split('/')[2]
+                        if any(domain in src for domain in exclude_social_media_domains):
+                            element.decompose()
+                            return False
+                        
+                    # Handle exclude domains
+                    if kwargs.get('exclude_domains', []):
+                        if any(domain in src for domain in kwargs.get('exclude_domains', [])):
+                            element.decompose()
+                            return False
+                    
+                    return True  # Always keep image elements
+            except Exception as e:
+                raise "Error processing images"
+            
+            
+            # Check if flag to remove all forms is set
+            if kwargs.get('remove_forms', False) and element.name == 'form':
+                element.decompose()
+                return False
+            
+            if element.name in ['video', 'audio']:
+                media[f"{element.name}s"].append({
+                    'src': element.get('src'),
+                    'alt': element.get('alt'),
+                    'type': element.name,
+                    'description': self.find_closest_parent_with_useful_text(element, **kwargs)
+                })
+                source_tags = element.find_all('source')
+                for source_tag in source_tags:
+                    media[f"{element.name}s"].append({
+                    'src': source_tag.get('src'),
+                    'alt': element.get('alt'),
+                    'type': element.name,
+                    'description': self.find_closest_parent_with_useful_text(element, **kwargs)
+                })
+                return True  # Always keep video and audio elements
+
+            if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
+                if kwargs.get('only_text', False):
+                    element.replace_with(element.get_text())
+
+            try:
+                self.remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
+            except Exception as e:
+                # print('Error removing unwanted attributes:', str(e))
+                self._log('error',
+                    message="Error removing unwanted attributes: {error}",
+                    tag="SCRAPE",
+                    params={"error": str(e)}
+                )
+            # Process children
+            for child in list(element.children):
+                if isinstance(child, NavigableString) and not isinstance(child, Comment):
+                    if len(child.strip()) > 0:
+                        keep_element = True
+                else:
+                    if self._process_element(url, child, media, internal_links_dict, external_links_dict, **kwargs):
+                        keep_element = True
+                
+
+            # Check word count
+            word_count_threshold = kwargs.get('word_count_threshold', MIN_WORD_THRESHOLD)
+            if not keep_element:
+                word_count = len(element.get_text(strip=True).split())
+                keep_element = word_count >= word_count_threshold
+
+            if not keep_element:
+                element.decompose()
+
+            return keep_element
+        except Exception as e:
+            # print('Error processing element:', str(e))
+            self._log('error',
+                message="Error processing element: {error}",
+                tag="SCRAPE",
+                params={"error": str(e)}
+            )                
+            return False
+
+    def _scrap(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
        success = True
        if not html:
            return None

-        # soup = BeautifulSoup(html, 'html.parser')
        soup = BeautifulSoup(html, 'lxml')
        body = soup.body
        
@@ -195,15 +495,24 @@ class WebScrapingStrategy(ContentScrapingStrategy):
                tag="SCRAPE",
                params={"error": str(e)}
            )            
-            # print('Error extracting metadata:', str(e))
            meta = {}
        
+        # Handle tag-based removal first - faster than CSS selection
+        excluded_tags = set(kwargs.get('excluded_tags', []) or [])  
+        if excluded_tags:
+            for element in body.find_all(lambda tag: tag.name in excluded_tags):
+                element.extract()
        
-        image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
-
-        for tag in kwargs.get('excluded_tags', []) or []:
-            for el in body.select(tag):
-                el.decompose()
+        # Handle CSS selector-based removal
+        excluded_selector = kwargs.get('excluded_selector', '')
+        if excluded_selector:
+            is_single_selector = ',' not in excluded_selector and ' ' not in excluded_selector
+            if is_single_selector:
+                while element := body.select_one(excluded_selector):
+                    element.extract()
+            else:
+                for element in body.select(excluded_selector):
+                    element.extract()  
        
        if css_selector:
            selected_elements = body.select(css_selector)
@@ -222,384 +531,17 @@ class WebScrapingStrategy(ContentScrapingStrategy):
            for el in selected_elements:
                body.append(el)

-        links = {'internal': [], 'external': []}
-        media = {'images': [], 'videos': [], 'audios': []}
-        internal_links_dict = {}
-        external_links_dict = {}
-
-        # Extract meaningful text for media files from closest parent
-        def find_closest_parent_with_useful_text(tag):
-                current_tag = tag
-                while current_tag:
-                    current_tag = current_tag.parent
-                    # Get the text content of the parent tag
-                    if current_tag:
-                        text_content = current_tag.get_text(separator=' ',strip=True)
-                        # Check if the text content has at least word_count_threshold
-                        if len(text_content.split()) >= image_description_min_word_threshold:
-                            return text_content
-                return None
-
-        def process_image_old(img, url, index, total_images):
-                   
-            
-            #Check if an image has valid display and inside undesired html elements
-            def is_valid_image(img, parent, parent_classes):
-                style = img.get('style', '')
-                src = img.get('src', '')
-                classes_to_check = ['button', 'icon', 'logo']
-                tags_to_check = ['button', 'input']
-                return all([
-                    'display:none' not in style,
-                    src,
-                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
-                    parent.name not in tags_to_check
-                ])
-
-            #Score an image for it's usefulness
-            def score_image_for_usefulness(img, base_url, index, images_count):
-                image_height = img.get('height')
-                height_value, height_unit = parse_dimension(image_height)
-                image_width =  img.get('width')
-                width_value, width_unit = parse_dimension(image_width)
-                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
-                image_src = img.get('src','')
-                if "data:image/" in image_src:
-                    image_format = image_src.split(',')[0].split(';')[0].split('/')[1]
-                else:
-                    image_format = os.path.splitext(img.get('src',''))[1].lower()
-                # Remove . from format
-                image_format = image_format.strip('.').split('?')[0]
-                score = 0
-                if height_value:
-                    if height_unit == 'px' and height_value > 150:
-                        score += 1
-                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
-                        score += 1
-                if width_value:
-                    if width_unit == 'px' and width_value > 150:
-                        score += 1
-                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
-                        score += 1
-                if image_size > 10000:
-                    score += 1
-                if img.get('alt') != '':
-                    score+=1
-                if any(image_format==format for format in ['jpg','png','webp']):
-                    score+=1
-                if index/images_count<0.5:
-                    score+=1
-                return score
-
-            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
-                return None
-                
-            score = score_image_for_usefulness(img, url, index, total_images)
-            if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
-                return None
-
-            base_result = {
-                'src': img.get('src', ''),
-                'data-src': img.get('data-src', ''),
-                'alt': img.get('alt', ''),
-                'desc': find_closest_parent_with_useful_text(img),
-                'score': score,
-                'type': 'image'
-            }
-
-            sources = []
-            srcset = img.get('srcset', '')
-            if srcset:
-                sources = parse_srcset(srcset)
-                if sources:
-                    return [dict(base_result, src=source['url'], width=source['width']) 
-                        for source in sources]
-
-            return [base_result]  # Always return a list
-
-        def process_image(img, url, index, total_images):
-            parse_srcset = lambda s: [{'url': u.strip().split()[0], 'width': u.strip().split()[-1].rstrip('w') 
-                          if ' ' in u else None} 
-                         for u in [f"http{p}" for p in s.split("http") if p]]
-            
-            # Constants for checks
-            classes_to_check = frozenset(['button', 'icon', 'logo'])
-            tags_to_check = frozenset(['button', 'input'])
-            
-            # Pre-fetch commonly used attributes
-            style = img.get('style', '')
-            alt = img.get('alt', '')
-            src = img.get('src', '')
-            data_src = img.get('data-src', '')
-            width = img.get('width')
-            height = img.get('height')
-            parent = img.parent
-            parent_classes = parent.get('class', [])
-
-            # Quick validation checks
-            if ('display:none' in style or
-                parent.name in tags_to_check or
-                any(c in cls for c in parent_classes for cls in classes_to_check) or
-                any(c in src for c in classes_to_check) or
-                any(c in alt for c in classes_to_check)):
-                return None
-
-            # Quick score calculation
-            score = 0
-            if width and width.isdigit():
-                width_val = int(width)
-                score += 1 if width_val > 150 else 0
-            if height and height.isdigit():
-                height_val = int(height)
-                score += 1 if height_val > 150 else 0
-            if alt:
-                score += 1
-            score += index/total_images < 0.5
-            
-            image_format = ''
-            if "data:image/" in src:
-                image_format = src.split(',')[0].split(';')[0].split('/')[1].split(';')[0]
-            else:
-                image_format = os.path.splitext(src)[1].lower().strip('.').split('?')[0]
-            
-            if image_format in ('jpg', 'png', 'webp', 'avif'):
-                score += 1
-
-            if score <= kwargs.get('image_score_threshold', IMAGE_SCORE_THRESHOLD):
-                return None
-
-            # Use set for deduplication
-            unique_urls = set()
-            image_variants = []
-            
-            # Generate a unique group ID for this set of variants
-            group_id = index 
-            
-            # Base image info template
-            base_info = {
-                'alt': alt,
-                'desc': find_closest_parent_with_useful_text(img),
-                'score': score,
-                'type': 'image',
-                'group_id': group_id # Group ID for this set of variants
-            }
-
-            # Inline function for adding variants
-            def add_variant(src, width=None):
-                if src and not src.startswith('data:') and src not in unique_urls:
-                    unique_urls.add(src)
-                    image_variants.append({**base_info, 'src': src, 'width': width})
-
-            # Process all sources
-            add_variant(src)
-            add_variant(data_src)
-            
-            # Handle srcset and data-srcset in one pass
-            for attr in ('srcset', 'data-srcset'):
-                if value := img.get(attr):
-                    for source in parse_srcset(value):
-                        add_variant(source['url'], source['width'])
-
-            # Quick picture element check
-            if picture := img.find_parent('picture'):
-                for source in picture.find_all('source'):
-                    if srcset := source.get('srcset'):
-                        for src in parse_srcset(srcset):
-                            add_variant(src['url'], src['width'])
-
-            # Framework-specific attributes in one pass
-            for attr, value in img.attrs.items():
-                if attr.startswith('data-') and ('src' in attr or 'srcset' in attr) and 'http' in value:
-                    add_variant(value)
-
-            return image_variants if image_variants else None
-
-        def remove_unwanted_attributes(element, important_attrs, keep_data_attributes=False):
-            attrs_to_remove = []
-            for attr in element.attrs:
-                if attr not in important_attrs:
-                    if keep_data_attributes:
-                        if not attr.startswith('data-'):
-                            attrs_to_remove.append(attr)
-                    else:
-                        attrs_to_remove.append(attr)
-            
-            for attr in attrs_to_remove:
-                del element[attr]
+        result_obj = self.process_element(
+            url, 
+            body, 
+            word_count_threshold = word_count_threshold, 
+            **kwargs
+        )
        
-        def process_element(element: element.PageElement) -> bool:
-            try:
-                if isinstance(element, NavigableString):
-                    if isinstance(element, Comment):
-                        element.extract()
-                    return False
-                
-                # if element.name == 'img':
-                #     process_image(element, url, 0, 1)
-                #     return True
-
-                if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
-                    element.decompose()
-                    return False
-
-                keep_element = False
-                
-                exclude_social_media_domains = SOCIAL_MEDIA_DOMAINS + kwargs.get('exclude_social_media_domains', [])
-                exclude_social_media_domains = list(set(exclude_social_media_domains))
-                
-                try:
-                    if element.name == 'a' and element.get('href'):
-                        href = element.get('href', '').strip()
-                        if not href:  # Skip empty hrefs
-                            return False
-                            
-                        url_base = url.split('/')[2]
-                        
-                        # Normalize the URL
-                        try:
-                            normalized_href = normalize_url(href, url)
-                        except ValueError as e:
-                            # logging.warning(f"Invalid URL format: {href}, Error: {str(e)}")
-                            return False
-                            
-                        link_data = {
-                            'href': normalized_href,
-                            'text': element.get_text().strip(),
-                            'title': element.get('title', '').strip()
-                        }
-                        
-                        # Check for duplicates and add to appropriate dictionary
-                        is_external = is_external_url(normalized_href, url_base)
-                        if is_external:
-                            if normalized_href not in external_links_dict:
-                                external_links_dict[normalized_href] = link_data
-                        else:
-                            if normalized_href not in internal_links_dict:
-                                internal_links_dict[normalized_href] = link_data
-                                
-                        keep_element = True
-                        
-                        # Handle external link exclusions
-                        if is_external:
-                            if kwargs.get('exclude_external_links', False):
-                                element.decompose()
-                                return False
-                            elif kwargs.get('exclude_social_media_links', False):
-                                if any(domain in normalized_href.lower() for domain in exclude_social_media_domains):
-                                    element.decompose()
-                                    return False
-                            elif kwargs.get('exclude_domains', []):
-                                if any(domain in normalized_href.lower() for domain in kwargs.get('exclude_domains', [])):
-                                    element.decompose()
-                                    return False
-                                    
-                except Exception as e:
-                    raise Exception(f"Error processing links: {str(e)}")
-
-                try:
-                    if element.name == 'img':
-                        potential_sources = ['src', 'data-src', 'srcset' 'data-lazy-src', 'data-original']
-                        src = element.get('src', '')
-                        while not src and potential_sources:
-                            src = element.get(potential_sources.pop(0), '')
-                        if not src:
-                            element.decompose()
-                            return False
-                        
-                        # If it is srcset pick up the first image
-                        if 'srcset' in element.attrs:
-                            src = element.attrs['srcset'].split(',')[0].split(' ')[0]
-                            
-                        # Check flag if we should remove external images
-                        if kwargs.get('exclude_external_images', False):
-                            src_url_base = src.split('/')[2]
-                            url_base = url.split('/')[2]
-                            if url_base not in src_url_base:
-                                element.decompose()
-                                return False
-                            
-                        if not kwargs.get('exclude_external_images', False) and kwargs.get('exclude_social_media_links', False):
-                            src_url_base = src.split('/')[2]
-                            url_base = url.split('/')[2]
-                            if any(domain in src for domain in exclude_social_media_domains):
-                                element.decompose()
-                                return False
-                            
-                        # Handle exclude domains
-                        if kwargs.get('exclude_domains', []):
-                            if any(domain in src for domain in kwargs.get('exclude_domains', [])):
-                                element.decompose()
-                                return False
-                        
-                        return True  # Always keep image elements
-                except Exception as e:
-                    raise "Error processing images"
-                
-                
-                # Check if flag to remove all forms is set
-                if kwargs.get('remove_forms', False) and element.name == 'form':
-                    element.decompose()
-                    return False
-                
-                if element.name in ['video', 'audio']:
-                    media[f"{element.name}s"].append({
-                        'src': element.get('src'),
-                        'alt': element.get('alt'),
-                        'type': element.name,
-                        'description': find_closest_parent_with_useful_text(element)
-                    })
-                    source_tags = element.find_all('source')
-                    for source_tag in source_tags:
-                        media[f"{element.name}s"].append({
-                        'src': source_tag.get('src'),
-                        'alt': element.get('alt'),
-                        'type': element.name,
-                        'description': find_closest_parent_with_useful_text(element)
-                    })
-                    return True  # Always keep video and audio elements
-
-                if element.name in ONLY_TEXT_ELIGIBLE_TAGS:
-                    if kwargs.get('only_text', False):
-                        element.replace_with(element.get_text())
-
-                try:
-                    remove_unwanted_attributes(element, IMPORTANT_ATTRS, kwargs.get('keep_data_attributes', False))
-                except Exception as e:
-                    # print('Error removing unwanted attributes:', str(e))
-                    self._log('error',
-                        message="Error removing unwanted attributes: {error}",
-                        tag="SCRAPE",
-                        params={"error": str(e)}
-                    )
-                # Process children
-                for child in list(element.children):
-                    if isinstance(child, NavigableString) and not isinstance(child, Comment):
-                        if len(child.strip()) > 0:
-                            keep_element = True
-                    else:
-                        if process_element(child):
-                            keep_element = True
-                    
-
-                # Check word count
-                if not keep_element:
-                    word_count = len(element.get_text(strip=True).split())
-                    keep_element = word_count >= word_count_threshold
-
-                if not keep_element:
-                    element.decompose()
-
-                return keep_element
-            except Exception as e:
-                # print('Error processing element:', str(e))
-                self._log('error',
-                    message="Error processing element: {error}",
-                    tag="SCRAPE",
-                    params={"error": str(e)}
-                )                
-                return False
-       
-        process_element(body)
+        links = {'internal': [], 'external': []}
+        media = result_obj['media']
+        internal_links_dict = result_obj['internal_links_dict']
+        external_links_dict = result_obj['external_links_dict']
        
        # Update the links dictionary with unique links
        links['internal'] = list(internal_links_dict.values())
@@ -608,23 +550,14 @@ class WebScrapingStrategy(ContentScrapingStrategy):
        # # Process images using ThreadPoolExecutor
        imgs = body.find_all('img')
        
-        # For test we use for loop instead of thread
        media['images'] = [
-            img for result in (process_image(img, url, i, len(imgs)) 
+            img for result in (self.process_image(img, url, i, len(imgs)) 
                            for i, img in enumerate(imgs))
            if result is not None
            for img in result
        ]

-        def flatten_nested_elements(node):
-            if isinstance(node, NavigableString):
-                return node
-            if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
-                return flatten_nested_elements(node.contents[0])
-            node.contents = [flatten_nested_elements(child) for child in node.contents]
-            return node
-
-        body = flatten_nested_elements(body)
+        body = self.flatten_nested_elements(body)
        base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
        for img in imgs:
            src = img.get('src', '')
@@ -669,16 +602,16 @@ class WebScrapingStrategy(ContentScrapingStrategy):

        cleaned_html = str_body.replace('\n\n', '\n').replace('  ', ' ')

-        markdown_content = self._generate_markdown_content(
-            cleaned_html=cleaned_html,
-            html=html,
-            url=url,
-            success=success,
-            **kwargs
-        )
+        # markdown_content = self._generate_markdown_content(
+        #     cleaned_html=cleaned_html,
+        #     html=html,
+        #     url=url,
+        #     success=success,
+        #     **kwargs
+        # )
        
        return {
-            **markdown_content,
+            # **markdown_content,
            'cleaned_html': cleaned_html,
            'success': success,
            'media': media,
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -92,8 +92,10 @@ class LLMExtractionStrategy(ExtractionStrategy):
        
            
    def extract(self, url: str, ix:int, html: str) -> List[Dict[str, Any]]:
-        # print("[LOG] Extracting blocks from URL:", url)
-        print(f"[LOG] Call LLM for {url} - block index: {ix}")
+        if self.verbose:
+            # print("[LOG] Extracting blocks from URL:", url)
+            print(f"[LOG] Call LLM for {url} - block index: {ix}")
+
        variable_values = {
            "URL": url,
            "HTML": escape_json_string(sanitize_html(html)),
@@ -632,7 +634,7 @@ class ContentSummarizationStrategy(ExtractionStrategy):
        # Sort summaries by the original section index to maintain order
        summaries.sort(key=lambda x: x[0])
        return [summary for _, summary in summaries]
-  
+ 
 class JsonCssExtractionStrategy(ExtractionStrategy):
    def __init__(self, schema: Dict[str, Any], **kwargs):
        super().__init__(**kwargs)
@@ -868,4 +870,4 @@ class JsonXPATHExtractionStrategy(ExtractionStrategy):

    def run(self, url: str, sections: List[str], *q, **kwargs) -> List[Dict[str, Any]]:
        combined_html = self.DEL.join(sections)
-        return self.extract(url, combined_html, **kwargs)
+        return self.extract(url, combined_html, **kwargs)
--- a/crawl4ai/html2text/init.py
+++ b/crawl4ai/html2text/init.py
@@ -1006,10 +1006,136 @@ class HTML2Text(html.parser.HTMLParser):
                    newlines += 1
        return result

-
 def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str:
    if bodywidth is None:
        bodywidth = config.BODY_WIDTH
    h = HTML2Text(baseurl=baseurl, bodywidth=bodywidth)

    return h.handle(html)
+
+class CustomHTML2Text(HTML2Text):
+    def __init__(self, *args, handle_code_in_pre=False, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.inside_pre = False
+        self.inside_code = False
+        self.preserve_tags = set()  # Set of tags to preserve
+        self.current_preserved_tag = None
+        self.preserved_content = []
+        self.preserve_depth = 0
+        self.handle_code_in_pre = handle_code_in_pre 
+        
+        # Configuration options
+        self.skip_internal_links = False
+        self.single_line_break = False
+        self.mark_code = False
+        self.include_sup_sub = False
+        self.body_width = 0
+        self.ignore_mailto_links = True
+        self.ignore_links = False
+        self.escape_backslash = False
+        self.escape_dot = False
+        self.escape_plus = False
+        self.escape_dash = False
+        self.escape_snob = False
+
+    def update_params(self, **kwargs):
+        """Update parameters and set preserved tags."""
+        for key, value in kwargs.items():
+            if key == 'preserve_tags':
+                self.preserve_tags = set(value)
+            elif key == 'handle_code_in_pre':
+                self.handle_code_in_pre = value
+            else:
+                setattr(self, key, value)
+
+    def handle_tag(self, tag, attrs, start):
+        # Handle preserved tags
+        if tag in self.preserve_tags:
+            if start:
+                if self.preserve_depth == 0:
+                    self.current_preserved_tag = tag
+                    self.preserved_content = []
+                    # Format opening tag with attributes
+                    attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
+                    self.preserved_content.append(f'<{tag}{attr_str}>')
+                self.preserve_depth += 1
+                return
+            else:
+                self.preserve_depth -= 1
+                if self.preserve_depth == 0:
+                    self.preserved_content.append(f'</{tag}>')
+                    # Output the preserved HTML block with proper spacing
+                    preserved_html = ''.join(self.preserved_content)
+                    self.o('\n' + preserved_html + '\n')
+                    self.current_preserved_tag = None
+                return
+
+        # If we're inside a preserved tag, collect all content
+        if self.preserve_depth > 0:
+            if start:
+                # Format nested tags with attributes
+                attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
+                self.preserved_content.append(f'<{tag}{attr_str}>')
+            else:
+                self.preserved_content.append(f'</{tag}>')
+            return
+
+        # Handle pre tags
+        if tag == 'pre':
+            if start:
+                self.o('```\n')  # Markdown code block start
+                self.inside_pre = True
+            else:
+                self.o('\n```\n')  # Markdown code block end
+                self.inside_pre = False
+        elif tag == 'code':
+            if self.inside_pre and not self.handle_code_in_pre:
+                # Ignore code tags inside pre blocks if handle_code_in_pre is False
+                return
+            if start:
+                self.o('`')  # Markdown inline code start
+                self.inside_code = True
+            else:
+                self.o('`')  # Markdown inline code end
+                self.inside_code = False
+        else:
+            super().handle_tag(tag, attrs, start)
+
+    def handle_data(self, data, entity_char=False):
+        """Override handle_data to capture content within preserved tags."""
+        if self.preserve_depth > 0:
+            self.preserved_content.append(data)
+            return
+
+        if self.inside_pre:
+            # Output the raw content for pre blocks, including content inside code tags
+            self.o(data)  # Directly output the data as-is (preserve newlines)
+            return
+        if self.inside_code:
+            # Inline code: no newlines allowed
+            self.o(data.replace('\n', ' '))
+            return
+
+        # Default behavior for other tags
+        super().handle_data(data, entity_char)
+
+
+    #     # Handle pre tags
+    #     if tag == 'pre':
+    #         if start:
+    #             self.o('```\n')
+    #             self.inside_pre = True
+    #         else:
+    #             self.o('\n```')
+    #             self.inside_pre = False
+    #     # elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
+    #     #     pass
+    #     else:
+    #         super().handle_tag(tag, attrs, start)
+
+    # def handle_data(self, data, entity_char=False):
+    #     """Override handle_data to capture content within preserved tags."""
+    #     if self.preserve_depth > 0:
+    #         self.preserved_content.append(data)
+    #         return
+    #     super().handle_data(data, entity_char)
--- a/crawl4ai/js_snippet/init.py
+++ b/crawl4ai/js_snippet/init.py
@@ -0,0 +1,15 @@
+import os, sys
+
+# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
+def load_js_script(script_name):
+    # Get the path of the current script
+    current_script_path = os.path.dirname(os.path.realpath(__file__))
+    # Get the path of the script to load
+    script_path = os.path.join(current_script_path, script_name + '.js')
+    # Check if the script exists
+    if not os.path.exists(script_path):
+        raise ValueError(f"Script {script_name} not found in the folder {current_script_path}")
+    # Load the content of the script
+    with open(script_path, 'r') as f:
+        script_content = f.read()
+    return script_content
--- a/crawl4ai/js_snippet/navigator_overrider.js
+++ b/crawl4ai/js_snippet/navigator_overrider.js
@@ -0,0 +1,25 @@
+// Pass the Permissions Test.
+const originalQuery = window.navigator.permissions.query;
+window.navigator.permissions.query = (parameters) =>
+    parameters.name === "notifications"
+        ? Promise.resolve({ state: Notification.permission })
+        : originalQuery(parameters);
+Object.defineProperty(navigator, "webdriver", {
+    get: () => undefined,
+});
+window.navigator.chrome = {
+    runtime: {},
+    // Add other properties if necessary
+};
+Object.defineProperty(navigator, "plugins", {
+    get: () => [1, 2, 3, 4, 5],
+});
+Object.defineProperty(navigator, "languages", {
+    get: () => ["en-US", "en"],
+});
+Object.defineProperty(document, "hidden", {
+    get: () => false,
+});
+Object.defineProperty(document, "visibilityState", {
+    get: () => "visible",
+});
--- a/crawl4ai/js_snippet/remove_overlay_elements.js
+++ b/crawl4ai/js_snippet/remove_overlay_elements.js
@@ -0,0 +1,119 @@
+async () => {
+    // Function to check if element is visible
+    const isVisible = (elem) => {
+        const style = window.getComputedStyle(elem);
+        return style.display !== "none" && style.visibility !== "hidden" && style.opacity !== "0";
+    };
+
+    // Common selectors for popups and overlays
+    const commonSelectors = [
+        // Close buttons first
+        'button[class*="close" i]',
+        'button[class*="dismiss" i]',
+        'button[aria-label*="close" i]',
+        'button[title*="close" i]',
+        'a[class*="close" i]',
+        'span[class*="close" i]',
+
+        // Cookie notices
+        '[class*="cookie-banner" i]',
+        '[id*="cookie-banner" i]',
+        '[class*="cookie-consent" i]',
+        '[id*="cookie-consent" i]',
+
+        // Newsletter/subscription dialogs
+        '[class*="newsletter" i]',
+        '[class*="subscribe" i]',
+
+        // Generic popups/modals
+        '[class*="popup" i]',
+        '[class*="modal" i]',
+        '[class*="overlay" i]',
+        '[class*="dialog" i]',
+        '[role="dialog"]',
+        '[role="alertdialog"]',
+    ];
+
+    // Try to click close buttons first
+    for (const selector of commonSelectors.slice(0, 6)) {
+        const closeButtons = document.querySelectorAll(selector);
+        for (const button of closeButtons) {
+            if (isVisible(button)) {
+                try {
+                    button.click();
+                    await new Promise((resolve) => setTimeout(resolve, 100));
+                } catch (e) {
+                    console.log("Error clicking button:", e);
+                }
+            }
+        }
+    }
+
+    // Remove remaining overlay elements
+    const removeOverlays = () => {
+        // Find elements with high z-index
+        const allElements = document.querySelectorAll("*");
+        for (const elem of allElements) {
+            const style = window.getComputedStyle(elem);
+            const zIndex = parseInt(style.zIndex);
+            const position = style.position;
+
+            if (
+                isVisible(elem) &&
+                (zIndex > 999 || position === "fixed" || position === "absolute") &&
+                (elem.offsetWidth > window.innerWidth * 0.5 ||
+                    elem.offsetHeight > window.innerHeight * 0.5 ||
+                    style.backgroundColor.includes("rgba") ||
+                    parseFloat(style.opacity) < 1)
+            ) {
+                elem.remove();
+            }
+        }
+
+        // Remove elements matching common selectors
+        for (const selector of commonSelectors) {
+            const elements = document.querySelectorAll(selector);
+            elements.forEach((elem) => {
+                if (isVisible(elem)) {
+                    elem.remove();
+                }
+            });
+        }
+    };
+
+    // Remove overlay elements
+    removeOverlays();
+
+    // Remove any fixed/sticky position elements at the top/bottom
+    const removeFixedElements = () => {
+        const elements = document.querySelectorAll("*");
+        elements.forEach((elem) => {
+            const style = window.getComputedStyle(elem);
+            if ((style.position === "fixed" || style.position === "sticky") && isVisible(elem)) {
+                elem.remove();
+            }
+        });
+    };
+
+    removeFixedElements();
+
+    // Remove empty block elements as: div, p, span, etc.
+    const removeEmptyBlockElements = () => {
+        const blockElements = document.querySelectorAll(
+            "div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6"
+        );
+        blockElements.forEach((elem) => {
+            if (elem.innerText.trim() === "") {
+                elem.remove();
+            }
+        });
+    };
+
+    // Remove margin-right and padding-right from body (often added by modal scripts)
+    document.body.style.marginRight = "0px";
+    document.body.style.paddingRight = "0px";
+    document.body.style.overflow = "auto";
+
+    // Wait a bit for any animations to complete
+    await new Promise((resolve) => setTimeout(resolve, 100));
+};
--- a/crawl4ai/js_snippet/update_image_dimensions.js
+++ b/crawl4ai/js_snippet/update_image_dimensions.js
@@ -0,0 +1,54 @@
+() => {
+    return new Promise((resolve) => {
+        const filterImage = (img) => {
+            // Filter out images that are too small
+            if (img.width < 100 && img.height < 100) return false;
+
+            // Filter out images that are not visible
+            const rect = img.getBoundingClientRect();
+            if (rect.width === 0 || rect.height === 0) return false;
+
+            // Filter out images with certain class names (e.g., icons, thumbnails)
+            if (img.classList.contains("icon") || img.classList.contains("thumbnail")) return false;
+
+            // Filter out images with certain patterns in their src (e.g., placeholder images)
+            if (img.src.includes("placeholder") || img.src.includes("icon")) return false;
+
+            return true;
+        };
+
+        const images = Array.from(document.querySelectorAll("img")).filter(filterImage);
+        let imagesLeft = images.length;
+
+        if (imagesLeft === 0) {
+            resolve();
+            return;
+        }
+
+        const checkImage = (img) => {
+            if (img.complete && img.naturalWidth !== 0) {
+                img.setAttribute("width", img.naturalWidth);
+                img.setAttribute("height", img.naturalHeight);
+                imagesLeft--;
+                if (imagesLeft === 0) resolve();
+            }
+        };
+
+        images.forEach((img) => {
+            checkImage(img);
+            if (!img.complete) {
+                img.onload = () => {
+                    checkImage(img);
+                };
+                img.onerror = () => {
+                    imagesLeft--;
+                    if (imagesLeft === 0) resolve();
+                };
+            }
+        });
+
+        // Fallback timeout of 5 seconds
+        // setTimeout(() => resolve(), 5000);
+        resolve();
+    });
+};
--- a/crawl4ai/markdown_generation_strategy.py
+++ b/crawl4ai/markdown_generation_strategy.py
@@ -1,7 +1,7 @@
 from abc import ABC, abstractmethod
 from typing import Optional, Dict, Any, Tuple
 from .models import MarkdownGenerationResult
-from .utils import CustomHTML2Text
+from .html2text import CustomHTML2Text
 from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter
 import re
 from urllib.parse import urljoin
@@ -9,10 +9,22 @@ from urllib.parse import urljoin
 # Pre-compile the regex pattern
 LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')

+def fast_urljoin(base: str, url: str) -> str:
+    """Fast URL joining for common cases."""
+    if url.startswith(('http://', 'https://', 'mailto:', '//')):
+        return url
+    if url.startswith('/'):
+        # Handle absolute paths
+        if base.endswith('/'):
+            return base[:-1] + url
+        return base + url
+    return urljoin(base, url)
+
 class MarkdownGenerationStrategy(ABC):
    """Abstract base class for markdown generation strategies."""
-    def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
+    def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
        self.content_filter = content_filter
+        self.options = options or {}
    
    @abstractmethod
    def generate_markdown(self, 
@@ -27,8 +39,8 @@ class MarkdownGenerationStrategy(ABC):

 class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
    """Default implementation of markdown generation strategy."""
-    def __init__(self, content_filter: Optional[RelevantContentFilter] = None):
-        super().__init__(content_filter)
+    def __init__(self, content_filter: Optional[RelevantContentFilter] = None, options: Optional[Dict[str, Any]] = None):
+        super().__init__(content_filter, options)
    
    def convert_links_to_citations(self, markdown: str, base_url: str = "") -> Tuple[str, str]:
        link_map = {}
@@ -74,6 +86,7 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
                         cleaned_html: str, 
                         base_url: str = "",
                         html2text_options: Optional[Dict[str, Any]] = None,
+                         options: Optional[Dict[str, Any]] = None,
                         content_filter: Optional[RelevantContentFilter] = None,
                         citations: bool = True,
                         **kwargs) -> MarkdownGenerationResult:
@@ -82,6 +95,10 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
        h = CustomHTML2Text()
        if html2text_options:
            h.update_params(**html2text_options)
+        elif options:
+            h.update_params(**options)
+        elif self.options:
+            h.update_params(**self.options)

        # Generate raw markdown
        raw_markdown = h.handle(cleaned_html)
@@ -112,13 +129,3 @@ class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
            fit_html=filtered_html,
        )

-def fast_urljoin(base: str, url: str) -> str:
-    """Fast URL joining for common cases."""
-    if url.startswith(('http://', 'https://', 'mailto:', '//')):
-        return url
-    if url.startswith('/'):
-        # Handle absolute paths
-        if base.endswith('/'):
-            return base[:-1] + url
-        return base + url
-    return urljoin(base, url)
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -23,6 +23,7 @@ class CrawlResult(BaseModel):
    links: Dict[str, List[Dict]] = {}
    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
+    pdf : Optional[bytes] = None
    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
    markdown_v2: Optional[MarkdownGenerationResult] = None
    fit_markdown: Optional[str] = None
@@ -39,6 +40,7 @@ class AsyncCrawlResponse(BaseModel):
    response_headers: Dict[str, str]
    status_code: int
    screenshot: Optional[str] = None
+    pdf_data: Optional[bytes] = None
    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
    downloaded_files: Optional[List[str]] = None

--- a/crawl4ai/tools.py
+++ b/crawl4ai/tools.py
@@ -1,34 +0,0 @@
-import time
-import cProfile
-import pstats
-from functools import wraps
-
-def profile_and_time(func):
-    @wraps(func)
-    def wrapper(self, *args, **kwargs):
-        # Start timer
-        start_time = time.perf_counter()
-        
-        # Setup profiler
-        profiler = cProfile.Profile()
-        profiler.enable()
-        
-        # Run function
-        result = func(self, *args, **kwargs)
-        
-        # Stop profiler
-        profiler.disable()
-        
-        # Calculate elapsed time
-        elapsed_time = time.perf_counter() - start_time
-        
-        # Print timing
-        print(f"[PROFILER] Scraping completed in {elapsed_time:.2f} seconds")
-        
-        # Print profiling stats
-        stats = pstats.Stats(profiler)
-        stats.sort_stats('cumulative')  # Sort by cumulative time
-        stats.print_stats(20)  # Print top 20 time-consuming functions
-        
-        return result
-    return wrapper
--- a/crawl4ai/user_agent_generator.py
+++ b/crawl4ai/user_agent_generator.py
@@ -0,0 +1,263 @@
+import random
+from typing import Optional, Literal, List, Dict, Tuple
+import re
+
+
+class UserAgentGenerator:
+    def __init__(self):
+        # Previous platform definitions remain the same...
+        self.desktop_platforms = {
+            "windows": {
+                "10_64": "(Windows NT 10.0; Win64; x64)",
+                "10_32": "(Windows NT 10.0; WOW64)",
+            },
+            "macos": {
+                "intel": "(Macintosh; Intel Mac OS X 10_15_7)",
+                "newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
+            },
+            "linux": {
+                "generic": "(X11; Linux x86_64)",
+                "ubuntu": "(X11; Ubuntu; Linux x86_64)",
+                "chrome_os": "(X11; CrOS x86_64 14541.0.0)",
+            }
+        }
+
+        self.mobile_platforms = {
+            "android": {
+                "samsung": "(Linux; Android 13; SM-S901B)",
+                "pixel": "(Linux; Android 12; Pixel 6)",
+                "oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
+                "xiaomi": "(Linux; Android 12; M2102J20SG)",
+            },
+            "ios": {
+                "iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
+                "ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
+            }
+        }
+
+        # Browser Combinations
+        self.browser_combinations = {
+            1: [
+                ["chrome"],
+                ["firefox"],
+                ["safari"],
+                ["edge"]
+            ],
+            2: [
+                ["gecko", "firefox"],
+                ["chrome", "safari"],
+                ["webkit", "safari"]
+            ],
+            3: [
+                ["chrome", "safari", "edge"],
+                ["webkit", "chrome", "safari"]
+            ]
+        }
+
+        # Rendering Engines with versions
+        self.rendering_engines = {
+            "chrome_webkit": "AppleWebKit/537.36",
+            "safari_webkit": "AppleWebKit/605.1.15",
+            "gecko": [  # Added Gecko versions
+                "Gecko/20100101",
+                "Gecko/20100101",  # Firefox usually uses this constant version
+                "Gecko/2010010",
+            ]
+        }
+
+        # Browser Versions
+        self.chrome_versions = [
+            "Chrome/119.0.6045.199",
+            "Chrome/118.0.5993.117",
+            "Chrome/117.0.5938.149",
+            "Chrome/116.0.5845.187",
+            "Chrome/115.0.5790.171",
+        ]
+
+        self.edge_versions = [
+            "Edg/119.0.2151.97",
+            "Edg/118.0.2088.76",
+            "Edg/117.0.2045.47",
+            "Edg/116.0.1938.81",
+            "Edg/115.0.1901.203",
+        ]
+
+        self.safari_versions = [
+            "Safari/537.36",  # For Chrome-based
+            "Safari/605.1.15",
+            "Safari/604.1",
+            "Safari/602.1",
+            "Safari/601.5.17",
+        ]
+
+        # Added Firefox versions
+        self.firefox_versions = [
+            "Firefox/119.0",
+            "Firefox/118.0.2",
+            "Firefox/117.0.1",
+            "Firefox/116.0",
+            "Firefox/115.0.3",
+            "Firefox/114.0.2",
+            "Firefox/113.0.1",
+            "Firefox/112.0",
+            "Firefox/111.0.1",
+            "Firefox/110.0",
+        ]
+
+    def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
+        """Get a valid combination of browser versions"""
+        if num_browsers not in self.browser_combinations:
+            raise ValueError(f"Unsupported number of browsers: {num_browsers}")
+        
+        combination = random.choice(self.browser_combinations[num_browsers])
+        browser_stack = []
+        
+        for browser in combination:
+            if browser == "chrome":
+                browser_stack.append(random.choice(self.chrome_versions))
+            elif browser == "firefox":
+                browser_stack.append(random.choice(self.firefox_versions))
+            elif browser == "safari":
+                browser_stack.append(random.choice(self.safari_versions))
+            elif browser == "edge":
+                browser_stack.append(random.choice(self.edge_versions))
+            elif browser == "gecko":
+                browser_stack.append(random.choice(self.rendering_engines["gecko"]))
+            elif browser == "webkit":
+                browser_stack.append(self.rendering_engines["chrome_webkit"])
+        
+        return browser_stack
+
+    def generate(self, 
+                device_type: Optional[Literal['desktop', 'mobile']] = None,
+                os_type: Optional[str] = None,
+                device_brand: Optional[str] = None,
+                browser_type: Optional[Literal['chrome', 'edge', 'safari', 'firefox']] = None,
+                num_browsers: int = 3) -> str:
+        """
+        Generate a random user agent with specified constraints.
+        
+        Args:
+            device_type: 'desktop' or 'mobile'
+            os_type: 'windows', 'macos', 'linux', 'android', 'ios'
+            device_brand: Specific device brand
+            browser_type: 'chrome', 'edge', 'safari', or 'firefox'
+            num_browsers: Number of browser specifications (1-3)
+        """
+        # Get platform string
+        platform = self.get_random_platform(device_type, os_type, device_brand)
+        
+        # Start with Mozilla
+        components = ["Mozilla/5.0", platform]
+        
+        # Add browser stack
+        browser_stack = self.get_browser_stack(num_browsers)
+        
+        # Add appropriate legacy token based on browser stack
+        if "Firefox" in str(browser_stack):
+            components.append(random.choice(self.rendering_engines["gecko"]))
+        elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
+            components.append(self.rendering_engines["chrome_webkit"])
+            components.append("(KHTML, like Gecko)")
+        
+        # Add browser versions
+        components.extend(browser_stack)
+        
+        return " ".join(components)
+
+    def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
+        """Generate both user agent and matching client hints"""
+        user_agent = self.generate(**kwargs)
+        client_hints = self.generate_client_hints(user_agent)
+        return user_agent, client_hints
+
+    def get_random_platform(self, device_type, os_type, device_brand):
+        """Helper method to get random platform based on constraints"""
+        platforms = self.desktop_platforms if device_type == 'desktop' else \
+                   self.mobile_platforms if device_type == 'mobile' else \
+                   {**self.desktop_platforms, **self.mobile_platforms}
+        
+        if os_type:
+            for platform_group in [self.desktop_platforms, self.mobile_platforms]:
+                if os_type in platform_group:
+                    platforms = {os_type: platform_group[os_type]}
+                    break
+        
+        os_key = random.choice(list(platforms.keys()))
+        if device_brand and device_brand in platforms[os_key]:
+            return platforms[os_key][device_brand]
+        return random.choice(list(platforms[os_key].values()))
+
+    def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
+        """Parse a user agent string to extract browser and version information"""
+        browsers = {
+            'chrome': r'Chrome/(\d+)',
+            'edge': r'Edg/(\d+)',
+            'safari': r'Version/(\d+)',
+            'firefox': r'Firefox/(\d+)'
+        }
+        
+        result = {}
+        for browser, pattern in browsers.items():
+            match = re.search(pattern, user_agent)
+            if match:
+                result[browser] = match.group(1)
+        
+        return result
+
+    def generate_client_hints(self, user_agent: str) -> str:
+        """Generate Sec-CH-UA header value based on user agent string"""
+        browsers = self.parse_user_agent(user_agent)
+        
+        # Client hints components
+        hints = []
+        
+        # Handle different browser combinations
+        if 'chrome' in browsers:
+            hints.append(f'"Chromium";v="{browsers["chrome"]}"')
+            hints.append('"Not_A Brand";v="8"')
+            
+            if 'edge' in browsers:
+                hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
+            else:
+                hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
+                
+        elif 'firefox' in browsers:
+            # Firefox doesn't typically send Sec-CH-UA
+            return '""'
+            
+        elif 'safari' in browsers:
+            # Safari's format for client hints
+            hints.append(f'"Safari";v="{browsers["safari"]}"')
+            hints.append('"Not_A Brand";v="8"')
+        
+        return ', '.join(hints)
+
+# Example usage:
+if __name__ == "__main__":
+    generator = UserAgentGenerator()
+    print(generator.generate())
+    
+    print("\nSingle browser (Chrome):")
+    print(generator.generate(num_browsers=1, browser_type='chrome'))
+    
+    print("\nTwo browsers (Gecko/Firefox):")
+    print(generator.generate(num_browsers=2))
+    
+    print("\nThree browsers (Chrome/Safari/Edge):")
+    print(generator.generate(num_browsers=3))
+    
+    print("\nFirefox on Linux:")
+    print(generator.generate(
+        device_type='desktop',
+        os_type='linux',
+        browser_type='firefox',
+        num_browsers=2
+    ))
+    
+    print("\nChrome/Safari/Edge on Windows:")
+    print(generator.generate(
+        device_type='desktop',
+        os_type='windows',
+        num_browsers=3
+    ))
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -19,99 +19,17 @@ from typing import Optional, Tuple, Dict, Any
 import xxhash
 from colorama import Fore, Style, init
 import textwrap
+import cProfile
+import pstats
+from functools import wraps

-from .html2text import HTML2Text
-class CustomHTML2Text(HTML2Text):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.inside_pre = False
-        self.inside_code = False
-        self.preserve_tags = set()  # Set of tags to preserve
-        self.current_preserved_tag = None
-        self.preserved_content = []
-        self.preserve_depth = 0
-        
-        # Configuration options
-        self.skip_internal_links = False
-        self.single_line_break = False
-        self.mark_code = False
-        self.include_sup_sub = False
-        self.body_width = 0
-        self.ignore_mailto_links = True
-        self.ignore_links = False
-        self.escape_backslash = False
-        self.escape_dot = False
-        self.escape_plus = False
-        self.escape_dash = False
-        self.escape_snob = False
-
-    def update_params(self, **kwargs):
-        """Update parameters and set preserved tags."""
-        for key, value in kwargs.items():
-            if key == 'preserve_tags':
-                self.preserve_tags = set(value)
-            else:
-                setattr(self, key, value)
-
-    def handle_tag(self, tag, attrs, start):
-        # Handle preserved tags
-        if tag in self.preserve_tags:
-            if start:
-                if self.preserve_depth == 0:
-                    self.current_preserved_tag = tag
-                    self.preserved_content = []
-                    # Format opening tag with attributes
-                    attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
-                    self.preserved_content.append(f'<{tag}{attr_str}>')
-                self.preserve_depth += 1
-                return
-            else:
-                self.preserve_depth -= 1
-                if self.preserve_depth == 0:
-                    self.preserved_content.append(f'</{tag}>')
-                    # Output the preserved HTML block with proper spacing
-                    preserved_html = ''.join(self.preserved_content)
-                    self.o('\n' + preserved_html + '\n')
-                    self.current_preserved_tag = None
-                return
-
-        # If we're inside a preserved tag, collect all content
-        if self.preserve_depth > 0:
-            if start:
-                # Format nested tags with attributes
-                attr_str = ''.join(f' {k}="{v}"' for k, v in attrs.items() if v is not None)
-                self.preserved_content.append(f'<{tag}{attr_str}>')
-            else:
-                self.preserved_content.append(f'</{tag}>')
-            return
-
-        # Handle pre tags
-        if tag == 'pre':
-            if start:
-                self.o('```\n')
-                self.inside_pre = True
-            else:
-                self.o('\n```')
-                self.inside_pre = False
-        # elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
-        #     pass
-        else:
-            super().handle_tag(tag, attrs, start)
-
-    def handle_data(self, data, entity_char=False):
-        """Override handle_data to capture content within preserved tags."""
-        if self.preserve_depth > 0:
-            self.preserved_content.append(data)
-            return
-        super().handle_data(data, entity_char)
 class InvalidCSSSelectorError(Exception):
    pass

-
 def create_box_message(
   message: str, 
   type: str = "info", 
-   width: int = 80, 
+   width: int = 120, 
   add_newlines: bool = True,
   double_line: bool = False
 ) -> str:
@@ -330,50 +248,6 @@ def escape_json_string(s):
    
    return s

-class CustomHTML2Text_v0(HTML2Text):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.inside_pre = False
-        self.inside_code = False
-        
-        self.skip_internal_links = False
-        self.single_line_break = False
-        self.mark_code = False
-        self.include_sup_sub = False
-        self.body_width = 0
-        self.ignore_mailto_links = True
-        self.ignore_links = False
-        self.escape_backslash = False
-        self.escape_dot = False
-        self.escape_plus = False
-        self.escape_dash = False
-        self.escape_snob = False
-
-
-    def handle_tag(self, tag, attrs, start):
-        if tag == 'pre':
-            if start:
-                self.o('```\n')
-                self.inside_pre = True
-            else:
-                self.o('\n```')
-                self.inside_pre = False
-        elif tag in ["h1", "h2", "h3", "h4", "h5", "h6"]:
-            pass
-
-
-        # elif tag == 'code' and not self.inside_pre:
-        #     if start:
-        #         if not self.inside_pre:
-        #             self.o('`')
-        #         self.inside_code = True
-        #     else:
-        #         if not self.inside_pre:
-        #             self.o('`')
-        #         self.inside_code = False
-
-        super().handle_tag(tag, attrs, start)
-
 def replace_inline_tags(soup, tags, only_text=False):
    tag_replacements = {
        'b': lambda tag: f"**{tag.text}**",
@@ -935,7 +809,6 @@ def extract_metadata(html, soup=None):
    
    return metadata

-
 def extract_xml_tags(string):
    tags = re.findall(r'<(\w+)>', string)
    return list(set(tags))
@@ -953,7 +826,6 @@ def extract_xml_data(tags, string):

    return data
    
-# Function to perform the completion with exponential backoff
 def perform_completion_with_backoff(
    provider, 
    prompt_with_variables, 
@@ -967,7 +839,11 @@ def perform_completion_with_backoff(
    max_attempts = 3
    base_delay = 2  # Base delay in seconds, you can adjust this based on your needs
    
-    extra_args = {}
+    extra_args = {
+        "temperature": 0.01,
+        'api_key': api_token,
+        'base_url': base_url
+    }
    if json_response:
        extra_args["response_format"] = { "type": "json_object" }
        
@@ -976,14 +852,12 @@ def perform_completion_with_backoff(
    
    for attempt in range(max_attempts):
        try:
+            
            response =completion(
                model=provider,
                messages=[
                    {"role": "user", "content": prompt_with_variables}
                ],
-                temperature=0.01,
-                api_key=api_token,
-                base_url=base_url,
                **extra_args
            )
            return response  # Return the successful response
@@ -1307,6 +1181,35 @@ def clean_tokens(tokens: list[str]) -> list[str]:
            and not token.startswith('▲')
            and not token.startswith('⬆')]

+def profile_and_time(func):
+    @wraps(func)
+    def wrapper(self, *args, **kwargs):
+        # Start timer
+        start_time = time.perf_counter()
+        
+        # Setup profiler
+        profiler = cProfile.Profile()
+        profiler.enable()
+        
+        # Run function
+        result = func(self, *args, **kwargs)
+        
+        # Stop profiler
+        profiler.disable()
+        
+        # Calculate elapsed time
+        elapsed_time = time.perf_counter() - start_time
+        
+        # Print timing
+        print(f"[PROFILER] Scraping completed in {elapsed_time:.2f} seconds")
+        
+        # Print profiling stats
+        stats = pstats.Stats(profiler)
+        stats.sort_stats('cumulative')  # Sort by cumulative time
+        stats.print_stats(20)  # Print top 20 time-consuming functions
+        
+        return result
+    return wrapper

 def generate_content_hash(content: str) -> str:
    """Generate a unique hash for content"""
@@ -1320,7 +1223,8 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
        'cleaned': 'cleaned_html',
        'markdown': 'markdown_content', 
        'extracted': 'extracted_content',
-        'screenshots': 'screenshots'
+        'screenshots': 'screenshots',
+        'screenshot': 'screenshots'
    }
    
    content_paths = {}
@@ -1329,4 +1233,60 @@ def ensure_content_dirs(base_path: str) -> Dict[str, str]:
        os.makedirs(path, exist_ok=True)
        content_paths[key] = path
        
-    return content_paths
+    return content_paths
+
+def get_error_context(exc_info, context_lines: int = 5):
+    """
+    Extract error context with more reliable line number tracking.
+    
+    Args:
+        exc_info: The exception info from sys.exc_info()
+        context_lines: Number of lines to show before and after the error
+    
+    Returns:
+        dict: Error context information
+    """
+    import traceback
+    import linecache
+    import os
+    
+    # Get the full traceback
+    tb = traceback.extract_tb(exc_info[2])
+    
+    # Get the last frame (where the error occurred)
+    last_frame = tb[-1]
+    filename = last_frame.filename
+    line_no = last_frame.lineno
+    func_name = last_frame.name
+    
+    # Get the source code context using linecache
+    # This is more reliable than inspect.getsourcelines
+    context_start = max(1, line_no - context_lines)
+    context_end = line_no + context_lines + 1
+    
+    # Build the context lines with line numbers
+    context_lines = []
+    for i in range(context_start, context_end):
+        line = linecache.getline(filename, i)
+        if line:
+            # Remove any trailing whitespace/newlines and add the pointer for error line
+            line = line.rstrip()
+            pointer = '→' if i == line_no else ' '
+            context_lines.append(f"{i:4d} {pointer} {line}")
+    
+    # Join the lines with newlines
+    code_context = '\n'.join(context_lines)
+    
+    # Get relative path for cleaner output
+    try:
+        rel_path = os.path.relpath(filename)
+    except ValueError:
+        # Fallback if relpath fails (can happen on Windows with different drives)
+        rel_path = filename
+    
+    return {
+        "filename": rel_path,
+        "line_no": line_no,
+        "function": func_name,
+        "code_context": code_context
+    }
--- a/crawl4ai/utils.scraping.py
+++ b/crawl4ai/utils.scraping.py
--- a/docs/examples/full_page_screenshot_and_pdf_export.md
+++ b/docs/examples/full_page_screenshot_and_pdf_export.md
@@ -0,0 +1,58 @@
+# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
+
+When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
+
+**The New Approach:**
+We’ve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browser’s built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
+
+**Key Benefits:**
+- **Reliability:** The PDF export never times out and works regardless of page length.
+- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
+- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
+
+**Simple Example:**
+```python
+import os, sys
+import asyncio
+from crawl4ai import AsyncWebCrawler, CacheMode
+
+# Adjust paths as needed
+parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.append(parent_dir)
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+async def main():
+    async with AsyncWebCrawler() as crawler:
+        # Request both PDF and screenshot
+        result = await crawler.arun(
+            url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
+            cache_mode=CacheMode.BYPASS,
+            pdf=True,
+            screenshot=True
+        )
+        
+        if result.success:
+            # Save screenshot
+            if result.screenshot:
+                from base64 import b64decode
+                with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
+                    f.write(b64decode(result.screenshot))
+            
+            # Save PDF
+            if result.pdf_data:
+                pdf_bytes = b64decode(result.pdf_data)
+                with open(os.path.join(__location__, "page.pdf"), "wb") as f:
+                    f.write(pdf_bytes)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What Happens Under the Hood:**
+- Crawl4AI navigates to the target page.
+- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
+- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
+- Finally, you get your PDF and/or screenshot ready to use.
+
+**Conclusion:**
+With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.
--- a/docs/examples/llm_extraction_openai_pricing.py
+++ b/docs/examples/llm_extraction_openai_pricing.py
@@ -1,41 +1,40 @@
-import os
-import time
-from crawl4ai.web_crawler import WebCrawler
-from crawl4ai.chunking_strategy import *
 from crawl4ai.extraction_strategy import *
 from crawl4ai.crawler_strategy import *
+import asyncio
+from pydantic import BaseModel, Field

 url = r'https://openai.com/api/pricing/'

-crawler = WebCrawler()
-crawler.warmup()
-
-from pydantic import BaseModel, Field
-
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

-result = crawler.run(
-    url=url,
-    word_count_threshold=1,
-    extraction_strategy= LLMExtractionStrategy(
-        # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
-        provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'), 
-        schema=OpenAIModelFee.model_json_schema(),
-        extraction_type="schema",
-        instruction="From the crawled content, extract all mentioned model names along with their "\
-            "fees for input and output tokens. Make sure not to miss anything in the entire content. "\
-            'One extracted model JSON format should look like this: '\
-            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
-    ),
-    bypass_cache=True,
-)
+from crawl4ai import AsyncWebCrawler

-model_fees = json.loads(result.extracted_content)
+async def main():
+    # Use AsyncWebCrawler
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            url=url,
+            word_count_threshold=1,
+            extraction_strategy= LLMExtractionStrategy(
+                # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
+                provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'),
+                schema=OpenAIModelFee.model_json_schema(),
+                extraction_type="schema",
+                instruction="From the crawled content, extract all mentioned model names along with their " \
+                            "fees for input and output tokens. Make sure not to miss anything in the entire content. " \
+                            'One extracted model JSON format should look like this: ' \
+                            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+            ),

-print(len(model_fees))
+        )
+        print("Success:", result.success)
+        model_fees = json.loads(result.extracted_content)
+        print(len(model_fees))

-with open(".data/data.json", "w", encoding="utf-8") as f:
-    f.write(result.extracted_content)
+        with open(".data/data.json", "w", encoding="utf-8") as f:
+            f.write(result.extracted_content)
+
+asyncio.run(main())
--- a/docs/examples/quickstart_async.config.py
+++ b/docs/examples/quickstart_async.config.py
@@ -0,0 +1,518 @@
+import os, sys
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__)))))
+os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692"
+
+import asyncio
+import time
+import json
+import re
+from typing import Dict, List
+from bs4 import BeautifulSoup
+from pydantic import BaseModel, Field
+from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
+
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+print("Crawl4AI: Advanced Web Crawling and Data Extraction")
+print("GitHub Repository: https://github.com/unclecode/crawl4ai")
+print("Twitter: @unclecode")
+print("Website: https://crawl4ai.com")
+
+# Basic Example - Simple Crawl
+async def simple_crawl():
+    print("\n--- Basic Usage ---")
+    browser_config = BrowserConfig(headless=True)
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            config=crawler_config
+        )
+        print(result.markdown[:500])
+
+# JavaScript Execution Example
+async def simple_example_with_running_js_code():
+    print("\n--- Executing JavaScript and Using CSS Selectors ---")
+    
+    browser_config = BrowserConfig(
+        headless=True,
+        java_script_enabled=True
+    )
+    
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
+        # wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            config=crawler_config
+        )
+        print(result.markdown[:500])
+
+# CSS Selector Example
+async def simple_example_with_css_selector():
+    print("\n--- Using CSS Selectors ---")
+    browser_config = BrowserConfig(headless=True)
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        css_selector=".wide-tease-item__description"
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            config=crawler_config
+        )
+        print(result.markdown[:500])
+
+# Proxy Example
+async def use_proxy():
+    print("\n--- Using a Proxy ---")
+    browser_config = BrowserConfig(
+        headless=True,
+        proxy="http://your-proxy-url:port"
+    )
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            config=crawler_config
+        )
+        if result.success:
+            print(result.markdown[:500])
+
+# Screenshot Example
+async def capture_and_save_screenshot(url: str, output_path: str):
+    browser_config = BrowserConfig(headless=True)
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        screenshot=True
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url=url,
+            config=crawler_config
+        )
+        
+        if result.success and result.screenshot:
+            import base64
+            screenshot_data = base64.b64decode(result.screenshot)
+            with open(output_path, 'wb') as f:
+                f.write(screenshot_data)
+            print(f"Screenshot saved successfully to {output_path}")
+        else:
+            print("Failed to capture screenshot")
+
+# LLM Extraction Example
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+    print(f"\n--- Extracting Structured Data with {provider} ---")
+    
+    if api_token is None and provider != "ollama":
+        print(f"API token is required for {provider}. Skipping this example.")
+        return
+
+    browser_config = BrowserConfig(headless=True)
+    
+    extra_args = {
+        "temperature": 0,
+        "top_p": 0.9,
+        "max_tokens": 2000
+    }
+    if extra_headers:
+        extra_args["extra_headers"] = extra_headers
+
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        word_count_threshold=1,
+        page_timeout = 80000,
+        extraction_strategy=LLMExtractionStrategy(
+            provider=provider,
+            api_token=api_token,
+            schema=OpenAIModelFee.model_json_schema(),
+            extraction_type="schema",
+            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+            Do not miss any models in the entire content.""",
+            extra_args=extra_args
+        )
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="https://openai.com/api/pricing/",
+            config=crawler_config
+        )
+        print(result.extracted_content)
+
+# CSS Extraction Example
+async def extract_structured_data_using_css_extractor():
+    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
+    schema = {
+        "name": "KidoCode Courses",
+        "baseSelector": "section.charge-methodology .w-tab-content > div",
+        "fields": [
+            {
+                "name": "section_title",
+                "selector": "h3.heading-50",
+                "type": "text",
+            },
+            {
+                "name": "section_description",
+                "selector": ".charge-content",
+                "type": "text",
+            },
+            {
+                "name": "course_name",
+                "selector": ".text-block-93",
+                "type": "text",
+            },
+            {
+                "name": "course_description",
+                "selector": ".course-content-text",
+                "type": "text",
+            },
+            {
+                "name": "course_icon",
+                "selector": ".image-92",
+                "type": "attribute",
+                "attribute": "src"
+            }
+        ]
+    }
+
+    browser_config = BrowserConfig(
+        headless=True,
+        java_script_enabled=True
+    )
+    
+    js_click_tabs = """
+    (async () => {
+        const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
+        for(let tab of tabs) {
+            tab.scrollIntoView();
+            tab.click();
+            await new Promise(r => setTimeout(r, 500));
+        }
+    })();
+    """
+    
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        extraction_strategy=JsonCssExtractionStrategy(schema),
+        js_code=[js_click_tabs]
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="https://www.kidocode.com/degrees/technology",
+            config=crawler_config
+        )
+
+        companies = json.loads(result.extracted_content)
+        print(f"Successfully extracted {len(companies)} companies")
+        print(json.dumps(companies[0], indent=2))
+
+# Dynamic Content Examples - Method 1
+async def crawl_dynamic_content_pages_method_1():
+    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
+    first_commit = ""
+
+    async def on_execution_started(page, **kwargs):
+        nonlocal first_commit
+        try:
+            while True:
+                await page.wait_for_selector("li.Box-sc-g0xbh4-0 h4")
+                commit = await page.query_selector("li.Box-sc-g0xbh4-0 h4")
+                commit = await commit.evaluate("(element) => element.textContent")
+                commit = re.sub(r"\s+", "", commit)
+                if commit and commit != first_commit:
+                    first_commit = commit
+                    break
+                await asyncio.sleep(0.5)
+        except Exception as e:
+            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
+
+    browser_config = BrowserConfig(
+        headless=False,
+        java_script_enabled=True
+    )
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
+
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        js_next_page = """
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        for page in range(3):
+            crawler_config = CrawlerRunConfig(
+                cache_mode=CacheMode.BYPASS,
+                css_selector="li.Box-sc-g0xbh4-0",
+                js_code=js_next_page if page > 0 else None,
+                js_only=page > 0,
+                session_id=session_id
+            )
+
+            result = await crawler.arun(url=url, config=crawler_config)
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            soup = BeautifulSoup(result.cleaned_html, "html.parser")
+            commits = soup.select("li")
+            all_commits.extend(commits)
+
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+# Dynamic Content Examples - Method 2
+async def crawl_dynamic_content_pages_method_2():
+    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
+
+    browser_config = BrowserConfig(
+        headless=False,
+        java_script_enabled=True
+    )
+
+    js_next_page_and_wait = """
+    (async () => {
+        const getCurrentCommit = () => {
+            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
+            return commits.length > 0 ? commits[0].textContent.trim() : null;
+        };
+
+        const initialCommit = getCurrentCommit();
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+
+        while (true) {
+            await new Promise(resolve => setTimeout(resolve, 100));
+            const newCommit = getCurrentCommit();
+            if (newCommit && newCommit !== initialCommit) {
+                break;
+            }
+        }
+    })();
+    """
+
+    schema = {
+        "name": "Commit Extractor",
+        "baseSelector": "li.Box-sc-g0xbh4-0",
+        "fields": [
+            {
+                "name": "title",
+                "selector": "h4.markdown-title",
+                "type": "text",
+                "transform": "strip",
+            },
+        ],
+    }
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        extraction_strategy = JsonCssExtractionStrategy(schema)
+
+        for page in range(3):
+            crawler_config = CrawlerRunConfig(
+                cache_mode=CacheMode.BYPASS,
+                css_selector="li.Box-sc-g0xbh4-0",
+                extraction_strategy=extraction_strategy,
+                js_code=js_next_page_and_wait if page > 0 else None,
+                js_only=page > 0,
+                session_id=session_id
+            )
+
+            result = await crawler.arun(url=url, config=crawler_config)
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            commits = json.loads(result.extracted_content)
+            all_commits.extend(commits)
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+# Browser Comparison
+async def crawl_custom_browser_type():
+    print("\n--- Browser Comparison ---")
+    
+    # Firefox
+    browser_config_firefox = BrowserConfig(
+        browser_type="firefox",
+        headless=True
+    )
+    start = time.time()
+    async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
+        result = await crawler.arun(
+            url="https://www.example.com",
+            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+        )
+        print("Firefox:", time.time() - start)
+        print(result.markdown[:500])
+
+    # WebKit
+    browser_config_webkit = BrowserConfig(
+        browser_type="webkit",
+        headless=True
+    )
+    start = time.time()
+    async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
+        result = await crawler.arun(
+            url="https://www.example.com",
+            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+        )
+        print("WebKit:", time.time() - start)
+        print(result.markdown[:500])
+
+    # Chromium (default)
+    browser_config_chromium = BrowserConfig(
+        browser_type="chromium",
+        headless=True
+    )
+    start = time.time()
+    async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
+        result = await crawler.arun(
+            url="https://www.example.com",
+            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
+        )
+        print("Chromium:", time.time() - start)
+        print(result.markdown[:500])
+
+# Anti-Bot and User Simulation
+async def crawl_with_user_simulation():
+    browser_config = BrowserConfig(
+        headless=True,
+        user_agent_mode="random",
+        user_agent_generator_config={
+            "device_type": "mobile",
+            "os_type": "android"
+        }
+    )
+
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        magic=True,
+        simulate_user=True,
+        override_navigator=True
+    )
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            url="YOUR-URL-HERE",
+            config=crawler_config
+        )
+        print(result.markdown)
+
+# Speed Comparison
+async def speed_comparison():
+    print("\n--- Speed Comparison ---")
+    
+    # Firecrawl comparison
+    from firecrawl import FirecrawlApp
+    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
+    start = time.time()
+    scrape_status = app.scrape_url(
+        'https://www.nbcnews.com/business',
+        params={'formats': ['markdown', 'html']}
+    )
+    end = time.time()
+    print("Firecrawl:")
+    print(f"Time taken: {end - start:.2f} seconds")
+    print(f"Content length: {len(scrape_status['markdown'])} characters")
+    print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
+    print()
+
+    # Crawl4AI comparisons
+    browser_config = BrowserConfig(headless=True)
+    
+    # Simple crawl
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        start = time.time()
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            config=CrawlerRunConfig(
+                cache_mode=CacheMode.BYPASS,
+                word_count_threshold=0
+            )
+        )
+        end = time.time()
+        print("Crawl4AI (simple crawl):")
+        print(f"Time taken: {end - start:.2f} seconds")
+        print(f"Content length: {len(result.markdown)} characters")
+        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
+        print()
+
+        # Advanced filtering
+        start = time.time()
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            config=CrawlerRunConfig(
+                cache_mode=CacheMode.BYPASS,
+                word_count_threshold=0,
+                markdown_generator=DefaultMarkdownGenerator(
+                    content_filter=PruningContentFilter(
+                        threshold=0.48,
+                        threshold_type="fixed",
+                        min_word_threshold=0
+                    )
+                )
+            )
+        )
+        end = time.time()
+        print("Crawl4AI (Markdown Plus):")
+        print(f"Time taken: {end - start:.2f} seconds")
+        print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
+        print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
+        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
+        print()
+
+# Main execution
+async def main():
+    # Basic examples
+    # await simple_crawl()
+    # await simple_example_with_running_js_code()
+    # await simple_example_with_css_selector()
+    
+    # Advanced examples
+    # await extract_structured_data_using_css_extractor()
+    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
+    # await crawl_dynamic_content_pages_method_1()
+    # await crawl_dynamic_content_pages_method_2()
+    
+    # Browser comparisons
+    # await crawl_custom_browser_type()
+    
+    # Performance testing
+    # await speed_comparison()
+
+    # Screenshot example
+    # await capture_and_save_screenshot(
+    #     "https://www.example.com",
+    #     os.path.join(__location__, "tmp/example_screenshot.jpg")
+    # )
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -15,7 +15,7 @@ from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
 from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-from crawl4ai.content_filter_strategy import BM25ContentFilter
+from crawl4ai.content_filter_strategy import BM25ContentFilter, PruningContentFilter
 from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
@@ -117,7 +117,13 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
        print(f"API token is required for {provider}. Skipping this example.")
        return

-    extra_args = {}
+    # extra_args = {}
+    extra_args={
+        "temperature": 0, 
+        "top_p": 0.9,
+        "max_tokens": 2000,
+        # any other supported parameters for litellm
+    }
    if extra_headers:
        extra_args["extra_headers"] = extra_headers

@@ -128,7 +134,7 @@ async def extract_structured_data_using_llm(provider: str, api_token: str = None
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_token=api_token,
-                schema=OpenAIModelFee.schema(),
+                schema=OpenAIModelFee.model_json_schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
@@ -233,8 +239,10 @@ async def crawl_dynamic_content_pages_method_1():
        all_commits = []

        js_next_page = """
-        const button = document.querySelector('a[data-testid="pagination-next-button"]');
-        if (button) button.click();
+        (() => {
+            const button = document.querySelector('a[data-testid="pagination-next-button"]');
+            if (button) button.click();
+        })();
        """

        for page in range(3):  # Crawl 3 pages
@@ -466,7 +474,8 @@ async def speed_comparison():
            url="https://www.nbcnews.com/business",
            word_count_threshold=0,
            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
+                content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
+                # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
            cache_mode=CacheMode.BYPASS,
            verbose=False,
@@ -489,7 +498,8 @@ async def speed_comparison():
            word_count_threshold=0,
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
+                content_filter = PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
+                # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
            ),
            verbose=False,
        )
@@ -545,35 +555,70 @@ async def generate_knowledge_graph():
            f.write(result.extracted_content)

 async def fit_markdown_remove_overlay():
-    async with AsyncWebCrawler(headless = False) as crawler:
-        url = "https://janineintheworld.com/places-to-visit-in-central-mexico"
+    
+    async with AsyncWebCrawler(
+            headless=True,  # Set to False to see what is happening
+            verbose=True,
+            user_agent_mode="random",
+            user_agent_generator_config={
+                "device_type": "mobile",
+                "os_type": "android"
+            },
+    ) as crawler:
        result = await crawler.arun(
-            url=url,
+            url='https://www.kidocode.com/degrees/technology',
            cache_mode=CacheMode.BYPASS,
-            word_count_threshold = 10,
-            remove_overlay_elements=True,
-            screenshot = True
+            markdown_generator=DefaultMarkdownGenerator(
+                content_filter=PruningContentFilter(
+                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
+                ),
+                options={
+                    "ignore_links": True
+                }
+            ),
+            # markdown_generator=DefaultMarkdownGenerator(
+            #     content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
+            #     options={
+            #         "ignore_links": True
+            #     }
+            # ),
        )
-        # Save markdown to file
-        with open(os.path.join(__location__, "mexico_places.md"), "w") as f:
-            f.write(result.fit_markdown)
-
+        
+        if result.success:
+            print(len(result.markdown_v2.raw_markdown))
+            print(len(result.markdown_v2.markdown_with_citations))
+            print(len(result.markdown_v2.fit_markdown))
+            
+            # Save clean html
+            with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
+                f.write(result.cleaned_html)
+            
+            with open(os.path.join(__location__, "output/output_raw_markdown.md"), "w") as f:
+                f.write(result.markdown_v2.raw_markdown)
+                
+            with open(os.path.join(__location__, "output/output_markdown_with_citations.md"), "w") as f:
+                f.write(result.markdown_v2.markdown_with_citations) 
+                
+            with open(os.path.join(__location__, "output/output_fit_markdown.md"), "w") as f:   
+                f.write(result.markdown_v2.fit_markdown)
+        
    print("Done")


 async def main():
-    await simple_crawl()
-    await simple_example_with_running_js_code()
-    await simple_example_with_css_selector()
-    # await use_proxy()
-    await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
-    await extract_structured_data_using_css_extractor()
+    # await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
+    
+    # await simple_crawl()
+    # await simple_example_with_running_js_code()
+    # await simple_example_with_css_selector()
+    # # await use_proxy()
+    # await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
+    # await extract_structured_data_using_css_extractor()

    # LLM extraction examples
    # await extract_structured_data_using_llm()
    # await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
    # await extract_structured_data_using_llm("ollama/llama3.2")    
-    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))

    # You always can pass custom headers to the extraction strategy
    # custom_headers = {
--- a/docs/examples/storage_state_tutorial.md
+++ b/docs/examples/storage_state_tutorial.md
@@ -0,0 +1,225 @@
+### Using `storage_state` to Pre-Load Cookies and LocalStorage
+
+Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
+
+#### What is `storage_state`?
+
+`storage_state` can be:
+
+- A dictionary containing cookies and localStorage data.
+- A path to a JSON file that holds this information.
+
+When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
+
+#### Example Structure
+
+Here’s an example storage state:
+
+```json
+{
+  "cookies": [
+    {
+      "name": "session",
+      "value": "abcd1234",
+      "domain": "example.com",
+      "path": "/",
+      "expires": 1675363572.037711,
+      "httpOnly": false,
+      "secure": false,
+      "sameSite": "None"
+    }
+  ],
+  "origins": [
+    {
+      "origin": "https://example.com",
+      "localStorage": [
+        { "name": "token", "value": "my_auth_token" },
+        { "name": "refreshToken", "value": "my_refresh_token" }
+      ]
+    }
+  ]
+}
+```
+
+This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
+
+---
+
+### Passing `storage_state` as a Dictionary
+
+You can directly provide the data as a dictionary:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    storage_dict = {
+        "cookies": [
+            {
+                "name": "session",
+                "value": "abcd1234",
+                "domain": "example.com",
+                "path": "/",
+                "expires": 1675363572.037711,
+                "httpOnly": False,
+                "secure": False,
+                "sameSite": "None"
+            }
+        ],
+        "origins": [
+            {
+                "origin": "https://example.com",
+                "localStorage": [
+                    {"name": "token", "value": "my_auth_token"},
+                    {"name": "refreshToken", "value": "my_refresh_token"}
+                ]
+            }
+        ]
+    }
+
+    async with AsyncWebCrawler(
+        headless=True,
+        storage_state=storage_dict
+    ) as crawler:
+        result = await crawler.arun(url='https://example.com/protected')
+        if result.success:
+            print("Crawl succeeded with pre-loaded session data!")
+            print("Page HTML length:", len(result.html))
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+### Passing `storage_state` as a File
+
+If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(
+        headless=True,
+        storage_state="mystate.json"  # Uses a JSON file instead of a dictionary
+    ) as crawler:
+        result = await crawler.arun(url='https://example.com/protected')
+        if result.success:
+            print("Crawl succeeded with pre-loaded session data!")
+            print("Page HTML length:", len(result.html))
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+---
+
+### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
+
+A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
+
+1. Perform the login once in a hook.
+2. After login completes, export the resulting `storage_state` to a file.
+3. On subsequent runs, provide that `storage_state` to skip the login step.
+
+**Step-by-Step Example:**
+
+**First Run (Perform Login and Save State):**
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+async def on_browser_created_hook(browser):
+    # Access the default context and create a page
+    context = browser.contexts[0]
+    page = await context.new_page()
+    
+    # Navigate to the login page
+    await page.goto("https://example.com/login", wait_until="domcontentloaded")
+    
+    # Fill in credentials and submit
+    await page.fill("input[name='username']", "myuser")
+    await page.fill("input[name='password']", "mypassword")
+    await page.click("button[type='submit']")
+    await page.wait_for_load_state("networkidle")
+    
+    # Now the site sets tokens in localStorage and cookies
+    # Export this state to a file so we can reuse it
+    await context.storage_state(path="my_storage_state.json")
+    await page.close()
+
+async def main():
+    # First run: perform login and export the storage_state
+    async with AsyncWebCrawler(
+        headless=True,
+        verbose=True,
+        hooks={"on_browser_created": on_browser_created_hook},
+        use_persistent_context=True,
+        user_data_dir="./my_user_data"
+    ) as crawler:
+        
+        # After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
+        result = await crawler.arun(
+            url='https://example.com/protected-page',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
+        )
+        print("First run result success:", result.success)
+        if result.success:
+            print("Protected page HTML length:", len(result.html))
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**Second Run (Reuse Saved State, No Login Needed):**
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+async def main():
+    # Second run: no need to hook on_browser_created this time.
+    # Just provide the previously saved storage state.
+    async with AsyncWebCrawler(
+        headless=True,
+        verbose=True,
+        use_persistent_context=True,
+        user_data_dir="./my_user_data",
+        storage_state="my_storage_state.json"  # Reuse previously exported state
+    ) as crawler:
+        
+        # Now the crawler starts already logged in
+        result = await crawler.arun(
+            url='https://example.com/protected-page',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
+        )
+        print("Second run result success:", result.success)
+        if result.success:
+            print("Protected page HTML length:", len(result.html))
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+**What’s Happening Here?**
+
+- During the first run, the `on_browser_created_hook` logs into the site.  
+- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.  
+- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
+
+**Sign Out Scenario:**  
+If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
+
+---
+
+### Conclusion
+
+By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.
--- a/docs/examples/tutorial_dynamic_clicks.md
+++ b/docs/examples/tutorial_dynamic_clicks.md
@@ -0,0 +1,117 @@
+# Tutorial: Clicking Buttons to Load More Content with Crawl4AI
+
+## Introduction
+
+When scraping dynamic websites, it’s common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, we’ll cover two approaches:
+
+1. **Step-by-step (Session-based) Approach:** Multiple calls to `arun()` to progressively load more content.
+2. **Single-call Approach:** Execute a more complex JavaScript snippet inside a single `arun()` call to handle all clicks at once before the extraction.
+
+## Prerequisites
+
+- A working installation of Crawl4AI
+- Basic familiarity with Python’s `async`/`await` syntax
+
+## Step-by-Step Approach
+
+Use a session ID to maintain state across multiple `arun()` calls:
+
+```python
+from crawl4ai import AsyncWebCrawler, CacheMode
+
+js_code = [
+    # This JS finds the “Next” button and clicks it
+    "const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
+]
+
+wait_for_condition = "css:.new-content-class"
+
+async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
+    # 1. Load the initial page
+    result_initial = await crawler.arun(
+        url="https://example.com",
+        cache_mode=CacheMode.BYPASS,
+        session_id="my_session"
+    )
+
+    # 2. Click the 'Next' button and wait for new content
+    result_next = await crawler.arun(
+        url="https://example.com",
+        session_id="my_session",
+        js_code=js_code,
+        wait_for=wait_for_condition,
+        js_only=True,
+        cache_mode=CacheMode.BYPASS
+    )
+
+# `result_next` now contains the updated HTML after clicking 'Next'
+```
+
+**Key Points:**
+- **`session_id`**: Keeps the same browser context open.
+- **`js_code`**: Executes JavaScript in the context of the already loaded page.
+- **`wait_for`**: Ensures the crawler waits until new content is fully loaded.
+- **`js_only=True`**: Runs the JS in the current session without reloading the page.
+
+By repeating the `arun()` call multiple times and modifying the `js_code` (e.g., clicking different modules or pages), you can iteratively load all the desired content.
+
+## Single-call Approach
+
+If the page allows it, you can run a single `arun()` call with a more elaborate JavaScript snippet that:
+- Iterates over all the modules or "Next" buttons
+- Clicks them one by one
+- Waits for content updates between each click
+- Once done, returns control to Crawl4AI for extraction.
+
+Example snippet:
+
+```python
+from crawl4ai import AsyncWebCrawler, CacheMode
+
+js_code = [
+    # Example JS that clicks multiple modules:
+    """
+    (async () => {
+      const modules = document.querySelectorAll('.module-item');
+      for (let i = 0; i < modules.length; i++) {
+        modules[i].scrollIntoView();
+        modules[i].click();
+        // Wait for each module’s content to load, adjust 100ms as needed
+        await new Promise(r => setTimeout(r, 100));
+      }
+    })();
+    """
+]
+
+async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        js_code=js_code,
+        wait_for="css:.final-loaded-content-class",
+        cache_mode=CacheMode.BYPASS
+    )
+
+# `result` now contains all content after all modules have been clicked in one go.
+```
+
+**Key Points:**
+- All interactions (clicks and waits) happen before the extraction.
+- Ideal for pages where all steps can be done in a single pass.
+
+## Choosing the Right Approach
+
+- **Step-by-Step (Session-based)**: 
+  - Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
+  - Useful if the page requires multiple conditions checked at runtime.
+
+- **Single-call**:
+  - Perfect if the sequence of interactions is known in advance.
+  - Cleaner code if the page’s structure is consistent and predictable.
+
+## Conclusion
+
+Crawl4AI makes it easy to handle dynamic content:
+- Use session IDs and multiple `arun()` calls for stepwise crawling.
+- Or pack all actions into one `arun()` call if the interactions are well-defined upfront.
+
+This flexibility ensures you can handle a wide range of dynamic web pages efficiently.
--- a/docs/md_v2/advanced/managed_browser.md
+++ b/docs/md_v2/advanced/managed_browser.md
@@ -4,7 +4,59 @@ This guide explains how to use content filtering strategies in Crawl4AI to extra

 ## Relevance Content Filter

-The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
+The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
+
+
+## Pruning Content Filter
+
+The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
+
+### Usage
+
+```python
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.content_filter_strategy import PruningContentFilter
+
+async def filter_content(url):
+    async with AsyncWebCrawler() as crawler:
+        content_filter = PruningContentFilter(
+            min_word_threshold=5,
+            threshold_type='dynamic',
+            threshold=0.45
+        )
+        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
+        if result.success:
+            print(f"Cleaned Markdown:\n{result.fit_markdown}")
+```
+
+### Parameters
+
+- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
+
+- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
+  - `'fixed'`: Uses a constant threshold value for all nodes
+  - `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
+
+- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
+  - For fixed threshold: Nodes scoring below this value are removed
+  - For dynamic threshold: This value is adjusted based on node properties
+
+### How It Works
+
+The pruning algorithm evaluates each node using multiple metrics:
+- Text density: Ratio of actual text to overall node content
+- Link density: Proportion of text within links
+- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
+- Content quality: Metrics like text length and structural importance
+
+Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
+
+The algorithm is particularly effective for:
+- Removing boilerplate content
+- Eliminating navigation menus and sidebars
+- Preserving main article content
+- Maintaining document structure while removing noise
+

 ## BM25 Algorithm

--- a/docs/md_v2/basic/cache-modes.md
+++ b/docs/md_v2/basic/cache-modes.md
@@ -1,7 +1,7 @@
 # Crawl4AI Cache System and Migration Guide

 ## Overview
-Starting from version X.X.X, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.
+Starting from version 0.5.0, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.

 ## Old vs New Approach

--- a/docs/md_v2/basic/content_filtering.md
+++ b/docs/md_v2/basic/content_filtering.md
@@ -4,7 +4,59 @@ This guide explains how to use content filtering strategies in Crawl4AI to extra

 ## Relevance Content Filter

-The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
+The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
+
+
+## Pruning Content Filter
+
+The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
+
+### Usage
+
+```python
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.content_filter_strategy import PruningContentFilter
+
+async def filter_content(url):
+    async with AsyncWebCrawler() as crawler:
+        content_filter = PruningContentFilter(
+            min_word_threshold=5,
+            threshold_type='dynamic',
+            threshold=0.45
+        )
+        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
+        if result.success:
+            print(f"Cleaned Markdown:\n{result.fit_markdown}")
+```
+
+### Parameters
+
+- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
+
+- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
+  - `'fixed'`: Uses a constant threshold value for all nodes
+  - `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
+
+- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
+  - For fixed threshold: Nodes scoring below this value are removed
+  - For dynamic threshold: This value is adjusted based on node properties
+
+### How It Works
+
+The pruning algorithm evaluates each node using multiple metrics:
+- Text density: Ratio of actual text to overall node content
+- Link density: Proportion of text within links
+- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
+- Content quality: Metrics like text length and structural importance
+
+Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
+
+The algorithm is particularly effective for:
+- Removing boilerplate content
+- Eliminating navigation menus and sidebars
+- Preserving main article content
+- Maintaining document structure while removing noise
+

 ## BM25 Algorithm

@@ -21,7 +73,7 @@ from crawl4ai.content_filter_strategy import BM25ContentFilter
 async def filter_content(url, query=None):
    async with AsyncWebCrawler() as crawler:
        content_filter = BM25ContentFilter(user_query=query)
-        result = await crawler.arun(url=url, content_filter=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
+        result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
        if result.success:
            print(f"Filtered Content (JSON):\n{result.extracted_content}")
            print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
@@ -71,7 +123,7 @@ class MyCustomFilter(RelevantContentFilter):
 async def custom_filter_demo(url: str):
    async with AsyncWebCrawler() as crawler:
        custom_filter = MyCustomFilter()
-        result = await crawler.arun(url, content_filter=custom_filter)
+        result = await crawler.arun(url, extraction_strategy=custom_filter)
        if result.success:
            print(result.extracted_content)

--- a/docs/md_v2/basic/quickstart.md
+++ b/docs/md_v2/basic/quickstart.md
@@ -8,7 +8,7 @@ First, let's import the necessary modules and create an instance of `AsyncWebCra

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, CasheMode
+from crawl4ai import AsyncWebCrawler, CacheMode

 async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
--- a/docs/md_v2/basic/simple-crawling.md
+++ b/docs/md_v2/basic/simple-crawling.md
@@ -99,7 +99,7 @@ async def main():
            remove_overlay_elements=True,
            
            # Cache control
-            cache_mode=CacheMode.ENABLE  # Use cache if available
+            cache_mode=CacheMode.ENABLED  # Use cache if available
        )
        
        if result.success:
--- a/docs/md_v2/blog/index.md
+++ b/docs/md_v2/blog/index.md
@@ -0,0 +1,47 @@
+# Crawl4AI Blog
+
+Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
+
+## Latest Release
+
+### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
+*December 12, 2024*
+
+The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.
+
+[Read full release notes →](releases/0.4.2.md)
+
+---
+
+### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
+*December 8, 2024*
+
+This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
+
+[Read full release notes →](releases/0.4.1.md)
+
+---
+
+### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
+*December 1, 2024*
+
+Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
+
+[Read full release notes →](releases/0.4.0.md)
+
+## Project History
+
+Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
+
+## Categories
+
+- [Technical Deep Dives](/blog/technical) - Coming soon
+- [Tutorials & Guides](/blog/tutorials) - Coming soon
+- [Community Updates](/blog/community) - Coming soon
+
+## Stay Updated
+
+- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
+- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
+- Join our community discussions on GitHub
+
--- a/docs/md_v2/blog/releases/0.4.0.md
+++ b/docs/md_v2/blog/releases/0.4.0.md
@@ -0,0 +1,62 @@
+# Release Summary for Version 0.4.0 (December 1, 2024)
+
+## Overview
+The 0.4.0 release introduces significant improvements to content filtering, multi-threaded environment handling, user-agent generation, and test coverage. Key highlights include the introduction of the PruningContentFilter, designed to automatically identify and extract the most valuable parts of an HTML document, as well as enhancements to the BM25ContentFilter to extend its versatility and effectiveness.
+
+## Major Features and Enhancements
+
+### 1. PruningContentFilter
+- Introduced a new unsupervised content filtering strategy that scores and prunes less relevant nodes in an HTML document based on metrics like text and link density.
+- Focuses on retaining the most valuable parts of the content, making it highly effective for extracting relevant information from complex web pages.
+- Fully documented with updated README and expanded user guides.
+
+### 2. User-Agent Generator
+- Added a user-agent generator utility that resolves compatibility issues and supports customizable user-agent strings.
+- By default, the generator randomizes user agents for each request, adding diversity, but users can customize it for tailored scenarios.
+
+### 3. Enhanced Thread Safety
+- Improved handling of multi-threaded environments by adding better thread locks for parallel processing, ensuring consistency and stability when running multiple threads.
+
+### 4. Extended Content Filtering Strategies
+- Users now have access to both the PruningContentFilter for unsupervised extraction and the BM25ContentFilter for supervised filtering based on user queries.
+- Enhanced BM25ContentFilter with improved capabilities to process page titles, meta tags, and descriptions, allowing for more effective classification and clustering of text chunks.
+
+### 5. Documentation Updates
+- Updated examples and tutorials to promote the use of the PruningContentFilter alongside the BM25ContentFilter, providing clear instructions for selecting the appropriate filter for each use case.
+
+### 6. Unit Test Enhancements
+- Added unit tests for PruningContentFilter to ensure accuracy and reliability.
+- Enhanced BM25ContentFilter tests to cover additional edge cases and performance metrics, particularly for malformed HTML inputs.
+
+## Revised Change Logs for Version 0.4.0
+
+### PruningContentFilter (Dec 01, 2024)
+- Introduced the PruningContentFilter to optimize content extraction by pruning less relevant HTML nodes.
+  - **Affected Files:**
+    - **crawl4ai/content_filter_strategy.py**: Added a scoring-based pruning algorithm.
+    - **README.md**: Updated to include PruningContentFilter usage.
+    - **docs/md_v2/basic/content_filtering.md**: Expanded user documentation, detailing the use and benefits of PruningContentFilter.
+
+### Unit Tests for PruningContentFilter (Dec 01, 2024)
+- Added comprehensive unit tests for PruningContentFilter to ensure correctness and efficiency.
+  - **Affected Files:**
+    - **tests/async/test_content_filter_prune.py**: Created tests covering different pruning scenarios to ensure stability and correctness.
+
+### Enhanced BM25ContentFilter Tests (Dec 01, 2024)
+- Expanded tests to cover additional extraction scenarios and performance metrics, improving robustness.
+  - **Affected Files:**
+    - **tests/async/test_content_filter_bm25.py**: Added tests for edge cases, including malformed HTML inputs.
+
+### Documentation and Example Updates (Dec 01, 2024)
+- Revised examples to illustrate the use of PruningContentFilter alongside existing content filtering methods.
+  - **Affected Files:**
+    - **docs/examples/quickstart_async.py**: Enhanced example clarity and usability for new users.
+
+## Experimental Features
+- The PruningContentFilter is still under experimental development, and we continue to gather feedback for further refinements.
+
+## Conclusion
+This release significantly enhances the content extraction capabilities of Crawl4ai with the introduction of the PruningContentFilter, improved supervised filtering with BM25ContentFilter, and robust multi-threaded handling. Additionally, the user-agent generator provides much-needed versatility, resolving compatibility issues faced by many users.
+
+Users are encouraged to experiment with the new content filtering methods to determine which best suits their needs.
+
--- a/docs/md_v2/blog/releases/0.4.1.md
+++ b/docs/md_v2/blog/releases/0.4.1.md
@@ -0,0 +1,145 @@
+# Release Summary for Version 0.4.1 (December 8, 2024): Major Efficiency Boosts with New Features!
+
+_This post was generated with the help of ChatGPT, take everything with a grain of salt. 🧂_
+
+Hi everyone,
+
+I just finished putting together version 0.4.1 of Crawl4AI, and there are a few changes in here that I think you’ll find really helpful. I’ll explain what’s new, why it matters, and exactly how you can use these features (with the code to back it up). Let’s get into it.
+
+---
+
+### Handling Lazy Loading Better (Images Included)
+
+One thing that always bugged me with crawlers is how often they miss lazy-loaded content, especially images. In this version, I made sure Crawl4AI **waits for all images to load** before moving forward. This is useful because many modern websites only load images when they’re in the viewport or after some JavaScript executes.
+
+Here’s how to enable it:
+
+```python
+await crawler.crawl(
+    url="https://example.com",
+    wait_for_images=True  # Add this argument to ensure images are fully loaded
+)
+```
+
+What this does is:
+1. Waits for the page to reach a "network idle" state.
+2. Ensures all images on the page have been completely loaded.
+
+This single change handles the majority of lazy-loading cases you’re likely to encounter.
+
+---
+
+### Text-Only Mode (Fast, Lightweight Crawling)
+
+Sometimes, you don’t need to download images or process JavaScript at all. For example, if you’re crawling to extract text data, you can enable **text-only mode** to speed things up. By disabling images, JavaScript, and other heavy resources, this mode makes crawling **3-4 times faster** in most cases.
+
+Here’s how to turn it on:
+
+```python
+crawler = AsyncPlaywrightCrawlerStrategy(
+    text_only=True  # Set this to True to enable text-only crawling
+)
+```
+
+When `text_only=True`, the crawler automatically:
+- Disables GPU processing.
+- Blocks image and JavaScript resources.
+- Reduces the viewport size to 800x600 (you can override this with `viewport_width` and `viewport_height`).
+
+If you need to crawl thousands of pages where you only care about text, this mode will save you a ton of time and resources.
+
+---
+
+### Adjusting the Viewport Dynamically
+
+Another useful addition is the ability to **dynamically adjust the viewport size** to match the content on the page. This is particularly helpful when you’re working with responsive layouts or want to ensure all parts of the page load properly.
+
+Here’s how it works:
+1. The crawler calculates the page’s width and height after it loads.
+2. It adjusts the viewport to fit the content dimensions.
+3. (Optional) It uses Chrome DevTools Protocol (CDP) to simulate zooming out so everything fits in the viewport.
+
+To enable this, use:
+
+```python
+await crawler.crawl(
+    url="https://example.com",
+    adjust_viewport_to_content=True  # Dynamically adjusts the viewport
+)
+```
+
+This approach makes sure the entire page gets loaded into the viewport, especially for layouts that load content based on visibility.
+
+---
+
+### Simulating Full-Page Scrolling
+
+Some websites load data dynamically as you scroll down the page. To handle these cases, I added support for **full-page scanning**. It simulates scrolling to the bottom of the page, checking for new content, and capturing it all.
+
+Here’s an example:
+
+```python
+await crawler.crawl(
+    url="https://example.com",
+    scan_full_page=True,   # Enables scrolling
+    scroll_delay=0.2       # Waits 200ms between scrolls (optional)
+)
+```
+
+What happens here:
+1. The crawler scrolls down in increments, waiting for content to load after each scroll.
+2. It stops when no new content appears (i.e., dynamic elements stop loading).
+3. It scrolls back to the top before finishing (if necessary).
+
+If you’ve ever had to deal with infinite scroll pages, this is going to save you a lot of headaches.
+
+---
+
+### Reusing Browser Sessions (Save Time on Setup)
+
+By default, every time you crawl a page, a new browser context (or tab) is created. That’s fine for small crawls, but if you’re working on a large dataset, it’s more efficient to reuse the same session.
+
+I added a method called `create_session` for this:
+
+```python
+session_id = await crawler.create_session()
+
+# Use the same session for multiple crawls
+await crawler.crawl(
+    url="https://example.com/page1",
+    session_id=session_id  # Reuse the session
+)
+await crawler.crawl(
+    url="https://example.com/page2",
+    session_id=session_id
+)
+```
+
+This avoids creating a new tab for every page, speeding up the crawl and reducing memory usage.
+
+---
+
+### Other Updates
+
+Here are a few smaller updates I’ve made:
+- **Light Mode**: Use `light_mode=True` to disable background processes, extensions, and other unnecessary features, making the browser more efficient.
+- **Logging**: Improved logs to make debugging easier.
+- **Defaults**: Added sensible defaults for things like `delay_before_return_html` (now set to 0.1 seconds).
+
+---
+
+### How to Get the Update
+
+You can install or upgrade to version `0.4.1` like this:
+
+```bash
+pip install crawl4ai --upgrade
+```
+
+As always, I’d love to hear your thoughts. If there’s something you think could be improved or if you have suggestions for future versions, let me know!
+
+Enjoy the new features, and happy crawling! 🕷️
+
+--- 
+
+
--- a/docs/md_v2/blog/releases/0.4.2.md
+++ b/docs/md_v2/blog/releases/0.4.2.md
@@ -0,0 +1,86 @@
+## 🚀 Crawl4AI 0.4.2 Update: Smarter Crawling Just Got Easier (Dec 12, 2024)
+
+### Hey Developers,
+
+I’m excited to share Crawl4AI 0.4.2—a major upgrade that makes crawling smarter, faster, and a whole lot more intuitive. I’ve packed in a bunch of new features to simplify your workflows and improve your experience. Let’s cut to the chase!
+
+---
+
+### 🔧 **Configurable Browser and Crawler Behavior**
+
+You’ve asked for better control over how browsers and crawlers are configured, and now you’ve got it. With the new `BrowserConfig` and `CrawlerRunConfig` objects, you can set up your browser and crawling behavior exactly how you want. No more cluttering `arun` with a dozen arguments—just pass in your configs and go.
+
+**Example:**
+```python
+from crawl4ai import BrowserConfig, CrawlerRunConfig, AsyncWebCrawler
+
+browser_config = BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080)
+crawler_config = CrawlerRunConfig(cache_mode="BYPASS")
+
+async with AsyncWebCrawler(config=browser_config) as crawler:
+    result = await crawler.arun(url="https://example.com", config=crawler_config)
+    print(result.markdown[:500])
+```
+
+This setup is a game-changer for scalability, keeping your code clean and flexible as we add more parameters in the future.
+
+Remember: If you like to use the old way, you can still pass arguments directly to `arun` as before, no worries!
+
+---
+
+### 🔐 **Streamlined Session Management**
+
+Here’s the big one: You can now pass local storage and cookies directly. Whether it’s setting values programmatically or importing a saved JSON state, managing sessions has never been easier. This is a must-have for authenticated crawls—just export your storage state once and reuse it effortlessly across runs.
+
+**Example:**
+1. Open a browser, log in manually, and export the storage state.
+2. Import the JSON file for seamless authenticated crawling:
+
+```python
+result = await crawler.arun(
+    url="https://example.com/protected",
+    storage_state="my_storage_state.json"
+)
+```
+
+---
+
+### 🔢 **Handling Large Pages: Supercharged Screenshots and PDF Conversion**
+
+Two big upgrades here:
+
+- **Blazing-fast long-page screenshots**: Turn extremely long web pages into clean, high-quality screenshots—without breaking a sweat. It’s optimized to handle large content without lag.
+
+- **Full-page PDF exports**: Now, you can also convert any page into a PDF with all the details intact. Perfect for archiving or sharing complex layouts.
+
+---
+
+### 🔧 **Other Cool Stuff**
+
+- **Anti-bot enhancements**: Magic mode now handles overlays, user simulation, and anti-detection features like a pro.
+- **JavaScript execution**: Execute custom JS snippets to handle dynamic content. No more wrestling with endless page interactions.
+
+---
+
+### 📊 **Performance Boosts and Dev-friendly Updates**
+
+- Faster rendering and viewport adjustments for better performance.
+- Improved cookie and local storage handling for seamless authentication.
+- Better debugging with detailed logs and actionable error messages.
+
+---
+
+### 🔠 **Use Cases You’ll Love**
+
+1. **Authenticated Crawls**: Login once, export your storage state, and reuse it across multiple requests without the headache.
+2. **Long-page Screenshots**: Perfect for blogs, e-commerce pages, or any endless-scroll website.
+3. **PDF Export**: Create professional-looking page PDFs in seconds.
+
+---
+
+### Let’s Get Crawling
+
+Crawl4AI 0.4.2 is ready for you to download and try. I’m always looking for ways to improve, so don’t hold back—share your thoughts and feedback.
+
+Happy Crawling! 🚀
+
--- a/main.py
+++ b/main.py
@@ -342,7 +342,7 @@ app.add_middleware(

 # API token security
 security = HTTPBearer()
-CRAWL4AI_API_TOKEN = os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
+CRAWL4AI_API_TOKEN = os.getenv("CRAWL4AI_API_TOKEN")

 async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
    if not CRAWL4AI_API_TOKEN:
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -10,7 +10,11 @@ nav:
  - 'Installation': 'basic/installation.md'
  - 'Docker Deplotment': 'basic/docker-deploymeny.md'
  - 'Quick Start': 'basic/quickstart.md'
-  
+  - Changelog & Blog:
+    - 'Blog Home': 'blog/index.md'
+    - 'Latest (0.4.1)': 'blog/releases/0.4.1.md'
+    - 'Changelog': 'https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md'
+
  - Basic:
    - 'Simple Crawling': 'basic/simple-crawling.md'
    - 'Output Formats': 'basic/output-formats.md'
@@ -50,12 +54,12 @@ nav:
    - '5. Dynamic Content': 'tutorial/episode_05_JavaScript_Execution_and_Dynamic_Content_Handling.md'
    - '6. Magic Mode': 'tutorial/episode_06_Magic_Mode_and_Anti-Bot_Protection.md'
    - '7. Content Cleaning': 'tutorial/episode_07_Content_Cleaning_and_Fit_Markdown.md'
-    - '8. Media Handling': 'tutorial/episode_08_Media_Handling:_Images,_Videos,_and_Audio.md'
+    - '8. Media Handling': 'tutorial/episode_08_Media_Handling_Images_Videos_and_Audio.md'
    - '9. Link Analysis': 'tutorial/episode_09_Link_Analysis_and_Smart_Filtering.md'
    - '10. User Simulation': 'tutorial/episode_10_Custom_Headers,_Identity,_and_User_Simulation.md'
-    - '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies:_JSON_CSS.md'
-    - '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies:_LLM.md'
-    - '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies:_Cosine.md'
+    - '11.1. JSON CSS': 'tutorial/episode_11_1_Extraction_Strategies_JSON_CSS.md'
+    - '11.2. LLM Strategy': 'tutorial/episode_11_2_Extraction_Strategies_LLM.md'
+    - '11.3. Cosine Strategy': 'tutorial/episode_11_3_Extraction_Strategies_Cosine.md'
    - '12. Session Crawling': 'tutorial/episode_12_Session-Based_Crawling_for_Dynamic_Websites.md'
    - '13. Text Chunking': 'tutorial/episode_13_Chunking_Strategies_for_Large_Text_Processing.md'
    - '14. Custom Workflows': 'tutorial/episode_14_Hooks_and_Custom_Workflow_with_AsyncWebCrawler.md'
--- a/setup.py
+++ b/setup.py
@@ -57,6 +57,9 @@ setup(
    author_email="unclecode@kidocode.com",
    license="MIT",
    packages=find_packages(),
+    package_data={
+        'crawl4ai': ['js_snippet/*.js']  # This matches the exact path structure
+    },
    install_requires=default_requirements
    + ["playwright", "aiofiles"],  # Added aiofiles
    extras_require={
--- a/tests/async/test_0.4.2_browser_manager.py
+++ b/tests/async/test_0.4.2_browser_manager.py
@@ -0,0 +1,153 @@
+import os, sys
+parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.append(parent_dir)
+__location__ = os.path.realpath(    os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+import os, sys
+import asyncio
+from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+# Assuming that the changes made allow different configurations 
+# for managed browser, persistent context, and so forth.
+
+async def test_default_headless():
+    async with AsyncWebCrawler(
+        headless=True,
+        verbose=True,
+        user_agent_mode="random",
+        user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
+        use_managed_browser=False,
+        use_persistent_context=False,
+        ignore_https_errors=True,
+        # Testing normal ephemeral context
+    ) as crawler:
+        result = await crawler.arun(
+            url='https://www.kidocode.com/degrees/technology',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
+        )
+        print("[test_default_headless] success:", result.success)
+        print("HTML length:", len(result.html if result.html else ""))
+        
+async def test_managed_browser_persistent():
+    # Treating use_persistent_context=True as managed_browser scenario.
+    async with AsyncWebCrawler(
+        headless=False,
+        verbose=True,
+        user_agent_mode="random",
+        user_agent_generator_config={"device_type": "desktop", "os_type": "mac"},
+        use_managed_browser=True,
+        use_persistent_context=True,  # now should behave same as managed browser
+        user_data_dir="./outpu/test_profile",
+        # This should store and reuse profile data across runs
+    ) as crawler:
+        result = await crawler.arun(
+            url='https://www.google.com',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
+        )
+        print("[test_managed_browser_persistent] success:", result.success)
+        print("HTML length:", len(result.html if result.html else ""))
+
+async def test_session_reuse():
+    # Test creating a session, using it for multiple calls
+    session_id = "my_session"
+    async with AsyncWebCrawler(
+        headless=False,
+        verbose=True,
+        user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
+        # Fixed user-agent for consistency
+        use_managed_browser=False,
+        use_persistent_context=False,
+    ) as crawler:
+        
+        # First call: create session
+        result1 = await crawler.arun(
+            url='https://www.example.com',
+            cache_mode=CacheMode.BYPASS,
+            session_id=session_id,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
+        )
+        print("[test_session_reuse first call] success:", result1.success)
+        
+        # Second call: same session, possibly cookie retained
+        result2 = await crawler.arun(
+            url='https://www.example.com/about',
+            cache_mode=CacheMode.BYPASS,
+            session_id=session_id,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
+        )
+        print("[test_session_reuse second call] success:", result2.success)
+
+async def test_magic_mode():
+    # Test magic mode with override_navigator and simulate_user
+    async with AsyncWebCrawler(
+        headless=False,
+        verbose=True,
+        user_agent_mode="random",
+        user_agent_generator_config={"device_type": "desktop", "os_type": "windows"},
+        use_managed_browser=False,
+        use_persistent_context=False,
+        magic=True,
+        override_navigator=True,
+        simulate_user=True,
+    ) as crawler:
+        result = await crawler.arun(
+            url='https://www.kidocode.com/degrees/business',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
+        )
+        print("[test_magic_mode] success:", result.success)
+        print("HTML length:", len(result.html if result.html else ""))
+
+async def test_proxy_settings():
+    # Test with a proxy (if available) to ensure code runs with proxy
+    async with AsyncWebCrawler(
+        headless=True,
+        verbose=False,
+        user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
+        proxy="http://127.0.0.1:8080",  # Assuming local proxy server for test
+        use_managed_browser=False,
+        use_persistent_context=False,
+    ) as crawler:
+        result = await crawler.arun(
+            url='https://httpbin.org/ip',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
+        )
+        print("[test_proxy_settings] success:", result.success)
+        if result.success:
+            print("HTML preview:", result.html[:200] if result.html else "")
+
+async def test_ignore_https_errors():
+    # Test ignore HTTPS errors with a self-signed or invalid cert domain
+    # This is just conceptual, the domain should be one that triggers SSL error.
+    # Using a hypothetical URL that fails SSL:
+    async with AsyncWebCrawler(
+        headless=True,
+        verbose=True,
+        user_agent="Mozilla/5.0",
+        ignore_https_errors=True,
+        use_managed_browser=False,
+        use_persistent_context=False,
+    ) as crawler:
+        result = await crawler.arun(
+            url='https://self-signed.badssl.com/',
+            cache_mode=CacheMode.BYPASS,
+            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True})
+        )
+        print("[test_ignore_https_errors] success:", result.success)
+
+async def main():
+    print("Running tests...")
+    # await test_default_headless()
+    # await test_managed_browser_persistent()
+    # await test_session_reuse()
+    # await test_magic_mode()
+    # await test_proxy_settings()
+    await test_ignore_https_errors()
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/tests/async/test_0.4.2_config_params.py
+++ b/tests/async/test_0.4.2_config_params.py
@@ -0,0 +1,231 @@
+import os, sys
+parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.append(parent_dir)
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+import asyncio
+from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig      
+from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+from crawl4ai.chunking_strategy import RegexChunking
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+# Category 1: Browser Configuration Tests
+async def test_browser_config_object():
+    """Test the new BrowserConfig object with various browser settings"""
+    browser_config = BrowserConfig(
+        browser_type="chromium",
+        headless=False,
+        viewport_width=1920,
+        viewport_height=1080,
+        use_managed_browser=True,
+        user_agent_mode="random",
+        user_agent_generator_config={"device_type": "desktop", "os_type": "windows"}
+    )
+    
+    async with AsyncWebCrawler(config=browser_config, verbose=True) as crawler:
+        result = await crawler.arun('https://example.com', cache_mode=CacheMode.BYPASS)
+        assert result.success, "Browser config crawl failed"
+        assert len(result.html) > 0, "No HTML content retrieved"
+
+async def test_browser_performance_config():
+    """Test browser configurations focused on performance"""
+    browser_config = BrowserConfig(
+        text_only=True,
+        light_mode=True,
+        extra_args=['--disable-gpu', '--disable-software-rasterizer'],
+        ignore_https_errors=True,
+        java_script_enabled=False
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun('https://example.com')
+        assert result.success, "Performance optimized crawl failed"
+        assert result.status_code == 200, "Unexpected status code"
+
+# Category 2: Content Processing Tests
+async def test_content_extraction_config():
+    """Test content extraction with various strategies"""
+    crawler_config = CrawlerRunConfig(
+        word_count_threshold=300,
+        extraction_strategy=JsonCssExtractionStrategy(
+            schema={
+                "name": "article",
+                "baseSelector": "div",
+                "fields": [{
+                    "name": "title",
+                    "selector": "h1",
+                    "type": "text"
+                }]
+            }
+        ),
+        chunking_strategy=RegexChunking(),
+        content_filter=PruningContentFilter()
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            'https://example.com/article',
+            config=crawler_config
+        )
+        assert result.extracted_content is not None, "Content extraction failed"
+        assert 'title' in result.extracted_content, "Missing expected content field"
+
+# Category 3: Cache and Session Management Tests
+async def test_cache_and_session_management():
+    """Test different cache modes and session handling"""
+    browser_config = BrowserConfig(use_persistent_context=True)
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.WRITE_ONLY,
+        process_iframes=True,
+        remove_overlay_elements=True
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        # First request - should write to cache
+        result1 = await crawler.arun(
+            'https://example.com',
+            config=crawler_config
+        )
+        
+        # Second request - should use fresh fetch due to WRITE_ONLY mode
+        result2 = await crawler.arun(
+            'https://example.com',
+            config=crawler_config
+        )
+        
+        assert result1.success and result2.success, "Cache mode crawl failed"
+        assert result1.html == result2.html, "Inconsistent results between requests"
+
+# Category 4: Media Handling Tests
+async def test_media_handling_config():
+    """Test configurations related to media handling"""
+    # Get the base path for home directroy ~/.crawl4ai/downloads, make sure it exists
+    os.makedirs(os.path.expanduser("~/.crawl4ai/downloads"), exist_ok=True)
+    browser_config = BrowserConfig(
+        viewport_width=1920,
+        viewport_height=1080,
+        accept_downloads=True,
+        downloads_path= os.path.expanduser("~/.crawl4ai/downloads")
+    )
+    crawler_config = CrawlerRunConfig(
+        screenshot=True,
+        pdf=True,
+        adjust_viewport_to_content=True,
+        wait_for_images=True,
+        screenshot_height_threshold=20000
+    )
+    
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        result = await crawler.arun(
+            'https://example.com',
+            config=crawler_config
+        )
+        assert result.screenshot is not None, "Screenshot capture failed"
+        assert result.pdf is not None, "PDF generation failed"
+
+# Category 5: Anti-Bot and Site Interaction Tests
+async def test_antibot_config():
+    """Test configurations for handling anti-bot measures"""
+    crawler_config = CrawlerRunConfig(
+        simulate_user=True,
+        override_navigator=True,
+        magic=True,
+        wait_for="js:()=>document.querySelector('body')",
+        delay_before_return_html=1.0,
+        log_console=True,
+        cache_mode=CacheMode.BYPASS
+    )
+    
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(
+            'https://example.com',
+            config=crawler_config
+        )
+        assert result.success, "Anti-bot measure handling failed"
+
+# Category 6: Parallel Processing Tests
+async def test_parallel_processing():
+    """Test parallel processing capabilities"""
+    crawler_config = CrawlerRunConfig(
+        mean_delay=0.5,
+        max_range=1.0,
+        semaphore_count=5
+    )
+    
+    urls = [
+        'https://example.com/1',
+        'https://example.com/2',
+        'https://example.com/3'
+    ]
+    
+    async with AsyncWebCrawler() as crawler:
+        results = await crawler.arun_many(
+            urls,
+            config=crawler_config
+        )
+        assert len(results) == len(urls), "Not all URLs were processed"
+        assert all(r.success for r in results), "Some parallel requests failed"
+
+# Category 7: Backwards Compatibility Tests
+async def test_legacy_parameter_support():
+    """Test that legacy parameters still work"""
+    async with AsyncWebCrawler(
+        headless=True,
+        browser_type="chromium",
+        viewport_width=1024,
+        viewport_height=768
+    ) as crawler:
+        result = await crawler.arun(
+            'https://example.com',
+            screenshot=True,
+            word_count_threshold=200,
+            bypass_cache=True,
+            css_selector=".main-content"
+        )
+        assert result.success, "Legacy parameter support failed"
+
+# Category 8: Mixed Configuration Tests
+async def test_mixed_config_usage():
+    """Test mixing new config objects with legacy parameters"""
+    browser_config = BrowserConfig(headless=True)
+    crawler_config = CrawlerRunConfig(screenshot=True)
+    
+    async with AsyncWebCrawler(
+        config=browser_config,
+        verbose=True  # legacy parameter
+    ) as crawler:
+        result = await crawler.arun(
+            'https://example.com',
+            config=crawler_config,
+            cache_mode=CacheMode.BYPASS,  # legacy parameter
+            css_selector="body"  # legacy parameter
+        )
+        assert result.success, "Mixed configuration usage failed"
+
+if __name__ == "__main__":
+    async def run_tests():
+        test_functions = [
+            test_browser_config_object,
+            # test_browser_performance_config,
+            # test_content_extraction_config,
+            # test_cache_and_session_management,
+            # test_media_handling_config,
+            # test_antibot_config,
+            # test_parallel_processing,
+            # test_legacy_parameter_support,
+            # test_mixed_config_usage
+        ]
+        
+        for test in test_functions:
+            print(f"\nRunning {test.__name__}...")
+            try:
+                await test()
+                print(f"✓ {test.__name__} passed")
+            except AssertionError as e:
+                print(f"✗ {test.__name__} failed: {str(e)}")
+            except Exception as e:
+                print(f"✗ {test.__name__} error: {str(e)}")
+    
+    asyncio.run(run_tests())
--- a/tests/async/test_content_filter_bm25.py
+++ b/tests/async/test_content_filter_bm25.py
--- a/tests/async/test_content_filter_prune.py
+++ b/tests/async/test_content_filter_prune.py
@@ -0,0 +1,159 @@
+import os, sys
+import pytest
+from bs4 import BeautifulSoup
+
+parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+sys.path.append(parent_dir)
+
+from crawl4ai.content_filter_strategy import PruningContentFilter
+
+@pytest.fixture
+def basic_html():
+    return """
+    <html>
+        <body>
+            <article>
+                <h1>Main Article</h1>
+                <p>This is a high-quality paragraph with substantial text content. It contains enough words to pass the threshold and has good text density without too many links. This kind of content should survive the pruning process.</p>
+                <div class="sidebar">Low quality sidebar content</div>
+                <div class="social-share">Share buttons</div>
+            </article>
+        </body>
+    </html>
+    """
+
+@pytest.fixture
+def link_heavy_html():
+    return """
+    <html>
+        <body>
+            <div class="content">
+                <p>Good content paragraph that should remain.</p>
+                <div class="links">
+                    <a href="#">Link 1</a>
+                    <a href="#">Link 2</a>
+                    <a href="#">Link 3</a>
+                    <a href="#">Link 4</a>
+                </div>
+            </div>
+        </body>
+    </html>
+    """
+
+@pytest.fixture
+def mixed_content_html():
+    return """
+    <html>
+        <body>
+            <article>
+                <h1>Article Title</h1>
+                <p class="summary">Short summary.</p>
+                <div class="content">
+                    <p>Long high-quality paragraph with substantial content that should definitely survive the pruning process. This content has good text density and proper formatting which makes it valuable for retention.</p>
+                </div>
+                <div class="comments">
+                    <p>Short comment 1</p>
+                    <p>Short comment 2</p>
+                </div>
+            </article>
+        </body>
+    </html>
+    """
+
+class TestPruningContentFilter:
+    def test_basic_pruning(self, basic_html):
+        """Test basic content pruning functionality"""
+        filter = PruningContentFilter(min_word_threshold=5)
+        contents = filter.filter_content(basic_html)
+        
+        combined_content = ' '.join(contents).lower()
+        assert "high-quality paragraph" in combined_content
+        assert "sidebar content" not in combined_content
+        assert "share buttons" not in combined_content
+
+    def test_min_word_threshold(self, mixed_content_html):
+        """Test minimum word threshold filtering"""
+        filter = PruningContentFilter(min_word_threshold=10)
+        contents = filter.filter_content(mixed_content_html)
+        
+        combined_content = ' '.join(contents).lower()
+        assert "short summary" not in combined_content
+        assert "long high-quality paragraph" in combined_content
+        assert "short comment" not in combined_content
+
+    def test_threshold_types(self, basic_html):
+        """Test fixed vs dynamic thresholds"""
+        fixed_filter = PruningContentFilter(threshold_type='fixed', threshold=0.48)
+        dynamic_filter = PruningContentFilter(threshold_type='dynamic', threshold=0.45)
+        
+        fixed_contents = fixed_filter.filter_content(basic_html)
+        dynamic_contents = dynamic_filter.filter_content(basic_html)
+        
+        assert len(fixed_contents) != len(dynamic_contents), \
+            "Fixed and dynamic thresholds should yield different results"
+
+    def test_link_density_impact(self, link_heavy_html):
+        """Test handling of link-heavy content"""
+        filter = PruningContentFilter(threshold_type='dynamic')
+        contents = filter.filter_content(link_heavy_html)
+        
+        combined_content = ' '.join(contents).lower()
+        assert "good content paragraph" in combined_content
+        assert len([c for c in contents if 'href' in c]) < 2, \
+            "Should prune link-heavy sections"
+
+    def test_tag_importance(self, mixed_content_html):
+        """Test tag importance in scoring"""
+        filter = PruningContentFilter(threshold_type='dynamic')
+        contents = filter.filter_content(mixed_content_html)
+        
+        has_article = any('article' in c.lower() for c in contents)
+        has_h1 = any('h1' in c.lower() for c in contents)
+        assert has_article or has_h1, "Should retain important tags"
+
+    def test_empty_input(self):
+        """Test handling of empty input"""
+        filter = PruningContentFilter()
+        assert filter.filter_content("") == []
+        assert filter.filter_content(None) == []
+
+    def test_malformed_html(self):
+        """Test handling of malformed HTML"""
+        malformed_html = "<div>Unclosed div<p>Nested<span>content</div>"
+        filter = PruningContentFilter()
+        contents = filter.filter_content(malformed_html)
+        assert isinstance(contents, list)
+
+    def test_performance(self, basic_html):
+        """Test performance with timer"""
+        filter = PruningContentFilter()
+        
+        import time
+        start = time.perf_counter()
+        filter.filter_content(basic_html)
+        duration = time.perf_counter() - start
+        
+        # Extra strict on performance since you mentioned milliseconds matter
+        assert duration < 0.1, f"Processing took too long: {duration:.3f} seconds"
+
+    @pytest.mark.parametrize("threshold,expected_count", [
+        (0.3, 4),  # Very lenient
+        (0.48, 2), # Default
+        (0.7, 1),  # Very strict
+    ])
+    def test_threshold_levels(self, mixed_content_html, threshold, expected_count):
+        """Test different threshold levels"""
+        filter = PruningContentFilter(threshold_type='fixed', threshold=threshold)
+        contents = filter.filter_content(mixed_content_html)
+        assert len(contents) <= expected_count, \
+            f"Expected {expected_count} or fewer elements with threshold {threshold}"
+
+    def test_consistent_output(self, basic_html):
+        """Test output consistency across multiple runs"""
+        filter = PruningContentFilter()
+        first_run = filter.filter_content(basic_html)
+        second_run = filter.filter_content(basic_html)
+        assert first_run == second_run, "Output should be consistent"
+
+if __name__ == "__main__":
+    pytest.main([__file__])
Author	SHA1	Message	Date
UncleCode	8a4952c128	Update README.md	2024-12-30 21:23:19 +08:00
Robin Singh	78768fd714	Update simple-crawling.md (#379 ) In the comprehensive example, AttributeError: type object 'CacheMode' has no attribute 'ENABLE'. Did you mean: 'ENABLED'?	2024-12-27 17:42:59 +08:00
Haopeng138	bacbeb3ed4	Fix #340 example llm_extraction (#358 ) @Haopeng138 Thank you so much. They are still part of the library. I forgot to update them since I moved the asynchronous versions years ago. I really appreciate it. I have to say that I feel weak in the documentation. That's why I spent a lot of time on it last week. Now, when you mention some of the things in the example folder, I realize I forgot about the example folder. I'll try to update it more. If you find anything else, please help and support. Thank you. I will add your name to contributor name as well.	2024-12-24 19:56:07 +08:00
UncleCode	ed7bc1909c	Bump version to 0.4.22	2024-12-15 19:49:38 +08:00
UncleCode	e9e5b5642d	Fix js_snipprt issue 0.4.21 bump to 0.4.22	2024-12-15 19:49:30 +08:00
UncleCode	7524aa7b5e	Feature: Add Markdown generation to CrawlerRunConfig - Added markdown generator parameter to CrawlerRunConfig in `async_configs.py`. - Implemented logic for Markdown generation in content scraping in `async_webcrawler.py`. - Updated version number to 0.4.21 in `__version__.py`.	2024-12-13 21:51:38 +08:00
UncleCode	7af1d32ef6	Update README for version 0.4.2: Reflect new features and enhancements	2024-12-12 20:18:44 +08:00
UncleCode	399af801a1	Merge branch 'next'	2024-12-12 20:17:27 +08:00
UncleCode	4a72c5ea6e	Add release notes and documentation for version 0.4.2: Configurable Crawlers, Session Management, and Enhanced Screenshot/PDF features	2024-12-12 20:15:50 +08:00
UncleCode	20d6f5fdf4	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-12-12 19:58:01 +08:00
UncleCode	3d69715dba	chore: Update .gitignore to include new files and directories	2024-12-12 19:57:59 +08:00
UncleCode	de1766d565	Bump version to 0.4.2	2024-12-12 19:35:30 +08:00
UncleCode	0982c639ae	Enhance AsyncWebCrawler and related configurations - Introduced new configuration classes: BrowserConfig and CrawlerRunConfig. - Refactored AsyncWebCrawler to leverage the new configuration system for cleaner parameter management. - Updated AsyncPlaywrightCrawlerStrategy for better flexibility and reduced legacy parameters. - Improved error handling with detailed context extraction during exceptions. - Enhanced overall maintainability and usability of the web crawler.	2024-12-12 19:35:09 +08:00
UncleCode	5188b7a6a0	Add full-page screenshot and PDF export features - Introduced a new approach for capturing full-page screenshots by exporting them as PDFs first, enhancing reliability and performance. - Added documentation for the feature in `docs/examples/full_page_screenshot_and_pdf_export.md`. - Refactored `perform_completion_with_backoff` in `crawl4ai/utils.py` to include necessary extra parameters. - Updated `quickstart_async.py` to utilize LLM extraction with refined arguments.	2024-12-10 20:59:31 +08:00
lvzhengri	759164831d	Update async_webcrawler.py (#337 ) add @asynccontextmanager	2024-12-10 20:56:52 +08:00
UncleCode	5431fa2d0c	Add PDF & screenshot functionality, new tutorial - Added support for exporting pages as PDFs - Enhanced screenshot functionality for long pages - Created a tutorial on dynamic content loading with 'Load More' buttons. - Updated web crawler to handle PDF data in responses.	2024-12-10 20:10:39 +08:00
UncleCode	e130fd8db9	Implement new async crawler features and stability updates - Introduced new async crawl strategy with session management. - Added BrowserManager for improved browser management. - Enhanced documentation, focusing on storage state and usage examples. - Improved error handling and logging for sessions. - Added JavaScript snippets for customizing navigator properties.	2024-12-10 17:55:29 +08:00
Mohammed	ded554d334	Fixed typo (#324 )	2024-12-09 20:17:43 +08:00
UncleCode	2d31915f0a	Commit Message: Enhance Async Crawler with storage state handling - Updated Async Crawler to support storage state management. - Added error handling for URL validation in Async Web Crawler. - Modified README logo and improved .gitignore entries. - Fixed issues in multiple files for better code robustness.	2024-12-09 20:04:59 +08:00
lu4nx	ba3e808802	fix: The extract method logs output only when self.verbose is set to True. (#314 ) Co-authored-by: lu4nx <lu4nx@lx-pc>	2024-12-09 17:19:26 +08:00
Olavo Henrique Marques Peixoto	e3488da194	fixing Readmen tap (#313 )	2024-12-09 14:34:52 +08:00
UncleCode	740214e021	Merge branch 'next'	2024-12-08 20:06:36 +08:00
UncleCode	c51e901f68	feat: Enhance AsyncPlaywrightCrawlerStrategy with text-only and light modes, dynamic viewport adjustment, and session management ### New Features: - Text-Only Mode: Added support for text-only crawling by disabling images, JavaScript, GPU, and other non-essential features. - Light Mode: Optimized browser settings to reduce resource usage and improve efficiency during crawling. - Dynamic Viewport Adjustment: Automatically adjusts viewport dimensions based on content size, ensuring accurate rendering and scaling. - Full Page Scanning: Introduced a feature to scroll and capture dynamic content for pages with infinite scroll or lazy-loading elements. - Session Management: Added `create_session` method for creating and managing browser sessions with unique IDs. ### Improvements: - Unified viewport handling across contexts by dynamically setting dimensions using `self.viewport_width` and `self.viewport_height`. - Enhanced logging and error handling for viewport adjustments, page scanning, and content evaluation. - Reduced resource usage with additional browser flags for both `light_mode` and `text_only` configurations. - Improved handling of cookies, headers, and proxies in session creation. ### Refactoring: - Removed hardcoded viewport dimensions and replaced them with dynamic configurations. - Cleaned up unused and commented-out code for better readability and maintainability. - Introduced defaults for frequently used parameters like `delay_before_return_html`. ### Fixes: - Resolved potential inconsistencies in viewport handling. - Improved robustness of content loading and dynamic adjustments to avoid failures and timeouts. ### Docs Update: - Updated schema usage in `quickstart_async.py` example: - Changed `OpenAIModelFee.schema()` to `OpenAIModelFee.model_json_schema()` for compatibility. - Enhanced LLM extraction instruction documentation. This commit introduces significant enhancements to improve efficiency, flexibility, and reliability of the crawler strategy.	2024-12-08 20:04:44 +08:00
UncleCode	8c611dcb4b	Refactored web scraping components - Enhanced the web scraping strategy with new methods for optimized media handling. - Added new utility functions for better content processing. - Refined existing features for improved accuracy and efficiency in scraping tasks. - Introduced more robust filtering criteria for media elements.	2024-12-05 22:33:47 +08:00
UncleCode	a45b8b1eb1	Merge issues with 0.4.0 is over	2024-12-04 20:29:25 +08:00
UncleCode	56f82f3e7f	Merge branch 'next'	2024-12-04 20:27:35 +08:00
UncleCode	486db3a771	Updated to version 0.4.0 with new features - Enhanced error handling in async crawler. - Added flexible options in Markdown generation. - Updated user agent settings for improved reliability. - Reflected changes in documentation and examples.	2024-12-04 20:26:39 +08:00
UncleCode	b02544bc0b	docs: update README and blog for version 0.4.0 release, highlighting new features and improvements	2024-12-03 21:28:52 +08:00
UncleCode	e9639ad189	refactor: improve error handling in DataProcessor and optimize data parsing logic	2024-12-03 19:44:38 +08:00
UncleCode	95a4f74d2a	fix: pass logger to WebScrapingStrategy and update score computation in PruningContentFilter	2024-12-02 20:37:28 +08:00
unclecode	293f299c08	Add PruningContentFilter with unit tests and update documentation - Introduced the PruningContentFilter for better content relevance. - Implemented comprehensive unit tests for verification of functionality. - Enhanced existing BM25ContentFilter tests for edge case coverage. - Updated documentation to include usage examples for new filter.	2024-12-01 19:17:33 +08:00
UncleCode	80d58ad24c	bump version to 0.3.747	2024-11-30 22:00:15 +08:00
UncleCode	3e83893b3f	Enhance User-Agent Handling - Added a new UserAgentGenerator class for generating random User-Agents. - Integrated User-Agent generation in AsyncPlaywrightCrawlerStrategy for randomization. - Enhanced HTTP headers with generated Client Hints.	2024-11-30 18:13:12 +08:00
dvschuyl	1ed7c15118	🩹 Page-evaluate navigation destroyed error (#304 ) Thanks for your contribution and such a nice approach. Now that I think of it, I guess I can make good use of this for some other part of the code. By the way, thank you so much; I will add your name to the new list of contributors.	2024-11-29 21:06:04 +08:00
UncleCode	569bdb6073	Merge branch 'next'	2024-11-29 20:54:28 +08:00
UncleCode	b0419edda6	Update README.md (#300 )	2024-11-29 02:31:17 +08:00