feat(scraper): Enhance URL filtering and scoring systems

Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00 · 2024-11-08 18:45:12 +08:00 · 2024-11-08 15:57:23 +08:00 · 2024-11-07 18:54:53 +08:00 · 2024-11-06 21:09:47 +08:00 · 2024-11-06 18:44:03 +08:00
275 changed files with 79556 additions and 42087 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,12 +0,0 @@
-# Documentation
-*.html linguist-documentation
-docs/* linguist-documentation
-docs/examples/* linguist-documentation
-docs/md_v2/* linguist-documentation
-
-# Explicitly mark Python as the main language
-*.py linguist-detectable=true
-*.py linguist-language=Python
-
-# Exclude HTML from language statistics
-*.html linguist-detectable=false
--- a/.gitignore
+++ b/.gitignore
@@ -199,35 +199,13 @@ test_env/
 **/.DS_Store

 todo.md
-todo_executor.md
 git_changes.py
 git_changes.md
 pypi_build.sh
 git_issues.py
 git_issues.md

-.next/
 .tests/
-# .issues/
-.docs/
 .issues/
-.gitboss/
-todo_executor.md
-protect-all-except-feature.sh
-manage-collab.sh
-publish.sh
-combine.sh
-combined_output.txt
-.local
-.scripts
-tree.md
-tree.md
-.scripts
-.local
-.do
-/plans
-.codeiumignore
-todo/
-
-# windsurf rules
-.windsurfrules
+.docs/
+.issues/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -10,21 +10,11 @@ We would like to thank the following people for their contributions to Crawl4AI:

 ## Community Contributors

- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
 - [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
 - [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
 - [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
 - [datehoer](https://github.com/datehoer) - Add browser prxy support

-## Pull Requests
-
- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
- [paulokuong](https://github.com/paulokuong) - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string [#298](https://github.com/unclecode/crawl4ai/pull/298)
-
-
 ## Other Contributors

 - [Gokhan](https://github.com/gkhngyk) 
--- a/136
+++ b/136
@@ -1,136 +0,0 @@
-# syntax=docker/dockerfile:1.4
-
-ARG TARGETPLATFORM
-ARG BUILDPLATFORM
-
-# Other build arguments
-ARG PYTHON_VERSION=3.10
-
-# Base stage with system dependencies
-FROM python:${PYTHON_VERSION}-slim as base
-
-# Declare ARG variables again within the build stage
-ARG INSTALL_TYPE=all
-ARG ENABLE_GPU=false
-
-# Platform-specific labels
-LABEL maintainer="unclecode"
-LABEL description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
-LABEL version="1.0"
-
-# Environment setup
-ENV PYTHONUNBUFFERED=1 \
-    PYTHONDONTWRITEBYTECODE=1 \
-    PIP_NO_CACHE_DIR=1 \
-    PIP_DISABLE_PIP_VERSION_CHECK=1 \
-    PIP_DEFAULT_TIMEOUT=100 \
-    DEBIAN_FRONTEND=noninteractive
-
-# Install system dependencies
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    build-essential \
-    curl \
-    wget \
-    gnupg \
-    git \
-    cmake \
-    pkg-config \
-    python3-dev \
-    libjpeg-dev \
-    libpng-dev \
-    && rm -rf /var/lib/apt/lists/*
-
-# Playwright system dependencies for Linux
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    libglib2.0-0 \
-    libnss3 \
-    libnspr4 \
-    libatk1.0-0 \
-    libatk-bridge2.0-0 \
-    libcups2 \
-    libdrm2 \
-    libdbus-1-3 \
-    libxcb1 \
-    libxkbcommon0 \
-    libx11-6 \
-    libxcomposite1 \
-    libxdamage1 \
-    libxext6 \
-    libxfixes3 \
-    libxrandr2 \
-    libgbm1 \
-    libpango-1.0-0 \
-    libcairo2 \
-    libasound2 \
-    libatspi2.0-0 \
-    && rm -rf /var/lib/apt/lists/*
-
-# GPU support if enabled and architecture is supported
-RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETPLATFORM" = "linux/amd64" ] ; then \
-    apt-get update && apt-get install -y --no-install-recommends \
-    nvidia-cuda-toolkit \
-    && rm -rf /var/lib/apt/lists/* ; \
-else \
-    echo "Skipping NVIDIA CUDA Toolkit installation (unsupported platform or GPU disabled)"; \
-fi
-
-# Create and set working directory
-WORKDIR /app
-
-# Copy the entire project
-COPY . .
-
-# Install base requirements
-RUN pip install --no-cache-dir -r requirements.txt
-
-# Install required library for FastAPI
-RUN pip install fastapi uvicorn psutil
-
-# Install ML dependencies first for better layer caching
-RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
-        pip install --no-cache-dir \
-            torch \
-            torchvision \
-            torchaudio \
-            scikit-learn \
-            nltk \
-            transformers \
-            tokenizers && \
-        python -m nltk.downloader punkt stopwords ; \
-    fi
-
-# Install the package
-RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
-        pip install ".[all]" && \
-        python -m crawl4ai.model_loader ; \
-    elif [ "$INSTALL_TYPE" = "torch" ] ; then \
-        pip install ".[torch]" ; \
-    elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
-        pip install ".[transformer]" && \
-        python -m crawl4ai.model_loader ; \
-    else \
-        pip install "." ; \
-    fi
-
-    # Install MkDocs and required plugins
-RUN pip install --no-cache-dir \
-    mkdocs \
-    mkdocs-material \
-    mkdocs-terminal \
-    pymdown-extensions
-
-# Build MkDocs documentation
-RUN mkdocs build
-
-# Install Playwright and browsers
-RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
-    playwright install chromium; \
-    elif [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
-    playwright install chromium; \
-    fi
-
-# Expose port
-EXPOSE 8000 11235 9222 8080
-
-# Start the FastAPI server
-CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,2 +1 @@
-include requirements.txt
-recursive-include crawl4ai/js_snippet *.js
+include requirements.txt
--- a/MISSION.md
+++ b/MISSION.md
@@ -1,46 +0,0 @@
-# Mission
-
-![Mission Diagram](./docs/assets/pitch-dark.svg)
-
-### 1. The Data Capitalization Opportunity
-
-We live in an unprecedented era of digital wealth creation. Every day, individuals and enterprises generate massive amounts of valuable digital footprints across various platforms, social media channels, messenger apps, and cloud services. While people can interact with their data within these platforms, there's an immense untapped opportunity to transform this data into true capital assets. Just as physical property became a foundational element of wealth creation, personal and enterprise data has the potential to become a new form of capital on balance sheets.
-
-For individuals, this represents an opportunity to transform their digital activities into valuable assets. For enterprises, their internal communications, team discussions, and collaborative documents contain rich insights that could be structured and valued as intellectual capital. This wealth of information represents an unprecedented opportunity for value creation in the digital age.
-
-### 2. The Potential of Authentic Data
-
-While synthetic data has played a crucial role in AI development, there's an enormous untapped potential in the authentic data generated by individuals and organizations. Every message, document, and interaction contains unique insights and patterns that could enhance AI development. The challenge isn't a lack of data - it's that most authentic human-generated data remains inaccessible for productive use.
-
-By enabling willing participation in data sharing, we can unlock this vast reservoir of authentic human knowledge. This represents an opportunity to enhance AI development with diverse, real-world data that reflects the full spectrum of human experience and knowledge.
-
-## Our Pathway to Data Democracy
-
-### 1. Open-Source Foundation
-
-Our first step is creating an open-source data extraction engine that empowers developers and innovators to build tools for data structuring and organization. This foundation ensures transparency, security, and community-driven development. By making these tools openly available, we enable the technical infrastructure needed for true data ownership and capitalization.
-
-### 2. Data Capitalization Platform
-
-Building on this open-source foundation, we're developing a platform that helps individuals and enterprises transform their digital footprints into structured, valuable assets. This platform will provide the tools and frameworks needed to organize, understand, and value personal and organizational data as true capital assets.
-
-### 3. Creating a Data Marketplace
-
-The final piece is establishing a marketplace where individuals and organizations can willingly share their data assets. This creates opportunities for:
- Individuals to earn equity, revenue, or other forms of value from their data
- Enterprises to access diverse, high-quality data for AI development
- Researchers to work with authentic human-generated data
- Startups to build innovative solutions using real-world data
-
-## Economic Vision: A Shared Data Economy
-
-We envision a future where data becomes a fundamental asset class in a thriving shared economy. This transformation will democratize AI development by enabling willing participation in data sharing, ensuring that the benefits of AI advancement flow back to data creators. Just as property rights revolutionized economic systems, establishing data as a capital asset will create new opportunities for wealth creation and economic participation.
-
-This shared data economy will:
- Enable individuals to capitalize on their digital footprints
- Create new revenue streams for data creators
- Provide AI developers with access to diverse, authentic data
- Foster innovation through broader access to real-world data
- Ensure more equitable distribution of AI's economic benefits
-
-Our vision is to facilitate this transformation from the ground up - starting with open-source tools, progressing to data capitalization platforms, and ultimately creating a thriving marketplace where data becomes a true asset class in a shared economy. This approach ensures that the future of AI is built on a foundation of authentic human knowledge, with benefits flowing back to the individuals and organizations who create and share their valuable data.
--- a/README.md
+++ b/README.md
@@ -1,78 +1,157 @@
-# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.
-
-<div align="center">
-
-<a href="https://trendshift.io/repositories/11716" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11716" alt="unclecode%2Fcrawl4ai | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
+# Crawl4AI (Async Version) 🕷️🤖

 [![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
 [![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
-
-[![PyPI version](https://badge.fury.io/py/crawl4ai.svg)](https://badge.fury.io/py/crawl4ai)
-[![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
-[![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)
-
-<!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
+[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
+[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
 [![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
-[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)

-</div>
+Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐

-Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  
+> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).

-[✨ Check out latest update v0.4.3b1x](#-recent-updates)
+## New update 0.3.6
+- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
+- 🖼️ Improved image processing with lazy-loading detection
+- 🔧 Custom page timeout parameter for better control over crawling behavior
+- 🕰️ Enhanced handling of delayed content loading
+- 🔑 Custom headers support for LLM interactions
+- 🖼️ iframe content extraction for comprehensive page analysis
+- ⏱️ Flexible timeout and delayed content retrieval options

-🎉 **Version 0.4.3b1 is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
+## Try it Now!

-<details>
-<summary>🤓 <strong>My Personal Story</strong></summary>
+✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1REChY6fXQf-EaVYLv0eHEWvzlYxGm0pd?usp=sharing)

-My journey with computers started in childhood when my dad, a computer scientist, introduced me to an Amstrad computer. Those early days sparked a fascination with technology, leading me to pursue computer science and specialize in NLP during my postgraduate studies. It was during this time that I first delved into web crawling, building tools to help researchers organize papers and extract information from publications a challenging yet rewarding experience that honed my skills in data extraction.
+✨ Visit our [Documentation Website](https://crawl4ai.com/mkdocs/)

-Fast forward to 2023, I was working on a tool for a project and needed a crawler to convert a webpage into markdown. While exploring solutions, I found one that claimed to be open-source but required creating an account and generating an API token. Worse, it turned out to be a SaaS model charging $16, and its quality didn’t meet my standards. Frustrated, I realized this was a deeper problem. That frustration turned into turbo anger mode, and I decided to build my own solution. In just a few days, I created Crawl4AI. To my surprise, it went viral, earning thousands of GitHub stars and resonating with a global community.
+## Features ✨

-I made Crawl4AI open-source for two reasons. First, it’s my way of giving back to the open-source community that has supported me throughout my career. Second, I believe data should be accessible to everyone, not locked behind paywalls or monopolized by a few. Open access to data lays the foundation for the democratization of AI—a vision where individuals can train their own models and take ownership of their information. This library is the first step in a larger journey to create the best open-source data extraction and generation tool the world has ever seen, built collaboratively by a passionate community.
+- 🆓 Completely free and open-source
+- 🚀 Blazing fast performance, outperforming many paid services
+- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
+- 🌍 Supports crawling multiple URLs simultaneously
+- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
+- 🔗 Extracts all external and internal links
+- 📚 Extracts metadata from the page
+- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
+- 🕵️ User-agent customization
+- 🖼️ Takes screenshots of the page
+- 📜 Executes multiple custom JavaScripts before crawling
+- 📊 Generates structured output without LLM using JsonCssExtractionStrategy
+- 📚 Various chunking strategies: topic-based, regex, sentence, and more
+- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
+- 🎯 CSS selector support for precise data extraction
+- 📝 Passes instructions/keywords to refine extraction
+- 🔒 Proxy support for enhanced privacy and access
+- 🔄 Session management for complex multi-page crawling scenarios
+- 🌐 Asynchronous architecture for improved performance and scalability

-Thank you to everyone who has supported this project, used it, and shared feedback. Your encouragement motivates me to dream even bigger. Join us, file issues, submit PRs, or spread the word. Together, we can build a tool that truly empowers people to access their own data and reshape the future of AI.
-</details>
+## Installation 🛠️

-## 🧐 Why Crawl4AI?
+Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.

-1. **Built for LLMs**: Creates smart, concise Markdown optimized for RAG and fine-tuning applications.  
-2. **Lightning Fast**: Delivers results 6x faster with real-time, cost-efficient performance.  
-3. **Flexible Browser Control**: Offers session management, proxies, and custom hooks for seamless data access.  
-4. **Heuristic Intelligence**: Uses advanced algorithms for efficient extraction, reducing reliance on costly models.  
-5. **Open Source & Deployable**: Fully open-source with no API keys—ready for Docker and cloud integration.  
-6. **Thriving Community**: Actively maintained by a vibrant community and the #1 trending GitHub repository.
+### Using pip 🐍

-## 🚀 Quick Start 
+Choose the installation option that best fits your needs:
+
+#### Basic Installation
+
+For basic web crawling and scraping tasks:

-1. Install Crawl4AI:
 ```bash
-# Install the package
-pip install -U crawl4ai
-
-# Run post-installation setup
-crawl4ai-setup
-
-# Verify your installation
-crawl4ai-doctor
+pip install crawl4ai
 ```

-If you encounter any browser-related issues, you can install them manually:
+By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
+
+👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
+
+1. Through the command line:
+   ```bash
+   playwright install
+   ```
+
+2. If the above doesn't work, try this more specific command:
+   ```bash
+   python -m playwright install chromium
+   ```
+
+This second method has proven to be more reliable in some cases.
+
+#### Installation with Synchronous Version
+
+If you need the synchronous version using Selenium:
+
 ```bash
-python -m playwright install --with-deps chromium
+pip install crawl4ai[sync]
 ```

-2. Run a simple web crawl:
+#### Development Installation
+
+For contributors who plan to modify the source code:
+
+```bash
+git clone https://github.com/unclecode/crawl4ai.git
+cd crawl4ai
+pip install -e .
+```
+
+### Using Docker 🐳
+
+We're in the process of creating Docker images and pushing them to Docker Hub. This will provide an easy way to run Crawl4AI in a containerized environment. Stay tuned for updates!
+
+For more detailed installation instructions and options, please refer to our [Installation Guide](https://crawl4ai.com/mkdocs/installation).
+
+## Quick Start 🚀
+
 ```python
 import asyncio
-from crawl4ai import *
+from crawl4ai import AsyncWebCrawler

 async def main():
-    async with AsyncWebCrawler() as crawler:
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
+        print(result.markdown)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## Advanced Usage 🔬
+
+### Executing JavaScript and Using CSS Selectors
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
+            js_code=js_code,
+            css_selector=".wide-tease-item__description",
+            bypass_cache=True
+        )
+        print(result.extracted_content)
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Using a Proxy
+
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
+
+async def main():
+    async with AsyncWebCrawler(verbose=True, proxy="http://127.0.0.1:7890") as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            bypass_cache=True
        )
        print(result.markdown)

@@ -80,330 +159,86 @@ if __name__ == "__main__":
    asyncio.run(main())
 ```

-## ✨ Features 
+### Extracting Structured Data without LLM

-<details>
-<summary>📝 <strong>Markdown Generation</strong></summary>
-
- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.
- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.
- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.
- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content. 
-</details>
-
-<details>
-<summary>📊 <strong>Structured Data Extraction</strong></summary>
-
- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.
- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.
- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.
- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.
-
-</details>
-
-<details>
-<summary>🌐 <strong>Browser Integration</strong></summary>
-
- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.
- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.
-
-</details>
-
-<details>
-<summary>🔎 <strong>Crawling & Scraping</strong></summary>
-
- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.
- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.
- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.
- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file://`).
- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.
- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior.
- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.
- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.
- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.
- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.
- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.
-
-</details>
-
-<details>
-<summary>🚀 <strong>Deployment</strong></summary>
-
- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
-
-</details>
-
-<details>
-<summary>🎯 <strong>Additional Features</strong></summary>
-
- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.
- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.
- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.
- 🛡️ **Error Handling**: Robust error management for seamless execution.
- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.
- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.
- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.
-
-</details>
-
-## Try it Now!
-
-✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
-
-✨ Visit our [Documentation Website](https://docs.crawl4ai.com/)
-
-## Installation 🛠️
-
-Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
-
-<details>
-<summary>🐍 <strong>Using pip</strong></summary>
-
-Choose the installation option that best fits your needs:
-
-### Basic Installation
-
-For basic web crawling and scraping tasks:
-
-```bash
-pip install crawl4ai
-crawl4ai-setup # Setup the browser
-```
-
-By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
-
-👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
-
-1. Through the command line:
-
-   ```bash
-   playwright install
-   ```
-
-2. If the above doesn't work, try this more specific command:
-
-   ```bash
-   python -m playwright install chromium
-   ```
-
-This second method has proven to be more reliable in some cases.
-
---
-
-### Installation with Synchronous Version
-
-The sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:
-
-```bash
-pip install crawl4ai[sync]
-```
-
---
-
-### Development Installation
-
-For contributors who plan to modify the source code:
-
-```bash
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-pip install -e .                    # Basic installation in editable mode
-```
-
-Install optional features:
-
-```bash
-pip install -e ".[torch]"           # With PyTorch features
-pip install -e ".[transformer]"     # With Transformer features
-pip install -e ".[cosine]"          # With cosine similarity features
-pip install -e ".[sync]"            # With synchronous crawling (Selenium)
-pip install -e ".[all]"             # Install all optional features
-```
-
-</details>
-
-<details>
-<summary>🐳 <strong>Docker Deployment</strong></summary>
-
-> 🚀 **Major Changes Coming!** We're developing a completely new Docker implementation that will make deployment even more efficient and seamless. The current Docker setup is being deprecated in favor of this new solution.
-
-### Current Docker Support
-
-The existing Docker implementation is being deprecated and will be replaced soon. If you still need to use Docker with the current version:
-
- 📚 [Deprecated Docker Setup](./docs/deprecated/docker-deployment.md) - Instructions for the current Docker implementation
- ⚠️ Note: This setup will be replaced in the next major release
-
-### What's Coming Next?
-
-Our new Docker implementation will bring:
- Improved performance and resource efficiency
- Streamlined deployment process
- Better integration with Crawl4AI features
- Enhanced scalability options
-
-Stay connected with our [GitHub repository](https://github.com/unclecode/crawl4ai) for updates!
-
-</details>
-
---
-
-### Quick Test
-
-Run a quick test (works for both Docker options):
-
-```python
-import requests
-
-# Submit a crawl job
-response = requests.post(
-    "http://localhost:11235/crawl",
-    json={"urls": "https://example.com", "priority": 10}
-)
-task_id = response.json()["task_id"]
-
-# Continue polling until the task is complete (status="completed")
-result = requests.get(f"http://localhost:11235/task/{task_id}")
-```
-
-For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
-
-</details>
-
-
-## 🔬 Advanced Usage Examples 🔬
-
-You can check the project structure in the directory [https://github.com/unclecode/crawl4ai/docs/examples](docs/examples). Over there, you can find a variety of examples; here, some popular examples are shared.
-
-<details>
-<summary>📝 <strong>Heuristic Markdown Generation with Clean and Fit Markdown</strong></summary>
+The `JsonCssExtractionStrategy` allows for precise extraction of structured data from web pages using CSS selectors.

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-async def main():
-    browser_config = BrowserConfig(
-        headless=True,  
-        verbose=True,
-    )
-    run_config = CrawlerRunConfig(
-        cache_mode=CacheMode.ENABLED,
-        markdown_generator=DefaultMarkdownGenerator(
-            content_filter=PruningContentFilter(threshold=0.48, threshold_type="fixed", min_word_threshold=0)
-        ),
-        # markdown_generator=DefaultMarkdownGenerator(
-        #     content_filter=BM25ContentFilter(user_query="WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY", bm25_threshold=1.0)
-        # ),
-    )
-    
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://docs.micronaut.io/4.7.6/guide/",
-            config=run_config
-        )
-        print(len(result.markdown))
-        print(len(result.fit_markdown))
-        print(len(result.markdown_v2.fit_markdown))
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-</details>
-
-<details>
-<summary>🖥️ <strong>Executing JavaScript & Extract Structured Data without LLMs</strong></summary>
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
 import json
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

-async def main():
+async def extract_news_teasers():
    schema = {
-    "name": "KidoCode Courses",
-    "baseSelector": "section.charge-methodology .w-tab-content > div",
-    "fields": [
-        {
-            "name": "section_title",
-            "selector": "h3.heading-50",
-            "type": "text",
-        },
-        {
-            "name": "section_description",
-            "selector": ".charge-content",
-            "type": "text",
-        },
-        {
-            "name": "course_name",
-            "selector": ".text-block-93",
-            "type": "text",
-        },
-        {
-            "name": "course_description",
-            "selector": ".course-content-text",
-            "type": "text",
-        },
-        {
-            "name": "course_icon",
-            "selector": ".image-92",
-            "type": "attribute",
-            "attribute": "src"
-        }
+        "name": "News Teaser Extractor",
+        "baseSelector": ".wide-tease-item__wrapper",
+        "fields": [
+            {
+                "name": "category",
+                "selector": ".unibrow span[data-testid='unibrow-text']",
+                "type": "text",
+            },
+            {
+                "name": "headline",
+                "selector": ".wide-tease-item__headline",
+                "type": "text",
+            },
+            {
+                "name": "summary",
+                "selector": ".wide-tease-item__description",
+                "type": "text",
+            },
+            {
+                "name": "time",
+                "selector": "[data-testid='wide-tease-date']",
+                "type": "text",
+            },
+            {
+                "name": "image",
+                "type": "nested",
+                "selector": "picture.teasePicture img",
+                "fields": [
+                    {"name": "src", "type": "attribute", "attribute": "src"},
+                    {"name": "alt", "type": "attribute", "attribute": "alt"},
+                ],
+            },
+            {
+                "name": "link",
+                "selector": "a[href]",
+                "type": "attribute",
+                "attribute": "href",
+            },
+        ],
    }
-}

    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

-    browser_config = BrowserConfig(
-        headless=False,
-        verbose=True
-    )
-    run_config = CrawlerRunConfig(
-        extraction_strategy=extraction_strategy,
-        js_code=["""(async () => {const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();"""],
-        cache_mode=CacheMode.BYPASS
-    )
-        
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        
+    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
-            url="https://www.kidocode.com/degrees/technology",
-            config=run_config
+            url="https://www.nbcnews.com/business",
+            extraction_strategy=extraction_strategy,
+            bypass_cache=True,
        )

-        companies = json.loads(result.extracted_content)
-        print(f"Successfully extracted {len(companies)} companies")
-        print(json.dumps(companies[0], indent=2))
+        assert result.success, "Failed to crawl the page"

+        news_teasers = json.loads(result.extracted_content)
+        print(f"Successfully extracted {len(news_teasers)} news teasers")
+        print(json.dumps(news_teasers[0], indent=2))

 if __name__ == "__main__":
-    asyncio.run(main())
+    asyncio.run(extract_news_teasers())
 ```

-</details>
+For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/full_details/advanced_jsoncss_extraction.md) section in the documentation.

-<details>
-<summary>📚 <strong>Extracting Structured Data with LLMs</strong></summary>
+### Extracting Structured Data with OpenAI

 ```python
 import os
 import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import LLMExtractionStrategy
 from pydantic import BaseModel, Field

@@ -413,26 +248,19 @@ class OpenAIModelFee(BaseModel):
    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

 async def main():
-    browser_config = BrowserConfig(verbose=True)
-    run_config = CrawlerRunConfig(
-        word_count_threshold=1,
-        extraction_strategy=LLMExtractionStrategy(
-            # Here you can use any provider that Litellm library supports, for instance: ollama/qwen2
-            # provider="ollama/qwen2", api_token="no-token", 
-            provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
-            schema=OpenAIModelFee.schema(),
-            extraction_type="schema",
-            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
-            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
-            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
-        ),            
-        cache_mode=CacheMode.BYPASS,
-    )
-    
-    async with AsyncWebCrawler(config=browser_config) as crawler:
+    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url='https://openai.com/api/pricing/',
-            config=run_config
+            word_count_threshold=1,
+            extraction_strategy=LLMExtractionStrategy(
+                provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'), 
+                schema=OpenAIModelFee.schema(),
+                extraction_type="schema",
+                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
+                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
+            ),            
+            bypass_cache=True,
        )
        print(result.extracted_content)

@@ -440,143 +268,117 @@ if __name__ == "__main__":
    asyncio.run(main())
 ```

-</details>
+### Session Management and Dynamic Content Crawling

-<details>
-<summary>🤖 <strong>Using You own Browswer with Custome User Profile</strong></summary>
+Crawl4AI excels at handling complex scenarios, such as crawling multiple pages with dynamic content loaded via JavaScript. Here's an example of crawling GitHub commits across multiple pages:

 ```python
-import os, sys
-from pathlib import Path
-import asyncio, time
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+import asyncio
+import re
+from bs4 import BeautifulSoup
+from crawl4ai import AsyncWebCrawler

-async def test_news_crawl():
-    # Create a persistent user data directory
-    user_data_dir = os.path.join(Path.home(), ".crawl4ai", "browser_profile")
-    os.makedirs(user_data_dir, exist_ok=True)
+async def crawl_typescript_commits():
+    first_commit = ""
+    async def on_execution_started(page):
+        nonlocal first_commit 
+        try:
+            while True:
+                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')
+                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')
+                commit = await commit.evaluate('(element) => element.textContent')
+                commit = re.sub(r'\s+', '', commit)
+                if commit and commit != first_commit:
+                    first_commit = commit
+                    break
+                await asyncio.sleep(0.5)
+        except Exception as e:
+            print(f"Warning: New content didn't appear after JavaScript execution: {e}")

-    browser_config = BrowserConfig(
-        verbose=True,
-        headless=True,
-        user_data_dir=user_data_dir,
-        use_persistent_context=True,
-    )
-    run_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS
-    )
-    
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        url = "ADDRESS_OF_A_CHALLENGING_WEBSITE"
-        
-        result = await crawler.arun(
-            url,
-            config=run_config,
-            magic=True,
-        )
-        
-        print(f"Successfully crawled {url}")
-        print(f"Content length: {len(result.markdown)}")
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)
+
+        url = "https://github.com/microsoft/TypeScript/commits/main"
+        session_id = "typescript_commits_session"
+        all_commits = []
+
+        js_next_page = """
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
+        """
+
+        for page in range(3):  # Crawl 3 pages
+            result = await crawler.arun(
+                url=url,
+                session_id=session_id,
+                css_selector="li.Box-sc-g0xbh4-0",
+                js=js_next_page if page > 0 else None,
+                bypass_cache=True,
+                js_only=page > 0
+            )
+
+            assert result.success, f"Failed to crawl page {page + 1}"
+
+            soup = BeautifulSoup(result.cleaned_html, 'html.parser')
+            commits = soup.select("li")
+            all_commits.extend(commits)
+
+            print(f"Page {page + 1}: Found {len(commits)} commits")
+
+        await crawler.crawler_strategy.kill_session(session_id)
+        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
+
+if __name__ == "__main__":
+    asyncio.run(crawl_typescript_commits())
 ```

-</details>
+This example demonstrates Crawl4AI's ability to handle complex scenarios where content is loaded asynchronously. It crawls multiple pages of GitHub commits, executing JavaScript to load new content and using custom hooks to ensure data is loaded before proceeding.

-## ✨ Recent Updates
-
-   **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
-   **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
-   **🤖 Enhanced LLM Integration**:
-    -   **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
-    -   **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
-    -   **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
-   **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
-   **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
-   **➡️ URL Redirection Tracking**: The `final_url` field now captures the final destination after any redirects.
-   **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
-   **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
-   **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
-
-Read the full details in our [0.4.248 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
-
-Here's a clear markdown explanation for your users about version numbering:
+For more advanced usage examples, check out our [Examples](https://crawl4ai.com/mkdocs/full_details/session_based_crawling.md) section in the documentation.


-## Version Numbering in Crawl4AI
+## Speed Comparison 🚀

-Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.
+Crawl4AI is designed with speed as a primary focus. Our goal is to provide the fastest possible response with high-quality data extraction, minimizing abstractions between the data and the user.

-### Version Numbers Explained
+We've conducted a speed comparison between Crawl4AI and Firecrawl, a paid service. The results demonstrate Crawl4AI's superior performance:

-Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)
+```
+Firecrawl:
+Time taken: 7.02 seconds
+Content length: 42074 characters
+Images found: 49

-#### Pre-release Versions
-We use different suffixes to indicate development stages:
+Crawl4AI (simple crawl):
+Time taken: 1.60 seconds
+Content length: 18238 characters
+Images found: 49

- `dev` (0.4.3dev1): Development versions, unstable
- `a` (0.4.3a1): Alpha releases, experimental features
- `b` (0.4.3b1): Beta releases, feature complete but needs testing
- `rc` (0.4.3rc1): Release candidates, potential final version
+Crawl4AI (with JavaScript execution):
+Time taken: 4.64 seconds
+Content length: 40869 characters
+Images found: 89
+```

-#### Installation
- Regular installation (stable version):
-  ```bash
-  pip install -U crawl4ai
-  ```
+As you can see, Crawl4AI outperforms Firecrawl significantly:
+- Simple crawl: Crawl4AI is over 4 times faster than Firecrawl.
+- With JavaScript execution: Even when executing JavaScript to load more content (doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.

- Install pre-release versions:
-  ```bash
-  pip install crawl4ai --pre
-  ```
+You can find the full comparison code in our repository at `docs/examples/crawl4ai_vs_firecrawl.py`.

- Install specific version:
-  ```bash
-  pip install crawl4ai==0.4.3b1
-  ```
+## Documentation 📚

-#### Why Pre-releases?
-We use pre-releases to:
- Test new features in real-world scenarios
- Gather feedback before final releases
- Ensure stability for production users
- Allow early adopters to try new features
+For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).

-For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.
-
-## 📖 Documentation & Roadmap 
-
-> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!
-
-For current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://docs.crawl4ai.com/).
-
-To check our development plans and upcoming features, visit our [Roadmap](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md).
-
-<details>
-<summary>📈 <strong>Development TODOs</strong></summary>
-
- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction
- [ ] 1. Question-Based Crawler: Natural language driven web discovery and content extraction
- [ ] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction
- [ ] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations
- [ ] 4. Automated Schema Generator: Convert natural language to extraction schemas
- [ ] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)
- [ ] 6. Web Embedding Index: Semantic search infrastructure for crawled content
- [ ] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance
- [ ] 8. Performance Monitor: Real-time insights into crawler operations
- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers
- [ ] 10. Sponsorship Program: Structured support system with tiered benefits
- [ ] 11. Educational Content: "How to Crawl" video series and interactive tutorials
-
-</details>
-
-## 🤝 Contributing 
+## Contributing 🤝

 We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.

-## 📄 License 
+## License 📄

 Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).

-## 📧 Contact 
+## Contact 📧

 For questions, suggestions, or feedback, feel free to reach out:

@@ -586,31 +388,6 @@ For questions, suggestions, or feedback, feel free to reach out:

 Happy Crawling! 🕸️🚀

-## 🗾 Mission
-
-Our mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.  
-
-We envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.
-
-<details>
-<summary>🔑 <strong>Key Opportunities</strong></summary>
- 
- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.  
- **Authentic AI Data**: Provide AI systems with real human insights.  
- **Shared Economy**: Create a fair data marketplace that benefits data creators.  
-
-</details>
-
-<details>
-<summary>🚀 <strong>Development Pathway</strong></summary>
-
-1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.  
-2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.  
-3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.  
-
-For more details, see our [full mission statement](./MISSION.md).
-</details>
-
 ## Star History

-[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
+[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/README.sync.md
+++ b/README.sync.md
@@ -0,0 +1,244 @@
+# Crawl4AI v0.2.77 🕷️🤖
+
+[![GitHub Stars](https://img.shields.io/github/stars/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/stargazers)
+[![GitHub Forks](https://img.shields.io/github/forks/unclecode/crawl4ai?style=social)](https://github.com/unclecode/crawl4ai/network/members)
+[![GitHub Issues](https://img.shields.io/github/issues/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/issues)
+[![GitHub Pull Requests](https://img.shields.io/github/issues-pr/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/pulls)
+[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
+
+Crawl4AI simplifies web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
+
+#### [v0.2.77] - 2024-08-02
+
+Major improvements in functionality, performance, and cross-platform compatibility! 🚀
+
+- 🐳 **Docker enhancements**:
+  - Significantly improved Dockerfile for easy installation on Linux, Mac, and Windows.
+- 🌐 **Official Docker Hub image**:
+  - Launched our first official image on Docker Hub for streamlined deployment (unclecode/crawl4ai).
+- 🔧 **Selenium upgrade**:
+  - Removed dependency on ChromeDriver, now using Selenium's built-in capabilities for better compatibility.
+- 🖼️ **Image description**:
+  - Implemented ability to generate textual descriptions for extracted images from web pages.
+- ⚡ **Performance boost**:
+  - Various improvements to enhance overall speed and performance.
+  
+## Try it Now!
+
+✨ Play around with this [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1sJPAmeLj5PMrg2VgOwMJ2ubGIcK0cJeX?usp=sharing)
+
+✨ visit our [Documentation Website](https://crawl4ai.com/mkdocs/)
+
+✨ Check [Demo](https://crawl4ai.com/mkdocs/demo)
+
+## Features ✨
+
+- 🆓 Completely free and open-source
+- 🤖 LLM-friendly output formats (JSON, cleaned HTML, markdown)
+- 🌍 Supports crawling multiple URLs simultaneously
+- 🎨 Extracts and returns all media tags (Images, Audio, and Video)
+- 🔗 Extracts all external and internal links
+- 📚 Extracts metadata from the page
+- 🔄 Custom hooks for authentication, headers, and page modifications before crawling
+- 🕵️ User-agent customization
+- 🖼️ Takes screenshots of the page
+- 📜 Executes multiple custom JavaScripts before crawling
+- 📚 Various chunking strategies: topic-based, regex, sentence, and more
+- 🧠 Advanced extraction strategies: cosine clustering, LLM, and more
+- 🎯 CSS selector support
+- 📝 Passes instructions/keywords to refine extraction
+
+# Crawl4AI
+
+## 🌟 Shoutout to Contributors of v0.2.77!
+
+A big thank you to the amazing contributors who've made this release possible:
+
+- [@aravindkarnam](https://github.com/aravindkarnam) for the new image description feature
+- [@FractalMind](https://github.com/FractalMind) for our official Docker Hub image
+- [@ketonkss4](https://github.com/ketonkss4) for helping streamline our Selenium setup
+
+Your contributions are driving Crawl4AI forward! 🚀
+
+## Cool Examples 🚀
+
+### Quick Start
+
+```python
+from crawl4ai import WebCrawler
+
+# Create an instance of WebCrawler
+crawler = WebCrawler()
+
+# Warm up the crawler (load necessary models)
+crawler.warmup()
+
+# Run the crawler on a URL
+result = crawler.run(url="https://www.nbcnews.com/business")
+
+# Print the extracted content
+print(result.markdown)
+```
+
+## How to install 🛠 
+
+### Using pip 🐍
+```bash
+virtualenv venv
+source venv/bin/activate
+pip install "crawl4ai @ git+https://github.com/unclecode/crawl4ai.git"
+```
+
+### Using Docker 🐳
+
+```bash
+# For Mac users (M1/M2)
+# docker build --platform linux/amd64 -t crawl4ai .
+docker build -t crawl4ai .
+docker run -d -p 8000:80 crawl4ai
+```
+
+### Using Docker Hub 🐳
+
+```bash
+docker pull unclecode/crawl4ai:latest
+docker run -d -p 8000:80 unclecode/crawl4ai:latest
+```
+
+
+## Speed-First Design 🚀
+
+Perhaps the most important design principle for this library is speed. We need to ensure it can handle many links and resources in parallel as quickly as possible. By combining this speed with fast LLMs like Groq, the results will be truly amazing.
+
+```python
+import time
+from crawl4ai.web_crawler import WebCrawler
+crawler = WebCrawler()
+crawler.warmup()
+
+start = time.time()
+url = r"https://www.nbcnews.com/business"
+result = crawler.run( url, word_count_threshold=10, bypass_cache=True)
+end = time.time()
+print(f"Time taken: {end - start}")
+```
+
+Let's take a look the calculated time for the above code snippet:
+
+```bash
+[LOG] 🚀 Crawling done, success: True, time taken: 1.3623387813568115 seconds
+[LOG] 🚀 Content extracted, success: True, time taken: 0.05715131759643555 seconds
+[LOG] 🚀 Extraction, time taken: 0.05750393867492676 seconds.
+Time taken: 1.439958095550537
+```
+Fetching the content from the page took 1.3623 seconds, and extracting the content took 0.0575 seconds. 🚀
+
+### Extract Structured Data from Web Pages 📊
+
+Crawl all OpenAI models and their fees from the official page.
+
+```python
+import os
+from crawl4ai import WebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token ßfor the OpenAI model.")
+
+url = 'https://openai.com/api/pricing/'
+crawler = WebCrawler()
+crawler.warmup()
+
+result = crawler.run(
+        url=url,
+        word_count_threshold=1,
+        extraction_strategy= LLMExtractionStrategy(
+            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+            schema=OpenAIModelFee.schema(),
+            extraction_type="schema",
+            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+            Do not miss any models in the entire content. One extracted model JSON format should look like this: 
+            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
+        ),            
+        bypass_cache=True,
+    )
+
+print(result.extracted_content)
+```
+
+### Execute JS, Filter Data with CSS Selector, and Clustering
+
+```python
+from crawl4ai import WebCrawler
+from crawl4ai.chunking_strategy import CosineStrategy
+
+js_code = ["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"]
+
+crawler = WebCrawler()
+crawler.warmup()
+
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    js=js_code,
+    css_selector="p",
+    extraction_strategy=CosineStrategy(semantic_filter="technology")
+)
+
+print(result.extracted_content)
+```
+
+### Extract Structured Data from Web Pages With Proxy and BaseUrl
+
+```python
+from crawl4ai import WebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+def create_crawler():
+    crawler = WebCrawler(verbose=True, proxy="http://127.0.0.1:7890")
+    crawler.warmup()
+    return crawler
+
+crawler = create_crawler()
+
+crawler.warmup()
+
+result = crawler.run(
+    url="https://www.nbcnews.com/business",
+    extraction_strategy=LLMExtractionStrategy(
+        provider="openai/gpt-4o",
+        api_token="sk-",
+        base_url="https://api.openai.com/v1"
+    )
+)
+
+print(result.markdown)
+```
+
+## Documentation 📚
+
+For detailed documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https://crawl4ai.com/mkdocs/).
+
+## Contributing 🤝
+
+We welcome contributions from the open-source community. Check out our [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md) for more information.
+
+## License 📄
+
+Crawl4AI is released under the [Apache 2.0 License](https://github.com/unclecode/crawl4ai/blob/main/LICENSE).
+
+## Contact 📧
+
+For questions, suggestions, or feedback, feel free to reach out:
+
+- GitHub: [unclecode](https://github.com/unclecode)
+- Twitter: [@unclecode](https://twitter.com/unclecode)
+- Website: [crawl4ai.com](https://crawl4ai.com)
+
+Happy Crawling! 🕸️🚀
+
+## Star History
+
+[![Star History Chart](https://api.star-history.com/svg?repos=unclecode/crawl4ai&type=Date)](https://star-history.com/#unclecode/crawl4ai&Date)
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -1,503 +0,0 @@
-# Crawl4AI Strategic Roadmap
-
-```mermaid
-%%{init: {'themeVariables': { 'fontSize': '14px'}}}%%
-graph TD
-    subgraph A1[Advanced Crawling Systems 🔧]
-        A["`
-        • Graph Crawler ✓
-        • Question-Based Crawler
-        • Knowledge-Optimal Crawler
-        • Agentic Crawler
-        `"]
-    end
-
-    subgraph A2[Specialized Features 🛠️]
-        B["`
-        • Automated Schema Generator
-        • Domain-Specific Scrapers
-        • 
-        • 
-        `"]
-    end
-
-    subgraph A3[Development Tools 🔨]
-        C["`
-        • Interactive Playground
-        • Performance Monitor
-        • Cloud Integration
-        • 
-        `"]
-    end
-
-    subgraph A4[Community & Growth 🌱]
-        D["`
-        • Sponsorship Program
-        • Educational Content
-        • 
-        • 
-        `"]
-    end
-
-    classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px
-    classDef section fill:#f0f0f0,stroke:#333,stroke-width:4px,rx:10
-    class A1,A2,A3,A4 section
-
-    %% Layout hints
-    A1 --> A2[" "]
-    A3 --> A4[" "]
-    linkStyle 0,1 stroke:none
-```
-
-Crawl4AI is evolving to provide more intelligent, efficient, and versatile web crawling capabilities. This roadmap outlines the key developments and features planned for the project, organized into strategic sections that build upon our current foundation.
-
-## 1. Advanced Crawling Systems 🔧
-
-This section introduces three powerful crawling systems that extend Crawl4AI's capabilities from basic web crawling to intelligent, purpose-driven data extraction.
-
-### 1.1 Question-Based Crawler
-The Question-Based Crawler enhances our core engine by enabling automatic discovery and extraction of relevant web content based on natural language questions.
-
-Key Features:
- SerpiAPI integration for intelligent web search
- Relevancy scoring for search results
- Automatic URL discovery and prioritization
- Cross-source validation
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.discovery import QuestionBasedDiscovery
-
-async with AsyncWebCrawler() as crawler:
-    discovery = QuestionBasedDiscovery(crawler)
-    results = await discovery.arun(
-        question="What are the system requirements for major cloud providers' GPU instances?",
-        max_urls=5,
-        relevance_threshold=0.7
-    )
-    
-    for result in results:
-        print(f"Source: {result.url} (Relevance: {result.relevance_score})")
-        print(f"Content: {result.markdown}\n")
-```
-
-### 1.2 Knowledge-Optimal Crawler
-An intelligent crawling system that solves the optimization problem of minimizing data extraction while maximizing knowledge acquisition for specific objectives.
-
-Key Features:
- Smart content prioritization
- Minimal data extraction for maximum knowledge
- Probabilistic relevance assessment
- Objective-driven crawling paths
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.optimization import KnowledgeOptimizer
-
-async with AsyncWebCrawler() as crawler:
-    optimizer = KnowledgeOptimizer(
-        objective="Understand GPU instance pricing and limitations across cloud providers",
-        required_knowledge=[
-            "pricing structure",
-            "GPU specifications",
-            "usage limits",
-            "availability zones"
-        ],
-        confidence_threshold=0.85
-    )
-    
-    result = await crawler.arun(
-        urls=[
-            "https://aws.amazon.com/ec2/pricing/",
-            "https://cloud.google.com/gpu",
-            "https://azure.microsoft.com/pricing/"
-        ],
-        optimizer=optimizer,
-        optimization_mode="minimal_extraction"
-    )
-    
-    print(f"Knowledge Coverage: {result.knowledge_coverage}")
-    print(f"Data Efficiency: {result.efficiency_ratio}")
-    print(f"Extracted Content: {result.optimal_content}")
-```
-
-### 1.3 Agentic Crawler
-An autonomous system capable of understanding complex goals and automatically planning and executing multi-step crawling operations.
-
-Key Features:
- Autonomous goal interpretation
- Dynamic step planning
- Interactive navigation capabilities
- Visual recognition and interaction
- Automatic error recovery
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.agents import CrawlerAgent
-
-async with AsyncWebCrawler() as crawler:
-    agent = CrawlerAgent(crawler)
-    
-    # Automatic planning and execution
-    result = await agent.arun(
-        goal="Find research papers about quantum computing published in 2023 with more than 50 citations",
-        auto_retry=True
-    )
-    print("Generated Plan:", result.executed_steps)
-    print("Extracted Data:", result.data)
-    
-    # Using custom steps with automatic execution
-    result = await agent.arun(
-        goal="Extract conference deadlines from ML conferences",
-        custom_plan=[
-            "Navigate to conference page",
-            "Find important dates section",
-            "Extract submission deadlines",
-            "Verify dates are for 2024"
-        ]
-    )
-    
-    # Monitoring execution
-    print("Step Completion:", result.step_status)
-    print("Execution Time:", result.execution_time)
-    print("Success Rate:", result.success_rate)
-```
-
-# Section 2: Specialized Features 🛠️
-
-This section introduces specialized tools and features that enhance Crawl4AI's capabilities for specific use cases and data extraction needs.
-
-### 2.1 Automated Schema Generator
-A system that automatically generates JsonCssExtractionStrategy schemas from natural language descriptions, making structured data extraction accessible to all users.
-
-Key Features:
- Natural language schema generation
- Automatic pattern detection
- Predefined schema templates
- Chrome extension for visual schema building
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.schema import SchemaGenerator
-
-# Generate schema from natural language description
-generator = SchemaGenerator()
-schema = await generator.generate(
-    url="https://news-website.com",
-    description="For each news article on the page, I need the headline, publication date, and main image"
-)
-
-# Use generated schema with crawler
-async with AsyncWebCrawler() as crawler:
-    result = await crawler.arun(
-        url="https://news-website.com",
-        extraction_strategy=schema
-    )
-
-# Example of generated schema:
-"""
-{
-    "name": "News Article Extractor",
-    "baseSelector": "article.news-item",
-    "fields": [
-        {
-            "name": "headline",
-            "selector": "h2.article-title",
-            "type": "text"
-        },
-        {
-            "name": "date",
-            "selector": "span.publish-date",
-            "type": "text"
-        },
-        {
-            "name": "image",
-            "selector": "img.article-image",
-            "type": "attribute",
-            "attribute": "src"
-        }
-    ]
-}
-"""
-```
-
-### 2.2 Domain Specific Scrapers
-Specialized extraction strategies optimized for common website types and platforms, providing consistent and reliable data extraction without additional configuration.
-
-Key Features:
- Pre-configured extractors for popular platforms
- Academic site specialization (arXiv, NCBI)
- E-commerce standardization
- Documentation site handling
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.extractors import AcademicExtractor, EcommerceExtractor
-
-async with AsyncWebCrawler() as crawler:
-    # Academic paper extraction
-    papers = await crawler.arun(
-        url="https://arxiv.org/list/cs.AI/recent",
-        extractor="academic",  # Built-in extractor type
-        site_type="arxiv",     # Specific site optimization
-        extract_fields=[
-            "title", 
-            "authors", 
-            "abstract", 
-            "citations"
-        ]
-    )
-    
-    # E-commerce product data
-    products = await crawler.arun(
-        url="https://store.example.com/products",
-        extractor="ecommerce",
-        extract_fields=[
-            "name",
-            "price",
-            "availability",
-            "reviews"
-        ]
-    )
-```
-
-### 2.3 Web Embedding Index
-Creates and maintains a semantic search infrastructure for crawled content, enabling efficient retrieval and querying of web content through vector embeddings.
-
-Key Features:
- Automatic embedding generation
- Intelligent content chunking
- Efficient vector storage and indexing
- Semantic search capabilities
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.indexing import WebIndex
-
-# Initialize and build index
-index = WebIndex(model="efficient-mini")
-
-async with AsyncWebCrawler() as crawler:
-    # Crawl and index content
-    await index.build(
-        urls=["https://docs.example.com"],
-        crawler=crawler,
-        options={
-            "chunk_method": "semantic",
-            "update_policy": "incremental",
-            "embedding_batch_size": 100
-        }
-    )
-
-    # Search through indexed content
-    results = await index.search(
-        query="How to implement OAuth authentication?",
-        filters={
-            "content_type": "technical",
-            "recency": "6months"
-        },
-        top_k=5
-    )
-
-    # Get similar content
-    similar = await index.find_similar(
-        url="https://docs.example.com/auth/oauth",
-        threshold=0.85
-    )
-```
-
-Each of these specialized features builds upon Crawl4AI's core functionality while providing targeted solutions for specific use cases. They can be used independently or combined for more complex data extraction and processing needs.
-
-# Section 3: Development Tools 🔧
-
-This section covers tools designed to enhance the development experience, monitoring, and deployment of Crawl4AI applications.
-
-### 3.1 Crawl4AI Playground 🎮
-
-The Crawl4AI Playground is an interactive web-based development environment that simplifies web scraping experimentation, development, and deployment. With its intuitive interface and AI-powered assistance, users can quickly prototype, test, and deploy web scraping solutions.
-
-#### Key Features 🌟
-
-##### Visual Strategy Builder
- Interactive point-and-click interface for building extraction strategies
- Real-time preview of selected elements
- Side-by-side comparison of different extraction approaches
- Visual validation of CSS selectors and XPath queries
-
-##### AI Assistant Integration
- Strategy recommendations based on target website analysis
- Parameter optimization suggestions
- Best practices guidance for specific use cases
- Automated error detection and resolution
- Performance optimization tips
-
-##### Real-Time Testing & Validation
- Live preview of extraction results
- Side-by-side comparison of multiple strategies
- Performance metrics visualization
- Automatic validation of extracted data
- Error detection and debugging tools
-
-##### Project Management
- Save and organize multiple scraping projects
- Version control for configurations
- Export/import project settings
- Share configurations with team members
- Project templates for common use cases
-
-##### Deployment Pipeline
- One-click deployment to various environments
- Docker container generation
- Cloud deployment templates (AWS, GCP, Azure)
- Scaling configuration management
- Monitoring setup automation
-
-
-### 3.2 Performance Monitoring System
-A comprehensive monitoring solution providing real-time insights into crawler operations, resource usage, and system health through both CLI and GUI interfaces.
-
-Key Features:
- Real-time resource tracking
- Active crawl monitoring
- Performance statistics
- Customizable alerting system
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.monitor import CrawlMonitor
-
-# Initialize monitoring
-monitor = CrawlMonitor()
-
-# Start monitoring with CLI interface
-await monitor.start(
-    mode="cli",  # or "gui"
-    refresh_rate="1s",
-    metrics={
-        "resources": ["cpu", "memory", "network"],
-        "crawls": ["active", "queued", "completed"],
-        "performance": ["success_rate", "response_times"]
-    }
-)
-
-# Example CLI output:
-"""
-Crawl4AI Monitor (Live) - Press Q to exit
-────────────────────────────────────────
-System Usage:
- ├─ CPU: ███████░░░ 70%
- └─ Memory: ████░░░░░ 2.1GB/8GB
-
-Active Crawls:
-ID    URL                   Status    Progress
-001   docs.example.com     🟢 Active   75%
-002   api.service.com      🟡 Queue    -
-
-Metrics (Last 5min):
- ├─ Success Rate: 98%
- ├─ Avg Response: 0.6s
- └─ Pages/sec: 8.5
-"""
-```
-
-### 3.3 Cloud Integration
-Streamlined deployment tools for setting up Crawl4AI in various cloud environments, with support for scaling and monitoring.
-
-Key Features:
- One-click deployment solutions
- Auto-scaling configuration
- Load balancing setup
- Cloud-specific optimizations
- Monitoring integration
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.deploy import CloudDeployer
-
-# Initialize deployer
-deployer = CloudDeployer()
-
-# Deploy crawler service
-deployment = await deployer.deploy(
-    service_name="crawler-cluster",
-    platform="aws",  # or "gcp", "azure"
-    config={
-        "instance_type": "compute-optimized",
-        "auto_scaling": {
-            "min_instances": 2,
-            "max_instances": 10,
-            "scale_based_on": "cpu_usage"
-        },
-        "region": "us-east-1",
-        "monitoring": True
-    }
-)
-
-# Get deployment status and endpoints
-print(f"Service Status: {deployment.status}")
-print(f"API Endpoint: {deployment.endpoint}")
-print(f"Monitor URL: {deployment.monitor_url}")
-```
-
-These development tools work together to provide a comprehensive environment for developing, testing, monitoring, and deploying Crawl4AI applications. The Playground helps users experiment and generate optimal configurations, the Performance Monitor ensures smooth operation, and the Cloud Integration tools simplify deployment and scaling.
-
-# Section 4: Community & Growth 🌱
-
-This section outlines initiatives designed to build and support the Crawl4AI community, provide educational resources, and ensure sustainable project growth.
-
-### 4.1 Sponsorship Program
-A structured program to support ongoing development and maintenance of Crawl4AI while providing valuable benefits to sponsors.
-
-Key Features:
- Multiple sponsorship tiers
- Sponsor recognition system
- Priority support for sponsors
- Early access to new features
- Custom feature development opportunities
-
-Program Structure (not yet finalized):
-```
-Sponsorship Tiers:
-
-🥉 Bronze Supporter
- GitHub Sponsor badge
- Priority issue response
- Community Discord role
-
-🥈 Silver Supporter
- All Bronze benefits
- Technical support channel
- Vote on roadmap priorities
- Early access to beta features
-
-🥇 Gold Supporter
- All Silver benefits
- Custom feature requests
- Direct developer access
- Private support sessions
-
-💎 Diamond Partner
- All Gold benefits
- Custom development
- On-demand consulting
- Integration support
-```
-
-### 4.2 "How to Crawl" Video Series
-A comprehensive educational resource teaching users how to effectively use Crawl4AI for various web scraping and data extraction scenarios.
-
-Key Features:
- Step-by-step tutorials
- Real-world use cases
- Best practices
- Integration guides
- Advanced feature deep-dives
-
-These community initiatives are designed to:
- Provide comprehensive learning resources
- Foster a supportive user community
- Ensure sustainable project development
- Share knowledge and best practices
- Create opportunities for collaboration
-
-The combination of structured support through sponsorship, educational content through video series, and interactive learning through the playground creates a robust ecosystem for both new and experienced users of Crawl4AI.
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -1,88 +1,30 @@
 # __init__.py

-from .async_webcrawler import AsyncWebCrawler, CacheMode
-from .async_configs import BrowserConfig, CrawlerRunConfig
-from .content_scraping_strategy import (
-    ContentScrapingStrategy,
-    WebScrapingStrategy,
-    LXMLWebScrapingStrategy,
-)
-from .extraction_strategy import (
-    ExtractionStrategy,
-    LLMExtractionStrategy,
-    CosineStrategy,
-    JsonCssExtractionStrategy,
-    JsonXPathExtractionStrategy
-)
-from .chunking_strategy import ChunkingStrategy, RegexChunking
-from .markdown_generation_strategy import DefaultMarkdownGenerator
-from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter
-from .models import CrawlResult, MarkdownGenerationResult
-from .async_dispatcher import (
-    MemoryAdaptiveDispatcher,
-    SemaphoreDispatcher,
-    RateLimiter,
-    CrawlerMonitor,
-    DisplayMode,
-    BaseDispatcher
-)
+from .async_webcrawler import AsyncWebCrawler
+from .models import CrawlResult
+
+__version__ = "0.3.6"

 __all__ = [
    "AsyncWebCrawler",
    "CrawlResult",
-    "CacheMode",
-    "ContentScrapingStrategy",
-    "WebScrapingStrategy",
-    "LXMLWebScrapingStrategy",
-    "BrowserConfig",
-    "CrawlerRunConfig",
-    "ExtractionStrategy",
-    "LLMExtractionStrategy",
-    "CosineStrategy",
-    "JsonCssExtractionStrategy",
-    "JsonXPathExtractionStrategy",
-    "ChunkingStrategy",
-    "RegexChunking",
-    "DefaultMarkdownGenerator",
-    "PruningContentFilter",
-    "BM25ContentFilter",
-    "LLMContentFilter",
-    "BaseDispatcher",
-    "MemoryAdaptiveDispatcher",
-    "SemaphoreDispatcher",
-    "RateLimiter",
-    "CrawlerMonitor",
-    "DisplayMode",
-    "MarkdownGenerationResult",
 ]

-
 def is_sync_version_installed():
    try:
        import selenium
-
        return True
    except ImportError:
        return False

-
 if is_sync_version_installed():
    try:
        from .web_crawler import WebCrawler
-
        __all__.append("WebCrawler")
    except ImportError:
-        print(
-            "Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
-        )
+        import warnings
+        print("Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies.")
 else:
    WebCrawler = None
-    # import warnings
-    # print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
-
-import warnings
-from pydantic import warnings as pydantic_warnings
-
-# Disable all Pydantic warnings
-warnings.filterwarnings("ignore", module="pydantic")
-# pydantic_warnings.filter_warnings()
+    import warnings
+    print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,2 +0,0 @@
-# crawl4ai/_version.py
-__version__ = "0.4.3b1"
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -1,715 +0,0 @@
-from .config import (
-    MIN_WORD_THRESHOLD,
-    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-    SCREENSHOT_HEIGHT_TRESHOLD,
-    PAGE_TIMEOUT,
-    IMAGE_SCORE_THRESHOLD,
-    SOCIAL_MEDIA_DOMAINS,
-)
-from .user_agent_generator import UserAgentGenerator
-from .extraction_strategy import ExtractionStrategy
-from .chunking_strategy import ChunkingStrategy, RegexChunking
-from .markdown_generation_strategy import MarkdownGenerationStrategy
-from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
-from typing import Optional, Union, List
-
-
-class BrowserConfig:
-    """
-    Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrategy.
-
-    This class centralizes all parameters that affect browser and context creation. Instead of passing
-    scattered keyword arguments, users can instantiate and modify this configuration object. The crawler
-    code will then reference these settings to initialize the browser in a consistent, documented manner.
-
-    Attributes:
-        browser_type (str): The type of browser to launch. Supported values: "chromium", "firefox", "webkit".
-                            Default: "chromium".
-        headless (bool): Whether to run the browser in headless mode (no visible GUI).
-                         Default: True.
-        use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
-                                    advanced manipulation. Default: False.
-        debugging_port (int): Port for the browser debugging protocol. Default: 9222.
-        use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
-                                       Automatically sets use_managed_browser=True. Default: False.
-        user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
-                                     temporary directory may be used. Default: None.
-        chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chromium".
-        channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chromium".
-        proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
-                             Default: None.
-        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
-                                     If None, no additional proxy config. Default: None.
-        viewport_width (int): Default viewport width for pages. Default: 1080.
-        viewport_height (int): Default viewport height for pages. Default: 600.
-        verbose (bool): Enable verbose logging.
-                        Default: True.
-        accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
-                                 Default: False.
-        downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
-                                      a default path will be created. Default: None.
-        storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage).
-                                             Default: None.
-        ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
-        java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
-        cookies (list): List of cookies to add to the browser context. Each cookie is a dict with fields like
-                        {"name": "...", "value": "...", "url": "..."}.
-                        Default: [].
-        headers (dict): Extra HTTP headers to apply to all requests in this context.
-                        Default: {}.
-        user_agent (str): Custom User-Agent string to use. Default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
-                           "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36".
-        user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
-                                       user_agent as-is. Default: None.
-        user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
-                                                    Default: None.
-        text_mode (bool): If True, disables images and other rich content for potentially faster load times.
-                          Default: False.
-        light_mode (bool): Disables certain background features for performance gains. Default: False.
-        extra_args (list): Additional command-line arguments passed to the browser.
-                           Default: [].
-    """
-
-    def __init__(
-        self,
-        browser_type: str = "chromium",
-        headless: bool = True,
-        use_managed_browser: bool = False,
-        use_persistent_context: bool = False,
-        user_data_dir: str = None,
-        chrome_channel: str = "chromium",
-        channel: str = "chromium",
-        proxy: Optional[str] = None,
-        proxy_config: dict = None,
-        viewport_width: int = 1080,
-        viewport_height: int = 600,
-        accept_downloads: bool = False,
-        downloads_path: str = None,
-        storage_state=None,
-        ignore_https_errors: bool = True,
-        java_script_enabled: bool = True,
-        sleep_on_close: bool = False,
-        verbose: bool = True,
-        cookies: list = None,
-        headers: dict = None,
-        user_agent: str = (
-            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
-            "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
-        ),
-        user_agent_mode: str = None,
-        user_agent_generator_config: dict = None,
-        text_mode: bool = False,
-        light_mode: bool = False,
-        extra_args: list = None,
-        debugging_port: int = 9222,
-    ):
-        self.browser_type = browser_type
-        self.headless = headless
-        self.use_managed_browser = use_managed_browser
-        self.use_persistent_context = use_persistent_context
-        self.user_data_dir = user_data_dir
-        self.chrome_channel = chrome_channel or self.browser_type or "chromium"
-        self.channel = channel or self.browser_type or "chromium"
-        if self.browser_type in ["firefox", "webkit"]:
-            self.channel = ""
-            self.chrome_channel = ""
-        self.proxy = proxy
-        self.proxy_config = proxy_config
-        self.viewport_width = viewport_width
-        self.viewport_height = viewport_height
-        self.accept_downloads = accept_downloads
-        self.downloads_path = downloads_path
-        self.storage_state = storage_state
-        self.ignore_https_errors = ignore_https_errors
-        self.java_script_enabled = java_script_enabled
-        self.cookies = cookies if cookies is not None else []
-        self.headers = headers if headers is not None else {}
-        self.user_agent = user_agent
-        self.user_agent_mode = user_agent_mode
-        self.user_agent_generator_config = user_agent_generator_config
-        self.text_mode = text_mode
-        self.light_mode = light_mode
-        self.extra_args = extra_args if extra_args is not None else []
-        self.sleep_on_close = sleep_on_close
-        self.verbose = verbose
-        self.debugging_port = debugging_port
-
-        user_agenr_generator = UserAgentGenerator()
-        if self.user_agent_mode != "random" and self.user_agent_generator_config:
-            self.user_agent = user_agenr_generator.generate(
-                **(self.user_agent_generator_config or {})
-            )
-        elif self.user_agent_mode == "random":
-            self.user_agent = user_agenr_generator.generate()
-        else:
-            pass
-
-        self.browser_hint = user_agenr_generator.generate_client_hints(self.user_agent)
-        self.headers.setdefault("sec-ch-ua", self.browser_hint)
-
-        # If persistent context is requested, ensure managed browser is enabled
-        if self.use_persistent_context:
-            self.use_managed_browser = True
-
-    @staticmethod
-    def from_kwargs(kwargs: dict) -> "BrowserConfig":
-        return BrowserConfig(
-            browser_type=kwargs.get("browser_type", "chromium"),
-            headless=kwargs.get("headless", True),
-            use_managed_browser=kwargs.get("use_managed_browser", False),
-            use_persistent_context=kwargs.get("use_persistent_context", False),
-            user_data_dir=kwargs.get("user_data_dir"),
-            chrome_channel=kwargs.get("chrome_channel", "chromium"),
-            channel=kwargs.get("channel", "chromium"),
-            proxy=kwargs.get("proxy"),
-            proxy_config=kwargs.get("proxy_config"),
-            viewport_width=kwargs.get("viewport_width", 1080),
-            viewport_height=kwargs.get("viewport_height", 600),
-            accept_downloads=kwargs.get("accept_downloads", False),
-            downloads_path=kwargs.get("downloads_path"),
-            storage_state=kwargs.get("storage_state"),
-            ignore_https_errors=kwargs.get("ignore_https_errors", True),
-            java_script_enabled=kwargs.get("java_script_enabled", True),
-            cookies=kwargs.get("cookies", []),
-            headers=kwargs.get("headers", {}),
-            user_agent=kwargs.get(
-                "user_agent",
-                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
-                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
-            ),
-            user_agent_mode=kwargs.get("user_agent_mode"),
-            user_agent_generator_config=kwargs.get("user_agent_generator_config"),
-            text_mode=kwargs.get("text_mode", False),
-            light_mode=kwargs.get("light_mode", False),
-            extra_args=kwargs.get("extra_args", []),
-        )
-
-    def to_dict(self):
-        return {
-            "browser_type": self.browser_type,
-            "headless": self.headless,
-            "use_managed_browser": self.use_managed_browser,
-            "use_persistent_context": self.use_persistent_context,
-            "user_data_dir": self.user_data_dir,
-            "chrome_channel": self.chrome_channel,
-            "channel": self.channel,
-            "proxy": self.proxy,
-            "proxy_config": self.proxy_config,
-            "viewport_width": self.viewport_width,
-            "viewport_height": self.viewport_height,
-            "accept_downloads": self.accept_downloads,
-            "downloads_path": self.downloads_path,
-            "storage_state": self.storage_state,
-            "ignore_https_errors": self.ignore_https_errors,
-            "java_script_enabled": self.java_script_enabled,
-            "cookies": self.cookies,
-            "headers": self.headers,
-            "user_agent": self.user_agent,
-            "user_agent_mode": self.user_agent_mode,
-            "user_agent_generator_config": self.user_agent_generator_config,
-            "text_mode": self.text_mode,
-            "light_mode": self.light_mode,
-            "extra_args": self.extra_args,
-            "sleep_on_close": self.sleep_on_close,
-            "verbose": self.verbose,
-            "debugging_port": self.debugging_port,
-        }
-
-    def clone(self, **kwargs):
-        """Create a copy of this configuration with updated values.
-        
-        Args:
-            **kwargs: Key-value pairs of configuration options to update
-            
-        Returns:
-            BrowserConfig: A new instance with the specified updates
-        """
-        config_dict = self.to_dict()
-        config_dict.update(kwargs)
-        return BrowserConfig.from_kwargs(config_dict)
-
-
-class CrawlerRunConfig:
-    """
-    Configuration class for controlling how the crawler runs each crawl operation.
-    This includes parameters for content extraction, page manipulation, waiting conditions,
-    caching, and other runtime behaviors.
-
-    This centralizes parameters that were previously scattered as kwargs to `arun()` and related methods.
-    By using this class, you have a single place to understand and adjust the crawling options.
-
-    Attributes:
-        # Content Processing Parameters
-        word_count_threshold (int): Minimum word count threshold before processing content.
-                                    Default: MIN_WORD_THRESHOLD (typically 200).
-        extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
-                                                          Default: None (NoExtractionStrategy is used if None).
-        chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
-                                              Default: RegexChunking().
-        markdown_generator (MarkdownGenerationStrategy): Strategy for generating markdown.
-                                                         Default: None.
-        content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
-                                                        Default: None.
-        only_text (bool): If True, attempt to extract text-only content where applicable.
-                          Default: False.
-        css_selector (str or None): CSS selector to extract a specific portion of the page.
-                                    Default: None.
-        excluded_tags (list of str or None): List of HTML tags to exclude from processing.
-                                             Default: None.
-        excluded_selector (str or None): CSS selector to exclude from processing.
-                                         Default: None.
-        keep_data_attributes (bool): If True, retain `data-*` attributes while removing unwanted attributes.
-                                     Default: False.
-        remove_forms (bool): If True, remove all `<form>` elements from the HTML.
-                             Default: False.
-        prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
-                          Default: False.
-        parser_type (str): Type of parser to use for HTML parsing.
-                           Default: "lxml".
-        scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
-                           Default: WebScrapingStrategy.
-        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
-                                     If None, no additional proxy config. Default: None.
-
-        # Caching Parameters
-        cache_mode (CacheMode or None): Defines how caching is handled.
-                                        If None, defaults to CacheMode.ENABLED internally.
-                                        Default: None.
-        session_id (str or None): Optional session ID to persist the browser context and the created
-                                  page instance. If the ID already exists, the crawler does not
-                                  create a new page and uses the current page to preserve the state.
-        bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
-                             Default: False.
-        disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
-                              Default: False.
-        no_cache_read (bool): Legacy parameter, if True acts like CacheMode.WRITE_ONLY.
-                              Default: False.
-        no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
-                               Default: False.
-        shared_data (dict or None): Shared data to be passed between hooks.
-                                     Default: None.
-
-        # Page Navigation and Timing Parameters
-        wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
-                          Default: "domcontentloaded".
-        page_timeout (int): Timeout in ms for page operations like navigation.
-                            Default: 60000 (60 seconds).
-        wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
-                                Default: None.
-        wait_for_images (bool): If True, wait for images to load before extracting content.
-                                Default: False.
-        delay_before_return_html (float): Delay in seconds before retrieving final HTML.
-                                          Default: 0.1.
-        mean_delay (float): Mean base delay between requests when calling arun_many.
-                            Default: 0.1.
-        max_range (float): Max random additional delay range for requests in arun_many.
-                           Default: 0.3.
-        semaphore_count (int): Number of concurrent operations allowed.
-                               Default: 5.
-
-        # Page Interaction Parameters
-        js_code (str or list of str or None): JavaScript code/snippets to run on the page.
-                                              Default: None.
-        js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
-                        Default: False.
-        ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
-                                       Default: True.
-        scan_full_page (bool): If True, scroll through the entire page to load all content.
-                               Default: False.
-        scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
-                              Default: 0.2.
-        process_iframes (bool): If True, attempts to process and inline iframe content.
-                                Default: False.
-        remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
-                                        Default: False.
-        simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
-                              Default: False.
-        override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
-                                   Default: False.
-        magic (bool): If True, attempts automatic handling of overlays/popups.
-                      Default: False.
-        adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
-                                           Default: False.
-
-        # Media Handling Parameters
-        screenshot (bool): Whether to take a screenshot after crawling.
-                           Default: False.
-        screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
-                                             Default: None.
-        screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
-                                           Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
-        pdf (bool): Whether to generate a PDF of the page.
-                    Default: False.
-        image_description_min_word_threshold (int): Minimum words for image description extraction.
-                                                    Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
-        image_score_threshold (int): Minimum score threshold for processing an image.
-                                     Default: IMAGE_SCORE_THRESHOLD (e.g., 3).
-        exclude_external_images (bool): If True, exclude all external images from processing.
-                                         Default: False.
-
-        # Link and Domain Handling Parameters
-        exclude_social_media_domains (list of str): List of domains to exclude for social media links.
-                                                    Default: SOCIAL_MEDIA_DOMAINS (from config).
-        exclude_external_links (bool): If True, exclude all external links from the results.
-                                       Default: False.
-        exclude_social_media_links (bool): If True, exclude links pointing to social media domains.
-                                           Default: False.
-        exclude_domains (list of str): List of specific domains to exclude from results.
-                                       Default: [].
-
-        # Debugging and Logging Parameters
-        verbose (bool): Enable verbose logging.
-                        Default: True.
-        log_console (bool): If True, log console messages from the page.
-                            Default: False.
-
-        # Streaming Parameters
-        stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
-                      Default: False.
-
-        # Optional Parameters
-        stream (bool): If True, stream the page content as it is being loaded.
-        url: str = None  # This is not a compulsory parameter
-        check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
-    """
-
-    def __init__(
-        self,
-        # Content Processing Parameters
-        word_count_threshold: int = MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        markdown_generator: MarkdownGenerationStrategy = None,
-        content_filter=None,
-        only_text: bool = False,
-        css_selector: str = None,
-        excluded_tags: list = None,
-        excluded_selector: str = None,
-        keep_data_attributes: bool = False,
-        remove_forms: bool = False,
-        prettiify: bool = False,
-        parser_type: str = "lxml",
-        scraping_strategy: ContentScrapingStrategy = None,
-        proxy_config: dict = None,
-        # SSL Parameters
-        fetch_ssl_certificate: bool = False,
-        # Caching Parameters
-        cache_mode=None,
-        session_id: str = None,
-        bypass_cache: bool = False,
-        disable_cache: bool = False,
-        no_cache_read: bool = False,
-        no_cache_write: bool = False,
-        shared_data: dict = None,
-        # Page Navigation and Timing Parameters
-        wait_until: str = "domcontentloaded",
-        page_timeout: int = PAGE_TIMEOUT,
-        wait_for: str = None,
-        wait_for_images: bool = False,
-        delay_before_return_html: float = 0.1,
-        mean_delay: float = 0.1,
-        max_range: float = 0.3,
-        semaphore_count: int = 5,
-        # Page Interaction Parameters
-        js_code: Union[str, List[str]] = None,
-        js_only: bool = False,
-        ignore_body_visibility: bool = True,
-        scan_full_page: bool = False,
-        scroll_delay: float = 0.2,
-        process_iframes: bool = False,
-        remove_overlay_elements: bool = False,
-        simulate_user: bool = False,
-        override_navigator: bool = False,
-        magic: bool = False,
-        adjust_viewport_to_content: bool = False,
-        # Media Handling Parameters
-        screenshot: bool = False,
-        screenshot_wait_for: float = None,
-        screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
-        pdf: bool = False,
-        image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-        image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
-        exclude_external_images: bool = False,
-        # Link and Domain Handling Parameters
-        exclude_social_media_domains: list = None,
-        exclude_external_links: bool = False,
-        exclude_social_media_links: bool = False,
-        exclude_domains: list = None,
-        # Debugging and Logging Parameters
-        verbose: bool = True,
-        log_console: bool = False,
-        # Streaming Parameters
-        stream: bool = False,
-        url: str = None,
-        check_robots_txt: bool = False,
-    ):
-        self.url = url
-
-        # Content Processing Parameters
-        self.word_count_threshold = word_count_threshold
-        self.extraction_strategy = extraction_strategy
-        self.chunking_strategy = chunking_strategy
-        self.markdown_generator = markdown_generator
-        self.content_filter = content_filter
-        self.only_text = only_text
-        self.css_selector = css_selector
-        self.excluded_tags = excluded_tags or []
-        self.excluded_selector = excluded_selector or ""
-        self.keep_data_attributes = keep_data_attributes
-        self.remove_forms = remove_forms
-        self.prettiify = prettiify
-        self.parser_type = parser_type
-        self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
-        self.proxy_config = proxy_config
-
-        # SSL Parameters
-        self.fetch_ssl_certificate = fetch_ssl_certificate
-
-        # Caching Parameters
-        self.cache_mode = cache_mode
-        self.session_id = session_id
-        self.bypass_cache = bypass_cache
-        self.disable_cache = disable_cache
-        self.no_cache_read = no_cache_read
-        self.no_cache_write = no_cache_write
-        self.shared_data = shared_data
-
-        # Page Navigation and Timing Parameters
-        self.wait_until = wait_until
-        self.page_timeout = page_timeout
-        self.wait_for = wait_for
-        self.wait_for_images = wait_for_images
-        self.delay_before_return_html = delay_before_return_html
-        self.mean_delay = mean_delay
-        self.max_range = max_range
-        self.semaphore_count = semaphore_count
-
-        # Page Interaction Parameters
-        self.js_code = js_code
-        self.js_only = js_only
-        self.ignore_body_visibility = ignore_body_visibility
-        self.scan_full_page = scan_full_page
-        self.scroll_delay = scroll_delay
-        self.process_iframes = process_iframes
-        self.remove_overlay_elements = remove_overlay_elements
-        self.simulate_user = simulate_user
-        self.override_navigator = override_navigator
-        self.magic = magic
-        self.adjust_viewport_to_content = adjust_viewport_to_content
-
-        # Media Handling Parameters
-        self.screenshot = screenshot
-        self.screenshot_wait_for = screenshot_wait_for
-        self.screenshot_height_threshold = screenshot_height_threshold
-        self.pdf = pdf
-        self.image_description_min_word_threshold = image_description_min_word_threshold
-        self.image_score_threshold = image_score_threshold
-        self.exclude_external_images = exclude_external_images
-
-        # Link and Domain Handling Parameters
-        self.exclude_social_media_domains = (
-            exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
-        )
-        self.exclude_external_links = exclude_external_links
-        self.exclude_social_media_links = exclude_social_media_links
-        self.exclude_domains = exclude_domains or []
-
-        # Debugging and Logging Parameters
-        self.verbose = verbose
-        self.log_console = log_console
-
-        # Streaming Parameters
-        self.stream = stream
-
-        # Robots.txt Handling Parameters
-        self.check_robots_txt = check_robots_txt
-
-        # Validate type of extraction strategy and chunking strategy if they are provided
-        if self.extraction_strategy is not None and not isinstance(
-            self.extraction_strategy, ExtractionStrategy
-        ):
-            raise ValueError(
-                "extraction_strategy must be an instance of ExtractionStrategy"
-            )
-        if self.chunking_strategy is not None and not isinstance(
-            self.chunking_strategy, ChunkingStrategy
-        ):
-            raise ValueError(
-                "chunking_strategy must be an instance of ChunkingStrategy"
-            )
-
-        # Set default chunking strategy if None
-        if self.chunking_strategy is None:
-            self.chunking_strategy = RegexChunking()
-
-    @staticmethod
-    def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
-        return CrawlerRunConfig(
-            # Content Processing Parameters
-            word_count_threshold=kwargs.get("word_count_threshold", 200),
-            extraction_strategy=kwargs.get("extraction_strategy"),
-            chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
-            markdown_generator=kwargs.get("markdown_generator"),
-            content_filter=kwargs.get("content_filter"),
-            only_text=kwargs.get("only_text", False),
-            css_selector=kwargs.get("css_selector"),
-            excluded_tags=kwargs.get("excluded_tags", []),
-            excluded_selector=kwargs.get("excluded_selector", ""),
-            keep_data_attributes=kwargs.get("keep_data_attributes", False),
-            remove_forms=kwargs.get("remove_forms", False),
-            prettiify=kwargs.get("prettiify", False),
-            parser_type=kwargs.get("parser_type", "lxml"),
-            scraping_strategy=kwargs.get("scraping_strategy"),
-            proxy_config=kwargs.get("proxy_config"),
-            # SSL Parameters
-            fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
-            # Caching Parameters
-            cache_mode=kwargs.get("cache_mode"),
-            session_id=kwargs.get("session_id"),
-            bypass_cache=kwargs.get("bypass_cache", False),
-            disable_cache=kwargs.get("disable_cache", False),
-            no_cache_read=kwargs.get("no_cache_read", False),
-            no_cache_write=kwargs.get("no_cache_write", False),
-            shared_data=kwargs.get("shared_data", None),
-            # Page Navigation and Timing Parameters
-            wait_until=kwargs.get("wait_until", "domcontentloaded"),
-            page_timeout=kwargs.get("page_timeout", 60000),
-            wait_for=kwargs.get("wait_for"),
-            wait_for_images=kwargs.get("wait_for_images", False),
-            delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
-            mean_delay=kwargs.get("mean_delay", 0.1),
-            max_range=kwargs.get("max_range", 0.3),
-            semaphore_count=kwargs.get("semaphore_count", 5),
-            # Page Interaction Parameters
-            js_code=kwargs.get("js_code"),
-            js_only=kwargs.get("js_only", False),
-            ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
-            scan_full_page=kwargs.get("scan_full_page", False),
-            scroll_delay=kwargs.get("scroll_delay", 0.2),
-            process_iframes=kwargs.get("process_iframes", False),
-            remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
-            simulate_user=kwargs.get("simulate_user", False),
-            override_navigator=kwargs.get("override_navigator", False),
-            magic=kwargs.get("magic", False),
-            adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
-            # Media Handling Parameters
-            screenshot=kwargs.get("screenshot", False),
-            screenshot_wait_for=kwargs.get("screenshot_wait_for"),
-            screenshot_height_threshold=kwargs.get(
-                "screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
-            ),
-            pdf=kwargs.get("pdf", False),
-            image_description_min_word_threshold=kwargs.get(
-                "image_description_min_word_threshold",
-                IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-            ),
-            image_score_threshold=kwargs.get(
-                "image_score_threshold", IMAGE_SCORE_THRESHOLD
-            ),
-            exclude_external_images=kwargs.get("exclude_external_images", False),
-            # Link and Domain Handling Parameters
-            exclude_social_media_domains=kwargs.get(
-                "exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
-            ),
-            exclude_external_links=kwargs.get("exclude_external_links", False),
-            exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
-            exclude_domains=kwargs.get("exclude_domains", []),
-            # Debugging and Logging Parameters
-            verbose=kwargs.get("verbose", True),
-            log_console=kwargs.get("log_console", False),
-            # Streaming Parameters
-            stream=kwargs.get("stream", False),
-            url=kwargs.get("url"),
-            check_robots_txt=kwargs.get("check_robots_txt", False),
-        )
-
-    # Create a funciton returns dict of the object
-    def to_dict(self):
-        return {
-            "word_count_threshold": self.word_count_threshold,
-            "extraction_strategy": self.extraction_strategy,
-            "chunking_strategy": self.chunking_strategy,
-            "markdown_generator": self.markdown_generator,
-            "content_filter": self.content_filter,
-            "only_text": self.only_text,
-            "css_selector": self.css_selector,
-            "excluded_tags": self.excluded_tags,
-            "excluded_selector": self.excluded_selector,
-            "keep_data_attributes": self.keep_data_attributes,
-            "remove_forms": self.remove_forms,
-            "prettiify": self.prettiify,
-            "parser_type": self.parser_type,
-            "scraping_strategy": self.scraping_strategy,
-            "proxy_config": self.proxy_config,
-            "fetch_ssl_certificate": self.fetch_ssl_certificate,
-            "cache_mode": self.cache_mode,
-            "session_id": self.session_id,
-            "bypass_cache": self.bypass_cache,
-            "disable_cache": self.disable_cache,
-            "no_cache_read": self.no_cache_read,
-            "no_cache_write": self.no_cache_write,
-            "shared_data": self.shared_data,
-            "wait_until": self.wait_until,
-            "page_timeout": self.page_timeout,
-            "wait_for": self.wait_for,
-            "wait_for_images": self.wait_for_images,
-            "delay_before_return_html": self.delay_before_return_html,
-            "mean_delay": self.mean_delay,
-            "max_range": self.max_range,
-            "semaphore_count": self.semaphore_count,
-            "js_code": self.js_code,
-            "js_only": self.js_only,
-            "ignore_body_visibility": self.ignore_body_visibility,
-            "scan_full_page": self.scan_full_page,
-            "scroll_delay": self.scroll_delay,
-            "process_iframes": self.process_iframes,
-            "remove_overlay_elements": self.remove_overlay_elements,
-            "simulate_user": self.simulate_user,
-            "override_navigator": self.override_navigator,
-            "magic": self.magic,
-            "adjust_viewport_to_content": self.adjust_viewport_to_content,
-            "screenshot": self.screenshot,
-            "screenshot_wait_for": self.screenshot_wait_for,
-            "screenshot_height_threshold": self.screenshot_height_threshold,
-            "pdf": self.pdf,
-            "image_description_min_word_threshold": self.image_description_min_word_threshold,
-            "image_score_threshold": self.image_score_threshold,
-            "exclude_external_images": self.exclude_external_images,
-            "exclude_social_media_domains": self.exclude_social_media_domains,
-            "exclude_external_links": self.exclude_external_links,
-            "exclude_social_media_links": self.exclude_social_media_links,
-            "exclude_domains": self.exclude_domains,
-            "verbose": self.verbose,
-            "log_console": self.log_console,
-            "stream": self.stream,
-            "url": self.url,
-            "check_robots_txt": self.check_robots_txt,
-        }
-
-    def clone(self, **kwargs):
-        """Create a copy of this configuration with updated values.
-        
-        Args:
-            **kwargs: Key-value pairs of configuration options to update
-            
-        Returns:
-            CrawlerRunConfig: A new instance with the specified updates
-            
-        Example:
-            ```python
-            # Create a new config with streaming enabled
-            stream_config = config.clone(stream=True)
-            
-            # Create a new config with multiple updates
-            new_config = config.clone(
-                stream=True,
-                cache_mode=CacheMode.BYPASS,
-                verbose=True
-            )
-            ```
-        """
-        config_dict = self.to_dict()
-        config_dict.update(kwargs)
-        return CrawlerRunConfig.from_kwargs(config_dict)
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -2,232 +2,19 @@ import os
 from pathlib import Path
 import aiosqlite
 import asyncio
-from typing import Optional, Dict
-from contextlib import asynccontextmanager
-import logging
-import json  # Added for serialization/deserialization
-from .utils import ensure_content_dirs, generate_content_hash
-from .models import CrawlResult, MarkdownGenerationResult
-import aiofiles
-from .version_manager import VersionManager
-from .async_logger import AsyncLogger
-from .utils import get_error_context, create_box_message
+from typing import Optional, Tuple

-# Set up logging
-# logging.basicConfig(level=logging.INFO)
-# logger = logging.getLogger(__name__)
-# logger.setLevel(logging.INFO)
-
-base_directory = DB_PATH = os.path.join(
-    os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
-)
+DB_PATH = os.path.join(Path.home(), ".crawl4ai")
 os.makedirs(DB_PATH, exist_ok=True)
-DB_PATH = os.path.join(base_directory, "crawl4ai.db")
-
+DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")

 class AsyncDatabaseManager:
-    def __init__(self, pool_size: int = 10, max_retries: int = 3):
+    def __init__(self):
        self.db_path = DB_PATH
-        self.content_paths = ensure_content_dirs(os.path.dirname(DB_PATH))
-        self.pool_size = pool_size
-        self.max_retries = max_retries
-        self.connection_pool: Dict[int, aiosqlite.Connection] = {}
-        self.pool_lock = asyncio.Lock()
-        self.init_lock = asyncio.Lock()
-        self.connection_semaphore = asyncio.Semaphore(pool_size)
-        self._initialized = False
-        self.version_manager = VersionManager()
-        self.logger = AsyncLogger(
-            log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
-            verbose=False,
-            tag_width=10,
-        )
-
-    async def initialize(self):
-        """Initialize the database and connection pool"""
-        try:
-            self.logger.info("Initializing database", tag="INIT")
-            # Ensure the database file exists
-            os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
-
-            # Check if version update is needed
-            needs_update = self.version_manager.needs_update()
-
-            # Always ensure base table exists
-            await self.ainit_db()
-
-            # Verify the table exists
-            async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
-                async with db.execute(
-                    "SELECT name FROM sqlite_master WHERE type='table' AND name='crawled_data'"
-                ) as cursor:
-                    result = await cursor.fetchone()
-                    if not result:
-                        raise Exception("crawled_data table was not created")
-
-            # If version changed or fresh install, run updates
-            if needs_update:
-                self.logger.info("New version detected, running updates", tag="INIT")
-                await self.update_db_schema()
-                from .migrations import (
-                    run_migration,
-                )  # Import here to avoid circular imports
-
-                await run_migration()
-                self.version_manager.update_version()  # Update stored version after successful migration
-                self.logger.success(
-                    "Version update completed successfully", tag="COMPLETE"
-                )
-            else:
-                self.logger.success(
-                    "Database initialization completed successfully", tag="COMPLETE"
-                )
-
-        except Exception as e:
-            self.logger.error(
-                message="Database initialization error: {error}",
-                tag="ERROR",
-                params={"error": str(e)},
-            )
-            self.logger.info(
-                message="Database will be initialized on first use", tag="INIT"
-            )
-
-            raise
-
-    async def cleanup(self):
-        """Cleanup connections when shutting down"""
-        async with self.pool_lock:
-            for conn in self.connection_pool.values():
-                await conn.close()
-            self.connection_pool.clear()
-
-    @asynccontextmanager
-    async def get_connection(self):
-        """Connection pool manager with enhanced error handling"""
-        if not self._initialized:
-            async with self.init_lock:
-                if not self._initialized:
-                    try:
-                        await self.initialize()
-                        self._initialized = True
-                    except Exception as e:
-                        import sys
-
-                        error_context = get_error_context(sys.exc_info())
-                        self.logger.error(
-                            message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
-                            tag="ERROR",
-                            force_verbose=True,
-                            params={
-                                "error": str(e),
-                                "context": error_context["code_context"],
-                                "traceback": error_context["full_traceback"],
-                            },
-                        )
-                        raise
-
-        await self.connection_semaphore.acquire()
-        task_id = id(asyncio.current_task())
-
-        try:
-            async with self.pool_lock:
-                if task_id not in self.connection_pool:
-                    try:
-                        conn = await aiosqlite.connect(self.db_path, timeout=30.0)
-                        await conn.execute("PRAGMA journal_mode = WAL")
-                        await conn.execute("PRAGMA busy_timeout = 5000")
-
-                        # Verify database structure
-                        async with conn.execute(
-                            "PRAGMA table_info(crawled_data)"
-                        ) as cursor:
-                            columns = await cursor.fetchall()
-                            column_names = [col[1] for col in columns]
-                            expected_columns = {
-                                "url",
-                                "html",
-                                "cleaned_html",
-                                "markdown",
-                                "extracted_content",
-                                "success",
-                                "media",
-                                "links",
-                                "metadata",
-                                "screenshot",
-                                "response_headers",
-                                "downloaded_files",
-                            }
-                            missing_columns = expected_columns - set(column_names)
-                            if missing_columns:
-                                raise ValueError(
-                                    f"Database missing columns: {missing_columns}"
-                                )
-
-                        self.connection_pool[task_id] = conn
-                    except Exception as e:
-                        import sys
-
-                        error_context = get_error_context(sys.exc_info())
-                        error_message = (
-                            f"Unexpected error in db get_connection at line {error_context['line_no']} "
-                            f"in {error_context['function']} ({error_context['filename']}):\n"
-                            f"Error: {str(e)}\n\n"
-                            f"Code context:\n{error_context['code_context']}"
-                        )
-                        self.logger.error(
-                            message=create_box_message(error_message, type="error"),
-                        )
-
-                        raise
-
-            yield self.connection_pool[task_id]
-
-        except Exception as e:
-            import sys
-
-            error_context = get_error_context(sys.exc_info())
-            error_message = (
-                f"Unexpected error in db get_connection at line {error_context['line_no']} "
-                f"in {error_context['function']} ({error_context['filename']}):\n"
-                f"Error: {str(e)}\n\n"
-                f"Code context:\n{error_context['code_context']}"
-            )
-            self.logger.error(
-                message=create_box_message(error_message, type="error"),
-            )
-            raise
-        finally:
-            async with self.pool_lock:
-                if task_id in self.connection_pool:
-                    await self.connection_pool[task_id].close()
-                    del self.connection_pool[task_id]
-            self.connection_semaphore.release()
-
-    async def execute_with_retry(self, operation, *args):
-        """Execute database operations with retry logic"""
-        for attempt in range(self.max_retries):
-            try:
-                async with self.get_connection() as db:
-                    result = await operation(db, *args)
-                    await db.commit()
-                    return result
-            except Exception as e:
-                if attempt == self.max_retries - 1:
-                    self.logger.error(
-                        message="Operation failed after {retries} attempts: {error}",
-                        tag="ERROR",
-                        force_verbose=True,
-                        params={"retries": self.max_retries, "error": str(e)},
-                    )
-                    raise
-                await asyncio.sleep(1 * (attempt + 1))  # Exponential backoff

    async def ainit_db(self):
-        """Initialize database schema"""
-        async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
-            await db.execute(
-                """
+        async with aiosqlite.connect(self.db_path) as db:
+            await db.execute('''
                CREATE TABLE IF NOT EXISTS crawled_data (
                    url TEXT PRIMARY KEY,
                    html TEXT,
@@ -238,321 +25,90 @@ class AsyncDatabaseManager:
                    media TEXT DEFAULT "{}",
                    links TEXT DEFAULT "{}",
                    metadata TEXT DEFAULT "{}",
-                    screenshot TEXT DEFAULT "",
-                    response_headers TEXT DEFAULT "{}",
-                    downloaded_files TEXT DEFAULT "{}"  -- New column added
+                    screenshot TEXT DEFAULT ""
                )
-            """
-            )
+            ''')
            await db.commit()
+        await self.update_db_schema()

    async def update_db_schema(self):
-        """Update database schema if needed"""
-        async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
+        async with aiosqlite.connect(self.db_path) as db:
+            # Check if the 'media' column exists
            cursor = await db.execute("PRAGMA table_info(crawled_data)")
            columns = await cursor.fetchall()
            column_names = [column[1] for column in columns]
-
-            # List of new columns to add
-            new_columns = [
-                "media",
-                "links",
-                "metadata",
-                "screenshot",
-                "response_headers",
-                "downloaded_files",
-            ]
-
-            for column in new_columns:
+            
+            if 'media' not in column_names:
+                await self.aalter_db_add_column('media')
+            
+            # Check for other missing columns and add them if necessary
+            for column in ['links', 'metadata', 'screenshot']:
                if column not in column_names:
-                    await self.aalter_db_add_column(column, db)
-            await db.commit()
-
-    async def aalter_db_add_column(self, new_column: str, db):
-        """Add new column to the database"""
-        if new_column == "response_headers":
-            await db.execute(
-                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
-            )
-        else:
-            await db.execute(
-                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
-            )
-        self.logger.info(
-            message="Added column '{column}' to the database",
-            tag="INIT",
-            params={"column": new_column},
-        )
-
-    async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
-        """Retrieve cached URL data as CrawlResult"""
-
-        async def _get(db):
-            async with db.execute(
-                "SELECT * FROM crawled_data WHERE url = ?", (url,)
-            ) as cursor:
-                row = await cursor.fetchone()
-                if not row:
-                    return None
-
-                # Get column names
-                columns = [description[0] for description in cursor.description]
-                # Create dict from row data
-                row_dict = dict(zip(columns, row))
-
-                # Load content from files using stored hashes
-                content_fields = {
-                    "html": row_dict["html"],
-                    "cleaned_html": row_dict["cleaned_html"],
-                    "markdown": row_dict["markdown"],
-                    "extracted_content": row_dict["extracted_content"],
-                    "screenshot": row_dict["screenshot"],
-                    "screenshots": row_dict["screenshot"],
-                }
-
-                for field, hash_value in content_fields.items():
-                    if hash_value:
-                        content = await self._load_content(
-                            hash_value,
-                            field.split("_")[0],  # Get content type from field name
-                        )
-                        row_dict[field] = content or ""
-                    else:
-                        row_dict[field] = ""
-
-                # Parse JSON fields
-                json_fields = [
-                    "media",
-                    "links",
-                    "metadata",
-                    "response_headers",
-                    "markdown",
-                ]
-                for field in json_fields:
-                    try:
-                        row_dict[field] = (
-                            json.loads(row_dict[field]) if row_dict[field] else {}
-                        )
-                    except json.JSONDecodeError:
-                        # Very UGLY, never mention it to me please
-                        if field == "markdown" and isinstance(row_dict[field], str):
-                            row_dict[field] = row_dict[field]
-                        else:
-                            row_dict[field] = {}
-
-                if isinstance(row_dict["markdown"], Dict):
-                    row_dict["markdown_v2"] = row_dict["markdown"]
-                    if row_dict["markdown"].get("raw_markdown"):
-                        row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
-
-                # Parse downloaded_files
-                try:
-                    row_dict["downloaded_files"] = (
-                        json.loads(row_dict["downloaded_files"])
-                        if row_dict["downloaded_files"]
-                        else []
-                    )
-                except json.JSONDecodeError:
-                    row_dict["downloaded_files"] = []
-
-                # Remove any fields not in CrawlResult model
-                valid_fields = CrawlResult.__annotations__.keys()
-                filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
-
-                return CrawlResult(**filtered_dict)
+                    await self.aalter_db_add_column(column)

+    async def aalter_db_add_column(self, new_column: str):
        try:
-            return await self.execute_with_retry(_get)
+            async with aiosqlite.connect(self.db_path) as db:
+                await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
+                await db.commit()
+            print(f"Added column '{new_column}' to the database.")
        except Exception as e:
-            self.logger.error(
-                message="Error retrieving cached URL: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
+            print(f"Error altering database to add {new_column} column: {e}")
+
+    async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
+        try:
+            async with aiosqlite.connect(self.db_path) as db:
+                async with db.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,)) as cursor:
+                    return await cursor.fetchone()
+        except Exception as e:
+            print(f"Error retrieving cached URL: {e}")
            return None

-    async def acache_url(self, result: CrawlResult):
-        """Cache CrawlResult data"""
-        # Store content files and get hashes
-        content_map = {
-            "html": (result.html, "html"),
-            "cleaned_html": (result.cleaned_html or "", "cleaned"),
-            "markdown": None,
-            "extracted_content": (result.extracted_content or "", "extracted"),
-            "screenshot": (result.screenshot or "", "screenshots"),
-        }
-
+    async def acache_url(self, url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media: str = "{}", links: str = "{}", metadata: str = "{}", screenshot: str = ""):
        try:
-            if isinstance(result.markdown, MarkdownGenerationResult):
-                content_map["markdown"] = (
-                    result.markdown.model_dump_json(),
-                    "markdown",
-                )
-            elif hasattr(result, "markdown_v2"):
-                content_map["markdown"] = (
-                    result.markdown_v2.model_dump_json(),
-                    "markdown",
-                )
-            elif isinstance(result.markdown, str):
-                markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
-                content_map["markdown"] = (
-                    markdown_result.model_dump_json(),
-                    "markdown",
-                )
-            else:
-                content_map["markdown"] = (
-                    MarkdownGenerationResult().model_dump_json(),
-                    "markdown",
-                )
+            async with aiosqlite.connect(self.db_path) as db:
+                await db.execute('''
+                    INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
+                    VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+                    ON CONFLICT(url) DO UPDATE SET
+                        html = excluded.html,
+                        cleaned_html = excluded.cleaned_html,
+                        markdown = excluded.markdown,
+                        extracted_content = excluded.extracted_content,
+                        success = excluded.success,
+                        media = excluded.media,      
+                        links = excluded.links,    
+                        metadata = excluded.metadata,      
+                        screenshot = excluded.screenshot
+                ''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
+                await db.commit()
        except Exception as e:
-            self.logger.warning(
-                message=f"Error processing markdown content: {str(e)}", tag="WARNING"
-            )
-            # Fallback to empty markdown result
-            content_map["markdown"] = (
-                MarkdownGenerationResult().model_dump_json(),
-                "markdown",
-            )
-
-        content_hashes = {}
-        for field, (content, content_type) in content_map.items():
-            content_hashes[field] = await self._store_content(content, content_type)
-
-        async def _cache(db):
-            await db.execute(
-                """
-                INSERT INTO crawled_data (
-                    url, html, cleaned_html, markdown,
-                    extracted_content, success, media, links, metadata,
-                    screenshot, response_headers, downloaded_files
-                )
-                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
-                ON CONFLICT(url) DO UPDATE SET
-                    html = excluded.html,
-                    cleaned_html = excluded.cleaned_html,
-                    markdown = excluded.markdown,
-                    extracted_content = excluded.extracted_content,
-                    success = excluded.success,
-                    media = excluded.media,
-                    links = excluded.links,
-                    metadata = excluded.metadata,
-                    screenshot = excluded.screenshot,
-                    response_headers = excluded.response_headers,
-                    downloaded_files = excluded.downloaded_files
-            """,
-                (
-                    result.url,
-                    content_hashes["html"],
-                    content_hashes["cleaned_html"],
-                    content_hashes["markdown"],
-                    content_hashes["extracted_content"],
-                    result.success,
-                    json.dumps(result.media),
-                    json.dumps(result.links),
-                    json.dumps(result.metadata or {}),
-                    content_hashes["screenshot"],
-                    json.dumps(result.response_headers or {}),
-                    json.dumps(result.downloaded_files or []),
-                ),
-            )
-
-        try:
-            await self.execute_with_retry(_cache)
-        except Exception as e:
-            self.logger.error(
-                message="Error caching URL: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
+            print(f"Error caching URL: {e}")

    async def aget_total_count(self) -> int:
-        """Get total number of cached URLs"""
-
-        async def _count(db):
-            async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
-                result = await cursor.fetchone()
-                return result[0] if result else 0
-
        try:
-            return await self.execute_with_retry(_count)
+            async with aiosqlite.connect(self.db_path) as db:
+                async with db.execute('SELECT COUNT(*) FROM crawled_data') as cursor:
+                    result = await cursor.fetchone()
+                    return result[0] if result else 0
        except Exception as e:
-            self.logger.error(
-                message="Error getting total count: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
+            print(f"Error getting total count: {e}")
            return 0

    async def aclear_db(self):
-        """Clear all data from the database"""
-
-        async def _clear(db):
-            await db.execute("DELETE FROM crawled_data")
-
        try:
-            await self.execute_with_retry(_clear)
+            async with aiosqlite.connect(self.db_path) as db:
+                await db.execute('DELETE FROM crawled_data')
+                await db.commit()
        except Exception as e:
-            self.logger.error(
-                message="Error clearing database: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
+            print(f"Error clearing database: {e}")

    async def aflush_db(self):
-        """Drop the entire table"""
-
-        async def _flush(db):
-            await db.execute("DROP TABLE IF EXISTS crawled_data")
-
        try:
-            await self.execute_with_retry(_flush)
+            async with aiosqlite.connect(self.db_path) as db:
+                await db.execute('DROP TABLE IF EXISTS crawled_data')
+                await db.commit()
        except Exception as e:
-            self.logger.error(
-                message="Error flushing database: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
+            print(f"Error flushing database: {e}")

-    async def _store_content(self, content: str, content_type: str) -> str:
-        """Store content in filesystem and return hash"""
-        if not content:
-            return ""
-
-        content_hash = generate_content_hash(content)
-        file_path = os.path.join(self.content_paths[content_type], content_hash)
-
-        # Only write if file doesn't exist
-        if not os.path.exists(file_path):
-            async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
-                await f.write(content)
-
-        return content_hash
-
-    async def _load_content(
-        self, content_hash: str, content_type: str
-    ) -> Optional[str]:
-        """Load content from filesystem by hash"""
-        if not content_hash:
-            return None
-
-        file_path = os.path.join(self.content_paths[content_type], content_hash)
-        try:
-            async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
-                return await f.read()
-        except:
-            self.logger.error(
-                message="Failed to load content: {file_path}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"file_path": file_path},
-            )
-            return None
-
-
-# Create a singleton instance
-async_db_manager = AsyncDatabaseManager()
+async_db_manager = AsyncDatabaseManager()
--- a/crawl4ai/async_dispatcher.py
+++ b/crawl4ai/async_dispatcher.py
@@ -1,647 +0,0 @@
-from typing import Dict, Optional, List, Tuple
-from .async_configs import CrawlerRunConfig
-from .models import (
-    CrawlResult,
-    CrawlerTaskResult,
-    CrawlStatus,
-    DisplayMode,
-    CrawlStats,
-    DomainState,
-)
-
-from rich.live import Live
-from rich.table import Table
-from rich.console import Console
-from rich import box
-from datetime import datetime, timedelta
-from collections.abc import AsyncGenerator
-import time
-import psutil
-import asyncio
-import uuid
-
-from urllib.parse import urlparse
-import random
-from abc import ABC, abstractmethod
-
-
-
-class RateLimiter:
-    def __init__(
-        self,
-        base_delay: Tuple[float, float] = (1.0, 3.0),
-        max_delay: float = 60.0,
-        max_retries: int = 3,
-        rate_limit_codes: List[int] = None,
-    ):
-        self.base_delay = base_delay
-        self.max_delay = max_delay
-        self.max_retries = max_retries
-        self.rate_limit_codes = rate_limit_codes or [429, 503]
-        self.domains: Dict[str, DomainState] = {}
-
-    def get_domain(self, url: str) -> str:
-        return urlparse(url).netloc
-
-    async def wait_if_needed(self, url: str) -> None:
-        domain = self.get_domain(url)
-        state = self.domains.get(domain)
-
-        if not state:
-            self.domains[domain] = DomainState()
-            state = self.domains[domain]
-
-        now = time.time()
-        if state.last_request_time:
-            wait_time = max(0, state.current_delay - (now - state.last_request_time))
-            if wait_time > 0:
-                await asyncio.sleep(wait_time)
-
-        # Random delay within base range if no current delay
-        if state.current_delay == 0:
-            state.current_delay = random.uniform(*self.base_delay)
-
-        state.last_request_time = time.time()
-
-    def update_delay(self, url: str, status_code: int) -> bool:
-        domain = self.get_domain(url)
-        state = self.domains[domain]
-
-        if status_code in self.rate_limit_codes:
-            state.fail_count += 1
-            if state.fail_count > self.max_retries:
-                return False
-
-            # Exponential backoff with random jitter
-            state.current_delay = min(
-                state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
-            )
-        else:
-            # Gradually reduce delay on success
-            state.current_delay = max(
-                random.uniform(*self.base_delay), state.current_delay * 0.75
-            )
-            state.fail_count = 0
-
-        return True
-
-
-class CrawlerMonitor:
-    def __init__(
-        self,
-        max_visible_rows: int = 15,
-        display_mode: DisplayMode = DisplayMode.DETAILED,
-    ):
-        self.console = Console()
-        self.max_visible_rows = max_visible_rows
-        self.display_mode = display_mode
-        self.stats: Dict[str, CrawlStats] = {}
-        self.process = psutil.Process()
-        self.start_time = datetime.now()
-        self.live = Live(self._create_table(), refresh_per_second=2)
-
-    def start(self):
-        self.live.start()
-
-    def stop(self):
-        self.live.stop()
-
-    def add_task(self, task_id: str, url: str):
-        self.stats[task_id] = CrawlStats(
-            task_id=task_id, url=url, status=CrawlStatus.QUEUED
-        )
-        self.live.update(self._create_table())
-
-    def update_task(self, task_id: str, **kwargs):
-        if task_id in self.stats:
-            for key, value in kwargs.items():
-                setattr(self.stats[task_id], key, value)
-            self.live.update(self._create_table())
-
-    def _create_aggregated_table(self) -> Table:
-        """Creates a compact table showing only aggregated statistics"""
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Status Overview",
-            title_style="bold magenta",
-            header_style="bold blue",
-            show_lines=True,
-        )
-
-        # Calculate statistics
-        total_tasks = len(self.stats)
-        queued = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
-        )
-        in_progress = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        # Memory statistics
-        current_memory = self.process.memory_info().rss / (1024 * 1024)
-        total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
-        peak_memory = max(
-            (stat.peak_memory for stat in self.stats.values()), default=0.0
-        )
-
-        # Duration
-        duration = datetime.now() - self.start_time
-
-        # Create status row
-        table.add_column("Status", style="bold cyan")
-        table.add_column("Count", justify="right")
-        table.add_column("Percentage", justify="right")
-
-        table.add_row("Total Tasks", str(total_tasks), "100%")
-        table.add_row(
-            "[yellow]In Queue[/yellow]",
-            str(queued),
-            f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[blue]In Progress[/blue]",
-            str(in_progress),
-            f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[green]Completed[/green]",
-            str(completed),
-            f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[red]Failed[/red]",
-            str(failed),
-            f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-
-        # Add memory information
-        table.add_section()
-        table.add_row(
-            "[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[yellow]Runtime[/yellow]",
-            str(timedelta(seconds=int(duration.total_seconds()))),
-            "",
-        )
-
-        return table
-
-    def _create_detailed_table(self) -> Table:
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Performance Monitor",
-            title_style="bold magenta",
-            header_style="bold blue",
-        )
-
-        # Add columns
-        table.add_column("Task ID", style="cyan", no_wrap=True)
-        table.add_column("URL", style="cyan", no_wrap=True)
-        table.add_column("Status", style="bold")
-        table.add_column("Memory (MB)", justify="right")
-        table.add_column("Peak (MB)", justify="right")
-        table.add_column("Duration", justify="right")
-        table.add_column("Info", style="italic")
-
-        # Add summary row
-        total_memory = sum(stat.memory_usage for stat in self.stats.values())
-        active_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        table.add_row(
-            "[bold yellow]SUMMARY",
-            f"Total: {len(self.stats)}",
-            f"Active: {active_count}",
-            f"{total_memory:.1f}",
-            f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
-            str(
-                timedelta(
-                    seconds=int((datetime.now() - self.start_time).total_seconds())
-                )
-            ),
-            f"✓{completed_count} ✗{failed_count}",
-            style="bold",
-        )
-
-        table.add_section()
-
-        # Add rows for each task
-        visible_stats = sorted(
-            self.stats.values(),
-            key=lambda x: (
-                x.status != CrawlStatus.IN_PROGRESS,
-                x.status != CrawlStatus.QUEUED,
-                x.end_time or datetime.max,
-            ),
-        )[: self.max_visible_rows]
-
-        for stat in visible_stats:
-            status_style = {
-                CrawlStatus.QUEUED: "white",
-                CrawlStatus.IN_PROGRESS: "yellow",
-                CrawlStatus.COMPLETED: "green",
-                CrawlStatus.FAILED: "red",
-            }[stat.status]
-
-            table.add_row(
-                stat.task_id[:8],  # Show first 8 chars of task ID
-                stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
-                f"[{status_style}]{stat.status.value}[/{status_style}]",
-                f"{stat.memory_usage:.1f}",
-                f"{stat.peak_memory:.1f}",
-                stat.duration,
-                stat.error_message[:40] if stat.error_message else "",
-            )
-
-        return table
-
-    def _create_table(self) -> Table:
-        """Creates the appropriate table based on display mode"""
-        if self.display_mode == DisplayMode.AGGREGATED:
-            return self._create_aggregated_table()
-        return self._create_detailed_table()
-
-
-class BaseDispatcher(ABC):
-    def __init__(
-        self,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        self.crawler = None
-        self._domain_last_hit: Dict[str, float] = {}
-        self.concurrent_sessions = 0
-        self.rate_limiter = rate_limiter
-        self.monitor = monitor
-
-    @abstractmethod
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> CrawlerTaskResult:
-        pass
-
-    @abstractmethod
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> List[CrawlerTaskResult]:
-        pass
-
-
-class MemoryAdaptiveDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        memory_threshold_percent: float = 90.0,
-        check_interval: float = 1.0,
-        max_session_permit: int = 20,
-        memory_wait_timeout: float = 300.0,  # 5 minutes default timeout
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.memory_threshold_percent = memory_threshold_percent
-        self.check_interval = check_interval
-        self.max_session_permit = max_session_permit
-        self.memory_wait_timeout = memory_wait_timeout
-        self.result_queue = asyncio.Queue()  # Queue for storing results
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-            self.concurrent_sessions += 1
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            process = psutil.Process()
-            start_memory = process.memory_info().rss / (1024 * 1024)
-            result = await self.crawler.arun(url, config=config, session_id=task_id)
-            end_memory = process.memory_info().rss / (1024 * 1024)
-
-            memory_usage = peak_memory = end_memory - start_memory
-
-            if self.rate_limiter and result.status_code:
-                if not self.rate_limiter.update_delay(url, result.status_code):
-                    error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                    result = CrawlerTaskResult(
-                        task_id=task_id,
-                        url=url,
-                        result=result,
-                        memory_usage=memory_usage,
-                        peak_memory=peak_memory,
-                        start_time=start_time,
-                        end_time=datetime.now(),
-                        error_message=error_message,
-                    )
-                    await self.result_queue.put(result)
-                    return result
-
-            if not result.success:
-                error_message = result.error_message
-                if self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            elif self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-            self.concurrent_sessions -= 1
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-        ) -> List[CrawlerTaskResult]:
-            self.crawler = crawler
-
-            if self.monitor:
-                self.monitor.start()
-
-            try:
-                pending_tasks = []
-                active_tasks = []
-                task_queue = []
-
-                for url in urls:
-                    task_id = str(uuid.uuid4())
-                    if self.monitor:
-                        self.monitor.add_task(task_id, url)
-                    task_queue.append((url, task_id))
-
-                while task_queue or active_tasks:
-                    wait_start_time = time.time()
-                    while len(active_tasks) < self.max_session_permit and task_queue:
-                        if psutil.virtual_memory().percent >= self.memory_threshold_percent:
-                            # Check if we've exceeded the timeout
-                            if time.time() - wait_start_time > self.memory_wait_timeout:
-                                raise MemoryError(
-                                    f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
-                                )
-                            await asyncio.sleep(self.check_interval)
-                            continue
-
-                        url, task_id = task_queue.pop(0)
-                        task = asyncio.create_task(self.crawl_url(url, config, task_id))
-                        active_tasks.append(task)
-
-                    if not active_tasks:
-                        await asyncio.sleep(self.check_interval)
-                        continue
-
-                    done, pending = await asyncio.wait(
-                        active_tasks, return_when=asyncio.FIRST_COMPLETED
-                    )
-
-                    pending_tasks.extend(done)
-                    active_tasks = list(pending)
-
-                return await asyncio.gather(*pending_tasks)
-            finally:
-                if self.monitor:
-                    self.monitor.stop()
-
-    async def run_urls_stream(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",
-        config: CrawlerRunConfig,
-    ) -> AsyncGenerator[CrawlerTaskResult, None]:
-        self.crawler = crawler
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            active_tasks = []
-            task_queue = []
-            completed_count = 0
-            total_urls = len(urls)
-
-            # Initialize task queue
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task_queue.append((url, task_id))
-
-            while completed_count < total_urls:
-                # Start new tasks if memory permits
-                while len(active_tasks) < self.max_session_permit and task_queue:
-                    if psutil.virtual_memory().percent >= self.memory_threshold_percent:
-                        await asyncio.sleep(self.check_interval)
-                        continue
-
-                    url, task_id = task_queue.pop(0)
-                    task = asyncio.create_task(self.crawl_url(url, config, task_id))
-                    active_tasks.append(task)
-
-                if not active_tasks and not task_queue:
-                    break
-
-                # Wait for any task to complete and yield results
-                if active_tasks:
-                    done, pending = await asyncio.wait(
-                        active_tasks,
-                        timeout=0.1,
-                        return_when=asyncio.FIRST_COMPLETED
-                    )
-                    for completed_task in done:
-                        result = await completed_task
-                        completed_count += 1
-                        yield result
-                    active_tasks = list(pending)
-                else:
-                    await asyncio.sleep(self.check_interval)
-
-        finally:
-            if self.monitor:
-                self.monitor.stop()
-
-class SemaphoreDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        semaphore_count: int = 5,
-        max_session_permit: int = 20,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.semaphore_count = semaphore_count
-        self.max_session_permit = max_session_permit
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        semaphore: asyncio.Semaphore = None,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            async with semaphore:
-                process = psutil.Process()
-                start_memory = process.memory_info().rss / (1024 * 1024)
-                result = await self.crawler.arun(url, config=config, session_id=task_id)
-                end_memory = process.memory_info().rss / (1024 * 1024)
-
-                memory_usage = peak_memory = end_memory - start_memory
-
-                if self.rate_limiter and result.status_code:
-                    if not self.rate_limiter.update_delay(url, result.status_code):
-                        error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                        if self.monitor:
-                            self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                        return CrawlerTaskResult(
-                            task_id=task_id,
-                            url=url,
-                            result=result,
-                            memory_usage=memory_usage,
-                            peak_memory=peak_memory,
-                            start_time=start_time,
-                            end_time=datetime.now(),
-                            error_message=error_message,
-                        )
-
-                if not result.success:
-                    error_message = result.error_message
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                elif self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        urls: List[str],
-        config: CrawlerRunConfig,
-    ) -> List[CrawlerTaskResult]:
-        self.crawler = crawler
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            semaphore = asyncio.Semaphore(self.semaphore_count)
-            tasks = []
-
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task = asyncio.create_task(
-                    self.crawl_url(url, config, task_id, semaphore)
-                )
-                tasks.append(task)
-
-            return await asyncio.gather(*tasks, return_exceptions=True)
-        finally:
-            if self.monitor:
-                self.monitor.stop()
--- a/crawl4ai/async_dispatcher_.py
+++ b/crawl4ai/async_dispatcher_.py
@@ -1,588 +0,0 @@
-from typing import Dict, Optional, List, Tuple
-from .async_configs import CrawlerRunConfig
-from .models import (
-    CrawlResult,
-    CrawlerTaskResult,
-    CrawlStatus,
-    DisplayMode,
-    CrawlStats,
-    DomainState,
-)
-
-from rich.live import Live
-from rich.table import Table
-from rich.console import Console
-from rich import box
-from datetime import datetime, timedelta
-
-import time
-import psutil
-import asyncio
-import uuid
-
-from urllib.parse import urlparse
-import random
-from abc import ABC, abstractmethod
-
-
-class RateLimiter:
-    def __init__(
-        self,
-        base_delay: Tuple[float, float] = (1.0, 3.0),
-        max_delay: float = 60.0,
-        max_retries: int = 3,
-        rate_limit_codes: List[int] = None,
-    ):
-        self.base_delay = base_delay
-        self.max_delay = max_delay
-        self.max_retries = max_retries
-        self.rate_limit_codes = rate_limit_codes or [429, 503]
-        self.domains: Dict[str, DomainState] = {}
-
-    def get_domain(self, url: str) -> str:
-        return urlparse(url).netloc
-
-    async def wait_if_needed(self, url: str) -> None:
-        domain = self.get_domain(url)
-        state = self.domains.get(domain)
-
-        if not state:
-            self.domains[domain] = DomainState()
-            state = self.domains[domain]
-
-        now = time.time()
-        if state.last_request_time:
-            wait_time = max(0, state.current_delay - (now - state.last_request_time))
-            if wait_time > 0:
-                await asyncio.sleep(wait_time)
-
-        # Random delay within base range if no current delay
-        if state.current_delay == 0:
-            state.current_delay = random.uniform(*self.base_delay)
-
-        state.last_request_time = time.time()
-
-    def update_delay(self, url: str, status_code: int) -> bool:
-        domain = self.get_domain(url)
-        state = self.domains[domain]
-
-        if status_code in self.rate_limit_codes:
-            state.fail_count += 1
-            if state.fail_count > self.max_retries:
-                return False
-
-            # Exponential backoff with random jitter
-            state.current_delay = min(
-                state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
-            )
-        else:
-            # Gradually reduce delay on success
-            state.current_delay = max(
-                random.uniform(*self.base_delay), state.current_delay * 0.75
-            )
-            state.fail_count = 0
-
-        return True
-
-
-class CrawlerMonitor:
-    def __init__(
-        self,
-        max_visible_rows: int = 15,
-        display_mode: DisplayMode = DisplayMode.DETAILED,
-    ):
-        self.console = Console()
-        self.max_visible_rows = max_visible_rows
-        self.display_mode = display_mode
-        self.stats: Dict[str, CrawlStats] = {}
-        self.process = psutil.Process()
-        self.start_time = datetime.now()
-        self.live = Live(self._create_table(), refresh_per_second=2)
-
-    def start(self):
-        self.live.start()
-
-    def stop(self):
-        self.live.stop()
-
-    def add_task(self, task_id: str, url: str):
-        self.stats[task_id] = CrawlStats(
-            task_id=task_id, url=url, status=CrawlStatus.QUEUED
-        )
-        self.live.update(self._create_table())
-
-    def update_task(self, task_id: str, **kwargs):
-        if task_id in self.stats:
-            for key, value in kwargs.items():
-                setattr(self.stats[task_id], key, value)
-            self.live.update(self._create_table())
-
-    def _create_aggregated_table(self) -> Table:
-        """Creates a compact table showing only aggregated statistics"""
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Status Overview",
-            title_style="bold magenta",
-            header_style="bold blue",
-            show_lines=True,
-        )
-
-        # Calculate statistics
-        total_tasks = len(self.stats)
-        queued = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
-        )
-        in_progress = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        # Memory statistics
-        current_memory = self.process.memory_info().rss / (1024 * 1024)
-        total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
-        peak_memory = max(
-            (stat.peak_memory for stat in self.stats.values()), default=0.0
-        )
-
-        # Duration
-        duration = datetime.now() - self.start_time
-
-        # Create status row
-        table.add_column("Status", style="bold cyan")
-        table.add_column("Count", justify="right")
-        table.add_column("Percentage", justify="right")
-
-        table.add_row("Total Tasks", str(total_tasks), "100%")
-        table.add_row(
-            "[yellow]In Queue[/yellow]",
-            str(queued),
-            f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[blue]In Progress[/blue]",
-            str(in_progress),
-            f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[green]Completed[/green]",
-            str(completed),
-            f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[red]Failed[/red]",
-            str(failed),
-            f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-
-        # Add memory information
-        table.add_section()
-        table.add_row(
-            "[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[yellow]Runtime[/yellow]",
-            str(timedelta(seconds=int(duration.total_seconds()))),
-            "",
-        )
-
-        return table
-
-    def _create_detailed_table(self) -> Table:
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Performance Monitor",
-            title_style="bold magenta",
-            header_style="bold blue",
-        )
-
-        # Add columns
-        table.add_column("Task ID", style="cyan", no_wrap=True)
-        table.add_column("URL", style="cyan", no_wrap=True)
-        table.add_column("Status", style="bold")
-        table.add_column("Memory (MB)", justify="right")
-        table.add_column("Peak (MB)", justify="right")
-        table.add_column("Duration", justify="right")
-        table.add_column("Info", style="italic")
-
-        # Add summary row
-        total_memory = sum(stat.memory_usage for stat in self.stats.values())
-        active_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        table.add_row(
-            "[bold yellow]SUMMARY",
-            f"Total: {len(self.stats)}",
-            f"Active: {active_count}",
-            f"{total_memory:.1f}",
-            f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
-            str(
-                timedelta(
-                    seconds=int((datetime.now() - self.start_time).total_seconds())
-                )
-            ),
-            f"✓{completed_count} ✗{failed_count}",
-            style="bold",
-        )
-
-        table.add_section()
-
-        # Add rows for each task
-        visible_stats = sorted(
-            self.stats.values(),
-            key=lambda x: (
-                x.status != CrawlStatus.IN_PROGRESS,
-                x.status != CrawlStatus.QUEUED,
-                x.end_time or datetime.max,
-            ),
-        )[: self.max_visible_rows]
-
-        for stat in visible_stats:
-            status_style = {
-                CrawlStatus.QUEUED: "white",
-                CrawlStatus.IN_PROGRESS: "yellow",
-                CrawlStatus.COMPLETED: "green",
-                CrawlStatus.FAILED: "red",
-            }[stat.status]
-
-            table.add_row(
-                stat.task_id[:8],  # Show first 8 chars of task ID
-                stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
-                f"[{status_style}]{stat.status.value}[/{status_style}]",
-                f"{stat.memory_usage:.1f}",
-                f"{stat.peak_memory:.1f}",
-                stat.duration,
-                stat.error_message[:40] if stat.error_message else "",
-            )
-
-        return table
-
-    def _create_table(self) -> Table:
-        """Creates the appropriate table based on display mode"""
-        if self.display_mode == DisplayMode.AGGREGATED:
-            return self._create_aggregated_table()
-        return self._create_detailed_table()
-
-
-class BaseDispatcher(ABC):
-    def __init__(
-        self,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        self.crawler = None
-        self._domain_last_hit: Dict[str, float] = {}
-        self.concurrent_sessions = 0
-        self.rate_limiter = rate_limiter
-        self.monitor = monitor
-
-    @abstractmethod
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> CrawlerTaskResult:
-        pass
-
-    @abstractmethod
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> List[CrawlerTaskResult]:
-        pass
-
-
-class MemoryAdaptiveDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        memory_threshold_percent: float = 90.0,
-        check_interval: float = 1.0,
-        max_session_permit: int = 20,
-        memory_wait_timeout: float = 300.0,  # 5 minutes default timeout
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.memory_threshold_percent = memory_threshold_percent
-        self.check_interval = check_interval
-        self.max_session_permit = max_session_permit
-        self.memory_wait_timeout = memory_wait_timeout
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-            self.concurrent_sessions += 1
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            process = psutil.Process()
-            start_memory = process.memory_info().rss / (1024 * 1024)
-            result = await self.crawler.arun(url, config=config, session_id=task_id)
-            end_memory = process.memory_info().rss / (1024 * 1024)
-
-            memory_usage = peak_memory = end_memory - start_memory
-
-            if self.rate_limiter and result.status_code:
-                if not self.rate_limiter.update_delay(url, result.status_code):
-                    error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                    return CrawlerTaskResult(
-                        task_id=task_id,
-                        url=url,
-                        result=result,
-                        memory_usage=memory_usage,
-                        peak_memory=peak_memory,
-                        start_time=start_time,
-                        end_time=datetime.now(),
-                        error_message=error_message,
-                    )
-
-            if not result.success:
-                error_message = result.error_message
-                if self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            elif self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-            self.concurrent_sessions -= 1
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-    ) -> List[CrawlerTaskResult]:
-        self.crawler = crawler
-
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            pending_tasks = []
-            active_tasks = []
-            task_queue = []
-
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task_queue.append((url, task_id))
-
-            while task_queue or active_tasks:
-                wait_start_time = time.time()
-                while len(active_tasks) < self.max_session_permit and task_queue:
-                    if psutil.virtual_memory().percent >= self.memory_threshold_percent:
-                        # Check if we've exceeded the timeout
-                        if time.time() - wait_start_time > self.memory_wait_timeout:
-                            raise MemoryError(
-                                f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
-                            )
-                        await asyncio.sleep(self.check_interval)
-                        continue
-
-                    url, task_id = task_queue.pop(0)
-                    task = asyncio.create_task(self.crawl_url(url, config, task_id))
-                    active_tasks.append(task)
-
-                if not active_tasks:
-                    await asyncio.sleep(self.check_interval)
-                    continue
-
-                done, pending = await asyncio.wait(
-                    active_tasks, return_when=asyncio.FIRST_COMPLETED
-                )
-
-                pending_tasks.extend(done)
-                active_tasks = list(pending)
-
-            return await asyncio.gather(*pending_tasks)
-        finally:
-            if self.monitor:
-                self.monitor.stop()
-
-
-class SemaphoreDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        semaphore_count: int = 5,
-        max_session_permit: int = 20,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.semaphore_count = semaphore_count
-        self.max_session_permit = max_session_permit
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        semaphore: asyncio.Semaphore = None,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            async with semaphore:
-                process = psutil.Process()
-                start_memory = process.memory_info().rss / (1024 * 1024)
-                result = await self.crawler.arun(url, config=config, session_id=task_id)
-                end_memory = process.memory_info().rss / (1024 * 1024)
-
-                memory_usage = peak_memory = end_memory - start_memory
-
-                if self.rate_limiter and result.status_code:
-                    if not self.rate_limiter.update_delay(url, result.status_code):
-                        error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                        if self.monitor:
-                            self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                        return CrawlerTaskResult(
-                            task_id=task_id,
-                            url=url,
-                            result=result,
-                            memory_usage=memory_usage,
-                            peak_memory=peak_memory,
-                            start_time=start_time,
-                            end_time=datetime.now(),
-                            error_message=error_message,
-                        )
-
-                if not result.success:
-                    error_message = result.error_message
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                elif self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        urls: List[str],
-        config: CrawlerRunConfig,
-    ) -> List[CrawlerTaskResult]:
-        self.crawler = crawler
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            semaphore = asyncio.Semaphore(self.semaphore_count)
-            tasks = []
-
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task = asyncio.create_task(
-                    self.crawl_url(url, config, task_id, semaphore)
-                )
-                tasks.append(task)
-
-            return await asyncio.gather(*tasks, return_exceptions=True)
-        finally:
-            if self.monitor:
-                self.monitor.stop()
--- a/crawl4ai/async_logger.py
+++ b/crawl4ai/async_logger.py
@@ -1,227 +0,0 @@
-from enum import Enum
-from typing import Optional, Dict, Any
-from colorama import Fore, Style, init
-import os
-from datetime import datetime
-
-
-class LogLevel(Enum):
-    DEBUG = 1
-    INFO = 2
-    SUCCESS = 3
-    WARNING = 4
-    ERROR = 5
-
-
-class AsyncLogger:
-    """
-    Asynchronous logger with support for colored console output and file logging.
-    Supports templated messages with colored components.
-    """
-
-    DEFAULT_ICONS = {
-        "INIT": "→",
-        "READY": "✓",
-        "FETCH": "↓",
-        "SCRAPE": "◆",
-        "EXTRACT": "■",
-        "COMPLETE": "●",
-        "ERROR": "×",
-        "DEBUG": "⋯",
-        "INFO": "ℹ",
-        "WARNING": "⚠",
-    }
-
-    DEFAULT_COLORS = {
-        LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
-        LogLevel.INFO: Fore.CYAN,
-        LogLevel.SUCCESS: Fore.GREEN,
-        LogLevel.WARNING: Fore.YELLOW,
-        LogLevel.ERROR: Fore.RED,
-    }
-
-    def __init__(
-        self,
-        log_file: Optional[str] = None,
-        log_level: LogLevel = LogLevel.DEBUG,
-        tag_width: int = 10,
-        icons: Optional[Dict[str, str]] = None,
-        colors: Optional[Dict[LogLevel, str]] = None,
-        verbose: bool = True,
-    ):
-        """
-        Initialize the logger.
-
-        Args:
-            log_file: Optional file path for logging
-            log_level: Minimum log level to display
-            tag_width: Width for tag formatting
-            icons: Custom icons for different tags
-            colors: Custom colors for different log levels
-            verbose: Whether to output to console
-        """
-        init()  # Initialize colorama
-        self.log_file = log_file
-        self.log_level = log_level
-        self.tag_width = tag_width
-        self.icons = icons or self.DEFAULT_ICONS
-        self.colors = colors or self.DEFAULT_COLORS
-        self.verbose = verbose
-
-        # Create log file directory if needed
-        if log_file:
-            os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
-
-    def _format_tag(self, tag: str) -> str:
-        """Format a tag with consistent width."""
-        return f"[{tag}]".ljust(self.tag_width, ".")
-
-    def _get_icon(self, tag: str) -> str:
-        """Get the icon for a tag, defaulting to info icon if not found."""
-        return self.icons.get(tag, self.icons["INFO"])
-
-    def _write_to_file(self, message: str):
-        """Write a message to the log file if configured."""
-        if self.log_file:
-            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
-            with open(self.log_file, "a", encoding="utf-8") as f:
-                # Strip ANSI color codes for file output
-                clean_message = message.replace(Fore.RESET, "").replace(
-                    Style.RESET_ALL, ""
-                )
-                for color in vars(Fore).values():
-                    if isinstance(color, str):
-                        clean_message = clean_message.replace(color, "")
-                f.write(f"[{timestamp}] {clean_message}\n")
-
-    def _log(
-        self,
-        level: LogLevel,
-        message: str,
-        tag: str,
-        params: Optional[Dict[str, Any]] = None,
-        colors: Optional[Dict[str, str]] = None,
-        base_color: Optional[str] = None,
-        **kwargs,
-    ):
-        """
-        Core logging method that handles message formatting and output.
-
-        Args:
-            level: Log level for this message
-            message: Message template string
-            tag: Tag for the message
-            params: Parameters to format into the message
-            colors: Color overrides for specific parameters
-            base_color: Base color for the entire message
-        """
-        if level.value < self.log_level.value:
-            return
-
-        # Format the message with parameters if provided
-        if params:
-            try:
-                # First format the message with raw parameters
-                formatted_message = message.format(**params)
-
-                # Then apply colors if specified
-                if colors:
-                    for key, color in colors.items():
-                        # Find the formatted value in the message and wrap it with color
-                        if key in params:
-                            value_str = str(params[key])
-                            formatted_message = formatted_message.replace(
-                                value_str, f"{color}{value_str}{Style.RESET_ALL}"
-                            )
-
-            except KeyError as e:
-                formatted_message = (
-                    f"LOGGING ERROR: Missing parameter {e} in message template"
-                )
-                level = LogLevel.ERROR
-        else:
-            formatted_message = message
-
-        # Construct the full log line
-        color = base_color or self.colors[level]
-        log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
-
-        # Output to console if verbose
-        if self.verbose or kwargs.get("force_verbose", False):
-            print(log_line)
-
-        # Write to file if configured
-        self._write_to_file(log_line)
-
-    def debug(self, message: str, tag: str = "DEBUG", **kwargs):
-        """Log a debug message."""
-        self._log(LogLevel.DEBUG, message, tag, **kwargs)
-
-    def info(self, message: str, tag: str = "INFO", **kwargs):
-        """Log an info message."""
-        self._log(LogLevel.INFO, message, tag, **kwargs)
-
-    def success(self, message: str, tag: str = "SUCCESS", **kwargs):
-        """Log a success message."""
-        self._log(LogLevel.SUCCESS, message, tag, **kwargs)
-
-    def warning(self, message: str, tag: str = "WARNING", **kwargs):
-        """Log a warning message."""
-        self._log(LogLevel.WARNING, message, tag, **kwargs)
-
-    def error(self, message: str, tag: str = "ERROR", **kwargs):
-        """Log an error message."""
-        self._log(LogLevel.ERROR, message, tag, **kwargs)
-
-    def url_status(
-        self,
-        url: str,
-        success: bool,
-        timing: float,
-        tag: str = "FETCH",
-        url_length: int = 50,
-    ):
-        """
-        Convenience method for logging URL fetch status.
-
-        Args:
-            url: The URL being processed
-            success: Whether the operation was successful
-            timing: Time taken for the operation
-            tag: Tag for the message
-            url_length: Maximum length for URL in log
-        """
-        self._log(
-            level=LogLevel.SUCCESS if success else LogLevel.ERROR,
-            message="{url:.{url_length}}... | Status: {status} | Time: {timing:.2f}s",
-            tag=tag,
-            params={
-                "url": url,
-                "url_length": url_length,
-                "status": success,
-                "timing": timing,
-            },
-            colors={
-                "status": Fore.GREEN if success else Fore.RED,
-                "timing": Fore.YELLOW,
-            },
-        )
-
-    def error_status(
-        self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
-    ):
-        """
-        Convenience method for logging error status.
-
-        Args:
-            url: The URL being processed
-            error: Error message
-            tag: Tag for the message
-            url_length: Maximum length for URL in log
-        """
-        self._log(
-            level=LogLevel.ERROR,
-            message="{url:.{url_length}}... | Error: {error}",
-            tag=tag,
-            params={"url": url, "url_length": url_length, "error": error},
-        )
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
--- a/crawl4ai/cache_context.py
+++ b/crawl4ai/cache_context.py
@@ -1,117 +0,0 @@
-from enum import Enum
-
-
-class CacheMode(Enum):
-    """
-    Defines the caching behavior for web crawling operations.
-
-    Modes:
-    - ENABLED: Normal caching behavior (read and write)
-    - DISABLED: No caching at all
-    - READ_ONLY: Only read from cache, don't write
-    - WRITE_ONLY: Only write to cache, don't read
-    - BYPASS: Bypass cache for this operation
-    """
-
-    ENABLED = "enabled"
-    DISABLED = "disabled"
-    READ_ONLY = "read_only"
-    WRITE_ONLY = "write_only"
-    BYPASS = "bypass"
-
-
-class CacheContext:
-    """
-    Encapsulates cache-related decisions and URL handling.
-
-    This class centralizes all cache-related logic and URL type checking,
-    making the caching behavior more predictable and maintainable.
-
-    Attributes:
-        url (str): The URL being processed.
-        cache_mode (CacheMode): The cache mode for the current operation.
-        always_bypass (bool): If True, bypasses caching for this operation.
-        is_cacheable (bool): True if the URL is cacheable, False otherwise.
-        is_web_url (bool): True if the URL is a web URL, False otherwise.
-        is_local_file (bool): True if the URL is a local file, False otherwise.
-        is_raw_html (bool): True if the URL is raw HTML, False otherwise.
-        _url_display (str): The display name for the URL (web, local file, or raw HTML).
-    """
-
-    def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
-        """
-        Initializes the CacheContext with the provided URL and cache mode.
-
-        Args:
-            url (str): The URL being processed.
-            cache_mode (CacheMode): The cache mode for the current operation.
-            always_bypass (bool): If True, bypasses caching for this operation.
-        """
-        self.url = url
-        self.cache_mode = cache_mode
-        self.always_bypass = always_bypass
-        self.is_cacheable = url.startswith(("http://", "https://", "file://"))
-        self.is_web_url = url.startswith(("http://", "https://"))
-        self.is_local_file = url.startswith("file://")
-        self.is_raw_html = url.startswith("raw:")
-        self._url_display = url if not self.is_raw_html else "Raw HTML"
-
-    def should_read(self) -> bool:
-        """
-        Determines if cache should be read based on context.
-
-        How it works:
-        1. If always_bypass is True or is_cacheable is False, return False.
-        2. If cache_mode is ENABLED or READ_ONLY, return True.
-
-        Returns:
-            bool: True if cache should be read, False otherwise.
-        """
-        if self.always_bypass or not self.is_cacheable:
-            return False
-        return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
-
-    def should_write(self) -> bool:
-        """
-        Determines if cache should be written based on context.
-
-        How it works:
-        1. If always_bypass is True or is_cacheable is False, return False.
-        2. If cache_mode is ENABLED or WRITE_ONLY, return True.
-
-        Returns:
-            bool: True if cache should be written, False otherwise.
-        """
-        if self.always_bypass or not self.is_cacheable:
-            return False
-        return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
-
-    @property
-    def display_url(self) -> str:
-        """Returns the URL in display format."""
-        return self._url_display
-
-
-def _legacy_to_cache_mode(
-    disable_cache: bool = False,
-    bypass_cache: bool = False,
-    no_cache_read: bool = False,
-    no_cache_write: bool = False,
-) -> CacheMode:
-    """
-    Converts legacy cache parameters to the new CacheMode enum.
-
-    This is an internal function to help transition from the old boolean flags
-    to the new CacheMode system.
-    """
-    if disable_cache:
-        return CacheMode.DISABLED
-    if bypass_cache:
-        return CacheMode.BYPASS
-    if no_cache_read and no_cache_write:
-        return CacheMode.DISABLED
-    if no_cache_read:
-        return CacheMode.WRITE_ONLY
-    if no_cache_write:
-        return CacheMode.READ_ONLY
-    return CacheMode.ENABLED
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -3,53 +3,23 @@ import re
 from collections import Counter
 import string
 from .model_loader import load_nltk_punkt
-
+from .utils import *

 # Define the abstract base class for chunking strategies
 class ChunkingStrategy(ABC):
-    """
-    Abstract base class for chunking strategies.
-    """
-
+    
    @abstractmethod
    def chunk(self, text: str) -> list:
        """
        Abstract method to chunk the given text.
-
-        Args:
-            text (str): The text to chunk.
-
-        Returns:
-            list: A list of chunks.
        """
        pass
-
-
-# Create an identity chunking strategy f(x) = [x]
-class IdentityChunking(ChunkingStrategy):
-    """
-    Chunking strategy that returns the input text as a single chunk.
-    """
-
-    def chunk(self, text: str) -> list:
-        return [text]
-
-
+    
 # Regex-based chunking
 class RegexChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text based on regular expression patterns.
-    """
-
    def __init__(self, patterns=None, **kwargs):
-        """
-        Initialize the RegexChunking object.
-
-        Args:
-            patterns (list): A list of regular expression patterns to split text.
-        """
        if patterns is None:
-            patterns = [r"\n\n"]  # Default split pattern
+            patterns = [r'\n\n']  # Default split pattern
        self.patterns = patterns

    def chunk(self, text: str) -> list:
@@ -60,19 +30,12 @@ class RegexChunking(ChunkingStrategy):
                new_paragraphs.extend(re.split(pattern, paragraph))
            paragraphs = new_paragraphs
        return paragraphs
-
-
-# NLP-based sentence chunking
+    
+# NLP-based sentence chunking 
 class NlpSentenceChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
-    """
-
    def __init__(self, **kwargs):
-        """
-        Initialize the NlpSentenceChunking object.
-        """
        load_nltk_punkt()
+        pass

    def chunk(self, text: str) -> list:
        # Improved regex for sentence splitting
@@ -80,34 +43,18 @@ class NlpSentenceChunking(ChunkingStrategy):
        #     r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
        # )
        # sentences = sentence_endings.split(text)
-        # sens =  [sent.strip() for sent in sentences if sent]
+        # sens =  [sent.strip() for sent in sentences if sent]            
        from nltk.tokenize import sent_tokenize
-
        sentences = sent_tokenize(text)
-        sens = [sent.strip() for sent in sentences]
-
+        sens =  [sent.strip() for sent in sentences]        
+        
        return list(set(sens))
-
-
+    
 # Topic-based segmentation using TextTiling
 class TopicSegmentationChunking(ChunkingStrategy):
-    """
-    Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
-
-    How it works:
-    1. Segment the text into topics using TextTilingTokenizer
-    2. Extract keywords for each topic segment
-    """
-
+    
    def __init__(self, num_keywords=3, **kwargs):
-        """
-        Initialize the TopicSegmentationChunking object.
-
-        Args:
-            num_keywords (int): The number of keywords to extract for each topic segment.
-        """
        import nltk as nl
-
        self.tokenizer = nl.tokenize.TextTilingTokenizer()
        self.num_keywords = num_keywords

@@ -119,14 +66,8 @@ class TopicSegmentationChunking(ChunkingStrategy):
    def extract_keywords(self, text: str) -> list:
        # Tokenize and remove stopwords and punctuation
        import nltk as nl
-
        tokens = nl.toknize.word_tokenize(text)
-        tokens = [
-            token.lower()
-            for token in tokens
-            if token not in nl.corpus.stopwords.words("english")
-            and token not in string.punctuation
-        ]
+        tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]

        # Calculate frequency distribution
        freq_dist = Counter(tokens)
@@ -137,120 +78,29 @@ class TopicSegmentationChunking(ChunkingStrategy):
        # Segment the text into topics
        segments = self.chunk(text)
        # Extract keywords for each topic segment
-        segments_with_topics = [
-            (segment, self.extract_keywords(segment)) for segment in segments
-        ]
+        segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
        return segments_with_topics
-
-
+    
 # Fixed-length word chunks
 class FixedLengthWordChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into fixed-length word chunks.
-
-    How it works:
-    1. Split the text into words
-    2. Create chunks of fixed length
-    3. Return the list of chunks
-    """
-
    def __init__(self, chunk_size=100, **kwargs):
-        """
-        Initialize the fixed-length word chunking strategy with the given chunk size.
-
-        Args:
-            chunk_size (int): The size of each chunk in words.
-        """
        self.chunk_size = chunk_size

    def chunk(self, text: str) -> list:
        words = text.split()
-        return [
-            " ".join(words[i : i + self.chunk_size])
-            for i in range(0, len(words), self.chunk_size)
-        ]
-
-
+        return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
+    
 # Sliding window chunking
 class SlidingWindowChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into overlapping word chunks.
-
-    How it works:
-    1. Split the text into words
-    2. Create chunks of fixed length
-    3. Return the list of chunks
-    """
-
    def __init__(self, window_size=100, step=50, **kwargs):
-        """
-        Initialize the sliding window chunking strategy with the given window size and
-        step size.
-
-        Args:
-            window_size (int): The size of the sliding window in words.
-            step (int): The step size for sliding the window in words.
-        """
        self.window_size = window_size
        self.step = step

    def chunk(self, text: str) -> list:
        words = text.split()
        chunks = []
-
-        if len(words) <= self.window_size:
-            return [text]
-
-        for i in range(0, len(words) - self.window_size + 1, self.step):
-            chunk = " ".join(words[i : i + self.window_size])
-            chunks.append(chunk)
-
-        # Handle the last chunk if it doesn't align perfectly
-        if i + self.window_size < len(words):
-            chunks.append(" ".join(words[-self.window_size :]))
-
+        for i in range(0, len(words), self.step):
+            chunks.append(' '.join(words[i:i + self.window_size]))
        return chunks
+    

-
-class OverlappingWindowChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into overlapping word chunks.
-
-    How it works:
-    1. Split the text into words using whitespace
-    2. Create chunks of fixed length equal to the window size
-    3. Slide the window by the overlap size
-    4. Return the list of chunks
-    """
-
-    def __init__(self, window_size=1000, overlap=100, **kwargs):
-        """
-        Initialize the overlapping window chunking strategy with the given window size and
-        overlap size.
-
-        Args:
-            window_size (int): The size of the window in words.
-            overlap (int): The size of the overlap between consecutive chunks in words.
-        """
-        self.window_size = window_size
-        self.overlap = overlap
-
-    def chunk(self, text: str) -> list:
-        words = text.split()
-        chunks = []
-
-        if len(words) <= self.window_size:
-            return [text]
-
-        start = 0
-        while start < len(words):
-            end = start + self.window_size
-            chunk = " ".join(words[start:end])
-            chunks.append(chunk)
-
-            if end >= len(words):
-                break
-
-            start = end - self.overlap
-
-        return chunks
--- a/crawl4ai/cli.py
+++ b/crawl4ai/cli.py
@@ -1,123 +0,0 @@
-import click
-import sys
-import asyncio
-from typing import List
-from .docs_manager import DocsManager
-from .async_logger import AsyncLogger
-
-logger = AsyncLogger(verbose=True)
-docs_manager = DocsManager(logger)
-
-
-def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
-    """Print formatted table with headers and rows"""
-    widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
-    border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
-
-    def format_row(row):
-        return (
-            "|"
-            + "|".join(
-                f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
-                for cell, w in zip(row, widths)
-            )
-            + "|"
-        )
-
-    click.echo(border)
-    click.echo(format_row(headers))
-    click.echo(border)
-    for row in rows:
-        click.echo(format_row(row))
-    click.echo(border)
-
-
-@click.group()
-def cli():
-    """Crawl4AI Command Line Interface"""
-    pass
-
-
-@cli.group()
-def docs():
-    """Documentation operations"""
-    pass
-
-
-@docs.command()
-@click.argument("sections", nargs=-1)
-@click.option(
-    "--mode", type=click.Choice(["extended", "condensed"]), default="extended"
-)
-def combine(sections: tuple, mode: str):
-    """Combine documentation sections"""
-    try:
-        asyncio.run(docs_manager.ensure_docs_exist())
-        click.echo(docs_manager.generate(sections, mode))
-    except Exception as e:
-        logger.error(str(e), tag="ERROR")
-        sys.exit(1)
-
-
-@docs.command()
-@click.argument("query")
-@click.option("--top-k", "-k", default=5)
-@click.option("--build-index", is_flag=True, help="Build index if missing")
-def search(query: str, top_k: int, build_index: bool):
-    """Search documentation"""
-    try:
-        result = docs_manager.search(query, top_k)
-        if result == "No search index available. Call build_search_index() first.":
-            if build_index or click.confirm("No search index found. Build it now?"):
-                asyncio.run(docs_manager.llm_text.generate_index_files())
-                result = docs_manager.search(query, top_k)
-        click.echo(result)
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-@docs.command()
-def update():
-    """Update docs from GitHub"""
-    try:
-        asyncio.run(docs_manager.fetch_docs())
-        click.echo("Documentation updated successfully")
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-@docs.command()
-@click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
-@click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
-def index(force_facts: bool, clear_cache: bool):
-    """Build or rebuild search indexes"""
-    try:
-        asyncio.run(docs_manager.ensure_docs_exist())
-        asyncio.run(
-            docs_manager.llm_text.generate_index_files(
-                force_generate_facts=force_facts, clear_bm25_cache=clear_cache
-            )
-        )
-        click.echo("Search indexes built successfully")
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-# Add docs list command
-@docs.command()
-def list():
-    """List available documentation sections"""
-    try:
-        sections = docs_manager.list()
-        print_table(["Sections"], [[section] for section in sections])
-
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-if __name__ == "__main__":
-    cli()
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -4,68 +4,31 @@ from dotenv import load_dotenv
 load_dotenv()  # Load environment variables from .env file

 # Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
-DEFAULT_PROVIDER = "openai/gpt-4o-mini"
+DEFAULT_PROVIDER = "openai/gpt-4-turbo"
 MODEL_REPO_BRANCH = "new-release-0.0.2"
 # Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
 PROVIDER_MODELS = {
-    "ollama/llama3": "no-token-needed",  # Any model from Ollama no need for API token
+    "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
    "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
    "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
-    "openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
+    "openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
+    "openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
    "openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
-    "openai/o1-mini": os.getenv("OPENAI_API_KEY"),
-    "openai/o1-preview": os.getenv("OPENAI_API_KEY"),
    "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
-    "anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
 }

+
 # Chunk token threshold
-CHUNK_TOKEN_THRESHOLD = 2**11  # 2048 tokens
+CHUNK_TOKEN_THRESHOLD = 500
 OVERLAP_RATE = 0.1
 WORD_TOKEN_RATE = 1.3

-# Threshold for the minimum number of word in a HTML tag to be considered
+# Threshold for the minimum number of word in a HTML tag to be considered 
 MIN_WORD_THRESHOLD = 1
 IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1

-IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
-ONLY_TEXT_ELIGIBLE_TAGS = [
-    "b",
-    "i",
-    "u",
-    "span",
-    "del",
-    "ins",
-    "sub",
-    "sup",
-    "strong",
-    "em",
-    "code",
-    "kbd",
-    "var",
-    "s",
-    "q",
-    "abbr",
-    "cite",
-    "dfn",
-    "time",
-    "small",
-    "mark",
-]
-SOCIAL_MEDIA_DOMAINS = [
-    "facebook.com",
-    "twitter.com",
-    "x.com",
-    "linkedin.com",
-    "instagram.com",
-    "pinterest.com",
-    "tiktok.com",
-    "snapchat.com",
-    "reddit.com",
-]
-
 # Threshold for the Image extraction - Range is 1 to 6
 # Images are scored based on point based system, to filter based on usefulness. Points are assigned
 # to each image based on the following aspects.
@@ -75,12 +38,3 @@ SOCIAL_MEDIA_DOMAINS = [
 # If image format is in jpg, png or webp
 # If image is in the first half of the total images extracted from the page
 IMAGE_SCORE_THRESHOLD = 2
-
-MAX_METRICS_HISTORY = 1000
-
-NEED_MIGRATION = True
-URL_LOG_SHORTEN_LENGTH = 30
-SHOW_DEPRECATION_WARNINGS = True
-SCREENSHOT_HEIGHT_TRESHOLD = 10000
-PAGE_TIMEOUT = 60000
-DOWNLOAD_PAGE_TIMEOUT = 60000
--- a/crawl4ai/content_filter_strategy.py
+++ b/crawl4ai/content_filter_strategy.py
@@ -1,999 +0,0 @@
-import re
-import time
-from bs4 import BeautifulSoup, Tag
-from typing import List, Tuple, Dict, Optional
-from rank_bm25 import BM25Okapi
-from collections import deque
-from bs4 import NavigableString, Comment
-from .utils import clean_tokens, perform_completion_with_backoff, escape_json_string, sanitize_html, get_home_folder, extract_xml_data
-from abc import ABC, abstractmethod
-import math
-from snowballstemmer import stemmer
-from .config import DEFAULT_PROVIDER, OVERLAP_RATE, WORD_TOKEN_RATE
-from .models import TokenUsage
-from .prompts import PROMPT_FILTER_CONTENT
-import os
-import json
-import hashlib
-from pathlib import Path
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from .async_logger import AsyncLogger, LogLevel
-from colorama import Fore, Style, init
-
-class RelevantContentFilter(ABC):
-    """Abstract base class for content filtering strategies"""
-
-    def __init__(self, user_query: str = None):
-        self.user_query = user_query
-        self.included_tags = {
-            # Primary structure
-            "article",
-            "main",
-            "section",
-            "div",
-            # List structures
-            "ul",
-            "ol",
-            "li",
-            "dl",
-            "dt",
-            "dd",
-            # Text content
-            "p",
-            "span",
-            "blockquote",
-            "pre",
-            "code",
-            # Headers
-            "h1",
-            "h2",
-            "h3",
-            "h4",
-            "h5",
-            "h6",
-            # Tables
-            "table",
-            "thead",
-            "tbody",
-            "tr",
-            "td",
-            "th",
-            # Other semantic elements
-            "figure",
-            "figcaption",
-            "details",
-            "summary",
-            # Text formatting
-            "em",
-            "strong",
-            "b",
-            "i",
-            "mark",
-            "small",
-            # Rich content
-            "time",
-            "address",
-            "cite",
-            "q",
-        }
-        self.excluded_tags = {
-            "nav",
-            "footer",
-            "header",
-            "aside",
-            "script",
-            "style",
-            "form",
-            "iframe",
-            "noscript",
-        }
-        self.header_tags = {"h1", "h2", "h3", "h4", "h5", "h6"}
-        self.negative_patterns = re.compile(
-            r"nav|footer|header|sidebar|ads|comment|promo|advert|social|share", re.I
-        )
-        self.min_word_count = 2
-
-    @abstractmethod
-    def filter_content(self, html: str) -> List[str]:
-        """Abstract method to be implemented by specific filtering strategies"""
-        pass
-
-    def extract_page_query(self, soup: BeautifulSoup, body: Tag) -> str:
-        """Common method to extract page metadata with fallbacks"""
-        if self.user_query:
-            return self.user_query
-
-        query_parts = []
-
-        # Title
-        try:
-            title = soup.title.string
-            if title:
-                query_parts.append(title)
-        except Exception:
-            pass
-
-        if soup.find("h1"):
-            query_parts.append(soup.find("h1").get_text())
-
-        # Meta tags
-        temp = ""
-        for meta_name in ["keywords", "description"]:
-            meta = soup.find("meta", attrs={"name": meta_name})
-            if meta and meta.get("content"):
-                query_parts.append(meta["content"])
-                temp += meta["content"]
-
-        # If still empty, grab first significant paragraph
-        if not temp:
-            # Find the first tag P thatits text contains more than 50 characters
-            for p in body.find_all("p"):
-                if len(p.get_text()) > 150:
-                    query_parts.append(p.get_text()[:150])
-                    break
-
-        return " ".join(filter(None, query_parts))
-
-    def extract_text_chunks(
-        self, body: Tag, min_word_threshold: int = None
-    ) -> List[Tuple[str, str]]:
-        """
-        Extracts text chunks from a BeautifulSoup body element while preserving order.
-        Returns list of tuples (text, tag_name) for classification.
-
-        Args:
-            body: BeautifulSoup Tag object representing the body element
-
-        Returns:
-            List of (text, tag_name) tuples
-        """
-        # Tags to ignore - inline elements that shouldn't break text flow
-        INLINE_TAGS = {
-            "a",
-            "abbr",
-            "acronym",
-            "b",
-            "bdo",
-            "big",
-            "br",
-            "button",
-            "cite",
-            "code",
-            "dfn",
-            "em",
-            "i",
-            "img",
-            "input",
-            "kbd",
-            "label",
-            "map",
-            "object",
-            "q",
-            "samp",
-            "script",
-            "select",
-            "small",
-            "span",
-            "strong",
-            "sub",
-            "sup",
-            "textarea",
-            "time",
-            "tt",
-            "var",
-        }
-
-        # Tags that typically contain meaningful headers
-        HEADER_TAGS = {"h1", "h2", "h3", "h4", "h5", "h6", "header"}
-
-        chunks = []
-        current_text = []
-        chunk_index = 0
-
-        def should_break_chunk(tag: Tag) -> bool:
-            """Determine if a tag should cause a break in the current text chunk"""
-            return tag.name not in INLINE_TAGS and not (
-                tag.name == "p" and len(current_text) == 0
-            )
-
-        # Use deque for efficient push/pop operations
-        stack = deque([(body, False)])
-
-        while stack:
-            element, visited = stack.pop()
-
-            if visited:
-                # End of block element - flush accumulated text
-                if current_text and should_break_chunk(element):
-                    text = " ".join("".join(current_text).split())
-                    if text:
-                        tag_type = (
-                            "header" if element.name in HEADER_TAGS else "content"
-                        )
-                        chunks.append((chunk_index, text, tag_type, element))
-                        chunk_index += 1
-                    current_text = []
-                continue
-
-            if isinstance(element, NavigableString):
-                if str(element).strip():
-                    current_text.append(str(element).strip())
-                continue
-
-            # Pre-allocate children to avoid multiple list operations
-            children = list(element.children)
-            if not children:
-                continue
-
-            # Mark block for revisit after processing children
-            stack.append((element, True))
-
-            # Add children in reverse order for correct processing
-            for child in reversed(children):
-                if isinstance(child, (Tag, NavigableString)):
-                    stack.append((child, False))
-
-        # Handle any remaining text
-        if current_text:
-            text = " ".join("".join(current_text).split())
-            if text:
-                chunks.append((chunk_index, text, "content", body))
-
-        if min_word_threshold:
-            chunks = [
-                chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold
-            ]
-
-        return chunks
-
-    def _deprecated_extract_text_chunks(
-        self, soup: BeautifulSoup
-    ) -> List[Tuple[int, str, Tag]]:
-        """Common method for extracting text chunks"""
-        _text_cache = {}
-
-        def fast_text(element: Tag) -> str:
-            elem_id = id(element)
-            if elem_id in _text_cache:
-                return _text_cache[elem_id]
-            texts = []
-            for content in element.contents:
-                if isinstance(content, str):
-                    text = content.strip()
-                    if text:
-                        texts.append(text)
-            result = " ".join(texts)
-            _text_cache[elem_id] = result
-            return result
-
-        candidates = []
-        index = 0
-
-        def dfs(element):
-            nonlocal index
-            if isinstance(element, Tag):
-                if element.name in self.included_tags:
-                    if not self.is_excluded(element):
-                        text = fast_text(element)
-                        word_count = len(text.split())
-
-                        # Headers pass through with adjusted minimum
-                        if element.name in self.header_tags:
-                            if word_count >= 3:  # Minimal sanity check for headers
-                                candidates.append((index, text, element))
-                                index += 1
-                        # Regular content uses standard minimum
-                        elif word_count >= self.min_word_count:
-                            candidates.append((index, text, element))
-                            index += 1
-
-                for child in element.children:
-                    dfs(child)
-
-        dfs(soup.body if soup.body else soup)
-        return candidates
-
-    def is_excluded(self, tag: Tag) -> bool:
-        """Common method for exclusion logic"""
-        if tag.name in self.excluded_tags:
-            return True
-        class_id = " ".join(
-            filter(None, [" ".join(tag.get("class", [])), tag.get("id", "")])
-        )
-        return bool(self.negative_patterns.search(class_id))
-
-    def clean_element(self, tag: Tag) -> str:
-        """Common method for cleaning HTML elements with minimal overhead"""
-        if not tag or not isinstance(tag, Tag):
-            return ""
-
-        unwanted_tags = {"script", "style", "aside", "form", "iframe", "noscript"}
-        unwanted_attrs = {
-            "style",
-            "onclick",
-            "onmouseover",
-            "align",
-            "bgcolor",
-            "class",
-            "id",
-        }
-
-        # Use string builder pattern for better performance
-        builder = []
-
-        def render_tag(elem):
-            if not isinstance(elem, Tag):
-                if isinstance(elem, str):
-                    builder.append(elem.strip())
-                return
-
-            if elem.name in unwanted_tags:
-                return
-
-            # Start tag
-            builder.append(f"<{elem.name}")
-
-            # Add cleaned attributes
-            attrs = {k: v for k, v in elem.attrs.items() if k not in unwanted_attrs}
-            for key, value in attrs.items():
-                builder.append(f' {key}="{value}"')
-
-            builder.append(">")
-
-            # Process children
-            for child in elem.children:
-                render_tag(child)
-
-            # Close tag
-            builder.append(f"</{elem.name}>")
-
-        try:
-            render_tag(tag)
-            return "".join(builder)
-        except Exception:
-            return str(tag)  # Fallback to original if anything fails
-
-class BM25ContentFilter(RelevantContentFilter):
-    """
-    Content filtering using BM25 algorithm with priority tag handling.
-
-    How it works:
-    1. Extracts page metadata with fallbacks.
-    2. Extracts text chunks from the body element.
-    3. Tokenizes the corpus and query.
-    4. Applies BM25 algorithm to calculate scores for each chunk.
-    5. Filters out chunks below the threshold.
-    6. Sorts chunks by score in descending order.
-    7. Returns the top N chunks.
-
-    Attributes:
-        user_query (str): User query for filtering (optional).
-        bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
-        language (str): Language for stemming (default: 'english').
-
-        Methods:
-            filter_content(self, html: str, min_word_threshold: int = None)
-    """
-
-    def __init__(
-        self,
-        user_query: str = None,
-        bm25_threshold: float = 1.0,
-        language: str = "english",
-    ):
-        """
-        Initializes the BM25ContentFilter class, if not provided, falls back to page metadata.
-
-        Note:
-        If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
-
-        Args:
-            user_query (str): User query for filtering (optional).
-            bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
-            language (str): Language for stemming (default: 'english').
-        """
-        super().__init__(user_query=user_query)
-        self.bm25_threshold = bm25_threshold
-        self.priority_tags = {
-            "h1": 5.0,
-            "h2": 4.0,
-            "h3": 3.0,
-            "title": 4.0,
-            "strong": 2.0,
-            "b": 1.5,
-            "em": 1.5,
-            "blockquote": 2.0,
-            "code": 2.0,
-            "pre": 1.5,
-            "th": 1.5,  # Table headers
-        }
-        self.stemmer = stemmer(language)
-
-    def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
-        """
-        Implements content filtering using BM25 algorithm with priority tag handling.
-
-            Note:
-        This method implements the filtering logic for the BM25ContentFilter class.
-        It takes HTML content as input and returns a list of filtered text chunks.
-
-        Args:
-            html (str): HTML content to be filtered.
-            min_word_threshold (int): Minimum word threshold for filtering (optional).
-
-        Returns:
-            List[str]: List of filtered text chunks.
-        """
-        if not html or not isinstance(html, str):
-            return []
-
-        soup = BeautifulSoup(html, "lxml")
-
-        # Check if body is present
-        if not soup.body:
-            # Wrap in body tag if missing
-            soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
-        body = soup.find("body")
-
-        query = self.extract_page_query(soup, body)
-
-        if not query:
-            return []
-            # return [self.clean_element(soup)]
-
-        candidates = self.extract_text_chunks(body, min_word_threshold)
-
-        if not candidates:
-            return []
-
-        # Tokenize corpus
-        # tokenized_corpus = [chunk.lower().split() for _, chunk, _, _ in candidates]
-        # tokenized_query = query.lower().split()
-
-        # tokenized_corpus = [[ps.stem(word) for word in chunk.lower().split()]
-        #                 for _, chunk, _, _ in candidates]
-        # tokenized_query = [ps.stem(word) for word in query.lower().split()]
-
-        tokenized_corpus = [
-            [self.stemmer.stemWord(word) for word in chunk.lower().split()]
-            for _, chunk, _, _ in candidates
-        ]
-        tokenized_query = [
-            self.stemmer.stemWord(word) for word in query.lower().split()
-        ]
-
-        # tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
-        #            for _, chunk, _, _ in candidates]
-        # tokenized_query = [self.stemmer.stemWord(word) for word in tokenize_text(query.lower())]
-
-        # Clean from stop words and noise
-        tokenized_corpus = [clean_tokens(tokens) for tokens in tokenized_corpus]
-        tokenized_query = clean_tokens(tokenized_query)
-
-        bm25 = BM25Okapi(tokenized_corpus)
-        scores = bm25.get_scores(tokenized_query)
-
-        # Adjust scores with tag weights
-        adjusted_candidates = []
-        for score, (index, chunk, tag_type, tag) in zip(scores, candidates):
-            tag_weight = self.priority_tags.get(tag.name, 1.0)
-            adjusted_score = score * tag_weight
-            adjusted_candidates.append((adjusted_score, index, chunk, tag))
-
-        # Filter candidates by threshold
-        selected_candidates = [
-            (index, chunk, tag)
-            for adjusted_score, index, chunk, tag in adjusted_candidates
-            if adjusted_score >= self.bm25_threshold
-        ]
-
-        if not selected_candidates:
-            return []
-
-        # Sort selected candidates by original document order
-        selected_candidates.sort(key=lambda x: x[0])
-
-        return [self.clean_element(tag) for _, _, tag in selected_candidates]
-
-class PruningContentFilter(RelevantContentFilter):
-    """
-    Content filtering using pruning algorithm with dynamic threshold.
-
-    How it works:
-    1. Extracts page metadata with fallbacks.
-    2. Extracts text chunks from the body element.
-    3. Applies pruning algorithm to calculate scores for each chunk.
-    4. Filters out chunks below the threshold.
-    5. Sorts chunks by score in descending order.
-    6. Returns the top N chunks.
-
-    Attributes:
-        user_query (str): User query for filtering (optional), if not provided, falls back to page metadata.
-        min_word_threshold (int): Minimum word threshold for filtering (optional).
-        threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
-        threshold (float): Fixed threshold value (default: 0.48).
-
-        Methods:
-            filter_content(self, html: str, min_word_threshold: int = None):
-    """
-
-    def __init__(
-        self,
-        user_query: str = None,
-        min_word_threshold: int = None,
-        threshold_type: str = "fixed",
-        threshold: float = 0.48,
-    ):
-        """
-        Initializes the PruningContentFilter class, if not provided, falls back to page metadata.
-
-        Note:
-        If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
-
-        Args:
-            user_query (str): User query for filtering (optional).
-            min_word_threshold (int): Minimum word threshold for filtering (optional).
-            threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
-            threshold (float): Fixed threshold value (default: 0.48).
-        """
-        super().__init__(None)
-        self.min_word_threshold = min_word_threshold
-        self.threshold_type = threshold_type
-        self.threshold = threshold
-
-        # Add tag importance for dynamic threshold
-        self.tag_importance = {
-            "article": 1.5,
-            "main": 1.4,
-            "section": 1.3,
-            "p": 1.2,
-            "h1": 1.4,
-            "h2": 1.3,
-            "h3": 1.2,
-            "div": 0.7,
-            "span": 0.6,
-        }
-
-        # Metric configuration
-        self.metric_config = {
-            "text_density": True,
-            "link_density": True,
-            "tag_weight": True,
-            "class_id_weight": True,
-            "text_length": True,
-        }
-
-        self.metric_weights = {
-            "text_density": 0.4,
-            "link_density": 0.2,
-            "tag_weight": 0.2,
-            "class_id_weight": 0.1,
-            "text_length": 0.1,
-        }
-
-        self.tag_weights = {
-            "div": 0.5,
-            "p": 1.0,
-            "article": 1.5,
-            "section": 1.0,
-            "span": 0.3,
-            "li": 0.5,
-            "ul": 0.5,
-            "ol": 0.5,
-            "h1": 1.2,
-            "h2": 1.1,
-            "h3": 1.0,
-            "h4": 0.9,
-            "h5": 0.8,
-            "h6": 0.7,
-        }
-
-    def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
-        """
-        Implements content filtering using pruning algorithm with dynamic threshold.
-
-        Note:
-        This method implements the filtering logic for the PruningContentFilter class.
-        It takes HTML content as input and returns a list of filtered text chunks.
-
-        Args:
-            html (str): HTML content to be filtered.
-            min_word_threshold (int): Minimum word threshold for filtering (optional).
-
-        Returns:
-            List[str]: List of filtered text chunks.
-        """
-        if not html or not isinstance(html, str):
-            return []
-
-        soup = BeautifulSoup(html, "lxml")
-        if not soup.body:
-            soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
-
-        # Remove comments and unwanted tags
-        self._remove_comments(soup)
-        self._remove_unwanted_tags(soup)
-
-        # Prune tree starting from body
-        body = soup.find("body")
-        self._prune_tree(body)
-
-        # Extract remaining content as list of HTML strings
-        content_blocks = []
-        for element in body.children:
-            if isinstance(element, str) or not hasattr(element, "name"):
-                continue
-            if len(element.get_text(strip=True)) > 0:
-                content_blocks.append(str(element))
-
-        return content_blocks
-
-    def _remove_comments(self, soup):
-        """Removes HTML comments"""
-        for element in soup(text=lambda text: isinstance(text, Comment)):
-            element.extract()
-
-    def _remove_unwanted_tags(self, soup):
-        """Removes unwanted tags"""
-        for tag in self.excluded_tags:
-            for element in soup.find_all(tag):
-                element.decompose()
-
-    def _prune_tree(self, node):
-        """
-        Prunes the tree starting from the given node.
-
-        Args:
-            node (Tag): The node from which the pruning starts.
-        """
-        if not node or not hasattr(node, "name") or node.name is None:
-            return
-
-        text_len = len(node.get_text(strip=True))
-        tag_len = len(node.encode_contents().decode("utf-8"))
-        link_text_len = sum(
-            len(s.strip())
-            for s in (a.string for a in node.find_all("a", recursive=False))
-            if s
-        )
-
-        metrics = {
-            "node": node,
-            "tag_name": node.name,
-            "text_len": text_len,
-            "tag_len": tag_len,
-            "link_text_len": link_text_len,
-        }
-
-        score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
-
-        if self.threshold_type == "fixed":
-            should_remove = score < self.threshold
-        else:  # dynamic
-            tag_importance = self.tag_importance.get(node.name, 0.7)
-            text_ratio = text_len / tag_len if tag_len > 0 else 0
-            link_ratio = link_text_len / text_len if text_len > 0 else 1
-
-            threshold = self.threshold  # base threshold
-            if tag_importance > 1:
-                threshold *= 0.8
-            if text_ratio > 0.4:
-                threshold *= 0.9
-            if link_ratio > 0.6:
-                threshold *= 1.2
-
-            should_remove = score < threshold
-
-        if should_remove:
-            node.decompose()
-        else:
-            children = [child for child in node.children if hasattr(child, "name")]
-            for child in children:
-                self._prune_tree(child)
-
-    def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
-        """Computes the composite score"""
-        if self.min_word_threshold:
-            # Get raw text from metrics node - avoid extra processing
-            text = metrics["node"].get_text(strip=True)
-            word_count = text.count(" ") + 1
-            if word_count < self.min_word_threshold:
-                return -1.0  # Guaranteed removal
-        score = 0.0
-        total_weight = 0.0
-
-        if self.metric_config["text_density"]:
-            density = text_len / tag_len if tag_len > 0 else 0
-            score += self.metric_weights["text_density"] * density
-            total_weight += self.metric_weights["text_density"]
-
-        if self.metric_config["link_density"]:
-            density = 1 - (link_text_len / text_len if text_len > 0 else 0)
-            score += self.metric_weights["link_density"] * density
-            total_weight += self.metric_weights["link_density"]
-
-        if self.metric_config["tag_weight"]:
-            tag_score = self.tag_weights.get(metrics["tag_name"], 0.5)
-            score += self.metric_weights["tag_weight"] * tag_score
-            total_weight += self.metric_weights["tag_weight"]
-
-        if self.metric_config["class_id_weight"]:
-            class_score = self._compute_class_id_weight(metrics["node"])
-            score += self.metric_weights["class_id_weight"] * max(0, class_score)
-            total_weight += self.metric_weights["class_id_weight"]
-
-        if self.metric_config["text_length"]:
-            score += self.metric_weights["text_length"] * math.log(text_len + 1)
-            total_weight += self.metric_weights["text_length"]
-
-        return score / total_weight if total_weight > 0 else 0
-
-    def _compute_class_id_weight(self, node):
-        """Computes the class ID weight"""
-        class_id_score = 0
-        if "class" in node.attrs:
-            classes = " ".join(node["class"])
-            if self.negative_patterns.match(classes):
-                class_id_score -= 0.5
-        if "id" in node.attrs:
-            element_id = node["id"]
-            if self.negative_patterns.match(element_id):
-                class_id_score -= 0.5
-        return class_id_score
-
-class LLMContentFilter(RelevantContentFilter):
-    """Content filtering using LLMs to generate relevant markdown."""
-
-    def __init__(
-        self,
-        provider: str = DEFAULT_PROVIDER,
-        api_token: Optional[str] = None,
-        instruction: str = None,
-        chunk_token_threshold: int = int(1e9),
-        overlap_rate: float = OVERLAP_RATE,
-        word_token_rate: float = WORD_TOKEN_RATE,
-        base_url: Optional[str] = None,
-        api_base: Optional[str] = None,
-        extra_args: Dict = None,
-        verbose: bool = False,
-        logger: Optional[AsyncLogger] = None,
-    ):
-        super().__init__(None)
-        self.provider = provider
-        self.api_token = (
-            api_token
-            or PROVIDER_MODELS.get(provider, "no-token")
-            or os.getenv("OPENAI_API_KEY")
-        )
-        self.instruction = instruction
-        self.chunk_token_threshold = chunk_token_threshold
-        self.overlap_rate = overlap_rate
-        self.word_token_rate = word_token_rate
-        self.base_url = base_url
-        self.api_base = api_base or base_url
-        self.extra_args = extra_args or {}
-        self.verbose = verbose
-        
-        # Setup logger with custom styling for LLM operations
-        if logger:
-            self.logger = logger
-        elif verbose:
-            self.logger = AsyncLogger(
-                verbose=True,
-                icons={
-                    **AsyncLogger.DEFAULT_ICONS,
-                    "LLM": "★",  # Star for LLM operations
-                    "CHUNK": "◈",  # Diamond for chunks
-                    "CACHE": "⚡", # Lightning for cache operations
-                },
-                colors={
-                    **AsyncLogger.DEFAULT_COLORS,
-                    LogLevel.INFO: Fore.MAGENTA + Style.DIM,  # Dimmed purple for LLM ops
-                }
-            )
-        else:
-            self.logger = None
-        
-        self.usages = []
-        self.total_usage = TokenUsage()
-
-    def _get_cache_key(self, html: str, instruction: str) -> str:
-        """Generate a unique cache key based on HTML and instruction"""
-        content = f"{html}{instruction}"
-        return hashlib.md5(content.encode()).hexdigest()
-
-    def _merge_chunks(self, text: str) -> List[str]:
-        """Split text into chunks with overlap"""
-        # Calculate tokens and sections
-        total_tokens = len(text.split()) * self.word_token_rate
-        num_sections = max(1, math.floor(total_tokens / self.chunk_token_threshold))
-        adjusted_chunk_threshold = total_tokens / num_sections
-
-        # Split into words
-        words = text.split()
-        chunks = []
-        current_chunk = []
-        current_token_count = 0
-
-        for word in words:
-            word_tokens = len(word) * self.word_token_rate
-            if current_token_count + word_tokens <= adjusted_chunk_threshold:
-                current_chunk.append(word)
-                current_token_count += word_tokens
-            else:
-                # Add overlap if not the last chunk
-                if chunks and self.overlap_rate > 0:
-                    overlap_size = int(len(current_chunk) * self.overlap_rate)
-                    current_chunk.extend(current_chunk[-overlap_size:])
-                
-                chunks.append(" ".join(current_chunk))
-                current_chunk = [word]
-                current_token_count = word_tokens
-
-        if current_chunk:
-            chunks.append(" ".join(current_chunk))
-
-        return chunks
-
-    def filter_content(self, html: str, ignore_cache: bool = False) -> List[str]:
-        if not html or not isinstance(html, str):
-            return []
-
-        if self.logger:
-            self.logger.info(
-                "Starting LLM content filtering process", 
-                tag="LLM",
-                params={"provider": self.provider},
-                colors={"provider": Fore.CYAN}
-            )
-
-        # Cache handling
-        cache_dir = Path(get_home_folder()) / "llm_cache" / "content_filter"
-        cache_dir.mkdir(parents=True, exist_ok=True)
-        cache_key = self._get_cache_key(html, self.instruction or "")
-        cache_file = cache_dir / f"{cache_key}.json"
-
-        if not ignore_cache and cache_file.exists():
-            if self.logger:
-                self.logger.info("Found cached result", tag="CACHE")
-            try:
-                with cache_file.open('r') as f:
-                    cached_data = json.load(f)
-                    usage = TokenUsage(**cached_data['usage'])
-                    self.usages.append(usage)
-                    self.total_usage.completion_tokens += usage.completion_tokens
-                    self.total_usage.prompt_tokens += usage.prompt_tokens
-                    self.total_usage.total_tokens += usage.total_tokens
-                    return cached_data['blocks']
-            except Exception as e:
-                if self.logger:
-                    self.logger.error(f"Cache read error: {str(e)}", tag="CACHE")
-
-        # Split into chunks
-        html_chunks = self._merge_chunks(html)
-        if self.logger:
-            self.logger.info(
-                "Split content into {chunk_count} chunks", 
-                tag="CHUNK",
-                params={"chunk_count": len(html_chunks)},
-                colors={"chunk_count": Fore.YELLOW}
-            )
-        
-        extracted_content = []
-        start_time = time.time()
-        
-        # Process chunks in parallel
-        with ThreadPoolExecutor(max_workers=4) as executor:
-            futures = []
-            for i, chunk in enumerate(html_chunks):
-                if self.logger:
-                    self.logger.debug(
-                        "Processing chunk {chunk_num}/{total_chunks}", 
-                        tag="CHUNK",
-                        params={
-                            "chunk_num": i + 1,
-                            "total_chunks": len(html_chunks)
-                        }
-                    )
-
-                prompt_variables = {
-                    "HTML": escape_json_string(sanitize_html(chunk)),
-                    "REQUEST": self.instruction or "Convert this HTML into clean, relevant markdown, removing any noise or irrelevant content."
-                }
-
-                prompt = PROMPT_FILTER_CONTENT
-                for var, value in prompt_variables.items():
-                    prompt = prompt.replace("{" + var + "}", value)
-
-                future = executor.submit(
-                    perform_completion_with_backoff,
-                    self.provider,
-                    prompt,
-                    self.api_token,
-                    base_url=self.api_base,
-                    extra_args=self.extra_args
-                )
-                futures.append((i, future))
-
-            # Collect results in order
-            ordered_results = []
-            for i, future in sorted(futures):
-                try:
-                    response = future.result()
-                    
-                    # Track usage
-                    usage = TokenUsage(
-                        completion_tokens=response.usage.completion_tokens,
-                        prompt_tokens=response.usage.prompt_tokens,
-                        total_tokens=response.usage.total_tokens,
-                        completion_tokens_details=response.usage.completion_tokens_details.__dict__ 
-                        if response.usage.completion_tokens_details else {},
-                        prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
-                        if response.usage.prompt_tokens_details else {},
-                    )
-                    self.usages.append(usage)
-                    self.total_usage.completion_tokens += usage.completion_tokens
-                    self.total_usage.prompt_tokens += usage.prompt_tokens
-                    self.total_usage.total_tokens += usage.total_tokens
-
-                    blocks = extract_xml_data(["content"], response.choices[0].message.content)["content"]
-                    if blocks:
-                        ordered_results.append(blocks)
-                        if self.logger:
-                            self.logger.success(
-                                "Successfully processed chunk {chunk_num}", 
-                                tag="CHUNK",
-                                params={"chunk_num": i + 1}
-                            )
-                except Exception as e:
-                    if self.logger:
-                        self.logger.error(
-                            "Error processing chunk {chunk_num}: {error}", 
-                            tag="CHUNK",
-                            params={
-                                "chunk_num": i + 1,
-                                "error": str(e)
-                            }
-                        )
-
-        end_time = time.time()
-        if self.logger:
-            self.logger.success(
-                "Completed processing in {time:.2f}s", 
-                tag="LLM",
-                params={"time": end_time - start_time},
-                colors={"time": Fore.YELLOW}
-            )
-
-        result = ordered_results if ordered_results else []
-
-        # Cache the final result
-        cache_data = {
-            'blocks': result,
-            'usage': self.total_usage.__dict__
-        }
-        with cache_file.open('w') as f:
-            json.dump(cache_data, f)
-            if self.logger:
-                self.logger.info("Cached results for future use", tag="CACHE")
-
-        return result
-
-    def show_usage(self) -> None:
-        """Print usage statistics"""
-        print("\n=== Token Usage Summary ===")
-        print(f"{'Type':<15} {'Count':>12}")
-        print("-" * 30)
-        print(f"{'Completion':<15} {self.total_usage.completion_tokens:>12,}")
-        print(f"{'Prompt':<15} {self.total_usage.prompt_tokens:>12,}")
-        print(f"{'Total':<15} {self.total_usage.total_tokens:>12,}")
-
-        if self.usages:
-            print("\n=== Usage History ===")
-            print(f"{'Request #':<10} {'Completion':>12} {'Prompt':>12} {'Total':>12}")
-            print("-" * 48)
-            for i, usage in enumerate(self.usages, 1):
-                print(
-                    f"{i:<10} {usage.completion_tokens:>12,} "
-                    f"{usage.prompt_tokens:>12,} {usage.total_tokens:>12,}"
-                )
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -0,0 +1,301 @@
+from abc import ABC, abstractmethod
+from typing import Dict, Any
+from bs4 import BeautifulSoup
+from concurrent.futures import ThreadPoolExecutor
+import asyncio, requests, re, os
+from .config import *
+from bs4 import element, NavigableString, Comment
+from urllib.parse import urljoin
+from requests.exceptions import InvalidSchema
+
+from .utils import (
+    sanitize_input_encode,
+    sanitize_html,
+    extract_metadata,
+    InvalidCSSSelectorError,
+    CustomHTML2Text
+)
+
+class ContentScrappingStrategy(ABC):
+    @abstractmethod
+    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        pass
+
+    @abstractmethod
+    async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        pass
+
+class WebScrappingStrategy(ContentScrappingStrategy):
+    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        return self._get_content_of_website_optimized(url, html, is_async=False, **kwargs)
+
+    async def ascrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
+        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
+
+    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+        if not html:
+            return None
+
+        soup = BeautifulSoup(html, 'html.parser')
+        body = soup.body
+        
+        image_description_min_word_threshold = kwargs.get('image_description_min_word_threshold', IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD)
+
+        for tag in kwargs.get('excluded_tags', []) or []:
+            for el in body.select(tag):
+                el.decompose()
+        
+        if css_selector:
+            selected_elements = body.select(css_selector)
+            if not selected_elements:
+                return {
+                    'markdown': '',
+                    'cleaned_html': '',
+                    'success': True,
+                    'media': {'images': [], 'videos': [], 'audios': []},
+                    'links': {'internal': [], 'external': []},
+                    'metadata': {},
+                    'message': f"No elements found for CSS selector: {css_selector}"
+                }
+                # raise InvalidCSSSelectorError(f"Invalid CSS selector, No elements found for CSS selector: {css_selector}")
+            body = soup.new_tag('div')
+            for el in selected_elements:
+                body.append(el)
+
+        links = {'internal': [], 'external': []}
+        media = {'images': [], 'videos': [], 'audios': []}
+
+        # Extract meaningful text for media files from closest parent
+        def find_closest_parent_with_useful_text(tag):
+                current_tag = tag
+                while current_tag:
+                    current_tag = current_tag.parent
+                    # Get the text content of the parent tag
+                    if current_tag:
+                        text_content = current_tag.get_text(separator=' ',strip=True)
+                        # Check if the text content has at least word_count_threshold
+                        if len(text_content.split()) >= image_description_min_word_threshold:
+                            return text_content
+                return None
+
+        def process_image(img, url, index, total_images):
+            #Check if an image has valid display and inside undesired html elements
+            def is_valid_image(img, parent, parent_classes):
+                style = img.get('style', '')
+                src = img.get('src', '')
+                classes_to_check = ['button', 'icon', 'logo']
+                tags_to_check = ['button', 'input']
+                return all([
+                    'display:none' not in style,
+                    src,
+                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                    parent.name not in tags_to_check
+                ])
+
+            #Score an image for it's usefulness
+            def score_image_for_usefulness(img, base_url, index, images_count):
+                # Function to parse image height/width value and units
+                def parse_dimension(dimension):
+                    if dimension:
+                        match = re.match(r"(\d+)(\D*)", dimension)
+                        if match:
+                            number = int(match.group(1))
+                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                            return number, unit
+                    return None, None
+
+                # Fetch image file metadata to extract size and extension
+                def fetch_image_file_size(img, base_url):
+                    #If src is relative path construct full URL, if not it may be CDN URL
+                    img_url = urljoin(base_url,img.get('src'))
+                    try:
+                        response = requests.head(img_url)
+                        if response.status_code == 200:
+                            return response.headers.get('Content-Length',None)
+                        else:
+                            print(f"Failed to retrieve file size for {img_url}")
+                            return None
+                    except InvalidSchema as e:
+                        return None
+                    finally:
+                        return
+
+                image_height = img.get('height')
+                height_value, height_unit = parse_dimension(image_height)
+                image_width =  img.get('width')
+                width_value, width_unit = parse_dimension(image_width)
+                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
+                image_format = os.path.splitext(img.get('src',''))[1].lower()
+                # Remove . from format
+                image_format = image_format.strip('.').split('?')[0]
+                score = 0
+                if height_value:
+                    if height_unit == 'px' and height_value > 150:
+                        score += 1
+                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
+                        score += 1
+                if width_value:
+                    if width_unit == 'px' and width_value > 150:
+                        score += 1
+                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
+                        score += 1
+                if image_size > 10000:
+                    score += 1
+                if img.get('alt') != '':
+                    score+=1
+                if any(image_format==format for format in ['jpg','png','webp']):
+                    score+=1
+                if index/images_count<0.5:
+                    score+=1
+                return score
+
+            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+                return None
+            score = score_image_for_usefulness(img, url, index, total_images)
+            if score <= IMAGE_SCORE_THRESHOLD:
+                return None
+            return {
+                'src': img.get('src', ''),
+                'data-src': img.get('data-src', ''),
+                'alt': img.get('alt', ''),
+                'desc': find_closest_parent_with_useful_text(img),
+                'score': score,
+                'type': 'image'
+            }
+
+        def process_element(element: element.PageElement) -> bool:
+            try:
+                if isinstance(element, NavigableString):
+                    if isinstance(element, Comment):
+                        element.extract()
+                    return False
+                
+                # if element.name == 'img':
+                #     process_image(element, url, 0, 1)
+                #     return True
+
+                if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
+                    element.decompose()
+                    return False
+
+                keep_element = False
+
+                if element.name == 'a' and element.get('href'):
+                    href = element['href']
+                    url_base = url.split('/')[2]
+                    link_data = {'href': href, 'text': element.get_text()}
+                    if href.startswith('http') and url_base not in href:
+                        links['external'].append(link_data)
+                    else:
+                        links['internal'].append(link_data)
+                    keep_element = True
+
+                elif element.name == 'img':
+                    return True  # Always keep image elements
+
+                elif element.name in ['video', 'audio']:
+                    media[f"{element.name}s"].append({
+                        'src': element.get('src'),
+                        'alt': element.get('alt'),
+                        'type': element.name,
+                        'description': find_closest_parent_with_useful_text(element)
+                    })
+                    source_tags = element.find_all('source')
+                    for source_tag in source_tags:
+                        media[f"{element.name}s"].append({
+                        'src': source_tag.get('src'),
+                        'alt': element.get('alt'),
+                        'type': element.name,
+                        'description': find_closest_parent_with_useful_text(element)
+                    })
+                    return True  # Always keep video and audio elements
+
+                if element.name != 'pre':
+                    if element.name in ['b', 'i', 'u', 'span', 'del', 'ins', 'sub', 'sup', 'strong', 'em', 'code', 'kbd', 'var', 's', 'q', 'abbr', 'cite', 'dfn', 'time', 'small', 'mark']:
+                        if kwargs.get('only_text', False):
+                            element.replace_with(element.get_text())
+                        else:
+                            element.unwrap()
+                    elif element.name != 'img':
+                        element.attrs = {}
+
+                # Process children
+                for child in list(element.children):
+                    if isinstance(child, NavigableString) and not isinstance(child, Comment):
+                        if len(child.strip()) > 0:
+                            keep_element = True
+                    else:
+                        if process_element(child):
+                            keep_element = True
+                    
+
+                # Check word count
+                if not keep_element:
+                    word_count = len(element.get_text(strip=True).split())
+                    keep_element = word_count >= word_count_threshold
+
+                if not keep_element:
+                    element.decompose()
+
+                return keep_element
+            except Exception as e:
+                print('Error processing element:', str(e))
+                return False
+
+        #process images by filtering and extracting contextual text from the page
+        # imgs = body.find_all('img')
+        # media['images'] = [
+        #     result for result in
+        #     (process_image(img, url, i, len(imgs)) for i, img in enumerate(imgs))
+        #     if result is not None
+        # ]
+        
+        process_element(body)
+
+        # # Process images using ThreadPoolExecutor
+        imgs = body.find_all('img')
+        with ThreadPoolExecutor() as executor:
+            image_results = list(executor.map(process_image, imgs, [url]*len(imgs), range(len(imgs)), [len(imgs)]*len(imgs)))
+        media['images'] = [result for result in image_results if result is not None]
+
+        def flatten_nested_elements(node):
+            if isinstance(node, NavigableString):
+                return node
+            if len(node.contents) == 1 and isinstance(node.contents[0], element.Tag) and node.contents[0].name == node.name:
+                return flatten_nested_elements(node.contents[0])
+            node.contents = [flatten_nested_elements(child) for child in node.contents]
+            return node
+
+        body = flatten_nested_elements(body)
+        base64_pattern = re.compile(r'data:image/[^;]+;base64,([^"]+)')
+        for img in imgs:
+            src = img.get('src', '')
+            if base64_pattern.match(src):
+                # Replace base64 data with empty string
+                img['src'] = base64_pattern.sub('', src)
+        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
+
+        h = CustomHTML2Text()
+        h.ignore_links = True
+        h.body_width = 0
+        try:
+            markdown = h.handle(cleaned_html)
+        except Exception as e:
+            markdown = h.handle(sanitize_html(cleaned_html))
+        markdown = markdown.replace('    ```', '```')
+
+        try:
+            meta = extract_metadata(html, soup)
+        except Exception as e:
+            print('Error extracting metadata:', str(e))
+            meta = {}
+
+        cleaned_html = sanitize_html(cleaned_html)
+        return {
+            'markdown': markdown,
+            'cleaned_html': cleaned_html,
+            'success': True,
+            'media': media,
+            'links': links,
+            'metadata': meta
+        }
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -15,53 +15,54 @@ import logging, time
 import base64
 from PIL import Image, ImageDraw, ImageFont
 from io import BytesIO
-from typing import Callable
+from typing import List, Callable
 import requests
 import os
 from pathlib import Path
 from .utils import *

-logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
+logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
 logger.setLevel(logging.WARNING)

-logger_driver = logging.getLogger("selenium.webdriver.common.service")
+logger_driver = logging.getLogger('selenium.webdriver.common.service')
 logger_driver.setLevel(logging.WARNING)

-urllib3_logger = logging.getLogger("urllib3.connectionpool")
+urllib3_logger = logging.getLogger('urllib3.connectionpool')
 urllib3_logger.setLevel(logging.WARNING)

 # Disable http.client logging
-http_client_logger = logging.getLogger("http.client")
+http_client_logger = logging.getLogger('http.client')
 http_client_logger.setLevel(logging.WARNING)

 # Disable driver_finder and service logging
-driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
+driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
 driver_finder_logger.setLevel(logging.WARNING)


+
+
 class CrawlerStrategy(ABC):
    @abstractmethod
    def crawl(self, url: str, **kwargs) -> str:
        pass
-
+    
    @abstractmethod
    def take_screenshot(self, save_path: str):
        pass
-
+    
    @abstractmethod
    def update_user_agent(self, user_agent: str):
        pass
-
+    
    @abstractmethod
    def set_hook(self, hook_type: str, hook: Callable):
        pass

-
 class CloudCrawlerStrategy(CrawlerStrategy):
-    def __init__(self, use_cached_html=False):
+    def __init__(self, use_cached_html = False):
        super().__init__()
        self.use_cached_html = use_cached_html
-
+        
    def crawl(self, url: str) -> str:
        data = {
            "urls": [url],
@@ -75,7 +76,6 @@ class CloudCrawlerStrategy(CrawlerStrategy):
        html = response["results"][0]["html"]
        return sanitize_input_encode(html)

-
 class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        super().__init__()
@@ -87,25 +87,20 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        if kwargs.get("user_agent"):
            self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
        else:
-            user_agent = kwargs.get(
-                "user_agent",
-                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
-            )
+            user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
            self.options.add_argument(f"--user-agent={user_agent}")
-            self.options.add_argument(
-                "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
-            )
-
+            self.options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+                  
        self.options.headless = kwargs.get("headless", True)
        if self.options.headless:
            self.options.add_argument("--headless")
-
-        self.options.add_argument("--disable-gpu")
+        
+        self.options.add_argument("--disable-gpu")  
        self.options.add_argument("--window-size=1920,1080")
        self.options.add_argument("--no-sandbox")
        self.options.add_argument("--disable-dev-shm-usage")
-        self.options.add_argument("--disable-blink-features=AutomationControlled")
-
+        self.options.add_argument("--disable-blink-features=AutomationControlled")     
+        
        # self.options.add_argument("--disable-dev-shm-usage")
        self.options.add_argument("--disable-gpu")
        # self.options.add_argument("--disable-extensions")
@@ -125,45 +120,48 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        self.use_cached_html = use_cached_html
        self.js_code = js_code
        self.verbose = kwargs.get("verbose", False)
-
+        
        # Hooks
        self.hooks = {
-            "on_driver_created": None,
-            "on_user_agent_updated": None,
-            "before_get_url": None,
-            "after_get_url": None,
-            "before_return_html": None,
+            'on_driver_created': None,
+            'on_user_agent_updated': None,
+            'before_get_url': None,
+            'after_get_url': None,
+            'before_return_html': None
        }

        # chromedriver_autoinstaller.install()
        # import chromedriver_autoinstaller
-        # crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
+        # crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
        # driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=self.options)
        # chromedriver_path = chromedriver_autoinstaller.install()
        # chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
        # self.service = Service(chromedriver_autoinstaller.install())
-
+        
+        
        # chromedriver_path = ChromeDriverManager().install()
        # self.service = Service(chromedriver_path)
        # self.service.log_path = "NUL"
        # self.driver = webdriver.Chrome(service=self.service, options=self.options)
-
+        
        # Use selenium-manager (built into Selenium 4.10.0+)
        self.service = Service()
        self.driver = webdriver.Chrome(options=self.options)
-
-        self.driver = self.execute_hook("on_driver_created", self.driver)
-
+        
+        self.driver = self.execute_hook('on_driver_created', self.driver)
+        
        if kwargs.get("cookies"):
            for cookie in kwargs.get("cookies"):
                self.driver.add_cookie(cookie)
+            
+        

    def set_hook(self, hook_type: str, hook: Callable):
        if hook_type in self.hooks:
            self.hooks[hook_type] = hook
        else:
            raise ValueError(f"Invalid hook type: {hook_type}")
-
+    
    def execute_hook(self, hook_type: str, *args):
        hook = self.hooks.get(hook_type)
        if hook:
@@ -172,9 +170,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
                if isinstance(result, webdriver.Chrome):
                    return result
                else:
-                    raise TypeError(
-                        f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
-                    )
+                    raise TypeError(f"Hook {hook_type} must return an instance of webdriver.Chrome or None.")
        # If the hook returns None or there is no hook, return self.driver
        return self.driver

@@ -182,77 +178,60 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        self.options.add_argument(f"user-agent={user_agent}")
        self.driver.quit()
        self.driver = webdriver.Chrome(service=self.service, options=self.options)
-        self.driver = self.execute_hook("on_user_agent_updated", self.driver)
+        self.driver = self.execute_hook('on_user_agent_updated', self.driver)

    def set_custom_headers(self, headers: dict):
        # Enable Network domain for sending headers
-        self.driver.execute_cdp_cmd("Network.enable", {})
+        self.driver.execute_cdp_cmd('Network.enable', {})
        # Set extra HTTP headers
-        self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
+        self.driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': headers})

-    def _ensure_page_load(self, max_checks=6, check_interval=0.01):
+    def _ensure_page_load(self,  max_checks=6, check_interval=0.01):
        initial_length = len(self.driver.page_source)
-
+        
        for ix in range(max_checks):
            # print(f"Checking page load: {ix}")
            time.sleep(check_interval)
            current_length = len(self.driver.page_source)
-
+            
            if current_length != initial_length:
                break

        return self.driver.page_source
-
+    
    def crawl(self, url: str, **kwargs) -> str:
        # Create md5 hash of the URL
        import hashlib
-
        url_hash = hashlib.md5(url.encode()).hexdigest()
-
+        
        if self.use_cached_html:
-            cache_file_path = os.path.join(
-                os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
-                ".crawl4ai",
-                "cache",
-                url_hash,
-            )
+            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url_hash)
            if os.path.exists(cache_file_path):
                with open(cache_file_path, "r") as f:
                    return sanitize_input_encode(f.read())

        try:
-            self.driver = self.execute_hook("before_get_url", self.driver)
+            self.driver = self.execute_hook('before_get_url', self.driver)
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
-            self.driver.get(url)  # <html><head></head><body></body></html>
-
+            self.driver.get(url) #<html><head></head><body></body></html>
+            
            WebDriverWait(self.driver, 20).until(
-                lambda d: d.execute_script("return document.readyState") == "complete"
+                lambda d: d.execute_script('return document.readyState') == 'complete'
            )
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
            )
-
-            self.driver.execute_script(
-                "window.scrollTo(0, document.body.scrollHeight);"
-            )
-
-            self.driver = self.execute_hook("after_get_url", self.driver)
-            html = sanitize_input_encode(
-                self._ensure_page_load()
-            )  # self.driver.page_source
-            can_not_be_done_headless = (
-                False  # Look at my creativity for naming variables
-            )
-
+            
+            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
+            
+            self.driver = self.execute_hook('after_get_url', self.driver)
+            html = sanitize_input_encode(self._ensure_page_load()) # self.driver.page_source                                        
+            can_not_be_done_headless = False # Look at my creativity for naming variables
+            
            # TODO: Very ugly approach, but promise to change it!
-            if (
-                kwargs.get("bypass_headless", False)
-                or html == "<html><head></head><body></body></html>"
-            ):
-                print(
-                    "[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
-                )
+            if kwargs.get('bypass_headless', False) or html == "<html><head></head><body></body></html>":
+                print("[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode...")
                can_not_be_done_headless = True
                options = Options()
                options.headless = False
@@ -260,31 +239,27 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
                options.add_argument("--window-size=5,5")
                driver = webdriver.Chrome(service=self.service, options=options)
                driver.get(url)
-                self.driver = self.execute_hook("after_get_url", driver)
+                self.driver = self.execute_hook('after_get_url', driver)
                html = sanitize_input_encode(driver.page_source)
                driver.quit()
-
+            
            # Execute JS code if provided
            self.js_code = kwargs.get("js_code", self.js_code)
            if self.js_code and type(self.js_code) == str:
                self.driver.execute_script(self.js_code)
                # Optionally, wait for some condition after executing the JS code
                WebDriverWait(self.driver, 10).until(
-                    lambda driver: driver.execute_script("return document.readyState")
-                    == "complete"
+                    lambda driver: driver.execute_script("return document.readyState") == "complete"
                )
            elif self.js_code and type(self.js_code) == list:
                for js in self.js_code:
                    self.driver.execute_script(js)
                    WebDriverWait(self.driver, 10).until(
-                        lambda driver: driver.execute_script(
-                            "return document.readyState"
-                        )
-                        == "complete"
+                        lambda driver: driver.execute_script("return document.readyState") == "complete"
                    )
-
+            
            # Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
-            wait_for = kwargs.get("wait_for", False)
+            wait_for = kwargs.get('wait_for', False)
            if wait_for:
                if callable(wait_for):
                    print("[LOG] 🔄 Waiting for condition...")
@@ -293,37 +268,32 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
                    print("[LOG] 🔄 Waiting for condition...")
                    WebDriverWait(self.driver, 20).until(
                        EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
-                    )
-
+                    ) 
+            
            if not can_not_be_done_headless:
                html = sanitize_input_encode(self.driver.page_source)
-            self.driver = self.execute_hook("before_return_html", self.driver, html)
-
+            self.driver = self.execute_hook('before_return_html', self.driver, html)
+            
            # Store in cache
-            cache_file_path = os.path.join(
-                os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
-                ".crawl4ai",
-                "cache",
-                url_hash,
-            )
+            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url_hash)
            with open(cache_file_path, "w", encoding="utf-8") as f:
                f.write(html)
-
+                
            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")
-
+            
            return html
-        except InvalidArgumentException as e:
-            if not hasattr(e, "msg"):
+        except InvalidArgumentException:
+            if not hasattr(e, 'msg'):
                e.msg = sanitize_input_encode(str(e))
            raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
        except WebDriverException as e:
            # If e does nlt have msg attribute create it and set it to str(e)
-            if not hasattr(e, "msg"):
+            if not hasattr(e, 'msg'):
                e.msg = sanitize_input_encode(str(e))
-            raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
+            raise WebDriverException(f"Failed to crawl {url}: {e.msg}")  
        except Exception as e:
-            if not hasattr(e, "msg"):
+            if not hasattr(e, 'msg'):
                e.msg = sanitize_input_encode(str(e))
            raise Exception(f"Failed to crawl {url}: {e.msg}")

@@ -331,9 +301,7 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        try:
            # Get the dimensions of the page
            total_width = self.driver.execute_script("return document.body.scrollWidth")
-            total_height = self.driver.execute_script(
-                "return document.body.scrollHeight"
-            )
+            total_height = self.driver.execute_script("return document.body.scrollHeight")

            # Set the window size to the dimensions of the page
            self.driver.set_window_size(total_width, total_height)
@@ -345,27 +313,25 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
            image = Image.open(BytesIO(screenshot))

            # Convert image to RGB mode (this will handle both RGB and RGBA images)
-            rgb_image = image.convert("RGB")
+            rgb_image = image.convert('RGB')

            # Convert to JPEG and compress
            buffered = BytesIO()
            rgb_image.save(buffered, format="JPEG", quality=85)
-            img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
+            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')

            if self.verbose:
-                print("[LOG] 📸 Screenshot taken and converted to base64")
+                print(f"[LOG] 📸 Screenshot taken and converted to base64")

            return img_base64
        except Exception as e:
-            error_message = sanitize_input_encode(
-                f"Failed to take screenshot: {str(e)}"
-            )
+            error_message = sanitize_input_encode(f"Failed to take screenshot: {str(e)}")
            print(error_message)

            # Generate an image with black background
-            img = Image.new("RGB", (800, 600), color="black")
+            img = Image.new('RGB', (800, 600), color='black')
            draw = ImageDraw.Draw(img)
-
+            
            # Load a font
            try:
                font = ImageFont.truetype("arial.ttf", 40)
@@ -379,16 +345,16 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):

            # Calculate text position
            text_position = (10, 10)
-
+            
            # Draw the text on the image
            draw.text(text_position, wrapped_text, fill=text_color, font=font)
-
+            
            # Convert to base64
            buffered = BytesIO()
            img.save(buffered, format="JPEG")
-            img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
+            img_base64 = base64.b64encode(buffered.getvalue()).decode('utf-8')

            return img_base64
-
+        
    def quit(self):
        self.driver.quit()
--- a/crawl4ai/database.py
+++ b/crawl4ai/database.py
@@ -3,17 +3,15 @@ from pathlib import Path
 import sqlite3
 from typing import Optional, Tuple

-DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
+DB_PATH = os.path.join(Path.home(), ".crawl4ai")
 os.makedirs(DB_PATH, exist_ok=True)
 DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")

-
 def init_db():
    global DB_PATH
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
-    cursor.execute(
-        """
+    cursor.execute('''
        CREATE TABLE IF NOT EXISTS crawled_data (
            url TEXT PRIMARY KEY,
            html TEXT,
@@ -26,42 +24,31 @@ def init_db():
            metadata TEXT DEFAULT "{}",
            screenshot TEXT DEFAULT ""
        )
-    """
-    )
+    ''')
    conn.commit()
    conn.close()

-
 def alter_db_add_screenshot(new_column: str = "media"):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(
-            f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
-        )
+        cursor.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error altering database to add screenshot column: {e}")

-
 def check_db_path():
    if not DB_PATH:
        raise ValueError("Database path is not set or is empty.")

-
-def get_cached_url(
-    url: str,
-) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
+def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(
-            "SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
-            (url,),
-        )
+        cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?', (url,))
        result = cursor.fetchone()
        conn.close()
        return result
@@ -69,25 +56,12 @@ def get_cached_url(
        print(f"Error retrieving cached URL: {e}")
        return None

-
-def cache_url(
-    url: str,
-    html: str,
-    cleaned_html: str,
-    markdown: str,
-    extracted_content: str,
-    success: bool,
-    media: str = "{}",
-    links: str = "{}",
-    metadata: str = "{}",
-    screenshot: str = "",
-):
+def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool, media : str = "{}", links : str = "{}", metadata : str = "{}", screenshot: str = ""):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(
-            """
+        cursor.execute('''
            INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
            ON CONFLICT(url) DO UPDATE SET
@@ -100,32 +74,18 @@ def cache_url(
                links = excluded.links,    
                metadata = excluded.metadata,      
                screenshot = excluded.screenshot
-        """,
-            (
-                url,
-                html,
-                cleaned_html,
-                markdown,
-                extracted_content,
-                success,
-                media,
-                links,
-                metadata,
-                screenshot,
-            ),
-        )
+        ''', (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot))
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error caching URL: {e}")

-
 def get_total_count() -> int:
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute("SELECT COUNT(*) FROM crawled_data")
+        cursor.execute('SELECT COUNT(*) FROM crawled_data')
        result = cursor.fetchone()
        conn.close()
        return result[0]
@@ -133,48 +93,43 @@ def get_total_count() -> int:
        print(f"Error getting total count: {e}")
        return 0

-
 def clear_db():
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute("DELETE FROM crawled_data")
+        cursor.execute('DELETE FROM crawled_data')
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error clearing database: {e}")
-
-
+        
 def flush_db():
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute("DROP TABLE crawled_data")
+        cursor.execute('DROP TABLE crawled_data')
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error flushing database: {e}")

-
 def update_existing_records(new_column: str = "media", default_value: str = "{}"):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(
-            f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
-        )
+        cursor.execute(f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL')
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error updating existing records: {e}")

-
 if __name__ == "__main__":
    # Delete the existing database file
    if os.path.exists(DB_PATH):
        os.remove(DB_PATH)
-    init_db()
+    init_db()  
    # alter_db_add_screenshot("COL_NAME")
+    
--- a/crawl4ai/docs_manager.py
+++ b/crawl4ai/docs_manager.py
@@ -1,75 +0,0 @@
-import requests
-import shutil
-from pathlib import Path
-from crawl4ai.async_logger import AsyncLogger
-from crawl4ai.llmtxt import AsyncLLMTextManager
-
-
-class DocsManager:
-    def __init__(self, logger=None):
-        self.docs_dir = Path.home() / ".crawl4ai" / "docs"
-        self.local_docs = Path(__file__).parent.parent / "docs" / "llm.txt"
-        self.docs_dir.mkdir(parents=True, exist_ok=True)
-        self.logger = logger or AsyncLogger(verbose=True)
-        self.llm_text = AsyncLLMTextManager(self.docs_dir, self.logger)
-
-    async def ensure_docs_exist(self):
-        """Fetch docs if not present"""
-        if not any(self.docs_dir.iterdir()):
-            await self.fetch_docs()
-
-    async def fetch_docs(self) -> bool:
-        """Copy from local docs or download from GitHub"""
-        try:
-            # Try local first
-            if self.local_docs.exists() and (
-                any(self.local_docs.glob("*.md"))
-                or any(self.local_docs.glob("*.tokens"))
-            ):
-                # Empty the local docs directory
-                for file_path in self.docs_dir.glob("*.md"):
-                    file_path.unlink()
-                # for file_path in self.docs_dir.glob("*.tokens"):
-                #     file_path.unlink()
-                for file_path in self.local_docs.glob("*.md"):
-                    shutil.copy2(file_path, self.docs_dir / file_path.name)
-                # for file_path in self.local_docs.glob("*.tokens"):
-                #     shutil.copy2(file_path, self.docs_dir / file_path.name)
-                return True
-
-            # Fallback to GitHub
-            response = requests.get(
-                "https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
-                headers={"Accept": "application/vnd.github.v3+json"},
-            )
-            response.raise_for_status()
-
-            for item in response.json():
-                if item["type"] == "file" and item["name"].endswith(".md"):
-                    content = requests.get(item["download_url"]).text
-                    with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
-                        f.write(content)
-            return True
-
-        except Exception as e:
-            self.logger.error(f"Failed to fetch docs: {str(e)}")
-            raise
-
-    def list(self) -> list[str]:
-        """List available topics"""
-        names = [file_path.stem for file_path in self.docs_dir.glob("*.md")]
-        # Remove [0-9]+_ prefix
-        names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
-        # Exclude those end with .xs.md and .q.md
-        names = [
-            name
-            for name in names
-            if not name.endswith(".xs") and not name.endswith(".q")
-        ]
-        return names
-
-    def generate(self, sections, mode="extended"):
-        return self.llm_text.generate(sections, mode)
-
-    def search(self, query: str, top_k: int = 5):
-        return self.llm_text.search(query, top_k)
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
--- a/crawl4ai/html2text/init.py
+++ b/crawl4ai/html2text/init.py
--- a/crawl4ai/html2text/main.py
+++ b/crawl4ai/html2text/main.py
@@ -1,3 +0,0 @@
-from .cli import main
-
-main()
--- a/crawl4ai/html2text/_typing.py
+++ b/crawl4ai/html2text/_typing.py
@@ -1,3 +0,0 @@
-class OutCallback:
-    def __call__(self, s: str) -> None:
-        ...
--- a/crawl4ai/html2text/cli.py
+++ b/crawl4ai/html2text/cli.py
@@ -1,330 +0,0 @@
-import argparse
-import sys
-
-from . import HTML2Text, __version__, config
-
-
-def main() -> None:
-    baseurl = ""
-
-    class bcolors:
-        HEADER = "\033[95m"
-        OKBLUE = "\033[94m"
-        OKGREEN = "\033[92m"
-        WARNING = "\033[93m"
-        FAIL = "\033[91m"
-        ENDC = "\033[0m"
-        BOLD = "\033[1m"
-        UNDERLINE = "\033[4m"
-
-    p = argparse.ArgumentParser()
-    p.add_argument(
-        "--default-image-alt",
-        dest="default_image_alt",
-        default=config.DEFAULT_IMAGE_ALT,
-        help="The default alt string for images with missing ones",
-    )
-    p.add_argument(
-        "--pad-tables",
-        dest="pad_tables",
-        action="store_true",
-        default=config.PAD_TABLES,
-        help="pad the cells to equal column width in tables",
-    )
-    p.add_argument(
-        "--no-wrap-links",
-        dest="wrap_links",
-        action="store_false",
-        default=config.WRAP_LINKS,
-        help="don't wrap links during conversion",
-    )
-    p.add_argument(
-        "--wrap-list-items",
-        dest="wrap_list_items",
-        action="store_true",
-        default=config.WRAP_LIST_ITEMS,
-        help="wrap list items during conversion",
-    )
-    p.add_argument(
-        "--wrap-tables",
-        dest="wrap_tables",
-        action="store_true",
-        default=config.WRAP_TABLES,
-        help="wrap tables",
-    )
-    p.add_argument(
-        "--ignore-emphasis",
-        dest="ignore_emphasis",
-        action="store_true",
-        default=config.IGNORE_EMPHASIS,
-        help="don't include any formatting for emphasis",
-    )
-    p.add_argument(
-        "--reference-links",
-        dest="inline_links",
-        action="store_false",
-        default=config.INLINE_LINKS,
-        help="use reference style links instead of inline links",
-    )
-    p.add_argument(
-        "--ignore-links",
-        dest="ignore_links",
-        action="store_true",
-        default=config.IGNORE_ANCHORS,
-        help="don't include any formatting for links",
-    )
-    p.add_argument(
-        "--ignore-mailto-links",
-        action="store_true",
-        dest="ignore_mailto_links",
-        default=config.IGNORE_MAILTO_LINKS,
-        help="don't include mailto: links",
-    )
-    p.add_argument(
-        "--protect-links",
-        dest="protect_links",
-        action="store_true",
-        default=config.PROTECT_LINKS,
-        help="protect links from line breaks surrounding them with angle brackets",
-    )
-    p.add_argument(
-        "--ignore-images",
-        dest="ignore_images",
-        action="store_true",
-        default=config.IGNORE_IMAGES,
-        help="don't include any formatting for images",
-    )
-    p.add_argument(
-        "--images-as-html",
-        dest="images_as_html",
-        action="store_true",
-        default=config.IMAGES_AS_HTML,
-        help=(
-            "Always write image tags as raw html; preserves `height`, `width` and "
-            "`alt` if possible."
-        ),
-    )
-    p.add_argument(
-        "--images-to-alt",
-        dest="images_to_alt",
-        action="store_true",
-        default=config.IMAGES_TO_ALT,
-        help="Discard image data, only keep alt text",
-    )
-    p.add_argument(
-        "--images-with-size",
-        dest="images_with_size",
-        action="store_true",
-        default=config.IMAGES_WITH_SIZE,
-        help=(
-            "Write image tags with height and width attrs as raw html to retain "
-            "dimensions"
-        ),
-    )
-    p.add_argument(
-        "-g",
-        "--google-doc",
-        action="store_true",
-        dest="google_doc",
-        default=False,
-        help="convert an html-exported Google Document",
-    )
-    p.add_argument(
-        "-d",
-        "--dash-unordered-list",
-        action="store_true",
-        dest="ul_style_dash",
-        default=False,
-        help="use a dash rather than a star for unordered list items",
-    )
-    p.add_argument(
-        "-e",
-        "--asterisk-emphasis",
-        action="store_true",
-        dest="em_style_asterisk",
-        default=False,
-        help="use an asterisk rather than an underscore for emphasized text",
-    )
-    p.add_argument(
-        "-b",
-        "--body-width",
-        dest="body_width",
-        type=int,
-        default=config.BODY_WIDTH,
-        help="number of characters per output line, 0 for no wrap",
-    )
-    p.add_argument(
-        "-i",
-        "--google-list-indent",
-        dest="list_indent",
-        type=int,
-        default=config.GOOGLE_LIST_INDENT,
-        help="number of pixels Google indents nested lists",
-    )
-    p.add_argument(
-        "-s",
-        "--hide-strikethrough",
-        action="store_true",
-        dest="hide_strikethrough",
-        default=False,
-        help="hide strike-through text. only relevant when -g is " "specified as well",
-    )
-    p.add_argument(
-        "--escape-all",
-        action="store_true",
-        dest="escape_snob",
-        default=False,
-        help=(
-            "Escape all special characters.  Output is less readable, but avoids "
-            "corner case formatting issues."
-        ),
-    )
-    p.add_argument(
-        "--bypass-tables",
-        action="store_true",
-        dest="bypass_tables",
-        default=config.BYPASS_TABLES,
-        help="Format tables in HTML rather than Markdown syntax.",
-    )
-    p.add_argument(
-        "--ignore-tables",
-        action="store_true",
-        dest="ignore_tables",
-        default=config.IGNORE_TABLES,
-        help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
-    )
-    p.add_argument(
-        "--single-line-break",
-        action="store_true",
-        dest="single_line_break",
-        default=config.SINGLE_LINE_BREAK,
-        help=(
-            "Use a single line break after a block element rather than two line "
-            "breaks. NOTE: Requires --body-width=0"
-        ),
-    )
-    p.add_argument(
-        "--unicode-snob",
-        action="store_true",
-        dest="unicode_snob",
-        default=config.UNICODE_SNOB,
-        help="Use unicode throughout document",
-    )
-    p.add_argument(
-        "--no-automatic-links",
-        action="store_false",
-        dest="use_automatic_links",
-        default=config.USE_AUTOMATIC_LINKS,
-        help="Do not use automatic links wherever applicable",
-    )
-    p.add_argument(
-        "--no-skip-internal-links",
-        action="store_false",
-        dest="skip_internal_links",
-        default=config.SKIP_INTERNAL_LINKS,
-        help="Do not skip internal links",
-    )
-    p.add_argument(
-        "--links-after-para",
-        action="store_true",
-        dest="links_each_paragraph",
-        default=config.LINKS_EACH_PARAGRAPH,
-        help="Put links after each paragraph instead of document",
-    )
-    p.add_argument(
-        "--mark-code",
-        action="store_true",
-        dest="mark_code",
-        default=config.MARK_CODE,
-        help="Mark program code blocks with [code]...[/code]",
-    )
-    p.add_argument(
-        "--decode-errors",
-        dest="decode_errors",
-        default=config.DECODE_ERRORS,
-        help=(
-            "What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
-            "acceptable values"
-        ),
-    )
-    p.add_argument(
-        "--open-quote",
-        dest="open_quote",
-        default=config.OPEN_QUOTE,
-        help="The character used to open quotes",
-    )
-    p.add_argument(
-        "--close-quote",
-        dest="close_quote",
-        default=config.CLOSE_QUOTE,
-        help="The character used to close quotes",
-    )
-    p.add_argument(
-        "--version", action="version", version=".".join(map(str, __version__))
-    )
-    p.add_argument("filename", nargs="?")
-    p.add_argument("encoding", nargs="?", default="utf-8")
-    p.add_argument(
-        "--include-sup-sub",
-        dest="include_sup_sub",
-        action="store_true",
-        default=config.INCLUDE_SUP_SUB,
-        help="Include the sup and sub tags",
-    )
-    args = p.parse_args()
-
-    if args.filename and args.filename != "-":
-        with open(args.filename, "rb") as fp:
-            data = fp.read()
-    else:
-        data = sys.stdin.buffer.read()
-
-    try:
-        html = data.decode(args.encoding, args.decode_errors)
-    except UnicodeDecodeError as err:
-        warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
-        warning += " Use the " + bcolors.OKGREEN
-        warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
-        print(warning)
-        raise err
-
-    h = HTML2Text(baseurl=baseurl)
-    # handle options
-    if args.ul_style_dash:
-        h.ul_item_mark = "-"
-    if args.em_style_asterisk:
-        h.emphasis_mark = "*"
-        h.strong_mark = "__"
-
-    h.body_width = args.body_width
-    h.google_list_indent = args.list_indent
-    h.ignore_emphasis = args.ignore_emphasis
-    h.ignore_links = args.ignore_links
-    h.ignore_mailto_links = args.ignore_mailto_links
-    h.protect_links = args.protect_links
-    h.ignore_images = args.ignore_images
-    h.images_as_html = args.images_as_html
-    h.images_to_alt = args.images_to_alt
-    h.images_with_size = args.images_with_size
-    h.google_doc = args.google_doc
-    h.hide_strikethrough = args.hide_strikethrough
-    h.escape_snob = args.escape_snob
-    h.bypass_tables = args.bypass_tables
-    h.ignore_tables = args.ignore_tables
-    h.single_line_break = args.single_line_break
-    h.inline_links = args.inline_links
-    h.unicode_snob = args.unicode_snob
-    h.use_automatic_links = args.use_automatic_links
-    h.skip_internal_links = args.skip_internal_links
-    h.links_each_paragraph = args.links_each_paragraph
-    h.mark_code = args.mark_code
-    h.wrap_links = args.wrap_links
-    h.wrap_list_items = args.wrap_list_items
-    h.wrap_tables = args.wrap_tables
-    h.pad_tables = args.pad_tables
-    h.default_image_alt = args.default_image_alt
-    h.open_quote = args.open_quote
-    h.close_quote = args.close_quote
-    h.include_sup_sub = args.include_sup_sub
-
-    sys.stdout.write(h.handle(html))
--- a/crawl4ai/html2text/config.py
+++ b/crawl4ai/html2text/config.py
@@ -1,172 +0,0 @@
-import re
-
-# Use Unicode characters instead of their ascii pseudo-replacements
-UNICODE_SNOB = False
-
-# Marker to use for marking tables for padding post processing
-TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
-# Escape all special characters.  Output is less readable, but avoids
-# corner case formatting issues.
-ESCAPE_SNOB = False
-ESCAPE_BACKSLASH = False
-ESCAPE_DOT = False
-ESCAPE_PLUS = False
-ESCAPE_DASH = False
-
-# Put the links after each paragraph instead of at the end.
-LINKS_EACH_PARAGRAPH = False
-
-# Wrap long lines at position. 0 for no wrapping.
-BODY_WIDTH = 78
-
-# Don't show internal links (href="#local-anchor") -- corresponding link
-# targets won't be visible in the plain text file anyway.
-SKIP_INTERNAL_LINKS = True
-
-# Use inline, rather than reference, formatting for images and links
-INLINE_LINKS = True
-
-# Protect links from line breaks surrounding them with angle brackets (in
-# addition to their square brackets)
-PROTECT_LINKS = False
-# WRAP_LINKS = True
-WRAP_LINKS = True
-
-# Wrap list items.
-WRAP_LIST_ITEMS = False
-
-# Wrap tables
-WRAP_TABLES = False
-
-# Number of pixels Google indents nested lists
-GOOGLE_LIST_INDENT = 36
-
-# Values Google and others may use to indicate bold text
-BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
-
-IGNORE_ANCHORS = False
-IGNORE_MAILTO_LINKS = False
-IGNORE_IMAGES = False
-IMAGES_AS_HTML = False
-IMAGES_TO_ALT = False
-IMAGES_WITH_SIZE = False
-IGNORE_EMPHASIS = False
-MARK_CODE = False
-DECODE_ERRORS = "strict"
-DEFAULT_IMAGE_ALT = ""
-PAD_TABLES = False
-
-# Convert links with same href and text to <href> format
-# if they are absolute links
-USE_AUTOMATIC_LINKS = True
-
-# For checking space-only lines on line 771
-RE_SPACE = re.compile(r"\s\+")
-
-RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
-RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
-RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
-RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
-
-# to find links in the text
-RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
-
-# to find table separators
-RE_TABLE = re.compile(r" \| ")
-
-RE_MD_DOT_MATCHER = re.compile(
-    r"""
-    ^             # start of line
-    (\s*\d+)      # optional whitespace and a number
-    (\.)          # dot
-    (?=\s)        # lookahead assert whitespace
-    """,
-    re.MULTILINE | re.VERBOSE,
-)
-RE_MD_PLUS_MATCHER = re.compile(
-    r"""
-    ^
-    (\s*)
-    (\+)
-    (?=\s)
-    """,
-    flags=re.MULTILINE | re.VERBOSE,
-)
-RE_MD_DASH_MATCHER = re.compile(
-    r"""
-    ^
-    (\s*)
-    (-)
-    (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
-                  # or another dash (header or hr)
-    """,
-    flags=re.MULTILINE | re.VERBOSE,
-)
-RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
-RE_MD_BACKSLASH_MATCHER = re.compile(
-    r"""
-    (\\)          # match one slash
-    (?=[%s])      # followed by a char that requires escaping
-    """
-    % re.escape(RE_SLASH_CHARS),
-    flags=re.VERBOSE,
-)
-
-UNIFIABLE = {
-    "rsquo": "'",
-    "lsquo": "'",
-    "rdquo": '"',
-    "ldquo": '"',
-    "copy": "(C)",
-    "mdash": "--",
-    "nbsp": " ",
-    "rarr": "->",
-    "larr": "<-",
-    "middot": "*",
-    "ndash": "-",
-    "oelig": "oe",
-    "aelig": "ae",
-    "agrave": "a",
-    "aacute": "a",
-    "acirc": "a",
-    "atilde": "a",
-    "auml": "a",
-    "aring": "a",
-    "egrave": "e",
-    "eacute": "e",
-    "ecirc": "e",
-    "euml": "e",
-    "igrave": "i",
-    "iacute": "i",
-    "icirc": "i",
-    "iuml": "i",
-    "ograve": "o",
-    "oacute": "o",
-    "ocirc": "o",
-    "otilde": "o",
-    "ouml": "o",
-    "ugrave": "u",
-    "uacute": "u",
-    "ucirc": "u",
-    "uuml": "u",
-    "lrm": "",
-    "rlm": "",
-}
-
-# Format tables in HTML rather than Markdown syntax
-BYPASS_TABLES = False
-# Ignore table-related tags (table, th, td, tr) while keeping rows
-IGNORE_TABLES = False
-
-
-# Use a single line break after a block element rather than two line breaks.
-# NOTE: Requires body width setting to be 0.
-SINGLE_LINE_BREAK = False
-
-
-# Use double quotation marks when converting the <q> tag.
-OPEN_QUOTE = '"'
-CLOSE_QUOTE = '"'
-
-# Include the <sup> and <sub> tags
-INCLUDE_SUP_SUB = False
--- a/crawl4ai/html2text/elements.py
+++ b/crawl4ai/html2text/elements.py
@@ -1,18 +0,0 @@
-from typing import Dict, Optional
-
-
-class AnchorElement:
-    __slots__ = ["attrs", "count", "outcount"]
-
-    def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
-        self.attrs = attrs
-        self.count = count
-        self.outcount = outcount
-
-
-class ListElement:
-    __slots__ = ["name", "num"]
-
-    def __init__(self, name: str, num: int):
-        self.name = name
-        self.num = num
--- a/crawl4ai/html2text/utils.py
+++ b/crawl4ai/html2text/utils.py
@@ -1,304 +0,0 @@
-import html.entities
-from typing import Dict, List, Optional
-
-from . import config
-
-unifiable_n = {
-    html.entities.name2codepoint[k]: v
-    for k, v in config.UNIFIABLE.items()
-    if k != "nbsp"
-}
-
-
-def hn(tag: str) -> int:
-    if tag[0] == "h" and len(tag) == 2:
-        n = tag[1]
-        if "0" < n <= "9":
-            return int(n)
-    return 0
-
-
-def dumb_property_dict(style: str) -> Dict[str, str]:
-    """
-    :returns: A hash of css attributes
-    """
-    return {
-        x.strip().lower(): y.strip().lower()
-        for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
-    }
-
-
-def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
-    """
-    :type data: str
-
-    :returns: A hash of css selectors, each of which contains a hash of
-    css attributes.
-    :rtype: dict
-    """
-    # remove @import sentences
-    data += ";"
-    importIndex = data.find("@import")
-    while importIndex != -1:
-        data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
-        importIndex = data.find("@import")
-
-    # parse the css. reverted from dictionary comprehension in order to
-    # support older pythons
-    pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
-    try:
-        elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
-    except ValueError:
-        elements = {}  # not that important
-
-    return elements
-
-
-def element_style(
-    attrs: Dict[str, Optional[str]],
-    style_def: Dict[str, Dict[str, str]],
-    parent_style: Dict[str, str],
-) -> Dict[str, str]:
-    """
-    :type attrs: dict
-    :type style_def: dict
-    :type style_def: dict
-
-    :returns: A hash of the 'final' style attributes of the element
-    :rtype: dict
-    """
-    style = parent_style.copy()
-    if "class" in attrs:
-        assert attrs["class"] is not None
-        for css_class in attrs["class"].split():
-            css_style = style_def.get("." + css_class, {})
-            style.update(css_style)
-    if "style" in attrs:
-        assert attrs["style"] is not None
-        immediate_style = dumb_property_dict(attrs["style"])
-        style.update(immediate_style)
-
-    return style
-
-
-def google_list_style(style: Dict[str, str]) -> str:
-    """
-    Finds out whether this is an ordered or unordered list
-
-    :type style: dict
-
-    :rtype: str
-    """
-    if "list-style-type" in style:
-        list_style = style["list-style-type"]
-        if list_style in ["disc", "circle", "square", "none"]:
-            return "ul"
-
-    return "ol"
-
-
-def google_has_height(style: Dict[str, str]) -> bool:
-    """
-    Check if the style of the element has the 'height' attribute
-    explicitly defined
-
-    :type style: dict
-
-    :rtype: bool
-    """
-    return "height" in style
-
-
-def google_text_emphasis(style: Dict[str, str]) -> List[str]:
-    """
-    :type style: dict
-
-    :returns: A list of all emphasis modifiers of the element
-    :rtype: list
-    """
-    emphasis = []
-    if "text-decoration" in style:
-        emphasis.append(style["text-decoration"])
-    if "font-style" in style:
-        emphasis.append(style["font-style"])
-    if "font-weight" in style:
-        emphasis.append(style["font-weight"])
-
-    return emphasis
-
-
-def google_fixed_width_font(style: Dict[str, str]) -> bool:
-    """
-    Check if the css of the current element defines a fixed width font
-
-    :type style: dict
-
-    :rtype: bool
-    """
-    font_family = ""
-    if "font-family" in style:
-        font_family = style["font-family"]
-    return "courier new" == font_family or "consolas" == font_family
-
-
-def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
-    """
-    Extract numbering from list element attributes
-
-    :type attrs: dict
-
-    :rtype: int or None
-    """
-    if "start" in attrs:
-        assert attrs["start"] is not None
-        try:
-            return int(attrs["start"]) - 1
-        except ValueError:
-            pass
-
-    return 0
-
-
-def skipwrap(
-    para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
-) -> bool:
-    # If it appears to contain a link
-    # don't wrap
-    if not wrap_links and config.RE_LINK.search(para):
-        return True
-    # If the text begins with four spaces or one tab, it's a code block;
-    # don't wrap
-    if para[0:4] == "    " or para[0] == "\t":
-        return True
-
-    # If the text begins with only two "--", possibly preceded by
-    # whitespace, that's an emdash; so wrap.
-    stripped = para.lstrip()
-    if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
-        return False
-
-    # I'm not sure what this is for; I thought it was to detect lists,
-    # but there's a <br>-inside-<span> case in one of the tests that
-    # also depends upon it.
-    if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
-        return not wrap_list_items
-
-    # If text contains a pipe character it is likely a table
-    if not wrap_tables and config.RE_TABLE.search(para):
-        return True
-
-    # If the text begins with a single -, *, or +, followed by a space,
-    # or an integer, followed by a ., followed by a space (in either
-    # case optionally proceeded by whitespace), it's a list; don't wrap.
-    return bool(
-        config.RE_ORDERED_LIST_MATCHER.match(stripped)
-        or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
-    )
-
-
-def escape_md(text: str) -> str:
-    """
-    Escapes markdown-sensitive characters within other markdown
-    constructs.
-    """
-    return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
-
-
-def escape_md_section(
-    text: str,
-    escape_backslash: bool = True,
-    snob: bool = False,
-    escape_dot: bool = True,
-    escape_plus: bool = True,
-    escape_dash: bool = True,
-) -> str:
-    """
-    Escapes markdown-sensitive characters across whole document sections.
-    Each escaping operation can be controlled individually.
-    """
-    if escape_backslash:
-        text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
-
-    if snob:
-        text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
-
-    if escape_dot:
-        text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
-
-    if escape_plus:
-        text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
-
-    if escape_dash:
-        text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
-
-    return text
-
-
-def reformat_table(lines: List[str], right_margin: int) -> List[str]:
-    """
-    Given the lines of a table
-    padds the cells and returns the new lines
-    """
-    # find the maximum width of the columns
-    max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
-    max_cols = len(max_width)
-    for line in lines:
-        cols = [x.rstrip() for x in line.split("|")]
-        num_cols = len(cols)
-
-        # don't drop any data if colspan attributes result in unequal lengths
-        if num_cols < max_cols:
-            cols += [""] * (max_cols - num_cols)
-        elif max_cols < num_cols:
-            max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
-            max_cols = num_cols
-
-        max_width = [
-            max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
-        ]
-
-    # reformat
-    new_lines = []
-    for line in lines:
-        cols = [x.rstrip() for x in line.split("|")]
-        if set(line.strip()) == set("-|"):
-            filler = "-"
-            new_cols = [
-                x.rstrip() + (filler * (M - len(x.rstrip())))
-                for x, M in zip(cols, max_width)
-            ]
-            new_lines.append("|-" + "|".join(new_cols) + "|")
-        else:
-            filler = " "
-            new_cols = [
-                x.rstrip() + (filler * (M - len(x.rstrip())))
-                for x, M in zip(cols, max_width)
-            ]
-            new_lines.append("| " + "|".join(new_cols) + "|")
-    return new_lines
-
-
-def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
-    """
-    Provide padding for tables in the text
-    """
-    lines = text.split("\n")
-    table_buffer = []  # type: List[str]
-    table_started = False
-    new_lines = []
-    for line in lines:
-        # Toggle table started
-        if config.TABLE_MARKER_FOR_PAD in line:
-            table_started = not table_started
-            if not table_started:
-                table = reformat_table(table_buffer, right_margin)
-                new_lines.extend(table)
-                table_buffer = []
-                new_lines.append("")
-            continue
-        # Process lines
-        if table_started:
-            table_buffer.append(line)
-        else:
-            new_lines.append(line)
-    return "\n".join(new_lines)
--- a/crawl4ai/install.py
+++ b/crawl4ai/install.py
@@ -1,109 +0,0 @@
-import subprocess
-import sys
-import asyncio
-from .async_logger import AsyncLogger, LogLevel
-
-# Initialize logger
-logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
-
-
-def post_install():
-    """Run all post-installation tasks"""
-    logger.info("Running post-installation setup...", tag="INIT")
-    install_playwright()
-    run_migration()
-    logger.success("Post-installation setup completed!", tag="COMPLETE")
-
-
-def install_playwright():
-    logger.info("Installing Playwright browsers...", tag="INIT")
-    try:
-        # subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
-        subprocess.check_call(
-            [
-                sys.executable,
-                "-m",
-                "playwright",
-                "install",
-                "--with-deps",
-                "--force",
-                "chromium",
-            ]
-        )
-        logger.success(
-            "Playwright installation completed successfully.", tag="COMPLETE"
-        )
-    except subprocess.CalledProcessError:
-        # logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
-        logger.warning(
-            f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
-        )
-    except Exception:
-        # logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
-        logger.warning(
-            f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
-        )
-
-
-def run_migration():
-    """Initialize database during installation"""
-    try:
-        logger.info("Starting database initialization...", tag="INIT")
-        from crawl4ai.async_database import async_db_manager
-
-        asyncio.run(async_db_manager.initialize())
-        logger.success(
-            "Database initialization completed successfully.", tag="COMPLETE"
-        )
-    except ImportError:
-        logger.warning("Database module not found. Will initialize on first use.")
-    except Exception as e:
-        logger.warning(f"Database initialization failed: {e}")
-        logger.warning("Database will be initialized on first use")
-
-
-async def run_doctor():
-    """Test if Crawl4AI is working properly"""
-    logger.info("Running Crawl4AI health check...", tag="INIT")
-    try:
-        from .async_webcrawler import (
-            AsyncWebCrawler,
-            BrowserConfig,
-            CrawlerRunConfig,
-            CacheMode,
-        )
-
-        browser_config = BrowserConfig(
-            headless=True,
-            browser_type="chromium",
-            ignore_https_errors=True,
-            light_mode=True,
-            viewport_width=1280,
-            viewport_height=720,
-        )
-
-        run_config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            screenshot=True,
-        )
-
-        async with AsyncWebCrawler(config=browser_config) as crawler:
-            logger.info("Testing crawling capabilities...", tag="TEST")
-            result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
-
-            if result and result.markdown:
-                logger.success("✅ Crawling test passed!", tag="COMPLETE")
-                return True
-            else:
-                raise Exception("Failed to get content")
-
-    except Exception as e:
-        logger.error(f"❌ Test failed: {e}", tag="ERROR")
-        return False
-
-
-def doctor():
-    """Entry point for the doctor command"""
-    import asyncio
-
-    return asyncio.run(run_doctor())
--- a/crawl4ai/js_snippet/init.py
+++ b/crawl4ai/js_snippet/init.py
@@ -1,18 +0,0 @@
-import os
-
-
-# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
-def load_js_script(script_name):
-    # Get the path of the current script
-    current_script_path = os.path.dirname(os.path.realpath(__file__))
-    # Get the path of the script to load
-    script_path = os.path.join(current_script_path, script_name + ".js")
-    # Check if the script exists
-    if not os.path.exists(script_path):
-        raise ValueError(
-            f"Script {script_name} not found in the folder {current_script_path}"
-        )
-    # Load the content of the script
-    with open(script_path, "r") as f:
-        script_content = f.read()
-    return script_content
--- a/crawl4ai/js_snippet/navigator_overrider.js
+++ b/crawl4ai/js_snippet/navigator_overrider.js
@@ -1,25 +0,0 @@
-// Pass the Permissions Test.
-const originalQuery = window.navigator.permissions.query;
-window.navigator.permissions.query = (parameters) =>
-    parameters.name === "notifications"
-        ? Promise.resolve({ state: Notification.permission })
-        : originalQuery(parameters);
-Object.defineProperty(navigator, "webdriver", {
-    get: () => undefined,
-});
-window.navigator.chrome = {
-    runtime: {},
-    // Add other properties if necessary
-};
-Object.defineProperty(navigator, "plugins", {
-    get: () => [1, 2, 3, 4, 5],
-});
-Object.defineProperty(navigator, "languages", {
-    get: () => ["en-US", "en"],
-});
-Object.defineProperty(document, "hidden", {
-    get: () => false,
-});
-Object.defineProperty(document, "visibilityState", {
-    get: () => "visible",
-});
--- a/crawl4ai/js_snippet/remove_overlay_elements.js
+++ b/crawl4ai/js_snippet/remove_overlay_elements.js
@@ -1,119 +0,0 @@
-async () => {
-    // Function to check if element is visible
-    const isVisible = (elem) => {
-        const style = window.getComputedStyle(elem);
-        return style.display !== "none" && style.visibility !== "hidden" && style.opacity !== "0";
-    };
-
-    // Common selectors for popups and overlays
-    const commonSelectors = [
-        // Close buttons first
-        'button[class*="close" i]',
-        'button[class*="dismiss" i]',
-        'button[aria-label*="close" i]',
-        'button[title*="close" i]',
-        'a[class*="close" i]',
-        'span[class*="close" i]',
-
-        // Cookie notices
-        '[class*="cookie-banner" i]',
-        '[id*="cookie-banner" i]',
-        '[class*="cookie-consent" i]',
-        '[id*="cookie-consent" i]',
-
-        // Newsletter/subscription dialogs
-        '[class*="newsletter" i]',
-        '[class*="subscribe" i]',
-
-        // Generic popups/modals
-        '[class*="popup" i]',
-        '[class*="modal" i]',
-        '[class*="overlay" i]',
-        '[class*="dialog" i]',
-        '[role="dialog"]',
-        '[role="alertdialog"]',
-    ];
-
-    // Try to click close buttons first
-    for (const selector of commonSelectors.slice(0, 6)) {
-        const closeButtons = document.querySelectorAll(selector);
-        for (const button of closeButtons) {
-            if (isVisible(button)) {
-                try {
-                    button.click();
-                    await new Promise((resolve) => setTimeout(resolve, 100));
-                } catch (e) {
-                    console.log("Error clicking button:", e);
-                }
-            }
-        }
-    }
-
-    // Remove remaining overlay elements
-    const removeOverlays = () => {
-        // Find elements with high z-index
-        const allElements = document.querySelectorAll("*");
-        for (const elem of allElements) {
-            const style = window.getComputedStyle(elem);
-            const zIndex = parseInt(style.zIndex);
-            const position = style.position;
-
-            if (
-                isVisible(elem) &&
-                (zIndex > 999 || position === "fixed" || position === "absolute") &&
-                (elem.offsetWidth > window.innerWidth * 0.5 ||
-                    elem.offsetHeight > window.innerHeight * 0.5 ||
-                    style.backgroundColor.includes("rgba") ||
-                    parseFloat(style.opacity) < 1)
-            ) {
-                elem.remove();
-            }
-        }
-
-        // Remove elements matching common selectors
-        for (const selector of commonSelectors) {
-            const elements = document.querySelectorAll(selector);
-            elements.forEach((elem) => {
-                if (isVisible(elem)) {
-                    elem.remove();
-                }
-            });
-        }
-    };
-
-    // Remove overlay elements
-    removeOverlays();
-
-    // Remove any fixed/sticky position elements at the top/bottom
-    const removeFixedElements = () => {
-        const elements = document.querySelectorAll("*");
-        elements.forEach((elem) => {
-            const style = window.getComputedStyle(elem);
-            if ((style.position === "fixed" || style.position === "sticky") && isVisible(elem)) {
-                elem.remove();
-            }
-        });
-    };
-
-    removeFixedElements();
-
-    // Remove empty block elements as: div, p, span, etc.
-    const removeEmptyBlockElements = () => {
-        const blockElements = document.querySelectorAll(
-            "div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6"
-        );
-        blockElements.forEach((elem) => {
-            if (elem.innerText.trim() === "") {
-                elem.remove();
-            }
-        });
-    };
-
-    // Remove margin-right and padding-right from body (often added by modal scripts)
-    document.body.style.marginRight = "0px";
-    document.body.style.paddingRight = "0px";
-    document.body.style.overflow = "auto";
-
-    // Wait a bit for any animations to complete
-    await new Promise((resolve) => setTimeout(resolve, 100));
-};
--- a/crawl4ai/js_snippet/update_image_dimensions.js
+++ b/crawl4ai/js_snippet/update_image_dimensions.js
@@ -1,54 +0,0 @@
-() => {
-    return new Promise((resolve) => {
-        const filterImage = (img) => {
-            // Filter out images that are too small
-            if (img.width < 100 && img.height < 100) return false;
-
-            // Filter out images that are not visible
-            const rect = img.getBoundingClientRect();
-            if (rect.width === 0 || rect.height === 0) return false;
-
-            // Filter out images with certain class names (e.g., icons, thumbnails)
-            if (img.classList.contains("icon") || img.classList.contains("thumbnail")) return false;
-
-            // Filter out images with certain patterns in their src (e.g., placeholder images)
-            if (img.src.includes("placeholder") || img.src.includes("icon")) return false;
-
-            return true;
-        };
-
-        const images = Array.from(document.querySelectorAll("img")).filter(filterImage);
-        let imagesLeft = images.length;
-
-        if (imagesLeft === 0) {
-            resolve();
-            return;
-        }
-
-        const checkImage = (img) => {
-            if (img.complete && img.naturalWidth !== 0) {
-                img.setAttribute("width", img.naturalWidth);
-                img.setAttribute("height", img.naturalHeight);
-                imagesLeft--;
-                if (imagesLeft === 0) resolve();
-            }
-        };
-
-        images.forEach((img) => {
-            checkImage(img);
-            if (!img.complete) {
-                img.onload = () => {
-                    checkImage(img);
-                };
-                img.onerror = () => {
-                    imagesLeft--;
-                    if (imagesLeft === 0) resolve();
-                };
-            }
-        });
-
-        // Fallback timeout of 5 seconds
-        // setTimeout(() => resolve(), 5000);
-        resolve();
-    });
-};
--- a/crawl4ai/llmtxt.py
+++ b/crawl4ai/llmtxt.py
@@ -1,546 +0,0 @@
-import os
-from pathlib import Path
-import re
-from typing import Dict, List, Tuple, Optional, Any
-import json
-from tqdm import tqdm
-import time
-import psutil
-import numpy as np
-from rank_bm25 import BM25Okapi
-from nltk.tokenize import word_tokenize
-from nltk.corpus import stopwords
-from nltk.stem import WordNetLemmatizer
-from litellm import batch_completion
-from .async_logger import AsyncLogger
-import litellm
-import pickle
-import hashlib  # <--- ADDED for file-hash
-import glob
-
-litellm.set_verbose = False
-
-
-def _compute_file_hash(file_path: Path) -> str:
-    """Compute MD5 hash for the file's entire content."""
-    hash_md5 = hashlib.md5()
-    with file_path.open("rb") as f:
-        for chunk in iter(lambda: f.read(4096), b""):
-            hash_md5.update(chunk)
-    return hash_md5.hexdigest()
-
-
-class AsyncLLMTextManager:
-    def __init__(
-        self,
-        docs_dir: Path,
-        logger: Optional[AsyncLogger] = None,
-        max_concurrent_calls: int = 5,
-        batch_size: int = 3,
-    ) -> None:
-        self.docs_dir = docs_dir
-        self.logger = logger
-        self.max_concurrent_calls = max_concurrent_calls
-        self.batch_size = batch_size
-        self.bm25_index = None
-        self.document_map: Dict[str, Any] = {}
-        self.tokenized_facts: List[str] = []
-        self.bm25_index_file = self.docs_dir / "bm25_index.pkl"
-
-    async def _process_document_batch(self, doc_batch: List[Path]) -> None:
-        """Process a batch of documents in parallel"""
-        contents = []
-        for file_path in doc_batch:
-            try:
-                with open(file_path, "r", encoding="utf-8") as f:
-                    contents.append(f.read())
-            except Exception as e:
-                self.logger.error(f"Error reading {file_path}: {str(e)}")
-                contents.append("")  # Add empty content to maintain batch alignment
-
-        prompt = """Given a documentation file, generate a list of atomic facts where each fact:
-1. Represents a single piece of knowledge
-2. Contains variations in terminology for the same concept
-3. References relevant code patterns if they exist
-4. Is written in a way that would match natural language queries
-
-Each fact should follow this format:
-<main_concept>: <fact_statement> | <related_terms> | <code_reference>
-
-Example Facts:
-browser_config: Configure headless mode and browser type for AsyncWebCrawler | headless, browser_type, chromium, firefox | BrowserConfig(browser_type="chromium", headless=True)
-redis_connection: Redis client connection requires host and port configuration | redis setup, redis client, connection params | Redis(host='localhost', port=6379, db=0)
-pandas_filtering: Filter DataFrame rows using boolean conditions | dataframe filter, query, boolean indexing | df[df['column'] > 5]
-
-Wrap your response in <index>...</index> tags.
-"""
-
-        # Prepare messages for batch processing
-        messages_list = [
-            [
-                {
-                    "role": "user",
-                    "content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
-                }
-            ]
-            for content in contents
-            if content
-        ]
-
-        try:
-            responses = batch_completion(
-                model="anthropic/claude-3-5-sonnet-latest",
-                messages=messages_list,
-                logger_fn=None,
-            )
-
-            # Process responses and save index files
-            for response, file_path in zip(responses, doc_batch):
-                try:
-                    index_content_match = re.search(
-                        r"<index>(.*?)</index>",
-                        response.choices[0].message.content,
-                        re.DOTALL,
-                    )
-                    if not index_content_match:
-                        self.logger.warning(
-                            f"No <index>...</index> content found for {file_path}"
-                        )
-                        continue
-
-                    index_content = re.sub(
-                        r"\n\s*\n", "\n", index_content_match.group(1)
-                    ).strip()
-                    if index_content:
-                        index_file = file_path.with_suffix(".q.md")
-                        with open(index_file, "w", encoding="utf-8") as f:
-                            f.write(index_content)
-                        self.logger.info(f"Created index file: {index_file}")
-                    else:
-                        self.logger.warning(
-                            f"No index content found in response for {file_path}"
-                        )
-
-                except Exception as e:
-                    self.logger.error(
-                        f"Error processing response for {file_path}: {str(e)}"
-                    )
-
-        except Exception as e:
-            self.logger.error(f"Error in batch completion: {str(e)}")
-
-    def _validate_fact_line(self, line: str) -> Tuple[bool, Optional[str]]:
-        if "|" not in line:
-            return False, "Missing separator '|'"
-
-        parts = [p.strip() for p in line.split("|")]
-        if len(parts) != 3:
-            return False, f"Expected 3 parts, got {len(parts)}"
-
-        concept_part = parts[0]
-        if ":" not in concept_part:
-            return False, "Missing ':' in concept definition"
-
-        return True, None
-
-    def _load_or_create_token_cache(self, fact_file: Path) -> Dict:
-        """
-        Load token cache from .q.tokens if present and matching file hash.
-        Otherwise return a new structure with updated file-hash.
-        """
-        cache_file = fact_file.with_suffix(".q.tokens")
-        current_hash = _compute_file_hash(fact_file)
-
-        if cache_file.exists():
-            try:
-                with open(cache_file, "r") as f:
-                    cache = json.load(f)
-                # If the hash matches, return it directly
-                if cache.get("content_hash") == current_hash:
-                    return cache
-                # Otherwise, we signal that it's changed
-                self.logger.info(f"Hash changed for {fact_file}, reindex needed.")
-            except json.JSONDecodeError:
-                self.logger.warning(f"Corrupt token cache for {fact_file}, rebuilding.")
-            except Exception as e:
-                self.logger.warning(f"Error reading cache for {fact_file}: {str(e)}")
-
-        # Return a fresh cache
-        return {"facts": {}, "content_hash": current_hash}
-
-    def _save_token_cache(self, fact_file: Path, cache: Dict) -> None:
-        cache_file = fact_file.with_suffix(".q.tokens")
-        # Always ensure we're saving the correct file-hash
-        cache["content_hash"] = _compute_file_hash(fact_file)
-        with open(cache_file, "w") as f:
-            json.dump(cache, f)
-
-    def preprocess_text(self, text: str) -> List[str]:
-        parts = [x.strip() for x in text.split("|")] if "|" in text else [text]
-        # Remove : after the first word of parts[0]
-        parts[0] = re.sub(r"^(.*?):", r"\1", parts[0])
-
-        lemmatizer = WordNetLemmatizer()
-        stop_words = set(stopwords.words("english")) - {
-            "how",
-            "what",
-            "when",
-            "where",
-            "why",
-            "which",
-        }
-
-        tokens = []
-        for part in parts:
-            if "(" in part and ")" in part:
-                code_tokens = re.findall(
-                    r'[\w_]+(?=\()|[\w_]+(?==[\'"]{1}[\w_]+[\'"]{1})', part
-                )
-                tokens.extend(code_tokens)
-
-            words = word_tokenize(part.lower())
-            tokens.extend(
-                [
-                    lemmatizer.lemmatize(token)
-                    for token in words
-                    if token not in stop_words
-                ]
-            )
-
-        return tokens
-
-    def maybe_load_bm25_index(self, clear_cache=False) -> bool:
-        """
-        Load existing BM25 index from disk, if present and clear_cache=False.
-        """
-        if not clear_cache and os.path.exists(self.bm25_index_file):
-            self.logger.info("Loading existing BM25 index from disk.")
-            with open(self.bm25_index_file, "rb") as f:
-                data = pickle.load(f)
-            self.tokenized_facts = data["tokenized_facts"]
-            self.bm25_index = data["bm25_index"]
-            return True
-        return False
-
-    def build_search_index(self, clear_cache=False) -> None:
-        """
-        Checks for new or modified .q.md files by comparing file-hash.
-        If none need reindexing and clear_cache is False, loads existing index if available.
-        Otherwise, reindexes only changed/new files and merges or creates a new index.
-        """
-        # If clear_cache is True, we skip partial logic: rebuild everything from scratch
-        if clear_cache:
-            self.logger.info("Clearing cache and rebuilding full search index.")
-            if self.bm25_index_file.exists():
-                self.bm25_index_file.unlink()
-
-        process = psutil.Process()
-        self.logger.info("Checking which .q.md files need (re)indexing...")
-
-        # Gather all .q.md files
-        q_files = [
-            self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
-        ]
-
-        # We'll store known (unchanged) facts in these lists
-        existing_facts: List[str] = []
-        existing_tokens: List[List[str]] = []
-
-        # Keep track of invalid lines for logging
-        invalid_lines = []
-        needSet = []  # files that must be (re)indexed
-
-        for qf in q_files:
-            token_cache_file = qf.with_suffix(".q.tokens")
-
-            # If no .q.tokens or clear_cache is True → definitely reindex
-            if clear_cache or not token_cache_file.exists():
-                needSet.append(qf)
-                continue
-
-            # Otherwise, load the existing cache and compare hash
-            cache = self._load_or_create_token_cache(qf)
-            # If the .q.tokens was out of date (i.e. changed hash), we reindex
-            if len(cache["facts"]) == 0 or cache.get(
-                "content_hash"
-            ) != _compute_file_hash(qf):
-                needSet.append(qf)
-            else:
-                # File is unchanged → retrieve cached token data
-                for line, cache_data in cache["facts"].items():
-                    existing_facts.append(line)
-                    existing_tokens.append(cache_data["tokens"])
-                    self.document_map[line] = qf  # track the doc for that fact
-
-        if not needSet and not clear_cache:
-            # If no file needs reindexing, try loading existing index
-            if self.maybe_load_bm25_index(clear_cache=False):
-                self.logger.info(
-                    "No new/changed .q.md files found. Using existing BM25 index."
-                )
-                return
-            else:
-                # If there's no existing index, we must build a fresh index from the old caches
-                self.logger.info(
-                    "No existing BM25 index found. Building from cached facts."
-                )
-                if existing_facts:
-                    self.logger.info(
-                        f"Building BM25 index with {len(existing_facts)} cached facts."
-                    )
-                    self.bm25_index = BM25Okapi(existing_tokens)
-                    self.tokenized_facts = existing_facts
-                    with open(self.bm25_index_file, "wb") as f:
-                        pickle.dump(
-                            {
-                                "bm25_index": self.bm25_index,
-                                "tokenized_facts": self.tokenized_facts,
-                            },
-                            f,
-                        )
-                else:
-                    self.logger.warning("No facts found at all. Index remains empty.")
-                return
-
-        # ----------------------------------------------------- /Users/unclecode/.crawl4ai/docs/14_proxy_security.q.q.tokens '/Users/unclecode/.crawl4ai/docs/14_proxy_security.q.md'
-        # If we reach here, we have new or changed .q.md files
-        # We'll parse them, reindex them, and then combine with existing_facts
-        # -----------------------------------------------------
-
-        self.logger.info(f"{len(needSet)} file(s) need reindexing. Parsing now...")
-
-        # 1) Parse the new or changed .q.md files
-        new_facts = []
-        new_tokens = []
-        with tqdm(total=len(needSet), desc="Indexing changed files") as file_pbar:
-            for file in needSet:
-                # We'll build up a fresh cache
-                fresh_cache = {"facts": {}, "content_hash": _compute_file_hash(file)}
-                try:
-                    with open(file, "r", encoding="utf-8") as f_obj:
-                        content = f_obj.read().strip()
-                        lines = [l.strip() for l in content.split("\n") if l.strip()]
-
-                    for line in lines:
-                        is_valid, error = self._validate_fact_line(line)
-                        if not is_valid:
-                            invalid_lines.append((file, line, error))
-                            continue
-
-                        tokens = self.preprocess_text(line)
-                        fresh_cache["facts"][line] = {
-                            "tokens": tokens,
-                            "added": time.time(),
-                        }
-                        new_facts.append(line)
-                        new_tokens.append(tokens)
-                        self.document_map[line] = file
-
-                    # Save the new .q.tokens with updated hash
-                    self._save_token_cache(file, fresh_cache)
-
-                    mem_usage = process.memory_info().rss / 1024 / 1024
-                    self.logger.debug(
-                        f"Memory usage after {file.name}: {mem_usage:.2f}MB"
-                    )
-
-                except Exception as e:
-                    self.logger.error(f"Error processing {file}: {str(e)}")
-
-                file_pbar.update(1)
-
-        if invalid_lines:
-            self.logger.warning(f"Found {len(invalid_lines)} invalid fact lines:")
-            for file, line, error in invalid_lines:
-                self.logger.warning(f"{file}: {error} in line: {line[:50]}...")
-
-        # 2) Merge newly tokenized facts with the existing ones
-        all_facts = existing_facts + new_facts
-        all_tokens = existing_tokens + new_tokens
-
-        # 3) Build BM25 index from combined facts
-        self.logger.info(
-            f"Building BM25 index with {len(all_facts)} total facts (old + new)."
-        )
-        self.bm25_index = BM25Okapi(all_tokens)
-        self.tokenized_facts = all_facts
-
-        # 4) Save the updated BM25 index to disk
-        with open(self.bm25_index_file, "wb") as f:
-            pickle.dump(
-                {
-                    "bm25_index": self.bm25_index,
-                    "tokenized_facts": self.tokenized_facts,
-                },
-                f,
-            )
-
-        final_mem = process.memory_info().rss / 1024 / 1024
-        self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
-
-    async def generate_index_files(
-        self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
-    ) -> None:
-        """
-        Generate index files for all documents in parallel batches
-
-        Args:
-            force_generate_facts (bool): If True, regenerate indexes even if they exist
-            clear_bm25_cache (bool): If True, clear existing BM25 index cache
-        """
-        self.logger.info("Starting index generation for documentation files.")
-
-        md_files = [
-            self.docs_dir / f
-            for f in os.listdir(self.docs_dir)
-            if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
-        ]
-
-        # Filter out files that already have .q files unless force=True
-        if not force_generate_facts:
-            md_files = [
-                f
-                for f in md_files
-                if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
-            ]
-
-        if not md_files:
-            self.logger.info("All index files exist. Use force=True to regenerate.")
-        else:
-            # Process documents in batches
-            for i in range(0, len(md_files), self.batch_size):
-                batch = md_files[i : i + self.batch_size]
-                self.logger.info(
-                    f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
-                )
-                await self._process_document_batch(batch)
-
-        self.logger.info("Index generation complete, building/updating search index.")
-        self.build_search_index(clear_cache=clear_bm25_cache)
-
-    def generate(self, sections: List[str], mode: str = "extended") -> str:
-        # Get all markdown files
-        all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
-            str(self.docs_dir / "[0-9]*.xs.md")
-        )
-
-        # Extract base names without extensions
-        base_docs = {
-            Path(f).name.split(".")[0]
-            for f in all_files
-            if not Path(f).name.endswith(".q.md")
-        }
-
-        # Filter by sections if provided
-        if sections:
-            base_docs = {
-                doc
-                for doc in base_docs
-                if any(section.lower() in doc.lower() for section in sections)
-            }
-
-        # Get file paths based on mode
-        files = []
-        for doc in sorted(
-            base_docs,
-            key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
-        ):
-            if mode == "condensed":
-                xs_file = self.docs_dir / f"{doc}.xs.md"
-                regular_file = self.docs_dir / f"{doc}.md"
-                files.append(str(xs_file if xs_file.exists() else regular_file))
-            else:
-                files.append(str(self.docs_dir / f"{doc}.md"))
-
-        # Read and format content
-        content = []
-        for file in files:
-            try:
-                with open(file, "r", encoding="utf-8") as f:
-                    fname = Path(file).name
-                    content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
-            except Exception as e:
-                self.logger.error(f"Error reading {file}: {str(e)}")
-
-        return "\n\n---\n\n".join(content) if content else ""
-
-    def search(self, query: str, top_k: int = 5) -> str:
-        if not self.bm25_index:
-            return "No search index available. Call build_search_index() first."
-
-        query_tokens = self.preprocess_text(query)
-        doc_scores = self.bm25_index.get_scores(query_tokens)
-
-        mean_score = np.mean(doc_scores)
-        std_score = np.std(doc_scores)
-        score_threshold = mean_score + (0.25 * std_score)
-
-        file_data = self._aggregate_search_scores(
-            doc_scores=doc_scores,
-            score_threshold=score_threshold,
-            query_tokens=query_tokens,
-        )
-
-        ranked_files = sorted(
-            file_data.items(),
-            key=lambda x: (
-                x[1]["code_match_score"] * 2.0
-                + x[1]["match_count"] * 1.5
-                + x[1]["total_score"]
-            ),
-            reverse=True,
-        )[:top_k]
-
-        results = []
-        for file, _ in ranked_files:
-            main_doc = str(file).replace(".q.md", ".md")
-            if os.path.exists(self.docs_dir / main_doc):
-                with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
-                    only_file_name = main_doc.split("/")[-1]
-                    content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
-                    results.append("\n".join(content))
-
-        return "\n\n---\n\n".join(results)
-
-    def _aggregate_search_scores(
-        self, doc_scores: List[float], score_threshold: float, query_tokens: List[str]
-    ) -> Dict:
-        file_data = {}
-
-        for idx, score in enumerate(doc_scores):
-            if score <= score_threshold:
-                continue
-
-            fact = self.tokenized_facts[idx]
-            file_path = self.document_map[fact]
-
-            if file_path not in file_data:
-                file_data[file_path] = {
-                    "total_score": 0,
-                    "match_count": 0,
-                    "code_match_score": 0,
-                    "matched_facts": [],
-                }
-
-            components = fact.split("|") if "|" in fact else [fact]
-
-            code_match_score = 0
-            if len(components) == 3:
-                code_ref = components[2].strip()
-                code_tokens = self.preprocess_text(code_ref)
-                code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
-                    query_tokens
-                )
-
-            file_data[file_path]["total_score"] += score
-            file_data[file_path]["match_count"] += 1
-            file_data[file_path]["code_match_score"] = max(
-                file_data[file_path]["code_match_score"], code_match_score
-            )
-            file_data[file_path]["matched_facts"].append(fact)
-
-        return file_data
-
-    def refresh_index(self) -> None:
-        """Convenience method for a full rebuild."""
-        self.build_search_index(clear_cache=True)
--- a/crawl4ai/markdown_generation_strategy.py
+++ b/crawl4ai/markdown_generation_strategy.py
@@ -1,253 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import Optional, Dict, Any, Tuple
-from .models import MarkdownGenerationResult
-from .html2text import CustomHTML2Text
-from .content_filter_strategy import RelevantContentFilter
-import re
-from urllib.parse import urljoin
-
-# Pre-compile the regex pattern
-LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
-
-
-def fast_urljoin(base: str, url: str) -> str:
-    """Fast URL joining for common cases."""
-    if url.startswith(("http://", "https://", "mailto:", "//")):
-        return url
-    if url.startswith("/"):
-        # Handle absolute paths
-        if base.endswith("/"):
-            return base[:-1] + url
-        return base + url
-    return urljoin(base, url)
-
-
-class MarkdownGenerationStrategy(ABC):
-    """Abstract base class for markdown generation strategies."""
-
-    def __init__(
-        self,
-        content_filter: Optional[RelevantContentFilter] = None,
-        options: Optional[Dict[str, Any]] = None,
-    ):
-        self.content_filter = content_filter
-        self.options = options or {}
-
-    @abstractmethod
-    def generate_markdown(
-        self,
-        cleaned_html: str,
-        base_url: str = "",
-        html2text_options: Optional[Dict[str, Any]] = None,
-        content_filter: Optional[RelevantContentFilter] = None,
-        citations: bool = True,
-        **kwargs,
-    ) -> MarkdownGenerationResult:
-        """Generate markdown from cleaned HTML."""
-        pass
-
-
-class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
-    """
-    Default implementation of markdown generation strategy.
-
-    How it works:
-    1. Generate raw markdown from cleaned HTML.
-    2. Convert links to citations.
-    3. Generate fit markdown if content filter is provided.
-    4. Return MarkdownGenerationResult.
-
-    Args:
-        content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
-        options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
-
-    Returns:
-        MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
-    """
-
-    def __init__(
-        self,
-        content_filter: Optional[RelevantContentFilter] = None,
-        options: Optional[Dict[str, Any]] = None,
-    ):
-        super().__init__(content_filter, options)
-
-    def convert_links_to_citations(
-        self, markdown: str, base_url: str = ""
-    ) -> Tuple[str, str]:
-        """
-        Convert links in markdown to citations.
-
-        How it works:
-        1. Find all links in the markdown.
-        2. Convert links to citations.
-        3. Return converted markdown and references markdown.
-
-        Note:
-        This function uses a regex pattern to find links in markdown.
-
-        Args:
-            markdown (str): Markdown text.
-            base_url (str): Base URL for URL joins.
-
-        Returns:
-            Tuple[str, str]: Converted markdown and references markdown.
-        """
-        link_map = {}
-        url_cache = {}  # Cache for URL joins
-        parts = []
-        last_end = 0
-        counter = 1
-
-        for match in LINK_PATTERN.finditer(markdown):
-            parts.append(markdown[last_end : match.start()])
-            text, url, title = match.groups()
-
-            # Use cached URL if available, otherwise compute and cache
-            if base_url and not url.startswith(("http://", "https://", "mailto:")):
-                if url not in url_cache:
-                    url_cache[url] = fast_urljoin(base_url, url)
-                url = url_cache[url]
-
-            if url not in link_map:
-                desc = []
-                if title:
-                    desc.append(title)
-                if text and text != title:
-                    desc.append(text)
-                link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
-                counter += 1
-
-            num = link_map[url][0]
-            parts.append(
-                f"{text}⟨{num}⟩"
-                if not match.group(0).startswith("!")
-                else f"![{text}⟨{num}⟩]"
-            )
-            last_end = match.end()
-
-        parts.append(markdown[last_end:])
-        converted_text = "".join(parts)
-
-        # Pre-build reference strings
-        references = ["\n\n## References\n\n"]
-        references.extend(
-            f"⟨{num}⟩ {url}{desc}\n"
-            for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
-        )
-
-        return converted_text, "".join(references)
-
-    def generate_markdown(
-        self,
-        cleaned_html: str,
-        base_url: str = "",
-        html2text_options: Optional[Dict[str, Any]] = None,
-        options: Optional[Dict[str, Any]] = None,
-        content_filter: Optional[RelevantContentFilter] = None,
-        citations: bool = True,
-        **kwargs,
-    ) -> MarkdownGenerationResult:
-        """
-        Generate markdown with citations from cleaned HTML.
-
-        How it works:
-        1. Generate raw markdown from cleaned HTML.
-        2. Convert links to citations.
-        3. Generate fit markdown if content filter is provided.
-        4. Return MarkdownGenerationResult.
-
-        Args:
-            cleaned_html (str): Cleaned HTML content.
-            base_url (str): Base URL for URL joins.
-            html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
-            options (Optional[Dict[str, Any]]): Additional options for markdown generation.
-            content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
-            citations (bool): Whether to generate citations.
-
-        Returns:
-            MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
-        """
-        try:
-            # Initialize HTML2Text with default options for better conversion
-            h = CustomHTML2Text(baseurl=base_url)
-            default_options = {
-                "body_width": 0,  # Disable text wrapping
-                "ignore_emphasis": False,
-                "ignore_links": False,
-                "ignore_images": False,
-                "protect_links": True,
-                "single_line_break": True,
-                "mark_code": True,
-                "escape_snob": False,
-            }
-
-            # Update with custom options if provided
-            if html2text_options:
-                default_options.update(html2text_options)
-            elif options:
-                default_options.update(options)
-            elif self.options:
-                default_options.update(self.options)
-
-            h.update_params(**default_options)
-
-            # Ensure we have valid input
-            if not cleaned_html:
-                cleaned_html = ""
-            elif not isinstance(cleaned_html, str):
-                cleaned_html = str(cleaned_html)
-
-            # Generate raw markdown
-            try:
-                raw_markdown = h.handle(cleaned_html)
-            except Exception as e:
-                raw_markdown = f"Error converting HTML to markdown: {str(e)}"
-
-            raw_markdown = raw_markdown.replace("    ```", "```")
-
-            # Convert links to citations
-            markdown_with_citations: str = raw_markdown
-            references_markdown: str = ""
-            if citations:
-                try:
-                    (
-                        markdown_with_citations,
-                        references_markdown,
-                    ) = self.convert_links_to_citations(raw_markdown, base_url)
-                except Exception as e:
-                    markdown_with_citations = raw_markdown
-                    references_markdown = f"Error generating citations: {str(e)}"
-
-            # Generate fit markdown if content filter is provided
-            fit_markdown: Optional[str] = ""
-            filtered_html: Optional[str] = ""
-            if content_filter or self.content_filter:
-                try:
-                    content_filter = content_filter or self.content_filter
-                    filtered_html = content_filter.filter_content(cleaned_html)
-                    filtered_html = "\n".join(
-                        "<div>{}</div>".format(s) for s in filtered_html
-                    )
-                    fit_markdown = h.handle(filtered_html)
-                except Exception as e:
-                    fit_markdown = f"Error generating fit markdown: {str(e)}"
-                    filtered_html = ""
-
-            return MarkdownGenerationResult(
-                raw_markdown=raw_markdown or "",
-                markdown_with_citations=markdown_with_citations or "",
-                references_markdown=references_markdown or "",
-                fit_markdown=fit_markdown or "",
-                fit_html=filtered_html or "",
-            )
-        except Exception as e:
-            # If anything fails, return empty strings with error message
-            error_msg = f"Error in markdown generation: {str(e)}"
-            return MarkdownGenerationResult(
-                raw_markdown=error_msg,
-                markdown_with_citations=error_msg,
-                references_markdown="",
-                fit_markdown="",
-                fit_html="",
-            )
--- a/crawl4ai/migrations.py
+++ b/crawl4ai/migrations.py
@@ -1,194 +0,0 @@
-import os
-import asyncio
-from pathlib import Path
-import aiosqlite
-from typing import Optional
-import xxhash
-import aiofiles
-import shutil
-from datetime import datetime
-from .async_logger import AsyncLogger, LogLevel
-
-# Initialize logger
-logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
-
-# logging.basicConfig(level=logging.INFO)
-# logger = logging.getLogger(__name__)
-
-
-class DatabaseMigration:
-    def __init__(self, db_path: str):
-        self.db_path = db_path
-        self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
-
-    def _ensure_content_dirs(self, base_path: str) -> dict:
-        dirs = {
-            "html": "html_content",
-            "cleaned": "cleaned_html",
-            "markdown": "markdown_content",
-            "extracted": "extracted_content",
-            "screenshots": "screenshots",
-        }
-        content_paths = {}
-        for key, dirname in dirs.items():
-            path = os.path.join(base_path, dirname)
-            os.makedirs(path, exist_ok=True)
-            content_paths[key] = path
-        return content_paths
-
-    def _generate_content_hash(self, content: str) -> str:
-        x = xxhash.xxh64()
-        x.update(content.encode())
-        content_hash = x.hexdigest()
-        return content_hash
-        # return hashlib.sha256(content.encode()).hexdigest()
-
-    async def _store_content(self, content: str, content_type: str) -> str:
-        if not content:
-            return ""
-
-        content_hash = self._generate_content_hash(content)
-        file_path = os.path.join(self.content_paths[content_type], content_hash)
-
-        if not os.path.exists(file_path):
-            async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
-                await f.write(content)
-
-        return content_hash
-
-    async def migrate_database(self):
-        """Migrate existing database to file-based storage"""
-        # logger.info("Starting database migration...")
-        logger.info("Starting database migration...", tag="INIT")
-
-        try:
-            async with aiosqlite.connect(self.db_path) as db:
-                # Get all rows
-                async with db.execute(
-                    """SELECT url, html, cleaned_html, markdown, 
-                       extracted_content, screenshot FROM crawled_data"""
-                ) as cursor:
-                    rows = await cursor.fetchall()
-
-                migrated_count = 0
-                for row in rows:
-                    (
-                        url,
-                        html,
-                        cleaned_html,
-                        markdown,
-                        extracted_content,
-                        screenshot,
-                    ) = row
-
-                    # Store content in files and get hashes
-                    html_hash = await self._store_content(html, "html")
-                    cleaned_hash = await self._store_content(cleaned_html, "cleaned")
-                    markdown_hash = await self._store_content(markdown, "markdown")
-                    extracted_hash = await self._store_content(
-                        extracted_content, "extracted"
-                    )
-                    screenshot_hash = await self._store_content(
-                        screenshot, "screenshots"
-                    )
-
-                    # Update database with hashes
-                    await db.execute(
-                        """
-                        UPDATE crawled_data 
-                        SET html = ?, 
-                            cleaned_html = ?,
-                            markdown = ?,
-                            extracted_content = ?,
-                            screenshot = ?
-                        WHERE url = ?
-                    """,
-                        (
-                            html_hash,
-                            cleaned_hash,
-                            markdown_hash,
-                            extracted_hash,
-                            screenshot_hash,
-                            url,
-                        ),
-                    )
-
-                    migrated_count += 1
-                    if migrated_count % 100 == 0:
-                        logger.info(f"Migrated {migrated_count} records...", tag="INIT")
-
-                await db.commit()
-                logger.success(
-                    f"Migration completed. {migrated_count} records processed.",
-                    tag="COMPLETE",
-                )
-
-        except Exception as e:
-            # logger.error(f"Migration failed: {e}")
-            logger.error(
-                message="Migration failed: {error}",
-                tag="ERROR",
-                params={"error": str(e)},
-            )
-            raise e
-
-
-async def backup_database(db_path: str) -> str:
-    """Create backup of existing database"""
-    if not os.path.exists(db_path):
-        logger.info("No existing database found. Skipping backup.", tag="INIT")
-        return None
-
-    # Create backup with timestamp
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    backup_path = f"{db_path}.backup_{timestamp}"
-
-    try:
-        # Wait for any potential write operations to finish
-        await asyncio.sleep(1)
-
-        # Create backup
-        shutil.copy2(db_path, backup_path)
-        logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
-        return backup_path
-    except Exception as e:
-        # logger.error(f"Backup failed: {e}")
-        logger.error(
-            message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
-        )
-        raise e
-
-
-async def run_migration(db_path: Optional[str] = None):
-    """Run database migration"""
-    if db_path is None:
-        db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
-
-    if not os.path.exists(db_path):
-        logger.info("No existing database found. Skipping migration.", tag="INIT")
-        return
-
-    # Create backup first
-    backup_path = await backup_database(db_path)
-    if not backup_path:
-        return
-
-    migration = DatabaseMigration(db_path)
-    await migration.migrate_database()
-
-
-def main():
-    """CLI entry point for migration"""
-    import argparse
-
-    parser = argparse.ArgumentParser(
-        description="Migrate Crawl4AI database to file-based storage"
-    )
-    parser.add_argument("--db-path", help="Custom database path")
-    args = parser.parse_args()
-
-    asyncio.run(run_migration(args.db_path))
-
-
-if __name__ == "__main__":
-    main()
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -2,125 +2,101 @@ from functools import lru_cache
 from pathlib import Path
 import subprocess, os
 import shutil
+import tarfile
 from .model_loader import *
 import argparse
+import urllib.request
 from crawl4ai.config import MODEL_REPO_BRANCH
-
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))

-
@lru_cache()
 def get_available_memory(device):
    import torch
-
-    if device.type == "cuda":
+    if device.type == 'cuda':
        return torch.cuda.get_device_properties(device).total_memory
-    elif device.type == "mps":
-        return 48 * 1024**3  # Assuming 8GB for MPS, as a conservative estimate
+    elif device.type == 'mps':      
+        return 48 * 1024 ** 3  # Assuming 8GB for MPS, as a conservative estimate
    else:
        return 0

-
@lru_cache()
 def calculate_batch_size(device):
    available_memory = get_available_memory(device)
-
-    if device.type == "cpu":
+    
+    if device.type == 'cpu':
        return 16
-    elif device.type in ["cuda", "mps"]:
+    elif device.type in ['cuda', 'mps']:
        # Adjust these thresholds based on your model size and available memory
-        if available_memory >= 31 * 1024**3:  # > 32GB
+        if available_memory >= 31 * 1024 ** 3:  # > 32GB
            return 256
-        elif available_memory >= 15 * 1024**3:  # > 16GB to 32GB
+        elif available_memory >= 15 * 1024 ** 3:  # > 16GB to 32GB
            return 128
-        elif available_memory >= 8 * 1024**3:  # 8GB to 16GB
+        elif available_memory >= 8 * 1024 ** 3:  # 8GB to 16GB
            return 64
        else:
            return 32
    else:
-        return 16  # Default batch size
-
-
+        return 16  # Default batch size   
+    
@lru_cache()
 def get_device():
    import torch
-
    if torch.cuda.is_available():
-        device = torch.device("cuda")
+        device = torch.device('cuda')
    elif torch.backends.mps.is_available():
-        device = torch.device("mps")
+        device = torch.device('mps')
    else:
-        device = torch.device("cpu")
-    return device
-
-
+        device = torch.device('cpu')
+    return device   
+    
 def set_model_device(model):
    device = get_device()
-    model.to(device)
+    model.to(device)    
    return model, device

-
@lru_cache()
 def get_home_folder():
-    home_folder = os.path.join(
-        os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
-    )
+    home_folder = os.path.join(Path.home(), ".crawl4ai")
    os.makedirs(home_folder, exist_ok=True)
    os.makedirs(f"{home_folder}/cache", exist_ok=True)
    os.makedirs(f"{home_folder}/models", exist_ok=True)
-    return home_folder
-
+    return home_folder 

@lru_cache()
 def load_bert_base_uncased():
-    from transformers import BertTokenizer, BertModel
-
-    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
-    model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
+    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
+    model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model

-
@lru_cache()
-def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
-    """Load the Hugging Face model for embedding.
-
-    Args:
-        model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
-
-    Returns:
-        tuple: The tokenizer and model.
-    """
-    from transformers import AutoTokenizer, AutoModel
-
-    tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
-    model = AutoModel.from_pretrained(model_name, resume_download=None)
+def load_bge_small_en_v1_5():
+    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
+    model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model

-
@lru_cache()
 def load_text_classifier():
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers import pipeline
+    import torch

-    tokenizer = AutoTokenizer.from_pretrained(
-        "dstefa/roberta-base_topic_classification_nyt_news"
-    )
-    model = AutoModelForSequenceClassification.from_pretrained(
-        "dstefa/roberta-base_topic_classification_nyt_news"
-    )
+    tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+    model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
    model.eval()
    model, device = set_model_device(model)
    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
    return pipe

-
@lru_cache()
 def load_text_multilabel_classifier():
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
+    import numpy as np
    from scipy.special import expit
    import torch

@@ -132,27 +108,18 @@ def load_text_multilabel_classifier():
    # else:
    #     device = torch.device("cpu")
    #     # return load_spacy_model(), torch.device("cpu")
+    

    MODEL = "cardiffnlp/tweet-topic-21-multi"
    tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
-    model = AutoModelForSequenceClassification.from_pretrained(
-        MODEL, resume_download=None
-    )
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    class_mapping = model.config.id2label

    def _classifier(texts, threshold=0.5, max_length=64):
-        tokens = tokenizer(
-            texts,
-            return_tensors="pt",
-            padding=True,
-            truncation=True,
-            max_length=max_length,
-        )
-        tokens = {
-            key: val.to(device) for key, val in tokens.items()
-        }  # Move tokens to the selected device
+        tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
+        tokens = {key: val.to(device) for key, val in tokens.items()}  # Move tokens to the selected device

        with torch.no_grad():
            output = model(**tokens)
@@ -163,41 +130,35 @@ def load_text_multilabel_classifier():

        batch_labels = []
        for prediction in predictions:
-            labels = [
-                class_mapping[i] for i, value in enumerate(prediction) if value == 1
-            ]
+            labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
            batch_labels.append(labels)

        return batch_labels

    return _classifier, device

-
@lru_cache()
 def load_nltk_punkt():
    import nltk
-
    try:
-        nltk.data.find("tokenizers/punkt")
+        nltk.data.find('tokenizers/punkt')
    except LookupError:
-        nltk.download("punkt")
-    return nltk.data.find("tokenizers/punkt")
-
+        nltk.download('punkt')
+    return nltk.data.find('tokenizers/punkt')

@lru_cache()
 def load_spacy_model():
    import spacy
-
    name = "models/reuters"
    home_folder = get_home_folder()
    model_folder = Path(home_folder) / name
-
+    
    # Check if the model directory already exists
    if not (model_folder.exists() and any(model_folder.iterdir())):
        repo_url = "https://github.com/unclecode/crawl4ai.git"
-        branch = MODEL_REPO_BRANCH
+        branch = MODEL_REPO_BRANCH 
        repo_folder = Path(home_folder) / "crawl4ai"
-
+        
        print("[LOG] ⏬ Downloading Spacy model for the first time...")

        # Remove existing repo folder if it exists
@@ -207,9 +168,7 @@ def load_spacy_model():
                if model_folder.exists():
                    shutil.rmtree(model_folder)
            except PermissionError:
-                print(
-                    "[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
-                )
+                print("[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:")
                print(f"- {repo_folder}")
                print(f"- {model_folder}")
                return None
@@ -220,7 +179,7 @@ def load_spacy_model():
                ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
-                check=True,
+                check=True
            )

            # Create the models directory if it doesn't exist
@@ -248,7 +207,6 @@ def load_spacy_model():
        print(f"Error loading spacy model: {e}")
        return None

-
 def download_all_models(remove_existing=False):
    """Download all models required for Crawl4AI."""
    if remove_existing:
@@ -277,20 +235,14 @@ def download_all_models(remove_existing=False):
    load_nltk_punkt()
    print("[LOG] ✅ All models downloaded successfully.")

-
 def main():
    print("[LOG] Welcome to the Crawl4AI Model Downloader!")
    print("[LOG] This script will download all the models required for Crawl4AI.")
    parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
-    parser.add_argument(
-        "--remove-existing",
-        action="store_true",
-        help="Remove existing models before downloading",
-    )
+    parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
    args = parser.parse_args()
-
+    
    download_all_models(remove_existing=args.remove_existing)

-
 if __name__ == "__main__":
    main()
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -1,100 +1,10 @@
 from pydantic import BaseModel, HttpUrl
-from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
-from enum import Enum
-from dataclasses import dataclass
-from .ssl_certificate import SSLCertificate
-from datetime import datetime
-from datetime import timedelta
-
-
-###############################
-# Dispatcher Models
-###############################
-@dataclass
-class DomainState:
-    last_request_time: float = 0
-    current_delay: float = 0
-    fail_count: int = 0
-
-
-@dataclass
-class CrawlerTaskResult:
-    task_id: str
-    url: str
-    result: "CrawlResult"
-    memory_usage: float
-    peak_memory: float
-    start_time: datetime
-    end_time: datetime
-    error_message: str = ""
-
-
-class CrawlStatus(Enum):
-    QUEUED = "QUEUED"
-    IN_PROGRESS = "IN_PROGRESS"
-    COMPLETED = "COMPLETED"
-    FAILED = "FAILED"
-
-
-@dataclass
-class CrawlStats:
-    task_id: str
-    url: str
-    status: CrawlStatus
-    start_time: Optional[datetime] = None
-    end_time: Optional[datetime] = None
-    memory_usage: float = 0.0
-    peak_memory: float = 0.0
-    error_message: str = ""
-
-    @property
-    def duration(self) -> str:
-        if not self.start_time:
-            return "0:00"
-        end = self.end_time or datetime.now()
-        duration = end - self.start_time
-        return str(timedelta(seconds=int(duration.total_seconds())))
-
-
-class DisplayMode(Enum):
-    DETAILED = "DETAILED"
-    AGGREGATED = "AGGREGATED"
-
-
-###############################
-# Crawler Models
-###############################
-@dataclass
-class TokenUsage:
-    completion_tokens: int = 0
-    prompt_tokens: int = 0
-    total_tokens: int = 0
-    completion_tokens_details: Optional[dict] = None
-    prompt_tokens_details: Optional[dict] = None
-
+from typing import List, Dict, Optional

 class UrlModel(BaseModel):
    url: HttpUrl
    forced: bool = False

-
-class MarkdownGenerationResult(BaseModel):
-    raw_markdown: str
-    markdown_with_citations: str
-    references_markdown: str
-    fit_markdown: Optional[str] = None
-    fit_html: Optional[str] = None
-
-
-class DispatchResult(BaseModel):
-    task_id: str
-    memory_usage: float
-    peak_memory: float
-    start_time: datetime
-    end_time: datetime
-    error_message: str = ""
-
-
 class CrawlResult(BaseModel):
    url: str
    html: str
@@ -102,81 +12,11 @@ class CrawlResult(BaseModel):
    cleaned_html: Optional[str] = None
    media: Dict[str, List[Dict]] = {}
    links: Dict[str, List[Dict]] = {}
-    downloaded_files: Optional[List[str]] = None
    screenshot: Optional[str] = None
-    pdf: Optional[bytes] = None
-    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
-    markdown_v2: Optional[MarkdownGenerationResult] = None
-    fit_markdown: Optional[str] = None
-    fit_html: Optional[str] = None
+    markdown: Optional[str] = None
    extracted_content: Optional[str] = None
    metadata: Optional[dict] = None
    error_message: Optional[str] = None
    session_id: Optional[str] = None
    response_headers: Optional[dict] = None
-    status_code: Optional[int] = None
-    ssl_certificate: Optional[SSLCertificate] = None
-    dispatch_result: Optional[DispatchResult] = None
-    redirected_url: Optional[str] = None
-
-    class Config:
-        arbitrary_types_allowed = True
-
-
-class AsyncCrawlResponse(BaseModel):
-    html: str
-    response_headers: Dict[str, str]
-    status_code: int
-    screenshot: Optional[str] = None
-    pdf_data: Optional[bytes] = None
-    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
-    downloaded_files: Optional[List[str]] = None
-    ssl_certificate: Optional[SSLCertificate] = None
-    final_url: Optional[str] = None
-
-    class Config:
-        arbitrary_types_allowed = True
-
-
-###############################
-# Scraping Models
-###############################
-class MediaItem(BaseModel):
-    src: Optional[str] = ""
-    alt: Optional[str] = ""
-    desc: Optional[str] = ""
-    score: Optional[int] = 0
-    type: str = "image"
-    group_id: Optional[int] = 0
-    format: Optional[str] = None
-    width: Optional[int] = None
-
-
-class Link(BaseModel):
-    href: Optional[str] = ""
-    text: Optional[str] = ""
-    title: Optional[str] = ""
-    base_domain: Optional[str] = ""
-
-
-class Media(BaseModel):
-    images: List[MediaItem] = []
-    videos: List[
-        MediaItem
-    ] = []  # Using MediaItem model for now, can be extended with Video model if needed
-    audios: List[
-        MediaItem
-    ] = []  # Using MediaItem model for now, can be extended with Audio model if needed
-
-
-class Links(BaseModel):
-    internal: List[Link] = []
-    external: List[Link] = []
-
-
-class ScrapingResult(BaseModel):
-    cleaned_html: str
-    success: bool
-    media: Media = Media()
-    links: Links = Links()
-    metadata: Dict[str, Any] = {}
+    status_code: Optional[int] = None
--- a/crawl4ai/models/onnx/config.json
+++ b/crawl4ai/models/onnx/config.json
@@ -0,0 +1,25 @@
+{
+  "_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.27.4",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}
--- a/crawl4ai/models/onnx/model.onnx
+++ b/crawl4ai/models/onnx/model.onnx
--- a/crawl4ai/models/onnx/special_tokens_map.json
+++ b/crawl4ai/models/onnx/special_tokens_map.json
@@ -0,0 +1,7 @@
+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}
--- a/crawl4ai/models/onnx/tokenizer.json
+++ b/crawl4ai/models/onnx/tokenizer.json
--- a/crawl4ai/models/onnx/tokenizer_config.json
+++ b/crawl4ai/models/onnx/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "special_tokens_map_file": "/Users/hammad/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/special_tokens_map.json",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}
--- a/crawl4ai/models/onnx/vocab.txt
+++ b/crawl4ai/models/onnx/vocab.txt
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -202,808 +202,3 @@ Avoid Common Mistakes:

 Result
 Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
-
-
-PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
-
-INPUT HTML: 
-<|HTML_CONTENT_START|>
-{HTML}
-<|HTML_CONTENT_END|>
-
-
-SPECIFIC INSTRUCTION: 
-<|USER_INSTRUCTION_START|>
-{REQUEST}
-<|USER_INSTRUCTION_END|>
-
-TASK DETAILS:
-1. Content Selection
- DO: Keep essential information, main content, key details
- DO: Preserve hierarchical structure using markdown headers
- DO: Keep code blocks, tables, key lists
- DON'T: Include navigation menus, ads, footers, cookie notices
- DON'T: Keep social media widgets, sidebars, related content
-
-2. Content Transformation
- DO: Use proper markdown syntax (#, ##, **, `, etc)
- DO: Convert tables to markdown tables
- DO: Preserve code formatting with ```language blocks
- DO: Maintain link texts but remove tracking parameters
- DON'T: Include HTML tags in output
- DON'T: Keep class names, ids, or other HTML attributes
-
-3. Content Organization
- DO: Maintain logical flow of information
- DO: Group related content under appropriate headers
- DO: Use consistent header levels
- DON'T: Fragment related content
- DON'T: Duplicate information
-
-Example Input:
-<div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
-<div class="sidebar">Related articles...</div>
-
-Example Output:
-# Setup Guide
-Follow these steps...
-
-IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
-
-OUTPUT FORMAT: 
-Wrap your response in <content> tags. Use proper markdown throughout.
-<content>
-[Your markdown content here]
-</content>
-
-Begin filtering now."""
-
-JSON_SCHEMA_BUILDER= """
-# HTML Schema Generation Instructions
-You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
-
-## Your Core Responsibilities:
-1. Analyze HTML structure to identify repeating patterns and important data points
-2. Generate valid JSON schemas following the specified format
-3. Create appropriate selectors that will work reliably for data extraction
-4. Name fields meaningfully based on their content and purpose
-5. Handle both specific user requests and autonomous pattern detection
-
-## Available Schema Types You Can Generate:
-
-<schema_types>
-1. Basic Single-Level Schema
-   - Use for simple, flat data structures
-   - Example: Product cards, user profiles
-   - Direct field extractions
-
-2. Nested Object Schema
-   - Use for hierarchical data
-   - Example: Articles with author details
-   - Contains objects within objects
-
-3. List Schema
-   - Use for repeating elements
-   - Example: Comment sections, product lists
-   - Handles arrays of similar items
-
-4. Complex Nested Lists
-   - Use for multi-level data
-   - Example: Categories with subcategories
-   - Multiple levels of nesting
-
-5. Transformation Schema
-   - Use for data requiring processing
-   - Supports regex and text transformations
-   - Special attribute handling
-</schema_types>
-
-<schema_structure>
-Your output must always be a JSON object with this structure:
-{
-  "name": "Descriptive name of the pattern",
-  "baseSelector": "CSS selector for the repeating element",
-  "fields": [
-    {
-      "name": "field_name",
-      "selector": "CSS selector",
-      "type": "text|attribute|nested|list|regex",
-      "attribute": "attribute_name",  // Optional
-      "transform": "transformation_type",  // Optional
-      "pattern": "regex_pattern",  // Optional
-      "fields": []  // For nested/list types
-    }
-  ]
-}
-</schema_structure>
-
-<type_definitions>
-Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
-</type_definitions>
-
-<behavior_rules>
-1. When given a specific query:
-   - Focus on extracting requested data points
-   - Use most specific selectors possible
-   - Include all fields mentioned in the query
-
-2. When no query is provided:
-   - Identify main content areas
-   - Extract all meaningful data points
-   - Use semantic structure to determine importance
-   - Include prices, dates, titles, and other common data types
-
-3. Always:
-   - Use reliable CSS selectors
-   - Handle dynamic class names appropriately
-   - Create descriptive field names
-   - Follow consistent naming conventions
-</behavior_rules>
-
-<examples>
-1. Basic Product Card Example:
-<html>
-<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
-  <h2 class="product-title">Gaming Laptop</h2>
-  <span class="price">$999.99</span>
-  <img src="laptop.jpg" alt="Gaming Laptop">
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Product Cards",
-  "baseSelector": ".product-card",
-  "baseFields": [
-    {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
-    {"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
-  ],
-  "fields": [
-    {
-      "name": "title",
-      "selector": ".product-title",
-      "type": "text"
-    },
-    {
-      "name": "price",
-      "selector": ".price",
-      "type": "text"
-    },
-    {
-      "name": "image_url",
-      "selector": "img",
-      "type": "attribute",
-      "attribute": "src"
-    }
-  ]
-}
-
-2. Article with Author Details Example:
-<html>
-<article>
-  <h1>The Future of AI</h1>
-  <div class="author-info">
-    <span class="author-name">Dr. Smith</span>
-    <img src="author.jpg" alt="Dr. Smith">
-  </div>
-</article>
-</html>
-
-Generated Schema:
-{
-  "name": "Article Details",
-  "baseSelector": "article",
-  "fields": [
-    {
-      "name": "title",
-      "selector": "h1",
-      "type": "text"
-    },
-    {
-      "name": "author",
-      "type": "nested",
-      "selector": ".author-info",
-      "fields": [
-        {
-          "name": "name",
-          "selector": ".author-name",
-          "type": "text"
-        },
-        {
-          "name": "avatar",
-          "selector": "img",
-          "type": "attribute",
-          "attribute": "src"
-        }
-      ]
-    }
-  ]
-}
-
-3. Comments Section Example:
-<html>
-<div class="comments-container">
-  <div class="comment" data-user-id="123">
-    <div class="user-name">John123</div>
-    <p class="comment-text">Great article!</p>
-  </div>
-  <div class="comment" data-user-id="456">
-    <div class="user-name">Alice456</div>
-    <p class="comment-text">Thanks for sharing.</p>
-  </div>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Comment Section",
-  "baseSelector": ".comments-container",
-  "baseFields": [
-    {"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
-  ],
-  "fields": [
-    {
-      "name": "comments",
-      "type": "list",
-      "selector": ".comment",
-      "fields": [
-        {
-          "name": "user",
-          "selector": ".user-name",
-          "type": "text"
-        },
-        {
-          "name": "content",
-          "selector": ".comment-text",
-          "type": "text"
-        }
-      ]
-    }
-  ]
-}
-
-4. E-commerce Categories Example:
-<html>
-<div class="category-section" data-category="electronics">
-  <h2>Electronics</h2>
-  <div class="subcategory">
-    <h3>Laptops</h3>
-    <div class="product">
-      <span class="product-name">MacBook Pro</span>
-      <span class="price">$1299</span>
-    </div>
-    <div class="product">
-      <span class="product-name">Dell XPS</span>
-      <span class="price">$999</span>
-    </div>
-  </div>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "E-commerce Categories",
-  "baseSelector": ".category-section",
-  "baseFields": [
-    {"name": "data_category", "type": "attribute", "attribute": "data-category"}
-  ],
-  "fields": [
-    {
-      "name": "category_name",
-      "selector": "h2",
-      "type": "text"
-    },
-    {
-      "name": "subcategories",
-      "type": "nested_list",
-      "selector": ".subcategory",
-      "fields": [
-        {
-          "name": "name",
-          "selector": "h3",
-          "type": "text"
-        },
-        {
-          "name": "products",
-          "type": "list",
-          "selector": ".product",
-          "fields": [
-            {
-              "name": "name",
-              "selector": ".product-name",
-              "type": "text"
-            },
-            {
-              "name": "price",
-              "selector": ".price",
-              "type": "text"
-            }
-          ]
-        }
-      ]
-    }
-  ]
-}
-
-5. Job Listings with Transformations Example:
-<html>
-<div class="job-post">
-  <h3 class="job-title">Senior Developer</h3>
-  <span class="salary-text">Salary: $120,000/year</span>
-  <span class="location">  New York, NY  </span>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Job Listings",
-  "baseSelector": ".job-post",
-  "fields": [
-    {
-      "name": "title",
-      "selector": ".job-title",
-      "type": "text",
-      "transform": "uppercase"
-    },
-    {
-      "name": "salary",
-      "selector": ".salary-text",
-      "type": "regex",
-      "pattern": "\\$([\\d,]+)"
-    },
-    {
-      "name": "location",
-      "selector": ".location",
-      "type": "text",
-      "transform": "strip"
-    }
-  ]
-}
-
-6. Skyscanner Place Card Example:
-<html>
-<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
-  <div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
-    <div class="PlaceCard_nameContent__ODUwZ">
-      <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
-    </div>
-    <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
-  </div>
-  <span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
-  <a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
-    <div class="PriceDescription_container__NjEzM">
-      <span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
-    </div>
-  </a>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Skyscanner Place Cards",
-  "baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
-  "baseFields": [
-    {"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
-  ],
-  "fields": [
-    {
-      "name": "city_name",
-      "selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
-      "type": "text"
-    },
-    {
-      "name": "country_name",
-      "selector": "span[class*='PlaceCard_subName__']",
-      "type": "text"
-    },
-    {
-      "name": "description",
-      "selector": "span[class*='PlaceCard_advertLabel__']",
-      "type": "text"
-    },
-    {
-      "name": "flight_price",
-      "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
-      "type": "text"
-    },
-    {
-      "name": "flight_url",
-      "selector": "a[data-testid='flights-link']",
-      "type": "attribute",
-      "attribute": "href"
-    }
-  ]
-}
-</examples>
-
-
-<output_requirements>
-Your output must:
-1. Be valid JSON only
-2. Include no explanatory text
-3. Follow the exact schema structure provided
-4. Use appropriate field types
-5. Include all required fields
-6. Use valid CSS selectors
-</output_requirements>
-
-"""
-
-JSON_SCHEMA_BUILDER_XPATH = """
-# HTML Schema Generation Instructions
-You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
-
-## Your Core Responsibilities:
-1. Analyze HTML structure to identify repeating patterns and important data points
-2. Generate valid JSON schemas following the specified format
-3. Create appropriate XPath selectors that will work reliably for data extraction
-4. Name fields meaningfully based on their content and purpose
-5. Handle both specific user requests and autonomous pattern detection
-
-## Available Schema Types You Can Generate:
-
-<schema_types>
-1. Basic Single-Level Schema
-  - Use for simple, flat data structures
-  - Example: Product cards, user profiles
-  - Direct field extractions
-
-2. Nested Object Schema
-  - Use for hierarchical data
-  - Example: Articles with author details
-  - Contains objects within objects
-
-3. List Schema
-  - Use for repeating elements
-  - Example: Comment sections, product lists
-  - Handles arrays of similar items
-
-4. Complex Nested Lists
-  - Use for multi-level data
-  - Example: Categories with subcategories
-  - Multiple levels of nesting
-
-5. Transformation Schema
-  - Use for data requiring processing
-  - Supports regex and text transformations
-  - Special attribute handling
-</schema_types>
-
-<schema_structure>
-Your output must always be a JSON object with this structure:
-{
- "name": "Descriptive name of the pattern",
- "baseSelector": "XPath selector for the repeating element",
- "fields": [
-   {
-     "name": "field_name",
-     "selector": "XPath selector",
-     "type": "text|attribute|nested|list|regex",
-     "attribute": "attribute_name",  // Optional
-     "transform": "transformation_type",  // Optional
-     "pattern": "regex_pattern",  // Optional
-     "fields": []  // For nested/list types
-   }
- ]
-}
-</schema_structure>
-
-<type_definitions>
-Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
-</type_definitions>
-
-<behavior_rules>
-1. When given a specific query:
-  - Focus on extracting requested data points
-  - Use most specific selectors possible
-  - Include all fields mentioned in the query
-
-2. When no query is provided:
-  - Identify main content areas
-  - Extract all meaningful data points
-  - Use semantic structure to determine importance
-  - Include prices, dates, titles, and other common data types
-
-3. Always:
-  - Use reliable XPath selectors
-  - Handle dynamic element IDs appropriately
-  - Create descriptive field names
-  - Follow consistent naming conventions
-</behavior_rules>
-
-<examples>
-1. Basic Product Card Example:
-<html>
-<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
- <h2 class="product-title">Gaming Laptop</h2>
- <span class="price">$999.99</span>
- <img src="laptop.jpg" alt="Gaming Laptop">
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Product Cards",
- "baseSelector": "//div[@class='product-card']",
- "baseFields": [
-   {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
-   {"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
- ],
- "fields": [
-   {
-     "name": "title",
-     "selector": ".//h2[@class='product-title']",
-     "type": "text"
-   },
-   {
-     "name": "price",
-     "selector": ".//span[@class='price']",
-     "type": "text"
-   },
-   {
-     "name": "image_url",
-     "selector": ".//img",
-     "type": "attribute",
-     "attribute": "src"
-   }
- ]
-}
-
-2. Article with Author Details Example:
-<html>
-<article>
- <h1>The Future of AI</h1>
- <div class="author-info">
-   <span class="author-name">Dr. Smith</span>
-   <img src="author.jpg" alt="Dr. Smith">
- </div>
-</article>
-</html>
-
-Generated Schema:
-{
- "name": "Article Details",
- "baseSelector": "//article",
- "fields": [
-   {
-     "name": "title",
-     "selector": ".//h1",
-     "type": "text"
-   },
-   {
-     "name": "author",
-     "type": "nested",
-     "selector": ".//div[@class='author-info']",
-     "fields": [
-       {
-         "name": "name",
-         "selector": ".//span[@class='author-name']",
-         "type": "text"
-       },
-       {
-         "name": "avatar",
-         "selector": ".//img",
-         "type": "attribute",
-         "attribute": "src"
-       }
-     ]
-   }
- ]
-}
-
-3. Comments Section Example:
-<html>
-<div class="comments-container">
- <div class="comment" data-user-id="123">
-   <div class="user-name">John123</div>
-   <p class="comment-text">Great article!</p>
- </div>
- <div class="comment" data-user-id="456">
-   <div class="user-name">Alice456</div>
-   <p class="comment-text">Thanks for sharing.</p>
- </div>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Comment Section",
- "baseSelector": "//div[@class='comments-container']",
- "fields": [
-   {
-     "name": "comments",
-     "type": "list",
-     "selector": ".//div[@class='comment']",
-     "baseFields": [
-       {"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
-     ],
-     "fields": [
-       {
-         "name": "user",
-         "selector": ".//div[@class='user-name']",
-         "type": "text"
-       },
-       {
-         "name": "content",
-         "selector": ".//p[@class='comment-text']",
-         "type": "text"
-       }
-     ]
-   }
- ]
-}
-
-4. E-commerce Categories Example:
-<html>
-<div class="category-section" data-category="electronics">
- <h2>Electronics</h2>
- <div class="subcategory">
-   <h3>Laptops</h3>
-   <div class="product">
-     <span class="product-name">MacBook Pro</span>
-     <span class="price">$1299</span>
-   </div>
-   <div class="product">
-     <span class="product-name">Dell XPS</span>
-     <span class="price">$999</span>
-   </div>
- </div>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "E-commerce Categories",
- "baseSelector": "//div[@class='category-section']",
- "baseFields": [
-   {"name": "data_category", "type": "attribute", "attribute": "data-category"}
- ],
- "fields": [
-   {
-     "name": "category_name",
-     "selector": ".//h2",
-     "type": "text"
-   },
-   {
-     "name": "subcategories",
-     "type": "nested_list",
-     "selector": ".//div[@class='subcategory']",
-     "fields": [
-       {
-         "name": "name",
-         "selector": ".//h3",
-         "type": "text"
-       },
-       {
-         "name": "products",
-         "type": "list",
-         "selector": ".//div[@class='product']",
-         "fields": [
-           {
-             "name": "name",
-             "selector": ".//span[@class='product-name']",
-             "type": "text"
-           },
-           {
-             "name": "price",
-             "selector": ".//span[@class='price']",
-             "type": "text"
-           }
-         ]
-       }
-     ]
-   }
- ]
-}
-
-5. Job Listings with Transformations Example:
-<html>
-<div class="job-post">
- <h3 class="job-title">Senior Developer</h3>
- <span class="salary-text">Salary: $120,000/year</span>
- <span class="location">  New York, NY  </span>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Job Listings",
- "baseSelector": "//div[@class='job-post']",
- "fields": [
-   {
-     "name": "title",
-     "selector": ".//h3[@class='job-title']",
-     "type": "text",
-     "transform": "uppercase"
-   },
-   {
-     "name": "salary",
-     "selector": ".//span[@class='salary-text']",
-     "type": "regex",
-     "pattern": "\\$([\\d,]+)"
-   },
-   {
-     "name": "location",
-     "selector": ".//span[@class='location']",
-     "type": "text",
-     "transform": "strip"
-   }
- ]
-}
-
-6. Skyscanner Place Card Example:
-<html>
-<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
- <div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
-   <div class="PlaceCard_nameContent__ODUwZ">
-     <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
-   </div>
-   <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
- </div>
- <span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
- <a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
-   <div class="PriceDescription_container__NjEzM">
-     <span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
-   </div>
- </a>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Skyscanner Place Cards",
- "baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
- "baseFields": [
-   {"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
- ],
- "fields": [
-   {
-     "name": "city_name",
-     "selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
-     "type": "text"
-   },
-   {
-     "name": "country_name",
-     "selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
-     "type": "text"
-   },
-   {
-     "name": "description",
-     "selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
-     "type": "text"
-   },
-   {
-     "name": "flight_price",
-     "selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
-     "type": "text"
-   },
-   {
-     "name": "flight_url",
-     "selector": ".//a[@data-testid='flights-link']",
-     "type": "attribute",
-     "attribute": "href"
-   }
- ]
-}
-</examples>
-
-<output_requirements>
-Your output must:
-1. Be valid JSON only
-2. Include no explanatory text
-3. Follow the exact schema structure provided
-4. Use appropriate field types
-5. Include all required fields
-6. Use valid XPath selectors
-</output_requirements>
-"""
--- a/crawl4ai/scraper/init.py
+++ b/crawl4ai/scraper/init.py
@@ -0,0 +1,3 @@
+from .async_web_scraper import AsyncWebScraper
+from .bfs_scraper_strategy import BFSScraperStrategy
+from .filters import URLFilter, FilterChain, URLPatternFilter, ContentTypeFilter
--- a/crawl4ai/scraper/async_web_scraper.py
+++ b/crawl4ai/scraper/async_web_scraper.py
@@ -0,0 +1,123 @@
+from typing import Union, AsyncGenerator, Optional
+from .scraper_strategy import ScraperStrategy
+from .models import ScraperResult, CrawlResult
+from ..async_webcrawler import AsyncWebCrawler
+import logging
+from dataclasses import dataclass
+from contextlib import asynccontextmanager
+
+@dataclass
+class ScrapingProgress:
+    """Tracks the progress of a scraping operation."""
+    processed_urls: int = 0
+    failed_urls: int = 0
+    current_url: Optional[str] = None
+
+class AsyncWebScraper:
+    """
+    A high-level web scraper that combines an async crawler with a scraping strategy.
+    
+    Args:
+        crawler (AsyncWebCrawler): The async web crawler implementation
+        strategy (ScraperStrategy): The scraping strategy to use
+        logger (Optional[logging.Logger]): Custom logger for the scraper
+    """
+    
+    def __init__(
+        self, 
+        crawler: AsyncWebCrawler, 
+        strategy: ScraperStrategy,
+        logger: Optional[logging.Logger] = None
+    ):
+        if not isinstance(crawler, AsyncWebCrawler):
+            raise TypeError("crawler must be an instance of AsyncWebCrawler")
+        if not isinstance(strategy, ScraperStrategy):
+            raise TypeError("strategy must be an instance of ScraperStrategy")
+            
+        self.crawler = crawler
+        self.strategy = strategy
+        self.logger = logger or logging.getLogger(__name__)
+        self._progress = ScrapingProgress()
+
+    @property
+    def progress(self) -> ScrapingProgress:
+        """Get current scraping progress."""
+        return self._progress
+
+    @asynccontextmanager
+    async def _error_handling_context(self, url: str):
+        """Context manager for handling errors during scraping."""
+        try:
+            yield
+        except Exception as e:
+            self.logger.error(f"Error scraping {url}: {str(e)}")
+            self._progress.failed_urls += 1
+            raise
+
+    async def ascrape(
+        self, 
+        url: str, 
+        parallel_processing: bool = True, 
+        stream: bool = False
+    ) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
+        """
+        Scrape a website starting from the given URL.
+        
+        Args:
+            url: Starting URL for scraping
+            parallel_processing: Whether to process URLs in parallel
+            stream: If True, yield results as they come; if False, collect all results
+            
+        Returns:
+            Either an async generator yielding CrawlResults or a final ScraperResult
+        """
+        self._progress = ScrapingProgress()  # Reset progress
+        
+        async with self._error_handling_context(url):
+            if stream:
+                return self._ascrape_yielding(url, parallel_processing)
+            return await self._ascrape_collecting(url, parallel_processing)
+
+    async def _ascrape_yielding(
+        self, 
+        url: str, 
+        parallel_processing: bool
+    ) -> AsyncGenerator[CrawlResult, None]:
+        """Stream scraping results as they become available."""
+        try:
+            result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
+            async for res in result_generator:
+                self._progress.processed_urls += 1
+                self._progress.current_url = res.url
+                yield res
+        except Exception as e:
+            self.logger.error(f"Error in streaming scrape: {str(e)}")
+            raise
+
+    async def _ascrape_collecting(
+        self, 
+        url: str, 
+        parallel_processing: bool
+    ) -> ScraperResult:
+        """Collect all scraping results before returning."""
+        extracted_data = {}
+        
+        try:
+            result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
+            async for res in result_generator:
+                self._progress.processed_urls += 1
+                self._progress.current_url = res.url
+                extracted_data[res.url] = res
+                
+            return ScraperResult(
+                url=url,
+                crawled_urls=list(extracted_data.keys()),
+                extracted_data=extracted_data,
+                stats={
+                    'processed_urls': self._progress.processed_urls,
+                    'failed_urls': self._progress.failed_urls
+                }
+            )
+        except Exception as e:
+            self.logger.error(f"Error in collecting scrape: {str(e)}")
+            raise
--- a/crawl4ai/scraper/bfs_scraper_strategy.py
+++ b/crawl4ai/scraper/bfs_scraper_strategy.py
@@ -0,0 +1,327 @@
+from abc import ABC, abstractmethod
+from typing import Union, AsyncGenerator, Optional, Dict, Set
+from dataclasses import dataclass
+from datetime import datetime
+import asyncio
+import logging
+from urllib.parse import urljoin, urlparse, urlunparse
+from urllib.robotparser import RobotFileParser
+import validators
+import time
+from aiolimiter import AsyncLimiter
+from tenacity import retry, stop_after_attempt, wait_exponential
+from collections import defaultdict
+
+from .models import ScraperResult, CrawlResult
+from .filters import FilterChain
+from .scorers import URLScorer
+from ..async_webcrawler import AsyncWebCrawler
+
+@dataclass
+class CrawlStats:
+    """Statistics for the crawling process"""
+    start_time: datetime
+    urls_processed: int = 0
+    urls_failed: int = 0
+    urls_skipped: int = 0
+    total_depth_reached: int = 0
+    current_depth: int = 0
+    robots_blocked: int = 0
+
+class ScraperStrategy(ABC):
+    """Base class for scraping strategies"""
+    
+    @abstractmethod
+    async def ascrape(
+        self, 
+        url: str, 
+        crawler: AsyncWebCrawler, 
+        parallel_processing: bool = True,
+        stream: bool = False
+    ) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
+        """Abstract method for scraping implementation"""
+        pass
+
+    @abstractmethod
+    async def can_process_url(self, url: str) -> bool:
+        """Check if URL can be processed based on strategy rules"""
+        pass
+
+    @abstractmethod
+    async def shutdown(self):
+        """Clean up resources used by the strategy"""
+        pass
+
+class BFSScraperStrategy(ScraperStrategy):
+    """Breadth-First Search scraping strategy with politeness controls"""
+
+    def __init__(
+        self,
+        max_depth: int,
+        filter_chain: FilterChain,
+        url_scorer: URLScorer,
+        max_concurrent: int = 5,
+        min_crawl_delay: int = 1,
+        timeout: int = 30,
+        logger: Optional[logging.Logger] = None
+    ):
+        self.max_depth = max_depth
+        self.filter_chain = filter_chain
+        self.url_scorer = url_scorer
+        self.max_concurrent = max_concurrent
+        self.min_crawl_delay = min_crawl_delay
+        self.timeout = timeout
+        self.logger = logger or logging.getLogger(__name__)
+        
+        # Crawl control
+        self.stats = CrawlStats(start_time=datetime.now())
+        self._cancel_event = asyncio.Event()
+        self.process_external_links = False
+        
+        # Rate limiting and politeness
+        self.rate_limiter = AsyncLimiter(1, 1)
+        self.last_crawl_time = defaultdict(float)
+        self.robot_parsers: Dict[str, RobotFileParser] = {}
+        self.domain_queues: Dict[str, asyncio.Queue] = defaultdict(asyncio.Queue)
+
+    async def can_process_url(self, url: str) -> bool:
+        """Check if URL can be processed based on robots.txt and filters
+        This is our gatekeeper method that determines if a URL should be processed. It:
+            - Validates URL format using the validators library
+            - Checks robots.txt permissions for the domain
+            - Applies custom filters from the filter chain
+            - Updates statistics for blocked URLs
+            - Returns False early if any check fails
+        """
+        if not validators.url(url):
+            self.logger.warning(f"Invalid URL: {url}")
+            return False
+
+        robot_parser = await self._get_robot_parser(url)
+        if robot_parser and not robot_parser.can_fetch("*", url):
+            self.stats.robots_blocked += 1
+            self.logger.info(f"Blocked by robots.txt: {url}")
+            return False
+
+        return self.filter_chain.apply(url)
+
+    async def _get_robot_parser(self, url: str) -> Optional[RobotFileParser]:
+        """Get or create robots.txt parser for domain.
+            This is our robots.txt manager that:
+                - Uses domain-level caching of robot parsers
+                - Creates and caches new parsers as needed
+                - Handles failed robots.txt fetches gracefully
+                - Returns None if robots.txt can't be fetched, allowing crawling to proceed        
+        """
+        domain = urlparse(url).netloc
+        if domain not in self.robot_parsers:
+            parser = RobotFileParser()
+            try:
+                robots_url = f"{urlparse(url).scheme}://{domain}/robots.txt"
+                parser.set_url(robots_url)
+                parser.read()
+                self.robot_parsers[domain] = parser
+            except Exception as e:
+                self.logger.warning(f"Error fetching robots.txt for {domain}: {e}")
+                return None
+        return self.robot_parsers[domain]
+
+    @retry(stop=stop_after_attempt(3), 
+           wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def _crawl_with_retry(
+        self, 
+        crawler: AsyncWebCrawler, 
+        url: str
+    ) -> CrawlResult:
+        """Crawl URL with retry logic"""
+        try:
+            async with asyncio.timeout(self.timeout):
+                return await crawler.arun(url)
+        except asyncio.TimeoutError:
+            self.logger.error(f"Timeout crawling {url}")
+            raise
+
+    async def process_url(
+        self,
+        url: str,
+        depth: int,
+        crawler: AsyncWebCrawler,
+        queue: asyncio.PriorityQueue,
+        visited: Set[str],
+        depths: Dict[str, int]
+    ) -> Optional[CrawlResult]:
+        """Process a single URL and extract links.
+        This is our main URL processing workhorse that:
+            - Checks for cancellation
+            - Validates URLs through can_process_url
+            - Implements politeness delays per domain
+            - Applies rate limiting
+            - Handles crawling with retries
+            - Updates various statistics
+            - Processes extracted links
+            - Returns the crawl result or None on failure
+        """
+        
+        if self._cancel_event.is_set():
+            return None
+            
+        if not await self.can_process_url(url):
+            self.stats.urls_skipped += 1
+            return None
+
+        # Politeness delay
+        domain = urlparse(url).netloc
+        time_since_last = time.time() - self.last_crawl_time[domain]
+        if time_since_last < self.min_crawl_delay:
+            await asyncio.sleep(self.min_crawl_delay - time_since_last)
+        self.last_crawl_time[domain] = time.time()
+
+        # Crawl with rate limiting
+        try:
+            async with self.rate_limiter:
+                result = await self._crawl_with_retry(crawler, url)
+                self.stats.urls_processed += 1
+        except Exception as e:
+            self.logger.error(f"Error crawling {url}: {e}")
+            self.stats.urls_failed += 1
+            return None
+
+        # Process links
+        await self._process_links(result, url, depth, queue, visited, depths)
+        
+        return result
+
+    async def _process_links(
+        self,
+        result: CrawlResult,
+        source_url: str,
+        depth: int,
+        queue: asyncio.PriorityQueue,
+        visited: Set[str],
+        depths: Dict[str, int]
+    ):
+        """Process extracted links from crawl result.
+        This is our link processor that:
+            Handles both internal and external links
+            Normalizes URLs (removes fragments)
+            Checks depth limits
+            Scores URLs for priority
+            Updates depth tracking
+            Adds valid URLs to the queue
+            Updates maximum depth statistics
+        """
+        links_ro_process = result.links["internal"]
+        if self.process_external_links:
+            links_ro_process += result.links["external"]
+        for link_type in links_ro_process:
+            for link in result.links[link_type]:
+                url = link['href']
+                # url = urljoin(source_url, link['href'])
+                # url = urlunparse(urlparse(url)._replace(fragment=""))
+                
+                if url not in visited and await self.can_process_url(url):
+                    new_depth = depths[source_url] + 1
+                    if new_depth <= self.max_depth:
+                        score = self.url_scorer.score(url)
+                        await queue.put((score, new_depth, url))
+                        depths[url] = new_depth
+                        self.stats.total_depth_reached = max(
+                            self.stats.total_depth_reached, 
+                            new_depth
+                        )
+
+    async def ascrape(
+        self,
+        start_url: str,
+        crawler: AsyncWebCrawler,
+        parallel_processing: bool = True
+    ) -> AsyncGenerator[CrawlResult, None]:
+        """Implement BFS crawling strategy"""
+        
+        # Initialize crawl state
+        """
+        queue: A priority queue where items are tuples of (score, depth, url)
+            Score: Determines crawling priority (lower = higher priority)
+            Depth: Current distance from start_url
+            URL: The actual URL to crawl
+        visited: Keeps track of URLs we've already seen to avoid cycles
+        depths: Maps URLs to their depths from the start URL
+        pending_tasks: Tracks currently running crawl tasks        
+        """
+        queue = asyncio.PriorityQueue()
+        await queue.put((0, 0, start_url))
+        visited: Set[str] = set()
+        depths = {start_url: 0}
+        pending_tasks = set()
+        
+        try:
+            while (not queue.empty() or pending_tasks) and not self._cancel_event.is_set():
+                """
+                This sets up our main control loop which:
+                    - Continues while there are URLs to process (not queue.empty())
+                    - Or while there are tasks still running (pending_tasks)
+                    - Can be interrupted via cancellation (not self._cancel_event.is_set())
+                """
+                # Start new tasks up to max_concurrent
+                while not queue.empty() and len(pending_tasks) < self.max_concurrent:
+                    """
+                    This section manages task creation:
+                        Checks if we can start more tasks (under max_concurrent limit)
+                        Gets the next URL from the priority queue
+                        Marks URLs as visited immediately to prevent duplicates
+                        Updates current depth in stats
+                        Either:
+                            Creates a new async task (parallel mode)
+                            Processes URL directly (sequential mode)
+                    """
+                    _, depth, url = await queue.get()
+                    if url not in visited:
+                        visited.add(url)
+                        self.stats.current_depth = depth
+                        
+                        if parallel_processing:
+                            task = asyncio.create_task(
+                                self.process_url(url, depth, crawler, queue, visited, depths)
+                            )
+                            pending_tasks.add(task)
+                        else:
+                            result = await self.process_url(
+                                url, depth, crawler, queue, visited, depths
+                            )
+                            if result:
+                                yield result
+
+                # Process completed tasks
+                """
+                This section manages completed tasks:
+                    Waits for any task to complete using asyncio.wait
+                    Uses FIRST_COMPLETED to handle results as soon as they're ready
+                    Yields successful results to the caller
+                    Updates pending_tasks to remove completed ones
+                """
+                if pending_tasks:
+                    done, pending_tasks = await asyncio.wait(
+                        pending_tasks,
+                        return_when=asyncio.FIRST_COMPLETED
+                    )
+                    for task in done:
+                        result = await task
+                        if result:
+                            yield result
+                            
+        except Exception as e:
+            self.logger.error(f"Error in crawl process: {e}")
+            raise
+            
+        finally:
+            # Clean up any remaining tasks
+            for task in pending_tasks:
+                task.cancel()
+            self.stats.end_time = datetime.now()
+
+    async def shutdown(self):
+        """Clean up resources and stop crawling"""
+        self._cancel_event.set()
+        # Clear caches and close connections
+        self.robot_parsers.clear()
+        self.domain_queues.clear()
--- a/crawl4ai/scraper/filters.py
+++ b/crawl4ai/scraper/filters.py
@@ -0,0 +1,205 @@
+# from .url_filter import URLFilter, FilterChain
+# from .content_type_filter import ContentTypeFilter
+# from .url_pattern_filter import URLPatternFilter
+
+from abc import ABC, abstractmethod
+from typing import List, Pattern, Set, Union
+import re
+from urllib.parse import urlparse
+import mimetypes
+import logging
+from dataclasses import dataclass
+import fnmatch
+
+@dataclass
+class FilterStats:
+    """Statistics for filter applications"""
+    total_urls: int = 0
+    rejected_urls: int = 0
+    passed_urls: int = 0
+
+class URLFilter(ABC):
+    """Base class for URL filters"""
+    
+    def __init__(self, name: str = None):
+        self.name = name or self.__class__.__name__
+        self.stats = FilterStats()
+        self.logger = logging.getLogger(f"urlfilter.{self.name}")
+
+    @abstractmethod
+    def apply(self, url: str) -> bool:
+        """Apply the filter to a URL"""
+        pass
+
+    def _update_stats(self, passed: bool):
+        """Update filter statistics"""
+        self.stats.total_urls += 1
+        if passed:
+            self.stats.passed_urls += 1
+        else:
+            self.stats.rejected_urls += 1
+
+class FilterChain:
+    """Chain of URL filters."""
+    
+    def __init__(self, filters: List[URLFilter] = None):
+        self.filters = filters or []
+        self.stats = FilterStats()
+        self.logger = logging.getLogger("urlfilter.chain")
+
+    def add_filter(self, filter_: URLFilter) -> 'FilterChain':
+        """Add a filter to the chain"""
+        self.filters.append(filter_)
+        return self  # Enable method chaining
+
+    def apply(self, url: str) -> bool:
+        """Apply all filters in the chain"""
+        self.stats.total_urls += 1
+        
+        for filter_ in self.filters:
+            if not filter_.apply(url):
+                self.stats.rejected_urls += 1
+                self.logger.debug(f"URL {url} rejected by {filter_.name}")
+                return False
+        
+        self.stats.passed_urls += 1
+        return True
+
+class URLPatternFilter(URLFilter):
+    """Filter URLs based on glob patterns or regex.
+    
+    pattern_filter = URLPatternFilter([
+        "*.example.com/*",  # Glob pattern
+        "*/article/*",      # Path pattern
+        re.compile(r"blog-\d+") # Regex pattern
+    ])
+
+    - Supports glob patterns and regex
+    - Multiple patterns per filter
+    - Pattern pre-compilation for performance    
+    """
+    
+    def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]], 
+                 use_glob: bool = True):
+        super().__init__()
+        self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
+        self.use_glob = use_glob
+        self._compiled_patterns = []
+        
+        for pattern in self.patterns:
+            if isinstance(pattern, str) and use_glob:
+                self._compiled_patterns.append(self._glob_to_regex(pattern))
+            else:
+                self._compiled_patterns.append(re.compile(pattern) if isinstance(pattern, str) else pattern)
+
+    def _glob_to_regex(self, pattern: str) -> Pattern:
+        """Convert glob pattern to regex"""
+        return re.compile(fnmatch.translate(pattern))
+
+    def apply(self, url: str) -> bool:
+        """Check if URL matches any of the patterns"""
+        matches = any(pattern.search(url) for pattern in self._compiled_patterns)
+        self._update_stats(matches)
+        return matches
+
+class ContentTypeFilter(URLFilter):
+    """Filter URLs based on expected content type.
+    
+    content_filter = ContentTypeFilter([
+        "text/html",
+        "application/pdf"
+    ], check_extension=True)
+
+    - Filter by MIME types
+    - Extension checking
+    - Support for multiple content types
+    """
+    
+    def __init__(self, allowed_types: Union[str, List[str]], 
+                 check_extension: bool = True):
+        super().__init__()
+        self.allowed_types = [allowed_types] if isinstance(allowed_types, str) else allowed_types
+        self.check_extension = check_extension
+        self._normalize_types()
+
+    def _normalize_types(self):
+        """Normalize content type strings"""
+        self.allowed_types = [t.lower() for t in self.allowed_types]
+
+    def _check_extension(self, url: str) -> bool:
+        """Check URL's file extension"""
+        ext = urlparse(url).path.split('.')[-1].lower() if '.' in urlparse(url).path else ''
+        if not ext:
+            return True  # No extension, might be dynamic content
+            
+        guessed_type = mimetypes.guess_type(url)[0]
+        return any(allowed in (guessed_type or '').lower() for allowed in self.allowed_types)
+
+    def apply(self, url: str) -> bool:
+        """Check if URL's content type is allowed"""
+        result = True
+        if self.check_extension:
+            result = self._check_extension(url)
+        self._update_stats(result)
+        return result
+
+class DomainFilter(URLFilter):
+    """Filter URLs based on allowed/blocked domains.
+    
+    domain_filter = DomainFilter(
+        allowed_domains=["example.com", "blog.example.com"],
+        blocked_domains=["ads.example.com"]
+    )
+
+    - Allow/block specific domains
+    - Subdomain support
+    - Efficient domain matching
+    """
+    
+    def __init__(self, allowed_domains: Union[str, List[str]] = None, 
+                 blocked_domains: Union[str, List[str]] = None):
+        super().__init__()
+        self.allowed_domains = set(self._normalize_domains(allowed_domains)) if allowed_domains else None
+        self.blocked_domains = set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
+
+    def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
+        """Normalize domain strings"""
+        if isinstance(domains, str):
+            domains = [domains]
+        return [d.lower().strip() for d in domains]
+
+    def _extract_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        return urlparse(url).netloc.lower()
+
+    def apply(self, url: str) -> bool:
+        """Check if URL's domain is allowed"""
+        domain = self._extract_domain(url)
+        
+        if domain in self.blocked_domains:
+            self._update_stats(False)
+            return False
+            
+        if self.allowed_domains is not None and domain not in self.allowed_domains:
+            self._update_stats(False)
+            return False
+            
+        self._update_stats(True)
+        return True
+
+# Example usage:
+def create_common_filter_chain() -> FilterChain:
+    """Create a commonly used filter chain"""
+    return FilterChain([
+        URLPatternFilter([
+            "*.html", "*.htm",  # HTML files
+            "*/article/*", "*/blog/*"  # Common content paths
+        ]),
+        ContentTypeFilter([
+            "text/html",
+            "application/xhtml+xml"
+        ]),
+        DomainFilter(
+            blocked_domains=["ads.*", "analytics.*"]
+        )
+    ])
--- a/crawl4ai/scraper/models.py
+++ b/crawl4ai/scraper/models.py
@@ -0,0 +1,8 @@
+from pydantic import BaseModel
+from typing import List, Dict
+from ..models import CrawlResult
+
+class ScraperResult(BaseModel):
+    url: str
+    crawled_urls: List[str]
+    extracted_data: Dict[str,CrawlResult]
--- a/crawl4ai/scraper/scorers.py
+++ b/crawl4ai/scraper/scorers.py
@@ -0,0 +1,268 @@
+# from .url_scorer import URLScorer
+# from .keyword_relevance_scorer import KeywordRelevanceScorer
+
+from abc import ABC, abstractmethod
+from typing import List, Dict, Optional, Union
+from dataclasses import dataclass
+from urllib.parse import urlparse, unquote
+import re
+from collections import defaultdict
+import math
+import logging
+
+@dataclass
+class ScoringStats:
+    """Statistics for URL scoring"""
+    urls_scored: int = 0
+    total_score: float = 0.0
+    min_score: float = float('inf')
+    max_score: float = float('-inf')
+    
+    def update(self, score: float):
+        """Update scoring statistics"""
+        self.urls_scored += 1
+        self.total_score += score
+        self.min_score = min(self.min_score, score)
+        self.max_score = max(self.max_score, score)
+    
+    @property
+    def average_score(self) -> float:
+        """Calculate average score"""
+        return self.total_score / self.urls_scored if self.urls_scored > 0 else 0.0
+
+class URLScorer(ABC):
+    """Base class for URL scoring strategies"""
+    
+    def __init__(self, weight: float = 1.0, name: str = None):
+        self.weight = weight
+        self.name = name or self.__class__.__name__
+        self.stats = ScoringStats()
+        self.logger = logging.getLogger(f"urlscorer.{self.name}")
+
+    @abstractmethod
+    def _calculate_score(self, url: str) -> float:
+        """Calculate the raw score for a URL"""
+        pass
+
+    def score(self, url: str) -> float:
+        """Calculate the weighted score for a URL"""
+        raw_score = self._calculate_score(url)
+        weighted_score = raw_score * self.weight
+        self.stats.update(weighted_score)
+        return weighted_score
+
+class CompositeScorer(URLScorer):
+    """Combines multiple scorers with weights"""
+    
+    def __init__(self, scorers: List[URLScorer], normalize: bool = True):
+        super().__init__(name="CompositeScorer")
+        self.scorers = scorers
+        self.normalize = normalize
+
+    def _calculate_score(self, url: str) -> float:
+        scores = [scorer.score(url) for scorer in self.scorers]
+        total_score = sum(scores)
+        
+        if self.normalize and scores:
+            total_score /= len(scores)
+            
+        return total_score
+
+class KeywordRelevanceScorer(URLScorer):
+    """Score URLs based on keyword relevance.
+
+    keyword_scorer = KeywordRelevanceScorer(
+        keywords=["python", "programming"],
+        weight=1.0,
+        case_sensitive=False
+    )
+
+    - Score based on keyword matches
+    - Case sensitivity options
+    - Weighted scoring
+    """
+    
+    def __init__(self, keywords: List[str], weight: float = 1.0,
+                 case_sensitive: bool = False):
+        super().__init__(weight=weight)
+        self.keywords = keywords
+        self.case_sensitive = case_sensitive
+        self._compile_keywords()
+
+    def _compile_keywords(self):
+        """Prepare keywords for matching"""
+        flags = 0 if self.case_sensitive else re.IGNORECASE
+        self.patterns = [re.compile(re.escape(k), flags) for k in self.keywords]
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on keyword matches"""
+        decoded_url = unquote(url)
+        total_matches = sum(
+            1 for pattern in self.patterns
+            if pattern.search(decoded_url)
+        )
+        # Normalize score between 0 and 1
+        return total_matches / len(self.patterns) if self.patterns else 0.0
+
+class PathDepthScorer(URLScorer):
+    """Score URLs based on their path depth.
+        
+    path_scorer = PathDepthScorer(
+        optimal_depth=3,  # Preferred URL depth
+        weight=0.7
+    )
+
+    - Score based on URL path depth
+    - Configurable optimal depth
+    - Diminishing returns for deeper paths
+    """
+    
+    def __init__(self, optimal_depth: int = 3, weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.optimal_depth = optimal_depth
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on path depth"""
+        path = urlparse(url).path
+        depth = len([x for x in path.split('/') if x])
+        
+        # Score decreases as we move away from optimal depth
+        distance_from_optimal = abs(depth - self.optimal_depth)
+        return 1.0 / (1.0 + distance_from_optimal)
+
+class ContentTypeScorer(URLScorer):
+    """Score URLs based on content type preferences.
+    
+    content_scorer = ContentTypeScorer({
+        r'\.html$': 1.0,
+        r'\.pdf$': 0.8,
+        r'\.xml$': 0.6
+    })
+
+    - Score based on file types
+    - Configurable type weights
+    - Pattern matching support
+    """
+    
+    def __init__(self, type_weights: Dict[str, float], weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.type_weights = type_weights
+        self._compile_patterns()
+
+    def _compile_patterns(self):
+        """Prepare content type patterns"""
+        self.patterns = {
+            re.compile(pattern): weight
+            for pattern, weight in self.type_weights.items()
+        }
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on content type matching"""
+        for pattern, weight in self.patterns.items():
+            if pattern.search(url):
+                return weight
+        return 0.0
+
+class FreshnessScorer(URLScorer):
+    """Score URLs based on freshness indicators.
+    
+    freshness_scorer = FreshnessScorer(weight=0.9)
+
+    Score based on date indicators in URLs
+    Multiple date format support
+    Recency weighting"""
+    
+    def __init__(self, weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.date_patterns = [
+            r'/(\d{4})/(\d{2})/(\d{2})/',  # yyyy/mm/dd
+            r'(\d{4})[-_](\d{2})[-_](\d{2})',  # yyyy-mm-dd
+            r'/(\d{4})/',  # year only
+        ]
+        self._compile_patterns()
+
+    def _compile_patterns(self):
+        """Prepare date patterns"""
+        self.compiled_patterns = [re.compile(p) for p in self.date_patterns]
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on date indicators"""
+        for pattern in self.compiled_patterns:
+            if match := pattern.search(url):
+                year = int(match.group(1))
+                # Score higher for more recent years
+                return 1.0 - (2024 - year) * 0.1
+        return 0.5  # Default score for URLs without dates
+
+class DomainAuthorityScorer(URLScorer):
+    """Score URLs based on domain authority.
+
+    authority_scorer = DomainAuthorityScorer({
+        "python.org": 1.0,
+        "github.com": 0.9,
+        "medium.com": 0.7
+    })
+
+    Score based on domain importance
+    Configurable domain weights
+    Default weight for unknown domains"""
+    
+    def __init__(self, domain_weights: Dict[str, float], 
+                 default_weight: float = 0.5, weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.domain_weights = domain_weights
+        self.default_weight = default_weight
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on domain authority"""
+        domain = urlparse(url).netloc.lower()
+        return self.domain_weights.get(domain, self.default_weight)
+
+def create_balanced_scorer() -> CompositeScorer:
+    """Create a balanced composite scorer"""
+    return CompositeScorer([
+        KeywordRelevanceScorer(
+            keywords=["article", "blog", "news", "research"],
+            weight=1.0
+        ),
+        PathDepthScorer(
+            optimal_depth=3,
+            weight=0.7
+        ),
+        ContentTypeScorer(
+            type_weights={
+                r'\.html?$': 1.0,
+                r'\.pdf$': 0.8,
+                r'\.xml$': 0.6
+            },
+            weight=0.8
+        ),
+        FreshnessScorer(
+            weight=0.9
+        )
+    ])
+
+# Example Usage:
+"""
+# Create a composite scorer
+scorer = CompositeScorer([
+    KeywordRelevanceScorer(["python", "programming"], weight=1.0),
+    PathDepthScorer(optimal_depth=2, weight=0.7),
+    FreshnessScorer(weight=0.8),
+    DomainAuthorityScorer(
+        domain_weights={
+            "python.org": 1.0,
+            "github.com": 0.9,
+            "medium.com": 0.7
+        },
+        weight=0.9
+    )
+])
+
+# Score a URL
+score = scorer.score("https://python.org/article/2024/01/new-features")
+
+# Access statistics
+print(f"Average score: {scorer.stats.average_score}")
+print(f"URLs scored: {scorer.stats.urls_scored}")
+"""
--- a/crawl4ai/scraper/scraper_strategy.py
+++ b/crawl4ai/scraper/scraper_strategy.py
@@ -0,0 +1,26 @@
+from abc import ABC, abstractmethod
+from .models import ScraperResult, CrawlResult
+from ..models import CrawlResult
+from ..async_webcrawler import AsyncWebCrawler
+from typing import Union, AsyncGenerator
+
+class ScraperStrategy(ABC):
+    @abstractmethod
+    async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
+        """Scrape the given URL using the specified crawler.
+
+        Args:
+            url (str): The starting URL for the scrape.
+            crawler (AsyncWebCrawler): The web crawler instance.
+            parallel_processing (bool): Whether to use parallel processing. Defaults to True.
+            stream (bool): If True, yields individual crawl results as they are ready; 
+                                if False, accumulates results and returns a final ScraperResult.
+
+        Yields:
+            CrawlResult: Individual crawl results if stream is True.
+
+        Returns:
+            ScraperResult: A summary of the scrape results containing the final extracted data 
+            and the list of crawled URLs if stream is False.
+        """
+        pass
--- a/crawl4ai/ssl_certificate.py
+++ b/crawl4ai/ssl_certificate.py
@@ -1,184 +0,0 @@
-"""SSL Certificate class for handling certificate operations."""
-
-import ssl
-import socket
-import base64
-import json
-from typing import Dict, Any, Optional
-from urllib.parse import urlparse
-import OpenSSL.crypto
-from pathlib import Path
-
-
-class SSLCertificate:
-    """
-    A class representing an SSL certificate with methods to export in various formats.
-
-    Attributes:
-        cert_info (Dict[str, Any]): The certificate information.
-
-        Methods:
-            from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
-            from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
-            from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
-            export_as_pem() -> str: Export the certificate as PEM format.
-            export_as_der() -> bytes: Export the certificate as DER format.
-            export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
-            export_as_text() -> str: Export the certificate as text format.
-    """
-
-    def __init__(self, cert_info: Dict[str, Any]):
-        self._cert_info = self._decode_cert_data(cert_info)
-
-    @staticmethod
-    def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
-        """
-        Create SSLCertificate instance from a URL.
-
-        Args:
-            url (str): URL of the website.
-            timeout (int): Timeout for the connection (default: 10).
-
-        Returns:
-            Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
-        """
-        try:
-            hostname = urlparse(url).netloc
-            if ":" in hostname:
-                hostname = hostname.split(":")[0]
-
-            context = ssl.create_default_context()
-            with socket.create_connection((hostname, 443), timeout=timeout) as sock:
-                with context.wrap_socket(sock, server_hostname=hostname) as ssock:
-                    cert_binary = ssock.getpeercert(binary_form=True)
-                    x509 = OpenSSL.crypto.load_certificate(
-                        OpenSSL.crypto.FILETYPE_ASN1, cert_binary
-                    )
-
-                    cert_info = {
-                        "subject": dict(x509.get_subject().get_components()),
-                        "issuer": dict(x509.get_issuer().get_components()),
-                        "version": x509.get_version(),
-                        "serial_number": hex(x509.get_serial_number()),
-                        "not_before": x509.get_notBefore(),
-                        "not_after": x509.get_notAfter(),
-                        "fingerprint": x509.digest("sha256").hex(),
-                        "signature_algorithm": x509.get_signature_algorithm(),
-                        "raw_cert": base64.b64encode(cert_binary),
-                    }
-
-                    # Add extensions
-                    extensions = []
-                    for i in range(x509.get_extension_count()):
-                        ext = x509.get_extension(i)
-                        extensions.append(
-                            {"name": ext.get_short_name(), "value": str(ext)}
-                        )
-                    cert_info["extensions"] = extensions
-
-                    return SSLCertificate(cert_info)
-
-        except Exception:
-            return None
-
-    @staticmethod
-    def _decode_cert_data(data: Any) -> Any:
-        """Helper method to decode bytes in certificate data."""
-        if isinstance(data, bytes):
-            return data.decode("utf-8")
-        elif isinstance(data, dict):
-            return {
-                (
-                    k.decode("utf-8") if isinstance(k, bytes) else k
-                ): SSLCertificate._decode_cert_data(v)
-                for k, v in data.items()
-            }
-        elif isinstance(data, list):
-            return [SSLCertificate._decode_cert_data(item) for item in data]
-        return data
-
-    def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
-        """
-        Export certificate as JSON.
-
-        Args:
-            filepath (Optional[str]): Path to save the JSON file (default: None).
-
-        Returns:
-            Optional[str]: JSON string if successful, None otherwise.
-        """
-        json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
-        if filepath:
-            Path(filepath).write_text(json_str, encoding="utf-8")
-            return None
-        return json_str
-
-    def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
-        """
-        Export certificate as PEM.
-
-        Args:
-            filepath (Optional[str]): Path to save the PEM file (default: None).
-
-        Returns:
-            Optional[str]: PEM string if successful, None otherwise.
-        """
-        try:
-            x509 = OpenSSL.crypto.load_certificate(
-                OpenSSL.crypto.FILETYPE_ASN1,
-                base64.b64decode(self._cert_info["raw_cert"]),
-            )
-            pem_data = OpenSSL.crypto.dump_certificate(
-                OpenSSL.crypto.FILETYPE_PEM, x509
-            ).decode("utf-8")
-
-            if filepath:
-                Path(filepath).write_text(pem_data, encoding="utf-8")
-                return None
-            return pem_data
-        except Exception:
-            return None
-
-    def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
-        """
-        Export certificate as DER.
-
-        Args:
-            filepath (Optional[str]): Path to save the DER file (default: None).
-
-        Returns:
-            Optional[bytes]: DER bytes if successful, None otherwise.
-        """
-        try:
-            der_data = base64.b64decode(self._cert_info["raw_cert"])
-            if filepath:
-                Path(filepath).write_bytes(der_data)
-                return None
-            return der_data
-        except Exception:
-            return None
-
-    @property
-    def issuer(self) -> Dict[str, str]:
-        """Get certificate issuer information."""
-        return self._cert_info.get("issuer", {})
-
-    @property
-    def subject(self) -> Dict[str, str]:
-        """Get certificate subject information."""
-        return self._cert_info.get("subject", {})
-
-    @property
-    def valid_from(self) -> str:
-        """Get certificate validity start date."""
-        return self._cert_info.get("not_before", "")
-
-    @property
-    def valid_until(self) -> str:
-        """Get certificate validity end date."""
-        return self._cert_info.get("not_after", "")
-
-    @property
-    def fingerprint(self) -> str:
-        """Get certificate fingerprint."""
-        return self._cert_info.get("fingerprint", "")
--- a/crawl4ai/train.py
+++ b/crawl4ai/train.py
@@ -0,0 +1,146 @@
+import spacy
+from spacy.training import Example
+import random
+import nltk
+from nltk.corpus import reuters
+import torch
+
+def save_spacy_model_as_torch(nlp, model_dir="models/reuters"):
+    # Extract the TextCategorizer component
+    textcat = nlp.get_pipe("textcat_multilabel")
+
+    # Convert the weights to a PyTorch state dictionary
+    state_dict = {name: torch.tensor(param.data) for name, param in textcat.model.named_parameters()}
+
+    # Save the state dictionary
+    torch.save(state_dict, f"{model_dir}/model_weights.pth")
+
+    # Extract and save the vocabulary
+    vocab = extract_vocab(nlp)
+    with open(f"{model_dir}/vocab.txt", "w") as vocab_file:
+        for word, idx in vocab.items():
+            vocab_file.write(f"{word}\t{idx}\n")
+    
+    print(f"Model weights and vocabulary saved to: {model_dir}")
+
+def extract_vocab(nlp):
+    # Extract vocabulary from the SpaCy model
+    vocab = {word: i for i, word in enumerate(nlp.vocab.strings)}
+    return vocab
+
+nlp = spacy.load("models/reuters")
+save_spacy_model_as_torch(nlp, model_dir="models")
+
+def train_and_save_reuters_model(model_dir="models/reuters"):
+    # Ensure the Reuters corpus is downloaded
+    nltk.download('reuters')
+    nltk.download('punkt')
+    if not reuters.fileids():
+        print("Reuters corpus not found.")
+        return
+
+    # Load a blank English spaCy model
+    nlp = spacy.blank("en")
+
+    # Create a TextCategorizer with the ensemble model for multi-label classification
+    textcat = nlp.add_pipe("textcat_multilabel")
+
+    # Add labels to text classifier
+    for label in reuters.categories():
+        textcat.add_label(label)
+
+    # Prepare training data
+    train_examples = []
+    for fileid in reuters.fileids():
+        categories = reuters.categories(fileid)
+        text = reuters.raw(fileid)
+        cats = {label: label in categories for label in reuters.categories()}
+        # Prepare spacy Example objects
+        doc = nlp.make_doc(text)
+        example = Example.from_dict(doc, {'cats': cats})
+        train_examples.append(example)
+
+    # Initialize the text categorizer with the example objects
+    nlp.initialize(lambda: train_examples)
+
+    # Train the model
+    random.seed(1)
+    spacy.util.fix_random_seed(1)
+    for i in range(5):  # Adjust iterations for better accuracy
+        random.shuffle(train_examples)
+        losses = {}
+        # Create batches of data
+        batches = spacy.util.minibatch(train_examples, size=8)
+        for batch in batches:
+            nlp.update(batch, drop=0.2, losses=losses)
+        print(f"Losses at iteration {i}: {losses}")
+
+    # Save the trained model
+    nlp.to_disk(model_dir)
+    print(f"Model saved to: {model_dir}")
+
+def train_model(model_dir, additional_epochs=0):
+    # Load the model if it exists, otherwise start with a blank model
+    try:
+        nlp = spacy.load(model_dir)
+        print("Model loaded from disk.")
+    except IOError:
+        print("No existing model found. Starting with a new model.")
+        nlp = spacy.blank("en")
+        textcat = nlp.add_pipe("textcat_multilabel")
+        for label in reuters.categories():
+            textcat.add_label(label)
+
+    # Prepare training data
+    train_examples = []
+    for fileid in reuters.fileids():
+        categories = reuters.categories(fileid)
+        text = reuters.raw(fileid)
+        cats = {label: label in categories for label in reuters.categories()}
+        doc = nlp.make_doc(text)
+        example = Example.from_dict(doc, {'cats': cats})
+        train_examples.append(example)
+
+    # Initialize the model if it was newly created
+    if 'textcat_multilabel' not in nlp.pipe_names:
+        nlp.initialize(lambda: train_examples)
+    else:
+        print("Continuing training with existing model.")
+
+    # Train the model
+    random.seed(1)
+    spacy.util.fix_random_seed(1)
+    num_epochs = 5 + additional_epochs
+    for i in range(num_epochs):
+        random.shuffle(train_examples)
+        losses = {}
+        batches = spacy.util.minibatch(train_examples, size=8)
+        for batch in batches:
+            nlp.update(batch, drop=0.2, losses=losses)
+        print(f"Losses at iteration {i}: {losses}")
+
+    # Save the trained model
+    nlp.to_disk(model_dir)
+    print(f"Model saved to: {model_dir}")
+
+def load_model_and_predict(model_dir, text, tok_k = 3):
+    # Load the trained model from the specified directory
+    nlp = spacy.load(model_dir)
+    
+    # Process the text with the loaded model
+    doc = nlp(text)
+    
+    # gee top 3 categories
+    top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+    print(f"Top {tok_k} categories:")
+    
+    return top_categories    
+
+if __name__ == "__main__":
+    train_and_save_reuters_model()
+    train_model("models/reuters", additional_epochs=5)
+    model_directory = "reuters_model_10"
+    print(reuters.categories())
+    example_text = "Apple Inc. is reportedly buying a startup for $1 billion"
+    r =load_model_and_predict(model_directory, example_text)
+    print(r)
--- a/crawl4ai/user_agent_generator.py
+++ b/crawl4ai/user_agent_generator.py
@@ -1,299 +0,0 @@
-import random
-from typing import Optional, Literal, List, Dict, Tuple
-import re
-
-
-class UserAgentGenerator:
-    """
-    Generate random user agents with specified constraints.
-
-    Attributes:
-        desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
-        mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
-        browser_combinations (dict): A dictionary of possible browser combinations and their corresponding user agent strings.
-        rendering_engines (dict): A dictionary of possible rendering engines and their corresponding user agent strings.
-        chrome_versions (list): A list of possible Chrome browser versions.
-        firefox_versions (list): A list of possible Firefox browser versions.
-        edge_versions (list): A list of possible Edge browser versions.
-        safari_versions (list): A list of possible Safari browser versions.
-        ios_versions (list): A list of possible iOS browser versions.
-        android_versions (list): A list of possible Android browser versions.
-
-        Methods:
-            generate_user_agent(
-                platform: Literal["desktop", "mobile"] = "desktop",
-                browser: str = "chrome",
-                rendering_engine: str = "chrome_webkit",
-                chrome_version: Optional[str] = None,
-                firefox_version: Optional[str] = None,
-                edge_version: Optional[str] = None,
-                safari_version: Optional[str] = None,
-                ios_version: Optional[str] = None,
-                android_version: Optional[str] = None
-            ): Generates a random user agent string based on the specified parameters.
-    """
-
-    def __init__(self):
-        # Previous platform definitions remain the same...
-        self.desktop_platforms = {
-            "windows": {
-                "10_64": "(Windows NT 10.0; Win64; x64)",
-                "10_32": "(Windows NT 10.0; WOW64)",
-            },
-            "macos": {
-                "intel": "(Macintosh; Intel Mac OS X 10_15_7)",
-                "newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
-            },
-            "linux": {
-                "generic": "(X11; Linux x86_64)",
-                "ubuntu": "(X11; Ubuntu; Linux x86_64)",
-                "chrome_os": "(X11; CrOS x86_64 14541.0.0)",
-            },
-        }
-
-        self.mobile_platforms = {
-            "android": {
-                "samsung": "(Linux; Android 13; SM-S901B)",
-                "pixel": "(Linux; Android 12; Pixel 6)",
-                "oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
-                "xiaomi": "(Linux; Android 12; M2102J20SG)",
-            },
-            "ios": {
-                "iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
-                "ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
-            },
-        }
-
-        # Browser Combinations
-        self.browser_combinations = {
-            1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
-            2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
-            3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
-        }
-
-        # Rendering Engines with versions
-        self.rendering_engines = {
-            "chrome_webkit": "AppleWebKit/537.36",
-            "safari_webkit": "AppleWebKit/605.1.15",
-            "gecko": [  # Added Gecko versions
-                "Gecko/20100101",
-                "Gecko/20100101",  # Firefox usually uses this constant version
-                "Gecko/2010010",
-            ],
-        }
-
-        # Browser Versions
-        self.chrome_versions = [
-            "Chrome/119.0.6045.199",
-            "Chrome/118.0.5993.117",
-            "Chrome/117.0.5938.149",
-            "Chrome/116.0.5845.187",
-            "Chrome/115.0.5790.171",
-        ]
-
-        self.edge_versions = [
-            "Edg/119.0.2151.97",
-            "Edg/118.0.2088.76",
-            "Edg/117.0.2045.47",
-            "Edg/116.0.1938.81",
-            "Edg/115.0.1901.203",
-        ]
-
-        self.safari_versions = [
-            "Safari/537.36",  # For Chrome-based
-            "Safari/605.1.15",
-            "Safari/604.1",
-            "Safari/602.1",
-            "Safari/601.5.17",
-        ]
-
-        # Added Firefox versions
-        self.firefox_versions = [
-            "Firefox/119.0",
-            "Firefox/118.0.2",
-            "Firefox/117.0.1",
-            "Firefox/116.0",
-            "Firefox/115.0.3",
-            "Firefox/114.0.2",
-            "Firefox/113.0.1",
-            "Firefox/112.0",
-            "Firefox/111.0.1",
-            "Firefox/110.0",
-        ]
-
-    def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
-        """
-        Get a valid combination of browser versions.
-
-        How it works:
-        1. Check if the number of browsers is supported.
-        2. Randomly choose a combination of browsers.
-        3. Iterate through the combination and add browser versions.
-        4. Return the browser stack.
-
-        Args:
-            num_browsers: Number of browser specifications (1-3)
-
-        Returns:
-            List[str]: A list of browser versions.
-        """
-        if num_browsers not in self.browser_combinations:
-            raise ValueError(f"Unsupported number of browsers: {num_browsers}")
-
-        combination = random.choice(self.browser_combinations[num_browsers])
-        browser_stack = []
-
-        for browser in combination:
-            if browser == "chrome":
-                browser_stack.append(random.choice(self.chrome_versions))
-            elif browser == "firefox":
-                browser_stack.append(random.choice(self.firefox_versions))
-            elif browser == "safari":
-                browser_stack.append(random.choice(self.safari_versions))
-            elif browser == "edge":
-                browser_stack.append(random.choice(self.edge_versions))
-            elif browser == "gecko":
-                browser_stack.append(random.choice(self.rendering_engines["gecko"]))
-            elif browser == "webkit":
-                browser_stack.append(self.rendering_engines["chrome_webkit"])
-
-        return browser_stack
-
-    def generate(
-        self,
-        device_type: Optional[Literal["desktop", "mobile"]] = None,
-        os_type: Optional[str] = None,
-        device_brand: Optional[str] = None,
-        browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
-        num_browsers: int = 3,
-    ) -> str:
-        """
-        Generate a random user agent with specified constraints.
-
-        Args:
-            device_type: 'desktop' or 'mobile'
-            os_type: 'windows', 'macos', 'linux', 'android', 'ios'
-            device_brand: Specific device brand
-            browser_type: 'chrome', 'edge', 'safari', or 'firefox'
-            num_browsers: Number of browser specifications (1-3)
-        """
-        # Get platform string
-        platform = self.get_random_platform(device_type, os_type, device_brand)
-
-        # Start with Mozilla
-        components = ["Mozilla/5.0", platform]
-
-        # Add browser stack
-        browser_stack = self.get_browser_stack(num_browsers)
-
-        # Add appropriate legacy token based on browser stack
-        if "Firefox" in str(browser_stack):
-            components.append(random.choice(self.rendering_engines["gecko"]))
-        elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack):
-            components.append(self.rendering_engines["chrome_webkit"])
-            components.append("(KHTML, like Gecko)")
-
-        # Add browser versions
-        components.extend(browser_stack)
-
-        return " ".join(components)
-
-    def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
-        """Generate both user agent and matching client hints"""
-        user_agent = self.generate(**kwargs)
-        client_hints = self.generate_client_hints(user_agent)
-        return user_agent, client_hints
-
-    def get_random_platform(self, device_type, os_type, device_brand):
-        """Helper method to get random platform based on constraints"""
-        platforms = (
-            self.desktop_platforms
-            if device_type == "desktop"
-            else self.mobile_platforms
-            if device_type == "mobile"
-            else {**self.desktop_platforms, **self.mobile_platforms}
-        )
-
-        if os_type:
-            for platform_group in [self.desktop_platforms, self.mobile_platforms]:
-                if os_type in platform_group:
-                    platforms = {os_type: platform_group[os_type]}
-                    break
-
-        os_key = random.choice(list(platforms.keys()))
-        if device_brand and device_brand in platforms[os_key]:
-            return platforms[os_key][device_brand]
-        return random.choice(list(platforms[os_key].values()))
-
-    def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
-        """Parse a user agent string to extract browser and version information"""
-        browsers = {
-            "chrome": r"Chrome/(\d+)",
-            "edge": r"Edg/(\d+)",
-            "safari": r"Version/(\d+)",
-            "firefox": r"Firefox/(\d+)",
-        }
-
-        result = {}
-        for browser, pattern in browsers.items():
-            match = re.search(pattern, user_agent)
-            if match:
-                result[browser] = match.group(1)
-
-        return result
-
-    def generate_client_hints(self, user_agent: str) -> str:
-        """Generate Sec-CH-UA header value based on user agent string"""
-        browsers = self.parse_user_agent(user_agent)
-
-        # Client hints components
-        hints = []
-
-        # Handle different browser combinations
-        if "chrome" in browsers:
-            hints.append(f'"Chromium";v="{browsers["chrome"]}"')
-            hints.append('"Not_A Brand";v="8"')
-
-            if "edge" in browsers:
-                hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
-            else:
-                hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
-
-        elif "firefox" in browsers:
-            # Firefox doesn't typically send Sec-CH-UA
-            return '""'
-
-        elif "safari" in browsers:
-            # Safari's format for client hints
-            hints.append(f'"Safari";v="{browsers["safari"]}"')
-            hints.append('"Not_A Brand";v="8"')
-
-        return ", ".join(hints)
-
-
-# Example usage:
-if __name__ == "__main__":
-    generator = UserAgentGenerator()
-    print(generator.generate())
-
-    print("\nSingle browser (Chrome):")
-    print(generator.generate(num_browsers=1, browser_type="chrome"))
-
-    print("\nTwo browsers (Gecko/Firefox):")
-    print(generator.generate(num_browsers=2))
-
-    print("\nThree browsers (Chrome/Safari/Edge):")
-    print(generator.generate(num_browsers=3))
-
-    print("\nFirefox on Linux:")
-    print(
-        generator.generate(
-            device_type="desktop",
-            os_type="linux",
-            browser_type="firefox",
-            num_browsers=2,
-        )
-    )
-
-    print("\nChrome/Safari/Edge on Windows:")
-    print(generator.generate(device_type="desktop", os_type="windows", num_browsers=3))
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
--- a/crawl4ai/version_manager.py
+++ b/crawl4ai/version_manager.py
@@ -1,29 +0,0 @@
-# version_manager.py
-from pathlib import Path
-from packaging import version
-from . import __version__
-
-
-class VersionManager:
-    def __init__(self):
-        self.home_dir = Path.home() / ".crawl4ai"
-        self.version_file = self.home_dir / "version.txt"
-
-    def get_installed_version(self):
-        """Get the version recorded in home directory"""
-        if not self.version_file.exists():
-            return None
-        try:
-            return version.parse(self.version_file.read_text().strip())
-        except:
-            return None
-
-    def update_version(self):
-        """Update the version file to current library version"""
-        self.version_file.write_text(__version__.__version__)
-
-    def needs_update(self):
-        """Check if database needs update based on version"""
-        installed = self.get_installed_version()
-        current = version.parse(__version__.__version__)
-        return installed is None or installed < current
--- a/crawl4ai/web_crawler.back.py
+++ b/crawl4ai/web_crawler.back.py
@@ -0,0 +1,357 @@
+import os, time
+os.environ["TOKENIZERS_PARALLELISM"] = "false"
+from pathlib import Path
+
+from .models import UrlModel, CrawlResult
+from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
+from .utils import *
+from .chunking_strategy import *
+from .extraction_strategy import *
+from .crawler_strategy import *
+from typing import List
+from concurrent.futures import ThreadPoolExecutor
+from .config import *
+
+
+class WebCrawler:
+    def __init__(
+        self,
+        # db_path: str = None,
+        crawler_strategy: CrawlerStrategy = None,
+        always_by_pass_cache: bool = False,
+        verbose: bool = False,
+    ):
+        # self.db_path = db_path
+        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
+        self.always_by_pass_cache = always_by_pass_cache
+
+        # Create the .crawl4ai folder in the user's home directory if it doesn't exist
+        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
+        os.makedirs(self.crawl4ai_folder, exist_ok=True)
+        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
+
+        # If db_path is not provided, use the default path
+        # if not db_path:
+            # self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
+        
+        # flush_db()
+        init_db()
+        
+        self.ready = False
+        
+    def warmup(self):
+        print("[LOG] 🌤️  Warming up the WebCrawler")
+        result = self.run(
+            url='https://crawl4ai.uccode.io/',
+            word_count_threshold=5,
+            extraction_strategy= NoExtractionStrategy(),
+            bypass_cache=False,
+            verbose = False
+        )
+        self.ready = True
+        print("[LOG] 🌞 WebCrawler is ready to crawl")
+        
+    def fetch_page(
+        self,
+        url_model: UrlModel,
+        provider: str = DEFAULT_PROVIDER,
+        api_token: str = None,
+        extract_blocks_flag: bool = True,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        css_selector: str = None,
+        screenshot: bool = False,
+        use_cached_html: bool = False,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        **kwargs,
+    ) -> CrawlResult:
+        return self.run(
+            url_model.url,
+            word_count_threshold,
+            extraction_strategy or NoExtractionStrategy(),
+            chunking_strategy,
+            bypass_cache=url_model.forced,
+            css_selector=css_selector,
+            screenshot=screenshot,
+            **kwargs,
+        )
+        pass
+
+    def run_old(
+        self,
+        url: str,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        bypass_cache: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        user_agent: str = None,
+        verbose=True,
+        **kwargs,
+    ) -> CrawlResult:
+        if user_agent:
+            self.crawler_strategy.update_user_agent(user_agent)
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
+        extraction_strategy.verbose = verbose
+        # Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
+        if not isinstance(extraction_strategy, ExtractionStrategy):
+            raise ValueError("Unsupported extraction strategy")
+        if not isinstance(chunking_strategy, ChunkingStrategy):
+            raise ValueError("Unsupported chunking strategy")
+        
+        # make sure word_count_threshold is not lesser than MIN_WORD_THRESHOLD
+        if word_count_threshold < MIN_WORD_THRESHOLD:
+            word_count_threshold = MIN_WORD_THRESHOLD
+
+        # Check cache first
+        if not bypass_cache and not self.always_by_pass_cache:
+            cached = get_cached_url(url)
+            if cached:
+                return CrawlResult(
+                    **{
+                        "url": cached[0],
+                        "html": cached[1],
+                        "cleaned_html": cached[2],
+                        "markdown": cached[3],
+                        "extracted_content": cached[4],
+                        "success": cached[5],
+                        "media": json.loads(cached[6] or "{}"),
+                        "links": json.loads(cached[7] or "{}"),
+                        "metadata": json.loads(cached[8] or "{}"), # "metadata": "{}
+                        "screenshot": cached[9],
+                        "error_message": "",
+                    }
+                )
+
+        # Initialize WebDriver for crawling
+        t = time.time()
+        if kwargs.get("js", None):
+            self.crawler_strategy.js_code = kwargs.get("js")
+        html = self.crawler_strategy.crawl(url)
+        base64_image = None
+        if screenshot:
+            base64_image = self.crawler_strategy.take_screenshot()
+        success = True
+        error_message = ""
+        # Extract content from HTML
+        try:
+            result = get_content_of_website(url, html, word_count_threshold, css_selector=css_selector)
+            metadata = extract_metadata(html)
+            if result is None:
+                raise ValueError(f"Failed to extract content from the website: {url}")
+        except InvalidCSSSelectorError as e:
+            raise ValueError(str(e))
+        
+        cleaned_html = result.get("cleaned_html", "")
+        markdown = result.get("markdown", "")
+        media = result.get("media", [])
+        links = result.get("links", [])
+
+        # Print a profession LOG style message, show time taken and say crawling is done
+        if verbose:
+            print(
+                f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
+            )
+
+        extracted_content = []
+        if verbose:
+            print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+        t = time.time()
+        # Split markdown into sections
+        sections = chunking_strategy.chunk(markdown)
+        # sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
+
+        extracted_content = extraction_strategy.run(
+            url, sections,
+        )
+        extracted_content = json.dumps(extracted_content)
+
+        if verbose:
+            print(
+                f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds."
+            )
+
+        # Cache the result
+        cleaned_html = beautify_html(cleaned_html)
+        cache_url(
+            url,
+            html,
+            cleaned_html,
+            markdown,
+            extracted_content,
+            success,
+            json.dumps(media),
+            json.dumps(links),
+            json.dumps(metadata),
+            screenshot=base64_image,
+        )
+
+        return CrawlResult(
+            url=url,
+            html=html,
+            cleaned_html=cleaned_html,
+            markdown=markdown,
+            media=media,
+            links=links,
+            metadata=metadata,
+            screenshot=base64_image,
+            extracted_content=extracted_content,
+            success=success,
+            error_message=error_message,
+        )
+
+    def fetch_pages(
+        self,
+        url_models: List[UrlModel],
+        provider: str = DEFAULT_PROVIDER,
+        api_token: str = None,
+        extract_blocks_flag: bool = True,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        use_cached_html: bool = False,
+        css_selector: str = None,
+        screenshot: bool = False,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        **kwargs,
+    ) -> List[CrawlResult]:
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
+        def fetch_page_wrapper(url_model, *args, **kwargs):
+            return self.fetch_page(url_model, *args, **kwargs)
+
+        with ThreadPoolExecutor() as executor:
+            results = list(
+                executor.map(
+                    fetch_page_wrapper,
+                    url_models,
+                    [provider] * len(url_models),
+                    [api_token] * len(url_models),
+                    [extract_blocks_flag] * len(url_models),
+                    [word_count_threshold] * len(url_models),
+                    [css_selector] * len(url_models),
+                    [screenshot] * len(url_models),
+                    [use_cached_html] * len(url_models),
+                    [extraction_strategy] * len(url_models),
+                    [chunking_strategy] * len(url_models),
+                    *[kwargs] * len(url_models),
+                )
+            )
+
+        return results
+
+    def run(
+            self,
+            url: str,
+            word_count_threshold=MIN_WORD_THRESHOLD,
+            extraction_strategy: ExtractionStrategy = None,
+            chunking_strategy: ChunkingStrategy = RegexChunking(),
+            bypass_cache: bool = False,
+            css_selector: str = None,
+            screenshot: bool = False,
+            user_agent: str = None,
+            verbose=True,
+            **kwargs,
+        ) -> CrawlResult:
+            extraction_strategy = extraction_strategy or NoExtractionStrategy()
+            extraction_strategy.verbose = verbose
+            if not isinstance(extraction_strategy, ExtractionStrategy):
+                raise ValueError("Unsupported extraction strategy")
+            if not isinstance(chunking_strategy, ChunkingStrategy):
+                raise ValueError("Unsupported chunking strategy")
+            
+            if word_count_threshold < MIN_WORD_THRESHOLD:
+                word_count_threshold = MIN_WORD_THRESHOLD
+
+            # Check cache first
+            cached = None
+            extracted_content = None
+            if not bypass_cache and not self.always_by_pass_cache:
+                cached = get_cached_url(url)
+            
+            if cached:
+                html = cached[1]
+                extracted_content = cached[2]
+                if screenshot:
+                    screenshot = cached[9]
+            
+            else:
+                if user_agent:
+                    self.crawler_strategy.update_user_agent(user_agent)
+                html = self.crawler_strategy.crawl(url)
+                if screenshot:
+                    screenshot = self.crawler_strategy.take_screenshot()
+            
+            return self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot, verbose, bool(cached), **kwargs)
+
+    def process_html(
+            self,
+            url: str,
+            html: str,
+            extracted_content: str,
+            word_count_threshold: int,
+            extraction_strategy: ExtractionStrategy,
+            chunking_strategy: ChunkingStrategy,
+            css_selector: str,
+            screenshot: bool,
+            verbose: bool,
+            is_cached: bool,
+            **kwargs,
+        ) -> CrawlResult:
+            t = time.time()
+            # Extract content from HTML
+            try:
+                result = get_content_of_website(url, html, word_count_threshold, css_selector=css_selector)
+                metadata = extract_metadata(html)
+                if result is None:
+                    raise ValueError(f"Failed to extract content from the website: {url}")
+            except InvalidCSSSelectorError as e:
+                raise ValueError(str(e))
+            
+            cleaned_html = result.get("cleaned_html", "")
+            markdown = result.get("markdown", "")
+            media = result.get("media", [])
+            links = result.get("links", [])
+
+            if verbose:
+                print(f"[LOG] 🚀 Crawling done for {url}, success: True, time taken: {time.time() - t} seconds")
+                        
+            if extracted_content is None:
+                if verbose:
+                    print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+
+                sections = chunking_strategy.chunk(markdown)
+                extracted_content = extraction_strategy.run(url, sections)
+                extracted_content = json.dumps(extracted_content)
+
+                if verbose:
+                    print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds.")
+                
+            screenshot = None if not screenshot else screenshot
+            
+            if not is_cached:
+                cache_url(
+                    url,
+                    html,
+                    cleaned_html,
+                    markdown,
+                    extracted_content,
+                    True,
+                    json.dumps(media),
+                    json.dumps(links),
+                    json.dumps(metadata),
+                    screenshot=screenshot,
+                )                
+
+            return CrawlResult(
+                url=url,
+                html=html,
+                cleaned_html=cleaned_html,
+                markdown=markdown,
+                media=media,
+                links=links,
+                metadata=metadata,
+                screenshot=screenshot,
+                extracted_content=extracted_content,
+                success=True,
+                error_message="",
+            )
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -1,58 +1,43 @@
 import os, time
-
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 from pathlib import Path

 from .models import UrlModel, CrawlResult
-from .database import init_db, get_cached_url, cache_url
+from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
 from .utils import *
 from .chunking_strategy import *
 from .extraction_strategy import *
 from .crawler_strategy import *
 from typing import List
 from concurrent.futures import ThreadPoolExecutor
-from .content_scraping_strategy import WebScrapingStrategy
 from .config import *
 import warnings
 import json
-
-warnings.filterwarnings(
-    "ignore",
-    message='Field "model_name" has conflict with protected namespace "model_".',
-)
+warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')


 class WebCrawler:
-    def __init__(
-        self,
-        crawler_strategy: CrawlerStrategy = None,
-        always_by_pass_cache: bool = False,
-        verbose: bool = False,
-    ):
-        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
-            verbose=verbose
-        )
+    def __init__(self, crawler_strategy: CrawlerStrategy = None, always_by_pass_cache: bool = False, verbose: bool = False):
+        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
        self.always_by_pass_cache = always_by_pass_cache
-        self.crawl4ai_folder = os.path.join(
-            os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
-        )
+        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
        init_db()
        self.ready = False
-
+        
    def warmup(self):
        print("[LOG] 🌤️  Warming up the WebCrawler")
        self.run(
-            url="https://google.com/",
+            url='https://google.com/',
            word_count_threshold=5,
            extraction_strategy=NoExtractionStrategy(),
            bypass_cache=False,
-            verbose=False,
+            verbose=False
        )
        self.ready = True
        print("[LOG] 🌞 WebCrawler is ready to crawl")
-
+        
    def fetch_page(
        self,
        url_model: UrlModel,
@@ -94,7 +79,6 @@ class WebCrawler:
        **kwargs,
    ) -> List[CrawlResult]:
        extraction_strategy = extraction_strategy or NoExtractionStrategy()
-
        def fetch_page_wrapper(url_model, *args, **kwargs):
            return self.fetch_page(url_model, *args, **kwargs)

@@ -119,176 +103,136 @@ class WebCrawler:
        return results

    def run(
-        self,
-        url: str,
-        word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        bypass_cache: bool = False,
-        css_selector: str = None,
-        screenshot: bool = False,
-        user_agent: str = None,
-        verbose=True,
-        **kwargs,
-    ) -> CrawlResult:
-        try:
-            extraction_strategy = extraction_strategy or NoExtractionStrategy()
-            extraction_strategy.verbose = verbose
-            if not isinstance(extraction_strategy, ExtractionStrategy):
-                raise ValueError("Unsupported extraction strategy")
-            if not isinstance(chunking_strategy, ChunkingStrategy):
-                raise ValueError("Unsupported chunking strategy")
+            self,
+            url: str,
+            word_count_threshold=MIN_WORD_THRESHOLD,
+            extraction_strategy: ExtractionStrategy = None,
+            chunking_strategy: ChunkingStrategy = RegexChunking(),
+            bypass_cache: bool = False,
+            css_selector: str = None,
+            screenshot: bool = False,
+            user_agent: str = None,
+            verbose=True,
+            **kwargs,
+        ) -> CrawlResult:
+            try:
+                extraction_strategy = extraction_strategy or NoExtractionStrategy()
+                extraction_strategy.verbose = verbose
+                if not isinstance(extraction_strategy, ExtractionStrategy):
+                    raise ValueError("Unsupported extraction strategy")
+                if not isinstance(chunking_strategy, ChunkingStrategy):
+                    raise ValueError("Unsupported chunking strategy")
+                
+                word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)

-            word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
+                cached = None
+                screenshot_data = None
+                extracted_content = None
+                if not bypass_cache and not self.always_by_pass_cache:
+                    cached = get_cached_url(url)
+                
+                if kwargs.get("warmup", True) and not self.ready:
+                    return None
+                
+                if cached:
+                    html = sanitize_input_encode(cached[1])
+                    extracted_content = sanitize_input_encode(cached[4])
+                    if screenshot:
+                        screenshot_data = cached[9]
+                        if not screenshot_data:
+                            cached = None
+                
+                if not cached or not html:
+                    if user_agent:
+                        self.crawler_strategy.update_user_agent(user_agent)
+                    t1 = time.time()
+                    html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
+                    t2 = time.time()
+                    if verbose:
+                        print(f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds")
+                    if screenshot:
+                        screenshot_data = self.crawler_strategy.take_screenshot()

-            cached = None
-            screenshot_data = None
-            extracted_content = None
-            if not bypass_cache and not self.always_by_pass_cache:
-                cached = get_cached_url(url)
-
-            if kwargs.get("warmup", True) and not self.ready:
-                return None
-
-            if cached:
-                html = sanitize_input_encode(cached[1])
-                extracted_content = sanitize_input_encode(cached[4])
-                if screenshot:
-                    screenshot_data = cached[9]
-                    if not screenshot_data:
-                        cached = None
-
-            if not cached or not html:
-                if user_agent:
-                    self.crawler_strategy.update_user_agent(user_agent)
-                t1 = time.time()
-                html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
-                t2 = time.time()
-                if verbose:
-                    print(
-                        f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
-                    )
-                if screenshot:
-                    screenshot_data = self.crawler_strategy.take_screenshot()
-
-            crawl_result = self.process_html(
-                url,
-                html,
-                extracted_content,
-                word_count_threshold,
-                extraction_strategy,
-                chunking_strategy,
-                css_selector,
-                screenshot_data,
-                verbose,
-                bool(cached),
-                **kwargs,
-            )
-            crawl_result.success = bool(html)
-            return crawl_result
-        except Exception as e:
-            if not hasattr(e, "msg"):
-                e.msg = str(e)
-            print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
-            return CrawlResult(url=url, html="", success=False, error_message=e.msg)
+                
+                crawl_result = self.process_html(url, html, extracted_content, word_count_threshold, extraction_strategy, chunking_strategy, css_selector, screenshot_data, verbose, bool(cached), **kwargs)
+                crawl_result.success = bool(html)
+                return crawl_result
+            except Exception as e:
+                if not hasattr(e, "msg"):
+                    e.msg = str(e)
+                print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")    
+                return CrawlResult(url=url, html="", success=False, error_message=e.msg)

    def process_html(
-        self,
-        url: str,
-        html: str,
-        extracted_content: str,
-        word_count_threshold: int,
-        extraction_strategy: ExtractionStrategy,
-        chunking_strategy: ChunkingStrategy,
-        css_selector: str,
-        screenshot: bool,
-        verbose: bool,
-        is_cached: bool,
-        **kwargs,
-    ) -> CrawlResult:
-        t = time.time()
-        # Extract content from HTML
-        try:
-            t1 = time.time()
-            scrapping_strategy = WebScrapingStrategy()
-            extra_params = {
-                k: v
-                for k, v in kwargs.items()
-                if k not in ["only_text", "image_description_min_word_threshold"]
-            }
-            result = scrapping_strategy.scrap(
-                url,
-                html,
-                word_count_threshold=word_count_threshold,
-                css_selector=css_selector,
-                only_text=kwargs.get("only_text", False),
-                image_description_min_word_threshold=kwargs.get(
-                    "image_description_min_word_threshold",
-                    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-                ),
-                **extra_params,
-            )
+            self,
+            url: str,
+            html: str,
+            extracted_content: str,
+            word_count_threshold: int,
+            extraction_strategy: ExtractionStrategy,
+            chunking_strategy: ChunkingStrategy,
+            css_selector: str,
+            screenshot: bool,
+            verbose: bool,
+            is_cached: bool,
+            **kwargs,
+        ) -> CrawlResult:
+            t = time.time()
+            # Extract content from HTML
+            try:
+                t1 = time.time()
+                result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
+                if verbose:
+                    print(f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds")
+                
+                if result is None:
+                    raise ValueError(f"Failed to extract content from the website: {url}")
+            except InvalidCSSSelectorError as e:
+                raise ValueError(str(e))
+            
+            cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
+            markdown = sanitize_input_encode(result.get("markdown", ""))
+            media = result.get("media", [])
+            links = result.get("links", [])
+            metadata = result.get("metadata", {})
+                        
+            if extracted_content is None:
+                if verbose:
+                    print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")

-            # result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
-            if verbose:
-                print(
-                    f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
-                )
+                sections = chunking_strategy.chunk(markdown)
+                extracted_content = extraction_strategy.run(url, sections)
+                extracted_content = json.dumps(extracted_content, indent=4, default=str, ensure_ascii=False)

-            if result is None:
-                raise ValueError(f"Failed to extract content from the website: {url}")
-        except InvalidCSSSelectorError as e:
-            raise ValueError(str(e))
-
-        cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
-        markdown = sanitize_input_encode(result.get("markdown", ""))
-        media = result.get("media", [])
-        links = result.get("links", [])
-        metadata = result.get("metadata", {})
-
-        if extracted_content is None:
-            if verbose:
-                print(
-                    f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
-                )
-
-            sections = chunking_strategy.chunk(markdown)
-            extracted_content = extraction_strategy.run(url, sections)
-            extracted_content = json.dumps(
-                extracted_content, indent=4, default=str, ensure_ascii=False
-            )
-
-            if verbose:
-                print(
-                    f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
-                )
-
-        screenshot = None if not screenshot else screenshot
-
-        if not is_cached:
-            cache_url(
-                url,
-                html,
-                cleaned_html,
-                markdown,
-                extracted_content,
-                True,
-                json.dumps(media),
-                json.dumps(links),
-                json.dumps(metadata),
+                if verbose:
+                    print(f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds.")
+                
+            screenshot = None if not screenshot else screenshot
+            
+            if not is_cached:
+                cache_url(
+                    url,
+                    html,
+                    cleaned_html,
+                    markdown,
+                    extracted_content,
+                    True,
+                    json.dumps(media),
+                    json.dumps(links),
+                    json.dumps(metadata),
+                    screenshot=screenshot,
+                )                
+            
+            return CrawlResult(
+                url=url,
+                html=html,
+                cleaned_html=format_html(cleaned_html),
+                markdown=markdown,
+                media=media,
+                links=links,
+                metadata=metadata,
                screenshot=screenshot,
-            )
-
-        return CrawlResult(
-            url=url,
-            html=html,
-            cleaned_html=format_html(cleaned_html),
-            markdown=markdown,
-            media=media,
-            links=links,
-            metadata=metadata,
-            screenshot=screenshot,
-            extracted_content=extracted_content,
-            success=True,
-            error_message="",
-        )
+                extracted_content=extracted_content,
+                success=True,
+                error_message="",
+            )
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -1,67 +0,0 @@
-services:
-  # Local build services for different platforms
-  crawl4ai-amd64:
-    build:
-      context: .
-      dockerfile: Dockerfile
-      args:
-        PYTHON_VERSION: "3.10"
-        INSTALL_TYPE: ${INSTALL_TYPE:-basic}
-        ENABLE_GPU: false
-      platforms:
-        - linux/amd64
-    profiles: ["local-amd64"]
-    extends: &base-config
-      file: docker-compose.yml
-      service: base-config
-
-  crawl4ai-arm64:
-    build:
-      context: .
-      dockerfile: Dockerfile
-      args:
-        PYTHON_VERSION: "3.10"
-        INSTALL_TYPE: ${INSTALL_TYPE:-basic}
-        ENABLE_GPU: false
-      platforms:
-        - linux/arm64
-    profiles: ["local-arm64"]
-    extends: *base-config
-
-  # Hub services for different platforms and versions
-  crawl4ai-hub-amd64:
-    image: unclecode/crawl4ai:${VERSION:-basic}-amd64
-    profiles: ["hub-amd64"]
-    extends: *base-config
-
-  crawl4ai-hub-arm64:
-    image: unclecode/crawl4ai:${VERSION:-basic}-arm64
-    profiles: ["hub-arm64"]
-    extends: *base-config
-
-  # Base configuration to be extended
-  base-config:
-    ports:
-      - "11235:11235"
-      - "8000:8000"
-      - "9222:9222"
-      - "8080:8080"
-    environment:
-      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
-      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
-      - CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
-    volumes:
-      - /dev/shm:/dev/shm
-    deploy:
-      resources:
-        limits:
-          memory: 4G
-        reservations:
-          memory: 1G
-    restart: unless-stopped
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 40s
--- a/docs/.DS_Store
+++ b/docs/.DS_Store
--- a/docs/assets/pitch-dark.png
+++ b/docs/assets/pitch-dark.png
--- a/docs/assets/pitch-dark.svg
+++ b/docs/assets/pitch-dark.svg
@@ -1,64 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 500">
-    <!-- Background -->
-    <rect width="800" height="500" fill="#1a1a1a"/>
-    
-    <!-- Opportunities Section -->
-    <g transform="translate(50,50)">
-        <!-- Opportunity 1 Box -->
-        <rect x="0" y="0" width="300" height="150" rx="10" fill="#1a2d3d" stroke="#64b5f6" stroke-width="2"/>
-        <text x="150" y="30" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#64b5f6">Data Capitalization Opportunity</text>
-        <text x="150" y="60" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
-            <tspan x="150" dy="0">Transform digital footprints into assets</tspan>
-            <tspan x="150" dy="20">Personal data as capital</tspan>
-            <tspan x="150" dy="20">Enterprise knowledge valuation</tspan>
-            <tspan x="150" dy="20">New form of wealth creation</tspan>
-        </text>
-
-        <!-- Opportunity 2 Box -->
-        <rect x="0" y="200" width="300" height="150" rx="10" fill="#1a2d1a" stroke="#81c784" stroke-width="2"/>
-        <text x="150" y="230" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#81c784">Authentic Data Potential</text>
-        <text x="150" y="260" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
-            <tspan x="150" dy="0">Vast reservoir of real insights</tspan>
-            <tspan x="150" dy="20">Enhanced AI development</tspan>
-            <tspan x="150" dy="20">Diverse human knowledge</tspan>
-            <tspan x="150" dy="20">Willing participation model</tspan>
-        </text>
-    </g>
-
-    <!-- Development Pathway -->
-    <g transform="translate(450,50)">
-        <!-- Step 1 Box -->
-        <rect x="0" y="0" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
-        <text x="150" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">1. Open-Source Foundation</text>
-        <text x="150" y="65" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Data extraction engine &amp; community development</text>
-
-        <!-- Step 2 Box -->
-        <rect x="0" y="125" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
-        <text x="150" y="160" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">2. Data Capitalization Platform</text>
-        <text x="150" y="190" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Tools to structure &amp; value digital assets</text>
-
-        <!-- Step 3 Box -->
-        <rect x="0" y="250" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
-        <text x="150" y="285" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">3. Shared Data Marketplace</text>
-        <text x="150" y="315" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Economic platform for data exchange</text>
-    </g>
-
-    <!-- Connecting Arrows -->
-    <g transform="translate(400,125)">
-        <path d="M-20,0 L40,0" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
-        <path d="M-20,200 L40,200" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
-    </g>
-
-    <!-- Arrow Marker -->
-    <defs>
-        <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
-            <polygon points="0 0, 10 3.5, 0 7" fill="#666"/>
-        </marker>
-    </defs>
-
-    <!-- Vision Box at Bottom -->
-    <g transform="translate(200,420)">
-        <rect x="0" y="0" width="400" height="60" rx="10" fill="#2d2613" stroke="#ffd54f" stroke-width="2"/>
-        <text x="200" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ffd54f">Economic Vision: Shared Data Economy</text>
-    </g>
-</svg>
--- a/docs/chunking_strategies.json
+++ b/docs/chunking_strategies.json
@@ -0,0 +1,12 @@
+{
+    "RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
+    
+    "NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
+    
+    "TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
+    
+    "FixedLengthWordChunking": "### FixedLengthWordChunking\n\n`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.\n\n#### Constructor Parameters:\n- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.\n\n#### Example usage:\n```python\nchunker = FixedLengthWordChunking(chunk_size=100)\nchunks = chunker.chunk(\"This is a sample text. It will be split into fixed-length word chunks.\")\n```",
+    
+    "SlidingWindowChunking": "### SlidingWindowChunking\n\n`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.\n\n#### Constructor Parameters:\n- `window_size` (int, optional): The number of words in each chunk. Default is `100`.\n- `step` (int, optional): The number of words to slide the window. Default is `50`.\n\n#### Example usage:\n```python\nchunker = SlidingWindowChunking(window_size=100, step=50)\nchunks = chunker.chunk(\"This is a sample text. It will be split using a sliding window approach.\")\n```"
+  }
+  
--- a/docs/deprecated/docker-deployment.md
+++ b/docs/deprecated/docker-deployment.md
@@ -1,189 +0,0 @@
-# 🐳 Using Docker (Legacy)
-
-Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
-
---
-
-<details>
-<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
-
-Choose the appropriate image based on your platform and needs:
-
-### For AMD64 (Regular Linux/Windows):
-```bash
-# Basic version (recommended)
-docker pull unclecode/crawl4ai:basic-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
-
-# Full ML/LLM support
-docker pull unclecode/crawl4ai:all-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
-
-# With GPU support
-docker pull unclecode/crawl4ai:gpu-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
-```
-
-### For ARM64 (M1/M2 Macs, ARM servers):
-```bash
-# Basic version (recommended)
-docker pull unclecode/crawl4ai:basic-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
-
-# Full ML/LLM support
-docker pull unclecode/crawl4ai:all-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
-
-# With GPU support
-docker pull unclecode/crawl4ai:gpu-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
-```
-
-Need more memory? Add `--shm-size`:
-```bash
-docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
-```
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-### For Raspberry Pi (32-bit) (coming soon):
-```bash
-# Pull and run basic version (recommended for Raspberry Pi)
-docker pull unclecode/crawl4ai:basic-armv7
-docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
-
-# With increased shared memory if needed
-docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
-```
-
-Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
-
-</details>
-
-<details>
-<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
-
-Build the image locally based on your platform:
-
-```bash
-# Clone the repository
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-
-# For AMD64 (Regular Linux/Windows)
-docker build --platform linux/amd64 \
-  --tag crawl4ai:local \
-  --build-arg INSTALL_TYPE=basic \
-  .
-
-# For ARM64 (M1/M2 Macs, ARM servers)
-docker build --platform linux/arm64 \
-  --tag crawl4ai:local \
-  --build-arg INSTALL_TYPE=basic \
-  .
-```
-
-Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
-
-Example with all options:
-```bash
-docker build --platform linux/amd64 \
-  --tag crawl4ai:local \
-  --build-arg INSTALL_TYPE=all \
-  --build-arg ENABLE_GPU=true \
-  .
-```
-
-Run your local build:
-```bash
-# Regular run
-docker run -p 11235:11235 crawl4ai:local
-
-# With increased shared memory
-docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
-```
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-</details>
-
-<details>
-<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
-
-Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
-
-```bash
-# Clone the repository
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-```
-
-### For AMD64 (Regular Linux/Windows):
-```bash
-# Build and run locally
-docker-compose --profile local-amd64 up
-
-# Run from Docker Hub
-VERSION=basic docker-compose --profile hub-amd64 up   # Basic version
-VERSION=all docker-compose --profile hub-amd64 up     # Full ML/LLM support
-VERSION=gpu docker-compose --profile hub-amd64 up     # GPU support
-```
-
-### For ARM64 (M1/M2 Macs, ARM servers):
-```bash
-# Build and run locally
-docker-compose --profile local-arm64 up
-
-# Run from Docker Hub
-VERSION=basic docker-compose --profile hub-arm64 up   # Basic version
-VERSION=all docker-compose --profile hub-arm64 up     # Full ML/LLM support
-VERSION=gpu docker-compose --profile hub-arm64 up     # GPU support
-```
-
-Environment variables (optional):
-```bash
-# Create a .env file
-CRAWL4AI_API_TOKEN=your_token
-OPENAI_API_KEY=your_openai_key
-CLAUDE_API_KEY=your_claude_key
-```
-
-The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-</details>
-
-<details>
-<summary>🚀 <strong>One-Click Deployment</strong></summary>
-
-Deploy your own instance of Crawl4AI with one click:
-
-[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
-
-> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
-
-The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
-
-</details>
--- a/docs/examples/amazon_product_extraction_direct_url.py
+++ b/docs/examples/amazon_product_extraction_direct_url.py
@@ -1,110 +0,0 @@
-"""
-This example demonstrates how to use JSON CSS extraction to scrape product information 
-from Amazon search results. It shows how to extract structured data like product titles,
-prices, ratings, and other details using CSS selectors.
-"""
-
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-import json
-
-
-async def extract_amazon_products():
-    # Initialize browser config
-    browser_config = BrowserConfig(browser_type="chromium", headless=True)
-
-    # Initialize crawler config with JSON CSS extraction strategy
-    crawler_config = CrawlerRunConfig(
-        extraction_strategy=JsonCssExtractionStrategy(
-            schema={
-                "name": "Amazon Product Search Results",
-                "baseSelector": "[data-component-type='s-search-result']",
-                "fields": [
-                    {
-                        "name": "asin",
-                        "selector": "",
-                        "type": "attribute",
-                        "attribute": "data-asin",
-                    },
-                    {"name": "title", "selector": "h2 a span", "type": "text"},
-                    {
-                        "name": "url",
-                        "selector": "h2 a",
-                        "type": "attribute",
-                        "attribute": "href",
-                    },
-                    {
-                        "name": "image",
-                        "selector": ".s-image",
-                        "type": "attribute",
-                        "attribute": "src",
-                    },
-                    {
-                        "name": "rating",
-                        "selector": ".a-icon-star-small .a-icon-alt",
-                        "type": "text",
-                    },
-                    {
-                        "name": "reviews_count",
-                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
-                        "type": "text",
-                    },
-                    {
-                        "name": "price",
-                        "selector": ".a-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "original_price",
-                        "selector": ".a-price.a-text-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "sponsored",
-                        "selector": ".puis-sponsored-label-text",
-                        "type": "exists",
-                    },
-                    {
-                        "name": "delivery_info",
-                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
-                        "type": "text",
-                        "multiple": True,
-                    },
-                ],
-            }
-        )
-    )
-
-    # Example search URL (you should replace with your actual Amazon URL)
-    url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # Extract the data
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        # Process and print the results
-        if result and result.extracted_content:
-            # Parse the JSON string into a list of products
-            products = json.loads(result.extracted_content)
-
-            # Process each product in the list
-            for product in products:
-                print("\nProduct Details:")
-                print(f"ASIN: {product.get('asin')}")
-                print(f"Title: {product.get('title')}")
-                print(f"Price: {product.get('price')}")
-                print(f"Original Price: {product.get('original_price')}")
-                print(f"Rating: {product.get('rating')}")
-                print(f"Reviews: {product.get('reviews_count')}")
-                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
-                if product.get("delivery_info"):
-                    print(f"Delivery: {' '.join(product['delivery_info'])}")
-                print("-" * 80)
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(extract_amazon_products())
--- a/docs/examples/amazon_product_extraction_using_hooks.py
+++ b/docs/examples/amazon_product_extraction_using_hooks.py
@@ -1,150 +0,0 @@
-"""
-This example demonstrates how to use JSON CSS extraction to scrape product information 
-from Amazon search results. It shows how to extract structured data like product titles,
-prices, ratings, and other details using CSS selectors.
-"""
-
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-import json
-from playwright.async_api import Page, BrowserContext
-
-
-async def extract_amazon_products():
-    # Initialize browser config
-    browser_config = BrowserConfig(
-        # browser_type="chromium",
-        headless=True
-    )
-
-    # Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        extraction_strategy=JsonCssExtractionStrategy(
-            schema={
-                "name": "Amazon Product Search Results",
-                "baseSelector": "[data-component-type='s-search-result']",
-                "fields": [
-                    {
-                        "name": "asin",
-                        "selector": "",
-                        "type": "attribute",
-                        "attribute": "data-asin",
-                    },
-                    {"name": "title", "selector": "h2 a span", "type": "text"},
-                    {
-                        "name": "url",
-                        "selector": "h2 a",
-                        "type": "attribute",
-                        "attribute": "href",
-                    },
-                    {
-                        "name": "image",
-                        "selector": ".s-image",
-                        "type": "attribute",
-                        "attribute": "src",
-                    },
-                    {
-                        "name": "rating",
-                        "selector": ".a-icon-star-small .a-icon-alt",
-                        "type": "text",
-                    },
-                    {
-                        "name": "reviews_count",
-                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
-                        "type": "text",
-                    },
-                    {
-                        "name": "price",
-                        "selector": ".a-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "original_price",
-                        "selector": ".a-price.a-text-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "sponsored",
-                        "selector": ".puis-sponsored-label-text",
-                        "type": "exists",
-                    },
-                    {
-                        "name": "delivery_info",
-                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
-                        "type": "text",
-                        "multiple": True,
-                    },
-                ],
-            }
-        ),
-    )
-
-    url = "https://www.amazon.com/"
-
-    async def after_goto(
-        page: Page, context: BrowserContext, url: str, response: dict, **kwargs
-    ):
-        """Hook called after navigating to each URL"""
-        print(f"[HOOK] after_goto - Successfully loaded: {url}")
-
-        try:
-            # Wait for search box to be available
-            search_box = await page.wait_for_selector(
-                "#twotabsearchtextbox", timeout=1000
-            )
-
-            # Type the search query
-            await search_box.fill("Samsung Galaxy Tab")
-
-            # Get the search button and prepare for navigation
-            search_button = await page.wait_for_selector(
-                "#nav-search-submit-button", timeout=1000
-            )
-
-            # Click with navigation waiting
-            await search_button.click()
-
-            # Wait for search results to load
-            await page.wait_for_selector(
-                '[data-component-type="s-search-result"]', timeout=10000
-            )
-            print("[HOOK] Search completed and results loaded!")
-
-        except Exception as e:
-            print(f"[HOOK] Error during search operation: {str(e)}")
-
-        return page
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        crawler.crawler_strategy.set_hook("after_goto", after_goto)
-
-        # Extract the data
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        # Process and print the results
-        if result and result.extracted_content:
-            # Parse the JSON string into a list of products
-            products = json.loads(result.extracted_content)
-
-            # Process each product in the list
-            for product in products:
-                print("\nProduct Details:")
-                print(f"ASIN: {product.get('asin')}")
-                print(f"Title: {product.get('title')}")
-                print(f"Price: {product.get('price')}")
-                print(f"Original Price: {product.get('original_price')}")
-                print(f"Rating: {product.get('rating')}")
-                print(f"Reviews: {product.get('reviews_count')}")
-                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
-                if product.get("delivery_info"):
-                    print(f"Delivery: {' '.join(product['delivery_info'])}")
-                print("-" * 80)
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(extract_amazon_products())
--- a/docs/examples/amazon_product_extraction_using_use_javascript.py
+++ b/docs/examples/amazon_product_extraction_using_use_javascript.py
@@ -1,126 +0,0 @@
-"""
-This example demonstrates how to use JSON CSS extraction to scrape product information 
-from Amazon search results. It shows how to extract structured data like product titles,
-prices, ratings, and other details using CSS selectors.
-"""
-
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-import json
-
-
-async def extract_amazon_products():
-    # Initialize browser config
-    browser_config = BrowserConfig(
-        # browser_type="chromium",
-        headless=True
-    )
-
-    js_code_to_search = """
-        const task = async () => {
-            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
-            document.querySelector('#nav-search-submit-button').click();
-        }
-        await task();
-    """
-    js_code_to_search_sync = """
-            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
-            document.querySelector('#nav-search-submit-button').click();
-    """
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        js_code=js_code_to_search,
-        wait_for='css:[data-component-type="s-search-result"]',
-        extraction_strategy=JsonCssExtractionStrategy(
-            schema={
-                "name": "Amazon Product Search Results",
-                "baseSelector": "[data-component-type='s-search-result']",
-                "fields": [
-                    {
-                        "name": "asin",
-                        "selector": "",
-                        "type": "attribute",
-                        "attribute": "data-asin",
-                    },
-                    {"name": "title", "selector": "h2 a span", "type": "text"},
-                    {
-                        "name": "url",
-                        "selector": "h2 a",
-                        "type": "attribute",
-                        "attribute": "href",
-                    },
-                    {
-                        "name": "image",
-                        "selector": ".s-image",
-                        "type": "attribute",
-                        "attribute": "src",
-                    },
-                    {
-                        "name": "rating",
-                        "selector": ".a-icon-star-small .a-icon-alt",
-                        "type": "text",
-                    },
-                    {
-                        "name": "reviews_count",
-                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
-                        "type": "text",
-                    },
-                    {
-                        "name": "price",
-                        "selector": ".a-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "original_price",
-                        "selector": ".a-price.a-text-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "sponsored",
-                        "selector": ".puis-sponsored-label-text",
-                        "type": "exists",
-                    },
-                    {
-                        "name": "delivery_info",
-                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
-                        "type": "text",
-                        "multiple": True,
-                    },
-                ],
-            }
-        ),
-    )
-
-    # Example search URL (you should replace with your actual Amazon URL)
-    url = "https://www.amazon.com/"
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # Extract the data
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        # Process and print the results
-        if result and result.extracted_content:
-            # Parse the JSON string into a list of products
-            products = json.loads(result.extracted_content)
-
-            # Process each product in the list
-            for product in products:
-                print("\nProduct Details:")
-                print(f"ASIN: {product.get('asin')}")
-                print(f"Title: {product.get('title')}")
-                print(f"Price: {product.get('price')}")
-                print(f"Original Price: {product.get('original_price')}")
-                print(f"Rating: {product.get('rating')}")
-                print(f"Reviews: {product.get('reviews_count')}")
-                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
-                if product.get("delivery_info"):
-                    print(f"Delivery: {' '.join(product['delivery_info'])}")
-                print("-" * 80)
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(extract_amazon_products())
--- a/docs/examples/async_webcrawler_multiple_urls_example.py
+++ b/docs/examples/async_webcrawler_multiple_urls_example.py
@@ -1,16 +1,12 @@
 # File: async_webcrawler_multiple_urls_example.py
 import os, sys
-
 # append 2 parent directories to sys.path to import crawl4ai
-parent_dir = os.path.dirname(
-    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-)
+parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 sys.path.append(parent_dir)

 import asyncio
 from crawl4ai import AsyncWebCrawler

-
 async def main():
    # Initialize the AsyncWebCrawler
    async with AsyncWebCrawler(verbose=True) as crawler:
@@ -20,7 +16,7 @@ async def main():
            "https://python.org",
            "https://github.com",
            "https://stackoverflow.com",
-            "https://news.ycombinator.com",
+            "https://news.ycombinator.com"
        ]

        # Set up crawling parameters
@@ -31,7 +27,7 @@ async def main():
            urls=urls,
            word_count_threshold=word_count_threshold,
            bypass_cache=True,
-            verbose=True,
+            verbose=True
        )

        # Process the results
@@ -40,9 +36,7 @@ async def main():
                print(f"Successfully crawled: {result.url}")
                print(f"Title: {result.metadata.get('title', 'N/A')}")
                print(f"Word count: {len(result.markdown.split())}")
-                print(
-                    f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
-                )
+                print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
                print(f"Number of images: {len(result.media.get('images', []))}")
                print("---")
            else:
@@ -50,6 +44,5 @@ async def main():
                print(f"Error: {result.error_message}")
                print("---")

-
 if __name__ == "__main__":
-    asyncio.run(main())
+    asyncio.run(main())
--- a/docs/examples/browser_optimization_example.py
+++ b/docs/examples/browser_optimization_example.py
@@ -1,126 +0,0 @@
-"""
-This example demonstrates optimal browser usage patterns in Crawl4AI:
-1. Sequential crawling with session reuse
-2. Parallel crawling with browser instance reuse
-3. Performance optimization settings
-"""
-
-import asyncio
-from typing import List
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-
-async def crawl_sequential(urls: List[str]):
-    """
-    Sequential crawling using session reuse - most efficient for moderate workloads
-    """
-    print("\n=== Sequential Crawling with Session Reuse ===")
-
-    # Configure browser with optimized settings
-    browser_config = BrowserConfig(
-        headless=True,
-        browser_args=[
-            "--disable-gpu",  # Disable GPU acceleration
-            "--disable-dev-shm-usage",  # Disable /dev/shm usage
-            "--no-sandbox",  # Required for Docker
-        ],
-        viewport={
-            "width": 800,
-            "height": 600,
-        },  # Smaller viewport for better performance
-    )
-
-    # Configure crawl settings
-    crawl_config = CrawlerRunConfig(
-        markdown_generator=DefaultMarkdownGenerator(
-            #  content_filter=PruningContentFilter(), In case you need fit_markdown
-        ),
-    )
-
-    # Create single crawler instance
-    crawler = AsyncWebCrawler(config=browser_config)
-    await crawler.start()
-
-    try:
-        session_id = "session1"  # Use same session for all URLs
-        for url in urls:
-            result = await crawler.arun(
-                url=url,
-                config=crawl_config,
-                session_id=session_id,  # Reuse same browser tab
-            )
-            if result.success:
-                print(f"Successfully crawled {url}")
-                print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
-    finally:
-        await crawler.close()
-
-
-async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
-    """
-    Parallel crawling while reusing browser instance - best for large workloads
-    """
-    print("\n=== Parallel Crawling with Browser Reuse ===")
-
-    browser_config = BrowserConfig(
-        headless=True,
-        browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
-        viewport={"width": 800, "height": 600},
-    )
-
-    crawl_config = CrawlerRunConfig(
-        markdown_generator=DefaultMarkdownGenerator(
-            #  content_filter=PruningContentFilter(), In case you need fit_markdown
-        ),
-    )
-
-    # Create single crawler instance for all parallel tasks
-    crawler = AsyncWebCrawler(config=browser_config)
-    await crawler.start()
-
-    try:
-        # Create tasks in batches to control concurrency
-        for i in range(0, len(urls), max_concurrent):
-            batch = urls[i : i + max_concurrent]
-            tasks = []
-
-            for j, url in enumerate(batch):
-                session_id = (
-                    f"parallel_session_{j}"  # Different session per concurrent task
-                )
-                task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
-                tasks.append(task)
-
-            # Wait for batch to complete
-            results = await asyncio.gather(*tasks, return_exceptions=True)
-
-            # Process results
-            for url, result in zip(batch, results):
-                if isinstance(result, Exception):
-                    print(f"Error crawling {url}: {str(result)}")
-                elif result.success:
-                    print(f"Successfully crawled {url}")
-                    print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
-    finally:
-        await crawler.close()
-
-
-async def main():
-    # Example URLs
-    urls = [
-        "https://example.com/page1",
-        "https://example.com/page2",
-        "https://example.com/page3",
-        "https://example.com/page4",
-    ]
-
-    # Demo sequential crawling
-    await crawl_sequential(urls)
-
-    # Demo parallel crawling
-    await crawl_parallel(urls, max_concurrent=2)
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/crawlai_vs_firecrawl.py
+++ b/docs/examples/crawlai_vs_firecrawl.py
@@ -1,32 +1,31 @@
 import os, time
-
 # append the path to the root of the project
 import sys
 import asyncio
-
-sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
+sys.path.append(os.path.join(os.path.dirname(__file__), '..', '..'))
 from firecrawl import FirecrawlApp
 from crawl4ai import AsyncWebCrawler
-
-__data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
-
+__data__ = os.path.join(os.path.dirname(__file__), '..', '..') + '/.data'

 async def compare():
-    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
+    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])

    # Tet Firecrawl with a simple crawl
    start = time.time()
    scrape_status = app.scrape_url(
-        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
+    'https://www.nbcnews.com/business',
+    params={'formats': ['markdown', 'html']}
    )
    end = time.time()
    print(f"Time taken: {end - start} seconds")
-    print(len(scrape_status["markdown"]))
+    print(len(scrape_status['markdown']))
    # save the markdown content with provider name
    with open(f"{__data__}/firecrawl_simple.md", "w") as f:
-        f.write(scrape_status["markdown"])
+        f.write(scrape_status['markdown'])
    # Count how many "cldnry.s-nbcnews.com" are in the markdown
-    print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
+    print(scrape_status['markdown'].count("cldnry.s-nbcnews.com"))
+    
+

    async with AsyncWebCrawler() as crawler:
        start = time.time()
@@ -34,13 +33,13 @@ async def compare():
            url="https://www.nbcnews.com/business",
            # js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
            word_count_threshold=0,
-            bypass_cache=True,
-            verbose=False,
+            bypass_cache=True, 
+            verbose=False
        )
        end = time.time()
        print(f"Time taken: {end - start} seconds")
        print(len(result.markdown))
-        # save the markdown content with provider name
+        # save the markdown content with provider name  
        with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
            f.write(result.markdown)
        # count how many "cldnry.s-nbcnews.com" are in the markdown
@@ -49,12 +48,10 @@ async def compare():
        start = time.time()
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
-            js_code=[
-                "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
-            ],
+            js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
            word_count_threshold=0,
-            bypass_cache=True,
-            verbose=False,
+            bypass_cache=True, 
+            verbose=False
        )
        end = time.time()
        print(f"Time taken: {end - start} seconds")
@@ -64,7 +61,7 @@ async def compare():
            f.write(result.markdown)
        # count how many "cldnry.s-nbcnews.com" are in the markdown
        print(result.markdown.count("cldnry.s-nbcnews.com"))
-
-
+        
 if __name__ == "__main__":
    asyncio.run(compare())
+    
--- a/docs/examples/dispatcher_example.py
+++ b/docs/examples/dispatcher_example.py
@@ -1,136 +0,0 @@
-import asyncio
-import time
-from rich import print
-from rich.table import Table
-from crawl4ai import (
-    AsyncWebCrawler,
-    BrowserConfig,
-    CrawlerRunConfig,
-    MemoryAdaptiveDispatcher,
-    SemaphoreDispatcher,
-    RateLimiter,
-    CrawlerMonitor,
-    DisplayMode,
-    CacheMode,
-    LXMLWebScrapingStrategy,
-)
-
-
-async def memory_adaptive(urls, browser_config, run_config):
-    """Memory adaptive crawler with monitoring"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=70.0,
-            max_session_permit=10,
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
-    """Memory adaptive crawler with rate limiting"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=70.0,
-            max_session_permit=10,
-            rate_limiter=RateLimiter(
-                base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
-            ),
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-async def semaphore(urls, browser_config, run_config):
-    """Basic semaphore crawler"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = SemaphoreDispatcher(
-            semaphore_count=5,
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-async def semaphore_with_rate_limit(urls, browser_config, run_config):
-    """Semaphore crawler with rate limiting"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = SemaphoreDispatcher(
-            semaphore_count=5,
-            rate_limiter=RateLimiter(
-                base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
-            ),
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-def create_performance_table(results):
-    """Creates a rich table showing performance results"""
-    table = Table(title="Crawler Strategy Performance Comparison")
-    table.add_column("Strategy", style="cyan")
-    table.add_column("URLs Crawled", justify="right", style="green")
-    table.add_column("Time (seconds)", justify="right", style="yellow")
-    table.add_column("URLs/second", justify="right", style="magenta")
-
-    sorted_results = sorted(results.items(), key=lambda x: x[1][1])
-
-    for strategy, (urls_crawled, duration) in sorted_results:
-        urls_per_second = urls_crawled / duration
-        table.add_row(
-            strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
-        )
-
-    return table
-
-
-async def main():
-    urls = [f"https://example.com/page{i}" for i in range(1, 20)]
-    browser_config = BrowserConfig(headless=True, verbose=False)
-    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
-
-    results = {
-        "Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
-        "Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
-            urls, browser_config, run_config
-        ),
-        "Semaphore": await semaphore(urls, browser_config, run_config),
-        "Semaphore + Rate Limit": await semaphore_with_rate_limit(
-            urls, browser_config, run_config
-        ),
-    }
-
-    table = create_performance_table(results)
-    print("\nPerformance Summary:")
-    print(table)
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/docker_example.py
+++ b/docs/examples/docker_example.py
@@ -1,372 +0,0 @@
-import requests
-import json
-import time
-import sys
-import base64
-import os
-from typing import Dict, Any
-
-
-class Crawl4AiTester:
-    def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
-        self.base_url = base_url
-        self.api_token = (
-            api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
-        )  # Check environment variable as fallback
-        self.headers = (
-            {"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
-        )
-
-    def submit_and_wait(
-        self, request_data: Dict[str, Any], timeout: int = 300
-    ) -> Dict[str, Any]:
-        # Submit crawl job
-        response = requests.post(
-            f"{self.base_url}/crawl", json=request_data, headers=self.headers
-        )
-        if response.status_code == 403:
-            raise Exception("API token is invalid or missing")
-        task_id = response.json()["task_id"]
-        print(f"Task ID: {task_id}")
-
-        # Poll for result
-        start_time = time.time()
-        while True:
-            if time.time() - start_time > timeout:
-                raise TimeoutError(
-                    f"Task {task_id} did not complete within {timeout} seconds"
-                )
-
-            result = requests.get(
-                f"{self.base_url}/task/{task_id}", headers=self.headers
-            )
-            status = result.json()
-
-            if status["status"] == "failed":
-                print("Task failed:", status.get("error"))
-                raise Exception(f"Task failed: {status.get('error')}")
-
-            if status["status"] == "completed":
-                return status
-
-            time.sleep(2)
-
-    def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
-        response = requests.post(
-            f"{self.base_url}/crawl_sync",
-            json=request_data,
-            headers=self.headers,
-            timeout=60,
-        )
-        if response.status_code == 408:
-            raise TimeoutError("Task did not complete within server timeout")
-        response.raise_for_status()
-        return response.json()
-
-    def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
-        """Directly crawl without using task queue"""
-        response = requests.post(
-            f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
-        )
-        response.raise_for_status()
-        return response.json()
-
-
-def test_docker_deployment(version="basic"):
-    tester = Crawl4AiTester(
-        base_url="http://localhost:11235",
-        # base_url="https://api.crawl4ai.com" # just for example
-        # api_token="test" # just for example
-    )
-    print(f"Testing Crawl4AI Docker {version} version")
-
-    # Health check with timeout and retry
-    max_retries = 5
-    for i in range(max_retries):
-        try:
-            health = requests.get(f"{tester.base_url}/health", timeout=10)
-            print("Health check:", health.json())
-            break
-        except requests.exceptions.RequestException:
-            if i == max_retries - 1:
-                print(f"Failed to connect after {max_retries} attempts")
-                sys.exit(1)
-            print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
-            time.sleep(5)
-
-    # Test cases based on version
-    test_basic_crawl_direct(tester)
-    test_basic_crawl(tester)
-    test_basic_crawl(tester)
-    test_basic_crawl_sync(tester)
-
-    if version in ["full", "transformer"]:
-        test_cosine_extraction(tester)
-
-    test_js_execution(tester)
-    test_css_selector(tester)
-    test_structured_extraction(tester)
-    test_llm_extraction(tester)
-    test_llm_with_ollama(tester)
-    test_screenshot(tester)
-
-
-def test_basic_crawl(tester: Crawl4AiTester):
-    print("\n=== Testing Basic Crawl ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 10,
-        "session_id": "test",
-    }
-
-    result = tester.submit_and_wait(request)
-    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-    assert len(result["result"]["markdown"]) > 0
-
-
-def test_basic_crawl_sync(tester: Crawl4AiTester):
-    print("\n=== Testing Basic Crawl (Sync) ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 10,
-        "session_id": "test",
-    }
-
-    result = tester.submit_sync(request)
-    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result["status"] == "completed"
-    assert result["result"]["success"]
-    assert len(result["result"]["markdown"]) > 0
-
-
-def test_basic_crawl_direct(tester: Crawl4AiTester):
-    print("\n=== Testing Basic Crawl (Direct) ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 10,
-        # "session_id": "test"
-        "cache_mode": "bypass",  # or "enabled", "disabled", "read_only", "write_only"
-    }
-
-    result = tester.crawl_direct(request)
-    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-    assert len(result["result"]["markdown"]) > 0
-
-
-def test_js_execution(tester: Crawl4AiTester):
-    print("\n=== Testing JS Execution ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 8,
-        "js_code": [
-            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
-        ],
-        "wait_for": "article.tease-card:nth-child(10)",
-        "crawler_params": {"headless": True},
-    }
-
-    result = tester.submit_and_wait(request)
-    print(f"JS execution result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-
-
-def test_css_selector(tester: Crawl4AiTester):
-    print("\n=== Testing CSS Selector ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 7,
-        "css_selector": ".wide-tease-item__description",
-        "crawler_params": {"headless": True},
-        "extra": {"word_count_threshold": 10},
-    }
-
-    result = tester.submit_and_wait(request)
-    print(f"CSS selector result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-
-
-def test_structured_extraction(tester: Crawl4AiTester):
-    print("\n=== Testing Structured Extraction ===")
-    schema = {
-        "name": "Coinbase Crypto Prices",
-        "baseSelector": ".cds-tableRow-t45thuk",
-        "fields": [
-            {
-                "name": "crypto",
-                "selector": "td:nth-child(1) h2",
-                "type": "text",
-            },
-            {
-                "name": "symbol",
-                "selector": "td:nth-child(1) p",
-                "type": "text",
-            },
-            {
-                "name": "price",
-                "selector": "td:nth-child(2)",
-                "type": "text",
-            },
-        ],
-    }
-
-    request = {
-        "urls": "https://www.coinbase.com/explore",
-        "priority": 9,
-        "extraction_config": {"type": "json_css", "params": {"schema": schema}},
-    }
-
-    result = tester.submit_and_wait(request)
-    extracted = json.loads(result["result"]["extracted_content"])
-    print(f"Extracted {len(extracted)} items")
-    print("Sample item:", json.dumps(extracted[0], indent=2))
-    assert result["result"]["success"]
-    assert len(extracted) > 0
-
-
-def test_llm_extraction(tester: Crawl4AiTester):
-    print("\n=== Testing LLM Extraction ===")
-    schema = {
-        "type": "object",
-        "properties": {
-            "model_name": {
-                "type": "string",
-                "description": "Name of the OpenAI model.",
-            },
-            "input_fee": {
-                "type": "string",
-                "description": "Fee for input token for the OpenAI model.",
-            },
-            "output_fee": {
-                "type": "string",
-                "description": "Fee for output token for the OpenAI model.",
-            },
-        },
-        "required": ["model_name", "input_fee", "output_fee"],
-    }
-
-    request = {
-        "urls": "https://openai.com/api/pricing",
-        "priority": 8,
-        "extraction_config": {
-            "type": "llm",
-            "params": {
-                "provider": "openai/gpt-4o-mini",
-                "api_token": os.getenv("OPENAI_API_KEY"),
-                "schema": schema,
-                "extraction_type": "schema",
-                "instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
-            },
-        },
-        "crawler_params": {"word_count_threshold": 1},
-    }
-
-    try:
-        result = tester.submit_and_wait(request)
-        extracted = json.loads(result["result"]["extracted_content"])
-        print(f"Extracted {len(extracted)} model pricing entries")
-        print("Sample entry:", json.dumps(extracted[0], indent=2))
-        assert result["result"]["success"]
-    except Exception as e:
-        print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
-
-
-def test_llm_with_ollama(tester: Crawl4AiTester):
-    print("\n=== Testing LLM with Ollama ===")
-    schema = {
-        "type": "object",
-        "properties": {
-            "article_title": {
-                "type": "string",
-                "description": "The main title of the news article",
-            },
-            "summary": {
-                "type": "string",
-                "description": "A brief summary of the article content",
-            },
-            "main_topics": {
-                "type": "array",
-                "items": {"type": "string"},
-                "description": "Main topics or themes discussed in the article",
-            },
-        },
-    }
-
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 8,
-        "extraction_config": {
-            "type": "llm",
-            "params": {
-                "provider": "ollama/llama2",
-                "schema": schema,
-                "extraction_type": "schema",
-                "instruction": "Extract the main article information including title, summary, and main topics.",
-            },
-        },
-        "extra": {"word_count_threshold": 1},
-        "crawler_params": {"verbose": True},
-    }
-
-    try:
-        result = tester.submit_and_wait(request)
-        extracted = json.loads(result["result"]["extracted_content"])
-        print("Extracted content:", json.dumps(extracted, indent=2))
-        assert result["result"]["success"]
-    except Exception as e:
-        print(f"Ollama extraction test failed: {str(e)}")
-
-
-def test_cosine_extraction(tester: Crawl4AiTester):
-    print("\n=== Testing Cosine Extraction ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 8,
-        "extraction_config": {
-            "type": "cosine",
-            "params": {
-                "semantic_filter": "business finance economy",
-                "word_count_threshold": 10,
-                "max_dist": 0.2,
-                "top_k": 3,
-            },
-        },
-    }
-
-    try:
-        result = tester.submit_and_wait(request)
-        extracted = json.loads(result["result"]["extracted_content"])
-        print(f"Extracted {len(extracted)} text clusters")
-        print("First cluster tags:", extracted[0]["tags"])
-        assert result["result"]["success"]
-    except Exception as e:
-        print(f"Cosine extraction test failed: {str(e)}")
-
-
-def test_screenshot(tester: Crawl4AiTester):
-    print("\n=== Testing Screenshot ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 5,
-        "screenshot": True,
-        "crawler_params": {"headless": True},
-    }
-
-    result = tester.submit_and_wait(request)
-    print("Screenshot captured:", bool(result["result"]["screenshot"]))
-
-    if result["result"]["screenshot"]:
-        # Save screenshot
-        screenshot_data = base64.b64decode(result["result"]["screenshot"])
-        with open("test_screenshot.jpg", "wb") as f:
-            f.write(screenshot_data)
-        print("Screenshot saved as test_screenshot.jpg")
-
-    assert result["result"]["success"]
-
-
-if __name__ == "__main__":
-    version = sys.argv[1] if len(sys.argv) > 1 else "basic"
-    # version = "full"
-    test_docker_deployment(version)
--- a/docs/examples/extraction_strategies_example.py
+++ b/docs/examples/extraction_strategies_example.py
@@ -1,127 +0,0 @@
-"""
-Example demonstrating different extraction strategies with various input formats.
-This example shows how to:
-1. Use different input formats (markdown, HTML, fit_markdown)
-2. Work with JSON-based extractors (CSS and XPath)
-3. Use LLM-based extraction with different input formats
-4. Configure browser and crawler settings properly
-"""
-
-import asyncio
-import os
-
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from crawl4ai.extraction_strategy import (
-    LLMExtractionStrategy,
-    JsonCssExtractionStrategy,
-    JsonXPathExtractionStrategy,
-)
-from crawl4ai.content_filter_strategy import PruningContentFilter
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-
-async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
-    """Helper function to run extraction with proper configuration"""
-    try:
-        # Configure the crawler run settings
-        config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            extraction_strategy=strategy,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter()  # For fit_markdown support
-            ),
-        )
-
-        # Run the crawler
-        result = await crawler.arun(url=url, config=config)
-
-        if result.success:
-            print(f"\n=== {name} Results ===")
-            print(f"Extracted Content: {result.extracted_content}")
-            print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
-            print(
-                f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
-            )
-        else:
-            print(f"Error in {name}: Crawl failed")
-
-    except Exception as e:
-        print(f"Error in {name}: {str(e)}")
-
-
-async def main():
-    # Example URL (replace with actual URL)
-    url = "https://example.com/product-page"
-
-    # Configure browser settings
-    browser_config = BrowserConfig(headless=True, verbose=True)
-
-    # Initialize extraction strategies
-
-    # 1. LLM Extraction with different input formats
-    markdown_strategy = LLMExtractionStrategy(
-        provider="openai/gpt-4o-mini",
-        api_token=os.getenv("OPENAI_API_KEY"),
-        instruction="Extract product information including name, price, and description",
-    )
-
-    html_strategy = LLMExtractionStrategy(
-        input_format="html",
-        provider="openai/gpt-4o-mini",
-        api_token=os.getenv("OPENAI_API_KEY"),
-        instruction="Extract product information from HTML including structured data",
-    )
-
-    fit_markdown_strategy = LLMExtractionStrategy(
-        input_format="fit_markdown",
-        provider="openai/gpt-4o-mini",
-        api_token=os.getenv("OPENAI_API_KEY"),
-        instruction="Extract product information from cleaned markdown",
-    )
-
-    # 2. JSON CSS Extraction (automatically uses HTML input)
-    css_schema = {
-        "baseSelector": ".product",
-        "fields": [
-            {"name": "title", "selector": "h1.product-title", "type": "text"},
-            {"name": "price", "selector": ".price", "type": "text"},
-            {"name": "description", "selector": ".description", "type": "text"},
-        ],
-    }
-    css_strategy = JsonCssExtractionStrategy(schema=css_schema)
-
-    # 3. JSON XPath Extraction (automatically uses HTML input)
-    xpath_schema = {
-        "baseSelector": "//div[@class='product']",
-        "fields": [
-            {
-                "name": "title",
-                "selector": ".//h1[@class='product-title']/text()",
-                "type": "text",
-            },
-            {
-                "name": "price",
-                "selector": ".//span[@class='price']/text()",
-                "type": "text",
-            },
-            {
-                "name": "description",
-                "selector": ".//div[@class='description']/text()",
-                "type": "text",
-            },
-        ],
-    }
-    xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # Run all strategies
-        await run_extraction(crawler, url, markdown_strategy, "Markdown LLM")
-        await run_extraction(crawler, url, html_strategy, "HTML LLM")
-        await run_extraction(crawler, url, fit_markdown_strategy, "Fit Markdown LLM")
-        await run_extraction(crawler, url, css_strategy, "CSS Extraction")
-        await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/full_page_screenshot_and_pdf_export.md
+++ b/docs/examples/full_page_screenshot_and_pdf_export.md
@@ -1,58 +0,0 @@
-# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
-
-When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
-
-**The New Approach:**
-We’ve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browser’s built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
-
-**Key Benefits:**
- **Reliability:** The PDF export never times out and works regardless of page length.
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
-
-**Simple Example:**
-```python
-import os, sys
-import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
-
-# Adjust paths as needed
-parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-sys.path.append(parent_dir)
-__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
-
-async def main():
-    async with AsyncWebCrawler() as crawler:
-        # Request both PDF and screenshot
-        result = await crawler.arun(
-            url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
-            cache_mode=CacheMode.BYPASS,
-            pdf=True,
-            screenshot=True
-        )
-        
-        if result.success:
-            # Save screenshot
-            if result.screenshot:
-                from base64 import b64decode
-                with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
-                    f.write(b64decode(result.screenshot))
-            
-            # Save PDF
-            if result.pdf:
-                pdf_bytes = b64decode(result.pdf)
-                with open(os.path.join(__location__, "page.pdf"), "wb") as f:
-                    f.write(pdf_bytes)
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-**What Happens Under the Hood:**
- Crawl4AI navigates to the target page.
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
- Finally, you get your PDF and/or screenshot ready to use.
-
-**Conclusion:**
-With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.
--- a/docs/examples/hello_world.py
+++ b/docs/examples/hello_world.py
@@ -1,23 +0,0 @@
-import asyncio
-from crawl4ai import *
-
-
-async def main():
-    browser_config = BrowserConfig(headless=True, verbose=True)
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        crawler_config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter(
-                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
-                )
-            ),
-        )
-        result = await crawler.arun(
-            url="https://www.helloworld.org", config=crawler_config
-        )
-        print(result.markdown_v2.raw_markdown[:500])
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/hooks_example.py
+++ b/docs/examples/hooks_example.py
@@ -1,118 +0,0 @@
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from playwright.async_api import Page, BrowserContext
-
-
-async def main():
-    print("🔗 Hooks Example: Demonstrating different hook use cases")
-
-    # Configure browser settings
-    browser_config = BrowserConfig(headless=True)
-
-    # Configure crawler settings
-    crawler_run_config = CrawlerRunConfig(
-        js_code="window.scrollTo(0, document.body.scrollHeight);",
-        wait_for="body",
-        cache_mode=CacheMode.BYPASS,
-    )
-
-    # Create crawler instance
-    crawler = AsyncWebCrawler(config=browser_config)
-
-    # Define and set hook functions
-    async def on_browser_created(browser, context: BrowserContext, **kwargs):
-        """Hook called after the browser is created"""
-        print("[HOOK] on_browser_created - Browser is ready!")
-        # Example: Set a cookie that will be used for all requests
-        return browser
-
-    async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
-        """Hook called after a new page and context are created"""
-        print("[HOOK] on_page_context_created - New page created!")
-        # Example: Set default viewport size
-        await context.add_cookies(
-            [
-                {
-                    "name": "session_id",
-                    "value": "example_session",
-                    "domain": ".example.com",
-                    "path": "/",
-                }
-            ]
-        )
-        await page.set_viewport_size({"width": 1080, "height": 800})
-        return page
-
-    async def on_user_agent_updated(
-        page: Page, context: BrowserContext, user_agent: str, **kwargs
-    ):
-        """Hook called when the user agent is updated"""
-        print(f"[HOOK] on_user_agent_updated - New user agent: {user_agent}")
-        return page
-
-    async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
-        """Hook called after custom JavaScript execution"""
-        print("[HOOK] on_execution_started - Custom JS executed!")
-        return page
-
-    async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
-        """Hook called before navigating to each URL"""
-        print(f"[HOOK] before_goto - About to visit: {url}")
-        # Example: Add custom headers for the request
-        await page.set_extra_http_headers({"Custom-Header": "my-value"})
-        return page
-
-    async def after_goto(
-        page: Page, context: BrowserContext, url: str, response: dict, **kwargs
-    ):
-        """Hook called after navigating to each URL"""
-        print(f"[HOOK] after_goto - Successfully loaded: {url}")
-        # Example: Wait for a specific element to be loaded
-        try:
-            await page.wait_for_selector(".content", timeout=1000)
-            print("Content element found!")
-        except:
-            print("Content element not found, continuing anyway")
-        return page
-
-    async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
-        """Hook called before retrieving the HTML content"""
-        print("[HOOK] before_retrieve_html - About to get HTML content")
-        # Example: Scroll to bottom to trigger lazy loading
-        await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
-        return page
-
-    async def before_return_html(
-        page: Page, context: BrowserContext, html: str, **kwargs
-    ):
-        """Hook called before returning the HTML content"""
-        print(f"[HOOK] before_return_html - Got HTML content (length: {len(html)})")
-        # Example: You could modify the HTML content here if needed
-        return page
-
-    # Set all the hooks
-    crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
-    crawler.crawler_strategy.set_hook(
-        "on_page_context_created", on_page_context_created
-    )
-    crawler.crawler_strategy.set_hook("on_user_agent_updated", on_user_agent_updated)
-    crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
-    crawler.crawler_strategy.set_hook("before_goto", before_goto)
-    crawler.crawler_strategy.set_hook("after_goto", after_goto)
-    crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
-    crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
-
-    await crawler.start()
-
-    # Example usage: crawl a simple website
-    url = "https://example.com"
-    result = await crawler.arun(url, config=crawler_run_config)
-    print(f"\nCrawled URL: {result.url}")
-    print(f"HTML length: {len(result.html)}")
-
-    await crawler.close()
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(main())
--- a/docs/examples/language_support_example.py
+++ b/docs/examples/language_support_example.py
@@ -1,7 +1,6 @@
 import asyncio
 from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy

-
 async def main():
    # Example 1: Setting language when creating the crawler
    crawler1 = AsyncWebCrawler(
@@ -10,15 +9,11 @@ async def main():
        )
    )
    result1 = await crawler1.arun("https://www.example.com")
-    print(
-        "Example 1 result:", result1.extracted_content[:100]
-    )  # Print first 100 characters
+    print("Example 1 result:", result1.extracted_content[:100])  # Print first 100 characters

    # Example 2: Setting language before crawling
    crawler2 = AsyncWebCrawler()
-    crawler2.crawler_strategy.headers[
-        "Accept-Language"
-    ] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
+    crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
    result2 = await crawler2.arun("https://www.example.com")
    print("Example 2 result:", result2.extracted_content[:100])

@@ -26,7 +21,7 @@ async def main():
    crawler3 = AsyncWebCrawler()
    result3 = await crawler3.arun(
        "https://www.example.com",
-        headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"},
+        headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
    )
    print("Example 3 result:", result3.extracted_content[:100])

@@ -36,15 +31,15 @@ async def main():
        ("https://www.example.org", "es-ES,es;q=0.9"),
        ("https://www.example.net", "de-DE,de;q=0.9"),
    ]
-
+    
    crawler4 = AsyncWebCrawler()
-    results = await asyncio.gather(
-        *[crawler4.arun(url, headers={"Accept-Language": lang}) for url, lang in urls]
-    )
-
+    results = await asyncio.gather(*[
+        crawler4.arun(url, headers={"Accept-Language": lang})
+        for url, lang in urls
+    ])
+    
    for url, result in zip([u for u, _ in urls], results):
        print(f"Result for {url}:", result.extracted_content[:100])

-
 if __name__ == "__main__":
-    asyncio.run(main())
+    asyncio.run(main())
--- a/docs/examples/llm_extraction_openai_pricing.py
+++ b/docs/examples/llm_extraction_openai_pricing.py
@@ -1,46 +1,41 @@
+import os
+import time
+from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import *
 from crawl4ai.extraction_strategy import *
 from crawl4ai.crawler_strategy import *
-import asyncio
+
+url = r'https://openai.com/api/pricing/'
+
+crawler = WebCrawler()
+crawler.warmup()
+
 from pydantic import BaseModel, Field

-url = r"https://openai.com/api/pricing/"
-
-
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(
-        ..., description="Fee for output token for the OpenAI model."
-    )
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")

+result = crawler.run(
+    url=url,
+    word_count_threshold=1,
+    extraction_strategy= LLMExtractionStrategy(
+        # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        provider= "groq/llama-3.1-70b-versatile", api_token = os.getenv('GROQ_API_KEY'), 
+        schema=OpenAIModelFee.model_json_schema(),
+        extraction_type="schema",
+        instruction="From the crawled content, extract all mentioned model names along with their "\
+            "fees for input and output tokens. Make sure not to miss anything in the entire content. "\
+            'One extracted model JSON format should look like this: '\
+            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+    ),
+    bypass_cache=True,
+)

-from crawl4ai import AsyncWebCrawler
+model_fees = json.loads(result.extracted_content)

+print(len(model_fees))

-async def main():
-    # Use AsyncWebCrawler
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(
-            url=url,
-            word_count_threshold=1,
-            extraction_strategy=LLMExtractionStrategy(
-                # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
-                provider="groq/llama-3.1-70b-versatile",
-                api_token=os.getenv("GROQ_API_KEY"),
-                schema=OpenAIModelFee.model_json_schema(),
-                extraction_type="schema",
-                instruction="From the crawled content, extract all mentioned model names along with their "
-                "fees for input and output tokens. Make sure not to miss anything in the entire content. "
-                "One extracted model JSON format should look like this: "
-                '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
-            ),
-        )
-        print("Success:", result.success)
-        model_fees = json.loads(result.extracted_content)
-        print(len(model_fees))
-
-        with open(".data/data.json", "w", encoding="utf-8") as f:
-            f.write(result.extracted_content)
-
-
-asyncio.run(main())
+with open(".data/data.json", "w", encoding="utf-8") as f:
+    f.write(result.extracted_content)
--- a/docs/examples/llm_markdown_generator.py
+++ b/docs/examples/llm_markdown_generator.py
@@ -1,87 +0,0 @@
-import os
-import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from crawl4ai.content_filter_strategy import LLMContentFilter
-
-async def test_llm_filter():
-    # Create an HTML source that needs intelligent filtering
-    url = "https://docs.python.org/3/tutorial/classes.html"
-    
-    browser_config = BrowserConfig(
-        headless=True,
-        verbose=True
-    )
-    
-    # run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
-    run_config = CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
-    
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # First get the raw HTML
-        result = await crawler.arun(url, config=run_config)
-        html = result.cleaned_html
-
-        # Initialize LLM filter with focused instruction
-        filter = LLMContentFilter(
-            provider="openai/gpt-4o",
-            api_token=os.getenv('OPENAI_API_KEY'),
-            instruction="""
-            Focus on extracting the core educational content about Python classes.
-            Include:
-            - Key concepts and their explanations
-            - Important code examples
-            - Essential technical details
-            Exclude:
-            - Navigation elements
-            - Sidebars
-            - Footer content
-            - Version information
-            - Any non-essential UI elements
-            
-            Format the output as clean markdown with proper code blocks and headers.
-            """,
-            verbose=True
-        )
-        
-        filter = LLMContentFilter(
-            provider="openai/gpt-4o",
-            api_token=os.getenv('OPENAI_API_KEY'),
-            chunk_token_threshold=2 ** 12 * 2, # 2048 * 2
-            instruction="""
-            Extract the main educational content while preserving its original wording and substance completely. Your task is to:
-
-            1. Maintain the exact language and terminology used in the main content
-            2. Keep all technical explanations, examples, and educational content intact
-            3. Preserve the original flow and structure of the core content
-            4. Remove only clearly irrelevant elements like:
-            - Navigation menus
-            - Advertisement sections
-            - Cookie notices
-            - Footers with site information
-            - Sidebars with external links
-            - Any UI elements that don't contribute to learning
-
-            The goal is to create a clean markdown version that reads exactly like the original article, 
-            keeping all valuable content but free from distracting elements. Imagine you're creating 
-            a perfect reading experience where nothing valuable is lost, but all noise is removed.
-            """,
-            verbose=True
-        )        
-
-        # Apply filtering
-        filtered_content = filter.filter_content(html, ignore_cache = True)
-        
-        # Show results
-        print("\nFiltered Content Length:", len(filtered_content))
-        print("\nFirst 500 chars of filtered content:")
-        if filtered_content:
-            print(filtered_content[0][:500])
-        
-        # Save on disc the markdown version
-        with open("filtered_content.md", "w", encoding="utf-8") as f:
-            f.write("\n".join(filtered_content))
-        
-        # Show token usage
-        filter.show_usage()
-
-if __name__ == "__main__":
-    asyncio.run(test_llm_filter())
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
--- a/docs/examples/quickstart_async.config.py
+++ b/docs/examples/quickstart_async.config.py
@@ -1,618 +0,0 @@
-import os, sys
-
-sys.path.append(
-    os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-)
-
-import asyncio
-import time
-import json
-import re
-from typing import Dict
-from bs4 import BeautifulSoup
-from pydantic import BaseModel, Field
-from crawl4ai import AsyncWebCrawler, CacheMode, BrowserConfig, CrawlerRunConfig
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-from crawl4ai.content_filter_strategy import PruningContentFilter
-from crawl4ai.extraction_strategy import (
-    JsonCssExtractionStrategy,
-    LLMExtractionStrategy,
-)
-
-__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
-
-print("Crawl4AI: Advanced Web Crawling and Data Extraction")
-print("GitHub Repository: https://github.com/unclecode/crawl4ai")
-print("Twitter: @unclecode")
-print("Website: https://crawl4ai.com")
-
-
-# Basic Example - Simple Crawl
-async def simple_crawl():
-    print("\n--- Basic Usage ---")
-    browser_config = BrowserConfig(headless=True)
-    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", config=crawler_config
-        )
-        print(result.markdown[:500])
-
-
-async def clean_content():
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        excluded_tags=["nav", "footer", "aside"],
-        remove_overlay_elements=True,
-        markdown_generator=DefaultMarkdownGenerator(
-            content_filter=PruningContentFilter(
-                threshold=0.48, threshold_type="fixed", min_word_threshold=0
-            ),
-            options={"ignore_links": True},
-        ),
-    )
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(
-            url="https://en.wikipedia.org/wiki/Apple",
-            config=crawler_config,
-        )
-        full_markdown_length = len(result.markdown_v2.raw_markdown)
-        fit_markdown_length = len(result.markdown_v2.fit_markdown)
-        print(f"Full Markdown Length: {full_markdown_length}")
-        print(f"Fit Markdown Length: {fit_markdown_length}")
-
-
-async def link_analysis():
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.ENABLED,
-        exclude_external_links=True,
-        exclude_social_media_links=True,
-    )
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
-            config=crawler_config,
-        )
-        print(f"Found {len(result.links['internal'])} internal links")
-        print(f"Found {len(result.links['external'])} external links")
-
-        for link in result.links["internal"][:5]:
-            print(f"Href: {link['href']}\nText: {link['text']}\n")
-
-
-# JavaScript Execution Example
-async def simple_example_with_running_js_code():
-    print("\n--- Executing JavaScript and Using CSS Selectors ---")
-
-    browser_config = BrowserConfig(headless=True, java_script_enabled=True)
-
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        js_code="const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();",
-        # wait_for="() => { return Array.from(document.querySelectorAll('article.tease-card')).length > 10; }"
-    )
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", config=crawler_config
-        )
-        print(result.markdown[:500])
-
-
-# CSS Selector Example
-async def simple_example_with_css_selector():
-    print("\n--- Using CSS Selectors ---")
-    browser_config = BrowserConfig(headless=True)
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS, css_selector=".wide-tease-item__description"
-    )
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", config=crawler_config
-        )
-        print(result.markdown[:500])
-
-
-async def media_handling():
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS, exclude_external_images=True, screenshot=True
-    )
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", config=crawler_config
-        )
-        for img in result.media["images"][:5]:
-            print(f"Image URL: {img['src']}, Alt: {img['alt']}, Score: {img['score']}")
-
-
-async def custom_hook_workflow(verbose=True):
-    async with AsyncWebCrawler() as crawler:
-        # Set a 'before_goto' hook to run custom code just before navigation
-        crawler.crawler_strategy.set_hook(
-            "before_goto",
-            lambda page, context: print("[Hook] Preparing to navigate..."),
-        )
-
-        # Perform the crawl operation
-        result = await crawler.arun(url="https://crawl4ai.com")
-        print(result.markdown_v2.raw_markdown[:500].replace("\n", " -- "))
-
-
-# Proxy Example
-async def use_proxy():
-    print("\n--- Using a Proxy ---")
-    browser_config = BrowserConfig(
-        headless=True,
-        proxy_config={
-            "server": "http://proxy.example.com:8080",
-            "username": "username",
-            "password": "password",
-        },
-    )
-    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", config=crawler_config
-        )
-        if result.success:
-            print(result.markdown[:500])
-
-
-# Screenshot Example
-async def capture_and_save_screenshot(url: str, output_path: str):
-    browser_config = BrowserConfig(headless=True)
-    crawler_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, screenshot=True)
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        if result.success and result.screenshot:
-            import base64
-
-            screenshot_data = base64.b64decode(result.screenshot)
-            with open(output_path, "wb") as f:
-                f.write(screenshot_data)
-            print(f"Screenshot saved successfully to {output_path}")
-        else:
-            print("Failed to capture screenshot")
-
-
-# LLM Extraction Example
-class OpenAIModelFee(BaseModel):
-    model_name: str = Field(..., description="Name of the OpenAI model.")
-    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(
-        ..., description="Fee for output token for the OpenAI model."
-    )
-
-
-async def extract_structured_data_using_llm(
-    provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
-):
-    print(f"\n--- Extracting Structured Data with {provider} ---")
-
-    if api_token is None and provider != "ollama":
-        print(f"API token is required for {provider}. Skipping this example.")
-        return
-
-    browser_config = BrowserConfig(headless=True)
-
-    extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
-    if extra_headers:
-        extra_args["extra_headers"] = extra_headers
-
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        word_count_threshold=1,
-        page_timeout=80000,
-        extraction_strategy=LLMExtractionStrategy(
-            provider=provider,
-            api_token=api_token,
-            schema=OpenAIModelFee.model_json_schema(),
-            extraction_type="schema",
-            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
-            Do not miss any models in the entire content.""",
-            extra_args=extra_args,
-        ),
-    )
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://openai.com/api/pricing/", config=crawler_config
-        )
-        print(result.extracted_content)
-
-
-# CSS Extraction Example
-async def extract_structured_data_using_css_extractor():
-    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
-    schema = {
-        "name": "KidoCode Courses",
-        "baseSelector": "section.charge-methodology .framework-collection-item.w-dyn-item",
-        "fields": [
-            {
-                "name": "section_title",
-                "selector": "h3.heading-50",
-                "type": "text",
-            },
-            {
-                "name": "section_description",
-                "selector": ".charge-content",
-                "type": "text",
-            },
-            {
-                "name": "course_name",
-                "selector": ".text-block-93",
-                "type": "text",
-            },
-            {
-                "name": "course_description",
-                "selector": ".course-content-text",
-                "type": "text",
-            },
-            {
-                "name": "course_icon",
-                "selector": ".image-92",
-                "type": "attribute",
-                "attribute": "src",
-            },
-        ],
-    }
-
-    browser_config = BrowserConfig(headless=True, java_script_enabled=True)
-
-    js_click_tabs = """
-    (async () => {
-        const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
-        for(let tab of tabs) {
-            tab.scrollIntoView();
-            tab.click();
-            await new Promise(r => setTimeout(r, 500));
-        }
-    })();
-    """
-
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        extraction_strategy=JsonCssExtractionStrategy(schema),
-        js_code=[js_click_tabs],
-        delay_before_return_html=1
-    )
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(
-            url="https://www.kidocode.com/degrees/technology", config=crawler_config
-        )
-
-        companies = json.loads(result.extracted_content)
-        print(f"Successfully extracted {len(companies)} companies")
-        print(json.dumps(companies[0], indent=2))
-
-
-# Dynamic Content Examples - Method 1
-async def crawl_dynamic_content_pages_method_1():
-    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
-    first_commit = ""
-
-    async def on_execution_started(page, **kwargs):
-        nonlocal first_commit
-        try:
-            while True:
-                await page.wait_for_selector("li.Box-sc-g0xbh4-0 h4")
-                commit = await page.query_selector("li.Box-sc-g0xbh4-0 h4")
-                commit = await commit.evaluate("(element) => element.textContent")
-                commit = re.sub(r"\s+", "", commit)
-                if commit and commit != first_commit:
-                    first_commit = commit
-                    break
-                await asyncio.sleep(0.5)
-        except Exception as e:
-            print(f"Warning: New content didn't appear after JavaScript execution: {e}")
-
-    browser_config = BrowserConfig(headless=False, java_script_enabled=True)
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
-
-        url = "https://github.com/microsoft/TypeScript/commits/main"
-        session_id = "typescript_commits_session"
-        all_commits = []
-
-        js_next_page = """
-        const button = document.querySelector('a[data-testid="pagination-next-button"]');
-        if (button) button.click();
-        """
-
-        for page in range(3):
-            crawler_config = CrawlerRunConfig(
-                cache_mode=CacheMode.BYPASS,
-                css_selector="li.Box-sc-g0xbh4-0",
-                js_code=js_next_page if page > 0 else None,
-                js_only=page > 0,
-                session_id=session_id,
-            )
-
-            result = await crawler.arun(url=url, config=crawler_config)
-            assert result.success, f"Failed to crawl page {page + 1}"
-
-            soup = BeautifulSoup(result.cleaned_html, "html.parser")
-            commits = soup.select("li")
-            all_commits.extend(commits)
-
-            print(f"Page {page + 1}: Found {len(commits)} commits")
-
-        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
-
-
-# Dynamic Content Examples - Method 2
-async def crawl_dynamic_content_pages_method_2():
-    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")
-
-    browser_config = BrowserConfig(headless=False, java_script_enabled=True)
-
-    js_next_page_and_wait = """
-    (async () => {
-        const getCurrentCommit = () => {
-            const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
-            return commits.length > 0 ? commits[0].textContent.trim() : null;
-        };
-
-        const initialCommit = getCurrentCommit();
-        const button = document.querySelector('a[data-testid="pagination-next-button"]');
-        if (button) button.click();
-
-        while (true) {
-            await new Promise(resolve => setTimeout(resolve, 100));
-            const newCommit = getCurrentCommit();
-            if (newCommit && newCommit !== initialCommit) {
-                break;
-            }
-        }
-    })();
-    """
-
-    schema = {
-        "name": "Commit Extractor",
-        "baseSelector": "li.Box-sc-g0xbh4-0",
-        "fields": [
-            {
-                "name": "title",
-                "selector": "h4.markdown-title",
-                "type": "text",
-                "transform": "strip",
-            },
-        ],
-    }
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        url = "https://github.com/microsoft/TypeScript/commits/main"
-        session_id = "typescript_commits_session"
-        all_commits = []
-
-        extraction_strategy = JsonCssExtractionStrategy(schema)
-
-        for page in range(3):
-            crawler_config = CrawlerRunConfig(
-                cache_mode=CacheMode.BYPASS,
-                css_selector="li.Box-sc-g0xbh4-0",
-                extraction_strategy=extraction_strategy,
-                js_code=js_next_page_and_wait if page > 0 else None,
-                js_only=page > 0,
-                session_id=session_id,
-            )
-
-            result = await crawler.arun(url=url, config=crawler_config)
-            assert result.success, f"Failed to crawl page {page + 1}"
-
-            commits = json.loads(result.extracted_content)
-            all_commits.extend(commits)
-            print(f"Page {page + 1}: Found {len(commits)} commits")
-
-        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
-
-
-async def cosine_similarity_extraction():
-    crawl_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        extraction_strategy=CosineStrategy(
-            word_count_threshold=10,
-            max_dist=0.2,  # Maximum distance between two words
-            linkage_method="ward",  # Linkage method for hierarchical clustering (ward, complete, average, single)
-            top_k=3,  # Number of top keywords to extract
-            sim_threshold=0.3,  # Similarity threshold for clustering
-            semantic_filter="McDonald's economic impact, American consumer trends",  # Keywords to filter the content semantically using embeddings
-            verbose=True,
-        ),
-    )
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business/consumer/how-mcdonalds-e-coli-crisis-inflation-politics-reflect-american-story-rcna177156",
-            config=crawl_config,
-        )
-        print(json.loads(result.extracted_content)[:5])
-
-
-# Browser Comparison
-async def crawl_custom_browser_type():
-    print("\n--- Browser Comparison ---")
-
-    # Firefox
-    browser_config_firefox = BrowserConfig(browser_type="firefox", headless=True)
-    start = time.time()
-    async with AsyncWebCrawler(config=browser_config_firefox) as crawler:
-        result = await crawler.arun(
-            url="https://www.example.com",
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
-        )
-        print("Firefox:", time.time() - start)
-        print(result.markdown[:500])
-
-    # WebKit
-    browser_config_webkit = BrowserConfig(browser_type="webkit", headless=True)
-    start = time.time()
-    async with AsyncWebCrawler(config=browser_config_webkit) as crawler:
-        result = await crawler.arun(
-            url="https://www.example.com",
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
-        )
-        print("WebKit:", time.time() - start)
-        print(result.markdown[:500])
-
-    # Chromium (default)
-    browser_config_chromium = BrowserConfig(browser_type="chromium", headless=True)
-    start = time.time()
-    async with AsyncWebCrawler(config=browser_config_chromium) as crawler:
-        result = await crawler.arun(
-            url="https://www.example.com",
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS),
-        )
-        print("Chromium:", time.time() - start)
-        print(result.markdown[:500])
-
-
-# Anti-Bot and User Simulation
-async def crawl_with_user_simulation():
-    browser_config = BrowserConfig(
-        headless=True,
-        user_agent_mode="random",
-        user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
-    )
-
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        magic=True,
-        simulate_user=True,
-        override_navigator=True,
-    )
-
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        result = await crawler.arun(url="YOUR-URL-HERE", config=crawler_config)
-        print(result.markdown)
-
-
-async def ssl_certification():
-    # Configure crawler to fetch SSL certificate
-    config = CrawlerRunConfig(
-        fetch_ssl_certificate=True,
-        cache_mode=CacheMode.BYPASS,  # Bypass cache to always get fresh certificates
-    )
-
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(url="https://example.com", config=config)
-
-        if result.success and result.ssl_certificate:
-            cert = result.ssl_certificate
-
-            # 1. Access certificate properties directly
-            print("\nCertificate Information:")
-            print(f"Issuer: {cert.issuer.get('CN', '')}")
-            print(f"Valid until: {cert.valid_until}")
-            print(f"Fingerprint: {cert.fingerprint}")
-
-            # 2. Export certificate in different formats
-            cert.to_json(os.path.join(tmp_dir, "certificate.json"))  # For analysis
-            print("\nCertificate exported to:")
-            print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
-
-            pem_data = cert.to_pem(
-                os.path.join(tmp_dir, "certificate.pem")
-            )  # For web servers
-            print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
-
-            der_data = cert.to_der(
-                os.path.join(tmp_dir, "certificate.der")
-            )  # For Java apps
-            print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
-
-
-# Speed Comparison
-async def speed_comparison():
-    print("\n--- Speed Comparison ---")
-
-    # Firecrawl comparison
-    from firecrawl import FirecrawlApp
-
-    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
-    start = time.time()
-    scrape_status = app.scrape_url(
-        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
-    )
-    end = time.time()
-    print("Firecrawl:")
-    print(f"Time taken: {end - start:.2f} seconds")
-    print(f"Content length: {len(scrape_status['markdown'])} characters")
-    print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
-    print()
-
-    # Crawl4AI comparisons
-    browser_config = BrowserConfig(headless=True)
-
-    # Simple crawl
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        start = time.time()
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
-            config=CrawlerRunConfig(
-                cache_mode=CacheMode.BYPASS, word_count_threshold=0
-            ),
-        )
-        end = time.time()
-        print("Crawl4AI (simple crawl):")
-        print(f"Time taken: {end - start:.2f} seconds")
-        print(f"Content length: {len(result.markdown)} characters")
-        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
-        print()
-
-        # Advanced filtering
-        start = time.time()
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
-            config=CrawlerRunConfig(
-                cache_mode=CacheMode.BYPASS,
-                word_count_threshold=0,
-                markdown_generator=DefaultMarkdownGenerator(
-                    content_filter=PruningContentFilter(
-                        threshold=0.48, threshold_type="fixed", min_word_threshold=0
-                    )
-                ),
-            ),
-        )
-        end = time.time()
-        print("Crawl4AI (Markdown Plus):")
-        print(f"Time taken: {end - start:.2f} seconds")
-        print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
-        print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
-        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
-        print()
-
-
-# Main execution
-async def main():
-    # Basic examples
-    await simple_crawl()
-    await simple_example_with_running_js_code()
-    await simple_example_with_css_selector()
-
-    # Advanced examples
-    await extract_structured_data_using_css_extractor()
-    await extract_structured_data_using_llm(
-        "openai/gpt-4o", os.getenv("OPENAI_API_KEY")
-    )
-    await crawl_dynamic_content_pages_method_1()
-    await crawl_dynamic_content_pages_method_2()
-
-    # Browser comparisons
-    await crawl_custom_browser_type()
-
-    # Screenshot example
-    await capture_and_save_screenshot(
-        "https://www.example.com",
-        os.path.join(__location__, "tmp/example_screenshot.jpg")
-    )
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -1,10 +1,6 @@
 import os, sys
-
 # append parent directory to system path
-sys.path.append(
-    os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-)
-os.environ["FIRECRAWL_API_KEY"] = "fc-84b370ccfad44beabc686b38f1769692"
+sys.path.append(os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))); os.environ['FIRECRAWL_API_KEY'] = "fc-84b370ccfad44beabc686b38f1769692";

 import asyncio
 # import nest_asyncio
@@ -14,12 +10,10 @@ import time
 import json
 import os
 import re
-from typing import Dict, List
+from typing import Dict
 from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import (
    JsonCssExtractionStrategy,
    LLMExtractionStrategy,
@@ -36,12 +30,9 @@ print("Website: https://crawl4ai.com")
 async def simple_crawl():
    print("\n--- Basic Usage ---")
    async with AsyncWebCrawler(verbose=True) as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
-        )
+        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  # Print first 500 characters

-
 async def simple_example_with_running_js_code():
    print("\n--- Executing JavaScript and Using CSS Selectors ---")
    # New code to handle the wait_for parameter
@@ -60,59 +51,55 @@ async def simple_example_with_running_js_code():
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            # wait_for=wait_for,
-            cache_mode=CacheMode.BYPASS,
+            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters

-
 async def simple_example_with_css_selector():
    print("\n--- Using CSS Selectors ---")
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            css_selector=".wide-tease-item__description",
-            cache_mode=CacheMode.BYPASS,
+            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters

-
 async def use_proxy():
    print("\n--- Using a Proxy ---")
    print(
        "Note: Replace 'http://your-proxy-url:port' with a working proxy to run this example."
    )
    # Uncomment and modify the following lines to use a proxy
-    async with AsyncWebCrawler(
-        verbose=True, proxy="http://your-proxy-url:port"
-    ) as crawler:
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS
-        )
-        if result.success:
-            print(result.markdown[:500])  # Print first 500 characters
-
+    # async with AsyncWebCrawler(verbose=True, proxy="http://your-proxy-url:port") as crawler:
+    #     result = await crawler.arun(
+    #         url="https://www.nbcnews.com/business",
+    #         bypass_cache=True
+    #     )
+    #     print(result.markdown[:500])  # Print first 500 characters

 async def capture_and_save_screenshot(url: str, output_path: str):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
-            url=url, screenshot=True, cache_mode=CacheMode.BYPASS
+            url=url,
+            screenshot=True,
+            bypass_cache=True
        )
-
+        
        if result.success and result.screenshot:
            import base64
-
+            
            # Decode the base64 screenshot data
            screenshot_data = base64.b64decode(result.screenshot)
-
+            
            # Save the screenshot as a JPEG file
-            with open(output_path, "wb") as f:
+            with open(output_path, 'wb') as f:
                f.write(screenshot_data)
-
+            
            print(f"Screenshot saved successfully to {output_path}")
        else:
            print("Failed to capture screenshot")

-
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -120,23 +107,14 @@ class OpenAIModelFee(BaseModel):
        ..., description="Fee for output token for the OpenAI model."
    )

-
-async def extract_structured_data_using_llm(
-    provider: str, api_token: str = None, extra_headers: Dict[str, str] = None
-):
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
    print(f"\n--- Extracting Structured Data with {provider} ---")
-
+    
    if api_token is None and provider != "ollama":
        print(f"API token is required for {provider}. Skipping this example.")
        return

-    # extra_args = {}
-    extra_args = {
-        "temperature": 0,
-        "top_p": 0.9,
-        "max_tokens": 2000,
-        # any other supported parameters for litellm
-    }
+    extra_args = {}
    if extra_headers:
        extra_args["extra_headers"] = extra_headers

@@ -147,80 +125,55 @@ async def extract_structured_data_using_llm(
            extraction_strategy=LLMExtractionStrategy(
                provider=provider,
                api_token=api_token,
-                schema=OpenAIModelFee.model_json_schema(),
+                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
-                extra_args=extra_args,
+                extra_args=extra_args
            ),
-            cache_mode=CacheMode.BYPASS,
+            bypass_cache=True,
        )
        print(result.extracted_content)

-
 async def extract_structured_data_using_css_extractor():
    print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
    schema = {
-        "name": "KidoCode Courses",
-        "baseSelector": "section.charge-methodology .w-tab-content > div",
+        "name": "Coinbase Crypto Prices",
+        "baseSelector": ".cds-tableRow-t45thuk",
        "fields": [
            {
-                "name": "section_title",
-                "selector": "h3.heading-50",
+                "name": "crypto",
+                "selector": "td:nth-child(1) h2",
                "type": "text",
            },
            {
-                "name": "section_description",
-                "selector": ".charge-content",
+                "name": "symbol",
+                "selector": "td:nth-child(1) p",
                "type": "text",
            },
            {
-                "name": "course_name",
-                "selector": ".text-block-93",
+                "name": "price",
+                "selector": "td:nth-child(2)",
                "type": "text",
-            },
-            {
-                "name": "course_description",
-                "selector": ".course-content-text",
-                "type": "text",
-            },
-            {
-                "name": "course_icon",
-                "selector": ".image-92",
-                "type": "attribute",
-                "attribute": "src",
-            },
+            }
        ],
    }

-    async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
-        # Create the JavaScript that handles clicking multiple times
-        js_click_tabs = """
-        (async () => {
-            const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div");
-            
-            for(let tab of tabs) {
-                // scroll to the tab
-                tab.scrollIntoView();
-                tab.click();
-                // Wait for content to load and animations to complete
-                await new Promise(r => setTimeout(r, 500));
-            }
-        })();
-        """
+    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)

+    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
-            url="https://www.kidocode.com/degrees/technology",
-            extraction_strategy=JsonCssExtractionStrategy(schema, verbose=True),
-            js_code=[js_click_tabs],
-            cache_mode=CacheMode.BYPASS,
+            url="https://www.coinbase.com/explore",
+            extraction_strategy=extraction_strategy,
+            bypass_cache=True,
        )

-        companies = json.loads(result.extracted_content)
-        print(f"Successfully extracted {len(companies)} companies")
-        print(json.dumps(companies[0], indent=2))
+        assert result.success, "Failed to crawl the page"

+        news_teasers = json.loads(result.extracted_content)
+        print(f"Successfully extracted {len(news_teasers)} news teasers")
+        print(json.dumps(news_teasers[0], indent=2))

 # Advanced Session-Based Crawling with Dynamic Content 🔄
 async def crawl_dynamic_content_pages_method_1():
@@ -250,10 +203,8 @@ async def crawl_dynamic_content_pages_method_1():
        all_commits = []

        js_next_page = """
-        (() => {
-            const button = document.querySelector('a[data-testid="pagination-next-button"]');
-            if (button) button.click();
-        })();
+        const button = document.querySelector('a[data-testid="pagination-next-button"]');
+        if (button) button.click();
        """

        for page in range(3):  # Crawl 3 pages
@@ -262,7 +213,7 @@ async def crawl_dynamic_content_pages_method_1():
                session_id=session_id,
                css_selector="li.Box-sc-g0xbh4-0",
                js=js_next_page if page > 0 else None,
-                cache_mode=CacheMode.BYPASS,
+                bypass_cache=True,
                js_only=page > 0,
                headless=False,
            )
@@ -278,7 +229,6 @@ async def crawl_dynamic_content_pages_method_1():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

-
 async def crawl_dynamic_content_pages_method_2():
    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution ---")

@@ -332,7 +282,7 @@ async def crawl_dynamic_content_pages_method_2():
                extraction_strategy=extraction_strategy,
                js_code=js_next_page_and_wait if page > 0 else None,
                js_only=page > 0,
-                cache_mode=CacheMode.BYPASS,
+                bypass_cache=True,
                headless=False,
            )

@@ -346,11 +296,8 @@ async def crawl_dynamic_content_pages_method_2():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

-
 async def crawl_dynamic_content_pages_method_3():
-    print(
-        "\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---"
-    )
+    print("\n--- Advanced Multi-Page Crawling with JavaScript Execution using `wait_for` ---")

    async with AsyncWebCrawler(verbose=True) as crawler:
        url = "https://github.com/microsoft/TypeScript/commits/main"
@@ -372,7 +319,7 @@ async def crawl_dynamic_content_pages_method_3():
            const firstCommit = commits[0].textContent.trim();
            return firstCommit !== window.firstCommit;
        }"""
-
+        
        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
@@ -396,7 +343,7 @@ async def crawl_dynamic_content_pages_method_3():
                js_code=js_next_page if page > 0 else None,
                wait_for=wait_for if page > 0 else None,
                js_only=page > 0,
-                cache_mode=CacheMode.BYPASS,
+                bypass_cache=True,
                headless=False,
            )

@@ -410,54 +357,28 @@ async def crawl_dynamic_content_pages_method_3():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

-
 async def crawl_custom_browser_type():
    # Use Firefox
    start = time.time()
-    async with AsyncWebCrawler(
-        browser_type="firefox", verbose=True, headless=True
-    ) as crawler:
-        result = await crawler.arun(
-            url="https://www.example.com", cache_mode=CacheMode.BYPASS
-        )
+    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

    # Use WebKit
    start = time.time()
-    async with AsyncWebCrawler(
-        browser_type="webkit", verbose=True, headless=True
-    ) as crawler:
-        result = await crawler.arun(
-            url="https://www.example.com", cache_mode=CacheMode.BYPASS
-        )
+    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

    # Use Chromium (default)
    start = time.time()
-    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
-        result = await crawler.arun(
-            url="https://www.example.com", cache_mode=CacheMode.BYPASS
-        )
+    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

-
-async def crawl_with_user_simultion():
-    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
-        url = "YOUR-URL-HERE"
-        result = await crawler.arun(
-            url=url,
-            cache_mode=CacheMode.BYPASS,
-            magic=True,  # Automatically detects and removes overlays, popups, and other elements that block content
-            # simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
-            # override_navigator = True # Overrides the navigator object to make it look like a real user
-        )
-
-        print(result.markdown)
-
-
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -467,18 +388,18 @@ async def speed_comparison():
    # print()
    # Simulated Firecrawl performance
    from firecrawl import FirecrawlApp
-
-    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
+    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
    start = time.time()
    scrape_status = app.scrape_url(
-        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
+    'https://www.nbcnews.com/business',
+    params={'formats': ['markdown', 'html']}
    )
    end = time.time()
-    print("Firecrawl:")
+    print("Firecrawl (simulated):")
    print(f"Time taken: {end - start:.2f} seconds")
    print(f"Content length: {len(scrape_status['markdown'])} characters")
    print(f"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}")
-    print()
+    print()    

    async with AsyncWebCrawler() as crawler:
        # Crawl4AI simple crawl
@@ -486,7 +407,7 @@ async def speed_comparison():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            word_count_threshold=0,
-            cache_mode=CacheMode.BYPASS,
+            bypass_cache=True,
            verbose=False,
        )
        end = time.time()
@@ -496,28 +417,6 @@ async def speed_comparison():
        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
        print()

-        # Crawl4AI with advanced content filtering
-        start = time.time()
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
-            word_count_threshold=0,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter(
-                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
-                )
-                # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
-            ),
-            cache_mode=CacheMode.BYPASS,
-            verbose=False,
-        )
-        end = time.time()
-        print("Crawl4AI (Markdown Plus):")
-        print(f"Time taken: {end - start:.2f} seconds")
-        print(f"Content length: {len(result.markdown_v2.raw_markdown)} characters")
-        print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
-        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")
-        print()
-
        # Crawl4AI with JavaScript execution
        start = time.time()
        result = await crawler.arun(
@@ -526,20 +425,13 @@ async def speed_comparison():
                "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
            ],
            word_count_threshold=0,
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter(
-                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
-                )
-                # content_filter=BM25ContentFilter(user_query=None, bm25_threshold=1.0)
-            ),
+            bypass_cache=True,
            verbose=False,
        )
        end = time.time()
        print("Crawl4AI (with JavaScript execution):")
        print(f"Time taken: {end - start:.2f} seconds")
        print(f"Content length: {len(result.markdown)} characters")
-        print(f"Fit Markdown: {len(result.markdown_v2.fit_markdown)} characters")
        print(f"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}")

    print("\nNote on Speed Comparison:")
@@ -552,123 +444,34 @@ async def speed_comparison():
    print("If you run these tests in an environment with better network conditions,")
    print("you may observe an even more significant speed advantage for Crawl4AI.")

-
-async def generate_knowledge_graph():
-    class Entity(BaseModel):
-        name: str
-        description: str
-
-    class Relationship(BaseModel):
-        entity1: Entity
-        entity2: Entity
-        description: str
-        relation_type: str
-
-    class KnowledgeGraph(BaseModel):
-        entities: List[Entity]
-        relationships: List[Relationship]
-
-    extraction_strategy = LLMExtractionStrategy(
-        provider="openai/gpt-4o-mini",  # Or any other provider, including Ollama and open source models
-        api_token=os.getenv("OPENAI_API_KEY"),  # In case of Ollama just pass "no-token"
-        schema=KnowledgeGraph.model_json_schema(),
-        extraction_type="schema",
-        instruction="""Extract entities and relationships from the given text.""",
-    )
-    async with AsyncWebCrawler() as crawler:
-        url = "https://paulgraham.com/love.html"
-        result = await crawler.arun(
-            url=url,
-            cache_mode=CacheMode.BYPASS,
-            extraction_strategy=extraction_strategy,
-            # magic=True
-        )
-        # print(result.extracted_content)
-        with open(os.path.join(__location__, "kb.json"), "w") as f:
-            f.write(result.extracted_content)
-
-
-async def fit_markdown_remove_overlay():
-    async with AsyncWebCrawler(
-        headless=True,  # Set to False to see what is happening
-        verbose=True,
-        user_agent_mode="random",
-        user_agent_generator_config={"device_type": "mobile", "os_type": "android"},
-    ) as crawler:
-        result = await crawler.arun(
-            url="https://www.kidocode.com/degrees/technology",
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter(
-                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
-                ),
-                options={"ignore_links": True},
-            ),
-            # markdown_generator=DefaultMarkdownGenerator(
-            #     content_filter=BM25ContentFilter(user_query="", bm25_threshold=1.0),
-            #     options={
-            #         "ignore_links": True
-            #     }
-            # ),
-        )
-
-        if result.success:
-            print(len(result.markdown_v2.raw_markdown))
-            print(len(result.markdown_v2.markdown_with_citations))
-            print(len(result.markdown_v2.fit_markdown))
-
-            # Save clean html
-            with open(os.path.join(__location__, "output/cleaned_html.html"), "w") as f:
-                f.write(result.cleaned_html)
-
-            with open(
-                os.path.join(__location__, "output/output_raw_markdown.md"), "w"
-            ) as f:
-                f.write(result.markdown_v2.raw_markdown)
-
-            with open(
-                os.path.join(__location__, "output/output_markdown_with_citations.md"),
-                "w",
-            ) as f:
-                f.write(result.markdown_v2.markdown_with_citations)
-
-            with open(
-                os.path.join(__location__, "output/output_fit_markdown.md"), "w"
-            ) as f:
-                f.write(result.markdown_v2.fit_markdown)
-
-    print("Done")
-
-
 async def main():
-    # await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
-
-    # await simple_crawl()
-    # await simple_example_with_running_js_code()
-    # await simple_example_with_css_selector()
-    # # await use_proxy()
-    # await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
-    # await extract_structured_data_using_css_extractor()
+    await simple_crawl()
+    await simple_example_with_running_js_code()
+    await simple_example_with_css_selector()
+    await use_proxy()
+    await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
+    await extract_structured_data_using_css_extractor()

    # LLM extraction examples
-    # await extract_structured_data_using_llm()
-    # await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
-    # await extract_structured_data_using_llm("ollama/llama3.2")
+    await extract_structured_data_using_llm()
+    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
+    await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm("ollama/llama3.2")    

    # You always can pass custom headers to the extraction strategy
-    # custom_headers = {
-    #     "Authorization": "Bearer your-custom-token",
-    #     "X-Custom-Header": "Some-Value"
-    # }
-    # await extract_structured_data_using_llm(extra_headers=custom_headers)
-
+    custom_headers = {
+        "Authorization": "Bearer your-custom-token",
+        "X-Custom-Header": "Some-Value"
+    }
+    await extract_structured_data_using_llm(extra_headers=custom_headers)
+    
    # await crawl_dynamic_content_pages_method_1()
    # await crawl_dynamic_content_pages_method_2()
    await crawl_dynamic_content_pages_method_3()
-
-    # await crawl_custom_browser_type()
-
-    # await speed_comparison()
+    
+    await crawl_custom_browser_type()
+    
+    await speed_comparison()


 if __name__ == "__main__":
--- a/docs/examples/quickstart_sync.py
+++ b/docs/examples/quickstart_sync.py
@@ -10,17 +10,15 @@ from functools import lru_cache

 console = Console()

-
@lru_cache()
 def create_crawler():
    crawler = WebCrawler(verbose=True)
    crawler.warmup()
    return crawler

-
 def print_result(result):
    # Print each key in one line and just the first 10 characters of each one's value and three dots
-    console.print("\t[bold]Result:[/bold]")
+    console.print(f"\t[bold]Result:[/bold]")
    for key, value in result.model_dump().items():
        if isinstance(value, str) and value:
            console.print(f"\t{key}: [green]{value[:20]}...[/green]")
@@ -35,27 +33,18 @@ def cprint(message, press_any_key=False):
        console.print("Press any key to continue...", style="")
        input()

-
 def basic_usage(crawler):
-    cprint(
-        "🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
-    )
-    result = crawler.run(url="https://www.nbcnews.com/business", only_text=True)
+    cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
+    result = crawler.run(url="https://www.nbcnews.com/business", only_text = True)
    cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
    print_result(result)

-
 def basic_usage_some_params(crawler):
-    cprint(
-        "🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]"
-    )
-    result = crawler.run(
-        url="https://www.nbcnews.com/business", word_count_threshold=1, only_text=True
-    )
+    cprint("🛠️ [bold cyan]Basic Usage: Simply provide a URL and let Crawl4ai do the magic![/bold cyan]")
+    result = crawler.run(url="https://www.nbcnews.com/business", word_count_threshold=1, only_text = True)
    cprint("[LOG] 📦 [bold yellow]Basic crawl result:[/bold yellow]")
    print_result(result)

-
 def screenshot_usage(crawler):
    cprint("\n📸 [bold cyan]Let's take a screenshot of the page![/bold cyan]")
    result = crawler.run(url="https://www.nbcnews.com/business", screenshot=True)
@@ -66,23 +55,16 @@ def screenshot_usage(crawler):
    cprint("Screenshot saved to 'screenshot.png'!")
    print_result(result)

-
 def understanding_parameters(crawler):
-    cprint(
-        "\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]"
-    )
-    cprint(
-        "By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action."
-    )
-
+    cprint("\n🧠 [bold cyan]Understanding 'bypass_cache' and 'include_raw_html' parameters:[/bold cyan]")
+    cprint("By default, Crawl4ai caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.")
+    
    # First crawl (reads from cache)
    cprint("1️⃣ First crawl (caches the result):", True)
    start_time = time.time()
    result = crawler.run(url="https://www.nbcnews.com/business")
    end_time = time.time()
-    cprint(
-        f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]"
-    )
+    cprint(f"[LOG] 📦 [bold yellow]First crawl took {end_time - start_time} seconds and result (from cache):[/bold yellow]")
    print_result(result)

    # Force to crawl again
@@ -90,232 +72,169 @@ def understanding_parameters(crawler):
    start_time = time.time()
    result = crawler.run(url="https://www.nbcnews.com/business", bypass_cache=True)
    end_time = time.time()
-    cprint(
-        f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]"
-    )
+    cprint(f"[LOG] 📦 [bold yellow]Second crawl took {end_time - start_time} seconds and result (forced to crawl):[/bold yellow]")
    print_result(result)

-
 def add_chunking_strategy(crawler):
    # Adding a chunking strategy: RegexChunking
-    cprint(
-        "\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]",
-        True,
-    )
-    cprint(
-        "RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!"
-    )
+    cprint("\n🧩 [bold cyan]Let's add a chunking strategy: RegexChunking![/bold cyan]", True)
+    cprint("RegexChunking is a simple chunking strategy that splits the text based on a given regex pattern. Let's see it in action!")
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        chunking_strategy=RegexChunking(patterns=["\n\n"]),
+        chunking_strategy=RegexChunking(patterns=["\n\n"])
    )
    cprint("[LOG] 📦 [bold yellow]RegexChunking result:[/bold yellow]")
    print_result(result)

    # Adding another chunking strategy: NlpSentenceChunking
-    cprint(
-        "\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]",
-        True,
-    )
-    cprint(
-        "NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!"
-    )
+    cprint("\n🔍 [bold cyan]Time to explore another chunking strategy: NlpSentenceChunking![/bold cyan]", True)
+    cprint("NlpSentenceChunking uses NLP techniques to split the text into sentences. Let's see how it performs!")
    result = crawler.run(
-        url="https://www.nbcnews.com/business", chunking_strategy=NlpSentenceChunking()
+        url="https://www.nbcnews.com/business",
+        chunking_strategy=NlpSentenceChunking()
    )
    cprint("[LOG] 📦 [bold yellow]NlpSentenceChunking result:[/bold yellow]")
    print_result(result)

-
 def add_extraction_strategy(crawler):
    # Adding an extraction strategy: CosineStrategy
-    cprint(
-        "\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]",
-        True,
-    )
-    cprint(
-        "CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!"
-    )
+    cprint("\n🧠 [bold cyan]Let's get smarter with an extraction strategy: CosineStrategy![/bold cyan]", True)
+    cprint("CosineStrategy uses cosine similarity to extract semantically similar blocks of text. Let's see it in action!")
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        extraction_strategy=CosineStrategy(
-            word_count_threshold=10,
-            max_dist=0.2,
-            linkage_method="ward",
-            top_k=3,
-            sim_threshold=0.3,
-            verbose=True,
-        ),
+        extraction_strategy=CosineStrategy(word_count_threshold=10, max_dist=0.2, linkage_method="ward", top_k=3, sim_threshold = 0.3, verbose=True)
    )
    cprint("[LOG] 📦 [bold yellow]CosineStrategy result:[/bold yellow]")
    print_result(result)
-
+    
    # Using semantic_filter with CosineStrategy
-    cprint(
-        "You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!"
-    )
+    cprint("You can pass other parameters like 'semantic_filter' to the CosineStrategy to extract semantically similar blocks of text. Let's see it in action!")
    result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=CosineStrategy(
            semantic_filter="inflation rent prices",
-        ),
-    )
-    cprint(
-        "[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]"
+        )
    )
+    cprint("[LOG] 📦 [bold yellow]CosineStrategy result with semantic filter:[/bold yellow]")
    print_result(result)

-
 def add_llm_extraction_strategy(crawler):
    # Adding an LLM extraction strategy without instructions
-    cprint(
-        "\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]",
-        True,
-    )
-    cprint(
-        "LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!"
-    )
+    cprint("\n🤖 [bold cyan]Time to bring in the big guns: LLMExtractionStrategy without instructions![/bold cyan]", True)
+    cprint("LLMExtractionStrategy uses a large language model to extract relevant information from the web page. Let's see it in action!")
    result = crawler.run(
        url="https://www.nbcnews.com/business",
-        extraction_strategy=LLMExtractionStrategy(
-            provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
-        ),
-    )
-    cprint(
-        "[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]"
+        extraction_strategy=LLMExtractionStrategy(provider="openai/gpt-4o", api_token=os.getenv('OPENAI_API_KEY'))
    )
+    cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (no instructions) result:[/bold yellow]")
    print_result(result)
-
+    
    # Adding an LLM extraction strategy with instructions
-    cprint(
-        "\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]",
-        True,
-    )
-    cprint(
-        "Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!"
-    )
+    cprint("\n📜 [bold cyan]Let's make it even more interesting: LLMExtractionStrategy with instructions![/bold cyan]", True)
+    cprint("Let's say we are only interested in financial news. Let's see how LLMExtractionStrategy performs with instructions!")
    result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
-            api_token=os.getenv("OPENAI_API_KEY"),
-            instruction="I am interested in only financial news",
-        ),
-    )
-    cprint(
-        "[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]"
+            api_token=os.getenv('OPENAI_API_KEY'),
+            instruction="I am interested in only financial news"
+        )
    )
+    cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with instructions) result:[/bold yellow]")
    print_result(result)
-
+    
    result = crawler.run(
        url="https://www.nbcnews.com/business",
        extraction_strategy=LLMExtractionStrategy(
            provider="openai/gpt-4o",
-            api_token=os.getenv("OPENAI_API_KEY"),
-            instruction="Extract only content related to technology",
-        ),
-    )
-    cprint(
-        "[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]"
+            api_token=os.getenv('OPENAI_API_KEY'),
+            instruction="Extract only content related to technology"
+        )
    )
+    cprint("[LOG] 📦 [bold yellow]LLMExtractionStrategy (with technology instruction) result:[/bold yellow]")
    print_result(result)

-
 def targeted_extraction(crawler):
    # Using a CSS selector to extract only H2 tags
-    cprint(
-        "\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]",
-        True,
+    cprint("\n🎯 [bold cyan]Targeted extraction: Let's use a CSS selector to extract only H2 tags![/bold cyan]", True)
+    result = crawler.run(
+        url="https://www.nbcnews.com/business",
+        css_selector="h2"
    )
-    result = crawler.run(url="https://www.nbcnews.com/business", css_selector="h2")
    cprint("[LOG] 📦 [bold yellow]CSS Selector (H2 tags) result:[/bold yellow]")
    print_result(result)

-
 def interactive_extraction(crawler):
    # Passing JavaScript code to interact with the page
-    cprint(
-        "\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
-        True,
-    )
-    cprint(
-        "In this example we try to click the 'Load More' button on the page using JavaScript code."
-    )
+    cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
+    cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
    js_code = """
    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
    """
    # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
    # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
-    result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
-    cprint(
-        "[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
+    result = crawler.run(
+        url="https://www.nbcnews.com/business",
+        js = js_code
    )
+    cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
    print_result(result)

-
 def multiple_scrip(crawler):
    # Passing JavaScript code to interact with the page
-    cprint(
-        "\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]",
-        True,
-    )
-    cprint(
-        "In this example we try to click the 'Load More' button on the page using JavaScript code."
-    )
-    js_code = [
-        """
+    cprint("\n🖱️ [bold cyan]Let's get interactive: Passing JavaScript code to click 'Load More' button![/bold cyan]", True)
+    cprint("In this example we try to click the 'Load More' button on the page using JavaScript code.")
+    js_code = ["""
    const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
-    """
-    ] * 2
+    """] * 2
    # crawler_strategy = LocalSeleniumCrawlerStrategy(js_code=js_code)
    # crawler = WebCrawler(crawler_strategy=crawler_strategy, always_by_pass_cache=True)
-    result = crawler.run(url="https://www.nbcnews.com/business", js=js_code)
-    cprint(
-        "[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]"
+    result = crawler.run(
+        url="https://www.nbcnews.com/business",
+        js = js_code  
    )
+    cprint("[LOG] 📦 [bold yellow]JavaScript Code (Load More button) result:[/bold yellow]")
    print_result(result)

-
 def using_crawler_hooks(crawler):
    # Example usage of the hooks for authentication and setting a cookie
    def on_driver_created(driver):
        print("[HOOK] on_driver_created")
        # Example customization: maximize the window
        driver.maximize_window()
-
+        
        # Example customization: logging in to a hypothetical website
-        driver.get("https://example.com/login")
-
+        driver.get('https://example.com/login')
+        
        from selenium.webdriver.support.ui import WebDriverWait
        from selenium.webdriver.common.by import By
        from selenium.webdriver.support import expected_conditions as EC
-
+        
        WebDriverWait(driver, 10).until(
-            EC.presence_of_element_located((By.NAME, "username"))
+            EC.presence_of_element_located((By.NAME, 'username'))
        )
-        driver.find_element(By.NAME, "username").send_keys("testuser")
-        driver.find_element(By.NAME, "password").send_keys("password123")
-        driver.find_element(By.NAME, "login").click()
+        driver.find_element(By.NAME, 'username').send_keys('testuser')
+        driver.find_element(By.NAME, 'password').send_keys('password123')
+        driver.find_element(By.NAME, 'login').click()
        WebDriverWait(driver, 10).until(
-            EC.presence_of_element_located((By.ID, "welcome"))
+            EC.presence_of_element_located((By.ID, 'welcome'))
        )
        # Add a custom cookie
-        driver.add_cookie({"name": "test_cookie", "value": "cookie_value"})
-        return driver
+        driver.add_cookie({'name': 'test_cookie', 'value': 'cookie_value'})
+        return driver        
+        

    def before_get_url(driver):
        print("[HOOK] before_get_url")
        # Example customization: add a custom header
        # Enable Network domain for sending headers
-        driver.execute_cdp_cmd("Network.enable", {})
+        driver.execute_cdp_cmd('Network.enable', {})
        # Add a custom header
-        driver.execute_cdp_cmd(
-            "Network.setExtraHTTPHeaders", {"headers": {"X-Test-Header": "test"}}
-        )
+        driver.execute_cdp_cmd('Network.setExtraHTTPHeaders', {'headers': {'X-Test-Header': 'test'}})
        return driver
-
+    
    def after_get_url(driver):
        print("[HOOK] after_get_url")
        # Example customization: log the URL
@@ -327,59 +246,48 @@ def using_crawler_hooks(crawler):
        # Example customization: log the HTML
        print(len(html))
        return driver
-
-    cprint(
-        "\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]",
-        True,
-    )
-
+    
+    cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's see how we can customize the crawler using hooks![/bold cyan]", True)
+    
    crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
-    crawler_strategy.set_hook("on_driver_created", on_driver_created)
-    crawler_strategy.set_hook("before_get_url", before_get_url)
-    crawler_strategy.set_hook("after_get_url", after_get_url)
-    crawler_strategy.set_hook("before_return_html", before_return_html)
-
+    crawler_strategy.set_hook('on_driver_created', on_driver_created)
+    crawler_strategy.set_hook('before_get_url', before_get_url)
+    crawler_strategy.set_hook('after_get_url', after_get_url)
+    crawler_strategy.set_hook('before_return_html', before_return_html)
+    
    crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
-    crawler.warmup()
+    crawler.warmup()    
    result = crawler.run(url="https://example.com")
-
+    
    cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
-    print_result(result=result)
-
-
+    print_result(result= result)
+    
 def using_crawler_hooks_dleay_example(crawler):
    def delay(driver):
        print("Delaying for 5 seconds...")
        time.sleep(5)
        print("Resuming...")
-
+        
    def create_crawler():
        crawler_strategy = LocalSeleniumCrawlerStrategy(verbose=True)
-        crawler_strategy.set_hook("after_get_url", delay)
+        crawler_strategy.set_hook('after_get_url', delay)
        crawler = WebCrawler(verbose=True, crawler_strategy=crawler_strategy)
        crawler.warmup()
        return crawler

-    cprint(
-        "\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]"
-    )
+    cprint("\n🔗 [bold cyan]Using Crawler Hooks: Let's add a delay after fetching the url to make sure entire page is fetched.[/bold cyan]")
    crawler = create_crawler()
-    result = crawler.run(url="https://google.com", bypass_cache=True)
-
+    result = crawler.run(url="https://google.com", bypass_cache=True)    
+    
    cprint("[LOG] 📦 [bold yellow]Crawler Hooks result:[/bold yellow]")
    print_result(result)
-
+    
+    

 def main():
-    cprint(
-        "🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]"
-    )
-    cprint(
-        "⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]"
-    )
-    cprint(
-        "If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files."
-    )
+    cprint("🌟 [bold green]Welcome to the Crawl4ai Quickstart Guide! Let's dive into some web crawling fun! 🌐[/bold green]")
+    cprint("⛳️ [bold cyan]First Step: Create an instance of WebCrawler and call the `warmup()` function.[/bold cyan]")
+    cprint("If this is the first time you're running Crawl4ai, this might take a few seconds to load required model files.")

    crawler = create_crawler()

@@ -387,7 +295,7 @@ def main():
    basic_usage(crawler)
    # basic_usage_some_params(crawler)
    understanding_parameters(crawler)
-
+    
    crawler.always_by_pass_cache = True
    screenshot_usage(crawler)
    add_chunking_strategy(crawler)
@@ -397,10 +305,8 @@ def main():
    interactive_extraction(crawler)
    multiple_scrip(crawler)

-    cprint(
-        "\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]"
-    )
-
+    cprint("\n🎉 [bold green]Congratulations! You've made it through the Crawl4ai Quickstart Guide! Now go forth and crawl the web like a pro! 🕸️[/bold green]")

 if __name__ == "__main__":
    main()
+
--- a/docs/examples/quickstart_v0.ipynb
+++ b/docs/examples/quickstart_v0.ipynb
@@ -1,735 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "6yLvrXn7yZQI"
-      },
-      "source": [
-        "# Crawl4AI: Advanced Web Crawling and Data Extraction\n",
-        "\n",
-        "Welcome to this interactive notebook showcasing Crawl4AI, an advanced asynchronous web crawling and data extraction library.\n",
-        "\n",
-        "- GitHub Repository: [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\n",
-        "- Twitter: [@unclecode](https://twitter.com/unclecode)\n",
-        "- Website: [https://crawl4ai.com](https://crawl4ai.com)\n",
-        "\n",
-        "Let's explore the powerful features of Crawl4AI!"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "KIn_9nxFyZQK"
-      },
-      "source": [
-        "## Installation\n",
-        "\n",
-        "First, let's install Crawl4AI from GitHub:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "mSnaxLf3zMog"
-      },
-      "outputs": [],
-      "source": [
-        "!sudo apt-get update && sudo apt-get install -y libwoff1 libopus0 libwebp6 libwebpdemux2 libenchant1c2a libgudev-1.0-0 libsecret-1-0 libhyphen0 libgdk-pixbuf2.0-0 libegl1 libnotify4 libxslt1.1 libevent-2.1-7 libgles2 libvpx6 libxcomposite1 libatk1.0-0 libatk-bridge2.0-0 libepoxy0 libgtk-3-0 libharfbuzz-icu0"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "xlXqaRtayZQK"
-      },
-      "outputs": [],
-      "source": [
-        "!pip install crawl4ai\n",
-        "!pip install nest-asyncio\n",
-        "!playwright install"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "qKCE7TI7yZQL"
-      },
-      "source": [
-        "Now, let's import the necessary libraries:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 1,
-      "metadata": {
-        "id": "I67tr7aAyZQL"
-      },
-      "outputs": [],
-      "source": [
-        "import asyncio\n",
-        "import nest_asyncio\n",
-        "from crawl4ai import AsyncWebCrawler\n",
-        "from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy\n",
-        "import json\n",
-        "import time\n",
-        "from pydantic import BaseModel, Field\n",
-        "\n",
-        "nest_asyncio.apply()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "h7yR_Rt_yZQM"
-      },
-      "source": [
-        "## Basic Usage\n",
-        "\n",
-        "Let's start with a simple crawl example:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 2,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "yBh6hf4WyZQM",
-        "outputId": "0f83af5c-abba-4175-ed95-70b7512e6bcc"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
-            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.05 seconds\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.05 seconds.\n",
-            "18102\n"
-          ]
-        }
-      ],
-      "source": [
-        "async def simple_crawl():\n",
-        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
-        "        result = await crawler.arun(url=\"https://www.nbcnews.com/business\")\n",
-        "        print(len(result.markdown))\n",
-        "await simple_crawl()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "9rtkgHI28uI4"
-      },
-      "source": [
-        "💡 By default, **Crawl4AI** caches the result of every URL, so the next time you call it, you’ll get an instant result. But if you want to bypass the cache, just set `bypass_cache=True`."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "MzZ0zlJ9yZQM"
-      },
-      "source": [
-        "## Advanced Features\n",
-        "\n",
-        "### Executing JavaScript and Using CSS Selectors"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 3,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "gHStF86xyZQM",
-        "outputId": "34d0fb6d-4dec-4677-f76e-85a1f082829b"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
-            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
-            "[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
-            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 6.06 seconds\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.10 seconds\n",
-            "[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.11 seconds.\n",
-            "41135\n"
-          ]
-        }
-      ],
-      "source": [
-        "async def js_and_css():\n",
-        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
-        "        js_code = [\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"]\n",
-        "        result = await crawler.arun(\n",
-        "            url=\"https://www.nbcnews.com/business\",\n",
-        "            js_code=js_code,\n",
-        "            # css_selector=\"YOUR_CSS_SELECTOR_HERE\",\n",
-        "            bypass_cache=True\n",
-        "        )\n",
-        "        print(len(result.markdown))\n",
-        "\n",
-        "await js_and_css()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "cqE_W4coyZQM"
-      },
-      "source": [
-        "### Using a Proxy\n",
-        "\n",
-        "Note: You'll need to replace the proxy URL with a working proxy for this example to run successfully."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "QjAyiAGqyZQM"
-      },
-      "outputs": [],
-      "source": [
-        "async def use_proxy():\n",
-        "    async with AsyncWebCrawler(verbose=True, proxy=\"http://your-proxy-url:port\") as crawler:\n",
-        "        result = await crawler.arun(\n",
-        "            url=\"https://www.nbcnews.com/business\",\n",
-        "            bypass_cache=True\n",
-        "        )\n",
-        "        print(result.markdown[:500])  # Print first 500 characters\n",
-        "\n",
-        "# Uncomment the following line to run the proxy example\n",
-        "# await use_proxy()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "XTZ88lbayZQN"
-      },
-      "source": [
-        "### Extracting Structured Data with OpenAI\n",
-        "\n",
-        "Note: You'll need to set your OpenAI API key as an environment variable for this example to work."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 14,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "fIOlDayYyZQN",
-        "outputId": "cb8359cc-dee0-4762-9698-5dfdcee055b8"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
-            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🕸️ Crawling https://openai.com/api/pricing/ using AsyncPlaywrightCrawlerStrategy...\n",
-            "[LOG] ✅ Crawled https://openai.com/api/pricing/ successfully!\n",
-            "[LOG] 🚀 Crawling done for https://openai.com/api/pricing/, success: True, time taken: 3.77 seconds\n",
-            "[LOG] 🚀 Content extracted for https://openai.com/api/pricing/, success: True, time taken: 0.21 seconds\n",
-            "[LOG] 🔥 Extracting semantic blocks for https://openai.com/api/pricing/, Strategy: AsyncWebCrawler\n",
-            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 0\n",
-            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 1\n",
-            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 2\n",
-            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 3\n",
-            "[LOG] Extracted 4 blocks from URL: https://openai.com/api/pricing/ block index: 3\n",
-            "[LOG] Call LLM for https://openai.com/api/pricing/ - block index: 4\n",
-            "[LOG] Extracted 5 blocks from URL: https://openai.com/api/pricing/ block index: 0\n",
-            "[LOG] Extracted 1 blocks from URL: https://openai.com/api/pricing/ block index: 4\n",
-            "[LOG] Extracted 8 blocks from URL: https://openai.com/api/pricing/ block index: 1\n",
-            "[LOG] Extracted 12 blocks from URL: https://openai.com/api/pricing/ block index: 2\n",
-            "[LOG] 🚀 Extraction done for https://openai.com/api/pricing/, time taken: 8.55 seconds.\n",
-            "5029\n"
-          ]
-        }
-      ],
-      "source": [
-        "import os\n",
-        "from google.colab import userdata\n",
-        "os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')\n",
-        "\n",
-        "class OpenAIModelFee(BaseModel):\n",
-        "    model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n",
-        "    input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n",
-        "    output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n",
-        "\n",
-        "async def extract_openai_fees():\n",
-        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
-        "        result = await crawler.arun(\n",
-        "            url='https://openai.com/api/pricing/',\n",
-        "            word_count_threshold=1,\n",
-        "            extraction_strategy=LLMExtractionStrategy(\n",
-        "                provider=\"openai/gpt-4o\", api_token=os.getenv('OPENAI_API_KEY'),\n",
-        "                schema=OpenAIModelFee.schema(),\n",
-        "                extraction_type=\"schema\",\n",
-        "                instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens.\n",
-        "                Do not miss any models in the entire content. One extracted model JSON format should look like this:\n",
-        "                {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 / 1M tokens\", \"output_fee\": \"US$30.00 / 1M tokens\"}.\"\"\"\n",
-        "            ),\n",
-        "            bypass_cache=True,\n",
-        "        )\n",
-        "        print(len(result.extracted_content))\n",
-        "\n",
-        "# Uncomment the following line to run the OpenAI extraction example\n",
-        "await extract_openai_fees()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "BypA5YxEyZQN"
-      },
-      "source": [
-        "### Advanced Multi-Page Crawling with JavaScript Execution"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "tfkcVQ0b7mw-"
-      },
-      "source": [
-        "## Advanced Multi-Page Crawling with JavaScript Execution\n",
-        "\n",
-        "This example demonstrates Crawl4AI's ability to handle complex crawling scenarios, specifically extracting commits from multiple pages of a GitHub repository. The challenge here is that clicking the \"Next\" button doesn't load a new page, but instead uses asynchronous JavaScript to update the content. This is a common hurdle in modern web crawling.\n",
-        "\n",
-        "To overcome this, we use Crawl4AI's custom JavaScript execution to simulate clicking the \"Next\" button, and implement a custom hook to detect when new data has loaded. Our strategy involves comparing the first commit's text before and after \"clicking\" Next, waiting until it changes to confirm new data has rendered. This showcases Crawl4AI's flexibility in handling dynamic content and its ability to implement custom logic for even the most challenging crawling tasks."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 11,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "qUBKGpn3yZQN",
-        "outputId": "3e555b6a-ed33-42f4-cce9-499a923fbe17"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
-            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
-            "[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
-            "[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 5.16 seconds\n",
-            "[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.28 seconds\n",
-            "[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.28 seconds.\n",
-            "Page 1: Found 35 commits\n",
-            "[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
-            "[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
-            "[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.78 seconds\n",
-            "[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.90 seconds\n",
-            "[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.90 seconds.\n",
-            "Page 2: Found 35 commits\n",
-            "[LOG] 🕸️ Crawling https://github.com/microsoft/TypeScript/commits/main using AsyncPlaywrightCrawlerStrategy...\n",
-            "[LOG] ✅ Crawled https://github.com/microsoft/TypeScript/commits/main successfully!\n",
-            "[LOG] 🚀 Crawling done for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 2.00 seconds\n",
-            "[LOG] 🚀 Content extracted for https://github.com/microsoft/TypeScript/commits/main, success: True, time taken: 0.74 seconds\n",
-            "[LOG] 🔥 Extracting semantic blocks for https://github.com/microsoft/TypeScript/commits/main, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://github.com/microsoft/TypeScript/commits/main, time taken: 0.75 seconds.\n",
-            "Page 3: Found 35 commits\n",
-            "Successfully crawled 105 commits across 3 pages\n"
-          ]
-        }
-      ],
-      "source": [
-        "import re\n",
-        "from bs4 import BeautifulSoup\n",
-        "\n",
-        "async def crawl_typescript_commits():\n",
-        "    first_commit = \"\"\n",
-        "    async def on_execution_started(page):\n",
-        "        nonlocal first_commit\n",
-        "        try:\n",
-        "            while True:\n",
-        "                await page.wait_for_selector('li.Box-sc-g0xbh4-0 h4')\n",
-        "                commit = await page.query_selector('li.Box-sc-g0xbh4-0 h4')\n",
-        "                commit = await commit.evaluate('(element) => element.textContent')\n",
-        "                commit = re.sub(r'\\s+', '', commit)\n",
-        "                if commit and commit != first_commit:\n",
-        "                    first_commit = commit\n",
-        "                    break\n",
-        "                await asyncio.sleep(0.5)\n",
-        "        except Exception as e:\n",
-        "            print(f\"Warning: New content didn't appear after JavaScript execution: {e}\")\n",
-        "\n",
-        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
-        "        crawler.crawler_strategy.set_hook('on_execution_started', on_execution_started)\n",
-        "\n",
-        "        url = \"https://github.com/microsoft/TypeScript/commits/main\"\n",
-        "        session_id = \"typescript_commits_session\"\n",
-        "        all_commits = []\n",
-        "\n",
-        "        js_next_page = \"\"\"\n",
-        "        const button = document.querySelector('a[data-testid=\"pagination-next-button\"]');\n",
-        "        if (button) button.click();\n",
-        "        \"\"\"\n",
-        "\n",
-        "        for page in range(3):  # Crawl 3 pages\n",
-        "            result = await crawler.arun(\n",
-        "                url=url,\n",
-        "                session_id=session_id,\n",
-        "                css_selector=\"li.Box-sc-g0xbh4-0\",\n",
-        "                js=js_next_page if page > 0 else None,\n",
-        "                bypass_cache=True,\n",
-        "                js_only=page > 0\n",
-        "            )\n",
-        "\n",
-        "            assert result.success, f\"Failed to crawl page {page + 1}\"\n",
-        "\n",
-        "            soup = BeautifulSoup(result.cleaned_html, 'html.parser')\n",
-        "            commits = soup.select(\"li\")\n",
-        "            all_commits.extend(commits)\n",
-        "\n",
-        "            print(f\"Page {page + 1}: Found {len(commits)} commits\")\n",
-        "\n",
-        "        await crawler.crawler_strategy.kill_session(session_id)\n",
-        "        print(f\"Successfully crawled {len(all_commits)} commits across 3 pages\")\n",
-        "\n",
-        "await crawl_typescript_commits()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "EJRnYsp6yZQN"
-      },
-      "source": [
-        "### Using JsonCssExtractionStrategy for Fast Structured Output"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "1ZMqIzB_8SYp"
-      },
-      "source": [
-        "The JsonCssExtractionStrategy is a powerful feature of Crawl4AI that allows for precise, structured data extraction from web pages. Here's how it works:\n",
-        "\n",
-        "1. You define a schema that describes the pattern of data you're interested in extracting.\n",
-        "2. The schema includes a base selector that identifies repeating elements on the page.\n",
-        "3. Within the schema, you define fields, each with its own selector and type.\n",
-        "4. These field selectors are applied within the context of each base selector element.\n",
-        "5. The strategy supports nested structures, lists within lists, and various data types.\n",
-        "6. You can even include computed fields for more complex data manipulation.\n",
-        "\n",
-        "This approach allows for highly flexible and precise data extraction, transforming semi-structured web content into clean, structured JSON data. It's particularly useful for extracting consistent data patterns from pages like product listings, news articles, or search results.\n",
-        "\n",
-        "For more details and advanced usage, check out the full documentation on the Crawl4AI website."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 12,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "trCMR2T9yZQN",
-        "outputId": "718d36f4-cccf-40f4-8d8c-c3ba73524d16"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "[LOG] 🌤️  Warming up the AsyncWebCrawler\n",
-            "[LOG] 🌞 AsyncWebCrawler is ready to crawl\n",
-            "[LOG] 🕸️ Crawling https://www.nbcnews.com/business using AsyncPlaywrightCrawlerStrategy...\n",
-            "[LOG] ✅ Crawled https://www.nbcnews.com/business successfully!\n",
-            "[LOG] 🚀 Crawling done for https://www.nbcnews.com/business, success: True, time taken: 7.00 seconds\n",
-            "[LOG] 🚀 Content extracted for https://www.nbcnews.com/business, success: True, time taken: 0.32 seconds\n",
-            "[LOG] 🔥 Extracting semantic blocks for https://www.nbcnews.com/business, Strategy: AsyncWebCrawler\n",
-            "[LOG] 🚀 Extraction done for https://www.nbcnews.com/business, time taken: 0.48 seconds.\n",
-            "Successfully extracted 11 news teasers\n",
-            "{\n",
-            "  \"category\": \"Business News\",\n",
-            "  \"headline\": \"NBC ripped up its Olympics playbook for 2024 \\u2014 so far, the new strategy paid off\",\n",
-            "  \"summary\": \"The Olympics have long been key to NBCUniversal. Paris marked the 18th Olympic Games broadcast by NBC in the U.S.\",\n",
-            "  \"time\": \"13h ago\",\n",
-            "  \"image\": {\n",
-            "    \"src\": \"https://media-cldnry.s-nbcnews.com/image/upload/t_focal-200x100,f_auto,q_auto:best/rockcms/2024-09/240903-nbc-olympics-ch-1344-c7a486.jpg\",\n",
-            "    \"alt\": \"Mike Tirico.\"\n",
-            "  },\n",
-            "  \"link\": \"https://www.nbcnews.com/business\"\n",
-            "}\n"
-          ]
-        }
-      ],
-      "source": [
-        "async def extract_news_teasers():\n",
-        "    schema = {\n",
-        "        \"name\": \"News Teaser Extractor\",\n",
-        "        \"baseSelector\": \".wide-tease-item__wrapper\",\n",
-        "        \"fields\": [\n",
-        "            {\n",
-        "                \"name\": \"category\",\n",
-        "                \"selector\": \".unibrow span[data-testid='unibrow-text']\",\n",
-        "                \"type\": \"text\",\n",
-        "            },\n",
-        "            {\n",
-        "                \"name\": \"headline\",\n",
-        "                \"selector\": \".wide-tease-item__headline\",\n",
-        "                \"type\": \"text\",\n",
-        "            },\n",
-        "            {\n",
-        "                \"name\": \"summary\",\n",
-        "                \"selector\": \".wide-tease-item__description\",\n",
-        "                \"type\": \"text\",\n",
-        "            },\n",
-        "            {\n",
-        "                \"name\": \"time\",\n",
-        "                \"selector\": \"[data-testid='wide-tease-date']\",\n",
-        "                \"type\": \"text\",\n",
-        "            },\n",
-        "            {\n",
-        "                \"name\": \"image\",\n",
-        "                \"type\": \"nested\",\n",
-        "                \"selector\": \"picture.teasePicture img\",\n",
-        "                \"fields\": [\n",
-        "                    {\"name\": \"src\", \"type\": \"attribute\", \"attribute\": \"src\"},\n",
-        "                    {\"name\": \"alt\", \"type\": \"attribute\", \"attribute\": \"alt\"},\n",
-        "                ],\n",
-        "            },\n",
-        "            {\n",
-        "                \"name\": \"link\",\n",
-        "                \"selector\": \"a[href]\",\n",
-        "                \"type\": \"attribute\",\n",
-        "                \"attribute\": \"href\",\n",
-        "            },\n",
-        "        ],\n",
-        "    }\n",
-        "\n",
-        "    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n",
-        "\n",
-        "    async with AsyncWebCrawler(verbose=True) as crawler:\n",
-        "        result = await crawler.arun(\n",
-        "            url=\"https://www.nbcnews.com/business\",\n",
-        "            extraction_strategy=extraction_strategy,\n",
-        "            bypass_cache=True,\n",
-        "        )\n",
-        "\n",
-        "        assert result.success, \"Failed to crawl the page\"\n",
-        "\n",
-        "        news_teasers = json.loads(result.extracted_content)\n",
-        "        print(f\"Successfully extracted {len(news_teasers)} news teasers\")\n",
-        "        print(json.dumps(news_teasers[0], indent=2))\n",
-        "\n",
-        "await extract_news_teasers()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "FnyVhJaByZQN"
-      },
-      "source": [
-        "## Speed Comparison\n",
-        "\n",
-        "Let's compare the speed of Crawl4AI with Firecrawl, a paid service. Note that we can't run Firecrawl in this Colab environment, so we'll simulate its performance based on previously recorded data."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "agDD186f3wig"
-      },
-      "source": [
-        "💡 **Note on Speed Comparison:**\n",
-        "\n",
-        "The speed test conducted here is running on Google Colab, where the internet speed and performance can vary and may not reflect optimal conditions. When we call Firecrawl's API, we're seeing its best performance, while Crawl4AI's performance is limited by Colab's network speed.\n",
-        "\n",
-        "For a more accurate comparison, it's recommended to run these tests on your own servers or computers with a stable and fast internet connection. Despite these limitations, Crawl4AI still demonstrates faster performance in this environment.\n",
-        "\n",
-        "If you run these tests locally, you may observe an even more significant speed advantage for Crawl4AI compared to other services."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "F7KwHv8G1LbY"
-      },
-      "outputs": [],
-      "source": [
-        "!pip install firecrawl"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": 4,
-      "metadata": {
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "id": "91813zILyZQN",
-        "outputId": "663223db-ab89-4976-b233-05ceca62b19b"
-      },
-      "outputs": [
-        {
-          "name": "stdout",
-          "output_type": "stream",
-          "text": [
-            "Firecrawl (simulated):\n",
-            "Time taken: 4.38 seconds\n",
-            "Content length: 41967 characters\n",
-            "Images found: 49\n",
-            "\n",
-            "Crawl4AI (simple crawl):\n",
-            "Time taken: 4.22 seconds\n",
-            "Content length: 18221 characters\n",
-            "Images found: 49\n",
-            "\n",
-            "Crawl4AI (with JavaScript execution):\n",
-            "Time taken: 9.13 seconds\n",
-            "Content length: 34243 characters\n",
-            "Images found: 89\n"
-          ]
-        }
-      ],
-      "source": [
-        "import os\n",
-        "from google.colab import userdata\n",
-        "os.environ['FIRECRAWL_API_KEY'] = userdata.get('FIRECRAWL_API_KEY')\n",
-        "import time\n",
-        "from firecrawl import FirecrawlApp\n",
-        "\n",
-        "async def speed_comparison():\n",
-        "    # Simulated Firecrawl performance\n",
-        "    app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])\n",
-        "    start = time.time()\n",
-        "    scrape_status = app.scrape_url(\n",
-        "    'https://www.nbcnews.com/business',\n",
-        "    params={'formats': ['markdown', 'html']}\n",
-        "    )\n",
-        "    end = time.time()\n",
-        "    print(\"Firecrawl (simulated):\")\n",
-        "    print(f\"Time taken: {end - start:.2f} seconds\")\n",
-        "    print(f\"Content length: {len(scrape_status['markdown'])} characters\")\n",
-        "    print(f\"Images found: {scrape_status['markdown'].count('cldnry.s-nbcnews.com')}\")\n",
-        "    print()\n",
-        "\n",
-        "    async with AsyncWebCrawler() as crawler:\n",
-        "        # Crawl4AI simple crawl\n",
-        "        start = time.time()\n",
-        "        result = await crawler.arun(\n",
-        "            url=\"https://www.nbcnews.com/business\",\n",
-        "            word_count_threshold=0,\n",
-        "            bypass_cache=True,\n",
-        "            verbose=False\n",
-        "        )\n",
-        "        end = time.time()\n",
-        "        print(\"Crawl4AI (simple crawl):\")\n",
-        "        print(f\"Time taken: {end - start:.2f} seconds\")\n",
-        "        print(f\"Content length: {len(result.markdown)} characters\")\n",
-        "        print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
-        "        print()\n",
-        "\n",
-        "        # Crawl4AI with JavaScript execution\n",
-        "        start = time.time()\n",
-        "        result = await crawler.arun(\n",
-        "            url=\"https://www.nbcnews.com/business\",\n",
-        "            js_code=[\"const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();\"],\n",
-        "            word_count_threshold=0,\n",
-        "            bypass_cache=True,\n",
-        "            verbose=False\n",
-        "        )\n",
-        "        end = time.time()\n",
-        "        print(\"Crawl4AI (with JavaScript execution):\")\n",
-        "        print(f\"Time taken: {end - start:.2f} seconds\")\n",
-        "        print(f\"Content length: {len(result.markdown)} characters\")\n",
-        "        print(f\"Images found: {result.markdown.count('cldnry.s-nbcnews.com')}\")\n",
-        "\n",
-        "await speed_comparison()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "OBFFYVJIyZQN"
-      },
-      "source": [
-        "If you run on a local machine with a proper internet speed:\n",
-        "- Simple crawl: Crawl4AI is typically over 3-4 times faster than Firecrawl.\n",
-        "- With JavaScript execution: Even when executing JavaScript to load more content (potentially doubling the number of images found), Crawl4AI is still faster than Firecrawl's simple crawl.\n",
-        "\n",
-        "Please note that actual performance may vary depending on network conditions and the specific content being crawled."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "A6_1RK1_yZQO"
-      },
-      "source": [
-        "## Conclusion\n",
-        "\n",
-        "In this notebook, we've explored the powerful features of Crawl4AI, including:\n",
-        "\n",
-        "1. Basic crawling\n",
-        "2. JavaScript execution and CSS selector usage\n",
-        "3. Proxy support\n",
-        "4. Structured data extraction with OpenAI\n",
-        "5. Advanced multi-page crawling with JavaScript execution\n",
-        "6. Fast structured output using JsonCssExtractionStrategy\n",
-        "7. Speed comparison with other services\n",
-        "\n",
-        "Crawl4AI offers a fast, flexible, and powerful solution for web crawling and data extraction tasks. Its asynchronous architecture and advanced features make it suitable for a wide range of applications, from simple web scraping to complex, multi-page data extraction scenarios.\n",
-        "\n",
-        "For more information and advanced usage, please visit the [Crawl4AI documentation](https://docs.crawl4ai.com/).\n",
-        "\n",
-        "Happy crawling!"
-      ]
-    }
-  ],
-  "metadata": {
-    "colab": {
-      "provenance": []
-    },
-    "kernelspec": {
-      "display_name": "venv",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.10.13"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
--- a/docs/examples/research_assistant.py
+++ b/docs/examples/research_assistant.py
@@ -11,9 +11,7 @@ from groq import Groq
 # Import threadpools to run the crawl_url function in a separate thread
 from concurrent.futures import ThreadPoolExecutor

-client = AsyncOpenAI(
-    base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY")
-)
+client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))

 # Instrument the OpenAI client
 cl.instrument_openai()
@@ -27,39 +25,41 @@ settings = {
    "presence_penalty": 0,
 }

-
 def extract_urls(text):
-    url_pattern = re.compile(r"(https?://\S+)")
+    url_pattern = re.compile(r'(https?://\S+)')
    return url_pattern.findall(text)

-
 def crawl_url(url):
    data = {
        "urls": [url],
        "include_raw_html": True,
        "word_count_threshold": 10,
        "extraction_strategy": "NoExtractionStrategy",
-        "chunking_strategy": "RegexChunking",
+        "chunking_strategy": "RegexChunking"
    }
    response = requests.post("https://crawl4ai.com/crawl", json=data)
    response_data = response.json()
-    response_data = response_data["results"][0]
-    return response_data["markdown"]
-
+    response_data = response_data['results'][0]
+    return response_data['markdown']

@cl.on_chat_start
 async def on_chat_start():
-    cl.user_session.set("session", {"history": [], "context": {}})
-    await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
-
+    cl.user_session.set("session", {
+        "history": [],
+        "context": {}
+    })  
+    await cl.Message(
+        content="Welcome to the chat! How can I assist you today?"
+    ).send()

@cl.on_message
 async def on_message(message: cl.Message):
    user_session = cl.user_session.get("session")
-
+    
    # Extract URLs from the user's message
    urls = extract_urls(message.content)
-
+    
+    
    futures = []
    with ThreadPoolExecutor() as executor:
        for url in urls:
@@ -69,9 +69,16 @@ async def on_message(message: cl.Message):

    for url, result in zip(urls, results):
        ref_number = f"REF_{len(user_session['context']) + 1}"
-        user_session["context"][ref_number] = {"url": url, "content": result}
+        user_session["context"][ref_number] = {
+            "url": url,
+            "content": result
+        }    

-    user_session["history"].append({"role": "user", "content": message.content})
+
+    user_session["history"].append({
+        "role": "user",
+        "content": message.content
+    })

    # Create a system message that includes the context
    context_messages = [
@@ -88,17 +95,26 @@ async def on_message(message: cl.Message):
                "If not, there is no need to add a references section. "
                "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
                "\n\n".join(context_messages)
-            ),
+            )
        }
    else:
-        system_message = {"role": "system", "content": "You are a helpful assistant."}
+        system_message = {
+            "role": "system",
+            "content": "You are a helpful assistant."
+        }
+

    msg = cl.Message(content="")
    await msg.send()

    # Get response from the LLM
    stream = await client.chat.completions.create(
-        messages=[system_message, *user_session["history"]], stream=True, **settings
+        messages=[
+            system_message,
+            *user_session["history"]
+        ],
+        stream=True,
+        **settings
    )

    assistant_response = ""
@@ -108,7 +124,10 @@ async def on_message(message: cl.Message):
            await msg.stream_token(token)

    # Add assistant message to the history
-    user_session["history"].append({"role": "assistant", "content": assistant_response})
+    user_session["history"].append({
+        "role": "assistant",
+        "content": assistant_response
+    })
    await msg.update()

    # Append the reference section to the assistant's response
@@ -135,11 +154,10 @@ async def on_audio_chunk(chunk: cl.AudioChunk):

    pass

-
@cl.step(type="tool")
 async def speech_to_text(audio_file):
    cli = Groq()
-
+       
    response = await client.audio.transcriptions.create(
        model="whisper-large-v3", file=audio_file
    )
@@ -154,19 +172,24 @@ async def on_audio_end(elements: list[ElementBased]):
    audio_buffer.seek(0)  # Move the file pointer to the beginning
    audio_file = audio_buffer.read()
    audio_mime_type: str = cl.user_session.get("audio_mime_type")
-
+    
    start_time = time.time()
    whisper_input = (audio_buffer.name, audio_file, audio_mime_type)
    transcription = await speech_to_text(whisper_input)
    end_time = time.time()
    print(f"Transcription took {end_time - start_time} seconds")
-
-    user_msg = cl.Message(author="You", type="user_message", content=transcription)
+    
+    user_msg = cl.Message(
+        author="You", 
+        type="user_message",
+        content=transcription
+    )
    await user_msg.send()
    await on_message(user_msg)


 if __name__ == "__main__":
    from chainlit.cli import run_chainlit
-
    run_chainlit(__file__)
+
+
--- a/docs/examples/rest_call.py
+++ b/docs/examples/rest_call.py
@@ -1,3 +1,4 @@
+
 import requests, base64, os

 data = {
@@ -5,50 +6,59 @@ data = {
    "screenshot": True,
 }

-response = requests.post("https://crawl4ai.com/crawl", json=data)
-result = response.json()["results"][0]
+response = requests.post("https://crawl4ai.com/crawl", json=data) 
+result = response.json()['results'][0]
 print(result.keys())
-# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media',
-# 'links', 'screenshot', 'markdown', 'extracted_content',
+# dict_keys(['url', 'html', 'success', 'cleaned_html', 'media', 
+# 'links', 'screenshot', 'markdown', 'extracted_content', 
 # 'metadata', 'error_message'])
 with open("screenshot.png", "wb") as f:
-    f.write(base64.b64decode(result["screenshot"]))
-
+    f.write(base64.b64decode(result['screenshot']))
+    
 # Example of filtering the content using CSS selectors
 data = {
-    "urls": ["https://www.nbcnews.com/business"],
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
    "css_selector": "article",
    "screenshot": True,
 }

 # Example of executing a JS script on the page before extracting the content
 data = {
-    "urls": ["https://www.nbcnews.com/business"],
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
    "screenshot": True,
-    "js": [
-        """
+    'js' : ["""
    const loadMoreButton = Array.from(document.querySelectorAll('button')).
    find(button => button.textContent.includes('Load More'));
    loadMoreButton && loadMoreButton.click();
-    """
-    ],
+    """]
 }

 # Example of using a custom extraction strategy
 data = {
-    "urls": ["https://www.nbcnews.com/business"],
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
    "extraction_strategy": "CosineStrategy",
-    "extraction_strategy_args": {"semantic_filter": "inflation rent prices"},
+    "extraction_strategy_args": {
+        "semantic_filter": "inflation rent prices"
+    },
 }

 # Example of using LLM to extract content
 data = {
-    "urls": ["https://www.nbcnews.com/business"],
+    "urls": [
+        "https://www.nbcnews.com/business"
+    ],
    "extraction_strategy": "LLMExtractionStrategy",
    "extraction_strategy_args": {
        "provider": "groq/llama3-8b-8192",
        "api_token": os.environ.get("GROQ_API_KEY"),
        "instruction": """I am interested in only financial news, 
-        and translate them in French.""",
+        and translate them in French."""
    },
 }
+
--- a/docs/examples/scraping_strategies_performance.py
+++ b/docs/examples/scraping_strategies_performance.py
@@ -1,135 +0,0 @@
-import time, re
-from crawl4ai.content_scraping_strategy import WebScrapingStrategy,  LXMLWebScrapingStrategy
-import time
-import functools
-from collections import defaultdict
-
-class TimingStats:
-    def __init__(self):
-        self.stats = defaultdict(lambda: defaultdict(lambda: {"calls": 0, "total_time": 0}))
-        
-    def add(self, strategy_name, func_name, elapsed):
-        self.stats[strategy_name][func_name]["calls"] += 1
-        self.stats[strategy_name][func_name]["total_time"] += elapsed
-        
-    def report(self):
-        for strategy_name, funcs in self.stats.items():
-            print(f"\n{strategy_name} Timing Breakdown:")
-            print("-" * 60)
-            print(f"{'Function':<30} {'Calls':<10} {'Total(s)':<10} {'Avg(ms)':<10}")
-            print("-" * 60)
-            
-            for func, data in sorted(funcs.items(), key=lambda x: x[1]["total_time"], reverse=True):
-                avg_ms = (data["total_time"] / data["calls"]) * 1000
-                print(f"{func:<30} {data['calls']:<10} {data['total_time']:<10.3f} {avg_ms:<10.2f}")
-
-timing_stats = TimingStats()
-
-# Modify timing decorator
-def timing_decorator(strategy_name):
-    def decorator(func):
-        @functools.wraps(func)
-        def wrapper(*args, **kwargs):
-            start = time.time()
-            result = func(*args, **kwargs)
-            elapsed = time.time() - start
-            timing_stats.add(strategy_name, func.__name__, elapsed)
-            return result
-        return wrapper
-    return decorator
-
-# Modified decorator application
-def apply_decorators(cls, method_name, strategy_name):
-    try:
-        original_method = getattr(cls, method_name)
-        decorated_method = timing_decorator(strategy_name)(original_method)
-        setattr(cls, method_name, decorated_method)
-    except AttributeError:
-        print(f"Method {method_name} not found in class {cls.__name__}.")
-
-# Apply to key methods
-methods_to_profile = [
-    '_scrap',
-    # 'process_element', 
-    '_process_element', 
-    'process_image',
-]
-
-
-# Apply decorators to both strategies
-for strategy, name in [(WebScrapingStrategy, "Original"), (LXMLWebScrapingStrategy, "LXML")]:
-    for method in methods_to_profile:
-        apply_decorators(strategy, method, name)
-
-
-def generate_large_html(n_elements=1000):
-    html = ['<!DOCTYPE html><html><head></head><body>']
-    for i in range(n_elements):
-        html.append(f'''
-            <div class="article">
-                <h2>Heading {i}</h2>
-                <div>
-                    <div>
-                        <p>This is paragraph {i} with some content and a <a href="http://example.com/{i}">link</a></p>
-                    </div>
-                </div>
-                <img src="image{i}.jpg" alt="Image {i}">
-                <ul>
-                    <li>List item {i}.1</li>
-                    <li>List item {i}.2</li>
-                </ul>
-            </div>
-        ''')
-    html.append('</body></html>')
-    return ''.join(html)
-
-def test_scraping():
-    # Initialize both scrapers
-    original_scraper = WebScrapingStrategy()
-    selected_scraper = LXMLWebScrapingStrategy()
-    
-    # Generate test HTML
-    print("Generating HTML...")
-    html = generate_large_html(5000)
-    print(f"HTML Size: {len(html)/1024:.2f} KB")
-    
-    # Time the scraping
-    print("\nStarting scrape...")
-    start_time = time.time()
-    
-    kwargs = {
-        "url": "http://example.com",
-        "html": html,
-        "word_count_threshold": 5,
-        "keep_data_attributes": True
-    }
-    
-    t1 = time.perf_counter()
-    result_selected = selected_scraper.scrap(**kwargs)
-    t2 = time.perf_counter()
-    
-    result_original = original_scraper.scrap(**kwargs)
-    t3 = time.perf_counter()
-    
-    elapsed = t3 - start_time
-    print(f"\nScraping completed in {elapsed:.2f} seconds")
-    
-    timing_stats.report()
-    
-    # Print stats of LXML output
-    print("\nLXML Output:")
-    print(f"\nExtracted links: {len(result_selected['links']['internal']) + len(result_selected['links']['external'])}")
-    print(f"Extracted images: {len(result_selected['media']['images'])}")
-    print(f"Clean HTML size: {len(result_selected['cleaned_html'])/1024:.2f} KB")
-    print(f"Scraping time: {t2 - t1:.2f} seconds")
-
-    # Print stats of original output
-    print("\nOriginal Output:")
-    print(f"\nExtracted links: {len(result_original['links']['internal']) + len(result_original['links']['external'])}")
-    print(f"Extracted images: {len(result_original['media']['images'])}")
-    print(f"Clean HTML size: {len(result_original['cleaned_html'])/1024:.2f} KB")
-    print(f"Scraping time: {t3 - t1:.2f} seconds")
-        
-        
-if __name__ == "__main__":
-    test_scraping()
--- a/docs/examples/ssl_example.py
+++ b/docs/examples/ssl_example.py
@@ -1,51 +0,0 @@
-"""Example showing how to work with SSL certificates in Crawl4AI."""
-
-import asyncio
-import os
-from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
-
-# Create tmp directory if it doesn't exist
-parent_dir = os.path.dirname(
-    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-)
-tmp_dir = os.path.join(parent_dir, "tmp")
-os.makedirs(tmp_dir, exist_ok=True)
-
-
-async def main():
-    # Configure crawler to fetch SSL certificate
-    config = CrawlerRunConfig(
-        fetch_ssl_certificate=True,
-        cache_mode=CacheMode.BYPASS,  # Bypass cache to always get fresh certificates
-    )
-
-    async with AsyncWebCrawler() as crawler:
-        result = await crawler.arun(url="https://example.com", config=config)
-
-        if result.success and result.ssl_certificate:
-            cert = result.ssl_certificate
-
-            # 1. Access certificate properties directly
-            print("\nCertificate Information:")
-            print(f"Issuer: {cert.issuer.get('CN', '')}")
-            print(f"Valid until: {cert.valid_until}")
-            print(f"Fingerprint: {cert.fingerprint}")
-
-            # 2. Export certificate in different formats
-            cert.to_json(os.path.join(tmp_dir, "certificate.json"))  # For analysis
-            print("\nCertificate exported to:")
-            print(f"- JSON: {os.path.join(tmp_dir, 'certificate.json')}")
-
-            pem_data = cert.to_pem(
-                os.path.join(tmp_dir, "certificate.pem")
-            )  # For web servers
-            print(f"- PEM: {os.path.join(tmp_dir, 'certificate.pem')}")
-
-            der_data = cert.to_der(
-                os.path.join(tmp_dir, "certificate.der")
-            )  # For Java apps
-            print(f"- DER: {os.path.join(tmp_dir, 'certificate.der')}")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/storage_state_tutorial.md
+++ b/docs/examples/storage_state_tutorial.md
@@ -1,225 +0,0 @@
-### Using `storage_state` to Pre-Load Cookies and LocalStorage
-
-Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
-
-#### What is `storage_state`?
-
-`storage_state` can be:
-
- A dictionary containing cookies and localStorage data.
- A path to a JSON file that holds this information.
-
-When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
-
-#### Example Structure
-
-Here’s an example storage state:
-
-```json
-{
-  "cookies": [
-    {
-      "name": "session",
-      "value": "abcd1234",
-      "domain": "example.com",
-      "path": "/",
-      "expires": 1675363572.037711,
-      "httpOnly": false,
-      "secure": false,
-      "sameSite": "None"
-    }
-  ],
-  "origins": [
-    {
-      "origin": "https://example.com",
-      "localStorage": [
-        { "name": "token", "value": "my_auth_token" },
-        { "name": "refreshToken", "value": "my_refresh_token" }
-      ]
-    }
-  ]
-}
-```
-
-This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
-
---
-
-### Passing `storage_state` as a Dictionary
-
-You can directly provide the data as a dictionary:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
-
-async def main():
-    storage_dict = {
-        "cookies": [
-            {
-                "name": "session",
-                "value": "abcd1234",
-                "domain": "example.com",
-                "path": "/",
-                "expires": 1675363572.037711,
-                "httpOnly": False,
-                "secure": False,
-                "sameSite": "None"
-            }
-        ],
-        "origins": [
-            {
-                "origin": "https://example.com",
-                "localStorage": [
-                    {"name": "token", "value": "my_auth_token"},
-                    {"name": "refreshToken", "value": "my_refresh_token"}
-                ]
-            }
-        ]
-    }
-
-    async with AsyncWebCrawler(
-        headless=True,
-        storage_state=storage_dict
-    ) as crawler:
-        result = await crawler.arun(url='https://example.com/protected')
-        if result.success:
-            print("Crawl succeeded with pre-loaded session data!")
-            print("Page HTML length:", len(result.html))
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
---
-
-### Passing `storage_state` as a File
-
-If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler
-
-async def main():
-    async with AsyncWebCrawler(
-        headless=True,
-        storage_state="mystate.json"  # Uses a JSON file instead of a dictionary
-    ) as crawler:
-        result = await crawler.arun(url='https://example.com/protected')
-        if result.success:
-            print("Crawl succeeded with pre-loaded session data!")
-            print("Page HTML length:", len(result.html))
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
---
-
-### Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
-
-A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
-
-1. Perform the login once in a hook.
-2. After login completes, export the resulting `storage_state` to a file.
-3. On subsequent runs, provide that `storage_state` to skip the login step.
-
-**Step-by-Step Example:**
-
-**First Run (Perform Login and Save State):**
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-async def on_browser_created_hook(browser):
-    # Access the default context and create a page
-    context = browser.contexts[0]
-    page = await context.new_page()
-    
-    # Navigate to the login page
-    await page.goto("https://example.com/login", wait_until="domcontentloaded")
-    
-    # Fill in credentials and submit
-    await page.fill("input[name='username']", "myuser")
-    await page.fill("input[name='password']", "mypassword")
-    await page.click("button[type='submit']")
-    await page.wait_for_load_state("networkidle")
-    
-    # Now the site sets tokens in localStorage and cookies
-    # Export this state to a file so we can reuse it
-    await context.storage_state(path="my_storage_state.json")
-    await page.close()
-
-async def main():
-    # First run: perform login and export the storage_state
-    async with AsyncWebCrawler(
-        headless=True,
-        verbose=True,
-        hooks={"on_browser_created": on_browser_created_hook},
-        use_persistent_context=True,
-        user_data_dir="./my_user_data"
-    ) as crawler:
-        
-        # After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
-        result = await crawler.arun(
-            url='https://example.com/protected-page',
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
-        )
-        print("First run result success:", result.success)
-        if result.success:
-            print("Protected page HTML length:", len(result.html))
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-**Second Run (Reuse Saved State, No Login Needed):**
-
-```python
-import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-async def main():
-    # Second run: no need to hook on_browser_created this time.
-    # Just provide the previously saved storage state.
-    async with AsyncWebCrawler(
-        headless=True,
-        verbose=True,
-        use_persistent_context=True,
-        user_data_dir="./my_user_data",
-        storage_state="my_storage_state.json"  # Reuse previously exported state
-    ) as crawler:
-        
-        # Now the crawler starts already logged in
-        result = await crawler.arun(
-            url='https://example.com/protected-page',
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
-        )
-        print("Second run result success:", result.success)
-        if result.success:
-            print("Protected page HTML length:", len(result.html))
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-**What’s Happening Here?**
-
- During the first run, the `on_browser_created_hook` logs into the site.  
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.  
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
-
-**Sign Out Scenario:**  
-If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
-
---
-
-### Conclusion
-
-By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.
--- a/docs/examples/summarize_page.py
+++ b/docs/examples/summarize_page.py
@@ -1,41 +1,39 @@
 import os
+import time
 import json
 from crawl4ai.web_crawler import WebCrawler
 from crawl4ai.chunking_strategy import *
 from crawl4ai.extraction_strategy import *
 from crawl4ai.crawler_strategy import *

-url = r"https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot"
+url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'

 crawler = WebCrawler()
 crawler.warmup()

 from pydantic import BaseModel, Field

-
 class PageSummary(BaseModel):
    title: str = Field(..., description="Title of the page.")
    summary: str = Field(..., description="Summary of the page.")
    brief_summary: str = Field(..., description="Brief summary of the page.")
    keywords: list = Field(..., description="Keywords assigned to the page.")

-
 result = crawler.run(
    url=url,
    word_count_threshold=1,
-    extraction_strategy=LLMExtractionStrategy(
-        provider="openai/gpt-4o",
-        api_token=os.getenv("OPENAI_API_KEY"),
+    extraction_strategy= LLMExtractionStrategy(
+        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
        schema=PageSummary.model_json_schema(),
        extraction_type="schema",
-        apply_chunking=False,
-        instruction="From the crawled content, extract the following details: "
-        "1. Title of the page "
-        "2. Summary of the page, which is a detailed summary "
-        "3. Brief summary of the page, which is a paragraph text "
-        "4. Keywords assigned to the page, which is a list of keywords. "
-        "The extracted JSON format should look like this: "
-        '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }',
+        apply_chunking =False,
+        instruction="From the crawled content, extract the following details: "\
+            "1. Title of the page "\
+            "2. Summary of the page, which is a detailed summary "\
+            "3. Brief summary of the page, which is a paragraph text "\
+            "4. Keywords assigned to the page, which is a list of keywords. "\
+            'The extracted JSON format should look like this: '\
+            '{ "title": "Page Title", "summary": "Detailed summary of the page.", "brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
    ),
    bypass_cache=True,
 )
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
UncleCode	0d357ab7d2	feat(scraper): Enhance URL filtering and scoring systems Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes.	2024-11-08 19:02:28 +08:00
UncleCode	bae4665949	feat(scraper): Enhance URL filtering and scoring systems Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes. - Quick Start is created and added	2024-11-08 18:45:12 +08:00
UncleCode	d11c004fbb	Enhanced BFS Strategy: Improved monitoring, resource management & configuration - Added CrawlStats for comprehensive crawl monitoring - Implemented proper resource cleanup with shutdown mechanism - Enhanced URL processing with better validation and politeness controls - Added configuration options (max_concurrent, timeout, external_links) - Improved error handling with retry logic - Added domain-specific queues for better performance - Created comprehensive documentation Note: URL normalization needs review - potential duplicate processing with core crawler for internal links. Currently commented out pending further investigation of edge cases.	2024-11-08 15:57:23 +08:00
UncleCode	3d1c9a8434	Revieweing the BFS strategy.	2024-11-07 18:54:53 +08:00
UncleCode	be472c624c	Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.	2024-11-06 21:09:47 +08:00
UncleCode	06b21dcc50	Update .gitignore to include new directories for issues and documentation	2024-11-06 18:44:03 +08:00
UncleCode	0f0f60527d	Merge pull request #172 from aravindkarnam/scraper Scraper	2024-11-06 07:00:44 +01:00
Aravind Karnam	8105fd178e	Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.	2024-10-17 15:42:43 +05:30
Aravind Karnam	ce7fce4b16	1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.	2024-10-17 12:25:17 +05:30
Aravind Karnam	de28b59aca	removed unused imports	2024-10-16 22:36:48 +05:30
Aravind Karnam	04d8b47b92	Exposed min_crawl_delay for BFSScraperStrategy	2024-10-16 22:34:54 +05:30
Aravind Karnam	2943feeecf	1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed.	2024-10-16 22:05:29 +05:30
Aravind Karnam	8a7d29ce85	updated some comments and removed content type checking functionality from core as it's implemented as a filter	2024-10-16 15:59:37 +05:30
aravind	159bd875bd	Merge pull request #5 from aravindkarnam/main Merging 0.3.6	2024-10-16 10:41:22 +05:30
Aravind Karnam	d743adac68	Fixed some bugs in robots.txt processing	2024-10-03 15:58:57 +05:30
Aravind Karnam	7fe220dbd5	1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.	2024-10-03 11:17:11 +05:30
aravind	65e013d9d1	Merge pull request #3 from aravindkarnam/main Merging latest changes from main branch	2024-10-03 09:52:12 +05:30
Aravind Karnam	7f3e2e47ed	Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt	2024-09-19 12:34:12 +05:30
aravind	78f26ac263	Merge pull request #2 from aravindkarnam/staging Staging	2024-09-18 18:16:23 +05:30
Aravind Karnam	44ce12c62c	Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy	2024-09-09 13:13:34 +05:30