chore: Bump version to 0.2.2 in setup.py

2024-05-19 16:19:40 +00:00
232 changed files with 65992 additions and 51251 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -1,12 +0,0 @@
-# Documentation
-*.html linguist-documentation
-docs/* linguist-documentation
-docs/examples/* linguist-documentation
-docs/md_v2/* linguist-documentation
-
-# Explicitly mark Python as the main language
-*.py linguist-detectable=true
-*.py linguist-language=Python
-
-# Exclude HTML from language statistics
-*.html linguist-detectable=false
--- a/.github/DISCUSSION_TEMPLATE/feature-requests.yml
+++ b/.github/DISCUSSION_TEMPLATE/feature-requests.yml
@@ -1,59 +0,0 @@
-title: "[Feature Request]: "
-labels: ["⚙️ New"]
-body:
-  - type: markdown
-    attributes:
-      value: |
-        Thank you for your interest in suggesting a new feature! Before you submit, please take a moment to check if already exists in
-        this discussions category to avoid duplicates. 😊
-
-  - type: textarea
-    id: needs_to_be_done
-    attributes:
-      label: What needs to be done?
-      description: Please describe the feature or functionality you'd like to see.
-      placeholder: "e.g., Return alt text along with images scraped from a webpages in Result"
-    validations:
-      required: true
-
-  - type: textarea
-    id: problem_to_solve
-    attributes:
-      label: What problem does this solve?
-      description: Explain the pain point or issue this feature will help address.
-      placeholder: "e.g., Bypass Captchas added by cloudflare"
-    validations:
-      required: true
-
-  - type: textarea
-    id: target_users
-    attributes:
-      label: Target users/beneficiaries
-      description: Who would benefit from this feature? (e.g., specific teams, developers, users, etc.)
-      placeholder: "e.g., Marketing teams, developers"
-    validations:
-      required: false
-
-  - type: textarea
-    id: current_workarounds
-    attributes:
-      label: Current alternatives/workarounds
-      description: Are there any existing solutions or workarounds? How does this feature improve upon them?
-      placeholder: "e.g., Users manually select the css classes mapped to data fields to extract them"
-    validations:
-      required: false
-
-  - type: markdown
-    attributes:
-      value: |
-        ### 💡 Implementation Ideas
-
-  - type: textarea
-    id: proposed_approach
-    attributes:
-      label: Proposed approach
-      description: Share any ideas you have for how this feature could be implemented. Point out any challenges your foresee
-       and the success metrics for this feature
-      placeholder: "e.g., Implement a breadth first traversal algorithm for scraper"
-    validations:
-      required: false
--- a/.github/ISSUE_TEMPLATE/bug_report.yml
+++ b/.github/ISSUE_TEMPLATE/bug_report.yml
@@ -1,127 +0,0 @@
-name: Bug Report
-description: Report a bug with the Crawl4AI.
-title: "[Bug]: "
-labels: ["🐞 Bug","🩺 Needs Triage"]
-body:
-  - type: input
-    id: crawl4ai_version
-    attributes:
-      label: crawl4ai version
-      description: Specify the version of crawl4ai you are using.
-      placeholder: "e.g., 2.0.0"
-    validations:
-      required: true
-
-  - type: textarea
-    id: expected_behavior
-    attributes:
-      label: Expected Behavior
-      description: Describe what you expected to happen.
-      placeholder: "Provide a detailed explanation of the expected outcome."
-    validations:
-      required: true
-
-  - type: textarea
-    id: current_behavior
-    attributes:
-      label: Current Behavior
-      description: Describe what is happening instead of the expected behavior.
-      placeholder: "Describe the actual result or issue you encountered."
-    validations:
-      required: true
-
-  - type: dropdown
-    id: reproducible
-    attributes:
-      label: Is this reproducible?
-      description: Indicate whether this bug can be reproduced consistently.
-      options:
-        - "Yes"
-        - "No"
-    validations:
-      required: true
-
-  - type: textarea
-    id: inputs
-    attributes:
-      label: Inputs Causing the Bug
-      description: Provide details about the inputs causing the issue.
-      placeholder: |
-        - URL(s): 
-        - Settings used: 
-        - Input data (if applicable): 
-      render: bash
-  
-  - type: textarea
-    id: steps_to_reproduce
-    attributes:
-      label: Steps to Reproduce
-      description: Provide step-by-step instructions to reproduce the issue.
-      placeholder: |
-        1. Go to...
-        2. Click on...
-        3. Observe the issue...
-      render: bash
-  
-  - type: textarea
-    id: code_snippets
-    attributes:
-      label: Code snippets
-      description: Provide code snippets(if any). Add comments as necessary
-      placeholder: print("Hello world")
-      render: python
-
-  # Header Section with Title
-  - type: markdown
-    attributes:
-      value: |
-          ## Supporting Information
-          Please provide the following details to help us understand and resolve your issue. This will assist us in reproducing and diagnosing the problem
-
-  - type: input
-    id: os
-    attributes:
-      label: OS
-      description: Please provide the operating system & distro where the issue occurs.
-      placeholder: "e.g., Windows, macOS, Linux"
-    validations:
-      required: true
-
-  - type: input
-    id: python_version
-    attributes:
-      label: Python version
-      description: Specify the Python version being used.
-      placeholder: "e.g., 3.8.5"
-    validations:
-      required: true
-
-  # Browser Field
-  - type: input
-    id: browser
-    attributes:
-      label: Browser
-      description: Provide the name of the browser you are using.
-      placeholder: "e.g., Chrome, Firefox, Safari"
-    validations:
-      required: false
-
-  # Browser Version Field
-  - type: input
-    id: browser_version
-    attributes:
-      label: Browser version
-      description: Provide the version of the browser you are using.
-      placeholder: "e.g., 91.0.4472.124"
-    validations:
-      required: false
-
-  # Error Logs Field (Text Area)
-  - type: textarea
-    id: error_logs
-    attributes:
-      label: Error logs & Screenshots (if applicable)
-      description: If you encountered any errors, please provide the error logs. Attach any relevant screenshots to help us understand the issue.
-      placeholder: "Paste error logs here and attach your screenshots"
-    validations:
-      required: false
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@@ -1,8 +0,0 @@
-blank_issues_enabled: false
-contact_links:
-  - name: Feature Requests
-    url: https://github.com/unclecode/crawl4ai/discussions/categories/feature-requests
-    about: "Suggest new features or enhancements for Crawl4AI"
-  - name: Forums - Q&A
-    url: https://github.com/unclecode/crawl4ai/discussions/categories/forums-q-a
-    about: "Ask questions or engage in general discussions about Crawl4AI"
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -1,19 +0,0 @@
-## Summary
-Please include a summary of the change and/or which issues are fixed.
-
-eg: `Fixes #123` (Tag GitHub issue numbers in this format, so it automatically links the issues with your PR)
-
-## List of files changed and why
-eg: quickstart.py - To update the example as per new changes
-
-## How Has This Been Tested?
-Please describe the tests that you ran to verify your changes.
-
-## Checklist:
-
- [ ] My code follows the style guidelines of this project
- [ ] I have performed a self-review of my own code
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have added/updated unit tests that prove my fix is effective or that my feature works
- [ ] New and existing unit tests pass locally with my changes
--- a/.gitignore
+++ b/.gitignore
@@ -165,8 +165,6 @@ Crawl4AI.egg-info/
 Crawl4AI.egg-info/*
 crawler_data.db
 .vscode/
-.tests/
-.test_pads/
 test_pad.py
 test_pad*.py
 .data/
@@ -175,66 +173,4 @@ Crawl4AI.egg-info/
 requirements0.txt
 a.txt

-*.sh
-.idea
-docs/examples/.chainlit/
-docs/examples/.chainlit/*
-.chainlit/config.toml
-.chainlit/translations/en-US.json
-
-local/
-.files/
-
-a.txt
-.lambda_function.py
-ec2*
-
-update_changelog.sh
-
-.DS_Store
-docs/.DS_Store
-tmp/
-test_env/
-**/.DS_Store
-**/.DS_Store
-
-todo.md
-todo_executor.md
-git_changes.py
-git_changes.md
-pypi_build.sh
-git_issues.py
-git_issues.md
-
-.next/
-.tests/
-# .issues/
-.docs/
-.issues/
-.gitboss/
-todo_executor.md
-protect-all-except-feature.sh
-manage-collab.sh
-publish.sh
-combine.sh
-combined_output.txt
-.local
-.scripts
-tree.md
-tree.md
-.scripts
-.local
-.do
-/plans
-plans/
-
-# Codeium
-.codeiumignore
-todo/
-
-# windsurf rules
-.windsurfrules
-
-
-# windsurf rules
-.windsurfrules
+*.sh
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
--- a/CODE_OF_CONDUCT.md
+++ b/CODE_OF_CONDUCT.md
@@ -1,131 +0,0 @@
-# Crawl4AI Code of Conduct
-
-## Our Pledge
-
-We as members, contributors, and leaders pledge to make participation in our
-community a harassment-free experience for everyone, regardless of age, body
-size, visible or invisible disability, ethnicity, sex characteristics, gender
-identity and expression, level of experience, education, socio-economic status,
-nationality, personal appearance, race, caste, color, religion, or sexual
-identity and orientation.
-
-We pledge to act and interact in ways that contribute to an open, welcoming,
-diverse, inclusive, and healthy community.
-
-## Our Standards
-
-Examples of behavior that contributes to a positive environment for our
-community include:
-
-* Demonstrating empathy and kindness toward other people
-* Being respectful of differing opinions, viewpoints, and experiences
-* Giving and gracefully accepting constructive feedback
-* Accepting responsibility and apologizing to those affected by our mistakes,
-  and learning from the experience
-* Focusing on what is best not just for us as individuals, but for the overall
-  community
-
-Examples of unacceptable behavior include:
-
-* The use of sexualized language or imagery, and sexual attention or advances of
-  any kind
-* Trolling, insulting or derogatory comments, and personal or political attacks
-* Public or private harassment
-* Publishing others' private information, such as a physical or email address,
-  without their explicit permission
-* Other conduct which could reasonably be considered inappropriate in a
-  professional setting
-
-## Enforcement Responsibilities
-
-Community leaders are responsible for clarifying and enforcing our standards of
-acceptable behavior and will take appropriate and fair corrective action in
-response to any behavior that they deem inappropriate, threatening, offensive,
-or harmful.
-
-Community leaders have the right and responsibility to remove, edit, or reject
-comments, commits, code, wiki edits, issues, and other contributions that are
-not aligned to this Code of Conduct, and will communicate reasons for moderation
-decisions when appropriate.
-
-## Scope
-
-This Code of Conduct applies within all community spaces, and also applies when
-an individual is officially representing the community in public spaces.
-Examples of representing our community include using an official email address,
-posting via an official social media account, or acting as an appointed
-representative at an online or offline event.
-
-## Enforcement
-
-Instances of abusive, harassing, or otherwise unacceptable behavior may be
-reported to the community leaders responsible for enforcement at
-unclecode@crawl4ai.com. All complaints will be reviewed and investigated promptly and fairly.
-
-All community leaders are obligated to respect the privacy and security of the
-reporter of any incident.
-
-## Enforcement Guidelines
-
-Community leaders will follow these Community Impact Guidelines in determining
-the consequences for any action they deem in violation of this Code of Conduct:
-
-### 1. Correction
-
-**Community Impact**: Use of inappropriate language or other behavior deemed
-unprofessional or unwelcome in the community.
-
-**Consequence**: A private, written warning from community leaders, providing
-clarity around the nature of the violation and an explanation of why the
-behavior was inappropriate. A public apology may be requested.
-
-### 2. Warning
-
-**Community Impact**: A violation through a single incident or series of
-actions.
-
-**Consequence**: A warning with consequences for continued behavior. No
-interaction with the people involved, including unsolicited interaction with
-those enforcing the Code of Conduct, for a specified period of time. This
-includes avoiding interactions in community spaces as well as external channels
-like social media. Violating these terms may lead to a temporary or permanent
-ban.
-
-### 3. Temporary Ban
-
-**Community Impact**: A serious violation of community standards, including
-sustained inappropriate behavior.
-
-**Consequence**: A temporary ban from any sort of interaction or public
-communication with the community for a specified period of time. No public or
-private interaction with the people involved, including unsolicited interaction
-with those enforcing the Code of Conduct, is allowed during this period.
-Violating these terms may lead to a permanent ban.
-
-### 4. Permanent Ban
-
-**Community Impact**: Demonstrating a pattern of violation of community
-standards, including sustained inappropriate behavior, harassment of an
-individual, or aggression toward or disparagement of classes of individuals.
-
-**Consequence**: A permanent ban from any sort of public interaction within the
-community.
-
-## Attribution
-
-This Code of Conduct is adapted from the [Contributor Covenant][homepage],
-version 2.1, available at
-[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
-
-Community Impact Guidelines were inspired by
-[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
-
-For answers to common questions about this code of conduct, see the FAQ at
-[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
-[https://www.contributor-covenant.org/translations][translations].
-
-[homepage]: https://www.contributor-covenant.org
-[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
-[Mozilla CoC]: https://github.com/mozilla/diversity
-[FAQ]: https://www.contributor-covenant.org/faq
-[translations]: https://www.contributor-covenant.org/translations
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@@ -1,42 +0,0 @@
-# Contributors to Crawl4AI
-
-We would like to thank the following people for their contributions to Crawl4AI:
-
-## Core Team
-
- [Unclecode](https://github.com/unclecode) - Project Creator and Main Developer
- [Nasrin](https://github.com/ntohidi) - Project Manager and Developer
- [Aravind Karnam](https://github.com/aravindkarnam) - Head of Community and Product 
-
-## Community Contributors
-
- [aadityakanjolia4](https://github.com/aadityakanjolia4) - Fix for `CustomHTML2Text` is not defined.
- [FractalMind](https://github.com/FractalMind) - Created the first official Docker Hub image and fixed Dockerfile errors
- [ketonkss4](https://github.com/ketonkss4) - Identified Selenium's new capabilities, helping reduce dependencies
- [jonymusky](https://github.com/jonymusky) - Javascript execution documentation, and wait_for
- [datehoer](https://github.com/datehoer) - Add browser prxy support
-
-## Pull Requests
-
- [dvschuyl](https://github.com/dvschuyl) - AsyncPlaywrightCrawlerStrategy page-evaluate context destroyed by navigation [#304](https://github.com/unclecode/crawl4ai/pull/304)
- [nelzomal](https://github.com/nelzomal) - Enhance development installation instructions [#286](https://github.com/unclecode/crawl4ai/pull/286)
- [HamzaFarhan](https://github.com/HamzaFarhan) - Handled the cases where markdown_with_citations, references_markdown, and filtered_html might not be defined [#293](https://github.com/unclecode/crawl4ai/pull/293)
- [NanmiCoder](https://github.com/NanmiCoder) - fix: crawler strategy exception handling and fixes [#271](https://github.com/unclecode/crawl4ai/pull/271)
- [paulokuong](https://github.com/paulokuong) - fix: RAWL4_AI_BASE_DIRECTORY should be Path object instead of string [#298](https://github.com/unclecode/crawl4ai/pull/298)
-
-
-## Other Contributors
-
- [Gokhan](https://github.com/gkhngyk) 
- [Shiv Kumar](https://github.com/shivkumar0757)
- [QIN2DIM](https://github.com/QIN2DIM)
-
-## Acknowledgements
-
-We also want to thank all the users who have reported bugs, suggested features, or helped in any other way to make Crawl4AI better.
-
---
-
-If you've contributed to Crawl4AI and your name isn't on this list, please [open a pull request](https://github.com/unclecode/crawl4ai/pulls) with your name, link, and contribution, and we'll review it promptly.
-
-Thank you all for your contributions!
--- a/161
+++ b/161
@@ -1,136 +1,43 @@
-# syntax=docker/dockerfile:1.4
+# Use an official Python runtime as a parent image
+FROM python:3.10-slim

-ARG TARGETPLATFORM
-ARG BUILDPLATFORM
+# Set the working directory in the container
+WORKDIR /usr/src/app

-# Other build arguments
-ARG PYTHON_VERSION=3.10
-
-# Base stage with system dependencies
-FROM python:${PYTHON_VERSION}-slim as base
-
-# Declare ARG variables again within the build stage
-ARG INSTALL_TYPE=all
-ARG ENABLE_GPU=false
-
-# Platform-specific labels
-LABEL maintainer="unclecode"
-LABEL description="🔥🕷️ Crawl4AI: Open-source LLM Friendly Web Crawler & scraper"
-LABEL version="1.0"
-
-# Environment setup
-ENV PYTHONUNBUFFERED=1 \
-    PYTHONDONTWRITEBYTECODE=1 \
-    PIP_NO_CACHE_DIR=1 \
-    PIP_DISABLE_PIP_VERSION_CHECK=1 \
-    PIP_DEFAULT_TIMEOUT=100 \
-    DEBIAN_FRONTEND=noninteractive
-
-# Install system dependencies
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    build-essential \
-    curl \
-    wget \
-    gnupg \
-    git \
-    cmake \
-    pkg-config \
-    python3-dev \
-    libjpeg-dev \
-    libpng-dev \
-    && rm -rf /var/lib/apt/lists/*
-
-# Playwright system dependencies for Linux
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    libglib2.0-0 \
-    libnss3 \
-    libnspr4 \
-    libatk1.0-0 \
-    libatk-bridge2.0-0 \
-    libcups2 \
-    libdrm2 \
-    libdbus-1-3 \
-    libxcb1 \
-    libxkbcommon0 \
-    libx11-6 \
-    libxcomposite1 \
-    libxdamage1 \
-    libxext6 \
-    libxfixes3 \
-    libxrandr2 \
-    libgbm1 \
-    libpango-1.0-0 \
-    libcairo2 \
-    libasound2 \
-    libatspi2.0-0 \
-    && rm -rf /var/lib/apt/lists/*
-
-# GPU support if enabled and architecture is supported
-RUN if [ "$ENABLE_GPU" = "true" ] && [ "$TARGETPLATFORM" = "linux/amd64" ] ; then \
-    apt-get update && apt-get install -y --no-install-recommends \
-    nvidia-cuda-toolkit \
-    && rm -rf /var/lib/apt/lists/* ; \
-else \
-    echo "Skipping NVIDIA CUDA Toolkit installation (unsupported platform or GPU disabled)"; \
-fi
-
-# Create and set working directory
-WORKDIR /app
-
-# Copy the entire project
+# Copy the current directory contents into the container at /usr/src/app
 COPY . .

-# Install base requirements
+# Install dependencies for Chrome and ChromeDriver
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wget \
+    xvfb \
+    unzip \
+    curl \
+    gnupg2 \
+    ca-certificates \
+    apt-transport-https \
+    software-properties-common \
+    && mkdir -p /etc/apt/keyrings \
+    && curl -fsSL https://dl-ssl.google.com/linux/linux_signing_key.pub | gpg --dearmor -o /etc/apt/keyrings/google-linux-signing-keyring.gpg \
+    && echo 'deb [arch=amd64 signed-by=/etc/apt/keyrings/google-linux-signing-keyring.gpg] http://dl.google.com/linux/chrome/deb/ stable main' | tee /etc/apt/sources.list.d/google-chrome.list \
+    && apt-get update \
+    && apt-get install -y google-chrome-stable \
+    && rm -rf /var/lib/apt/lists/* \
+    && apt-get install -y chromium-chromedriver
+
+# Install Python dependencies
 RUN pip install --no-cache-dir -r requirements.txt
+RUN pip install spacy torch torchvision torchaudio

-# Install required library for FastAPI
-RUN pip install fastapi uvicorn psutil
+# Set display port and dbus env to avoid hanging
+ENV DISPLAY=:99
+ENV DBUS_SESSION_BUS_ADDRESS=/dev/null

-# Install ML dependencies first for better layer caching
-RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
-        pip install --no-cache-dir \
-            torch \
-            torchvision \
-            torchaudio \
-            scikit-learn \
-            nltk \
-            transformers \
-            tokenizers && \
-        python -m nltk.downloader punkt stopwords ; \
-    fi
+# Make port 80 available to the world outside this container
+EXPOSE 80

-# Install the package
-RUN if [ "$INSTALL_TYPE" = "all" ] ; then \
-        pip install ".[all]" && \
-        python -m crawl4ai.model_loader ; \
-    elif [ "$INSTALL_TYPE" = "torch" ] ; then \
-        pip install ".[torch]" ; \
-    elif [ "$INSTALL_TYPE" = "transformer" ] ; then \
-        pip install ".[transformer]" && \
-        python -m crawl4ai.model_loader ; \
-    else \
-        pip install "." ; \
-    fi
+# Define environment variable
+ENV PYTHONUNBUFFERED 1

-    # Install MkDocs and required plugins
-RUN pip install --no-cache-dir \
-    mkdocs \
-    mkdocs-material \
-    mkdocs-terminal \
-    pymdown-extensions
-
-# Build MkDocs documentation
-RUN mkdocs build
-
-# Install Playwright and browsers
-RUN if [ "$TARGETPLATFORM" = "linux/amd64" ]; then \
-    playwright install chromium; \
-    elif [ "$TARGETPLATFORM" = "linux/arm64" ]; then \
-    playwright install chromium; \
-    fi
-
-# Expose port
-EXPOSE 8000 11235 9222 8080
-
-# Start the FastAPI server
-CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "11235"]
+# Run uvicorn
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80", "--workers", "4"]
--- a/44
+++ b/44
@@ -0,0 +1,44 @@
+# Use an official Python runtime as a parent image
+FROM python:3.10-slim
+
+# Set the working directory in the container
+WORKDIR /usr/src/app
+
+# Copy the current directory contents into the container at /usr/src/app
+COPY . .
+
+# Install any needed packages specified in requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Install dependencies for Chrome and ChromeDriver
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    wget \
+    xvfb \
+    unzip \
+    curl \
+    gnupg2 \
+    ca-certificates \
+    apt-transport-https \
+    software-properties-common \
+    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
+    && echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list \
+    && apt-get update \
+    && apt-get install -y google-chrome-stable \
+    && rm -rf /var/lib/apt/lists/* \
+    && apt install chromium-chromedriver -y
+
+# Install spacy library using pip
+RUN pip install spacy
+
+# Set display port and dbus env to avoid hanging
+ENV DISPLAY=:99
+ENV DBUS_SESSION_BUS_ADDRESS=/dev/null
+
+# Make port 80 available to the world outside this container
+EXPOSE 80
+
+# Define environment variable
+ENV PYTHONUNBUFFERED 1
+
+# Run uvicorn
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80", "--workers", "4"]
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -1,2 +0,0 @@
-include requirements.txt
-recursive-include crawl4ai/js_snippet *.js
--- a/MISSION.md
+++ b/MISSION.md
@@ -1,46 +0,0 @@
-# Mission
-
-![Mission Diagram](./docs/assets/pitch-dark.svg)
-
-### 1. The Data Capitalization Opportunity
-
-We live in an unprecedented era of digital wealth creation. Every day, individuals and enterprises generate massive amounts of valuable digital footprints across various platforms, social media channels, messenger apps, and cloud services. While people can interact with their data within these platforms, there's an immense untapped opportunity to transform this data into true capital assets. Just as physical property became a foundational element of wealth creation, personal and enterprise data has the potential to become a new form of capital on balance sheets.
-
-For individuals, this represents an opportunity to transform their digital activities into valuable assets. For enterprises, their internal communications, team discussions, and collaborative documents contain rich insights that could be structured and valued as intellectual capital. This wealth of information represents an unprecedented opportunity for value creation in the digital age.
-
-### 2. The Potential of Authentic Data
-
-While synthetic data has played a crucial role in AI development, there's an enormous untapped potential in the authentic data generated by individuals and organizations. Every message, document, and interaction contains unique insights and patterns that could enhance AI development. The challenge isn't a lack of data - it's that most authentic human-generated data remains inaccessible for productive use.
-
-By enabling willing participation in data sharing, we can unlock this vast reservoir of authentic human knowledge. This represents an opportunity to enhance AI development with diverse, real-world data that reflects the full spectrum of human experience and knowledge.
-
-## Our Pathway to Data Democracy
-
-### 1. Open-Source Foundation
-
-Our first step is creating an open-source data extraction engine that empowers developers and innovators to build tools for data structuring and organization. This foundation ensures transparency, security, and community-driven development. By making these tools openly available, we enable the technical infrastructure needed for true data ownership and capitalization.
-
-### 2. Data Capitalization Platform
-
-Building on this open-source foundation, we're developing a platform that helps individuals and enterprises transform their digital footprints into structured, valuable assets. This platform will provide the tools and frameworks needed to organize, understand, and value personal and organizational data as true capital assets.
-
-### 3. Creating a Data Marketplace
-
-The final piece is establishing a marketplace where individuals and organizations can willingly share their data assets. This creates opportunities for:
- Individuals to earn equity, revenue, or other forms of value from their data
- Enterprises to access diverse, high-quality data for AI development
- Researchers to work with authentic human-generated data
- Startups to build innovative solutions using real-world data
-
-## Economic Vision: A Shared Data Economy
-
-We envision a future where data becomes a fundamental asset class in a thriving shared economy. This transformation will democratize AI development by enabling willing participation in data sharing, ensuring that the benefits of AI advancement flow back to data creators. Just as property rights revolutionized economic systems, establishing data as a capital asset will create new opportunities for wealth creation and economic participation.
-
-This shared data economy will:
- Enable individuals to capitalize on their digital footprints
- Create new revenue streams for data creators
- Provide AI developers with access to diverse, authentic data
- Foster innovation through broader access to real-world data
- Ensure more equitable distribution of AI's economic benefits
-
-Our vision is to facilitate this transformation from the ground up - starting with open-source tools, progressing to data capitalization platforms, and ultimately creating a thriving marketplace where data becomes a true asset class in a shared economy. This approach ensures that the future of AI is built on a foundation of authentic human knowledge, with benefits flowing back to the individuals and organizations who create and share their valuable data.
--- a/README.md
+++ b/README.md
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -1,503 +0,0 @@
-# Crawl4AI Strategic Roadmap
-
-```mermaid
-%%{init: {'themeVariables': { 'fontSize': '14px'}}}%%
-graph TD
-    subgraph A1[Advanced Crawling Systems 🔧]
-        A["`
-        • Graph Crawler ✓
-        • Question-Based Crawler
-        • Knowledge-Optimal Crawler
-        • Agentic Crawler
-        `"]
-    end
-
-    subgraph A2[Specialized Features 🛠️]
-        B["`
-        • Automated Schema Generator
-        • Domain-Specific Scrapers
-        • 
-        • 
-        `"]
-    end
-
-    subgraph A3[Development Tools 🔨]
-        C["`
-        • Interactive Playground
-        • Performance Monitor
-        • Cloud Integration
-        • 
-        `"]
-    end
-
-    subgraph A4[Community & Growth 🌱]
-        D["`
-        • Sponsorship Program
-        • Educational Content
-        • 
-        • 
-        `"]
-    end
-
-    classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px
-    classDef section fill:#f0f0f0,stroke:#333,stroke-width:4px,rx:10
-    class A1,A2,A3,A4 section
-
-    %% Layout hints
-    A1 --> A2[" "]
-    A3 --> A4[" "]
-    linkStyle 0,1 stroke:none
-```
-
-Crawl4AI is evolving to provide more intelligent, efficient, and versatile web crawling capabilities. This roadmap outlines the key developments and features planned for the project, organized into strategic sections that build upon our current foundation.
-
-## 1. Advanced Crawling Systems 🔧
-
-This section introduces three powerful crawling systems that extend Crawl4AI's capabilities from basic web crawling to intelligent, purpose-driven data extraction.
-
-### 1.1 Question-Based Crawler
-The Question-Based Crawler enhances our core engine by enabling automatic discovery and extraction of relevant web content based on natural language questions.
-
-Key Features:
- SerpiAPI integration for intelligent web search
- Relevancy scoring for search results
- Automatic URL discovery and prioritization
- Cross-source validation
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.discovery import QuestionBasedDiscovery
-
-async with AsyncWebCrawler() as crawler:
-    discovery = QuestionBasedDiscovery(crawler)
-    results = await discovery.arun(
-        question="What are the system requirements for major cloud providers' GPU instances?",
-        max_urls=5,
-        relevance_threshold=0.7
-    )
-    
-    for result in results:
-        print(f"Source: {result.url} (Relevance: {result.relevance_score})")
-        print(f"Content: {result.markdown}\n")
-```
-
-### 1.2 Knowledge-Optimal Crawler
-An intelligent crawling system that solves the optimization problem of minimizing data extraction while maximizing knowledge acquisition for specific objectives.
-
-Key Features:
- Smart content prioritization
- Minimal data extraction for maximum knowledge
- Probabilistic relevance assessment
- Objective-driven crawling paths
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.optimization import KnowledgeOptimizer
-
-async with AsyncWebCrawler() as crawler:
-    optimizer = KnowledgeOptimizer(
-        objective="Understand GPU instance pricing and limitations across cloud providers",
-        required_knowledge=[
-            "pricing structure",
-            "GPU specifications",
-            "usage limits",
-            "availability zones"
-        ],
-        confidence_threshold=0.85
-    )
-    
-    result = await crawler.arun(
-        urls=[
-            "https://aws.amazon.com/ec2/pricing/",
-            "https://cloud.google.com/gpu",
-            "https://azure.microsoft.com/pricing/"
-        ],
-        optimizer=optimizer,
-        optimization_mode="minimal_extraction"
-    )
-    
-    print(f"Knowledge Coverage: {result.knowledge_coverage}")
-    print(f"Data Efficiency: {result.efficiency_ratio}")
-    print(f"Extracted Content: {result.optimal_content}")
-```
-
-### 1.3 Agentic Crawler
-An autonomous system capable of understanding complex goals and automatically planning and executing multi-step crawling operations.
-
-Key Features:
- Autonomous goal interpretation
- Dynamic step planning
- Interactive navigation capabilities
- Visual recognition and interaction
- Automatic error recovery
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.agents import CrawlerAgent
-
-async with AsyncWebCrawler() as crawler:
-    agent = CrawlerAgent(crawler)
-    
-    # Automatic planning and execution
-    result = await agent.arun(
-        goal="Find research papers about quantum computing published in 2023 with more than 50 citations",
-        auto_retry=True
-    )
-    print("Generated Plan:", result.executed_steps)
-    print("Extracted Data:", result.data)
-    
-    # Using custom steps with automatic execution
-    result = await agent.arun(
-        goal="Extract conference deadlines from ML conferences",
-        custom_plan=[
-            "Navigate to conference page",
-            "Find important dates section",
-            "Extract submission deadlines",
-            "Verify dates are for 2024"
-        ]
-    )
-    
-    # Monitoring execution
-    print("Step Completion:", result.step_status)
-    print("Execution Time:", result.execution_time)
-    print("Success Rate:", result.success_rate)
-```
-
-# Section 2: Specialized Features 🛠️
-
-This section introduces specialized tools and features that enhance Crawl4AI's capabilities for specific use cases and data extraction needs.
-
-### 2.1 Automated Schema Generator
-A system that automatically generates JsonCssExtractionStrategy schemas from natural language descriptions, making structured data extraction accessible to all users.
-
-Key Features:
- Natural language schema generation
- Automatic pattern detection
- Predefined schema templates
- Chrome extension for visual schema building
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.schema import SchemaGenerator
-
-# Generate schema from natural language description
-generator = SchemaGenerator()
-schema = await generator.generate(
-    url="https://news-website.com",
-    description="For each news article on the page, I need the headline, publication date, and main image"
-)
-
-# Use generated schema with crawler
-async with AsyncWebCrawler() as crawler:
-    result = await crawler.arun(
-        url="https://news-website.com",
-        extraction_strategy=schema
-    )
-
-# Example of generated schema:
-"""
-{
-    "name": "News Article Extractor",
-    "baseSelector": "article.news-item",
-    "fields": [
-        {
-            "name": "headline",
-            "selector": "h2.article-title",
-            "type": "text"
-        },
-        {
-            "name": "date",
-            "selector": "span.publish-date",
-            "type": "text"
-        },
-        {
-            "name": "image",
-            "selector": "img.article-image",
-            "type": "attribute",
-            "attribute": "src"
-        }
-    ]
-}
-"""
-```
-
-### 2.2 Domain Specific Scrapers
-Specialized extraction strategies optimized for common website types and platforms, providing consistent and reliable data extraction without additional configuration.
-
-Key Features:
- Pre-configured extractors for popular platforms
- Academic site specialization (arXiv, NCBI)
- E-commerce standardization
- Documentation site handling
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.extractors import AcademicExtractor, EcommerceExtractor
-
-async with AsyncWebCrawler() as crawler:
-    # Academic paper extraction
-    papers = await crawler.arun(
-        url="https://arxiv.org/list/cs.AI/recent",
-        extractor="academic",  # Built-in extractor type
-        site_type="arxiv",     # Specific site optimization
-        extract_fields=[
-            "title", 
-            "authors", 
-            "abstract", 
-            "citations"
-        ]
-    )
-    
-    # E-commerce product data
-    products = await crawler.arun(
-        url="https://store.example.com/products",
-        extractor="ecommerce",
-        extract_fields=[
-            "name",
-            "price",
-            "availability",
-            "reviews"
-        ]
-    )
-```
-
-### 2.3 Web Embedding Index
-Creates and maintains a semantic search infrastructure for crawled content, enabling efficient retrieval and querying of web content through vector embeddings.
-
-Key Features:
- Automatic embedding generation
- Intelligent content chunking
- Efficient vector storage and indexing
- Semantic search capabilities
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.indexing import WebIndex
-
-# Initialize and build index
-index = WebIndex(model="efficient-mini")
-
-async with AsyncWebCrawler() as crawler:
-    # Crawl and index content
-    await index.build(
-        urls=["https://docs.example.com"],
-        crawler=crawler,
-        options={
-            "chunk_method": "semantic",
-            "update_policy": "incremental",
-            "embedding_batch_size": 100
-        }
-    )
-
-    # Search through indexed content
-    results = await index.search(
-        query="How to implement OAuth authentication?",
-        filters={
-            "content_type": "technical",
-            "recency": "6months"
-        },
-        top_k=5
-    )
-
-    # Get similar content
-    similar = await index.find_similar(
-        url="https://docs.example.com/auth/oauth",
-        threshold=0.85
-    )
-```
-
-Each of these specialized features builds upon Crawl4AI's core functionality while providing targeted solutions for specific use cases. They can be used independently or combined for more complex data extraction and processing needs.
-
-# Section 3: Development Tools 🔧
-
-This section covers tools designed to enhance the development experience, monitoring, and deployment of Crawl4AI applications.
-
-### 3.1 Crawl4AI Playground 🎮
-
-The Crawl4AI Playground is an interactive web-based development environment that simplifies web scraping experimentation, development, and deployment. With its intuitive interface and AI-powered assistance, users can quickly prototype, test, and deploy web scraping solutions.
-
-#### Key Features 🌟
-
-##### Visual Strategy Builder
- Interactive point-and-click interface for building extraction strategies
- Real-time preview of selected elements
- Side-by-side comparison of different extraction approaches
- Visual validation of CSS selectors and XPath queries
-
-##### AI Assistant Integration
- Strategy recommendations based on target website analysis
- Parameter optimization suggestions
- Best practices guidance for specific use cases
- Automated error detection and resolution
- Performance optimization tips
-
-##### Real-Time Testing & Validation
- Live preview of extraction results
- Side-by-side comparison of multiple strategies
- Performance metrics visualization
- Automatic validation of extracted data
- Error detection and debugging tools
-
-##### Project Management
- Save and organize multiple scraping projects
- Version control for configurations
- Export/import project settings
- Share configurations with team members
- Project templates for common use cases
-
-##### Deployment Pipeline
- One-click deployment to various environments
- Docker container generation
- Cloud deployment templates (AWS, GCP, Azure)
- Scaling configuration management
- Monitoring setup automation
-
-
-### 3.2 Performance Monitoring System
-A comprehensive monitoring solution providing real-time insights into crawler operations, resource usage, and system health through both CLI and GUI interfaces.
-
-Key Features:
- Real-time resource tracking
- Active crawl monitoring
- Performance statistics
- Customizable alerting system
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.monitor import CrawlMonitor
-
-# Initialize monitoring
-monitor = CrawlMonitor()
-
-# Start monitoring with CLI interface
-await monitor.start(
-    mode="cli",  # or "gui"
-    refresh_rate="1s",
-    metrics={
-        "resources": ["cpu", "memory", "network"],
-        "crawls": ["active", "queued", "completed"],
-        "performance": ["success_rate", "response_times"]
-    }
-)
-
-# Example CLI output:
-"""
-Crawl4AI Monitor (Live) - Press Q to exit
-────────────────────────────────────────
-System Usage:
- ├─ CPU: ███████░░░ 70%
- └─ Memory: ████░░░░░ 2.1GB/8GB
-
-Active Crawls:
-ID    URL                   Status    Progress
-001   docs.example.com     🟢 Active   75%
-002   api.service.com      🟡 Queue    -
-
-Metrics (Last 5min):
- ├─ Success Rate: 98%
- ├─ Avg Response: 0.6s
- └─ Pages/sec: 8.5
-"""
-```
-
-### 3.3 Cloud Integration
-Streamlined deployment tools for setting up Crawl4AI in various cloud environments, with support for scaling and monitoring.
-
-Key Features:
- One-click deployment solutions
- Auto-scaling configuration
- Load balancing setup
- Cloud-specific optimizations
- Monitoring integration
-
-```python
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.deploy import CloudDeployer
-
-# Initialize deployer
-deployer = CloudDeployer()
-
-# Deploy crawler service
-deployment = await deployer.deploy(
-    service_name="crawler-cluster",
-    platform="aws",  # or "gcp", "azure"
-    config={
-        "instance_type": "compute-optimized",
-        "auto_scaling": {
-            "min_instances": 2,
-            "max_instances": 10,
-            "scale_based_on": "cpu_usage"
-        },
-        "region": "us-east-1",
-        "monitoring": True
-    }
-)
-
-# Get deployment status and endpoints
-print(f"Service Status: {deployment.status}")
-print(f"API Endpoint: {deployment.endpoint}")
-print(f"Monitor URL: {deployment.monitor_url}")
-```
-
-These development tools work together to provide a comprehensive environment for developing, testing, monitoring, and deploying Crawl4AI applications. The Playground helps users experiment and generate optimal configurations, the Performance Monitor ensures smooth operation, and the Cloud Integration tools simplify deployment and scaling.
-
-# Section 4: Community & Growth 🌱
-
-This section outlines initiatives designed to build and support the Crawl4AI community, provide educational resources, and ensure sustainable project growth.
-
-### 4.1 Sponsorship Program
-A structured program to support ongoing development and maintenance of Crawl4AI while providing valuable benefits to sponsors.
-
-Key Features:
- Multiple sponsorship tiers
- Sponsor recognition system
- Priority support for sponsors
- Early access to new features
- Custom feature development opportunities
-
-Program Structure (not yet finalized):
-```
-Sponsorship Tiers:
-
-🥉 Bronze Supporter
- GitHub Sponsor badge
- Priority issue response
- Community Discord role
-
-🥈 Silver Supporter
- All Bronze benefits
- Technical support channel
- Vote on roadmap priorities
- Early access to beta features
-
-🥇 Gold Supporter
- All Silver benefits
- Custom feature requests
- Direct developer access
- Private support sessions
-
-💎 Diamond Partner
- All Gold benefits
- Custom development
- On-demand consulting
- Integration support
-```
-
-### 4.2 "How to Crawl" Video Series
-A comprehensive educational resource teaching users how to effectively use Crawl4AI for various web scraping and data extraction scenarios.
-
-Key Features:
- Step-by-step tutorials
- Real-world use cases
- Best practices
- Integration guides
- Advanced feature deep-dives
-
-These community initiatives are designed to:
- Provide comprehensive learning resources
- Foster a supportive user community
- Ensure sustainable project development
- Share knowledge and best practices
- Create opportunities for collaboration
-
-The combination of structured support through sponsorship, educational content through video series, and interactive learning through the playground creates a robust ecosystem for both new and experienced users of Crawl4AI.
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -1,89 +1 @@
-# __init__.py
-
-from .async_webcrawler import AsyncWebCrawler, CacheMode
-from .async_configs import BrowserConfig, CrawlerRunConfig
-from .content_scraping_strategy import (
-    ContentScrapingStrategy,
-    WebScrapingStrategy,
-    LXMLWebScrapingStrategy,
-)
-from .extraction_strategy import (
-    ExtractionStrategy,
-    LLMExtractionStrategy,
-    CosineStrategy,
-    JsonCssExtractionStrategy,
-    JsonXPathExtractionStrategy
-)
-from .chunking_strategy import ChunkingStrategy, RegexChunking
-from .markdown_generation_strategy import DefaultMarkdownGenerator
-from .content_filter_strategy import PruningContentFilter, BM25ContentFilter, LLMContentFilter, RelevantContentFilter
-from .models import CrawlResult, MarkdownGenerationResult
-from .async_dispatcher import (
-    MemoryAdaptiveDispatcher,
-    SemaphoreDispatcher,
-    RateLimiter,
-    CrawlerMonitor,
-    DisplayMode,
-    BaseDispatcher
-)
-
-__all__ = [
-    "AsyncWebCrawler",
-    "CrawlResult",
-    "CacheMode",
-    "ContentScrapingStrategy",
-    "WebScrapingStrategy",
-    "LXMLWebScrapingStrategy",
-    "BrowserConfig",
-    "CrawlerRunConfig",
-    "ExtractionStrategy",
-    "LLMExtractionStrategy",
-    "CosineStrategy",
-    "JsonCssExtractionStrategy",
-    "JsonXPathExtractionStrategy",
-    "ChunkingStrategy",
-    "RegexChunking",
-    "DefaultMarkdownGenerator",
-    "RelevantContentFilter",
-    "PruningContentFilter",
-    "BM25ContentFilter",
-    "LLMContentFilter",
-    "BaseDispatcher",
-    "MemoryAdaptiveDispatcher",
-    "SemaphoreDispatcher",
-    "RateLimiter",
-    "CrawlerMonitor",
-    "DisplayMode",
-    "MarkdownGenerationResult",
-]
-
-
-def is_sync_version_installed():
-    try:
-        import selenium
-
-        return True
-    except ImportError:
-        return False
-
-
-if is_sync_version_installed():
-    try:
-        from .web_crawler import WebCrawler
-
-        __all__.append("WebCrawler")
-    except ImportError:
-        print(
-            "Warning: Failed to import WebCrawler even though selenium is installed. This might be due to other missing dependencies."
-        )
-else:
-    WebCrawler = None
-    # import warnings
-    # print("Warning: Synchronous WebCrawler is not available. Install crawl4ai[sync] for synchronous support. However, please note that the synchronous version will be deprecated soon.")
-
-import warnings
-from pydantic import warnings as pydantic_warnings
-
-# Disable all Pydantic warnings
-warnings.filterwarnings("ignore", module="pydantic")
-# pydantic_warnings.filter_warnings()
+from .web_crawler import WebCrawler
--- a/crawl4ai/version.py
+++ b/crawl4ai/version.py
@@ -1,2 +0,0 @@
-# crawl4ai/_version.py
-__version__ = "0.4.3b3"
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -1,756 +0,0 @@
-from .config import (
-    MIN_WORD_THRESHOLD,
-    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-    SCREENSHOT_HEIGHT_TRESHOLD,
-    PAGE_TIMEOUT,
-    IMAGE_SCORE_THRESHOLD,
-    SOCIAL_MEDIA_DOMAINS,
-)
-
-from .user_agent_generator import UserAgentGenerator, UAGen, ValidUAGenerator, OnlineUAGenerator
-from .extraction_strategy import ExtractionStrategy
-from .chunking_strategy import ChunkingStrategy, RegexChunking
-from .deep_crawl import DeepCrawlStrategy
-from .markdown_generation_strategy import MarkdownGenerationStrategy
-from .content_filter_strategy import RelevantContentFilter, BM25ContentFilter, LLMContentFilter, PruningContentFilter
-from .content_scraping_strategy import ContentScrapingStrategy, WebScrapingStrategy
-from typing import Optional, Union, List
-from .cache_context import CacheMode
-
-
-class BrowserConfig:
-    """
-    Configuration class for setting up a browser instance and its context in AsyncPlaywrightCrawlerStrategy.
-
-    This class centralizes all parameters that affect browser and context creation. Instead of passing
-    scattered keyword arguments, users can instantiate and modify this configuration object. The crawler
-    code will then reference these settings to initialize the browser in a consistent, documented manner.
-
-    Attributes:
-        browser_type (str): The type of browser to launch. Supported values: "chromium", "firefox", "webkit".
-                            Default: "chromium".
-        headless (bool): Whether to run the browser in headless mode (no visible GUI).
-                         Default: True.
-        use_managed_browser (bool): Launch the browser using a managed approach (e.g., via CDP), allowing
-                                    advanced manipulation. Default: False.
-        cdp_url (str): URL for the Chrome DevTools Protocol (CDP) endpoint. Default: "ws://localhost:9222/devtools/browser/".
-        debugging_port (int): Port for the browser debugging protocol. Default: 9222.
-        use_persistent_context (bool): Use a persistent browser context (like a persistent profile).
-                                       Automatically sets use_managed_browser=True. Default: False.
-        user_data_dir (str or None): Path to a user data directory for persistent sessions. If None, a
-                                     temporary directory may be used. Default: None.
-        chrome_channel (str): The Chrome channel to launch (e.g., "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chromium".
-        channel (str): The channel to launch (e.g., "chromium", "chrome", "msedge"). Only applies if browser_type
-                              is "chromium". Default: "chromium".
-        proxy (Optional[str]): Proxy server URL (e.g., "http://username:password@proxy:port"). If None, no proxy is used.
-                             Default: None.
-        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
-                                     If None, no additional proxy config. Default: None.
-        viewport_width (int): Default viewport width for pages. Default: 1080.
-        viewport_height (int): Default viewport height for pages. Default: 600.
-        verbose (bool): Enable verbose logging.
-                        Default: True.
-        accept_downloads (bool): Whether to allow file downloads. If True, requires a downloads_path.
-                                 Default: False.
-        downloads_path (str or None): Directory to store downloaded files. If None and accept_downloads is True,
-                                      a default path will be created. Default: None.
-        storage_state (str or dict or None): Path or object describing storage state (cookies, localStorage).
-                                             Default: None.
-        ignore_https_errors (bool): Ignore HTTPS certificate errors. Default: True.
-        java_script_enabled (bool): Enable JavaScript execution in pages. Default: True.
-        cookies (list): List of cookies to add to the browser context. Each cookie is a dict with fields like
-                        {"name": "...", "value": "...", "url": "..."}.
-                        Default: [].
-        headers (dict): Extra HTTP headers to apply to all requests in this context.
-                        Default: {}.
-        user_agent (str): Custom User-Agent string to use. Default: "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
-                           "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36".
-        user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
-                                       user_agent as-is. Default: None.
-        user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
-                                                    Default: None.
-        text_mode (bool): If True, disables images and other rich content for potentially faster load times.
-                          Default: False.
-        light_mode (bool): Disables certain background features for performance gains. Default: False.
-        extra_args (list): Additional command-line arguments passed to the browser.
-                           Default: [].
-    """
-
-    def __init__(
-        self,
-        browser_type: str = "chromium",
-        headless: bool = True,
-        use_managed_browser: bool = False,
-        cdp_url: str = None,
-        use_persistent_context: bool = False,
-        user_data_dir: str = None,
-        chrome_channel: str = "chromium",
-        channel: str = "chromium",
-        proxy: str = None,
-        proxy_config: dict = None,
-        viewport_width: int = 1080,
-        viewport_height: int = 600,
-        accept_downloads: bool = False,
-        downloads_path: str = None,
-        storage_state : Union[str, dict, None]=None,
-        ignore_https_errors: bool = True,
-        java_script_enabled: bool = True,
-        sleep_on_close: bool = False,
-        verbose: bool = True,
-        cookies: list = None,
-        headers: dict = None,
-        user_agent: str = (
-            # "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) AppleWebKit/537.36 "
-            # "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
-            # "(KHTML, like Gecko) Chrome/116.0.5845.187 Safari/604.1 Edg/117.0.2045.47"
-            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36"
-        ),
-        user_agent_mode: str = "",
-        user_agent_generator_config: dict = {},
-        text_mode: bool = False,
-        light_mode: bool = False,
-        extra_args: list = None,
-        debugging_port: int = 9222,
-        host: str = "localhost",
-    ):
-        self.browser_type = browser_type
-        self.headless = headless
-        self.use_managed_browser = use_managed_browser
-        self.cdp_url = cdp_url
-        self.use_persistent_context = use_persistent_context
-        self.user_data_dir = user_data_dir
-        self.chrome_channel = chrome_channel or self.browser_type or "chromium"
-        self.channel = channel or self.browser_type or "chromium"
-        if self.browser_type in ["firefox", "webkit"]:
-            self.channel = ""
-            self.chrome_channel = ""
-        self.proxy = proxy
-        self.proxy_config = proxy_config
-        self.viewport_width = viewport_width
-        self.viewport_height = viewport_height
-        self.accept_downloads = accept_downloads
-        self.downloads_path = downloads_path
-        self.storage_state = storage_state
-        self.ignore_https_errors = ignore_https_errors
-        self.java_script_enabled = java_script_enabled
-        self.cookies = cookies if cookies is not None else []
-        self.headers = headers if headers is not None else {}
-        self.user_agent = user_agent
-        self.user_agent_mode = user_agent_mode
-        self.user_agent_generator_config = user_agent_generator_config
-        self.text_mode = text_mode
-        self.light_mode = light_mode
-        self.extra_args = extra_args if extra_args is not None else []
-        self.sleep_on_close = sleep_on_close
-        self.verbose = verbose
-        self.debugging_port = debugging_port
-
-        fa_user_agenr_generator = ValidUAGenerator()
-        if self.user_agent_mode == "random":
-            self.user_agent = fa_user_agenr_generator.generate(
-                **(self.user_agent_generator_config or {})
-            )
-        else:
-            pass
-        
-        self.browser_hint = UAGen.generate_client_hints(self.user_agent)
-        self.headers.setdefault("sec-ch-ua", self.browser_hint)
-
-        # If persistent context is requested, ensure managed browser is enabled
-        if self.use_persistent_context:
-            self.use_managed_browser = True
-
-    @staticmethod
-    def from_kwargs(kwargs: dict) -> "BrowserConfig":
-        return BrowserConfig(
-            browser_type=kwargs.get("browser_type", "chromium"),
-            headless=kwargs.get("headless", True),
-            use_managed_browser=kwargs.get("use_managed_browser", False),
-            cdp_url=kwargs.get("cdp_url"),
-            use_persistent_context=kwargs.get("use_persistent_context", False),
-            user_data_dir=kwargs.get("user_data_dir"),
-            chrome_channel=kwargs.get("chrome_channel", "chromium"),
-            channel=kwargs.get("channel", "chromium"),
-            proxy=kwargs.get("proxy"),
-            proxy_config=kwargs.get("proxy_config"),
-            viewport_width=kwargs.get("viewport_width", 1080),
-            viewport_height=kwargs.get("viewport_height", 600),
-            accept_downloads=kwargs.get("accept_downloads", False),
-            downloads_path=kwargs.get("downloads_path"),
-            storage_state=kwargs.get("storage_state"),
-            ignore_https_errors=kwargs.get("ignore_https_errors", True),
-            java_script_enabled=kwargs.get("java_script_enabled", True),
-            cookies=kwargs.get("cookies", []),
-            headers=kwargs.get("headers", {}),
-            user_agent=kwargs.get(
-                "user_agent",
-                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
-                "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36",
-            ),
-            user_agent_mode=kwargs.get("user_agent_mode"),
-            user_agent_generator_config=kwargs.get("user_agent_generator_config"),
-            text_mode=kwargs.get("text_mode", False),
-            light_mode=kwargs.get("light_mode", False),
-            extra_args=kwargs.get("extra_args", []),
-        )
-
-    def to_dict(self):
-        return {
-            "browser_type": self.browser_type,
-            "headless": self.headless,
-            "use_managed_browser": self.use_managed_browser,
-            "cdp_url": self.cdp_url,
-            "use_persistent_context": self.use_persistent_context,
-            "user_data_dir": self.user_data_dir,
-            "chrome_channel": self.chrome_channel,
-            "channel": self.channel,
-            "proxy": self.proxy,
-            "proxy_config": self.proxy_config,
-            "viewport_width": self.viewport_width,
-            "viewport_height": self.viewport_height,
-            "accept_downloads": self.accept_downloads,
-            "downloads_path": self.downloads_path,
-            "storage_state": self.storage_state,
-            "ignore_https_errors": self.ignore_https_errors,
-            "java_script_enabled": self.java_script_enabled,
-            "cookies": self.cookies,
-            "headers": self.headers,
-            "user_agent": self.user_agent,
-            "user_agent_mode": self.user_agent_mode,
-            "user_agent_generator_config": self.user_agent_generator_config,
-            "text_mode": self.text_mode,
-            "light_mode": self.light_mode,
-            "extra_args": self.extra_args,
-            "sleep_on_close": self.sleep_on_close,
-            "verbose": self.verbose,
-            "debugging_port": self.debugging_port,
-        }
-
-    def clone(self, **kwargs):
-        """Create a copy of this configuration with updated values.
-        
-        Args:
-            **kwargs: Key-value pairs of configuration options to update
-            
-        Returns:
-            BrowserConfig: A new instance with the specified updates
-        """
-        config_dict = self.to_dict()
-        config_dict.update(kwargs)
-        return BrowserConfig.from_kwargs(config_dict)
-
-
-class CrawlerRunConfig:
-    """
-    Configuration class for controlling how the crawler runs each crawl operation.
-    This includes parameters for content extraction, page manipulation, waiting conditions,
-    caching, and other runtime behaviors.
-
-    This centralizes parameters that were previously scattered as kwargs to `arun()` and related methods.
-    By using this class, you have a single place to understand and adjust the crawling options.
-
-    Attributes:
-        # Content Processing Parameters
-        word_count_threshold (int): Minimum word count threshold before processing content.
-                                    Default: MIN_WORD_THRESHOLD (typically 200).
-        extraction_strategy (ExtractionStrategy or None): Strategy to extract structured data from crawled pages.
-                                                          Default: None (NoExtractionStrategy is used if None).
-        chunking_strategy (ChunkingStrategy): Strategy to chunk content before extraction.
-                                              Default: RegexChunking().
-        markdown_generator (MarkdownGenerationStrategy): Strategy for generating markdown.
-                                                         Default: None.
-        content_filter (RelevantContentFilter or None): Optional filter to prune irrelevant content.
-                                                        Default: None.
-        only_text (bool): If True, attempt to extract text-only content where applicable.
-                          Default: False.
-        css_selector (str or None): CSS selector to extract a specific portion of the page.
-                                    Default: None.
-        excluded_tags (list of str or None): List of HTML tags to exclude from processing.
-                                             Default: None.
-        excluded_selector (str or None): CSS selector to exclude from processing.
-                                         Default: None.
-        keep_data_attributes (bool): If True, retain `data-*` attributes while removing unwanted attributes.
-                                     Default: False.
-        remove_forms (bool): If True, remove all `<form>` elements from the HTML.
-                             Default: False.
-        prettiify (bool): If True, apply `fast_format_html` to produce prettified HTML output.
-                          Default: False.
-        parser_type (str): Type of parser to use for HTML parsing.
-                           Default: "lxml".
-        scraping_strategy (ContentScrapingStrategy): Scraping strategy to use.
-                           Default: WebScrapingStrategy.
-        proxy_config (dict or None): Detailed proxy configuration, e.g. {"server": "...", "username": "..."}.
-                                     If None, no additional proxy config. Default: None.
-
-        # Caching Parameters
-        cache_mode (CacheMode or None): Defines how caching is handled.
-                                        If None, defaults to CacheMode.ENABLED internally.
-                                        Default: None.
-        session_id (str or None): Optional session ID to persist the browser context and the created
-                                  page instance. If the ID already exists, the crawler does not
-                                  create a new page and uses the current page to preserve the state.
-        bypass_cache (bool): Legacy parameter, if True acts like CacheMode.BYPASS.
-                             Default: False.
-        disable_cache (bool): Legacy parameter, if True acts like CacheMode.DISABLED.
-                              Default: False.
-        no_cache_read (bool): Legacy parameter, if True acts like CacheMode.WRITE_ONLY.
-                              Default: False.
-        no_cache_write (bool): Legacy parameter, if True acts like CacheMode.READ_ONLY.
-                               Default: False.
-        shared_data (dict or None): Shared data to be passed between hooks.
-                                     Default: None.
-
-        # Page Navigation and Timing Parameters
-        wait_until (str): The condition to wait for when navigating, e.g. "domcontentloaded".
-                          Default: "domcontentloaded".
-        page_timeout (int): Timeout in ms for page operations like navigation.
-                            Default: 60000 (60 seconds).
-        wait_for (str or None): A CSS selector or JS condition to wait for before extracting content.
-                                Default: None.
-        wait_for_images (bool): If True, wait for images to load before extracting content.
-                                Default: False.
-        delay_before_return_html (float): Delay in seconds before retrieving final HTML.
-                                          Default: 0.1.
-        mean_delay (float): Mean base delay between requests when calling arun_many.
-                            Default: 0.1.
-        max_range (float): Max random additional delay range for requests in arun_many.
-                           Default: 0.3.
-        semaphore_count (int): Number of concurrent operations allowed.
-                               Default: 5.
-
-        # Page Interaction Parameters
-        js_code (str or list of str or None): JavaScript code/snippets to run on the page.
-                                              Default: None.
-        js_only (bool): If True, indicates subsequent calls are JS-driven updates, not full page loads.
-                        Default: False.
-        ignore_body_visibility (bool): If True, ignore whether the body is visible before proceeding.
-                                       Default: True.
-        scan_full_page (bool): If True, scroll through the entire page to load all content.
-                               Default: False.
-        scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
-                              Default: 0.2.
-        process_iframes (bool): If True, attempts to process and inline iframe content.
-                                Default: False.
-        remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
-                                        Default: False.
-        simulate_user (bool): If True, simulate user interactions (mouse moves, clicks) for anti-bot measures.
-                              Default: False.
-        override_navigator (bool): If True, overrides navigator properties for more human-like behavior.
-                                   Default: False.
-        magic (bool): If True, attempts automatic handling of overlays/popups.
-                      Default: False.
-        adjust_viewport_to_content (bool): If True, adjust viewport according to the page content dimensions.
-                                           Default: False.
-
-        # Media Handling Parameters
-        screenshot (bool): Whether to take a screenshot after crawling.
-                           Default: False.
-        screenshot_wait_for (float or None): Additional wait time before taking a screenshot.
-                                             Default: None.
-        screenshot_height_threshold (int): Threshold for page height to decide screenshot strategy.
-                                           Default: SCREENSHOT_HEIGHT_TRESHOLD (from config, e.g. 20000).
-        pdf (bool): Whether to generate a PDF of the page.
-                    Default: False.
-        image_description_min_word_threshold (int): Minimum words for image description extraction.
-                                                    Default: IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD (e.g., 50).
-        image_score_threshold (int): Minimum score threshold for processing an image.
-                                     Default: IMAGE_SCORE_THRESHOLD (e.g., 3).
-        exclude_external_images (bool): If True, exclude all external images from processing.
-                                         Default: False.
-
-        # Link and Domain Handling Parameters
-        exclude_social_media_domains (list of str): List of domains to exclude for social media links.
-                                                    Default: SOCIAL_MEDIA_DOMAINS (from config).
-        exclude_external_links (bool): If True, exclude all external links from the results.
-                                       Default: False.
-        exclude_social_media_links (bool): If True, exclude links pointing to social media domains.
-                                           Default: False.
-        exclude_domains (list of str): List of specific domains to exclude from results.
-                                       Default: [].
-
-        # Debugging and Logging Parameters
-        verbose (bool): Enable verbose logging.
-                        Default: True.
-        log_console (bool): If True, log console messages from the page.
-                            Default: False.
-
-        # Streaming Parameters
-        stream (bool): If True, enables streaming of crawled URLs as they are processed when used with arun_many.
-                      Default: False.
-
-        # Optional Parameters
-        stream (bool): If True, stream the page content as it is being loaded.
-        url: str = None  # This is not a compulsory parameter
-        check_robots_txt (bool): Whether to check robots.txt rules before crawling. Default: False
-        user_agent (str): Custom User-Agent string to use. Default: None
-        user_agent_mode (str or None): Mode for generating the user agent (e.g., "random"). If None, use the provided
-                                       user_agent as-is. Default: None.
-        user_agent_generator_config (dict or None): Configuration for user agent generation if user_agent_mode is set.
-                                                    Default: None.
-    """
-
-    def __init__(
-        self,
-        # Content Processing Parameters
-        word_count_threshold: int = MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        deep_crawl_strategy: DeepCrawlStrategy = None,
-        markdown_generator: MarkdownGenerationStrategy = None,
-        content_filter : RelevantContentFilter = None,
-        only_text: bool = False,
-        css_selector: str = None,
-        excluded_tags: list = None,
-        excluded_selector: str = None,
-        keep_data_attributes: bool = False,
-        remove_forms: bool = False,
-        prettiify: bool = False,
-        parser_type: str = "lxml",
-        scraping_strategy: ContentScrapingStrategy = None,
-        proxy_config: dict = None,
-        # SSL Parameters
-        fetch_ssl_certificate: bool = False,
-        # Caching Parameters
-        cache_mode: CacheMode =None,
-        session_id: str = None,
-        bypass_cache: bool = False,
-        disable_cache: bool = False,
-        no_cache_read: bool = False,
-        no_cache_write: bool = False,
-        shared_data: dict = None,
-        # Page Navigation and Timing Parameters
-        wait_until: str = "domcontentloaded",
-        page_timeout: int = PAGE_TIMEOUT,
-        wait_for: str = None,
-        wait_for_images: bool = False,
-        delay_before_return_html: float = 0.1,
-        mean_delay: float = 0.1,
-        max_range: float = 0.3,
-        semaphore_count: int = 5,
-        # Page Interaction Parameters
-        js_code: Union[str, List[str]] = None,
-        js_only: bool = False,
-        ignore_body_visibility: bool = True,
-        scan_full_page: bool = False,
-        scroll_delay: float = 0.2,
-        process_iframes: bool = False,
-        remove_overlay_elements: bool = False,
-        simulate_user: bool = False,
-        override_navigator: bool = False,
-        magic: bool = False,
-        adjust_viewport_to_content: bool = False,
-        # Media Handling Parameters
-        screenshot: bool = False,
-        screenshot_wait_for: float = None,
-        screenshot_height_threshold: int = SCREENSHOT_HEIGHT_TRESHOLD,
-        pdf: bool = False,
-        image_description_min_word_threshold: int = IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-        image_score_threshold: int = IMAGE_SCORE_THRESHOLD,
-        exclude_external_images: bool = False,
-        # Link and Domain Handling Parameters
-        exclude_social_media_domains: list = None,
-        exclude_external_links: bool = False,
-        exclude_social_media_links: bool = False,
-        exclude_domains: list = None,
-        # Debugging and Logging Parameters
-        verbose: bool = True,
-        log_console: bool = False,
-        # Streaming Parameters
-        stream: bool = False,
-        url: str = None,
-        check_robots_txt: bool = False,
-        user_agent: str = None,
-        user_agent_mode: str = None,
-        user_agent_generator_config: dict = {},
-    ):
-        self.url = url
-
-        # Content Processing Parameters
-        self.word_count_threshold = word_count_threshold
-        self.extraction_strategy = extraction_strategy
-        self.chunking_strategy = chunking_strategy
-        self.deep_crawl_strategy = deep_crawl_strategy
-        self.markdown_generator = markdown_generator
-        self.content_filter = content_filter
-        self.only_text = only_text
-        self.css_selector = css_selector
-        self.excluded_tags = excluded_tags or []
-        self.excluded_selector = excluded_selector or ""
-        self.keep_data_attributes = keep_data_attributes
-        self.remove_forms = remove_forms
-        self.prettiify = prettiify
-        self.parser_type = parser_type
-        self.scraping_strategy = scraping_strategy or WebScrapingStrategy()
-        self.proxy_config = proxy_config
-
-        # SSL Parameters
-        self.fetch_ssl_certificate = fetch_ssl_certificate
-
-        # Caching Parameters
-        self.cache_mode = cache_mode
-        self.session_id = session_id
-        self.bypass_cache = bypass_cache
-        self.disable_cache = disable_cache
-        self.no_cache_read = no_cache_read
-        self.no_cache_write = no_cache_write
-        self.shared_data = shared_data
-
-        # Page Navigation and Timing Parameters
-        self.wait_until = wait_until
-        self.page_timeout = page_timeout
-        self.wait_for = wait_for
-        self.wait_for_images = wait_for_images
-        self.delay_before_return_html = delay_before_return_html
-        self.mean_delay = mean_delay
-        self.max_range = max_range
-        self.semaphore_count = semaphore_count
-
-        # Page Interaction Parameters
-        self.js_code = js_code
-        self.js_only = js_only
-        self.ignore_body_visibility = ignore_body_visibility
-        self.scan_full_page = scan_full_page
-        self.scroll_delay = scroll_delay
-        self.process_iframes = process_iframes
-        self.remove_overlay_elements = remove_overlay_elements
-        self.simulate_user = simulate_user
-        self.override_navigator = override_navigator
-        self.magic = magic
-        self.adjust_viewport_to_content = adjust_viewport_to_content
-
-        # Media Handling Parameters
-        self.screenshot = screenshot
-        self.screenshot_wait_for = screenshot_wait_for
-        self.screenshot_height_threshold = screenshot_height_threshold
-        self.pdf = pdf
-        self.image_description_min_word_threshold = image_description_min_word_threshold
-        self.image_score_threshold = image_score_threshold
-        self.exclude_external_images = exclude_external_images
-
-        # Link and Domain Handling Parameters
-        self.exclude_social_media_domains = (
-            exclude_social_media_domains or SOCIAL_MEDIA_DOMAINS
-        )
-        self.exclude_external_links = exclude_external_links
-        self.exclude_social_media_links = exclude_social_media_links
-        self.exclude_domains = exclude_domains or []
-
-        # Debugging and Logging Parameters
-        self.verbose = verbose
-        self.log_console = log_console
-
-        # Streaming Parameters
-        self.stream = stream
-
-        # Robots.txt Handling Parameters
-        self.check_robots_txt = check_robots_txt
-
-        # User Agent Parameters
-        self.user_agent = user_agent
-        self.user_agent_mode = user_agent_mode
-        self.user_agent_generator_config = user_agent_generator_config
-
-        # Validate type of extraction strategy and chunking strategy if they are provided
-        if self.extraction_strategy is not None and not isinstance(
-            self.extraction_strategy, ExtractionStrategy
-        ):
-            raise ValueError(
-                "extraction_strategy must be an instance of ExtractionStrategy"
-            )
-        
-        if self.deep_crawl_strategy is not None and not isinstance(
-            self.deep_crawl_strategy, DeepCrawlStrategy
-        ):
-            raise ValueError(
-            "deep_crawl_strategy must be an instance of DeepCrawlStrategy"
-            )
-
-        if self.chunking_strategy is not None and not isinstance(
-            self.chunking_strategy, ChunkingStrategy
-        ):
-            raise ValueError(
-                "chunking_strategy must be an instance of ChunkingStrategy"
-            )
-
-        # Set default chunking strategy if None
-        if self.chunking_strategy is None:
-            self.chunking_strategy = RegexChunking()
-
-    @staticmethod
-    def from_kwargs(kwargs: dict) -> "CrawlerRunConfig":
-        return CrawlerRunConfig(
-            # Content Processing Parameters
-            word_count_threshold=kwargs.get("word_count_threshold", 200),
-            extraction_strategy=kwargs.get("extraction_strategy"),
-            chunking_strategy=kwargs.get("chunking_strategy", RegexChunking()),
-            deep_crawl_strategy=kwargs.get("deep_crawl_strategy"),
-            markdown_generator=kwargs.get("markdown_generator"),
-            content_filter=kwargs.get("content_filter"),
-            only_text=kwargs.get("only_text", False),
-            css_selector=kwargs.get("css_selector"),
-            excluded_tags=kwargs.get("excluded_tags", []),
-            excluded_selector=kwargs.get("excluded_selector", ""),
-            keep_data_attributes=kwargs.get("keep_data_attributes", False),
-            remove_forms=kwargs.get("remove_forms", False),
-            prettiify=kwargs.get("prettiify", False),
-            parser_type=kwargs.get("parser_type", "lxml"),
-            scraping_strategy=kwargs.get("scraping_strategy"),
-            proxy_config=kwargs.get("proxy_config"),
-            # SSL Parameters
-            fetch_ssl_certificate=kwargs.get("fetch_ssl_certificate", False),
-            # Caching Parameters
-            cache_mode=kwargs.get("cache_mode"),
-            session_id=kwargs.get("session_id"),
-            bypass_cache=kwargs.get("bypass_cache", False),
-            disable_cache=kwargs.get("disable_cache", False),
-            no_cache_read=kwargs.get("no_cache_read", False),
-            no_cache_write=kwargs.get("no_cache_write", False),
-            shared_data=kwargs.get("shared_data", None),
-            # Page Navigation and Timing Parameters
-            wait_until=kwargs.get("wait_until", "domcontentloaded"),
-            page_timeout=kwargs.get("page_timeout", 60000),
-            wait_for=kwargs.get("wait_for"),
-            wait_for_images=kwargs.get("wait_for_images", False),
-            delay_before_return_html=kwargs.get("delay_before_return_html", 0.1),
-            mean_delay=kwargs.get("mean_delay", 0.1),
-            max_range=kwargs.get("max_range", 0.3),
-            semaphore_count=kwargs.get("semaphore_count", 5),
-            # Page Interaction Parameters
-            js_code=kwargs.get("js_code"),
-            js_only=kwargs.get("js_only", False),
-            ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
-            scan_full_page=kwargs.get("scan_full_page", False),
-            scroll_delay=kwargs.get("scroll_delay", 0.2),
-            process_iframes=kwargs.get("process_iframes", False),
-            remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
-            simulate_user=kwargs.get("simulate_user", False),
-            override_navigator=kwargs.get("override_navigator", False),
-            magic=kwargs.get("magic", False),
-            adjust_viewport_to_content=kwargs.get("adjust_viewport_to_content", False),
-            # Media Handling Parameters
-            screenshot=kwargs.get("screenshot", False),
-            screenshot_wait_for=kwargs.get("screenshot_wait_for"),
-            screenshot_height_threshold=kwargs.get(
-                "screenshot_height_threshold", SCREENSHOT_HEIGHT_TRESHOLD
-            ),
-            pdf=kwargs.get("pdf", False),
-            image_description_min_word_threshold=kwargs.get(
-                "image_description_min_word_threshold",
-                IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-            ),
-            image_score_threshold=kwargs.get(
-                "image_score_threshold", IMAGE_SCORE_THRESHOLD
-            ),
-            exclude_external_images=kwargs.get("exclude_external_images", False),
-            # Link and Domain Handling Parameters
-            exclude_social_media_domains=kwargs.get(
-                "exclude_social_media_domains", SOCIAL_MEDIA_DOMAINS
-            ),
-            exclude_external_links=kwargs.get("exclude_external_links", False),
-            exclude_social_media_links=kwargs.get("exclude_social_media_links", False),
-            exclude_domains=kwargs.get("exclude_domains", []),
-            # Debugging and Logging Parameters
-            verbose=kwargs.get("verbose", True),
-            log_console=kwargs.get("log_console", False),
-            # Streaming Parameters
-            stream=kwargs.get("stream", False),
-            url=kwargs.get("url"),
-            check_robots_txt=kwargs.get("check_robots_txt", False),
-            user_agent=kwargs.get("user_agent"),
-            user_agent_mode=kwargs.get("user_agent_mode"),
-            user_agent_generator_config=kwargs.get("user_agent_generator_config", {}),
-        )
-
-    # Create a funciton returns dict of the object
-    def to_dict(self):
-        return {
-            "word_count_threshold": self.word_count_threshold,
-            "extraction_strategy": self.extraction_strategy,
-            "chunking_strategy": self.chunking_strategy,
-            "deep_crawl_strategy": self.deep_crawl_strategy,
-            "markdown_generator": self.markdown_generator,
-            "content_filter": self.content_filter,
-            "only_text": self.only_text,
-            "css_selector": self.css_selector,
-            "excluded_tags": self.excluded_tags,
-            "excluded_selector": self.excluded_selector,
-            "keep_data_attributes": self.keep_data_attributes,
-            "remove_forms": self.remove_forms,
-            "prettiify": self.prettiify,
-            "parser_type": self.parser_type,
-            "scraping_strategy": self.scraping_strategy,
-            "proxy_config": self.proxy_config,
-            "fetch_ssl_certificate": self.fetch_ssl_certificate,
-            "cache_mode": self.cache_mode,
-            "session_id": self.session_id,
-            "bypass_cache": self.bypass_cache,
-            "disable_cache": self.disable_cache,
-            "no_cache_read": self.no_cache_read,
-            "no_cache_write": self.no_cache_write,
-            "shared_data": self.shared_data,
-            "wait_until": self.wait_until,
-            "page_timeout": self.page_timeout,
-            "wait_for": self.wait_for,
-            "wait_for_images": self.wait_for_images,
-            "delay_before_return_html": self.delay_before_return_html,
-            "mean_delay": self.mean_delay,
-            "max_range": self.max_range,
-            "semaphore_count": self.semaphore_count,
-            "js_code": self.js_code,
-            "js_only": self.js_only,
-            "ignore_body_visibility": self.ignore_body_visibility,
-            "scan_full_page": self.scan_full_page,
-            "scroll_delay": self.scroll_delay,
-            "process_iframes": self.process_iframes,
-            "remove_overlay_elements": self.remove_overlay_elements,
-            "simulate_user": self.simulate_user,
-            "override_navigator": self.override_navigator,
-            "magic": self.magic,
-            "adjust_viewport_to_content": self.adjust_viewport_to_content,
-            "screenshot": self.screenshot,
-            "screenshot_wait_for": self.screenshot_wait_for,
-            "screenshot_height_threshold": self.screenshot_height_threshold,
-            "pdf": self.pdf,
-            "image_description_min_word_threshold": self.image_description_min_word_threshold,
-            "image_score_threshold": self.image_score_threshold,
-            "exclude_external_images": self.exclude_external_images,
-            "exclude_social_media_domains": self.exclude_social_media_domains,
-            "exclude_external_links": self.exclude_external_links,
-            "exclude_social_media_links": self.exclude_social_media_links,
-            "exclude_domains": self.exclude_domains,
-            "verbose": self.verbose,
-            "log_console": self.log_console,
-            "stream": self.stream,
-            "url": self.url,
-            "check_robots_txt": self.check_robots_txt,
-            "user_agent": self.user_agent,
-            "user_agent_mode": self.user_agent_mode,
-            "user_agent_generator_config": self.user_agent_generator_config,
-        }
-
-    def clone(self, **kwargs):
-        """Create a copy of this configuration with updated values.
-        
-        Args:
-            **kwargs: Key-value pairs of configuration options to update
-            
-        Returns:
-            CrawlerRunConfig: A new instance with the specified updates
-            
-        Example:
-            ```python
-            # Create a new config with streaming enabled
-            stream_config = config.clone(stream=True)
-            
-            # Create a new config with multiple updates
-            new_config = config.clone(
-                stream=True,
-                cache_mode=CacheMode.BYPASS,
-                verbose=True
-            )
-            ```
-        """
-        config_dict = self.to_dict()
-        config_dict.update(kwargs)
-        return CrawlerRunConfig.from_kwargs(config_dict)
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -1,558 +0,0 @@
-import os
-from pathlib import Path
-import aiosqlite
-import asyncio
-from typing import Optional, Dict
-from contextlib import asynccontextmanager
-import logging
-import json  # Added for serialization/deserialization
-from .utils import ensure_content_dirs, generate_content_hash
-from .models import CrawlResult, MarkdownGenerationResult
-import aiofiles
-from .version_manager import VersionManager
-from .async_logger import AsyncLogger
-from .utils import get_error_context, create_box_message
-
-# Set up logging
-# logging.basicConfig(level=logging.INFO)
-# logger = logging.getLogger(__name__)
-# logger.setLevel(logging.INFO)
-
-base_directory = DB_PATH = os.path.join(
-    os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
-)
-os.makedirs(DB_PATH, exist_ok=True)
-DB_PATH = os.path.join(base_directory, "crawl4ai.db")
-
-
-class AsyncDatabaseManager:
-    def __init__(self, pool_size: int = 10, max_retries: int = 3):
-        self.db_path = DB_PATH
-        self.content_paths = ensure_content_dirs(os.path.dirname(DB_PATH))
-        self.pool_size = pool_size
-        self.max_retries = max_retries
-        self.connection_pool: Dict[int, aiosqlite.Connection] = {}
-        self.pool_lock = asyncio.Lock()
-        self.init_lock = asyncio.Lock()
-        self.connection_semaphore = asyncio.Semaphore(pool_size)
-        self._initialized = False
-        self.version_manager = VersionManager()
-        self.logger = AsyncLogger(
-            log_file=os.path.join(base_directory, ".crawl4ai", "crawler_db.log"),
-            verbose=False,
-            tag_width=10,
-        )
-
-    async def initialize(self):
-        """Initialize the database and connection pool"""
-        try:
-            self.logger.info("Initializing database", tag="INIT")
-            # Ensure the database file exists
-            os.makedirs(os.path.dirname(self.db_path), exist_ok=True)
-
-            # Check if version update is needed
-            needs_update = self.version_manager.needs_update()
-
-            # Always ensure base table exists
-            await self.ainit_db()
-
-            # Verify the table exists
-            async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
-                async with db.execute(
-                    "SELECT name FROM sqlite_master WHERE type='table' AND name='crawled_data'"
-                ) as cursor:
-                    result = await cursor.fetchone()
-                    if not result:
-                        raise Exception("crawled_data table was not created")
-
-            # If version changed or fresh install, run updates
-            if needs_update:
-                self.logger.info("New version detected, running updates", tag="INIT")
-                await self.update_db_schema()
-                from .migrations import (
-                    run_migration,
-                )  # Import here to avoid circular imports
-
-                await run_migration()
-                self.version_manager.update_version()  # Update stored version after successful migration
-                self.logger.success(
-                    "Version update completed successfully", tag="COMPLETE"
-                )
-            else:
-                self.logger.success(
-                    "Database initialization completed successfully", tag="COMPLETE"
-                )
-
-        except Exception as e:
-            self.logger.error(
-                message="Database initialization error: {error}",
-                tag="ERROR",
-                params={"error": str(e)},
-            )
-            self.logger.info(
-                message="Database will be initialized on first use", tag="INIT"
-            )
-
-            raise
-
-    async def cleanup(self):
-        """Cleanup connections when shutting down"""
-        async with self.pool_lock:
-            for conn in self.connection_pool.values():
-                await conn.close()
-            self.connection_pool.clear()
-
-    @asynccontextmanager
-    async def get_connection(self):
-        """Connection pool manager with enhanced error handling"""
-        if not self._initialized:
-            async with self.init_lock:
-                if not self._initialized:
-                    try:
-                        await self.initialize()
-                        self._initialized = True
-                    except Exception as e:
-                        import sys
-
-                        error_context = get_error_context(sys.exc_info())
-                        self.logger.error(
-                            message="Database initialization failed:\n{error}\n\nContext:\n{context}\n\nTraceback:\n{traceback}",
-                            tag="ERROR",
-                            force_verbose=True,
-                            params={
-                                "error": str(e),
-                                "context": error_context["code_context"],
-                                "traceback": error_context["full_traceback"],
-                            },
-                        )
-                        raise
-
-        await self.connection_semaphore.acquire()
-        task_id = id(asyncio.current_task())
-
-        try:
-            async with self.pool_lock:
-                if task_id not in self.connection_pool:
-                    try:
-                        conn = await aiosqlite.connect(self.db_path, timeout=30.0)
-                        await conn.execute("PRAGMA journal_mode = WAL")
-                        await conn.execute("PRAGMA busy_timeout = 5000")
-
-                        # Verify database structure
-                        async with conn.execute(
-                            "PRAGMA table_info(crawled_data)"
-                        ) as cursor:
-                            columns = await cursor.fetchall()
-                            column_names = [col[1] for col in columns]
-                            expected_columns = {
-                                "url",
-                                "html",
-                                "cleaned_html",
-                                "markdown",
-                                "extracted_content",
-                                "success",
-                                "media",
-                                "links",
-                                "metadata",
-                                "screenshot",
-                                "response_headers",
-                                "downloaded_files",
-                            }
-                            missing_columns = expected_columns - set(column_names)
-                            if missing_columns:
-                                raise ValueError(
-                                    f"Database missing columns: {missing_columns}"
-                                )
-
-                        self.connection_pool[task_id] = conn
-                    except Exception as e:
-                        import sys
-
-                        error_context = get_error_context(sys.exc_info())
-                        error_message = (
-                            f"Unexpected error in db get_connection at line {error_context['line_no']} "
-                            f"in {error_context['function']} ({error_context['filename']}):\n"
-                            f"Error: {str(e)}\n\n"
-                            f"Code context:\n{error_context['code_context']}"
-                        )
-                        self.logger.error(
-                            message=create_box_message(error_message, type="error"),
-                        )
-
-                        raise
-
-            yield self.connection_pool[task_id]
-
-        except Exception as e:
-            import sys
-
-            error_context = get_error_context(sys.exc_info())
-            error_message = (
-                f"Unexpected error in db get_connection at line {error_context['line_no']} "
-                f"in {error_context['function']} ({error_context['filename']}):\n"
-                f"Error: {str(e)}\n\n"
-                f"Code context:\n{error_context['code_context']}"
-            )
-            self.logger.error(
-                message=create_box_message(error_message, type="error"),
-            )
-            raise
-        finally:
-            async with self.pool_lock:
-                if task_id in self.connection_pool:
-                    await self.connection_pool[task_id].close()
-                    del self.connection_pool[task_id]
-            self.connection_semaphore.release()
-
-    async def execute_with_retry(self, operation, *args):
-        """Execute database operations with retry logic"""
-        for attempt in range(self.max_retries):
-            try:
-                async with self.get_connection() as db:
-                    result = await operation(db, *args)
-                    await db.commit()
-                    return result
-            except Exception as e:
-                if attempt == self.max_retries - 1:
-                    self.logger.error(
-                        message="Operation failed after {retries} attempts: {error}",
-                        tag="ERROR",
-                        force_verbose=True,
-                        params={"retries": self.max_retries, "error": str(e)},
-                    )
-                    raise
-                await asyncio.sleep(1 * (attempt + 1))  # Exponential backoff
-
-    async def ainit_db(self):
-        """Initialize database schema"""
-        async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
-            await db.execute(
-                """
-                CREATE TABLE IF NOT EXISTS crawled_data (
-                    url TEXT PRIMARY KEY,
-                    html TEXT,
-                    cleaned_html TEXT,
-                    markdown TEXT,
-                    extracted_content TEXT,
-                    success BOOLEAN,
-                    media TEXT DEFAULT "{}",
-                    links TEXT DEFAULT "{}",
-                    metadata TEXT DEFAULT "{}",
-                    screenshot TEXT DEFAULT "",
-                    response_headers TEXT DEFAULT "{}",
-                    downloaded_files TEXT DEFAULT "{}"  -- New column added
-                )
-            """
-            )
-            await db.commit()
-
-    async def update_db_schema(self):
-        """Update database schema if needed"""
-        async with aiosqlite.connect(self.db_path, timeout=30.0) as db:
-            cursor = await db.execute("PRAGMA table_info(crawled_data)")
-            columns = await cursor.fetchall()
-            column_names = [column[1] for column in columns]
-
-            # List of new columns to add
-            new_columns = [
-                "media",
-                "links",
-                "metadata",
-                "screenshot",
-                "response_headers",
-                "downloaded_files",
-            ]
-
-            for column in new_columns:
-                if column not in column_names:
-                    await self.aalter_db_add_column(column, db)
-            await db.commit()
-
-    async def aalter_db_add_column(self, new_column: str, db):
-        """Add new column to the database"""
-        if new_column == "response_headers":
-            await db.execute(
-                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT "{{}}"'
-            )
-        else:
-            await db.execute(
-                f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
-            )
-        self.logger.info(
-            message="Added column '{column}' to the database",
-            tag="INIT",
-            params={"column": new_column},
-        )
-
-    async def aget_cached_url(self, url: str) -> Optional[CrawlResult]:
-        """Retrieve cached URL data as CrawlResult"""
-
-        async def _get(db):
-            async with db.execute(
-                "SELECT * FROM crawled_data WHERE url = ?", (url,)
-            ) as cursor:
-                row = await cursor.fetchone()
-                if not row:
-                    return None
-
-                # Get column names
-                columns = [description[0] for description in cursor.description]
-                # Create dict from row data
-                row_dict = dict(zip(columns, row))
-
-                # Load content from files using stored hashes
-                content_fields = {
-                    "html": row_dict["html"],
-                    "cleaned_html": row_dict["cleaned_html"],
-                    "markdown": row_dict["markdown"],
-                    "extracted_content": row_dict["extracted_content"],
-                    "screenshot": row_dict["screenshot"],
-                    "screenshots": row_dict["screenshot"],
-                }
-
-                for field, hash_value in content_fields.items():
-                    if hash_value:
-                        content = await self._load_content(
-                            hash_value,
-                            field.split("_")[0],  # Get content type from field name
-                        )
-                        row_dict[field] = content or ""
-                    else:
-                        row_dict[field] = ""
-
-                # Parse JSON fields
-                json_fields = [
-                    "media",
-                    "links",
-                    "metadata",
-                    "response_headers",
-                    "markdown",
-                ]
-                for field in json_fields:
-                    try:
-                        row_dict[field] = (
-                            json.loads(row_dict[field]) if row_dict[field] else {}
-                        )
-                    except json.JSONDecodeError:
-                        # Very UGLY, never mention it to me please
-                        if field == "markdown" and isinstance(row_dict[field], str):
-                            row_dict[field] = row_dict[field]
-                        else:
-                            row_dict[field] = {}
-
-                if isinstance(row_dict["markdown"], Dict):
-                    row_dict["markdown_v2"] = row_dict["markdown"]
-                    if row_dict["markdown"].get("raw_markdown"):
-                        row_dict["markdown"] = row_dict["markdown"]["raw_markdown"]
-
-                # Parse downloaded_files
-                try:
-                    row_dict["downloaded_files"] = (
-                        json.loads(row_dict["downloaded_files"])
-                        if row_dict["downloaded_files"]
-                        else []
-                    )
-                except json.JSONDecodeError:
-                    row_dict["downloaded_files"] = []
-
-                # Remove any fields not in CrawlResult model
-                valid_fields = CrawlResult.__annotations__.keys()
-                filtered_dict = {k: v for k, v in row_dict.items() if k in valid_fields}
-
-                return CrawlResult(**filtered_dict)
-
-        try:
-            return await self.execute_with_retry(_get)
-        except Exception as e:
-            self.logger.error(
-                message="Error retrieving cached URL: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
-            return None
-
-    async def acache_url(self, result: CrawlResult):
-        """Cache CrawlResult data"""
-        # Store content files and get hashes
-        content_map = {
-            "html": (result.html, "html"),
-            "cleaned_html": (result.cleaned_html or "", "cleaned"),
-            "markdown": None,
-            "extracted_content": (result.extracted_content or "", "extracted"),
-            "screenshot": (result.screenshot or "", "screenshots"),
-        }
-
-        try:
-            if isinstance(result.markdown, MarkdownGenerationResult):
-                content_map["markdown"] = (
-                    result.markdown.model_dump_json(),
-                    "markdown",
-                )
-            elif hasattr(result, "markdown_v2"):
-                content_map["markdown"] = (
-                    result.markdown_v2.model_dump_json(),
-                    "markdown",
-                )
-            elif isinstance(result.markdown, str):
-                markdown_result = MarkdownGenerationResult(raw_markdown=result.markdown)
-                content_map["markdown"] = (
-                    markdown_result.model_dump_json(),
-                    "markdown",
-                )
-            else:
-                content_map["markdown"] = (
-                    MarkdownGenerationResult().model_dump_json(),
-                    "markdown",
-                )
-        except Exception as e:
-            self.logger.warning(
-                message=f"Error processing markdown content: {str(e)}", tag="WARNING"
-            )
-            # Fallback to empty markdown result
-            content_map["markdown"] = (
-                MarkdownGenerationResult().model_dump_json(),
-                "markdown",
-            )
-
-        content_hashes = {}
-        for field, (content, content_type) in content_map.items():
-            content_hashes[field] = await self._store_content(content, content_type)
-
-        async def _cache(db):
-            await db.execute(
-                """
-                INSERT INTO crawled_data (
-                    url, html, cleaned_html, markdown,
-                    extracted_content, success, media, links, metadata,
-                    screenshot, response_headers, downloaded_files
-                )
-                VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
-                ON CONFLICT(url) DO UPDATE SET
-                    html = excluded.html,
-                    cleaned_html = excluded.cleaned_html,
-                    markdown = excluded.markdown,
-                    extracted_content = excluded.extracted_content,
-                    success = excluded.success,
-                    media = excluded.media,
-                    links = excluded.links,
-                    metadata = excluded.metadata,
-                    screenshot = excluded.screenshot,
-                    response_headers = excluded.response_headers,
-                    downloaded_files = excluded.downloaded_files
-            """,
-                (
-                    result.url,
-                    content_hashes["html"],
-                    content_hashes["cleaned_html"],
-                    content_hashes["markdown"],
-                    content_hashes["extracted_content"],
-                    result.success,
-                    json.dumps(result.media),
-                    json.dumps(result.links),
-                    json.dumps(result.metadata or {}),
-                    content_hashes["screenshot"],
-                    json.dumps(result.response_headers or {}),
-                    json.dumps(result.downloaded_files or []),
-                ),
-            )
-
-        try:
-            await self.execute_with_retry(_cache)
-        except Exception as e:
-            self.logger.error(
-                message="Error caching URL: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
-
-    async def aget_total_count(self) -> int:
-        """Get total number of cached URLs"""
-
-        async def _count(db):
-            async with db.execute("SELECT COUNT(*) FROM crawled_data") as cursor:
-                result = await cursor.fetchone()
-                return result[0] if result else 0
-
-        try:
-            return await self.execute_with_retry(_count)
-        except Exception as e:
-            self.logger.error(
-                message="Error getting total count: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
-            return 0
-
-    async def aclear_db(self):
-        """Clear all data from the database"""
-
-        async def _clear(db):
-            await db.execute("DELETE FROM crawled_data")
-
-        try:
-            await self.execute_with_retry(_clear)
-        except Exception as e:
-            self.logger.error(
-                message="Error clearing database: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
-
-    async def aflush_db(self):
-        """Drop the entire table"""
-
-        async def _flush(db):
-            await db.execute("DROP TABLE IF EXISTS crawled_data")
-
-        try:
-            await self.execute_with_retry(_flush)
-        except Exception as e:
-            self.logger.error(
-                message="Error flushing database: {error}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"error": str(e)},
-            )
-
-    async def _store_content(self, content: str, content_type: str) -> str:
-        """Store content in filesystem and return hash"""
-        if not content:
-            return ""
-
-        content_hash = generate_content_hash(content)
-        file_path = os.path.join(self.content_paths[content_type], content_hash)
-
-        # Only write if file doesn't exist
-        if not os.path.exists(file_path):
-            async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
-                await f.write(content)
-
-        return content_hash
-
-    async def _load_content(
-        self, content_hash: str, content_type: str
-    ) -> Optional[str]:
-        """Load content from filesystem by hash"""
-        if not content_hash:
-            return None
-
-        file_path = os.path.join(self.content_paths[content_type], content_hash)
-        try:
-            async with aiofiles.open(file_path, "r", encoding="utf-8") as f:
-                return await f.read()
-        except:
-            self.logger.error(
-                message="Failed to load content: {file_path}",
-                tag="ERROR",
-                force_verbose=True,
-                params={"file_path": file_path},
-            )
-            return None
-
-
-# Create a singleton instance
-async_db_manager = AsyncDatabaseManager()
--- a/crawl4ai/async_dispatcher.py
+++ b/crawl4ai/async_dispatcher.py
@@ -1,647 +0,0 @@
-from typing import Dict, Optional, List, Tuple
-from .async_configs import CrawlerRunConfig
-from .models import (
-    CrawlResult,
-    CrawlerTaskResult,
-    CrawlStatus,
-    DisplayMode,
-    CrawlStats,
-    DomainState,
-)
-
-from rich.live import Live
-from rich.table import Table
-from rich.console import Console
-from rich import box
-from datetime import datetime, timedelta
-from collections.abc import AsyncGenerator
-import time
-import psutil
-import asyncio
-import uuid
-
-from urllib.parse import urlparse
-import random
-from abc import ABC, abstractmethod
-
-
-
-class RateLimiter:
-    def __init__(
-        self,
-        base_delay: Tuple[float, float] = (1.0, 3.0),
-        max_delay: float = 60.0,
-        max_retries: int = 3,
-        rate_limit_codes: List[int] = None,
-    ):
-        self.base_delay = base_delay
-        self.max_delay = max_delay
-        self.max_retries = max_retries
-        self.rate_limit_codes = rate_limit_codes or [429, 503]
-        self.domains: Dict[str, DomainState] = {}
-
-    def get_domain(self, url: str) -> str:
-        return urlparse(url).netloc
-
-    async def wait_if_needed(self, url: str) -> None:
-        domain = self.get_domain(url)
-        state = self.domains.get(domain)
-
-        if not state:
-            self.domains[domain] = DomainState()
-            state = self.domains[domain]
-
-        now = time.time()
-        if state.last_request_time:
-            wait_time = max(0, state.current_delay - (now - state.last_request_time))
-            if wait_time > 0:
-                await asyncio.sleep(wait_time)
-
-        # Random delay within base range if no current delay
-        if state.current_delay == 0:
-            state.current_delay = random.uniform(*self.base_delay)
-
-        state.last_request_time = time.time()
-
-    def update_delay(self, url: str, status_code: int) -> bool:
-        domain = self.get_domain(url)
-        state = self.domains[domain]
-
-        if status_code in self.rate_limit_codes:
-            state.fail_count += 1
-            if state.fail_count > self.max_retries:
-                return False
-
-            # Exponential backoff with random jitter
-            state.current_delay = min(
-                state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
-            )
-        else:
-            # Gradually reduce delay on success
-            state.current_delay = max(
-                random.uniform(*self.base_delay), state.current_delay * 0.75
-            )
-            state.fail_count = 0
-
-        return True
-
-
-class CrawlerMonitor:
-    def __init__(
-        self,
-        max_visible_rows: int = 15,
-        display_mode: DisplayMode = DisplayMode.DETAILED,
-    ):
-        self.console = Console()
-        self.max_visible_rows = max_visible_rows
-        self.display_mode = display_mode
-        self.stats: Dict[str, CrawlStats] = {}
-        self.process = psutil.Process()
-        self.start_time = datetime.now()
-        self.live = Live(self._create_table(), refresh_per_second=2)
-
-    def start(self):
-        self.live.start()
-
-    def stop(self):
-        self.live.stop()
-
-    def add_task(self, task_id: str, url: str):
-        self.stats[task_id] = CrawlStats(
-            task_id=task_id, url=url, status=CrawlStatus.QUEUED
-        )
-        self.live.update(self._create_table())
-
-    def update_task(self, task_id: str, **kwargs):
-        if task_id in self.stats:
-            for key, value in kwargs.items():
-                setattr(self.stats[task_id], key, value)
-            self.live.update(self._create_table())
-
-    def _create_aggregated_table(self) -> Table:
-        """Creates a compact table showing only aggregated statistics"""
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Status Overview",
-            title_style="bold magenta",
-            header_style="bold blue",
-            show_lines=True,
-        )
-
-        # Calculate statistics
-        total_tasks = len(self.stats)
-        queued = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
-        )
-        in_progress = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        # Memory statistics
-        current_memory = self.process.memory_info().rss / (1024 * 1024)
-        total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
-        peak_memory = max(
-            (stat.peak_memory for stat in self.stats.values()), default=0.0
-        )
-
-        # Duration
-        duration = datetime.now() - self.start_time
-
-        # Create status row
-        table.add_column("Status", style="bold cyan")
-        table.add_column("Count", justify="right")
-        table.add_column("Percentage", justify="right")
-
-        table.add_row("Total Tasks", str(total_tasks), "100%")
-        table.add_row(
-            "[yellow]In Queue[/yellow]",
-            str(queued),
-            f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[blue]In Progress[/blue]",
-            str(in_progress),
-            f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[green]Completed[/green]",
-            str(completed),
-            f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[red]Failed[/red]",
-            str(failed),
-            f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-
-        # Add memory information
-        table.add_section()
-        table.add_row(
-            "[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[yellow]Runtime[/yellow]",
-            str(timedelta(seconds=int(duration.total_seconds()))),
-            "",
-        )
-
-        return table
-
-    def _create_detailed_table(self) -> Table:
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Performance Monitor",
-            title_style="bold magenta",
-            header_style="bold blue",
-        )
-
-        # Add columns
-        table.add_column("Task ID", style="cyan", no_wrap=True)
-        table.add_column("URL", style="cyan", no_wrap=True)
-        table.add_column("Status", style="bold")
-        table.add_column("Memory (MB)", justify="right")
-        table.add_column("Peak (MB)", justify="right")
-        table.add_column("Duration", justify="right")
-        table.add_column("Info", style="italic")
-
-        # Add summary row
-        total_memory = sum(stat.memory_usage for stat in self.stats.values())
-        active_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        table.add_row(
-            "[bold yellow]SUMMARY",
-            f"Total: {len(self.stats)}",
-            f"Active: {active_count}",
-            f"{total_memory:.1f}",
-            f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
-            str(
-                timedelta(
-                    seconds=int((datetime.now() - self.start_time).total_seconds())
-                )
-            ),
-            f"✓{completed_count} ✗{failed_count}",
-            style="bold",
-        )
-
-        table.add_section()
-
-        # Add rows for each task
-        visible_stats = sorted(
-            self.stats.values(),
-            key=lambda x: (
-                x.status != CrawlStatus.IN_PROGRESS,
-                x.status != CrawlStatus.QUEUED,
-                x.end_time or datetime.max,
-            ),
-        )[: self.max_visible_rows]
-
-        for stat in visible_stats:
-            status_style = {
-                CrawlStatus.QUEUED: "white",
-                CrawlStatus.IN_PROGRESS: "yellow",
-                CrawlStatus.COMPLETED: "green",
-                CrawlStatus.FAILED: "red",
-            }[stat.status]
-
-            table.add_row(
-                stat.task_id[:8],  # Show first 8 chars of task ID
-                stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
-                f"[{status_style}]{stat.status.value}[/{status_style}]",
-                f"{stat.memory_usage:.1f}",
-                f"{stat.peak_memory:.1f}",
-                stat.duration,
-                stat.error_message[:40] if stat.error_message else "",
-            )
-
-        return table
-
-    def _create_table(self) -> Table:
-        """Creates the appropriate table based on display mode"""
-        if self.display_mode == DisplayMode.AGGREGATED:
-            return self._create_aggregated_table()
-        return self._create_detailed_table()
-
-
-class BaseDispatcher(ABC):
-    def __init__(
-        self,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        self.crawler = None
-        self._domain_last_hit: Dict[str, float] = {}
-        self.concurrent_sessions = 0
-        self.rate_limiter = rate_limiter
-        self.monitor = monitor
-
-    @abstractmethod
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> CrawlerTaskResult:
-        pass
-
-    @abstractmethod
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> List[CrawlerTaskResult]:
-        pass
-
-
-class MemoryAdaptiveDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        memory_threshold_percent: float = 90.0,
-        check_interval: float = 1.0,
-        max_session_permit: int = 20,
-        memory_wait_timeout: float = 300.0,  # 5 minutes default timeout
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.memory_threshold_percent = memory_threshold_percent
-        self.check_interval = check_interval
-        self.max_session_permit = max_session_permit
-        self.memory_wait_timeout = memory_wait_timeout
-        self.result_queue = asyncio.Queue()  # Queue for storing results
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-            self.concurrent_sessions += 1
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            process = psutil.Process()
-            start_memory = process.memory_info().rss / (1024 * 1024)
-            result = await self.crawler.arun(url, config=config, session_id=task_id)
-            end_memory = process.memory_info().rss / (1024 * 1024)
-
-            memory_usage = peak_memory = end_memory - start_memory
-
-            if self.rate_limiter and result.status_code:
-                if not self.rate_limiter.update_delay(url, result.status_code):
-                    error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                    result = CrawlerTaskResult(
-                        task_id=task_id,
-                        url=url,
-                        result=result,
-                        memory_usage=memory_usage,
-                        peak_memory=peak_memory,
-                        start_time=start_time,
-                        end_time=datetime.now(),
-                        error_message=error_message,
-                    )
-                    await self.result_queue.put(result)
-                    return result
-
-            if not result.success:
-                error_message = result.error_message
-                if self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            elif self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-            self.concurrent_sessions -= 1
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-        ) -> List[CrawlerTaskResult]:
-            self.crawler = crawler
-
-            if self.monitor:
-                self.monitor.start()
-
-            try:
-                pending_tasks = []
-                active_tasks = []
-                task_queue = []
-
-                for url in urls:
-                    task_id = str(uuid.uuid4())
-                    if self.monitor:
-                        self.monitor.add_task(task_id, url)
-                    task_queue.append((url, task_id))
-
-                while task_queue or active_tasks:
-                    wait_start_time = time.time()
-                    while len(active_tasks) < self.max_session_permit and task_queue:
-                        if psutil.virtual_memory().percent >= self.memory_threshold_percent:
-                            # Check if we've exceeded the timeout
-                            if time.time() - wait_start_time > self.memory_wait_timeout:
-                                raise MemoryError(
-                                    f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
-                                )
-                            await asyncio.sleep(self.check_interval)
-                            continue
-
-                        url, task_id = task_queue.pop(0)
-                        task = asyncio.create_task(self.crawl_url(url, config, task_id))
-                        active_tasks.append(task)
-
-                    if not active_tasks:
-                        await asyncio.sleep(self.check_interval)
-                        continue
-
-                    done, pending = await asyncio.wait(
-                        active_tasks, return_when=asyncio.FIRST_COMPLETED
-                    )
-
-                    pending_tasks.extend(done)
-                    active_tasks = list(pending)
-
-                return await asyncio.gather(*pending_tasks)
-            finally:
-                if self.monitor:
-                    self.monitor.stop()
-
-    async def run_urls_stream(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",
-        config: CrawlerRunConfig,
-    ) -> AsyncGenerator[CrawlerTaskResult, None]:
-        self.crawler = crawler
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            active_tasks = []
-            task_queue = []
-            completed_count = 0
-            total_urls = len(urls)
-
-            # Initialize task queue
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task_queue.append((url, task_id))
-
-            while completed_count < total_urls:
-                # Start new tasks if memory permits
-                while len(active_tasks) < self.max_session_permit and task_queue:
-                    if psutil.virtual_memory().percent >= self.memory_threshold_percent:
-                        await asyncio.sleep(self.check_interval)
-                        continue
-
-                    url, task_id = task_queue.pop(0)
-                    task = asyncio.create_task(self.crawl_url(url, config, task_id))
-                    active_tasks.append(task)
-
-                if not active_tasks and not task_queue:
-                    break
-
-                # Wait for any task to complete and yield results
-                if active_tasks:
-                    done, pending = await asyncio.wait(
-                        active_tasks,
-                        timeout=0.1,
-                        return_when=asyncio.FIRST_COMPLETED
-                    )
-                    for completed_task in done:
-                        result = await completed_task
-                        completed_count += 1
-                        yield result
-                    active_tasks = list(pending)
-                else:
-                    await asyncio.sleep(self.check_interval)
-
-        finally:
-            if self.monitor:
-                self.monitor.stop()
-
-class SemaphoreDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        semaphore_count: int = 5,
-        max_session_permit: int = 20,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.semaphore_count = semaphore_count
-        self.max_session_permit = max_session_permit
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        semaphore: asyncio.Semaphore = None,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            async with semaphore:
-                process = psutil.Process()
-                start_memory = process.memory_info().rss / (1024 * 1024)
-                result = await self.crawler.arun(url, config=config, session_id=task_id)
-                end_memory = process.memory_info().rss / (1024 * 1024)
-
-                memory_usage = peak_memory = end_memory - start_memory
-
-                if self.rate_limiter and result.status_code:
-                    if not self.rate_limiter.update_delay(url, result.status_code):
-                        error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                        if self.monitor:
-                            self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                        return CrawlerTaskResult(
-                            task_id=task_id,
-                            url=url,
-                            result=result,
-                            memory_usage=memory_usage,
-                            peak_memory=peak_memory,
-                            start_time=start_time,
-                            end_time=datetime.now(),
-                            error_message=error_message,
-                        )
-
-                if not result.success:
-                    error_message = result.error_message
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                elif self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        urls: List[str],
-        config: CrawlerRunConfig,
-    ) -> List[CrawlerTaskResult]:
-        self.crawler = crawler
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            semaphore = asyncio.Semaphore(self.semaphore_count)
-            tasks = []
-
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task = asyncio.create_task(
-                    self.crawl_url(url, config, task_id, semaphore)
-                )
-                tasks.append(task)
-
-            return await asyncio.gather(*tasks, return_exceptions=True)
-        finally:
-            if self.monitor:
-                self.monitor.stop()
--- a/crawl4ai/async_dispatcher_.py
+++ b/crawl4ai/async_dispatcher_.py
@@ -1,588 +0,0 @@
-from typing import Dict, Optional, List, Tuple
-from .async_configs import CrawlerRunConfig
-from .models import (
-    CrawlResult,
-    CrawlerTaskResult,
-    CrawlStatus,
-    DisplayMode,
-    CrawlStats,
-    DomainState,
-)
-
-from rich.live import Live
-from rich.table import Table
-from rich.console import Console
-from rich import box
-from datetime import datetime, timedelta
-
-import time
-import psutil
-import asyncio
-import uuid
-
-from urllib.parse import urlparse
-import random
-from abc import ABC, abstractmethod
-
-
-class RateLimiter:
-    def __init__(
-        self,
-        base_delay: Tuple[float, float] = (1.0, 3.0),
-        max_delay: float = 60.0,
-        max_retries: int = 3,
-        rate_limit_codes: List[int] = None,
-    ):
-        self.base_delay = base_delay
-        self.max_delay = max_delay
-        self.max_retries = max_retries
-        self.rate_limit_codes = rate_limit_codes or [429, 503]
-        self.domains: Dict[str, DomainState] = {}
-
-    def get_domain(self, url: str) -> str:
-        return urlparse(url).netloc
-
-    async def wait_if_needed(self, url: str) -> None:
-        domain = self.get_domain(url)
-        state = self.domains.get(domain)
-
-        if not state:
-            self.domains[domain] = DomainState()
-            state = self.domains[domain]
-
-        now = time.time()
-        if state.last_request_time:
-            wait_time = max(0, state.current_delay - (now - state.last_request_time))
-            if wait_time > 0:
-                await asyncio.sleep(wait_time)
-
-        # Random delay within base range if no current delay
-        if state.current_delay == 0:
-            state.current_delay = random.uniform(*self.base_delay)
-
-        state.last_request_time = time.time()
-
-    def update_delay(self, url: str, status_code: int) -> bool:
-        domain = self.get_domain(url)
-        state = self.domains[domain]
-
-        if status_code in self.rate_limit_codes:
-            state.fail_count += 1
-            if state.fail_count > self.max_retries:
-                return False
-
-            # Exponential backoff with random jitter
-            state.current_delay = min(
-                state.current_delay * 2 * random.uniform(0.75, 1.25), self.max_delay
-            )
-        else:
-            # Gradually reduce delay on success
-            state.current_delay = max(
-                random.uniform(*self.base_delay), state.current_delay * 0.75
-            )
-            state.fail_count = 0
-
-        return True
-
-
-class CrawlerMonitor:
-    def __init__(
-        self,
-        max_visible_rows: int = 15,
-        display_mode: DisplayMode = DisplayMode.DETAILED,
-    ):
-        self.console = Console()
-        self.max_visible_rows = max_visible_rows
-        self.display_mode = display_mode
-        self.stats: Dict[str, CrawlStats] = {}
-        self.process = psutil.Process()
-        self.start_time = datetime.now()
-        self.live = Live(self._create_table(), refresh_per_second=2)
-
-    def start(self):
-        self.live.start()
-
-    def stop(self):
-        self.live.stop()
-
-    def add_task(self, task_id: str, url: str):
-        self.stats[task_id] = CrawlStats(
-            task_id=task_id, url=url, status=CrawlStatus.QUEUED
-        )
-        self.live.update(self._create_table())
-
-    def update_task(self, task_id: str, **kwargs):
-        if task_id in self.stats:
-            for key, value in kwargs.items():
-                setattr(self.stats[task_id], key, value)
-            self.live.update(self._create_table())
-
-    def _create_aggregated_table(self) -> Table:
-        """Creates a compact table showing only aggregated statistics"""
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Status Overview",
-            title_style="bold magenta",
-            header_style="bold blue",
-            show_lines=True,
-        )
-
-        # Calculate statistics
-        total_tasks = len(self.stats)
-        queued = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.QUEUED
-        )
-        in_progress = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        # Memory statistics
-        current_memory = self.process.memory_info().rss / (1024 * 1024)
-        total_task_memory = sum(stat.memory_usage for stat in self.stats.values())
-        peak_memory = max(
-            (stat.peak_memory for stat in self.stats.values()), default=0.0
-        )
-
-        # Duration
-        duration = datetime.now() - self.start_time
-
-        # Create status row
-        table.add_column("Status", style="bold cyan")
-        table.add_column("Count", justify="right")
-        table.add_column("Percentage", justify="right")
-
-        table.add_row("Total Tasks", str(total_tasks), "100%")
-        table.add_row(
-            "[yellow]In Queue[/yellow]",
-            str(queued),
-            f"{(queued/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[blue]In Progress[/blue]",
-            str(in_progress),
-            f"{(in_progress/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[green]Completed[/green]",
-            str(completed),
-            f"{(completed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-        table.add_row(
-            "[red]Failed[/red]",
-            str(failed),
-            f"{(failed/total_tasks*100):.1f}%" if total_tasks > 0 else "0%",
-        )
-
-        # Add memory information
-        table.add_section()
-        table.add_row(
-            "[magenta]Current Memory[/magenta]", f"{current_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Total Task Memory[/magenta]", f"{total_task_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[magenta]Peak Task Memory[/magenta]", f"{peak_memory:.1f} MB", ""
-        )
-        table.add_row(
-            "[yellow]Runtime[/yellow]",
-            str(timedelta(seconds=int(duration.total_seconds()))),
-            "",
-        )
-
-        return table
-
-    def _create_detailed_table(self) -> Table:
-        table = Table(
-            box=box.ROUNDED,
-            title="Crawler Performance Monitor",
-            title_style="bold magenta",
-            header_style="bold blue",
-        )
-
-        # Add columns
-        table.add_column("Task ID", style="cyan", no_wrap=True)
-        table.add_column("URL", style="cyan", no_wrap=True)
-        table.add_column("Status", style="bold")
-        table.add_column("Memory (MB)", justify="right")
-        table.add_column("Peak (MB)", justify="right")
-        table.add_column("Duration", justify="right")
-        table.add_column("Info", style="italic")
-
-        # Add summary row
-        total_memory = sum(stat.memory_usage for stat in self.stats.values())
-        active_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.IN_PROGRESS
-        )
-        completed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.COMPLETED
-        )
-        failed_count = sum(
-            1 for stat in self.stats.values() if stat.status == CrawlStatus.FAILED
-        )
-
-        table.add_row(
-            "[bold yellow]SUMMARY",
-            f"Total: {len(self.stats)}",
-            f"Active: {active_count}",
-            f"{total_memory:.1f}",
-            f"{self.process.memory_info().rss / (1024 * 1024):.1f}",
-            str(
-                timedelta(
-                    seconds=int((datetime.now() - self.start_time).total_seconds())
-                )
-            ),
-            f"✓{completed_count} ✗{failed_count}",
-            style="bold",
-        )
-
-        table.add_section()
-
-        # Add rows for each task
-        visible_stats = sorted(
-            self.stats.values(),
-            key=lambda x: (
-                x.status != CrawlStatus.IN_PROGRESS,
-                x.status != CrawlStatus.QUEUED,
-                x.end_time or datetime.max,
-            ),
-        )[: self.max_visible_rows]
-
-        for stat in visible_stats:
-            status_style = {
-                CrawlStatus.QUEUED: "white",
-                CrawlStatus.IN_PROGRESS: "yellow",
-                CrawlStatus.COMPLETED: "green",
-                CrawlStatus.FAILED: "red",
-            }[stat.status]
-
-            table.add_row(
-                stat.task_id[:8],  # Show first 8 chars of task ID
-                stat.url[:40] + "..." if len(stat.url) > 40 else stat.url,
-                f"[{status_style}]{stat.status.value}[/{status_style}]",
-                f"{stat.memory_usage:.1f}",
-                f"{stat.peak_memory:.1f}",
-                stat.duration,
-                stat.error_message[:40] if stat.error_message else "",
-            )
-
-        return table
-
-    def _create_table(self) -> Table:
-        """Creates the appropriate table based on display mode"""
-        if self.display_mode == DisplayMode.AGGREGATED:
-            return self._create_aggregated_table()
-        return self._create_detailed_table()
-
-
-class BaseDispatcher(ABC):
-    def __init__(
-        self,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        self.crawler = None
-        self._domain_last_hit: Dict[str, float] = {}
-        self.concurrent_sessions = 0
-        self.rate_limiter = rate_limiter
-        self.monitor = monitor
-
-    @abstractmethod
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> CrawlerTaskResult:
-        pass
-
-    @abstractmethod
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-        monitor: Optional[CrawlerMonitor] = None,
-    ) -> List[CrawlerTaskResult]:
-        pass
-
-
-class MemoryAdaptiveDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        memory_threshold_percent: float = 90.0,
-        check_interval: float = 1.0,
-        max_session_permit: int = 20,
-        memory_wait_timeout: float = 300.0,  # 5 minutes default timeout
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.memory_threshold_percent = memory_threshold_percent
-        self.check_interval = check_interval
-        self.max_session_permit = max_session_permit
-        self.memory_wait_timeout = memory_wait_timeout
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-            self.concurrent_sessions += 1
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            process = psutil.Process()
-            start_memory = process.memory_info().rss / (1024 * 1024)
-            result = await self.crawler.arun(url, config=config, session_id=task_id)
-            end_memory = process.memory_info().rss / (1024 * 1024)
-
-            memory_usage = peak_memory = end_memory - start_memory
-
-            if self.rate_limiter and result.status_code:
-                if not self.rate_limiter.update_delay(url, result.status_code):
-                    error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                    return CrawlerTaskResult(
-                        task_id=task_id,
-                        url=url,
-                        result=result,
-                        memory_usage=memory_usage,
-                        peak_memory=peak_memory,
-                        start_time=start_time,
-                        end_time=datetime.now(),
-                        error_message=error_message,
-                    )
-
-            if not result.success:
-                error_message = result.error_message
-                if self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            elif self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-            self.concurrent_sessions -= 1
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        urls: List[str],
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        config: CrawlerRunConfig,
-    ) -> List[CrawlerTaskResult]:
-        self.crawler = crawler
-
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            pending_tasks = []
-            active_tasks = []
-            task_queue = []
-
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task_queue.append((url, task_id))
-
-            while task_queue or active_tasks:
-                wait_start_time = time.time()
-                while len(active_tasks) < self.max_session_permit and task_queue:
-                    if psutil.virtual_memory().percent >= self.memory_threshold_percent:
-                        # Check if we've exceeded the timeout
-                        if time.time() - wait_start_time > self.memory_wait_timeout:
-                            raise MemoryError(
-                                f"Memory usage above threshold ({self.memory_threshold_percent}%) for more than {self.memory_wait_timeout} seconds"
-                            )
-                        await asyncio.sleep(self.check_interval)
-                        continue
-
-                    url, task_id = task_queue.pop(0)
-                    task = asyncio.create_task(self.crawl_url(url, config, task_id))
-                    active_tasks.append(task)
-
-                if not active_tasks:
-                    await asyncio.sleep(self.check_interval)
-                    continue
-
-                done, pending = await asyncio.wait(
-                    active_tasks, return_when=asyncio.FIRST_COMPLETED
-                )
-
-                pending_tasks.extend(done)
-                active_tasks = list(pending)
-
-            return await asyncio.gather(*pending_tasks)
-        finally:
-            if self.monitor:
-                self.monitor.stop()
-
-
-class SemaphoreDispatcher(BaseDispatcher):
-    def __init__(
-        self,
-        semaphore_count: int = 5,
-        max_session_permit: int = 20,
-        rate_limiter: Optional[RateLimiter] = None,
-        monitor: Optional[CrawlerMonitor] = None,
-    ):
-        super().__init__(rate_limiter, monitor)
-        self.semaphore_count = semaphore_count
-        self.max_session_permit = max_session_permit
-
-    async def crawl_url(
-        self,
-        url: str,
-        config: CrawlerRunConfig,
-        task_id: str,
-        semaphore: asyncio.Semaphore = None,
-    ) -> CrawlerTaskResult:
-        start_time = datetime.now()
-        error_message = ""
-        memory_usage = peak_memory = 0.0
-
-        try:
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id, status=CrawlStatus.IN_PROGRESS, start_time=start_time
-                )
-
-            if self.rate_limiter:
-                await self.rate_limiter.wait_if_needed(url)
-
-            async with semaphore:
-                process = psutil.Process()
-                start_memory = process.memory_info().rss / (1024 * 1024)
-                result = await self.crawler.arun(url, config=config, session_id=task_id)
-                end_memory = process.memory_info().rss / (1024 * 1024)
-
-                memory_usage = peak_memory = end_memory - start_memory
-
-                if self.rate_limiter and result.status_code:
-                    if not self.rate_limiter.update_delay(url, result.status_code):
-                        error_message = f"Rate limit retry count exceeded for domain {urlparse(url).netloc}"
-                        if self.monitor:
-                            self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                        return CrawlerTaskResult(
-                            task_id=task_id,
-                            url=url,
-                            result=result,
-                            memory_usage=memory_usage,
-                            peak_memory=peak_memory,
-                            start_time=start_time,
-                            end_time=datetime.now(),
-                            error_message=error_message,
-                        )
-
-                if not result.success:
-                    error_message = result.error_message
-                    if self.monitor:
-                        self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-                elif self.monitor:
-                    self.monitor.update_task(task_id, status=CrawlStatus.COMPLETED)
-
-        except Exception as e:
-            error_message = str(e)
-            if self.monitor:
-                self.monitor.update_task(task_id, status=CrawlStatus.FAILED)
-            result = CrawlResult(
-                url=url, html="", metadata={}, success=False, error_message=str(e)
-            )
-
-        finally:
-            end_time = datetime.now()
-            if self.monitor:
-                self.monitor.update_task(
-                    task_id,
-                    end_time=end_time,
-                    memory_usage=memory_usage,
-                    peak_memory=peak_memory,
-                    error_message=error_message,
-                )
-
-        return CrawlerTaskResult(
-            task_id=task_id,
-            url=url,
-            result=result,
-            memory_usage=memory_usage,
-            peak_memory=peak_memory,
-            start_time=start_time,
-            end_time=end_time,
-            error_message=error_message,
-        )
-
-    async def run_urls(
-        self,
-        crawler: "AsyncWebCrawler",  # noqa: F821
-        urls: List[str],
-        config: CrawlerRunConfig,
-    ) -> List[CrawlerTaskResult]:
-        self.crawler = crawler
-        if self.monitor:
-            self.monitor.start()
-
-        try:
-            semaphore = asyncio.Semaphore(self.semaphore_count)
-            tasks = []
-
-            for url in urls:
-                task_id = str(uuid.uuid4())
-                if self.monitor:
-                    self.monitor.add_task(task_id, url)
-                task = asyncio.create_task(
-                    self.crawl_url(url, config, task_id, semaphore)
-                )
-                tasks.append(task)
-
-            return await asyncio.gather(*tasks, return_exceptions=True)
-        finally:
-            if self.monitor:
-                self.monitor.stop()
--- a/crawl4ai/async_logger.py
+++ b/crawl4ai/async_logger.py
@@ -1,227 +0,0 @@
-from enum import Enum
-from typing import Optional, Dict, Any
-from colorama import Fore, Style, init
-import os
-from datetime import datetime
-
-
-class LogLevel(Enum):
-    DEBUG = 1
-    INFO = 2
-    SUCCESS = 3
-    WARNING = 4
-    ERROR = 5
-
-
-class AsyncLogger:
-    """
-    Asynchronous logger with support for colored console output and file logging.
-    Supports templated messages with colored components.
-    """
-
-    DEFAULT_ICONS = {
-        "INIT": "→",
-        "READY": "✓",
-        "FETCH": "↓",
-        "SCRAPE": "◆",
-        "EXTRACT": "■",
-        "COMPLETE": "●",
-        "ERROR": "×",
-        "DEBUG": "⋯",
-        "INFO": "ℹ",
-        "WARNING": "⚠",
-    }
-
-    DEFAULT_COLORS = {
-        LogLevel.DEBUG: Fore.LIGHTBLACK_EX,
-        LogLevel.INFO: Fore.CYAN,
-        LogLevel.SUCCESS: Fore.GREEN,
-        LogLevel.WARNING: Fore.YELLOW,
-        LogLevel.ERROR: Fore.RED,
-    }
-
-    def __init__(
-        self,
-        log_file: Optional[str] = None,
-        log_level: LogLevel = LogLevel.DEBUG,
-        tag_width: int = 10,
-        icons: Optional[Dict[str, str]] = None,
-        colors: Optional[Dict[LogLevel, str]] = None,
-        verbose: bool = True,
-    ):
-        """
-        Initialize the logger.
-
-        Args:
-            log_file: Optional file path for logging
-            log_level: Minimum log level to display
-            tag_width: Width for tag formatting
-            icons: Custom icons for different tags
-            colors: Custom colors for different log levels
-            verbose: Whether to output to console
-        """
-        init()  # Initialize colorama
-        self.log_file = log_file
-        self.log_level = log_level
-        self.tag_width = tag_width
-        self.icons = icons or self.DEFAULT_ICONS
-        self.colors = colors or self.DEFAULT_COLORS
-        self.verbose = verbose
-
-        # Create log file directory if needed
-        if log_file:
-            os.makedirs(os.path.dirname(os.path.abspath(log_file)), exist_ok=True)
-
-    def _format_tag(self, tag: str) -> str:
-        """Format a tag with consistent width."""
-        return f"[{tag}]".ljust(self.tag_width, ".")
-
-    def _get_icon(self, tag: str) -> str:
-        """Get the icon for a tag, defaulting to info icon if not found."""
-        return self.icons.get(tag, self.icons["INFO"])
-
-    def _write_to_file(self, message: str):
-        """Write a message to the log file if configured."""
-        if self.log_file:
-            timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]
-            with open(self.log_file, "a", encoding="utf-8") as f:
-                # Strip ANSI color codes for file output
-                clean_message = message.replace(Fore.RESET, "").replace(
-                    Style.RESET_ALL, ""
-                )
-                for color in vars(Fore).values():
-                    if isinstance(color, str):
-                        clean_message = clean_message.replace(color, "")
-                f.write(f"[{timestamp}] {clean_message}\n")
-
-    def _log(
-        self,
-        level: LogLevel,
-        message: str,
-        tag: str,
-        params: Optional[Dict[str, Any]] = None,
-        colors: Optional[Dict[str, str]] = None,
-        base_color: Optional[str] = None,
-        **kwargs,
-    ):
-        """
-        Core logging method that handles message formatting and output.
-
-        Args:
-            level: Log level for this message
-            message: Message template string
-            tag: Tag for the message
-            params: Parameters to format into the message
-            colors: Color overrides for specific parameters
-            base_color: Base color for the entire message
-        """
-        if level.value < self.log_level.value:
-            return
-
-        # Format the message with parameters if provided
-        if params:
-            try:
-                # First format the message with raw parameters
-                formatted_message = message.format(**params)
-
-                # Then apply colors if specified
-                if colors:
-                    for key, color in colors.items():
-                        # Find the formatted value in the message and wrap it with color
-                        if key in params:
-                            value_str = str(params[key])
-                            formatted_message = formatted_message.replace(
-                                value_str, f"{color}{value_str}{Style.RESET_ALL}"
-                            )
-
-            except KeyError as e:
-                formatted_message = (
-                    f"LOGGING ERROR: Missing parameter {e} in message template"
-                )
-                level = LogLevel.ERROR
-        else:
-            formatted_message = message
-
-        # Construct the full log line
-        color = base_color or self.colors[level]
-        log_line = f"{color}{self._format_tag(tag)} {self._get_icon(tag)} {formatted_message}{Style.RESET_ALL}"
-
-        # Output to console if verbose
-        if self.verbose or kwargs.get("force_verbose", False):
-            print(log_line)
-
-        # Write to file if configured
-        self._write_to_file(log_line)
-
-    def debug(self, message: str, tag: str = "DEBUG", **kwargs):
-        """Log a debug message."""
-        self._log(LogLevel.DEBUG, message, tag, **kwargs)
-
-    def info(self, message: str, tag: str = "INFO", **kwargs):
-        """Log an info message."""
-        self._log(LogLevel.INFO, message, tag, **kwargs)
-
-    def success(self, message: str, tag: str = "SUCCESS", **kwargs):
-        """Log a success message."""
-        self._log(LogLevel.SUCCESS, message, tag, **kwargs)
-
-    def warning(self, message: str, tag: str = "WARNING", **kwargs):
-        """Log a warning message."""
-        self._log(LogLevel.WARNING, message, tag, **kwargs)
-
-    def error(self, message: str, tag: str = "ERROR", **kwargs):
-        """Log an error message."""
-        self._log(LogLevel.ERROR, message, tag, **kwargs)
-
-    def url_status(
-        self,
-        url: str,
-        success: bool,
-        timing: float,
-        tag: str = "FETCH",
-        url_length: int = 50,
-    ):
-        """
-        Convenience method for logging URL fetch status.
-
-        Args:
-            url: The URL being processed
-            success: Whether the operation was successful
-            timing: Time taken for the operation
-            tag: Tag for the message
-            url_length: Maximum length for URL in log
-        """
-        self._log(
-            level=LogLevel.SUCCESS if success else LogLevel.ERROR,
-            message="{url:.{url_length}}... | Status: {status} | Time: {timing:.2f}s",
-            tag=tag,
-            params={
-                "url": url,
-                "url_length": url_length,
-                "status": success,
-                "timing": timing,
-            },
-            colors={
-                "status": Fore.GREEN if success else Fore.RED,
-                "timing": Fore.YELLOW,
-            },
-        )
-
-    def error_status(
-        self, url: str, error: str, tag: str = "ERROR", url_length: int = 50
-    ):
-        """
-        Convenience method for logging error status.
-
-        Args:
-            url: The URL being processed
-            error: Error message
-            tag: Tag for the message
-            url_length: Maximum length for URL in log
-        """
-        self._log(
-            level=LogLevel.ERROR,
-            message="{url:.{url_length}}... | Error: {error}",
-            tag=tag,
-            params={"url": url, "url_length": url_length, "error": error},
-        )
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -1,901 +0,0 @@
-import os
-import sys
-import time
-import warnings
-from colorama import Fore
-from pathlib import Path
-from typing import Optional, List
-import json
-import asyncio
-
-# from contextlib import nullcontext, asynccontextmanager
-from contextlib import asynccontextmanager
-
-from .models import (
-    CrawlResult,
-    MarkdownGenerationResult,
-    CrawlerTaskResult,
-    DispatchResult,
-)
-from .async_database import async_db_manager
-from .chunking_strategy import *  # noqa: F403
-from .chunking_strategy import RegexChunking, ChunkingStrategy, IdentityChunking
-from .content_filter_strategy import *  # noqa: F403
-from .content_filter_strategy import RelevantContentFilter
-from .extraction_strategy import *  # noqa: F403
-from .extraction_strategy import NoExtractionStrategy, ExtractionStrategy
-from .async_crawler_strategy import (
-    AsyncCrawlerStrategy,
-    AsyncPlaywrightCrawlerStrategy,
-    AsyncCrawlResponse,
-)
-from .cache_context import CacheMode, CacheContext, _legacy_to_cache_mode
-from .markdown_generation_strategy import (
-    DefaultMarkdownGenerator,
-    MarkdownGenerationStrategy,
-)
-from .async_logger import AsyncLogger
-from .async_configs import BrowserConfig, CrawlerRunConfig
-from .async_dispatcher import *  # noqa: F403
-from .async_dispatcher import BaseDispatcher, MemoryAdaptiveDispatcher, RateLimiter
-from .deep_crawl import DeepCrawlStrategy
-
-from .config import MIN_WORD_THRESHOLD
-from .utils import (
-    sanitize_input_encode,
-    InvalidCSSSelectorError,
-    fast_format_html,
-    create_box_message,
-    get_error_context,
-    RobotsParser,
-)
-
-from typing import Union, AsyncGenerator, List, TypeVar
-from collections.abc import AsyncGenerator
-
-
-from .__version__ import __version__ as crawl4ai_version
-
-CrawlResultT = TypeVar("CrawlResultT", bound=CrawlResult)
-RunManyReturn = Union[List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
-
-DeepCrawlSingleReturn = Union[List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
-DeepCrawlManyReturn = Union[
-    List[List[CrawlResultT]],
-    AsyncGenerator[CrawlResultT, None],
-]
-
-
-class AsyncWebCrawler:
-    """
-    Asynchronous web crawler with flexible caching capabilities.
-
-    There are two ways to use the crawler:
-
-    1. Using context manager (recommended for simple cases):
-        ```python
-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(url="https://example.com")
-        ```
-
-    2. Using explicit lifecycle management (recommended for long-running applications):
-        ```python
-        crawler = AsyncWebCrawler()
-        await crawler.start()
-
-        # Use the crawler multiple times
-        result1 = await crawler.arun(url="https://example.com")
-        result2 = await crawler.arun(url="https://another.com")
-
-        await crawler.close()
-        ```
-
-    Migration Guide:
-    Old way (deprecated):
-        crawler = AsyncWebCrawler(always_by_pass_cache=True, browser_type="chromium", headless=True)
-
-    New way (recommended):
-        browser_config = BrowserConfig(browser_type="chromium", headless=True)
-        crawler = AsyncWebCrawler(config=browser_config)
-
-
-    Attributes:
-        browser_config (BrowserConfig): Configuration object for browser settings.
-        crawler_strategy (AsyncCrawlerStrategy): Strategy for crawling web pages.
-        logger (AsyncLogger): Logger instance for recording events and errors.
-        always_bypass_cache (bool): Whether to always bypass cache.
-        crawl4ai_folder (str): Directory for storing cache.
-        base_directory (str): Base directory for storing cache.
-        ready (bool): Whether the crawler is ready for use.
-
-        Methods:
-            start(): Start the crawler explicitly without using context manager.
-            close(): Close the crawler explicitly without using context manager.
-            arun(): Run the crawler for a single source: URL (web, local file, or raw HTML).
-            awarmup(): Perform warmup sequence.
-            arun_many(): Run the crawler for multiple sources.
-            aprocess_html(): Process HTML content.
-
-    Typical Usage:
-        async with AsyncWebCrawler() as crawler:
-            result = await crawler.arun(url="https://example.com")
-            print(result.markdown)
-
-        Using configuration:
-        browser_config = BrowserConfig(browser_type="chromium", headless=True)
-        async with AsyncWebCrawler(config=browser_config) as crawler:
-            crawler_config = CrawlerRunConfig(
-                cache_mode=CacheMode.BYPASS
-            )
-            result = await crawler.arun(url="https://example.com", config=crawler_config)
-            print(result.markdown)
-    """
-
-    _domain_last_hit = {}
-
-    def __init__(
-        self,
-        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
-        config: Optional[BrowserConfig] = None,
-        always_bypass_cache: bool = False,
-        always_by_pass_cache: Optional[bool] = None,  # Deprecated parameter
-        base_directory: str = str(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home())),
-        thread_safe: bool = False,
-        **kwargs,
-    ):
-        """
-        Initialize the AsyncWebCrawler.
-
-        Args:
-            crawler_strategy: Strategy for crawling web pages. If None, will create AsyncPlaywrightCrawlerStrategy
-            config: Configuration object for browser settings. If None, will be created from kwargs
-            always_bypass_cache: Whether to always bypass cache (new parameter)
-            always_by_pass_cache: Deprecated, use always_bypass_cache instead
-            base_directory: Base directory for storing cache
-            thread_safe: Whether to use thread-safe operations
-            **kwargs: Additional arguments for backwards compatibility
-        """
-        # Handle browser configuration
-        browser_config = config
-        if browser_config is not None:
-            if any(
-                k in kwargs
-                for k in [
-                    "browser_type",
-                    "headless",
-                    "viewport_width",
-                    "viewport_height",
-                ]
-            ):
-                self.logger.warning(
-                    message="Both browser_config and legacy browser parameters provided. browser_config will take precedence.",
-                    tag="WARNING",
-                )
-        else:
-            # Create browser config from kwargs for backwards compatibility
-            browser_config = BrowserConfig.from_kwargs(kwargs)
-
-        self.browser_config = browser_config
-
-        # Initialize logger first since other components may need it
-        self.logger = AsyncLogger(
-            log_file=os.path.join(base_directory, ".crawl4ai", "crawler.log"),
-            verbose=self.browser_config.verbose,
-            tag_width=10,
-        )
-
-        # Initialize crawler strategy
-        params = {k: v for k, v in kwargs.items() if k in ["browser_congig", "logger"]}
-        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
-            browser_config=browser_config,
-            logger=self.logger,
-            **params,  # Pass remaining kwargs for backwards compatibility
-        )
-
-        # If craweler strategy doesnt have logger, use crawler logger
-        if not self.crawler_strategy.logger:
-            self.crawler_strategy.logger = self.logger
-
-        # Handle deprecated cache parameter
-        if always_by_pass_cache is not None:
-            if kwargs.get("warning", True):
-                warnings.warn(
-                    "'always_by_pass_cache' is deprecated and will be removed in version 0.5.0. "
-                    "Use 'always_bypass_cache' instead. "
-                    "Pass warning=False to suppress this warning.",
-                    DeprecationWarning,
-                    stacklevel=2,
-                )
-            self.always_bypass_cache = always_by_pass_cache
-        else:
-            self.always_bypass_cache = always_bypass_cache
-
-        # Thread safety setup
-        self._lock = asyncio.Lock() if thread_safe else None
-
-        # Initialize directories
-        self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
-        os.makedirs(self.crawl4ai_folder, exist_ok=True)
-        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
-
-        # Initialize robots parser
-        self.robots_parser = RobotsParser()
-
-        self.ready = False
-
-    async def start(self):
-        """
-        Start the crawler explicitly without using context manager.
-        This is equivalent to using 'async with' but gives more control over the lifecycle.
-
-        This method will:
-        1. Initialize the browser and context
-        2. Perform warmup sequence
-        3. Return the crawler instance for method chaining
-
-        Returns:
-            AsyncWebCrawler: The initialized crawler instance
-        """
-        await self.crawler_strategy.__aenter__()
-        await self.awarmup()
-        return self
-
-    async def close(self):
-        """
-        Close the crawler explicitly without using context manager.
-        This should be called when you're done with the crawler if you used start().
-
-        This method will:
-        1. Clean up browser resources
-        2. Close any open pages and contexts
-        """
-        await self.crawler_strategy.__aexit__(None, None, None)
-
-    async def __aenter__(self):
-        return await self.start()
-
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        await self.close()
-
-    async def awarmup(self):
-        """
-        Initialize the crawler with warm-up sequence.
-
-        This method:
-        1. Logs initialization info
-        2. Sets up browser configuration
-        3. Marks the crawler as ready
-        """
-        self.logger.info(f"Crawl4AI {crawl4ai_version}", tag="INIT")
-        self.ready = True
-
-    @asynccontextmanager
-    async def nullcontext(self):
-        """Asynchronous null context manager"""
-        yield
-
-    async def arun(
-        self,
-        url: str,
-        config: Optional[CrawlerRunConfig] = None,
-        # Legacy parameters maintained for backwards compatibility
-        word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        content_filter: RelevantContentFilter = None,
-        cache_mode: Optional[CacheMode] = None,
-        # Deprecated cache parameters
-        bypass_cache: bool = False,
-        disable_cache: bool = False,
-        no_cache_read: bool = False,
-        no_cache_write: bool = False,
-        # Other legacy parameters
-        css_selector: str = None,
-        screenshot: bool = False,
-        pdf: bool = False,
-        user_agent: str = None,
-        verbose=True,
-        **kwargs,
-    ) -> Union[CrawlResult, DeepCrawlSingleReturn]:
-        """
-        Runs the crawler for a single source: URL (web, local file, or raw HTML).
-
-        Migration Guide:
-        Old way (deprecated):
-            result = await crawler.arun(
-                url="https://example.com",
-                word_count_threshold=200,
-                screenshot=True,
-                ...
-            )
-
-        New way (recommended):
-            config = CrawlerRunConfig(
-                word_count_threshold=200,
-                screenshot=True,
-                ...
-            )
-            result = await crawler.arun(url="https://example.com", crawler_config=config)
-
-        Args:
-            url: The URL to crawl (http://, https://, file://, or raw:)
-            crawler_config: Configuration object controlling crawl behavior
-            [other parameters maintained for backwards compatibility]
-
-        Returns:
-            CrawlResult: The result of crawling and processing
-        """
-        crawler_config = config
-        if not isinstance(url, str) or not url:
-            raise ValueError("Invalid URL, make sure the URL is a non-empty string")
-
-        async with self._lock or self.nullcontext():
-            try:
-                # Handle configuration
-                if crawler_config is not None:
-                    # if any(param is not None for param in [
-                    #     word_count_threshold, extraction_strategy, chunking_strategy,
-                    #     content_filter, cache_mode, css_selector, screenshot, pdf
-                    # ]):
-                    #     self.logger.warning(
-                    #         message="Both crawler_config and legacy parameters provided. crawler_config will take precedence.",
-                    #         tag="WARNING"
-                    #     )
-                    config = crawler_config
-                else:
-                    # Merge all parameters into a single kwargs dict for config creation
-                    config_kwargs = {
-                        "word_count_threshold": word_count_threshold,
-                        "extraction_strategy": extraction_strategy,
-                        "chunking_strategy": chunking_strategy,
-                        "content_filter": content_filter,
-                        "cache_mode": cache_mode,
-                        "bypass_cache": bypass_cache,
-                        "disable_cache": disable_cache,
-                        "no_cache_read": no_cache_read,
-                        "no_cache_write": no_cache_write,
-                        "css_selector": css_selector,
-                        "screenshot": screenshot,
-                        "pdf": pdf,
-                        "verbose": verbose,
-                        **kwargs,
-                    }
-                    config = CrawlerRunConfig.from_kwargs(config_kwargs)
-
-                # Handle deprecated cache parameters
-                if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
-                    if kwargs.get("warning", True):
-                        warnings.warn(
-                            "Cache control boolean flags are deprecated and will be removed in version 0.5.0. "
-                            "Use 'cache_mode' parameter instead.",
-                            DeprecationWarning,
-                            stacklevel=2,
-                        )
-
-                    # Convert legacy parameters if cache_mode not provided
-                    if config.cache_mode is None:
-                        config.cache_mode = _legacy_to_cache_mode(
-                            disable_cache=disable_cache,
-                            bypass_cache=bypass_cache,
-                            no_cache_read=no_cache_read,
-                            no_cache_write=no_cache_write,
-                        )
-
-                # Default to ENABLED if no cache mode specified
-                if config.cache_mode is None:
-                    config.cache_mode = CacheMode.ENABLED
-
-                # Create cache context
-                cache_context = CacheContext(
-                    url, config.cache_mode, self.always_bypass_cache
-                )
-
-                # Initialize processing variables
-                async_response: AsyncCrawlResponse = None
-                cached_result: CrawlResult = None
-                screenshot_data = None
-                pdf_data = None
-                extracted_content = None
-                start_time = time.perf_counter()
-
-                if crawler_config.deep_crawl_strategy:
-                    if crawler_config.stream:
-                        return crawler_config.deep_crawl_strategy.arun(
-                            start_url=url,
-                            crawler=self,
-                            crawler_run_config=crawler_config,
-                        )
-                    else:
-                        results = []
-                        async for result in crawler_config.deep_crawl_strategy.arun(
-                            start_url=url,
-                            crawler=self,
-                            crawler_run_config=crawler_config,
-                        ):
-                            results.append(result)
-                        return results
-
-                # Try to get cached result if appropriate
-                if cache_context.should_read():
-                    cached_result = await async_db_manager.aget_cached_url(url)
-
-                if cached_result:
-                    html = sanitize_input_encode(cached_result.html)
-                    extracted_content = sanitize_input_encode(
-                        cached_result.extracted_content or ""
-                    )
-                    extracted_content = (
-                        None
-                        if not extracted_content or extracted_content == "[]"
-                        else extracted_content
-                    )
-                    # If screenshot is requested but its not in cache, then set cache_result to None
-                    screenshot_data = cached_result.screenshot
-                    pdf_data = cached_result.pdf
-                    if config.screenshot and not screenshot or config.pdf and not pdf:
-                        cached_result = None
-
-                    self.logger.url_status(
-                        url=cache_context.display_url,
-                        success=bool(html),
-                        timing=time.perf_counter() - start_time,
-                        tag="FETCH",
-                    )
-
-                # Fetch fresh content if needed
-                if not cached_result or not html:
-                    t1 = time.perf_counter()
-
-                    if user_agent:
-                        self.crawler_strategy.update_user_agent(user_agent)
-
-                    # Check robots.txt if enabled
-                    if config and config.check_robots_txt:
-                        if not await self.robots_parser.can_fetch(
-                            url, self.browser_config.user_agent
-                        ):
-                            return CrawlResult(
-                                url=url,
-                                html="",
-                                success=False,
-                                status_code=403,
-                                error_message="Access denied by robots.txt",
-                                response_headers={
-                                    "X-Robots-Status": "Blocked by robots.txt"
-                                },
-                            )
-
-                    # Pass config to crawl method
-                    async_response = await self.crawler_strategy.crawl(
-                        url,
-                        config=config,  # Pass the entire config object
-                    )
-
-                    html = sanitize_input_encode(async_response.html)
-                    screenshot_data = async_response.screenshot
-                    pdf_data = async_response.pdf_data
-
-                    t2 = time.perf_counter()
-                    self.logger.url_status(
-                        url=cache_context.display_url,
-                        success=bool(html),
-                        timing=t2 - t1,
-                        tag="FETCH",
-                    )
-
-                    # Process the HTML content
-                    crawl_result: CrawlResult = await self.aprocess_html(
-                        url=url,
-                        html=html,
-                        extracted_content=extracted_content,
-                        config=config,  # Pass the config object instead of individual parameters
-                        screenshot=screenshot_data,
-                        pdf_data=pdf_data,
-                        verbose=config.verbose,
-                        is_raw_html=True if url.startswith("raw:") else False,
-                        **kwargs,
-                    )
-
-                    crawl_result.status_code = async_response.status_code
-                    crawl_result.redirected_url = async_response.redirected_url or url
-                    crawl_result.response_headers = async_response.response_headers
-                    crawl_result.downloaded_files = async_response.downloaded_files
-                    crawl_result.ssl_certificate = (
-                        async_response.ssl_certificate
-                    )  # Add SSL certificate
-
-                    # # Check and set values from async_response to crawl_result
-                    # try:
-                    #     for key in vars(async_response):
-                    #         if hasattr(crawl_result, key):
-                    #             value = getattr(async_response, key, None)
-                    #             current_value = getattr(crawl_result, key, None)
-                    #             if value is not None and not current_value:
-                    #                 try:
-                    #                     setattr(crawl_result, key, value)
-                    #                 except Exception as e:
-                    #                     self.logger.warning(
-                    #                         message=f"Failed to set attribute {key}: {str(e)}",
-                    #                         tag="WARNING"
-                    #                     )
-                    # except Exception as e:
-                    #     self.logger.warning(
-                    #         message=f"Error copying response attributes: {str(e)}",
-                    #         tag="WARNING"
-                    #     )
-
-                    crawl_result.success = bool(html)
-                    crawl_result.session_id = getattr(config, "session_id", None)
-
-                    self.logger.success(
-                        message="{url:.50}... | Status: {status} | Total: {timing}",
-                        tag="COMPLETE",
-                        params={
-                            "url": cache_context.display_url,
-                            "status": crawl_result.success,
-                            "timing": f"{time.perf_counter() - start_time:.2f}s",
-                        },
-                        colors={
-                            "status": Fore.GREEN if crawl_result.success else Fore.RED,
-                            "timing": Fore.YELLOW,
-                        },
-                    )
-
-                    # Update cache if appropriate
-                    if cache_context.should_write() and not bool(cached_result):
-                        await async_db_manager.acache_url(crawl_result)
-
-                    return crawl_result
-
-                else:
-                    self.logger.success(
-                        message="{url:.50}... | Status: {status} | Total: {timing}",
-                        tag="COMPLETE",
-                        params={
-                            "url": cache_context.display_url,
-                            "status": True,
-                            "timing": f"{time.perf_counter() - start_time:.2f}s",
-                        },
-                        colors={"status": Fore.GREEN, "timing": Fore.YELLOW},
-                    )
-
-                    cached_result.success = bool(html)
-                    cached_result.session_id = getattr(config, "session_id", None)
-                    cached_result.redirected_url = cached_result.redirected_url or url
-                    return cached_result
-
-            except Exception as e:
-                error_context = get_error_context(sys.exc_info())
-
-                error_message = (
-                    f"Unexpected error in _crawl_web at line {error_context['line_no']} "
-                    f"in {error_context['function']} ({error_context['filename']}):\n"
-                    f"Error: {str(e)}\n\n"
-                    f"Code context:\n{error_context['code_context']}"
-                )
-                # if not hasattr(e, "msg"):
-                #     e.msg = str(e)
-
-                self.logger.error_status(
-                    url=url,
-                    error=create_box_message(error_message, type="error"),
-                    tag="ERROR",
-                )
-
-                return CrawlResult(
-                    url=url, html="", success=False, error_message=error_message
-                )
-
-    async def aprocess_html(
-        self,
-        url: str,
-        html: str,
-        extracted_content: str,
-        config: CrawlerRunConfig,
-        screenshot: str,
-        pdf_data: str,
-        verbose: bool,
-        **kwargs,
-    ) -> CrawlResult:
-        """
-        Process HTML content using the provided configuration.
-
-        Args:
-            url: The URL being processed
-            html: Raw HTML content
-            extracted_content: Previously extracted content (if any)
-            config: Configuration object controlling processing behavior
-            screenshot: Screenshot data (if any)
-            pdf_data: PDF data (if any)
-            verbose: Whether to enable verbose logging
-            **kwargs: Additional parameters for backwards compatibility
-
-        Returns:
-            CrawlResult: Processed result containing extracted and formatted content
-        """
-        try:
-            _url = url if not kwargs.get("is_raw_html", False) else "Raw HTML"
-            t1 = time.perf_counter()
-
-            # Get scraping strategy and ensure it has a logger
-            scraping_strategy = config.scraping_strategy
-            if not scraping_strategy.logger:
-                scraping_strategy.logger = self.logger
-
-            # Process HTML content
-            params = {k: v for k, v in config.to_dict().items() if k not in ["url"]}
-            # add keys from kwargs to params that doesn't exist in params
-            params.update({k: v for k, v in kwargs.items() if k not in params.keys()})
-
-            result = scraping_strategy.scrap(url, html, **params)
-
-            if result is None:
-                raise ValueError(
-                    f"Process HTML, Failed to extract content from the website: {url}"
-                )
-
-        except InvalidCSSSelectorError as e:
-            raise ValueError(str(e))
-        except Exception as e:
-            raise ValueError(
-                f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}"
-            )
-
-        # Extract results - handle both dict and ScrapingResult
-        if isinstance(result, dict):
-            cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
-            media = result.get("media", {})
-            links = result.get("links", {})
-            metadata = result.get("metadata", {})
-        else:
-            cleaned_html = sanitize_input_encode(result.cleaned_html)
-            media = result.media.model_dump()
-            links = result.links.model_dump()
-            metadata = result.metadata
-
-        # Markdown Generation
-        markdown_generator: Optional[MarkdownGenerationStrategy] = (
-            config.markdown_generator or DefaultMarkdownGenerator()
-        )
-
-        # Uncomment if by default we want to use PruningContentFilter
-        # if not config.content_filter and not markdown_generator.content_filter:
-        #     markdown_generator.content_filter = PruningContentFilter()
-
-        markdown_result: MarkdownGenerationResult = (
-            markdown_generator.generate_markdown(
-                cleaned_html=cleaned_html,
-                base_url=url,
-                # html2text_options=kwargs.get('html2text', {})
-            )
-        )
-        markdown_v2 = markdown_result
-        markdown = sanitize_input_encode(markdown_result.raw_markdown)
-
-        # Log processing completion
-        self.logger.info(
-            message="Processed {url:.50}... | Time: {timing}ms",
-            tag="SCRAPE",
-            params={"url": _url, "timing": int((time.perf_counter() - t1) * 1000)},
-        )
-
-        # Handle content extraction if needed
-        if (
-            not bool(extracted_content)
-            and config.extraction_strategy
-            and not isinstance(config.extraction_strategy, NoExtractionStrategy)
-        ):
-            t1 = time.perf_counter()
-
-            # Choose content based on input_format
-            content_format = config.extraction_strategy.input_format
-            if content_format == "fit_markdown" and not markdown_result.fit_markdown:
-                self.logger.warning(
-                    message="Fit markdown requested but not available. Falling back to raw markdown.",
-                    tag="EXTRACT",
-                    params={"url": _url},
-                )
-                content_format = "markdown"
-
-            content = {
-                "markdown": markdown,
-                "html": html,
-                "fit_markdown": markdown_result.raw_markdown,
-            }.get(content_format, markdown)
-
-            # Use IdentityChunking for HTML input, otherwise use provided chunking strategy
-            chunking = (
-                IdentityChunking()
-                if content_format == "html"
-                else config.chunking_strategy
-            )
-            sections = chunking.chunk(content)
-            extracted_content = config.extraction_strategy.run(url, sections)
-            extracted_content = json.dumps(
-                extracted_content, indent=4, default=str, ensure_ascii=False
-            )
-
-            # Log extraction completion
-            self.logger.info(
-                message="Completed for {url:.50}... | Time: {timing}s",
-                tag="EXTRACT",
-                params={"url": _url, "timing": time.perf_counter() - t1},
-            )
-
-        # Handle screenshot and PDF data
-        screenshot_data = None if not screenshot else screenshot
-        pdf_data = None if not pdf_data else pdf_data
-
-        # Apply HTML formatting if requested
-        if config.prettiify:
-            cleaned_html = fast_format_html(cleaned_html)
-
-        # Return complete crawl result
-        return CrawlResult(
-            url=url,
-            html=html,
-            cleaned_html=cleaned_html,
-            markdown_v2=markdown_v2,
-            markdown=markdown,
-            fit_markdown=markdown_result.fit_markdown,
-            fit_html=markdown_result.fit_html,
-            media=media,
-            links=links,
-            metadata=metadata,
-            screenshot=screenshot_data,
-            pdf=pdf_data,
-            extracted_content=extracted_content,
-            success=True,
-            error_message="",
-        )
-
-    async def arun_many(
-        self,
-        urls: List[str],
-        config: Optional[CrawlerRunConfig] = None,
-        dispatcher: Optional[BaseDispatcher] = None,
-        # Legacy parameters maintained for backwards compatibility
-        word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        content_filter: RelevantContentFilter = None,
-        cache_mode: Optional[CacheMode] = None,
-        bypass_cache: bool = False,
-        css_selector: str = None,
-        screenshot: bool = False,
-        pdf: bool = False,
-        user_agent: str = None,
-        verbose=True,
-        **kwargs,
-    ) -> Union[RunManyReturn, DeepCrawlManyReturn]:
-        """
-        Runs the crawler for multiple URLs concurrently using a configurable dispatcher strategy.
-
-        Args:
-        urls: List of URLs to crawl
-        config: Configuration object controlling crawl behavior for all URLs
-        dispatcher: The dispatcher strategy instance to use. Defaults to MemoryAdaptiveDispatcher
-        [other parameters maintained for backwards compatibility]
-
-        Returns:
-        Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
-            Either a list of all results or an async generator yielding results
-
-        Examples:
-
-        # Batch processing (default)
-        results = await crawler.arun_many(
-            urls=["https://example1.com", "https://example2.com"],
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
-        )
-        for result in results:
-            print(f"Processed {result.url}: {len(result.markdown)} chars")
-
-        # Streaming results
-        async for result in await crawler.arun_many(
-            urls=["https://example1.com", "https://example2.com"],
-            config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS, stream=True),
-        ):
-            print(f"Processed {result.url}: {len(result.markdown)} chars")
-        """
-
-        async def merge_async_generators(generators):
-            tasks = {asyncio.create_task(gen.__anext__()): gen for gen in generators}
-            while tasks:
-                done, _ = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED)
-                
-                for task in done:
-                    gen = tasks.pop(task)  # Get the generator associated with this task
-                    
-                    try:
-                        result = task.result()
-                        yield result  # Yield the result
-                        tasks[asyncio.create_task(gen.__anext__())] = gen  # Fetch next item
-                    except StopAsyncIteration:
-                        pass  # Generator is exhausted, don't add it back to the tasks
-
-        if config is None:
-            config = CrawlerRunConfig(
-                word_count_threshold=word_count_threshold,
-                extraction_strategy=extraction_strategy,
-                chunking_strategy=chunking_strategy,
-                content_filter=content_filter,
-                cache_mode=cache_mode,
-                bypass_cache=bypass_cache,
-                css_selector=css_selector,
-                screenshot=screenshot,
-                pdf=pdf,
-                verbose=verbose,
-                **kwargs,
-            )
-
-        if dispatcher is None:
-            dispatcher = MemoryAdaptiveDispatcher(
-                rate_limiter=RateLimiter(
-                    base_delay=(1.0, 3.0), max_delay=60.0, max_retries=3
-                ),
-            )
-
-        transform_result = lambda task_result: (
-            setattr(
-                task_result.result,
-                "dispatch_result",
-                DispatchResult(
-                    task_id=task_result.task_id,
-                    memory_usage=task_result.memory_usage,
-                    peak_memory=task_result.peak_memory,
-                    start_time=task_result.start_time,
-                    end_time=task_result.end_time,
-                    error_message=task_result.error_message,
-                ),
-            )
-            or task_result.result
-        )
-
-        stream = config.stream
-
-        if config.deep_crawl_strategy:
-            if config.stream:
-                generators = []
-                for url in urls:
-                    generators.append(
-                        config.deep_crawl_strategy.arun(
-                            start_url=url, crawler=self, crawler_run_config=config
-                        )
-                    )
-                return merge_async_generators(generators)
-            else:
-                results = []
-                for url in urls:
-                    url_results = []
-                    async for result in config.deep_crawl_strategy.arun(
-                        start_url=url, crawler=self, crawler_run_config=config
-                    ):
-                        url_results.append(result)
-                    results.append(url_results)
-                return results
-
-        if stream:
-
-            async def result_transformer():
-                async for task_result in dispatcher.run_urls_stream(
-                    crawler=self, urls=urls, config=config
-                ):
-                    yield transform_result(task_result)
-
-            return result_transformer()
-        else:
-            _results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
-            return [transform_result(res) for res in _results]
-
-    async def aclear_cache(self):
-        """Clear the cache database."""
-        await async_db_manager.cleanup()
-
-    async def aflush_cache(self):
-        """Flush the cache database."""
-        await async_db_manager.aflush_db()
-
-    async def aget_cache_size(self):
-        """Get the total number of cached items."""
-        return await async_db_manager.aget_total_count()
--- a/crawl4ai/cache_context.py
+++ b/crawl4ai/cache_context.py
@@ -1,117 +0,0 @@
-from enum import Enum
-
-
-class CacheMode(Enum):
-    """
-    Defines the caching behavior for web crawling operations.
-
-    Modes:
-    - ENABLED: Normal caching behavior (read and write)
-    - DISABLED: No caching at all
-    - READ_ONLY: Only read from cache, don't write
-    - WRITE_ONLY: Only write to cache, don't read
-    - BYPASS: Bypass cache for this operation
-    """
-
-    ENABLED = "enabled"
-    DISABLED = "disabled"
-    READ_ONLY = "read_only"
-    WRITE_ONLY = "write_only"
-    BYPASS = "bypass"
-
-
-class CacheContext:
-    """
-    Encapsulates cache-related decisions and URL handling.
-
-    This class centralizes all cache-related logic and URL type checking,
-    making the caching behavior more predictable and maintainable.
-
-    Attributes:
-        url (str): The URL being processed.
-        cache_mode (CacheMode): The cache mode for the current operation.
-        always_bypass (bool): If True, bypasses caching for this operation.
-        is_cacheable (bool): True if the URL is cacheable, False otherwise.
-        is_web_url (bool): True if the URL is a web URL, False otherwise.
-        is_local_file (bool): True if the URL is a local file, False otherwise.
-        is_raw_html (bool): True if the URL is raw HTML, False otherwise.
-        _url_display (str): The display name for the URL (web, local file, or raw HTML).
-    """
-
-    def __init__(self, url: str, cache_mode: CacheMode, always_bypass: bool = False):
-        """
-        Initializes the CacheContext with the provided URL and cache mode.
-
-        Args:
-            url (str): The URL being processed.
-            cache_mode (CacheMode): The cache mode for the current operation.
-            always_bypass (bool): If True, bypasses caching for this operation.
-        """
-        self.url = url
-        self.cache_mode = cache_mode
-        self.always_bypass = always_bypass
-        self.is_cacheable = url.startswith(("http://", "https://", "file://"))
-        self.is_web_url = url.startswith(("http://", "https://"))
-        self.is_local_file = url.startswith("file://")
-        self.is_raw_html = url.startswith("raw:")
-        self._url_display = url if not self.is_raw_html else "Raw HTML"
-
-    def should_read(self) -> bool:
-        """
-        Determines if cache should be read based on context.
-
-        How it works:
-        1. If always_bypass is True or is_cacheable is False, return False.
-        2. If cache_mode is ENABLED or READ_ONLY, return True.
-
-        Returns:
-            bool: True if cache should be read, False otherwise.
-        """
-        if self.always_bypass or not self.is_cacheable:
-            return False
-        return self.cache_mode in [CacheMode.ENABLED, CacheMode.READ_ONLY]
-
-    def should_write(self) -> bool:
-        """
-        Determines if cache should be written based on context.
-
-        How it works:
-        1. If always_bypass is True or is_cacheable is False, return False.
-        2. If cache_mode is ENABLED or WRITE_ONLY, return True.
-
-        Returns:
-            bool: True if cache should be written, False otherwise.
-        """
-        if self.always_bypass or not self.is_cacheable:
-            return False
-        return self.cache_mode in [CacheMode.ENABLED, CacheMode.WRITE_ONLY]
-
-    @property
-    def display_url(self) -> str:
-        """Returns the URL in display format."""
-        return self._url_display
-
-
-def _legacy_to_cache_mode(
-    disable_cache: bool = False,
-    bypass_cache: bool = False,
-    no_cache_read: bool = False,
-    no_cache_write: bool = False,
-) -> CacheMode:
-    """
-    Converts legacy cache parameters to the new CacheMode enum.
-
-    This is an internal function to help transition from the old boolean flags
-    to the new CacheMode system.
-    """
-    if disable_cache:
-        return CacheMode.DISABLED
-    if bypass_cache:
-        return CacheMode.BYPASS
-    if no_cache_read and no_cache_write:
-        return CacheMode.DISABLED
-    if no_cache_read:
-        return CacheMode.WRITE_ONLY
-    if no_cache_write:
-        return CacheMode.READ_ONLY
-    return CacheMode.ENABLED
--- a/crawl4ai/chunking_strategy.py
+++ b/crawl4ai/chunking_strategy.py
@@ -4,52 +4,21 @@ from collections import Counter
 import string
 from .model_loader import load_nltk_punkt

-
 # Define the abstract base class for chunking strategies
 class ChunkingStrategy(ABC):
-    """
-    Abstract base class for chunking strategies.
-    """
-
+    
    @abstractmethod
    def chunk(self, text: str) -> list:
        """
        Abstract method to chunk the given text.
-
-        Args:
-            text (str): The text to chunk.
-
-        Returns:
-            list: A list of chunks.
        """
        pass
-
-
-# Create an identity chunking strategy f(x) = [x]
-class IdentityChunking(ChunkingStrategy):
-    """
-    Chunking strategy that returns the input text as a single chunk.
-    """
-
-    def chunk(self, text: str) -> list:
-        return [text]
-
-
+    
 # Regex-based chunking
 class RegexChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text based on regular expression patterns.
-    """
-
    def __init__(self, patterns=None, **kwargs):
-        """
-        Initialize the RegexChunking object.
-
-        Args:
-            patterns (list): A list of regular expression patterns to split text.
-        """
        if patterns is None:
-            patterns = [r"\n\n"]  # Default split pattern
+            patterns = [r'\n\n']  # Default split pattern
        self.patterns = patterns

    def chunk(self, text: str) -> list:
@@ -60,19 +29,12 @@ class RegexChunking(ChunkingStrategy):
                new_paragraphs.extend(re.split(pattern, paragraph))
            paragraphs = new_paragraphs
        return paragraphs
-
-
-# NLP-based sentence chunking
+    
+# NLP-based sentence chunking 
 class NlpSentenceChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into sentences using NLTK's sentence tokenizer.
-    """
-
    def __init__(self, **kwargs):
-        """
-        Initialize the NlpSentenceChunking object.
-        """
        load_nltk_punkt()
+        pass

    def chunk(self, text: str) -> list:
        # Improved regex for sentence splitting
@@ -80,35 +42,19 @@ class NlpSentenceChunking(ChunkingStrategy):
        #     r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<![A-Z][A-Z]\.)(?<![A-Za-z]\.)(?<=\.|\?|\!|\n)\s'
        # )
        # sentences = sentence_endings.split(text)
-        # sens =  [sent.strip() for sent in sentences if sent]
+        # sens =  [sent.strip() for sent in sentences if sent]            
        from nltk.tokenize import sent_tokenize
-
        sentences = sent_tokenize(text)
-        sens = [sent.strip() for sent in sentences]
-
+        sens =  [sent.strip() for sent in sentences]        
+        
        return list(set(sens))
-
-
+    
 # Topic-based segmentation using TextTiling
 class TopicSegmentationChunking(ChunkingStrategy):
-    """
-    Chunking strategy that segments text into topics using NLTK's TextTilingTokenizer.
-
-    How it works:
-    1. Segment the text into topics using TextTilingTokenizer
-    2. Extract keywords for each topic segment
-    """
-
+    
    def __init__(self, num_keywords=3, **kwargs):
-        """
-        Initialize the TopicSegmentationChunking object.
-
-        Args:
-            num_keywords (int): The number of keywords to extract for each topic segment.
-        """
        import nltk as nl
-
-        self.tokenizer = nl.tokenize.TextTilingTokenizer()
+        self.tokenizer = nl.toknize.TextTilingTokenizer()
        self.num_keywords = num_keywords

    def chunk(self, text: str) -> list:
@@ -119,14 +65,8 @@ class TopicSegmentationChunking(ChunkingStrategy):
    def extract_keywords(self, text: str) -> list:
        # Tokenize and remove stopwords and punctuation
        import nltk as nl
-
        tokens = nl.toknize.word_tokenize(text)
-        tokens = [
-            token.lower()
-            for token in tokens
-            if token not in nl.corpus.stopwords.words("english")
-            and token not in string.punctuation
-        ]
+        tokens = [token.lower() for token in tokens if token not in nl.corpus.stopwords.words('english') and token not in string.punctuation]

        # Calculate frequency distribution
        freq_dist = Counter(tokens)
@@ -137,120 +77,29 @@ class TopicSegmentationChunking(ChunkingStrategy):
        # Segment the text into topics
        segments = self.chunk(text)
        # Extract keywords for each topic segment
-        segments_with_topics = [
-            (segment, self.extract_keywords(segment)) for segment in segments
-        ]
+        segments_with_topics = [(segment, self.extract_keywords(segment)) for segment in segments]
        return segments_with_topics
-
-
+    
 # Fixed-length word chunks
 class FixedLengthWordChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into fixed-length word chunks.
-
-    How it works:
-    1. Split the text into words
-    2. Create chunks of fixed length
-    3. Return the list of chunks
-    """
-
    def __init__(self, chunk_size=100, **kwargs):
-        """
-        Initialize the fixed-length word chunking strategy with the given chunk size.
-
-        Args:
-            chunk_size (int): The size of each chunk in words.
-        """
        self.chunk_size = chunk_size

    def chunk(self, text: str) -> list:
        words = text.split()
-        return [
-            " ".join(words[i : i + self.chunk_size])
-            for i in range(0, len(words), self.chunk_size)
-        ]
-
-
+        return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
+    
 # Sliding window chunking
 class SlidingWindowChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into overlapping word chunks.
-
-    How it works:
-    1. Split the text into words
-    2. Create chunks of fixed length
-    3. Return the list of chunks
-    """
-
    def __init__(self, window_size=100, step=50, **kwargs):
-        """
-        Initialize the sliding window chunking strategy with the given window size and
-        step size.
-
-        Args:
-            window_size (int): The size of the sliding window in words.
-            step (int): The step size for sliding the window in words.
-        """
        self.window_size = window_size
        self.step = step

    def chunk(self, text: str) -> list:
        words = text.split()
        chunks = []
-
-        if len(words) <= self.window_size:
-            return [text]
-
-        for i in range(0, len(words) - self.window_size + 1, self.step):
-            chunk = " ".join(words[i : i + self.window_size])
-            chunks.append(chunk)
-
-        # Handle the last chunk if it doesn't align perfectly
-        if i + self.window_size < len(words):
-            chunks.append(" ".join(words[-self.window_size :]))
-
+        for i in range(0, len(words), self.step):
+            chunks.append(' '.join(words[i:i + self.window_size]))
        return chunks
+    

-
-class OverlappingWindowChunking(ChunkingStrategy):
-    """
-    Chunking strategy that splits text into overlapping word chunks.
-
-    How it works:
-    1. Split the text into words using whitespace
-    2. Create chunks of fixed length equal to the window size
-    3. Slide the window by the overlap size
-    4. Return the list of chunks
-    """
-
-    def __init__(self, window_size=1000, overlap=100, **kwargs):
-        """
-        Initialize the overlapping window chunking strategy with the given window size and
-        overlap size.
-
-        Args:
-            window_size (int): The size of the window in words.
-            overlap (int): The size of the overlap between consecutive chunks in words.
-        """
-        self.window_size = window_size
-        self.overlap = overlap
-
-    def chunk(self, text: str) -> list:
-        words = text.split()
-        chunks = []
-
-        if len(words) <= self.window_size:
-            return [text]
-
-        start = 0
-        while start < len(words):
-            end = start + self.window_size
-            chunk = " ".join(words[start:end])
-            chunks.append(chunk)
-
-            if end >= len(words):
-                break
-
-            start = end - self.overlap
-
-        return chunks
--- a/crawl4ai/cli.py
+++ b/crawl4ai/cli.py
@@ -1,123 +0,0 @@
-import click
-import sys
-import asyncio
-from typing import List
-from .docs_manager import DocsManager
-from .async_logger import AsyncLogger
-
-logger = AsyncLogger(verbose=True)
-docs_manager = DocsManager(logger)
-
-
-def print_table(headers: List[str], rows: List[List[str]], padding: int = 2):
-    """Print formatted table with headers and rows"""
-    widths = [max(len(str(cell)) for cell in col) for col in zip(headers, *rows)]
-    border = "+" + "+".join("-" * (w + 2 * padding) for w in widths) + "+"
-
-    def format_row(row):
-        return (
-            "|"
-            + "|".join(
-                f"{' ' * padding}{str(cell):<{w}}{' ' * padding}"
-                for cell, w in zip(row, widths)
-            )
-            + "|"
-        )
-
-    click.echo(border)
-    click.echo(format_row(headers))
-    click.echo(border)
-    for row in rows:
-        click.echo(format_row(row))
-    click.echo(border)
-
-
-@click.group()
-def cli():
-    """Crawl4AI Command Line Interface"""
-    pass
-
-
-@cli.group()
-def docs():
-    """Documentation operations"""
-    pass
-
-
-@docs.command()
-@click.argument("sections", nargs=-1)
-@click.option(
-    "--mode", type=click.Choice(["extended", "condensed"]), default="extended"
-)
-def combine(sections: tuple, mode: str):
-    """Combine documentation sections"""
-    try:
-        asyncio.run(docs_manager.ensure_docs_exist())
-        click.echo(docs_manager.generate(sections, mode))
-    except Exception as e:
-        logger.error(str(e), tag="ERROR")
-        sys.exit(1)
-
-
-@docs.command()
-@click.argument("query")
-@click.option("--top-k", "-k", default=5)
-@click.option("--build-index", is_flag=True, help="Build index if missing")
-def search(query: str, top_k: int, build_index: bool):
-    """Search documentation"""
-    try:
-        result = docs_manager.search(query, top_k)
-        if result == "No search index available. Call build_search_index() first.":
-            if build_index or click.confirm("No search index found. Build it now?"):
-                asyncio.run(docs_manager.llm_text.generate_index_files())
-                result = docs_manager.search(query, top_k)
-        click.echo(result)
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-@docs.command()
-def update():
-    """Update docs from GitHub"""
-    try:
-        asyncio.run(docs_manager.fetch_docs())
-        click.echo("Documentation updated successfully")
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-@docs.command()
-@click.option("--force-facts", is_flag=True, help="Force regenerate fact files")
-@click.option("--clear-cache", is_flag=True, help="Clear BM25 cache")
-def index(force_facts: bool, clear_cache: bool):
-    """Build or rebuild search indexes"""
-    try:
-        asyncio.run(docs_manager.ensure_docs_exist())
-        asyncio.run(
-            docs_manager.llm_text.generate_index_files(
-                force_generate_facts=force_facts, clear_bm25_cache=clear_cache
-            )
-        )
-        click.echo("Search indexes built successfully")
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-# Add docs list command
-@docs.command()
-def list():
-    """List available documentation sections"""
-    try:
-        sections = docs_manager.list()
-        print_table(["Sections"], [[section] for section in sections])
-
-    except Exception as e:
-        click.echo(f"Error: {str(e)}", err=True)
-        sys.exit(1)
-
-
-if __name__ == "__main__":
-    cli()
--- a/crawl4ai/config.py
+++ b/crawl4ai/config.py
@@ -4,84 +4,24 @@ from dotenv import load_dotenv
 load_dotenv()  # Load environment variables from .env file

 # Default provider, ONLY used when the extraction strategy is LLMExtractionStrategy
-DEFAULT_PROVIDER = "openai/gpt-4o-mini"
+DEFAULT_PROVIDER = "openai/gpt-4-turbo"
 MODEL_REPO_BRANCH = "new-release-0.0.2"
 # Provider-model dictionary, ONLY used when the extraction strategy is LLMExtractionStrategy
 PROVIDER_MODELS = {
-    "ollama/llama3": "no-token-needed",  # Any model from Ollama no need for API token
+    "ollama/llama3": "no-token-needed", # Any model from Ollama no need for API token
    "groq/llama3-70b-8192": os.getenv("GROQ_API_KEY"),
    "groq/llama3-8b-8192": os.getenv("GROQ_API_KEY"),
-    "openai/gpt-4o-mini": os.getenv("OPENAI_API_KEY"),
+    "openai/gpt-3.5-turbo": os.getenv("OPENAI_API_KEY"),
+    "openai/gpt-4-turbo": os.getenv("OPENAI_API_KEY"),
    "openai/gpt-4o": os.getenv("OPENAI_API_KEY"),
-    "openai/o1-mini": os.getenv("OPENAI_API_KEY"),
-    "openai/o1-preview": os.getenv("OPENAI_API_KEY"),
    "anthropic/claude-3-haiku-20240307": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-opus-20240229": os.getenv("ANTHROPIC_API_KEY"),
    "anthropic/claude-3-sonnet-20240229": os.getenv("ANTHROPIC_API_KEY"),
-    "anthropic/claude-3-5-sonnet-20240620": os.getenv("ANTHROPIC_API_KEY"),
 }

+
 # Chunk token threshold
-CHUNK_TOKEN_THRESHOLD = 2**11  # 2048 tokens
-OVERLAP_RATE = 0.1
-WORD_TOKEN_RATE = 1.3
+CHUNK_TOKEN_THRESHOLD = 1000

-# Threshold for the minimum number of word in a HTML tag to be considered
-MIN_WORD_THRESHOLD = 1
-IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD = 1
-
-IMPORTANT_ATTRS = ["src", "href", "alt", "title", "width", "height"]
-ONLY_TEXT_ELIGIBLE_TAGS = [
-    "b",
-    "i",
-    "u",
-    "span",
-    "del",
-    "ins",
-    "sub",
-    "sup",
-    "strong",
-    "em",
-    "code",
-    "kbd",
-    "var",
-    "s",
-    "q",
-    "abbr",
-    "cite",
-    "dfn",
-    "time",
-    "small",
-    "mark",
-]
-SOCIAL_MEDIA_DOMAINS = [
-    "facebook.com",
-    "twitter.com",
-    "x.com",
-    "linkedin.com",
-    "instagram.com",
-    "pinterest.com",
-    "tiktok.com",
-    "snapchat.com",
-    "reddit.com",
-]
-
-# Threshold for the Image extraction - Range is 1 to 6
-# Images are scored based on point based system, to filter based on usefulness. Points are assigned
-# to each image based on the following aspects.
-# If either height or width exceeds 150px
-# If image size is greater than 10Kb
-# If alt property is set
-# If image format is in jpg, png or webp
-# If image is in the first half of the total images extracted from the page
-IMAGE_SCORE_THRESHOLD = 2
-
-MAX_METRICS_HISTORY = 1000
-
-NEED_MIGRATION = True
-URL_LOG_SHORTEN_LENGTH = 30
-SHOW_DEPRECATION_WARNINGS = True
-SCREENSHOT_HEIGHT_TRESHOLD = 10000
-PAGE_TIMEOUT = 60000
-DOWNLOAD_PAGE_TIMEOUT = 60000
-DEEP_CRAWL_BATCH_SIZE = 5
+# Threshold for the minimum number of word in a HTML tag to be considered 
+MIN_WORD_THRESHOLD = 5
--- a/crawl4ai/content_filter_strategy.py
+++ b/crawl4ai/content_filter_strategy.py
@@ -1,999 +0,0 @@
-import re
-import time
-from bs4 import BeautifulSoup, Tag
-from typing import List, Tuple, Dict, Optional
-from rank_bm25 import BM25Okapi
-from collections import deque
-from bs4 import NavigableString, Comment
-from .utils import clean_tokens, perform_completion_with_backoff, escape_json_string, sanitize_html, get_home_folder, extract_xml_data
-from abc import ABC, abstractmethod
-import math
-from snowballstemmer import stemmer
-from .config import DEFAULT_PROVIDER, OVERLAP_RATE, WORD_TOKEN_RATE
-from .models import TokenUsage
-from .prompts import PROMPT_FILTER_CONTENT
-import os
-import json
-import hashlib
-from pathlib import Path
-from concurrent.futures import ThreadPoolExecutor, as_completed
-from .async_logger import AsyncLogger, LogLevel
-from colorama import Fore, Style, init
-
-class RelevantContentFilter(ABC):
-    """Abstract base class for content filtering strategies"""
-
-    def __init__(self, user_query: str = None):
-        self.user_query = user_query
-        self.included_tags = {
-            # Primary structure
-            "article",
-            "main",
-            "section",
-            "div",
-            # List structures
-            "ul",
-            "ol",
-            "li",
-            "dl",
-            "dt",
-            "dd",
-            # Text content
-            "p",
-            "span",
-            "blockquote",
-            "pre",
-            "code",
-            # Headers
-            "h1",
-            "h2",
-            "h3",
-            "h4",
-            "h5",
-            "h6",
-            # Tables
-            "table",
-            "thead",
-            "tbody",
-            "tr",
-            "td",
-            "th",
-            # Other semantic elements
-            "figure",
-            "figcaption",
-            "details",
-            "summary",
-            # Text formatting
-            "em",
-            "strong",
-            "b",
-            "i",
-            "mark",
-            "small",
-            # Rich content
-            "time",
-            "address",
-            "cite",
-            "q",
-        }
-        self.excluded_tags = {
-            "nav",
-            "footer",
-            "header",
-            "aside",
-            "script",
-            "style",
-            "form",
-            "iframe",
-            "noscript",
-        }
-        self.header_tags = {"h1", "h2", "h3", "h4", "h5", "h6"}
-        self.negative_patterns = re.compile(
-            r"nav|footer|header|sidebar|ads|comment|promo|advert|social|share", re.I
-        )
-        self.min_word_count = 2
-
-    @abstractmethod
-    def filter_content(self, html: str) -> List[str]:
-        """Abstract method to be implemented by specific filtering strategies"""
-        pass
-
-    def extract_page_query(self, soup: BeautifulSoup, body: Tag) -> str:
-        """Common method to extract page metadata with fallbacks"""
-        if self.user_query:
-            return self.user_query
-
-        query_parts = []
-
-        # Title
-        try:
-            title = soup.title.string
-            if title:
-                query_parts.append(title)
-        except Exception:
-            pass
-
-        if soup.find("h1"):
-            query_parts.append(soup.find("h1").get_text())
-
-        # Meta tags
-        temp = ""
-        for meta_name in ["keywords", "description"]:
-            meta = soup.find("meta", attrs={"name": meta_name})
-            if meta and meta.get("content"):
-                query_parts.append(meta["content"])
-                temp += meta["content"]
-
-        # If still empty, grab first significant paragraph
-        if not temp:
-            # Find the first tag P thatits text contains more than 50 characters
-            for p in body.find_all("p"):
-                if len(p.get_text()) > 150:
-                    query_parts.append(p.get_text()[:150])
-                    break
-
-        return " ".join(filter(None, query_parts))
-
-    def extract_text_chunks(
-        self, body: Tag, min_word_threshold: int = None
-    ) -> List[Tuple[str, str]]:
-        """
-        Extracts text chunks from a BeautifulSoup body element while preserving order.
-        Returns list of tuples (text, tag_name) for classification.
-
-        Args:
-            body: BeautifulSoup Tag object representing the body element
-
-        Returns:
-            List of (text, tag_name) tuples
-        """
-        # Tags to ignore - inline elements that shouldn't break text flow
-        INLINE_TAGS = {
-            "a",
-            "abbr",
-            "acronym",
-            "b",
-            "bdo",
-            "big",
-            "br",
-            "button",
-            "cite",
-            "code",
-            "dfn",
-            "em",
-            "i",
-            "img",
-            "input",
-            "kbd",
-            "label",
-            "map",
-            "object",
-            "q",
-            "samp",
-            "script",
-            "select",
-            "small",
-            "span",
-            "strong",
-            "sub",
-            "sup",
-            "textarea",
-            "time",
-            "tt",
-            "var",
-        }
-
-        # Tags that typically contain meaningful headers
-        HEADER_TAGS = {"h1", "h2", "h3", "h4", "h5", "h6", "header"}
-
-        chunks = []
-        current_text = []
-        chunk_index = 0
-
-        def should_break_chunk(tag: Tag) -> bool:
-            """Determine if a tag should cause a break in the current text chunk"""
-            return tag.name not in INLINE_TAGS and not (
-                tag.name == "p" and len(current_text) == 0
-            )
-
-        # Use deque for efficient push/pop operations
-        stack = deque([(body, False)])
-
-        while stack:
-            element, visited = stack.pop()
-
-            if visited:
-                # End of block element - flush accumulated text
-                if current_text and should_break_chunk(element):
-                    text = " ".join("".join(current_text).split())
-                    if text:
-                        tag_type = (
-                            "header" if element.name in HEADER_TAGS else "content"
-                        )
-                        chunks.append((chunk_index, text, tag_type, element))
-                        chunk_index += 1
-                    current_text = []
-                continue
-
-            if isinstance(element, NavigableString):
-                if str(element).strip():
-                    current_text.append(str(element).strip())
-                continue
-
-            # Pre-allocate children to avoid multiple list operations
-            children = list(element.children)
-            if not children:
-                continue
-
-            # Mark block for revisit after processing children
-            stack.append((element, True))
-
-            # Add children in reverse order for correct processing
-            for child in reversed(children):
-                if isinstance(child, (Tag, NavigableString)):
-                    stack.append((child, False))
-
-        # Handle any remaining text
-        if current_text:
-            text = " ".join("".join(current_text).split())
-            if text:
-                chunks.append((chunk_index, text, "content", body))
-
-        if min_word_threshold:
-            chunks = [
-                chunk for chunk in chunks if len(chunk[1].split()) >= min_word_threshold
-            ]
-
-        return chunks
-
-    def _deprecated_extract_text_chunks(
-        self, soup: BeautifulSoup
-    ) -> List[Tuple[int, str, Tag]]:
-        """Common method for extracting text chunks"""
-        _text_cache = {}
-
-        def fast_text(element: Tag) -> str:
-            elem_id = id(element)
-            if elem_id in _text_cache:
-                return _text_cache[elem_id]
-            texts = []
-            for content in element.contents:
-                if isinstance(content, str):
-                    text = content.strip()
-                    if text:
-                        texts.append(text)
-            result = " ".join(texts)
-            _text_cache[elem_id] = result
-            return result
-
-        candidates = []
-        index = 0
-
-        def dfs(element):
-            nonlocal index
-            if isinstance(element, Tag):
-                if element.name in self.included_tags:
-                    if not self.is_excluded(element):
-                        text = fast_text(element)
-                        word_count = len(text.split())
-
-                        # Headers pass through with adjusted minimum
-                        if element.name in self.header_tags:
-                            if word_count >= 3:  # Minimal sanity check for headers
-                                candidates.append((index, text, element))
-                                index += 1
-                        # Regular content uses standard minimum
-                        elif word_count >= self.min_word_count:
-                            candidates.append((index, text, element))
-                            index += 1
-
-                for child in element.children:
-                    dfs(child)
-
-        dfs(soup.body if soup.body else soup)
-        return candidates
-
-    def is_excluded(self, tag: Tag) -> bool:
-        """Common method for exclusion logic"""
-        if tag.name in self.excluded_tags:
-            return True
-        class_id = " ".join(
-            filter(None, [" ".join(tag.get("class", [])), tag.get("id", "")])
-        )
-        return bool(self.negative_patterns.search(class_id))
-
-    def clean_element(self, tag: Tag) -> str:
-        """Common method for cleaning HTML elements with minimal overhead"""
-        if not tag or not isinstance(tag, Tag):
-            return ""
-
-        unwanted_tags = {"script", "style", "aside", "form", "iframe", "noscript"}
-        unwanted_attrs = {
-            "style",
-            "onclick",
-            "onmouseover",
-            "align",
-            "bgcolor",
-            "class",
-            "id",
-        }
-
-        # Use string builder pattern for better performance
-        builder = []
-
-        def render_tag(elem):
-            if not isinstance(elem, Tag):
-                if isinstance(elem, str):
-                    builder.append(elem.strip())
-                return
-
-            if elem.name in unwanted_tags:
-                return
-
-            # Start tag
-            builder.append(f"<{elem.name}")
-
-            # Add cleaned attributes
-            attrs = {k: v for k, v in elem.attrs.items() if k not in unwanted_attrs}
-            for key, value in attrs.items():
-                builder.append(f' {key}="{value}"')
-
-            builder.append(">")
-
-            # Process children
-            for child in elem.children:
-                render_tag(child)
-
-            # Close tag
-            builder.append(f"</{elem.name}>")
-
-        try:
-            render_tag(tag)
-            return "".join(builder)
-        except Exception:
-            return str(tag)  # Fallback to original if anything fails
-
-class BM25ContentFilter(RelevantContentFilter):
-    """
-    Content filtering using BM25 algorithm with priority tag handling.
-
-    How it works:
-    1. Extracts page metadata with fallbacks.
-    2. Extracts text chunks from the body element.
-    3. Tokenizes the corpus and query.
-    4. Applies BM25 algorithm to calculate scores for each chunk.
-    5. Filters out chunks below the threshold.
-    6. Sorts chunks by score in descending order.
-    7. Returns the top N chunks.
-
-    Attributes:
-        user_query (str): User query for filtering (optional).
-        bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
-        language (str): Language for stemming (default: 'english').
-
-        Methods:
-            filter_content(self, html: str, min_word_threshold: int = None)
-    """
-
-    def __init__(
-        self,
-        user_query: str = None,
-        bm25_threshold: float = 1.0,
-        language: str = "english",
-    ):
-        """
-        Initializes the BM25ContentFilter class, if not provided, falls back to page metadata.
-
-        Note:
-        If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
-
-        Args:
-            user_query (str): User query for filtering (optional).
-            bm25_threshold (float): BM25 threshold for filtering (default: 1.0).
-            language (str): Language for stemming (default: 'english').
-        """
-        super().__init__(user_query=user_query)
-        self.bm25_threshold = bm25_threshold
-        self.priority_tags = {
-            "h1": 5.0,
-            "h2": 4.0,
-            "h3": 3.0,
-            "title": 4.0,
-            "strong": 2.0,
-            "b": 1.5,
-            "em": 1.5,
-            "blockquote": 2.0,
-            "code": 2.0,
-            "pre": 1.5,
-            "th": 1.5,  # Table headers
-        }
-        self.stemmer = stemmer(language)
-
-    def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
-        """
-        Implements content filtering using BM25 algorithm with priority tag handling.
-
-            Note:
-        This method implements the filtering logic for the BM25ContentFilter class.
-        It takes HTML content as input and returns a list of filtered text chunks.
-
-        Args:
-            html (str): HTML content to be filtered.
-            min_word_threshold (int): Minimum word threshold for filtering (optional).
-
-        Returns:
-            List[str]: List of filtered text chunks.
-        """
-        if not html or not isinstance(html, str):
-            return []
-
-        soup = BeautifulSoup(html, "lxml")
-
-        # Check if body is present
-        if not soup.body:
-            # Wrap in body tag if missing
-            soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
-        body = soup.find("body")
-
-        query = self.extract_page_query(soup, body)
-
-        if not query:
-            return []
-            # return [self.clean_element(soup)]
-
-        candidates = self.extract_text_chunks(body, min_word_threshold)
-
-        if not candidates:
-            return []
-
-        # Tokenize corpus
-        # tokenized_corpus = [chunk.lower().split() for _, chunk, _, _ in candidates]
-        # tokenized_query = query.lower().split()
-
-        # tokenized_corpus = [[ps.stem(word) for word in chunk.lower().split()]
-        #                 for _, chunk, _, _ in candidates]
-        # tokenized_query = [ps.stem(word) for word in query.lower().split()]
-
-        tokenized_corpus = [
-            [self.stemmer.stemWord(word) for word in chunk.lower().split()]
-            for _, chunk, _, _ in candidates
-        ]
-        tokenized_query = [
-            self.stemmer.stemWord(word) for word in query.lower().split()
-        ]
-
-        # tokenized_corpus = [[self.stemmer.stemWord(word) for word in tokenize_text(chunk.lower())]
-        #            for _, chunk, _, _ in candidates]
-        # tokenized_query = [self.stemmer.stemWord(word) for word in tokenize_text(query.lower())]
-
-        # Clean from stop words and noise
-        tokenized_corpus = [clean_tokens(tokens) for tokens in tokenized_corpus]
-        tokenized_query = clean_tokens(tokenized_query)
-
-        bm25 = BM25Okapi(tokenized_corpus)
-        scores = bm25.get_scores(tokenized_query)
-
-        # Adjust scores with tag weights
-        adjusted_candidates = []
-        for score, (index, chunk, tag_type, tag) in zip(scores, candidates):
-            tag_weight = self.priority_tags.get(tag.name, 1.0)
-            adjusted_score = score * tag_weight
-            adjusted_candidates.append((adjusted_score, index, chunk, tag))
-
-        # Filter candidates by threshold
-        selected_candidates = [
-            (index, chunk, tag)
-            for adjusted_score, index, chunk, tag in adjusted_candidates
-            if adjusted_score >= self.bm25_threshold
-        ]
-
-        if not selected_candidates:
-            return []
-
-        # Sort selected candidates by original document order
-        selected_candidates.sort(key=lambda x: x[0])
-
-        return [self.clean_element(tag) for _, _, tag in selected_candidates]
-
-class PruningContentFilter(RelevantContentFilter):
-    """
-    Content filtering using pruning algorithm with dynamic threshold.
-
-    How it works:
-    1. Extracts page metadata with fallbacks.
-    2. Extracts text chunks from the body element.
-    3. Applies pruning algorithm to calculate scores for each chunk.
-    4. Filters out chunks below the threshold.
-    5. Sorts chunks by score in descending order.
-    6. Returns the top N chunks.
-
-    Attributes:
-        user_query (str): User query for filtering (optional), if not provided, falls back to page metadata.
-        min_word_threshold (int): Minimum word threshold for filtering (optional).
-        threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
-        threshold (float): Fixed threshold value (default: 0.48).
-
-        Methods:
-            filter_content(self, html: str, min_word_threshold: int = None):
-    """
-
-    def __init__(
-        self,
-        user_query: str = None,
-        min_word_threshold: int = None,
-        threshold_type: str = "fixed",
-        threshold: float = 0.48,
-    ):
-        """
-        Initializes the PruningContentFilter class, if not provided, falls back to page metadata.
-
-        Note:
-        If no query is given and no page metadata is available, then it tries to pick up the first significant paragraph.
-
-        Args:
-            user_query (str): User query for filtering (optional).
-            min_word_threshold (int): Minimum word threshold for filtering (optional).
-            threshold_type (str): Threshold type for dynamic threshold (default: 'fixed').
-            threshold (float): Fixed threshold value (default: 0.48).
-        """
-        super().__init__(None)
-        self.min_word_threshold = min_word_threshold
-        self.threshold_type = threshold_type
-        self.threshold = threshold
-
-        # Add tag importance for dynamic threshold
-        self.tag_importance = {
-            "article": 1.5,
-            "main": 1.4,
-            "section": 1.3,
-            "p": 1.2,
-            "h1": 1.4,
-            "h2": 1.3,
-            "h3": 1.2,
-            "div": 0.7,
-            "span": 0.6,
-        }
-
-        # Metric configuration
-        self.metric_config = {
-            "text_density": True,
-            "link_density": True,
-            "tag_weight": True,
-            "class_id_weight": True,
-            "text_length": True,
-        }
-
-        self.metric_weights = {
-            "text_density": 0.4,
-            "link_density": 0.2,
-            "tag_weight": 0.2,
-            "class_id_weight": 0.1,
-            "text_length": 0.1,
-        }
-
-        self.tag_weights = {
-            "div": 0.5,
-            "p": 1.0,
-            "article": 1.5,
-            "section": 1.0,
-            "span": 0.3,
-            "li": 0.5,
-            "ul": 0.5,
-            "ol": 0.5,
-            "h1": 1.2,
-            "h2": 1.1,
-            "h3": 1.0,
-            "h4": 0.9,
-            "h5": 0.8,
-            "h6": 0.7,
-        }
-
-    def filter_content(self, html: str, min_word_threshold: int = None) -> List[str]:
-        """
-        Implements content filtering using pruning algorithm with dynamic threshold.
-
-        Note:
-        This method implements the filtering logic for the PruningContentFilter class.
-        It takes HTML content as input and returns a list of filtered text chunks.
-
-        Args:
-            html (str): HTML content to be filtered.
-            min_word_threshold (int): Minimum word threshold for filtering (optional).
-
-        Returns:
-            List[str]: List of filtered text chunks.
-        """
-        if not html or not isinstance(html, str):
-            return []
-
-        soup = BeautifulSoup(html, "lxml")
-        if not soup.body:
-            soup = BeautifulSoup(f"<body>{html}</body>", "lxml")
-
-        # Remove comments and unwanted tags
-        self._remove_comments(soup)
-        self._remove_unwanted_tags(soup)
-
-        # Prune tree starting from body
-        body = soup.find("body")
-        self._prune_tree(body)
-
-        # Extract remaining content as list of HTML strings
-        content_blocks = []
-        for element in body.children:
-            if isinstance(element, str) or not hasattr(element, "name"):
-                continue
-            if len(element.get_text(strip=True)) > 0:
-                content_blocks.append(str(element))
-
-        return content_blocks
-
-    def _remove_comments(self, soup):
-        """Removes HTML comments"""
-        for element in soup(text=lambda text: isinstance(text, Comment)):
-            element.extract()
-
-    def _remove_unwanted_tags(self, soup):
-        """Removes unwanted tags"""
-        for tag in self.excluded_tags:
-            for element in soup.find_all(tag):
-                element.decompose()
-
-    def _prune_tree(self, node):
-        """
-        Prunes the tree starting from the given node.
-
-        Args:
-            node (Tag): The node from which the pruning starts.
-        """
-        if not node or not hasattr(node, "name") or node.name is None:
-            return
-
-        text_len = len(node.get_text(strip=True))
-        tag_len = len(node.encode_contents().decode("utf-8"))
-        link_text_len = sum(
-            len(s.strip())
-            for s in (a.string for a in node.find_all("a", recursive=False))
-            if s
-        )
-
-        metrics = {
-            "node": node,
-            "tag_name": node.name,
-            "text_len": text_len,
-            "tag_len": tag_len,
-            "link_text_len": link_text_len,
-        }
-
-        score = self._compute_composite_score(metrics, text_len, tag_len, link_text_len)
-
-        if self.threshold_type == "fixed":
-            should_remove = score < self.threshold
-        else:  # dynamic
-            tag_importance = self.tag_importance.get(node.name, 0.7)
-            text_ratio = text_len / tag_len if tag_len > 0 else 0
-            link_ratio = link_text_len / text_len if text_len > 0 else 1
-
-            threshold = self.threshold  # base threshold
-            if tag_importance > 1:
-                threshold *= 0.8
-            if text_ratio > 0.4:
-                threshold *= 0.9
-            if link_ratio > 0.6:
-                threshold *= 1.2
-
-            should_remove = score < threshold
-
-        if should_remove:
-            node.decompose()
-        else:
-            children = [child for child in node.children if hasattr(child, "name")]
-            for child in children:
-                self._prune_tree(child)
-
-    def _compute_composite_score(self, metrics, text_len, tag_len, link_text_len):
-        """Computes the composite score"""
-        if self.min_word_threshold:
-            # Get raw text from metrics node - avoid extra processing
-            text = metrics["node"].get_text(strip=True)
-            word_count = text.count(" ") + 1
-            if word_count < self.min_word_threshold:
-                return -1.0  # Guaranteed removal
-        score = 0.0
-        total_weight = 0.0
-
-        if self.metric_config["text_density"]:
-            density = text_len / tag_len if tag_len > 0 else 0
-            score += self.metric_weights["text_density"] * density
-            total_weight += self.metric_weights["text_density"]
-
-        if self.metric_config["link_density"]:
-            density = 1 - (link_text_len / text_len if text_len > 0 else 0)
-            score += self.metric_weights["link_density"] * density
-            total_weight += self.metric_weights["link_density"]
-
-        if self.metric_config["tag_weight"]:
-            tag_score = self.tag_weights.get(metrics["tag_name"], 0.5)
-            score += self.metric_weights["tag_weight"] * tag_score
-            total_weight += self.metric_weights["tag_weight"]
-
-        if self.metric_config["class_id_weight"]:
-            class_score = self._compute_class_id_weight(metrics["node"])
-            score += self.metric_weights["class_id_weight"] * max(0, class_score)
-            total_weight += self.metric_weights["class_id_weight"]
-
-        if self.metric_config["text_length"]:
-            score += self.metric_weights["text_length"] * math.log(text_len + 1)
-            total_weight += self.metric_weights["text_length"]
-
-        return score / total_weight if total_weight > 0 else 0
-
-    def _compute_class_id_weight(self, node):
-        """Computes the class ID weight"""
-        class_id_score = 0
-        if "class" in node.attrs:
-            classes = " ".join(node["class"])
-            if self.negative_patterns.match(classes):
-                class_id_score -= 0.5
-        if "id" in node.attrs:
-            element_id = node["id"]
-            if self.negative_patterns.match(element_id):
-                class_id_score -= 0.5
-        return class_id_score
-
-class LLMContentFilter(RelevantContentFilter):
-    """Content filtering using LLMs to generate relevant markdown."""
-
-    def __init__(
-        self,
-        provider: str = DEFAULT_PROVIDER,
-        api_token: Optional[str] = None,
-        instruction: str = None,
-        chunk_token_threshold: int = int(1e9),
-        overlap_rate: float = OVERLAP_RATE,
-        word_token_rate: float = WORD_TOKEN_RATE,
-        base_url: Optional[str] = None,
-        api_base: Optional[str] = None,
-        extra_args: Dict = None,
-        verbose: bool = False,
-        logger: Optional[AsyncLogger] = None,
-    ):
-        super().__init__(None)
-        self.provider = provider
-        self.api_token = (
-            api_token
-            or PROVIDER_MODELS.get(provider, "no-token")
-            or os.getenv("OPENAI_API_KEY")
-        )
-        self.instruction = instruction
-        self.chunk_token_threshold = chunk_token_threshold
-        self.overlap_rate = overlap_rate
-        self.word_token_rate = word_token_rate
-        self.base_url = base_url
-        self.api_base = api_base or base_url
-        self.extra_args = extra_args or {}
-        self.verbose = verbose
-        
-        # Setup logger with custom styling for LLM operations
-        if logger:
-            self.logger = logger
-        elif verbose:
-            self.logger = AsyncLogger(
-                verbose=True,
-                icons={
-                    **AsyncLogger.DEFAULT_ICONS,
-                    "LLM": "★",  # Star for LLM operations
-                    "CHUNK": "◈",  # Diamond for chunks
-                    "CACHE": "⚡", # Lightning for cache operations
-                },
-                colors={
-                    **AsyncLogger.DEFAULT_COLORS,
-                    LogLevel.INFO: Fore.MAGENTA + Style.DIM,  # Dimmed purple for LLM ops
-                }
-            )
-        else:
-            self.logger = None
-        
-        self.usages = []
-        self.total_usage = TokenUsage()
-
-    def _get_cache_key(self, html: str, instruction: str) -> str:
-        """Generate a unique cache key based on HTML and instruction"""
-        content = f"{html}{instruction}"
-        return hashlib.md5(content.encode()).hexdigest()
-
-    def _merge_chunks(self, text: str) -> List[str]:
-        """Split text into chunks with overlap"""
-        # Calculate tokens and sections
-        total_tokens = len(text.split()) * self.word_token_rate
-        num_sections = max(1, math.floor(total_tokens / self.chunk_token_threshold))
-        adjusted_chunk_threshold = total_tokens / num_sections
-
-        # Split into words
-        words = text.split()
-        chunks = []
-        current_chunk = []
-        current_token_count = 0
-
-        for word in words:
-            word_tokens = len(word) * self.word_token_rate
-            if current_token_count + word_tokens <= adjusted_chunk_threshold:
-                current_chunk.append(word)
-                current_token_count += word_tokens
-            else:
-                # Add overlap if not the last chunk
-                if chunks and self.overlap_rate > 0:
-                    overlap_size = int(len(current_chunk) * self.overlap_rate)
-                    current_chunk.extend(current_chunk[-overlap_size:])
-                
-                chunks.append(" ".join(current_chunk))
-                current_chunk = [word]
-                current_token_count = word_tokens
-
-        if current_chunk:
-            chunks.append(" ".join(current_chunk))
-
-        return chunks
-
-    def filter_content(self, html: str, ignore_cache: bool = False) -> List[str]:
-        if not html or not isinstance(html, str):
-            return []
-
-        if self.logger:
-            self.logger.info(
-                "Starting LLM content filtering process", 
-                tag="LLM",
-                params={"provider": self.provider},
-                colors={"provider": Fore.CYAN}
-            )
-
-        # Cache handling
-        cache_dir = Path(get_home_folder()) / "llm_cache" / "content_filter"
-        cache_dir.mkdir(parents=True, exist_ok=True)
-        cache_key = self._get_cache_key(html, self.instruction or "")
-        cache_file = cache_dir / f"{cache_key}.json"
-
-        if not ignore_cache and cache_file.exists():
-            if self.logger:
-                self.logger.info("Found cached result", tag="CACHE")
-            try:
-                with cache_file.open('r') as f:
-                    cached_data = json.load(f)
-                    usage = TokenUsage(**cached_data['usage'])
-                    self.usages.append(usage)
-                    self.total_usage.completion_tokens += usage.completion_tokens
-                    self.total_usage.prompt_tokens += usage.prompt_tokens
-                    self.total_usage.total_tokens += usage.total_tokens
-                    return cached_data['blocks']
-            except Exception as e:
-                if self.logger:
-                    self.logger.error(f"Cache read error: {str(e)}", tag="CACHE")
-
-        # Split into chunks
-        html_chunks = self._merge_chunks(html)
-        if self.logger:
-            self.logger.info(
-                "Split content into {chunk_count} chunks", 
-                tag="CHUNK",
-                params={"chunk_count": len(html_chunks)},
-                colors={"chunk_count": Fore.YELLOW}
-            )
-        
-        extracted_content = []
-        start_time = time.time()
-        
-        # Process chunks in parallel
-        with ThreadPoolExecutor(max_workers=4) as executor:
-            futures = []
-            for i, chunk in enumerate(html_chunks):
-                if self.logger:
-                    self.logger.debug(
-                        "Processing chunk {chunk_num}/{total_chunks}", 
-                        tag="CHUNK",
-                        params={
-                            "chunk_num": i + 1,
-                            "total_chunks": len(html_chunks)
-                        }
-                    )
-
-                prompt_variables = {
-                    "HTML": escape_json_string(sanitize_html(chunk)),
-                    "REQUEST": self.instruction or "Convert this HTML into clean, relevant markdown, removing any noise or irrelevant content."
-                }
-
-                prompt = PROMPT_FILTER_CONTENT
-                for var, value in prompt_variables.items():
-                    prompt = prompt.replace("{" + var + "}", value)
-
-                future = executor.submit(
-                    perform_completion_with_backoff,
-                    self.provider,
-                    prompt,
-                    self.api_token,
-                    base_url=self.api_base,
-                    extra_args=self.extra_args
-                )
-                futures.append((i, future))
-
-            # Collect results in order
-            ordered_results = []
-            for i, future in sorted(futures):
-                try:
-                    response = future.result()
-                    
-                    # Track usage
-                    usage = TokenUsage(
-                        completion_tokens=response.usage.completion_tokens,
-                        prompt_tokens=response.usage.prompt_tokens,
-                        total_tokens=response.usage.total_tokens,
-                        completion_tokens_details=response.usage.completion_tokens_details.__dict__ 
-                        if response.usage.completion_tokens_details else {},
-                        prompt_tokens_details=response.usage.prompt_tokens_details.__dict__
-                        if response.usage.prompt_tokens_details else {},
-                    )
-                    self.usages.append(usage)
-                    self.total_usage.completion_tokens += usage.completion_tokens
-                    self.total_usage.prompt_tokens += usage.prompt_tokens
-                    self.total_usage.total_tokens += usage.total_tokens
-
-                    blocks = extract_xml_data(["content"], response.choices[0].message.content)["content"]
-                    if blocks:
-                        ordered_results.append(blocks)
-                        if self.logger:
-                            self.logger.success(
-                                "Successfully processed chunk {chunk_num}", 
-                                tag="CHUNK",
-                                params={"chunk_num": i + 1}
-                            )
-                except Exception as e:
-                    if self.logger:
-                        self.logger.error(
-                            "Error processing chunk {chunk_num}: {error}", 
-                            tag="CHUNK",
-                            params={
-                                "chunk_num": i + 1,
-                                "error": str(e)
-                            }
-                        )
-
-        end_time = time.time()
-        if self.logger:
-            self.logger.success(
-                "Completed processing in {time:.2f}s", 
-                tag="LLM",
-                params={"time": end_time - start_time},
-                colors={"time": Fore.YELLOW}
-            )
-
-        result = ordered_results if ordered_results else []
-
-        # Cache the final result
-        cache_data = {
-            'blocks': result,
-            'usage': self.total_usage.__dict__
-        }
-        with cache_file.open('w') as f:
-            json.dump(cache_data, f)
-            if self.logger:
-                self.logger.info("Cached results for future use", tag="CACHE")
-
-        return result
-
-    def show_usage(self) -> None:
-        """Print usage statistics"""
-        print("\n=== Token Usage Summary ===")
-        print(f"{'Type':<15} {'Count':>12}")
-        print("-" * 30)
-        print(f"{'Completion':<15} {self.total_usage.completion_tokens:>12,}")
-        print(f"{'Prompt':<15} {self.total_usage.prompt_tokens:>12,}")
-        print(f"{'Total':<15} {self.total_usage.total_tokens:>12,}")
-
-        if self.usages:
-            print("\n=== Usage History ===")
-            print(f"{'Request #':<10} {'Completion':>12} {'Prompt':>12} {'Total':>12}")
-            print("-" * 48)
-            for i, usage in enumerate(self.usages, 1):
-                print(
-                    f"{i:<10} {usage.completion_tokens:>12,} "
-                    f"{usage.prompt_tokens:>12,} {usage.total_tokens:>12,}"
-                )
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
--- a/crawl4ai/crawler_strategy.py
+++ b/crawl4ai/crawler_strategy.py
@@ -5,63 +5,41 @@ from selenium.webdriver.common.by import By
 from selenium.webdriver.support.ui import WebDriverWait
 from selenium.webdriver.support import expected_conditions as EC
 from selenium.webdriver.chrome.options import Options
-from selenium.common.exceptions import InvalidArgumentException, WebDriverException
-# from selenium.webdriver.chrome.service import Service as ChromeService
-# from webdriver_manager.chrome import ChromeDriverManager
-# from urllib3.exceptions import MaxRetryError
-
-from .config import *
-import logging, time
-import base64
-from PIL import Image, ImageDraw, ImageFont
-from io import BytesIO
-from typing import Callable
-import requests
-import os
-from pathlib import Path
-from .utils import *
-
-logger = logging.getLogger("selenium.webdriver.remote.remote_connection")
+from selenium.common.exceptions import InvalidArgumentException
+import logging
+logger = logging.getLogger('selenium.webdriver.remote.remote_connection')
 logger.setLevel(logging.WARNING)

-logger_driver = logging.getLogger("selenium.webdriver.common.service")
+logger_driver = logging.getLogger('selenium.webdriver.common.service')
 logger_driver.setLevel(logging.WARNING)

-urllib3_logger = logging.getLogger("urllib3.connectionpool")
+urllib3_logger = logging.getLogger('urllib3.connectionpool')
 urllib3_logger.setLevel(logging.WARNING)

 # Disable http.client logging
-http_client_logger = logging.getLogger("http.client")
+http_client_logger = logging.getLogger('http.client')
 http_client_logger.setLevel(logging.WARNING)

 # Disable driver_finder and service logging
-driver_finder_logger = logging.getLogger("selenium.webdriver.common.driver_finder")
+driver_finder_logger = logging.getLogger('selenium.webdriver.common.driver_finder')
 driver_finder_logger.setLevel(logging.WARNING)


+from typing import List
+import requests
+import os
+from pathlib import Path
+
 class CrawlerStrategy(ABC):
    @abstractmethod
    def crawl(self, url: str, **kwargs) -> str:
        pass

-    @abstractmethod
-    def take_screenshot(self, save_path: str):
-        pass
-
-    @abstractmethod
-    def update_user_agent(self, user_agent: str):
-        pass
-
-    @abstractmethod
-    def set_hook(self, hook_type: str, hook: Callable):
-        pass
-
-
 class CloudCrawlerStrategy(CrawlerStrategy):
-    def __init__(self, use_cached_html=False):
+    def __init__(self, use_cached_html = False):
        super().__init__()
        self.use_cached_html = use_cached_html
-
+        
    def crawl(self, url: str) -> str:
        data = {
            "urls": [url],
@@ -73,8 +51,7 @@ class CloudCrawlerStrategy(CrawlerStrategy):
        response = requests.post("http://crawl4ai.uccode.io/crawl", json=data)
        response = response.json()
        html = response["results"][0]["html"]
-        return sanitize_input_encode(html)
-
+        return html

 class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
@@ -82,30 +59,8 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        print("[LOG] 🚀 Initializing LocalSeleniumCrawlerStrategy")
        self.options = Options()
        self.options.headless = True
-        if kwargs.get("proxy"):
-            self.options.add_argument("--proxy-server={}".format(kwargs.get("proxy")))
-        if kwargs.get("user_agent"):
-            self.options.add_argument("--user-agent=" + kwargs.get("user_agent"))
-        else:
-            user_agent = kwargs.get(
-                "user_agent",
-                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
-            )
-            self.options.add_argument(f"--user-agent={user_agent}")
-            self.options.add_argument(
-                "user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
-            )
-
-        self.options.headless = kwargs.get("headless", True)
-        if self.options.headless:
-            self.options.add_argument("--headless")
-
-        self.options.add_argument("--disable-gpu")
-        self.options.add_argument("--window-size=1920,1080")
        self.options.add_argument("--no-sandbox")
-        self.options.add_argument("--disable-dev-shm-usage")
-        self.options.add_argument("--disable-blink-features=AutomationControlled")
-
+        self.options.add_argument("--headless")
        # self.options.add_argument("--disable-dev-shm-usage")
        self.options.add_argument("--disable-gpu")
        # self.options.add_argument("--disable-extensions")
@@ -126,269 +81,50 @@ class LocalSeleniumCrawlerStrategy(CrawlerStrategy):
        self.js_code = js_code
        self.verbose = kwargs.get("verbose", False)

-        # Hooks
-        self.hooks = {
-            "on_driver_created": None,
-            "on_user_agent_updated": None,
-            "before_get_url": None,
-            "after_get_url": None,
-            "before_return_html": None,
-        }
-
        # chromedriver_autoinstaller.install()
-        # import chromedriver_autoinstaller
-        # crawl4ai_folder = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
-        # driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=self.options)
-        # chromedriver_path = chromedriver_autoinstaller.install()
-        # chromedriver_path = chromedriver_autoinstaller.utils.download_chromedriver()
-        # self.service = Service(chromedriver_autoinstaller.install())
-
-        # chromedriver_path = ChromeDriverManager().install()
-        # self.service = Service(chromedriver_path)
-        # self.service.log_path = "NUL"
-        # self.driver = webdriver.Chrome(service=self.service, options=self.options)
-
-        # Use selenium-manager (built into Selenium 4.10.0+)
-        self.service = Service()
-        self.driver = webdriver.Chrome(options=self.options)
-
-        self.driver = self.execute_hook("on_driver_created", self.driver)
-
-        if kwargs.get("cookies"):
-            for cookie in kwargs.get("cookies"):
-                self.driver.add_cookie(cookie)
-
-    def set_hook(self, hook_type: str, hook: Callable):
-        if hook_type in self.hooks:
-            self.hooks[hook_type] = hook
-        else:
-            raise ValueError(f"Invalid hook type: {hook_type}")
-
-    def execute_hook(self, hook_type: str, *args):
-        hook = self.hooks.get(hook_type)
-        if hook:
-            result = hook(*args)
-            if result is not None:
-                if isinstance(result, webdriver.Chrome):
-                    return result
-                else:
-                    raise TypeError(
-                        f"Hook {hook_type} must return an instance of webdriver.Chrome or None."
-                    )
-        # If the hook returns None or there is no hook, return self.driver
-        return self.driver
-
-    def update_user_agent(self, user_agent: str):
-        self.options.add_argument(f"user-agent={user_agent}")
-        self.driver.quit()
+        import chromedriver_autoinstaller
+        self.service = Service(chromedriver_autoinstaller.install())
+        self.service.log_path = "NUL"
        self.driver = webdriver.Chrome(service=self.service, options=self.options)
-        self.driver = self.execute_hook("on_user_agent_updated", self.driver)
-
-    def set_custom_headers(self, headers: dict):
-        # Enable Network domain for sending headers
-        self.driver.execute_cdp_cmd("Network.enable", {})
-        # Set extra HTTP headers
-        self.driver.execute_cdp_cmd("Network.setExtraHTTPHeaders", {"headers": headers})
-
-    def _ensure_page_load(self, max_checks=6, check_interval=0.01):
-        initial_length = len(self.driver.page_source)
-
-        for ix in range(max_checks):
-            # print(f"Checking page load: {ix}")
-            time.sleep(check_interval)
-            current_length = len(self.driver.page_source)
-
-            if current_length != initial_length:
-                break
-
-        return self.driver.page_source
-
-    def crawl(self, url: str, **kwargs) -> str:
-        # Create md5 hash of the URL
-        import hashlib
-
-        url_hash = hashlib.md5(url.encode()).hexdigest()

+    def crawl(self, url: str) -> str:
        if self.use_cached_html:
-            cache_file_path = os.path.join(
-                os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
-                ".crawl4ai",
-                "cache",
-                url_hash,
-            )
+            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
            if os.path.exists(cache_file_path):
                with open(cache_file_path, "r") as f:
-                    return sanitize_input_encode(f.read())
+                    return f.read()

        try:
-            self.driver = self.execute_hook("before_get_url", self.driver)
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using LocalSeleniumCrawlerStrategy...")
-            self.driver.get(url)  # <html><head></head><body></body></html>
-
-            WebDriverWait(self.driver, 20).until(
-                lambda d: d.execute_script("return document.readyState") == "complete"
-            )
+            self.driver.get(url)
            WebDriverWait(self.driver, 10).until(
-                EC.presence_of_all_elements_located((By.TAG_NAME, "body"))
+                EC.presence_of_all_elements_located((By.TAG_NAME, "html"))
            )
-
-            self.driver.execute_script(
-                "window.scrollTo(0, document.body.scrollHeight);"
-            )
-
-            self.driver = self.execute_hook("after_get_url", self.driver)
-            html = sanitize_input_encode(
-                self._ensure_page_load()
-            )  # self.driver.page_source
-            can_not_be_done_headless = (
-                False  # Look at my creativity for naming variables
-            )
-
-            # TODO: Very ugly approach, but promise to change it!
-            if (
-                kwargs.get("bypass_headless", False)
-                or html == "<html><head></head><body></body></html>"
-            ):
-                print(
-                    "[LOG] 🙌 Page could not be loaded in headless mode. Trying non-headless mode..."
-                )
-                can_not_be_done_headless = True
-                options = Options()
-                options.headless = False
-                # set window size very small
-                options.add_argument("--window-size=5,5")
-                driver = webdriver.Chrome(service=self.service, options=options)
-                driver.get(url)
-                self.driver = self.execute_hook("after_get_url", driver)
-                html = sanitize_input_encode(driver.page_source)
-                driver.quit()
-
+            
            # Execute JS code if provided
-            self.js_code = kwargs.get("js_code", self.js_code)
-            if self.js_code and type(self.js_code) == str:
+            if self.js_code:
                self.driver.execute_script(self.js_code)
                # Optionally, wait for some condition after executing the JS code
                WebDriverWait(self.driver, 10).until(
-                    lambda driver: driver.execute_script("return document.readyState")
-                    == "complete"
+                    lambda driver: driver.execute_script("return document.readyState") == "complete"
                )
-            elif self.js_code and type(self.js_code) == list:
-                for js in self.js_code:
-                    self.driver.execute_script(js)
-                    WebDriverWait(self.driver, 10).until(
-                        lambda driver: driver.execute_script(
-                            "return document.readyState"
-                        )
-                        == "complete"
-                    )
-
-            # Optionally, wait for some condition after executing the JS code : Contributed by (https://github.com/jonymusky)
-            wait_for = kwargs.get("wait_for", False)
-            if wait_for:
-                if callable(wait_for):
-                    print("[LOG] 🔄 Waiting for condition...")
-                    WebDriverWait(self.driver, 20).until(wait_for)
-                else:
-                    print("[LOG] 🔄 Waiting for condition...")
-                    WebDriverWait(self.driver, 20).until(
-                        EC.presence_of_element_located((By.CSS_SELECTOR, wait_for))
-                    )
-
-            if not can_not_be_done_headless:
-                html = sanitize_input_encode(self.driver.page_source)
-            self.driver = self.execute_hook("before_return_html", self.driver, html)
-
+            
+            html = self.driver.page_source
+            
            # Store in cache
-            cache_file_path = os.path.join(
-                os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()),
-                ".crawl4ai",
-                "cache",
-                url_hash,
-            )
-            with open(cache_file_path, "w", encoding="utf-8") as f:
+            cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", url.replace("/", "_"))
+            with open(cache_file_path, "w") as f:
                f.write(html)
-
+                
            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")
-
+            
            return html
-        except InvalidArgumentException as e:
-            if not hasattr(e, "msg"):
-                e.msg = sanitize_input_encode(str(e))
-            raise InvalidArgumentException(f"Failed to crawl {url}: {e.msg}")
-        except WebDriverException as e:
-            # If e does nlt have msg attribute create it and set it to str(e)
-            if not hasattr(e, "msg"):
-                e.msg = sanitize_input_encode(str(e))
-            raise WebDriverException(f"Failed to crawl {url}: {e.msg}")
+        except InvalidArgumentException:
+            raise InvalidArgumentException(f"Invalid URL {url}")
        except Exception as e:
-            if not hasattr(e, "msg"):
-                e.msg = sanitize_input_encode(str(e))
-            raise Exception(f"Failed to crawl {url}: {e.msg}")
-
-    def take_screenshot(self) -> str:
-        try:
-            # Get the dimensions of the page
-            total_width = self.driver.execute_script("return document.body.scrollWidth")
-            total_height = self.driver.execute_script(
-                "return document.body.scrollHeight"
-            )
-
-            # Set the window size to the dimensions of the page
-            self.driver.set_window_size(total_width, total_height)
-
-            # Take screenshot
-            screenshot = self.driver.get_screenshot_as_png()
-
-            # Open the screenshot with PIL
-            image = Image.open(BytesIO(screenshot))
-
-            # Convert image to RGB mode (this will handle both RGB and RGBA images)
-            rgb_image = image.convert("RGB")
-
-            # Convert to JPEG and compress
-            buffered = BytesIO()
-            rgb_image.save(buffered, format="JPEG", quality=85)
-            img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
-
-            if self.verbose:
-                print("[LOG] 📸 Screenshot taken and converted to base64")
-
-            return img_base64
-        except Exception as e:
-            error_message = sanitize_input_encode(
-                f"Failed to take screenshot: {str(e)}"
-            )
-            print(error_message)
-
-            # Generate an image with black background
-            img = Image.new("RGB", (800, 600), color="black")
-            draw = ImageDraw.Draw(img)
-
-            # Load a font
-            try:
-                font = ImageFont.truetype("arial.ttf", 40)
-            except IOError:
-                font = ImageFont.load_default()
-
-            # Define text color and wrap the text
-            text_color = (255, 255, 255)
-            max_width = 780
-            wrapped_text = wrap_text(draw, error_message, font, max_width)
-
-            # Calculate text position
-            text_position = (10, 10)
-
-            # Draw the text on the image
-            draw.text(text_position, wrapped_text, fill=text_color, font=font)
-
-            # Convert to base64
-            buffered = BytesIO()
-            img.save(buffered, format="JPEG")
-            img_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
-
-            return img_base64
+            raise Exception(f"Failed to crawl {url}: {str(e)}")

    def quit(self):
-        self.driver.quit()
+        self.driver.quit()
--- a/crawl4ai/database.py
+++ b/crawl4ai/database.py
@@ -1,67 +1,40 @@
 import os
 from pathlib import Path
 import sqlite3
+from typing import Optional
 from typing import Optional, Tuple

-DB_PATH = os.path.join(os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai")
+DB_PATH = os.path.join(Path.home(), ".crawl4ai")
 os.makedirs(DB_PATH, exist_ok=True)
 DB_PATH = os.path.join(DB_PATH, "crawl4ai.db")
-
-
+        
 def init_db():
    global DB_PATH
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
-    cursor.execute(
-        """
+    cursor.execute('''
        CREATE TABLE IF NOT EXISTS crawled_data (
            url TEXT PRIMARY KEY,
            html TEXT,
            cleaned_html TEXT,
            markdown TEXT,
            extracted_content TEXT,
-            success BOOLEAN,
-            media TEXT DEFAULT "{}",
-            links TEXT DEFAULT "{}",
-            metadata TEXT DEFAULT "{}",
-            screenshot TEXT DEFAULT ""
+            success BOOLEAN
        )
-    """
-    )
+    ''')
    conn.commit()
    conn.close()

-
-def alter_db_add_screenshot(new_column: str = "media"):
-    check_db_path()
-    try:
-        conn = sqlite3.connect(DB_PATH)
-        cursor = conn.cursor()
-        cursor.execute(
-            f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""'
-        )
-        conn.commit()
-        conn.close()
-    except Exception as e:
-        print(f"Error altering database to add screenshot column: {e}")
-
-
 def check_db_path():
    if not DB_PATH:
        raise ValueError("Database path is not set or is empty.")

-
-def get_cached_url(
-    url: str,
-) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
+def get_cached_url(url: str) -> Optional[Tuple[str, str, str, str, str, bool]]:
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(
-            "SELECT url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot FROM crawled_data WHERE url = ?",
-            (url,),
-        )
+        cursor.execute('SELECT url, html, cleaned_html, markdown, extracted_content, success FROM crawled_data WHERE url = ?', (url,))
        result = cursor.fetchone()
        conn.close()
        return result
@@ -69,63 +42,32 @@ def get_cached_url(
        print(f"Error retrieving cached URL: {e}")
        return None

-
-def cache_url(
-    url: str,
-    html: str,
-    cleaned_html: str,
-    markdown: str,
-    extracted_content: str,
-    success: bool,
-    media: str = "{}",
-    links: str = "{}",
-    metadata: str = "{}",
-    screenshot: str = "",
-):
+def cache_url(url: str, html: str, cleaned_html: str, markdown: str, extracted_content: str, success: bool):
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute(
-            """
-            INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success, media, links, metadata, screenshot)
-            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
+        cursor.execute('''
+            INSERT INTO crawled_data (url, html, cleaned_html, markdown, extracted_content, success)
+            VALUES (?, ?, ?, ?, ?, ?)
            ON CONFLICT(url) DO UPDATE SET
                html = excluded.html,
                cleaned_html = excluded.cleaned_html,
                markdown = excluded.markdown,
                extracted_content = excluded.extracted_content,
-                success = excluded.success,
-                media = excluded.media,      
-                links = excluded.links,    
-                metadata = excluded.metadata,      
-                screenshot = excluded.screenshot
-        """,
-            (
-                url,
-                html,
-                cleaned_html,
-                markdown,
-                extracted_content,
-                success,
-                media,
-                links,
-                metadata,
-                screenshot,
-            ),
-        )
+                success = excluded.success
+        ''', (url, html, cleaned_html, markdown, extracted_content, success))
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error caching URL: {e}")

-
 def get_total_count() -> int:
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute("SELECT COUNT(*) FROM crawled_data")
+        cursor.execute('SELECT COUNT(*) FROM crawled_data')
        result = cursor.fetchone()
        conn.close()
        return result[0]
@@ -133,48 +75,24 @@ def get_total_count() -> int:
        print(f"Error getting total count: {e}")
        return 0

-
 def clear_db():
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute("DELETE FROM crawled_data")
+        cursor.execute('DELETE FROM crawled_data')
        conn.commit()
        conn.close()
    except Exception as e:
        print(f"Error clearing database: {e}")
-
-
+        
 def flush_db():
    check_db_path()
    try:
        conn = sqlite3.connect(DB_PATH)
        cursor = conn.cursor()
-        cursor.execute("DROP TABLE crawled_data")
+        cursor.execute('DROP TABLE crawled_data')
        conn.commit()
        conn.close()
    except Exception as e:
-        print(f"Error flushing database: {e}")
-
-
-def update_existing_records(new_column: str = "media", default_value: str = "{}"):
-    check_db_path()
-    try:
-        conn = sqlite3.connect(DB_PATH)
-        cursor = conn.cursor()
-        cursor.execute(
-            f'UPDATE crawled_data SET {new_column} = "{default_value}" WHERE screenshot IS NULL'
-        )
-        conn.commit()
-        conn.close()
-    except Exception as e:
-        print(f"Error updating existing records: {e}")
-
-
-if __name__ == "__main__":
-    # Delete the existing database file
-    if os.path.exists(DB_PATH):
-        os.remove(DB_PATH)
-    init_db()
-    # alter_db_add_screenshot("COL_NAME")
+        print(f"Error flushing database: {e}")
--- a/crawl4ai/deep_crawl/init.py
+++ b/crawl4ai/deep_crawl/init.py
@@ -1,29 +0,0 @@
-from .bfs_deep_crawl_strategy import BFSDeepCrawlStrategy
-from .filters import (
-    URLFilter,
-    FilterChain,
-    URLPatternFilter,
-    ContentTypeFilter,
-    DomainFilter,
-)
-from .scorers import (
-    KeywordRelevanceScorer,
-    PathDepthScorer,
-    FreshnessScorer,
-    CompositeScorer,
-)
-from .deep_crawl_strategty import DeepCrawlStrategy
-
-__all__ = [
-    "BFSDeepCrawlStrategy",
-    "FilterChain",
-    "URLFilter",
-    "URLPatternFilter",
-    "ContentTypeFilter",
-    "DomainFilter",
-    "KeywordRelevanceScorer",
-    "PathDepthScorer",
-    "FreshnessScorer",
-    "CompositeScorer",
-    "DeepCrawlStrategy",
-]
--- a/crawl4ai/deep_crawl/bfs_deep_crawl_strategy.py
+++ b/crawl4ai/deep_crawl/bfs_deep_crawl_strategy.py
@@ -1,193 +0,0 @@
-from typing import AsyncGenerator, Optional, Dict, Set, List
-from datetime import datetime
-import asyncio
-import logging
-from urllib.parse import urlparse
-from ..models import CrawlResult, TraversalStats
-from .filters import FilterChain
-from .scorers import URLScorer
-from .deep_crawl_strategty import DeepCrawlStrategy
-from ..config import DEEP_CRAWL_BATCH_SIZE
-
-
-class BFSDeepCrawlStrategy(DeepCrawlStrategy):
-    """Best-First Search traversal strategy with filtering and scoring."""
-
-    def __init__(
-        self,
-        max_depth: int,
-        filter_chain: FilterChain,
-        url_scorer: URLScorer,
-        process_external_links: bool = False,
-        logger: Optional[logging.Logger] = None,
-    ):
-        self.max_depth = max_depth
-        self.filter_chain = filter_chain
-        self.url_scorer = url_scorer
-        self.logger = logger or logging.getLogger(__name__)
-
-        # Crawl control
-        self.stats = TraversalStats(start_time=datetime.now())
-        self._cancel_event = asyncio.Event()
-        self.process_external_links = process_external_links
-
-    async def can_process_url(self, url: str, depth: int) -> bool:
-        """Check if URL can be processed based on filters
-        This is our gatekeeper method that determines if a URL should be processed. It:
-            - Validates URL format using a robust built-in method
-            - Applies custom filters from the filter chain
-            - Updates statistics for blocked URLs
-            - Returns False early if any check fails
-        """
-        try:
-            result = urlparse(url)
-            if not all([result.scheme, result.netloc]):
-                raise ValueError("Invalid URL")
-            if result.scheme not in ("http", "https"):
-                raise ValueError("URL must be HTTP or HTTPS")
-            if not result.netloc or "." not in result.netloc:
-                raise ValueError("Invalid domain")
-        except Exception as e:
-            self.logger.warning(f"Invalid URL: {url}. Error: {str(e)}")
-            return False
-
-        # Apply the filter chain if it's not start page
-        if depth != 0 and not self.filter_chain.apply(url):
-            return False
-
-        return True
-
-    async def _process_links(
-        self,
-        result: CrawlResult,
-        source_url: str,
-        queue: asyncio.PriorityQueue,
-        visited: Set[str],
-        depths: Dict[str, int],
-    ) -> List[str]:
-        """Process extracted links from crawl result.
-        This is our link processor that:
-            Checks depth limits
-            Handles both internal and external links
-            Checks if URL is visited already
-            Checks if URL can be processed - validates URL, applies Filters with can_process_url
-            Scores URLs for priority
-            Updates depth tracking dictionary
-            Adds valid URLs to the queue
-            Updates maximum depth statistics
-        """
-        next_depth = depths[source_url] + 1
-        # If depth limit reached, exit without processing links
-        if next_depth > self.max_depth:
-            return
-        links_to_process = result.links["internal"]
-        if self.process_external_links:
-            links_to_process += result.links["external"]
-        for link in links_to_process:
-            url = link["href"]
-            if url in visited:
-                continue
-            if not await self.can_process_url(url, next_depth):
-                self.stats.urls_skipped += 1
-                continue
-            score = self.url_scorer.score(url) if self.url_scorer else 0
-            await queue.put((score, next_depth, url, source_url))
-            depths[url] = next_depth
-            self.stats.total_depth_reached = max(
-                self.stats.total_depth_reached, next_depth
-            )
-
-    async def arun(
-        self,
-        start_url: str,
-        crawler: "AsyncWebCrawler",
-        crawler_run_config: Optional["CrawlerRunConfig"] = None,
-    ) -> AsyncGenerator[CrawlResult, None]:
-        """Implement BFS traversal strategy"""
-
-        # Initialize traversal state
-        """
-        queue: A priority queue where items are tuples of (score, depth, url)
-            Score: Determines traversal priority (lower = higher priority)
-            Depth: Current distance from start_url
-            URL: The actual URL to crawl
-            visited: Keeps track of URLs we've already seen to avoid cycles
-            depths: Maps URLs to their depths from the start URL
-            active_crawls: Tracks currently running crawl tasks        
-        """
-        queue = asyncio.PriorityQueue()
-        await queue.put((0, 0, start_url, None))
-        visited: Set[str] = set()
-        depths = {start_url: 0}
-        active_crawls = {}  # Track URLs currently being processed with depth and score
-        active_crawls_lock = (
-            asyncio.Lock()
-        )  # Create the lock within the same event loop
-        try:
-            while (
-                not queue.empty() or active_crawls
-            ) and not self._cancel_event.is_set():
-                """
-                This sets up our main control loop which:
-                    - Continues while there are URLs to process (not queue.empty())
-                    - Or while there are active crawls still running (arun_many)
-                    - Can be interrupted via cancellation (not self._cancel_event.is_set())
-                """
-                # Collect batch of URLs into active_crawls to process
-                async with active_crawls_lock:
-                    while (
-                        len(active_crawls) < DEEP_CRAWL_BATCH_SIZE and not queue.empty()
-                    ):
-                        score, depth, url, parent_url = await queue.get()
-                        active_crawls[url] = {
-                            "depth": depth,
-                            "score": score,
-                            "parent_url": parent_url,
-                        }
-                        self.stats.current_depth = depth
-
-                if not active_crawls:
-                    # If no active crawls exist, wait a bit and continue
-                    await asyncio.sleep(0.1)
-                    continue
-                # Process batch
-                try:
-                    # This is very important to ensure recursively you don't deep_crawl down the children.
-                    if crawler_run_config:
-                        crawler_run_config = crawler_run_config.clone(
-                            deep_crawl_strategy=None, stream=True
-                        )
-                    async for result in await crawler.arun_many(
-                        urls=list(active_crawls.keys()),
-                        config=crawler_run_config
-                    ):
-                        async with active_crawls_lock:
-                            crawl_info = active_crawls.pop(result.url, None)
-
-                        if crawl_info and result.success:
-                            await self._process_links(
-                                result, result.url, queue, visited, depths
-                            )
-                            result.depth = crawl_info["depth"]
-                            result.score = crawl_info["score"]
-                            result.parent_url = crawl_info["parent_url"]
-                            yield result
-                        else:
-                            self.logger.warning(
-                                f"Failed to crawl {result.url}: {result.error_message}"
-                            )
-                except Exception as e:
-                    self.logger.error(f"Batch processing error: {e}")
-                    # Continue processing other batches
-                    continue
-
-        except Exception as e:
-            self.logger.error(f"Error in crawl process: {e}")
-            raise
-
-        finally:
-            self.stats.end_time = datetime.now()
-
-    async def shutdown(self):
-        """Clean up resources and stop crawling"""
-        self._cancel_event.set()
--- a/crawl4ai/deep_crawl/deep_crawl_strategty.py
+++ b/crawl4ai/deep_crawl/deep_crawl_strategty.py
@@ -1,30 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import AsyncGenerator, Optional
-
-from ..models import CrawlResult
-
-
-class DeepCrawlStrategy(ABC):
-    @abstractmethod
-    async def arun(
-        self,
-        url: str,
-        crawler: "AsyncWebCrawler",
-        crawler_run_config: Optional["CrawlerRunConfig"] = None,
-    ) -> AsyncGenerator[CrawlResult, None]:
-        """Traverse the given URL using the specified crawler.
-
-        Args:
-            url (str): The starting URL for the traversal.
-            crawler (AsyncWebCrawler): The crawler instance to use for traversal.
-            crawler_run_config (CrawlerRunConfig, optional): The configuration for the crawler.
-
-        Returns:
-            AsyncGenerator[CrawlResult, None]: An async generator yielding crawl results.
-        """
-        pass
-
-    @abstractmethod
-    async def shutdown(self):
-        """Clean up resources used by the strategy"""
-        pass
--- a/crawl4ai/deep_crawl/filters.py
+++ b/crawl4ai/deep_crawl/filters.py
@@ -1,868 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import List, Pattern, Set, Union, FrozenSet
-import re, time
-from urllib.parse import urlparse
-from array import array
-import logging
-from functools import lru_cache
-import fnmatch
-from dataclasses import dataclass
-from typing import ClassVar
-import weakref
-import mimetypes
-
-
-@dataclass
-class FilterStats:
-    # PERF: Using dataclass creates overhead with __init__ and property access
-    # PERF: Could use __slots__ to reduce memory footprint
-    # PERF: Consider using array.array('I') for atomic increments
-    total_urls: int = 0
-    rejected_urls: int = 0
-    passed_urls: int = 0
-
-
-class URLFilter(ABC):
-    # PERF: Logger creation is expensive, consider lazy initialization
-    # PERF: stats object creation adds overhead for each filter instance
-    def __init__(self, name: str = None):
-        self.name = name or self.__class__.__name__
-        self.stats = FilterStats()
-        self.logger = logging.getLogger(f"urlfilter.{self.name}")
-
-    @abstractmethod
-    def apply(self, url: str) -> bool:
-        pass
-
-    def _update_stats(self, passed: bool):
-        # PERF: Already optimized but could use bitwise operations
-        # PERF: Consider removing stats entirely in production/fast mode
-        self.stats.total_urls += 1
-        self.stats.passed_urls += passed
-        self.stats.rejected_urls += not passed
-
-
-class FilterChain:
-    # PERF: List traversal for each URL is expensive
-    # PERF: Could use array.array instead of list for filters
-    # PERF: Consider adding fast path for single filter case
-    def __init__(self, filters: List[URLFilter] = None):
-        self.filters = filters or []
-        self.stats = FilterStats()
-        self.logger = logging.getLogger("urlfilter.chain")
-
-    def apply(self, url: str) -> bool:
-        # PERF: Logging on every rejection is expensive
-        # PERF: Could reorder filters by rejection rate
-        # PERF: Consider batch processing mode
-        self.stats.total_urls += 1
-
-        for filter_ in self.filters:
-            if not filter_.apply(url):
-                self.stats.rejected_urls += 1
-                self.logger.debug(f"URL {url} rejected by {filter_.name}")
-                return False
-
-        self.stats.passed_urls += 1
-        return True
-
-
-class URLPatternFilter(URLFilter):
-    # PERF: Converting glob to regex is expensive
-    # PERF: Multiple regex compilation is slow
-    # PERF: List of patterns causes multiple regex evaluations
-    def __init__(
-        self,
-        patterns: Union[str, Pattern, List[Union[str, Pattern]]],
-        use_glob: bool = True,
-    ):
-        super().__init__()
-        self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
-        self.use_glob = use_glob
-        self._compiled_patterns = []
-
-        # PERF: This could be consolidated into a single regex with OR conditions
-        # PERF: glob_to_regex creates complex patterns, could be simplified
-        for pattern in self.patterns:
-            if isinstance(pattern, str) and use_glob:
-                self._compiled_patterns.append(self._glob_to_regex(pattern))
-            else:
-                self._compiled_patterns.append(
-                    re.compile(pattern) if isinstance(pattern, str) else pattern
-                )
-
-    def _glob_to_regex(self, pattern: str) -> Pattern:
-        # PERF: fnmatch.translate creates overly complex patterns
-        # PERF: Could cache common translations
-        return re.compile(fnmatch.translate(pattern))
-
-    def apply(self, url: str) -> bool:
-        # PERF: any() with generator is slower than direct loop with early return
-        # PERF: searching entire string is slower than anchored match
-        matches = any(pattern.search(url) for pattern in self._compiled_patterns)
-        self._update_stats(matches)
-        return matches
-
-
-class ContentTypeFilter(URLFilter):
-    # PERF: mimetypes guessing is extremely slow
-    # PERF: URL parsing on every check is expensive
-    # PERF: No caching of results for similar extensions
-    def __init__(
-        self, allowed_types: Union[str, List[str]], check_extension: bool = True
-    ):
-        super().__init__()
-        self.allowed_types = (
-            [allowed_types] if isinstance(allowed_types, str) else allowed_types
-        )
-        self.check_extension = check_extension
-        self._normalize_types()
-
-    def _normalize_types(self):
-        """Normalize content type strings"""
-        self.allowed_types = [t.lower() for t in self.allowed_types]
-
-    def _check_extension(self, url: str) -> bool:
-        # PERF: urlparse is called on every check
-        # PERF: multiple string splits are expensive
-        # PERF: mimetypes.guess_type is very slow
-        ext = (
-            urlparse(url).path.split(".")[-1].lower()
-            if "." in urlparse(url).path
-            else ""
-        )
-        if not ext:
-            return True
-
-        # PERF: guess_type is main bottleneck
-        guessed_type = mimetypes.guess_type(url)[0]
-        return any(
-            allowed in (guessed_type or "").lower() for allowed in self.allowed_types
-        )
-
-    def apply(self, url: str) -> bool:
-        """Check if URL's content type is allowed"""
-        result = True
-        if self.check_extension:
-            result = self._check_extension(url)
-        self._update_stats(result)
-        return result
-
-
-class DomainFilter(URLFilter):
-    # PERF: Set lookups are fast but string normalizations on init are not
-    # PERF: Creating two sets doubles memory usage
-    def __init__(
-        self,
-        allowed_domains: Union[str, List[str]] = None,
-        blocked_domains: Union[str, List[str]] = None,
-    ):
-        super().__init__()
-        # PERF: Normalizing domains on every init is wasteful
-        # PERF: Could use frozenset for immutable lists
-        self.allowed_domains = (
-            set(self._normalize_domains(allowed_domains)) if allowed_domains else None
-        )
-        self.blocked_domains = (
-            set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
-        )
-
-    def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
-        # PERF: strip() and lower() create new strings for each domain
-        # PERF: List comprehension creates intermediate list
-        if isinstance(domains, str):
-            domains = [domains]
-        return [d.lower().strip() for d in domains]
-
-    def _extract_domain(self, url: str) -> str:
-        # PERF: urlparse is called for every URL check
-        # PERF: lower() creates new string every time
-        # PERF: Could cache recent results
-        return urlparse(url).netloc.lower()
-
-    def apply(self, url: str) -> bool:
-        # PERF: Two separate set lookups in worst case
-        # PERF: Domain extraction happens before knowing if we have any filters
-        domain = self._extract_domain(url)
-
-        if domain in self.blocked_domains:
-            self._update_stats(False)
-            return False
-
-        if self.allowed_domains is not None and domain not in self.allowed_domains:
-            self._update_stats(False)
-            return False
-
-        self._update_stats(True)
-        return True
-
-
-# Example usage:
-def create_common_filter_chain() -> FilterChain:
-    """Create a commonly used filter chain"""
-    return FilterChain(
-        [
-            URLPatternFilter(
-                [
-                    "*.html",
-                    "*.htm",  # HTML files
-                    "*/article/*",
-                    "*/blog/*",  # Common content paths
-                ]
-            ),
-            ContentTypeFilter(["text/html", "application/xhtml+xml"]),
-            DomainFilter(blocked_domains=["ads.*", "analytics.*"]),
-        ]
-    )
-
-
-####################################################################################
-# Uncledoe: Optimized Version
-####################################################################################
-
-
-# Use __slots__ and array for maximum memory/speed efficiency
-class FastFilterStats:
-    __slots__ = ("_counters",)
-
-    def __init__(self):
-        # Use array of unsigned ints for atomic operations
-        self._counters = array("I", [0, 0, 0])  # total, passed, rejected
-
-    @property
-    def total_urls(self):
-        return self._counters[0]
-
-    @property
-    def passed_urls(self):
-        return self._counters[1]
-
-    @property
-    def rejected_urls(self):
-        return self._counters[2]
-
-
-class FastURLFilter(ABC):
-    """Optimized base filter class"""
-
-    __slots__ = ("name", "stats", "_logger_ref")
-
-    def __init__(self, name: str = None):
-        self.name = name or self.__class__.__name__
-        self.stats = FastFilterStats()
-        # Lazy logger initialization using weakref
-        self._logger_ref = None
-
-    @property
-    def logger(self):
-        if self._logger_ref is None or self._logger_ref() is None:
-            logger = logging.getLogger(f"urlfilter.{self.name}")
-            self._logger_ref = weakref.ref(logger)
-        return self._logger_ref()
-
-    @abstractmethod
-    def apply(self, url: str) -> bool:
-        pass
-
-    def _update_stats(self, passed: bool):
-        # Use direct array index for speed
-        self.stats._counters[0] += 1  # total
-        self.stats._counters[1] += passed  # passed
-        self.stats._counters[2] += not passed  # rejected
-
-
-class FastFilterChain:
-    """Optimized filter chain"""
-
-    __slots__ = ("filters", "stats", "_logger_ref")
-
-    def __init__(self, filters: List[FastURLFilter] = None):
-        self.filters = tuple(filters or [])  # Immutable tuple for speed
-        self.stats = FastFilterStats()
-        self._logger_ref = None
-
-    @property
-    def logger(self):
-        if self._logger_ref is None or self._logger_ref() is None:
-            logger = logging.getLogger("urlfilter.chain")
-            self._logger_ref = weakref.ref(logger)
-        return self._logger_ref()
-
-    def add_filter(self, filter_: FastURLFilter) -> "FastFilterChain":
-        """Add a filter to the chain"""
-        self.filters.append(filter_)
-        return self  # Enable method chaining
-
-    def apply(self, url: str) -> bool:
-        """Optimized apply with minimal operations"""
-        self.stats._counters[0] += 1  # total
-
-        # Direct tuple iteration is faster than list
-        for f in self.filters:
-            if not f.apply(url):
-                self.stats._counters[2] += 1  # rejected
-                return False
-
-        self.stats._counters[1] += 1  # passed
-        return True
-
-class FastURLPatternFilter(FastURLFilter):
-    """Pattern filter balancing speed and completeness"""
-    __slots__ = ('_simple_suffixes', '_simple_prefixes', '_domain_patterns', '_path_patterns')
-    
-    PATTERN_TYPES = {
-        'SUFFIX': 1,    # *.html
-        'PREFIX': 2,    # /foo/*
-        'DOMAIN': 3,    # *.example.com
-        'PATH': 4 ,      # Everything else
-        'REGEX': 5 
-    }
-    
-    def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]], use_glob: bool = True):
-        super().__init__()
-        patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
-        
-        self._simple_suffixes = set()
-        self._simple_prefixes = set()
-        self._domain_patterns = []
-        self._path_patterns = []
-        
-        for pattern in patterns:
-            pattern_type = self._categorize_pattern(pattern)
-            self._add_pattern(pattern, pattern_type)
-    
-    def _categorize_pattern(self, pattern: str) -> int:
-        """Categorize pattern for specialized handling"""
-        if not isinstance(pattern, str):
-            return self.PATTERN_TYPES['PATH']
-            
-        # Check if it's a regex pattern
-        if pattern.startswith('^') or pattern.endswith('$') or '\\d' in pattern:
-            return self.PATTERN_TYPES['REGEX']
-        
-        if pattern.count('*') == 1:
-            if pattern.startswith('*.'):
-                return self.PATTERN_TYPES['SUFFIX']
-            if pattern.endswith('/*'):
-                return self.PATTERN_TYPES['PREFIX']
-                
-        if '://' in pattern and pattern.startswith('*.'):
-            return self.PATTERN_TYPES['DOMAIN']
-            
-        return self.PATTERN_TYPES['PATH']
-    
-    def _add_pattern(self, pattern: str, pattern_type: int):
-        """Add pattern to appropriate matcher"""
-        if pattern_type == self.PATTERN_TYPES['REGEX']:
-            # For regex patterns, compile directly without glob translation
-            if isinstance(pattern, str) and (pattern.startswith('^') or pattern.endswith('$') or '\\d' in pattern):
-                self._path_patterns.append(re.compile(pattern))
-                return
-        elif pattern_type == self.PATTERN_TYPES['SUFFIX']:
-            self._simple_suffixes.add(pattern[2:])
-        elif pattern_type == self.PATTERN_TYPES['PREFIX']:
-            self._simple_prefixes.add(pattern[:-2])
-        elif pattern_type == self.PATTERN_TYPES['DOMAIN']:
-            self._domain_patterns.append(
-                re.compile(pattern.replace('*.', r'[^/]+\.'))
-            )
-        else:
-            if isinstance(pattern, str):
-                # Handle complex glob patterns
-                if '**' in pattern:
-                    pattern = pattern.replace('**', '.*')
-                if '{' in pattern:
-                    # Convert {a,b} to (a|b)
-                    pattern = re.sub(r'\{([^}]+)\}', 
-                                   lambda m: f'({"|".join(m.group(1).split(","))})',
-                                   pattern)
-                pattern = fnmatch.translate(pattern)
-            self._path_patterns.append(
-                pattern if isinstance(pattern, Pattern) else re.compile(pattern)
-            )
-
-    @lru_cache(maxsize=10000)
-    def apply(self, url: str) -> bool:
-        """Hierarchical pattern matching"""
-        # Quick suffix check (*.html)
-        if self._simple_suffixes:
-            path = url.split('?')[0]
-            if path.split('/')[-1].split('.')[-1] in self._simple_suffixes:
-                self._update_stats(True)
-                return True
-                
-        # Domain check
-        if self._domain_patterns:
-            for pattern in self._domain_patterns:
-                if pattern.match(url):
-                    self._update_stats(True)
-                    return True
-        
-        # Prefix check (/foo/*)
-        if self._simple_prefixes:
-            path = url.split('?')[0]
-            if any(path.startswith(p) for p in self._simple_prefixes):
-                self._update_stats(True)
-                return True
-                
-        # Complex patterns
-        if self._path_patterns:
-            if any(p.search(url) for p in self._path_patterns):
-                self._update_stats(True)
-                return True
-        
-        self._update_stats(False)
-        return False
-
-
-class FastContentTypeFilter(FastURLFilter):
-    """Optimized content type filter using fast lookups"""
-
-    __slots__ = ("allowed_types", "_ext_map", "_check_extension")
-
-    # Fast extension to mime type mapping
-    _MIME_MAP = {
-        # Text Formats
-        "txt": "text/plain",
-        "html": "text/html",
-        "htm": "text/html",
-        "xhtml": "application/xhtml+xml",
-        "css": "text/css",
-        "csv": "text/csv",
-        "ics": "text/calendar",
-        "js": "application/javascript",
-        # Images
-        "bmp": "image/bmp",
-        "gif": "image/gif",
-        "jpeg": "image/jpeg",
-        "jpg": "image/jpeg",
-        "png": "image/png",
-        "svg": "image/svg+xml",
-        "tiff": "image/tiff",
-        "ico": "image/x-icon",
-        "webp": "image/webp",
-        # Audio
-        "mp3": "audio/mpeg",
-        "wav": "audio/wav",
-        "ogg": "audio/ogg",
-        "m4a": "audio/mp4",
-        "aac": "audio/aac",
-        # Video
-        "mp4": "video/mp4",
-        "mpeg": "video/mpeg",
-        "webm": "video/webm",
-        "avi": "video/x-msvideo",
-        "mov": "video/quicktime",
-        "flv": "video/x-flv",
-        "wmv": "video/x-ms-wmv",
-        "mkv": "video/x-matroska",
-        # Applications
-        "json": "application/json",
-        "xml": "application/xml",
-        "pdf": "application/pdf",
-        "zip": "application/zip",
-        "gz": "application/gzip",
-        "tar": "application/x-tar",
-        "rar": "application/vnd.rar",
-        "7z": "application/x-7z-compressed",
-        "exe": "application/vnd.microsoft.portable-executable",
-        "msi": "application/x-msdownload",
-        # Fonts
-        "woff": "font/woff",
-        "woff2": "font/woff2",
-        "ttf": "font/ttf",
-        "otf": "font/otf",
-        # Microsoft Office
-        "doc": "application/msword",
-        "dot": "application/msword",
-        "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
-        "xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-        "xls": "application/vnd.ms-excel",
-        "ppt": "application/vnd.ms-powerpoint",
-        "pptx": "application/vnd.openxmlformats-officedocument.presentationml.presentation",
-        # OpenDocument Formats
-        "odt": "application/vnd.oasis.opendocument.text",
-        "ods": "application/vnd.oasis.opendocument.spreadsheet",
-        "odp": "application/vnd.oasis.opendocument.presentation",
-        # Archives
-        "tar.gz": "application/gzip",
-        "tgz": "application/gzip",
-        "bz2": "application/x-bzip2",
-        # Others
-        "rtf": "application/rtf",
-        "apk": "application/vnd.android.package-archive",
-        "epub": "application/epub+zip",
-        "jar": "application/java-archive",
-        "swf": "application/x-shockwave-flash",
-        "midi": "audio/midi",
-        "mid": "audio/midi",
-        "ps": "application/postscript",
-        "ai": "application/postscript",
-        "eps": "application/postscript",
-        # Custom or less common
-        "bin": "application/octet-stream",
-        "dmg": "application/x-apple-diskimage",
-        "iso": "application/x-iso9660-image",
-        "deb": "application/x-debian-package",
-        "rpm": "application/x-rpm",
-        "sqlite": "application/vnd.sqlite3",
-        # Placeholder
-        "unknown": "application/octet-stream",  # Fallback for unknown file types
-    }
-
-    @staticmethod
-    @lru_cache(maxsize=1000)
-    def _extract_extension(path: str) -> str:
-        """Fast extension extraction with caching"""
-        if "." not in path:
-            return ""
-        return path.rpartition(".")[-1].lower()
-
-    def __init__(
-        self, allowed_types: Union[str, List[str]], check_extension: bool = True
-    ):
-        super().__init__()
-        # Normalize and store as frozenset for fast lookup
-        self.allowed_types = frozenset(
-            t.lower()
-            for t in (
-                allowed_types if isinstance(allowed_types, list) else [allowed_types]
-            )
-        )
-        self._check_extension = check_extension
-
-        # Pre-compute extension map for allowed types
-        self._ext_map = frozenset(
-            ext
-            for ext, mime in self._MIME_MAP.items()
-            if any(allowed in mime for allowed in self.allowed_types)
-        )
-
-    @lru_cache(maxsize=1000)
-    def _check_url_cached(self, url: str) -> bool:
-        """Cached URL checking"""
-        if not self._check_extension:
-            return True
-
-        path = url.split("?")[0]  # Fast path split
-        ext = self._extract_extension(path)
-        if not ext:
-            return True
-
-        return ext in self._ext_map
-
-    def apply(self, url: str) -> bool:
-        """Fast extension check with caching"""
-        result = self._check_url_cached(url)
-        self._update_stats(result)
-        return result
-
-
-class FastDomainFilter(FastURLFilter):
-    """Optimized domain filter with fast lookups and caching"""
-
-    __slots__ = ("_allowed_domains", "_blocked_domains", "_domain_cache")
-
-    # Regex for fast domain extraction
-    _DOMAIN_REGEX = re.compile(r"://([^/]+)")
-
-    def __init__(
-        self,
-        allowed_domains: Union[str, List[str]] = None,
-        blocked_domains: Union[str, List[str]] = None,
-    ):
-        super().__init__()
-
-        # Convert inputs to frozensets for immutable, fast lookups
-        self._allowed_domains = (
-            frozenset(self._normalize_domains(allowed_domains))
-            if allowed_domains
-            else None
-        )
-        self._blocked_domains = (
-            frozenset(self._normalize_domains(blocked_domains))
-            if blocked_domains
-            else frozenset()
-        )
-
-    @staticmethod
-    def _normalize_domains(domains: Union[str, List[str]]) -> Set[str]:
-        """Fast domain normalization"""
-        if isinstance(domains, str):
-            return {domains.lower()}
-        return {d.lower() for d in domains}
-
-    @staticmethod
-    @lru_cache(maxsize=10000)
-    def _extract_domain(url: str) -> str:
-        """Ultra-fast domain extraction with regex and caching"""
-        match = FastDomainFilter._DOMAIN_REGEX.search(url)
-        return match.group(1).lower() if match else ""
-
-    def apply(self, url: str) -> bool:
-        """Optimized domain checking with early returns"""
-        # Skip processing if no filters
-        if not self._blocked_domains and self._allowed_domains is None:
-            self._update_stats(True)
-            return True
-
-        domain = self._extract_domain(url)
-
-        # Early return for blocked domains
-        if domain in self._blocked_domains:
-            self._update_stats(False)
-            return False
-
-        # If no allowed domains specified, accept all non-blocked
-        if self._allowed_domains is None:
-            self._update_stats(True)
-            return True
-
-        # Final allowed domains check
-        result = domain in self._allowed_domains
-        self._update_stats(result)
-        return result
-
-
-def create_fast_filter_chain() -> FastFilterChain:
-    """Create an optimized filter chain with filters ordered by rejection rate"""
-    return FastFilterChain(
-        [
-            # Domain filter first (fastest rejection)
-            FastDomainFilter(blocked_domains=["ads.*", "analytics.*"]),
-            # Content filter second (medium speed)
-            FastContentTypeFilter(["text/html", "application/xhtml+xml"]),
-            # Pattern filter last (most expensive)
-            FastURLPatternFilter(
-                [
-                    "*.html",
-                    "*.htm",
-                    "*/article/*",
-                    "*/blog/*",
-                ]
-            ),
-        ]
-    )
-
-
-def run_performance_test():
-    import time
-    import random
-    from itertools import cycle
-
-    # Generate test URLs
-    base_urls = [
-        "https://example.com/article/123",
-        "https://blog.example.com/post/456",
-        "https://ads.example.com/tracking",
-        "https://example.com/about.html",
-        "https://analytics.example.com/script.js",
-        "https://example.com/products.php",
-        "https://subdomain.example.com/blog/post-123",
-        "https://example.com/path/file.pdf",
-    ]
-
-    # Create more varied test data
-    test_urls = []
-    for base in base_urls:
-        # Add original
-        test_urls.append(base)
-        # Add variations
-        parts = base.split("/")
-        for i in range(10):
-            parts[-1] = f"page_{i}.html"
-            test_urls.append("/".join(parts))
-
-    # Multiply to get enough test data
-    test_urls = test_urls * 10000  # Creates ~800k URLs
-
-    def benchmark(name: str, func, *args, warmup=True):
-        if warmup:
-            # Warmup run
-            func(*args)
-
-        # Actual timing
-        start = time.perf_counter_ns()
-        result = func(*args)
-        elapsed = (time.perf_counter_ns() - start) / 1_000_000  # Convert to ms
-        print(
-            f"{name:<30} {elapsed:>8.3f} ms  ({len(test_urls)/elapsed*1000:,.0f} URLs/sec)"
-        )
-        return result
-
-    print("\nBenchmarking original vs optimized implementations...")
-    print("-" * 70)
-
-    # Original implementation
-    pattern_filter = URLPatternFilter(["*.html", "*/article/*"])
-    content_filter = ContentTypeFilter(["text/html"])
-    domain_filter = DomainFilter(blocked_domains=["ads.*", "analytics.*"])
-    chain = FilterChain([pattern_filter, content_filter, domain_filter])
-
-    # Optimized implementation
-    fast_pattern_filter = FastURLPatternFilter(["*.html", "*/article/*"])
-    fast_content_filter = FastContentTypeFilter(["text/html"])
-    fast_domain_filter = FastDomainFilter(blocked_domains=["ads.*", "analytics.*"])
-    fast_chain = FastFilterChain(
-        [fast_domain_filter, fast_content_filter, fast_pattern_filter]
-    )
-
-    # Test individual filters
-    print("\nSingle filter performance (first 1000 URLs):")
-    test_subset = test_urls[:1000]
-
-    print("\nPattern Filters:")
-    benchmark(
-        "Original Pattern Filter",
-        lambda: [pattern_filter.apply(url) for url in test_subset],
-    )
-    benchmark(
-        "Optimized Pattern Filter",
-        lambda: [fast_pattern_filter.apply(url) for url in test_subset],
-    )
-
-    print("\nContent Filters:")
-    benchmark(
-        "Original Content Filter",
-        lambda: [content_filter.apply(url) for url in test_subset],
-    )
-    benchmark(
-        "Optimized Content Filter",
-        lambda: [fast_content_filter.apply(url) for url in test_subset],
-    )
-
-    print("\nDomain Filters:")
-    benchmark(
-        "Original Domain Filter",
-        lambda: [domain_filter.apply(url) for url in test_subset],
-    )
-    benchmark(
-        "Optimized Domain Filter",
-        lambda: [fast_domain_filter.apply(url) for url in test_subset],
-    )
-
-    print("\nFull Chain Performance (all URLs):")
-    # Test chain
-    benchmark("Original Chain", lambda: [chain.apply(url) for url in test_urls])
-    benchmark("Optimized Chain", lambda: [fast_chain.apply(url) for url in test_urls])
-
-    # Memory usage
-    import sys
-
-    print("\nMemory Usage per Filter:")
-    print(f"Original Pattern Filter: {sys.getsizeof(pattern_filter):,} bytes")
-    print(f"Optimized Pattern Filter: {sys.getsizeof(fast_pattern_filter):,} bytes")
-    print(f"Original Content Filter: {sys.getsizeof(content_filter):,} bytes")
-    print(f"Optimized Content Filter: {sys.getsizeof(fast_content_filter):,} bytes")
-    print(f"Original Domain Filter: {sys.getsizeof(domain_filter):,} bytes")
-    print(f"Optimized Domain Filter: {sys.getsizeof(fast_domain_filter):,} bytes")
-
-def test_pattern_filter():
-    import time
-    from itertools import chain
-
-    # Test cases as list of tuples instead of dict for multiple patterns
-    test_cases = [
-        # Simple suffix patterns (*.html)
-        ("*.html", {
-            "https://example.com/page.html": True,
-            "https://example.com/path/doc.html": True,
-            "https://example.com/page.htm": False,
-            "https://example.com/page.html?param=1": True,
-        }),
-        
-        # Path prefix patterns (/foo/*)
-        ("*/article/*", {
-            "https://example.com/article/123": True,
-            "https://example.com/blog/article/456": True,
-            "https://example.com/articles/789": False,
-            "https://example.com/article": False,
-        }),
-        
-        # Complex patterns
-        ("blog-*-[0-9]", {
-            "https://example.com/blog-post-1": True,
-            "https://example.com/blog-test-9": True,
-            "https://example.com/blog-post": False,
-            "https://example.com/blog-post-x": False,
-        }),
-        
-        # Multiple patterns case
-        (["*.pdf", "*/download/*"], {
-            "https://example.com/doc.pdf": True,
-            "https://example.com/download/file.txt": True,
-            "https://example.com/path/download/doc": True,
-            "https://example.com/uploads/file.txt": False,
-        }),
-        
-        # Edge cases
-        ("*", {
-            "https://example.com": True,
-            "": True,
-            "http://test.com/path": True,
-        }),
-        
-        # Complex regex
-        (r"^https?://.*\.example\.com/\d+", {
-            "https://sub.example.com/123": True,
-            "http://test.example.com/456": True,
-            "https://example.com/789": False,
-            "https://sub.example.com/abc": False,
-        })
-    ]
-
-    def run_accuracy_test():
-        print("\nAccuracy Tests:")
-        print("-" * 50)
-        
-        all_passed = True
-        for patterns, test_urls in test_cases:
-            filter_obj = FastURLPatternFilter(patterns)
-            
-            for url, expected in test_urls.items():
-                result = filter_obj.apply(url)
-                if result != expected:
-                    print(f"❌ Failed: Pattern '{patterns}' with URL '{url}'")
-                    print(f"   Expected: {expected}, Got: {result}")
-                    all_passed = False
-                else:
-                    print(f"✅ Passed: Pattern '{patterns}' with URL '{url}'")
-        
-        return all_passed
-
-    def run_speed_test():
-        print("\nSpeed Tests:")
-        print("-" * 50)
-        
-        # Create a large set of test URLs
-        all_urls = list(chain.from_iterable(urls.keys() for _, urls in test_cases))
-        test_urls = all_urls * 10000  # 100K+ URLs
-        
-        # Test both implementations
-        original = URLPatternFilter(["*.html", "*/article/*", "blog-*"])
-        optimized = FastURLPatternFilter(["*.html", "*/article/*", "blog-*"])
-        
-        def benchmark(name, filter_obj):
-            start = time.perf_counter()
-            for url in test_urls:
-                filter_obj.apply(url)
-            elapsed = time.perf_counter() - start
-            urls_per_sec = len(test_urls) / elapsed
-            print(f"{name:<20} {elapsed:.3f}s ({urls_per_sec:,.0f} URLs/sec)")
-        
-        benchmark("Original Filter:", original)
-        benchmark("Optimized Filter:", optimized)
-
-    # Run tests
-    print("Running Pattern Filter Tests...")
-    accuracy_passed = run_accuracy_test()
-    
-    if accuracy_passed:
-        print("\n✨ All accuracy tests passed!")
-        run_speed_test()
-    else:
-        print("\n❌ Some accuracy tests failed!")
-
-if __name__ == "__main__":
-    run_performance_test()
-    # test_pattern_filter()
--- a/crawl4ai/deep_crawl/scorers.py
+++ b/crawl4ai/deep_crawl/scorers.py
--- a/crawl4ai/docs_manager.py
+++ b/crawl4ai/docs_manager.py
@@ -1,75 +0,0 @@
-import requests
-import shutil
-from pathlib import Path
-from crawl4ai.async_logger import AsyncLogger
-from crawl4ai.llmtxt import AsyncLLMTextManager
-
-
-class DocsManager:
-    def __init__(self, logger=None):
-        self.docs_dir = Path.home() / ".crawl4ai" / "docs"
-        self.local_docs = Path(__file__).parent.parent / "docs" / "llm.txt"
-        self.docs_dir.mkdir(parents=True, exist_ok=True)
-        self.logger = logger or AsyncLogger(verbose=True)
-        self.llm_text = AsyncLLMTextManager(self.docs_dir, self.logger)
-
-    async def ensure_docs_exist(self):
-        """Fetch docs if not present"""
-        if not any(self.docs_dir.iterdir()):
-            await self.fetch_docs()
-
-    async def fetch_docs(self) -> bool:
-        """Copy from local docs or download from GitHub"""
-        try:
-            # Try local first
-            if self.local_docs.exists() and (
-                any(self.local_docs.glob("*.md"))
-                or any(self.local_docs.glob("*.tokens"))
-            ):
-                # Empty the local docs directory
-                for file_path in self.docs_dir.glob("*.md"):
-                    file_path.unlink()
-                # for file_path in self.docs_dir.glob("*.tokens"):
-                #     file_path.unlink()
-                for file_path in self.local_docs.glob("*.md"):
-                    shutil.copy2(file_path, self.docs_dir / file_path.name)
-                # for file_path in self.local_docs.glob("*.tokens"):
-                #     shutil.copy2(file_path, self.docs_dir / file_path.name)
-                return True
-
-            # Fallback to GitHub
-            response = requests.get(
-                "https://api.github.com/repos/unclecode/crawl4ai/contents/docs/llm.txt",
-                headers={"Accept": "application/vnd.github.v3+json"},
-            )
-            response.raise_for_status()
-
-            for item in response.json():
-                if item["type"] == "file" and item["name"].endswith(".md"):
-                    content = requests.get(item["download_url"]).text
-                    with open(self.docs_dir / item["name"], "w", encoding="utf-8") as f:
-                        f.write(content)
-            return True
-
-        except Exception as e:
-            self.logger.error(f"Failed to fetch docs: {str(e)}")
-            raise
-
-    def list(self) -> list[str]:
-        """List available topics"""
-        names = [file_path.stem for file_path in self.docs_dir.glob("*.md")]
-        # Remove [0-9]+_ prefix
-        names = [name.split("_", 1)[1] if name[0].isdigit() else name for name in names]
-        # Exclude those end with .xs.md and .q.md
-        names = [
-            name
-            for name in names
-            if not name.endswith(".xs") and not name.endswith(".q")
-        ]
-        return names
-
-    def generate(self, sections, mode="extended"):
-        return self.llm_text.generate(sections, mode)
-
-    def search(self, query: str, top_k: int = 5):
-        return self.llm_text.search(query, top_k)
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
--- a/crawl4ai/html2text/init.py
+++ b/crawl4ai/html2text/init.py
--- a/crawl4ai/html2text/main.py
+++ b/crawl4ai/html2text/main.py
@@ -1,3 +0,0 @@
-from .cli import main
-
-main()
--- a/crawl4ai/html2text/_typing.py
+++ b/crawl4ai/html2text/_typing.py
@@ -1,3 +0,0 @@
-class OutCallback:
-    def __call__(self, s: str) -> None:
-        ...
--- a/crawl4ai/html2text/cli.py
+++ b/crawl4ai/html2text/cli.py
@@ -1,330 +0,0 @@
-import argparse
-import sys
-
-from . import HTML2Text, __version__, config
-
-
-def main() -> None:
-    baseurl = ""
-
-    class bcolors:
-        HEADER = "\033[95m"
-        OKBLUE = "\033[94m"
-        OKGREEN = "\033[92m"
-        WARNING = "\033[93m"
-        FAIL = "\033[91m"
-        ENDC = "\033[0m"
-        BOLD = "\033[1m"
-        UNDERLINE = "\033[4m"
-
-    p = argparse.ArgumentParser()
-    p.add_argument(
-        "--default-image-alt",
-        dest="default_image_alt",
-        default=config.DEFAULT_IMAGE_ALT,
-        help="The default alt string for images with missing ones",
-    )
-    p.add_argument(
-        "--pad-tables",
-        dest="pad_tables",
-        action="store_true",
-        default=config.PAD_TABLES,
-        help="pad the cells to equal column width in tables",
-    )
-    p.add_argument(
-        "--no-wrap-links",
-        dest="wrap_links",
-        action="store_false",
-        default=config.WRAP_LINKS,
-        help="don't wrap links during conversion",
-    )
-    p.add_argument(
-        "--wrap-list-items",
-        dest="wrap_list_items",
-        action="store_true",
-        default=config.WRAP_LIST_ITEMS,
-        help="wrap list items during conversion",
-    )
-    p.add_argument(
-        "--wrap-tables",
-        dest="wrap_tables",
-        action="store_true",
-        default=config.WRAP_TABLES,
-        help="wrap tables",
-    )
-    p.add_argument(
-        "--ignore-emphasis",
-        dest="ignore_emphasis",
-        action="store_true",
-        default=config.IGNORE_EMPHASIS,
-        help="don't include any formatting for emphasis",
-    )
-    p.add_argument(
-        "--reference-links",
-        dest="inline_links",
-        action="store_false",
-        default=config.INLINE_LINKS,
-        help="use reference style links instead of inline links",
-    )
-    p.add_argument(
-        "--ignore-links",
-        dest="ignore_links",
-        action="store_true",
-        default=config.IGNORE_ANCHORS,
-        help="don't include any formatting for links",
-    )
-    p.add_argument(
-        "--ignore-mailto-links",
-        action="store_true",
-        dest="ignore_mailto_links",
-        default=config.IGNORE_MAILTO_LINKS,
-        help="don't include mailto: links",
-    )
-    p.add_argument(
-        "--protect-links",
-        dest="protect_links",
-        action="store_true",
-        default=config.PROTECT_LINKS,
-        help="protect links from line breaks surrounding them with angle brackets",
-    )
-    p.add_argument(
-        "--ignore-images",
-        dest="ignore_images",
-        action="store_true",
-        default=config.IGNORE_IMAGES,
-        help="don't include any formatting for images",
-    )
-    p.add_argument(
-        "--images-as-html",
-        dest="images_as_html",
-        action="store_true",
-        default=config.IMAGES_AS_HTML,
-        help=(
-            "Always write image tags as raw html; preserves `height`, `width` and "
-            "`alt` if possible."
-        ),
-    )
-    p.add_argument(
-        "--images-to-alt",
-        dest="images_to_alt",
-        action="store_true",
-        default=config.IMAGES_TO_ALT,
-        help="Discard image data, only keep alt text",
-    )
-    p.add_argument(
-        "--images-with-size",
-        dest="images_with_size",
-        action="store_true",
-        default=config.IMAGES_WITH_SIZE,
-        help=(
-            "Write image tags with height and width attrs as raw html to retain "
-            "dimensions"
-        ),
-    )
-    p.add_argument(
-        "-g",
-        "--google-doc",
-        action="store_true",
-        dest="google_doc",
-        default=False,
-        help="convert an html-exported Google Document",
-    )
-    p.add_argument(
-        "-d",
-        "--dash-unordered-list",
-        action="store_true",
-        dest="ul_style_dash",
-        default=False,
-        help="use a dash rather than a star for unordered list items",
-    )
-    p.add_argument(
-        "-e",
-        "--asterisk-emphasis",
-        action="store_true",
-        dest="em_style_asterisk",
-        default=False,
-        help="use an asterisk rather than an underscore for emphasized text",
-    )
-    p.add_argument(
-        "-b",
-        "--body-width",
-        dest="body_width",
-        type=int,
-        default=config.BODY_WIDTH,
-        help="number of characters per output line, 0 for no wrap",
-    )
-    p.add_argument(
-        "-i",
-        "--google-list-indent",
-        dest="list_indent",
-        type=int,
-        default=config.GOOGLE_LIST_INDENT,
-        help="number of pixels Google indents nested lists",
-    )
-    p.add_argument(
-        "-s",
-        "--hide-strikethrough",
-        action="store_true",
-        dest="hide_strikethrough",
-        default=False,
-        help="hide strike-through text. only relevant when -g is " "specified as well",
-    )
-    p.add_argument(
-        "--escape-all",
-        action="store_true",
-        dest="escape_snob",
-        default=False,
-        help=(
-            "Escape all special characters.  Output is less readable, but avoids "
-            "corner case formatting issues."
-        ),
-    )
-    p.add_argument(
-        "--bypass-tables",
-        action="store_true",
-        dest="bypass_tables",
-        default=config.BYPASS_TABLES,
-        help="Format tables in HTML rather than Markdown syntax.",
-    )
-    p.add_argument(
-        "--ignore-tables",
-        action="store_true",
-        dest="ignore_tables",
-        default=config.IGNORE_TABLES,
-        help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.",
-    )
-    p.add_argument(
-        "--single-line-break",
-        action="store_true",
-        dest="single_line_break",
-        default=config.SINGLE_LINE_BREAK,
-        help=(
-            "Use a single line break after a block element rather than two line "
-            "breaks. NOTE: Requires --body-width=0"
-        ),
-    )
-    p.add_argument(
-        "--unicode-snob",
-        action="store_true",
-        dest="unicode_snob",
-        default=config.UNICODE_SNOB,
-        help="Use unicode throughout document",
-    )
-    p.add_argument(
-        "--no-automatic-links",
-        action="store_false",
-        dest="use_automatic_links",
-        default=config.USE_AUTOMATIC_LINKS,
-        help="Do not use automatic links wherever applicable",
-    )
-    p.add_argument(
-        "--no-skip-internal-links",
-        action="store_false",
-        dest="skip_internal_links",
-        default=config.SKIP_INTERNAL_LINKS,
-        help="Do not skip internal links",
-    )
-    p.add_argument(
-        "--links-after-para",
-        action="store_true",
-        dest="links_each_paragraph",
-        default=config.LINKS_EACH_PARAGRAPH,
-        help="Put links after each paragraph instead of document",
-    )
-    p.add_argument(
-        "--mark-code",
-        action="store_true",
-        dest="mark_code",
-        default=config.MARK_CODE,
-        help="Mark program code blocks with [code]...[/code]",
-    )
-    p.add_argument(
-        "--decode-errors",
-        dest="decode_errors",
-        default=config.DECODE_ERRORS,
-        help=(
-            "What to do in case of decode errors.'ignore', 'strict' and 'replace' are "
-            "acceptable values"
-        ),
-    )
-    p.add_argument(
-        "--open-quote",
-        dest="open_quote",
-        default=config.OPEN_QUOTE,
-        help="The character used to open quotes",
-    )
-    p.add_argument(
-        "--close-quote",
-        dest="close_quote",
-        default=config.CLOSE_QUOTE,
-        help="The character used to close quotes",
-    )
-    p.add_argument(
-        "--version", action="version", version=".".join(map(str, __version__))
-    )
-    p.add_argument("filename", nargs="?")
-    p.add_argument("encoding", nargs="?", default="utf-8")
-    p.add_argument(
-        "--include-sup-sub",
-        dest="include_sup_sub",
-        action="store_true",
-        default=config.INCLUDE_SUP_SUB,
-        help="Include the sup and sub tags",
-    )
-    args = p.parse_args()
-
-    if args.filename and args.filename != "-":
-        with open(args.filename, "rb") as fp:
-            data = fp.read()
-    else:
-        data = sys.stdin.buffer.read()
-
-    try:
-        html = data.decode(args.encoding, args.decode_errors)
-    except UnicodeDecodeError as err:
-        warning = bcolors.WARNING + "Warning:" + bcolors.ENDC
-        warning += " Use the " + bcolors.OKGREEN
-        warning += "--decode-errors=ignore" + bcolors.ENDC + " flag."
-        print(warning)
-        raise err
-
-    h = HTML2Text(baseurl=baseurl)
-    # handle options
-    if args.ul_style_dash:
-        h.ul_item_mark = "-"
-    if args.em_style_asterisk:
-        h.emphasis_mark = "*"
-        h.strong_mark = "__"
-
-    h.body_width = args.body_width
-    h.google_list_indent = args.list_indent
-    h.ignore_emphasis = args.ignore_emphasis
-    h.ignore_links = args.ignore_links
-    h.ignore_mailto_links = args.ignore_mailto_links
-    h.protect_links = args.protect_links
-    h.ignore_images = args.ignore_images
-    h.images_as_html = args.images_as_html
-    h.images_to_alt = args.images_to_alt
-    h.images_with_size = args.images_with_size
-    h.google_doc = args.google_doc
-    h.hide_strikethrough = args.hide_strikethrough
-    h.escape_snob = args.escape_snob
-    h.bypass_tables = args.bypass_tables
-    h.ignore_tables = args.ignore_tables
-    h.single_line_break = args.single_line_break
-    h.inline_links = args.inline_links
-    h.unicode_snob = args.unicode_snob
-    h.use_automatic_links = args.use_automatic_links
-    h.skip_internal_links = args.skip_internal_links
-    h.links_each_paragraph = args.links_each_paragraph
-    h.mark_code = args.mark_code
-    h.wrap_links = args.wrap_links
-    h.wrap_list_items = args.wrap_list_items
-    h.wrap_tables = args.wrap_tables
-    h.pad_tables = args.pad_tables
-    h.default_image_alt = args.default_image_alt
-    h.open_quote = args.open_quote
-    h.close_quote = args.close_quote
-    h.include_sup_sub = args.include_sup_sub
-
-    sys.stdout.write(h.handle(html))
--- a/crawl4ai/html2text/config.py
+++ b/crawl4ai/html2text/config.py
@@ -1,172 +0,0 @@
-import re
-
-# Use Unicode characters instead of their ascii pseudo-replacements
-UNICODE_SNOB = False
-
-# Marker to use for marking tables for padding post processing
-TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding"
-# Escape all special characters.  Output is less readable, but avoids
-# corner case formatting issues.
-ESCAPE_SNOB = False
-ESCAPE_BACKSLASH = False
-ESCAPE_DOT = False
-ESCAPE_PLUS = False
-ESCAPE_DASH = False
-
-# Put the links after each paragraph instead of at the end.
-LINKS_EACH_PARAGRAPH = False
-
-# Wrap long lines at position. 0 for no wrapping.
-BODY_WIDTH = 78
-
-# Don't show internal links (href="#local-anchor") -- corresponding link
-# targets won't be visible in the plain text file anyway.
-SKIP_INTERNAL_LINKS = True
-
-# Use inline, rather than reference, formatting for images and links
-INLINE_LINKS = True
-
-# Protect links from line breaks surrounding them with angle brackets (in
-# addition to their square brackets)
-PROTECT_LINKS = False
-# WRAP_LINKS = True
-WRAP_LINKS = True
-
-# Wrap list items.
-WRAP_LIST_ITEMS = False
-
-# Wrap tables
-WRAP_TABLES = False
-
-# Number of pixels Google indents nested lists
-GOOGLE_LIST_INDENT = 36
-
-# Values Google and others may use to indicate bold text
-BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900")
-
-IGNORE_ANCHORS = False
-IGNORE_MAILTO_LINKS = False
-IGNORE_IMAGES = False
-IMAGES_AS_HTML = False
-IMAGES_TO_ALT = False
-IMAGES_WITH_SIZE = False
-IGNORE_EMPHASIS = False
-MARK_CODE = False
-DECODE_ERRORS = "strict"
-DEFAULT_IMAGE_ALT = ""
-PAD_TABLES = False
-
-# Convert links with same href and text to <href> format
-# if they are absolute links
-USE_AUTOMATIC_LINKS = True
-
-# For checking space-only lines on line 771
-RE_SPACE = re.compile(r"\s\+")
-
-RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s")
-RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s")
-RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])")
-RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])")
-
-# to find links in the text
-RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)")
-
-# to find table separators
-RE_TABLE = re.compile(r" \| ")
-
-RE_MD_DOT_MATCHER = re.compile(
-    r"""
-    ^             # start of line
-    (\s*\d+)      # optional whitespace and a number
-    (\.)          # dot
-    (?=\s)        # lookahead assert whitespace
-    """,
-    re.MULTILINE | re.VERBOSE,
-)
-RE_MD_PLUS_MATCHER = re.compile(
-    r"""
-    ^
-    (\s*)
-    (\+)
-    (?=\s)
-    """,
-    flags=re.MULTILINE | re.VERBOSE,
-)
-RE_MD_DASH_MATCHER = re.compile(
-    r"""
-    ^
-    (\s*)
-    (-)
-    (?=\s|\-)     # followed by whitespace (bullet list, or spaced out hr)
-                  # or another dash (header or hr)
-    """,
-    flags=re.MULTILINE | re.VERBOSE,
-)
-RE_SLASH_CHARS = r"\`*_{}[]()#+-.!"
-RE_MD_BACKSLASH_MATCHER = re.compile(
-    r"""
-    (\\)          # match one slash
-    (?=[%s])      # followed by a char that requires escaping
-    """
-    % re.escape(RE_SLASH_CHARS),
-    flags=re.VERBOSE,
-)
-
-UNIFIABLE = {
-    "rsquo": "'",
-    "lsquo": "'",
-    "rdquo": '"',
-    "ldquo": '"',
-    "copy": "(C)",
-    "mdash": "--",
-    "nbsp": " ",
-    "rarr": "->",
-    "larr": "<-",
-    "middot": "*",
-    "ndash": "-",
-    "oelig": "oe",
-    "aelig": "ae",
-    "agrave": "a",
-    "aacute": "a",
-    "acirc": "a",
-    "atilde": "a",
-    "auml": "a",
-    "aring": "a",
-    "egrave": "e",
-    "eacute": "e",
-    "ecirc": "e",
-    "euml": "e",
-    "igrave": "i",
-    "iacute": "i",
-    "icirc": "i",
-    "iuml": "i",
-    "ograve": "o",
-    "oacute": "o",
-    "ocirc": "o",
-    "otilde": "o",
-    "ouml": "o",
-    "ugrave": "u",
-    "uacute": "u",
-    "ucirc": "u",
-    "uuml": "u",
-    "lrm": "",
-    "rlm": "",
-}
-
-# Format tables in HTML rather than Markdown syntax
-BYPASS_TABLES = False
-# Ignore table-related tags (table, th, td, tr) while keeping rows
-IGNORE_TABLES = False
-
-
-# Use a single line break after a block element rather than two line breaks.
-# NOTE: Requires body width setting to be 0.
-SINGLE_LINE_BREAK = False
-
-
-# Use double quotation marks when converting the <q> tag.
-OPEN_QUOTE = '"'
-CLOSE_QUOTE = '"'
-
-# Include the <sup> and <sub> tags
-INCLUDE_SUP_SUB = False
--- a/crawl4ai/html2text/elements.py
+++ b/crawl4ai/html2text/elements.py
@@ -1,18 +0,0 @@
-from typing import Dict, Optional
-
-
-class AnchorElement:
-    __slots__ = ["attrs", "count", "outcount"]
-
-    def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int):
-        self.attrs = attrs
-        self.count = count
-        self.outcount = outcount
-
-
-class ListElement:
-    __slots__ = ["name", "num"]
-
-    def __init__(self, name: str, num: int):
-        self.name = name
-        self.num = num
--- a/crawl4ai/html2text/utils.py
+++ b/crawl4ai/html2text/utils.py
@@ -1,304 +0,0 @@
-import html.entities
-from typing import Dict, List, Optional
-
-from . import config
-
-unifiable_n = {
-    html.entities.name2codepoint[k]: v
-    for k, v in config.UNIFIABLE.items()
-    if k != "nbsp"
-}
-
-
-def hn(tag: str) -> int:
-    if tag[0] == "h" and len(tag) == 2:
-        n = tag[1]
-        if "0" < n <= "9":
-            return int(n)
-    return 0
-
-
-def dumb_property_dict(style: str) -> Dict[str, str]:
-    """
-    :returns: A hash of css attributes
-    """
-    return {
-        x.strip().lower(): y.strip().lower()
-        for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z]
-    }
-
-
-def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]:
-    """
-    :type data: str
-
-    :returns: A hash of css selectors, each of which contains a hash of
-    css attributes.
-    :rtype: dict
-    """
-    # remove @import sentences
-    data += ";"
-    importIndex = data.find("@import")
-    while importIndex != -1:
-        data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :]
-        importIndex = data.find("@import")
-
-    # parse the css. reverted from dictionary comprehension in order to
-    # support older pythons
-    pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()]
-    try:
-        elements = {a.strip(): dumb_property_dict(b) for a, b in pairs}
-    except ValueError:
-        elements = {}  # not that important
-
-    return elements
-
-
-def element_style(
-    attrs: Dict[str, Optional[str]],
-    style_def: Dict[str, Dict[str, str]],
-    parent_style: Dict[str, str],
-) -> Dict[str, str]:
-    """
-    :type attrs: dict
-    :type style_def: dict
-    :type style_def: dict
-
-    :returns: A hash of the 'final' style attributes of the element
-    :rtype: dict
-    """
-    style = parent_style.copy()
-    if "class" in attrs:
-        assert attrs["class"] is not None
-        for css_class in attrs["class"].split():
-            css_style = style_def.get("." + css_class, {})
-            style.update(css_style)
-    if "style" in attrs:
-        assert attrs["style"] is not None
-        immediate_style = dumb_property_dict(attrs["style"])
-        style.update(immediate_style)
-
-    return style
-
-
-def google_list_style(style: Dict[str, str]) -> str:
-    """
-    Finds out whether this is an ordered or unordered list
-
-    :type style: dict
-
-    :rtype: str
-    """
-    if "list-style-type" in style:
-        list_style = style["list-style-type"]
-        if list_style in ["disc", "circle", "square", "none"]:
-            return "ul"
-
-    return "ol"
-
-
-def google_has_height(style: Dict[str, str]) -> bool:
-    """
-    Check if the style of the element has the 'height' attribute
-    explicitly defined
-
-    :type style: dict
-
-    :rtype: bool
-    """
-    return "height" in style
-
-
-def google_text_emphasis(style: Dict[str, str]) -> List[str]:
-    """
-    :type style: dict
-
-    :returns: A list of all emphasis modifiers of the element
-    :rtype: list
-    """
-    emphasis = []
-    if "text-decoration" in style:
-        emphasis.append(style["text-decoration"])
-    if "font-style" in style:
-        emphasis.append(style["font-style"])
-    if "font-weight" in style:
-        emphasis.append(style["font-weight"])
-
-    return emphasis
-
-
-def google_fixed_width_font(style: Dict[str, str]) -> bool:
-    """
-    Check if the css of the current element defines a fixed width font
-
-    :type style: dict
-
-    :rtype: bool
-    """
-    font_family = ""
-    if "font-family" in style:
-        font_family = style["font-family"]
-    return "courier new" == font_family or "consolas" == font_family
-
-
-def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int:
-    """
-    Extract numbering from list element attributes
-
-    :type attrs: dict
-
-    :rtype: int or None
-    """
-    if "start" in attrs:
-        assert attrs["start"] is not None
-        try:
-            return int(attrs["start"]) - 1
-        except ValueError:
-            pass
-
-    return 0
-
-
-def skipwrap(
-    para: str, wrap_links: bool, wrap_list_items: bool, wrap_tables: bool
-) -> bool:
-    # If it appears to contain a link
-    # don't wrap
-    if not wrap_links and config.RE_LINK.search(para):
-        return True
-    # If the text begins with four spaces or one tab, it's a code block;
-    # don't wrap
-    if para[0:4] == "    " or para[0] == "\t":
-        return True
-
-    # If the text begins with only two "--", possibly preceded by
-    # whitespace, that's an emdash; so wrap.
-    stripped = para.lstrip()
-    if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-":
-        return False
-
-    # I'm not sure what this is for; I thought it was to detect lists,
-    # but there's a <br>-inside-<span> case in one of the tests that
-    # also depends upon it.
-    if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**":
-        return not wrap_list_items
-
-    # If text contains a pipe character it is likely a table
-    if not wrap_tables and config.RE_TABLE.search(para):
-        return True
-
-    # If the text begins with a single -, *, or +, followed by a space,
-    # or an integer, followed by a ., followed by a space (in either
-    # case optionally proceeded by whitespace), it's a list; don't wrap.
-    return bool(
-        config.RE_ORDERED_LIST_MATCHER.match(stripped)
-        or config.RE_UNORDERED_LIST_MATCHER.match(stripped)
-    )
-
-
-def escape_md(text: str) -> str:
-    """
-    Escapes markdown-sensitive characters within other markdown
-    constructs.
-    """
-    return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text)
-
-
-def escape_md_section(
-    text: str,
-    escape_backslash: bool = True,
-    snob: bool = False,
-    escape_dot: bool = True,
-    escape_plus: bool = True,
-    escape_dash: bool = True,
-) -> str:
-    """
-    Escapes markdown-sensitive characters across whole document sections.
-    Each escaping operation can be controlled individually.
-    """
-    if escape_backslash:
-        text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text)
-
-    if snob:
-        text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text)
-
-    if escape_dot:
-        text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text)
-
-    if escape_plus:
-        text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text)
-
-    if escape_dash:
-        text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text)
-
-    return text
-
-
-def reformat_table(lines: List[str], right_margin: int) -> List[str]:
-    """
-    Given the lines of a table
-    padds the cells and returns the new lines
-    """
-    # find the maximum width of the columns
-    max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")]
-    max_cols = len(max_width)
-    for line in lines:
-        cols = [x.rstrip() for x in line.split("|")]
-        num_cols = len(cols)
-
-        # don't drop any data if colspan attributes result in unequal lengths
-        if num_cols < max_cols:
-            cols += [""] * (max_cols - num_cols)
-        elif max_cols < num_cols:
-            max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]]
-            max_cols = num_cols
-
-        max_width = [
-            max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width)
-        ]
-
-    # reformat
-    new_lines = []
-    for line in lines:
-        cols = [x.rstrip() for x in line.split("|")]
-        if set(line.strip()) == set("-|"):
-            filler = "-"
-            new_cols = [
-                x.rstrip() + (filler * (M - len(x.rstrip())))
-                for x, M in zip(cols, max_width)
-            ]
-            new_lines.append("|-" + "|".join(new_cols) + "|")
-        else:
-            filler = " "
-            new_cols = [
-                x.rstrip() + (filler * (M - len(x.rstrip())))
-                for x, M in zip(cols, max_width)
-            ]
-            new_lines.append("| " + "|".join(new_cols) + "|")
-    return new_lines
-
-
-def pad_tables_in_text(text: str, right_margin: int = 1) -> str:
-    """
-    Provide padding for tables in the text
-    """
-    lines = text.split("\n")
-    table_buffer = []  # type: List[str]
-    table_started = False
-    new_lines = []
-    for line in lines:
-        # Toggle table started
-        if config.TABLE_MARKER_FOR_PAD in line:
-            table_started = not table_started
-            if not table_started:
-                table = reformat_table(table_buffer, right_margin)
-                new_lines.extend(table)
-                table_buffer = []
-                new_lines.append("")
-            continue
-        # Process lines
-        if table_started:
-            table_buffer.append(line)
-        else:
-            new_lines.append(line)
-    return "\n".join(new_lines)
--- a/crawl4ai/install.py
+++ b/crawl4ai/install.py
@@ -1,109 +0,0 @@
-import subprocess
-import sys
-import asyncio
-from .async_logger import AsyncLogger, LogLevel
-
-# Initialize logger
-logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
-
-
-def post_install():
-    """Run all post-installation tasks"""
-    logger.info("Running post-installation setup...", tag="INIT")
-    install_playwright()
-    run_migration()
-    logger.success("Post-installation setup completed!", tag="COMPLETE")
-
-
-def install_playwright():
-    logger.info("Installing Playwright browsers...", tag="INIT")
-    try:
-        # subprocess.check_call([sys.executable, "-m", "playwright", "install", "--with-deps", "--force", "chrome"])
-        subprocess.check_call(
-            [
-                sys.executable,
-                "-m",
-                "playwright",
-                "install",
-                "--with-deps",
-                "--force",
-                "chromium",
-            ]
-        )
-        logger.success(
-            "Playwright installation completed successfully.", tag="COMPLETE"
-        )
-    except subprocess.CalledProcessError:
-        # logger.error(f"Error during Playwright installation: {e}", tag="ERROR")
-        logger.warning(
-            f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
-        )
-    except Exception:
-        # logger.error(f"Unexpected error during Playwright installation: {e}", tag="ERROR")
-        logger.warning(
-            f"Please run '{sys.executable} -m playwright install --with-deps' manually after the installation."
-        )
-
-
-def run_migration():
-    """Initialize database during installation"""
-    try:
-        logger.info("Starting database initialization...", tag="INIT")
-        from crawl4ai.async_database import async_db_manager
-
-        asyncio.run(async_db_manager.initialize())
-        logger.success(
-            "Database initialization completed successfully.", tag="COMPLETE"
-        )
-    except ImportError:
-        logger.warning("Database module not found. Will initialize on first use.")
-    except Exception as e:
-        logger.warning(f"Database initialization failed: {e}")
-        logger.warning("Database will be initialized on first use")
-
-
-async def run_doctor():
-    """Test if Crawl4AI is working properly"""
-    logger.info("Running Crawl4AI health check...", tag="INIT")
-    try:
-        from .async_webcrawler import (
-            AsyncWebCrawler,
-            BrowserConfig,
-            CrawlerRunConfig,
-            CacheMode,
-        )
-
-        browser_config = BrowserConfig(
-            headless=True,
-            browser_type="chromium",
-            ignore_https_errors=True,
-            light_mode=True,
-            viewport_width=1280,
-            viewport_height=720,
-        )
-
-        run_config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            screenshot=True,
-        )
-
-        async with AsyncWebCrawler(config=browser_config) as crawler:
-            logger.info("Testing crawling capabilities...", tag="TEST")
-            result = await crawler.arun(url="https://crawl4ai.com", config=run_config)
-
-            if result and result.markdown:
-                logger.success("✅ Crawling test passed!", tag="COMPLETE")
-                return True
-            else:
-                raise Exception("Failed to get content")
-
-    except Exception as e:
-        logger.error(f"❌ Test failed: {e}", tag="ERROR")
-        return False
-
-
-def doctor():
-    """Entry point for the doctor command"""
-    import asyncio
-
-    return asyncio.run(run_doctor())
--- a/crawl4ai/js_snippet/init.py
+++ b/crawl4ai/js_snippet/init.py
@@ -1,18 +0,0 @@
-import os
-
-
-# Create a function get name of a js script, then load from the CURRENT folder of this script and return its content as string, make sure its error free
-def load_js_script(script_name):
-    # Get the path of the current script
-    current_script_path = os.path.dirname(os.path.realpath(__file__))
-    # Get the path of the script to load
-    script_path = os.path.join(current_script_path, script_name + ".js")
-    # Check if the script exists
-    if not os.path.exists(script_path):
-        raise ValueError(
-            f"Script {script_name} not found in the folder {current_script_path}"
-        )
-    # Load the content of the script
-    with open(script_path, "r") as f:
-        script_content = f.read()
-    return script_content
--- a/crawl4ai/js_snippet/navigator_overrider.js
+++ b/crawl4ai/js_snippet/navigator_overrider.js
@@ -1,25 +0,0 @@
-// Pass the Permissions Test.
-const originalQuery = window.navigator.permissions.query;
-window.navigator.permissions.query = (parameters) =>
-    parameters.name === "notifications"
-        ? Promise.resolve({ state: Notification.permission })
-        : originalQuery(parameters);
-Object.defineProperty(navigator, "webdriver", {
-    get: () => undefined,
-});
-window.navigator.chrome = {
-    runtime: {},
-    // Add other properties if necessary
-};
-Object.defineProperty(navigator, "plugins", {
-    get: () => [1, 2, 3, 4, 5],
-});
-Object.defineProperty(navigator, "languages", {
-    get: () => ["en-US", "en"],
-});
-Object.defineProperty(document, "hidden", {
-    get: () => false,
-});
-Object.defineProperty(document, "visibilityState", {
-    get: () => "visible",
-});
--- a/crawl4ai/js_snippet/remove_overlay_elements.js
+++ b/crawl4ai/js_snippet/remove_overlay_elements.js
@@ -1,119 +0,0 @@
-async () => {
-    // Function to check if element is visible
-    const isVisible = (elem) => {
-        const style = window.getComputedStyle(elem);
-        return style.display !== "none" && style.visibility !== "hidden" && style.opacity !== "0";
-    };
-
-    // Common selectors for popups and overlays
-    const commonSelectors = [
-        // Close buttons first
-        'button[class*="close" i]',
-        'button[class*="dismiss" i]',
-        'button[aria-label*="close" i]',
-        'button[title*="close" i]',
-        'a[class*="close" i]',
-        'span[class*="close" i]',
-
-        // Cookie notices
-        '[class*="cookie-banner" i]',
-        '[id*="cookie-banner" i]',
-        '[class*="cookie-consent" i]',
-        '[id*="cookie-consent" i]',
-
-        // Newsletter/subscription dialogs
-        '[class*="newsletter" i]',
-        '[class*="subscribe" i]',
-
-        // Generic popups/modals
-        '[class*="popup" i]',
-        '[class*="modal" i]',
-        '[class*="overlay" i]',
-        '[class*="dialog" i]',
-        '[role="dialog"]',
-        '[role="alertdialog"]',
-    ];
-
-    // Try to click close buttons first
-    for (const selector of commonSelectors.slice(0, 6)) {
-        const closeButtons = document.querySelectorAll(selector);
-        for (const button of closeButtons) {
-            if (isVisible(button)) {
-                try {
-                    button.click();
-                    await new Promise((resolve) => setTimeout(resolve, 100));
-                } catch (e) {
-                    console.log("Error clicking button:", e);
-                }
-            }
-        }
-    }
-
-    // Remove remaining overlay elements
-    const removeOverlays = () => {
-        // Find elements with high z-index
-        const allElements = document.querySelectorAll("*");
-        for (const elem of allElements) {
-            const style = window.getComputedStyle(elem);
-            const zIndex = parseInt(style.zIndex);
-            const position = style.position;
-
-            if (
-                isVisible(elem) &&
-                (zIndex > 999 || position === "fixed" || position === "absolute") &&
-                (elem.offsetWidth > window.innerWidth * 0.5 ||
-                    elem.offsetHeight > window.innerHeight * 0.5 ||
-                    style.backgroundColor.includes("rgba") ||
-                    parseFloat(style.opacity) < 1)
-            ) {
-                elem.remove();
-            }
-        }
-
-        // Remove elements matching common selectors
-        for (const selector of commonSelectors) {
-            const elements = document.querySelectorAll(selector);
-            elements.forEach((elem) => {
-                if (isVisible(elem)) {
-                    elem.remove();
-                }
-            });
-        }
-    };
-
-    // Remove overlay elements
-    removeOverlays();
-
-    // Remove any fixed/sticky position elements at the top/bottom
-    const removeFixedElements = () => {
-        const elements = document.querySelectorAll("*");
-        elements.forEach((elem) => {
-            const style = window.getComputedStyle(elem);
-            if ((style.position === "fixed" || style.position === "sticky") && isVisible(elem)) {
-                elem.remove();
-            }
-        });
-    };
-
-    removeFixedElements();
-
-    // Remove empty block elements as: div, p, span, etc.
-    const removeEmptyBlockElements = () => {
-        const blockElements = document.querySelectorAll(
-            "div, p, span, section, article, header, footer, aside, nav, main, ul, ol, li, dl, dt, dd, h1, h2, h3, h4, h5, h6"
-        );
-        blockElements.forEach((elem) => {
-            if (elem.innerText.trim() === "") {
-                elem.remove();
-            }
-        });
-    };
-
-    // Remove margin-right and padding-right from body (often added by modal scripts)
-    document.body.style.marginRight = "0px";
-    document.body.style.paddingRight = "0px";
-    document.body.style.overflow = "auto";
-
-    // Wait a bit for any animations to complete
-    await new Promise((resolve) => setTimeout(resolve, 100));
-};
--- a/crawl4ai/js_snippet/update_image_dimensions.js
+++ b/crawl4ai/js_snippet/update_image_dimensions.js
@@ -1,54 +0,0 @@
-() => {
-    return new Promise((resolve) => {
-        const filterImage = (img) => {
-            // Filter out images that are too small
-            if (img.width < 100 && img.height < 100) return false;
-
-            // Filter out images that are not visible
-            const rect = img.getBoundingClientRect();
-            if (rect.width === 0 || rect.height === 0) return false;
-
-            // Filter out images with certain class names (e.g., icons, thumbnails)
-            if (img.classList.contains("icon") || img.classList.contains("thumbnail")) return false;
-
-            // Filter out images with certain patterns in their src (e.g., placeholder images)
-            if (img.src.includes("placeholder") || img.src.includes("icon")) return false;
-
-            return true;
-        };
-
-        const images = Array.from(document.querySelectorAll("img")).filter(filterImage);
-        let imagesLeft = images.length;
-
-        if (imagesLeft === 0) {
-            resolve();
-            return;
-        }
-
-        const checkImage = (img) => {
-            if (img.complete && img.naturalWidth !== 0) {
-                img.setAttribute("width", img.naturalWidth);
-                img.setAttribute("height", img.naturalHeight);
-                imagesLeft--;
-                if (imagesLeft === 0) resolve();
-            }
-        };
-
-        images.forEach((img) => {
-            checkImage(img);
-            if (!img.complete) {
-                img.onload = () => {
-                    checkImage(img);
-                };
-                img.onerror = () => {
-                    imagesLeft--;
-                    if (imagesLeft === 0) resolve();
-                };
-            }
-        });
-
-        // Fallback timeout of 5 seconds
-        // setTimeout(() => resolve(), 5000);
-        resolve();
-    });
-};
--- a/crawl4ai/llmtxt.py
+++ b/crawl4ai/llmtxt.py
@@ -1,546 +0,0 @@
-import os
-from pathlib import Path
-import re
-from typing import Dict, List, Tuple, Optional, Any
-import json
-from tqdm import tqdm
-import time
-import psutil
-import numpy as np
-from rank_bm25 import BM25Okapi
-from nltk.tokenize import word_tokenize
-from nltk.corpus import stopwords
-from nltk.stem import WordNetLemmatizer
-from litellm import batch_completion
-from .async_logger import AsyncLogger
-import litellm
-import pickle
-import hashlib  # <--- ADDED for file-hash
-import glob
-
-litellm.set_verbose = False
-
-
-def _compute_file_hash(file_path: Path) -> str:
-    """Compute MD5 hash for the file's entire content."""
-    hash_md5 = hashlib.md5()
-    with file_path.open("rb") as f:
-        for chunk in iter(lambda: f.read(4096), b""):
-            hash_md5.update(chunk)
-    return hash_md5.hexdigest()
-
-
-class AsyncLLMTextManager:
-    def __init__(
-        self,
-        docs_dir: Path,
-        logger: Optional[AsyncLogger] = None,
-        max_concurrent_calls: int = 5,
-        batch_size: int = 3,
-    ) -> None:
-        self.docs_dir = docs_dir
-        self.logger = logger
-        self.max_concurrent_calls = max_concurrent_calls
-        self.batch_size = batch_size
-        self.bm25_index = None
-        self.document_map: Dict[str, Any] = {}
-        self.tokenized_facts: List[str] = []
-        self.bm25_index_file = self.docs_dir / "bm25_index.pkl"
-
-    async def _process_document_batch(self, doc_batch: List[Path]) -> None:
-        """Process a batch of documents in parallel"""
-        contents = []
-        for file_path in doc_batch:
-            try:
-                with open(file_path, "r", encoding="utf-8") as f:
-                    contents.append(f.read())
-            except Exception as e:
-                self.logger.error(f"Error reading {file_path}: {str(e)}")
-                contents.append("")  # Add empty content to maintain batch alignment
-
-        prompt = """Given a documentation file, generate a list of atomic facts where each fact:
-1. Represents a single piece of knowledge
-2. Contains variations in terminology for the same concept
-3. References relevant code patterns if they exist
-4. Is written in a way that would match natural language queries
-
-Each fact should follow this format:
-<main_concept>: <fact_statement> | <related_terms> | <code_reference>
-
-Example Facts:
-browser_config: Configure headless mode and browser type for AsyncWebCrawler | headless, browser_type, chromium, firefox | BrowserConfig(browser_type="chromium", headless=True)
-redis_connection: Redis client connection requires host and port configuration | redis setup, redis client, connection params | Redis(host='localhost', port=6379, db=0)
-pandas_filtering: Filter DataFrame rows using boolean conditions | dataframe filter, query, boolean indexing | df[df['column'] > 5]
-
-Wrap your response in <index>...</index> tags.
-"""
-
-        # Prepare messages for batch processing
-        messages_list = [
-            [
-                {
-                    "role": "user",
-                    "content": f"{prompt}\n\nGenerate index for this documentation:\n\n{content}",
-                }
-            ]
-            for content in contents
-            if content
-        ]
-
-        try:
-            responses = batch_completion(
-                model="anthropic/claude-3-5-sonnet-latest",
-                messages=messages_list,
-                logger_fn=None,
-            )
-
-            # Process responses and save index files
-            for response, file_path in zip(responses, doc_batch):
-                try:
-                    index_content_match = re.search(
-                        r"<index>(.*?)</index>",
-                        response.choices[0].message.content,
-                        re.DOTALL,
-                    )
-                    if not index_content_match:
-                        self.logger.warning(
-                            f"No <index>...</index> content found for {file_path}"
-                        )
-                        continue
-
-                    index_content = re.sub(
-                        r"\n\s*\n", "\n", index_content_match.group(1)
-                    ).strip()
-                    if index_content:
-                        index_file = file_path.with_suffix(".q.md")
-                        with open(index_file, "w", encoding="utf-8") as f:
-                            f.write(index_content)
-                        self.logger.info(f"Created index file: {index_file}")
-                    else:
-                        self.logger.warning(
-                            f"No index content found in response for {file_path}"
-                        )
-
-                except Exception as e:
-                    self.logger.error(
-                        f"Error processing response for {file_path}: {str(e)}"
-                    )
-
-        except Exception as e:
-            self.logger.error(f"Error in batch completion: {str(e)}")
-
-    def _validate_fact_line(self, line: str) -> Tuple[bool, Optional[str]]:
-        if "|" not in line:
-            return False, "Missing separator '|'"
-
-        parts = [p.strip() for p in line.split("|")]
-        if len(parts) != 3:
-            return False, f"Expected 3 parts, got {len(parts)}"
-
-        concept_part = parts[0]
-        if ":" not in concept_part:
-            return False, "Missing ':' in concept definition"
-
-        return True, None
-
-    def _load_or_create_token_cache(self, fact_file: Path) -> Dict:
-        """
-        Load token cache from .q.tokens if present and matching file hash.
-        Otherwise return a new structure with updated file-hash.
-        """
-        cache_file = fact_file.with_suffix(".q.tokens")
-        current_hash = _compute_file_hash(fact_file)
-
-        if cache_file.exists():
-            try:
-                with open(cache_file, "r") as f:
-                    cache = json.load(f)
-                # If the hash matches, return it directly
-                if cache.get("content_hash") == current_hash:
-                    return cache
-                # Otherwise, we signal that it's changed
-                self.logger.info(f"Hash changed for {fact_file}, reindex needed.")
-            except json.JSONDecodeError:
-                self.logger.warning(f"Corrupt token cache for {fact_file}, rebuilding.")
-            except Exception as e:
-                self.logger.warning(f"Error reading cache for {fact_file}: {str(e)}")
-
-        # Return a fresh cache
-        return {"facts": {}, "content_hash": current_hash}
-
-    def _save_token_cache(self, fact_file: Path, cache: Dict) -> None:
-        cache_file = fact_file.with_suffix(".q.tokens")
-        # Always ensure we're saving the correct file-hash
-        cache["content_hash"] = _compute_file_hash(fact_file)
-        with open(cache_file, "w") as f:
-            json.dump(cache, f)
-
-    def preprocess_text(self, text: str) -> List[str]:
-        parts = [x.strip() for x in text.split("|")] if "|" in text else [text]
-        # Remove : after the first word of parts[0]
-        parts[0] = re.sub(r"^(.*?):", r"\1", parts[0])
-
-        lemmatizer = WordNetLemmatizer()
-        stop_words = set(stopwords.words("english")) - {
-            "how",
-            "what",
-            "when",
-            "where",
-            "why",
-            "which",
-        }
-
-        tokens = []
-        for part in parts:
-            if "(" in part and ")" in part:
-                code_tokens = re.findall(
-                    r'[\w_]+(?=\()|[\w_]+(?==[\'"]{1}[\w_]+[\'"]{1})', part
-                )
-                tokens.extend(code_tokens)
-
-            words = word_tokenize(part.lower())
-            tokens.extend(
-                [
-                    lemmatizer.lemmatize(token)
-                    for token in words
-                    if token not in stop_words
-                ]
-            )
-
-        return tokens
-
-    def maybe_load_bm25_index(self, clear_cache=False) -> bool:
-        """
-        Load existing BM25 index from disk, if present and clear_cache=False.
-        """
-        if not clear_cache and os.path.exists(self.bm25_index_file):
-            self.logger.info("Loading existing BM25 index from disk.")
-            with open(self.bm25_index_file, "rb") as f:
-                data = pickle.load(f)
-            self.tokenized_facts = data["tokenized_facts"]
-            self.bm25_index = data["bm25_index"]
-            return True
-        return False
-
-    def build_search_index(self, clear_cache=False) -> None:
-        """
-        Checks for new or modified .q.md files by comparing file-hash.
-        If none need reindexing and clear_cache is False, loads existing index if available.
-        Otherwise, reindexes only changed/new files and merges or creates a new index.
-        """
-        # If clear_cache is True, we skip partial logic: rebuild everything from scratch
-        if clear_cache:
-            self.logger.info("Clearing cache and rebuilding full search index.")
-            if self.bm25_index_file.exists():
-                self.bm25_index_file.unlink()
-
-        process = psutil.Process()
-        self.logger.info("Checking which .q.md files need (re)indexing...")
-
-        # Gather all .q.md files
-        q_files = [
-            self.docs_dir / f for f in os.listdir(self.docs_dir) if f.endswith(".q.md")
-        ]
-
-        # We'll store known (unchanged) facts in these lists
-        existing_facts: List[str] = []
-        existing_tokens: List[List[str]] = []
-
-        # Keep track of invalid lines for logging
-        invalid_lines = []
-        needSet = []  # files that must be (re)indexed
-
-        for qf in q_files:
-            token_cache_file = qf.with_suffix(".q.tokens")
-
-            # If no .q.tokens or clear_cache is True → definitely reindex
-            if clear_cache or not token_cache_file.exists():
-                needSet.append(qf)
-                continue
-
-            # Otherwise, load the existing cache and compare hash
-            cache = self._load_or_create_token_cache(qf)
-            # If the .q.tokens was out of date (i.e. changed hash), we reindex
-            if len(cache["facts"]) == 0 or cache.get(
-                "content_hash"
-            ) != _compute_file_hash(qf):
-                needSet.append(qf)
-            else:
-                # File is unchanged → retrieve cached token data
-                for line, cache_data in cache["facts"].items():
-                    existing_facts.append(line)
-                    existing_tokens.append(cache_data["tokens"])
-                    self.document_map[line] = qf  # track the doc for that fact
-
-        if not needSet and not clear_cache:
-            # If no file needs reindexing, try loading existing index
-            if self.maybe_load_bm25_index(clear_cache=False):
-                self.logger.info(
-                    "No new/changed .q.md files found. Using existing BM25 index."
-                )
-                return
-            else:
-                # If there's no existing index, we must build a fresh index from the old caches
-                self.logger.info(
-                    "No existing BM25 index found. Building from cached facts."
-                )
-                if existing_facts:
-                    self.logger.info(
-                        f"Building BM25 index with {len(existing_facts)} cached facts."
-                    )
-                    self.bm25_index = BM25Okapi(existing_tokens)
-                    self.tokenized_facts = existing_facts
-                    with open(self.bm25_index_file, "wb") as f:
-                        pickle.dump(
-                            {
-                                "bm25_index": self.bm25_index,
-                                "tokenized_facts": self.tokenized_facts,
-                            },
-                            f,
-                        )
-                else:
-                    self.logger.warning("No facts found at all. Index remains empty.")
-                return
-
-        # ----------------------------------------------------- /Users/unclecode/.crawl4ai/docs/14_proxy_security.q.q.tokens '/Users/unclecode/.crawl4ai/docs/14_proxy_security.q.md'
-        # If we reach here, we have new or changed .q.md files
-        # We'll parse them, reindex them, and then combine with existing_facts
-        # -----------------------------------------------------
-
-        self.logger.info(f"{len(needSet)} file(s) need reindexing. Parsing now...")
-
-        # 1) Parse the new or changed .q.md files
-        new_facts = []
-        new_tokens = []
-        with tqdm(total=len(needSet), desc="Indexing changed files") as file_pbar:
-            for file in needSet:
-                # We'll build up a fresh cache
-                fresh_cache = {"facts": {}, "content_hash": _compute_file_hash(file)}
-                try:
-                    with open(file, "r", encoding="utf-8") as f_obj:
-                        content = f_obj.read().strip()
-                        lines = [l.strip() for l in content.split("\n") if l.strip()]
-
-                    for line in lines:
-                        is_valid, error = self._validate_fact_line(line)
-                        if not is_valid:
-                            invalid_lines.append((file, line, error))
-                            continue
-
-                        tokens = self.preprocess_text(line)
-                        fresh_cache["facts"][line] = {
-                            "tokens": tokens,
-                            "added": time.time(),
-                        }
-                        new_facts.append(line)
-                        new_tokens.append(tokens)
-                        self.document_map[line] = file
-
-                    # Save the new .q.tokens with updated hash
-                    self._save_token_cache(file, fresh_cache)
-
-                    mem_usage = process.memory_info().rss / 1024 / 1024
-                    self.logger.debug(
-                        f"Memory usage after {file.name}: {mem_usage:.2f}MB"
-                    )
-
-                except Exception as e:
-                    self.logger.error(f"Error processing {file}: {str(e)}")
-
-                file_pbar.update(1)
-
-        if invalid_lines:
-            self.logger.warning(f"Found {len(invalid_lines)} invalid fact lines:")
-            for file, line, error in invalid_lines:
-                self.logger.warning(f"{file}: {error} in line: {line[:50]}...")
-
-        # 2) Merge newly tokenized facts with the existing ones
-        all_facts = existing_facts + new_facts
-        all_tokens = existing_tokens + new_tokens
-
-        # 3) Build BM25 index from combined facts
-        self.logger.info(
-            f"Building BM25 index with {len(all_facts)} total facts (old + new)."
-        )
-        self.bm25_index = BM25Okapi(all_tokens)
-        self.tokenized_facts = all_facts
-
-        # 4) Save the updated BM25 index to disk
-        with open(self.bm25_index_file, "wb") as f:
-            pickle.dump(
-                {
-                    "bm25_index": self.bm25_index,
-                    "tokenized_facts": self.tokenized_facts,
-                },
-                f,
-            )
-
-        final_mem = process.memory_info().rss / 1024 / 1024
-        self.logger.info(f"Search index updated. Final memory usage: {final_mem:.2f}MB")
-
-    async def generate_index_files(
-        self, force_generate_facts: bool = False, clear_bm25_cache: bool = False
-    ) -> None:
-        """
-        Generate index files for all documents in parallel batches
-
-        Args:
-            force_generate_facts (bool): If True, regenerate indexes even if they exist
-            clear_bm25_cache (bool): If True, clear existing BM25 index cache
-        """
-        self.logger.info("Starting index generation for documentation files.")
-
-        md_files = [
-            self.docs_dir / f
-            for f in os.listdir(self.docs_dir)
-            if f.endswith(".md") and not any(f.endswith(x) for x in [".q.md", ".xs.md"])
-        ]
-
-        # Filter out files that already have .q files unless force=True
-        if not force_generate_facts:
-            md_files = [
-                f
-                for f in md_files
-                if not (self.docs_dir / f.name.replace(".md", ".q.md")).exists()
-            ]
-
-        if not md_files:
-            self.logger.info("All index files exist. Use force=True to regenerate.")
-        else:
-            # Process documents in batches
-            for i in range(0, len(md_files), self.batch_size):
-                batch = md_files[i : i + self.batch_size]
-                self.logger.info(
-                    f"Processing batch {i//self.batch_size + 1}/{(len(md_files)//self.batch_size) + 1}"
-                )
-                await self._process_document_batch(batch)
-
-        self.logger.info("Index generation complete, building/updating search index.")
-        self.build_search_index(clear_cache=clear_bm25_cache)
-
-    def generate(self, sections: List[str], mode: str = "extended") -> str:
-        # Get all markdown files
-        all_files = glob.glob(str(self.docs_dir / "[0-9]*.md")) + glob.glob(
-            str(self.docs_dir / "[0-9]*.xs.md")
-        )
-
-        # Extract base names without extensions
-        base_docs = {
-            Path(f).name.split(".")[0]
-            for f in all_files
-            if not Path(f).name.endswith(".q.md")
-        }
-
-        # Filter by sections if provided
-        if sections:
-            base_docs = {
-                doc
-                for doc in base_docs
-                if any(section.lower() in doc.lower() for section in sections)
-            }
-
-        # Get file paths based on mode
-        files = []
-        for doc in sorted(
-            base_docs,
-            key=lambda x: int(x.split("_")[0]) if x.split("_")[0].isdigit() else 999999,
-        ):
-            if mode == "condensed":
-                xs_file = self.docs_dir / f"{doc}.xs.md"
-                regular_file = self.docs_dir / f"{doc}.md"
-                files.append(str(xs_file if xs_file.exists() else regular_file))
-            else:
-                files.append(str(self.docs_dir / f"{doc}.md"))
-
-        # Read and format content
-        content = []
-        for file in files:
-            try:
-                with open(file, "r", encoding="utf-8") as f:
-                    fname = Path(file).name
-                    content.append(f"{'#'*20}\n# {fname}\n{'#'*20}\n\n{f.read()}")
-            except Exception as e:
-                self.logger.error(f"Error reading {file}: {str(e)}")
-
-        return "\n\n---\n\n".join(content) if content else ""
-
-    def search(self, query: str, top_k: int = 5) -> str:
-        if not self.bm25_index:
-            return "No search index available. Call build_search_index() first."
-
-        query_tokens = self.preprocess_text(query)
-        doc_scores = self.bm25_index.get_scores(query_tokens)
-
-        mean_score = np.mean(doc_scores)
-        std_score = np.std(doc_scores)
-        score_threshold = mean_score + (0.25 * std_score)
-
-        file_data = self._aggregate_search_scores(
-            doc_scores=doc_scores,
-            score_threshold=score_threshold,
-            query_tokens=query_tokens,
-        )
-
-        ranked_files = sorted(
-            file_data.items(),
-            key=lambda x: (
-                x[1]["code_match_score"] * 2.0
-                + x[1]["match_count"] * 1.5
-                + x[1]["total_score"]
-            ),
-            reverse=True,
-        )[:top_k]
-
-        results = []
-        for file, _ in ranked_files:
-            main_doc = str(file).replace(".q.md", ".md")
-            if os.path.exists(self.docs_dir / main_doc):
-                with open(self.docs_dir / main_doc, "r", encoding="utf-8") as f:
-                    only_file_name = main_doc.split("/")[-1]
-                    content = ["#" * 20, f"# {only_file_name}", "#" * 20, "", f.read()]
-                    results.append("\n".join(content))
-
-        return "\n\n---\n\n".join(results)
-
-    def _aggregate_search_scores(
-        self, doc_scores: List[float], score_threshold: float, query_tokens: List[str]
-    ) -> Dict:
-        file_data = {}
-
-        for idx, score in enumerate(doc_scores):
-            if score <= score_threshold:
-                continue
-
-            fact = self.tokenized_facts[idx]
-            file_path = self.document_map[fact]
-
-            if file_path not in file_data:
-                file_data[file_path] = {
-                    "total_score": 0,
-                    "match_count": 0,
-                    "code_match_score": 0,
-                    "matched_facts": [],
-                }
-
-            components = fact.split("|") if "|" in fact else [fact]
-
-            code_match_score = 0
-            if len(components) == 3:
-                code_ref = components[2].strip()
-                code_tokens = self.preprocess_text(code_ref)
-                code_match_score = len(set(query_tokens) & set(code_tokens)) / len(
-                    query_tokens
-                )
-
-            file_data[file_path]["total_score"] += score
-            file_data[file_path]["match_count"] += 1
-            file_data[file_path]["code_match_score"] = max(
-                file_data[file_path]["code_match_score"], code_match_score
-            )
-            file_data[file_path]["matched_facts"].append(fact)
-
-        return file_data
-
-    def refresh_index(self) -> None:
-        """Convenience method for a full rebuild."""
-        self.build_search_index(clear_cache=True)
--- a/crawl4ai/markdown_generation_strategy.py
+++ b/crawl4ai/markdown_generation_strategy.py
@@ -1,253 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import Optional, Dict, Any, Tuple
-from .models import MarkdownGenerationResult
-from .html2text import CustomHTML2Text
-from .content_filter_strategy import RelevantContentFilter
-import re
-from urllib.parse import urljoin
-
-# Pre-compile the regex pattern
-LINK_PATTERN = re.compile(r'!?\[([^\]]+)\]\(([^)]+?)(?:\s+"([^"]*)")?\)')
-
-
-def fast_urljoin(base: str, url: str) -> str:
-    """Fast URL joining for common cases."""
-    if url.startswith(("http://", "https://", "mailto:", "//")):
-        return url
-    if url.startswith("/"):
-        # Handle absolute paths
-        if base.endswith("/"):
-            return base[:-1] + url
-        return base + url
-    return urljoin(base, url)
-
-
-class MarkdownGenerationStrategy(ABC):
-    """Abstract base class for markdown generation strategies."""
-
-    def __init__(
-        self,
-        content_filter: Optional[RelevantContentFilter] = None,
-        options: Optional[Dict[str, Any]] = None,
-    ):
-        self.content_filter = content_filter
-        self.options = options or {}
-
-    @abstractmethod
-    def generate_markdown(
-        self,
-        cleaned_html: str,
-        base_url: str = "",
-        html2text_options: Optional[Dict[str, Any]] = None,
-        content_filter: Optional[RelevantContentFilter] = None,
-        citations: bool = True,
-        **kwargs,
-    ) -> MarkdownGenerationResult:
-        """Generate markdown from cleaned HTML."""
-        pass
-
-
-class DefaultMarkdownGenerator(MarkdownGenerationStrategy):
-    """
-    Default implementation of markdown generation strategy.
-
-    How it works:
-    1. Generate raw markdown from cleaned HTML.
-    2. Convert links to citations.
-    3. Generate fit markdown if content filter is provided.
-    4. Return MarkdownGenerationResult.
-
-    Args:
-        content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
-        options (Optional[Dict[str, Any]]): Additional options for markdown generation. Defaults to None.
-
-    Returns:
-        MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
-    """
-
-    def __init__(
-        self,
-        content_filter: Optional[RelevantContentFilter] = None,
-        options: Optional[Dict[str, Any]] = None,
-    ):
-        super().__init__(content_filter, options)
-
-    def convert_links_to_citations(
-        self, markdown: str, base_url: str = ""
-    ) -> Tuple[str, str]:
-        """
-        Convert links in markdown to citations.
-
-        How it works:
-        1. Find all links in the markdown.
-        2. Convert links to citations.
-        3. Return converted markdown and references markdown.
-
-        Note:
-        This function uses a regex pattern to find links in markdown.
-
-        Args:
-            markdown (str): Markdown text.
-            base_url (str): Base URL for URL joins.
-
-        Returns:
-            Tuple[str, str]: Converted markdown and references markdown.
-        """
-        link_map = {}
-        url_cache = {}  # Cache for URL joins
-        parts = []
-        last_end = 0
-        counter = 1
-
-        for match in LINK_PATTERN.finditer(markdown):
-            parts.append(markdown[last_end : match.start()])
-            text, url, title = match.groups()
-
-            # Use cached URL if available, otherwise compute and cache
-            if base_url and not url.startswith(("http://", "https://", "mailto:")):
-                if url not in url_cache:
-                    url_cache[url] = fast_urljoin(base_url, url)
-                url = url_cache[url]
-
-            if url not in link_map:
-                desc = []
-                if title:
-                    desc.append(title)
-                if text and text != title:
-                    desc.append(text)
-                link_map[url] = (counter, ": " + " - ".join(desc) if desc else "")
-                counter += 1
-
-            num = link_map[url][0]
-            parts.append(
-                f"{text}⟨{num}⟩"
-                if not match.group(0).startswith("!")
-                else f"![{text}⟨{num}⟩]"
-            )
-            last_end = match.end()
-
-        parts.append(markdown[last_end:])
-        converted_text = "".join(parts)
-
-        # Pre-build reference strings
-        references = ["\n\n## References\n\n"]
-        references.extend(
-            f"⟨{num}⟩ {url}{desc}\n"
-            for url, (num, desc) in sorted(link_map.items(), key=lambda x: x[1][0])
-        )
-
-        return converted_text, "".join(references)
-
-    def generate_markdown(
-        self,
-        cleaned_html: str,
-        base_url: str = "",
-        html2text_options: Optional[Dict[str, Any]] = None,
-        options: Optional[Dict[str, Any]] = None,
-        content_filter: Optional[RelevantContentFilter] = None,
-        citations: bool = True,
-        **kwargs,
-    ) -> MarkdownGenerationResult:
-        """
-        Generate markdown with citations from cleaned HTML.
-
-        How it works:
-        1. Generate raw markdown from cleaned HTML.
-        2. Convert links to citations.
-        3. Generate fit markdown if content filter is provided.
-        4. Return MarkdownGenerationResult.
-
-        Args:
-            cleaned_html (str): Cleaned HTML content.
-            base_url (str): Base URL for URL joins.
-            html2text_options (Optional[Dict[str, Any]]): HTML2Text options.
-            options (Optional[Dict[str, Any]]): Additional options for markdown generation.
-            content_filter (Optional[RelevantContentFilter]): Content filter for generating fit markdown.
-            citations (bool): Whether to generate citations.
-
-        Returns:
-            MarkdownGenerationResult: Result containing raw markdown, fit markdown, fit HTML, and references markdown.
-        """
-        try:
-            # Initialize HTML2Text with default options for better conversion
-            h = CustomHTML2Text(baseurl=base_url)
-            default_options = {
-                "body_width": 0,  # Disable text wrapping
-                "ignore_emphasis": False,
-                "ignore_links": False,
-                "ignore_images": False,
-                "protect_links": True,
-                "single_line_break": True,
-                "mark_code": True,
-                "escape_snob": False,
-            }
-
-            # Update with custom options if provided
-            if html2text_options:
-                default_options.update(html2text_options)
-            elif options:
-                default_options.update(options)
-            elif self.options:
-                default_options.update(self.options)
-
-            h.update_params(**default_options)
-
-            # Ensure we have valid input
-            if not cleaned_html:
-                cleaned_html = ""
-            elif not isinstance(cleaned_html, str):
-                cleaned_html = str(cleaned_html)
-
-            # Generate raw markdown
-            try:
-                raw_markdown = h.handle(cleaned_html)
-            except Exception as e:
-                raw_markdown = f"Error converting HTML to markdown: {str(e)}"
-
-            raw_markdown = raw_markdown.replace("    ```", "```")
-
-            # Convert links to citations
-            markdown_with_citations: str = raw_markdown
-            references_markdown: str = ""
-            if citations:
-                try:
-                    (
-                        markdown_with_citations,
-                        references_markdown,
-                    ) = self.convert_links_to_citations(raw_markdown, base_url)
-                except Exception as e:
-                    markdown_with_citations = raw_markdown
-                    references_markdown = f"Error generating citations: {str(e)}"
-
-            # Generate fit markdown if content filter is provided
-            fit_markdown: Optional[str] = ""
-            filtered_html: Optional[str] = ""
-            if content_filter or self.content_filter:
-                try:
-                    content_filter = content_filter or self.content_filter
-                    filtered_html = content_filter.filter_content(cleaned_html)
-                    filtered_html = "\n".join(
-                        "<div>{}</div>".format(s) for s in filtered_html
-                    )
-                    fit_markdown = h.handle(filtered_html)
-                except Exception as e:
-                    fit_markdown = f"Error generating fit markdown: {str(e)}"
-                    filtered_html = ""
-
-            return MarkdownGenerationResult(
-                raw_markdown=raw_markdown or "",
-                markdown_with_citations=markdown_with_citations or "",
-                references_markdown=references_markdown or "",
-                fit_markdown=fit_markdown or "",
-                fit_html=filtered_html or "",
-            )
-        except Exception as e:
-            # If anything fails, return empty strings with error message
-            error_msg = f"Error in markdown generation: {str(e)}"
-            return MarkdownGenerationResult(
-                raw_markdown=error_msg,
-                markdown_with_citations=error_msg,
-                references_markdown="",
-                fit_markdown="",
-                fit_html="",
-            )
--- a/crawl4ai/migrations.py
+++ b/crawl4ai/migrations.py
@@ -1,194 +0,0 @@
-import os
-import asyncio
-from pathlib import Path
-import aiosqlite
-from typing import Optional
-import xxhash
-import aiofiles
-import shutil
-from datetime import datetime
-from .async_logger import AsyncLogger, LogLevel
-
-# Initialize logger
-logger = AsyncLogger(log_level=LogLevel.DEBUG, verbose=True)
-
-# logging.basicConfig(level=logging.INFO)
-# logger = logging.getLogger(__name__)
-
-
-class DatabaseMigration:
-    def __init__(self, db_path: str):
-        self.db_path = db_path
-        self.content_paths = self._ensure_content_dirs(os.path.dirname(db_path))
-
-    def _ensure_content_dirs(self, base_path: str) -> dict:
-        dirs = {
-            "html": "html_content",
-            "cleaned": "cleaned_html",
-            "markdown": "markdown_content",
-            "extracted": "extracted_content",
-            "screenshots": "screenshots",
-        }
-        content_paths = {}
-        for key, dirname in dirs.items():
-            path = os.path.join(base_path, dirname)
-            os.makedirs(path, exist_ok=True)
-            content_paths[key] = path
-        return content_paths
-
-    def _generate_content_hash(self, content: str) -> str:
-        x = xxhash.xxh64()
-        x.update(content.encode())
-        content_hash = x.hexdigest()
-        return content_hash
-        # return hashlib.sha256(content.encode()).hexdigest()
-
-    async def _store_content(self, content: str, content_type: str) -> str:
-        if not content:
-            return ""
-
-        content_hash = self._generate_content_hash(content)
-        file_path = os.path.join(self.content_paths[content_type], content_hash)
-
-        if not os.path.exists(file_path):
-            async with aiofiles.open(file_path, "w", encoding="utf-8") as f:
-                await f.write(content)
-
-        return content_hash
-
-    async def migrate_database(self):
-        """Migrate existing database to file-based storage"""
-        # logger.info("Starting database migration...")
-        logger.info("Starting database migration...", tag="INIT")
-
-        try:
-            async with aiosqlite.connect(self.db_path) as db:
-                # Get all rows
-                async with db.execute(
-                    """SELECT url, html, cleaned_html, markdown, 
-                       extracted_content, screenshot FROM crawled_data"""
-                ) as cursor:
-                    rows = await cursor.fetchall()
-
-                migrated_count = 0
-                for row in rows:
-                    (
-                        url,
-                        html,
-                        cleaned_html,
-                        markdown,
-                        extracted_content,
-                        screenshot,
-                    ) = row
-
-                    # Store content in files and get hashes
-                    html_hash = await self._store_content(html, "html")
-                    cleaned_hash = await self._store_content(cleaned_html, "cleaned")
-                    markdown_hash = await self._store_content(markdown, "markdown")
-                    extracted_hash = await self._store_content(
-                        extracted_content, "extracted"
-                    )
-                    screenshot_hash = await self._store_content(
-                        screenshot, "screenshots"
-                    )
-
-                    # Update database with hashes
-                    await db.execute(
-                        """
-                        UPDATE crawled_data 
-                        SET html = ?, 
-                            cleaned_html = ?,
-                            markdown = ?,
-                            extracted_content = ?,
-                            screenshot = ?
-                        WHERE url = ?
-                    """,
-                        (
-                            html_hash,
-                            cleaned_hash,
-                            markdown_hash,
-                            extracted_hash,
-                            screenshot_hash,
-                            url,
-                        ),
-                    )
-
-                    migrated_count += 1
-                    if migrated_count % 100 == 0:
-                        logger.info(f"Migrated {migrated_count} records...", tag="INIT")
-
-                await db.commit()
-                logger.success(
-                    f"Migration completed. {migrated_count} records processed.",
-                    tag="COMPLETE",
-                )
-
-        except Exception as e:
-            # logger.error(f"Migration failed: {e}")
-            logger.error(
-                message="Migration failed: {error}",
-                tag="ERROR",
-                params={"error": str(e)},
-            )
-            raise e
-
-
-async def backup_database(db_path: str) -> str:
-    """Create backup of existing database"""
-    if not os.path.exists(db_path):
-        logger.info("No existing database found. Skipping backup.", tag="INIT")
-        return None
-
-    # Create backup with timestamp
-    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    backup_path = f"{db_path}.backup_{timestamp}"
-
-    try:
-        # Wait for any potential write operations to finish
-        await asyncio.sleep(1)
-
-        # Create backup
-        shutil.copy2(db_path, backup_path)
-        logger.info(f"Database backup created at: {backup_path}", tag="COMPLETE")
-        return backup_path
-    except Exception as e:
-        # logger.error(f"Backup failed: {e}")
-        logger.error(
-            message="Migration failed: {error}", tag="ERROR", params={"error": str(e)}
-        )
-        raise e
-
-
-async def run_migration(db_path: Optional[str] = None):
-    """Run database migration"""
-    if db_path is None:
-        db_path = os.path.join(Path.home(), ".crawl4ai", "crawl4ai.db")
-
-    if not os.path.exists(db_path):
-        logger.info("No existing database found. Skipping migration.", tag="INIT")
-        return
-
-    # Create backup first
-    backup_path = await backup_database(db_path)
-    if not backup_path:
-        return
-
-    migration = DatabaseMigration(db_path)
-    await migration.migrate_database()
-
-
-def main():
-    """CLI entry point for migration"""
-    import argparse
-
-    parser = argparse.ArgumentParser(
-        description="Migrate Crawl4AI database to file-based storage"
-    )
-    parser.add_argument("--db-path", help="Custom database path")
-    args = parser.parse_args()
-
-    asyncio.run(run_migration(args.db_path))
-
-
-if __name__ == "__main__":
-    main()
--- a/crawl4ai/model_loader.py
+++ b/crawl4ai/model_loader.py
@@ -2,157 +2,147 @@ from functools import lru_cache
 from pathlib import Path
 import subprocess, os
 import shutil
-from .model_loader import *
-import argparse
 from crawl4ai.config import MODEL_REPO_BRANCH
-
+import argparse
+import urllib.request
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))

-
@lru_cache()
 def get_available_memory(device):
    import torch
-
-    if device.type == "cuda":
+    if device.type == 'cuda':
        return torch.cuda.get_device_properties(device).total_memory
-    elif device.type == "mps":
-        return 48 * 1024**3  # Assuming 8GB for MPS, as a conservative estimate
+    elif device.type == 'mps':      
+        return 48 * 1024 ** 3  # Assuming 8GB for MPS, as a conservative estimate
    else:
        return 0

-
@lru_cache()
 def calculate_batch_size(device):
    available_memory = get_available_memory(device)
-
-    if device.type == "cpu":
+    
+    if device.type == 'cpu':
        return 16
-    elif device.type in ["cuda", "mps"]:
+    elif device.type in ['cuda', 'mps']:
        # Adjust these thresholds based on your model size and available memory
-        if available_memory >= 31 * 1024**3:  # > 32GB
+        if available_memory >= 31 * 1024 ** 3:  # > 32GB
            return 256
-        elif available_memory >= 15 * 1024**3:  # > 16GB to 32GB
+        elif available_memory >= 15 * 1024 ** 3:  # > 16GB to 32GB
            return 128
-        elif available_memory >= 8 * 1024**3:  # 8GB to 16GB
+        elif available_memory >= 8 * 1024 ** 3:  # 8GB to 16GB
            return 64
        else:
            return 32
    else:
        return 16  # Default batch size
-
-
+    
+    
@lru_cache()
 def get_device():
    import torch
-
    if torch.cuda.is_available():
-        device = torch.device("cuda")
+        device = torch.device('cuda')
    elif torch.backends.mps.is_available():
-        device = torch.device("mps")
+        device = torch.device('mps')
    else:
-        device = torch.device("cpu")
-    return device
-
-
+        device = torch.device('cpu')
+    return device   
+    
 def set_model_device(model):
    device = get_device()
-    model.to(device)
+    model.to(device)    
    return model, device

-
@lru_cache()
 def get_home_folder():
-    home_folder = os.path.join(
-        os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
-    )
+    home_folder = os.path.join(Path.home(), ".crawl4ai")
    os.makedirs(home_folder, exist_ok=True)
    os.makedirs(f"{home_folder}/cache", exist_ok=True)
    os.makedirs(f"{home_folder}/models", exist_ok=True)
-    return home_folder
-
+    return home_folder 

@lru_cache()
 def load_bert_base_uncased():
-    from transformers import BertTokenizer, BertModel
-
-    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", resume_download=None)
-    model = BertModel.from_pretrained("bert-base-uncased", resume_download=None)
+    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', resume_download=None)
+    model = BertModel.from_pretrained('bert-base-uncased', resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model

-
@lru_cache()
-def load_HF_embedding_model(model_name="BAAI/bge-small-en-v1.5") -> tuple:
-    """Load the Hugging Face model for embedding.
-
-    Args:
-        model_name (str, optional): The model name to load. Defaults to "BAAI/bge-small-en-v1.5".
-
-    Returns:
-        tuple: The tokenizer and model.
-    """
-    from transformers import AutoTokenizer, AutoModel
-
-    tokenizer = AutoTokenizer.from_pretrained(model_name, resume_download=None)
-    model = AutoModel.from_pretrained(model_name, resume_download=None)
+def load_bge_small_en_v1_5():
+    from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModel
+    tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
+    model = AutoModel.from_pretrained('BAAI/bge-small-en-v1.5', resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    return tokenizer, model

+@lru_cache()
+def load_onnx_all_MiniLM_l6_v2():
+    from crawl4ai.onnx_embedding import DefaultEmbeddingModel
+    model_path = "models/onnx/model.onnx"
+    model_url = "https://unclecode-files.s3.us-west-2.amazonaws.com/model.onnx"
+    download_path = os.path.join(__location__, model_path)
+
+    if not os.path.exists(download_path):
+        # Define a download function with a simple progress display
+        def download_with_progress(url, filename):
+            def reporthook(block_num, block_size, total_size):
+                downloaded = block_num * block_size
+                percentage = 100 * downloaded / total_size
+                if downloaded < total_size:
+                    print(f"\rDownloading: {percentage:.2f}% ({downloaded / (1024 * 1024):.2f} MB of {total_size / (1024 * 1024):.2f} MB)", end='')
+                else:
+                    print("\rDownload complete!                              ")
+
+            urllib.request.urlretrieve(url, filename, reporthook)
+
+        download_with_progress(model_url, download_path)
+
+    model = DefaultEmbeddingModel()
+    return model

@lru_cache()
 def load_text_classifier():
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    from transformers import pipeline
+    import torch

-    tokenizer = AutoTokenizer.from_pretrained(
-        "dstefa/roberta-base_topic_classification_nyt_news"
-    )
-    model = AutoModelForSequenceClassification.from_pretrained(
-        "dstefa/roberta-base_topic_classification_nyt_news"
-    )
+    tokenizer = AutoTokenizer.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
+    model = AutoModelForSequenceClassification.from_pretrained("dstefa/roberta-base_topic_classification_nyt_news")
    model.eval()
    model, device = set_model_device(model)
    pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
    return pipe

-
@lru_cache()
 def load_text_multilabel_classifier():
    from transformers import AutoModelForSequenceClassification, AutoTokenizer
+    import numpy as np
    from scipy.special import expit
    import torch

-    # # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
-    # if torch.cuda.is_available():
-    #     device = torch.device("cuda")
-    # elif torch.backends.mps.is_available():
-    #     device = torch.device("mps")
-    # else:
-    #     device = torch.device("cpu")
-    #     # return load_spacy_model(), torch.device("cpu")
+    # Check for available device: CUDA, MPS (for Apple Silicon), or CPU
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+    elif torch.backends.mps.is_available():
+        device = torch.device("mps")
+    else:
+        return load_spacy_model(), torch.device("cpu")
+

    MODEL = "cardiffnlp/tweet-topic-21-multi"
    tokenizer = AutoTokenizer.from_pretrained(MODEL, resume_download=None)
-    model = AutoModelForSequenceClassification.from_pretrained(
-        MODEL, resume_download=None
-    )
+    model = AutoModelForSequenceClassification.from_pretrained(MODEL, resume_download=None)
    model.eval()
    model, device = set_model_device(model)
    class_mapping = model.config.id2label

    def _classifier(texts, threshold=0.5, max_length=64):
-        tokens = tokenizer(
-            texts,
-            return_tensors="pt",
-            padding=True,
-            truncation=True,
-            max_length=max_length,
-        )
-        tokens = {
-            key: val.to(device) for key, val in tokens.items()
-        }  # Move tokens to the selected device
+        tokens = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
+        tokens = {key: val.to(device) for key, val in tokens.items()}  # Move tokens to the selected device

        with torch.no_grad():
            output = model(**tokens)
@@ -163,91 +153,73 @@ def load_text_multilabel_classifier():

        batch_labels = []
        for prediction in predictions:
-            labels = [
-                class_mapping[i] for i, value in enumerate(prediction) if value == 1
-            ]
+            labels = [class_mapping[i] for i, value in enumerate(prediction) if value == 1]
            batch_labels.append(labels)

        return batch_labels

    return _classifier, device

-
@lru_cache()
 def load_nltk_punkt():
    import nltk
-
    try:
-        nltk.data.find("tokenizers/punkt")
+        nltk.data.find('tokenizers/punkt')
    except LookupError:
-        nltk.download("punkt")
-    return nltk.data.find("tokenizers/punkt")
+        nltk.download('punkt')
+    return nltk.data.find('tokenizers/punkt')


@lru_cache()
 def load_spacy_model():
    import spacy
-
    name = "models/reuters"
    home_folder = get_home_folder()
-    model_folder = Path(home_folder) / name
-
+    model_folder = os.path.join(home_folder, name)
+    
    # Check if the model directory already exists
-    if not (model_folder.exists() and any(model_folder.iterdir())):
+    if not (Path(model_folder).exists() and any(Path(model_folder).iterdir())):
        repo_url = "https://github.com/unclecode/crawl4ai.git"
-        branch = MODEL_REPO_BRANCH
-        repo_folder = Path(home_folder) / "crawl4ai"
+        # branch = "main"
+        branch = MODEL_REPO_BRANCH 
+        repo_folder = os.path.join(home_folder, "crawl4ai")
+        model_folder = os.path.join(home_folder, name)

-        print("[LOG] ⏬ Downloading Spacy model for the first time...")
+        # print("[LOG] ⏬ Downloading Spacy model for the first time...")

        # Remove existing repo folder if it exists
-        if repo_folder.exists():
-            try:
-                shutil.rmtree(repo_folder)
-                if model_folder.exists():
-                    shutil.rmtree(model_folder)
-            except PermissionError:
-                print(
-                    "[WARNING] Unable to remove existing folders. Please manually delete the following folders and try again:"
-                )
-                print(f"- {repo_folder}")
-                print(f"- {model_folder}")
-                return None
+        if Path(repo_folder).exists():
+            shutil.rmtree(repo_folder)
+            shutil.rmtree(model_folder)

        try:
            # Clone the repository
            subprocess.run(
-                ["git", "clone", "-b", branch, repo_url, str(repo_folder)],
+                ["git", "clone", "-b", branch, repo_url, repo_folder],
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
-                check=True,
+                check=True
            )

            # Create the models directory if it doesn't exist
-            models_folder = Path(home_folder) / "models"
-            models_folder.mkdir(parents=True, exist_ok=True)
+            models_folder = os.path.join(home_folder, "models")
+            os.makedirs(models_folder, exist_ok=True)

            # Copy the reuters model folder to the models directory
-            source_folder = repo_folder / "models" / "reuters"
+            source_folder = os.path.join(repo_folder, "models/reuters")
            shutil.copytree(source_folder, model_folder)

            # Remove the cloned repository
            shutil.rmtree(repo_folder)

-            print("[LOG] ✅ Spacy Model downloaded successfully")
+            # Print completion message
+            # print("[LOG] ✅ Spacy Model downloaded successfully")
        except subprocess.CalledProcessError as e:
            print(f"An error occurred while cloning the repository: {e}")
-            return None
        except Exception as e:
            print(f"An error occurred: {e}")
-            return None
-
-    try:
-        return spacy.load(str(model_folder))
-    except Exception as e:
-        print(f"Error loading spacy model: {e}")
-        return None

+    return spacy.load(model_folder)

 def download_all_models(remove_existing=False):
    """Download all models required for Crawl4AI."""
@@ -268,8 +240,8 @@ def download_all_models(remove_existing=False):
    # load_bert_base_uncased()
    # print("[LOG] Downloading BGE Small EN v1.5...")
    # load_bge_small_en_v1_5()
-    # print("[LOG] Downloading ONNX model...")
-    # load_onnx_all_MiniLM_l6_v2()
+    print("[LOG] Downloading ONNX model...")
+    load_onnx_all_MiniLM_l6_v2()
    print("[LOG] Downloading text classifier...")
    _, device = load_text_multilabel_classifier()
    print(f"[LOG] Text classifier loaded on {device}")
@@ -277,20 +249,14 @@ def download_all_models(remove_existing=False):
    load_nltk_punkt()
    print("[LOG] ✅ All models downloaded successfully.")

-
 def main():
    print("[LOG] Welcome to the Crawl4AI Model Downloader!")
    print("[LOG] This script will download all the models required for Crawl4AI.")
    parser = argparse.ArgumentParser(description="Crawl4AI Model Downloader")
-    parser.add_argument(
-        "--remove-existing",
-        action="store_true",
-        help="Remove existing models before downloading",
-    )
+    parser.add_argument('--remove-existing', action='store_true', help="Remove existing models before downloading")
    args = parser.parse_args()
-
+    
    download_all_models(remove_existing=args.remove_existing)

-
 if __name__ == "__main__":
    main()
--- a/crawl4ai/models.py
+++ b/crawl4ai/models.py
@@ -1,199 +1,16 @@
-from __future__ import annotations
 from pydantic import BaseModel, HttpUrl
-from typing import List, Dict, Optional, Callable, Awaitable, Union, Any
-from enum import Enum
-from dataclasses import dataclass
-from .ssl_certificate import SSLCertificate
-from datetime import datetime
-from datetime import timedelta
-from math import inf
-
-
-###############################
-# Dispatcher Models
-###############################
-@dataclass
-class DomainState:
-    last_request_time: float = 0
-    current_delay: float = 0
-    fail_count: int = 0
-
-
-@dataclass
-class CrawlerTaskResult:
-    task_id: str
-    url: str
-    result: "CrawlResult"
-    memory_usage: float
-    peak_memory: float
-    start_time: datetime
-    end_time: datetime
-    error_message: str = ""
-
-
-class CrawlStatus(Enum):
-    QUEUED = "QUEUED"
-    IN_PROGRESS = "IN_PROGRESS"
-    COMPLETED = "COMPLETED"
-    FAILED = "FAILED"
-
-
-@dataclass
-class CrawlStats:
-    task_id: str
-    url: str
-    status: CrawlStatus
-    start_time: Optional[datetime] = None
-    end_time: Optional[datetime] = None
-    memory_usage: float = 0.0
-    peak_memory: float = 0.0
-    error_message: str = ""
-
-    @property
-    def duration(self) -> str:
-        if not self.start_time:
-            return "0:00"
-        end = self.end_time or datetime.now()
-        duration = end - self.start_time
-        return str(timedelta(seconds=int(duration.total_seconds())))
-
-
-class DisplayMode(Enum):
-    DETAILED = "DETAILED"
-    AGGREGATED = "AGGREGATED"
-
-
-###############################
-# Crawler Models
-###############################
-@dataclass
-class TokenUsage:
-    completion_tokens: int = 0
-    prompt_tokens: int = 0
-    total_tokens: int = 0
-    completion_tokens_details: Optional[dict] = None
-    prompt_tokens_details: Optional[dict] = None
-
+from typing import List

 class UrlModel(BaseModel):
    url: HttpUrl
    forced: bool = False

-
-class MarkdownGenerationResult(BaseModel):
-    raw_markdown: str
-    markdown_with_citations: str
-    references_markdown: str
-    fit_markdown: Optional[str] = None
-    fit_html: Optional[str] = None
-
-
-class DispatchResult(BaseModel):
-    task_id: str
-    memory_usage: float
-    peak_memory: float
-    start_time: datetime
-    end_time: datetime
-    error_message: str = ""
-
-
-@dataclass
-class TraversalStats:
-    """Statistics for the traversal process"""
-
-    start_time: datetime
-    urls_processed: int = 0
-    urls_failed: int = 0
-    urls_skipped: int = 0
-    total_depth_reached: int = 0
-    current_depth: int = 0
-
-
 class CrawlResult(BaseModel):
    url: str
    html: str
    success: bool
-    cleaned_html: Optional[str] = None
-    media: Dict[str, List[Dict]] = {}
-    links: Dict[str, List[Dict]] = {}
-    downloaded_files: Optional[List[str]] = None
-    screenshot: Optional[str] = None
-    pdf: Optional[bytes] = None
-    markdown: Optional[Union[str, MarkdownGenerationResult]] = None
-    markdown_v2: Optional[MarkdownGenerationResult] = None
-    fit_markdown: Optional[str] = None
-    fit_html: Optional[str] = None
-    extracted_content: Optional[str] = None
-    metadata: Optional[dict] = None
-    error_message: Optional[str] = None
-    session_id: Optional[str] = None
-    response_headers: Optional[dict] = None
-    status_code: Optional[int] = None
-    ssl_certificate: Optional[SSLCertificate] = None
-    dispatch_result: Optional[DispatchResult] = None
-    redirected_url: Optional[str] = None
-    # Attributes for position
-    depth: Optional[int] = None
-    score: Optional[float] = -inf
-    parent_url: Optional[str] = None
-
-    class Config:
-        arbitrary_types_allowed = True
-
-class AsyncCrawlResponse(BaseModel):
-    html: str
-    response_headers: Dict[str, str]
-    status_code: int
-    screenshot: Optional[str] = None
-    pdf_data: Optional[bytes] = None
-    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
-    downloaded_files: Optional[List[str]] = None
-    ssl_certificate: Optional[SSLCertificate] = None
-    redirected_url: Optional[str] = None
-
-    class Config:
-        arbitrary_types_allowed = True
-
-
-###############################
-# Scraping Models
-###############################
-class MediaItem(BaseModel):
-    src: Optional[str] = ""
-    alt: Optional[str] = ""
-    desc: Optional[str] = ""
-    score: Optional[int] = 0
-    type: str = "image"
-    group_id: Optional[int] = 0
-    format: Optional[str] = None
-    width: Optional[int] = None
-
-
-class Link(BaseModel):
-    href: Optional[str] = ""
-    text: Optional[str] = ""
-    title: Optional[str] = ""
-    base_domain: Optional[str] = ""
-
-
-class Media(BaseModel):
-    images: List[MediaItem] = []
-    videos: List[MediaItem] = (
-        []
-    )  # Using MediaItem model for now, can be extended with Video model if needed
-    audios: List[MediaItem] = (
-        []
-    )  # Using MediaItem model for now, can be extended with Audio model if needed
-
-
-class Links(BaseModel):
-    internal: List[Link] = []
-    external: List[Link] = []
-
-
-class ScrapingResult(BaseModel):
-    cleaned_html: str
-    success: bool
-    media: Media = Media()
-    links: Links = Links()
-    metadata: Dict[str, Any] = {}
+    cleaned_html: str = None
+    markdown: str = None
+    extracted_content: str = None
+    metadata: dict = None
+    error_message: str = None
--- a/crawl4ai/models/onnx/config.json
+++ b/crawl4ai/models/onnx/config.json
@@ -0,0 +1,25 @@
+{
+  "_name_or_path": "sentence-transformers/all-MiniLM-L6-v2",
+  "architectures": [
+    "BertModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 6,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.27.4",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}
--- a/crawl4ai/models/onnx/model.onnx
+++ b/crawl4ai/models/onnx/model.onnx
--- a/crawl4ai/models/onnx/special_tokens_map.json
+++ b/crawl4ai/models/onnx/special_tokens_map.json
@@ -0,0 +1,7 @@
+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}
--- a/crawl4ai/models/onnx/tokenizer.json
+++ b/crawl4ai/models/onnx/tokenizer.json
--- a/crawl4ai/models/onnx/tokenizer_config.json
+++ b/crawl4ai/models/onnx/tokenizer_config.json
@@ -0,0 +1,15 @@
+{
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "special_tokens_map_file": "/Users/hammad/.cache/huggingface/hub/models--sentence-transformers--all-MiniLM-L6-v2/snapshots/7dbbc90392e2f80f3d3c277d6e90027e55de9125/special_tokens_map.json",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}
--- a/crawl4ai/models/onnx/vocab.txt
+++ b/crawl4ai/models/onnx/vocab.txt
--- a/crawl4ai/onnx_embedding.py
+++ b/crawl4ai/onnx_embedding.py
@@ -0,0 +1,50 @@
+# A dependency-light way to run the onnx model
+
+
+import numpy as np
+from typing import List
+import os
+
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
+
+def normalize(v):
+    norm = np.linalg.norm(v, axis=1)
+    norm[norm == 0] = 1e-12
+    return v / norm[:, np.newaxis]
+
+# Sampel implementation of the default sentence-transformers model using ONNX
+class DefaultEmbeddingModel():
+
+    def __init__(self):
+        from tokenizers import Tokenizer
+        import onnxruntime as ort
+        # max_seq_length = 256, for some reason sentence-transformers uses 256 even though the HF config has a max length of 128
+        # https://github.com/UKPLab/sentence-transformers/blob/3e1929fddef16df94f8bc6e3b10598a98f46e62d/docs/_static/html/models_en_sentence_embeddings.html#LL480
+        self.tokenizer = Tokenizer.from_file(os.path.join(__location__, "models/onnx/tokenizer.json"))
+        self.tokenizer.enable_truncation(max_length=256)
+        self.tokenizer.enable_padding(pad_id=0, pad_token="[PAD]", length=256)
+        self.model = ort.InferenceSession(os.path.join(__location__,"models/onnx/model.onnx"))
+        
+
+    def __call__(self, documents: List[str], batch_size: int = 32):
+        all_embeddings = []
+        for i in range(0, len(documents), batch_size):
+            batch = documents[i:i + batch_size]
+            encoded = [self.tokenizer.encode(d) for d in batch]
+            input_ids = np.array([e.ids for e in encoded])
+            attention_mask = np.array([e.attention_mask for e in encoded])
+            onnx_input = {
+                "input_ids": np.array(input_ids, dtype=np.int64),
+                "attention_mask": np.array(attention_mask, dtype=np.int64),
+                "token_type_ids": np.array([np.zeros(len(e), dtype=np.int64) for e in input_ids], dtype=np.int64),
+            }
+            model_output = self.model.run(None, onnx_input)
+            last_hidden_state = model_output[0]
+            # Perform mean pooling with attention weighting
+            input_mask_expanded = np.broadcast_to(np.expand_dims(attention_mask, -1), last_hidden_state.shape)
+            embeddings = np.sum(last_hidden_state * input_mask_expanded, 1) / np.clip(input_mask_expanded.sum(1), a_min=1e-9, a_max=None)
+            embeddings = normalize(embeddings).astype(np.float32)
+            all_embeddings.append(embeddings)
+        return np.concatenate(all_embeddings)
+
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -1,4 +1,4 @@
-PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
+PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
 <url>{URL}</url>

 And here is the cleaned HTML content of that webpage:
@@ -29,7 +29,7 @@ To generate the JSON objects:

 5. Make sure the generated JSON is complete and parsable, with no errors or omissions.

-6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
+6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.

 Please provide your output within <blocks> tags, like this:

@@ -79,7 +79,7 @@ To generate the JSON objects:
 2. For each block:
   a. Assign it an index based on its order in the content.
   b. Analyze the content and generate ONE semantic tag that describe what the block is about.
-   c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
+   c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.

 3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.

@@ -87,7 +87,7 @@ To generate the JSON objects:

 5. Make sure the generated JSON is complete and parsable, with no errors or omissions.

-6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
+6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.

 7. Never alter the extracted content, just copy and paste it as it is.

@@ -142,7 +142,7 @@ To generate the JSON objects:

 5. Make sure the generated JSON is complete and parsable, with no errors or omissions.

-6. Make sure to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.
+6. Make sur to escape any special characters in the HTML content, and also single or double quote to avoid JSON parsing issues.

 7. Never alter the extracted content, just copy and paste it as it is.

@@ -164,846 +164,4 @@ Please provide your output within <blocks> tags, like this:

 **Make sure to follow the user instruction to extract blocks aligin with the instruction.**

-Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
-
-PROMPT_EXTRACT_SCHEMA_WITH_INSTRUCTION = """Here is the content from the URL:
-<url>{URL}</url>
-
-<url_content>
-{HTML}
-</url_content>
-
-The user has made the following request for what information to extract from the above content:
-
-<user_request>
-{REQUEST}
-</user_request>
-
-<schema_block>
-{SCHEMA}
-</schema_block>
-
-Please carefully read the URL content and the user's request. If the user provided a desired JSON schema in the <schema_block> above, extract the requested information from the URL content according to that schema. If no schema was provided, infer an appropriate JSON schema based on the user's request that will best capture the key information they are looking for.
-
-Extraction instructions:
-Return the extracted information as a list of JSON objects, with each object in the list corresponding to a block of content from the URL, in the same order as it appears on the page. Wrap the entire JSON list in <blocks>...</blocks> XML tags.
-
-Quality Reflection:
-Before outputting your final answer, double check that the JSON you are returning is complete, containing all the information requested by the user, and is valid JSON that could be parsed by json.loads() with no errors or omissions. The outputted JSON objects should fully match the schema, either provided or inferred.
-
-Quality Score:
-After reflecting, score the quality and completeness of the JSON data you are about to return on a scale of 1 to 5. Write the score inside <score> tags.
-
-Avoid Common Mistakes:
- Do NOT add any comments using "//" or "#" in the JSON output. It causes parsing errors.
- Make sure the JSON is properly formatted with curly braces, square brackets, and commas in the right places.
- Do not miss closing </blocks> tag at the end of the JSON output.
- Do not generate the Python coee show me how to do the task, this is your task to extract the information and return it in JSON format.
-
-Result
-Output the final list of JSON objects, wrapped in <blocks>...</blocks> XML tags. Make sure to close the tag properly."""
-
-
-PROMPT_FILTER_CONTENT = """Your task is to filter and convert HTML content into clean, focused markdown that's optimized for use with LLMs and information retrieval systems.
-
-INPUT HTML: 
-<|HTML_CONTENT_START|>
-{HTML}
-<|HTML_CONTENT_END|>
-
-
-SPECIFIC INSTRUCTION: 
-<|USER_INSTRUCTION_START|>
-{REQUEST}
-<|USER_INSTRUCTION_END|>
-
-TASK DETAILS:
-1. Content Selection
- DO: Keep essential information, main content, key details
- DO: Preserve hierarchical structure using markdown headers
- DO: Keep code blocks, tables, key lists
- DON'T: Include navigation menus, ads, footers, cookie notices
- DON'T: Keep social media widgets, sidebars, related content
-
-2. Content Transformation
- DO: Use proper markdown syntax (#, ##, **, `, etc)
- DO: Convert tables to markdown tables
- DO: Preserve code formatting with ```language blocks
- DO: Maintain link texts but remove tracking parameters
- DON'T: Include HTML tags in output
- DON'T: Keep class names, ids, or other HTML attributes
-
-3. Content Organization
- DO: Maintain logical flow of information
- DO: Group related content under appropriate headers
- DO: Use consistent header levels
- DON'T: Fragment related content
- DON'T: Duplicate information
-
-Example Input:
-<div class="main-content"><h1>Setup Guide</h1><p>Follow these steps...</p></div>
-<div class="sidebar">Related articles...</div>
-
-Example Output:
-# Setup Guide
-Follow these steps...
-
-IMPORTANT: If specific instruction is provided above, prioritize those requirements over these general guidelines.
-
-OUTPUT FORMAT: 
-Wrap your response in <content> tags. Use proper markdown throughout.
-<content>
-[Your markdown content here]
-</content>
-
-Begin filtering now."""
-
-JSON_SCHEMA_BUILDER= """
-# HTML Schema Generation Instructions
-You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
-
-## Your Core Responsibilities:
-1. Analyze HTML structure to identify repeating patterns and important data points
-2. Generate valid JSON schemas following the specified format
-3. Create appropriate selectors that will work reliably for data extraction
-4. Name fields meaningfully based on their content and purpose
-5. Handle both specific user requests and autonomous pattern detection
-
-## Available Schema Types You Can Generate:
-
-<schema_types>
-1. Basic Single-Level Schema
-   - Use for simple, flat data structures
-   - Example: Product cards, user profiles
-   - Direct field extractions
-
-2. Nested Object Schema
-   - Use for hierarchical data
-   - Example: Articles with author details
-   - Contains objects within objects
-
-3. List Schema
-   - Use for repeating elements
-   - Example: Comment sections, product lists
-   - Handles arrays of similar items
-
-4. Complex Nested Lists
-   - Use for multi-level data
-   - Example: Categories with subcategories
-   - Multiple levels of nesting
-
-5. Transformation Schema
-   - Use for data requiring processing
-   - Supports regex and text transformations
-   - Special attribute handling
-</schema_types>
-
-<schema_structure>
-Your output must always be a JSON object with this structure:
-{
-  "name": "Descriptive name of the pattern",
-  "baseSelector": "CSS selector for the repeating element",
-  "fields": [
-    {
-      "name": "field_name",
-      "selector": "CSS selector",
-      "type": "text|attribute|nested|list|regex",
-      "attribute": "attribute_name",  // Optional
-      "transform": "transformation_type",  // Optional
-      "pattern": "regex_pattern",  // Optional
-      "fields": []  // For nested/list types
-    }
-  ]
-}
-</schema_structure>
-
-<type_definitions>
-Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
-</type_definitions>
-
-<behavior_rules>
-1. When given a specific query:
-   - Focus on extracting requested data points
-   - Use most specific selectors possible
-   - Include all fields mentioned in the query
-
-2. When no query is provided:
-   - Identify main content areas
-   - Extract all meaningful data points
-   - Use semantic structure to determine importance
-   - Include prices, dates, titles, and other common data types
-
-3. Always:
-   - Use reliable CSS selectors
-   - Handle dynamic class names appropriately
-   - Create descriptive field names
-   - Follow consistent naming conventions
-</behavior_rules>
-
-<examples>
-1. Basic Product Card Example:
-<html>
-<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
-  <h2 class="product-title">Gaming Laptop</h2>
-  <span class="price">$999.99</span>
-  <img src="laptop.jpg" alt="Gaming Laptop">
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Product Cards",
-  "baseSelector": ".product-card",
-  "baseFields": [
-    {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
-    {"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
-  ],
-  "fields": [
-    {
-      "name": "title",
-      "selector": ".product-title",
-      "type": "text"
-    },
-    {
-      "name": "price",
-      "selector": ".price",
-      "type": "text"
-    },
-    {
-      "name": "image_url",
-      "selector": "img",
-      "type": "attribute",
-      "attribute": "src"
-    }
-  ]
-}
-
-2. Article with Author Details Example:
-<html>
-<article>
-  <h1>The Future of AI</h1>
-  <div class="author-info">
-    <span class="author-name">Dr. Smith</span>
-    <img src="author.jpg" alt="Dr. Smith">
-  </div>
-</article>
-</html>
-
-Generated Schema:
-{
-  "name": "Article Details",
-  "baseSelector": "article",
-  "fields": [
-    {
-      "name": "title",
-      "selector": "h1",
-      "type": "text"
-    },
-    {
-      "name": "author",
-      "type": "nested",
-      "selector": ".author-info",
-      "fields": [
-        {
-          "name": "name",
-          "selector": ".author-name",
-          "type": "text"
-        },
-        {
-          "name": "avatar",
-          "selector": "img",
-          "type": "attribute",
-          "attribute": "src"
-        }
-      ]
-    }
-  ]
-}
-
-3. Comments Section Example:
-<html>
-<div class="comments-container">
-  <div class="comment" data-user-id="123">
-    <div class="user-name">John123</div>
-    <p class="comment-text">Great article!</p>
-  </div>
-  <div class="comment" data-user-id="456">
-    <div class="user-name">Alice456</div>
-    <p class="comment-text">Thanks for sharing.</p>
-  </div>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Comment Section",
-  "baseSelector": ".comments-container",
-  "baseFields": [
-    {"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
-  ],
-  "fields": [
-    {
-      "name": "comments",
-      "type": "list",
-      "selector": ".comment",
-      "fields": [
-        {
-          "name": "user",
-          "selector": ".user-name",
-          "type": "text"
-        },
-        {
-          "name": "content",
-          "selector": ".comment-text",
-          "type": "text"
-        }
-      ]
-    }
-  ]
-}
-
-4. E-commerce Categories Example:
-<html>
-<div class="category-section" data-category="electronics">
-  <h2>Electronics</h2>
-  <div class="subcategory">
-    <h3>Laptops</h3>
-    <div class="product">
-      <span class="product-name">MacBook Pro</span>
-      <span class="price">$1299</span>
-    </div>
-    <div class="product">
-      <span class="product-name">Dell XPS</span>
-      <span class="price">$999</span>
-    </div>
-  </div>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "E-commerce Categories",
-  "baseSelector": ".category-section",
-  "baseFields": [
-    {"name": "data_category", "type": "attribute", "attribute": "data-category"}
-  ],
-  "fields": [
-    {
-      "name": "category_name",
-      "selector": "h2",
-      "type": "text"
-    },
-    {
-      "name": "subcategories",
-      "type": "nested_list",
-      "selector": ".subcategory",
-      "fields": [
-        {
-          "name": "name",
-          "selector": "h3",
-          "type": "text"
-        },
-        {
-          "name": "products",
-          "type": "list",
-          "selector": ".product",
-          "fields": [
-            {
-              "name": "name",
-              "selector": ".product-name",
-              "type": "text"
-            },
-            {
-              "name": "price",
-              "selector": ".price",
-              "type": "text"
-            }
-          ]
-        }
-      ]
-    }
-  ]
-}
-
-5. Job Listings with Transformations Example:
-<html>
-<div class="job-post">
-  <h3 class="job-title">Senior Developer</h3>
-  <span class="salary-text">Salary: $120,000/year</span>
-  <span class="location">  New York, NY  </span>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Job Listings",
-  "baseSelector": ".job-post",
-  "fields": [
-    {
-      "name": "title",
-      "selector": ".job-title",
-      "type": "text",
-      "transform": "uppercase"
-    },
-    {
-      "name": "salary",
-      "selector": ".salary-text",
-      "type": "regex",
-      "pattern": "\\$([\\d,]+)"
-    },
-    {
-      "name": "location",
-      "selector": ".location",
-      "type": "text",
-      "transform": "strip"
-    }
-  ]
-}
-
-6. Skyscanner Place Card Example:
-<html>
-<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
-  <div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
-    <div class="PlaceCard_nameContent__ODUwZ">
-      <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
-    </div>
-    <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
-  </div>
-  <span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
-  <a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
-    <div class="PriceDescription_container__NjEzM">
-      <span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
-    </div>
-  </a>
-</div>
-</html>
-
-Generated Schema:
-{
-  "name": "Skyscanner Place Cards",
-  "baseSelector": "div[class^='PlaceCard_descriptionContainer__']",
-  "baseFields": [
-    {"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
-  ],
-  "fields": [
-    {
-      "name": "city_name",
-      "selector": "div[class^='PlaceCard_nameContent__'] .BpkText_bpk-text--heading-4__",
-      "type": "text"
-    },
-    {
-      "name": "country_name",
-      "selector": "span[class*='PlaceCard_subName__']",
-      "type": "text"
-    },
-    {
-      "name": "description",
-      "selector": "span[class*='PlaceCard_advertLabel__']",
-      "type": "text"
-    },
-    {
-      "name": "flight_price",
-      "selector": "a[data-testid='flights-link'] .BpkText_bpk-text--heading-5__",
-      "type": "text"
-    },
-    {
-      "name": "flight_url",
-      "selector": "a[data-testid='flights-link']",
-      "type": "attribute",
-      "attribute": "href"
-    }
-  ]
-}
-</examples>
-
-
-<output_requirements>
-Your output must:
-1. Be valid JSON only
-2. Include no explanatory text
-3. Follow the exact schema structure provided
-4. Use appropriate field types
-5. Include all required fields
-6. Use valid CSS selectors
-</output_requirements>
-
-"""
-
-JSON_SCHEMA_BUILDER_XPATH = """
-# HTML Schema Generation Instructions
-You are a specialized model designed to analyze HTML patterns and generate extraction schemas. Your primary job is to create structured JSON schemas that can be used to extract data from HTML in a consistent and reliable way. When presented with HTML content, you must analyze its structure and generate a schema that captures all relevant data points.
-
-## Your Core Responsibilities:
-1. Analyze HTML structure to identify repeating patterns and important data points
-2. Generate valid JSON schemas following the specified format
-3. Create appropriate XPath selectors that will work reliably for data extraction
-4. Name fields meaningfully based on their content and purpose
-5. Handle both specific user requests and autonomous pattern detection
-
-## Available Schema Types You Can Generate:
-
-<schema_types>
-1. Basic Single-Level Schema
-  - Use for simple, flat data structures
-  - Example: Product cards, user profiles
-  - Direct field extractions
-
-2. Nested Object Schema
-  - Use for hierarchical data
-  - Example: Articles with author details
-  - Contains objects within objects
-
-3. List Schema
-  - Use for repeating elements
-  - Example: Comment sections, product lists
-  - Handles arrays of similar items
-
-4. Complex Nested Lists
-  - Use for multi-level data
-  - Example: Categories with subcategories
-  - Multiple levels of nesting
-
-5. Transformation Schema
-  - Use for data requiring processing
-  - Supports regex and text transformations
-  - Special attribute handling
-</schema_types>
-
-<schema_structure>
-Your output must always be a JSON object with this structure:
-{
- "name": "Descriptive name of the pattern",
- "baseSelector": "XPath selector for the repeating element",
- "fields": [
-   {
-     "name": "field_name",
-     "selector": "XPath selector",
-     "type": "text|attribute|nested|list|regex",
-     "attribute": "attribute_name",  // Optional
-     "transform": "transformation_type",  // Optional
-     "pattern": "regex_pattern",  // Optional
-     "fields": []  // For nested/list types
-   }
- ]
-}
-</schema_structure>
-
-<type_definitions>
-Available field types:
- text: Direct text extraction
- attribute: HTML attribute extraction
- nested: Object containing other fields
- list: Array of similar items
- regex: Pattern-based extraction
-</type_definitions>
-
-<behavior_rules>
-1. When given a specific query:
-  - Focus on extracting requested data points
-  - Use most specific selectors possible
-  - Include all fields mentioned in the query
-
-2. When no query is provided:
-  - Identify main content areas
-  - Extract all meaningful data points
-  - Use semantic structure to determine importance
-  - Include prices, dates, titles, and other common data types
-
-3. Always:
-  - Use reliable XPath selectors
-  - Handle dynamic element IDs appropriately
-  - Create descriptive field names
-  - Follow consistent naming conventions
-</behavior_rules>
-
-<examples>
-1. Basic Product Card Example:
-<html>
-<div class="product-card" data-cat-id="electronics" data-subcat-id="laptops">
- <h2 class="product-title">Gaming Laptop</h2>
- <span class="price">$999.99</span>
- <img src="laptop.jpg" alt="Gaming Laptop">
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Product Cards",
- "baseSelector": "//div[@class='product-card']",
- "baseFields": [
-   {"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
-   {"name": "data_subcat_id", "type": "attribute", "attribute": "data-subcat-id"}
- ],
- "fields": [
-   {
-     "name": "title",
-     "selector": ".//h2[@class='product-title']",
-     "type": "text"
-   },
-   {
-     "name": "price",
-     "selector": ".//span[@class='price']",
-     "type": "text"
-   },
-   {
-     "name": "image_url",
-     "selector": ".//img",
-     "type": "attribute",
-     "attribute": "src"
-   }
- ]
-}
-
-2. Article with Author Details Example:
-<html>
-<article>
- <h1>The Future of AI</h1>
- <div class="author-info">
-   <span class="author-name">Dr. Smith</span>
-   <img src="author.jpg" alt="Dr. Smith">
- </div>
-</article>
-</html>
-
-Generated Schema:
-{
- "name": "Article Details",
- "baseSelector": "//article",
- "fields": [
-   {
-     "name": "title",
-     "selector": ".//h1",
-     "type": "text"
-   },
-   {
-     "name": "author",
-     "type": "nested",
-     "selector": ".//div[@class='author-info']",
-     "fields": [
-       {
-         "name": "name",
-         "selector": ".//span[@class='author-name']",
-         "type": "text"
-       },
-       {
-         "name": "avatar",
-         "selector": ".//img",
-         "type": "attribute",
-         "attribute": "src"
-       }
-     ]
-   }
- ]
-}
-
-3. Comments Section Example:
-<html>
-<div class="comments-container">
- <div class="comment" data-user-id="123">
-   <div class="user-name">John123</div>
-   <p class="comment-text">Great article!</p>
- </div>
- <div class="comment" data-user-id="456">
-   <div class="user-name">Alice456</div>
-   <p class="comment-text">Thanks for sharing.</p>
- </div>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Comment Section",
- "baseSelector": "//div[@class='comments-container']",
- "fields": [
-   {
-     "name": "comments",
-     "type": "list",
-     "selector": ".//div[@class='comment']",
-     "baseFields": [
-       {"name": "data_user_id", "type": "attribute", "attribute": "data-user-id"}
-     ],
-     "fields": [
-       {
-         "name": "user",
-         "selector": ".//div[@class='user-name']",
-         "type": "text"
-       },
-       {
-         "name": "content",
-         "selector": ".//p[@class='comment-text']",
-         "type": "text"
-       }
-     ]
-   }
- ]
-}
-
-4. E-commerce Categories Example:
-<html>
-<div class="category-section" data-category="electronics">
- <h2>Electronics</h2>
- <div class="subcategory">
-   <h3>Laptops</h3>
-   <div class="product">
-     <span class="product-name">MacBook Pro</span>
-     <span class="price">$1299</span>
-   </div>
-   <div class="product">
-     <span class="product-name">Dell XPS</span>
-     <span class="price">$999</span>
-   </div>
- </div>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "E-commerce Categories",
- "baseSelector": "//div[@class='category-section']",
- "baseFields": [
-   {"name": "data_category", "type": "attribute", "attribute": "data-category"}
- ],
- "fields": [
-   {
-     "name": "category_name",
-     "selector": ".//h2",
-     "type": "text"
-   },
-   {
-     "name": "subcategories",
-     "type": "nested_list",
-     "selector": ".//div[@class='subcategory']",
-     "fields": [
-       {
-         "name": "name",
-         "selector": ".//h3",
-         "type": "text"
-       },
-       {
-         "name": "products",
-         "type": "list",
-         "selector": ".//div[@class='product']",
-         "fields": [
-           {
-             "name": "name",
-             "selector": ".//span[@class='product-name']",
-             "type": "text"
-           },
-           {
-             "name": "price",
-             "selector": ".//span[@class='price']",
-             "type": "text"
-           }
-         ]
-       }
-     ]
-   }
- ]
-}
-
-5. Job Listings with Transformations Example:
-<html>
-<div class="job-post">
- <h3 class="job-title">Senior Developer</h3>
- <span class="salary-text">Salary: $120,000/year</span>
- <span class="location">  New York, NY  </span>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Job Listings",
- "baseSelector": "//div[@class='job-post']",
- "fields": [
-   {
-     "name": "title",
-     "selector": ".//h3[@class='job-title']",
-     "type": "text",
-     "transform": "uppercase"
-   },
-   {
-     "name": "salary",
-     "selector": ".//span[@class='salary-text']",
-     "type": "regex",
-     "pattern": "\\$([\\d,]+)"
-   },
-   {
-     "name": "location",
-     "selector": ".//span[@class='location']",
-     "type": "text",
-     "transform": "strip"
-   }
- ]
-}
-
-6. Skyscanner Place Card Example:
-<html>
-<div class="PlaceCard_descriptionContainer__M2NjN" data-testid="description-container">
- <div class="PlaceCard_nameContainer__ZjZmY" tabindex="0" role="link">
-   <div class="PlaceCard_nameContent__ODUwZ">
-     <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY">Doha</span>
-   </div>
-   <span class="BpkText_bpk-text__MjhhY BpkText_bpk-text--heading-4__Y2FlY PlaceCard_subName__NTVkY">Qatar</span>
- </div>
- <span class="PlaceCard_advertLabel__YTM0N">Sunny days and the warmest welcome awaits</span>
- <a class="BpkLink_bpk-link__MmQwY PlaceCard_descriptionLink__NzYwN" href="/flights/del/doha/" data-testid="flights-link">
-   <div class="PriceDescription_container__NjEzM">
-     <span class="BpkText_bpk-text--heading-5__MTRjZ">₹17,559</span>
-   </div>
- </a>
-</div>
-</html>
-
-Generated Schema:
-{
- "name": "Skyscanner Place Cards",
- "baseSelector": "//div[contains(@class, 'PlaceCard_descriptionContainer__')]",
- "baseFields": [
-   {"name": "data_testid", "type": "attribute", "attribute": "data-testid"}
- ],
- "fields": [
-   {
-     "name": "city_name",
-     "selector": ".//div[contains(@class, 'PlaceCard_nameContent__')]//span[contains(@class, 'BpkText_bpk-text--heading-4__')]",
-     "type": "text"
-   },
-   {
-     "name": "country_name",
-     "selector": ".//span[contains(@class, 'PlaceCard_subName__')]",
-     "type": "text"
-   },
-   {
-     "name": "description",
-     "selector": ".//span[contains(@class, 'PlaceCard_advertLabel__')]",
-     "type": "text"
-   },
-   {
-     "name": "flight_price",
-     "selector": ".//a[@data-testid='flights-link']//span[contains(@class, 'BpkText_bpk-text--heading-5__')]",
-     "type": "text"
-   },
-   {
-     "name": "flight_url",
-     "selector": ".//a[@data-testid='flights-link']",
-     "type": "attribute",
-     "attribute": "href"
-   }
- ]
-}
-</examples>
-
-<output_requirements>
-Your output must:
-1. Be valid JSON only
-2. Include no explanatory text
-3. Follow the exact schema structure provided
-4. Use appropriate field types
-5. Include all required fields
-6. Use valid XPath selectors
-</output_requirements>
-"""
+Remember, the output should be a complete, parsable JSON wrapped in <blocks> tags, with no omissions or errors. The JSON objects should semantically break down the content into relevant blocks, maintaining the original order."""
--- a/crawl4ai/ssl_certificate.py
+++ b/crawl4ai/ssl_certificate.py
@@ -1,184 +0,0 @@
-"""SSL Certificate class for handling certificate operations."""
-
-import ssl
-import socket
-import base64
-import json
-from typing import Dict, Any, Optional
-from urllib.parse import urlparse
-import OpenSSL.crypto
-from pathlib import Path
-
-
-class SSLCertificate:
-    """
-    A class representing an SSL certificate with methods to export in various formats.
-
-    Attributes:
-        cert_info (Dict[str, Any]): The certificate information.
-
-        Methods:
-            from_url(url: str, timeout: int = 10) -> Optional['SSLCertificate']: Create SSLCertificate instance from a URL.
-            from_file(file_path: str) -> Optional['SSLCertificate']: Create SSLCertificate instance from a file.
-            from_binary(binary_data: bytes) -> Optional['SSLCertificate']: Create SSLCertificate instance from binary data.
-            export_as_pem() -> str: Export the certificate as PEM format.
-            export_as_der() -> bytes: Export the certificate as DER format.
-            export_as_json() -> Dict[str, Any]: Export the certificate as JSON format.
-            export_as_text() -> str: Export the certificate as text format.
-    """
-
-    def __init__(self, cert_info: Dict[str, Any]):
-        self._cert_info = self._decode_cert_data(cert_info)
-
-    @staticmethod
-    def from_url(url: str, timeout: int = 10) -> Optional["SSLCertificate"]:
-        """
-        Create SSLCertificate instance from a URL.
-
-        Args:
-            url (str): URL of the website.
-            timeout (int): Timeout for the connection (default: 10).
-
-        Returns:
-            Optional[SSLCertificate]: SSLCertificate instance if successful, None otherwise.
-        """
-        try:
-            hostname = urlparse(url).netloc
-            if ":" in hostname:
-                hostname = hostname.split(":")[0]
-
-            context = ssl.create_default_context()
-            with socket.create_connection((hostname, 443), timeout=timeout) as sock:
-                with context.wrap_socket(sock, server_hostname=hostname) as ssock:
-                    cert_binary = ssock.getpeercert(binary_form=True)
-                    x509 = OpenSSL.crypto.load_certificate(
-                        OpenSSL.crypto.FILETYPE_ASN1, cert_binary
-                    )
-
-                    cert_info = {
-                        "subject": dict(x509.get_subject().get_components()),
-                        "issuer": dict(x509.get_issuer().get_components()),
-                        "version": x509.get_version(),
-                        "serial_number": hex(x509.get_serial_number()),
-                        "not_before": x509.get_notBefore(),
-                        "not_after": x509.get_notAfter(),
-                        "fingerprint": x509.digest("sha256").hex(),
-                        "signature_algorithm": x509.get_signature_algorithm(),
-                        "raw_cert": base64.b64encode(cert_binary),
-                    }
-
-                    # Add extensions
-                    extensions = []
-                    for i in range(x509.get_extension_count()):
-                        ext = x509.get_extension(i)
-                        extensions.append(
-                            {"name": ext.get_short_name(), "value": str(ext)}
-                        )
-                    cert_info["extensions"] = extensions
-
-                    return SSLCertificate(cert_info)
-
-        except Exception:
-            return None
-
-    @staticmethod
-    def _decode_cert_data(data: Any) -> Any:
-        """Helper method to decode bytes in certificate data."""
-        if isinstance(data, bytes):
-            return data.decode("utf-8")
-        elif isinstance(data, dict):
-            return {
-                (
-                    k.decode("utf-8") if isinstance(k, bytes) else k
-                ): SSLCertificate._decode_cert_data(v)
-                for k, v in data.items()
-            }
-        elif isinstance(data, list):
-            return [SSLCertificate._decode_cert_data(item) for item in data]
-        return data
-
-    def to_json(self, filepath: Optional[str] = None) -> Optional[str]:
-        """
-        Export certificate as JSON.
-
-        Args:
-            filepath (Optional[str]): Path to save the JSON file (default: None).
-
-        Returns:
-            Optional[str]: JSON string if successful, None otherwise.
-        """
-        json_str = json.dumps(self._cert_info, indent=2, ensure_ascii=False)
-        if filepath:
-            Path(filepath).write_text(json_str, encoding="utf-8")
-            return None
-        return json_str
-
-    def to_pem(self, filepath: Optional[str] = None) -> Optional[str]:
-        """
-        Export certificate as PEM.
-
-        Args:
-            filepath (Optional[str]): Path to save the PEM file (default: None).
-
-        Returns:
-            Optional[str]: PEM string if successful, None otherwise.
-        """
-        try:
-            x509 = OpenSSL.crypto.load_certificate(
-                OpenSSL.crypto.FILETYPE_ASN1,
-                base64.b64decode(self._cert_info["raw_cert"]),
-            )
-            pem_data = OpenSSL.crypto.dump_certificate(
-                OpenSSL.crypto.FILETYPE_PEM, x509
-            ).decode("utf-8")
-
-            if filepath:
-                Path(filepath).write_text(pem_data, encoding="utf-8")
-                return None
-            return pem_data
-        except Exception:
-            return None
-
-    def to_der(self, filepath: Optional[str] = None) -> Optional[bytes]:
-        """
-        Export certificate as DER.
-
-        Args:
-            filepath (Optional[str]): Path to save the DER file (default: None).
-
-        Returns:
-            Optional[bytes]: DER bytes if successful, None otherwise.
-        """
-        try:
-            der_data = base64.b64decode(self._cert_info["raw_cert"])
-            if filepath:
-                Path(filepath).write_bytes(der_data)
-                return None
-            return der_data
-        except Exception:
-            return None
-
-    @property
-    def issuer(self) -> Dict[str, str]:
-        """Get certificate issuer information."""
-        return self._cert_info.get("issuer", {})
-
-    @property
-    def subject(self) -> Dict[str, str]:
-        """Get certificate subject information."""
-        return self._cert_info.get("subject", {})
-
-    @property
-    def valid_from(self) -> str:
-        """Get certificate validity start date."""
-        return self._cert_info.get("not_before", "")
-
-    @property
-    def valid_until(self) -> str:
-        """Get certificate validity end date."""
-        return self._cert_info.get("not_after", "")
-
-    @property
-    def fingerprint(self) -> str:
-        """Get certificate fingerprint."""
-        return self._cert_info.get("fingerprint", "")
--- a/crawl4ai/train.py
+++ b/crawl4ai/train.py
@@ -0,0 +1,146 @@
+import spacy
+from spacy.training import Example
+import random
+import nltk
+from nltk.corpus import reuters
+import torch
+
+def save_spacy_model_as_torch(nlp, model_dir="models/reuters"):
+    # Extract the TextCategorizer component
+    textcat = nlp.get_pipe("textcat_multilabel")
+
+    # Convert the weights to a PyTorch state dictionary
+    state_dict = {name: torch.tensor(param.data) for name, param in textcat.model.named_parameters()}
+
+    # Save the state dictionary
+    torch.save(state_dict, f"{model_dir}/model_weights.pth")
+
+    # Extract and save the vocabulary
+    vocab = extract_vocab(nlp)
+    with open(f"{model_dir}/vocab.txt", "w") as vocab_file:
+        for word, idx in vocab.items():
+            vocab_file.write(f"{word}\t{idx}\n")
+    
+    print(f"Model weights and vocabulary saved to: {model_dir}")
+
+def extract_vocab(nlp):
+    # Extract vocabulary from the SpaCy model
+    vocab = {word: i for i, word in enumerate(nlp.vocab.strings)}
+    return vocab
+
+nlp = spacy.load("models/reuters")
+save_spacy_model_as_torch(nlp, model_dir="models")
+
+def train_and_save_reuters_model(model_dir="models/reuters"):
+    # Ensure the Reuters corpus is downloaded
+    nltk.download('reuters')
+    nltk.download('punkt')
+    if not reuters.fileids():
+        print("Reuters corpus not found.")
+        return
+
+    # Load a blank English spaCy model
+    nlp = spacy.blank("en")
+
+    # Create a TextCategorizer with the ensemble model for multi-label classification
+    textcat = nlp.add_pipe("textcat_multilabel")
+
+    # Add labels to text classifier
+    for label in reuters.categories():
+        textcat.add_label(label)
+
+    # Prepare training data
+    train_examples = []
+    for fileid in reuters.fileids():
+        categories = reuters.categories(fileid)
+        text = reuters.raw(fileid)
+        cats = {label: label in categories for label in reuters.categories()}
+        # Prepare spacy Example objects
+        doc = nlp.make_doc(text)
+        example = Example.from_dict(doc, {'cats': cats})
+        train_examples.append(example)
+
+    # Initialize the text categorizer with the example objects
+    nlp.initialize(lambda: train_examples)
+
+    # Train the model
+    random.seed(1)
+    spacy.util.fix_random_seed(1)
+    for i in range(5):  # Adjust iterations for better accuracy
+        random.shuffle(train_examples)
+        losses = {}
+        # Create batches of data
+        batches = spacy.util.minibatch(train_examples, size=8)
+        for batch in batches:
+            nlp.update(batch, drop=0.2, losses=losses)
+        print(f"Losses at iteration {i}: {losses}")
+
+    # Save the trained model
+    nlp.to_disk(model_dir)
+    print(f"Model saved to: {model_dir}")
+
+def train_model(model_dir, additional_epochs=0):
+    # Load the model if it exists, otherwise start with a blank model
+    try:
+        nlp = spacy.load(model_dir)
+        print("Model loaded from disk.")
+    except IOError:
+        print("No existing model found. Starting with a new model.")
+        nlp = spacy.blank("en")
+        textcat = nlp.add_pipe("textcat_multilabel")
+        for label in reuters.categories():
+            textcat.add_label(label)
+
+    # Prepare training data
+    train_examples = []
+    for fileid in reuters.fileids():
+        categories = reuters.categories(fileid)
+        text = reuters.raw(fileid)
+        cats = {label: label in categories for label in reuters.categories()}
+        doc = nlp.make_doc(text)
+        example = Example.from_dict(doc, {'cats': cats})
+        train_examples.append(example)
+
+    # Initialize the model if it was newly created
+    if 'textcat_multilabel' not in nlp.pipe_names:
+        nlp.initialize(lambda: train_examples)
+    else:
+        print("Continuing training with existing model.")
+
+    # Train the model
+    random.seed(1)
+    spacy.util.fix_random_seed(1)
+    num_epochs = 5 + additional_epochs
+    for i in range(num_epochs):
+        random.shuffle(train_examples)
+        losses = {}
+        batches = spacy.util.minibatch(train_examples, size=8)
+        for batch in batches:
+            nlp.update(batch, drop=0.2, losses=losses)
+        print(f"Losses at iteration {i}: {losses}")
+
+    # Save the trained model
+    nlp.to_disk(model_dir)
+    print(f"Model saved to: {model_dir}")
+
+def load_model_and_predict(model_dir, text, tok_k = 3):
+    # Load the trained model from the specified directory
+    nlp = spacy.load(model_dir)
+    
+    # Process the text with the loaded model
+    doc = nlp(text)
+    
+    # gee top 3 categories
+    top_categories = sorted(doc.cats.items(), key=lambda x: x[1], reverse=True)[:tok_k]
+    print(f"Top {tok_k} categories:")
+    
+    return top_categories    
+
+if __name__ == "__main__":
+    train_and_save_reuters_model()
+    train_model("models/reuters", additional_epochs=5)
+    model_directory = "reuters_model_10"
+    print(reuters.categories())
+    example_text = "Apple Inc. is reportedly buying a startup for $1 billion"
+    r =load_model_and_predict(model_directory, example_text)
+    print(r)
--- a/crawl4ai/user_agent_generator.py
+++ b/crawl4ai/user_agent_generator.py
@@ -1,429 +0,0 @@
-import random
-from typing import Optional, Literal, List, Dict, Tuple
-import re
-
-from abc import ABC, abstractmethod
-import random
-from fake_useragent import UserAgent
-import requests
-from lxml import html
-import json
-from typing import Optional, List, Union, Dict
-
-class UAGen(ABC):
-   @abstractmethod
-   def generate(self, 
-               browsers: Optional[List[str]] = None,
-               os: Optional[Union[str, List[str]]] = None,
-               min_version: float = 0.0,
-               platforms: Optional[Union[str, List[str]]] = None,
-               pct_threshold: Optional[float] = None,
-               fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Union[str, Dict]:
-       pass
-   
-   @staticmethod
-   def generate_client_hints( user_agent: str) -> str:
-        """Generate Sec-CH-UA header value based on user agent string"""
-        def _parse_user_agent(user_agent: str) -> Dict[str, str]:
-            """Parse a user agent string to extract browser and version information"""
-            browsers = {
-                "chrome": r"Chrome/(\d+)",
-                "edge": r"Edg/(\d+)",
-                "safari": r"Version/(\d+)",
-                "firefox": r"Firefox/(\d+)",
-            }
-
-            result = {}
-            for browser, pattern in browsers.items():
-                match = re.search(pattern, user_agent)
-                if match:
-                    result[browser] = match.group(1)
-
-            return result
-        browsers = _parse_user_agent(user_agent)
-
-        # Client hints components
-        hints = []
-
-        # Handle different browser combinations
-        if "chrome" in browsers:
-            hints.append(f'"Chromium";v="{browsers["chrome"]}"')
-            hints.append('"Not_A Brand";v="8"')
-
-            if "edge" in browsers:
-                hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
-            else:
-                hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
-
-        elif "firefox" in browsers:
-            # Firefox doesn't typically send Sec-CH-UA
-            return '""'
-
-        elif "safari" in browsers:
-            # Safari's format for client hints
-            hints.append(f'"Safari";v="{browsers["safari"]}"')
-            hints.append('"Not_A Brand";v="8"')
-
-        return ", ".join(hints)
-
-class ValidUAGenerator(UAGen):
-   def __init__(self):
-       self.ua = UserAgent()
-       
-   def generate(self,
-               browsers: Optional[List[str]] = None,
-               os: Optional[Union[str, List[str]]] = None, 
-               min_version: float = 0.0,
-               platforms: Optional[Union[str, List[str]]] = None,
-               pct_threshold: Optional[float] = None,
-               fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> str:
-       
-       self.ua = UserAgent(
-           browsers=browsers or ['Chrome', 'Firefox', 'Edge'],
-           os=os or ['Windows', 'Mac OS X'],
-           min_version=min_version,
-           platforms=platforms or ['desktop'],
-           fallback=fallback
-       )
-       return self.ua.random
-
-class OnlineUAGenerator(UAGen):
-   def __init__(self):
-       self.agents = []
-       self._fetch_agents()
-       
-   def _fetch_agents(self):
-       try:
-           response = requests.get(
-               'https://www.useragents.me/',
-               timeout=5,
-               headers={'Accept': 'text/html,application/xhtml+xml'}
-           )
-           response.raise_for_status()
-           
-           tree = html.fromstring(response.content)
-           json_text = tree.cssselect('#most-common-desktop-useragents-json-csv > div:nth-child(1) > textarea')[0].text
-           self.agents = json.loads(json_text)
-       except Exception as e:
-           print(f"Error fetching agents: {e}")
-           
-   def generate(self,
-               browsers: Optional[List[str]] = None,
-               os: Optional[Union[str, List[str]]] = None,
-               min_version: float = 0.0,
-               platforms: Optional[Union[str, List[str]]] = None, 
-               pct_threshold: Optional[float] = None,
-               fallback: str = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36") -> Dict:
-       
-       if not self.agents:
-           self._fetch_agents()
-           
-       filtered_agents = self.agents
-       
-       if pct_threshold:
-           filtered_agents = [a for a in filtered_agents if a['pct'] >= pct_threshold]
-           
-       if browsers:
-           filtered_agents = [a for a in filtered_agents 
-                            if any(b.lower() in a['ua'].lower() for b in browsers)]
-           
-       if os:
-           os_list = [os] if isinstance(os, str) else os
-           filtered_agents = [a for a in filtered_agents 
-                            if any(o.lower() in a['ua'].lower() for o in os_list)]
-           
-       if platforms:
-           platform_list = [platforms] if isinstance(platforms, str) else platforms
-           filtered_agents = [a for a in filtered_agents 
-                            if any(p.lower() in a['ua'].lower() for p in platform_list)]
-           
-       return filtered_agents[0] if filtered_agents else {'ua': fallback, 'pct': 0}
-
-
-
-class UserAgentGenerator():
-    """
-    Generate random user agents with specified constraints.
-
-    Attributes:
-        desktop_platforms (dict): A dictionary of possible desktop platforms and their corresponding user agent strings.
-        mobile_platforms (dict): A dictionary of possible mobile platforms and their corresponding user agent strings.
-        browser_combinations (dict): A dictionary of possible browser combinations and their corresponding user agent strings.
-        rendering_engines (dict): A dictionary of possible rendering engines and their corresponding user agent strings.
-        chrome_versions (list): A list of possible Chrome browser versions.
-        firefox_versions (list): A list of possible Firefox browser versions.
-        edge_versions (list): A list of possible Edge browser versions.
-        safari_versions (list): A list of possible Safari browser versions.
-        ios_versions (list): A list of possible iOS browser versions.
-        android_versions (list): A list of possible Android browser versions.
-
-        Methods:
-            generate_user_agent(
-                platform: Literal["desktop", "mobile"] = "desktop",
-                browser: str = "chrome",
-                rendering_engine: str = "chrome_webkit",
-                chrome_version: Optional[str] = None,
-                firefox_version: Optional[str] = None,
-                edge_version: Optional[str] = None,
-                safari_version: Optional[str] = None,
-                ios_version: Optional[str] = None,
-                android_version: Optional[str] = None
-            ): Generates a random user agent string based on the specified parameters.
-    """
-
-    def __init__(self):
-        # Previous platform definitions remain the same...
-        self.desktop_platforms = {
-            "windows": {
-                "10_64": "(Windows NT 10.0; Win64; x64)",
-                "10_32": "(Windows NT 10.0; WOW64)",
-            },
-            "macos": {
-                "intel": "(Macintosh; Intel Mac OS X 10_15_7)",
-                "newer": "(Macintosh; Intel Mac OS X 10.15; rv:109.0)",
-            },
-            "linux": {
-                "generic": "(X11; Linux x86_64)",
-                "ubuntu": "(X11; Ubuntu; Linux x86_64)",
-                "chrome_os": "(X11; CrOS x86_64 14541.0.0)",
-            },
-        }
-
-        self.mobile_platforms = {
-            "android": {
-                "samsung": "(Linux; Android 13; SM-S901B)",
-                "pixel": "(Linux; Android 12; Pixel 6)",
-                "oneplus": "(Linux; Android 13; OnePlus 9 Pro)",
-                "xiaomi": "(Linux; Android 12; M2102J20SG)",
-            },
-            "ios": {
-                "iphone": "(iPhone; CPU iPhone OS 16_5 like Mac OS X)",
-                "ipad": "(iPad; CPU OS 16_5 like Mac OS X)",
-            },
-        }
-
-        # Browser Combinations
-        self.browser_combinations = {
-            1: [["chrome"], ["firefox"], ["safari"], ["edge"]],
-            2: [["gecko", "firefox"], ["chrome", "safari"], ["webkit", "safari"]],
-            3: [["chrome", "safari", "edge"], ["webkit", "chrome", "safari"]],
-        }
-
-        # Rendering Engines with versions
-        self.rendering_engines = {
-            "chrome_webkit": "AppleWebKit/537.36",
-            "safari_webkit": "AppleWebKit/605.1.15",
-            "gecko": [  # Added Gecko versions
-                "Gecko/20100101",
-                "Gecko/20100101",  # Firefox usually uses this constant version
-                "Gecko/2010010",
-            ],
-        }
-
-        # Browser Versions
-        self.chrome_versions = [
-            "Chrome/119.0.6045.199",
-            "Chrome/118.0.5993.117",
-            "Chrome/117.0.5938.149",
-            "Chrome/116.0.5845.187",
-            "Chrome/115.0.5790.171",
-        ]
-
-        self.edge_versions = [
-            "Edg/119.0.2151.97",
-            "Edg/118.0.2088.76",
-            "Edg/117.0.2045.47",
-            "Edg/116.0.1938.81",
-            "Edg/115.0.1901.203",
-        ]
-
-        self.safari_versions = [
-            "Safari/537.36",  # For Chrome-based
-            "Safari/605.1.15",
-            "Safari/604.1",
-            "Safari/602.1",
-            "Safari/601.5.17",
-        ]
-
-        # Added Firefox versions
-        self.firefox_versions = [
-            "Firefox/119.0",
-            "Firefox/118.0.2",
-            "Firefox/117.0.1",
-            "Firefox/116.0",
-            "Firefox/115.0.3",
-            "Firefox/114.0.2",
-            "Firefox/113.0.1",
-            "Firefox/112.0",
-            "Firefox/111.0.1",
-            "Firefox/110.0",
-        ]
-
-    def get_browser_stack(self, num_browsers: int = 1) -> List[str]:
-        """
-        Get a valid combination of browser versions.
-
-        How it works:
-        1. Check if the number of browsers is supported.
-        2. Randomly choose a combination of browsers.
-        3. Iterate through the combination and add browser versions.
-        4. Return the browser stack.
-
-        Args:
-            num_browsers: Number of browser specifications (1-3)
-
-        Returns:
-            List[str]: A list of browser versions.
-        """
-        if num_browsers not in self.browser_combinations:
-            raise ValueError(f"Unsupported number of browsers: {num_browsers}")
-
-        combination = random.choice(self.browser_combinations[num_browsers])
-        browser_stack = []
-
-        for browser in combination:
-            if browser == "chrome":
-                browser_stack.append(random.choice(self.chrome_versions))
-            elif browser == "firefox":
-                browser_stack.append(random.choice(self.firefox_versions))
-            elif browser == "safari":
-                browser_stack.append(random.choice(self.safari_versions))
-            elif browser == "edge":
-                browser_stack.append(random.choice(self.edge_versions))
-            elif browser == "gecko":
-                browser_stack.append(random.choice(self.rendering_engines["gecko"]))
-            elif browser == "webkit":
-                browser_stack.append(self.rendering_engines["chrome_webkit"])
-
-        return browser_stack
-
-    def generate(
-        self,
-        device_type: Optional[Literal["desktop", "mobile"]] = None,
-        os_type: Optional[str] = None,
-        device_brand: Optional[str] = None,
-        browser_type: Optional[Literal["chrome", "edge", "safari", "firefox"]] = None,
-        num_browsers: int = 3,
-    ) -> str:
-        """
-        Generate a random user agent with specified constraints.
-
-        Args:
-            device_type: 'desktop' or 'mobile'
-            os_type: 'windows', 'macos', 'linux', 'android', 'ios'
-            device_brand: Specific device brand
-            browser_type: 'chrome', 'edge', 'safari', or 'firefox'
-            num_browsers: Number of browser specifications (1-3)
-        """
-        # Get platform string
-        platform = self.get_random_platform(device_type, os_type, device_brand)
-
-        # Start with Mozilla
-        components = ["Mozilla/5.0", platform]
-
-        # Add browser stack
-        browser_stack = self.get_browser_stack(num_browsers)
-
-        # Add appropriate legacy token based on browser stack
-        if "Firefox" in str(browser_stack) or browser_type == "firefox":
-            components.append(random.choice(self.rendering_engines["gecko"]))
-        elif "Chrome" in str(browser_stack) or "Safari" in str(browser_stack) or browser_type == "chrome":
-            components.append(self.rendering_engines["chrome_webkit"])
-            components.append("(KHTML, like Gecko)")
-        elif "Edge" in str(browser_stack) or browser_type == "edge":
-            components.append(self.rendering_engines["safari_webkit"])
-            components.append("(KHTML, like Gecko)")
-        elif "Safari" in str(browser_stack) or browser_type == "safari":
-            components.append(self.rendering_engines["chrome_webkit"])
-            components.append("(KHTML, like Gecko)")
-
-        # Add browser versions
-        components.extend(browser_stack)
-
-        return " ".join(components)
-
-    def generate_with_client_hints(self, **kwargs) -> Tuple[str, str]:
-        """Generate both user agent and matching client hints"""
-        user_agent = self.generate(**kwargs)
-        client_hints = self.generate_client_hints(user_agent)
-        return user_agent, client_hints
-
-    def get_random_platform(self, device_type, os_type, device_brand):
-        """Helper method to get random platform based on constraints"""
-        platforms = (
-            self.desktop_platforms
-            if device_type == "desktop"
-            else self.mobile_platforms
-            if device_type == "mobile"
-            else {**self.desktop_platforms, **self.mobile_platforms}
-        )
-
-        if os_type:
-            for platform_group in [self.desktop_platforms, self.mobile_platforms]:
-                if os_type in platform_group:
-                    platforms = {os_type: platform_group[os_type]}
-                    break
-
-        os_key = random.choice(list(platforms.keys()))
-        if device_brand and device_brand in platforms[os_key]:
-            return platforms[os_key][device_brand]
-        return random.choice(list(platforms[os_key].values()))
-
-    def parse_user_agent(self, user_agent: str) -> Dict[str, str]:
-        """Parse a user agent string to extract browser and version information"""
-        browsers = {
-            "chrome": r"Chrome/(\d+)",
-            "edge": r"Edg/(\d+)",
-            "safari": r"Version/(\d+)",
-            "firefox": r"Firefox/(\d+)",
-        }
-
-        result = {}
-        for browser, pattern in browsers.items():
-            match = re.search(pattern, user_agent)
-            if match:
-                result[browser] = match.group(1)
-
-        return result
-
-    def generate_client_hints(self, user_agent: str) -> str:
-        """Generate Sec-CH-UA header value based on user agent string"""
-        browsers = self.parse_user_agent(user_agent)
-
-        # Client hints components
-        hints = []
-
-        # Handle different browser combinations
-        if "chrome" in browsers:
-            hints.append(f'"Chromium";v="{browsers["chrome"]}"')
-            hints.append('"Not_A Brand";v="8"')
-
-            if "edge" in browsers:
-                hints.append(f'"Microsoft Edge";v="{browsers["edge"]}"')
-            else:
-                hints.append(f'"Google Chrome";v="{browsers["chrome"]}"')
-
-        elif "firefox" in browsers:
-            # Firefox doesn't typically send Sec-CH-UA
-            return '""'
-
-        elif "safari" in browsers:
-            # Safari's format for client hints
-            hints.append(f'"Safari";v="{browsers["safari"]}"')
-            hints.append('"Not_A Brand";v="8"')
-
-        return ", ".join(hints)
-
-
-# Example usage:
-if __name__ == "__main__":
-    
-    # Usage example:
-    generator = ValidUAGenerator()
-    ua = generator.generate()
-    print(ua)
-    
-    generator = OnlineUAGenerator()
-    ua = generator.generate()
-    print(ua)
-
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
--- a/crawl4ai/version_manager.py
+++ b/crawl4ai/version_manager.py
@@ -1,29 +0,0 @@
-# version_manager.py
-from pathlib import Path
-from packaging import version
-from . import __version__
-
-
-class VersionManager:
-    def __init__(self):
-        self.home_dir = Path.home() / ".crawl4ai"
-        self.version_file = self.home_dir / "version.txt"
-
-    def get_installed_version(self):
-        """Get the version recorded in home directory"""
-        if not self.version_file.exists():
-            return None
-        try:
-            return version.parse(self.version_file.read_text().strip())
-        except:
-            return None
-
-    def update_version(self):
-        """Update the version file to current library version"""
-        self.version_file.write_text(__version__.__version__)
-
-    def needs_update(self):
-        """Check if database needs update based on version"""
-        installed = self.get_installed_version()
-        current = version.parse(__version__.__version__)
-        return installed is None or installed < current
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -1,57 +1,56 @@
 import os, time
-
 os.environ["TOKENIZERS_PARALLELISM"] = "false"
 from pathlib import Path

 from .models import UrlModel, CrawlResult
-from .database import init_db, get_cached_url, cache_url
+from .database import init_db, get_cached_url, cache_url, DB_PATH, flush_db
 from .utils import *
 from .chunking_strategy import *
 from .extraction_strategy import *
 from .crawler_strategy import *
 from typing import List
 from concurrent.futures import ThreadPoolExecutor
-from .content_scraping_strategy import WebScrapingStrategy
 from .config import *
-import warnings
-import json
-
-warnings.filterwarnings(
-    "ignore",
-    message='Field "model_name" has conflict with protected namespace "model_".',
-)


 class WebCrawler:
    def __init__(
        self,
+        # db_path: str = None,
        crawler_strategy: CrawlerStrategy = None,
        always_by_pass_cache: bool = False,
        verbose: bool = False,
    ):
-        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(
-            verbose=verbose
-        )
+        # self.db_path = db_path
+        self.crawler_strategy = crawler_strategy or LocalSeleniumCrawlerStrategy(verbose=verbose)
        self.always_by_pass_cache = always_by_pass_cache
-        self.crawl4ai_folder = os.path.join(
-            os.getenv("CRAWL4_AI_BASE_DIRECTORY", Path.home()), ".crawl4ai"
-        )
+
+        # Create the .crawl4ai folder in the user's home directory if it doesn't exist
+        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
-        init_db()
-        self.ready = False

+        # If db_path is not provided, use the default path
+        # if not db_path:
+            # self.db_path = f"{self.crawl4ai_folder}/crawl4ai.db"
+        
+        # flush_db()
+        init_db()
+        
+        self.ready = False
+        
    def warmup(self):
        print("[LOG] 🌤️  Warming up the WebCrawler")
-        self.run(
-            url="https://google.com/",
+        result = self.run(
+            url='https://crawl4ai.uccode.io/',
            word_count_threshold=5,
-            extraction_strategy=NoExtractionStrategy(),
+            extraction_strategy= NoExtractionStrategy(),
            bypass_cache=False,
-            verbose=False,
+            verbose = False
        )
        self.ready = True
        print("[LOG] 🌞 WebCrawler is ready to crawl")
+        

    def fetch_page(
        self,
@@ -60,8 +59,6 @@ class WebCrawler:
        api_token: str = None,
        extract_blocks_flag: bool = True,
        word_count_threshold=MIN_WORD_THRESHOLD,
-        css_selector: str = None,
-        screenshot: bool = False,
        use_cached_html: bool = False,
        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
@@ -73,12 +70,111 @@ class WebCrawler:
            extraction_strategy or NoExtractionStrategy(),
            chunking_strategy,
            bypass_cache=url_model.forced,
-            css_selector=css_selector,
-            screenshot=screenshot,
            **kwargs,
        )
        pass

+
+    def run(
+        self,
+        url: str,
+        word_count_threshold=MIN_WORD_THRESHOLD,
+        extraction_strategy: ExtractionStrategy = None,
+        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        bypass_cache: bool = False,
+        css_selector: str = None,
+        verbose=True,
+        **kwargs,
+    ) -> CrawlResult:
+        extraction_strategy = extraction_strategy or NoExtractionStrategy()
+        extraction_strategy.verbose = verbose
+        # Check if extraction strategy is an instance of ExtractionStrategy if not raise an error
+        if not isinstance(extraction_strategy, ExtractionStrategy):
+            raise ValueError("Unsupported extraction strategy")
+        if not isinstance(chunking_strategy, ChunkingStrategy):
+            raise ValueError("Unsupported chunking strategy")
+        
+        # make sure word_count_threshold is not lesser than MIN_WORD_THRESHOLD
+        if word_count_threshold < MIN_WORD_THRESHOLD:
+            word_count_threshold = MIN_WORD_THRESHOLD
+
+        # Check cache first
+        if not bypass_cache and not self.always_by_pass_cache:
+            cached = get_cached_url(url)
+            if cached:
+                return CrawlResult(
+                    **{
+                        "url": cached[0],
+                        "html": cached[1],
+                        "cleaned_html": cached[2],
+                        "markdown": cached[3],
+                        "extracted_content": cached[4],
+                        "success": cached[5],
+                        "error_message": "",
+                    }
+                )
+
+        # Initialize WebDriver for crawling
+        t = time.time()
+        html = self.crawler_strategy.crawl(url)
+        success = True
+        error_message = ""
+        # Extract content from HTML
+        try:
+            result = get_content_of_website(html, word_count_threshold, css_selector=css_selector)
+            if result is None:
+                raise ValueError(f"Failed to extract content from the website: {url}")
+        except InvalidCSSSelectorError as e:
+            raise ValueError(str(e))
+        
+        cleaned_html = result.get("cleaned_html", html)
+        markdown = result.get("markdown", "")
+
+        # Print a profession LOG style message, show time taken and say crawling is done
+        if verbose:
+            print(
+                f"[LOG] 🚀 Crawling done for {url}, success: {success}, time taken: {time.time() - t} seconds"
+            )
+
+        extracted_content = []
+        if verbose:
+            print(f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}")
+        t = time.time()
+        # Split markdown into sections
+        sections = chunking_strategy.chunk(markdown)
+        # sections = merge_chunks_based_on_token_threshold(sections, CHUNK_TOKEN_THRESHOLD)
+
+        extracted_content = extraction_strategy.run(
+            url, sections,
+        )
+        extracted_content = json.dumps(extracted_content)
+
+        if verbose:
+            print(
+                f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t} seconds."
+            )
+
+        # Cache the result
+        cleaned_html = beautify_html(cleaned_html)
+        cache_url(
+            url,
+            html,
+            cleaned_html,
+            markdown,
+            extracted_content,
+            success,
+        )
+
+        return CrawlResult(
+            url=url,
+            html=html,
+            cleaned_html=cleaned_html,
+            markdown=markdown,
+            extracted_content=extracted_content,
+            success=success,
+            error_message=error_message,
+        )
+
    def fetch_pages(
        self,
        url_models: List[UrlModel],
@@ -87,14 +183,11 @@ class WebCrawler:
        extract_blocks_flag: bool = True,
        word_count_threshold=MIN_WORD_THRESHOLD,
        use_cached_html: bool = False,
-        css_selector: str = None,
-        screenshot: bool = False,
        extraction_strategy: ExtractionStrategy = None,
        chunking_strategy: ChunkingStrategy = RegexChunking(),
        **kwargs,
    ) -> List[CrawlResult]:
        extraction_strategy = extraction_strategy or NoExtractionStrategy()
-
        def fetch_page_wrapper(url_model, *args, **kwargs):
            return self.fetch_page(url_model, *args, **kwargs)

@@ -107,8 +200,6 @@ class WebCrawler:
                    [api_token] * len(url_models),
                    [extract_blocks_flag] * len(url_models),
                    [word_count_threshold] * len(url_models),
-                    [css_selector] * len(url_models),
-                    [screenshot] * len(url_models),
                    [use_cached_html] * len(url_models),
                    [extraction_strategy] * len(url_models),
                    [chunking_strategy] * len(url_models),
@@ -117,178 +208,3 @@ class WebCrawler:
            )

        return results
-
-    def run(
-        self,
-        url: str,
-        word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
-        bypass_cache: bool = False,
-        css_selector: str = None,
-        screenshot: bool = False,
-        user_agent: str = None,
-        verbose=True,
-        **kwargs,
-    ) -> CrawlResult:
-        try:
-            extraction_strategy = extraction_strategy or NoExtractionStrategy()
-            extraction_strategy.verbose = verbose
-            if not isinstance(extraction_strategy, ExtractionStrategy):
-                raise ValueError("Unsupported extraction strategy")
-            if not isinstance(chunking_strategy, ChunkingStrategy):
-                raise ValueError("Unsupported chunking strategy")
-
-            word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
-
-            cached = None
-            screenshot_data = None
-            extracted_content = None
-            if not bypass_cache and not self.always_by_pass_cache:
-                cached = get_cached_url(url)
-
-            if kwargs.get("warmup", True) and not self.ready:
-                return None
-
-            if cached:
-                html = sanitize_input_encode(cached[1])
-                extracted_content = sanitize_input_encode(cached[4])
-                if screenshot:
-                    screenshot_data = cached[9]
-                    if not screenshot_data:
-                        cached = None
-
-            if not cached or not html:
-                if user_agent:
-                    self.crawler_strategy.update_user_agent(user_agent)
-                t1 = time.time()
-                html = sanitize_input_encode(self.crawler_strategy.crawl(url, **kwargs))
-                t2 = time.time()
-                if verbose:
-                    print(
-                        f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
-                    )
-                if screenshot:
-                    screenshot_data = self.crawler_strategy.take_screenshot()
-
-            crawl_result = self.process_html(
-                url,
-                html,
-                extracted_content,
-                word_count_threshold,
-                extraction_strategy,
-                chunking_strategy,
-                css_selector,
-                screenshot_data,
-                verbose,
-                bool(cached),
-                **kwargs,
-            )
-            crawl_result.success = bool(html)
-            return crawl_result
-        except Exception as e:
-            if not hasattr(e, "msg"):
-                e.msg = str(e)
-            print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
-            return CrawlResult(url=url, html="", success=False, error_message=e.msg)
-
-    def process_html(
-        self,
-        url: str,
-        html: str,
-        extracted_content: str,
-        word_count_threshold: int,
-        extraction_strategy: ExtractionStrategy,
-        chunking_strategy: ChunkingStrategy,
-        css_selector: str,
-        screenshot: bool,
-        verbose: bool,
-        is_cached: bool,
-        **kwargs,
-    ) -> CrawlResult:
-        t = time.time()
-        # Extract content from HTML
-        try:
-            t1 = time.time()
-            scrapping_strategy = WebScrapingStrategy()
-            extra_params = {
-                k: v
-                for k, v in kwargs.items()
-                if k not in ["only_text", "image_description_min_word_threshold"]
-            }
-            result = scrapping_strategy.scrap(
-                url,
-                html,
-                word_count_threshold=word_count_threshold,
-                css_selector=css_selector,
-                only_text=kwargs.get("only_text", False),
-                image_description_min_word_threshold=kwargs.get(
-                    "image_description_min_word_threshold",
-                    IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD,
-                ),
-                **extra_params,
-            )
-
-            # result = get_content_of_website_optimized(url, html, word_count_threshold, css_selector=css_selector, only_text=kwargs.get("only_text", False))
-            if verbose:
-                print(
-                    f"[LOG] 🚀 Content extracted for {url}, success: True, time taken: {time.time() - t1:.2f} seconds"
-                )
-
-            if result is None:
-                raise ValueError(f"Failed to extract content from the website: {url}")
-        except InvalidCSSSelectorError as e:
-            raise ValueError(str(e))
-
-        cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
-        markdown = sanitize_input_encode(result.get("markdown", ""))
-        media = result.get("media", [])
-        links = result.get("links", [])
-        metadata = result.get("metadata", {})
-
-        if extracted_content is None:
-            if verbose:
-                print(
-                    f"[LOG] 🔥 Extracting semantic blocks for {url}, Strategy: {extraction_strategy.name}"
-                )
-
-            sections = chunking_strategy.chunk(markdown)
-            extracted_content = extraction_strategy.run(url, sections)
-            extracted_content = json.dumps(
-                extracted_content, indent=4, default=str, ensure_ascii=False
-            )
-
-            if verbose:
-                print(
-                    f"[LOG] 🚀 Extraction done for {url}, time taken: {time.time() - t:.2f} seconds."
-                )
-
-        screenshot = None if not screenshot else screenshot
-
-        if not is_cached:
-            cache_url(
-                url,
-                html,
-                cleaned_html,
-                markdown,
-                extracted_content,
-                True,
-                json.dumps(media),
-                json.dumps(links),
-                json.dumps(metadata),
-                screenshot=screenshot,
-            )
-
-        return CrawlResult(
-            url=url,
-            html=html,
-            cleaned_html=format_html(cleaned_html),
-            markdown=markdown,
-            media=media,
-            links=links,
-            metadata=metadata,
-            screenshot=screenshot,
-            extracted_content=extracted_content,
-            success=True,
-            error_message="",
-        )
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -1,67 +1,10 @@
+version: '3.8'
+
 services:
-  # Local build services for different platforms
-  crawl4ai-amd64:
-    build:
-      context: .
-      dockerfile: Dockerfile
-      args:
-        PYTHON_VERSION: "3.10"
-        INSTALL_TYPE: ${INSTALL_TYPE:-basic}
-        ENABLE_GPU: false
-      platforms:
-        - linux/amd64
-    profiles: ["local-amd64"]
-    extends: &base-config
-      file: docker-compose.yml
-      service: base-config
-
-  crawl4ai-arm64:
-    build:
-      context: .
-      dockerfile: Dockerfile
-      args:
-        PYTHON_VERSION: "3.10"
-        INSTALL_TYPE: ${INSTALL_TYPE:-basic}
-        ENABLE_GPU: false
-      platforms:
-        - linux/arm64
-    profiles: ["local-arm64"]
-    extends: *base-config
-
-  # Hub services for different platforms and versions
-  crawl4ai-hub-amd64:
-    image: unclecode/crawl4ai:${VERSION:-basic}-amd64
-    profiles: ["hub-amd64"]
-    extends: *base-config
-
-  crawl4ai-hub-arm64:
-    image: unclecode/crawl4ai:${VERSION:-basic}-arm64
-    profiles: ["hub-arm64"]
-    extends: *base-config
-
-  # Base configuration to be extended
-  base-config:
+  web:
+    build: .
+    command: uvicorn main:app --host 0.0.0.0 --port 80 --workers $(nproc)
    ports:
-      - "11235:11235"
-      - "8000:8000"
-      - "9222:9222"
-      - "8080:8080"
+      - "80:80"
    environment:
-      - CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
-      - OPENAI_API_KEY=${OPENAI_API_KEY:-}
-      - CLAUDE_API_KEY=${CLAUDE_API_KEY:-}
-    volumes:
-      - /dev/shm:/dev/shm
-    deploy:
-      resources:
-        limits:
-          memory: 4G
-        reservations:
-          memory: 1G
-    restart: unless-stopped
-    healthcheck:
-      test: ["CMD", "curl", "-f", "http://localhost:11235/health"]
-      interval: 30s
-      timeout: 10s
-      retries: 3
-      start_period: 40s
+      - PYTHONUNBUFFERED=1
--- a/docs/assets/pitch-dark.png
+++ b/docs/assets/pitch-dark.png
--- a/docs/assets/pitch-dark.svg
+++ b/docs/assets/pitch-dark.svg
@@ -1,64 +0,0 @@
-<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 800 500">
-    <!-- Background -->
-    <rect width="800" height="500" fill="#1a1a1a"/>
-    
-    <!-- Opportunities Section -->
-    <g transform="translate(50,50)">
-        <!-- Opportunity 1 Box -->
-        <rect x="0" y="0" width="300" height="150" rx="10" fill="#1a2d3d" stroke="#64b5f6" stroke-width="2"/>
-        <text x="150" y="30" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#64b5f6">Data Capitalization Opportunity</text>
-        <text x="150" y="60" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
-            <tspan x="150" dy="0">Transform digital footprints into assets</tspan>
-            <tspan x="150" dy="20">Personal data as capital</tspan>
-            <tspan x="150" dy="20">Enterprise knowledge valuation</tspan>
-            <tspan x="150" dy="20">New form of wealth creation</tspan>
-        </text>
-
-        <!-- Opportunity 2 Box -->
-        <rect x="0" y="200" width="300" height="150" rx="10" fill="#1a2d1a" stroke="#81c784" stroke-width="2"/>
-        <text x="150" y="230" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#81c784">Authentic Data Potential</text>
-        <text x="150" y="260" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">
-            <tspan x="150" dy="0">Vast reservoir of real insights</tspan>
-            <tspan x="150" dy="20">Enhanced AI development</tspan>
-            <tspan x="150" dy="20">Diverse human knowledge</tspan>
-            <tspan x="150" dy="20">Willing participation model</tspan>
-        </text>
-    </g>
-
-    <!-- Development Pathway -->
-    <g transform="translate(450,50)">
-        <!-- Step 1 Box -->
-        <rect x="0" y="0" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
-        <text x="150" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">1. Open-Source Foundation</text>
-        <text x="150" y="65" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Data extraction engine &amp; community development</text>
-
-        <!-- Step 2 Box -->
-        <rect x="0" y="125" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
-        <text x="150" y="160" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">2. Data Capitalization Platform</text>
-        <text x="150" y="190" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Tools to structure &amp; value digital assets</text>
-
-        <!-- Step 3 Box -->
-        <rect x="0" y="250" width="300" height="100" rx="10" fill="#2d1a2d" stroke="#ce93d8" stroke-width="2"/>
-        <text x="150" y="285" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ce93d8">3. Shared Data Marketplace</text>
-        <text x="150" y="315" text-anchor="middle" font-family="Arial" font-size="12" fill="#e0e0e0">Economic platform for data exchange</text>
-    </g>
-
-    <!-- Connecting Arrows -->
-    <g transform="translate(400,125)">
-        <path d="M-20,0 L40,0" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
-        <path d="M-20,200 L40,200" stroke="#666" stroke-width="2" marker-end="url(#arrowhead)"/>
-    </g>
-
-    <!-- Arrow Marker -->
-    <defs>
-        <marker id="arrowhead" markerWidth="10" markerHeight="7" refX="9" refY="3.5" orient="auto">
-            <polygon points="0 0, 10 3.5, 0 7" fill="#666"/>
-        </marker>
-    </defs>
-
-    <!-- Vision Box at Bottom -->
-    <g transform="translate(200,420)">
-        <rect x="0" y="0" width="400" height="60" rx="10" fill="#2d2613" stroke="#ffd54f" stroke-width="2"/>
-        <text x="200" y="35" text-anchor="middle" font-family="Arial" font-weight="bold" font-size="16" fill="#ffd54f">Economic Vision: Shared Data Economy</text>
-    </g>
-</svg>
--- a/docs/chunking_strategies.json
+++ b/docs/chunking_strategies.json
@@ -0,0 +1,12 @@
+{
+    "RegexChunking": "### RegexChunking\n\n`RegexChunking` is a text chunking strategy that splits a given text into smaller parts using regular expressions.\nThis is useful for preparing large texts for processing by language models, ensuring they are divided into manageable segments.\n\n#### Constructor Parameters:\n- `patterns` (list, optional): A list of regular expression patterns used to split the text. Default is to split by double newlines (`['\\n\\n']`).\n\n#### Example usage:\n```python\nchunker = RegexChunking(patterns=[r'\\n\\n', r'\\. '])\nchunks = chunker.chunk(\"This is a sample text. It will be split into chunks.\")\n```",
+    
+    "NlpSentenceChunking": "### NlpSentenceChunking\n\n`NlpSentenceChunking` uses a natural language processing model to chunk a given text into sentences. This approach leverages SpaCy to accurately split text based on sentence boundaries.\n\n#### Constructor Parameters:\n- None.\n\n#### Example usage:\n```python\nchunker = NlpSentenceChunking()\nchunks = chunker.chunk(\"This is a sample text. It will be split into sentences.\")\n```",
+    
+    "TopicSegmentationChunking": "### TopicSegmentationChunking\n\n`TopicSegmentationChunking` uses the TextTiling algorithm to segment a given text into topic-based chunks. This method identifies thematic boundaries in the text.\n\n#### Constructor Parameters:\n- `num_keywords` (int, optional): The number of keywords to extract for each topic segment. Default is `3`.\n\n#### Example usage:\n```python\nchunker = TopicSegmentationChunking(num_keywords=3)\nchunks = chunker.chunk(\"This is a sample text. It will be split into topic-based segments.\")\n```",
+    
+    "FixedLengthWordChunking": "### FixedLengthWordChunking\n\n`FixedLengthWordChunking` splits a given text into chunks of fixed length, based on the number of words.\n\n#### Constructor Parameters:\n- `chunk_size` (int, optional): The number of words in each chunk. Default is `100`.\n\n#### Example usage:\n```python\nchunker = FixedLengthWordChunking(chunk_size=100)\nchunks = chunker.chunk(\"This is a sample text. It will be split into fixed-length word chunks.\")\n```",
+    
+    "SlidingWindowChunking": "### SlidingWindowChunking\n\n`SlidingWindowChunking` uses a sliding window approach to chunk a given text. Each chunk has a fixed length, and the window slides by a specified step size.\n\n#### Constructor Parameters:\n- `window_size` (int, optional): The number of words in each chunk. Default is `100`.\n- `step` (int, optional): The number of words to slide the window. Default is `50`.\n\n#### Example usage:\n```python\nchunker = SlidingWindowChunking(window_size=100, step=50)\nchunks = chunker.chunk(\"This is a sample text. It will be split using a sliding window approach.\")\n```"
+  }
+  
--- a/docs/deep_crawl/bfs_traversal_strategy.md
+++ b/docs/deep_crawl/bfs_traversal_strategy.md
@@ -1,244 +0,0 @@
-# BFS Scraper Strategy: Smart Web Traversal
-
-The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
-
-```mermaid
-flowchart TB
-    Start([Start]) --> Init[Initialize BFS Strategy]
-    Init --> InitStats[Initialize CrawlStats]
-    InitStats --> InitQueue[Initialize Priority Queue]
-    InitQueue --> AddStart[Add Start URL to Queue]
-    
-    AddStart --> CheckState{Queue Empty or\nTasks Pending?}
-    CheckState -->|No| Cleanup[Cleanup & Stats]
-    Cleanup --> End([End])
-    
-    CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
-    CheckCancel -->|Yes| Cleanup
-    
-    CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
-    
-    CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
-    WaitComplete --> YieldResult[Yield Result]
-    YieldResult --> CheckState
-    
-    CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
-    
-    GetNextURL --> ValidateURL{Already\nVisited?}
-    ValidateURL -->|Yes| CheckState
-    
-    ValidateURL -->|No| ProcessURL[Process URL]
-    
-    subgraph URL_Processing [URL Processing]
-        ProcessURL --> CheckValid{URL Valid?}
-        CheckValid -->|No| UpdateStats[Update Skip Stats]
-        
-        CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
-        CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
-        
-        CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
-        ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
-        
-        FetchContent --> CheckError{Error?}
-        CheckError -->|Yes| Retry{Retry\nNeeded?}
-        Retry -->|Yes| FetchContent
-        Retry -->|No| UpdateFailStats[Update Fail Stats]
-        
-        CheckError -->|No| ExtractLinks[Extract & Process Links]
-        ExtractLinks --> ScoreURLs[Score New URLs]
-        ScoreURLs --> AddToQueue[Add to Priority Queue]
-    end
-    
-    ProcessURL --> CreateTask{Parallel\nProcessing?}
-    CreateTask -->|Yes| AddTask[Add to Pending Tasks]
-    CreateTask -->|No| DirectProcess[Process Directly]
-    
-    AddTask --> CheckState
-    DirectProcess --> YieldResult
-    
-    UpdateStats --> CheckState
-    UpdateRobotStats --> CheckState
-    UpdateFailStats --> CheckState
-    
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    
-    class Start,End stats;
-    class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
-    class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
-    class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
-```
-
-## How It Works
-
-The BFS strategy crawls a website by:
-1. Starting from a root URL
-2. Processing all URLs at the current depth
-3. Moving to URLs at the next depth level
-4. Continuing until maximum depth is reached
-
-This ensures systematic coverage of the website while maintaining control over the crawling process.
-
-## Key Features
-
-### 1. Smart URL Processing
-```python
-strategy = BFSScraperStrategy(
-    max_depth=2,
-    filter_chain=my_filters,
-    url_scorer=my_scorer,
-    max_concurrent=5
-)
-```
- Controls crawl depth
- Filters unwanted URLs
- Scores URLs for priority
- Manages concurrent requests
-
-### 2. Polite Crawling
-The strategy automatically implements web crawling best practices:
- Respects robots.txt
- Implements rate limiting
- Adds politeness delays
- Manages concurrent requests
-
-### 3. Link Processing Control
-```python
-strategy = BFSScraperStrategy(
-    ...,
-    process_external_links=False  # Only process internal links
-)
-```
- Control whether to follow external links
- Default: internal links only
- Enable external links when needed
-
-## Configuration Options
-
-| Parameter | Description | Default |
-|-----------|-------------|---------|
-| max_depth | Maximum crawl depth | Required |
-| filter_chain | URL filtering rules | Required |
-| url_scorer | URL priority scoring | Required |
-| max_concurrent | Max parallel requests | 5 |
-| min_crawl_delay | Seconds between requests | 1 |
-| process_external_links | Follow external links | False |
-
-## Best Practices
-
-1. **Set Appropriate Depth**
-   - Start with smaller depths (2-3)
-   - Increase based on needs
-   - Consider site structure
-
-2. **Configure Filters**
-   - Use URL patterns
-   - Filter by content type
-   - Avoid unwanted sections
-
-3. **Tune Performance**
-   - Adjust max_concurrent
-   - Set appropriate delays
-   - Monitor resource usage
-
-4. **Handle External Links**
-   - Keep external_links=False for focused crawls
-   - Enable only when needed
-   - Consider additional filtering
-
-## Example Usage
-
-```python
-from crawl4ai.scraper import BFSScraperStrategy
-from crawl4ai.scraper.filters import FilterChain
-from crawl4ai.scraper.scorers import BasicURLScorer
-
-# Configure strategy
-strategy = BFSScraperStrategy(
-    max_depth=3,
-    filter_chain=FilterChain([
-        URLPatternFilter("*.example.com/*"),
-        ContentTypeFilter(["text/html"])
-    ]),
-    url_scorer=BasicURLScorer(),
-    max_concurrent=5,
-    min_crawl_delay=1,
-    process_external_links=False
-)
-
-# Use with AsyncWebScraper
-scraper = AsyncWebScraper(crawler, strategy)
-results = await scraper.ascrape("https://example.com")
-```
-
-## Common Use Cases
-
-### 1. Site Mapping
-```python
-strategy = BFSScraperStrategy(
-    max_depth=5,
-    filter_chain=site_filter,
-    url_scorer=depth_scorer,
-    process_external_links=False
-)
-```
-Perfect for creating complete site maps or understanding site structure.
-
-### 2. Content Aggregation
-```python
-strategy = BFSScraperStrategy(
-    max_depth=2,
-    filter_chain=content_filter,
-    url_scorer=relevance_scorer,
-    max_concurrent=3
-)
-```
-Ideal for collecting specific types of content (articles, products, etc.).
-
-### 3. Link Analysis
-```python
-strategy = BFSScraperStrategy(
-    max_depth=1,
-    filter_chain=link_filter,
-    url_scorer=link_scorer,
-    process_external_links=True
-)
-```
-Useful for analyzing both internal and external link structures.
-
-## Advanced Features
-
-### Progress Monitoring
-```python
-async for result in scraper.ascrape(url):
-    print(f"Current depth: {strategy.stats.current_depth}")
-    print(f"Processed URLs: {strategy.stats.urls_processed}")
-```
-
-### Custom URL Scoring
-```python
-class CustomScorer(URLScorer):
-    def score(self, url: str) -> float:
-        # Lower scores = higher priority
-        return score_based_on_criteria(url)
-```
-
-## Troubleshooting
-
-1. **Slow Crawling**
-   - Increase max_concurrent
-   - Adjust min_crawl_delay
-   - Check network conditions
-
-2. **Missing Content**
-   - Verify max_depth
-   - Check filter settings
-   - Review URL patterns
-
-3. **High Resource Usage**
-   - Reduce max_concurrent
-   - Increase crawl delay
-   - Add more specific filters
-
--- a/docs/deep_crawl/deep_crawl_quickstart.py
+++ b/docs/deep_crawl/deep_crawl_quickstart.py
@@ -1,260 +0,0 @@
-from crawl4ai.async_configs import CrawlerRunConfig, BrowserConfig
-from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
-from crawl4ai.deep_crawl import (
-    BFSDeepCrawlStrategy,
-    FilterChain,
-    URLPatternFilter,
-    ContentTypeFilter,
-    DomainFilter,
-    KeywordRelevanceScorer,
-    PathDepthScorer,
-    FreshnessScorer,
-    CompositeScorer,
-)
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-import re
-import time
-import logging
-
-browser_config = BrowserConfig(headless=True, viewport_width=800, viewport_height=600)
-
-
-async def basic_example():
-    """
-    Basic example: Deep crawl a blog site for articles
-    - Crawls only HTML pages
-    - Stays within the blog section
-    - Collects all results at once
-    """
-    # Create a simple filter chain
-    filter_chain = FilterChain(
-        [
-            # Only crawl pages within the blog section
-            URLPatternFilter("*/basic/*"),
-            # Only process HTML pages
-            ContentTypeFilter(["text/html"]),
-        ]
-    )
-
-    # Initialize the strategy with basic configuration
-    bfs_strategy = BFSDeepCrawlStrategy(
-        max_depth=2,  # Only go 2 levels deep
-        filter_chain=filter_chain,
-        url_scorer=None,  # Use default scoring
-        process_external_links=True,
-    )
-
-    # Create the crawler
-    async with AsyncWebCrawler(
-        config=browser_config,
-    ) as crawler:
-        # Start scraping
-        try:
-            results = await crawler.arun(
-                "https://crawl4ai.com/mkdocs",
-                CrawlerRunConfig(deep_crawl_strategy=bfs_strategy),
-            )
-            # Process results
-            print(f"Crawled {len(results)} pages:")
-            for result in results:
-                print(f"- {result.url}: {len(result.html)} bytes")
-
-        except Exception as e:
-            print(f"Error during scraping: {e}")
-
-
-async def advanced_example():
-    """
-    Advanced example: Intelligent news site crawling
-    - Uses all filter types
-    - Implements sophisticated scoring
-    - Streams results
-    - Includes monitoring and logging
-    """
-    # Set up logging
-    logging.basicConfig(level=logging.INFO)
-    logger = logging.getLogger("advanced_deep_crawler")
-
-    # Create sophisticated filter chain
-    filter_chain = FilterChain(
-        [
-            # Domain control
-            DomainFilter(
-                allowed_domains=["techcrunch.com"],
-                blocked_domains=["login.techcrunch.com", "legal.yahoo.com"],
-            ),
-            # URL patterns
-            URLPatternFilter(
-                [
-                    "*/article/*",
-                    "*/news/*",
-                    "*/blog/*",
-                    re.compile(r"\d{4}/\d{2}/.*"),  # Date-based URLs
-                ]
-            ),
-            # Content types
-            ContentTypeFilter(["text/html", "application/xhtml+xml"]),
-        ]
-    )
-
-    # Create composite scorer
-    scorer = CompositeScorer(
-        [
-            # Prioritize by keywords
-            KeywordRelevanceScorer(
-                keywords=["news", "breaking", "update", "latest"], weight=1.0
-            ),
-            # Prefer optimal URL structure
-            PathDepthScorer(optimal_depth=3, weight=0.7),
-            # Prioritize fresh content
-            FreshnessScorer(weight=0.9),
-        ]
-    )
-
-    # Initialize strategy with advanced configuration
-    bfs_strategy = BFSDeepCrawlStrategy(
-        max_depth=2, filter_chain=filter_chain, url_scorer=scorer
-    )
-
-    # Create crawler
-    async with AsyncWebCrawler(
-        config=browser_config,
-    ) as crawler:
-
-        # Track statistics
-        stats = {"processed": 0, "errors": 0, "total_size": 0}
-
-        try:
-            # Use streaming mode
-            results = []
-            result_generator = await crawler.arun(
-                "https://techcrunch.com",
-                config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy, stream=True),
-            )
-            async for result in result_generator:
-                stats["processed"] += 1
-
-                if result.success:
-                    stats["total_size"] += len(result.html)
-                    logger.info(
-                        f"Processed at depth: {result.depth} with score: {result.score:.3f} : \n {result.url}"
-                    )
-                    results.append(result)
-                else:
-                    stats["errors"] += 1
-                    logger.error(
-                        f"Failed to process {result.url}: {result.error_message}"
-                    )
-
-                # Log progress regularly
-                if stats["processed"] % 10 == 0:
-                    logger.info(f"Progress: {stats['processed']} URLs processed")
-
-        except Exception as e:
-            logger.error(f"Scraping error: {e}")
-
-        finally:
-            # Print final statistics
-            logger.info("Scraping completed:")
-            logger.info(f"- URLs processed: {stats['processed']}")
-            logger.info(f"- Errors: {stats['errors']}")
-            logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
-
-            # Print filter statistics
-            for filter_ in filter_chain.filters:
-                logger.info(f"{filter_.name} stats:")
-                logger.info(f"- Passed: {filter_.stats.passed_urls}")
-                logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
-
-            # Print scorer statistics
-            logger.info("Scoring statistics:")
-            logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
-            logger.info(
-                f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}"
-            )
-
-
-async def basic_example_many_urls():
-    filter_chain = FilterChain(
-        [
-            URLPatternFilter("*/basic/*"),
-            ContentTypeFilter(["text/html"]),
-        ]
-    )
-    # Initialize the strategy with basic configuration
-    bfs_strategy = BFSDeepCrawlStrategy(
-        max_depth=2,  # Only go 2 levels deep
-        filter_chain=filter_chain,
-        url_scorer=None,  # Use default scoring
-        process_external_links=False,
-    )
-
-    # Create the crawler
-    async with AsyncWebCrawler(
-        config=browser_config,
-    ) as crawler:
-        # Start scraping
-        try:
-            results = await crawler.arun_many(
-                urls=["https://crawl4ai.com/mkdocs","https://aravindkarnam.com"],
-                config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy),
-            )
-            # Process results
-            print(f"Crawled {len(results)} pages:")
-            for url_result in results:
-                for result in url_result:
-                    print(f"- {result.url}: {len(result.html)} bytes")
-
-        except Exception as e:
-            print(f"Error during scraping: {e}")
-
-async def basic_example_many_urls_stream():
-    filter_chain = FilterChain(
-        [
-            URLPatternFilter("*/basic/*"),
-            ContentTypeFilter(["text/html"]),
-        ]
-    )
-    # Initialize the strategy with basic configuration
-    bfs_strategy = BFSDeepCrawlStrategy(
-        max_depth=2,  # Only go 2 levels deep
-        filter_chain=filter_chain,
-        url_scorer=None,  # Use default scoring
-        process_external_links=False,
-    )
-
-    # Create the crawler
-    async with AsyncWebCrawler(
-        config=browser_config,
-    ) as crawler:
-        # Start scraping
-        try:
-            async for result in await crawler.arun_many(
-                urls=["https://crawl4ai.com/mkdocs","https://aravindkarnam.com"],
-                config=CrawlerRunConfig(deep_crawl_strategy=bfs_strategy,stream=True),
-            ):
-            # Process results
-                print(f"- {result.url}: {len(result.html)} bytes")
-        except Exception as e:
-            print(f"Error during scraping: {e}")
-
-if __name__ == "__main__":
-    import asyncio
-    import time
-
-    # Run basic example
-    start_time = time.perf_counter()
-    print("Running basic Deep crawl example...")
-    asyncio.run(basic_example())
-    end_time = time.perf_counter()
-    print(f"Basic deep crawl example completed in {end_time - start_time:.2f} seconds")
-
-    # Run advanced example
-    print("\nRunning advanced deep crawl example...")
-    asyncio.run(advanced_example())
-
-    print("\nRunning advanced deep crawl example with arun_many...")
-    asyncio.run(basic_example_many_urls())
-
-    print("\nRunning advanced deep crawl example with arun_many streaming enabled...")
-    asyncio.run(basic_example_many_urls_stream())
--- a/docs/deep_crawl/filters_scrorers.md
+++ b/docs/deep_crawl/filters_scrorers.md
@@ -1,342 +0,0 @@
-# URL Filters and Scorers
-
-The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
-
-```mermaid
-flowchart TB
-    Start([URL Input]) --> Chain[Filter Chain]
-    
-    subgraph Chain Process
-        Chain --> Pattern{URL Pattern\nFilter}
-        Pattern -->|Match| Content{Content Type\nFilter}
-        Pattern -->|No Match| Reject1[Reject URL]
-        
-        Content -->|Allowed| Domain{Domain\nFilter}
-        Content -->|Not Allowed| Reject2[Reject URL]
-        
-        Domain -->|Allowed| Accept[Accept URL]
-        Domain -->|Blocked| Reject3[Reject URL]
-    end
-    
-    subgraph Statistics
-        Pattern --> UpdatePattern[Update Pattern Stats]
-        Content --> UpdateContent[Update Content Stats]
-        Domain --> UpdateDomain[Update Domain Stats]
-        Accept --> UpdateChain[Update Chain Stats]
-        Reject1 --> UpdateChain
-        Reject2 --> UpdateChain
-        Reject3 --> UpdateChain
-    end
-    
-    Accept --> End([End])
-    Reject1 --> End
-    Reject2 --> End
-    Reject3 --> End
-    
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    
-    class Start,End accept;
-    class Pattern,Content,Domain decision;
-    class Reject1,Reject2,Reject3 reject;
-    class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
-```
-
-## URL Filters
-
-URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
-
-### Available Filters
-
-1. **URL Pattern Filter**
-```python
-pattern_filter = URLPatternFilter([
-    "*.example.com/*",  # Glob pattern
-    "*/article/*",      # Path pattern
-    re.compile(r"blog-\d+") # Regex pattern
-])
-```
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
-
-2. **Content Type Filter**
-```python
-content_filter = ContentTypeFilter([
-    "text/html",
-    "application/pdf"
-], check_extension=True)
-```
- Filter by MIME types
- Extension checking
- Support for multiple content types
-
-3. **Domain Filter**
-```python
-domain_filter = DomainFilter(
-    allowed_domains=["example.com", "blog.example.com"],
-    blocked_domains=["ads.example.com"]
-)
-```
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
-
-### Creating Filter Chains
-
-```python
-# Create and configure a filter chain
-filter_chain = FilterChain([
-    URLPatternFilter(["*.example.com/*"]),
-    ContentTypeFilter(["text/html"]),
-    DomainFilter(blocked_domains=["ads.*"])
-])
-
-# Add more filters
-filter_chain.add_filter(
-    URLPatternFilter(["*/article/*"])
-)
-```
-
-```mermaid
-flowchart TB
-    Start([URL Input]) --> Composite[Composite Scorer]
-    
-    subgraph Scoring Process
-        Composite --> Keywords[Keyword Relevance]
-        Composite --> Path[Path Depth]
-        Composite --> Content[Content Type]
-        Composite --> Fresh[Freshness]
-        Composite --> Domain[Domain Authority]
-        
-        Keywords --> KeywordScore[Calculate Score]
-        Path --> PathScore[Calculate Score]
-        Content --> ContentScore[Calculate Score]
-        Fresh --> FreshScore[Calculate Score]
-        Domain --> DomainScore[Calculate Score]
-        
-        KeywordScore --> Weight1[Apply Weight]
-        PathScore --> Weight2[Apply Weight]
-        ContentScore --> Weight3[Apply Weight]
-        FreshScore --> Weight4[Apply Weight]
-        DomainScore --> Weight5[Apply Weight]
-    end
-    
-    Weight1 --> Combine[Combine Scores]
-    Weight2 --> Combine
-    Weight3 --> Combine
-    Weight4 --> Combine
-    Weight5 --> Combine
-    
-    Combine --> Normalize{Normalize?}
-    Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
-    Normalize -->|No| FinalScore[Final Score]
-    NormalizeScore --> FinalScore
-    
-    FinalScore --> Stats[Update Statistics]
-    Stats --> End([End])
-    
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    
-    class Start,End calc;
-    class Keywords,Path,Content,Fresh,Domain scorer;
-    class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
-    class Normalize decision;
-```
-
-## URL Scorers
-
-URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
-
-### Available Scorers
-
-1. **Keyword Relevance Scorer**
-```python
-keyword_scorer = KeywordRelevanceScorer(
-    keywords=["python", "programming"],
-    weight=1.0,
-    case_sensitive=False
-)
-```
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
-
-2. **Path Depth Scorer**
-```python
-path_scorer = PathDepthScorer(
-    optimal_depth=3,  # Preferred URL depth
-    weight=0.7
-)
-```
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
-
-3. **Content Type Scorer**
-```python
-content_scorer = ContentTypeScorer({
-    r'\.html$': 1.0,
-    r'\.pdf$': 0.8,
-    r'\.xml$': 0.6
-})
-```
- Score based on file types
- Configurable type weights
- Pattern matching support
-
-4. **Freshness Scorer**
-```python
-freshness_scorer = FreshnessScorer(weight=0.9)
-```
- Score based on date indicators in URLs
- Multiple date format support
- Recency weighting
-
-5. **Domain Authority Scorer**
-```python
-authority_scorer = DomainAuthorityScorer({
-    "python.org": 1.0,
-    "github.com": 0.9,
-    "medium.com": 0.7
-})
-```
- Score based on domain importance
- Configurable domain weights
- Default weight for unknown domains
-
-### Combining Scorers
-
-```python
-# Create a composite scorer
-composite_scorer = CompositeScorer([
-    KeywordRelevanceScorer(["python"], weight=1.0),
-    PathDepthScorer(optimal_depth=2, weight=0.7),
-    FreshnessScorer(weight=0.8)
-], normalize=True)
-```
-
-## Best Practices
-
-### Filter Configuration
-
-1. **Start Restrictive**
-   ```python
-   # Begin with strict filters
-   filter_chain = FilterChain([
-       DomainFilter(allowed_domains=["example.com"]),
-       ContentTypeFilter(["text/html"])
-   ])
-   ```
-
-2. **Layer Filters**
-   ```python
-   # Add more specific filters
-   filter_chain.add_filter(
-       URLPatternFilter(["*/article/*", "*/blog/*"])
-   )
-   ```
-
-3. **Monitor Filter Statistics**
-   ```python
-   # Check filter performance
-   for filter in filter_chain.filters:
-       print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
-   ```
-
-### Scorer Configuration
-
-1. **Balance Weights**
-   ```python
-   # Balanced scoring configuration
-   scorer = create_balanced_scorer()
-   ```
-
-2. **Customize for Content**
-   ```python
-   # News site configuration
-   news_scorer = CompositeScorer([
-       KeywordRelevanceScorer(["news", "article"], weight=1.0),
-       FreshnessScorer(weight=1.0),
-       PathDepthScorer(optimal_depth=2, weight=0.5)
-   ])
-   ```
-
-3. **Monitor Scoring Statistics**
-   ```python
-   # Check scoring distribution
-   print(f"Average score: {scorer.stats.average_score}")
-   print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
-   ```
-
-## Common Use Cases
-
-### Blog Crawling
-```python
-blog_config = {
-    'filters': FilterChain([
-        URLPatternFilter(["*/blog/*", "*/post/*"]),
-        ContentTypeFilter(["text/html"])
-    ]),
-    'scorer': CompositeScorer([
-        FreshnessScorer(weight=1.0),
-        KeywordRelevanceScorer(["blog", "article"], weight=0.8)
-    ])
-}
-```
-
-### Documentation Sites
-```python
-docs_config = {
-    'filters': FilterChain([
-        URLPatternFilter(["*/docs/*", "*/guide/*"]),
-        ContentTypeFilter(["text/html", "application/pdf"])
-    ]),
-    'scorer': CompositeScorer([
-        PathDepthScorer(optimal_depth=3, weight=1.0),
-        KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
-    ])
-}
-```
-
-### E-commerce Sites
-```python
-ecommerce_config = {
-    'filters': FilterChain([
-        URLPatternFilter(["*/product/*", "*/category/*"]),
-        DomainFilter(blocked_domains=["ads.*", "tracker.*"])
-    ]),
-    'scorer': CompositeScorer([
-        PathDepthScorer(optimal_depth=2, weight=1.0),
-        ContentTypeScorer({
-            r'/product/': 1.0,
-            r'/category/': 0.8
-        })
-    ])
-}
-```
-
-## Advanced Topics
-
-### Custom Filters
-```python
-class CustomFilter(URLFilter):
-    def apply(self, url: str) -> bool:
-        # Your custom filtering logic
-        return True
-```
-
-### Custom Scorers
-```python
-class CustomScorer(URLScorer):
-    def _calculate_score(self, url: str) -> float:
-        # Your custom scoring logic
-        return 1.0
-```
-
-For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).
--- a/docs/deep_crawl/how_to_use.md
+++ b/docs/deep_crawl/how_to_use.md
@@ -1,206 +0,0 @@
-# Scraper Examples Guide
-
-This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
-
-## Basic Example
-
-The basic example demonstrates a simple blog scraping scenario:
-
-```python
-from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
-
-# Create simple filter chain
-filter_chain = FilterChain([
-    URLPatternFilter("*/blog/*"),
-    ContentTypeFilter(["text/html"])
-])
-
-# Initialize strategy
-strategy = BFSScraperStrategy(
-    max_depth=2,
-    filter_chain=filter_chain,
-    url_scorer=None,
-    max_concurrent=3
-)
-
-# Create and run scraper
-crawler = AsyncWebCrawler()
-scraper = AsyncWebScraper(crawler, strategy)
-result = await scraper.ascrape("https://example.com/blog/")
-```
-
-### Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection
-
-## Advanced Example
-
-The advanced example shows a sophisticated news site scraping setup with all features enabled:
-
-```python
-# Create comprehensive filter chain
-filter_chain = FilterChain([
-    DomainFilter(
-        allowed_domains=["example.com"],
-        blocked_domains=["ads.example.com"]
-    ),
-    URLPatternFilter([
-        "*/article/*",
-        re.compile(r"\d{4}/\d{2}/.*")
-    ]),
-    ContentTypeFilter(["text/html"])
-])
-
-# Create intelligent scorer
-scorer = CompositeScorer([
-    KeywordRelevanceScorer(
-        keywords=["news", "breaking"],
-        weight=1.0
-    ),
-    PathDepthScorer(optimal_depth=3, weight=0.7),
-    FreshnessScorer(weight=0.9)
-])
-
-# Initialize advanced strategy
-strategy = BFSScraperStrategy(
-    max_depth=4,
-    filter_chain=filter_chain,
-    url_scorer=scorer,
-    max_concurrent=5
-)
-```
-
-### Features Demonstrated
-1. **Advanced Filtering**
-   - Domain filtering
-   - Pattern matching
-   - Content type control
-
-2. **Intelligent Scoring**
-   - Keyword relevance
-   - Path optimization
-   - Freshness priority
-
-3. **Monitoring**
-   - Progress tracking
-   - Error handling
-   - Statistics collection
-
-4. **Resource Management**
-   - Concurrent processing
-   - Rate limiting
-   - Cleanup handling
-
-## Running the Examples
-
-```bash
-# Basic usage
-python basic_scraper_example.py
-
-# Advanced usage with logging
-PYTHONPATH=. python advanced_scraper_example.py
-```
-
-## Example Output
-
-### Basic Example
-```
-Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
-...
-```
-
-### Advanced Example
-```
-INFO: Starting crawl of https://example.com/news/
-INFO: Processed: https://example.com/news/breaking/story1
-DEBUG: KeywordScorer: 0.85
-DEBUG: FreshnessScorer: 0.95
-INFO: Progress: 10 URLs processed
-...
-INFO: Scraping completed:
-INFO: - URLs processed: 50
-INFO: - Errors: 2
-INFO: - Total content size: 1240.50 KB
-```
-
-## Customization
-
-### Adding Custom Filters
-```python
-class CustomFilter(URLFilter):
-    def apply(self, url: str) -> bool:
-        # Your custom filtering logic
-        return True
-
-filter_chain.add_filter(CustomFilter())
-```
-
-### Custom Scoring Logic
-```python
-class CustomScorer(URLScorer):
-    def _calculate_score(self, url: str) -> float:
-        # Your custom scoring logic
-        return 1.0
-
-scorer = CompositeScorer([
-    CustomScorer(weight=1.0),
-    ...
-])
-```
-
-## Best Practices
-
-1. **Start Simple**
-   - Begin with basic filtering
-   - Add features incrementally
-   - Test thoroughly at each step
-
-2. **Monitor Performance**
-   - Watch memory usage
-   - Track processing times
-   - Adjust concurrency as needed
-
-3. **Handle Errors**
-   - Implement proper error handling
-   - Log important events
-   - Track error statistics
-
-4. **Optimize Resources**
-   - Set appropriate delays
-   - Limit concurrent requests
-   - Use streaming for large crawls
-
-## Troubleshooting
-
-Common issues and solutions:
-
-1. **Too Many Requests**
-   ```python
-   strategy = BFSScraperStrategy(
-       max_concurrent=3,  # Reduce concurrent requests
-       min_crawl_delay=2  # Increase delay between requests
-   )
-   ```
-
-2. **Memory Issues**
-   ```python
-   # Use streaming mode for large crawls
-   async for result in scraper.ascrape(url, stream=True):
-       process_result(result)
-   ```
-
-3. **Missing Content**
-   ```python
-   # Check your filter chain
-   filter_chain = FilterChain([
-       URLPatternFilter("*"),  # Broaden patterns
-       ContentTypeFilter(["*"])  # Accept all content
-   ])
-   ```
-
-For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).
--- a/docs/deprecated/docker-deployment.md
+++ b/docs/deprecated/docker-deployment.md
@@ -1,189 +0,0 @@
-# 🐳 Using Docker (Legacy)
-
-Crawl4AI is available as Docker images for easy deployment. You can either pull directly from Docker Hub (recommended) or build from the repository.
-
---
-
-<details>
-<summary>🐳 <strong>Option 1: Docker Hub (Recommended)</strong></summary>
-
-Choose the appropriate image based on your platform and needs:
-
-### For AMD64 (Regular Linux/Windows):
-```bash
-# Basic version (recommended)
-docker pull unclecode/crawl4ai:basic-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
-
-# Full ML/LLM support
-docker pull unclecode/crawl4ai:all-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:all-amd64
-
-# With GPU support
-docker pull unclecode/crawl4ai:gpu-amd64
-docker run -p 11235:11235 unclecode/crawl4ai:gpu-amd64
-```
-
-### For ARM64 (M1/M2 Macs, ARM servers):
-```bash
-# Basic version (recommended)
-docker pull unclecode/crawl4ai:basic-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
-
-# Full ML/LLM support
-docker pull unclecode/crawl4ai:all-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:all-arm64
-
-# With GPU support
-docker pull unclecode/crawl4ai:gpu-arm64
-docker run -p 11235:11235 unclecode/crawl4ai:gpu-arm64
-```
-
-Need more memory? Add `--shm-size`:
-```bash
-docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-amd64
-```
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-### For Raspberry Pi (32-bit) (coming soon):
-```bash
-# Pull and run basic version (recommended for Raspberry Pi)
-docker pull unclecode/crawl4ai:basic-armv7
-docker run -p 11235:11235 unclecode/crawl4ai:basic-armv7
-
-# With increased shared memory if needed
-docker run --shm-size=2gb -p 11235:11235 unclecode/crawl4ai:basic-armv7
-```
-
-Note: Due to hardware constraints, only the basic version is recommended for Raspberry Pi.
-
-</details>
-
-<details>
-<summary>🐳 <strong>Option 2: Build from Repository</strong></summary>
-
-Build the image locally based on your platform:
-
-```bash
-# Clone the repository
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-
-# For AMD64 (Regular Linux/Windows)
-docker build --platform linux/amd64 \
-  --tag crawl4ai:local \
-  --build-arg INSTALL_TYPE=basic \
-  .
-
-# For ARM64 (M1/M2 Macs, ARM servers)
-docker build --platform linux/arm64 \
-  --tag crawl4ai:local \
-  --build-arg INSTALL_TYPE=basic \
-  .
-```
-
-Build options:
- INSTALL_TYPE=basic (default): Basic crawling features
- INSTALL_TYPE=all: Full ML/LLM support
- ENABLE_GPU=true: Add GPU support
-
-Example with all options:
-```bash
-docker build --platform linux/amd64 \
-  --tag crawl4ai:local \
-  --build-arg INSTALL_TYPE=all \
-  --build-arg ENABLE_GPU=true \
-  .
-```
-
-Run your local build:
-```bash
-# Regular run
-docker run -p 11235:11235 crawl4ai:local
-
-# With increased shared memory
-docker run --shm-size=2gb -p 11235:11235 crawl4ai:local
-```
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-</details>
-
-<details>
-<summary>🐳 <strong>Option 3: Using Docker Compose</strong></summary>
-
-Docker Compose provides a more structured way to run Crawl4AI, especially when dealing with environment variables and multiple configurations.
-
-```bash
-# Clone the repository
-git clone https://github.com/unclecode/crawl4ai.git
-cd crawl4ai
-```
-
-### For AMD64 (Regular Linux/Windows):
-```bash
-# Build and run locally
-docker-compose --profile local-amd64 up
-
-# Run from Docker Hub
-VERSION=basic docker-compose --profile hub-amd64 up   # Basic version
-VERSION=all docker-compose --profile hub-amd64 up     # Full ML/LLM support
-VERSION=gpu docker-compose --profile hub-amd64 up     # GPU support
-```
-
-### For ARM64 (M1/M2 Macs, ARM servers):
-```bash
-# Build and run locally
-docker-compose --profile local-arm64 up
-
-# Run from Docker Hub
-VERSION=basic docker-compose --profile hub-arm64 up   # Basic version
-VERSION=all docker-compose --profile hub-arm64 up     # Full ML/LLM support
-VERSION=gpu docker-compose --profile hub-arm64 up     # GPU support
-```
-
-Environment variables (optional):
-```bash
-# Create a .env file
-CRAWL4AI_API_TOKEN=your_token
-OPENAI_API_KEY=your_openai_key
-CLAUDE_API_KEY=your_claude_key
-```
-
-The compose file includes:
- Memory management (4GB limit, 1GB reserved)
- Shared memory volume for browser support
- Health checks
- Auto-restart policy
- All necessary port mappings
-
-Test the installation:
-```bash
-curl http://localhost:11235/health
-```
-
-</details>
-
-<details>
-<summary>🚀 <strong>One-Click Deployment</strong></summary>
-
-Deploy your own instance of Crawl4AI with one click:
-
-[![DigitalOcean Referral Badge](https://web-platforms.sfo2.cdn.digitaloceanspaces.com/WWW/Badge%203.svg)](https://www.digitalocean.com/?repo=https://github.com/unclecode/crawl4ai/tree/0.3.74&refcode=a0780f1bdb3d&utm_campaign=Referral_Invite&utm_medium=Referral_Program&utm_source=badge)
-
-> 💡 **Recommended specs**: 4GB RAM minimum. Select "professional-xs" or higher when deploying for stable operation.
-
-The deploy will:
- Set up a Docker container with Crawl4AI
- Configure Playwright and all dependencies
- Start the FastAPI server on port `11235`
- Set up health checks and auto-deployment
-
-</details>
--- a/docs/examples/amazon_product_extraction_direct_url.py
+++ b/docs/examples/amazon_product_extraction_direct_url.py
@@ -1,110 +0,0 @@
-"""
-This example demonstrates how to use JSON CSS extraction to scrape product information 
-from Amazon search results. It shows how to extract structured data like product titles,
-prices, ratings, and other details using CSS selectors.
-"""
-
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-import json
-
-
-async def extract_amazon_products():
-    # Initialize browser config
-    browser_config = BrowserConfig(browser_type="chromium", headless=True)
-
-    # Initialize crawler config with JSON CSS extraction strategy
-    crawler_config = CrawlerRunConfig(
-        extraction_strategy=JsonCssExtractionStrategy(
-            schema={
-                "name": "Amazon Product Search Results",
-                "baseSelector": "[data-component-type='s-search-result']",
-                "fields": [
-                    {
-                        "name": "asin",
-                        "selector": "",
-                        "type": "attribute",
-                        "attribute": "data-asin",
-                    },
-                    {"name": "title", "selector": "h2 a span", "type": "text"},
-                    {
-                        "name": "url",
-                        "selector": "h2 a",
-                        "type": "attribute",
-                        "attribute": "href",
-                    },
-                    {
-                        "name": "image",
-                        "selector": ".s-image",
-                        "type": "attribute",
-                        "attribute": "src",
-                    },
-                    {
-                        "name": "rating",
-                        "selector": ".a-icon-star-small .a-icon-alt",
-                        "type": "text",
-                    },
-                    {
-                        "name": "reviews_count",
-                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
-                        "type": "text",
-                    },
-                    {
-                        "name": "price",
-                        "selector": ".a-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "original_price",
-                        "selector": ".a-price.a-text-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "sponsored",
-                        "selector": ".puis-sponsored-label-text",
-                        "type": "exists",
-                    },
-                    {
-                        "name": "delivery_info",
-                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
-                        "type": "text",
-                        "multiple": True,
-                    },
-                ],
-            }
-        )
-    )
-
-    # Example search URL (you should replace with your actual Amazon URL)
-    url = "https://www.amazon.com/s?k=Samsung+Galaxy+Tab"
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # Extract the data
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        # Process and print the results
-        if result and result.extracted_content:
-            # Parse the JSON string into a list of products
-            products = json.loads(result.extracted_content)
-
-            # Process each product in the list
-            for product in products:
-                print("\nProduct Details:")
-                print(f"ASIN: {product.get('asin')}")
-                print(f"Title: {product.get('title')}")
-                print(f"Price: {product.get('price')}")
-                print(f"Original Price: {product.get('original_price')}")
-                print(f"Rating: {product.get('rating')}")
-                print(f"Reviews: {product.get('reviews_count')}")
-                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
-                if product.get("delivery_info"):
-                    print(f"Delivery: {' '.join(product['delivery_info'])}")
-                print("-" * 80)
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(extract_amazon_products())
--- a/docs/examples/amazon_product_extraction_using_hooks.py
+++ b/docs/examples/amazon_product_extraction_using_hooks.py
@@ -1,150 +0,0 @@
-"""
-This example demonstrates how to use JSON CSS extraction to scrape product information 
-from Amazon search results. It shows how to extract structured data like product titles,
-prices, ratings, and other details using CSS selectors.
-"""
-
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-import json
-from playwright.async_api import Page, BrowserContext
-
-
-async def extract_amazon_products():
-    # Initialize browser config
-    browser_config = BrowserConfig(
-        # browser_type="chromium",
-        headless=True
-    )
-
-    # Initialize crawler config with JSON CSS extraction strategy nav-search-submit-button
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        extraction_strategy=JsonCssExtractionStrategy(
-            schema={
-                "name": "Amazon Product Search Results",
-                "baseSelector": "[data-component-type='s-search-result']",
-                "fields": [
-                    {
-                        "name": "asin",
-                        "selector": "",
-                        "type": "attribute",
-                        "attribute": "data-asin",
-                    },
-                    {"name": "title", "selector": "h2 a span", "type": "text"},
-                    {
-                        "name": "url",
-                        "selector": "h2 a",
-                        "type": "attribute",
-                        "attribute": "href",
-                    },
-                    {
-                        "name": "image",
-                        "selector": ".s-image",
-                        "type": "attribute",
-                        "attribute": "src",
-                    },
-                    {
-                        "name": "rating",
-                        "selector": ".a-icon-star-small .a-icon-alt",
-                        "type": "text",
-                    },
-                    {
-                        "name": "reviews_count",
-                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
-                        "type": "text",
-                    },
-                    {
-                        "name": "price",
-                        "selector": ".a-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "original_price",
-                        "selector": ".a-price.a-text-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "sponsored",
-                        "selector": ".puis-sponsored-label-text",
-                        "type": "exists",
-                    },
-                    {
-                        "name": "delivery_info",
-                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
-                        "type": "text",
-                        "multiple": True,
-                    },
-                ],
-            }
-        ),
-    )
-
-    url = "https://www.amazon.com/"
-
-    async def after_goto(
-        page: Page, context: BrowserContext, url: str, response: dict, **kwargs
-    ):
-        """Hook called after navigating to each URL"""
-        print(f"[HOOK] after_goto - Successfully loaded: {url}")
-
-        try:
-            # Wait for search box to be available
-            search_box = await page.wait_for_selector(
-                "#twotabsearchtextbox", timeout=1000
-            )
-
-            # Type the search query
-            await search_box.fill("Samsung Galaxy Tab")
-
-            # Get the search button and prepare for navigation
-            search_button = await page.wait_for_selector(
-                "#nav-search-submit-button", timeout=1000
-            )
-
-            # Click with navigation waiting
-            await search_button.click()
-
-            # Wait for search results to load
-            await page.wait_for_selector(
-                '[data-component-type="s-search-result"]', timeout=10000
-            )
-            print("[HOOK] Search completed and results loaded!")
-
-        except Exception as e:
-            print(f"[HOOK] Error during search operation: {str(e)}")
-
-        return page
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        crawler.crawler_strategy.set_hook("after_goto", after_goto)
-
-        # Extract the data
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        # Process and print the results
-        if result and result.extracted_content:
-            # Parse the JSON string into a list of products
-            products = json.loads(result.extracted_content)
-
-            # Process each product in the list
-            for product in products:
-                print("\nProduct Details:")
-                print(f"ASIN: {product.get('asin')}")
-                print(f"Title: {product.get('title')}")
-                print(f"Price: {product.get('price')}")
-                print(f"Original Price: {product.get('original_price')}")
-                print(f"Rating: {product.get('rating')}")
-                print(f"Reviews: {product.get('reviews_count')}")
-                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
-                if product.get("delivery_info"):
-                    print(f"Delivery: {' '.join(product['delivery_info'])}")
-                print("-" * 80)
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(extract_amazon_products())
--- a/docs/examples/amazon_product_extraction_using_use_javascript.py
+++ b/docs/examples/amazon_product_extraction_using_use_javascript.py
@@ -1,126 +0,0 @@
-"""
-This example demonstrates how to use JSON CSS extraction to scrape product information 
-from Amazon search results. It shows how to extract structured data like product titles,
-prices, ratings, and other details using CSS selectors.
-"""
-
-from crawl4ai import AsyncWebCrawler, CacheMode
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
-from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
-import json
-
-
-async def extract_amazon_products():
-    # Initialize browser config
-    browser_config = BrowserConfig(
-        # browser_type="chromium",
-        headless=True
-    )
-
-    js_code_to_search = """
-        const task = async () => {
-            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
-            document.querySelector('#nav-search-submit-button').click();
-        }
-        await task();
-    """
-    js_code_to_search_sync = """
-            document.querySelector('#twotabsearchtextbox').value = 'Samsung Galaxy Tab';
-            document.querySelector('#nav-search-submit-button').click();
-    """
-    crawler_config = CrawlerRunConfig(
-        cache_mode=CacheMode.BYPASS,
-        js_code=js_code_to_search,
-        wait_for='css:[data-component-type="s-search-result"]',
-        extraction_strategy=JsonCssExtractionStrategy(
-            schema={
-                "name": "Amazon Product Search Results",
-                "baseSelector": "[data-component-type='s-search-result']",
-                "fields": [
-                    {
-                        "name": "asin",
-                        "selector": "",
-                        "type": "attribute",
-                        "attribute": "data-asin",
-                    },
-                    {"name": "title", "selector": "h2 a span", "type": "text"},
-                    {
-                        "name": "url",
-                        "selector": "h2 a",
-                        "type": "attribute",
-                        "attribute": "href",
-                    },
-                    {
-                        "name": "image",
-                        "selector": ".s-image",
-                        "type": "attribute",
-                        "attribute": "src",
-                    },
-                    {
-                        "name": "rating",
-                        "selector": ".a-icon-star-small .a-icon-alt",
-                        "type": "text",
-                    },
-                    {
-                        "name": "reviews_count",
-                        "selector": "[data-csa-c-func-deps='aui-da-a-popover'] ~ span span",
-                        "type": "text",
-                    },
-                    {
-                        "name": "price",
-                        "selector": ".a-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "original_price",
-                        "selector": ".a-price.a-text-price .a-offscreen",
-                        "type": "text",
-                    },
-                    {
-                        "name": "sponsored",
-                        "selector": ".puis-sponsored-label-text",
-                        "type": "exists",
-                    },
-                    {
-                        "name": "delivery_info",
-                        "selector": "[data-cy='delivery-recipe'] .a-color-base",
-                        "type": "text",
-                        "multiple": True,
-                    },
-                ],
-            }
-        ),
-    )
-
-    # Example search URL (you should replace with your actual Amazon URL)
-    url = "https://www.amazon.com/"
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # Extract the data
-        result = await crawler.arun(url=url, config=crawler_config)
-
-        # Process and print the results
-        if result and result.extracted_content:
-            # Parse the JSON string into a list of products
-            products = json.loads(result.extracted_content)
-
-            # Process each product in the list
-            for product in products:
-                print("\nProduct Details:")
-                print(f"ASIN: {product.get('asin')}")
-                print(f"Title: {product.get('title')}")
-                print(f"Price: {product.get('price')}")
-                print(f"Original Price: {product.get('original_price')}")
-                print(f"Rating: {product.get('rating')}")
-                print(f"Reviews: {product.get('reviews_count')}")
-                print(f"Sponsored: {'Yes' if product.get('sponsored') else 'No'}")
-                if product.get("delivery_info"):
-                    print(f"Delivery: {' '.join(product['delivery_info'])}")
-                print("-" * 80)
-
-
-if __name__ == "__main__":
-    import asyncio
-
-    asyncio.run(extract_amazon_products())
--- a/docs/examples/assets/audio.mp3
+++ b/docs/examples/assets/audio.mp3
--- a/docs/examples/assets/basic.png
+++ b/docs/examples/assets/basic.png
--- a/docs/examples/assets/cosine_extraction.png
+++ b/docs/examples/assets/cosine_extraction.png
--- a/docs/examples/assets/css_js.png
+++ b/docs/examples/assets/css_js.png
--- a/docs/examples/assets/css_selector.png
+++ b/docs/examples/assets/css_selector.png
--- a/docs/examples/assets/exec_script.png
+++ b/docs/examples/assets/exec_script.png
--- a/docs/examples/assets/llm_extraction.png
+++ b/docs/examples/assets/llm_extraction.png
--- a/docs/examples/assets/semantic_extraction_cosine.png
+++ b/docs/examples/assets/semantic_extraction_cosine.png
--- a/docs/examples/assets/semantic_extraction_llm.png
+++ b/docs/examples/assets/semantic_extraction_llm.png
--- a/docs/examples/async_webcrawler_multiple_urls_example.py
+++ b/docs/examples/async_webcrawler_multiple_urls_example.py
@@ -1,55 +0,0 @@
-# File: async_webcrawler_multiple_urls_example.py
-import os, sys
-
-# append 2 parent directories to sys.path to import crawl4ai
-parent_dir = os.path.dirname(
-    os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-)
-sys.path.append(parent_dir)
-
-import asyncio
-from crawl4ai import AsyncWebCrawler
-
-
-async def main():
-    # Initialize the AsyncWebCrawler
-    async with AsyncWebCrawler(verbose=True) as crawler:
-        # List of URLs to crawl
-        urls = [
-            "https://example.com",
-            "https://python.org",
-            "https://github.com",
-            "https://stackoverflow.com",
-            "https://news.ycombinator.com",
-        ]
-
-        # Set up crawling parameters
-        word_count_threshold = 100
-
-        # Run the crawling process for multiple URLs
-        results = await crawler.arun_many(
-            urls=urls,
-            word_count_threshold=word_count_threshold,
-            bypass_cache=True,
-            verbose=True,
-        )
-
-        # Process the results
-        for result in results:
-            if result.success:
-                print(f"Successfully crawled: {result.url}")
-                print(f"Title: {result.metadata.get('title', 'N/A')}")
-                print(f"Word count: {len(result.markdown.split())}")
-                print(
-                    f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}"
-                )
-                print(f"Number of images: {len(result.media.get('images', []))}")
-                print("---")
-            else:
-                print(f"Failed to crawl: {result.url}")
-                print(f"Error: {result.error_message}")
-                print("---")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/browser_optimization_example.py
+++ b/docs/examples/browser_optimization_example.py
@@ -1,126 +0,0 @@
-"""
-This example demonstrates optimal browser usage patterns in Crawl4AI:
-1. Sequential crawling with session reuse
-2. Parallel crawling with browser instance reuse
-3. Performance optimization settings
-"""
-
-import asyncio
-from typing import List
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-
-async def crawl_sequential(urls: List[str]):
-    """
-    Sequential crawling using session reuse - most efficient for moderate workloads
-    """
-    print("\n=== Sequential Crawling with Session Reuse ===")
-
-    # Configure browser with optimized settings
-    browser_config = BrowserConfig(
-        headless=True,
-        browser_args=[
-            "--disable-gpu",  # Disable GPU acceleration
-            "--disable-dev-shm-usage",  # Disable /dev/shm usage
-            "--no-sandbox",  # Required for Docker
-        ],
-        viewport={
-            "width": 800,
-            "height": 600,
-        },  # Smaller viewport for better performance
-    )
-
-    # Configure crawl settings
-    crawl_config = CrawlerRunConfig(
-        markdown_generator=DefaultMarkdownGenerator(
-            #  content_filter=PruningContentFilter(), In case you need fit_markdown
-        ),
-    )
-
-    # Create single crawler instance
-    crawler = AsyncWebCrawler(config=browser_config)
-    await crawler.start()
-
-    try:
-        session_id = "session1"  # Use same session for all URLs
-        for url in urls:
-            result = await crawler.arun(
-                url=url,
-                config=crawl_config,
-                session_id=session_id,  # Reuse same browser tab
-            )
-            if result.success:
-                print(f"Successfully crawled {url}")
-                print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
-    finally:
-        await crawler.close()
-
-
-async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
-    """
-    Parallel crawling while reusing browser instance - best for large workloads
-    """
-    print("\n=== Parallel Crawling with Browser Reuse ===")
-
-    browser_config = BrowserConfig(
-        headless=True,
-        browser_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
-        viewport={"width": 800, "height": 600},
-    )
-
-    crawl_config = CrawlerRunConfig(
-        markdown_generator=DefaultMarkdownGenerator(
-            #  content_filter=PruningContentFilter(), In case you need fit_markdown
-        ),
-    )
-
-    # Create single crawler instance for all parallel tasks
-    crawler = AsyncWebCrawler(config=browser_config)
-    await crawler.start()
-
-    try:
-        # Create tasks in batches to control concurrency
-        for i in range(0, len(urls), max_concurrent):
-            batch = urls[i : i + max_concurrent]
-            tasks = []
-
-            for j, url in enumerate(batch):
-                session_id = (
-                    f"parallel_session_{j}"  # Different session per concurrent task
-                )
-                task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
-                tasks.append(task)
-
-            # Wait for batch to complete
-            results = await asyncio.gather(*tasks, return_exceptions=True)
-
-            # Process results
-            for url, result in zip(batch, results):
-                if isinstance(result, Exception):
-                    print(f"Error crawling {url}: {str(result)}")
-                elif result.success:
-                    print(f"Successfully crawled {url}")
-                    print(f"Content length: {len(result.markdown_v2.raw_markdown)}")
-    finally:
-        await crawler.close()
-
-
-async def main():
-    # Example URLs
-    urls = [
-        "https://example.com/page1",
-        "https://example.com/page2",
-        "https://example.com/page3",
-        "https://example.com/page4",
-    ]
-
-    # Demo sequential crawling
-    await crawl_sequential(urls)
-
-    # Demo parallel crawling
-    await crawl_parallel(urls, max_concurrent=2)
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/chainlit.md
+++ b/docs/examples/chainlit.md
@@ -1,3 +0,0 @@
-# Welcome to Crawl4AI! 🚀🤖
-
-Hi there, Developer! 👋 Here is an example of a research pipeline, where you can share a URL in your conversation with any LLM, and then the context of crawled pages will be used as the context.
--- a/docs/examples/crawlai_vs_firecrawl.py
+++ b/docs/examples/crawlai_vs_firecrawl.py
@@ -1,70 +0,0 @@
-import os, time
-
-# append the path to the root of the project
-import sys
-import asyncio
-
-sys.path.append(os.path.join(os.path.dirname(__file__), "..", ".."))
-from firecrawl import FirecrawlApp
-from crawl4ai import AsyncWebCrawler
-
-__data__ = os.path.join(os.path.dirname(__file__), "..", "..") + "/.data"
-
-
-async def compare():
-    app = FirecrawlApp(api_key=os.environ["FIRECRAWL_API_KEY"])
-
-    # Tet Firecrawl with a simple crawl
-    start = time.time()
-    scrape_status = app.scrape_url(
-        "https://www.nbcnews.com/business", params={"formats": ["markdown", "html"]}
-    )
-    end = time.time()
-    print(f"Time taken: {end - start} seconds")
-    print(len(scrape_status["markdown"]))
-    # save the markdown content with provider name
-    with open(f"{__data__}/firecrawl_simple.md", "w") as f:
-        f.write(scrape_status["markdown"])
-    # Count how many "cldnry.s-nbcnews.com" are in the markdown
-    print(scrape_status["markdown"].count("cldnry.s-nbcnews.com"))
-
-    async with AsyncWebCrawler() as crawler:
-        start = time.time()
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
-            # js_code=["const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"],
-            word_count_threshold=0,
-            bypass_cache=True,
-            verbose=False,
-        )
-        end = time.time()
-        print(f"Time taken: {end - start} seconds")
-        print(len(result.markdown))
-        # save the markdown content with provider name
-        with open(f"{__data__}/crawl4ai_simple.md", "w") as f:
-            f.write(result.markdown)
-        # count how many "cldnry.s-nbcnews.com" are in the markdown
-        print(result.markdown.count("cldnry.s-nbcnews.com"))
-
-        start = time.time()
-        result = await crawler.arun(
-            url="https://www.nbcnews.com/business",
-            js_code=[
-                "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
-            ],
-            word_count_threshold=0,
-            bypass_cache=True,
-            verbose=False,
-        )
-        end = time.time()
-        print(f"Time taken: {end - start} seconds")
-        print(len(result.markdown))
-        # save the markdown content with provider name
-        with open(f"{__data__}/crawl4ai_js.md", "w") as f:
-            f.write(result.markdown)
-        # count how many "cldnry.s-nbcnews.com" are in the markdown
-        print(result.markdown.count("cldnry.s-nbcnews.com"))
-
-
-if __name__ == "__main__":
-    asyncio.run(compare())
--- a/docs/examples/dispatcher_example.py
+++ b/docs/examples/dispatcher_example.py
@@ -1,136 +0,0 @@
-import asyncio
-import time
-from rich import print
-from rich.table import Table
-from crawl4ai import (
-    AsyncWebCrawler,
-    BrowserConfig,
-    CrawlerRunConfig,
-    MemoryAdaptiveDispatcher,
-    SemaphoreDispatcher,
-    RateLimiter,
-    CrawlerMonitor,
-    DisplayMode,
-    CacheMode,
-    LXMLWebScrapingStrategy,
-)
-
-
-async def memory_adaptive(urls, browser_config, run_config):
-    """Memory adaptive crawler with monitoring"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=70.0,
-            max_session_permit=10,
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-async def memory_adaptive_with_rate_limit(urls, browser_config, run_config):
-    """Memory adaptive crawler with rate limiting"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = MemoryAdaptiveDispatcher(
-            memory_threshold_percent=70.0,
-            max_session_permit=10,
-            rate_limiter=RateLimiter(
-                base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
-            ),
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-async def semaphore(urls, browser_config, run_config):
-    """Basic semaphore crawler"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = SemaphoreDispatcher(
-            semaphore_count=5,
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-async def semaphore_with_rate_limit(urls, browser_config, run_config):
-    """Semaphore crawler with rate limiting"""
-    start = time.perf_counter()
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        dispatcher = SemaphoreDispatcher(
-            semaphore_count=5,
-            rate_limiter=RateLimiter(
-                base_delay=(1.0, 2.0), max_delay=30.0, max_retries=2
-            ),
-            monitor=CrawlerMonitor(
-                max_visible_rows=15, display_mode=DisplayMode.DETAILED
-            ),
-        )
-        results = await crawler.arun_many(
-            urls, config=run_config, dispatcher=dispatcher
-        )
-    duration = time.perf_counter() - start
-    return len(results), duration
-
-
-def create_performance_table(results):
-    """Creates a rich table showing performance results"""
-    table = Table(title="Crawler Strategy Performance Comparison")
-    table.add_column("Strategy", style="cyan")
-    table.add_column("URLs Crawled", justify="right", style="green")
-    table.add_column("Time (seconds)", justify="right", style="yellow")
-    table.add_column("URLs/second", justify="right", style="magenta")
-
-    sorted_results = sorted(results.items(), key=lambda x: x[1][1])
-
-    for strategy, (urls_crawled, duration) in sorted_results:
-        urls_per_second = urls_crawled / duration
-        table.add_row(
-            strategy, str(urls_crawled), f"{duration:.2f}", f"{urls_per_second:.2f}"
-        )
-
-    return table
-
-
-async def main():
-    urls = [f"https://example.com/page{i}" for i in range(1, 40)]
-    browser_config = BrowserConfig(headless=True, verbose=False)
-    run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS, scraping_strategy=LXMLWebScrapingStrategy())
-
-    results = {
-        "Memory Adaptive": await memory_adaptive(urls, browser_config, run_config),
-        # "Memory Adaptive + Rate Limit": await memory_adaptive_with_rate_limit(
-        #     urls, browser_config, run_config
-        # ),
-        # "Semaphore": await semaphore(urls, browser_config, run_config),
-        # "Semaphore + Rate Limit": await semaphore_with_rate_limit(
-        #     urls, browser_config, run_config
-        # ),
-    }
-
-    table = create_performance_table(results)
-    print("\nPerformance Summary:")
-    print(table)
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/docker_example.py
+++ b/docs/examples/docker_example.py
@@ -1,372 +0,0 @@
-import requests
-import json
-import time
-import sys
-import base64
-import os
-from typing import Dict, Any
-
-
-class Crawl4AiTester:
-    def __init__(self, base_url: str = "http://localhost:11235", api_token: str = None):
-        self.base_url = base_url
-        self.api_token = (
-            api_token or os.getenv("CRAWL4AI_API_TOKEN") or "test_api_code"
-        )  # Check environment variable as fallback
-        self.headers = (
-            {"Authorization": f"Bearer {self.api_token}"} if self.api_token else {}
-        )
-
-    def submit_and_wait(
-        self, request_data: Dict[str, Any], timeout: int = 300
-    ) -> Dict[str, Any]:
-        # Submit crawl job
-        response = requests.post(
-            f"{self.base_url}/crawl", json=request_data, headers=self.headers
-        )
-        if response.status_code == 403:
-            raise Exception("API token is invalid or missing")
-        task_id = response.json()["task_id"]
-        print(f"Task ID: {task_id}")
-
-        # Poll for result
-        start_time = time.time()
-        while True:
-            if time.time() - start_time > timeout:
-                raise TimeoutError(
-                    f"Task {task_id} did not complete within {timeout} seconds"
-                )
-
-            result = requests.get(
-                f"{self.base_url}/task/{task_id}", headers=self.headers
-            )
-            status = result.json()
-
-            if status["status"] == "failed":
-                print("Task failed:", status.get("error"))
-                raise Exception(f"Task failed: {status.get('error')}")
-
-            if status["status"] == "completed":
-                return status
-
-            time.sleep(2)
-
-    def submit_sync(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
-        response = requests.post(
-            f"{self.base_url}/crawl_sync",
-            json=request_data,
-            headers=self.headers,
-            timeout=60,
-        )
-        if response.status_code == 408:
-            raise TimeoutError("Task did not complete within server timeout")
-        response.raise_for_status()
-        return response.json()
-
-    def crawl_direct(self, request_data: Dict[str, Any]) -> Dict[str, Any]:
-        """Directly crawl without using task queue"""
-        response = requests.post(
-            f"{self.base_url}/crawl_direct", json=request_data, headers=self.headers
-        )
-        response.raise_for_status()
-        return response.json()
-
-
-def test_docker_deployment(version="basic"):
-    tester = Crawl4AiTester(
-        base_url="http://localhost:11235",
-        # base_url="https://api.crawl4ai.com" # just for example
-        # api_token="test" # just for example
-    )
-    print(f"Testing Crawl4AI Docker {version} version")
-
-    # Health check with timeout and retry
-    max_retries = 5
-    for i in range(max_retries):
-        try:
-            health = requests.get(f"{tester.base_url}/health", timeout=10)
-            print("Health check:", health.json())
-            break
-        except requests.exceptions.RequestException:
-            if i == max_retries - 1:
-                print(f"Failed to connect after {max_retries} attempts")
-                sys.exit(1)
-            print(f"Waiting for service to start (attempt {i+1}/{max_retries})...")
-            time.sleep(5)
-
-    # Test cases based on version
-    test_basic_crawl_direct(tester)
-    test_basic_crawl(tester)
-    test_basic_crawl(tester)
-    test_basic_crawl_sync(tester)
-
-    if version in ["full", "transformer"]:
-        test_cosine_extraction(tester)
-
-    test_js_execution(tester)
-    test_css_selector(tester)
-    test_structured_extraction(tester)
-    test_llm_extraction(tester)
-    test_llm_with_ollama(tester)
-    test_screenshot(tester)
-
-
-def test_basic_crawl(tester: Crawl4AiTester):
-    print("\n=== Testing Basic Crawl ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 10,
-        "session_id": "test",
-    }
-
-    result = tester.submit_and_wait(request)
-    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-    assert len(result["result"]["markdown"]) > 0
-
-
-def test_basic_crawl_sync(tester: Crawl4AiTester):
-    print("\n=== Testing Basic Crawl (Sync) ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 10,
-        "session_id": "test",
-    }
-
-    result = tester.submit_sync(request)
-    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result["status"] == "completed"
-    assert result["result"]["success"]
-    assert len(result["result"]["markdown"]) > 0
-
-
-def test_basic_crawl_direct(tester: Crawl4AiTester):
-    print("\n=== Testing Basic Crawl (Direct) ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 10,
-        # "session_id": "test"
-        "cache_mode": "bypass",  # or "enabled", "disabled", "read_only", "write_only"
-    }
-
-    result = tester.crawl_direct(request)
-    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-    assert len(result["result"]["markdown"]) > 0
-
-
-def test_js_execution(tester: Crawl4AiTester):
-    print("\n=== Testing JS Execution ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 8,
-        "js_code": [
-            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
-        ],
-        "wait_for": "article.tease-card:nth-child(10)",
-        "crawler_params": {"headless": True},
-    }
-
-    result = tester.submit_and_wait(request)
-    print(f"JS execution result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-
-
-def test_css_selector(tester: Crawl4AiTester):
-    print("\n=== Testing CSS Selector ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 7,
-        "css_selector": ".wide-tease-item__description",
-        "crawler_params": {"headless": True},
-        "extra": {"word_count_threshold": 10},
-    }
-
-    result = tester.submit_and_wait(request)
-    print(f"CSS selector result length: {len(result['result']['markdown'])}")
-    assert result["result"]["success"]
-
-
-def test_structured_extraction(tester: Crawl4AiTester):
-    print("\n=== Testing Structured Extraction ===")
-    schema = {
-        "name": "Coinbase Crypto Prices",
-        "baseSelector": ".cds-tableRow-t45thuk",
-        "fields": [
-            {
-                "name": "crypto",
-                "selector": "td:nth-child(1) h2",
-                "type": "text",
-            },
-            {
-                "name": "symbol",
-                "selector": "td:nth-child(1) p",
-                "type": "text",
-            },
-            {
-                "name": "price",
-                "selector": "td:nth-child(2)",
-                "type": "text",
-            },
-        ],
-    }
-
-    request = {
-        "urls": "https://www.coinbase.com/explore",
-        "priority": 9,
-        "extraction_config": {"type": "json_css", "params": {"schema": schema}},
-    }
-
-    result = tester.submit_and_wait(request)
-    extracted = json.loads(result["result"]["extracted_content"])
-    print(f"Extracted {len(extracted)} items")
-    print("Sample item:", json.dumps(extracted[0], indent=2))
-    assert result["result"]["success"]
-    assert len(extracted) > 0
-
-
-def test_llm_extraction(tester: Crawl4AiTester):
-    print("\n=== Testing LLM Extraction ===")
-    schema = {
-        "type": "object",
-        "properties": {
-            "model_name": {
-                "type": "string",
-                "description": "Name of the OpenAI model.",
-            },
-            "input_fee": {
-                "type": "string",
-                "description": "Fee for input token for the OpenAI model.",
-            },
-            "output_fee": {
-                "type": "string",
-                "description": "Fee for output token for the OpenAI model.",
-            },
-        },
-        "required": ["model_name", "input_fee", "output_fee"],
-    }
-
-    request = {
-        "urls": "https://openai.com/api/pricing",
-        "priority": 8,
-        "extraction_config": {
-            "type": "llm",
-            "params": {
-                "provider": "openai/gpt-4o-mini",
-                "api_token": os.getenv("OPENAI_API_KEY"),
-                "schema": schema,
-                "extraction_type": "schema",
-                "instruction": """From the crawled content, extract all mentioned model names along with their fees for input and output tokens.""",
-            },
-        },
-        "crawler_params": {"word_count_threshold": 1},
-    }
-
-    try:
-        result = tester.submit_and_wait(request)
-        extracted = json.loads(result["result"]["extracted_content"])
-        print(f"Extracted {len(extracted)} model pricing entries")
-        print("Sample entry:", json.dumps(extracted[0], indent=2))
-        assert result["result"]["success"]
-    except Exception as e:
-        print(f"LLM extraction test failed (might be due to missing API key): {str(e)}")
-
-
-def test_llm_with_ollama(tester: Crawl4AiTester):
-    print("\n=== Testing LLM with Ollama ===")
-    schema = {
-        "type": "object",
-        "properties": {
-            "article_title": {
-                "type": "string",
-                "description": "The main title of the news article",
-            },
-            "summary": {
-                "type": "string",
-                "description": "A brief summary of the article content",
-            },
-            "main_topics": {
-                "type": "array",
-                "items": {"type": "string"},
-                "description": "Main topics or themes discussed in the article",
-            },
-        },
-    }
-
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 8,
-        "extraction_config": {
-            "type": "llm",
-            "params": {
-                "provider": "ollama/llama2",
-                "schema": schema,
-                "extraction_type": "schema",
-                "instruction": "Extract the main article information including title, summary, and main topics.",
-            },
-        },
-        "extra": {"word_count_threshold": 1},
-        "crawler_params": {"verbose": True},
-    }
-
-    try:
-        result = tester.submit_and_wait(request)
-        extracted = json.loads(result["result"]["extracted_content"])
-        print("Extracted content:", json.dumps(extracted, indent=2))
-        assert result["result"]["success"]
-    except Exception as e:
-        print(f"Ollama extraction test failed: {str(e)}")
-
-
-def test_cosine_extraction(tester: Crawl4AiTester):
-    print("\n=== Testing Cosine Extraction ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 8,
-        "extraction_config": {
-            "type": "cosine",
-            "params": {
-                "semantic_filter": "business finance economy",
-                "word_count_threshold": 10,
-                "max_dist": 0.2,
-                "top_k": 3,
-            },
-        },
-    }
-
-    try:
-        result = tester.submit_and_wait(request)
-        extracted = json.loads(result["result"]["extracted_content"])
-        print(f"Extracted {len(extracted)} text clusters")
-        print("First cluster tags:", extracted[0]["tags"])
-        assert result["result"]["success"]
-    except Exception as e:
-        print(f"Cosine extraction test failed: {str(e)}")
-
-
-def test_screenshot(tester: Crawl4AiTester):
-    print("\n=== Testing Screenshot ===")
-    request = {
-        "urls": "https://www.nbcnews.com/business",
-        "priority": 5,
-        "screenshot": True,
-        "crawler_params": {"headless": True},
-    }
-
-    result = tester.submit_and_wait(request)
-    print("Screenshot captured:", bool(result["result"]["screenshot"]))
-
-    if result["result"]["screenshot"]:
-        # Save screenshot
-        screenshot_data = base64.b64decode(result["result"]["screenshot"])
-        with open("test_screenshot.jpg", "wb") as f:
-            f.write(screenshot_data)
-        print("Screenshot saved as test_screenshot.jpg")
-
-    assert result["result"]["success"]
-
-
-if __name__ == "__main__":
-    version = sys.argv[1] if len(sys.argv) > 1 else "basic"
-    # version = "full"
-    test_docker_deployment(version)
--- a/docs/examples/extraction_strategies_example.py
+++ b/docs/examples/extraction_strategies_example.py
@@ -1,127 +0,0 @@
-"""
-Example demonstrating different extraction strategies with various input formats.
-This example shows how to:
-1. Use different input formats (markdown, HTML, fit_markdown)
-2. Work with JSON-based extractors (CSS and XPath)
-3. Use LLM-based extraction with different input formats
-4. Configure browser and crawler settings properly
-"""
-
-import asyncio
-import os
-
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
-from crawl4ai.extraction_strategy import (
-    LLMExtractionStrategy,
-    JsonCssExtractionStrategy,
-    JsonXPathExtractionStrategy,
-)
-from crawl4ai.content_filter_strategy import PruningContentFilter
-from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
-
-
-async def run_extraction(crawler: AsyncWebCrawler, url: str, strategy, name: str):
-    """Helper function to run extraction with proper configuration"""
-    try:
-        # Configure the crawler run settings
-        config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            extraction_strategy=strategy,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter()  # For fit_markdown support
-            ),
-        )
-
-        # Run the crawler
-        result = await crawler.arun(url=url, config=config)
-
-        if result.success:
-            print(f"\n=== {name} Results ===")
-            print(f"Extracted Content: {result.extracted_content}")
-            print(f"Raw Markdown Length: {len(result.markdown_v2.raw_markdown)}")
-            print(
-                f"Citations Markdown Length: {len(result.markdown_v2.markdown_with_citations)}"
-            )
-        else:
-            print(f"Error in {name}: Crawl failed")
-
-    except Exception as e:
-        print(f"Error in {name}: {str(e)}")
-
-
-async def main():
-    # Example URL (replace with actual URL)
-    url = "https://example.com/product-page"
-
-    # Configure browser settings
-    browser_config = BrowserConfig(headless=True, verbose=True)
-
-    # Initialize extraction strategies
-
-    # 1. LLM Extraction with different input formats
-    markdown_strategy = LLMExtractionStrategy(
-        provider="openai/gpt-4o-mini",
-        api_token=os.getenv("OPENAI_API_KEY"),
-        instruction="Extract product information including name, price, and description",
-    )
-
-    html_strategy = LLMExtractionStrategy(
-        input_format="html",
-        provider="openai/gpt-4o-mini",
-        api_token=os.getenv("OPENAI_API_KEY"),
-        instruction="Extract product information from HTML including structured data",
-    )
-
-    fit_markdown_strategy = LLMExtractionStrategy(
-        input_format="fit_markdown",
-        provider="openai/gpt-4o-mini",
-        api_token=os.getenv("OPENAI_API_KEY"),
-        instruction="Extract product information from cleaned markdown",
-    )
-
-    # 2. JSON CSS Extraction (automatically uses HTML input)
-    css_schema = {
-        "baseSelector": ".product",
-        "fields": [
-            {"name": "title", "selector": "h1.product-title", "type": "text"},
-            {"name": "price", "selector": ".price", "type": "text"},
-            {"name": "description", "selector": ".description", "type": "text"},
-        ],
-    }
-    css_strategy = JsonCssExtractionStrategy(schema=css_schema)
-
-    # 3. JSON XPath Extraction (automatically uses HTML input)
-    xpath_schema = {
-        "baseSelector": "//div[@class='product']",
-        "fields": [
-            {
-                "name": "title",
-                "selector": ".//h1[@class='product-title']/text()",
-                "type": "text",
-            },
-            {
-                "name": "price",
-                "selector": ".//span[@class='price']/text()",
-                "type": "text",
-            },
-            {
-                "name": "description",
-                "selector": ".//div[@class='description']/text()",
-                "type": "text",
-            },
-        ],
-    }
-    xpath_strategy = JsonXPathExtractionStrategy(schema=xpath_schema)
-
-    # Use context manager for proper resource handling
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # Run all strategies
-        await run_extraction(crawler, url, markdown_strategy, "Markdown LLM")
-        await run_extraction(crawler, url, html_strategy, "HTML LLM")
-        await run_extraction(crawler, url, fit_markdown_strategy, "Fit Markdown LLM")
-        await run_extraction(crawler, url, css_strategy, "CSS Extraction")
-        await run_extraction(crawler, url, xpath_strategy, "XPath Extraction")
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/docs/examples/full_page_screenshot_and_pdf_export.md
+++ b/docs/examples/full_page_screenshot_and_pdf_export.md
@@ -1,58 +0,0 @@
-# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
-
-When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
-
-**The New Approach:**
-We’ve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browser’s built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
-
-**Key Benefits:**
- **Reliability:** The PDF export never times out and works regardless of page length.
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
-
-**Simple Example:**
-```python
-import os, sys
-import asyncio
-from crawl4ai import AsyncWebCrawler, CacheMode
-
-# Adjust paths as needed
-parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-sys.path.append(parent_dir)
-__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
-
-async def main():
-    async with AsyncWebCrawler() as crawler:
-        # Request both PDF and screenshot
-        result = await crawler.arun(
-            url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
-            cache_mode=CacheMode.BYPASS,
-            pdf=True,
-            screenshot=True
-        )
-        
-        if result.success:
-            # Save screenshot
-            if result.screenshot:
-                from base64 import b64decode
-                with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
-                    f.write(b64decode(result.screenshot))
-            
-            # Save PDF
-            if result.pdf:
-                pdf_bytes = b64decode(result.pdf)
-                with open(os.path.join(__location__, "page.pdf"), "wb") as f:
-                    f.write(pdf_bytes)
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-**What Happens Under the Hood:**
- Crawl4AI navigates to the target page.
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
- Finally, you get your PDF and/or screenshot ready to use.
-
-**Conclusion:**
-With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.
--- a/docs/examples/hello_world.py
+++ b/docs/examples/hello_world.py
@@ -1,23 +0,0 @@
-import asyncio
-from crawl4ai import *
-
-
-async def main():
-    browser_config = BrowserConfig(headless=True, verbose=True)
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        crawler_config = CrawlerRunConfig(
-            cache_mode=CacheMode.BYPASS,
-            markdown_generator=DefaultMarkdownGenerator(
-                content_filter=PruningContentFilter(
-                    threshold=0.48, threshold_type="fixed", min_word_threshold=0
-                )
-            ),
-        )
-        result = await crawler.arun(
-            url="https://www.helloworld.org", config=crawler_config
-        )
-        print(result.markdown_v2.raw_markdown[:500])
-
-
-if __name__ == "__main__":
-    asyncio.run(main())
--- a/Show More
+++ b/Show More