Fix the model nam ein quick start example

chore: Bump version to 0.3.71 and improve error handling
- Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation
2024-10-18 15:32:25 +08:00 · 2024-10-18 13:31:12 +08:00 · 2024-10-18 12:51:23 +08:00 · 2024-10-18 12:35:09 +08:00 · 2024-10-17 21:37:48 +08:00
24 changed files with 822 additions and 2406 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -205,7 +205,4 @@ pypi_build.sh
 git_issues.py
 git_issues.md

-.tests/
-.issues/
-.docs/
-.issues/
+.tests/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,79 @@
 # Changelog

+## [v0.3.71] - 2024-10-18
+
+### Changes
+1. **Version Update**:
+   - Updated version number from 0.3.7 to 0.3.71.
+
+2. **Crawler Enhancements**:
+   - Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
+   - Improved context creation with additional options:
+     - Enabled `accept_downloads` and `java_script_enabled`.
+     - Added a cookie to enable cookies by default.
+
+3. **Error Handling Improvements**:
+   - Enhanced error messages in AsyncWebCrawler's `arun` method.
+   - Updated error reporting format for better visibility and consistency.
+
+4. **Performance Optimization**:
+   - Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
+
+### Documentation
+- Updated quickstart notebook:
+  - Changed installation command to use the released package instead of GitHub repository.
+  - Updated kernel display name.
+
+### Developer Notes
+- Minor code refactoring and cleanup.
+
+## [v0.3.7] - 2024-10-17
+
+### New Features
+1. **Enhanced Browser Stealth**: 
+   - Implemented `playwright_stealth` for improved bot detection avoidance.
+   - Added `StealthConfig` for fine-tuned control over stealth parameters.
+
+2. **User Simulation**:
+   - New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
+
+3. **Navigator Override**:
+   - Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
+
+4. **Improved iframe Handling**:
+   - New `process_iframes` parameter to extract and integrate iframe content into the main page.
+
+5. **Flexible Browser Selection**:
+   - Support for choosing between Chromium, Firefox, and WebKit browsers.
+
+6. **Include Links in Markdown**:
+    - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.   
+
+### Improvements
+1. **Better Error Handling**:
+   - Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
+   - Added console message and error logging for better debugging.
+
+2. **Image Processing Enhancements**:
+   - Improved image dimension updating and filtering logic.
+
+3. **Crawling Flexibility**:
+   - Added support for custom viewport sizes.
+   - Implemented delayed content retrieval with `delay_before_return_html` parameter.
+
+4. **Performance Optimization**:
+   - Adjusted default semaphore count for parallel crawling.
+
+### Bug Fixes
+- Fixed an issue where the HTML content could be empty after processing.
+
+### Examples
+- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
+
+### Developer Notes
+- Refactored code for better maintainability and readability.
+- Updated browser launch arguments for improved compatibility and performance.
+
 ## [v0.3.6] - 2024-10-12 

 ### 1. Improved Crawling Control
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -3,7 +3,7 @@
 from .async_webcrawler import AsyncWebCrawler
 from .models import CrawlResult

-__version__ = "0.3.6"
+__version__ = "0.3.71"

 __all__ = [
    "AsyncWebCrawler",
--- a/crawl4ai/async_crawler_strategy
+++ b/crawl4ai/async_crawler_strategy
@@ -0,0 +1,558 @@
+import asyncio
+import base64
+import time
+from abc import ABC, abstractmethod
+from typing import Callable, Dict, Any, List, Optional, Awaitable
+import os
+from playwright.async_api import async_playwright, Page, Browser, Error
+from io import BytesIO
+from PIL import Image, ImageDraw, ImageFont
+from pathlib import Path
+from playwright.async_api import ProxySettings
+from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from playwright_stealth import stealth_async
+
+class AsyncCrawlResponse(BaseModel):
+    html: str
+    response_headers: Dict[str, str]
+    status_code: int
+    screenshot: Optional[str] = None
+    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
+
+    class Config:
+        arbitrary_types_allowed = True
+
+class AsyncCrawlerStrategy(ABC):
+    @abstractmethod
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        pass
+    
+    @abstractmethod
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        pass
+    
+    @abstractmethod
+    async def take_screenshot(self, url: str) -> str:
+        pass
+    
+    @abstractmethod
+    def update_user_agent(self, user_agent: str):
+        pass
+    
+    @abstractmethod
+    def set_hook(self, hook_type: str, hook: Callable):
+        pass
+
+class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
+    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
+        self.use_cached_html = use_cached_html
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
+        self.proxy = kwargs.get("proxy")
+        self.headless = kwargs.get("headless", True)
+        self.browser_type = kwargs.get("browser_type", "chromium")
+        self.headers = kwargs.get("headers", {})
+        self.sessions = {}
+        self.session_ttl = 1800 
+        self.js_code = js_code
+        self.verbose = kwargs.get("verbose", False)
+        self.playwright = None
+        self.browser = None
+        self.hooks = {
+            'on_browser_created': None,
+            'on_user_agent_updated': None,
+            'on_execution_started': None,
+            'before_goto': None,
+            'after_goto': None,
+            'before_return_html': None,
+            'before_retrieve_html': None
+        }
+
+    async def __aenter__(self):
+        await self.start()
+        return self
+
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        await self.close()
+
+    async def start(self):
+        if self.playwright is None:
+            self.playwright = await async_playwright().start()
+        if self.browser is None:
+            browser_args = {
+                "headless": self.headless,
+                "args": [
+                    "--disable-gpu",
+                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--window-position=0,0",
+                    "--ignore-certificate-errors",
+                    "--ignore-certificate-errors-spki-list",
+                    # "--headless=new",  # Use the new headless mode
+                ]
+            }
+            
+            # Add proxy settings if a proxy is specified
+            if self.proxy:
+                proxy_settings = ProxySettings(server=self.proxy)
+                browser_args["proxy"] = proxy_settings
+                
+            # Select the appropriate browser based on the browser_type
+            if self.browser_type == "firefox":
+                self.browser = await self.playwright.firefox.launch(**browser_args)
+            elif self.browser_type == "webkit":
+                self.browser = await self.playwright.webkit.launch(**browser_args)
+            else:
+                self.browser = await self.playwright.chromium.launch(**browser_args)
+
+            await self.execute_hook('on_browser_created', self.browser)
+
+    async def close(self):
+        if self.browser:
+            await self.browser.close()
+            self.browser = None
+        if self.playwright:
+            await self.playwright.stop()
+            self.playwright = None
+
+    def __del__(self):
+        if self.browser or self.playwright:
+            asyncio.get_event_loop().run_until_complete(self.close())
+
+    def set_hook(self, hook_type: str, hook: Callable):
+        if hook_type in self.hooks:
+            self.hooks[hook_type] = hook
+        else:
+            raise ValueError(f"Invalid hook type: {hook_type}")
+
+    async def execute_hook(self, hook_type: str, *args):
+        hook = self.hooks.get(hook_type)
+        if hook:
+            if asyncio.iscoroutinefunction(hook):
+                return await hook(*args)
+            else:
+                return hook(*args)
+        return args[0] if args else None
+
+    def update_user_agent(self, user_agent: str):
+        self.user_agent = user_agent
+
+    def set_custom_headers(self, headers: Dict[str, str]):
+        self.headers = headers
+
+    async def kill_session(self, session_id: str):
+        if session_id in self.sessions:
+            context, page, _ = self.sessions[session_id]
+            await page.close()
+            await context.close()
+            del self.sessions[session_id]
+
+    def _cleanup_expired_sessions(self):
+        current_time = time.time()
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
+        for sid in expired_sessions:
+            asyncio.create_task(self.kill_session(sid))
+            
+    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
+        wait_for = wait_for.strip()
+        
+        if wait_for.startswith('js:'):
+            # Explicitly specified JavaScript
+            js_code = wait_for[3:].strip()
+            return await self.csp_compliant_wait(page, js_code, timeout)
+        elif wait_for.startswith('css:'):
+            # Explicitly specified CSS selector
+            css_selector = wait_for[4:].strip()
+            try:
+                await page.wait_for_selector(css_selector, timeout=timeout)
+            except Error as e:
+                if 'Timeout' in str(e):
+                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
+                else:
+                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
+        else:
+            # Auto-detect based on content
+            if wait_for.startswith('()') or wait_for.startswith('function'):
+                # It's likely a JavaScript function
+                return await self.csp_compliant_wait(page, wait_for, timeout)
+            else:
+                # Assume it's a CSS selector first
+                try:
+                    await page.wait_for_selector(wait_for, timeout=timeout)
+                except Error as e:
+                    if 'Timeout' in str(e):
+                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
+                    else:
+                        # If it's not a timeout error, it might be an invalid selector
+                        # Let's try to evaluate it as a JavaScript function as a fallback
+                        try:
+                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
+                        except Error:
+                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
+    
+    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
+        wrapper_js = f"""
+        async () => {{
+            const userFunction = {user_wait_function};
+            const startTime = Date.now();
+            while (true) {{
+                if (await userFunction()) {{
+                    return true;
+                }}
+                if (Date.now() - startTime > {timeout}) {{
+                    throw new Error('Timeout waiting for condition');
+                }}
+                await new Promise(resolve => setTimeout(resolve, 100));
+            }}
+        }}
+        """
+        
+        try:
+            await page.evaluate(wrapper_js)
+        except TimeoutError:
+            raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
+        except Exception as e:
+            raise RuntimeError(f"Error in wait condition: {str(e)}")
+
+    async def process_iframes(self, page):
+        # Find all iframes
+        iframes = await page.query_selector_all('iframe')
+        
+        for i, iframe in enumerate(iframes):
+            try:
+                # Add a unique identifier to the iframe
+                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
+                
+                # Get the frame associated with this iframe
+                frame = await iframe.content_frame()
+                
+                if frame:
+                    # Wait for the frame to load
+                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
+                    
+                    # Extract the content of the iframe's body
+                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
+                    
+                    # Generate a unique class name for this iframe
+                    class_name = f'extracted-iframe-content-{i}'
+                    
+                    # Replace the iframe with a div containing the extracted content
+                    _iframe = iframe_content.replace('`', '\\`')
+                    await page.evaluate(f"""
+                        () => {{
+                            const iframe = document.getElementById('iframe-{i}');
+                            const div = document.createElement('div');
+                            div.innerHTML = `{_iframe}`;
+                            div.className = '{class_name}';
+                            iframe.replaceWith(div);
+                        }}
+                    """)
+                else:
+                    print(f"Warning: Could not access content frame for iframe {i}")
+            except Exception as e:
+                print(f"Error processing iframe {i}: {str(e)}")
+
+        # Return the page object
+        return page  
+    
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        response_headers = {}
+        status_code = None
+        
+        self._cleanup_expired_sessions()
+        session_id = kwargs.get("session_id")
+        if session_id:
+            context, page, _ = self.sessions.get(session_id, (None, None, None))
+            if not context:
+                context = await self.browser.new_context(
+                    user_agent=self.user_agent,
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None
+                )
+                await context.set_extra_http_headers(self.headers)
+                page = await context.new_page()
+                self.sessions[session_id] = (context, page, time.time())
+        else:
+            context = await self.browser.new_context(
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
+            )
+            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
+            page = await context.new_page()
+
+        try:
+            if self.verbose:
+                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                if os.path.exists(cache_file_path):
+                    html = ""
+                    with open(cache_file_path, "r") as f:
+                        html = f.read()
+                    # retrieve response headers and status code from cache
+                    with open(cache_file_path + ".meta", "r") as f:
+                        meta = json.load(f)
+                        response_headers = meta.get("response_headers", {})
+                        status_code = meta.get("status_code")
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
+                    return response
+
+            if not kwargs.get("js_only", False):
+                await self.execute_hook('before_goto', page)
+                
+                response = await page.goto("about:blank")
+                await stealth_async(page)
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # await stealth_async(page)
+                # response = await page.goto("about:blank")
+                # await stealth_async(page)
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
+                await self.execute_hook('after_goto', page)
+                
+                # Get status code and headers
+                status_code = response.status
+                response_headers = response.headers
+            else:
+                status_code = 200
+                response_headers = {}
+
+            await page.wait_for_selector('body')
+            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
+
+            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
+            if js_code:
+                if isinstance(js_code, str):
+                    await page.evaluate(js_code)
+                elif isinstance(js_code, list):
+                    for js in js_code:
+                        await page.evaluate(js)
+                
+                await page.wait_for_load_state('networkidle')
+                # Check for on execution event
+                await self.execute_hook('on_execution_started', page)
+                
+            if kwargs.get("simulate_user", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
+            wait_for = kwargs.get("wait_for")
+            if wait_for:
+                try:
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
+                except Exception as e:
+                    raise RuntimeError(f"Wait condition failed: {str(e)}")
+
+
+            
+            # Update image dimensions
+            update_image_dimensions_js = """
+            () => {
+                return new Promise((resolve) => {
+                    const filterImage = (img) => {
+                        // Filter out images that are too small
+                        if (img.width < 100 && img.height < 100) return false;
+                        
+                        // Filter out images that are not visible
+                        const rect = img.getBoundingClientRect();
+                        if (rect.width === 0 || rect.height === 0) return false;
+                        
+                        // Filter out images with certain class names (e.g., icons, thumbnails)
+                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
+                        
+                        // Filter out images with certain patterns in their src (e.g., placeholder images)
+                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
+                        
+                        return true;
+                    };
+
+                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
+                    let imagesLeft = images.length;
+                    
+                    if (imagesLeft === 0) {
+                        resolve();
+                        return;
+                    }
+
+                    const checkImage = (img) => {
+                        if (img.complete && img.naturalWidth !== 0) {
+                            img.setAttribute('width', img.naturalWidth);
+                            img.setAttribute('height', img.naturalHeight);
+                            imagesLeft--;
+                            if (imagesLeft === 0) resolve();
+                        }
+                    };
+
+                    images.forEach(img => {
+                        checkImage(img);
+                        if (!img.complete) {
+                            img.onload = () => {
+                                checkImage(img);
+                            };
+                            img.onerror = () => {
+                                imagesLeft--;
+                                if (imagesLeft === 0) resolve();
+                            };
+                        }
+                    });
+
+                    // Fallback timeout of 5 seconds
+                    setTimeout(() => resolve(), 5000);
+                });
+            }
+            """
+            await page.evaluate(update_image_dimensions_js)
+
+            # Wait a bit for any onload events to complete
+            await page.wait_for_timeout(100)
+
+            # Process iframes
+            if kwargs.get("process_iframes", False):
+                page = await self.process_iframes(page)
+            
+            await self.execute_hook('before_retrieve_html', page)
+            # Check if delay_before_return_html is set then wait for that time
+            delay_before_return_html = kwargs.get("delay_before_return_html")
+            if delay_before_return_html:
+                await asyncio.sleep(delay_before_return_html)
+                
+            html = await page.content()
+            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                screenshot_data = await self.take_screenshot(url)            
+
+            if self.verbose:
+                print(f"[LOG] ✅ Crawled {url} successfully!")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                with open(cache_file_path, "w", encoding="utf-8") as f:
+                    f.write(html)
+                # store response headers and status code in cache
+                with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
+                    json.dump({
+                        "response_headers": response_headers,
+                        "status_code": status_code
+                    }, f)
+
+            async def get_delayed_content(delay: float = 5.0) -> str:
+                if self.verbose:
+                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
+                await asyncio.sleep(delay)
+                return await page.content()
+                
+            response = AsyncCrawlResponse(
+                html=html, 
+                response_headers=response_headers, 
+                status_code=status_code,
+                screenshot=screenshot_data,
+                get_delayed_content=get_delayed_content
+            )
+            return response
+        except Error as e:
+            raise Error(f"Failed to crawl {url}: {str(e)}")
+        finally:
+            if not session_id:
+                await page.close()
+                await context.close()
+
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
+        semaphore = asyncio.Semaphore(semaphore_count)
+
+        async def crawl_with_semaphore(url):
+            async with semaphore:
+                return await self.crawl(url, **kwargs)
+
+        tasks = [crawl_with_semaphore(url) for url in urls]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        return [result if not isinstance(result, Exception) else str(result) for result in results]
+
+    async def take_screenshot(self, url: str, wait_time=1000) -> str:
+        async with await self.browser.new_context(user_agent=self.user_agent) as context:
+            page = await context.new_page()
+            try:
+                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
+                # Wait for a specified time (default is 1 second)
+                await page.wait_for_timeout(wait_time)
+                screenshot = await page.screenshot(full_page=True)
+                return base64.b64encode(screenshot).decode('utf-8')
+            except Exception as e:
+                error_message = f"Failed to take screenshot: {str(e)}"
+                print(error_message)
+
+                # Generate an error image
+                img = Image.new('RGB', (800, 600), color='black')
+                draw = ImageDraw.Draw(img)
+                font = ImageFont.load_default()
+                draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+                
+                buffered = BytesIO()
+                img.save(buffered, format="JPEG")
+                return base64.b64encode(buffered.getvalue()).decode('utf-8')
+            finally:
+                await page.close()
+
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -1,17 +1,35 @@
 import asyncio
-import base64, time
+import base64
+import time
 from abc import ABC, abstractmethod
 from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os
 from playwright.async_api import async_playwright, Page, Browser, Error
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
-from .utils import sanitize_input_encode, calculate_semaphore_count
-import json, uuid
-import hashlib
 from pathlib import Path
 from playwright.async_api import ProxySettings
 from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from playwright_stealth import StealthConfig, stealth_async
+
+stealth_config = StealthConfig(
+    webdriver=True,
+    chrome_app=True,
+    chrome_csi=True,
+    chrome_load_times=True,
+    chrome_runtime=True,
+    navigator_languages=True,
+    navigator_plugins=True,
+    navigator_permissions=True,
+    webgl_vendor=True,
+    outerdimensions=True,
+    navigator_hardware_concurrency=True,
+    media_codecs=True,
+)
+

 class AsyncCrawlResponse(BaseModel):
    html: str
@@ -47,10 +65,14 @@ class AsyncCrawlerStrategy(ABC):
 class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        self.use_cached_html = use_cached_html
-        self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
        self.proxy = kwargs.get("proxy")
        self.headless = kwargs.get("headless", True)
-        self.browser_type = kwargs.get("browser_type", "chromium")  # New parameter
+        self.browser_type = kwargs.get("browser_type", "chromium")
        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800 
@@ -58,6 +80,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        self.verbose = kwargs.get("verbose", False)
        self.playwright = None
        self.browser = None
+        self.sleep_on_close = kwargs.get("sleep_on_close", False)
        self.hooks = {
            'on_browser_created': None,
            'on_user_agent_updated': None,
@@ -83,9 +106,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                "headless": self.headless,
                "args": [
                    "--disable-gpu",
-                    "--disable-dev-shm-usage",
-                    "--disable-setuid-sandbox",
                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--window-position=0,0",
+                    "--ignore-certificate-errors",
+                    "--ignore-certificate-errors-spki-list",
+                    # "--headless=new",  # Use the new headless mode
                ]
            }
            
@@ -94,7 +122,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                proxy_settings = ProxySettings(server=self.proxy)
                browser_args["proxy"] = proxy_settings
                
-                
            # Select the appropriate browser based on the browser_type
            if self.browser_type == "firefox":
                self.browser = await self.playwright.firefox.launch(**browser_args)
@@ -106,6 +133,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            await self.execute_hook('on_browser_created', self.browser)

    async def close(self):
+        if self.sleep_on_close:
+            await asyncio.sleep(500)
        if self.browser:
            await self.browser.close()
            self.browser = None
@@ -147,8 +176,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

    def _cleanup_expired_sessions(self):
        current_time = time.time()
-        expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items() 
-                            if current_time - last_used > self.session_ttl]
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))
            
@@ -188,8 +219,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
                        except Error:
                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
-                                            "It should be either a valid CSS selector, a JavaScript function, "
-                                            "or explicitly prefixed with 'js:' or 'css:'.")
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
    
    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
        wrapper_js = f"""
@@ -254,8 +285,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                print(f"Error processing iframe {i}: {str(e)}")

        # Return the page object
-        return page
-    
+        return page  
    
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
@@ -268,25 +298,70 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            if not context:
                context = await self.browser.new_context(
                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None,
+                    accept_downloads=True,
+                    java_script_enabled=True
                )
+                await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
                await context.set_extra_http_headers(self.headers)
                page = await context.new_page()
                self.sessions[session_id] = (context, page, time.time())
        else:
            context = await self.browser.new_context(
-                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
            )
            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
            page = await context.new_page()
+            # await stealth_async(page) #, stealth_config)

+        # Add console message and error logging
+        if kwargs.get("log_console", False):
+            page.on("console", lambda msg: print(f"Console: {msg.text}"))
+            page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
+        
        try:
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")

            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
                if os.path.exists(cache_file_path):
                    html = ""
                    with open(cache_file_path, "r") as f:
@@ -296,12 +371,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        meta = json.load(f)
                        response_headers = meta.get("response_headers", {})
                        status_code = meta.get("status_code")
-                    response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
                    return response

            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
-                response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
+                
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # response = await page.goto("about:blank")
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
                await self.execute_hook('after_goto', page)
                
                # Get status code and headers
@@ -311,37 +395,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                status_code = 200
                response_headers = {}

-
            await page.wait_for_selector('body')
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
            if js_code:
                if isinstance(js_code, str):
-                    r = await page.evaluate(js_code)
+                    await page.evaluate(js_code)
                elif isinstance(js_code, list):
                    for js in js_code:
                        await page.evaluate(js)
                
-                # await page.wait_for_timeout(100)
                await page.wait_for_load_state('networkidle')
-                # Check for on execution even
+                # Check for on execution event
                await self.execute_hook('on_execution_started', page)
                
-            # New code to handle the wait_for parameter
-            # Example usage:
-            # await crawler.crawl(
-            #     url,
-            #     js_code="// some JavaScript code",
-            #     wait_for="""() => {
-            #         return document.querySelector('#my-element') !== null;
-            #     }"""
-            # )
-            # Example of using a CSS selector:
-            # await crawler.crawl(
-            #     url,
-            #     wait_for="#my-element"
-            # )
+            if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
            wait_for = kwargs.get("wait_for")
            if wait_for:
                try:
@@ -349,13 +425,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                except Exception as e:
                    raise RuntimeError(f"Wait condition failed: {str(e)}")

-            # Check if kwargs has screenshot=True then take screenshot
-            screenshot_data = None
-            if kwargs.get("screenshot"):
-                screenshot_data = await self.take_screenshot(url)
-            
-            
-            # New code to update image dimensions
+            # Update image dimensions
            update_image_dimensions_js = """
            () => {
                return new Promise((resolve) => {
@@ -428,12 +498,19 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                
            html = await page.content()
            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                screenshot_data = await self.take_screenshot(url)            

            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")

            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
                with open(cache_file_path, "w", encoding="utf-8") as f:
                    f.write(html)
                # store response headers and status code in cache
@@ -443,7 +520,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        "status_code": status_code
                    }, f)

-            
            async def get_delayed_content(delay: float = 5.0) -> str:
                if self.verbose:
                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
@@ -459,63 +535,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            )
            return response
        except Error as e:
-            raise Error(f"Failed to crawl {url}: {str(e)}")
-        finally:
-            if not session_id:
-                await page.close()
+            raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
+        # finally:
+        #     if not session_id:
+        #         await page.close()
+        #         await context.close()

-        # try:
-        #     html = await _crawl()
-        #     return sanitize_input_encode(html)
-        # except Error as e:
-        #     raise Error(f"Failed to crawl {url}: {str(e)}")
-        # except Exception as e:
-        #     raise Exception(f"Failed to crawl {url}: {str(e)}")
-
-    async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
-        """
-        Execute JavaScript code in a specific session and optionally wait for a condition.
-        
-        :param session_id: The ID of the session to execute the JS code in.
-        :param js_code: The JavaScript code to execute.
-        :param wait_for_js: JavaScript condition to wait for after execution.
-        :param wait_for_css: CSS selector to wait for after execution.
-        :return: AsyncCrawlResponse containing the page's HTML and other information.
-        :raises ValueError: If the session does not exist.
-        """
-        if not session_id:
-            raise ValueError("Session ID must be provided")
-        
-        if session_id not in self.sessions:
-            raise ValueError(f"No active session found for session ID: {session_id}")
-        
-        context, page, last_used = self.sessions[session_id]
-        
-        try:
-            await page.evaluate(js_code)
-            
-            if wait_for_js:
-                await page.wait_for_function(wait_for_js)
-            
-            if wait_for_css:
-                await page.wait_for_selector(wait_for_css)
-            
-            # Get the updated HTML content
-            html = await page.content()
-            
-            # Get response headers and status code (assuming these are available)
-            response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
-            status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
-            
-            # Update the last used time for this session
-            self.sessions[session_id] = (context, page, time.time())
-            
-            return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
-        except Error as e:
-            raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
-    
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
-        semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
        semaphore = asyncio.Semaphore(semaphore_count)

        async def crawl_with_semaphore(url):
@@ -526,7 +553,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]

-    async def take_screenshot(self, url: str, wait_time = 1000) -> str:
+    async def take_screenshot(self, url: str, wait_time=1000) -> str:
        async with await self.browser.new_context(user_agent=self.user_agent) as context:
            page = await context.new_page()
            try:
@@ -549,4 +576,5 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                img.save(buffered, format="JPEG")
                return base64.b64encode(buffered.getvalue()).decode('utf-8')
            finally:
-                await page.close()
+                await page.close()
+
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -133,8 +133,8 @@ class AsyncWebCrawler:
        except Exception as e:
            if not hasattr(e, "msg"):
                e.msg = str(e)
-            print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
-            return CrawlResult(url=url, html="", success=False, error_message=e.msg)
+            print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
+            return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)

    async def arun_many(
        self,
@@ -195,6 +195,7 @@ class AsyncWebCrawler:
                image_description_min_word_threshold=kwargs.get(
                    "image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
                ),
+                **kwargs,
            )
            if verbose:
                print(
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -33,6 +33,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)

    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+        success = True
        if not html:
            return None

@@ -273,10 +274,41 @@ class WebScrappingStrategy(ContentScrappingStrategy):
            if base64_pattern.match(src):
                # Replace base64 data with empty string
                img['src'] = base64_pattern.sub('', src)
+                
+        try:
+            str(body)
+        except Exception as e:
+            # Reset body to the original HTML
+            success = False
+            body = BeautifulSoup(html, 'html.parser')
+            
+            # Create a new div with a special ID
+            error_div = body.new_tag('div', id='crawl4ai_error_message')
+            error_div.string = '''
+            Crawl4AI Error: This page is not fully supported.
+            
+            Possible reasons:
+            1. The page may have restrictions that prevent crawling.
+            2. The page might not be fully loaded.
+            
+            Suggestions:
+            - Try calling the crawl function with these parameters:
+            magic=True,
+            - Set headless=False to visualize what's happening on the page.
+            
+            If the issue persists, please check the page's structure and any potential anti-crawling measures.
+            '''
+            
+            # Append the error div to the body
+            body.body.append(error_div)
+            
+            print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
+
+
        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')

        h = CustomHTML2Text()
-        h.ignore_links = True
+        h.ignore_links = not kwargs.get('include_links_on_markdown', False)
        h.body_width = 0
        try:
            markdown = h.handle(cleaned_html)
@@ -294,7 +326,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        return {
            'markdown': markdown,
            'cleaned_html': cleaned_html,
-            'success': True,
+            'success': success,
            'media': media,
            'links': links,
            'metadata': meta
--- a/crawl4ai/scraper/init.py
+++ b/crawl4ai/scraper/init.py
@@ -1,3 +0,0 @@
-from .async_web_scraper import AsyncWebScraper
-from .bfs_scraper_strategy import BFSScraperStrategy
-from .filters import URLFilter, FilterChain, URLPatternFilter, ContentTypeFilter
--- a/crawl4ai/scraper/async_web_scraper.py
+++ b/crawl4ai/scraper/async_web_scraper.py
@@ -1,123 +0,0 @@
-from typing import Union, AsyncGenerator, Optional
-from .scraper_strategy import ScraperStrategy
-from .models import ScraperResult, CrawlResult
-from ..async_webcrawler import AsyncWebCrawler
-import logging
-from dataclasses import dataclass
-from contextlib import asynccontextmanager
-
-@dataclass
-class ScrapingProgress:
-    """Tracks the progress of a scraping operation."""
-    processed_urls: int = 0
-    failed_urls: int = 0
-    current_url: Optional[str] = None
-
-class AsyncWebScraper:
-    """
-    A high-level web scraper that combines an async crawler with a scraping strategy.
-    
-    Args:
-        crawler (AsyncWebCrawler): The async web crawler implementation
-        strategy (ScraperStrategy): The scraping strategy to use
-        logger (Optional[logging.Logger]): Custom logger for the scraper
-    """
-    
-    def __init__(
-        self, 
-        crawler: AsyncWebCrawler, 
-        strategy: ScraperStrategy,
-        logger: Optional[logging.Logger] = None
-    ):
-        if not isinstance(crawler, AsyncWebCrawler):
-            raise TypeError("crawler must be an instance of AsyncWebCrawler")
-        if not isinstance(strategy, ScraperStrategy):
-            raise TypeError("strategy must be an instance of ScraperStrategy")
-            
-        self.crawler = crawler
-        self.strategy = strategy
-        self.logger = logger or logging.getLogger(__name__)
-        self._progress = ScrapingProgress()
-
-    @property
-    def progress(self) -> ScrapingProgress:
-        """Get current scraping progress."""
-        return self._progress
-
-    @asynccontextmanager
-    async def _error_handling_context(self, url: str):
-        """Context manager for handling errors during scraping."""
-        try:
-            yield
-        except Exception as e:
-            self.logger.error(f"Error scraping {url}: {str(e)}")
-            self._progress.failed_urls += 1
-            raise
-
-    async def ascrape(
-        self, 
-        url: str, 
-        parallel_processing: bool = True, 
-        stream: bool = False
-    ) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
-        """
-        Scrape a website starting from the given URL.
-        
-        Args:
-            url: Starting URL for scraping
-            parallel_processing: Whether to process URLs in parallel
-            stream: If True, yield results as they come; if False, collect all results
-            
-        Returns:
-            Either an async generator yielding CrawlResults or a final ScraperResult
-        """
-        self._progress = ScrapingProgress()  # Reset progress
-        
-        async with self._error_handling_context(url):
-            if stream:
-                return self._ascrape_yielding(url, parallel_processing)
-            return await self._ascrape_collecting(url, parallel_processing)
-
-    async def _ascrape_yielding(
-        self, 
-        url: str, 
-        parallel_processing: bool
-    ) -> AsyncGenerator[CrawlResult, None]:
-        """Stream scraping results as they become available."""
-        try:
-            result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
-            async for res in result_generator:
-                self._progress.processed_urls += 1
-                self._progress.current_url = res.url
-                yield res
-        except Exception as e:
-            self.logger.error(f"Error in streaming scrape: {str(e)}")
-            raise
-
-    async def _ascrape_collecting(
-        self, 
-        url: str, 
-        parallel_processing: bool
-    ) -> ScraperResult:
-        """Collect all scraping results before returning."""
-        extracted_data = {}
-        
-        try:
-            result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
-            async for res in result_generator:
-                self._progress.processed_urls += 1
-                self._progress.current_url = res.url
-                extracted_data[res.url] = res
-                
-            return ScraperResult(
-                url=url,
-                crawled_urls=list(extracted_data.keys()),
-                extracted_data=extracted_data,
-                stats={
-                    'processed_urls': self._progress.processed_urls,
-                    'failed_urls': self._progress.failed_urls
-                }
-            )
-        except Exception as e:
-            self.logger.error(f"Error in collecting scrape: {str(e)}")
-            raise
--- a/crawl4ai/scraper/bfs_scraper_strategy.py
+++ b/crawl4ai/scraper/bfs_scraper_strategy.py
@@ -1,327 +0,0 @@
-from abc import ABC, abstractmethod
-from typing import Union, AsyncGenerator, Optional, Dict, Set
-from dataclasses import dataclass
-from datetime import datetime
-import asyncio
-import logging
-from urllib.parse import urljoin, urlparse, urlunparse
-from urllib.robotparser import RobotFileParser
-import validators
-import time
-from aiolimiter import AsyncLimiter
-from tenacity import retry, stop_after_attempt, wait_exponential
-from collections import defaultdict
-
-from .models import ScraperResult, CrawlResult
-from .filters import FilterChain
-from .scorers import URLScorer
-from ..async_webcrawler import AsyncWebCrawler
-
-@dataclass
-class CrawlStats:
-    """Statistics for the crawling process"""
-    start_time: datetime
-    urls_processed: int = 0
-    urls_failed: int = 0
-    urls_skipped: int = 0
-    total_depth_reached: int = 0
-    current_depth: int = 0
-    robots_blocked: int = 0
-
-class ScraperStrategy(ABC):
-    """Base class for scraping strategies"""
-    
-    @abstractmethod
-    async def ascrape(
-        self, 
-        url: str, 
-        crawler: AsyncWebCrawler, 
-        parallel_processing: bool = True,
-        stream: bool = False
-    ) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
-        """Abstract method for scraping implementation"""
-        pass
-
-    @abstractmethod
-    async def can_process_url(self, url: str) -> bool:
-        """Check if URL can be processed based on strategy rules"""
-        pass
-
-    @abstractmethod
-    async def shutdown(self):
-        """Clean up resources used by the strategy"""
-        pass
-
-class BFSScraperStrategy(ScraperStrategy):
-    """Breadth-First Search scraping strategy with politeness controls"""
-
-    def __init__(
-        self,
-        max_depth: int,
-        filter_chain: FilterChain,
-        url_scorer: URLScorer,
-        max_concurrent: int = 5,
-        min_crawl_delay: int = 1,
-        timeout: int = 30,
-        logger: Optional[logging.Logger] = None
-    ):
-        self.max_depth = max_depth
-        self.filter_chain = filter_chain
-        self.url_scorer = url_scorer
-        self.max_concurrent = max_concurrent
-        self.min_crawl_delay = min_crawl_delay
-        self.timeout = timeout
-        self.logger = logger or logging.getLogger(__name__)
-        
-        # Crawl control
-        self.stats = CrawlStats(start_time=datetime.now())
-        self._cancel_event = asyncio.Event()
-        self.process_external_links = False
-        
-        # Rate limiting and politeness
-        self.rate_limiter = AsyncLimiter(1, 1)
-        self.last_crawl_time = defaultdict(float)
-        self.robot_parsers: Dict[str, RobotFileParser] = {}
-        self.domain_queues: Dict[str, asyncio.Queue] = defaultdict(asyncio.Queue)
-
-    async def can_process_url(self, url: str) -> bool:
-        """Check if URL can be processed based on robots.txt and filters
-        This is our gatekeeper method that determines if a URL should be processed. It:
-            - Validates URL format using the validators library
-            - Checks robots.txt permissions for the domain
-            - Applies custom filters from the filter chain
-            - Updates statistics for blocked URLs
-            - Returns False early if any check fails
-        """
-        if not validators.url(url):
-            self.logger.warning(f"Invalid URL: {url}")
-            return False
-
-        robot_parser = await self._get_robot_parser(url)
-        if robot_parser and not robot_parser.can_fetch("*", url):
-            self.stats.robots_blocked += 1
-            self.logger.info(f"Blocked by robots.txt: {url}")
-            return False
-
-        return self.filter_chain.apply(url)
-
-    async def _get_robot_parser(self, url: str) -> Optional[RobotFileParser]:
-        """Get or create robots.txt parser for domain.
-            This is our robots.txt manager that:
-                - Uses domain-level caching of robot parsers
-                - Creates and caches new parsers as needed
-                - Handles failed robots.txt fetches gracefully
-                - Returns None if robots.txt can't be fetched, allowing crawling to proceed        
-        """
-        domain = urlparse(url).netloc
-        if domain not in self.robot_parsers:
-            parser = RobotFileParser()
-            try:
-                robots_url = f"{urlparse(url).scheme}://{domain}/robots.txt"
-                parser.set_url(robots_url)
-                parser.read()
-                self.robot_parsers[domain] = parser
-            except Exception as e:
-                self.logger.warning(f"Error fetching robots.txt for {domain}: {e}")
-                return None
-        return self.robot_parsers[domain]
-
-    @retry(stop=stop_after_attempt(3), 
-           wait=wait_exponential(multiplier=1, min=4, max=10))
-    async def _crawl_with_retry(
-        self, 
-        crawler: AsyncWebCrawler, 
-        url: str
-    ) -> CrawlResult:
-        """Crawl URL with retry logic"""
-        try:
-            async with asyncio.timeout(self.timeout):
-                return await crawler.arun(url)
-        except asyncio.TimeoutError:
-            self.logger.error(f"Timeout crawling {url}")
-            raise
-
-    async def process_url(
-        self,
-        url: str,
-        depth: int,
-        crawler: AsyncWebCrawler,
-        queue: asyncio.PriorityQueue,
-        visited: Set[str],
-        depths: Dict[str, int]
-    ) -> Optional[CrawlResult]:
-        """Process a single URL and extract links.
-        This is our main URL processing workhorse that:
-            - Checks for cancellation
-            - Validates URLs through can_process_url
-            - Implements politeness delays per domain
-            - Applies rate limiting
-            - Handles crawling with retries
-            - Updates various statistics
-            - Processes extracted links
-            - Returns the crawl result or None on failure
-        """
-        
-        if self._cancel_event.is_set():
-            return None
-            
-        if not await self.can_process_url(url):
-            self.stats.urls_skipped += 1
-            return None
-
-        # Politeness delay
-        domain = urlparse(url).netloc
-        time_since_last = time.time() - self.last_crawl_time[domain]
-        if time_since_last < self.min_crawl_delay:
-            await asyncio.sleep(self.min_crawl_delay - time_since_last)
-        self.last_crawl_time[domain] = time.time()
-
-        # Crawl with rate limiting
-        try:
-            async with self.rate_limiter:
-                result = await self._crawl_with_retry(crawler, url)
-                self.stats.urls_processed += 1
-        except Exception as e:
-            self.logger.error(f"Error crawling {url}: {e}")
-            self.stats.urls_failed += 1
-            return None
-
-        # Process links
-        await self._process_links(result, url, depth, queue, visited, depths)
-        
-        return result
-
-    async def _process_links(
-        self,
-        result: CrawlResult,
-        source_url: str,
-        depth: int,
-        queue: asyncio.PriorityQueue,
-        visited: Set[str],
-        depths: Dict[str, int]
-    ):
-        """Process extracted links from crawl result.
-        This is our link processor that:
-            Handles both internal and external links
-            Normalizes URLs (removes fragments)
-            Checks depth limits
-            Scores URLs for priority
-            Updates depth tracking
-            Adds valid URLs to the queue
-            Updates maximum depth statistics
-        """
-        links_ro_process = result.links["internal"]
-        if self.process_external_links:
-            links_ro_process += result.links["external"]
-        for link_type in links_ro_process:
-            for link in result.links[link_type]:
-                url = link['href']
-                # url = urljoin(source_url, link['href'])
-                # url = urlunparse(urlparse(url)._replace(fragment=""))
-                
-                if url not in visited and await self.can_process_url(url):
-                    new_depth = depths[source_url] + 1
-                    if new_depth <= self.max_depth:
-                        score = self.url_scorer.score(url)
-                        await queue.put((score, new_depth, url))
-                        depths[url] = new_depth
-                        self.stats.total_depth_reached = max(
-                            self.stats.total_depth_reached, 
-                            new_depth
-                        )
-
-    async def ascrape(
-        self,
-        start_url: str,
-        crawler: AsyncWebCrawler,
-        parallel_processing: bool = True
-    ) -> AsyncGenerator[CrawlResult, None]:
-        """Implement BFS crawling strategy"""
-        
-        # Initialize crawl state
-        """
-        queue: A priority queue where items are tuples of (score, depth, url)
-            Score: Determines crawling priority (lower = higher priority)
-            Depth: Current distance from start_url
-            URL: The actual URL to crawl
-        visited: Keeps track of URLs we've already seen to avoid cycles
-        depths: Maps URLs to their depths from the start URL
-        pending_tasks: Tracks currently running crawl tasks        
-        """
-        queue = asyncio.PriorityQueue()
-        await queue.put((0, 0, start_url))
-        visited: Set[str] = set()
-        depths = {start_url: 0}
-        pending_tasks = set()
-        
-        try:
-            while (not queue.empty() or pending_tasks) and not self._cancel_event.is_set():
-                """
-                This sets up our main control loop which:
-                    - Continues while there are URLs to process (not queue.empty())
-                    - Or while there are tasks still running (pending_tasks)
-                    - Can be interrupted via cancellation (not self._cancel_event.is_set())
-                """
-                # Start new tasks up to max_concurrent
-                while not queue.empty() and len(pending_tasks) < self.max_concurrent:
-                    """
-                    This section manages task creation:
-                        Checks if we can start more tasks (under max_concurrent limit)
-                        Gets the next URL from the priority queue
-                        Marks URLs as visited immediately to prevent duplicates
-                        Updates current depth in stats
-                        Either:
-                            Creates a new async task (parallel mode)
-                            Processes URL directly (sequential mode)
-                    """
-                    _, depth, url = await queue.get()
-                    if url not in visited:
-                        visited.add(url)
-                        self.stats.current_depth = depth
-                        
-                        if parallel_processing:
-                            task = asyncio.create_task(
-                                self.process_url(url, depth, crawler, queue, visited, depths)
-                            )
-                            pending_tasks.add(task)
-                        else:
-                            result = await self.process_url(
-                                url, depth, crawler, queue, visited, depths
-                            )
-                            if result:
-                                yield result
-
-                # Process completed tasks
-                """
-                This section manages completed tasks:
-                    Waits for any task to complete using asyncio.wait
-                    Uses FIRST_COMPLETED to handle results as soon as they're ready
-                    Yields successful results to the caller
-                    Updates pending_tasks to remove completed ones
-                """
-                if pending_tasks:
-                    done, pending_tasks = await asyncio.wait(
-                        pending_tasks,
-                        return_when=asyncio.FIRST_COMPLETED
-                    )
-                    for task in done:
-                        result = await task
-                        if result:
-                            yield result
-                            
-        except Exception as e:
-            self.logger.error(f"Error in crawl process: {e}")
-            raise
-            
-        finally:
-            # Clean up any remaining tasks
-            for task in pending_tasks:
-                task.cancel()
-            self.stats.end_time = datetime.now()
-
-    async def shutdown(self):
-        """Clean up resources and stop crawling"""
-        self._cancel_event.set()
-        # Clear caches and close connections
-        self.robot_parsers.clear()
-        self.domain_queues.clear()
--- a/crawl4ai/scraper/filters.py
+++ b/crawl4ai/scraper/filters.py
@@ -1,205 +0,0 @@
-# from .url_filter import URLFilter, FilterChain
-# from .content_type_filter import ContentTypeFilter
-# from .url_pattern_filter import URLPatternFilter
-
-from abc import ABC, abstractmethod
-from typing import List, Pattern, Set, Union
-import re
-from urllib.parse import urlparse
-import mimetypes
-import logging
-from dataclasses import dataclass
-import fnmatch
-
-@dataclass
-class FilterStats:
-    """Statistics for filter applications"""
-    total_urls: int = 0
-    rejected_urls: int = 0
-    passed_urls: int = 0
-
-class URLFilter(ABC):
-    """Base class for URL filters"""
-    
-    def __init__(self, name: str = None):
-        self.name = name or self.__class__.__name__
-        self.stats = FilterStats()
-        self.logger = logging.getLogger(f"urlfilter.{self.name}")
-
-    @abstractmethod
-    def apply(self, url: str) -> bool:
-        """Apply the filter to a URL"""
-        pass
-
-    def _update_stats(self, passed: bool):
-        """Update filter statistics"""
-        self.stats.total_urls += 1
-        if passed:
-            self.stats.passed_urls += 1
-        else:
-            self.stats.rejected_urls += 1
-
-class FilterChain:
-    """Chain of URL filters."""
-    
-    def __init__(self, filters: List[URLFilter] = None):
-        self.filters = filters or []
-        self.stats = FilterStats()
-        self.logger = logging.getLogger("urlfilter.chain")
-
-    def add_filter(self, filter_: URLFilter) -> 'FilterChain':
-        """Add a filter to the chain"""
-        self.filters.append(filter_)
-        return self  # Enable method chaining
-
-    def apply(self, url: str) -> bool:
-        """Apply all filters in the chain"""
-        self.stats.total_urls += 1
-        
-        for filter_ in self.filters:
-            if not filter_.apply(url):
-                self.stats.rejected_urls += 1
-                self.logger.debug(f"URL {url} rejected by {filter_.name}")
-                return False
-        
-        self.stats.passed_urls += 1
-        return True
-
-class URLPatternFilter(URLFilter):
-    """Filter URLs based on glob patterns or regex.
-    
-    pattern_filter = URLPatternFilter([
-        "*.example.com/*",  # Glob pattern
-        "*/article/*",      # Path pattern
-        re.compile(r"blog-\d+") # Regex pattern
-    ])
-
-    - Supports glob patterns and regex
-    - Multiple patterns per filter
-    - Pattern pre-compilation for performance    
-    """
-    
-    def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]], 
-                 use_glob: bool = True):
-        super().__init__()
-        self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
-        self.use_glob = use_glob
-        self._compiled_patterns = []
-        
-        for pattern in self.patterns:
-            if isinstance(pattern, str) and use_glob:
-                self._compiled_patterns.append(self._glob_to_regex(pattern))
-            else:
-                self._compiled_patterns.append(re.compile(pattern) if isinstance(pattern, str) else pattern)
-
-    def _glob_to_regex(self, pattern: str) -> Pattern:
-        """Convert glob pattern to regex"""
-        return re.compile(fnmatch.translate(pattern))
-
-    def apply(self, url: str) -> bool:
-        """Check if URL matches any of the patterns"""
-        matches = any(pattern.search(url) for pattern in self._compiled_patterns)
-        self._update_stats(matches)
-        return matches
-
-class ContentTypeFilter(URLFilter):
-    """Filter URLs based on expected content type.
-    
-    content_filter = ContentTypeFilter([
-        "text/html",
-        "application/pdf"
-    ], check_extension=True)
-
-    - Filter by MIME types
-    - Extension checking
-    - Support for multiple content types
-    """
-    
-    def __init__(self, allowed_types: Union[str, List[str]], 
-                 check_extension: bool = True):
-        super().__init__()
-        self.allowed_types = [allowed_types] if isinstance(allowed_types, str) else allowed_types
-        self.check_extension = check_extension
-        self._normalize_types()
-
-    def _normalize_types(self):
-        """Normalize content type strings"""
-        self.allowed_types = [t.lower() for t in self.allowed_types]
-
-    def _check_extension(self, url: str) -> bool:
-        """Check URL's file extension"""
-        ext = urlparse(url).path.split('.')[-1].lower() if '.' in urlparse(url).path else ''
-        if not ext:
-            return True  # No extension, might be dynamic content
-            
-        guessed_type = mimetypes.guess_type(url)[0]
-        return any(allowed in (guessed_type or '').lower() for allowed in self.allowed_types)
-
-    def apply(self, url: str) -> bool:
-        """Check if URL's content type is allowed"""
-        result = True
-        if self.check_extension:
-            result = self._check_extension(url)
-        self._update_stats(result)
-        return result
-
-class DomainFilter(URLFilter):
-    """Filter URLs based on allowed/blocked domains.
-    
-    domain_filter = DomainFilter(
-        allowed_domains=["example.com", "blog.example.com"],
-        blocked_domains=["ads.example.com"]
-    )
-
-    - Allow/block specific domains
-    - Subdomain support
-    - Efficient domain matching
-    """
-    
-    def __init__(self, allowed_domains: Union[str, List[str]] = None, 
-                 blocked_domains: Union[str, List[str]] = None):
-        super().__init__()
-        self.allowed_domains = set(self._normalize_domains(allowed_domains)) if allowed_domains else None
-        self.blocked_domains = set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
-
-    def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
-        """Normalize domain strings"""
-        if isinstance(domains, str):
-            domains = [domains]
-        return [d.lower().strip() for d in domains]
-
-    def _extract_domain(self, url: str) -> str:
-        """Extract domain from URL"""
-        return urlparse(url).netloc.lower()
-
-    def apply(self, url: str) -> bool:
-        """Check if URL's domain is allowed"""
-        domain = self._extract_domain(url)
-        
-        if domain in self.blocked_domains:
-            self._update_stats(False)
-            return False
-            
-        if self.allowed_domains is not None and domain not in self.allowed_domains:
-            self._update_stats(False)
-            return False
-            
-        self._update_stats(True)
-        return True
-
-# Example usage:
-def create_common_filter_chain() -> FilterChain:
-    """Create a commonly used filter chain"""
-    return FilterChain([
-        URLPatternFilter([
-            "*.html", "*.htm",  # HTML files
-            "*/article/*", "*/blog/*"  # Common content paths
-        ]),
-        ContentTypeFilter([
-            "text/html",
-            "application/xhtml+xml"
-        ]),
-        DomainFilter(
-            blocked_domains=["ads.*", "analytics.*"]
-        )
-    ])
--- a/crawl4ai/scraper/models.py
+++ b/crawl4ai/scraper/models.py
@@ -1,8 +0,0 @@
-from pydantic import BaseModel
-from typing import List, Dict
-from ..models import CrawlResult
-
-class ScraperResult(BaseModel):
-    url: str
-    crawled_urls: List[str]
-    extracted_data: Dict[str,CrawlResult]
--- a/crawl4ai/scraper/scorers.py
+++ b/crawl4ai/scraper/scorers.py
@@ -1,268 +0,0 @@
-# from .url_scorer import URLScorer
-# from .keyword_relevance_scorer import KeywordRelevanceScorer
-
-from abc import ABC, abstractmethod
-from typing import List, Dict, Optional, Union
-from dataclasses import dataclass
-from urllib.parse import urlparse, unquote
-import re
-from collections import defaultdict
-import math
-import logging
-
-@dataclass
-class ScoringStats:
-    """Statistics for URL scoring"""
-    urls_scored: int = 0
-    total_score: float = 0.0
-    min_score: float = float('inf')
-    max_score: float = float('-inf')
-    
-    def update(self, score: float):
-        """Update scoring statistics"""
-        self.urls_scored += 1
-        self.total_score += score
-        self.min_score = min(self.min_score, score)
-        self.max_score = max(self.max_score, score)
-    
-    @property
-    def average_score(self) -> float:
-        """Calculate average score"""
-        return self.total_score / self.urls_scored if self.urls_scored > 0 else 0.0
-
-class URLScorer(ABC):
-    """Base class for URL scoring strategies"""
-    
-    def __init__(self, weight: float = 1.0, name: str = None):
-        self.weight = weight
-        self.name = name or self.__class__.__name__
-        self.stats = ScoringStats()
-        self.logger = logging.getLogger(f"urlscorer.{self.name}")
-
-    @abstractmethod
-    def _calculate_score(self, url: str) -> float:
-        """Calculate the raw score for a URL"""
-        pass
-
-    def score(self, url: str) -> float:
-        """Calculate the weighted score for a URL"""
-        raw_score = self._calculate_score(url)
-        weighted_score = raw_score * self.weight
-        self.stats.update(weighted_score)
-        return weighted_score
-
-class CompositeScorer(URLScorer):
-    """Combines multiple scorers with weights"""
-    
-    def __init__(self, scorers: List[URLScorer], normalize: bool = True):
-        super().__init__(name="CompositeScorer")
-        self.scorers = scorers
-        self.normalize = normalize
-
-    def _calculate_score(self, url: str) -> float:
-        scores = [scorer.score(url) for scorer in self.scorers]
-        total_score = sum(scores)
-        
-        if self.normalize and scores:
-            total_score /= len(scores)
-            
-        return total_score
-
-class KeywordRelevanceScorer(URLScorer):
-    """Score URLs based on keyword relevance.
-
-    keyword_scorer = KeywordRelevanceScorer(
-        keywords=["python", "programming"],
-        weight=1.0,
-        case_sensitive=False
-    )
-
-    - Score based on keyword matches
-    - Case sensitivity options
-    - Weighted scoring
-    """
-    
-    def __init__(self, keywords: List[str], weight: float = 1.0,
-                 case_sensitive: bool = False):
-        super().__init__(weight=weight)
-        self.keywords = keywords
-        self.case_sensitive = case_sensitive
-        self._compile_keywords()
-
-    def _compile_keywords(self):
-        """Prepare keywords for matching"""
-        flags = 0 if self.case_sensitive else re.IGNORECASE
-        self.patterns = [re.compile(re.escape(k), flags) for k in self.keywords]
-
-    def _calculate_score(self, url: str) -> float:
-        """Calculate score based on keyword matches"""
-        decoded_url = unquote(url)
-        total_matches = sum(
-            1 for pattern in self.patterns
-            if pattern.search(decoded_url)
-        )
-        # Normalize score between 0 and 1
-        return total_matches / len(self.patterns) if self.patterns else 0.0
-
-class PathDepthScorer(URLScorer):
-    """Score URLs based on their path depth.
-        
-    path_scorer = PathDepthScorer(
-        optimal_depth=3,  # Preferred URL depth
-        weight=0.7
-    )
-
-    - Score based on URL path depth
-    - Configurable optimal depth
-    - Diminishing returns for deeper paths
-    """
-    
-    def __init__(self, optimal_depth: int = 3, weight: float = 1.0):
-        super().__init__(weight=weight)
-        self.optimal_depth = optimal_depth
-
-    def _calculate_score(self, url: str) -> float:
-        """Calculate score based on path depth"""
-        path = urlparse(url).path
-        depth = len([x for x in path.split('/') if x])
-        
-        # Score decreases as we move away from optimal depth
-        distance_from_optimal = abs(depth - self.optimal_depth)
-        return 1.0 / (1.0 + distance_from_optimal)
-
-class ContentTypeScorer(URLScorer):
-    """Score URLs based on content type preferences.
-    
-    content_scorer = ContentTypeScorer({
-        r'\.html$': 1.0,
-        r'\.pdf$': 0.8,
-        r'\.xml$': 0.6
-    })
-
-    - Score based on file types
-    - Configurable type weights
-    - Pattern matching support
-    """
-    
-    def __init__(self, type_weights: Dict[str, float], weight: float = 1.0):
-        super().__init__(weight=weight)
-        self.type_weights = type_weights
-        self._compile_patterns()
-
-    def _compile_patterns(self):
-        """Prepare content type patterns"""
-        self.patterns = {
-            re.compile(pattern): weight
-            for pattern, weight in self.type_weights.items()
-        }
-
-    def _calculate_score(self, url: str) -> float:
-        """Calculate score based on content type matching"""
-        for pattern, weight in self.patterns.items():
-            if pattern.search(url):
-                return weight
-        return 0.0
-
-class FreshnessScorer(URLScorer):
-    """Score URLs based on freshness indicators.
-    
-    freshness_scorer = FreshnessScorer(weight=0.9)
-
-    Score based on date indicators in URLs
-    Multiple date format support
-    Recency weighting"""
-    
-    def __init__(self, weight: float = 1.0):
-        super().__init__(weight=weight)
-        self.date_patterns = [
-            r'/(\d{4})/(\d{2})/(\d{2})/',  # yyyy/mm/dd
-            r'(\d{4})[-_](\d{2})[-_](\d{2})',  # yyyy-mm-dd
-            r'/(\d{4})/',  # year only
-        ]
-        self._compile_patterns()
-
-    def _compile_patterns(self):
-        """Prepare date patterns"""
-        self.compiled_patterns = [re.compile(p) for p in self.date_patterns]
-
-    def _calculate_score(self, url: str) -> float:
-        """Calculate score based on date indicators"""
-        for pattern in self.compiled_patterns:
-            if match := pattern.search(url):
-                year = int(match.group(1))
-                # Score higher for more recent years
-                return 1.0 - (2024 - year) * 0.1
-        return 0.5  # Default score for URLs without dates
-
-class DomainAuthorityScorer(URLScorer):
-    """Score URLs based on domain authority.
-
-    authority_scorer = DomainAuthorityScorer({
-        "python.org": 1.0,
-        "github.com": 0.9,
-        "medium.com": 0.7
-    })
-
-    Score based on domain importance
-    Configurable domain weights
-    Default weight for unknown domains"""
-    
-    def __init__(self, domain_weights: Dict[str, float], 
-                 default_weight: float = 0.5, weight: float = 1.0):
-        super().__init__(weight=weight)
-        self.domain_weights = domain_weights
-        self.default_weight = default_weight
-
-    def _calculate_score(self, url: str) -> float:
-        """Calculate score based on domain authority"""
-        domain = urlparse(url).netloc.lower()
-        return self.domain_weights.get(domain, self.default_weight)
-
-def create_balanced_scorer() -> CompositeScorer:
-    """Create a balanced composite scorer"""
-    return CompositeScorer([
-        KeywordRelevanceScorer(
-            keywords=["article", "blog", "news", "research"],
-            weight=1.0
-        ),
-        PathDepthScorer(
-            optimal_depth=3,
-            weight=0.7
-        ),
-        ContentTypeScorer(
-            type_weights={
-                r'\.html?$': 1.0,
-                r'\.pdf$': 0.8,
-                r'\.xml$': 0.6
-            },
-            weight=0.8
-        ),
-        FreshnessScorer(
-            weight=0.9
-        )
-    ])
-
-# Example Usage:
-"""
-# Create a composite scorer
-scorer = CompositeScorer([
-    KeywordRelevanceScorer(["python", "programming"], weight=1.0),
-    PathDepthScorer(optimal_depth=2, weight=0.7),
-    FreshnessScorer(weight=0.8),
-    DomainAuthorityScorer(
-        domain_weights={
-            "python.org": 1.0,
-            "github.com": 0.9,
-            "medium.com": 0.7
-        },
-        weight=0.9
-    )
-])
-
-# Score a URL
-score = scorer.score("https://python.org/article/2024/01/new-features")
-
-# Access statistics
-print(f"Average score: {scorer.stats.average_score}")
-print(f"URLs scored: {scorer.stats.urls_scored}")
-"""
--- a/crawl4ai/scraper/scraper_strategy.py
+++ b/crawl4ai/scraper/scraper_strategy.py
@@ -1,26 +0,0 @@
-from abc import ABC, abstractmethod
-from .models import ScraperResult, CrawlResult
-from ..models import CrawlResult
-from ..async_webcrawler import AsyncWebCrawler
-from typing import Union, AsyncGenerator
-
-class ScraperStrategy(ABC):
-    @abstractmethod
-    async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
-        """Scrape the given URL using the specified crawler.
-
-        Args:
-            url (str): The starting URL for the scrape.
-            crawler (AsyncWebCrawler): The web crawler instance.
-            parallel_processing (bool): Whether to use parallel processing. Defaults to True.
-            stream (bool): If True, yields individual crawl results as they are ready; 
-                                if False, accumulates results and returns a final ScraperResult.
-
-        Yields:
-            CrawlResult: Individual crawl results if stream is True.
-
-        Returns:
-            ScraperResult: A summary of the scrape results containing the final extracted data 
-            and the list of crawled URLs if stream is False.
-        """
-        pass
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -692,8 +692,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
    for img in imgs:
        src = img.get('src', '')
        if base64_pattern.match(src):
-            # Replace base64 data with empty string
            img['src'] = base64_pattern.sub('', src)
+
    cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
    cleaned_html = sanitize_html(cleaned_html)

--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -47,8 +47,7 @@
      },
      "outputs": [],
      "source": [
-        "# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
-        "!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
+        "!pip install crawl4ai\n",
        "!pip install nest-asyncio\n",
        "!playwright install"
      ]
@@ -714,7 +713,7 @@
      "provenance": []
    },
    "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "venv",
      "language": "python",
      "name": "python3"
    },
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -379,6 +379,18 @@ async def crawl_custom_browser_type():
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

+async def crawl_with_user_simultion():
+    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
+        url = "YOUR-URL-HERE"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
+            override_navigator = True # Overrides the navigator object to make it look like a real user
+        )
+        
+        print(result.markdown)    
+
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -455,7 +467,7 @@ async def main():
    # LLM extraction examples
    await extract_structured_data_using_llm()
    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
-    await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
    await extract_structured_data_using_llm("ollama/llama3.2")    

    # You always can pass custom headers to the extraction strategy
--- a/docs/scrapper/async_web_scraper.md
+++ b/docs/scrapper/async_web_scraper.md
@@ -1,166 +0,0 @@
-# AsyncWebScraper: Smart Web Crawling Made Easy
-
-AsyncWebScraper is a powerful and flexible web scraping tool that makes it easy to collect data from websites efficiently. Whether you need to scrape a few pages or an entire website, AsyncWebScraper handles the complexity of web crawling while giving you fine-grained control over the process.
-
-## How It Works
-
-```mermaid
-flowchart TB
-    Start([Start]) --> Init[Initialize AsyncWebScraper\nwith Crawler and Strategy]
-    Init --> InputURL[Receive URL to scrape]
-    InputURL --> Decision{Stream or\nCollect?}
-    
-    %% Streaming Path
-    Decision -->|Stream| StreamInit[Initialize Streaming Mode]
-    StreamInit --> StreamStrategy[Call Strategy.ascrape]
-    StreamStrategy --> AsyncGen[Create Async Generator]
-    AsyncGen --> ProcessURL[Process Next URL]
-    ProcessURL --> FetchContent[Fetch Page Content]
-    FetchContent --> Extract[Extract Data]
-    Extract --> YieldResult[Yield CrawlResult]
-    YieldResult --> CheckMore{More URLs?}
-    CheckMore -->|Yes| ProcessURL
-    CheckMore -->|No| StreamEnd([End Stream])
-    
-    %% Collecting Path
-    Decision -->|Collect| CollectInit[Initialize Collection Mode]
-    CollectInit --> CollectStrategy[Call Strategy.ascrape]
-    CollectStrategy --> CollectGen[Create Async Generator]
-    CollectGen --> ProcessURLColl[Process Next URL]
-    ProcessURLColl --> FetchContentColl[Fetch Page Content]
-    FetchContentColl --> ExtractColl[Extract Data]
-    ExtractColl --> StoreColl[Store in Dictionary]
-    StoreColl --> CheckMoreColl{More URLs?}
-    CheckMoreColl -->|Yes| ProcessURLColl
-    CheckMoreColl -->|No| CreateResult[Create ScraperResult]
-    CreateResult --> ReturnResult([Return Result])
-    
-    %% Parallel Processing
-    subgraph Parallel
-        ProcessURL
-        FetchContent
-        Extract
-        ProcessURLColl
-        FetchContentColl
-        ExtractColl
-    end
-    
-    %% Error Handling
-    FetchContent --> ErrorCheck{Error?}
-    ErrorCheck -->|Yes| LogError[Log Error]
-    LogError --> UpdateStats[Update Error Stats]
-    UpdateStats --> CheckMore
-    ErrorCheck -->|No| Extract
-    
-    FetchContentColl --> ErrorCheckColl{Error?}
-    ErrorCheckColl -->|Yes| LogErrorColl[Log Error]
-    LogErrorColl --> UpdateStatsColl[Update Error Stats]
-    UpdateStatsColl --> CheckMoreColl
-    ErrorCheckColl -->|No| ExtractColl
-    
-    %% Style definitions
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    classDef start fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    
-    class Start,StreamEnd,ReturnResult start;
-    class Decision,CheckMore,CheckMoreColl,ErrorCheck,ErrorCheckColl decision;
-    class LogError,LogErrorColl,UpdateStats,UpdateStatsColl error;
-    class ProcessURL,FetchContent,Extract,ProcessURLColl,FetchContentColl,ExtractColl process;
-```
-
-AsyncWebScraper uses an intelligent crawling system that can navigate through websites following your specified strategy. It supports two main modes of operation:
-
-### 1. Streaming Mode
-```python
-async for result in scraper.ascrape(url, stream=True):
-    print(f"Found data on {result.url}")
-    process_data(result.data)
-```
- Perfect for processing large websites
- Memory efficient - handles one page at a time
- Ideal for real-time data processing
- Great for monitoring or continuous scraping tasks
-
-### 2. Collection Mode
-```python
-result = await scraper.ascrape(url)
-print(f"Scraped {len(result.crawled_urls)} pages")
-process_all_data(result.extracted_data)
-```
- Collects all data before returning
- Best for when you need the complete dataset
- Easier to work with for batch processing
- Includes comprehensive statistics
-
-## Key Features
-
- **Smart Crawling**: Automatically follows relevant links while avoiding duplicates
- **Parallel Processing**: Scrapes multiple pages simultaneously for better performance
- **Memory Efficient**: Choose between streaming and collecting based on your needs
- **Error Resilient**: Continues working even if some pages fail to load
- **Progress Tracking**: Monitor the scraping progress in real-time
- **Customizable**: Configure crawling strategy, filters, and scoring to match your needs
-
-## Quick Start
-
-```python
-from crawl4ai.scraper import AsyncWebScraper, BFSStrategy
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-
-# Initialize the scraper
-crawler = AsyncWebCrawler()
-strategy = BFSStrategy(
-    max_depth=2,  # How deep to crawl
-    url_pattern="*.example.com/*"  # What URLs to follow
-)
-scraper = AsyncWebScraper(crawler, strategy)
-
-# Start scraping
-async def main():
-    # Collect all results
-    result = await scraper.ascrape("https://example.com")
-    print(f"Found {len(result.extracted_data)} pages")
-    
-    # Or stream results
-    async for page in scraper.ascrape("https://example.com", stream=True):
-        print(f"Processing {page.url}")
-
-```
-
-## Best Practices
-
-1. **Choose the Right Mode**
-   - Use streaming for large websites or real-time processing
-   - Use collecting for smaller sites or when you need the complete dataset
-
-2. **Configure Depth**
-   - Start with a small depth (2-3) and increase if needed
-   - Higher depths mean exponentially more pages to crawl
-
-3. **Set Appropriate Filters**
-   - Use URL patterns to stay within relevant sections
-   - Set content type filters to only process useful pages
-
-4. **Handle Resources Responsibly**
-   - Enable parallel processing for faster results
-   - Consider the target website's capacity
-   - Implement appropriate delays between requests
-
-## Common Use Cases
-
- **Content Aggregation**: Collect articles, blog posts, or news from multiple pages
- **Data Extraction**: Gather product information, prices, or specifications
- **Site Mapping**: Create a complete map of a website's structure
- **Content Monitoring**: Track changes or updates across multiple pages
- **Data Mining**: Extract and analyze patterns across web pages
-
-## Advanced Features
-
- Custom scoring algorithms for prioritizing important pages
- URL filters for focusing on specific site sections
- Content type filtering for processing only relevant pages
- Progress tracking for monitoring long-running scrapes
-
-Need more help? Check out our [examples repository](https://github.com/example/crawl4ai/examples) or join our [community Discord](https://discord.gg/example).
--- a/docs/scrapper/bfs_scraper_strategy.md
+++ b/docs/scrapper/bfs_scraper_strategy.md
@@ -1,244 +0,0 @@
-# BFS Scraper Strategy: Smart Web Traversal
-
-The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
-
-```mermaid
-flowchart TB
-    Start([Start]) --> Init[Initialize BFS Strategy]
-    Init --> InitStats[Initialize CrawlStats]
-    InitStats --> InitQueue[Initialize Priority Queue]
-    InitQueue --> AddStart[Add Start URL to Queue]
-    
-    AddStart --> CheckState{Queue Empty or\nTasks Pending?}
-    CheckState -->|No| Cleanup[Cleanup & Stats]
-    Cleanup --> End([End])
-    
-    CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
-    CheckCancel -->|Yes| Cleanup
-    
-    CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
-    
-    CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
-    WaitComplete --> YieldResult[Yield Result]
-    YieldResult --> CheckState
-    
-    CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
-    
-    GetNextURL --> ValidateURL{Already\nVisited?}
-    ValidateURL -->|Yes| CheckState
-    
-    ValidateURL -->|No| ProcessURL[Process URL]
-    
-    subgraph URL_Processing [URL Processing]
-        ProcessURL --> CheckValid{URL Valid?}
-        CheckValid -->|No| UpdateStats[Update Skip Stats]
-        
-        CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
-        CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
-        
-        CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
-        ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
-        
-        FetchContent --> CheckError{Error?}
-        CheckError -->|Yes| Retry{Retry\nNeeded?}
-        Retry -->|Yes| FetchContent
-        Retry -->|No| UpdateFailStats[Update Fail Stats]
-        
-        CheckError -->|No| ExtractLinks[Extract & Process Links]
-        ExtractLinks --> ScoreURLs[Score New URLs]
-        ScoreURLs --> AddToQueue[Add to Priority Queue]
-    end
-    
-    ProcessURL --> CreateTask{Parallel\nProcessing?}
-    CreateTask -->|Yes| AddTask[Add to Pending Tasks]
-    CreateTask -->|No| DirectProcess[Process Directly]
-    
-    AddTask --> CheckState
-    DirectProcess --> YieldResult
-    
-    UpdateStats --> CheckState
-    UpdateRobotStats --> CheckState
-    UpdateFailStats --> CheckState
-    
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    
-    class Start,End stats;
-    class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
-    class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
-    class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
-```
-
-## How It Works
-
-The BFS strategy crawls a website by:
-1. Starting from a root URL
-2. Processing all URLs at the current depth
-3. Moving to URLs at the next depth level
-4. Continuing until maximum depth is reached
-
-This ensures systematic coverage of the website while maintaining control over the crawling process.
-
-## Key Features
-
-### 1. Smart URL Processing
-```python
-strategy = BFSScraperStrategy(
-    max_depth=2,
-    filter_chain=my_filters,
-    url_scorer=my_scorer,
-    max_concurrent=5
-)
-```
- Controls crawl depth
- Filters unwanted URLs
- Scores URLs for priority
- Manages concurrent requests
-
-### 2. Polite Crawling
-The strategy automatically implements web crawling best practices:
- Respects robots.txt
- Implements rate limiting
- Adds politeness delays
- Manages concurrent requests
-
-### 3. Link Processing Control
-```python
-strategy = BFSScraperStrategy(
-    ...,
-    process_external_links=False  # Only process internal links
-)
-```
- Control whether to follow external links
- Default: internal links only
- Enable external links when needed
-
-## Configuration Options
-
-| Parameter | Description | Default |
-|-----------|-------------|---------|
-| max_depth | Maximum crawl depth | Required |
-| filter_chain | URL filtering rules | Required |
-| url_scorer | URL priority scoring | Required |
-| max_concurrent | Max parallel requests | 5 |
-| min_crawl_delay | Seconds between requests | 1 |
-| process_external_links | Follow external links | False |
-
-## Best Practices
-
-1. **Set Appropriate Depth**
-   - Start with smaller depths (2-3)
-   - Increase based on needs
-   - Consider site structure
-
-2. **Configure Filters**
-   - Use URL patterns
-   - Filter by content type
-   - Avoid unwanted sections
-
-3. **Tune Performance**
-   - Adjust max_concurrent
-   - Set appropriate delays
-   - Monitor resource usage
-
-4. **Handle External Links**
-   - Keep external_links=False for focused crawls
-   - Enable only when needed
-   - Consider additional filtering
-
-## Example Usage
-
-```python
-from crawl4ai.scraper import BFSScraperStrategy
-from crawl4ai.scraper.filters import FilterChain
-from crawl4ai.scraper.scorers import BasicURLScorer
-
-# Configure strategy
-strategy = BFSScraperStrategy(
-    max_depth=3,
-    filter_chain=FilterChain([
-        URLPatternFilter("*.example.com/*"),
-        ContentTypeFilter(["text/html"])
-    ]),
-    url_scorer=BasicURLScorer(),
-    max_concurrent=5,
-    min_crawl_delay=1,
-    process_external_links=False
-)
-
-# Use with AsyncWebScraper
-scraper = AsyncWebScraper(crawler, strategy)
-results = await scraper.ascrape("https://example.com")
-```
-
-## Common Use Cases
-
-### 1. Site Mapping
-```python
-strategy = BFSScraperStrategy(
-    max_depth=5,
-    filter_chain=site_filter,
-    url_scorer=depth_scorer,
-    process_external_links=False
-)
-```
-Perfect for creating complete site maps or understanding site structure.
-
-### 2. Content Aggregation
-```python
-strategy = BFSScraperStrategy(
-    max_depth=2,
-    filter_chain=content_filter,
-    url_scorer=relevance_scorer,
-    max_concurrent=3
-)
-```
-Ideal for collecting specific types of content (articles, products, etc.).
-
-### 3. Link Analysis
-```python
-strategy = BFSScraperStrategy(
-    max_depth=1,
-    filter_chain=link_filter,
-    url_scorer=link_scorer,
-    process_external_links=True
-)
-```
-Useful for analyzing both internal and external link structures.
-
-## Advanced Features
-
-### Progress Monitoring
-```python
-async for result in scraper.ascrape(url):
-    print(f"Current depth: {strategy.stats.current_depth}")
-    print(f"Processed URLs: {strategy.stats.urls_processed}")
-```
-
-### Custom URL Scoring
-```python
-class CustomScorer(URLScorer):
-    def score(self, url: str) -> float:
-        # Lower scores = higher priority
-        return score_based_on_criteria(url)
-```
-
-## Troubleshooting
-
-1. **Slow Crawling**
-   - Increase max_concurrent
-   - Adjust min_crawl_delay
-   - Check network conditions
-
-2. **Missing Content**
-   - Verify max_depth
-   - Check filter settings
-   - Review URL patterns
-
-3. **High Resource Usage**
-   - Reduce max_concurrent
-   - Increase crawl delay
-   - Add more specific filters
-
--- a/docs/scrapper/filters_scrorers.md
+++ b/docs/scrapper/filters_scrorers.md
@@ -1,342 +0,0 @@
-# URL Filters and Scorers
-
-The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
-
-```mermaid
-flowchart TB
-    Start([URL Input]) --> Chain[Filter Chain]
-    
-    subgraph Chain Process
-        Chain --> Pattern{URL Pattern\nFilter}
-        Pattern -->|Match| Content{Content Type\nFilter}
-        Pattern -->|No Match| Reject1[Reject URL]
-        
-        Content -->|Allowed| Domain{Domain\nFilter}
-        Content -->|Not Allowed| Reject2[Reject URL]
-        
-        Domain -->|Allowed| Accept[Accept URL]
-        Domain -->|Blocked| Reject3[Reject URL]
-    end
-    
-    subgraph Statistics
-        Pattern --> UpdatePattern[Update Pattern Stats]
-        Content --> UpdateContent[Update Content Stats]
-        Domain --> UpdateDomain[Update Domain Stats]
-        Accept --> UpdateChain[Update Chain Stats]
-        Reject1 --> UpdateChain
-        Reject2 --> UpdateChain
-        Reject3 --> UpdateChain
-    end
-    
-    Accept --> End([End])
-    Reject1 --> End
-    Reject2 --> End
-    Reject3 --> End
-    
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    
-    class Start,End accept;
-    class Pattern,Content,Domain decision;
-    class Reject1,Reject2,Reject3 reject;
-    class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
-```
-
-## URL Filters
-
-URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
-
-### Available Filters
-
-1. **URL Pattern Filter**
-```python
-pattern_filter = URLPatternFilter([
-    "*.example.com/*",  # Glob pattern
-    "*/article/*",      # Path pattern
-    re.compile(r"blog-\d+") # Regex pattern
-])
-```
- Supports glob patterns and regex
- Multiple patterns per filter
- Pattern pre-compilation for performance
-
-2. **Content Type Filter**
-```python
-content_filter = ContentTypeFilter([
-    "text/html",
-    "application/pdf"
-], check_extension=True)
-```
- Filter by MIME types
- Extension checking
- Support for multiple content types
-
-3. **Domain Filter**
-```python
-domain_filter = DomainFilter(
-    allowed_domains=["example.com", "blog.example.com"],
-    blocked_domains=["ads.example.com"]
-)
-```
- Allow/block specific domains
- Subdomain support
- Efficient domain matching
-
-### Creating Filter Chains
-
-```python
-# Create and configure a filter chain
-filter_chain = FilterChain([
-    URLPatternFilter(["*.example.com/*"]),
-    ContentTypeFilter(["text/html"]),
-    DomainFilter(blocked_domains=["ads.*"])
-])
-
-# Add more filters
-filter_chain.add_filter(
-    URLPatternFilter(["*/article/*"])
-)
-```
-
-```mermaid
-flowchart TB
-    Start([URL Input]) --> Composite[Composite Scorer]
-    
-    subgraph Scoring Process
-        Composite --> Keywords[Keyword Relevance]
-        Composite --> Path[Path Depth]
-        Composite --> Content[Content Type]
-        Composite --> Fresh[Freshness]
-        Composite --> Domain[Domain Authority]
-        
-        Keywords --> KeywordScore[Calculate Score]
-        Path --> PathScore[Calculate Score]
-        Content --> ContentScore[Calculate Score]
-        Fresh --> FreshScore[Calculate Score]
-        Domain --> DomainScore[Calculate Score]
-        
-        KeywordScore --> Weight1[Apply Weight]
-        PathScore --> Weight2[Apply Weight]
-        ContentScore --> Weight3[Apply Weight]
-        FreshScore --> Weight4[Apply Weight]
-        DomainScore --> Weight5[Apply Weight]
-    end
-    
-    Weight1 --> Combine[Combine Scores]
-    Weight2 --> Combine
-    Weight3 --> Combine
-    Weight4 --> Combine
-    Weight5 --> Combine
-    
-    Combine --> Normalize{Normalize?}
-    Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
-    Normalize -->|No| FinalScore[Final Score]
-    NormalizeScore --> FinalScore
-    
-    FinalScore --> Stats[Update Statistics]
-    Stats --> End([End])
-    
-    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
-    classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
-    classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
-    classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
-    
-    class Start,End calc;
-    class Keywords,Path,Content,Fresh,Domain scorer;
-    class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
-    class Normalize decision;
-```
-
-## URL Scorers
-
-URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
-
-### Available Scorers
-
-1. **Keyword Relevance Scorer**
-```python
-keyword_scorer = KeywordRelevanceScorer(
-    keywords=["python", "programming"],
-    weight=1.0,
-    case_sensitive=False
-)
-```
- Score based on keyword matches
- Case sensitivity options
- Weighted scoring
-
-2. **Path Depth Scorer**
-```python
-path_scorer = PathDepthScorer(
-    optimal_depth=3,  # Preferred URL depth
-    weight=0.7
-)
-```
- Score based on URL path depth
- Configurable optimal depth
- Diminishing returns for deeper paths
-
-3. **Content Type Scorer**
-```python
-content_scorer = ContentTypeScorer({
-    r'\.html$': 1.0,
-    r'\.pdf$': 0.8,
-    r'\.xml$': 0.6
-})
-```
- Score based on file types
- Configurable type weights
- Pattern matching support
-
-4. **Freshness Scorer**
-```python
-freshness_scorer = FreshnessScorer(weight=0.9)
-```
- Score based on date indicators in URLs
- Multiple date format support
- Recency weighting
-
-5. **Domain Authority Scorer**
-```python
-authority_scorer = DomainAuthorityScorer({
-    "python.org": 1.0,
-    "github.com": 0.9,
-    "medium.com": 0.7
-})
-```
- Score based on domain importance
- Configurable domain weights
- Default weight for unknown domains
-
-### Combining Scorers
-
-```python
-# Create a composite scorer
-composite_scorer = CompositeScorer([
-    KeywordRelevanceScorer(["python"], weight=1.0),
-    PathDepthScorer(optimal_depth=2, weight=0.7),
-    FreshnessScorer(weight=0.8)
-], normalize=True)
-```
-
-## Best Practices
-
-### Filter Configuration
-
-1. **Start Restrictive**
-   ```python
-   # Begin with strict filters
-   filter_chain = FilterChain([
-       DomainFilter(allowed_domains=["example.com"]),
-       ContentTypeFilter(["text/html"])
-   ])
-   ```
-
-2. **Layer Filters**
-   ```python
-   # Add more specific filters
-   filter_chain.add_filter(
-       URLPatternFilter(["*/article/*", "*/blog/*"])
-   )
-   ```
-
-3. **Monitor Filter Statistics**
-   ```python
-   # Check filter performance
-   for filter in filter_chain.filters:
-       print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
-   ```
-
-### Scorer Configuration
-
-1. **Balance Weights**
-   ```python
-   # Balanced scoring configuration
-   scorer = create_balanced_scorer()
-   ```
-
-2. **Customize for Content**
-   ```python
-   # News site configuration
-   news_scorer = CompositeScorer([
-       KeywordRelevanceScorer(["news", "article"], weight=1.0),
-       FreshnessScorer(weight=1.0),
-       PathDepthScorer(optimal_depth=2, weight=0.5)
-   ])
-   ```
-
-3. **Monitor Scoring Statistics**
-   ```python
-   # Check scoring distribution
-   print(f"Average score: {scorer.stats.average_score}")
-   print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
-   ```
-
-## Common Use Cases
-
-### Blog Crawling
-```python
-blog_config = {
-    'filters': FilterChain([
-        URLPatternFilter(["*/blog/*", "*/post/*"]),
-        ContentTypeFilter(["text/html"])
-    ]),
-    'scorer': CompositeScorer([
-        FreshnessScorer(weight=1.0),
-        KeywordRelevanceScorer(["blog", "article"], weight=0.8)
-    ])
-}
-```
-
-### Documentation Sites
-```python
-docs_config = {
-    'filters': FilterChain([
-        URLPatternFilter(["*/docs/*", "*/guide/*"]),
-        ContentTypeFilter(["text/html", "application/pdf"])
-    ]),
-    'scorer': CompositeScorer([
-        PathDepthScorer(optimal_depth=3, weight=1.0),
-        KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
-    ])
-}
-```
-
-### E-commerce Sites
-```python
-ecommerce_config = {
-    'filters': FilterChain([
-        URLPatternFilter(["*/product/*", "*/category/*"]),
-        DomainFilter(blocked_domains=["ads.*", "tracker.*"])
-    ]),
-    'scorer': CompositeScorer([
-        PathDepthScorer(optimal_depth=2, weight=1.0),
-        ContentTypeScorer({
-            r'/product/': 1.0,
-            r'/category/': 0.8
-        })
-    ])
-}
-```
-
-## Advanced Topics
-
-### Custom Filters
-```python
-class CustomFilter(URLFilter):
-    def apply(self, url: str) -> bool:
-        # Your custom filtering logic
-        return True
-```
-
-### Custom Scorers
-```python
-class CustomScorer(URLScorer):
-    def _calculate_score(self, url: str) -> float:
-        # Your custom scoring logic
-        return 1.0
-```
-
-For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).
--- a/docs/scrapper/how_to_use.md
+++ b/docs/scrapper/how_to_use.md
@@ -1,206 +0,0 @@
-# Scraper Examples Guide
-
-This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
-
-## Basic Example
-
-The basic example demonstrates a simple blog scraping scenario:
-
-```python
-from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
-
-# Create simple filter chain
-filter_chain = FilterChain([
-    URLPatternFilter("*/blog/*"),
-    ContentTypeFilter(["text/html"])
-])
-
-# Initialize strategy
-strategy = BFSScraperStrategy(
-    max_depth=2,
-    filter_chain=filter_chain,
-    url_scorer=None,
-    max_concurrent=3
-)
-
-# Create and run scraper
-crawler = AsyncWebCrawler()
-scraper = AsyncWebScraper(crawler, strategy)
-result = await scraper.ascrape("https://example.com/blog/")
-```
-
-### Features Demonstrated
- Basic URL filtering
- Simple content type filtering
- Depth control
- Concurrent request limiting
- Result collection
-
-## Advanced Example
-
-The advanced example shows a sophisticated news site scraping setup with all features enabled:
-
-```python
-# Create comprehensive filter chain
-filter_chain = FilterChain([
-    DomainFilter(
-        allowed_domains=["example.com"],
-        blocked_domains=["ads.example.com"]
-    ),
-    URLPatternFilter([
-        "*/article/*",
-        re.compile(r"\d{4}/\d{2}/.*")
-    ]),
-    ContentTypeFilter(["text/html"])
-])
-
-# Create intelligent scorer
-scorer = CompositeScorer([
-    KeywordRelevanceScorer(
-        keywords=["news", "breaking"],
-        weight=1.0
-    ),
-    PathDepthScorer(optimal_depth=3, weight=0.7),
-    FreshnessScorer(weight=0.9)
-])
-
-# Initialize advanced strategy
-strategy = BFSScraperStrategy(
-    max_depth=4,
-    filter_chain=filter_chain,
-    url_scorer=scorer,
-    max_concurrent=5
-)
-```
-
-### Features Demonstrated
-1. **Advanced Filtering**
-   - Domain filtering
-   - Pattern matching
-   - Content type control
-
-2. **Intelligent Scoring**
-   - Keyword relevance
-   - Path optimization
-   - Freshness priority
-
-3. **Monitoring**
-   - Progress tracking
-   - Error handling
-   - Statistics collection
-
-4. **Resource Management**
-   - Concurrent processing
-   - Rate limiting
-   - Cleanup handling
-
-## Running the Examples
-
-```bash
-# Basic usage
-python basic_scraper_example.py
-
-# Advanced usage with logging
-PYTHONPATH=. python advanced_scraper_example.py
-```
-
-## Example Output
-
-### Basic Example
-```
-Crawled 15 pages:
- https://example.com/blog/post1: 24560 bytes
- https://example.com/blog/post2: 18920 bytes
-...
-```
-
-### Advanced Example
-```
-INFO: Starting crawl of https://example.com/news/
-INFO: Processed: https://example.com/news/breaking/story1
-DEBUG: KeywordScorer: 0.85
-DEBUG: FreshnessScorer: 0.95
-INFO: Progress: 10 URLs processed
-...
-INFO: Scraping completed:
-INFO: - URLs processed: 50
-INFO: - Errors: 2
-INFO: - Total content size: 1240.50 KB
-```
-
-## Customization
-
-### Adding Custom Filters
-```python
-class CustomFilter(URLFilter):
-    def apply(self, url: str) -> bool:
-        # Your custom filtering logic
-        return True
-
-filter_chain.add_filter(CustomFilter())
-```
-
-### Custom Scoring Logic
-```python
-class CustomScorer(URLScorer):
-    def _calculate_score(self, url: str) -> float:
-        # Your custom scoring logic
-        return 1.0
-
-scorer = CompositeScorer([
-    CustomScorer(weight=1.0),
-    ...
-])
-```
-
-## Best Practices
-
-1. **Start Simple**
-   - Begin with basic filtering
-   - Add features incrementally
-   - Test thoroughly at each step
-
-2. **Monitor Performance**
-   - Watch memory usage
-   - Track processing times
-   - Adjust concurrency as needed
-
-3. **Handle Errors**
-   - Implement proper error handling
-   - Log important events
-   - Track error statistics
-
-4. **Optimize Resources**
-   - Set appropriate delays
-   - Limit concurrent requests
-   - Use streaming for large crawls
-
-## Troubleshooting
-
-Common issues and solutions:
-
-1. **Too Many Requests**
-   ```python
-   strategy = BFSScraperStrategy(
-       max_concurrent=3,  # Reduce concurrent requests
-       min_crawl_delay=2  # Increase delay between requests
-   )
-   ```
-
-2. **Memory Issues**
-   ```python
-   # Use streaming mode for large crawls
-   async for result in scraper.ascrape(url, stream=True):
-       process_result(result)
-   ```
-
-3. **Missing Content**
-   ```python
-   # Check your filter chain
-   filter_chain = FilterChain([
-       URLPatternFilter("*"),  # Broaden patterns
-       ContentTypeFilter(["*"])  # Accept all content
-   ])
-   ```
-
-For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).
--- a/docs/scrapper/scraper_quickstart.py
+++ b/docs/scrapper/scraper_quickstart.py
@@ -1,184 +0,0 @@
-# basic_scraper_example.py
-from crawl4ai.scraper import (
-    AsyncWebScraper,
-    BFSScraperStrategy,
-    FilterChain,
-    URLPatternFilter,
-    ContentTypeFilter
-)
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-
-async def basic_scraper_example():
-    """
-    Basic example: Scrape a blog site for articles
-    - Crawls only HTML pages
-    - Stays within the blog section
-    - Collects all results at once
-    """
-    # Create a simple filter chain
-    filter_chain = FilterChain([
-        # Only crawl pages within the blog section
-        URLPatternFilter("*/blog/*"),
-        # Only process HTML pages
-        ContentTypeFilter(["text/html"])
-    ])
-
-    # Initialize the strategy with basic configuration
-    strategy = BFSScraperStrategy(
-        max_depth=2,  # Only go 2 levels deep
-        filter_chain=filter_chain,
-        url_scorer=None,  # Use default scoring
-        max_concurrent=3  # Limit concurrent requests
-    )
-
-    # Create the crawler and scraper
-    crawler = AsyncWebCrawler()
-    scraper = AsyncWebScraper(crawler, strategy)
-
-    # Start scraping
-    try:
-        result = await scraper.ascrape("https://example.com/blog/")
-        
-        # Process results
-        print(f"Crawled {len(result.crawled_urls)} pages:")
-        for url, data in result.extracted_data.items():
-            print(f"- {url}: {len(data.html)} bytes")
-            
-    except Exception as e:
-        print(f"Error during scraping: {e}")
-
-# advanced_scraper_example.py
-import logging
-from crawl4ai.scraper import (
-    AsyncWebScraper,
-    BFSScraperStrategy,
-    FilterChain,
-    URLPatternFilter,
-    ContentTypeFilter,
-    DomainFilter,
-    KeywordRelevanceScorer,
-    PathDepthScorer,
-    FreshnessScorer,
-    CompositeScorer
-)
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-
-async def advanced_scraper_example():
-    """
-    Advanced example: Intelligent news site scraping
-    - Uses all filter types
-    - Implements sophisticated scoring
-    - Streams results
-    - Includes monitoring and logging
-    """
-    # Set up logging
-    logging.basicConfig(level=logging.INFO)
-    logger = logging.getLogger("advanced_scraper")
-
-    # Create sophisticated filter chain
-    filter_chain = FilterChain([
-        # Domain control
-        DomainFilter(
-            allowed_domains=["example.com", "blog.example.com"],
-            blocked_domains=["ads.example.com", "tracker.example.com"]
-        ),
-        # URL patterns
-        URLPatternFilter([
-            "*/article/*",
-            "*/news/*",
-            "*/blog/*",
-            re.compile(r"\d{4}/\d{2}/.*")  # Date-based URLs
-        ]),
-        # Content types
-        ContentTypeFilter([
-            "text/html",
-            "application/xhtml+xml"
-        ])
-    ])
-
-    # Create composite scorer
-    scorer = CompositeScorer([
-        # Prioritize by keywords
-        KeywordRelevanceScorer(
-            keywords=["news", "breaking", "update", "latest"],
-            weight=1.0
-        ),
-        # Prefer optimal URL structure
-        PathDepthScorer(
-            optimal_depth=3,
-            weight=0.7
-        ),
-        # Prioritize fresh content
-        FreshnessScorer(weight=0.9)
-    ])
-
-    # Initialize strategy with advanced configuration
-    strategy = BFSScraperStrategy(
-        max_depth=4,
-        filter_chain=filter_chain,
-        url_scorer=scorer,
-        max_concurrent=5,
-        min_crawl_delay=1
-    )
-
-    # Create crawler and scraper
-    crawler = AsyncWebCrawler()
-    scraper = AsyncWebScraper(crawler, strategy)
-
-    # Track statistics
-    stats = {
-        'processed': 0,
-        'errors': 0,
-        'total_size': 0
-    }
-
-    try:
-        # Use streaming mode
-        async for result in scraper.ascrape("https://example.com/news/", stream=True):
-            stats['processed'] += 1
-            
-            if result.success:
-                stats['total_size'] += len(result.html)
-                logger.info(f"Processed: {result.url}")
-                
-                # Print scoring information
-                for scorer_name, score in result.scores.items():
-                    logger.debug(f"{scorer_name}: {score:.2f}")
-            else:
-                stats['errors'] += 1
-                logger.error(f"Failed to process {result.url}: {result.error_message}")
-
-            # Log progress regularly
-            if stats['processed'] % 10 == 0:
-                logger.info(f"Progress: {stats['processed']} URLs processed")
-
-    except Exception as e:
-        logger.error(f"Scraping error: {e}")
-    
-    finally:
-        # Print final statistics
-        logger.info("Scraping completed:")
-        logger.info(f"- URLs processed: {stats['processed']}")
-        logger.info(f"- Errors: {stats['errors']}")
-        logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
-        
-        # Print filter statistics
-        for filter_ in filter_chain.filters:
-            logger.info(f"{filter_.name} stats:")
-            logger.info(f"- Passed: {filter_.stats.passed_urls}")
-            logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
-        
-        # Print scorer statistics
-        logger.info("Scoring statistics:")
-        logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
-        logger.info(f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}")
-
-if __name__ == "__main__":
-    import asyncio
-    
-    # Run basic example
-    print("Running basic scraper example...")
-    asyncio.run(basic_scraper_example())
-    
-    print("\nRunning advanced scraper example...")
-    asyncio.run(advanced_scraper_example())
--- a/requirements.txt
+++ b/requirements.txt
@@ -7,4 +7,5 @@ pillow==10.4.0
 playwright==1.47.0
 python-dotenv==1.0.1
 requests>=2.26.0,<2.32.3
-beautifulsoup4==4.12.3
+beautifulsoup4==4.12.3
+playwright_stealth==1.0.6
--- a/tests/test_scraper.py
+++ b/tests/test_scraper.py
@@ -1,184 +0,0 @@
-# basic_scraper_example.py
-from crawl4ai.scraper import (
-    AsyncWebScraper,
-    BFSScraperStrategy,
-    FilterChain,
-    URLPatternFilter,
-    ContentTypeFilter
-)
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-
-async def basic_scraper_example():
-    """
-    Basic example: Scrape a blog site for articles
-    - Crawls only HTML pages
-    - Stays within the blog section
-    - Collects all results at once
-    """
-    # Create a simple filter chain
-    filter_chain = FilterChain([
-        # Only crawl pages within the blog section
-        URLPatternFilter("*/blog/*"),
-        # Only process HTML pages
-        ContentTypeFilter(["text/html"])
-    ])
-
-    # Initialize the strategy with basic configuration
-    strategy = BFSScraperStrategy(
-        max_depth=2,  # Only go 2 levels deep
-        filter_chain=filter_chain,
-        url_scorer=None,  # Use default scoring
-        max_concurrent=3  # Limit concurrent requests
-    )
-
-    # Create the crawler and scraper
-    crawler = AsyncWebCrawler()
-    scraper = AsyncWebScraper(crawler, strategy)
-
-    # Start scraping
-    try:
-        result = await scraper.ascrape("https://example.com/blog/")
-        
-        # Process results
-        print(f"Crawled {len(result.crawled_urls)} pages:")
-        for url, data in result.extracted_data.items():
-            print(f"- {url}: {len(data.html)} bytes")
-            
-    except Exception as e:
-        print(f"Error during scraping: {e}")
-
-# advanced_scraper_example.py
-import logging
-from crawl4ai.scraper import (
-    AsyncWebScraper,
-    BFSScraperStrategy,
-    FilterChain,
-    URLPatternFilter,
-    ContentTypeFilter,
-    DomainFilter,
-    KeywordRelevanceScorer,
-    PathDepthScorer,
-    FreshnessScorer,
-    CompositeScorer
-)
-from crawl4ai.async_webcrawler import AsyncWebCrawler
-
-async def advanced_scraper_example():
-    """
-    Advanced example: Intelligent news site scraping
-    - Uses all filter types
-    - Implements sophisticated scoring
-    - Streams results
-    - Includes monitoring and logging
-    """
-    # Set up logging
-    logging.basicConfig(level=logging.INFO)
-    logger = logging.getLogger("advanced_scraper")
-
-    # Create sophisticated filter chain
-    filter_chain = FilterChain([
-        # Domain control
-        DomainFilter(
-            allowed_domains=["example.com", "blog.example.com"],
-            blocked_domains=["ads.example.com", "tracker.example.com"]
-        ),
-        # URL patterns
-        URLPatternFilter([
-            "*/article/*",
-            "*/news/*",
-            "*/blog/*",
-            re.compile(r"\d{4}/\d{2}/.*")  # Date-based URLs
-        ]),
-        # Content types
-        ContentTypeFilter([
-            "text/html",
-            "application/xhtml+xml"
-        ])
-    ])
-
-    # Create composite scorer
-    scorer = CompositeScorer([
-        # Prioritize by keywords
-        KeywordRelevanceScorer(
-            keywords=["news", "breaking", "update", "latest"],
-            weight=1.0
-        ),
-        # Prefer optimal URL structure
-        PathDepthScorer(
-            optimal_depth=3,
-            weight=0.7
-        ),
-        # Prioritize fresh content
-        FreshnessScorer(weight=0.9)
-    ])
-
-    # Initialize strategy with advanced configuration
-    strategy = BFSScraperStrategy(
-        max_depth=4,
-        filter_chain=filter_chain,
-        url_scorer=scorer,
-        max_concurrent=5,
-        min_crawl_delay=1
-    )
-
-    # Create crawler and scraper
-    crawler = AsyncWebCrawler()
-    scraper = AsyncWebScraper(crawler, strategy)
-
-    # Track statistics
-    stats = {
-        'processed': 0,
-        'errors': 0,
-        'total_size': 0
-    }
-
-    try:
-        # Use streaming mode
-        async for result in scraper.ascrape("https://example.com/news/", stream=True):
-            stats['processed'] += 1
-            
-            if result.success:
-                stats['total_size'] += len(result.html)
-                logger.info(f"Processed: {result.url}")
-                
-                # Print scoring information
-                for scorer_name, score in result.scores.items():
-                    logger.debug(f"{scorer_name}: {score:.2f}")
-            else:
-                stats['errors'] += 1
-                logger.error(f"Failed to process {result.url}: {result.error_message}")
-
-            # Log progress regularly
-            if stats['processed'] % 10 == 0:
-                logger.info(f"Progress: {stats['processed']} URLs processed")
-
-    except Exception as e:
-        logger.error(f"Scraping error: {e}")
-    
-    finally:
-        # Print final statistics
-        logger.info("Scraping completed:")
-        logger.info(f"- URLs processed: {stats['processed']}")
-        logger.info(f"- Errors: {stats['errors']}")
-        logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
-        
-        # Print filter statistics
-        for filter_ in filter_chain.filters:
-            logger.info(f"{filter_.name} stats:")
-            logger.info(f"- Passed: {filter_.stats.passed_urls}")
-            logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
-        
-        # Print scorer statistics
-        logger.info("Scoring statistics:")
-        logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
-        logger.info(f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}")
-
-if __name__ == "__main__":
-    import asyncio
-    
-    # Run basic example
-    print("Running basic scraper example...")
-    asyncio.run(basic_scraper_example())
-    
-    print("\nRunning advanced scraper example...")
-    asyncio.run(advanced_scraper_example())
Author	SHA1	Message	Date
UncleCode	b309bc34e1	Fix the model nam ein quick start example	2024-10-18 15:32:25 +08:00
UncleCode	b8147b64e0	chore: Bump version to 0.3.71 and improve error handling - Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation	2024-10-18 13:31:12 +08:00
UncleCode	aab6ea022e	Update requirements and switch to 0.3.8	2024-10-18 12:51:23 +08:00
UncleCode	dd17ed0e63	Rename some flags name, introducing magic flag.	2024-10-18 12:35:09 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00