Fix the model nam ein quick start example

chore: Bump version to 0.3.71 and improve error handling
- Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation
2024-10-18 15:32:25 +08:00 · 2024-10-18 13:31:12 +08:00 · 2024-10-18 12:51:23 +08:00 · 2024-10-18 12:35:09 +08:00 · 2024-10-17 21:37:48 +08:00
23 changed files with 822 additions and 403 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -199,35 +199,10 @@ test_env/
 **/.DS_Store

 todo.md
-todo_executor.md
 git_changes.py
 git_changes.md
 pypi_build.sh
 git_issues.py
 git_issues.md

-.next/
-.tests/
-# .issues/
-.docs/
-.issues/
-.gitboss/
-todo_executor.md
-protect-all-except-feature.sh
-manage-collab.sh
-publish.sh
-combine.sh
-combined_output.txt
-.local
-.scripts
-tree.md
-tree.md
-.scripts
-.local
-.do
-/plans
-.codeiumignore
-todo/
-
-# windsurf rules
-.windsurfrules
+.tests/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,79 @@
 # Changelog

+## [v0.3.71] - 2024-10-18
+
+### Changes
+1. **Version Update**:
+   - Updated version number from 0.3.7 to 0.3.71.
+
+2. **Crawler Enhancements**:
+   - Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
+   - Improved context creation with additional options:
+     - Enabled `accept_downloads` and `java_script_enabled`.
+     - Added a cookie to enable cookies by default.
+
+3. **Error Handling Improvements**:
+   - Enhanced error messages in AsyncWebCrawler's `arun` method.
+   - Updated error reporting format for better visibility and consistency.
+
+4. **Performance Optimization**:
+   - Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
+
+### Documentation
+- Updated quickstart notebook:
+  - Changed installation command to use the released package instead of GitHub repository.
+  - Updated kernel display name.
+
+### Developer Notes
+- Minor code refactoring and cleanup.
+
+## [v0.3.7] - 2024-10-17
+
+### New Features
+1. **Enhanced Browser Stealth**: 
+   - Implemented `playwright_stealth` for improved bot detection avoidance.
+   - Added `StealthConfig` for fine-tuned control over stealth parameters.
+
+2. **User Simulation**:
+   - New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
+
+3. **Navigator Override**:
+   - Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
+
+4. **Improved iframe Handling**:
+   - New `process_iframes` parameter to extract and integrate iframe content into the main page.
+
+5. **Flexible Browser Selection**:
+   - Support for choosing between Chromium, Firefox, and WebKit browsers.
+
+6. **Include Links in Markdown**:
+    - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.   
+
+### Improvements
+1. **Better Error Handling**:
+   - Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
+   - Added console message and error logging for better debugging.
+
+2. **Image Processing Enhancements**:
+   - Improved image dimension updating and filtering logic.
+
+3. **Crawling Flexibility**:
+   - Added support for custom viewport sizes.
+   - Implemented delayed content retrieval with `delay_before_return_html` parameter.
+
+4. **Performance Optimization**:
+   - Adjusted default semaphore count for parallel crawling.
+
+### Bug Fixes
+- Fixed an issue where the HTML content could be empty after processing.
+
+### Examples
+- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
+
+### Developer Notes
+- Refactored code for better maintainability and readability.
+- Updated browser launch arguments for improved compatibility and performance.
+
 ## [v0.3.6] - 2024-10-12 

 ### 1. Improved Crawling Control
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -3,7 +3,7 @@
 from .async_webcrawler import AsyncWebCrawler
 from .models import CrawlResult

-__version__ = "0.3.6"
+__version__ = "0.3.71"

 __all__ = [
    "AsyncWebCrawler",
--- a/crawl4ai/async_crawler_strategy
+++ b/crawl4ai/async_crawler_strategy
@@ -0,0 +1,558 @@
+import asyncio
+import base64
+import time
+from abc import ABC, abstractmethod
+from typing import Callable, Dict, Any, List, Optional, Awaitable
+import os
+from playwright.async_api import async_playwright, Page, Browser, Error
+from io import BytesIO
+from PIL import Image, ImageDraw, ImageFont
+from pathlib import Path
+from playwright.async_api import ProxySettings
+from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from playwright_stealth import stealth_async
+
+class AsyncCrawlResponse(BaseModel):
+    html: str
+    response_headers: Dict[str, str]
+    status_code: int
+    screenshot: Optional[str] = None
+    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
+
+    class Config:
+        arbitrary_types_allowed = True
+
+class AsyncCrawlerStrategy(ABC):
+    @abstractmethod
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        pass
+    
+    @abstractmethod
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        pass
+    
+    @abstractmethod
+    async def take_screenshot(self, url: str) -> str:
+        pass
+    
+    @abstractmethod
+    def update_user_agent(self, user_agent: str):
+        pass
+    
+    @abstractmethod
+    def set_hook(self, hook_type: str, hook: Callable):
+        pass
+
+class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
+    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
+        self.use_cached_html = use_cached_html
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
+        self.proxy = kwargs.get("proxy")
+        self.headless = kwargs.get("headless", True)
+        self.browser_type = kwargs.get("browser_type", "chromium")
+        self.headers = kwargs.get("headers", {})
+        self.sessions = {}
+        self.session_ttl = 1800 
+        self.js_code = js_code
+        self.verbose = kwargs.get("verbose", False)
+        self.playwright = None
+        self.browser = None
+        self.hooks = {
+            'on_browser_created': None,
+            'on_user_agent_updated': None,
+            'on_execution_started': None,
+            'before_goto': None,
+            'after_goto': None,
+            'before_return_html': None,
+            'before_retrieve_html': None
+        }
+
+    async def __aenter__(self):
+        await self.start()
+        return self
+
+    async def __aexit__(self, exc_type, exc_val, exc_tb):
+        await self.close()
+
+    async def start(self):
+        if self.playwright is None:
+            self.playwright = await async_playwright().start()
+        if self.browser is None:
+            browser_args = {
+                "headless": self.headless,
+                "args": [
+                    "--disable-gpu",
+                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--window-position=0,0",
+                    "--ignore-certificate-errors",
+                    "--ignore-certificate-errors-spki-list",
+                    # "--headless=new",  # Use the new headless mode
+                ]
+            }
+            
+            # Add proxy settings if a proxy is specified
+            if self.proxy:
+                proxy_settings = ProxySettings(server=self.proxy)
+                browser_args["proxy"] = proxy_settings
+                
+            # Select the appropriate browser based on the browser_type
+            if self.browser_type == "firefox":
+                self.browser = await self.playwright.firefox.launch(**browser_args)
+            elif self.browser_type == "webkit":
+                self.browser = await self.playwright.webkit.launch(**browser_args)
+            else:
+                self.browser = await self.playwright.chromium.launch(**browser_args)
+
+            await self.execute_hook('on_browser_created', self.browser)
+
+    async def close(self):
+        if self.browser:
+            await self.browser.close()
+            self.browser = None
+        if self.playwright:
+            await self.playwright.stop()
+            self.playwright = None
+
+    def __del__(self):
+        if self.browser or self.playwright:
+            asyncio.get_event_loop().run_until_complete(self.close())
+
+    def set_hook(self, hook_type: str, hook: Callable):
+        if hook_type in self.hooks:
+            self.hooks[hook_type] = hook
+        else:
+            raise ValueError(f"Invalid hook type: {hook_type}")
+
+    async def execute_hook(self, hook_type: str, *args):
+        hook = self.hooks.get(hook_type)
+        if hook:
+            if asyncio.iscoroutinefunction(hook):
+                return await hook(*args)
+            else:
+                return hook(*args)
+        return args[0] if args else None
+
+    def update_user_agent(self, user_agent: str):
+        self.user_agent = user_agent
+
+    def set_custom_headers(self, headers: Dict[str, str]):
+        self.headers = headers
+
+    async def kill_session(self, session_id: str):
+        if session_id in self.sessions:
+            context, page, _ = self.sessions[session_id]
+            await page.close()
+            await context.close()
+            del self.sessions[session_id]
+
+    def _cleanup_expired_sessions(self):
+        current_time = time.time()
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
+        for sid in expired_sessions:
+            asyncio.create_task(self.kill_session(sid))
+            
+    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
+        wait_for = wait_for.strip()
+        
+        if wait_for.startswith('js:'):
+            # Explicitly specified JavaScript
+            js_code = wait_for[3:].strip()
+            return await self.csp_compliant_wait(page, js_code, timeout)
+        elif wait_for.startswith('css:'):
+            # Explicitly specified CSS selector
+            css_selector = wait_for[4:].strip()
+            try:
+                await page.wait_for_selector(css_selector, timeout=timeout)
+            except Error as e:
+                if 'Timeout' in str(e):
+                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
+                else:
+                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
+        else:
+            # Auto-detect based on content
+            if wait_for.startswith('()') or wait_for.startswith('function'):
+                # It's likely a JavaScript function
+                return await self.csp_compliant_wait(page, wait_for, timeout)
+            else:
+                # Assume it's a CSS selector first
+                try:
+                    await page.wait_for_selector(wait_for, timeout=timeout)
+                except Error as e:
+                    if 'Timeout' in str(e):
+                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
+                    else:
+                        # If it's not a timeout error, it might be an invalid selector
+                        # Let's try to evaluate it as a JavaScript function as a fallback
+                        try:
+                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
+                        except Error:
+                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
+    
+    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
+        wrapper_js = f"""
+        async () => {{
+            const userFunction = {user_wait_function};
+            const startTime = Date.now();
+            while (true) {{
+                if (await userFunction()) {{
+                    return true;
+                }}
+                if (Date.now() - startTime > {timeout}) {{
+                    throw new Error('Timeout waiting for condition');
+                }}
+                await new Promise(resolve => setTimeout(resolve, 100));
+            }}
+        }}
+        """
+        
+        try:
+            await page.evaluate(wrapper_js)
+        except TimeoutError:
+            raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
+        except Exception as e:
+            raise RuntimeError(f"Error in wait condition: {str(e)}")
+
+    async def process_iframes(self, page):
+        # Find all iframes
+        iframes = await page.query_selector_all('iframe')
+        
+        for i, iframe in enumerate(iframes):
+            try:
+                # Add a unique identifier to the iframe
+                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
+                
+                # Get the frame associated with this iframe
+                frame = await iframe.content_frame()
+                
+                if frame:
+                    # Wait for the frame to load
+                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
+                    
+                    # Extract the content of the iframe's body
+                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
+                    
+                    # Generate a unique class name for this iframe
+                    class_name = f'extracted-iframe-content-{i}'
+                    
+                    # Replace the iframe with a div containing the extracted content
+                    _iframe = iframe_content.replace('`', '\\`')
+                    await page.evaluate(f"""
+                        () => {{
+                            const iframe = document.getElementById('iframe-{i}');
+                            const div = document.createElement('div');
+                            div.innerHTML = `{_iframe}`;
+                            div.className = '{class_name}';
+                            iframe.replaceWith(div);
+                        }}
+                    """)
+                else:
+                    print(f"Warning: Could not access content frame for iframe {i}")
+            except Exception as e:
+                print(f"Error processing iframe {i}: {str(e)}")
+
+        # Return the page object
+        return page  
+    
+    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
+        response_headers = {}
+        status_code = None
+        
+        self._cleanup_expired_sessions()
+        session_id = kwargs.get("session_id")
+        if session_id:
+            context, page, _ = self.sessions.get(session_id, (None, None, None))
+            if not context:
+                context = await self.browser.new_context(
+                    user_agent=self.user_agent,
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None
+                )
+                await context.set_extra_http_headers(self.headers)
+                page = await context.new_page()
+                self.sessions[session_id] = (context, page, time.time())
+        else:
+            context = await self.browser.new_context(
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
+            )
+            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
+            page = await context.new_page()
+
+        try:
+            if self.verbose:
+                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                if os.path.exists(cache_file_path):
+                    html = ""
+                    with open(cache_file_path, "r") as f:
+                        html = f.read()
+                    # retrieve response headers and status code from cache
+                    with open(cache_file_path + ".meta", "r") as f:
+                        meta = json.load(f)
+                        response_headers = meta.get("response_headers", {})
+                        status_code = meta.get("status_code")
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
+                    return response
+
+            if not kwargs.get("js_only", False):
+                await self.execute_hook('before_goto', page)
+                
+                response = await page.goto("about:blank")
+                await stealth_async(page)
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # await stealth_async(page)
+                # response = await page.goto("about:blank")
+                # await stealth_async(page)
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
+                await self.execute_hook('after_goto', page)
+                
+                # Get status code and headers
+                status_code = response.status
+                response_headers = response.headers
+            else:
+                status_code = 200
+                response_headers = {}
+
+            await page.wait_for_selector('body')
+            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
+
+            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
+            if js_code:
+                if isinstance(js_code, str):
+                    await page.evaluate(js_code)
+                elif isinstance(js_code, list):
+                    for js in js_code:
+                        await page.evaluate(js)
+                
+                await page.wait_for_load_state('networkidle')
+                # Check for on execution event
+                await self.execute_hook('on_execution_started', page)
+                
+            if kwargs.get("simulate_user", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
+            wait_for = kwargs.get("wait_for")
+            if wait_for:
+                try:
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
+                except Exception as e:
+                    raise RuntimeError(f"Wait condition failed: {str(e)}")
+
+
+            
+            # Update image dimensions
+            update_image_dimensions_js = """
+            () => {
+                return new Promise((resolve) => {
+                    const filterImage = (img) => {
+                        // Filter out images that are too small
+                        if (img.width < 100 && img.height < 100) return false;
+                        
+                        // Filter out images that are not visible
+                        const rect = img.getBoundingClientRect();
+                        if (rect.width === 0 || rect.height === 0) return false;
+                        
+                        // Filter out images with certain class names (e.g., icons, thumbnails)
+                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
+                        
+                        // Filter out images with certain patterns in their src (e.g., placeholder images)
+                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
+                        
+                        return true;
+                    };
+
+                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
+                    let imagesLeft = images.length;
+                    
+                    if (imagesLeft === 0) {
+                        resolve();
+                        return;
+                    }
+
+                    const checkImage = (img) => {
+                        if (img.complete && img.naturalWidth !== 0) {
+                            img.setAttribute('width', img.naturalWidth);
+                            img.setAttribute('height', img.naturalHeight);
+                            imagesLeft--;
+                            if (imagesLeft === 0) resolve();
+                        }
+                    };
+
+                    images.forEach(img => {
+                        checkImage(img);
+                        if (!img.complete) {
+                            img.onload = () => {
+                                checkImage(img);
+                            };
+                            img.onerror = () => {
+                                imagesLeft--;
+                                if (imagesLeft === 0) resolve();
+                            };
+                        }
+                    });
+
+                    // Fallback timeout of 5 seconds
+                    setTimeout(() => resolve(), 5000);
+                });
+            }
+            """
+            await page.evaluate(update_image_dimensions_js)
+
+            # Wait a bit for any onload events to complete
+            await page.wait_for_timeout(100)
+
+            # Process iframes
+            if kwargs.get("process_iframes", False):
+                page = await self.process_iframes(page)
+            
+            await self.execute_hook('before_retrieve_html', page)
+            # Check if delay_before_return_html is set then wait for that time
+            delay_before_return_html = kwargs.get("delay_before_return_html")
+            if delay_before_return_html:
+                await asyncio.sleep(delay_before_return_html)
+                
+            html = await page.content()
+            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                screenshot_data = await self.take_screenshot(url)            
+
+            if self.verbose:
+                print(f"[LOG] ✅ Crawled {url} successfully!")
+
+            if self.use_cached_html:
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
+                with open(cache_file_path, "w", encoding="utf-8") as f:
+                    f.write(html)
+                # store response headers and status code in cache
+                with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
+                    json.dump({
+                        "response_headers": response_headers,
+                        "status_code": status_code
+                    }, f)
+
+            async def get_delayed_content(delay: float = 5.0) -> str:
+                if self.verbose:
+                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
+                await asyncio.sleep(delay)
+                return await page.content()
+                
+            response = AsyncCrawlResponse(
+                html=html, 
+                response_headers=response_headers, 
+                status_code=status_code,
+                screenshot=screenshot_data,
+                get_delayed_content=get_delayed_content
+            )
+            return response
+        except Error as e:
+            raise Error(f"Failed to crawl {url}: {str(e)}")
+        finally:
+            if not session_id:
+                await page.close()
+                await context.close()
+
+    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
+        semaphore = asyncio.Semaphore(semaphore_count)
+
+        async def crawl_with_semaphore(url):
+            async with semaphore:
+                return await self.crawl(url, **kwargs)
+
+        tasks = [crawl_with_semaphore(url) for url in urls]
+        results = await asyncio.gather(*tasks, return_exceptions=True)
+        return [result if not isinstance(result, Exception) else str(result) for result in results]
+
+    async def take_screenshot(self, url: str, wait_time=1000) -> str:
+        async with await self.browser.new_context(user_agent=self.user_agent) as context:
+            page = await context.new_page()
+            try:
+                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
+                # Wait for a specified time (default is 1 second)
+                await page.wait_for_timeout(wait_time)
+                screenshot = await page.screenshot(full_page=True)
+                return base64.b64encode(screenshot).decode('utf-8')
+            except Exception as e:
+                error_message = f"Failed to take screenshot: {str(e)}"
+                print(error_message)
+
+                # Generate an error image
+                img = Image.new('RGB', (800, 600), color='black')
+                draw = ImageDraw.Draw(img)
+                font = ImageFont.load_default()
+                draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
+                
+                buffered = BytesIO()
+                img.save(buffered, format="JPEG")
+                return base64.b64encode(buffered.getvalue()).decode('utf-8')
+            finally:
+                await page.close()
+
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -1,17 +1,35 @@
 import asyncio
-import base64, time
+import base64
+import time
 from abc import ABC, abstractmethod
 from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os
 from playwright.async_api import async_playwright, Page, Browser, Error
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
-from .utils import sanitize_input_encode, calculate_semaphore_count
-import json, uuid
-import hashlib
 from pathlib import Path
 from playwright.async_api import ProxySettings
 from pydantic import BaseModel
+import hashlib
+import json
+import uuid
+from playwright_stealth import StealthConfig, stealth_async
+
+stealth_config = StealthConfig(
+    webdriver=True,
+    chrome_app=True,
+    chrome_csi=True,
+    chrome_load_times=True,
+    chrome_runtime=True,
+    navigator_languages=True,
+    navigator_plugins=True,
+    navigator_permissions=True,
+    webgl_vendor=True,
+    outerdimensions=True,
+    navigator_hardware_concurrency=True,
+    media_codecs=True,
+)
+

 class AsyncCrawlResponse(BaseModel):
    html: str
@@ -47,10 +65,14 @@ class AsyncCrawlerStrategy(ABC):
 class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        self.use_cached_html = use_cached_html
-        self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+        self.user_agent = kwargs.get(
+            "user_agent",
+            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
+            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
+        )
        self.proxy = kwargs.get("proxy")
        self.headless = kwargs.get("headless", True)
-        self.browser_type = kwargs.get("browser_type", "chromium")  # New parameter
+        self.browser_type = kwargs.get("browser_type", "chromium")
        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800 
@@ -58,6 +80,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        self.verbose = kwargs.get("verbose", False)
        self.playwright = None
        self.browser = None
+        self.sleep_on_close = kwargs.get("sleep_on_close", False)
        self.hooks = {
            'on_browser_created': None,
            'on_user_agent_updated': None,
@@ -83,9 +106,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                "headless": self.headless,
                "args": [
                    "--disable-gpu",
-                    "--disable-dev-shm-usage",
-                    "--disable-setuid-sandbox",
                    "--no-sandbox",
+                    "--disable-dev-shm-usage",
+                    "--disable-blink-features=AutomationControlled",
+                    "--disable-infobars",
+                    "--window-position=0,0",
+                    "--ignore-certificate-errors",
+                    "--ignore-certificate-errors-spki-list",
+                    # "--headless=new",  # Use the new headless mode
                ]
            }
            
@@ -94,7 +122,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                proxy_settings = ProxySettings(server=self.proxy)
                browser_args["proxy"] = proxy_settings
                
-                
            # Select the appropriate browser based on the browser_type
            if self.browser_type == "firefox":
                self.browser = await self.playwright.firefox.launch(**browser_args)
@@ -106,6 +133,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            await self.execute_hook('on_browser_created', self.browser)

    async def close(self):
+        if self.sleep_on_close:
+            await asyncio.sleep(500)
        if self.browser:
            await self.browser.close()
            self.browser = None
@@ -147,8 +176,10 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

    def _cleanup_expired_sessions(self):
        current_time = time.time()
-        expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items() 
-                            if current_time - last_used > self.session_ttl]
+        expired_sessions = [
+            sid for sid, (_, _, last_used) in self.sessions.items() 
+            if current_time - last_used > self.session_ttl
+        ]
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))
            
@@ -188,8 +219,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
                        except Error:
                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
-                                            "It should be either a valid CSS selector, a JavaScript function, "
-                                            "or explicitly prefixed with 'js:' or 'css:'.")
+                                             "It should be either a valid CSS selector, a JavaScript function, "
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
    
    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
        wrapper_js = f"""
@@ -254,8 +285,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                print(f"Error processing iframe {i}: {str(e)}")

        # Return the page object
-        return page
-    
+        return page  
    
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
@@ -268,25 +298,70 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            if not context:
                context = await self.browser.new_context(
                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                    viewport={"width": 1920, "height": 1080},
+                    proxy={"server": self.proxy} if self.proxy else None,
+                    accept_downloads=True,
+                    java_script_enabled=True
                )
+                await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
                await context.set_extra_http_headers(self.headers)
                page = await context.new_page()
                self.sessions[session_id] = (context, page, time.time())
        else:
            context = await self.browser.new_context(
-                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                user_agent=self.user_agent,
+                viewport={"width": 1920, "height": 1080},
+                proxy={"server": self.proxy} if self.proxy else None
            )
            await context.set_extra_http_headers(self.headers)
+            
+            if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Inject scripts to override navigator properties
+                await context.add_init_script("""
+                    // Pass the Permissions Test.
+                    const originalQuery = window.navigator.permissions.query;
+                    window.navigator.permissions.query = (parameters) => (
+                        parameters.name === 'notifications' ?
+                            Promise.resolve({ state: Notification.permission }) :
+                            originalQuery(parameters)
+                    );
+                    Object.defineProperty(navigator, 'webdriver', {
+                        get: () => undefined
+                    });
+                    window.navigator.chrome = {
+                        runtime: {},
+                        // Add other properties if necessary
+                    };
+                    Object.defineProperty(navigator, 'plugins', {
+                        get: () => [1, 2, 3, 4, 5],
+                    });
+                    Object.defineProperty(navigator, 'languages', {
+                        get: () => ['en-US', 'en'],
+                    });
+                    Object.defineProperty(document, 'hidden', {
+                        get: () => false
+                    });
+                    Object.defineProperty(document, 'visibilityState', {
+                        get: () => 'visible'
+                    });
+                """)
+            
            page = await context.new_page()
+            # await stealth_async(page) #, stealth_config)

+        # Add console message and error logging
+        if kwargs.get("log_console", False):
+            page.on("console", lambda msg: print(f"Console: {msg.text}"))
+            page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
+        
        try:
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")

            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
                if os.path.exists(cache_file_path):
                    html = ""
                    with open(cache_file_path, "r") as f:
@@ -296,12 +371,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        meta = json.load(f)
                        response_headers = meta.get("response_headers", {})
                        status_code = meta.get("status_code")
-                    response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+                    response = AsyncCrawlResponse(
+                        html=html, response_headers=response_headers, status_code=status_code
+                    )
                    return response

            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
-                response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
+                
+                response = await page.goto(
+                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
+                )
+                
+                # response = await page.goto("about:blank")
+                # await page.evaluate(f"window.location.href = '{url}'")
+                
                await self.execute_hook('after_goto', page)
                
                # Get status code and headers
@@ -311,37 +395,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                status_code = 200
                response_headers = {}

-
            await page.wait_for_selector('body')
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
            if js_code:
                if isinstance(js_code, str):
-                    r = await page.evaluate(js_code)
+                    await page.evaluate(js_code)
                elif isinstance(js_code, list):
                    for js in js_code:
                        await page.evaluate(js)
                
-                # await page.wait_for_timeout(100)
                await page.wait_for_load_state('networkidle')
-                # Check for on execution even
+                # Check for on execution event
                await self.execute_hook('on_execution_started', page)
                
-            # New code to handle the wait_for parameter
-            # Example usage:
-            # await crawler.crawl(
-            #     url,
-            #     js_code="// some JavaScript code",
-            #     wait_for="""() => {
-            #         return document.querySelector('#my-element') !== null;
-            #     }"""
-            # )
-            # Example of using a CSS selector:
-            # await crawler.crawl(
-            #     url,
-            #     wait_for="#my-element"
-            # )
+            if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
+                # Simulate user interactions
+                await page.mouse.move(100, 100)
+                await page.mouse.down()
+                await page.mouse.up()
+                await page.keyboard.press('ArrowDown')
+
+            # Handle the wait_for parameter
            wait_for = kwargs.get("wait_for")
            if wait_for:
                try:
@@ -349,13 +425,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                except Exception as e:
                    raise RuntimeError(f"Wait condition failed: {str(e)}")

-            # Check if kwargs has screenshot=True then take screenshot
-            screenshot_data = None
-            if kwargs.get("screenshot"):
-                screenshot_data = await self.take_screenshot(url)
-            
-            
-            # New code to update image dimensions
+            # Update image dimensions
            update_image_dimensions_js = """
            () => {
                return new Promise((resolve) => {
@@ -428,12 +498,19 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                
            html = await page.content()
            await self.execute_hook('before_return_html', page, html)
+            
+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                screenshot_data = await self.take_screenshot(url)            

            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")

            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
+                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
+                )
                with open(cache_file_path, "w", encoding="utf-8") as f:
                    f.write(html)
                # store response headers and status code in cache
@@ -443,7 +520,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        "status_code": status_code
                    }, f)

-            
            async def get_delayed_content(delay: float = 5.0) -> str:
                if self.verbose:
                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
@@ -459,63 +535,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            )
            return response
        except Error as e:
-            raise Error(f"Failed to crawl {url}: {str(e)}")
-        finally:
-            if not session_id:
-                await page.close()
+            raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
+        # finally:
+        #     if not session_id:
+        #         await page.close()
+        #         await context.close()

-        # try:
-        #     html = await _crawl()
-        #     return sanitize_input_encode(html)
-        # except Error as e:
-        #     raise Error(f"Failed to crawl {url}: {str(e)}")
-        # except Exception as e:
-        #     raise Exception(f"Failed to crawl {url}: {str(e)}")
-
-    async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
-        """
-        Execute JavaScript code in a specific session and optionally wait for a condition.
-        
-        :param session_id: The ID of the session to execute the JS code in.
-        :param js_code: The JavaScript code to execute.
-        :param wait_for_js: JavaScript condition to wait for after execution.
-        :param wait_for_css: CSS selector to wait for after execution.
-        :return: AsyncCrawlResponse containing the page's HTML and other information.
-        :raises ValueError: If the session does not exist.
-        """
-        if not session_id:
-            raise ValueError("Session ID must be provided")
-        
-        if session_id not in self.sessions:
-            raise ValueError(f"No active session found for session ID: {session_id}")
-        
-        context, page, last_used = self.sessions[session_id]
-        
-        try:
-            await page.evaluate(js_code)
-            
-            if wait_for_js:
-                await page.wait_for_function(wait_for_js)
-            
-            if wait_for_css:
-                await page.wait_for_selector(wait_for_css)
-            
-            # Get the updated HTML content
-            html = await page.content()
-            
-            # Get response headers and status code (assuming these are available)
-            response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
-            status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
-            
-            # Update the last used time for this session
-            self.sessions[session_id] = (context, page, time.time())
-            
-            return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
-        except Error as e:
-            raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
-    
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
-        semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
        semaphore = asyncio.Semaphore(semaphore_count)

        async def crawl_with_semaphore(url):
@@ -526,7 +553,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]

-    async def take_screenshot(self, url: str, wait_time = 1000) -> str:
+    async def take_screenshot(self, url: str, wait_time=1000) -> str:
        async with await self.browser.new_context(user_agent=self.user_agent) as context:
            page = await context.new_page()
            try:
@@ -549,4 +576,5 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                img.save(buffered, format="JPEG")
                return base64.b64encode(buffered.getvalue()).decode('utf-8')
            finally:
-                await page.close()
+                await page.close()
+
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -133,8 +133,8 @@ class AsyncWebCrawler:
        except Exception as e:
            if not hasattr(e, "msg"):
                e.msg = str(e)
-            print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
-            return CrawlResult(url=url, html="", success=False, error_message=e.msg)
+            print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
+            return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)

    async def arun_many(
        self,
@@ -195,6 +195,7 @@ class AsyncWebCrawler:
                image_description_min_word_threshold=kwargs.get(
                    "image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
                ),
+                **kwargs,
            )
            if verbose:
                print(
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -33,6 +33,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)

    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
+        success = True
        if not html:
            return None

@@ -273,10 +274,41 @@ class WebScrappingStrategy(ContentScrappingStrategy):
            if base64_pattern.match(src):
                # Replace base64 data with empty string
                img['src'] = base64_pattern.sub('', src)
+                
+        try:
+            str(body)
+        except Exception as e:
+            # Reset body to the original HTML
+            success = False
+            body = BeautifulSoup(html, 'html.parser')
+            
+            # Create a new div with a special ID
+            error_div = body.new_tag('div', id='crawl4ai_error_message')
+            error_div.string = '''
+            Crawl4AI Error: This page is not fully supported.
+            
+            Possible reasons:
+            1. The page may have restrictions that prevent crawling.
+            2. The page might not be fully loaded.
+            
+            Suggestions:
+            - Try calling the crawl function with these parameters:
+            magic=True,
+            - Set headless=False to visualize what's happening on the page.
+            
+            If the issue persists, please check the page's structure and any potential anti-crawling measures.
+            '''
+            
+            # Append the error div to the body
+            body.body.append(error_div)
+            
+            print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
+
+
        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')

        h = CustomHTML2Text()
-        h.ignore_links = True
+        h.ignore_links = not kwargs.get('include_links_on_markdown', False)
        h.body_width = 0
        try:
            markdown = h.handle(cleaned_html)
@@ -294,7 +326,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        return {
            'markdown': markdown,
            'cleaned_html': cleaned_html,
-            'success': True,
+            'success': success,
            'media': media,
            'links': links,
            'metadata': meta
--- a/crawl4ai/scraper/init.py
+++ b/crawl4ai/scraper/init.py
@@ -1,2 +0,0 @@
-from .async_web_scraper import AsyncWebScraper
-from .bfs_scraper_strategy import BFSScraperStrategy
--- a/crawl4ai/scraper/async_web_scraper.py
+++ b/crawl4ai/scraper/async_web_scraper.py
@@ -1,33 +0,0 @@
-from .scraper_strategy import ScraperStrategy
-from .models import ScraperResult, CrawlResult
-from ..async_webcrawler import AsyncWebCrawler
-from typing import Union, AsyncGenerator
-
-class AsyncWebScraper:
-    def __init__(self, crawler: AsyncWebCrawler, strategy: ScraperStrategy):
-        self.crawler = crawler
-        self.strategy = strategy
-
-    async def ascrape(self, url: str, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
-        if stream:
-            return self._ascrape_yielding(url, parallel_processing)
-        else:
-            return await self._ascrape_collecting(url, parallel_processing)
-
-    async def _ascrape_yielding(self, url: str, parallel_processing: bool) -> AsyncGenerator[CrawlResult, None]:
-        result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
-        async for res in result_generator:  # Consume the async generator
-            yield res  # Yielding individual results
-
-    async def _ascrape_collecting(self, url: str, parallel_processing: bool) -> ScraperResult:
-        extracted_data = {}
-        result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
-        async for res in result_generator:  # Consume the async generator
-            extracted_data[res.url] = res
-
-        # Return a final ScraperResult
-        return ScraperResult(
-            url=url,
-            crawled_urls=list(extracted_data.keys()),
-            extracted_data=extracted_data
-        )
--- a/crawl4ai/scraper/bfs_scraper_strategy.py
+++ b/crawl4ai/scraper/bfs_scraper_strategy.py
@@ -1,139 +0,0 @@
-from .scraper_strategy import ScraperStrategy
-from .filters import FilterChain
-from .scorers import URLScorer
-from ..models import CrawlResult
-from ..async_webcrawler import AsyncWebCrawler
-import asyncio
-import validators
-from urllib.parse import urljoin,urlparse,urlunparse
-from urllib.robotparser import RobotFileParser
-import time
-from aiolimiter import AsyncLimiter
-from tenacity import retry, stop_after_attempt, wait_exponential
-from collections import defaultdict
-import logging
-from typing import Dict, AsyncGenerator
-logging.basicConfig(level=logging.DEBUG)
-
-rate_limiter = AsyncLimiter(1, 1)  # 1 request per second
-
-class BFSScraperStrategy(ScraperStrategy):
-    def __init__(self, max_depth: int, filter_chain: FilterChain, url_scorer: URLScorer, max_concurrent: int = 5, min_crawl_delay: int=1):
-        self.max_depth = max_depth
-        self.filter_chain = filter_chain
-        self.url_scorer = url_scorer
-        self.max_concurrent = max_concurrent
-        # For Crawl Politeness
-        self.last_crawl_time = defaultdict(float)
-        self.min_crawl_delay = min_crawl_delay  # 1 second delay between requests to the same domain
-        # For Robots.txt Compliance
-        self.robot_parsers = {}
-
-    # Robots.txt Parser
-    def get_robot_parser(self, url: str) -> RobotFileParser:
-        domain = urlparse(url)
-        scheme = domain.scheme if domain.scheme else 'http'  # Default to 'http' if no scheme provided
-        netloc = domain.netloc
-        if netloc not in self.robot_parsers:
-            rp = RobotFileParser()
-            rp.set_url(f"{scheme}://{netloc}/robots.txt")
-            try:
-                rp.read()
-            except Exception as e:
-                # Log the type of error, message, and the URL
-                logging.warning(f"Error {type(e).__name__} occurred while fetching robots.txt for {netloc}: {e}")
-                return None
-            self.robot_parsers[netloc] = rp
-        return self.robot_parsers[netloc]
-
-    
-    # Retry with exponential backoff
-    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
-    async def retry_crawl(self, crawler: AsyncWebCrawler, url: str) -> CrawlResult:
-        return await crawler.arun(url)
-    
-    async def process_url(self, url: str, depth: int, crawler: AsyncWebCrawler, queue: asyncio.PriorityQueue, visited: set, depths: Dict[str, int]) -> AsyncGenerator[CrawlResult, None]:
-        def normalize_url(url: str) -> str:
-            parsed = urlparse(url)
-            return urlunparse(parsed._replace(fragment=""))
-        
-        # URL Validation
-        if not validators.url(url):
-            logging.warning(f"Invalid URL: {url}")
-            return None
-        
-        # Robots.txt Compliance
-        robot_parser = self.get_robot_parser(url)
-        if robot_parser is None:
-            logging.info(f"Could not retrieve robots.txt for {url}, hence proceeding with crawl.")
-        else:
-            # If robots.txt was fetched, check if crawling is allowed
-            if not robot_parser.can_fetch(crawler.crawler_strategy.user_agent, url):
-                logging.info(f"Skipping {url} as per robots.txt")
-                return None
-    
-        # Crawl Politeness
-        domain = urlparse(url).netloc
-        time_since_last_crawl = time.time() - self.last_crawl_time[domain]
-        if time_since_last_crawl < self.min_crawl_delay:
-            await asyncio.sleep(self.min_crawl_delay - time_since_last_crawl)
-        self.last_crawl_time[domain] = time.time()
-
-        # Rate Limiting
-        async with rate_limiter:
-            # Error Handling
-            try:
-                crawl_result = await self.retry_crawl(crawler, url)
-            except Exception as e:
-                logging.error(f"Error crawling {url}: {str(e)}")
-                crawl_result = CrawlResult(url=url, html="", success=False, status_code=0, error_message=str(e))
-        
-        if not crawl_result.success:
-            # Logging and Monitoring
-            logging.error(f"Failed to crawl URL: {url}. Error: {crawl_result.error_message}")
-            return crawl_result
-
-        # Process links
-        for link_type in ["internal", "external"]:
-            for link in crawl_result.links[link_type]:
-                absolute_link = urljoin(url, link['href'])
-                normalized_link = normalize_url(absolute_link)
-                if self.filter_chain.apply(normalized_link) and normalized_link not in visited:
-                    new_depth = depths[url] + 1
-                    if new_depth <= self.max_depth:
-                        # URL Scoring
-                        score = self.url_scorer.score(normalized_link)
-                        await queue.put((score, new_depth, normalized_link))
-                        depths[normalized_link] = new_depth
-        return crawl_result
-
-    async def ascrape(self, start_url: str, crawler: AsyncWebCrawler, parallel_processing:bool = True) -> AsyncGenerator[CrawlResult,None]:
-        queue = asyncio.PriorityQueue()
-        queue.put_nowait((0, 0, start_url))
-        visited = set()
-        depths = {start_url: 0}
-        pending_tasks = set()
-
-        while not queue.empty() or pending_tasks:
-            while not queue.empty() and len(pending_tasks) < self.max_concurrent:
-                _, depth, url = await queue.get()
-                if url not in visited:
-                    # Adding URL to the visited set here itself, (instead of after result generation)
-                    # so that other tasks are not queued for same URL, found at different depth before
-                    # crawling and extraction of this task is completed.
-                    visited.add(url)
-                    if parallel_processing:
-                        task = asyncio.create_task(self.process_url(url, depth, crawler, queue, visited, depths))
-                        pending_tasks.add(task)
-                    else:
-                        result = await self.process_url(url, depth, crawler, queue, visited, depths)
-                        if result:
-                            yield result 
-
-            # Wait for the first task to complete and yield results incrementally as each task is completed
-            if pending_tasks:
-                done, pending_tasks = await asyncio.wait(pending_tasks, return_when=asyncio.FIRST_COMPLETED)
-                for task in done:
-                    result = await task
-                    if result:
-                        yield result
--- a/crawl4ai/scraper/filters/init.py
+++ b/crawl4ai/scraper/filters/init.py
@@ -1,3 +0,0 @@
-from .url_filter import URLFilter, FilterChain
-from .content_type_filter import ContentTypeFilter
-from .url_pattern_filter import URLPatternFilter
--- a/crawl4ai/scraper/filters/content_type_filter.py
+++ b/crawl4ai/scraper/filters/content_type_filter.py
@@ -1,8 +0,0 @@
-from .url_filter import URLFilter
-
-class ContentTypeFilter(URLFilter):
-    def __init__(self, contentType: str):
-        self.contentType = contentType
-    def apply(self, url: str) -> bool:
-        #TODO: This is a stub. Will implement this later
-        return True
--- a/crawl4ai/scraper/filters/url_filter.py
+++ b/crawl4ai/scraper/filters/url_filter.py
@@ -1,16 +0,0 @@
-from abc import ABC, abstractmethod
-
-class URLFilter(ABC):
-    @abstractmethod
-    def apply(self, url: str) -> bool:
-        pass
-
-class FilterChain:
-    def __init__(self):
-        self.filters = []
-
-    def add_filter(self, filter: URLFilter):
-        self.filters.append(filter)
-
-    def apply(self, url: str) -> bool:
-        return all(filter.apply(url) for filter in self.filters)
--- a/crawl4ai/scraper/filters/url_pattern_filter.py
+++ b/crawl4ai/scraper/filters/url_pattern_filter.py
@@ -1,9 +0,0 @@
-from .url_filter import URLFilter
-from re import Pattern
-
-class URLPatternFilter(URLFilter):
-    def __init__(self, pattern: Pattern):
-        self.pattern = pattern
-    def apply(self, url: str) -> bool:
-        #TODO: This is a stub. Will implement this later.
-        return True
--- a/crawl4ai/scraper/models.py
+++ b/crawl4ai/scraper/models.py
@@ -1,8 +0,0 @@
-from pydantic import BaseModel
-from typing import List, Dict
-from ..models import CrawlResult
-
-class ScraperResult(BaseModel):
-    url: str
-    crawled_urls: List[str]
-    extracted_data: Dict[str,CrawlResult]
--- a/crawl4ai/scraper/scorers/init.py
+++ b/crawl4ai/scraper/scorers/init.py
@@ -1,2 +0,0 @@
-from .url_scorer import URLScorer
-from .keyword_relevance_scorer import KeywordRelevanceScorer
--- a/crawl4ai/scraper/scorers/keyword_relevance_scorer.py
+++ b/crawl4ai/scraper/scorers/keyword_relevance_scorer.py
@@ -1,9 +0,0 @@
-from .url_scorer import URLScorer
-from typing import List
-
-class KeywordRelevanceScorer(URLScorer):
-    def __init__(self,keywords: List[str]):
-        self.keyworkds = keywords
-    def score(self, url: str) -> float:
-        #TODO: This is a stub. Will implement this later.
-        return 1
--- a/crawl4ai/scraper/scorers/url_scorer.py
+++ b/crawl4ai/scraper/scorers/url_scorer.py
@@ -1,6 +0,0 @@
-from abc import ABC, abstractmethod
-
-class URLScorer(ABC):
-    @abstractmethod
-    def score(self, url: str) -> float:
-        pass
--- a/crawl4ai/scraper/scraper_strategy.py
+++ b/crawl4ai/scraper/scraper_strategy.py
@@ -1,26 +0,0 @@
-from abc import ABC, abstractmethod
-from .models import ScraperResult, CrawlResult
-from ..models import CrawlResult
-from ..async_webcrawler import AsyncWebCrawler
-from typing import Union, AsyncGenerator
-
-class ScraperStrategy(ABC):
-    @abstractmethod
-    async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
-        """Scrape the given URL using the specified crawler.
-
-        Args:
-            url (str): The starting URL for the scrape.
-            crawler (AsyncWebCrawler): The web crawler instance.
-            parallel_processing (bool): Whether to use parallel processing. Defaults to True.
-            stream (bool): If True, yields individual crawl results as they are ready; 
-                                if False, accumulates results and returns a final ScraperResult.
-
-        Yields:
-            CrawlResult: Individual crawl results if stream is True.
-
-        Returns:
-            ScraperResult: A summary of the scrape results containing the final extracted data 
-            and the list of crawled URLs if stream is False.
-        """
-        pass
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -692,8 +692,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
    for img in imgs:
        src = img.get('src', '')
        if base64_pattern.match(src):
-            # Replace base64 data with empty string
            img['src'] = base64_pattern.sub('', src)
+
    cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
    cleaned_html = sanitize_html(cleaned_html)

--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -47,8 +47,7 @@
      },
      "outputs": [],
      "source": [
-        "# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
-        "!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
+        "!pip install crawl4ai\n",
        "!pip install nest-asyncio\n",
        "!playwright install"
      ]
@@ -714,7 +713,7 @@
      "provenance": []
    },
    "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "venv",
      "language": "python",
      "name": "python3"
    },
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -379,6 +379,18 @@ async def crawl_custom_browser_type():
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)

+async def crawl_with_user_simultion():
+    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
+        url = "YOUR-URL-HERE"
+        result = await crawler.arun(
+            url=url,
+            bypass_cache=True,
+            simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
+            override_navigator = True # Overrides the navigator object to make it look like a real user
+        )
+        
+        print(result.markdown)    
+
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -455,7 +467,7 @@ async def main():
    # LLM extraction examples
    await extract_structured_data_using_llm()
    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
-    await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
    await extract_structured_data_using_llm("ollama/llama3.2")    

    # You always can pass custom headers to the extraction strategy
--- a/requirements.txt
+++ b/requirements.txt
@@ -7,4 +7,5 @@ pillow==10.4.0
 playwright==1.47.0
 python-dotenv==1.0.1
 requests>=2.26.0,<2.32.3
-beautifulsoup4==4.12.3
+beautifulsoup4==4.12.3
+playwright_stealth==1.0.6
Author	SHA1	Message	Date
UncleCode	b309bc34e1	Fix the model nam ein quick start example	2024-10-18 15:32:25 +08:00
UncleCode	b8147b64e0	chore: Bump version to 0.3.71 and improve error handling - Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation	2024-10-18 13:31:12 +08:00
UncleCode	aab6ea022e	Update requirements and switch to 0.3.8	2024-10-18 12:51:23 +08:00
UncleCode	dd17ed0e63	Rename some flags name, introducing magic flag.	2024-10-18 12:35:09 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00