Fix the model nam ein quick start example

chore: Bump version to 0.3.71 and improve error handling
- Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation
2024-10-18 15:32:25 +08:00 · 2024-10-18 13:31:12 +08:00 · 2024-10-18 12:51:23 +08:00 · 2024-10-18 12:35:09 +08:00 · 2024-10-17 21:37:48 +08:00 · 2024-10-14 22:58:27 +08:00
16 changed files with 1283 additions and 225 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -201,3 +201,8 @@ test_env/
 todo.md
 git_changes.py
 git_changes.md
 pypi_build.sh
 git_issues.py
 git_issues.md
 .tests/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,157 @@
 # Changelog
 ## [v0.3.71] - 2024-10-18
 ### Changes
 1. **Version Update**:
   - Updated version number from 0.3.7 to 0.3.71.
 2. **Crawler Enhancements**:
   - Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
   - Improved context creation with additional options:
     - Enabled `accept_downloads` and `java_script_enabled`.
     - Added a cookie to enable cookies by default.
 3. **Error Handling Improvements**:
   - Enhanced error messages in AsyncWebCrawler's `arun` method.
   - Updated error reporting format for better visibility and consistency.
 4. **Performance Optimization**:
   - Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
 ### Documentation
 - Updated quickstart notebook:
  - Changed installation command to use the released package instead of GitHub repository.
  - Updated kernel display name.
 ### Developer Notes
 - Minor code refactoring and cleanup.
 ## [v0.3.7] - 2024-10-17
 ### New Features
 1. **Enhanced Browser Stealth**: 
   - Implemented `playwright_stealth` for improved bot detection avoidance.
   - Added `StealthConfig` for fine-tuned control over stealth parameters.
 2. **User Simulation**:
   - New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
 3. **Navigator Override**:
   - Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
 4. **Improved iframe Handling**:
   - New `process_iframes` parameter to extract and integrate iframe content into the main page.
 5. **Flexible Browser Selection**:
   - Support for choosing between Chromium, Firefox, and WebKit browsers.
 6. **Include Links in Markdown**:
    - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.   
 ### Improvements
 1. **Better Error Handling**:
   - Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
   - Added console message and error logging for better debugging.
 2. **Image Processing Enhancements**:
   - Improved image dimension updating and filtering logic.
 3. **Crawling Flexibility**:
   - Added support for custom viewport sizes.
   - Implemented delayed content retrieval with `delay_before_return_html` parameter.
 4. **Performance Optimization**:
   - Adjusted default semaphore count for parallel crawling.
 ### Bug Fixes
 - Fixed an issue where the HTML content could be empty after processing.
 ### Examples
 - Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
 ### Developer Notes
 - Refactored code for better maintainability and readability.
 - Updated browser launch arguments for improved compatibility and performance.
 ## [v0.3.6] - 2024-10-12 
 ### 1. Improved Crawling Control
 - **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
 - **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
  - Useful for pages with delayed content loading.
 - **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
  - Provides better handling for slow-loading pages.
 - **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
 ### 2. Browser Type Selection
 - Added support for different browser types (Chromium, Firefox, WebKit).
 - Users can now specify the browser type when initializing AsyncWebCrawler.
 - **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
 ### 3. Screenshot Capture
 - Added ability to capture screenshots during crawling.
 - Useful for debugging and content verification.
 - **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
 ### 4. Enhanced LLM Extraction Strategy
 - Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
 - **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
 - **Custom Headers**: Users can now pass custom headers to the extraction strategy.
 - **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
 ### 5. iframe Content Extraction
 - New feature to process and extract content from iframes.
 - **How to use**: Set `process_iframes=True` in the crawl method.
 ### 6. Delayed Content Retrieval
 - Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
 - Allows retrieval of content after a specified delay, useful for dynamically loaded content.
 - **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
 ## Improvements and Optimizations
 ### 1. AsyncWebCrawler Enhancements
 - **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
 - Allows for more customized setups.
 ### 2. Image Processing Optimization
 - Enhanced image handling in WebScrappingStrategy.
 - Added filtering for small, invisible, or irrelevant images.
 - Improved image scoring system for better content relevance.
 - Implemented JavaScript-based image dimension updating for more accurate representation.
 ### 3. Database Schema Auto-updates
 - Automatic database schema updates ensure compatibility with the latest version.
 ### 4. Enhanced Error Handling and Logging
 - Improved error messages and logging for easier debugging.
 ### 5. Content Extraction Refinements
 - Refined HTML sanitization process.
 - Improved handling of base64 encoded images.
 - Enhanced Markdown conversion process.
 - Optimized content extraction algorithms.
 ### 6. Utility Function Enhancements
 - `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
 ## Bug Fixes
 - Fixed an issue where image tags were being prematurely removed during content extraction.
 ## Examples and Documentation
 - Updated `quickstart_async.py` with examples of:
  - Using custom headers in LLM extraction.
  - Different LLM provider usage (OpenAI, Hugging Face, Ollama).
  - Custom browser type usage.
 ## Developer Notes
 - Refactored code for better maintainability, flexibility, and performance.
 - Enhanced type hinting throughout the codebase for improved development experience.
 - Expanded error handling for more robust operation.
 These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
 ## [v0.3.5] - 2024-09-02
 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
--- a/README.md
+++ b/README.md
@@ -10,6 +10,14 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
 > Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
 ## New update 0.3.6
 - 🌐 Multi-browser support (Chromium, Firefox, WebKit)
 - 🖼️ Improved image processing with lazy-loading detection
 - 🔧 Custom page timeout parameter for better control over crawling behavior
 - 🕰️ Enhanced handling of delayed content loading
 - 🔑 Custom headers support for LLM interactions
 - 🖼️ iframe content extraction for comprehensive page analysis
 - ⏱️ Flexible timeout and delayed content retrieval options
 ## Try it Now!
@@ -124,7 +132,7 @@ async def main():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
-            css_selector="article.tease-card",
+            css_selector=".wide-tease-item__description",
            bypass_cache=True
        )
        print(result.extracted_content)
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -3,7 +3,7 @@
 from .async_webcrawler import AsyncWebCrawler
 from .models import CrawlResult
-__version__ = "0.3.5"
+__version__ = "0.3.71"
 __all__ = [
    "AsyncWebCrawler",
--- a/crawl4ai/async_crawler_strategy
+++ b/crawl4ai/async_crawler_strategy
@@ -0,0 +1,558 @@
 import asyncio
 import base64
 import time
 from abc import ABC, abstractmethod
 from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os
 from playwright.async_api import async_playwright, Page, Browser, Error
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
 from pathlib import Path
 from playwright.async_api import ProxySettings
 from pydantic import BaseModel
 import hashlib
 import json
 import uuid
 from playwright_stealth import stealth_async
 class AsyncCrawlResponse(BaseModel):
    html: str
    response_headers: Dict[str, str]
    status_code: int
    screenshot: Optional[str] = None
    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
    class Config:
        arbitrary_types_allowed = True
 class AsyncCrawlerStrategy(ABC):
    @abstractmethod
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        pass
    @abstractmethod
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
        pass
    @abstractmethod
    async def take_screenshot(self, url: str) -> str:
        pass
    @abstractmethod
    def update_user_agent(self, user_agent: str):
        pass
    @abstractmethod
    def set_hook(self, hook_type: str, hook: Callable):
        pass
 class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        self.use_cached_html = use_cached_html
        self.user_agent = kwargs.get(
            "user_agent",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
        )
        self.proxy = kwargs.get("proxy")
        self.headless = kwargs.get("headless", True)
        self.browser_type = kwargs.get("browser_type", "chromium")
        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800 
        self.js_code = js_code
        self.verbose = kwargs.get("verbose", False)
        self.playwright = None
        self.browser = None
        self.hooks = {
            'on_browser_created': None,
            'on_user_agent_updated': None,
            'on_execution_started': None,
            'before_goto': None,
            'after_goto': None,
            'before_return_html': None,
            'before_retrieve_html': None
        }
    async def __aenter__(self):
        await self.start()
        return self
    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self.close()
    async def start(self):
        if self.playwright is None:
            self.playwright = await async_playwright().start()
        if self.browser is None:
            browser_args = {
                "headless": self.headless,
                "args": [
                    "--disable-gpu",
                    "--no-sandbox",
                    "--disable-dev-shm-usage",
                    "--disable-blink-features=AutomationControlled",
                    "--disable-infobars",
                    "--window-position=0,0",
                    "--ignore-certificate-errors",
                    "--ignore-certificate-errors-spki-list",
                    # "--headless=new",  # Use the new headless mode
                ]
            }
            # Add proxy settings if a proxy is specified
            if self.proxy:
                proxy_settings = ProxySettings(server=self.proxy)
                browser_args["proxy"] = proxy_settings
            # Select the appropriate browser based on the browser_type
            if self.browser_type == "firefox":
                self.browser = await self.playwright.firefox.launch(**browser_args)
            elif self.browser_type == "webkit":
                self.browser = await self.playwright.webkit.launch(**browser_args)
            else:
                self.browser = await self.playwright.chromium.launch(**browser_args)
            await self.execute_hook('on_browser_created', self.browser)
    async def close(self):
        if self.browser:
            await self.browser.close()
            self.browser = None
        if self.playwright:
            await self.playwright.stop()
            self.playwright = None
    def __del__(self):
        if self.browser or self.playwright:
            asyncio.get_event_loop().run_until_complete(self.close())
    def set_hook(self, hook_type: str, hook: Callable):
        if hook_type in self.hooks:
            self.hooks[hook_type] = hook
        else:
            raise ValueError(f"Invalid hook type: {hook_type}")
    async def execute_hook(self, hook_type: str, *args):
        hook = self.hooks.get(hook_type)
        if hook:
            if asyncio.iscoroutinefunction(hook):
                return await hook(*args)
            else:
                return hook(*args)
        return args[0] if args else None
    def update_user_agent(self, user_agent: str):
        self.user_agent = user_agent
    def set_custom_headers(self, headers: Dict[str, str]):
        self.headers = headers
    async def kill_session(self, session_id: str):
        if session_id in self.sessions:
            context, page, _ = self.sessions[session_id]
            await page.close()
            await context.close()
            del self.sessions[session_id]
    def _cleanup_expired_sessions(self):
        current_time = time.time()
        expired_sessions = [
            sid for sid, (_, _, last_used) in self.sessions.items() 
            if current_time - last_used > self.session_ttl
        ]
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))
    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
        wait_for = wait_for.strip()
        if wait_for.startswith('js:'):
            # Explicitly specified JavaScript
            js_code = wait_for[3:].strip()
            return await self.csp_compliant_wait(page, js_code, timeout)
        elif wait_for.startswith('css:'):
            # Explicitly specified CSS selector
            css_selector = wait_for[4:].strip()
            try:
                await page.wait_for_selector(css_selector, timeout=timeout)
            except Error as e:
                if 'Timeout' in str(e):
                    raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
                else:
                    raise ValueError(f"Invalid CSS selector: '{css_selector}'")
        else:
            # Auto-detect based on content
            if wait_for.startswith('()') or wait_for.startswith('function'):
                # It's likely a JavaScript function
                return await self.csp_compliant_wait(page, wait_for, timeout)
            else:
                # Assume it's a CSS selector first
                try:
                    await page.wait_for_selector(wait_for, timeout=timeout)
                except Error as e:
                    if 'Timeout' in str(e):
                        raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
                    else:
                        # If it's not a timeout error, it might be an invalid selector
                        # Let's try to evaluate it as a JavaScript function as a fallback
                        try:
                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
                        except Error:
                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
                                             "It should be either a valid CSS selector, a JavaScript function, "
                                             "or explicitly prefixed with 'js:' or 'css:'.")
    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
        wrapper_js = f"""
        async () => {{
            const userFunction = {user_wait_function};
            const startTime = Date.now();
            while (true) {{
                if (await userFunction()) {{
                    return true;
                }}
                if (Date.now() - startTime > {timeout}) {{
                    throw new Error('Timeout waiting for condition');
                }}
                await new Promise(resolve => setTimeout(resolve, 100));
            }}
        }}
        """
        try:
            await page.evaluate(wrapper_js)
        except TimeoutError:
            raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
        except Exception as e:
            raise RuntimeError(f"Error in wait condition: {str(e)}")
    async def process_iframes(self, page):
        # Find all iframes
        iframes = await page.query_selector_all('iframe')
        for i, iframe in enumerate(iframes):
            try:
                # Add a unique identifier to the iframe
                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
                # Get the frame associated with this iframe
                frame = await iframe.content_frame()
                if frame:
                    # Wait for the frame to load
                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
                    # Extract the content of the iframe's body
                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
                    # Generate a unique class name for this iframe
                    class_name = f'extracted-iframe-content-{i}'
                    # Replace the iframe with a div containing the extracted content
                    _iframe = iframe_content.replace('`', '\\`')
                    await page.evaluate(f"""
                        () => {{
                            const iframe = document.getElementById('iframe-{i}');
                            const div = document.createElement('div');
                            div.innerHTML = `{_iframe}`;
                            div.className = '{class_name}';
                            iframe.replaceWith(div);
                        }}
                    """)
                else:
                    print(f"Warning: Could not access content frame for iframe {i}")
            except Exception as e:
                print(f"Error processing iframe {i}: {str(e)}")
        # Return the page object
        return page  
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
        status_code = None
        self._cleanup_expired_sessions()
        session_id = kwargs.get("session_id")
        if session_id:
            context, page, _ = self.sessions.get(session_id, (None, None, None))
            if not context:
                context = await self.browser.new_context(
                    user_agent=self.user_agent,
                    viewport={"width": 1920, "height": 1080},
                    proxy={"server": self.proxy} if self.proxy else None
                )
                await context.set_extra_http_headers(self.headers)
                page = await context.new_page()
                self.sessions[session_id] = (context, page, time.time())
        else:
            context = await self.browser.new_context(
                user_agent=self.user_agent,
                viewport={"width": 1920, "height": 1080},
                proxy={"server": self.proxy} if self.proxy else None
            )
            await context.set_extra_http_headers(self.headers)
            if kwargs.get("override_navigator", False):
                # Inject scripts to override navigator properties
                await context.add_init_script("""
                    // Pass the Permissions Test.
                    const originalQuery = window.navigator.permissions.query;
                    window.navigator.permissions.query = (parameters) => (
                        parameters.name === 'notifications' ?
                            Promise.resolve({ state: Notification.permission }) :
                            originalQuery(parameters)
                    );
                    Object.defineProperty(navigator, 'webdriver', {
                        get: () => undefined
                    });
                    window.navigator.chrome = {
                        runtime: {},
                        // Add other properties if necessary
                    };
                    Object.defineProperty(navigator, 'plugins', {
                        get: () => [1, 2, 3, 4, 5],
                    });
                    Object.defineProperty(navigator, 'languages', {
                        get: () => ['en-US', 'en'],
                    });
                    Object.defineProperty(document, 'hidden', {
                        get: () => false
                    });
                    Object.defineProperty(document, 'visibilityState', {
                        get: () => 'visible'
                    });
                """)
            page = await context.new_page()
        try:
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
            if self.use_cached_html:
                cache_file_path = os.path.join(
                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
                )
                if os.path.exists(cache_file_path):
                    html = ""
                    with open(cache_file_path, "r") as f:
                        html = f.read()
                    # retrieve response headers and status code from cache
                    with open(cache_file_path + ".meta", "r") as f:
                        meta = json.load(f)
                        response_headers = meta.get("response_headers", {})
                        status_code = meta.get("status_code")
                    response = AsyncCrawlResponse(
                        html=html, response_headers=response_headers, status_code=status_code
                    )
                    return response
            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
                response = await page.goto("about:blank")
                await stealth_async(page)
                response = await page.goto(
                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
                )
                # await stealth_async(page)
                # response = await page.goto("about:blank")
                # await stealth_async(page)
                # await page.evaluate(f"window.location.href = '{url}'")
                await self.execute_hook('after_goto', page)
                # Get status code and headers
                status_code = response.status
                response_headers = response.headers
            else:
                status_code = 200
                response_headers = {}
            await page.wait_for_selector('body')
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
            if js_code:
                if isinstance(js_code, str):
                    await page.evaluate(js_code)
                elif isinstance(js_code, list):
                    for js in js_code:
                        await page.evaluate(js)
                await page.wait_for_load_state('networkidle')
                # Check for on execution event
                await self.execute_hook('on_execution_started', page)
            if kwargs.get("simulate_user", False):
                # Simulate user interactions
                await page.mouse.move(100, 100)
                await page.mouse.down()
                await page.mouse.up()
                await page.keyboard.press('ArrowDown')
            # Handle the wait_for parameter
            wait_for = kwargs.get("wait_for")
            if wait_for:
                try:
                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
                except Exception as e:
                    raise RuntimeError(f"Wait condition failed: {str(e)}")
            # Update image dimensions
            update_image_dimensions_js = """
            () => {
                return new Promise((resolve) => {
                    const filterImage = (img) => {
                        // Filter out images that are too small
                        if (img.width < 100 && img.height < 100) return false;
                        // Filter out images that are not visible
                        const rect = img.getBoundingClientRect();
                        if (rect.width === 0 || rect.height === 0) return false;
                        // Filter out images with certain class names (e.g., icons, thumbnails)
                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
                        // Filter out images with certain patterns in their src (e.g., placeholder images)
                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
                        return true;
                    };
                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
                    let imagesLeft = images.length;
                    if (imagesLeft === 0) {
                        resolve();
                        return;
                    }
                    const checkImage = (img) => {
                        if (img.complete && img.naturalWidth !== 0) {
                            img.setAttribute('width', img.naturalWidth);
                            img.setAttribute('height', img.naturalHeight);
                            imagesLeft--;
                            if (imagesLeft === 0) resolve();
                        }
                    };
                    images.forEach(img => {
                        checkImage(img);
                        if (!img.complete) {
                            img.onload = () => {
                                checkImage(img);
                            };
                            img.onerror = () => {
                                imagesLeft--;
                                if (imagesLeft === 0) resolve();
                            };
                        }
                    });
                    // Fallback timeout of 5 seconds
                    setTimeout(() => resolve(), 5000);
                });
            }
            """
            await page.evaluate(update_image_dimensions_js)
            # Wait a bit for any onload events to complete
            await page.wait_for_timeout(100)
            # Process iframes
            if kwargs.get("process_iframes", False):
                page = await self.process_iframes(page)
            await self.execute_hook('before_retrieve_html', page)
            # Check if delay_before_return_html is set then wait for that time
            delay_before_return_html = kwargs.get("delay_before_return_html")
            if delay_before_return_html:
                await asyncio.sleep(delay_before_return_html)
            html = await page.content()
            await self.execute_hook('before_return_html', page, html)
            # Check if kwargs has screenshot=True then take screenshot
            screenshot_data = None
            if kwargs.get("screenshot"):
                screenshot_data = await self.take_screenshot(url)            
            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")
            if self.use_cached_html:
                cache_file_path = os.path.join(
                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
                )
                with open(cache_file_path, "w", encoding="utf-8") as f:
                    f.write(html)
                # store response headers and status code in cache
                with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
                    json.dump({
                        "response_headers": response_headers,
                        "status_code": status_code
                    }, f)
            async def get_delayed_content(delay: float = 5.0) -> str:
                if self.verbose:
                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
                await asyncio.sleep(delay)
                return await page.content()
            response = AsyncCrawlResponse(
                html=html, 
                response_headers=response_headers, 
                status_code=status_code,
                screenshot=screenshot_data,
                get_delayed_content=get_delayed_content
            )
            return response
        except Error as e:
            raise Error(f"Failed to crawl {url}: {str(e)}")
        finally:
            if not session_id:
                await page.close()
                await context.close()
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
        semaphore = asyncio.Semaphore(semaphore_count)
        async def crawl_with_semaphore(url):
            async with semaphore:
                return await self.crawl(url, **kwargs)
        tasks = [crawl_with_semaphore(url) for url in urls]
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]
    async def take_screenshot(self, url: str, wait_time=1000) -> str:
        async with await self.browser.new_context(user_agent=self.user_agent) as context:
            page = await context.new_page()
            try:
                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
                # Wait for a specified time (default is 1 second)
                await page.wait_for_timeout(wait_time)
                screenshot = await page.screenshot(full_page=True)
                return base64.b64encode(screenshot).decode('utf-8')
            except Exception as e:
                error_message = f"Failed to take screenshot: {str(e)}"
                print(error_message)
                # Generate an error image
                img = Image.new('RGB', (800, 600), color='black')
                draw = ImageDraw.Draw(img)
                font = ImageFont.load_default()
                draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
                buffered = BytesIO()
                img.save(buffered, format="JPEG")
                return base64.b64encode(buffered.getvalue()).decode('utf-8')
            finally:
                await page.close()
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -1,23 +1,45 @@
 import asyncio
-import base64, time
+import base64
 import time
 from abc import ABC, abstractmethod
-from typing import Callable, Dict, Any, List, Optional
+from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os
 from playwright.async_api import async_playwright, Page, Browser, Error
 from io import BytesIO
 from PIL import Image, ImageDraw, ImageFont
 from .utils import sanitize_input_encode, calculate_semaphore_count
 import json, uuid
 import hashlib
 from pathlib import Path
 from playwright.async_api import ProxySettings
 from pydantic import BaseModel
 import hashlib
 import json
 import uuid
 from playwright_stealth import StealthConfig, stealth_async
 stealth_config = StealthConfig(
    webdriver=True,
    chrome_app=True,
    chrome_csi=True,
    chrome_load_times=True,
    chrome_runtime=True,
    navigator_languages=True,
    navigator_plugins=True,
    navigator_permissions=True,
    webgl_vendor=True,
    outerdimensions=True,
    navigator_hardware_concurrency=True,
    media_codecs=True,
 )
 class AsyncCrawlResponse(BaseModel):
    html: str
    response_headers: Dict[str, str]
    status_code: int
    screenshot: Optional[str] = None
    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
    class Config:
        arbitrary_types_allowed = True
 class AsyncCrawlerStrategy(ABC):
    @abstractmethod
@@ -43,23 +65,30 @@ class AsyncCrawlerStrategy(ABC):
 class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def __init__(self, use_cached_html=False, js_code=None, **kwargs):
        self.use_cached_html = use_cached_html
-        self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
+        self.user_agent = kwargs.get(
            "user_agent",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
        )
        self.proxy = kwargs.get("proxy")
        self.headless = kwargs.get("headless", True)
-        self.headers = {}
+        self.browser_type = kwargs.get("browser_type", "chromium")
        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800 
        self.js_code = js_code
        self.verbose = kwargs.get("verbose", False)
        self.playwright = None
        self.browser = None
        self.sleep_on_close = kwargs.get("sleep_on_close", False)
        self.hooks = {
            'on_browser_created': None,
            'on_user_agent_updated': None,
            'on_execution_started': None,
            'before_goto': None,
            'after_goto': None,
-            'before_return_html': None
+            'before_return_html': None,
            'before_retrieve_html': None
        }
    async def __aenter__(self):
@@ -75,12 +104,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        if self.browser is None:
            browser_args = {
                "headless": self.headless,
                # "headless": False,
                "args": [
                    "--disable-gpu",
                    "--disable-dev-shm-usage",
                    "--disable-setuid-sandbox",
                    "--no-sandbox",
                    "--disable-dev-shm-usage",
                    "--disable-blink-features=AutomationControlled",
                    "--disable-infobars",
                    "--window-position=0,0",
                    "--ignore-certificate-errors",
                    "--ignore-certificate-errors-spki-list",
                    # "--headless=new",  # Use the new headless mode
                ]
            }
@@ -89,11 +122,19 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                proxy_settings = ProxySettings(server=self.proxy)
                browser_args["proxy"] = proxy_settings
            # Select the appropriate browser based on the browser_type
            if self.browser_type == "firefox":
                self.browser = await self.playwright.firefox.launch(**browser_args)
            elif self.browser_type == "webkit":
                self.browser = await self.playwright.webkit.launch(**browser_args)
            else:
                self.browser = await self.playwright.chromium.launch(**browser_args)
            self.browser = await self.playwright.chromium.launch(**browser_args)
            await self.execute_hook('on_browser_created', self.browser)
    async def close(self):
        if self.sleep_on_close:
            await asyncio.sleep(500)
        if self.browser:
            await self.browser.close()
            self.browser = None
@@ -135,12 +176,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
    def _cleanup_expired_sessions(self):
        current_time = time.time()
-        expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items() 
+        expired_sessions = [
-                            if current_time - last_used > self.session_ttl]
+            sid for sid, (_, _, last_used) in self.sessions.items() 
            if current_time - last_used > self.session_ttl
        ]
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))
    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
        wait_for = wait_for.strip()
@@ -177,8 +219,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                            return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
                        except Error:
                            raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
-                                            "It should be either a valid CSS selector, a JavaScript function, "
+                                             "It should be either a valid CSS selector, a JavaScript function, "
-                                            "or explicitly prefixed with 'js:' or 'css:'.")
+                                             "or explicitly prefixed with 'js:' or 'css:'.")
    async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
        wrapper_js = f"""
@@ -204,6 +246,47 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        except Exception as e:
            raise RuntimeError(f"Error in wait condition: {str(e)}")
    async def process_iframes(self, page):
        # Find all iframes
        iframes = await page.query_selector_all('iframe')
        for i, iframe in enumerate(iframes):
            try:
                # Add a unique identifier to the iframe
                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
                # Get the frame associated with this iframe
                frame = await iframe.content_frame()
                if frame:
                    # Wait for the frame to load
                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
                    # Extract the content of the iframe's body
                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
                    # Generate a unique class name for this iframe
                    class_name = f'extracted-iframe-content-{i}'
                    # Replace the iframe with a div containing the extracted content
                    _iframe = iframe_content.replace('`', '\\`')
                    await page.evaluate(f"""
                        () => {{
                            const iframe = document.getElementById('iframe-{i}');
                            const div = document.createElement('div');
                            div.innerHTML = `{_iframe}`;
                            div.className = '{class_name}';
                            iframe.replaceWith(div);
                        }}
                    """)
                else:
                    print(f"Warning: Could not access content frame for iframe {i}")
            except Exception as e:
                print(f"Error processing iframe {i}: {str(e)}")
        # Return the page object
        return page  
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
        status_code = None
@@ -215,25 +298,70 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            if not context:
                context = await self.browser.new_context(
                    user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                    viewport={"width": 1920, "height": 1080},
                    proxy={"server": self.proxy} if self.proxy else None,
                    accept_downloads=True,
                    java_script_enabled=True
                )
                await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
                await context.set_extra_http_headers(self.headers)
                page = await context.new_page()
                self.sessions[session_id] = (context, page, time.time())
        else:
            context = await self.browser.new_context(
-                    user_agent=self.user_agent,
+                user_agent=self.user_agent,
-                    proxy={"server": self.proxy} if self.proxy else None
+                viewport={"width": 1920, "height": 1080},
                proxy={"server": self.proxy} if self.proxy else None
            )
            await context.set_extra_http_headers(self.headers)
            if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
                # Inject scripts to override navigator properties
                await context.add_init_script("""
                    // Pass the Permissions Test.
                    const originalQuery = window.navigator.permissions.query;
                    window.navigator.permissions.query = (parameters) => (
                        parameters.name === 'notifications' ?
                            Promise.resolve({ state: Notification.permission }) :
                            originalQuery(parameters)
                    );
                    Object.defineProperty(navigator, 'webdriver', {
                        get: () => undefined
                    });
                    window.navigator.chrome = {
                        runtime: {},
                        // Add other properties if necessary
                    };
                    Object.defineProperty(navigator, 'plugins', {
                        get: () => [1, 2, 3, 4, 5],
                    });
                    Object.defineProperty(navigator, 'languages', {
                        get: () => ['en-US', 'en'],
                    });
                    Object.defineProperty(document, 'hidden', {
                        get: () => false
                    });
                    Object.defineProperty(document, 'visibilityState', {
                        get: () => 'visible'
                    });
                """)
            page = await context.new_page()
            # await stealth_async(page) #, stealth_config)
        # Add console message and error logging
        if kwargs.get("log_console", False):
            page.on("console", lambda msg: print(f"Console: {msg.text}"))
            page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
        try:
            if self.verbose:
                print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
                )
                if os.path.exists(cache_file_path):
                    html = ""
                    with open(cache_file_path, "r") as f:
@@ -243,12 +371,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        meta = json.load(f)
                        response_headers = meta.get("response_headers", {})
                        status_code = meta.get("status_code")
-                    response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+                    response = AsyncCrawlResponse(
                        html=html, response_headers=response_headers, status_code=status_code
                    )
                    return response
            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
-                response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
+                
                response = await page.goto(
                    url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
                )
                # response = await page.goto("about:blank")
                # await page.evaluate(f"window.location.href = '{url}'")
                await self.execute_hook('after_goto', page)
                # Get status code and headers
@@ -264,45 +401,116 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
            if js_code:
                if isinstance(js_code, str):
-                    r = await page.evaluate(js_code)
+                    await page.evaluate(js_code)
                elif isinstance(js_code, list):
                    for js in js_code:
                        await page.evaluate(js)
                # await page.wait_for_timeout(100)
                await page.wait_for_load_state('networkidle')
-                # Check for on execution even
+                # Check for on execution event
                await self.execute_hook('on_execution_started', page)
-            # New code to handle the wait_for parameter
+            if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
-            # Example usage:
+                # Simulate user interactions
-            # await crawler.crawl(
+                await page.mouse.move(100, 100)
-            #     url,
+                await page.mouse.down()
-            #     js_code="// some JavaScript code",
+                await page.mouse.up()
-            #     wait_for="""() => {
+                await page.keyboard.press('ArrowDown')
-            #         return document.querySelector('#my-element') !== null;
+
-            #     }"""
+            # Handle the wait_for parameter
            # )
            # Example of using a CSS selector:
            # await crawler.crawl(
            #     url,
            #     wait_for="#my-element"
            # )
            wait_for = kwargs.get("wait_for")
            if wait_for:
                try:
-                    await self.smart_wait(page, wait_for, timeout=kwargs.get("timeout", 30000))
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
                except Exception as e:
                    raise RuntimeError(f"Wait condition failed: {str(e)}")
            # Update image dimensions
            update_image_dimensions_js = """
            () => {
                return new Promise((resolve) => {
                    const filterImage = (img) => {
                        // Filter out images that are too small
                        if (img.width < 100 && img.height < 100) return false;
                        // Filter out images that are not visible
                        const rect = img.getBoundingClientRect();
                        if (rect.width === 0 || rect.height === 0) return false;
                        // Filter out images with certain class names (e.g., icons, thumbnails)
                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
                        // Filter out images with certain patterns in their src (e.g., placeholder images)
                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
                        return true;
                    };
                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
                    let imagesLeft = images.length;
                    if (imagesLeft === 0) {
                        resolve();
                        return;
                    }
                    const checkImage = (img) => {
                        if (img.complete && img.naturalWidth !== 0) {
                            img.setAttribute('width', img.naturalWidth);
                            img.setAttribute('height', img.naturalHeight);
                            imagesLeft--;
                            if (imagesLeft === 0) resolve();
                        }
                    };
                    images.forEach(img => {
                        checkImage(img);
                        if (!img.complete) {
                            img.onload = () => {
                                checkImage(img);
                            };
                            img.onerror = () => {
                                imagesLeft--;
                                if (imagesLeft === 0) resolve();
                            };
                        }
                    });
                    // Fallback timeout of 5 seconds
                    setTimeout(() => resolve(), 5000);
                });
            }
            """
            await page.evaluate(update_image_dimensions_js)
            # Wait a bit for any onload events to complete
            await page.wait_for_timeout(100)
            # Process iframes
            if kwargs.get("process_iframes", False):
                page = await self.process_iframes(page)
            await self.execute_hook('before_retrieve_html', page)
            # Check if delay_before_return_html is set then wait for that time
            delay_before_return_html = kwargs.get("delay_before_return_html")
            if delay_before_return_html:
                await asyncio.sleep(delay_before_return_html)
            html = await page.content()
-            page = await self.execute_hook('before_return_html', page, html)
+            await self.execute_hook('before_return_html', page, html)
            # Check if kwargs has screenshot=True then take screenshot
            screenshot_data = None
            if kwargs.get("screenshot"):
                screenshot_data = await self.take_screenshot(url)            
            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")
            if self.use_cached_html:
-                cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
+                cache_file_path = os.path.join(
                    Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
                )
                with open(cache_file_path, "w", encoding="utf-8") as f:
                    f.write(html)
                # store response headers and status code in cache
@@ -312,67 +520,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        "status_code": status_code
                    }, f)
-            response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+            async def get_delayed_content(delay: float = 5.0) -> str:
                if self.verbose:
                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
                await asyncio.sleep(delay)
                return await page.content()
            response = AsyncCrawlResponse(
                html=html, 
                response_headers=response_headers, 
                status_code=status_code,
                screenshot=screenshot_data,
                get_delayed_content=get_delayed_content
            )
            return response
        except Error as e:
-            raise Error(f"Failed to crawl {url}: {str(e)}")
+            raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
-        finally:
+        # finally:
-            if not session_id:
+        #     if not session_id:
-                await page.close()
+        #         await page.close()
-
+        #         await context.close()
        # try:
        #     html = await _crawl()
        #     return sanitize_input_encode(html)
        # except Error as e:
        #     raise Error(f"Failed to crawl {url}: {str(e)}")
        # except Exception as e:
        #     raise Exception(f"Failed to crawl {url}: {str(e)}")
    async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
        """
        Execute JavaScript code in a specific session and optionally wait for a condition.
        :param session_id: The ID of the session to execute the JS code in.
        :param js_code: The JavaScript code to execute.
        :param wait_for_js: JavaScript condition to wait for after execution.
        :param wait_for_css: CSS selector to wait for after execution.
        :return: AsyncCrawlResponse containing the page's HTML and other information.
        :raises ValueError: If the session does not exist.
        """
        if not session_id:
            raise ValueError("Session ID must be provided")
        if session_id not in self.sessions:
            raise ValueError(f"No active session found for session ID: {session_id}")
        context, page, last_used = self.sessions[session_id]
        try:
            await page.evaluate(js_code)
            if wait_for_js:
                await page.wait_for_function(wait_for_js)
            if wait_for_css:
                await page.wait_for_selector(wait_for_css)
            # Get the updated HTML content
            html = await page.content()
            # Get response headers and status code (assuming these are available)
            response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
            status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
            # Update the last used time for this session
            self.sessions[session_id] = (context, page, time.time())
            return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
        except Error as e:
            raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
-        semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
+        semaphore_count = kwargs.get('semaphore_count', 5)  # Adjust as needed
        semaphore = asyncio.Semaphore(semaphore_count)
        async def crawl_with_semaphore(url):
@@ -383,11 +553,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]
-    async def take_screenshot(self, url: str) -> str:
+    async def take_screenshot(self, url: str, wait_time=1000) -> str:
        async with await self.browser.new_context(user_agent=self.user_agent) as context:
            page = await context.new_page()
            try:
-                await page.goto(url, wait_until="domcontentloaded")
+                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
                # Wait for a specified time (default is 1 second)
                await page.wait_for_timeout(wait_time)
                screenshot = await page.screenshot(full_page=True)
                return base64.b64encode(screenshot).decode('utf-8')
            except Exception as e:
@@ -405,3 +577,4 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                return base64.b64encode(buffered.getvalue()).decode('utf-8')
            finally:
                await page.close()
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -29,14 +29,31 @@ class AsyncDatabaseManager:
                )
            ''')
            await db.commit()
        await self.update_db_schema()
-    async def aalter_db_add_screenshot(self, new_column: str = "media"):
+    async def update_db_schema(self):
        async with aiosqlite.connect(self.db_path) as db:
            # Check if the 'media' column exists
            cursor = await db.execute("PRAGMA table_info(crawled_data)")
            columns = await cursor.fetchall()
            column_names = [column[1] for column in columns]
            if 'media' not in column_names:
                await self.aalter_db_add_column('media')
            # Check for other missing columns and add them if necessary
            for column in ['links', 'metadata', 'screenshot']:
                if column not in column_names:
                    await self.aalter_db_add_column(column)
    async def aalter_db_add_column(self, new_column: str):
        try:
            async with aiosqlite.connect(self.db_path) as db:
                await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
                await db.commit()
            print(f"Added column '{new_column}' to the database.")
        except Exception as e:
-            print(f"Error altering database to add screenshot column: {e}")
+            print(f"Error altering database to add {new_column} column: {e}")
    async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
        try:
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -23,17 +23,17 @@ class AsyncWebCrawler:
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        always_by_pass_cache: bool = False,
-        verbose: bool = False,
+        **kwargs,
    ):
        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
-            verbose=verbose
+            **kwargs
        )
        self.always_by_pass_cache = always_by_pass_cache
        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
        self.ready = False
-        self.verbose = verbose
+        self.verbose = kwargs.get("verbose", False)
    async def __aenter__(self):
        await self.crawler_strategy.__aenter__()
@@ -133,8 +133,8 @@ class AsyncWebCrawler:
        except Exception as e:
            if not hasattr(e, "msg"):
                e.msg = str(e)
-            print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
+            print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
-            return CrawlResult(url=url, html="", success=False, error_message=e.msg)
+            return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)
    async def arun_many(
        self,
@@ -195,6 +195,7 @@ class AsyncWebCrawler:
                image_description_min_word_threshold=kwargs.get(
                    "image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
                ),
                **kwargs,
            )
            if verbose:
                print(
@@ -202,11 +203,11 @@ class AsyncWebCrawler:
                )
            if result is None:
-                raise ValueError(f"Failed to extract content from the website: {url}")
+                raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
        except InvalidCSSSelectorError as e:
            raise ValueError(str(e))
        except Exception as e:
-            raise ValueError(f"Failed to extract content from the website: {url}, error: {str(e)}")
+            raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
        cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
        markdown = sanitize_input_encode(result.get("markdown", ""))
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -16,8 +16,6 @@ from .utils import (
    CustomHTML2Text
 )
 class ContentScrappingStrategy(ABC):
    @abstractmethod
    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
@@ -35,6 +33,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
        return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
    def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
        success = True
        if not html:
            return None
@@ -129,7 +128,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
                image_format = os.path.splitext(img.get('src',''))[1].lower()
                # Remove . from format
-                image_format = image_format.strip('.')
+                image_format = image_format.strip('.').split('?')[0]
                score = 0
                if height_value:
                    if height_unit == 'px' and height_value > 150:
@@ -158,6 +157,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                return None
            return {
                'src': img.get('src', ''),
                'data-src': img.get('data-src', ''),
                'alt': img.get('alt', ''),
                'desc': find_closest_parent_with_useful_text(img),
                'score': score,
@@ -171,9 +171,11 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                        element.extract()
                    return False
                # if element.name == 'img':
                #     process_image(element, url, 0, 1)
                #     return True
                if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
                    if element.name == 'img':
                        process_image(element, url, 0, 1)
                    element.decompose()
                    return False
@@ -272,12 +274,46 @@ class WebScrappingStrategy(ContentScrappingStrategy):
            if base64_pattern.match(src):
                # Replace base64 data with empty string
                img['src'] = base64_pattern.sub('', src)
        try:
            str(body)
        except Exception as e:
            # Reset body to the original HTML
            success = False
            body = BeautifulSoup(html, 'html.parser')
            # Create a new div with a special ID
            error_div = body.new_tag('div', id='crawl4ai_error_message')
            error_div.string = '''
            Crawl4AI Error: This page is not fully supported.
            Possible reasons:
            1. The page may have restrictions that prevent crawling.
            2. The page might not be fully loaded.
            Suggestions:
            - Try calling the crawl function with these parameters:
            magic=True,
            - Set headless=False to visualize what's happening on the page.
            If the issue persists, please check the page's structure and any potential anti-crawling measures.
            '''
            # Append the error div to the body
            body.body.append(error_div)
            print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
        cleaned_html = sanitize_html(cleaned_html)
        h = CustomHTML2Text()
-        h.ignore_links = True
+        h.ignore_links = not kwargs.get('include_links_on_markdown', False)
-        markdown = h.handle(cleaned_html)
+        h.body_width = 0
        try:
            markdown = h.handle(cleaned_html)
        except Exception as e:
            markdown = h.handle(sanitize_html(cleaned_html))
        markdown = markdown.replace('    ```', '```')
        try:
@@ -286,10 +322,11 @@ class WebScrappingStrategy(ContentScrappingStrategy):
            print('Error extracting metadata:', str(e))
            meta = {}
        cleaned_html = sanitize_html(cleaned_html)
        return {
            'markdown': markdown,
            'cleaned_html': cleaned_html,
-            'success': True,
+            'success': success,
            'media': media,
            'links': links,
            'metadata': meta
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -80,6 +80,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
        self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
        self.apply_chunking = kwargs.get("apply_chunking", True)
        self.base_url = kwargs.get("base_url", None)
        self.extra_args = kwargs.get("extra_args", {})
        if not self.apply_chunking:
            self.chunk_token_threshold = 1e9
@@ -111,7 +112,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
                "{" + variable + "}", variable_values[variable]
            )
-        response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token, base_url=self.base_url) # , json_response=self.extract_type == "schema")
+        response = perform_completion_with_backoff(
            self.provider, 
            prompt_with_variables, 
            self.api_token, 
            base_url=self.base_url,
            extra_args = self.extra_args
            ) # , json_response=self.extract_type == "schema")
        try:
            blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
            blocks = json.loads(blocks)
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -1,4 +1,4 @@
-PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
+PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
 <url>{URL}</url>
 And here is the cleaned HTML content of that webpage:
@@ -79,7 +79,7 @@ To generate the JSON objects:
 2. For each block:
   a. Assign it an index based on its order in the content.
   b. Analyze the content and generate ONE semantic tag that describe what the block is about.
-   c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
+   c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
 3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -131,7 +131,7 @@ def split_and_parse_json_objects(json_string):
    return parsed_objects, unparsed_segments
 def sanitize_html(html):
-    # Replace all weird and special characters with an empty string
+    # Replace all unwanted and special characters with an empty string
    sanitized_html = html
    # sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)
@@ -301,7 +301,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
            if tag.name != 'img':
                tag.attrs = {}
-        # Extract all img tgas inti [{src: '', alt: ''}]
+        # Extract all img tgas int0 [{src: '', alt: ''}]
        media = {
            'images': [],
            'videos': [],
@@ -339,7 +339,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
                img.decompose()
-        # Create a function that replace content of all"pre" tage with its inner text
+        # Create a function that replace content of all"pre" tag with its inner text
        def replace_pre_tags_with_text(node):
            for child in node.find_all('pre'):
                # set child inner html to its text
@@ -502,7 +502,7 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            current_tag = tag
            while current_tag:
                current_tag = current_tag.parent
-                # Get the text content of the parent tag
+                # Get the text content from the parent tag
                if current_tag:
                    text_content = current_tag.get_text(separator=' ',strip=True)
                    # Check if the text content has at least word_count_threshold
@@ -511,88 +511,88 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            return None
    def process_image(img, url, index, total_images):
-            #Check if an image has valid display and inside undesired html elements
+        #Check if an image has valid display and inside undesired html elements
-            def is_valid_image(img, parent, parent_classes):
+        def is_valid_image(img, parent, parent_classes):
-                style = img.get('style', '')
+            style = img.get('style', '')
-                src = img.get('src', '')
+            src = img.get('src', '')
-                classes_to_check = ['button', 'icon', 'logo']
+            classes_to_check = ['button', 'icon', 'logo']
-                tags_to_check = ['button', 'input']
+            tags_to_check = ['button', 'input']
-                return all([
+            return all([
-                    'display:none' not in style,
+                'display:none' not in style,
-                    src,
+                src,
-                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
-                    parent.name not in tags_to_check
+                parent.name not in tags_to_check
-                ])
+            ])
-            #Score an image for it's usefulness
+        #Score an image for it's usefulness
-            def score_image_for_usefulness(img, base_url, index, images_count):
+        def score_image_for_usefulness(img, base_url, index, images_count):
-                # Function to parse image height/width value and units
+            # Function to parse image height/width value and units
-                def parse_dimension(dimension):
+            def parse_dimension(dimension):
-                    if dimension:
+                if dimension:
-                        match = re.match(r"(\d+)(\D*)", dimension)
+                    match = re.match(r"(\d+)(\D*)", dimension)
-                        if match:
+                    if match:
-                            number = int(match.group(1))
+                        number = int(match.group(1))
-                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                        unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
-                            return number, unit
+                        return number, unit
-                    return None, None
+                return None, None
-                # Fetch image file metadata to extract size and extension
+            # Fetch image file metadata to extract size and extension
-                def fetch_image_file_size(img, base_url):
+            def fetch_image_file_size(img, base_url):
-                    #If src is relative path construct full URL, if not it may be CDN URL
+                #If src is relative path construct full URL, if not it may be CDN URL
-                    img_url = urljoin(base_url,img.get('src'))
+                img_url = urljoin(base_url,img.get('src'))
-                    try:
+                try:
-                        response = requests.head(img_url)
+                    response = requests.head(img_url)
-                        if response.status_code == 200:
+                    if response.status_code == 200:
-                            return response.headers.get('Content-Length',None)
+                        return response.headers.get('Content-Length',None)
-                        else:
+                    else:
-                            print(f"Failed to retrieve file size for {img_url}")
+                        print(f"Failed to retrieve file size for {img_url}")
                            return None
                    except InvalidSchema as e:
                        return None
-                    finally:
+                except InvalidSchema as e:
-                        return
+                    return None
                finally:
                    return
-                image_height = img.get('height')
+            image_height = img.get('height')
-                height_value, height_unit = parse_dimension(image_height)
+            height_value, height_unit = parse_dimension(image_height)
-                image_width =  img.get('width')
+            image_width =  img.get('width')
-                width_value, width_unit = parse_dimension(image_width)
+            width_value, width_unit = parse_dimension(image_width)
-                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
+            image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
-                image_format = os.path.splitext(img.get('src',''))[1].lower()
+            image_format = os.path.splitext(img.get('src',''))[1].lower()
-                # Remove . from format
+            # Remove . from format
-                image_format = image_format.strip('.')
+            image_format = image_format.strip('.')
-                score = 0
+            score = 0
-                if height_value:
+            if height_value:
-                    if height_unit == 'px' and height_value > 150:
+                if height_unit == 'px' and height_value > 150:
                        score += 1
                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
                        score += 1
                if width_value:
                    if width_unit == 'px' and width_value > 150:
                        score += 1
                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
                        score += 1
                if image_size > 10000:
                    score += 1
-                if img.get('alt') != '':
+                if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
-                    score+=1
+                    score += 1
-                if any(image_format==format for format in ['jpg','png','webp']):
+            if width_value:
-                    score+=1
+                if width_unit == 'px' and width_value > 150:
-                if index/images_count<0.5:
+                    score += 1
-                    score+=1
+                if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
-                return score
+                    score += 1
            if image_size > 10000:
                score += 1
            if img.get('alt') != '':
                score+=1
            if any(image_format==format for format in ['jpg','png','webp']):
                score+=1
            if index/images_count<0.5:
                score+=1
            return score
-            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+        if not is_valid_image(img, img.parent, img.parent.get('class', [])):
-                return None
+            return None
-            score = score_image_for_usefulness(img, url, index, total_images)
+        score = score_image_for_usefulness(img, url, index, total_images)
-            if score <= IMAGE_SCORE_THRESHOLD:
+        if score <= IMAGE_SCORE_THRESHOLD:
-                return None
+            return None
-            return {
+        return {
-                'src': img.get('src', ''),
+            'src': img.get('src', '').replace('\\"', '"').strip(),
-                'alt': img.get('alt', ''),
+            'alt': img.get('alt', ''),
-                'desc': find_closest_parent_with_useful_text(img),
+            'desc': find_closest_parent_with_useful_text(img),
-                'score': score,
+            'score': score,
-                'type': 'image'
+            'type': 'image'
-            }
+        }
    def process_element(element: element.PageElement) -> bool:
        try:
@@ -692,8 +692,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
    for img in imgs:
        src = img.get('src', '')
        if base64_pattern.match(src):
            # Replace base64 data with empty string
            img['src'] = base64_pattern.sub('', src)
    cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
    cleaned_html = sanitize_html(cleaned_html)
@@ -775,7 +775,14 @@ def extract_xml_data(tags, string):
    return data
 # Function to perform the completion with exponential backoff
-def perform_completion_with_backoff(provider, prompt_with_variables, api_token, json_response = False, base_url=None):
+def perform_completion_with_backoff(
    provider, 
    prompt_with_variables, 
    api_token, 
    json_response = False, 
    base_url=None,
    **kwargs
    ):
    from litellm import completion 
    from litellm.exceptions import RateLimitError
    max_attempts = 3
@@ -785,6 +792,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token,
    if json_response:
        extra_args["response_format"] = { "type": "json_object" }
    if kwargs.get("extra_args"):
        extra_args.update(kwargs["extra_args"])
    for attempt in range(max_attempts):
        try:
            response =completion(
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -12,6 +12,7 @@ from typing import List
 from concurrent.futures import ThreadPoolExecutor
 from .config import *
 import warnings
 import json
 warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
--- a/docs/examples/quickstart.ipynb
+++ b/docs/examples/quickstart.ipynb
@@ -47,8 +47,7 @@
      },
      "outputs": [],
      "source": [
-        "# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
+        "!pip install crawl4ai\n",
        "!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
        "!pip install nest-asyncio\n",
        "!playwright install"
      ]
@@ -714,7 +713,7 @@
      "provenance": []
    },
    "kernelspec": {
-      "display_name": "Python 3",
+      "display_name": "venv",
      "language": "python",
      "name": "python3"
    },
--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -10,6 +10,7 @@ import time
 import json
 import os
 import re
 from typing import Dict
 from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
 from crawl4ai import AsyncWebCrawler
@@ -18,6 +19,8 @@ from crawl4ai.extraction_strategy import (
    LLMExtractionStrategy,
 )
 __location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
 print("Crawl4AI: Advanced Web Crawling and Data Extraction")
 print("GitHub Repository: https://github.com/unclecode/crawl4ai")
 print("Twitter: @unclecode")
@@ -30,7 +33,7 @@ async def simple_crawl():
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  # Print first 500 characters
-async def js_and_css():
+async def simple_example_with_running_js_code():
    print("\n--- Executing JavaScript and Using CSS Selectors ---")
    # New code to handle the wait_for parameter
    wait_for = """() => {
@@ -47,12 +50,21 @@ async def js_and_css():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
            # css_selector="article.tease-card",
            # wait_for=wait_for,
            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters
 async def simple_example_with_css_selector():
    print("\n--- Using CSS Selectors ---")
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            css_selector=".wide-tease-item__description",
            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters
 async def use_proxy():
    print("\n--- Using a Proxy ---")
    print(
@@ -66,6 +78,28 @@ async def use_proxy():
    #     )
    #     print(result.markdown[:500])  # Print first 500 characters
 async def capture_and_save_screenshot(url: str, output_path: str):
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url=url,
            screenshot=True,
            bypass_cache=True
        )
        if result.success and result.screenshot:
            import base64
            # Decode the base64 screenshot data
            screenshot_data = base64.b64decode(result.screenshot)
            # Save the screenshot as a JPEG file
            with open(output_path, 'wb') as f:
                f.write(screenshot_data)
            print(f"Screenshot saved successfully to {output_path}")
        else:
            print("Failed to capture screenshot")
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -73,27 +107,30 @@ class OpenAIModelFee(BaseModel):
        ..., description="Fee for output token for the OpenAI model."
    )
-async def extract_structured_data_using_llm():
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
-    print("\n--- Extracting Structured Data with OpenAI ---")
+    print(f"\n--- Extracting Structured Data with {provider} ---")
-    print(
+    
-        "Note: Set your OpenAI API key as an environment variable to run this example."
+    if api_token is None and provider != "ollama":
-    )
+        print(f"API token is required for {provider}. Skipping this example.")
    if not os.getenv("OPENAI_API_KEY"):
        print("OpenAI API key not found. Skipping this example.")
        return
    extra_args = {}
    if extra_headers:
        extra_args["extra_headers"] = extra_headers
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
-                provider="openai/gpt-4o",
+                provider=provider,
-                api_token=os.getenv("OPENAI_API_KEY"),
+                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
                extra_args=extra_args
            ),
            bypass_cache=True,
        )
@@ -320,6 +357,40 @@ async def crawl_dynamic_content_pages_method_3():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
 async def crawl_custom_browser_type():
    # Use Firefox
    start = time.time()
    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)
    # Use WebKit
    start = time.time()
    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)
    # Use Chromium (default)
    start = time.time()
    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
        print(result.markdown[:500])
        print("Time taken: ", time.time() - start)
 async def crawl_with_user_simultion():
    async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
        url = "YOUR-URL-HERE"
        result = await crawler.arun(
            url=url,
            bypass_cache=True,
            simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
            override_navigator = True # Overrides the navigator object to make it look like a real user
        )
        print(result.markdown)    
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -387,13 +458,31 @@ async def speed_comparison():
 async def main():
    await simple_crawl()
-    await js_and_css()
+    await simple_example_with_running_js_code()
    await simple_example_with_css_selector()
    await use_proxy()
    await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
    await extract_structured_data_using_css_extractor()
    # LLM extraction examples
    await extract_structured_data_using_llm()
    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
    await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
    await extract_structured_data_using_llm("ollama/llama3.2")    
    # You always can pass custom headers to the extraction strategy
    custom_headers = {
        "Authorization": "Bearer your-custom-token",
        "X-Custom-Header": "Some-Value"
    }
    await extract_structured_data_using_llm(extra_headers=custom_headers)
    # await crawl_dynamic_content_pages_method_1()
    # await crawl_dynamic_content_pages_method_2()
    await crawl_dynamic_content_pages_method_3()
    await crawl_custom_browser_type()
    await speed_comparison()
--- a/requirements.txt
+++ b/requirements.txt
@@ -8,3 +8,4 @@ playwright==1.47.0
 python-dotenv==1.0.1
 requests>=2.26.0,<2.32.3
 beautifulsoup4==4.12.3
 playwright_stealth==1.0.6
Author	SHA1	Message	Date
UncleCode	b309bc34e1	Fix the model nam ein quick start example	2024-10-18 15:32:25 +08:00
UncleCode	b8147b64e0	chore: Bump version to 0.3.71 and improve error handling - Update version number to 0.3.71 - Add sleep_on_close option to AsyncPlaywrightCrawlerStrategy - Enhance context creation with additional options - Improve error message formatting and visibility - Update quickstart documentation	2024-10-18 13:31:12 +08:00
UncleCode	aab6ea022e	Update requirements and switch to 0.3.8	2024-10-18 12:51:23 +08:00
UncleCode	dd17ed0e63	Rename some flags name, introducing magic flag.	2024-10-18 12:35:09 +08:00
UncleCode	768aa06ceb	feat(crawler): Enhance stealth and flexibility, improve error handling - Implement playwright_stealth for better bot detection avoidance - Add user simulation and navigator override options - Improve iframe processing and browser selection - Enhance error reporting and debugging capabilities - Optimize image processing and parallel crawling - Add new example for user simulation feature - Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.	2024-10-17 21:37:48 +08:00
unclecode	9ffa34b697	Update README	2024-10-14 22:58:27 +08:00
unclecode	740802c491	Merge branch '0.3.6'	2024-10-14 22:55:24 +08:00
unclecode	b9ac96c332	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-10-14 22:54:23 +08:00
unclecode	d06535388a	Update gitignore	2024-10-14 22:53:56 +08:00
unclecode	2b73bdf6b0	Update changelog	2024-10-14 21:04:02 +08:00
unclecode	6aa803d712	Update gitignore	2024-10-14 21:03:40 +08:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
UncleCode	ccbe72cfc1	Merge pull request #135 from hitesh22rana/fix/docs-example docs: fixed css_selector for example	2024-10-13 14:39:07 +08:00
unclecode	b9bbd42373	Update Quickstart examples	2024-10-13 14:37:45 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	9b2b267820	CHANGELOG UPDATE	2024-10-12 13:42:56 +08:00
unclecode	ff3524d9b1	feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts - Implement screenshot capture functionality - Add delayed content retrieval method - Introduce custom page timeout parameter - Enhance LLM support with multiple providers - Improve database schema auto-updates - Optimize image processing in WebScrappingStrategy - Update error handling and logging - Expand examples in quickstart_async.py	2024-10-12 13:42:42 +08:00
unclecode	b99d20b725	Add pypi_build.sh to .gitignore	2024-10-08 18:10:57 +08:00
hitesh22rana	768b93140f	docs: fixed css_selector for example	2024-10-05 00:25:41 +09:00