feat(scraper): Enhance URL filtering and scoring systems

Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00 · 2024-11-08 18:45:12 +08:00 · 2024-11-08 15:57:23 +08:00 · 2024-11-07 18:54:53 +08:00 · 2024-11-06 21:09:47 +08:00 · 2024-11-06 18:44:03 +08:00
26 changed files with 2767 additions and 125 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -201,3 +201,11 @@ test_env/
 todo.md
 git_changes.py
 git_changes.md
+pypi_build.sh
+git_issues.py
+git_issues.md
+
+.tests/
+.issues/
+.docs/
+.issues/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,83 @@
 # Changelog

+## [v0.3.6] - 2024-10-12 
+
+### 1. Improved Crawling Control
+- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
+- **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
+  - Useful for pages with delayed content loading.
+- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
+  - Provides better handling for slow-loading pages.
+- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
+
+### 2. Browser Type Selection
+- Added support for different browser types (Chromium, Firefox, WebKit).
+- Users can now specify the browser type when initializing AsyncWebCrawler.
+- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
+
+### 3. Screenshot Capture
+- Added ability to capture screenshots during crawling.
+- Useful for debugging and content verification.
+- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
+
+### 4. Enhanced LLM Extraction Strategy
+- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
+- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
+- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
+- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
+
+### 5. iframe Content Extraction
+- New feature to process and extract content from iframes.
+- **How to use**: Set `process_iframes=True` in the crawl method.
+
+### 6. Delayed Content Retrieval
+- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
+- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
+- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
+
+## Improvements and Optimizations
+
+### 1. AsyncWebCrawler Enhancements
+- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
+- Allows for more customized setups.
+
+### 2. Image Processing Optimization
+- Enhanced image handling in WebScrappingStrategy.
+- Added filtering for small, invisible, or irrelevant images.
+- Improved image scoring system for better content relevance.
+- Implemented JavaScript-based image dimension updating for more accurate representation.
+
+### 3. Database Schema Auto-updates
+- Automatic database schema updates ensure compatibility with the latest version.
+
+### 4. Enhanced Error Handling and Logging
+- Improved error messages and logging for easier debugging.
+
+### 5. Content Extraction Refinements
+- Refined HTML sanitization process.
+- Improved handling of base64 encoded images.
+- Enhanced Markdown conversion process.
+- Optimized content extraction algorithms.
+
+### 6. Utility Function Enhancements
+- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
+
+## Bug Fixes
+- Fixed an issue where image tags were being prematurely removed during content extraction.
+
+## Examples and Documentation
+- Updated `quickstart_async.py` with examples of:
+  - Using custom headers in LLM extraction.
+  - Different LLM provider usage (OpenAI, Hugging Face, Ollama).
+  - Custom browser type usage.
+
+## Developer Notes
+- Refactored code for better maintainability, flexibility, and performance.
+- Enhanced type hinting throughout the codebase for improved development experience.
+- Expanded error handling for more robust operation.
+
+These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
+
 ## [v0.3.5] - 2024-09-02

 Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
--- a/README.md
+++ b/README.md
@@ -10,6 +10,14 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc

 > Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).

+## New update 0.3.6
+- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
+- 🖼️ Improved image processing with lazy-loading detection
+- 🔧 Custom page timeout parameter for better control over crawling behavior
+- 🕰️ Enhanced handling of delayed content loading
+- 🔑 Custom headers support for LLM interactions
+- 🖼️ iframe content extraction for comprehensive page analysis
+- ⏱️ Flexible timeout and delayed content retrieval options

 ## Try it Now!

@@ -124,7 +132,7 @@ async def main():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
-            css_selector="article.tease-card",
+            css_selector=".wide-tease-item__description",
            bypass_cache=True
        )
        print(result.extracted_content)
--- a/crawl4ai/init.py
+++ b/crawl4ai/init.py
@@ -3,7 +3,7 @@
 from .async_webcrawler import AsyncWebCrawler
 from .models import CrawlResult

-__version__ = "0.3.5"
+__version__ = "0.3.6"

 __all__ = [
    "AsyncWebCrawler",
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -1,7 +1,7 @@
 import asyncio
 import base64, time
 from abc import ABC, abstractmethod
-from typing import Callable, Dict, Any, List, Optional
+from typing import Callable, Dict, Any, List, Optional, Awaitable
 import os
 from playwright.async_api import async_playwright, Page, Browser, Error
 from io import BytesIO
@@ -18,6 +18,10 @@ class AsyncCrawlResponse(BaseModel):
    response_headers: Dict[str, str]
    status_code: int
    screenshot: Optional[str] = None
+    get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
+
+    class Config:
+        arbitrary_types_allowed = True

 class AsyncCrawlerStrategy(ABC):
    @abstractmethod
@@ -46,7 +50,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
        self.proxy = kwargs.get("proxy")
        self.headless = kwargs.get("headless", True)
-        self.headers = {}
+        self.browser_type = kwargs.get("browser_type", "chromium")  # New parameter
+        self.headers = kwargs.get("headers", {})
        self.sessions = {}
        self.session_ttl = 1800 
        self.js_code = js_code
@@ -59,7 +64,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            'on_execution_started': None,
            'before_goto': None,
            'after_goto': None,
-            'before_return_html': None
+            'before_return_html': None,
+            'before_retrieve_html': None
        }

    async def __aenter__(self):
@@ -75,7 +81,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        if self.browser is None:
            browser_args = {
                "headless": self.headless,
-                # "headless": False,
                "args": [
                    "--disable-gpu",
                    "--disable-dev-shm-usage",
@@ -90,7 +95,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                browser_args["proxy"] = proxy_settings
                
                
-            self.browser = await self.playwright.chromium.launch(**browser_args)
+            # Select the appropriate browser based on the browser_type
+            if self.browser_type == "firefox":
+                self.browser = await self.playwright.firefox.launch(**browser_args)
+            elif self.browser_type == "webkit":
+                self.browser = await self.playwright.webkit.launch(**browser_args)
+            else:
+                self.browser = await self.playwright.chromium.launch(**browser_args)
+
            await self.execute_hook('on_browser_created', self.browser)

    async def close(self):
@@ -140,7 +152,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        for sid in expired_sessions:
            asyncio.create_task(self.kill_session(sid))
            
-            
    async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
        wait_for = wait_for.strip()
        
@@ -204,6 +215,48 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        except Exception as e:
            raise RuntimeError(f"Error in wait condition: {str(e)}")

+    async def process_iframes(self, page):
+        # Find all iframes
+        iframes = await page.query_selector_all('iframe')
+        
+        for i, iframe in enumerate(iframes):
+            try:
+                # Add a unique identifier to the iframe
+                await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
+                
+                # Get the frame associated with this iframe
+                frame = await iframe.content_frame()
+                
+                if frame:
+                    # Wait for the frame to load
+                    await frame.wait_for_load_state('load', timeout=30000)  # 30 seconds timeout
+                    
+                    # Extract the content of the iframe's body
+                    iframe_content = await frame.evaluate('() => document.body.innerHTML')
+                    
+                    # Generate a unique class name for this iframe
+                    class_name = f'extracted-iframe-content-{i}'
+                    
+                    # Replace the iframe with a div containing the extracted content
+                    _iframe = iframe_content.replace('`', '\\`')
+                    await page.evaluate(f"""
+                        () => {{
+                            const iframe = document.getElementById('iframe-{i}');
+                            const div = document.createElement('div');
+                            div.innerHTML = `{_iframe}`;
+                            div.className = '{class_name}';
+                            iframe.replaceWith(div);
+                        }}
+                    """)
+                else:
+                    print(f"Warning: Could not access content frame for iframe {i}")
+            except Exception as e:
+                print(f"Error processing iframe {i}: {str(e)}")
+
+        # Return the page object
+        return page
+    
+    
    async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
        response_headers = {}
        status_code = None
@@ -248,7 +301,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

            if not kwargs.get("js_only", False):
                await self.execute_hook('before_goto', page)
-                response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
+                response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
                await self.execute_hook('after_goto', page)
                
                # Get status code and headers
@@ -258,6 +311,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                status_code = 200
                response_headers = {}

+
            await page.wait_for_selector('body')
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")

@@ -291,12 +345,89 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            wait_for = kwargs.get("wait_for")
            if wait_for:
                try:
-                    await self.smart_wait(page, wait_for, timeout=kwargs.get("timeout", 30000))
+                    await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
                except Exception as e:
                    raise RuntimeError(f"Wait condition failed: {str(e)}")

+            # Check if kwargs has screenshot=True then take screenshot
+            screenshot_data = None
+            if kwargs.get("screenshot"):
+                screenshot_data = await self.take_screenshot(url)
+            
+            
+            # New code to update image dimensions
+            update_image_dimensions_js = """
+            () => {
+                return new Promise((resolve) => {
+                    const filterImage = (img) => {
+                        // Filter out images that are too small
+                        if (img.width < 100 && img.height < 100) return false;
+                        
+                        // Filter out images that are not visible
+                        const rect = img.getBoundingClientRect();
+                        if (rect.width === 0 || rect.height === 0) return false;
+                        
+                        // Filter out images with certain class names (e.g., icons, thumbnails)
+                        if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
+                        
+                        // Filter out images with certain patterns in their src (e.g., placeholder images)
+                        if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
+                        
+                        return true;
+                    };
+
+                    const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
+                    let imagesLeft = images.length;
+                    
+                    if (imagesLeft === 0) {
+                        resolve();
+                        return;
+                    }
+
+                    const checkImage = (img) => {
+                        if (img.complete && img.naturalWidth !== 0) {
+                            img.setAttribute('width', img.naturalWidth);
+                            img.setAttribute('height', img.naturalHeight);
+                            imagesLeft--;
+                            if (imagesLeft === 0) resolve();
+                        }
+                    };
+
+                    images.forEach(img => {
+                        checkImage(img);
+                        if (!img.complete) {
+                            img.onload = () => {
+                                checkImage(img);
+                            };
+                            img.onerror = () => {
+                                imagesLeft--;
+                                if (imagesLeft === 0) resolve();
+                            };
+                        }
+                    });
+
+                    // Fallback timeout of 5 seconds
+                    setTimeout(() => resolve(), 5000);
+                });
+            }
+            """
+            await page.evaluate(update_image_dimensions_js)
+
+            # Wait a bit for any onload events to complete
+            await page.wait_for_timeout(100)
+
+            # Process iframes
+            if kwargs.get("process_iframes", False):
+                page = await self.process_iframes(page)
+            
+            await self.execute_hook('before_retrieve_html', page)
+            # Check if delay_before_return_html is set then wait for that time
+            delay_before_return_html = kwargs.get("delay_before_return_html")
+            if delay_before_return_html:
+                await asyncio.sleep(delay_before_return_html)
+                
            html = await page.content()
-            page = await self.execute_hook('before_return_html', page, html)
+            await self.execute_hook('before_return_html', page, html)

            if self.verbose:
                print(f"[LOG] ✅ Crawled {url} successfully!")
@@ -312,7 +443,20 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                        "status_code": status_code
                    }, f)

-            response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
+            
+            async def get_delayed_content(delay: float = 5.0) -> str:
+                if self.verbose:
+                    print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
+                await asyncio.sleep(delay)
+                return await page.content()
+                
+            response = AsyncCrawlResponse(
+                html=html, 
+                response_headers=response_headers, 
+                status_code=status_code,
+                screenshot=screenshot_data,
+                get_delayed_content=get_delayed_content
+            )
            return response
        except Error as e:
            raise Error(f"Failed to crawl {url}: {str(e)}")
@@ -370,7 +514,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        except Error as e:
            raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
    
-    
    async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
        semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
        semaphore = asyncio.Semaphore(semaphore_count)
@@ -383,11 +526,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return [result if not isinstance(result, Exception) else str(result) for result in results]

-    async def take_screenshot(self, url: str) -> str:
+    async def take_screenshot(self, url: str, wait_time = 1000) -> str:
        async with await self.browser.new_context(user_agent=self.user_agent) as context:
            page = await context.new_page()
            try:
-                await page.goto(url, wait_until="domcontentloaded")
+                await page.goto(url, wait_until="domcontentloaded", timeout=30000)
+                # Wait for a specified time (default is 1 second)
+                await page.wait_for_timeout(wait_time)
                screenshot = await page.screenshot(full_page=True)
                return base64.b64encode(screenshot).decode('utf-8')
            except Exception as e:
--- a/crawl4ai/async_database.py
+++ b/crawl4ai/async_database.py
@@ -29,14 +29,31 @@ class AsyncDatabaseManager:
                )
            ''')
            await db.commit()
+        await self.update_db_schema()

-    async def aalter_db_add_screenshot(self, new_column: str = "media"):
+    async def update_db_schema(self):
+        async with aiosqlite.connect(self.db_path) as db:
+            # Check if the 'media' column exists
+            cursor = await db.execute("PRAGMA table_info(crawled_data)")
+            columns = await cursor.fetchall()
+            column_names = [column[1] for column in columns]
+            
+            if 'media' not in column_names:
+                await self.aalter_db_add_column('media')
+            
+            # Check for other missing columns and add them if necessary
+            for column in ['links', 'metadata', 'screenshot']:
+                if column not in column_names:
+                    await self.aalter_db_add_column(column)
+
+    async def aalter_db_add_column(self, new_column: str):
        try:
            async with aiosqlite.connect(self.db_path) as db:
                await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
                await db.commit()
+            print(f"Added column '{new_column}' to the database.")
        except Exception as e:
-            print(f"Error altering database to add screenshot column: {e}")
+            print(f"Error altering database to add {new_column} column: {e}")

    async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
        try:
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -23,17 +23,17 @@ class AsyncWebCrawler:
        self,
        crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
        always_by_pass_cache: bool = False,
-        verbose: bool = False,
+        **kwargs,
    ):
        self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
-            verbose=verbose
+            **kwargs
        )
        self.always_by_pass_cache = always_by_pass_cache
        self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
        os.makedirs(self.crawl4ai_folder, exist_ok=True)
        os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
        self.ready = False
-        self.verbose = verbose
+        self.verbose = kwargs.get("verbose", False)

    async def __aenter__(self):
        await self.crawler_strategy.__aenter__()
@@ -202,11 +202,11 @@ class AsyncWebCrawler:
                )

            if result is None:
-                raise ValueError(f"Failed to extract content from the website: {url}")
+                raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
        except InvalidCSSSelectorError as e:
            raise ValueError(str(e))
        except Exception as e:
-            raise ValueError(f"Failed to extract content from the website: {url}, error: {str(e)}")
+            raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")

        cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
        markdown = sanitize_input_encode(result.get("markdown", ""))
--- a/crawl4ai/content_scrapping_strategy.py
+++ b/crawl4ai/content_scrapping_strategy.py
@@ -16,8 +16,6 @@ from .utils import (
    CustomHTML2Text
 )

-
-
 class ContentScrappingStrategy(ABC):
    @abstractmethod
    def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
@@ -129,7 +127,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
                image_format = os.path.splitext(img.get('src',''))[1].lower()
                # Remove . from format
-                image_format = image_format.strip('.')
+                image_format = image_format.strip('.').split('?')[0]
                score = 0
                if height_value:
                    if height_unit == 'px' and height_value > 150:
@@ -158,6 +156,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                return None
            return {
                'src': img.get('src', ''),
+                'data-src': img.get('data-src', ''),
                'alt': img.get('alt', ''),
                'desc': find_closest_parent_with_useful_text(img),
                'score': score,
@@ -170,10 +169,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                    if isinstance(element, Comment):
                        element.extract()
                    return False
+                
+                # if element.name == 'img':
+                #     process_image(element, url, 0, 1)
+                #     return True

                if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
-                    if element.name == 'img':
-                        process_image(element, url, 0, 1)
                    element.decompose()
                    return False

@@ -273,11 +274,14 @@ class WebScrappingStrategy(ContentScrappingStrategy):
                # Replace base64 data with empty string
                img['src'] = base64_pattern.sub('', src)
        cleaned_html = str(body).replace('\n\n', '\n').replace('  ', ' ')
-        cleaned_html = sanitize_html(cleaned_html)

        h = CustomHTML2Text()
        h.ignore_links = True
-        markdown = h.handle(cleaned_html)
+        h.body_width = 0
+        try:
+            markdown = h.handle(cleaned_html)
+        except Exception as e:
+            markdown = h.handle(sanitize_html(cleaned_html))
        markdown = markdown.replace('    ```', '```')

        try:
@@ -286,6 +290,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
            print('Error extracting metadata:', str(e))
            meta = {}

+        cleaned_html = sanitize_html(cleaned_html)
        return {
            'markdown': markdown,
            'cleaned_html': cleaned_html,
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -80,6 +80,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
        self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
        self.apply_chunking = kwargs.get("apply_chunking", True)
        self.base_url = kwargs.get("base_url", None)
+        self.extra_args = kwargs.get("extra_args", {})
        if not self.apply_chunking:
            self.chunk_token_threshold = 1e9
        
@@ -111,7 +112,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
                "{" + variable + "}", variable_values[variable]
            )
        
-        response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token, base_url=self.base_url) # , json_response=self.extract_type == "schema")
+        response = perform_completion_with_backoff(
+            self.provider, 
+            prompt_with_variables, 
+            self.api_token, 
+            base_url=self.base_url,
+            extra_args = self.extra_args
+            ) # , json_response=self.extract_type == "schema")
        try:
            blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
            blocks = json.loads(blocks)
--- a/crawl4ai/prompts.py
+++ b/crawl4ai/prompts.py
@@ -1,4 +1,4 @@
-PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
+PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
 <url>{URL}</url>

 And here is the cleaned HTML content of that webpage:
@@ -79,7 +79,7 @@ To generate the JSON objects:
 2. For each block:
   a. Assign it an index based on its order in the content.
   b. Analyze the content and generate ONE semantic tag that describe what the block is about.
-   c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
+   c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.

 3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.

--- a/crawl4ai/scraper/init.py
+++ b/crawl4ai/scraper/init.py
@@ -0,0 +1,3 @@
+from .async_web_scraper import AsyncWebScraper
+from .bfs_scraper_strategy import BFSScraperStrategy
+from .filters import URLFilter, FilterChain, URLPatternFilter, ContentTypeFilter
--- a/crawl4ai/scraper/async_web_scraper.py
+++ b/crawl4ai/scraper/async_web_scraper.py
@@ -0,0 +1,123 @@
+from typing import Union, AsyncGenerator, Optional
+from .scraper_strategy import ScraperStrategy
+from .models import ScraperResult, CrawlResult
+from ..async_webcrawler import AsyncWebCrawler
+import logging
+from dataclasses import dataclass
+from contextlib import asynccontextmanager
+
+@dataclass
+class ScrapingProgress:
+    """Tracks the progress of a scraping operation."""
+    processed_urls: int = 0
+    failed_urls: int = 0
+    current_url: Optional[str] = None
+
+class AsyncWebScraper:
+    """
+    A high-level web scraper that combines an async crawler with a scraping strategy.
+    
+    Args:
+        crawler (AsyncWebCrawler): The async web crawler implementation
+        strategy (ScraperStrategy): The scraping strategy to use
+        logger (Optional[logging.Logger]): Custom logger for the scraper
+    """
+    
+    def __init__(
+        self, 
+        crawler: AsyncWebCrawler, 
+        strategy: ScraperStrategy,
+        logger: Optional[logging.Logger] = None
+    ):
+        if not isinstance(crawler, AsyncWebCrawler):
+            raise TypeError("crawler must be an instance of AsyncWebCrawler")
+        if not isinstance(strategy, ScraperStrategy):
+            raise TypeError("strategy must be an instance of ScraperStrategy")
+            
+        self.crawler = crawler
+        self.strategy = strategy
+        self.logger = logger or logging.getLogger(__name__)
+        self._progress = ScrapingProgress()
+
+    @property
+    def progress(self) -> ScrapingProgress:
+        """Get current scraping progress."""
+        return self._progress
+
+    @asynccontextmanager
+    async def _error_handling_context(self, url: str):
+        """Context manager for handling errors during scraping."""
+        try:
+            yield
+        except Exception as e:
+            self.logger.error(f"Error scraping {url}: {str(e)}")
+            self._progress.failed_urls += 1
+            raise
+
+    async def ascrape(
+        self, 
+        url: str, 
+        parallel_processing: bool = True, 
+        stream: bool = False
+    ) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
+        """
+        Scrape a website starting from the given URL.
+        
+        Args:
+            url: Starting URL for scraping
+            parallel_processing: Whether to process URLs in parallel
+            stream: If True, yield results as they come; if False, collect all results
+            
+        Returns:
+            Either an async generator yielding CrawlResults or a final ScraperResult
+        """
+        self._progress = ScrapingProgress()  # Reset progress
+        
+        async with self._error_handling_context(url):
+            if stream:
+                return self._ascrape_yielding(url, parallel_processing)
+            return await self._ascrape_collecting(url, parallel_processing)
+
+    async def _ascrape_yielding(
+        self, 
+        url: str, 
+        parallel_processing: bool
+    ) -> AsyncGenerator[CrawlResult, None]:
+        """Stream scraping results as they become available."""
+        try:
+            result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
+            async for res in result_generator:
+                self._progress.processed_urls += 1
+                self._progress.current_url = res.url
+                yield res
+        except Exception as e:
+            self.logger.error(f"Error in streaming scrape: {str(e)}")
+            raise
+
+    async def _ascrape_collecting(
+        self, 
+        url: str, 
+        parallel_processing: bool
+    ) -> ScraperResult:
+        """Collect all scraping results before returning."""
+        extracted_data = {}
+        
+        try:
+            result_generator = self.strategy.ascrape(url, self.crawler, parallel_processing)
+            async for res in result_generator:
+                self._progress.processed_urls += 1
+                self._progress.current_url = res.url
+                extracted_data[res.url] = res
+                
+            return ScraperResult(
+                url=url,
+                crawled_urls=list(extracted_data.keys()),
+                extracted_data=extracted_data,
+                stats={
+                    'processed_urls': self._progress.processed_urls,
+                    'failed_urls': self._progress.failed_urls
+                }
+            )
+        except Exception as e:
+            self.logger.error(f"Error in collecting scrape: {str(e)}")
+            raise
--- a/crawl4ai/scraper/bfs_scraper_strategy.py
+++ b/crawl4ai/scraper/bfs_scraper_strategy.py
@@ -0,0 +1,327 @@
+from abc import ABC, abstractmethod
+from typing import Union, AsyncGenerator, Optional, Dict, Set
+from dataclasses import dataclass
+from datetime import datetime
+import asyncio
+import logging
+from urllib.parse import urljoin, urlparse, urlunparse
+from urllib.robotparser import RobotFileParser
+import validators
+import time
+from aiolimiter import AsyncLimiter
+from tenacity import retry, stop_after_attempt, wait_exponential
+from collections import defaultdict
+
+from .models import ScraperResult, CrawlResult
+from .filters import FilterChain
+from .scorers import URLScorer
+from ..async_webcrawler import AsyncWebCrawler
+
+@dataclass
+class CrawlStats:
+    """Statistics for the crawling process"""
+    start_time: datetime
+    urls_processed: int = 0
+    urls_failed: int = 0
+    urls_skipped: int = 0
+    total_depth_reached: int = 0
+    current_depth: int = 0
+    robots_blocked: int = 0
+
+class ScraperStrategy(ABC):
+    """Base class for scraping strategies"""
+    
+    @abstractmethod
+    async def ascrape(
+        self, 
+        url: str, 
+        crawler: AsyncWebCrawler, 
+        parallel_processing: bool = True,
+        stream: bool = False
+    ) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
+        """Abstract method for scraping implementation"""
+        pass
+
+    @abstractmethod
+    async def can_process_url(self, url: str) -> bool:
+        """Check if URL can be processed based on strategy rules"""
+        pass
+
+    @abstractmethod
+    async def shutdown(self):
+        """Clean up resources used by the strategy"""
+        pass
+
+class BFSScraperStrategy(ScraperStrategy):
+    """Breadth-First Search scraping strategy with politeness controls"""
+
+    def __init__(
+        self,
+        max_depth: int,
+        filter_chain: FilterChain,
+        url_scorer: URLScorer,
+        max_concurrent: int = 5,
+        min_crawl_delay: int = 1,
+        timeout: int = 30,
+        logger: Optional[logging.Logger] = None
+    ):
+        self.max_depth = max_depth
+        self.filter_chain = filter_chain
+        self.url_scorer = url_scorer
+        self.max_concurrent = max_concurrent
+        self.min_crawl_delay = min_crawl_delay
+        self.timeout = timeout
+        self.logger = logger or logging.getLogger(__name__)
+        
+        # Crawl control
+        self.stats = CrawlStats(start_time=datetime.now())
+        self._cancel_event = asyncio.Event()
+        self.process_external_links = False
+        
+        # Rate limiting and politeness
+        self.rate_limiter = AsyncLimiter(1, 1)
+        self.last_crawl_time = defaultdict(float)
+        self.robot_parsers: Dict[str, RobotFileParser] = {}
+        self.domain_queues: Dict[str, asyncio.Queue] = defaultdict(asyncio.Queue)
+
+    async def can_process_url(self, url: str) -> bool:
+        """Check if URL can be processed based on robots.txt and filters
+        This is our gatekeeper method that determines if a URL should be processed. It:
+            - Validates URL format using the validators library
+            - Checks robots.txt permissions for the domain
+            - Applies custom filters from the filter chain
+            - Updates statistics for blocked URLs
+            - Returns False early if any check fails
+        """
+        if not validators.url(url):
+            self.logger.warning(f"Invalid URL: {url}")
+            return False
+
+        robot_parser = await self._get_robot_parser(url)
+        if robot_parser and not robot_parser.can_fetch("*", url):
+            self.stats.robots_blocked += 1
+            self.logger.info(f"Blocked by robots.txt: {url}")
+            return False
+
+        return self.filter_chain.apply(url)
+
+    async def _get_robot_parser(self, url: str) -> Optional[RobotFileParser]:
+        """Get or create robots.txt parser for domain.
+            This is our robots.txt manager that:
+                - Uses domain-level caching of robot parsers
+                - Creates and caches new parsers as needed
+                - Handles failed robots.txt fetches gracefully
+                - Returns None if robots.txt can't be fetched, allowing crawling to proceed        
+        """
+        domain = urlparse(url).netloc
+        if domain not in self.robot_parsers:
+            parser = RobotFileParser()
+            try:
+                robots_url = f"{urlparse(url).scheme}://{domain}/robots.txt"
+                parser.set_url(robots_url)
+                parser.read()
+                self.robot_parsers[domain] = parser
+            except Exception as e:
+                self.logger.warning(f"Error fetching robots.txt for {domain}: {e}")
+                return None
+        return self.robot_parsers[domain]
+
+    @retry(stop=stop_after_attempt(3), 
+           wait=wait_exponential(multiplier=1, min=4, max=10))
+    async def _crawl_with_retry(
+        self, 
+        crawler: AsyncWebCrawler, 
+        url: str
+    ) -> CrawlResult:
+        """Crawl URL with retry logic"""
+        try:
+            async with asyncio.timeout(self.timeout):
+                return await crawler.arun(url)
+        except asyncio.TimeoutError:
+            self.logger.error(f"Timeout crawling {url}")
+            raise
+
+    async def process_url(
+        self,
+        url: str,
+        depth: int,
+        crawler: AsyncWebCrawler,
+        queue: asyncio.PriorityQueue,
+        visited: Set[str],
+        depths: Dict[str, int]
+    ) -> Optional[CrawlResult]:
+        """Process a single URL and extract links.
+        This is our main URL processing workhorse that:
+            - Checks for cancellation
+            - Validates URLs through can_process_url
+            - Implements politeness delays per domain
+            - Applies rate limiting
+            - Handles crawling with retries
+            - Updates various statistics
+            - Processes extracted links
+            - Returns the crawl result or None on failure
+        """
+        
+        if self._cancel_event.is_set():
+            return None
+            
+        if not await self.can_process_url(url):
+            self.stats.urls_skipped += 1
+            return None
+
+        # Politeness delay
+        domain = urlparse(url).netloc
+        time_since_last = time.time() - self.last_crawl_time[domain]
+        if time_since_last < self.min_crawl_delay:
+            await asyncio.sleep(self.min_crawl_delay - time_since_last)
+        self.last_crawl_time[domain] = time.time()
+
+        # Crawl with rate limiting
+        try:
+            async with self.rate_limiter:
+                result = await self._crawl_with_retry(crawler, url)
+                self.stats.urls_processed += 1
+        except Exception as e:
+            self.logger.error(f"Error crawling {url}: {e}")
+            self.stats.urls_failed += 1
+            return None
+
+        # Process links
+        await self._process_links(result, url, depth, queue, visited, depths)
+        
+        return result
+
+    async def _process_links(
+        self,
+        result: CrawlResult,
+        source_url: str,
+        depth: int,
+        queue: asyncio.PriorityQueue,
+        visited: Set[str],
+        depths: Dict[str, int]
+    ):
+        """Process extracted links from crawl result.
+        This is our link processor that:
+            Handles both internal and external links
+            Normalizes URLs (removes fragments)
+            Checks depth limits
+            Scores URLs for priority
+            Updates depth tracking
+            Adds valid URLs to the queue
+            Updates maximum depth statistics
+        """
+        links_ro_process = result.links["internal"]
+        if self.process_external_links:
+            links_ro_process += result.links["external"]
+        for link_type in links_ro_process:
+            for link in result.links[link_type]:
+                url = link['href']
+                # url = urljoin(source_url, link['href'])
+                # url = urlunparse(urlparse(url)._replace(fragment=""))
+                
+                if url not in visited and await self.can_process_url(url):
+                    new_depth = depths[source_url] + 1
+                    if new_depth <= self.max_depth:
+                        score = self.url_scorer.score(url)
+                        await queue.put((score, new_depth, url))
+                        depths[url] = new_depth
+                        self.stats.total_depth_reached = max(
+                            self.stats.total_depth_reached, 
+                            new_depth
+                        )
+
+    async def ascrape(
+        self,
+        start_url: str,
+        crawler: AsyncWebCrawler,
+        parallel_processing: bool = True
+    ) -> AsyncGenerator[CrawlResult, None]:
+        """Implement BFS crawling strategy"""
+        
+        # Initialize crawl state
+        """
+        queue: A priority queue where items are tuples of (score, depth, url)
+            Score: Determines crawling priority (lower = higher priority)
+            Depth: Current distance from start_url
+            URL: The actual URL to crawl
+        visited: Keeps track of URLs we've already seen to avoid cycles
+        depths: Maps URLs to their depths from the start URL
+        pending_tasks: Tracks currently running crawl tasks        
+        """
+        queue = asyncio.PriorityQueue()
+        await queue.put((0, 0, start_url))
+        visited: Set[str] = set()
+        depths = {start_url: 0}
+        pending_tasks = set()
+        
+        try:
+            while (not queue.empty() or pending_tasks) and not self._cancel_event.is_set():
+                """
+                This sets up our main control loop which:
+                    - Continues while there are URLs to process (not queue.empty())
+                    - Or while there are tasks still running (pending_tasks)
+                    - Can be interrupted via cancellation (not self._cancel_event.is_set())
+                """
+                # Start new tasks up to max_concurrent
+                while not queue.empty() and len(pending_tasks) < self.max_concurrent:
+                    """
+                    This section manages task creation:
+                        Checks if we can start more tasks (under max_concurrent limit)
+                        Gets the next URL from the priority queue
+                        Marks URLs as visited immediately to prevent duplicates
+                        Updates current depth in stats
+                        Either:
+                            Creates a new async task (parallel mode)
+                            Processes URL directly (sequential mode)
+                    """
+                    _, depth, url = await queue.get()
+                    if url not in visited:
+                        visited.add(url)
+                        self.stats.current_depth = depth
+                        
+                        if parallel_processing:
+                            task = asyncio.create_task(
+                                self.process_url(url, depth, crawler, queue, visited, depths)
+                            )
+                            pending_tasks.add(task)
+                        else:
+                            result = await self.process_url(
+                                url, depth, crawler, queue, visited, depths
+                            )
+                            if result:
+                                yield result
+
+                # Process completed tasks
+                """
+                This section manages completed tasks:
+                    Waits for any task to complete using asyncio.wait
+                    Uses FIRST_COMPLETED to handle results as soon as they're ready
+                    Yields successful results to the caller
+                    Updates pending_tasks to remove completed ones
+                """
+                if pending_tasks:
+                    done, pending_tasks = await asyncio.wait(
+                        pending_tasks,
+                        return_when=asyncio.FIRST_COMPLETED
+                    )
+                    for task in done:
+                        result = await task
+                        if result:
+                            yield result
+                            
+        except Exception as e:
+            self.logger.error(f"Error in crawl process: {e}")
+            raise
+            
+        finally:
+            # Clean up any remaining tasks
+            for task in pending_tasks:
+                task.cancel()
+            self.stats.end_time = datetime.now()
+
+    async def shutdown(self):
+        """Clean up resources and stop crawling"""
+        self._cancel_event.set()
+        # Clear caches and close connections
+        self.robot_parsers.clear()
+        self.domain_queues.clear()
--- a/crawl4ai/scraper/filters.py
+++ b/crawl4ai/scraper/filters.py
@@ -0,0 +1,205 @@
+# from .url_filter import URLFilter, FilterChain
+# from .content_type_filter import ContentTypeFilter
+# from .url_pattern_filter import URLPatternFilter
+
+from abc import ABC, abstractmethod
+from typing import List, Pattern, Set, Union
+import re
+from urllib.parse import urlparse
+import mimetypes
+import logging
+from dataclasses import dataclass
+import fnmatch
+
+@dataclass
+class FilterStats:
+    """Statistics for filter applications"""
+    total_urls: int = 0
+    rejected_urls: int = 0
+    passed_urls: int = 0
+
+class URLFilter(ABC):
+    """Base class for URL filters"""
+    
+    def __init__(self, name: str = None):
+        self.name = name or self.__class__.__name__
+        self.stats = FilterStats()
+        self.logger = logging.getLogger(f"urlfilter.{self.name}")
+
+    @abstractmethod
+    def apply(self, url: str) -> bool:
+        """Apply the filter to a URL"""
+        pass
+
+    def _update_stats(self, passed: bool):
+        """Update filter statistics"""
+        self.stats.total_urls += 1
+        if passed:
+            self.stats.passed_urls += 1
+        else:
+            self.stats.rejected_urls += 1
+
+class FilterChain:
+    """Chain of URL filters."""
+    
+    def __init__(self, filters: List[URLFilter] = None):
+        self.filters = filters or []
+        self.stats = FilterStats()
+        self.logger = logging.getLogger("urlfilter.chain")
+
+    def add_filter(self, filter_: URLFilter) -> 'FilterChain':
+        """Add a filter to the chain"""
+        self.filters.append(filter_)
+        return self  # Enable method chaining
+
+    def apply(self, url: str) -> bool:
+        """Apply all filters in the chain"""
+        self.stats.total_urls += 1
+        
+        for filter_ in self.filters:
+            if not filter_.apply(url):
+                self.stats.rejected_urls += 1
+                self.logger.debug(f"URL {url} rejected by {filter_.name}")
+                return False
+        
+        self.stats.passed_urls += 1
+        return True
+
+class URLPatternFilter(URLFilter):
+    """Filter URLs based on glob patterns or regex.
+    
+    pattern_filter = URLPatternFilter([
+        "*.example.com/*",  # Glob pattern
+        "*/article/*",      # Path pattern
+        re.compile(r"blog-\d+") # Regex pattern
+    ])
+
+    - Supports glob patterns and regex
+    - Multiple patterns per filter
+    - Pattern pre-compilation for performance    
+    """
+    
+    def __init__(self, patterns: Union[str, Pattern, List[Union[str, Pattern]]], 
+                 use_glob: bool = True):
+        super().__init__()
+        self.patterns = [patterns] if isinstance(patterns, (str, Pattern)) else patterns
+        self.use_glob = use_glob
+        self._compiled_patterns = []
+        
+        for pattern in self.patterns:
+            if isinstance(pattern, str) and use_glob:
+                self._compiled_patterns.append(self._glob_to_regex(pattern))
+            else:
+                self._compiled_patterns.append(re.compile(pattern) if isinstance(pattern, str) else pattern)
+
+    def _glob_to_regex(self, pattern: str) -> Pattern:
+        """Convert glob pattern to regex"""
+        return re.compile(fnmatch.translate(pattern))
+
+    def apply(self, url: str) -> bool:
+        """Check if URL matches any of the patterns"""
+        matches = any(pattern.search(url) for pattern in self._compiled_patterns)
+        self._update_stats(matches)
+        return matches
+
+class ContentTypeFilter(URLFilter):
+    """Filter URLs based on expected content type.
+    
+    content_filter = ContentTypeFilter([
+        "text/html",
+        "application/pdf"
+    ], check_extension=True)
+
+    - Filter by MIME types
+    - Extension checking
+    - Support for multiple content types
+    """
+    
+    def __init__(self, allowed_types: Union[str, List[str]], 
+                 check_extension: bool = True):
+        super().__init__()
+        self.allowed_types = [allowed_types] if isinstance(allowed_types, str) else allowed_types
+        self.check_extension = check_extension
+        self._normalize_types()
+
+    def _normalize_types(self):
+        """Normalize content type strings"""
+        self.allowed_types = [t.lower() for t in self.allowed_types]
+
+    def _check_extension(self, url: str) -> bool:
+        """Check URL's file extension"""
+        ext = urlparse(url).path.split('.')[-1].lower() if '.' in urlparse(url).path else ''
+        if not ext:
+            return True  # No extension, might be dynamic content
+            
+        guessed_type = mimetypes.guess_type(url)[0]
+        return any(allowed in (guessed_type or '').lower() for allowed in self.allowed_types)
+
+    def apply(self, url: str) -> bool:
+        """Check if URL's content type is allowed"""
+        result = True
+        if self.check_extension:
+            result = self._check_extension(url)
+        self._update_stats(result)
+        return result
+
+class DomainFilter(URLFilter):
+    """Filter URLs based on allowed/blocked domains.
+    
+    domain_filter = DomainFilter(
+        allowed_domains=["example.com", "blog.example.com"],
+        blocked_domains=["ads.example.com"]
+    )
+
+    - Allow/block specific domains
+    - Subdomain support
+    - Efficient domain matching
+    """
+    
+    def __init__(self, allowed_domains: Union[str, List[str]] = None, 
+                 blocked_domains: Union[str, List[str]] = None):
+        super().__init__()
+        self.allowed_domains = set(self._normalize_domains(allowed_domains)) if allowed_domains else None
+        self.blocked_domains = set(self._normalize_domains(blocked_domains)) if blocked_domains else set()
+
+    def _normalize_domains(self, domains: Union[str, List[str]]) -> List[str]:
+        """Normalize domain strings"""
+        if isinstance(domains, str):
+            domains = [domains]
+        return [d.lower().strip() for d in domains]
+
+    def _extract_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        return urlparse(url).netloc.lower()
+
+    def apply(self, url: str) -> bool:
+        """Check if URL's domain is allowed"""
+        domain = self._extract_domain(url)
+        
+        if domain in self.blocked_domains:
+            self._update_stats(False)
+            return False
+            
+        if self.allowed_domains is not None and domain not in self.allowed_domains:
+            self._update_stats(False)
+            return False
+            
+        self._update_stats(True)
+        return True
+
+# Example usage:
+def create_common_filter_chain() -> FilterChain:
+    """Create a commonly used filter chain"""
+    return FilterChain([
+        URLPatternFilter([
+            "*.html", "*.htm",  # HTML files
+            "*/article/*", "*/blog/*"  # Common content paths
+        ]),
+        ContentTypeFilter([
+            "text/html",
+            "application/xhtml+xml"
+        ]),
+        DomainFilter(
+            blocked_domains=["ads.*", "analytics.*"]
+        )
+    ])
--- a/crawl4ai/scraper/models.py
+++ b/crawl4ai/scraper/models.py
@@ -0,0 +1,8 @@
+from pydantic import BaseModel
+from typing import List, Dict
+from ..models import CrawlResult
+
+class ScraperResult(BaseModel):
+    url: str
+    crawled_urls: List[str]
+    extracted_data: Dict[str,CrawlResult]
--- a/crawl4ai/scraper/scorers.py
+++ b/crawl4ai/scraper/scorers.py
@@ -0,0 +1,268 @@
+# from .url_scorer import URLScorer
+# from .keyword_relevance_scorer import KeywordRelevanceScorer
+
+from abc import ABC, abstractmethod
+from typing import List, Dict, Optional, Union
+from dataclasses import dataclass
+from urllib.parse import urlparse, unquote
+import re
+from collections import defaultdict
+import math
+import logging
+
+@dataclass
+class ScoringStats:
+    """Statistics for URL scoring"""
+    urls_scored: int = 0
+    total_score: float = 0.0
+    min_score: float = float('inf')
+    max_score: float = float('-inf')
+    
+    def update(self, score: float):
+        """Update scoring statistics"""
+        self.urls_scored += 1
+        self.total_score += score
+        self.min_score = min(self.min_score, score)
+        self.max_score = max(self.max_score, score)
+    
+    @property
+    def average_score(self) -> float:
+        """Calculate average score"""
+        return self.total_score / self.urls_scored if self.urls_scored > 0 else 0.0
+
+class URLScorer(ABC):
+    """Base class for URL scoring strategies"""
+    
+    def __init__(self, weight: float = 1.0, name: str = None):
+        self.weight = weight
+        self.name = name or self.__class__.__name__
+        self.stats = ScoringStats()
+        self.logger = logging.getLogger(f"urlscorer.{self.name}")
+
+    @abstractmethod
+    def _calculate_score(self, url: str) -> float:
+        """Calculate the raw score for a URL"""
+        pass
+
+    def score(self, url: str) -> float:
+        """Calculate the weighted score for a URL"""
+        raw_score = self._calculate_score(url)
+        weighted_score = raw_score * self.weight
+        self.stats.update(weighted_score)
+        return weighted_score
+
+class CompositeScorer(URLScorer):
+    """Combines multiple scorers with weights"""
+    
+    def __init__(self, scorers: List[URLScorer], normalize: bool = True):
+        super().__init__(name="CompositeScorer")
+        self.scorers = scorers
+        self.normalize = normalize
+
+    def _calculate_score(self, url: str) -> float:
+        scores = [scorer.score(url) for scorer in self.scorers]
+        total_score = sum(scores)
+        
+        if self.normalize and scores:
+            total_score /= len(scores)
+            
+        return total_score
+
+class KeywordRelevanceScorer(URLScorer):
+    """Score URLs based on keyword relevance.
+
+    keyword_scorer = KeywordRelevanceScorer(
+        keywords=["python", "programming"],
+        weight=1.0,
+        case_sensitive=False
+    )
+
+    - Score based on keyword matches
+    - Case sensitivity options
+    - Weighted scoring
+    """
+    
+    def __init__(self, keywords: List[str], weight: float = 1.0,
+                 case_sensitive: bool = False):
+        super().__init__(weight=weight)
+        self.keywords = keywords
+        self.case_sensitive = case_sensitive
+        self._compile_keywords()
+
+    def _compile_keywords(self):
+        """Prepare keywords for matching"""
+        flags = 0 if self.case_sensitive else re.IGNORECASE
+        self.patterns = [re.compile(re.escape(k), flags) for k in self.keywords]
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on keyword matches"""
+        decoded_url = unquote(url)
+        total_matches = sum(
+            1 for pattern in self.patterns
+            if pattern.search(decoded_url)
+        )
+        # Normalize score between 0 and 1
+        return total_matches / len(self.patterns) if self.patterns else 0.0
+
+class PathDepthScorer(URLScorer):
+    """Score URLs based on their path depth.
+        
+    path_scorer = PathDepthScorer(
+        optimal_depth=3,  # Preferred URL depth
+        weight=0.7
+    )
+
+    - Score based on URL path depth
+    - Configurable optimal depth
+    - Diminishing returns for deeper paths
+    """
+    
+    def __init__(self, optimal_depth: int = 3, weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.optimal_depth = optimal_depth
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on path depth"""
+        path = urlparse(url).path
+        depth = len([x for x in path.split('/') if x])
+        
+        # Score decreases as we move away from optimal depth
+        distance_from_optimal = abs(depth - self.optimal_depth)
+        return 1.0 / (1.0 + distance_from_optimal)
+
+class ContentTypeScorer(URLScorer):
+    """Score URLs based on content type preferences.
+    
+    content_scorer = ContentTypeScorer({
+        r'\.html$': 1.0,
+        r'\.pdf$': 0.8,
+        r'\.xml$': 0.6
+    })
+
+    - Score based on file types
+    - Configurable type weights
+    - Pattern matching support
+    """
+    
+    def __init__(self, type_weights: Dict[str, float], weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.type_weights = type_weights
+        self._compile_patterns()
+
+    def _compile_patterns(self):
+        """Prepare content type patterns"""
+        self.patterns = {
+            re.compile(pattern): weight
+            for pattern, weight in self.type_weights.items()
+        }
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on content type matching"""
+        for pattern, weight in self.patterns.items():
+            if pattern.search(url):
+                return weight
+        return 0.0
+
+class FreshnessScorer(URLScorer):
+    """Score URLs based on freshness indicators.
+    
+    freshness_scorer = FreshnessScorer(weight=0.9)
+
+    Score based on date indicators in URLs
+    Multiple date format support
+    Recency weighting"""
+    
+    def __init__(self, weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.date_patterns = [
+            r'/(\d{4})/(\d{2})/(\d{2})/',  # yyyy/mm/dd
+            r'(\d{4})[-_](\d{2})[-_](\d{2})',  # yyyy-mm-dd
+            r'/(\d{4})/',  # year only
+        ]
+        self._compile_patterns()
+
+    def _compile_patterns(self):
+        """Prepare date patterns"""
+        self.compiled_patterns = [re.compile(p) for p in self.date_patterns]
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on date indicators"""
+        for pattern in self.compiled_patterns:
+            if match := pattern.search(url):
+                year = int(match.group(1))
+                # Score higher for more recent years
+                return 1.0 - (2024 - year) * 0.1
+        return 0.5  # Default score for URLs without dates
+
+class DomainAuthorityScorer(URLScorer):
+    """Score URLs based on domain authority.
+
+    authority_scorer = DomainAuthorityScorer({
+        "python.org": 1.0,
+        "github.com": 0.9,
+        "medium.com": 0.7
+    })
+
+    Score based on domain importance
+    Configurable domain weights
+    Default weight for unknown domains"""
+    
+    def __init__(self, domain_weights: Dict[str, float], 
+                 default_weight: float = 0.5, weight: float = 1.0):
+        super().__init__(weight=weight)
+        self.domain_weights = domain_weights
+        self.default_weight = default_weight
+
+    def _calculate_score(self, url: str) -> float:
+        """Calculate score based on domain authority"""
+        domain = urlparse(url).netloc.lower()
+        return self.domain_weights.get(domain, self.default_weight)
+
+def create_balanced_scorer() -> CompositeScorer:
+    """Create a balanced composite scorer"""
+    return CompositeScorer([
+        KeywordRelevanceScorer(
+            keywords=["article", "blog", "news", "research"],
+            weight=1.0
+        ),
+        PathDepthScorer(
+            optimal_depth=3,
+            weight=0.7
+        ),
+        ContentTypeScorer(
+            type_weights={
+                r'\.html?$': 1.0,
+                r'\.pdf$': 0.8,
+                r'\.xml$': 0.6
+            },
+            weight=0.8
+        ),
+        FreshnessScorer(
+            weight=0.9
+        )
+    ])
+
+# Example Usage:
+"""
+# Create a composite scorer
+scorer = CompositeScorer([
+    KeywordRelevanceScorer(["python", "programming"], weight=1.0),
+    PathDepthScorer(optimal_depth=2, weight=0.7),
+    FreshnessScorer(weight=0.8),
+    DomainAuthorityScorer(
+        domain_weights={
+            "python.org": 1.0,
+            "github.com": 0.9,
+            "medium.com": 0.7
+        },
+        weight=0.9
+    )
+])
+
+# Score a URL
+score = scorer.score("https://python.org/article/2024/01/new-features")
+
+# Access statistics
+print(f"Average score: {scorer.stats.average_score}")
+print(f"URLs scored: {scorer.stats.urls_scored}")
+"""
--- a/crawl4ai/scraper/scraper_strategy.py
+++ b/crawl4ai/scraper/scraper_strategy.py
@@ -0,0 +1,26 @@
+from abc import ABC, abstractmethod
+from .models import ScraperResult, CrawlResult
+from ..models import CrawlResult
+from ..async_webcrawler import AsyncWebCrawler
+from typing import Union, AsyncGenerator
+
+class ScraperStrategy(ABC):
+    @abstractmethod
+    async def ascrape(self, url: str, crawler: AsyncWebCrawler, parallel_processing: bool = True, stream: bool = False) -> Union[AsyncGenerator[CrawlResult, None], ScraperResult]:
+        """Scrape the given URL using the specified crawler.
+
+        Args:
+            url (str): The starting URL for the scrape.
+            crawler (AsyncWebCrawler): The web crawler instance.
+            parallel_processing (bool): Whether to use parallel processing. Defaults to True.
+            stream (bool): If True, yields individual crawl results as they are ready; 
+                                if False, accumulates results and returns a final ScraperResult.
+
+        Yields:
+            CrawlResult: Individual crawl results if stream is True.
+
+        Returns:
+            ScraperResult: A summary of the scrape results containing the final extracted data 
+            and the list of crawled URLs if stream is False.
+        """
+        pass
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -131,7 +131,7 @@ def split_and_parse_json_objects(json_string):
    return parsed_objects, unparsed_segments

 def sanitize_html(html):
-    # Replace all weird and special characters with an empty string
+    # Replace all unwanted and special characters with an empty string
    sanitized_html = html
    # sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)

@@ -301,7 +301,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
            if tag.name != 'img':
                tag.attrs = {}

-        # Extract all img tgas inti [{src: '', alt: ''}]
+        # Extract all img tgas int0 [{src: '', alt: ''}]
        media = {
            'images': [],
            'videos': [],
@@ -339,7 +339,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
                img.decompose()


-        # Create a function that replace content of all"pre" tage with its inner text
+        # Create a function that replace content of all"pre" tag with its inner text
        def replace_pre_tags_with_text(node):
            for child in node.find_all('pre'):
                # set child inner html to its text
@@ -502,7 +502,7 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            current_tag = tag
            while current_tag:
                current_tag = current_tag.parent
-                # Get the text content of the parent tag
+                # Get the text content from the parent tag
                if current_tag:
                    text_content = current_tag.get_text(separator=' ',strip=True)
                    # Check if the text content has at least word_count_threshold
@@ -511,88 +511,88 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
            return None

    def process_image(img, url, index, total_images):
-            #Check if an image has valid display and inside undesired html elements
-            def is_valid_image(img, parent, parent_classes):
-                style = img.get('style', '')
-                src = img.get('src', '')
-                classes_to_check = ['button', 'icon', 'logo']
-                tags_to_check = ['button', 'input']
-                return all([
-                    'display:none' not in style,
-                    src,
-                    not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
-                    parent.name not in tags_to_check
-                ])
+        #Check if an image has valid display and inside undesired html elements
+        def is_valid_image(img, parent, parent_classes):
+            style = img.get('style', '')
+            src = img.get('src', '')
+            classes_to_check = ['button', 'icon', 'logo']
+            tags_to_check = ['button', 'input']
+            return all([
+                'display:none' not in style,
+                src,
+                not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
+                parent.name not in tags_to_check
+            ])

-            #Score an image for it's usefulness
-            def score_image_for_usefulness(img, base_url, index, images_count):
-                # Function to parse image height/width value and units
-                def parse_dimension(dimension):
-                    if dimension:
-                        match = re.match(r"(\d+)(\D*)", dimension)
-                        if match:
-                            number = int(match.group(1))
-                            unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
-                            return number, unit
-                    return None, None
+        #Score an image for it's usefulness
+        def score_image_for_usefulness(img, base_url, index, images_count):
+            # Function to parse image height/width value and units
+            def parse_dimension(dimension):
+                if dimension:
+                    match = re.match(r"(\d+)(\D*)", dimension)
+                    if match:
+                        number = int(match.group(1))
+                        unit = match.group(2) or 'px'  # Default unit is 'px' if not specified
+                        return number, unit
+                return None, None

-                # Fetch image file metadata to extract size and extension
-                def fetch_image_file_size(img, base_url):
-                    #If src is relative path construct full URL, if not it may be CDN URL
-                    img_url = urljoin(base_url,img.get('src'))
-                    try:
-                        response = requests.head(img_url)
-                        if response.status_code == 200:
-                            return response.headers.get('Content-Length',None)
-                        else:
-                            print(f"Failed to retrieve file size for {img_url}")
-                            return None
-                    except InvalidSchema as e:
+            # Fetch image file metadata to extract size and extension
+            def fetch_image_file_size(img, base_url):
+                #If src is relative path construct full URL, if not it may be CDN URL
+                img_url = urljoin(base_url,img.get('src'))
+                try:
+                    response = requests.head(img_url)
+                    if response.status_code == 200:
+                        return response.headers.get('Content-Length',None)
+                    else:
+                        print(f"Failed to retrieve file size for {img_url}")
                        return None
-                    finally:
-                        return
+                except InvalidSchema as e:
+                    return None
+                finally:
+                    return

-                image_height = img.get('height')
-                height_value, height_unit = parse_dimension(image_height)
-                image_width =  img.get('width')
-                width_value, width_unit = parse_dimension(image_width)
-                image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
-                image_format = os.path.splitext(img.get('src',''))[1].lower()
-                # Remove . from format
-                image_format = image_format.strip('.')
-                score = 0
-                if height_value:
-                    if height_unit == 'px' and height_value > 150:
-                        score += 1
-                    if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
-                        score += 1
-                if width_value:
-                    if width_unit == 'px' and width_value > 150:
-                        score += 1
-                    if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
-                        score += 1
-                if image_size > 10000:
+            image_height = img.get('height')
+            height_value, height_unit = parse_dimension(image_height)
+            image_width =  img.get('width')
+            width_value, width_unit = parse_dimension(image_width)
+            image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
+            image_format = os.path.splitext(img.get('src',''))[1].lower()
+            # Remove . from format
+            image_format = image_format.strip('.')
+            score = 0
+            if height_value:
+                if height_unit == 'px' and height_value > 150:
                    score += 1
-                if img.get('alt') != '':
-                    score+=1
-                if any(image_format==format for format in ['jpg','png','webp']):
-                    score+=1
-                if index/images_count<0.5:
-                    score+=1
-                return score
+                if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
+                    score += 1
+            if width_value:
+                if width_unit == 'px' and width_value > 150:
+                    score += 1
+                if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
+                    score += 1
+            if image_size > 10000:
+                score += 1
+            if img.get('alt') != '':
+                score+=1
+            if any(image_format==format for format in ['jpg','png','webp']):
+                score+=1
+            if index/images_count<0.5:
+                score+=1
+            return score

-            if not is_valid_image(img, img.parent, img.parent.get('class', [])):
-                return None
-            score = score_image_for_usefulness(img, url, index, total_images)
-            if score <= IMAGE_SCORE_THRESHOLD:
-                return None
-            return {
-                'src': img.get('src', ''),
-                'alt': img.get('alt', ''),
-                'desc': find_closest_parent_with_useful_text(img),
-                'score': score,
-                'type': 'image'
-            }
+        if not is_valid_image(img, img.parent, img.parent.get('class', [])):
+            return None
+        score = score_image_for_usefulness(img, url, index, total_images)
+        if score <= IMAGE_SCORE_THRESHOLD:
+            return None
+        return {
+            'src': img.get('src', '').replace('\\"', '"').strip(),
+            'alt': img.get('alt', ''),
+            'desc': find_closest_parent_with_useful_text(img),
+            'score': score,
+            'type': 'image'
+        }

    def process_element(element: element.PageElement) -> bool:
        try:
@@ -775,7 +775,14 @@ def extract_xml_data(tags, string):
    return data
    
 # Function to perform the completion with exponential backoff
-def perform_completion_with_backoff(provider, prompt_with_variables, api_token, json_response = False, base_url=None):
+def perform_completion_with_backoff(
+    provider, 
+    prompt_with_variables, 
+    api_token, 
+    json_response = False, 
+    base_url=None,
+    **kwargs
+    ):
    from litellm import completion 
    from litellm.exceptions import RateLimitError
    max_attempts = 3
@@ -784,6 +791,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token,
    extra_args = {}
    if json_response:
        extra_args["response_format"] = { "type": "json_object" }
+        
+    if kwargs.get("extra_args"):
+        extra_args.update(kwargs["extra_args"])
    
    for attempt in range(max_attempts):
        try:
--- a/crawl4ai/web_crawler.py
+++ b/crawl4ai/web_crawler.py
@@ -12,6 +12,7 @@ from typing import List
 from concurrent.futures import ThreadPoolExecutor
 from .config import *
 import warnings
+import json
 warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')


--- a/docs/examples/quickstart_async.py
+++ b/docs/examples/quickstart_async.py
@@ -10,6 +10,7 @@ import time
 import json
 import os
 import re
+from typing import Dict
 from bs4 import BeautifulSoup
 from pydantic import BaseModel, Field
 from crawl4ai import AsyncWebCrawler
@@ -18,6 +19,8 @@ from crawl4ai.extraction_strategy import (
    LLMExtractionStrategy,
 )

+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
 print("Crawl4AI: Advanced Web Crawling and Data Extraction")
 print("GitHub Repository: https://github.com/unclecode/crawl4ai")
 print("Twitter: @unclecode")
@@ -30,7 +33,7 @@ async def simple_crawl():
        result = await crawler.arun(url="https://www.nbcnews.com/business")
        print(result.markdown[:500])  # Print first 500 characters

-async def js_and_css():
+async def simple_example_with_running_js_code():
    print("\n--- Executing JavaScript and Using CSS Selectors ---")
    # New code to handle the wait_for parameter
    wait_for = """() => {
@@ -47,12 +50,21 @@ async def js_and_css():
        result = await crawler.arun(
            url="https://www.nbcnews.com/business",
            js_code=js_code,
-            # css_selector="article.tease-card",
            # wait_for=wait_for,
            bypass_cache=True,
        )
        print(result.markdown[:500])  # Print first 500 characters

+async def simple_example_with_css_selector():
+    print("\n--- Using CSS Selectors ---")
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url="https://www.nbcnews.com/business",
+            css_selector=".wide-tease-item__description",
+            bypass_cache=True,
+        )
+        print(result.markdown[:500])  # Print first 500 characters
+
 async def use_proxy():
    print("\n--- Using a Proxy ---")
    print(
@@ -66,6 +78,28 @@ async def use_proxy():
    #     )
    #     print(result.markdown[:500])  # Print first 500 characters

+async def capture_and_save_screenshot(url: str, output_path: str):
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        result = await crawler.arun(
+            url=url,
+            screenshot=True,
+            bypass_cache=True
+        )
+        
+        if result.success and result.screenshot:
+            import base64
+            
+            # Decode the base64 screenshot data
+            screenshot_data = base64.b64decode(result.screenshot)
+            
+            # Save the screenshot as a JPEG file
+            with open(output_path, 'wb') as f:
+                f.write(screenshot_data)
+            
+            print(f"Screenshot saved successfully to {output_path}")
+        else:
+            print("Failed to capture screenshot")
+
 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
@@ -73,27 +107,30 @@ class OpenAIModelFee(BaseModel):
        ..., description="Fee for output token for the OpenAI model."
    )

-async def extract_structured_data_using_llm():
-    print("\n--- Extracting Structured Data with OpenAI ---")
-    print(
-        "Note: Set your OpenAI API key as an environment variable to run this example."
-    )
-    if not os.getenv("OPENAI_API_KEY"):
-        print("OpenAI API key not found. Skipping this example.")
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+    print(f"\n--- Extracting Structured Data with {provider} ---")
+    
+    if api_token is None and provider != "ollama":
+        print(f"API token is required for {provider}. Skipping this example.")
        return

+    extra_args = {}
+    if extra_headers:
+        extra_args["extra_headers"] = extra_headers
+
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(
            url="https://openai.com/api/pricing/",
            word_count_threshold=1,
            extraction_strategy=LLMExtractionStrategy(
-                provider="openai/gpt-4o",
-                api_token=os.getenv("OPENAI_API_KEY"),
+                provider=provider,
+                api_token=api_token,
                schema=OpenAIModelFee.schema(),
                extraction_type="schema",
                instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
                Do not miss any models in the entire content. One extracted model JSON format should look like this: 
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
+                extra_args=extra_args
            ),
            bypass_cache=True,
        )
@@ -320,6 +357,28 @@ async def crawl_dynamic_content_pages_method_3():
        await crawler.crawler_strategy.kill_session(session_id)
        print(f"Successfully crawled {len(all_commits)} commits across 3 pages")

+async def crawl_custom_browser_type():
+    # Use Firefox
+    start = time.time()
+    async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+    # Use WebKit
+    start = time.time()
+    async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
+    # Use Chromium (default)
+    start = time.time()
+    async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
+        result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
+        print(result.markdown[:500])
+        print("Time taken: ", time.time() - start)
+
 async def speed_comparison():
    # print("\n--- Speed Comparison ---")
    # print("Firecrawl (simulated):")
@@ -387,13 +446,31 @@ async def speed_comparison():

 async def main():
    await simple_crawl()
-    await js_and_css()
+    await simple_example_with_running_js_code()
+    await simple_example_with_css_selector()
    await use_proxy()
+    await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
    await extract_structured_data_using_css_extractor()
+
+    # LLM extraction examples
    await extract_structured_data_using_llm()
+    await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
+    await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
+    await extract_structured_data_using_llm("ollama/llama3.2")    
+
+    # You always can pass custom headers to the extraction strategy
+    custom_headers = {
+        "Authorization": "Bearer your-custom-token",
+        "X-Custom-Header": "Some-Value"
+    }
+    await extract_structured_data_using_llm(extra_headers=custom_headers)
+    
    # await crawl_dynamic_content_pages_method_1()
    # await crawl_dynamic_content_pages_method_2()
    await crawl_dynamic_content_pages_method_3()
+    
+    await crawl_custom_browser_type()
+    
    await speed_comparison()


--- a/docs/scrapper/async_web_scraper.md
+++ b/docs/scrapper/async_web_scraper.md
@@ -0,0 +1,166 @@
+# AsyncWebScraper: Smart Web Crawling Made Easy
+
+AsyncWebScraper is a powerful and flexible web scraping tool that makes it easy to collect data from websites efficiently. Whether you need to scrape a few pages or an entire website, AsyncWebScraper handles the complexity of web crawling while giving you fine-grained control over the process.
+
+## How It Works
+
+```mermaid
+flowchart TB
+    Start([Start]) --> Init[Initialize AsyncWebScraper\nwith Crawler and Strategy]
+    Init --> InputURL[Receive URL to scrape]
+    InputURL --> Decision{Stream or\nCollect?}
+    
+    %% Streaming Path
+    Decision -->|Stream| StreamInit[Initialize Streaming Mode]
+    StreamInit --> StreamStrategy[Call Strategy.ascrape]
+    StreamStrategy --> AsyncGen[Create Async Generator]
+    AsyncGen --> ProcessURL[Process Next URL]
+    ProcessURL --> FetchContent[Fetch Page Content]
+    FetchContent --> Extract[Extract Data]
+    Extract --> YieldResult[Yield CrawlResult]
+    YieldResult --> CheckMore{More URLs?}
+    CheckMore -->|Yes| ProcessURL
+    CheckMore -->|No| StreamEnd([End Stream])
+    
+    %% Collecting Path
+    Decision -->|Collect| CollectInit[Initialize Collection Mode]
+    CollectInit --> CollectStrategy[Call Strategy.ascrape]
+    CollectStrategy --> CollectGen[Create Async Generator]
+    CollectGen --> ProcessURLColl[Process Next URL]
+    ProcessURLColl --> FetchContentColl[Fetch Page Content]
+    FetchContentColl --> ExtractColl[Extract Data]
+    ExtractColl --> StoreColl[Store in Dictionary]
+    StoreColl --> CheckMoreColl{More URLs?}
+    CheckMoreColl -->|Yes| ProcessURLColl
+    CheckMoreColl -->|No| CreateResult[Create ScraperResult]
+    CreateResult --> ReturnResult([Return Result])
+    
+    %% Parallel Processing
+    subgraph Parallel
+        ProcessURL
+        FetchContent
+        Extract
+        ProcessURLColl
+        FetchContentColl
+        ExtractColl
+    end
+    
+    %% Error Handling
+    FetchContent --> ErrorCheck{Error?}
+    ErrorCheck -->|Yes| LogError[Log Error]
+    LogError --> UpdateStats[Update Error Stats]
+    UpdateStats --> CheckMore
+    ErrorCheck -->|No| Extract
+    
+    FetchContentColl --> ErrorCheckColl{Error?}
+    ErrorCheckColl -->|Yes| LogErrorColl[Log Error]
+    LogErrorColl --> UpdateStatsColl[Update Error Stats]
+    UpdateStatsColl --> CheckMoreColl
+    ErrorCheckColl -->|No| ExtractColl
+    
+    %% Style definitions
+    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
+    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
+    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
+    classDef start fill:#a5d6a7,stroke:#000,stroke-width:2px;
+    
+    class Start,StreamEnd,ReturnResult start;
+    class Decision,CheckMore,CheckMoreColl,ErrorCheck,ErrorCheckColl decision;
+    class LogError,LogErrorColl,UpdateStats,UpdateStatsColl error;
+    class ProcessURL,FetchContent,Extract,ProcessURLColl,FetchContentColl,ExtractColl process;
+```
+
+AsyncWebScraper uses an intelligent crawling system that can navigate through websites following your specified strategy. It supports two main modes of operation:
+
+### 1. Streaming Mode
+```python
+async for result in scraper.ascrape(url, stream=True):
+    print(f"Found data on {result.url}")
+    process_data(result.data)
+```
+- Perfect for processing large websites
+- Memory efficient - handles one page at a time
+- Ideal for real-time data processing
+- Great for monitoring or continuous scraping tasks
+
+### 2. Collection Mode
+```python
+result = await scraper.ascrape(url)
+print(f"Scraped {len(result.crawled_urls)} pages")
+process_all_data(result.extracted_data)
+```
+- Collects all data before returning
+- Best for when you need the complete dataset
+- Easier to work with for batch processing
+- Includes comprehensive statistics
+
+## Key Features
+
+- **Smart Crawling**: Automatically follows relevant links while avoiding duplicates
+- **Parallel Processing**: Scrapes multiple pages simultaneously for better performance
+- **Memory Efficient**: Choose between streaming and collecting based on your needs
+- **Error Resilient**: Continues working even if some pages fail to load
+- **Progress Tracking**: Monitor the scraping progress in real-time
+- **Customizable**: Configure crawling strategy, filters, and scoring to match your needs
+
+## Quick Start
+
+```python
+from crawl4ai.scraper import AsyncWebScraper, BFSStrategy
+from crawl4ai.async_webcrawler import AsyncWebCrawler
+
+# Initialize the scraper
+crawler = AsyncWebCrawler()
+strategy = BFSStrategy(
+    max_depth=2,  # How deep to crawl
+    url_pattern="*.example.com/*"  # What URLs to follow
+)
+scraper = AsyncWebScraper(crawler, strategy)
+
+# Start scraping
+async def main():
+    # Collect all results
+    result = await scraper.ascrape("https://example.com")
+    print(f"Found {len(result.extracted_data)} pages")
+    
+    # Or stream results
+    async for page in scraper.ascrape("https://example.com", stream=True):
+        print(f"Processing {page.url}")
+
+```
+
+## Best Practices
+
+1. **Choose the Right Mode**
+   - Use streaming for large websites or real-time processing
+   - Use collecting for smaller sites or when you need the complete dataset
+
+2. **Configure Depth**
+   - Start with a small depth (2-3) and increase if needed
+   - Higher depths mean exponentially more pages to crawl
+
+3. **Set Appropriate Filters**
+   - Use URL patterns to stay within relevant sections
+   - Set content type filters to only process useful pages
+
+4. **Handle Resources Responsibly**
+   - Enable parallel processing for faster results
+   - Consider the target website's capacity
+   - Implement appropriate delays between requests
+
+## Common Use Cases
+
+- **Content Aggregation**: Collect articles, blog posts, or news from multiple pages
+- **Data Extraction**: Gather product information, prices, or specifications
+- **Site Mapping**: Create a complete map of a website's structure
+- **Content Monitoring**: Track changes or updates across multiple pages
+- **Data Mining**: Extract and analyze patterns across web pages
+
+## Advanced Features
+
+- Custom scoring algorithms for prioritizing important pages
+- URL filters for focusing on specific site sections
+- Content type filtering for processing only relevant pages
+- Progress tracking for monitoring long-running scrapes
+
+Need more help? Check out our [examples repository](https://github.com/example/crawl4ai/examples) or join our [community Discord](https://discord.gg/example).
--- a/docs/scrapper/bfs_scraper_strategy.md
+++ b/docs/scrapper/bfs_scraper_strategy.md
@@ -0,0 +1,244 @@
+# BFS Scraper Strategy: Smart Web Traversal
+
+The BFS (Breadth-First Search) Scraper Strategy provides an intelligent way to traverse websites systematically. It crawls websites level by level, ensuring thorough coverage while respecting web crawling etiquette.
+
+```mermaid
+flowchart TB
+    Start([Start]) --> Init[Initialize BFS Strategy]
+    Init --> InitStats[Initialize CrawlStats]
+    InitStats --> InitQueue[Initialize Priority Queue]
+    InitQueue --> AddStart[Add Start URL to Queue]
+    
+    AddStart --> CheckState{Queue Empty or\nTasks Pending?}
+    CheckState -->|No| Cleanup[Cleanup & Stats]
+    Cleanup --> End([End])
+    
+    CheckState -->|Yes| CheckCancel{Cancel\nRequested?}
+    CheckCancel -->|Yes| Cleanup
+    
+    CheckCancel -->|No| CheckConcurrent{Under Max\nConcurrent?}
+    
+    CheckConcurrent -->|No| WaitComplete[Wait for Task Completion]
+    WaitComplete --> YieldResult[Yield Result]
+    YieldResult --> CheckState
+    
+    CheckConcurrent -->|Yes| GetNextURL[Get Next URL from Queue]
+    
+    GetNextURL --> ValidateURL{Already\nVisited?}
+    ValidateURL -->|Yes| CheckState
+    
+    ValidateURL -->|No| ProcessURL[Process URL]
+    
+    subgraph URL_Processing [URL Processing]
+        ProcessURL --> CheckValid{URL Valid?}
+        CheckValid -->|No| UpdateStats[Update Skip Stats]
+        
+        CheckValid -->|Yes| CheckRobots{Allowed by\nrobots.txt?}
+        CheckRobots -->|No| UpdateRobotStats[Update Robot Stats]
+        
+        CheckRobots -->|Yes| ApplyDelay[Apply Politeness Delay]
+        ApplyDelay --> FetchContent[Fetch Content with Rate Limit]
+        
+        FetchContent --> CheckError{Error?}
+        CheckError -->|Yes| Retry{Retry\nNeeded?}
+        Retry -->|Yes| FetchContent
+        Retry -->|No| UpdateFailStats[Update Fail Stats]
+        
+        CheckError -->|No| ExtractLinks[Extract & Process Links]
+        ExtractLinks --> ScoreURLs[Score New URLs]
+        ScoreURLs --> AddToQueue[Add to Priority Queue]
+    end
+    
+    ProcessURL --> CreateTask{Parallel\nProcessing?}
+    CreateTask -->|Yes| AddTask[Add to Pending Tasks]
+    CreateTask -->|No| DirectProcess[Process Directly]
+    
+    AddTask --> CheckState
+    DirectProcess --> YieldResult
+    
+    UpdateStats --> CheckState
+    UpdateRobotStats --> CheckState
+    UpdateFailStats --> CheckState
+    
+    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
+    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
+    classDef error fill:#ef9a9a,stroke:#000,stroke-width:2px;
+    classDef stats fill:#a5d6a7,stroke:#000,stroke-width:2px;
+    
+    class Start,End stats;
+    class CheckState,CheckCancel,CheckConcurrent,ValidateURL,CheckValid,CheckRobots,CheckError,Retry,CreateTask decision;
+    class UpdateStats,UpdateRobotStats,UpdateFailStats,InitStats,Cleanup stats;
+    class ProcessURL,FetchContent,ExtractLinks,ScoreURLs process;
+```
+
+## How It Works
+
+The BFS strategy crawls a website by:
+1. Starting from a root URL
+2. Processing all URLs at the current depth
+3. Moving to URLs at the next depth level
+4. Continuing until maximum depth is reached
+
+This ensures systematic coverage of the website while maintaining control over the crawling process.
+
+## Key Features
+
+### 1. Smart URL Processing
+```python
+strategy = BFSScraperStrategy(
+    max_depth=2,
+    filter_chain=my_filters,
+    url_scorer=my_scorer,
+    max_concurrent=5
+)
+```
+- Controls crawl depth
+- Filters unwanted URLs
+- Scores URLs for priority
+- Manages concurrent requests
+
+### 2. Polite Crawling
+The strategy automatically implements web crawling best practices:
+- Respects robots.txt
+- Implements rate limiting
+- Adds politeness delays
+- Manages concurrent requests
+
+### 3. Link Processing Control
+```python
+strategy = BFSScraperStrategy(
+    ...,
+    process_external_links=False  # Only process internal links
+)
+```
+- Control whether to follow external links
+- Default: internal links only
+- Enable external links when needed
+
+## Configuration Options
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| max_depth | Maximum crawl depth | Required |
+| filter_chain | URL filtering rules | Required |
+| url_scorer | URL priority scoring | Required |
+| max_concurrent | Max parallel requests | 5 |
+| min_crawl_delay | Seconds between requests | 1 |
+| process_external_links | Follow external links | False |
+
+## Best Practices
+
+1. **Set Appropriate Depth**
+   - Start with smaller depths (2-3)
+   - Increase based on needs
+   - Consider site structure
+
+2. **Configure Filters**
+   - Use URL patterns
+   - Filter by content type
+   - Avoid unwanted sections
+
+3. **Tune Performance**
+   - Adjust max_concurrent
+   - Set appropriate delays
+   - Monitor resource usage
+
+4. **Handle External Links**
+   - Keep external_links=False for focused crawls
+   - Enable only when needed
+   - Consider additional filtering
+
+## Example Usage
+
+```python
+from crawl4ai.scraper import BFSScraperStrategy
+from crawl4ai.scraper.filters import FilterChain
+from crawl4ai.scraper.scorers import BasicURLScorer
+
+# Configure strategy
+strategy = BFSScraperStrategy(
+    max_depth=3,
+    filter_chain=FilterChain([
+        URLPatternFilter("*.example.com/*"),
+        ContentTypeFilter(["text/html"])
+    ]),
+    url_scorer=BasicURLScorer(),
+    max_concurrent=5,
+    min_crawl_delay=1,
+    process_external_links=False
+)
+
+# Use with AsyncWebScraper
+scraper = AsyncWebScraper(crawler, strategy)
+results = await scraper.ascrape("https://example.com")
+```
+
+## Common Use Cases
+
+### 1. Site Mapping
+```python
+strategy = BFSScraperStrategy(
+    max_depth=5,
+    filter_chain=site_filter,
+    url_scorer=depth_scorer,
+    process_external_links=False
+)
+```
+Perfect for creating complete site maps or understanding site structure.
+
+### 2. Content Aggregation
+```python
+strategy = BFSScraperStrategy(
+    max_depth=2,
+    filter_chain=content_filter,
+    url_scorer=relevance_scorer,
+    max_concurrent=3
+)
+```
+Ideal for collecting specific types of content (articles, products, etc.).
+
+### 3. Link Analysis
+```python
+strategy = BFSScraperStrategy(
+    max_depth=1,
+    filter_chain=link_filter,
+    url_scorer=link_scorer,
+    process_external_links=True
+)
+```
+Useful for analyzing both internal and external link structures.
+
+## Advanced Features
+
+### Progress Monitoring
+```python
+async for result in scraper.ascrape(url):
+    print(f"Current depth: {strategy.stats.current_depth}")
+    print(f"Processed URLs: {strategy.stats.urls_processed}")
+```
+
+### Custom URL Scoring
+```python
+class CustomScorer(URLScorer):
+    def score(self, url: str) -> float:
+        # Lower scores = higher priority
+        return score_based_on_criteria(url)
+```
+
+## Troubleshooting
+
+1. **Slow Crawling**
+   - Increase max_concurrent
+   - Adjust min_crawl_delay
+   - Check network conditions
+
+2. **Missing Content**
+   - Verify max_depth
+   - Check filter settings
+   - Review URL patterns
+
+3. **High Resource Usage**
+   - Reduce max_concurrent
+   - Increase crawl delay
+   - Add more specific filters
+
--- a/docs/scrapper/filters_scrorers.md
+++ b/docs/scrapper/filters_scrorers.md
@@ -0,0 +1,342 @@
+# URL Filters and Scorers
+
+The crawl4ai library provides powerful URL filtering and scoring capabilities that help you control and prioritize your web crawling. This guide explains how to use these features effectively.
+
+```mermaid
+flowchart TB
+    Start([URL Input]) --> Chain[Filter Chain]
+    
+    subgraph Chain Process
+        Chain --> Pattern{URL Pattern\nFilter}
+        Pattern -->|Match| Content{Content Type\nFilter}
+        Pattern -->|No Match| Reject1[Reject URL]
+        
+        Content -->|Allowed| Domain{Domain\nFilter}
+        Content -->|Not Allowed| Reject2[Reject URL]
+        
+        Domain -->|Allowed| Accept[Accept URL]
+        Domain -->|Blocked| Reject3[Reject URL]
+    end
+    
+    subgraph Statistics
+        Pattern --> UpdatePattern[Update Pattern Stats]
+        Content --> UpdateContent[Update Content Stats]
+        Domain --> UpdateDomain[Update Domain Stats]
+        Accept --> UpdateChain[Update Chain Stats]
+        Reject1 --> UpdateChain
+        Reject2 --> UpdateChain
+        Reject3 --> UpdateChain
+    end
+    
+    Accept --> End([End])
+    Reject1 --> End
+    Reject2 --> End
+    Reject3 --> End
+    
+    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
+    classDef decision fill:#fff59d,stroke:#000,stroke-width:2px;
+    classDef reject fill:#ef9a9a,stroke:#000,stroke-width:2px;
+    classDef accept fill:#a5d6a7,stroke:#000,stroke-width:2px;
+    
+    class Start,End accept;
+    class Pattern,Content,Domain decision;
+    class Reject1,Reject2,Reject3 reject;
+    class Chain,UpdatePattern,UpdateContent,UpdateDomain,UpdateChain process;
+```
+
+## URL Filters
+
+URL filters help you control which URLs are crawled. Multiple filters can be chained together to create sophisticated filtering rules.
+
+### Available Filters
+
+1. **URL Pattern Filter**
+```python
+pattern_filter = URLPatternFilter([
+    "*.example.com/*",  # Glob pattern
+    "*/article/*",      # Path pattern
+    re.compile(r"blog-\d+") # Regex pattern
+])
+```
+- Supports glob patterns and regex
+- Multiple patterns per filter
+- Pattern pre-compilation for performance
+
+2. **Content Type Filter**
+```python
+content_filter = ContentTypeFilter([
+    "text/html",
+    "application/pdf"
+], check_extension=True)
+```
+- Filter by MIME types
+- Extension checking
+- Support for multiple content types
+
+3. **Domain Filter**
+```python
+domain_filter = DomainFilter(
+    allowed_domains=["example.com", "blog.example.com"],
+    blocked_domains=["ads.example.com"]
+)
+```
+- Allow/block specific domains
+- Subdomain support
+- Efficient domain matching
+
+### Creating Filter Chains
+
+```python
+# Create and configure a filter chain
+filter_chain = FilterChain([
+    URLPatternFilter(["*.example.com/*"]),
+    ContentTypeFilter(["text/html"]),
+    DomainFilter(blocked_domains=["ads.*"])
+])
+
+# Add more filters
+filter_chain.add_filter(
+    URLPatternFilter(["*/article/*"])
+)
+```
+
+```mermaid
+flowchart TB
+    Start([URL Input]) --> Composite[Composite Scorer]
+    
+    subgraph Scoring Process
+        Composite --> Keywords[Keyword Relevance]
+        Composite --> Path[Path Depth]
+        Composite --> Content[Content Type]
+        Composite --> Fresh[Freshness]
+        Composite --> Domain[Domain Authority]
+        
+        Keywords --> KeywordScore[Calculate Score]
+        Path --> PathScore[Calculate Score]
+        Content --> ContentScore[Calculate Score]
+        Fresh --> FreshScore[Calculate Score]
+        Domain --> DomainScore[Calculate Score]
+        
+        KeywordScore --> Weight1[Apply Weight]
+        PathScore --> Weight2[Apply Weight]
+        ContentScore --> Weight3[Apply Weight]
+        FreshScore --> Weight4[Apply Weight]
+        DomainScore --> Weight5[Apply Weight]
+    end
+    
+    Weight1 --> Combine[Combine Scores]
+    Weight2 --> Combine
+    Weight3 --> Combine
+    Weight4 --> Combine
+    Weight5 --> Combine
+    
+    Combine --> Normalize{Normalize?}
+    Normalize -->|Yes| NormalizeScore[Normalize Combined Score]
+    Normalize -->|No| FinalScore[Final Score]
+    NormalizeScore --> FinalScore
+    
+    FinalScore --> Stats[Update Statistics]
+    Stats --> End([End])
+    
+    classDef process fill:#90caf9,stroke:#000,stroke-width:2px;
+    classDef scorer fill:#fff59d,stroke:#000,stroke-width:2px;
+    classDef calc fill:#a5d6a7,stroke:#000,stroke-width:2px;
+    classDef decision fill:#ef9a9a,stroke:#000,stroke-width:2px;
+    
+    class Start,End calc;
+    class Keywords,Path,Content,Fresh,Domain scorer;
+    class KeywordScore,PathScore,ContentScore,FreshScore,DomainScore process;
+    class Normalize decision;
+```
+
+## URL Scorers
+
+URL scorers help prioritize which URLs to crawl first. Higher scores indicate higher priority.
+
+### Available Scorers
+
+1. **Keyword Relevance Scorer**
+```python
+keyword_scorer = KeywordRelevanceScorer(
+    keywords=["python", "programming"],
+    weight=1.0,
+    case_sensitive=False
+)
+```
+- Score based on keyword matches
+- Case sensitivity options
+- Weighted scoring
+
+2. **Path Depth Scorer**
+```python
+path_scorer = PathDepthScorer(
+    optimal_depth=3,  # Preferred URL depth
+    weight=0.7
+)
+```
+- Score based on URL path depth
+- Configurable optimal depth
+- Diminishing returns for deeper paths
+
+3. **Content Type Scorer**
+```python
+content_scorer = ContentTypeScorer({
+    r'\.html$': 1.0,
+    r'\.pdf$': 0.8,
+    r'\.xml$': 0.6
+})
+```
+- Score based on file types
+- Configurable type weights
+- Pattern matching support
+
+4. **Freshness Scorer**
+```python
+freshness_scorer = FreshnessScorer(weight=0.9)
+```
+- Score based on date indicators in URLs
+- Multiple date format support
+- Recency weighting
+
+5. **Domain Authority Scorer**
+```python
+authority_scorer = DomainAuthorityScorer({
+    "python.org": 1.0,
+    "github.com": 0.9,
+    "medium.com": 0.7
+})
+```
+- Score based on domain importance
+- Configurable domain weights
+- Default weight for unknown domains
+
+### Combining Scorers
+
+```python
+# Create a composite scorer
+composite_scorer = CompositeScorer([
+    KeywordRelevanceScorer(["python"], weight=1.0),
+    PathDepthScorer(optimal_depth=2, weight=0.7),
+    FreshnessScorer(weight=0.8)
+], normalize=True)
+```
+
+## Best Practices
+
+### Filter Configuration
+
+1. **Start Restrictive**
+   ```python
+   # Begin with strict filters
+   filter_chain = FilterChain([
+       DomainFilter(allowed_domains=["example.com"]),
+       ContentTypeFilter(["text/html"])
+   ])
+   ```
+
+2. **Layer Filters**
+   ```python
+   # Add more specific filters
+   filter_chain.add_filter(
+       URLPatternFilter(["*/article/*", "*/blog/*"])
+   )
+   ```
+
+3. **Monitor Filter Statistics**
+   ```python
+   # Check filter performance
+   for filter in filter_chain.filters:
+       print(f"{filter.name}: {filter.stats.rejected_urls} rejected")
+   ```
+
+### Scorer Configuration
+
+1. **Balance Weights**
+   ```python
+   # Balanced scoring configuration
+   scorer = create_balanced_scorer()
+   ```
+
+2. **Customize for Content**
+   ```python
+   # News site configuration
+   news_scorer = CompositeScorer([
+       KeywordRelevanceScorer(["news", "article"], weight=1.0),
+       FreshnessScorer(weight=1.0),
+       PathDepthScorer(optimal_depth=2, weight=0.5)
+   ])
+   ```
+
+3. **Monitor Scoring Statistics**
+   ```python
+   # Check scoring distribution
+   print(f"Average score: {scorer.stats.average_score}")
+   print(f"Score range: {scorer.stats.min_score} - {scorer.stats.max_score}")
+   ```
+
+## Common Use Cases
+
+### Blog Crawling
+```python
+blog_config = {
+    'filters': FilterChain([
+        URLPatternFilter(["*/blog/*", "*/post/*"]),
+        ContentTypeFilter(["text/html"])
+    ]),
+    'scorer': CompositeScorer([
+        FreshnessScorer(weight=1.0),
+        KeywordRelevanceScorer(["blog", "article"], weight=0.8)
+    ])
+}
+```
+
+### Documentation Sites
+```python
+docs_config = {
+    'filters': FilterChain([
+        URLPatternFilter(["*/docs/*", "*/guide/*"]),
+        ContentTypeFilter(["text/html", "application/pdf"])
+    ]),
+    'scorer': CompositeScorer([
+        PathDepthScorer(optimal_depth=3, weight=1.0),
+        KeywordRelevanceScorer(["guide", "tutorial"], weight=0.9)
+    ])
+}
+```
+
+### E-commerce Sites
+```python
+ecommerce_config = {
+    'filters': FilterChain([
+        URLPatternFilter(["*/product/*", "*/category/*"]),
+        DomainFilter(blocked_domains=["ads.*", "tracker.*"])
+    ]),
+    'scorer': CompositeScorer([
+        PathDepthScorer(optimal_depth=2, weight=1.0),
+        ContentTypeScorer({
+            r'/product/': 1.0,
+            r'/category/': 0.8
+        })
+    ])
+}
+```
+
+## Advanced Topics
+
+### Custom Filters
+```python
+class CustomFilter(URLFilter):
+    def apply(self, url: str) -> bool:
+        # Your custom filtering logic
+        return True
+```
+
+### Custom Scorers
+```python
+class CustomScorer(URLScorer):
+    def _calculate_score(self, url: str) -> float:
+        # Your custom scoring logic
+        return 1.0
+```
+
+For more examples, check our [example repository](https://github.com/example/crawl4ai/examples).
--- a/docs/scrapper/how_to_use.md
+++ b/docs/scrapper/how_to_use.md
@@ -0,0 +1,206 @@
+# Scraper Examples Guide
+
+This guide provides two complete examples of using the crawl4ai scraper: a basic implementation for simple use cases and an advanced implementation showcasing all features.
+
+## Basic Example
+
+The basic example demonstrates a simple blog scraping scenario:
+
+```python
+from crawl4ai.scraper import AsyncWebScraper, BFSScraperStrategy, FilterChain
+
+# Create simple filter chain
+filter_chain = FilterChain([
+    URLPatternFilter("*/blog/*"),
+    ContentTypeFilter(["text/html"])
+])
+
+# Initialize strategy
+strategy = BFSScraperStrategy(
+    max_depth=2,
+    filter_chain=filter_chain,
+    url_scorer=None,
+    max_concurrent=3
+)
+
+# Create and run scraper
+crawler = AsyncWebCrawler()
+scraper = AsyncWebScraper(crawler, strategy)
+result = await scraper.ascrape("https://example.com/blog/")
+```
+
+### Features Demonstrated
+- Basic URL filtering
+- Simple content type filtering
+- Depth control
+- Concurrent request limiting
+- Result collection
+
+## Advanced Example
+
+The advanced example shows a sophisticated news site scraping setup with all features enabled:
+
+```python
+# Create comprehensive filter chain
+filter_chain = FilterChain([
+    DomainFilter(
+        allowed_domains=["example.com"],
+        blocked_domains=["ads.example.com"]
+    ),
+    URLPatternFilter([
+        "*/article/*",
+        re.compile(r"\d{4}/\d{2}/.*")
+    ]),
+    ContentTypeFilter(["text/html"])
+])
+
+# Create intelligent scorer
+scorer = CompositeScorer([
+    KeywordRelevanceScorer(
+        keywords=["news", "breaking"],
+        weight=1.0
+    ),
+    PathDepthScorer(optimal_depth=3, weight=0.7),
+    FreshnessScorer(weight=0.9)
+])
+
+# Initialize advanced strategy
+strategy = BFSScraperStrategy(
+    max_depth=4,
+    filter_chain=filter_chain,
+    url_scorer=scorer,
+    max_concurrent=5
+)
+```
+
+### Features Demonstrated
+1. **Advanced Filtering**
+   - Domain filtering
+   - Pattern matching
+   - Content type control
+
+2. **Intelligent Scoring**
+   - Keyword relevance
+   - Path optimization
+   - Freshness priority
+
+3. **Monitoring**
+   - Progress tracking
+   - Error handling
+   - Statistics collection
+
+4. **Resource Management**
+   - Concurrent processing
+   - Rate limiting
+   - Cleanup handling
+
+## Running the Examples
+
+```bash
+# Basic usage
+python basic_scraper_example.py
+
+# Advanced usage with logging
+PYTHONPATH=. python advanced_scraper_example.py
+```
+
+## Example Output
+
+### Basic Example
+```
+Crawled 15 pages:
+- https://example.com/blog/post1: 24560 bytes
+- https://example.com/blog/post2: 18920 bytes
+...
+```
+
+### Advanced Example
+```
+INFO: Starting crawl of https://example.com/news/
+INFO: Processed: https://example.com/news/breaking/story1
+DEBUG: KeywordScorer: 0.85
+DEBUG: FreshnessScorer: 0.95
+INFO: Progress: 10 URLs processed
+...
+INFO: Scraping completed:
+INFO: - URLs processed: 50
+INFO: - Errors: 2
+INFO: - Total content size: 1240.50 KB
+```
+
+## Customization
+
+### Adding Custom Filters
+```python
+class CustomFilter(URLFilter):
+    def apply(self, url: str) -> bool:
+        # Your custom filtering logic
+        return True
+
+filter_chain.add_filter(CustomFilter())
+```
+
+### Custom Scoring Logic
+```python
+class CustomScorer(URLScorer):
+    def _calculate_score(self, url: str) -> float:
+        # Your custom scoring logic
+        return 1.0
+
+scorer = CompositeScorer([
+    CustomScorer(weight=1.0),
+    ...
+])
+```
+
+## Best Practices
+
+1. **Start Simple**
+   - Begin with basic filtering
+   - Add features incrementally
+   - Test thoroughly at each step
+
+2. **Monitor Performance**
+   - Watch memory usage
+   - Track processing times
+   - Adjust concurrency as needed
+
+3. **Handle Errors**
+   - Implement proper error handling
+   - Log important events
+   - Track error statistics
+
+4. **Optimize Resources**
+   - Set appropriate delays
+   - Limit concurrent requests
+   - Use streaming for large crawls
+
+## Troubleshooting
+
+Common issues and solutions:
+
+1. **Too Many Requests**
+   ```python
+   strategy = BFSScraperStrategy(
+       max_concurrent=3,  # Reduce concurrent requests
+       min_crawl_delay=2  # Increase delay between requests
+   )
+   ```
+
+2. **Memory Issues**
+   ```python
+   # Use streaming mode for large crawls
+   async for result in scraper.ascrape(url, stream=True):
+       process_result(result)
+   ```
+
+3. **Missing Content**
+   ```python
+   # Check your filter chain
+   filter_chain = FilterChain([
+       URLPatternFilter("*"),  # Broaden patterns
+       ContentTypeFilter(["*"])  # Accept all content
+   ])
+   ```
+
+For more examples and use cases, visit our [GitHub repository](https://github.com/example/crawl4ai/examples).
--- a/docs/scrapper/scraper_quickstart.py
+++ b/docs/scrapper/scraper_quickstart.py
@@ -0,0 +1,184 @@
+# basic_scraper_example.py
+from crawl4ai.scraper import (
+    AsyncWebScraper,
+    BFSScraperStrategy,
+    FilterChain,
+    URLPatternFilter,
+    ContentTypeFilter
+)
+from crawl4ai.async_webcrawler import AsyncWebCrawler
+
+async def basic_scraper_example():
+    """
+    Basic example: Scrape a blog site for articles
+    - Crawls only HTML pages
+    - Stays within the blog section
+    - Collects all results at once
+    """
+    # Create a simple filter chain
+    filter_chain = FilterChain([
+        # Only crawl pages within the blog section
+        URLPatternFilter("*/blog/*"),
+        # Only process HTML pages
+        ContentTypeFilter(["text/html"])
+    ])
+
+    # Initialize the strategy with basic configuration
+    strategy = BFSScraperStrategy(
+        max_depth=2,  # Only go 2 levels deep
+        filter_chain=filter_chain,
+        url_scorer=None,  # Use default scoring
+        max_concurrent=3  # Limit concurrent requests
+    )
+
+    # Create the crawler and scraper
+    crawler = AsyncWebCrawler()
+    scraper = AsyncWebScraper(crawler, strategy)
+
+    # Start scraping
+    try:
+        result = await scraper.ascrape("https://example.com/blog/")
+        
+        # Process results
+        print(f"Crawled {len(result.crawled_urls)} pages:")
+        for url, data in result.extracted_data.items():
+            print(f"- {url}: {len(data.html)} bytes")
+            
+    except Exception as e:
+        print(f"Error during scraping: {e}")
+
+# advanced_scraper_example.py
+import logging
+from crawl4ai.scraper import (
+    AsyncWebScraper,
+    BFSScraperStrategy,
+    FilterChain,
+    URLPatternFilter,
+    ContentTypeFilter,
+    DomainFilter,
+    KeywordRelevanceScorer,
+    PathDepthScorer,
+    FreshnessScorer,
+    CompositeScorer
+)
+from crawl4ai.async_webcrawler import AsyncWebCrawler
+
+async def advanced_scraper_example():
+    """
+    Advanced example: Intelligent news site scraping
+    - Uses all filter types
+    - Implements sophisticated scoring
+    - Streams results
+    - Includes monitoring and logging
+    """
+    # Set up logging
+    logging.basicConfig(level=logging.INFO)
+    logger = logging.getLogger("advanced_scraper")
+
+    # Create sophisticated filter chain
+    filter_chain = FilterChain([
+        # Domain control
+        DomainFilter(
+            allowed_domains=["example.com", "blog.example.com"],
+            blocked_domains=["ads.example.com", "tracker.example.com"]
+        ),
+        # URL patterns
+        URLPatternFilter([
+            "*/article/*",
+            "*/news/*",
+            "*/blog/*",
+            re.compile(r"\d{4}/\d{2}/.*")  # Date-based URLs
+        ]),
+        # Content types
+        ContentTypeFilter([
+            "text/html",
+            "application/xhtml+xml"
+        ])
+    ])
+
+    # Create composite scorer
+    scorer = CompositeScorer([
+        # Prioritize by keywords
+        KeywordRelevanceScorer(
+            keywords=["news", "breaking", "update", "latest"],
+            weight=1.0
+        ),
+        # Prefer optimal URL structure
+        PathDepthScorer(
+            optimal_depth=3,
+            weight=0.7
+        ),
+        # Prioritize fresh content
+        FreshnessScorer(weight=0.9)
+    ])
+
+    # Initialize strategy with advanced configuration
+    strategy = BFSScraperStrategy(
+        max_depth=4,
+        filter_chain=filter_chain,
+        url_scorer=scorer,
+        max_concurrent=5,
+        min_crawl_delay=1
+    )
+
+    # Create crawler and scraper
+    crawler = AsyncWebCrawler()
+    scraper = AsyncWebScraper(crawler, strategy)
+
+    # Track statistics
+    stats = {
+        'processed': 0,
+        'errors': 0,
+        'total_size': 0
+    }
+
+    try:
+        # Use streaming mode
+        async for result in scraper.ascrape("https://example.com/news/", stream=True):
+            stats['processed'] += 1
+            
+            if result.success:
+                stats['total_size'] += len(result.html)
+                logger.info(f"Processed: {result.url}")
+                
+                # Print scoring information
+                for scorer_name, score in result.scores.items():
+                    logger.debug(f"{scorer_name}: {score:.2f}")
+            else:
+                stats['errors'] += 1
+                logger.error(f"Failed to process {result.url}: {result.error_message}")
+
+            # Log progress regularly
+            if stats['processed'] % 10 == 0:
+                logger.info(f"Progress: {stats['processed']} URLs processed")
+
+    except Exception as e:
+        logger.error(f"Scraping error: {e}")
+    
+    finally:
+        # Print final statistics
+        logger.info("Scraping completed:")
+        logger.info(f"- URLs processed: {stats['processed']}")
+        logger.info(f"- Errors: {stats['errors']}")
+        logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
+        
+        # Print filter statistics
+        for filter_ in filter_chain.filters:
+            logger.info(f"{filter_.name} stats:")
+            logger.info(f"- Passed: {filter_.stats.passed_urls}")
+            logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
+        
+        # Print scorer statistics
+        logger.info("Scoring statistics:")
+        logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
+        logger.info(f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}")
+
+if __name__ == "__main__":
+    import asyncio
+    
+    # Run basic example
+    print("Running basic scraper example...")
+    asyncio.run(basic_scraper_example())
+    
+    print("\nRunning advanced scraper example...")
+    asyncio.run(advanced_scraper_example())
--- a/tests/test_scraper.py
+++ b/tests/test_scraper.py
@@ -0,0 +1,184 @@
+# basic_scraper_example.py
+from crawl4ai.scraper import (
+    AsyncWebScraper,
+    BFSScraperStrategy,
+    FilterChain,
+    URLPatternFilter,
+    ContentTypeFilter
+)
+from crawl4ai.async_webcrawler import AsyncWebCrawler
+
+async def basic_scraper_example():
+    """
+    Basic example: Scrape a blog site for articles
+    - Crawls only HTML pages
+    - Stays within the blog section
+    - Collects all results at once
+    """
+    # Create a simple filter chain
+    filter_chain = FilterChain([
+        # Only crawl pages within the blog section
+        URLPatternFilter("*/blog/*"),
+        # Only process HTML pages
+        ContentTypeFilter(["text/html"])
+    ])
+
+    # Initialize the strategy with basic configuration
+    strategy = BFSScraperStrategy(
+        max_depth=2,  # Only go 2 levels deep
+        filter_chain=filter_chain,
+        url_scorer=None,  # Use default scoring
+        max_concurrent=3  # Limit concurrent requests
+    )
+
+    # Create the crawler and scraper
+    crawler = AsyncWebCrawler()
+    scraper = AsyncWebScraper(crawler, strategy)
+
+    # Start scraping
+    try:
+        result = await scraper.ascrape("https://example.com/blog/")
+        
+        # Process results
+        print(f"Crawled {len(result.crawled_urls)} pages:")
+        for url, data in result.extracted_data.items():
+            print(f"- {url}: {len(data.html)} bytes")
+            
+    except Exception as e:
+        print(f"Error during scraping: {e}")
+
+# advanced_scraper_example.py
+import logging
+from crawl4ai.scraper import (
+    AsyncWebScraper,
+    BFSScraperStrategy,
+    FilterChain,
+    URLPatternFilter,
+    ContentTypeFilter,
+    DomainFilter,
+    KeywordRelevanceScorer,
+    PathDepthScorer,
+    FreshnessScorer,
+    CompositeScorer
+)
+from crawl4ai.async_webcrawler import AsyncWebCrawler
+
+async def advanced_scraper_example():
+    """
+    Advanced example: Intelligent news site scraping
+    - Uses all filter types
+    - Implements sophisticated scoring
+    - Streams results
+    - Includes monitoring and logging
+    """
+    # Set up logging
+    logging.basicConfig(level=logging.INFO)
+    logger = logging.getLogger("advanced_scraper")
+
+    # Create sophisticated filter chain
+    filter_chain = FilterChain([
+        # Domain control
+        DomainFilter(
+            allowed_domains=["example.com", "blog.example.com"],
+            blocked_domains=["ads.example.com", "tracker.example.com"]
+        ),
+        # URL patterns
+        URLPatternFilter([
+            "*/article/*",
+            "*/news/*",
+            "*/blog/*",
+            re.compile(r"\d{4}/\d{2}/.*")  # Date-based URLs
+        ]),
+        # Content types
+        ContentTypeFilter([
+            "text/html",
+            "application/xhtml+xml"
+        ])
+    ])
+
+    # Create composite scorer
+    scorer = CompositeScorer([
+        # Prioritize by keywords
+        KeywordRelevanceScorer(
+            keywords=["news", "breaking", "update", "latest"],
+            weight=1.0
+        ),
+        # Prefer optimal URL structure
+        PathDepthScorer(
+            optimal_depth=3,
+            weight=0.7
+        ),
+        # Prioritize fresh content
+        FreshnessScorer(weight=0.9)
+    ])
+
+    # Initialize strategy with advanced configuration
+    strategy = BFSScraperStrategy(
+        max_depth=4,
+        filter_chain=filter_chain,
+        url_scorer=scorer,
+        max_concurrent=5,
+        min_crawl_delay=1
+    )
+
+    # Create crawler and scraper
+    crawler = AsyncWebCrawler()
+    scraper = AsyncWebScraper(crawler, strategy)
+
+    # Track statistics
+    stats = {
+        'processed': 0,
+        'errors': 0,
+        'total_size': 0
+    }
+
+    try:
+        # Use streaming mode
+        async for result in scraper.ascrape("https://example.com/news/", stream=True):
+            stats['processed'] += 1
+            
+            if result.success:
+                stats['total_size'] += len(result.html)
+                logger.info(f"Processed: {result.url}")
+                
+                # Print scoring information
+                for scorer_name, score in result.scores.items():
+                    logger.debug(f"{scorer_name}: {score:.2f}")
+            else:
+                stats['errors'] += 1
+                logger.error(f"Failed to process {result.url}: {result.error_message}")
+
+            # Log progress regularly
+            if stats['processed'] % 10 == 0:
+                logger.info(f"Progress: {stats['processed']} URLs processed")
+
+    except Exception as e:
+        logger.error(f"Scraping error: {e}")
+    
+    finally:
+        # Print final statistics
+        logger.info("Scraping completed:")
+        logger.info(f"- URLs processed: {stats['processed']}")
+        logger.info(f"- Errors: {stats['errors']}")
+        logger.info(f"- Total content size: {stats['total_size'] / 1024:.2f} KB")
+        
+        # Print filter statistics
+        for filter_ in filter_chain.filters:
+            logger.info(f"{filter_.name} stats:")
+            logger.info(f"- Passed: {filter_.stats.passed_urls}")
+            logger.info(f"- Rejected: {filter_.stats.rejected_urls}")
+        
+        # Print scorer statistics
+        logger.info("Scoring statistics:")
+        logger.info(f"- Average score: {scorer.stats.average_score:.2f}")
+        logger.info(f"- Score range: {scorer.stats.min_score:.2f} - {scorer.stats.max_score:.2f}")
+
+if __name__ == "__main__":
+    import asyncio
+    
+    # Run basic example
+    print("Running basic scraper example...")
+    asyncio.run(basic_scraper_example())
+    
+    print("\nRunning advanced scraper example...")
+    asyncio.run(advanced_scraper_example())
Author	SHA1	Message	Date
UncleCode	0d357ab7d2	feat(scraper): Enhance URL filtering and scoring systems Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes.	2024-11-08 19:02:28 +08:00
UncleCode	bae4665949	feat(scraper): Enhance URL filtering and scoring systems Implement comprehensive URL filtering and scoring capabilities: Filters: - Add URLPatternFilter with glob/regex support - Implement ContentTypeFilter with MIME type checking - Add DomainFilter for domain control - Create FilterChain with stats tracking Scorers: - Complete KeywordRelevanceScorer implementation - Add PathDepthScorer for URL structure scoring - Implement ContentTypeScorer for file type priorities - Add FreshnessScorer for date-based scoring - Add DomainAuthorityScorer for domain weighting - Create CompositeScorer for combined strategies Features: - Add statistics tracking for both filters and scorers - Implement logging support throughout - Add resource cleanup methods - Create comprehensive documentation - Include performance optimizations Tests and docs included. Note: Review URL normalization overlap with recent crawler changes. - Quick Start is created and added	2024-11-08 18:45:12 +08:00
UncleCode	d11c004fbb	Enhanced BFS Strategy: Improved monitoring, resource management & configuration - Added CrawlStats for comprehensive crawl monitoring - Implemented proper resource cleanup with shutdown mechanism - Enhanced URL processing with better validation and politeness controls - Added configuration options (max_concurrent, timeout, external_links) - Improved error handling with retry logic - Added domain-specific queues for better performance - Created comprehensive documentation Note: URL normalization needs review - potential duplicate processing with core crawler for internal links. Currently commented out pending further investigation of edge cases.	2024-11-08 15:57:23 +08:00
UncleCode	3d1c9a8434	Revieweing the BFS strategy.	2024-11-07 18:54:53 +08:00
UncleCode	be472c624c	Refactored AsyncWebScraper to include comprehensive error handling and progress tracking capabilities. Introduced a ScrapingProgress data class to monitor processed and failed URLs. Enhanced scraping methods to log errors and track stats throughout the scraping process.	2024-11-06 21:09:47 +08:00
UncleCode	06b21dcc50	Update .gitignore to include new directories for issues and documentation	2024-11-06 18:44:03 +08:00
UncleCode	0f0f60527d	Merge pull request #172 from aravindkarnam/scraper Scraper	2024-11-06 07:00:44 +01:00
Aravind Karnam	8105fd178e	Removed stubs for remove_from_future_crawls since the visited set is updated soon as the URL was queued, Removed add_to_retry_queue(url) since retry with exponential backoff with help of tenacity is going to take care of it.	2024-10-17 15:42:43 +05:30
Aravind Karnam	ce7fce4b16	1. Moved to asyncio.wait instead of gather so that results can be yeilded just as they are ready, rather than in batches 2. Moved the visted.add(url), to before the task is put in queue rather than after the crawl is completed. This makes sure that duplicate crawls doesn't happen when same URL is found at different depth and that get's queued too because the crawl is not yet completed and visted set is not updated. 3. Named the yield_results attribute to stream instead. Since that seems to be popularly used in all other AI libraries for intermediate results.	2024-10-17 12:25:17 +05:30
Aravind Karnam	de28b59aca	removed unused imports	2024-10-16 22:36:48 +05:30
Aravind Karnam	04d8b47b92	Exposed min_crawl_delay for BFSScraperStrategy	2024-10-16 22:34:54 +05:30
Aravind Karnam	2943feeecf	1. Added a flag to yield each crawl result,as they become ready along with the final scraper result as another option 2. Removed ascrape_many method, as I'm currently not focusing on it in the first cut of scraper 3. Added some error handling for cases where robots.txt cannot be fetched or parsed.	2024-10-16 22:05:29 +05:30
Aravind Karnam	8a7d29ce85	updated some comments and removed content type checking functionality from core as it's implemented as a filter	2024-10-16 15:59:37 +05:30
aravind	159bd875bd	Merge pull request #5 from aravindkarnam/main Merging 0.3.6	2024-10-16 10:41:22 +05:30
unclecode	9ffa34b697	Update README	2024-10-14 22:58:27 +08:00
unclecode	740802c491	Merge branch '0.3.6'	2024-10-14 22:55:24 +08:00
unclecode	b9ac96c332	Merge branch 'main' of https://github.com/unclecode/crawl4ai	2024-10-14 22:54:23 +08:00
unclecode	d06535388a	Update gitignore	2024-10-14 22:53:56 +08:00
unclecode	2b73bdf6b0	Update changelog	2024-10-14 21:04:02 +08:00
unclecode	6aa803d712	Update gitignore	2024-10-14 21:03:40 +08:00
unclecode	320afdea64	feat: Enhance crawler flexibility and LLM extraction capabilities - Add browser type selection (Chromium, Firefox, WebKit) - Implement iframe content extraction - Improve image processing and dimension updates - Add custom headers support in AsyncPlaywrightCrawlerStrategy - Enhance delayed content retrieval with new parameter - Optimize HTML sanitization and Markdown conversion - Update examples in quickstart_async.py for new features	2024-10-14 21:03:28 +08:00
UncleCode	ccbe72cfc1	Merge pull request #135 from hitesh22rana/fix/docs-example docs: fixed css_selector for example	2024-10-13 14:39:07 +08:00
unclecode	b9bbd42373	Update Quickstart examples	2024-10-13 14:37:45 +08:00
unclecode	68e9144ce3	feat: Enhance crawling control and LLM extraction flexibility - Add before_retrieve_html hook and delay_before_return_html option - Implement flexible page_timeout for smart_wait function - Support extra_args and custom headers in LLM extraction - Allow arbitrary kwargs in AsyncWebCrawler initialization - Improve perform_completion_with_backoff for custom API calls - Update examples with new features and diverse LLM providers	2024-10-12 14:48:22 +08:00
unclecode	9b2b267820	CHANGELOG UPDATE	2024-10-12 13:42:56 +08:00
unclecode	ff3524d9b1	feat(v0.3.6): Add screenshot capture, delayed content, and custom timeouts - Implement screenshot capture functionality - Add delayed content retrieval method - Introduce custom page timeout parameter - Enhance LLM support with multiple providers - Improve database schema auto-updates - Optimize image processing in WebScrappingStrategy - Update error handling and logging - Expand examples in quickstart_async.py	2024-10-12 13:42:42 +08:00
unclecode	b99d20b725	Add pypi_build.sh to .gitignore	2024-10-08 18:10:57 +08:00
hitesh22rana	768b93140f	docs: fixed css_selector for example	2024-10-05 00:25:41 +09:00
Aravind Karnam	d743adac68	Fixed some bugs in robots.txt processing	2024-10-03 15:58:57 +05:30
Aravind Karnam	7fe220dbd5	1. Introduced a bool flag to ascrape method to switch between sequential and concurrent processing 2. Introduced a dictionary for depth tracking across various tasks 3. Removed redundancy with crawled_urls variable. Instead created a list with visited set variable in returned object.	2024-10-03 11:17:11 +05:30
aravind	65e013d9d1	Merge pull request #3 from aravindkarnam/main Merging latest changes from main branch	2024-10-03 09:52:12 +05:30
Aravind Karnam	7f3e2e47ed	Parallel processing with retry on failure with exponential backoff - Simplified URL validation and normalisation - respecting Robots.txt	2024-09-19 12:34:12 +05:30
aravind	78f26ac263	Merge pull request #2 from aravindkarnam/staging Staging	2024-09-18 18:16:23 +05:30
Aravind Karnam	44ce12c62c	Created scaffolding for Scraper as per the plan. Implemented the ascrape method in bfs_scraper_strategy	2024-09-09 13:13:34 +05:30