fix: show /llm API response in playground. ref #1288

Merge branch '2025-JUN-1' into next-MAY
feat: Add social media and community links to README and index documentation
2025-07-09 16:59:17 +02:00 · 2025-07-09 09:41:03 +02:00 · 2025-07-08 15:48:40 +02:00 · 2025-07-08 12:54:33 +02:00 · 2025-07-08 12:24:33 +02:00 · 2025-07-08 11:46:24 +02:00
42 changed files with 1149 additions and 210 deletions
--- a/README.md
+++ b/README.md
@@ -11,12 +11,17 @@
 [![Python Version](https://img.shields.io/pypi/pyversions/crawl4ai)](https://pypi.org/project/crawl4ai/)
 [![Downloads](https://static.pepy.tech/badge/crawl4ai/month)](https://pepy.tech/project/crawl4ai)

-<!-- [![Documentation Status](https://readthedocs.org/projects/crawl4ai/badge/?version=latest)](https://crawl4ai.readthedocs.io/) -->
-[![License](https://img.shields.io/github/license/unclecode/crawl4ai)](https://github.com/unclecode/crawl4ai/blob/main/LICENSE)
-[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
-[![Security: bandit](https://img.shields.io/badge/security-bandit-yellow.svg)](https://github.com/PyCQA/bandit)
-[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
-
+<p align="center">
+    <a href="https://x.com/crawl4ai">
+      <img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
+    </a>
+    <a href="https://www.linkedin.com/company/crawl4ai">
+      <img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" />
+    </a>
+    <a href="https://discord.gg/jP8KfhDhyN">
+      <img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
+    </a>
+  </p>
 </div>

 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  
@@ -291,12 +296,20 @@ import requests
 # Submit a crawl job
 response = requests.post(
    "http://localhost:11235/crawl",
-    json={"urls": "https://example.com", "priority": 10}
+    json={"urls": ["https://example.com"], "priority": 10}
 )
-task_id = response.json()["task_id"]
-
-# Continue polling until the task is complete (status="completed")
-result = requests.get(f"http://localhost:11235/task/{task_id}")
+if response.status_code == 200:
+    print("Crawl job submitted successfully.")
+    
+if "results" in response.json():
+    results = response.json()["results"]
+    print("Crawl job completed. Results:")
+    for result in results:
+        print(result)
+else:
+    task_id = response.json()["task_id"]
+    print(f"Crawl job submitted. Task ID:: {task_id}")
+    result = requests.get(f"http://localhost:11235/task/{task_id}")
 ```

 For more examples, see our [Docker Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/docker_example.py). For advanced configuration, environment variables, and usage examples, see our [Docker Deployment Guide](https://docs.crawl4ai.com/basic/docker-deployment/).
--- a/crawl4ai/async_configs.py
+++ b/crawl4ai/async_configs.py
@@ -926,6 +926,8 @@ class CrawlerRunConfig():
                               Default: False.
        scroll_delay (float): Delay in seconds between scroll steps if scan_full_page is True.
                              Default: 0.2.
+        max_scroll_steps (Optional[int]): Maximum number of scroll steps to perform during full page scan.
+                                         If None, scrolls until the entire page is loaded. Default: None.
        process_iframes (bool): If True, attempts to process and inline iframe content.
                                Default: False.
        remove_overlay_elements (bool): If True, remove overlays/popups before extracting HTML.
@@ -1066,6 +1068,7 @@ class CrawlerRunConfig():
        ignore_body_visibility: bool = True,
        scan_full_page: bool = False,
        scroll_delay: float = 0.2,
+        max_scroll_steps: Optional[int] = None,
        process_iframes: bool = False,
        remove_overlay_elements: bool = False,
        simulate_user: bool = False,
@@ -1170,6 +1173,7 @@ class CrawlerRunConfig():
        self.ignore_body_visibility = ignore_body_visibility
        self.scan_full_page = scan_full_page
        self.scroll_delay = scroll_delay
+        self.max_scroll_steps = max_scroll_steps
        self.process_iframes = process_iframes
        self.remove_overlay_elements = remove_overlay_elements
        self.simulate_user = simulate_user
@@ -1387,6 +1391,7 @@ class CrawlerRunConfig():
            ignore_body_visibility=kwargs.get("ignore_body_visibility", True),
            scan_full_page=kwargs.get("scan_full_page", False),
            scroll_delay=kwargs.get("scroll_delay", 0.2),
+            max_scroll_steps=kwargs.get("max_scroll_steps"),
            process_iframes=kwargs.get("process_iframes", False),
            remove_overlay_elements=kwargs.get("remove_overlay_elements", False),
            simulate_user=kwargs.get("simulate_user", False),
@@ -1499,6 +1504,7 @@ class CrawlerRunConfig():
            "ignore_body_visibility": self.ignore_body_visibility,
            "scan_full_page": self.scan_full_page,
            "scroll_delay": self.scroll_delay,
+            "max_scroll_steps": self.max_scroll_steps,
            "process_iframes": self.process_iframes,
            "remove_overlay_elements": self.remove_overlay_elements,
            "simulate_user": self.simulate_user,
--- a/crawl4ai/async_crawler_strategy.py
+++ b/crawl4ai/async_crawler_strategy.py
@@ -445,6 +445,9 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            return await self._crawl_web(url, config)

        elif url.startswith("file://"):
+            # initialize empty lists for console messages
+            captured_console = []
+            
            # Process local file
            local_file_path = url[7:]  # Remove 'file://' prefix
            if not os.path.exists(local_file_path):
@@ -466,9 +469,15 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                console_messages=captured_console,
            )

-        elif url.startswith("raw:") or url.startswith("raw://"):
+        ##### 
+        # Since both "raw:" and "raw://" start with "raw:", the first condition is always true for both, so "raw://" will be sliced as "//...", which is incorrect.
+        # Fix: Check for "raw://" first, then "raw:"
+        # Also, the prefix "raw://" is actually 6 characters long, not 7, so it should be sliced accordingly: url[6:]
+        #####
+        elif url.startswith("raw://") or url.startswith("raw:"):
            # Process raw HTML content
-            raw_html = url[4:] if url[:4] == "raw:" else url[7:]
+            # raw_html = url[4:] if url[:4] == "raw:" else url[7:]
+            raw_html = url[6:] if url.startswith("raw://") else url[4:]
            html = raw_html
            if config.screenshot:
                screenshot_data = await self._generate_screenshot_from_html(html)
@@ -741,18 +750,49 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                    )
                    redirected_url = page.url
                except Error as e:
-                    raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
+                    # Allow navigation to be aborted when downloading files
+                    # This is expected behavior for downloads in some browser engines
+                    if 'net::ERR_ABORTED' in str(e) and self.browser_config.accept_downloads:
+                        self.logger.info(
+                            message=f"Navigation aborted, likely due to file download: {url}",
+                            tag="GOTO",
+                            params={"url": url},
+                        )
+                        response = None
+                    else:
+                        raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")

                await self.execute_hook(
                    "after_goto", page, context=context, url=url, response=response, config=config
                )

+                # ──────────────────────────────────────────────────────────────
+                # Walk the redirect chain.  Playwright returns only the last
+                # hop, so we trace the `request.redirected_from` links until the
+                # first response that differs from the final one and surface its
+                # status-code.
+                # ──────────────────────────────────────────────────────────────
                if response is None:
                    status_code = 200
                    response_headers = {}
                else:
-                    status_code = response.status
-                    response_headers = response.headers
+                    first_resp = response
+                    req = response.request
+                    while req and req.redirected_from:
+                        prev_req = req.redirected_from
+                        prev_resp = await prev_req.response()
+                        if prev_resp:                       # keep earliest
+                            first_resp = prev_resp
+                        req = prev_req
+                
+                    status_code = first_resp.status
+                    response_headers = first_resp.headers
+                # if response is None:
+                #     status_code = 200
+                #     response_headers = {}
+                # else:
+                #     status_code = response.status
+                #     response_headers = response.headers

            else:
                status_code = 200
@@ -896,7 +936,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):

            # Handle full page scanning
            if config.scan_full_page:
-                await self._handle_full_page_scan(page, config.scroll_delay)
+                # await self._handle_full_page_scan(page, config.scroll_delay)
+                await self._handle_full_page_scan(page, config.scroll_delay, config.max_scroll_steps)

            # Handle virtual scroll if configured
            if config.virtual_scroll_config:
@@ -1088,7 +1129,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                # Close the page
                await page.close()

-    async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
+    # async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1):
+    async def _handle_full_page_scan(self, page: Page, scroll_delay: float = 0.1, max_scroll_steps: Optional[int] = None):
        """
        Helper method to handle full page scanning.

@@ -1103,6 +1145,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
        Args:
            page (Page): The Playwright page object
            scroll_delay (float): The delay between page scrolls
+            max_scroll_steps (Optional[int]): Maximum number of scroll steps to perform. If None, scrolls until end.

        """
        try:
@@ -1127,9 +1170,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            dimensions = await self.get_page_dimensions(page)
            total_height = dimensions["height"]

+            scroll_step_count = 0
            while current_position < total_height:
+                #### 
+                # NEW FEATURE: Check if we've reached the maximum allowed scroll steps
+                # This prevents infinite scrolling on very long pages or infinite scroll scenarios
+                # If max_scroll_steps is None, this check is skipped (unlimited scrolling - original behavior)
+                ####
+                if max_scroll_steps is not None and scroll_step_count >= max_scroll_steps:
+                    break
                current_position = min(current_position + viewport_height, total_height)
                await self.safe_scroll(page, 0, current_position, delay=scroll_delay)
+
+                # Increment the step counter for max_scroll_steps tracking
+                scroll_step_count += 1
+                
                # await page.evaluate(f"window.scrollTo(0, {current_position})")
                # await asyncio.sleep(scroll_delay)

@@ -1616,12 +1671,32 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
            num_segments = (page_height // viewport_height) + 1
            for i in range(num_segments):
                y_offset = i * viewport_height
+                # Special handling for the last segment
+                if i == num_segments - 1:
+                    last_part_height = page_height % viewport_height
+                    
+                    # If page_height is an exact multiple of viewport_height,
+                    # we don't need an extra segment
+                    if last_part_height == 0:
+                        # Skip last segment if page height is exact multiple of viewport
+                        break
+                    
+                    # Adjust viewport to exactly match the remaining content height
+                    await page.set_viewport_size({"width": page_width, "height": last_part_height})
+                
                await page.evaluate(f"window.scrollTo(0, {y_offset})")
                await asyncio.sleep(0.01)  # wait for render
-                seg_shot = await page.screenshot(full_page=False)
+                
+                # Capture the current segment
+                # Note: Using compression options (format, quality) would go here
+                seg_shot = await page.screenshot(full_page=False, type="jpeg", quality=85)
+                # seg_shot = await page.screenshot(full_page=False)
                img = Image.open(BytesIO(seg_shot)).convert("RGB")
                segments.append(img)

+            # Reset viewport to original size after capturing segments
+            await page.set_viewport_size({"width": page_width, "height": viewport_height})
+
            total_height = sum(img.height for img in segments)
            stitched = Image.new("RGB", (segments[0].width, total_height))
            offset = 0
@@ -1750,12 +1825,31 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
                    # then wait for the new page to load before continuing
                    result = None
                    try:
+                        # OLD VERSION:
+                        # result = await page.evaluate(
+                        #     f"""
+                        # (async () => {{
+                        #     try {{
+                        #         const script_result = {script};
+                        #         return {{ success: true, result: script_result }};
+                        #     }} catch (err) {{
+                        #         return {{ success: false, error: err.toString(), stack: err.stack }};
+                        #     }}
+                        # }})();
+                        # """
+                        # )
+                        
+                        # """ NEW VERSION:
+                        # When {script} contains statements (e.g., const link = …; link.click();), 
+                        # this forms invalid JavaScript, causing Playwright execution error: SyntaxError: Unexpected token 'const'.
+                        # """
                        result = await page.evaluate(
                            f"""
                        (async () => {{
                            try {{
-                                const script_result = {script};
-                                return {{ success: true, result: script_result }};
+                                return await (async () => {{
+                                    {script}
+                                }})();
                            }} catch (err) {{
                                return {{ success: false, error: err.toString(), stack: err.stack }};
                            }}
--- a/crawl4ai/async_logger.py
+++ b/crawl4ai/async_logger.py
@@ -39,6 +39,7 @@ class LogColor(str, Enum):
    YELLOW = "yellow"
    MAGENTA = "magenta"
    DIM_MAGENTA = "dim magenta"
+    RED = "red"

    def __str__(self):
        """Automatically convert rich color to string."""
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -363,7 +363,7 @@ class AsyncWebCrawler:
                        pdf_data=pdf_data,
                        verbose=config.verbose,
                        is_raw_html=True if url.startswith("raw:") else False,
-                        redirected_url=async_response.redirected_url, 
+                        redirected_url=async_response.redirected_url,
                        **kwargs,
                    )

@@ -506,7 +506,7 @@ class AsyncWebCrawler:
            tables = media.pop("tables", [])
            links = result.links.model_dump()
            metadata = result.metadata
-            
+
        fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000)

        ################################
@@ -588,11 +588,13 @@ class AsyncWebCrawler:
            # Choose content based on input_format
            content_format = config.extraction_strategy.input_format
            if content_format == "fit_markdown" and not markdown_result.fit_markdown:
-                self.logger.warning(
-                    message="Fit markdown requested but not available. Falling back to raw markdown.",
-                    tag="EXTRACT",
-                    params={"url": _url},
-                )
+
+                self.logger.url_status(
+                        url=_url,
+                        success=bool(html),
+                        timing=time.perf_counter() - t1,
+                        tag="EXTRACT",
+                    )
                content_format = "markdown"

            content = {
@@ -616,11 +618,12 @@ class AsyncWebCrawler:
            )

            # Log extraction completion
-            self.logger.info(
-                message="Completed for {url:.50}... | Time: {timing}s",
-                tag="EXTRACT",
-                params={"url": _url, "timing": time.perf_counter() - t1},
-            )
+            self.logger.url_status(
+                        url=_url,
+                        success=bool(html),
+                        timing=time.perf_counter() - t1,
+                        tag="EXTRACT",
+                    )

        # Apply HTML formatting if requested
        if config.prettiify:
--- a/crawl4ai/browser_profiler.py
+++ b/crawl4ai/browser_profiler.py
@@ -480,7 +480,7 @@ class BrowserProfiler:
                self.logger.info("4. Exit", tag="MENU", base_color=LogColor.MAGENTA)
                exit_option = "4"
            
-            self.logger.print(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
+            self.logger.info(f"\n[cyan]Enter your choice (1-{exit_option}): [/cyan]", end="")
            choice = input()
            
            if choice == "1":
@@ -637,9 +637,18 @@ class BrowserProfiler:
        self.logger.info(f"Debugging port: {debugging_port}", tag="CDP")
        self.logger.info(f"Headless mode: {headless}", tag="CDP")
        
+        # create browser config
+        browser_config = BrowserConfig(
+            browser_type=browser_type,
+            headless=headless,
+            user_data_dir=profile_path,
+            debugging_port=debugging_port,
+            verbose=True
+        )
+        
        # Create managed browser instance
        managed_browser = ManagedBrowser(
-            browser_type=browser_type,
+            browser_config=browser_config,
            user_data_dir=profile_path,
            headless=headless,
            logger=self.logger,
--- a/crawl4ai/cli.py
+++ b/crawl4ai/cli.py
@@ -1010,7 +1010,7 @@ def cdp_cmd(user_data_dir: Optional[str], port: int, browser_type: str, headless
@click.option("--crawler", "-c", type=str, callback=parse_key_values, help="Crawler parameters as key1=value1,key2=value2")
@click.option("--output", "-o", type=click.Choice(["all", "json", "markdown", "md", "markdown-fit", "md-fit"]), default="all")
@click.option("--output-file", "-O", type=click.Path(), help="Output file path (default: stdout)")
-@click.option("--bypass-cache", "-b", is_flag=True, default=True, help="Bypass cache when crawling")
+@click.option("--bypass-cache", "-bc", is_flag=True, default=True, help="Bypass cache when crawling")
@click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)")
--- a/crawl4ai/content_scraping_strategy.py
+++ b/crawl4ai/content_scraping_strategy.py
@@ -720,13 +720,18 @@ class WebScrapingStrategy(ContentScrapingStrategy):

                    # Check flag if we should remove external images
                    if kwargs.get("exclude_external_images", False):
-                        element.decompose()
-                        return False
-                        # src_url_base = src.split('/')[2]
-                        # url_base = url.split('/')[2]
-                        # if url_base not in src_url_base:
-                        #     element.decompose()
-                        #     return False
+                        # Handle relative URLs (which are always from the same domain)
+                        if not src.startswith('http') and not src.startswith('//'):
+                            return True  # Keep relative URLs
+                        
+                        # For absolute URLs, compare the base domains using the existing function
+                        src_base_domain = get_base_domain(src)
+                        url_base_domain = get_base_domain(url)
+                        
+                        # If the domains don't match and both are valid, the image is external
+                        if src_base_domain and url_base_domain and src_base_domain != url_base_domain:
+                            element.decompose()
+                            return False

                    # if kwargs.get('exclude_social_media_links', False):
                    #     if image_src_base_domain in exclude_social_media_domains:
--- a/crawl4ai/deep_crawling/bff_strategy.py
+++ b/crawl4ai/deep_crawling/bff_strategy.py
@@ -150,6 +150,14 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
                self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
                break
                
+            # Calculate how many more URLs we can process in this batch
+            remaining = self.max_pages - self._pages_crawled
+            batch_size = min(BATCH_SIZE, remaining)
+            if batch_size <= 0:
+                # No more pages to crawl
+                self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
+                break
+                
            batch: List[Tuple[float, int, str, Optional[str]]] = []
            # Retrieve up to BATCH_SIZE items from the priority queue.
            for _ in range(BATCH_SIZE):
@@ -184,6 +192,10 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
                # Count only successful crawls toward max_pages limit
                if result.success:
                    self._pages_crawled += 1
+                    # Check if we've reached the limit during batch processing
+                    if self._pages_crawled >= self.max_pages:
+                        self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
+                        break  # Exit the generator
                
                yield result
                
--- a/crawl4ai/deep_crawling/bfs_strategy.py
+++ b/crawl4ai/deep_crawling/bfs_strategy.py
@@ -157,6 +157,11 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        results: List[CrawlResult] = []

        while current_level and not self._cancel_event.is_set():
+            # Check if we've already reached max_pages before starting a new level
+            if self._pages_crawled >= self.max_pages:
+                self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
+                break
+            
            next_level: List[Tuple[str, Optional[str]]] = []
            urls = [url for url, _ in current_level]

@@ -221,6 +226,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
                # Count only successful crawls
                if result.success:
                    self._pages_crawled += 1
+                    # Check if we've reached the limit during batch processing
+                    if self._pages_crawled >= self.max_pages:
+                        self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
+                        break  # Exit the generator
                
                results_count += 1
                yield result
--- a/crawl4ai/deep_crawling/dfs_strategy.py
+++ b/crawl4ai/deep_crawling/dfs_strategy.py
@@ -49,6 +49,10 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
                # Count only successful crawls toward max_pages limit
                if result.success:
                    self._pages_crawled += 1
+                    # Check if we've reached the limit during batch processing
+                    if self._pages_crawled >= self.max_pages:
+                        self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
+                        break  # Exit the generator
                    
                    # Only discover links from successful crawls
                    new_links: List[Tuple[str, Optional[str]]] = []
@@ -94,6 +98,10 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
                # and only discover links from successful crawls
                if result.success:
                    self._pages_crawled += 1
+                    # Check if we've reached the limit during batch processing
+                    if self._pages_crawled >= self.max_pages:
+                        self.logger.info(f"Max pages limit ({self.max_pages}) reached during batch, stopping crawl")
+                        break  # Exit the generator
                    
                    new_links: List[Tuple[str, Optional[str]]] = []
                    await self.link_discovery(result, url, depth, visited, new_links, depths)
--- a/crawl4ai/deep_crawling/filters.py
+++ b/crawl4ai/deep_crawling/filters.py
@@ -227,10 +227,21 @@ class URLPatternFilter(URLFilter):
        # Prefix check (/foo/*)
        if self._simple_prefixes:
            path = url.split("?")[0]
-            if any(path.startswith(p) for p in self._simple_prefixes):
-                result = True
-                self._update_stats(result)
-                return not result if self._reverse else result
+            # if any(path.startswith(p) for p in self._simple_prefixes):
+            #     result = True
+            #     self._update_stats(result)
+            #     return not result if self._reverse else result
+            ####
+            # Modified the prefix matching logic to ensure path boundary checking:
+            # - Check if the matched prefix is followed by a path separator (`/`), query parameter (`?`), fragment (`#`), or is at the end of the path
+            # - This ensures `/api/` only matches complete path segments, not substrings like `/apiv2/`
+            ####
+            for prefix in self._simple_prefixes:
+                if path.startswith(prefix):
+                    if len(path) == len(prefix) or path[len(prefix)] in ['/', '?', '#']:
+                        result = True
+                        self._update_stats(result)
+                        return not result if self._reverse else result

        # Complex patterns
        if self._path_patterns:
@@ -337,6 +348,15 @@ class ContentTypeFilter(URLFilter):
        "sqlite": "application/vnd.sqlite3",
        # Placeholder
        "unknown": "application/octet-stream",  # Fallback for unknown file types
+        # php
+        "php": "application/x-httpd-php",
+        "php3": "application/x-httpd-php",
+        "php4": "application/x-httpd-php",
+        "php5": "application/x-httpd-php",
+        "php7": "application/x-httpd-php",
+        "phtml": "application/x-httpd-php",
+        "phps": "application/x-httpd-php-source",
+
    }

    @staticmethod
--- a/crawl4ai/docker_client.py
+++ b/crawl4ai/docker_client.py
@@ -73,6 +73,8 @@ class Crawl4aiDockerClient:
    def _prepare_request(self, urls: List[str], browser_config: Optional[BrowserConfig] = None, 
                       crawler_config: Optional[CrawlerRunConfig] = None) -> Dict[str, Any]:
        """Prepare request data from configs."""
+        if self._token:
+            self._http_client.headers["Authorization"] = f"Bearer {self._token}"
        return {
            "urls": urls,
            "browser_config": browser_config.dump() if browser_config else {},
@@ -103,8 +105,6 @@ class Crawl4aiDockerClient:
        crawler_config: Optional[CrawlerRunConfig] = None
    ) -> Union[CrawlResult, List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
        """Execute a crawl operation."""
-        if not self._token:
-            raise Crawl4aiClientError("Authentication required. Call authenticate() first.")
        await self._check_server()
        
        data = self._prepare_request(urls, browser_config, crawler_config)
@@ -140,8 +140,6 @@ class Crawl4aiDockerClient:

    async def get_schema(self) -> Dict[str, Any]:
        """Retrieve configuration schemas."""
-        if not self._token:
-            raise Crawl4aiClientError("Authentication required. Call authenticate() first.")
        response = await self._request("GET", "/schema")
        return response.json()

@@ -167,4 +165,4 @@ async def main():
        print(schema)

 if __name__ == "__main__":
-    asyncio.run(main())
+    asyncio.run(main())
--- a/crawl4ai/extraction_strategy.py
+++ b/crawl4ai/extraction_strategy.py
@@ -656,11 +656,11 @@ class LLMExtractionStrategy(ExtractionStrategy):
            self.total_usage.total_tokens += usage.total_tokens

            try:
-                response = response.choices[0].message.content
+                content = response.choices[0].message.content
                blocks = None

                if self.force_json_response:
-                    blocks = json.loads(response)
+                    blocks = json.loads(content)
                    if isinstance(blocks, dict):
                        # If it has only one key which calue is list then assign that to blocks, exampled: {"news": [..]}
                        if len(blocks) == 1 and isinstance(list(blocks.values())[0], list):
@@ -673,7 +673,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
                        blocks = blocks
                else: 
                    # blocks = extract_xml_data(["blocks"], response.choices[0].message.content)["blocks"]
-                    blocks = extract_xml_data(["blocks"], response)["blocks"]
+                    blocks = extract_xml_data(["blocks"], content)["blocks"]
                    blocks = json.loads(blocks)

                for block in blocks:
--- a/crawl4ai/utils.py
+++ b/crawl4ai/utils.py
@@ -50,6 +50,29 @@ from urllib.parse import (
 )


+# Monkey patch to fix wildcard handling in urllib.robotparser
+from urllib.robotparser import RuleLine
+import re
+
+original_applies_to = RuleLine.applies_to
+
+def patched_applies_to(self, filename):
+   # Handle wildcards in paths
+   if '*' in self.path or '%2A' in self.path or self.path in ("*", "%2A"):
+       pattern = self.path.replace('%2A', '*')
+       pattern = re.escape(pattern).replace('\\*', '.*')
+       pattern = '^' + pattern
+       if pattern.endswith('\\$'):
+           pattern = pattern[:-2] + '$'
+       try:
+           return bool(re.match(pattern, filename))
+       except re.error:
+           return original_applies_to(self, filename)
+   return original_applies_to(self, filename)
+
+RuleLine.applies_to = patched_applies_to
+# Monkey patch ends
+
 def chunk_documents(
    documents: Iterable[str],
    chunk_token_threshold: int,
@@ -318,7 +341,7 @@ class RobotsParser:
                robots_url = f"{scheme}://{domain}/robots.txt"
                
                async with aiohttp.ClientSession() as session:
-                    async with session.get(robots_url, timeout=2) as response:
+                    async with session.get(robots_url, timeout=2, ssl=False) as response:
                        if response.status == 200:
                            rules = await response.text()
                            self._cache_rules(domain, rules)
@@ -1524,6 +1547,14 @@ def extract_metadata_using_lxml(html, doc=None):
        content = tag.get("content", "").strip()
        if property_name and content:
            metadata[property_name] = content
+   
+   # Article metadata
+    article_tags = head.xpath('.//meta[starts-with(@property, "article:")]')
+    for tag in article_tags:
+        property_name = tag.get("property", "").strip()
+        content = tag.get("content", "").strip()
+        if property_name and content:
+            metadata[property_name] = content

    return metadata

@@ -1599,7 +1630,15 @@ def extract_metadata(html, soup=None):
        content = tag.get("content", "").strip()
        if property_name and content:
            metadata[property_name] = content
-
+    
+    # Article metadata
+    article_tags = head.find_all("meta", attrs={"property": re.compile(r"^article:")})
+    for tag in article_tags:
+        property_name = tag.get("property", "").strip()
+        content = tag.get("content", "").strip()
+        if property_name and content:
+            metadata[property_name] = content
+    
    return metadata


@@ -2068,14 +2107,16 @@ def normalize_url(href, base_url):
    parsed_base = urlparse(base_url)
    if not parsed_base.scheme or not parsed_base.netloc:
        raise ValueError(f"Invalid base URL format: {base_url}")
-
-    # Ensure base_url ends with a trailing slash if it's a directory path
-    if not base_url.endswith('/'):
-        base_url = base_url + '/'
+    
+    if  parsed_base.scheme.lower() not in ["http", "https"]:
+        # Handle special protocols
+        raise ValueError(f"Invalid base URL format: {base_url}")
+    cleaned_href = href.strip()

    # Use urljoin to handle all cases
-    normalized = urljoin(base_url, href.strip())
-    return normalized
+    return urljoin(base_url, cleaned_href)
+
+


 def normalize_url(
--- a/deploy/docker/api.py
+++ b/deploy/docker/api.py
@@ -459,7 +459,7 @@ async def handle_crawl_request(
            #      await crawler.close()
            #  except Exception as close_e:
            #       logger.error(f"Error closing crawler during exception handling: {close_e}")
-            logger.error(f"Error closing crawler during exception handling: {close_e}")
+            logger.error(f"Error closing crawler during exception handling: {str(e)}")

        # Measure memory even on error if possible
        end_mem_mb_error = _get_memory_mb()
@@ -518,7 +518,7 @@ async def handle_stream_crawl_request(
            #       await crawler.close()
            #  except Exception as close_e:
            #       logger.error(f"Error closing crawler during stream setup exception: {close_e}")
-            logger.error(f"Error closing crawler during stream setup exception: {close_e}")
+            logger.error(f"Error closing crawler during stream setup exception: {str(e)}")
        logger.error(f"Stream crawl error: {str(e)}", exc_info=True)
        # Raising HTTPException here will prevent streaming response
        raise HTTPException(
--- a/deploy/docker/c4ai-doc-context.md
+++ b/deploy/docker/c4ai-doc-context.md
@@ -332,7 +332,7 @@ The `clone()` method:
 ### Key fields to note

 1. **`provider`**:  
- Which LLM provoder to use. 
+- Which LLM provider to use. 
 - Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*

 2. **`api_token`**:  
@@ -403,7 +403,7 @@ async def main():

    md_generator = DefaultMarkdownGenerator(
    content_filter=filter,
-    options={"ignore_links": True}
+    options={"ignore_links": True})

    # 4) Crawler run config: skip cache, use extraction
    run_conf = CrawlerRunConfig(
@@ -3760,11 +3760,11 @@ To crawl a live web page, provide the URL starting with `http://` or `https://`,

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler
+from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.async_configs import CrawlerRunConfig

 async def crawl_web():
-    config = CrawlerRunConfig(bypass_cache=True)
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/apple", 
@@ -3785,13 +3785,13 @@ To crawl a local HTML file, prefix the file path with `file://`.

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler
+from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.async_configs import CrawlerRunConfig

 async def crawl_local_file():
    local_file_path = "/path/to/apple.html"  # Replace with your file path
    file_url = f"file://{local_file_path}"
-    config = CrawlerRunConfig(bypass_cache=True)
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=file_url, config=config)
@@ -3810,13 +3810,13 @@ To crawl raw HTML content, prefix the HTML string with `raw:`.

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler
+from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.async_configs import CrawlerRunConfig

 async def crawl_raw_html():
    raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
    raw_html_url = f"raw:{raw_html}"
-    config = CrawlerRunConfig(bypass_cache=True)
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=raw_html_url, config=config)
@@ -3845,7 +3845,7 @@ import os
 import sys
 import asyncio
 from pathlib import Path
-from crawl4ai import AsyncWebCrawler
+from crawl4ai import AsyncWebCrawler, CacheMode
 from crawl4ai.async_configs import CrawlerRunConfig

 async def main():
@@ -3856,7 +3856,7 @@ async def main():
    async with AsyncWebCrawler() as crawler:
        # Step 1: Crawl the Web URL
        print("\n=== Step 1: Crawling the Wikipedia URL ===")
-        web_config = CrawlerRunConfig(bypass_cache=True)
+        web_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        result = await crawler.arun(url=wikipedia_url, config=web_config)

        if not result.success:
@@ -3871,7 +3871,7 @@ async def main():
        # Step 2: Crawl from the Local HTML File
        print("=== Step 2: Crawling from the Local HTML File ===")
        file_url = f"file://{html_file_path.resolve()}"
-        file_config = CrawlerRunConfig(bypass_cache=True)
+        file_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        local_result = await crawler.arun(url=file_url, config=file_config)

        if not local_result.success:
@@ -3887,7 +3887,7 @@ async def main():
        with open(html_file_path, 'r', encoding='utf-8') as f:
            raw_html_content = f.read()
        raw_html_url = f"raw:{raw_html_content}"
-        raw_config = CrawlerRunConfig(bypass_cache=True)
+        raw_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        raw_result = await crawler.arun(url=raw_html_url, config=raw_config)

        if not raw_result.success:
@@ -4152,7 +4152,7 @@ prune_filter = PruningContentFilter(
 For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:

 ```python
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator
 from crawl4ai.content_filter_strategy import LLMContentFilter

 async def main():
@@ -4175,8 +4175,13 @@ async def main():
        verbose=True
    )

+    md_generator = DefaultMarkdownGenerator(
+        content_filter=filter,
+        options={"ignore_links": True}
+    )
+
    config = CrawlerRunConfig(
-        content_filter=filter
+        markdown_generator=md_generator
    )

    async with AsyncWebCrawler() as crawler:
@@ -5428,29 +5433,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
 ```python
 import os, asyncio
 from base64 import b64decode
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

 async def main():
+    run_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        screenshot=True,
+        pdf=True
+    )
+
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
-            cache_mode=CacheMode.BYPASS,
-            pdf=True,
-            screenshot=True
+            config=run_config
        )
-        
        if result.success:
-            # Save screenshot
+            print(f"Screenshot data present: {result.screenshot is not None}")
+            print(f"PDF data present: {result.pdf is not None}")
+
            if result.screenshot:
+                print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
                with open("wikipedia_screenshot.png", "wb") as f:
                    f.write(b64decode(result.screenshot))
-            
-            # Save PDF
+            else:
+                print("[WARN] Screenshot data is None.")
+
            if result.pdf:
+                print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
                with open("wikipedia_page.pdf", "wb") as f:
                    f.write(result.pdf)
-            
-            print("[OK] PDF & screenshot captured.")
+            else:
+                print("[WARN] PDF data is None.")
+
        else:
            print("[ERROR]", result.error_message)

--- a/deploy/docker/schemas.py
+++ b/deploy/docker/schemas.py
@@ -12,8 +12,7 @@ class CrawlRequest(BaseModel):
 class MarkdownRequest(BaseModel):
    """Request body for the /md endpoint."""
    url: str                    = Field(...,  description="Absolute http/https URL to fetch")
-    f:   FilterType             = Field(FilterType.FIT,
-                                        description="Content‑filter strategy: FIT, RAW, BM25, or LLM")
+    f:   FilterType             = Field(FilterType.FIT, description="Content‑filter strategy: fit, raw, bm25, or llm")
    q:   Optional[str] = Field(None,  description="Query string used by BM25/LLM filters")
    c:   Optional[str] = Field("0",   description="Cache‑bust / revision counter")

--- a/deploy/docker/static/playground/index.html
+++ b/deploy/docker/static/playground/index.html
@@ -671,6 +671,16 @@
                        method: 'GET',
                        headers: { 'Accept': 'application/json' }
                    });
+                    responseData = await response.json();
+                    const time = Math.round(performance.now() - startTime);
+                    if (!response.ok) {
+                        updateStatus('error', time);
+                        throw new Error(responseData.error || 'Request failed');
+                    }
+                    updateStatus('success', time);
+                    document.querySelector('#response-content code').textContent = JSON.stringify(responseData, null, 2);
+                    document.querySelector('#response-content code').className = 'json hljs';
+                    forceHighlightElement(document.querySelector('#response-content code'));
                } else if (endpoint === 'crawl_stream') {
                    // Stream processing
                    response = await fetch(api, {
--- a/docs/examples/llm_extraction_openai_pricing.py
+++ b/docs/examples/llm_extraction_openai_pricing.py
@@ -1,43 +1,55 @@
-from crawl4ai import LLMConfig
-from crawl4ai import AsyncWebCrawler, LLMExtractionStrategy
 import asyncio
-import os
-import json
 from pydantic import BaseModel, Field
-
-url = "https://openai.com/api/pricing/"
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig, BrowserConfig, CacheMode
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+from typing import Dict
+import os


 class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="Name of the OpenAI model.")
    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
-    output_fee: str = Field(
-        ..., description="Fee for output token for the OpenAI model."
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+
+async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
+    print(f"\n--- Extracting Structured Data with {provider} ---")
+
+    if api_token is None and provider != "ollama":
+        print(f"API token is required for {provider}. Skipping this example.")
+        return
+
+    browser_config = BrowserConfig(headless=True)
+
+    extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000}
+    if extra_headers:
+        extra_args["extra_headers"] = extra_headers
+
+    crawler_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        word_count_threshold=1,
+        page_timeout=80000,
+        extraction_strategy=LLMExtractionStrategy(
+            llm_config=LLMConfig(provider=provider, api_token=api_token),
+            schema=OpenAIModelFee.model_json_schema(),
+            extraction_type="schema",
+            instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. 
+            Do not miss any models in the entire content.""",
+            extra_args=extra_args,
+        ),
    )

-async def main():
-    # Use AsyncWebCrawler
-    async with AsyncWebCrawler() as crawler:
+    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
-            url=url,
-            word_count_threshold=1,
-            extraction_strategy=LLMExtractionStrategy(
-                # provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'),
-                llm_config=LLMConfig(provider="groq/llama-3.1-70b-versatile", api_token=os.getenv("GROQ_API_KEY")),
-                schema=OpenAIModelFee.model_json_schema(),
-                extraction_type="schema",
-                instruction="From the crawled content, extract all mentioned model names along with their "
-                "fees for input and output tokens. Make sure not to miss anything in the entire content. "
-                "One extracted model JSON format should look like this: "
-                '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }',
-            ),
+            url="https://openai.com/api/pricing/", 
+            config=crawler_config
        )
-        print("Success:", result.success)
-        model_fees = json.loads(result.extracted_content)
-        print(len(model_fees))
-
-        with open(".data/data.json", "w", encoding="utf-8") as f:
-            f.write(result.extracted_content)
+        print(result.extracted_content)


-asyncio.run(main())
+if __name__ == "__main__":
+    asyncio.run(
+        extract_structured_data_using_llm(
+            provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY")
+        )
+    )
--- a/docs/md_v2/advanced/advanced-features.md
+++ b/docs/md_v2/advanced/advanced-features.md
@@ -66,29 +66,38 @@ Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI c
 ```python
 import os, asyncio
 from base64 import b64decode
-from crawl4ai import AsyncWebCrawler, CacheMode
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

 async def main():
+    run_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        screenshot=True,
+        pdf=True
+    )
+
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
-            cache_mode=CacheMode.BYPASS,
-            pdf=True,
-            screenshot=True
+            config=run_config
        )
-        
        if result.success:
-            # Save screenshot
+            print(f"Screenshot data present: {result.screenshot is not None}")
+            print(f"PDF data present: {result.pdf is not None}")
+
            if result.screenshot:
+                print(f"[OK] Screenshot captured, size: {len(result.screenshot)} bytes")
                with open("wikipedia_screenshot.png", "wb") as f:
                    f.write(b64decode(result.screenshot))
-            
-            # Save PDF
+            else:
+                print("[WARN] Screenshot data is None.")
+
            if result.pdf:
+                print(f"[OK] PDF captured, size: {len(result.pdf)} bytes")
                with open("wikipedia_page.pdf", "wb") as f:
                    f.write(result.pdf)
-            
-            print("[OK] PDF & screenshot captured.")
+            else:
+                print("[WARN] PDF data is None.")
+
        else:
            print("[ERROR]", result.error_message)

--- a/docs/md_v2/advanced/pdf-parsing.md
+++ b/docs/md_v2/advanced/pdf-parsing.md
@@ -0,0 +1,201 @@
+# PDF Processing Strategies
+
+Crawl4AI provides specialized strategies for handling and extracting content from PDF files. These strategies allow you to seamlessly integrate PDF processing into your crawling workflows, whether the PDFs are hosted online or stored locally.
+
+## `PDFCrawlerStrategy`
+
+### Overview
+`PDFCrawlerStrategy` is an implementation of `AsyncCrawlerStrategy` designed specifically for PDF documents. Instead of interpreting the input URL as an HTML webpage, this strategy treats it as a pointer to a PDF file. It doesn't perform deep crawling or HTML parsing itself but rather prepares the PDF source for a dedicated PDF scraping strategy. Its primary role is to identify the PDF source (web URL or local file) and pass it along the processing pipeline in a way that `AsyncWebCrawler` can handle.
+
+### When to Use
+Use `PDFCrawlerStrategy` when you need to:
+- Process PDF files using the `AsyncWebCrawler`.
+- Handle PDFs from both web URLs (e.g., `https://example.com/document.pdf`) and local file paths (e.g., `file:///path/to/your/document.pdf`).
+- Integrate PDF content extraction into a unified `CrawlResult` object, allowing consistent handling of PDF data alongside web page data.
+
+### Key Methods and Their Behavior
+-   **`__init__(self, logger: AsyncLogger = None)`**:
+    -   Initializes the strategy.
+    -   `logger`: An optional `AsyncLogger` instance (from `crawl4ai.async_logger`) for logging purposes.
+-   **`async crawl(self, url: str, **kwargs) -> AsyncCrawlResponse`**:
+    -   This method is called by the `AsyncWebCrawler` during the `arun` process.
+    -   It takes the `url` (which should point to a PDF) and creates a minimal `AsyncCrawlResponse`.
+    -   The `html` attribute of this response is typically empty or a placeholder, as the actual PDF content processing is deferred to the `PDFContentScrapingStrategy` (or a similar PDF-aware scraping strategy).
+    -   It sets `response_headers` to indicate "application/pdf" and `status_code` to 200.
+-   **`async close(self)`**:
+    -   A method for cleaning up any resources used by the strategy. For `PDFCrawlerStrategy`, this is usually minimal.
+-   **`async __aenter__(self)` / `async __aexit__(self, exc_type, exc_val, exc_tb)`**:
+    -   Enables asynchronous context management for the strategy, allowing it to be used with `async with`.
+
+### Example Usage
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
+
+async def main():
+    # Initialize the PDF crawler strategy
+    pdf_crawler_strategy = PDFCrawlerStrategy()
+
+    # PDFCrawlerStrategy is typically used in conjunction with PDFContentScrapingStrategy
+    # The scraping strategy handles the actual PDF content extraction
+    pdf_scraping_strategy = PDFContentScrapingStrategy()
+    run_config = CrawlerRunConfig(scraping_strategy=pdf_scraping_strategy)
+
+    async with AsyncWebCrawler(crawler_strategy=pdf_crawler_strategy) as crawler:
+        # Example with a remote PDF URL
+        pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # A public PDF from arXiv
+        
+        print(f"Attempting to process PDF: {pdf_url}")
+        result = await crawler.arun(url=pdf_url, config=run_config)
+
+        if result.success:
+            print(f"Successfully processed PDF: {result.url}")
+            print(f"Metadata Title: {result.metadata.get('title', 'N/A')}")
+            # Further processing of result.markdown, result.media, etc.
+            # would be done here, based on what PDFContentScrapingStrategy extracts.
+            if result.markdown and hasattr(result.markdown, 'raw_markdown'):
+                print(f"Extracted text (first 200 chars): {result.markdown.raw_markdown[:200]}...")
+            else:
+                print("No markdown (text) content extracted.")
+        else:
+            print(f"Failed to process PDF: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Pros and Cons
+**Pros:**
+-   Enables `AsyncWebCrawler` to handle PDF sources directly using familiar `arun` calls.
+-   Provides a consistent interface for specifying PDF sources (URLs or local paths).
+-   Abstracts the source handling, allowing a separate scraping strategy to focus on PDF content parsing.
+
+**Cons:**
+-   Does not perform any PDF data extraction itself; it strictly relies on a compatible scraping strategy (like `PDFContentScrapingStrategy`) to process the PDF.
+-   Has limited utility on its own; most of its value comes from being paired with a PDF-specific content scraping strategy.
+
+---
+
+## `PDFContentScrapingStrategy`
+
+### Overview
+`PDFContentScrapingStrategy` is an implementation of `ContentScrapingStrategy` designed to extract text, metadata, and optionally images from PDF documents. It is intended to be used in conjunction with a crawler strategy that can provide it with a PDF source, such as `PDFCrawlerStrategy`. This strategy uses the `NaivePDFProcessorStrategy` internally to perform the low-level PDF parsing.
+
+### When to Use
+Use `PDFContentScrapingStrategy` when your `AsyncWebCrawler` (often configured with `PDFCrawlerStrategy`) needs to:
+-   Extract textual content page by page from a PDF document.
+-   Retrieve standard metadata embedded within the PDF (e.g., title, author, subject, creation date, page count).
+-   Optionally, extract images contained within the PDF pages. These images can be saved to a local directory or made available for further processing.
+-   Produce a `ScrapingResult` that can be converted into a `CrawlResult`, making PDF content accessible in a manner similar to HTML web content (e.g., text in `result.markdown`, metadata in `result.metadata`).
+
+### Key Configuration Attributes
+When initializing `PDFContentScrapingStrategy`, you can configure its behavior using the following attributes:
+-   **`extract_images: bool = False`**: If `True`, the strategy will attempt to extract images from the PDF.
+-   **`save_images_locally: bool = False`**: If `True` (and `extract_images` is also `True`), extracted images will be saved to disk in the `image_save_dir`. If `False`, image data might be available in another form (e.g., base64, depending on the underlying processor) but not saved as separate files by this strategy.
+-   **`image_save_dir: str = None`**: Specifies the directory where extracted images should be saved if `save_images_locally` is `True`. If `None`, a default or temporary directory might be used.
+-   **`batch_size: int = 4`**: Defines how many PDF pages are processed in a single batch. This can be useful for managing memory when dealing with very large PDF documents.
+-   **`logger: AsyncLogger = None`**: An optional `AsyncLogger` instance for logging.
+
+### Key Methods and Their Behavior
+-   **`__init__(self, save_images_locally: bool = False, extract_images: bool = False, image_save_dir: str = None, batch_size: int = 4, logger: AsyncLogger = None)`**:
+    -   Initializes the strategy with configurations for image handling, batch processing, and logging. It sets up an internal `NaivePDFProcessorStrategy` instance which performs the actual PDF parsing.
+-   **`scrap(self, url: str, html: str, **params) -> ScrapingResult`**:
+    -   This is the primary synchronous method called by the crawler (via `ascrap`) to process the PDF.
+    -   `url`: The path or URL to the PDF file (provided by `PDFCrawlerStrategy` or similar).
+    -   `html`: Typically an empty string when used with `PDFCrawlerStrategy`, as the content is a PDF, not HTML.
+    -   It first ensures the PDF is accessible locally (downloads it to a temporary file if `url` is remote).
+    -   It then uses its internal PDF processor to extract text, metadata, and images (if configured).
+    -   The extracted information is compiled into a `ScrapingResult` object:
+        -   `cleaned_html`: Contains an HTML-like representation of the PDF, where each page's content is often wrapped in a `<div>` with page number information.
+        -   `media`: A dictionary where `media["images"]` will contain information about extracted images if `extract_images` was `True`.
+        -   `links`: A dictionary where `links["urls"]` can contain URLs found within the PDF content.
+        -   `metadata`: A dictionary holding PDF metadata (e.g., title, author, num_pages).
+-   **`async ascrap(self, url: str, html: str, **kwargs) -> ScrapingResult`**:
+    -   The asynchronous version of `scrap`. Under the hood, it typically runs the synchronous `scrap` method in a separate thread using `asyncio.to_thread` to avoid blocking the event loop.
+-   **`_get_pdf_path(self, url: str) -> str`**:
+    -   A private helper method to manage PDF file access. If the `url` is remote (http/https), it downloads the PDF to a temporary local file and returns its path. If `url` indicates a local file (`file://` or a direct path), it resolves and returns the local path.
+
+### Example Usage
+```python
+import asyncio
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
+import os # For creating image directory
+
+async def main():
+    # Define the directory for saving extracted images
+    image_output_dir = "./my_pdf_images"
+    os.makedirs(image_output_dir, exist_ok=True)
+
+    # Configure the PDF content scraping strategy
+    # Enable image extraction and specify where to save them
+    pdf_scraping_cfg = PDFContentScrapingStrategy(
+        extract_images=True,
+        save_images_locally=True,
+        image_save_dir=image_output_dir,
+        batch_size=2 # Process 2 pages at a time for demonstration
+    )
+
+    # The PDFCrawlerStrategy is needed to tell AsyncWebCrawler how to "crawl" a PDF
+    pdf_crawler_cfg = PDFCrawlerStrategy()
+
+    # Configure the overall crawl run
+    run_cfg = CrawlerRunConfig(
+        scraping_strategy=pdf_scraping_cfg # Use our PDF scraping strategy
+    )
+
+    # Initialize the crawler with the PDF-specific crawler strategy
+    async with AsyncWebCrawler(crawler_strategy=pdf_crawler_cfg) as crawler:
+        pdf_url = "https://arxiv.org/pdf/2310.06825.pdf" # Example PDF
+        
+        print(f"Starting PDF processing for: {pdf_url}")
+        result = await crawler.arun(url=pdf_url, config=run_cfg)
+
+        if result.success:
+            print("\n--- PDF Processing Successful ---")
+            print(f"Processed URL: {result.url}")
+            
+            print("\n--- Metadata ---")
+            for key, value in result.metadata.items():
+                print(f"  {key.replace('_', ' ').title()}: {value}")
+
+            if result.markdown and hasattr(result.markdown, 'raw_markdown'):
+                print(f"\n--- Extracted Text (Markdown Snippet) ---")
+                print(result.markdown.raw_markdown[:500].strip() + "...")
+            else:
+                print("\nNo text (markdown) content extracted.")
+
+            if result.media and result.media.get("images"):
+                print(f"\n--- Image Extraction ---")
+                print(f"Extracted {len(result.media['images'])} image(s).")
+                for i, img_info in enumerate(result.media["images"][:2]): # Show info for first 2 images
+                    print(f"  Image {i+1}:")
+                    print(f"    Page: {img_info.get('page')}")
+                    print(f"    Format: {img_info.get('format', 'N/A')}")
+                    if img_info.get('path'):
+                        print(f"    Saved at: {img_info.get('path')}")
+            else:
+                print("\nNo images were extracted (or extract_images was False).")
+        else:
+            print(f"\n--- PDF Processing Failed ---")
+            print(f"Error: {result.error_message}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+### Pros and Cons
+
+**Pros:**
+-   Provides a comprehensive way to extract text, metadata, and (optionally) images from PDF documents.
+-   Handles both remote PDFs (via URL) and local PDF files.
+-   Configurable image extraction allows saving images to disk or accessing their data.
+-   Integrates smoothly with the `CrawlResult` object structure, making PDF-derived data accessible in a way consistent with web-scraped data.
+-   The `batch_size` parameter can help in managing memory consumption when processing large or numerous PDF pages.
+
+**Cons:**
+-   Extraction quality and performance can vary significantly depending on the PDF's complexity, encoding, and whether it's image-based (scanned) or text-based.
+-   Image extraction can be resource-intensive (both CPU and disk space if `save_images_locally` is true).
+-   Relies on `NaivePDFProcessorStrategy` internally, which might have limitations with very complex layouts, encrypted PDFs, or forms compared to more sophisticated PDF parsing libraries. Scanned PDFs will not yield text unless an OCR step is performed (which is not part of this strategy by default).
+-   Link extraction from PDFs can be basic and depends on how hyperlinks are embedded in the document.
--- a/docs/md_v2/advanced/proxy-security.md
+++ b/docs/md_v2/advanced/proxy-security.md
@@ -25,44 +25,70 @@ Use an authenticated proxy with `BrowserConfig`:
 ```python
 from crawl4ai.async_configs import BrowserConfig

-proxy_config = {
-    "server": "http://proxy.example.com:8080",
-    "username": "user",
-    "password": "pass"
-}
-
-browser_config = BrowserConfig(proxy_config=proxy_config)
+browser_config = BrowserConfig(proxy="http://[username]:[password]@[host]:[port]")
 async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun(url="https://example.com")
 ```

-Here's the corrected documentation:

 ## Rotating Proxies 

 Example using a proxy rotation service dynamically:

 ```python
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
-
-async def get_next_proxy():
-    # Your proxy rotation logic here
-    return {"server": "http://next.proxy.com:8080"}
-
+import re
+from crawl4ai import (
+    AsyncWebCrawler,
+    BrowserConfig,
+    CrawlerRunConfig,
+    CacheMode,
+    RoundRobinProxyStrategy,
+)
+import asyncio
+from crawl4ai import ProxyConfig
 async def main():
-    browser_config = BrowserConfig()
-    run_config = CrawlerRunConfig()
-    
-    async with AsyncWebCrawler(config=browser_config) as crawler:
-        # For each URL, create a new run config with different proxy
-        for url in urls:
-            proxy = await get_next_proxy()
-            # Clone the config and update proxy - this creates a new browser context
-            current_config = run_config.clone(proxy_config=proxy)
-            result = await crawler.arun(url=url, config=current_config)
+    # Load proxies and create rotation strategy
+    proxies = ProxyConfig.from_env()
+    #eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
+    if not proxies:
+        print("No proxies found in environment. Set PROXIES env variable!")
+        return
+
+    proxy_strategy = RoundRobinProxyStrategy(proxies)
+
+    # Create configs
+    browser_config = BrowserConfig(headless=True, verbose=False)
+    run_config = CrawlerRunConfig(
+        cache_mode=CacheMode.BYPASS,
+        proxy_rotation_strategy=proxy_strategy
+    )
+
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        urls = ["https://httpbin.org/ip"] * (len(proxies) * 2)  # Test each proxy twice
+
+        print("\n📈 Initializing crawler with proxy rotation...")
+        async with AsyncWebCrawler(config=browser_config) as crawler:
+            print("\n🚀 Starting batch crawl with proxy rotation...")
+            results = await crawler.arun_many(
+                urls=urls,
+                config=run_config
+            )
+            for result in results:
+                if result.success:
+                    ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
+                    current_proxy = run_config.proxy_config if run_config.proxy_config else None
+
+                    if current_proxy and ip_match:
+                        print(f"URL {result.url}")
+                        print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
+                        verified = ip_match.group(0) == current_proxy.ip
+                        if verified:
+                            print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
+                        else:
+                            print("❌ Proxy failed or IP mismatch!")
+                    print("---")
+
+asyncio.run(main())

-if __name__ == "__main__":
-    import asyncio
-    asyncio.run(main())
 ```

--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -298,7 +298,7 @@ LLMConfig is useful to pass LLM provider config to strategies and functions that
 ## 3.1 Parameters
 | **Parameter**         | **Type / Default**                     | **What It Does**                                                                                                                     |
 |-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------|
-| **`provider`**    | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provoder to use. 
+| **`provider`**    | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use. 
 | **`api_token`**         |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables  <br/> 2. API token of LLM provider <br/> eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` <br/> 3. Environment variable - use with prefix "env:" <br/> eg:`api_token = "env: GROQ_API_KEY"`              | API token to use for the given provider 
 | **`base_url`**         |Optional. Custom API endpoint | If your provider has a custom endpoint

--- a/docs/md_v2/core/browser-crawler-config.md
+++ b/docs/md_v2/core/browser-crawler-config.md
@@ -252,7 +252,7 @@ The `clone()` method:
 ### Key fields to note

 1. **`provider`**:  
- Which LLM provoder to use. 
+- Which LLM provider to use. 
 - Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*

 2. **`api_token`**:  
@@ -273,7 +273,7 @@ In a typical scenario, you define **one** `BrowserConfig` for your crawler sessi

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
 from crawl4ai import JsonCssExtractionStrategy

 async def main():
@@ -298,7 +298,7 @@ async def main():
    # 3) Example LLM content filtering

    gemini_config = LLMConfig(
-        provider="gemini/gemini-1.5-pro" 
+        provider="gemini/gemini-1.5-pro", 
        api_token = "env:GEMINI_API_TOKEN"
    )

@@ -322,8 +322,9 @@ async def main():
    )

    md_generator = DefaultMarkdownGenerator(
-    content_filter=filter,
-    options={"ignore_links": True}
+        content_filter=filter,
+        options={"ignore_links": True}
+    )

    # 4) Crawler run config: skip cache, use extraction
    run_conf = CrawlerRunConfig(
--- a/docs/md_v2/core/cli.md
+++ b/docs/md_v2/core/cli.md
@@ -17,6 +17,9 @@
 - [Configuration Reference](#configuration-reference)
 - [Best Practices & Tips](#best-practices--tips)

+## Installation
+The Crawl4AI CLI will be installed automatically when you install the library.
+
 ## Basic Usage

 The Crawl4AI CLI (`crwl`) provides a simple interface to the Crawl4AI library:
--- a/docs/md_v2/core/local-files.md
+++ b/docs/md_v2/core/local-files.md
@@ -8,11 +8,10 @@ To crawl a live web page, provide the URL starting with `http://` or `https://`,

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

 async def crawl_web():
-    config = CrawlerRunConfig(bypass_cache=True)
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/apple", 
@@ -33,13 +32,12 @@ To crawl a local HTML file, prefix the file path with `file://`.

 ```python
 import asyncio
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

 async def crawl_local_file():
    local_file_path = "/path/to/apple.html"  # Replace with your file path
    file_url = f"file://{local_file_path}"
-    config = CrawlerRunConfig(bypass_cache=True)
+    config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
    
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=file_url, config=config)
@@ -93,8 +91,7 @@ import os
 import sys
 import asyncio
 from pathlib import Path
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.async_configs import CrawlerRunConfig
+from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig

 async def main():
    wikipedia_url = "https://en.wikipedia.org/wiki/apple"
@@ -104,7 +101,7 @@ async def main():
    async with AsyncWebCrawler() as crawler:
        # Step 1: Crawl the Web URL
        print("\n=== Step 1: Crawling the Wikipedia URL ===")
-        web_config = CrawlerRunConfig(bypass_cache=True)
+        web_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        result = await crawler.arun(url=wikipedia_url, config=web_config)

        if not result.success:
@@ -119,7 +116,7 @@ async def main():
        # Step 2: Crawl from the Local HTML File
        print("=== Step 2: Crawling from the Local HTML File ===")
        file_url = f"file://{html_file_path.resolve()}"
-        file_config = CrawlerRunConfig(bypass_cache=True)
+        file_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        local_result = await crawler.arun(url=file_url, config=file_config)

        if not local_result.success:
@@ -135,7 +132,7 @@ async def main():
        with open(html_file_path, 'r', encoding='utf-8') as f:
            raw_html_content = f.read()
        raw_html_url = f"raw:{raw_html_content}"
-        raw_config = CrawlerRunConfig(bypass_cache=True)
+        raw_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
        raw_result = await crawler.arun(url=raw_html_url, config=raw_config)

        if not raw_result.success:
--- a/docs/md_v2/core/markdown-generation.md
+++ b/docs/md_v2/core/markdown-generation.md
@@ -200,7 +200,8 @@ config = CrawlerRunConfig(markdown_generator=md_generator)

 - **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.  
 - **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.  
- **`use_stemming`** *(default `True`)*: If enabled, variations of words match (e.g., “learn,” “learning,” “learnt”).
+- **`use_stemming`** *(default `True`)*: Whether to apply stemming to the query and content.
+- **`language (str)`**: Language for stemming (default: 'english').

 **No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.

@@ -233,7 +234,7 @@ prune_filter = PruningContentFilter(
 For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:

 ```python
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator
 from crawl4ai.content_filter_strategy import LLMContentFilter

 async def main():
@@ -255,9 +256,12 @@ async def main():
        chunk_token_threshold=4096,  # Adjust based on your needs
        verbose=True
    )
-
+    md_generator = DefaultMarkdownGenerator(
+        content_filter=filter,
+        options={"ignore_links": True}
+    )
    config = CrawlerRunConfig(
-        content_filter=filter
+        markdown_generator=md_generator,
    )

    async with AsyncWebCrawler() as crawler:
--- a/docs/md_v2/extraction/llm-strategies.md
+++ b/docs/md_v2/extraction/llm-strategies.md
@@ -218,7 +218,7 @@ import json
 import asyncio
 from typing import List
 from pydantic import BaseModel, Field
-from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
+from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
 from crawl4ai import LLMExtractionStrategy

 class Entity(BaseModel):
@@ -238,8 +238,8 @@ class KnowledgeGraph(BaseModel):
 async def main():
    # LLM extraction strategy
    llm_strat = LLMExtractionStrategy(
-        llmConfig = LlmConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
-        schema=KnowledgeGraph.schema_json(),
+        llmConfig = LLMConfig(provider="openai/gpt-4", api_token=os.getenv('OPENAI_API_KEY')),
+        schema=KnowledgeGraph.model_json_schema(),
        extraction_type="schema",
        instruction="Extract entities and relationships from the content. Return valid JSON.",
        chunk_token_threshold=1400,
@@ -258,6 +258,10 @@ async def main():
        url = "https://www.nbcnews.com/business"
        result = await crawler.arun(url=url, config=crawl_config)

+        print("--- LLM RAW RESPONSE ---")
+        print(result.extracted_content)
+        print("--- END LLM RAW RESPONSE ---")
+
        if result.success:
            with open("kb_result.json", "w", encoding="utf-8") as f:
                f.write(result.extracted_content)
--- a/docs/md_v2/index.md
+++ b/docs/md_v2/index.md
@@ -41,6 +41,17 @@
           alt="License"/>
    </a>
  </p>
+  <p align="center">
+    <a href="https://x.com/crawl4ai">
+      <img src="https://img.shields.io/badge/Follow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white" alt="Follow on X" />
+    </a>
+    <a href="https://www.linkedin.com/company/crawl4ai">
+      <img src="https://img.shields.io/badge/Follow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="Follow on LinkedIn" />
+    </a>
+    <a href="https://discord.gg/jP8KfhDhyN">
+      <img src="https://img.shields.io/badge/Join%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white" alt="Join our Discord" />
+    </a>
+  </p>
  
 </div>

--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -48,6 +48,7 @@ nav:
    - "Identity Based Crawling": "advanced/identity-based-crawling.md"
    - "SSL Certificate": "advanced/ssl-certificate.md"
    - "Network & Console Capture": "advanced/network-console-capture.md"
+    - "PDF Parsing": "advanced/pdf-parsing.md"
  - Extraction:
    - "LLM-Free Strategies": "extraction/no-llm-strategies.md"
    - "LLM Strategies": "extraction/llm-strategies.md"
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -17,7 +17,7 @@ dependencies = [
    "lxml~=5.3",
    "litellm>=1.53.1",
    "numpy>=1.26.0,<3",
-    "pillow~=10.4",
+    "pillow>=10.4",
    "playwright>=1.49.0",
    "python-dotenv~=1.0",
    "requests~=2.26",
@@ -32,7 +32,6 @@ dependencies = [
    "psutil>=6.1.1",
    "nltk>=3.9.1",
    "playwright",
-    "aiofiles",
    "rich>=13.9.4",
    "cssselect>=1.2.0",
    "httpx>=0.27.2",
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,7 +4,7 @@ aiosqlite~=0.20
 lxml~=5.3
 litellm>=1.53.1
 numpy>=1.26.0,<3
-pillow~=10.4
+pillow>=10.4
 playwright>=1.49.0
 python-dotenv~=1.0
 requests~=2.26
@@ -27,3 +27,7 @@ httpx[http2]>=0.27.2
 sentence-transformers>=2.2.0
 alphashape>=1.3.1
 shapely>=2.0.0
+
+fake-useragent>=2.2.0
+pdf2image>=1.17.0
+PyPDF2>=3.0.1
--- a/tests/deep_crwaling/test_filter.py
+++ b/tests/deep_crwaling/test_filter.py
@@ -0,0 +1,75 @@
+# // File: tests/deep_crawling/test_filters.py
+import pytest
+from urllib.parse import urlparse
+from crawl4ai import ContentTypeFilter, URLFilter 
+
+# Minimal URLFilter base class stub if not already importable directly for tests
+# In a real scenario, this would be imported from the library
+if not hasattr(URLFilter, '_update_stats'): # Check if it's a basic stub
+    class URLFilter: # Basic stub for testing if needed
+        def __init__(self, name=None): self.name = name
+        def apply(self, url: str) -> bool: raise NotImplementedError
+        def _update_stats(self, passed: bool): pass # Mock implementation
+
+# Assume ContentTypeFilter is structured as discussed. If its definition is not fully
+# available for direct import in the test environment, a more elaborate stub or direct
+# instantiation of the real class (if possible) would be needed.
+# For this example, we assume ContentTypeFilter can be imported and used.
+
+class TestContentTypeFilter:
+    @pytest.mark.parametrize(
+        "url, allowed_types, expected",
+        [
+            # Existing tests (examples)
+            ("http://example.com/page.html", ["text/html"], True),
+            ("http://example.com/page.json", ["application/json"], True),
+            ("http://example.com/image.png", ["text/html"], False),
+            ("http://example.com/document.pdf", ["application/pdf"], True),
+            ("http://example.com/page", ["text/html"], True), # No extension, allowed
+            ("http://example.com/page", ["text/html"], False), # No extension, disallowed
+            ("http://example.com/page.unknown", ["text/html"], False), # Unknown extension
+            
+            # Tests for PHP extensions
+            ("http://example.com/index.php", ["application/x-httpd-php"], True),
+            ("http://example.com/script.php3", ["application/x-httpd-php"], True),
+            ("http://example.com/legacy.php4", ["application/x-httpd-php"], True),
+            ("http://example.com/main.php5", ["application/x-httpd-php"], True),
+            ("http://example.com/api.php7", ["application/x-httpd-php"], True),
+            ("http://example.com/index.phtml", ["application/x-httpd-php"], True),
+            ("http://example.com/source.phps", ["application/x-httpd-php-source"], True),
+
+            # Test rejection of PHP extensions
+            ("http://example.com/index.php", ["text/html"], False),
+            ("http://example.com/script.php3", ["text/plain"], False),
+            ("http://example.com/source.phps", ["application/x-httpd-php"], False), # Mismatch MIME
+            ("http://example.com/source.php", ["application/x-httpd-php-source"], False), # Mismatch MIME for .php
+
+            # Test case-insensitivity of extensions in URL
+            ("http://example.com/PAGE.HTML", ["text/html"], True),
+            ("http://example.com/INDEX.PHP", ["application/x-httpd-php"], True),
+            ("http://example.com/SOURCE.PHPS", ["application/x-httpd-php-source"], True),
+
+            # Test case-insensitivity of allowed_types
+            ("http://example.com/index.php", ["APPLICATION/X-HTTPD-PHP"], True),
+        ],
+    )
+    def test_apply(self, url, allowed_types, expected):
+        content_filter = ContentTypeFilter(
+            allowed_types=allowed_types
+        )
+        assert content_filter.apply(url) == expected
+
+    @pytest.mark.parametrize(
+        "url, expected_extension",
+        [
+            ("http://example.com/file.html", "html"),
+            ("http://example.com/file.tar.gz", "gz"),
+            ("http://example.com/path/", ""),
+            ("http://example.com/nodot", ""),
+            ("http://example.com/.config", "config"), # hidden file with extension
+            ("http://example.com/path/to/archive.BIG.zip", "zip"), # Case test
+        ]
+    )
+    def test_extract_extension(self, url, expected_extension):
+        # Test the static method directly
+        assert ContentTypeFilter._extract_extension(url) == expected_extension
--- a/tests/docker_example.py
+++ b/tests/docker_example.py
@@ -105,7 +105,7 @@ def test_docker_deployment(version="basic"):
 def test_basic_crawl(tester: Crawl4AiTester):
    print("\n=== Testing Basic Crawl ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 10,
        "session_id": "test",
    }
@@ -119,7 +119,7 @@ def test_basic_crawl(tester: Crawl4AiTester):
 def test_basic_crawl_sync(tester: Crawl4AiTester):
    print("\n=== Testing Basic Crawl (Sync) ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 10,
        "session_id": "test",
    }
@@ -134,7 +134,7 @@ def test_basic_crawl_sync(tester: Crawl4AiTester):
 def test_js_execution(tester: Crawl4AiTester):
    print("\n=== Testing JS Execution ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 8,
        "js_code": [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
@@ -151,7 +151,7 @@ def test_js_execution(tester: Crawl4AiTester):
 def test_css_selector(tester: Crawl4AiTester):
    print("\n=== Testing CSS Selector ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 7,
        "css_selector": ".wide-tease-item__description",
        "crawler_params": {"headless": True},
@@ -188,7 +188,7 @@ def test_structured_extraction(tester: Crawl4AiTester):
    }

    request = {
-        "urls": "https://www.coinbase.com/explore",
+        "urls": ["https://www.coinbase.com/explore"],
        "priority": 9,
        "extraction_config": {"type": "json_css", "params": {"schema": schema}},
    }
@@ -223,7 +223,7 @@ def test_llm_extraction(tester: Crawl4AiTester):
    }

    request = {
-        "urls": "https://openai.com/api/pricing",
+        "urls": ["https://openai.com/api/pricing"],
        "priority": 8,
        "extraction_config": {
            "type": "llm",
@@ -270,7 +270,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
    }

    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 8,
        "extraction_config": {
            "type": "llm",
@@ -297,7 +297,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
 def test_cosine_extraction(tester: Crawl4AiTester):
    print("\n=== Testing Cosine Extraction ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 8,
        "extraction_config": {
            "type": "cosine",
@@ -323,7 +323,7 @@ def test_cosine_extraction(tester: Crawl4AiTester):
 def test_screenshot(tester: Crawl4AiTester):
    print("\n=== Testing Screenshot ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 5,
        "screenshot": True,
        "crawler_params": {"headless": True},
--- a/tests/general/test_async_crawler_strategy.py
+++ b/tests/general/test_async_crawler_strategy.py
@@ -15,6 +15,24 @@ CRAWL4AI_HOME_DIR = Path(os.path.expanduser("~")).joinpath(".crawl4ai")
 if not CRAWL4AI_HOME_DIR.joinpath("profiles", "test_profile").exists():
    CRAWL4AI_HOME_DIR.joinpath("profiles", "test_profile").mkdir(parents=True)

+@pytest.fixture
+def basic_html():
+    return """
+    <html lang="en">
+    <head>
+        <title>Basic HTML</title>
+    </head>
+    <body>
+        <h1>Main Heading</h1>
+        <main>
+            <div class="container">
+                <p>Basic HTML document for testing purposes.</p>
+            </div>
+        </main>
+    </body>
+    </html>
+    """
+
 # Test Config Files
@pytest.fixture
 def basic_browser_config():
@@ -325,6 +343,13 @@ async def test_stealth_mode(crawler_strategy):
    )
    assert response.status_code == 200

+@pytest.mark.asyncio
+@pytest.mark.parametrize("prefix", ("raw:", "raw://"))
+async def test_raw_urls(crawler_strategy, basic_html, prefix):
+    url = f"{prefix}{basic_html}"
+    response = await crawler_strategy.crawl(url, CrawlerRunConfig())
+    assert response.html == basic_html
+
 # Error Handling Tests  
@pytest.mark.asyncio
 async def test_invalid_url():
--- a/tests/general/test_download_file.py
+++ b/tests/general/test_download_file.py
@@ -0,0 +1,34 @@
+import asyncio
+from crawl4ai import CrawlerRunConfig, AsyncWebCrawler, BrowserConfig
+from pathlib import Path
+import os
+
+async def test_basic_download():
+    
+    # Custom folder (otherwise defaults to ~/.crawl4ai/downloads)
+    downloads_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
+    os.makedirs(downloads_path, exist_ok=True)
+    browser_config = BrowserConfig(
+        accept_downloads=True,
+        downloads_path=downloads_path
+    )
+    async with AsyncWebCrawler(config=browser_config) as crawler:
+        run_config = CrawlerRunConfig(
+            js_code="""
+                const link = document.querySelector('a[href$=".exe"]');
+                if (link) { link.click(); }
+            """,
+            delay_before_return_html=5  
+        )
+        result = await crawler.arun("https://www.python.org/downloads/", config=run_config)
+
+        if result.downloaded_files:
+            print("Downloaded files:")
+            for file_path in result.downloaded_files:
+                print("•", file_path)
+        else:
+            print("No files downloaded.")
+
+if __name__ == "__main__":
+    asyncio.run(test_basic_download())
+ 
--- a/tests/general/test_max_scroll.py
+++ b/tests/general/test_max_scroll.py
@@ -0,0 +1,115 @@
+"""
+Sample script to test the max_scroll_steps parameter implementation
+"""
+import asyncio
+import os
+import sys
+
+# Get the grandparent directory
+grandparent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+sys.path.append(grandparent_dir)
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+
+
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_configs import CrawlerRunConfig
+
+async def test_max_scroll_steps():
+    """
+    Test the max_scroll_steps parameter with different configurations
+    """
+    print("🚀 Testing max_scroll_steps parameter implementation")
+    print("=" * 60)
+    
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        
+        # Test 1: Without max_scroll_steps (unlimited scrolling)
+        print("\\n📋 Test 1: Unlimited scrolling (max_scroll_steps=None)")
+        config1 = CrawlerRunConfig(
+            scan_full_page=True,
+            scroll_delay=0.1,
+            max_scroll_steps=None,  # Default behavior
+            verbose=True
+        )
+        
+        print(f"Config: scan_full_page={config1.scan_full_page}, max_scroll_steps={config1.max_scroll_steps}")
+        
+        try:
+            result1 = await crawler.arun(
+                url="https://example.com",  # Simple page for testing
+                config=config1
+            )
+            print(f"✅ Test 1 Success: Crawled {len(result1.markdown)} characters")
+        except Exception as e:
+            print(f"❌ Test 1 Failed: {e}")
+        
+        # Test 2: With limited scroll steps
+        print("\\n📋 Test 2: Limited scrolling (max_scroll_steps=3)")
+        config2 = CrawlerRunConfig(
+            scan_full_page=True,
+            scroll_delay=0.1,
+            max_scroll_steps=3,  # Limit to 3 scroll steps
+            verbose=True
+        )
+        
+        print(f"Config: scan_full_page={config2.scan_full_page}, max_scroll_steps={config2.max_scroll_steps}")
+        
+        try:
+            result2 = await crawler.arun(
+                url="https://techcrunch.com/",  # Another test page
+                config=config2
+            )
+            print(f"✅ Test 2 Success: Crawled {len(result2.markdown)} characters")
+        except Exception as e:
+            print(f"❌ Test 2 Failed: {e}")
+        
+        # Test 3: Test serialization/deserialization
+        print("\\n📋 Test 3: Configuration serialization test")
+        config3 = CrawlerRunConfig(
+            scan_full_page=True,
+            max_scroll_steps=5,
+            scroll_delay=0.2
+        )
+        
+        # Test to_dict
+        config_dict = config3.to_dict()
+        print(f"Serialized max_scroll_steps: {config_dict.get('max_scroll_steps')}")
+        
+        # Test from_kwargs
+        config4 = CrawlerRunConfig.from_kwargs({
+            'scan_full_page': True,
+            'max_scroll_steps': 7,
+            'scroll_delay': 0.3
+        })
+        print(f"Deserialized max_scroll_steps: {config4.max_scroll_steps}")
+        print("✅ Test 3 Success: Serialization works correctly")
+        
+        # Test 4: Edge case - max_scroll_steps = 0
+        print("\\n📋 Test 4: Edge case (max_scroll_steps=0)")
+        config5 = CrawlerRunConfig(
+            scan_full_page=True,
+            max_scroll_steps=0,  # Should not scroll at all
+            verbose=True
+        )
+        
+        try:
+            result5 = await crawler.arun(
+                url="https://techcrunch.com/",
+                config=config5
+            )
+            print(f"✅ Test 4 Success: No scrolling performed, crawled {len(result5.markdown)} characters")
+        except Exception as e:
+            print(f"❌ Test 4 Failed: {e}")
+    
+    print("\\n" + "=" * 60)
+    print("🎉 All tests completed!")
+    print("\\nThe max_scroll_steps parameter is working correctly:")
+    print("- None: Unlimited scrolling (default behavior)")
+    print("- Positive integer: Limits scroll steps to that number")
+    print("- 0: No scrolling performed")
+    print("- Properly serializes/deserializes in config")
+
+if __name__ == "__main__":
+    print("Starting max_scroll_steps test...")
+    asyncio.run(test_max_scroll_steps())
--- a/tests/general/test_url_pattern.py
+++ b/tests/general/test_url_pattern.py
@@ -0,0 +1,85 @@
+import sys
+import os
+
+# Get the grandparent directory
+grandparent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+sys.path.append(grandparent_dir)
+__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
+
+import asyncio
+from crawl4ai.deep_crawling.filters import URLPatternFilter
+
+
+def test_prefix_boundary_matching():
+    """Test that prefix patterns respect path boundaries"""
+    print("=== Testing URLPatternFilter Prefix Boundary Fix ===")
+    
+    filter_obj = URLPatternFilter(patterns=['https://langchain-ai.github.io/langgraph/*'])
+    
+    test_cases = [
+        ('https://langchain-ai.github.io/langgraph/', True),
+        ('https://langchain-ai.github.io/langgraph/concepts/', True),
+        ('https://langchain-ai.github.io/langgraph/tutorials/', True),
+        ('https://langchain-ai.github.io/langgraph?param=1', True),
+        ('https://langchain-ai.github.io/langgraph#section', True),
+        ('https://langchain-ai.github.io/langgraphjs/', False),
+        ('https://langchain-ai.github.io/langgraphjs/concepts/', False),
+        ('https://other-site.com/langgraph/', False),
+    ]
+    
+    all_passed = True
+    for url, expected in test_cases:
+        result = filter_obj.apply(url)
+        status = "PASS" if result == expected else "FAIL"
+        if result != expected:
+            all_passed = False
+        print(f"{status:4} | Expected: {expected:5} | Got: {result:5} | {url}")
+    
+    return all_passed
+
+
+def test_edge_cases():
+    """Test edge cases for path boundary matching"""
+    print("\n=== Testing Edge Cases ===")
+    
+    test_patterns = [
+        ('/api/*', [
+            ('/api/', True),
+            ('/api/v1', True),
+            ('/api?param=1', True),
+            ('/apiv2/', False),
+            ('/api_old/', False),
+        ]),
+        
+        ('*/docs/*', [
+            ('example.com/docs/', True),
+            ('example.com/docs/guide', True),
+            ('example.com/documentation/', False),
+            ('example.com/docs_old/', False),
+        ]),
+    ]
+    
+    all_passed = True
+    for pattern, test_cases in test_patterns:
+        print(f"\nPattern: {pattern}")
+        filter_obj = URLPatternFilter(patterns=[pattern])
+        
+        for url, expected in test_cases:
+            result = filter_obj.apply(url)
+            status = "PASS" if result == expected else "FAIL"
+            if result != expected:
+                all_passed = False
+            print(f"  {status:4} | Expected: {expected:5} | Got: {result:5} | {url}")
+    
+    return all_passed
+
+if __name__ == "__main__":
+    test1_passed = test_prefix_boundary_matching()
+    test2_passed = test_edge_cases()
+    
+    if test1_passed and test2_passed:
+        print("\n✅ All tests passed!")
+        sys.exit(0)
+    else:
+        print("\n❌ Some tests failed!")
+        sys.exit(1)
--- a/tests/test_docker.py
+++ b/tests/test_docker.py
@@ -74,7 +74,7 @@ def test_docker_deployment(version="basic"):

 def test_basic_crawl(tester: Crawl4AiTester):
    print("\n=== Testing Basic Crawl ===")
-    request = {"urls": "https://www.nbcnews.com/business", "priority": 10}
+    request = {"urls": ["https://www.nbcnews.com/business"], "priority": 10}

    result = tester.submit_and_wait(request)
    print(f"Basic crawl result length: {len(result['result']['markdown'])}")
@@ -85,7 +85,7 @@ def test_basic_crawl(tester: Crawl4AiTester):
 def test_js_execution(tester: Crawl4AiTester):
    print("\n=== Testing JS Execution ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 8,
        "js_code": [
            "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
@@ -102,7 +102,7 @@ def test_js_execution(tester: Crawl4AiTester):
 def test_css_selector(tester: Crawl4AiTester):
    print("\n=== Testing CSS Selector ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 7,
        "css_selector": ".wide-tease-item__description",
        "crawler_params": {"headless": True},
@@ -139,7 +139,7 @@ def test_structured_extraction(tester: Crawl4AiTester):
    }

    request = {
-        "urls": "https://www.coinbase.com/explore",
+        "urls": ["https://www.coinbase.com/explore"],
        "priority": 9,
        "extraction_config": {"type": "json_css", "params": {"schema": schema}},
    }
@@ -174,7 +174,7 @@ def test_llm_extraction(tester: Crawl4AiTester):
    }

    request = {
-        "urls": "https://openai.com/api/pricing",
+        "urls": ["https://openai.com/api/pricing"],
        "priority": 8,
        "extraction_config": {
            "type": "llm",
@@ -221,7 +221,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
    }

    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 8,
        "extraction_config": {
            "type": "llm",
@@ -248,7 +248,7 @@ def test_llm_with_ollama(tester: Crawl4AiTester):
 def test_cosine_extraction(tester: Crawl4AiTester):
    print("\n=== Testing Cosine Extraction ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 8,
        "extraction_config": {
            "type": "cosine",
@@ -274,7 +274,7 @@ def test_cosine_extraction(tester: Crawl4AiTester):
 def test_screenshot(tester: Crawl4AiTester):
    print("\n=== Testing Screenshot ===")
    request = {
-        "urls": "https://www.nbcnews.com/business",
+        "urls": ["https://www.nbcnews.com/business"],
        "priority": 5,
        "screenshot": True,
        "crawler_params": {"headless": True},
--- a/tests/test_main.py
+++ b/tests/test_main.py
@@ -54,7 +54,7 @@ class NBCNewsAPITest:
 async def test_basic_crawl():
    print("\n=== Testing Basic Crawl ===")
    async with NBCNewsAPITest() as api:
-        request = {"urls": "https://www.nbcnews.com/business", "priority": 10}
+        request = {"urls": ["https://www.nbcnews.com/business"], "priority": 10}
        task_id = await api.submit_crawl(request)
        result = await api.wait_for_task(task_id)
        print(f"Basic crawl result length: {len(result['result']['markdown'])}")
@@ -67,7 +67,7 @@ async def test_js_execution():
    print("\n=== Testing JS Execution ===")
    async with NBCNewsAPITest() as api:
        request = {
-            "urls": "https://www.nbcnews.com/business",
+            "urls": ["https://www.nbcnews.com/business"],
            "priority": 8,
            "js_code": [
                "const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More')); loadMoreButton && loadMoreButton.click();"
@@ -86,7 +86,7 @@ async def test_css_selector():
    print("\n=== Testing CSS Selector ===")
    async with NBCNewsAPITest() as api:
        request = {
-            "urls": "https://www.nbcnews.com/business",
+            "urls": ["https://www.nbcnews.com/business"],
            "priority": 7,
            "css_selector": ".wide-tease-item__description",
        }
@@ -120,7 +120,7 @@ async def test_structured_extraction():
        }

        request = {
-            "urls": "https://www.nbcnews.com/business",
+            "urls": ["https://www.nbcnews.com/business"],
            "priority": 9,
            "extraction_config": {"type": "json_css", "params": {"schema": schema}},
        }
@@ -177,7 +177,7 @@ async def test_llm_extraction():
        }

        request = {
-            "urls": "https://www.nbcnews.com/business",
+            "urls": ["https://www.nbcnews.com/business"],
            "priority": 8,
            "extraction_config": {
                "type": "llm",
@@ -209,7 +209,7 @@ async def test_screenshot():
    print("\n=== Testing Screenshot ===")
    async with NBCNewsAPITest() as api:
        request = {
-            "urls": "https://www.nbcnews.com/business",
+            "urls": ["https://www.nbcnews.com/business"],
            "priority": 5,
            "screenshot": True,
            "crawler_params": {"headless": True},
@@ -227,7 +227,7 @@ async def test_priority_handling():
    async with NBCNewsAPITest() as api:
        # Submit low priority task first
        low_priority = {
-            "urls": "https://www.nbcnews.com/business",
+            "urls": ["https://www.nbcnews.com/business"],
            "priority": 1,
            "crawler_params": {"headless": True},
        }
@@ -235,7 +235,7 @@ async def test_priority_handling():

        # Submit high priority task
        high_priority = {
-            "urls": "https://www.nbcnews.com/business/consumer",
+            "urls": ["https://www.nbcnews.com/business/consumer"],
            "priority": 10,
            "crawler_params": {"headless": True},
        }
--- a/tests/test_normalize_url.py
+++ b/tests/test_normalize_url.py
@@ -0,0 +1,91 @@
+import unittest
+from crawl4ai.utils import normalize_url
+
+class TestNormalizeUrl(unittest.TestCase):
+
+    def test_basic_relative_path(self):
+        self.assertEqual(normalize_url("path/to/page.html", "http://example.com/base/"), "http://example.com/base/path/to/page.html")
+
+    def test_base_url_with_trailing_slash(self):
+        self.assertEqual(normalize_url("page.html", "http://example.com/base/"), "http://example.com/base/page.html")
+
+    def test_base_url_without_trailing_slash(self):
+        # If normalize_url correctly uses urljoin, "base" is treated as a file.
+        self.assertEqual(normalize_url("page.html", "http://example.com/base"), "http://example.com/page.html")
+
+    def test_absolute_url_as_href(self):
+        self.assertEqual(normalize_url("http://another.com/page.html", "http://example.com/"), "http://another.com/page.html")
+
+    def test_href_with_leading_trailing_spaces(self):
+        self.assertEqual(normalize_url("  page.html  ", "http://example.com/"), "http://example.com/page.html")
+
+    def test_empty_href(self):
+        # urljoin with an empty href and base ending in '/' returns the base.
+        self.assertEqual(normalize_url("", "http://example.com/base/"), "http://example.com/base/")
+        # urljoin with an empty href and base not ending in '/' also returns base.
+        self.assertEqual(normalize_url("", "http://example.com/base"), "http://example.com/base")
+
+    def test_href_with_query_parameters(self):
+        self.assertEqual(normalize_url("page.html?query=test", "http://example.com/"), "http://example.com/page.html?query=test")
+
+    def test_href_with_fragment(self):
+        self.assertEqual(normalize_url("page.html#section", "http://example.com/"), "http://example.com/page.html#section")
+
+    def test_different_scheme_in_href(self):
+        self.assertEqual(normalize_url("https://secure.example.com/page.html", "http://example.com/"), "https://secure.example.com/page.html")
+
+    def test_parent_directory_in_href(self):
+        self.assertEqual(normalize_url("../otherpage.html", "http://example.com/base/current/"), "http://example.com/base/otherpage.html")
+
+    def test_root_relative_href(self):
+        self.assertEqual(normalize_url("/otherpage.html", "http://example.com/base/current/"), "http://example.com/otherpage.html")
+
+    def test_base_url_with_path_and_no_trailing_slash(self):
+        # If normalize_url correctly uses urljoin, "path" is treated as a file.
+        self.assertEqual(normalize_url("file.html", "http://example.com/path"), "http://example.com/file.html")
+
+    def test_base_url_is_just_domain(self):
+        self.assertEqual(normalize_url("page.html", "http://example.com"), "http://example.com/page.html")
+
+    def test_href_is_only_query(self):
+        self.assertEqual(normalize_url("?query=true", "http://example.com/page.html"), "http://example.com/page.html?query=true")
+
+    def test_href_is_only_fragment(self):
+        self.assertEqual(normalize_url("#fragment", "http://example.com/page.html"), "http://example.com/page.html#fragment")
+
+    def test_relative_link_from_base_file_url(self):
+        """
+        Tests the specific bug report: relative links from a base URL that is a file.
+        Example:
+        Page URL: http://example.com/path/to/document.html
+        Link on page: <a href="./file.xlsx">
+        Expected: http://example.com/path/to/file.xlsx
+        """
+        base_url_file = "http://example.com/zwgk/fdzdgk/zdxx/spaq/t19360680.shtml"
+        href_relative_current_dir = "./P020241203375994691134.xlsx"
+        expected_url1 = "http://example.com/zwgk/fdzdgk/zdxx/spaq/P020241203375994691134.xlsx"
+        self.assertEqual(normalize_url(href_relative_current_dir, base_url_file), expected_url1)
+
+        # Test with a relative link that doesn't start with "./"
+        href_relative_no_dot_slash = "another.doc"
+        expected_url2 = "http://example.com/zwgk/fdzdgk/zdxx/spaq/another.doc"
+        self.assertEqual(normalize_url(href_relative_no_dot_slash, base_url_file), expected_url2)
+
+    def test_invalid_base_url_scheme(self):
+        with self.assertRaises(ValueError) as context:
+            normalize_url("page.html", "ftp://example.com/")
+        self.assertIn("Invalid base URL format", str(context.exception))
+
+    def test_invalid_base_url_netloc(self):
+        with self.assertRaises(ValueError) as context:
+            normalize_url("page.html", "http:///path/")
+        self.assertIn("Invalid base URL format", str(context.exception))
+        
+    def test_base_url_with_port(self):
+        self.assertEqual(normalize_url("path/file.html", "http://example.com:8080/base/"), "http://example.com:8080/base/path/file.html")
+
+    def test_href_with_special_characters(self):
+        self.assertEqual(normalize_url("path%20with%20spaces/file.html", "http://example.com/"), "http://example.com/path%20with%20spaces/file.html")
+
+if __name__ == '__main__':
+    unittest.main()
Author	SHA1	Message	Date
ntohidi	afe852935e	fix: show /llm API response in playground. ref #1288	2025-07-09 16:59:17 +02:00
ntohidi	0ebce590f8	Merge branch '2025-JUN-1' into next-MAY	2025-07-09 09:41:03 +02:00
ntohidi	026e96a2df	feat: Add social media and community links to README and index documentation	2025-07-08 15:48:40 +02:00
ntohidi	36429a63de	fix: Improve comments for article metadata extraction in extract_metadata functions. ref #1105	2025-07-08 12:54:33 +02:00
ntohidi	a3d41c7951	fix: Clarify description of 'use_stemming' parameter in markdown generation documentation ref #1086	2025-07-08 12:24:33 +02:00
ntohidi	fee4c5c783	fix: Consolidate import statements in local-files.md for clarity	2025-07-08 11:46:24 +02:00
ntohidi	0f210f6e02	Merge branch '2025-MAY-2' into next-MAY	2025-07-08 11:46:13 +02:00
ntohidi	414f16e975	fix: Update pdf and screenshot usage documentation. ref #1230	2025-06-18 19:05:44 +02:00
ntohidi	b7a6e02236	fix: Update pdf and screenshot usage documentation. ref #1230	2025-06-18 19:04:32 +02:00
AHMET YILMAZ	9332326457	feat: Add PDF parsing documentation and navigation entry	2025-06-16 18:18:32 +08:00
ntohidi	6cd34b3157	Merge branch '2025-MAY-2' of https://github.com/unclecode/crawl4ai into 2025-MAY-2	2025-06-13 11:26:17 +02:00
ntohidi	871d4f1158	fix(extraction_strategy): rename response variable to content for clarity in LLMExtractionStrategy. ref #1146	2025-06-13 11:26:05 +02:00
ntohidi	dc85481180	refactor: Update LLM extraction example with the updated structure	2025-06-12 12:23:03 +02:00
ntohidi	5d9213a0e9	fix: Update JavaScript execution in AsyncPlaywrightCrawlerStrategy to handle script errors and add basic download test case. ref #1215	2025-06-12 12:21:40 +02:00
ntohidi	4679ee023d	fix: Enhance URLPatternFilter to enforce path boundary checks for prefix matching. ref #1003	2025-06-10 11:19:18 +02:00
Nasrin	f9b7090084	Merge pull request #1186 from zimmski/fix-typo-provoder fix, Typo	2025-06-10 10:26:45 +02:00
AHMET YILMAZ	9442597f81	#1127 : Improve URL handling and normalization in scraping strategies	2025-06-10 11:57:06 +08:00
AHMET YILMAZ	74b06d4b80	#1167 Add PHP MIME types to ContentTypeFilter for better file handling	2025-06-09 11:49:33 +08:00
ntohidi	5ac19a61d7	feat: Implement max_scroll_steps parameter for full page scanning. ref: #1168	2025-06-05 16:40:34 +02:00
Markus Zimmermann	022cc2d92a	fix, Typo	2025-06-05 15:30:38 +02:00
ntohidi	fcc2abe4db	(fix): Update document about LLM extraction strategy to use LLMConfig. REF #1146	2025-06-03 12:53:59 +02:00
ntohidi	cc95d3abd4	Fix raw URL parsing logic to correctly handle "raw://" and "raw:" prefixes. REF #1118	2025-06-03 11:19:08 +02:00
Nasrin	5ce3e682f3	Merge pull request #752 from jl-martins/fix-raw-url-parsing Fix `raw://` URL parsing logic. issue ref #1118	2025-06-03 11:10:29 +02:00
ntohidi	28125c1980	Merge branch 'next' into 2025-MAY-2	2025-06-02 20:26:40 +02:00
ntohidi	773ed7b281	Merge branch '2025-APR-1' into 2025-MAY-2	2025-06-02 20:25:58 +02:00
João Martins	58c1e17170	Merge branch 'main' into fix-raw-url-parsing	2025-05-30 13:03:25 +01:00
ntohidi	b55e27d2ef	fix: chanegd error variable name handle_crawl_request, docker api	2025-05-26 11:08:23 +02:00
Aravind Karnam	3d46d89759	docs: fix https://github.com/unclecode/crawl4ai/issues/1109	2025-05-22 17:21:42 +05:30
ntohidi	da8f0dbb93	fix(browser_profiler): change logger print to info for consistent logging in interactive manager	2025-05-22 11:25:51 +02:00
ntohidi	33a0c7a17a	fix(logger): add RED color to LogColor enum for enhanced logging options	2025-05-22 11:17:28 +02:00
Ahmed-Tawfik94	984524ca1c	fix(auth): add token authorization header in request preparation to ensure authenticated requests are made	2025-05-21 13:27:17 +08:00
ntohidi	cb8d581e47	fix(docs): update CrawlerRunConfig to use CacheMode for bypassing cache. REF: #1125	2025-05-19 18:03:05 +02:00
Ahmed-Tawfik94	a55c2b3f88	refactor(logging): update extraction logging to use url_status method	2025-05-19 16:32:22 +08:00
Ahmed Tawfik	ce09648af1	Merge pull request #1054 from Sacristaan/feature/readme_example Fix: README.md urls list	2025-05-19 14:20:21 +08:00
Ahmed-Tawfik94	a97654270b	#1086 fix(markdown): update BM25 filter to use language parameter for stemming	2025-05-19 14:11:46 +08:00
Ahmed-Tawfik94	b4fc60a555	#1103 fix(url): enhance URL normalization to handle invalid schemes and trailing slashes	2025-05-19 13:51:16 +08:00
Ahmed-Tawfik94	137ac014fb	#1105 :fix(metadata): optimize article metadata extraction using XPath for improved performance	2025-05-19 13:48:02 +08:00
Ahmed-Tawfik94	faa98eefbc	#1105 got fixed (metadata now matches with meta property article:*	2025-05-19 11:35:13 +08:00
ntohidi	22725ca87b	fix(crawler): initialize `captured_console` to prevent unbound local error for local HTML files. REF: #1072 Resolved a bug where running the crawler on local HTML files with `capture_console_messages=False` (default) raised `UnboundLocalError` due to `captured_console` being accessed before assignment.	2025-05-15 11:29:36 +02:00
ntohidi	e0fbd2b0a0	fix(schema): update `f` parameter description to use lowercase enum values. REF: #1070 Revised the description for the `f` parameter in the `/mcp/md` tool schema to use lowercase enum values (`raw`, `fit`, `bm25`, `llm`) for consistency with the actual `enum` definition. This change prevents LLM-based clients (e.g., Gemini via LibreChat) from generating uppercase values like `"FIT"`, which caused 422 validation errors due to strict case-sensitive matching.	2025-05-15 10:45:23 +02:00
ntohidi	32966bea11	fix(extraction): resolve `'str' object has no attribute 'choices'` error in LLMExtractionStrategy. Refs: #979 This patch ensures consistent handling of `response.choices[0].message.content` by avoiding redefinition of the `response` variable, which caused downstream exceptions during error handling.	2025-05-15 10:09:19 +02:00
Ahmed-Tawfik94	a3b0cab52a	#1088 is sloved flag -bc now if for --byPass-cache	2025-05-15 11:25:06 +08:00
medo94my	137556b3dc	fix the EXTRACT to match the styling of the other methods	2025-05-14 16:01:10 +08:00
ntohidi	260e2dc347	fix(browser): create browser config before launching managed browser instance. REF: https://discord.com/channels/1278297938551902308/1278298697540567132/1371683009459392716	2025-05-13 14:03:20 +02:00
ntohidi	25d97d56e4	fix(dependencies): remove duplicated aiofiles from project dependencies. REF #1045	2025-05-13 13:56:12 +02:00
Aravind Karnam	98a56e6e01	Merge next branch	2025-05-13 17:12:11 +05:30
ntohidi	1af3d1c2e0	Merge branch '2025-APR-1' of https://github.com/unclecode/crawl4ai into 2025-APR-1	2025-05-08 11:11:32 +02:00
Aravind Karnam	c1041b9bbe	fix: exclude_external_images flag simply discards elements ref:https://github.com/unclecode/crawl4ai/issues/345	2025-05-07 18:43:29 +05:30
Aravind Karnam	f6e25e2a6b	fix: check_robots_txt to support wildcard rules ref: #699	2025-05-07 17:53:30 +05:30
ntohidi	ee93acbd06	fix(async_playwright_crawler): use config directly instead of self.config for verbosity check	2025-05-07 12:32:38 +02:00
Aravind Karnam	2b17f234f8	docs: update direct passing of content_filter to CrawlerRunConfig and instead pass it via MarkdownGenerator. Ref: #603	2025-05-07 15:20:36 +05:30
ntohidi	eebb8c84f0	fix(requirements): add PyPDF2 dependency for PDF processing	2025-05-07 11:18:44 +02:00
ntohidi	12783fabda	fix(dependencies): update pillow version constraint to allow newer releases. ref #709	2025-05-07 11:18:13 +02:00
Aravind Karnam	39e3b792a1	Merge branch 'next' into 2025-APR-1	2025-05-07 10:25:25 +05:30
ntohidi	e0cd3e10de	fix(crawler): initialize captured_console variable for local file processing	2025-05-02 10:35:35 +02:00
ntohidi	1d6a2b9979	fix(crawler): surface real redirect status codes and keep redirect chain. the 30x response instead of always returning 200. Refs #660	2025-04-30 12:29:17 +02:00
ntohidi	039be1b1ce	feat: add pdf2image dependency to requirements	2025-04-30 11:41:35 +02:00
Marc Sacristán	53245e4e0e	Fix: README.md urls list	2025-04-29 16:26:35 +02:00
Aravind Karnam	094201ab2a	Merge next + resolve conflicts	2025-04-23 19:44:50 +05:30
ntohidi	14a31456ef	fix(docs): update browser-crawler-config example to include LLMContentFilter and DefaultMarkdownGenerator, fix syntax errors	2025-04-21 13:59:49 +02:00
ntohidi	0886153d6a	fix(async_playwright_crawler): improve segment handling and viewport adjustments during screenshot capture (Fixed bug: Capturing Screenshot Twice and Increasing Image Size)	2025-04-17 12:48:11 +02:00
ntohidi	0ec3c4a788	fix(crawler): handle navigation aborts during file downloads in AsyncPlaywrightCrawlerStrategy	2025-04-17 12:11:12 +02:00
ntohidi	05085b6e3d	fix(requirements): add fake-useragent to requirements	2025-04-15 13:05:19 +02:00
ntohidi	1f3b1251d0	docs(cli): add Crawl4AI CLI installation instructions to the CLI guide	2025-04-14 12:16:31 +02:00
ntohidi	7b9aabc64a	fix(crawler): ensure max_pages limit is respected during batch processing in crawling strategies	2025-04-14 12:11:22 +02:00
João Martins	27af4cc27b	Fix "raw://" URL parsing logic Closes https://github.com/unclecode/crawl4ai/issues/686	2025-02-15 15:34:59 +00:00