refactor(deep-crawl): add max_pages limit and improve crawl control

Add max_pages parameter to all deep crawling strategies to limit total pages crawled. Add score_threshold parameter to BFS/DFS strategies for quality control. Remove legacy parameter handling in AsyncWebCrawler. Improve error handling and logging in crawl strategies. BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()
2025-03-03 21:51:11 +08:00
parent c612f9a852
commit d024749633
7 changed files with 372 additions and 91 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,27 @@ All notable changes to Crawl4AI will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## Version 0.5.0 (2025-03-02)
 ### Added
 - *(profiles)* Add BrowserProfiler class for dedicated browser profile management
 - *(cli)* Add interactive profile management to CLI with rich UI
 - *(profiles)* Add ability to crawl directly from profile management interface
 - *(browser)* Support identity-based browsing with persistent profiles
 ### Changed
 - *(browser)* Refactor profile management from ManagedBrowser to BrowserProfiler class
 - *(cli)* Enhance CLI with profile selection and status display for crawling
 - *(examples)* Update identity-based browsing example to use BrowserProfiler class
 - *(docs)* Update identity-based crawling documentation
 ### Fixed
 - *(browser)* Fix profile detection and management on different platforms
 - *(cli)* Fix CLI command structure for better user experience
 ## Version 0.5.0 (2025-02-21)
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -224,22 +224,22 @@ class AsyncWebCrawler:
        url: str,
        config: CrawlerRunConfig = None,
        # Legacy parameters maintained for backwards compatibility
-        word_count_threshold=MIN_WORD_THRESHOLD,
+        # word_count_threshold=MIN_WORD_THRESHOLD,
-        extraction_strategy: ExtractionStrategy = None,
+        # extraction_strategy: ExtractionStrategy = None,
-        chunking_strategy: ChunkingStrategy = RegexChunking(),
+        # chunking_strategy: ChunkingStrategy = RegexChunking(),
-        content_filter: RelevantContentFilter = None,
+        # content_filter: RelevantContentFilter = None,
-        cache_mode: Optional[CacheMode] = None,
+        # cache_mode: Optional[CacheMode] = None,
        # Deprecated cache parameters
-        bypass_cache: bool = False,
+        # bypass_cache: bool = False,
-        disable_cache: bool = False,
+        # disable_cache: bool = False,
-        no_cache_read: bool = False,
+        # no_cache_read: bool = False,
-        no_cache_write: bool = False,
+        # no_cache_write: bool = False,
        # Other legacy parameters
-        css_selector: str = None,
+        # css_selector: str = None,
-        screenshot: bool = False,
+        # screenshot: bool = False,
-        pdf: bool = False,
+        # pdf: bool = False,
-        user_agent: str = None,
+        # user_agent: str = None,
-        verbose=True,
+        # verbose=True,
        **kwargs,
    ) -> RunManyReturn:
        """
@@ -276,39 +276,41 @@ class AsyncWebCrawler:
        async with self._lock or self.nullcontext():
            try:
                self.logger.verbose = crawler_config.verbose
                # Handle configuration
                if crawler_config is not None:
                    config = crawler_config
                else:
                    # Merge all parameters into a single kwargs dict for config creation
-                    config_kwargs = {
+                    # config_kwargs = {
-                        "word_count_threshold": word_count_threshold,
+                    #     "word_count_threshold": word_count_threshold,
-                        "extraction_strategy": extraction_strategy,
+                    #     "extraction_strategy": extraction_strategy,
-                        "chunking_strategy": chunking_strategy,
+                    #     "chunking_strategy": chunking_strategy,
-                        "content_filter": content_filter,
+                    #     "content_filter": content_filter,
-                        "cache_mode": cache_mode,
+                    #     "cache_mode": cache_mode,
-                        "bypass_cache": bypass_cache,
+                    #     "bypass_cache": bypass_cache,
-                        "disable_cache": disable_cache,
+                    #     "disable_cache": disable_cache,
-                        "no_cache_read": no_cache_read,
+                    #     "no_cache_read": no_cache_read,
-                        "no_cache_write": no_cache_write,
+                    #     "no_cache_write": no_cache_write,
-                        "css_selector": css_selector,
+                    #     "css_selector": css_selector,
-                        "screenshot": screenshot,
+                    #     "screenshot": screenshot,
-                        "pdf": pdf,
+                    #     "pdf": pdf,
-                        "verbose": verbose,
+                    #     "verbose": verbose,
-                        **kwargs,
+                    #     **kwargs,
-                    }
+                    # }
-                    config = CrawlerRunConfig.from_kwargs(config_kwargs)
+                    # config = CrawlerRunConfig.from_kwargs(config_kwargs)
                    pass
                # Handle deprecated cache parameters
-                if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
+                # if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
-                    # Convert legacy parameters if cache_mode not provided
+                #     # Convert legacy parameters if cache_mode not provided
-                    if config.cache_mode is None:
+                #     if config.cache_mode is None:
-                        config.cache_mode = _legacy_to_cache_mode(
+                #         config.cache_mode = _legacy_to_cache_mode(
-                            disable_cache=disable_cache,
+                #             disable_cache=disable_cache,
-                            bypass_cache=bypass_cache,
+                #             bypass_cache=bypass_cache,
-                            no_cache_read=no_cache_read,
+                #             no_cache_read=no_cache_read,
-                            no_cache_write=no_cache_write,
+                #             no_cache_write=no_cache_write,
-                        )
+                #         )
                # Default to ENABLED if no cache mode specified
                if config.cache_mode is None:
@@ -344,7 +346,11 @@ class AsyncWebCrawler:
                    # If screenshot is requested but its not in cache, then set cache_result to None
                    screenshot_data = cached_result.screenshot
                    pdf_data = cached_result.pdf
-                    if config.screenshot and not screenshot or config.pdf and not pdf:
+                    # if config.screenshot and not screenshot or config.pdf and not pdf:
                    if config.screenshot and not screenshot_data:
                        cached_result = None
                    if config.pdf and not pdf_data:
                        cached_result = None
                    self.logger.url_status(
@@ -358,12 +364,11 @@ class AsyncWebCrawler:
                if config and config.proxy_rotation_strategy:
                    next_proxy = await config.proxy_rotation_strategy.get_next_proxy()
                    if next_proxy:
-                        if verbose:
+                        self.logger.info(
-                            self.logger.info(
+                            message="Switch proxy: {proxy}",
-                                message="Switch proxy: {proxy}",
+                            tag="PROXY",
-                                tag="PROXY",
+                            params={"proxy": next_proxy.server},
-                                params={"proxy": next_proxy.server},
+                        )
                            )
                        config.proxy_config = next_proxy
                        # config = config.clone(proxy_config=next_proxy)
@@ -371,8 +376,8 @@ class AsyncWebCrawler:
                if not cached_result or not html:
                    t1 = time.perf_counter()
-                    if user_agent:
+                    if config.user_agent:
-                        self.crawler_strategy.update_user_agent(user_agent)
+                        self.crawler_strategy.update_user_agent(config.user_agent)
                    # Check robots.txt if enabled
                    if config and config.check_robots_txt:
--- a/crawl4ai/deep_crawling/bff_strategy.py
+++ b/crawl4ai/deep_crawling/bff_strategy.py
@@ -37,15 +37,18 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        filter_chain: FilterChain = FilterChain(),
        url_scorer: Optional[URLScorer] = None,
        include_external: bool = False,
        max_pages: int = float('inf'),
        logger: Optional[logging.Logger] = None,
    ):
        self.max_depth = max_depth
        self.filter_chain = filter_chain
        self.url_scorer = url_scorer
        self.include_external = include_external
        self.max_pages = max_pages
        self.logger = logger or logging.getLogger(__name__)
        self.stats = TraversalStats(start_time=datetime.now())
        self._cancel_event = asyncio.Event()
        self._pages_crawled = 0
    async def can_process_url(self, url: str, depth: int) -> bool:
        """
@@ -87,11 +90,19 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        if new_depth > self.max_depth:
            return
        # If we've reached the max pages limit, don't discover new links
        remaining_capacity = self.max_pages - self._pages_crawled
        if remaining_capacity <= 0:
            self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping link discovery")
            return
        # Retrieve internal links; include external links if enabled.
        links = result.links.get("internal", [])
        if self.include_external:
            links += result.links.get("external", [])
        # If we have more links than remaining capacity, limit how many we'll process
        valid_links = []
        for link in links:
            url = link.get("href")
            if url in visited:
@@ -100,7 +111,15 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
                self.stats.urls_skipped += 1
                continue
-            # Record the new depth.
+            valid_links.append(url)
        # If we have more valid links than capacity, limit them
        if len(valid_links) > remaining_capacity:
            valid_links = valid_links[:remaining_capacity]
            self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
        # Record the new depths and add to next_links
        for url in valid_links:
            depths[url] = new_depth
            next_links.append((url, source_url))
@@ -123,6 +142,11 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
        depths: Dict[str, int] = {start_url: 0}
        while not queue.empty() and not self._cancel_event.is_set():
            # Stop if we've reached the max pages limit
            if self._pages_crawled >= self.max_pages:
                self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
                break
            batch: List[Tuple[float, int, str, Optional[str]]] = []
            # Retrieve up to BATCH_SIZE items from the priority queue.
            for _ in range(BATCH_SIZE):
@@ -153,14 +177,23 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
                result.metadata["depth"] = depth
                result.metadata["parent_url"] = parent_url
                result.metadata["score"] = score
                # Count only successful crawls toward max_pages limit
                if result.success:
                    self._pages_crawled += 1
                yield result
-                # Discover new links from this result.
+                
-                new_links: List[Tuple[str, Optional[str]]] = []
+                # Only discover links from successful crawls
-                await self.link_discovery(result, result_url, depth, visited, new_links, depths)
+                if result.success:
-                for new_url, new_parent in new_links:
+                    # Discover new links from this result
-                    new_depth = depths.get(new_url, depth + 1)
+                    new_links: List[Tuple[str, Optional[str]]] = []
-                    new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
+                    await self.link_discovery(result, result_url, depth, visited, new_links, depths)
-                    await queue.put((new_score, new_depth, new_url, new_parent))
+                    
                    for new_url, new_parent in new_links:
                        new_depth = depths.get(new_url, depth + 1)
                        new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
                        await queue.put((new_score, new_depth, new_url, new_parent))
        # End of crawl.
--- a/crawl4ai/deep_crawling/bfs_strategy.py
+++ b/crawl4ai/deep_crawling/bfs_strategy.py
@@ -26,15 +26,20 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        filter_chain: FilterChain = FilterChain(),
        url_scorer: Optional[URLScorer] = None,        
        include_external: bool = False,
        score_threshold: float = float('-inf'),
        max_pages: int = float('inf'),
        logger: Optional[logging.Logger] = None,
    ):
        self.max_depth = max_depth
        self.filter_chain = filter_chain
        self.url_scorer = url_scorer
        self.include_external = include_external
        self.score_threshold = score_threshold
        self.max_pages = max_pages
        self.logger = logger or logging.getLogger(__name__)
        self.stats = TraversalStats(start_time=datetime.now())
        self._cancel_event = asyncio.Event()
        self._pages_crawled = 0
    async def can_process_url(self, url: str, depth: int) -> bool:
        """
@@ -77,11 +82,20 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
        if next_depth > self.max_depth:
            return
        # If we've reached the max pages limit, don't discover new links
        remaining_capacity = self.max_pages - self._pages_crawled
        if remaining_capacity <= 0:
            self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping link discovery")
            return
        # Get internal links and, if enabled, external links.
        links = result.links.get("internal", [])
        if self.include_external:
            links += result.links.get("external", [])
        valid_links = []
        # First collect all valid links
        for link in links:
            url = link.get("href")
            if url in visited:
@@ -90,10 +104,29 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
                self.stats.urls_skipped += 1
                continue
-            # Score the URL if a scorer is provided. In this simple BFS
+            # Score the URL if a scorer is provided
            # the score is not used for ordering.
            score = self.url_scorer.score(url) if self.url_scorer else 0
-            # attach the score to metadata if needed.
+            
            # Skip URLs with scores below the threshold
            if score < self.score_threshold:
                self.logger.debug(f"URL {url} skipped: score {score} below threshold {self.score_threshold}")
                self.stats.urls_skipped += 1
                continue
            valid_links.append((url, score))
        # If we have more valid links than capacity, sort by score and take the top ones
        if len(valid_links) > remaining_capacity:
            if self.url_scorer:
                # Sort by score in descending order
                valid_links.sort(key=lambda x: x[1], reverse=True)
            # Take only as many as we have capacity for
            valid_links = valid_links[:remaining_capacity]
            self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
        # Process the final selected links
        for url, score in valid_links:
            # attach the score to metadata if needed
            if score:
                result.metadata = result.metadata or {}
                result.metadata["score"] = score
@@ -126,6 +159,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
            batch_config = config.clone(deep_crawl_strategy=None, stream=False)
            batch_results = await crawler.arun_many(urls=urls, config=batch_config)
            # Update pages crawled counter - count only successful crawls
            successful_results = [r for r in batch_results if r.success]
            self._pages_crawled += len(successful_results)
            for result in batch_results:
                url = result.url
                depth = depths.get(url, 0)
@@ -134,7 +171,11 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
                parent_url = next((parent for (u, parent) in current_level if u == url), None)
                result.metadata["parent_url"] = parent_url
                results.append(result)
-                await self.link_discovery(result, url, depth, visited, next_level, depths)
+                
                # Only discover links from successful crawls
                if result.success:
                    # Link discovery will handle the max pages limit internally
                    await self.link_discovery(result, url, depth, visited, next_level, depths)
            current_level = next_level
@@ -161,6 +202,9 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
            stream_config = config.clone(deep_crawl_strategy=None, stream=True)
            stream_gen = await crawler.arun_many(urls=urls, config=stream_config)
            # Keep track of processed results for this batch
            results_count = 0
            async for result in stream_gen:
                url = result.url
                depth = depths.get(url, 0)
@@ -168,8 +212,23 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
                result.metadata["depth"] = depth
                parent_url = next((parent for (u, parent) in current_level if u == url), None)
                result.metadata["parent_url"] = parent_url
                # Count only successful crawls
                if result.success:
                    self._pages_crawled += 1
                results_count += 1
                yield result
-                await self.link_discovery(result, url, depth, visited, next_level, depths)
+                
                # Only discover links from successful crawls
                if result.success:
                    # Link discovery will handle the max pages limit internally
                    await self.link_discovery(result, url, depth, visited, next_level, depths)
            # If we didn't get results back (e.g. due to errors), avoid getting stuck in an infinite loop
            # by considering these URLs as visited but not counting them toward the max_pages limit
            if results_count == 0 and urls:
                self.logger.warning(f"No results returned for {len(urls)} URLs, marking as visited")
            current_level = next_level
--- a/crawl4ai/deep_crawling/dfs_strategy.py
+++ b/crawl4ai/deep_crawling/dfs_strategy.py
@@ -37,6 +37,7 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
            # Clone config to disable recursive deep crawling.
            batch_config = config.clone(deep_crawl_strategy=None, stream=False)
            url_results = await crawler.arun_many(urls=[url], config=batch_config)
            for result in url_results:
                result.metadata = result.metadata or {}
                result.metadata["depth"] = depth
@@ -45,12 +46,18 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
                    result.metadata["score"] = self.url_scorer.score(url)
                results.append(result)
-                new_links: List[Tuple[str, Optional[str]]] = []
+                # Count only successful crawls toward max_pages limit
-                await self.link_discovery(result, url, depth, visited, new_links, depths)
+                if result.success:
-                # Push new links in reverse order so the first discovered is processed next.
+                    self._pages_crawled += 1
-                for new_url, new_parent in reversed(new_links):
+                    
-                    new_depth = depths.get(new_url, depth + 1)
+                    # Only discover links from successful crawls
-                    stack.append((new_url, new_parent, new_depth))
+                    new_links: List[Tuple[str, Optional[str]]] = []
                    await self.link_discovery(result, url, depth, visited, new_links, depths)
                    # Push new links in reverse order so the first discovered is processed next.
                    for new_url, new_parent in reversed(new_links):
                        new_depth = depths.get(new_url, depth + 1)
                        stack.append((new_url, new_parent, new_depth))
        return results
    async def _arun_stream(
@@ -83,8 +90,13 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
                    result.metadata["score"] = self.url_scorer.score(url)
                yield result
-                new_links: List[Tuple[str, Optional[str]]] = []
+                # Only count successful crawls toward max_pages limit
-                await self.link_discovery(result, url, depth, visited, new_links, depths)
+                # and only discover links from successful crawls
-                for new_url, new_parent in reversed(new_links):
+                if result.success:
-                    new_depth = depths.get(new_url, depth + 1)
+                    self._pages_crawled += 1
-                    stack.append((new_url, new_parent, new_depth))
+                    
                    new_links: List[Tuple[str, Optional[str]]] = []
                    await self.link_discovery(result, url, depth, visited, new_links, depths)
                    for new_url, new_parent in reversed(new_links):
                        new_depth = depths.get(new_url, depth + 1)
                        stack.append((new_url, new_parent, new_depth))
--- a/docs/examples/deepcrawl_example.py
+++ b/docs/examples/deepcrawl_example.py
@@ -80,7 +80,7 @@ async def stream_vs_nonstream():
    base_config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1, include_external=False),
        scraping_strategy=LXMLWebScrapingStrategy(),
-        verbose=True,
+        verbose=False,
    )
    async with AsyncWebCrawler() as crawler:
@@ -212,11 +212,11 @@ async def filters_and_scorers():
        # Create a keyword relevance scorer
        keyword_scorer = KeywordRelevanceScorer(
-            keywords=["crawl", "example", "async", "configuration","javascript","css"], weight=0.3
+            keywords=["crawl", "example", "async", "configuration","javascript","css"], weight=1
        )
        config = CrawlerRunConfig(
-            deep_crawl_strategy=BestFirstCrawlingStrategy(  # Note: Changed to BestFirst
+            deep_crawl_strategy=BestFirstCrawlingStrategy(  
                max_depth=1, include_external=False, url_scorer=keyword_scorer
            ),
            scraping_strategy=LXMLWebScrapingStrategy(),
@@ -373,6 +373,104 @@ async def advanced_filters():
 # Main function to run the entire tutorial
 async def max_pages_and_thresholds():
    """
    PART 6: Demonstrates using max_pages and score_threshold parameters with different strategies.
    This function shows:
    - How to limit the number of pages crawled
    - How to set score thresholds for more targeted crawling
    - Comparing BFS, DFS, and Best-First strategies with these parameters
    """
    print("\n===== MAX PAGES AND SCORE THRESHOLDS =====")
    from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
    async with AsyncWebCrawler() as crawler:
        # Define a common keyword scorer for all examples
        keyword_scorer = KeywordRelevanceScorer(
            keywords=["browser", "crawler", "web", "automation"], 
            weight=1.0
        )
        # EXAMPLE 1: BFS WITH MAX PAGES
        print("\n📊 EXAMPLE 1: BFS STRATEGY WITH MAX PAGES LIMIT")
        print("  Limit the crawler to a maximum of 5 pages")
        bfs_config = CrawlerRunConfig(
            deep_crawl_strategy=BFSDeepCrawlStrategy(
                max_depth=2, 
                include_external=False,
                url_scorer=keyword_scorer,
                max_pages=5  # Only crawl 5 pages
            ),
            scraping_strategy=LXMLWebScrapingStrategy(),
            verbose=True,
            cache_mode=CacheMode.BYPASS,
        )
        results = await crawler.arun(url="https://docs.crawl4ai.com", config=bfs_config)
        print(f"  ✅ Crawled exactly {len(results)} pages as specified by max_pages")
        for result in results:
            depth = result.metadata.get("depth", 0)
            print(f"  → Depth: {depth} | {result.url}")
        # EXAMPLE 2: DFS WITH SCORE THRESHOLD
        print("\n📊 EXAMPLE 2: DFS STRATEGY WITH SCORE THRESHOLD")
        print("  Only crawl pages with a relevance score above 0.5")
        dfs_config = CrawlerRunConfig(
            deep_crawl_strategy=DFSDeepCrawlStrategy(
                max_depth=2,
                include_external=False, 
                url_scorer=keyword_scorer,
                score_threshold=0.7,  # Only process URLs with scores above 0.5
                max_pages=10
            ),
            scraping_strategy=LXMLWebScrapingStrategy(),
            verbose=True,
            cache_mode=CacheMode.BYPASS,
        )
        results = await crawler.arun(url="https://docs.crawl4ai.com", config=dfs_config)
        print(f"  ✅ Crawled {len(results)} pages with scores above threshold")
        for result in results:
            score = result.metadata.get("score", 0)
            depth = result.metadata.get("depth", 0)
            print(f"  → Depth: {depth} | Score: {score:.2f} | {result.url}")
        # EXAMPLE 3: BEST-FIRST WITH BOTH CONSTRAINTS
        print("\n📊 EXAMPLE 3: BEST-FIRST STRATEGY WITH BOTH CONSTRAINTS")
        print("  Limit to 7 pages with scores above 0.3, prioritizing highest scores")
        bf_config = CrawlerRunConfig(
            deep_crawl_strategy=BestFirstCrawlingStrategy(
                max_depth=2,
                include_external=False,
                url_scorer=keyword_scorer,
                max_pages=7,          # Limit to 7 pages total
            ),
            scraping_strategy=LXMLWebScrapingStrategy(),
            verbose=True,
            cache_mode=CacheMode.BYPASS,
            stream=True,
        )
        results = []
        async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=bf_config):
            results.append(result)
            score = result.metadata.get("score", 0)
            depth = result.metadata.get("depth", 0)
            print(f"  → Depth: {depth} | Score: {score:.2f} | {result.url}")
        print(f"  ✅ Crawled {len(results)} high-value pages with scores above 0.3")
        if results:
            avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
            print(f"  ✅ Average score: {avg_score:.2f}")
            print("  🔍 Note: BestFirstCrawlingStrategy visited highest-scoring pages first")
 async def run_tutorial():
    """
    Executes all tutorial sections in sequence.
@@ -384,9 +482,10 @@ async def run_tutorial():
    # Define sections - uncomment to run specific parts during development
    tutorial_sections = [
-        basic_deep_crawl,
+        # basic_deep_crawl,
-        stream_vs_nonstream,
+        # stream_vs_nonstream,
-        filters_and_scorers,
+        # filters_and_scorers,
        max_pages_and_thresholds,  # Added new section
        wrap_up,
        advanced_filters,
    ]
--- a/docs/md_v2/core/deep-crawling.md
+++ b/docs/md_v2/core/deep-crawling.md
@@ -73,12 +73,18 @@ from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
 strategy = BFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=50,              # Maximum number of pages to crawl (optional)
    score_threshold=0.3,       # Minimum score for URLs to be crawled (optional)
 )
 ```
 **Key parameters:**
 - **`max_depth`**: Number of levels to crawl beyond the starting page
 - **`include_external`**: Whether to follow links to other domains
 - **`max_pages`**: Maximum number of pages to crawl (default: infinite)
 - **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
 - **`filter_chain`**: FilterChain instance for URL filtering
 - **`url_scorer`**: Scorer instance for evaluating URLs
 ### 2.2 DFSDeepCrawlStrategy (Depth-First Search)
@@ -91,12 +97,18 @@ from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
 strategy = DFSDeepCrawlStrategy(
    max_depth=2,               # Crawl initial page + 2 levels deep
    include_external=False,    # Stay within the same domain
    max_pages=30,              # Maximum number of pages to crawl (optional)
    score_threshold=0.5,       # Minimum score for URLs to be crawled (optional)
 )
 ```
 **Key parameters:**
 - **`max_depth`**: Number of levels to crawl beyond the starting page
 - **`include_external`**: Whether to follow links to other domains
 - **`max_pages`**: Maximum number of pages to crawl (default: infinite)
 - **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
 - **`filter_chain`**: FilterChain instance for URL filtering
 - **`url_scorer`**: Scorer instance for evaluating URLs
 ### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
@@ -116,7 +128,8 @@ scorer = KeywordRelevanceScorer(
 strategy = BestFirstCrawlingStrategy(
    max_depth=2,
    include_external=False,
-    url_scorer=scorer
+    url_scorer=scorer,
    max_pages=25,              # Maximum number of pages to crawl (optional)
 )
 ```
@@ -124,6 +137,8 @@ This crawling approach:
 - Evaluates each discovered URL based on scorer criteria
 - Visits higher-scoring pages first
 - Helps focus crawl resources on the most relevant content
 - Can limit total pages crawled with `max_pages`
 - Does not need `score_threshold` as it naturally prioritizes by score
 ---
@@ -410,27 +425,64 @@ if __name__ == "__main__":
 ---
-## 8. Common Pitfalls & Tips
+## 8. Limiting and Controlling Crawl Size
-1.**Set realistic depth limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size. 
+### 8.1 Using max_pages
 You can limit the total number of pages crawled with the `max_pages` parameter:
 ```python
 # Limit to exactly 20 pages regardless of depth
 strategy = BFSDeepCrawlStrategy(
    max_depth=3,
    max_pages=20
 )
 ```
 This feature is useful for:
 - Controlling API costs
 - Setting predictable execution times
 - Focusing on the most important content
 - Testing crawl configurations before full execution
 ### 8.2 Using score_threshold
 For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:
 ```python
 # Only follow links with scores above 0.4
 strategy = DFSDeepCrawlStrategy(
    max_depth=2,
    url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
    score_threshold=0.4  # Skip URLs with scores below this value
 )
 ```
 Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.
 ## 9. Common Pitfalls & Tips
 1.**Set realistic limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size. Use `max_pages` to set hard limits.
 2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
 3.**Be a good web citizen.**  Respect robots.txt. (disabled by default)
 4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.status` when processing results.
-4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.success` and `result.error_message` when processing results.
+5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.
 ---
-## 9. Summary & Next Steps
+## 10. Summary & Next Steps
 In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
- Configure **BFSDeepCrawlStrategy** and **BestFirstCrawlingStrategy**
+- Configure **BFSDeepCrawlStrategy**, **DFSDeepCrawlStrategy**, and **BestFirstCrawlingStrategy**
 - Process results in streaming or non-streaming mode
 - Apply filters to target specific content
 - Use scorers to prioritize the most relevant pages
 - Limit crawls with `max_pages` and `score_threshold` parameters
 - Build a complete advanced crawler with combined techniques
 With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.