refactor(deep-crawl): add max_pages limit and improve crawl control

Add max_pages parameter to all deep crawling strategies to limit total pages crawled.
Add score_threshold parameter to BFS/DFS strategies for quality control.
Remove legacy parameter handling in AsyncWebCrawler.
Improve error handling and logging in crawl strategies.

BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()
This commit is contained in:
UncleCode
2025-03-03 21:51:11 +08:00
parent c612f9a852
commit d024749633
7 changed files with 372 additions and 91 deletions

View File

@@ -5,6 +5,27 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Version 0.5.0 (2025-03-02)
### Added
- *(profiles)* Add BrowserProfiler class for dedicated browser profile management
- *(cli)* Add interactive profile management to CLI with rich UI
- *(profiles)* Add ability to crawl directly from profile management interface
- *(browser)* Support identity-based browsing with persistent profiles
### Changed
- *(browser)* Refactor profile management from ManagedBrowser to BrowserProfiler class
- *(cli)* Enhance CLI with profile selection and status display for crawling
- *(examples)* Update identity-based browsing example to use BrowserProfiler class
- *(docs)* Update identity-based crawling documentation
### Fixed
- *(browser)* Fix profile detection and management on different platforms
- *(cli)* Fix CLI command structure for better user experience
## Version 0.5.0 (2025-02-21) ## Version 0.5.0 (2025-02-21)

View File

@@ -224,22 +224,22 @@ class AsyncWebCrawler:
url: str, url: str,
config: CrawlerRunConfig = None, config: CrawlerRunConfig = None,
# Legacy parameters maintained for backwards compatibility # Legacy parameters maintained for backwards compatibility
word_count_threshold=MIN_WORD_THRESHOLD, # word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None, # extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(), # chunking_strategy: ChunkingStrategy = RegexChunking(),
content_filter: RelevantContentFilter = None, # content_filter: RelevantContentFilter = None,
cache_mode: Optional[CacheMode] = None, # cache_mode: Optional[CacheMode] = None,
# Deprecated cache parameters # Deprecated cache parameters
bypass_cache: bool = False, # bypass_cache: bool = False,
disable_cache: bool = False, # disable_cache: bool = False,
no_cache_read: bool = False, # no_cache_read: bool = False,
no_cache_write: bool = False, # no_cache_write: bool = False,
# Other legacy parameters # Other legacy parameters
css_selector: str = None, # css_selector: str = None,
screenshot: bool = False, # screenshot: bool = False,
pdf: bool = False, # pdf: bool = False,
user_agent: str = None, # user_agent: str = None,
verbose=True, # verbose=True,
**kwargs, **kwargs,
) -> RunManyReturn: ) -> RunManyReturn:
""" """
@@ -276,39 +276,41 @@ class AsyncWebCrawler:
async with self._lock or self.nullcontext(): async with self._lock or self.nullcontext():
try: try:
self.logger.verbose = crawler_config.verbose
# Handle configuration # Handle configuration
if crawler_config is not None: if crawler_config is not None:
config = crawler_config config = crawler_config
else: else:
# Merge all parameters into a single kwargs dict for config creation # Merge all parameters into a single kwargs dict for config creation
config_kwargs = { # config_kwargs = {
"word_count_threshold": word_count_threshold, # "word_count_threshold": word_count_threshold,
"extraction_strategy": extraction_strategy, # "extraction_strategy": extraction_strategy,
"chunking_strategy": chunking_strategy, # "chunking_strategy": chunking_strategy,
"content_filter": content_filter, # "content_filter": content_filter,
"cache_mode": cache_mode, # "cache_mode": cache_mode,
"bypass_cache": bypass_cache, # "bypass_cache": bypass_cache,
"disable_cache": disable_cache, # "disable_cache": disable_cache,
"no_cache_read": no_cache_read, # "no_cache_read": no_cache_read,
"no_cache_write": no_cache_write, # "no_cache_write": no_cache_write,
"css_selector": css_selector, # "css_selector": css_selector,
"screenshot": screenshot, # "screenshot": screenshot,
"pdf": pdf, # "pdf": pdf,
"verbose": verbose, # "verbose": verbose,
**kwargs, # **kwargs,
} # }
config = CrawlerRunConfig.from_kwargs(config_kwargs) # config = CrawlerRunConfig.from_kwargs(config_kwargs)
pass
# Handle deprecated cache parameters # Handle deprecated cache parameters
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]): # if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
# Convert legacy parameters if cache_mode not provided # # Convert legacy parameters if cache_mode not provided
if config.cache_mode is None: # if config.cache_mode is None:
config.cache_mode = _legacy_to_cache_mode( # config.cache_mode = _legacy_to_cache_mode(
disable_cache=disable_cache, # disable_cache=disable_cache,
bypass_cache=bypass_cache, # bypass_cache=bypass_cache,
no_cache_read=no_cache_read, # no_cache_read=no_cache_read,
no_cache_write=no_cache_write, # no_cache_write=no_cache_write,
) # )
# Default to ENABLED if no cache mode specified # Default to ENABLED if no cache mode specified
if config.cache_mode is None: if config.cache_mode is None:
@@ -344,7 +346,11 @@ class AsyncWebCrawler:
# If screenshot is requested but its not in cache, then set cache_result to None # If screenshot is requested but its not in cache, then set cache_result to None
screenshot_data = cached_result.screenshot screenshot_data = cached_result.screenshot
pdf_data = cached_result.pdf pdf_data = cached_result.pdf
if config.screenshot and not screenshot or config.pdf and not pdf: # if config.screenshot and not screenshot or config.pdf and not pdf:
if config.screenshot and not screenshot_data:
cached_result = None
if config.pdf and not pdf_data:
cached_result = None cached_result = None
self.logger.url_status( self.logger.url_status(
@@ -358,12 +364,11 @@ class AsyncWebCrawler:
if config and config.proxy_rotation_strategy: if config and config.proxy_rotation_strategy:
next_proxy = await config.proxy_rotation_strategy.get_next_proxy() next_proxy = await config.proxy_rotation_strategy.get_next_proxy()
if next_proxy: if next_proxy:
if verbose: self.logger.info(
self.logger.info( message="Switch proxy: {proxy}",
message="Switch proxy: {proxy}", tag="PROXY",
tag="PROXY", params={"proxy": next_proxy.server},
params={"proxy": next_proxy.server}, )
)
config.proxy_config = next_proxy config.proxy_config = next_proxy
# config = config.clone(proxy_config=next_proxy) # config = config.clone(proxy_config=next_proxy)
@@ -371,8 +376,8 @@ class AsyncWebCrawler:
if not cached_result or not html: if not cached_result or not html:
t1 = time.perf_counter() t1 = time.perf_counter()
if user_agent: if config.user_agent:
self.crawler_strategy.update_user_agent(user_agent) self.crawler_strategy.update_user_agent(config.user_agent)
# Check robots.txt if enabled # Check robots.txt if enabled
if config and config.check_robots_txt: if config and config.check_robots_txt:

View File

@@ -37,15 +37,18 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
filter_chain: FilterChain = FilterChain(), filter_chain: FilterChain = FilterChain(),
url_scorer: Optional[URLScorer] = None, url_scorer: Optional[URLScorer] = None,
include_external: bool = False, include_external: bool = False,
max_pages: int = float('inf'),
logger: Optional[logging.Logger] = None, logger: Optional[logging.Logger] = None,
): ):
self.max_depth = max_depth self.max_depth = max_depth
self.filter_chain = filter_chain self.filter_chain = filter_chain
self.url_scorer = url_scorer self.url_scorer = url_scorer
self.include_external = include_external self.include_external = include_external
self.max_pages = max_pages
self.logger = logger or logging.getLogger(__name__) self.logger = logger or logging.getLogger(__name__)
self.stats = TraversalStats(start_time=datetime.now()) self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event() self._cancel_event = asyncio.Event()
self._pages_crawled = 0
async def can_process_url(self, url: str, depth: int) -> bool: async def can_process_url(self, url: str, depth: int) -> bool:
""" """
@@ -87,11 +90,19 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
if new_depth > self.max_depth: if new_depth > self.max_depth:
return return
# If we've reached the max pages limit, don't discover new links
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping link discovery")
return
# Retrieve internal links; include external links if enabled. # Retrieve internal links; include external links if enabled.
links = result.links.get("internal", []) links = result.links.get("internal", [])
if self.include_external: if self.include_external:
links += result.links.get("external", []) links += result.links.get("external", [])
# If we have more links than remaining capacity, limit how many we'll process
valid_links = []
for link in links: for link in links:
url = link.get("href") url = link.get("href")
if url in visited: if url in visited:
@@ -100,7 +111,15 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
self.stats.urls_skipped += 1 self.stats.urls_skipped += 1
continue continue
# Record the new depth. valid_links.append(url)
# If we have more valid links than capacity, limit them
if len(valid_links) > remaining_capacity:
valid_links = valid_links[:remaining_capacity]
self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
# Record the new depths and add to next_links
for url in valid_links:
depths[url] = new_depth depths[url] = new_depth
next_links.append((url, source_url)) next_links.append((url, source_url))
@@ -123,6 +142,11 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
depths: Dict[str, int] = {start_url: 0} depths: Dict[str, int] = {start_url: 0}
while not queue.empty() and not self._cancel_event.is_set(): while not queue.empty() and not self._cancel_event.is_set():
# Stop if we've reached the max pages limit
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
break
batch: List[Tuple[float, int, str, Optional[str]]] = [] batch: List[Tuple[float, int, str, Optional[str]]] = []
# Retrieve up to BATCH_SIZE items from the priority queue. # Retrieve up to BATCH_SIZE items from the priority queue.
for _ in range(BATCH_SIZE): for _ in range(BATCH_SIZE):
@@ -153,14 +177,23 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
result.metadata["depth"] = depth result.metadata["depth"] = depth
result.metadata["parent_url"] = parent_url result.metadata["parent_url"] = parent_url
result.metadata["score"] = score result.metadata["score"] = score
# Count only successful crawls toward max_pages limit
if result.success:
self._pages_crawled += 1
yield result yield result
# Discover new links from this result.
new_links: List[Tuple[str, Optional[str]]] = [] # Only discover links from successful crawls
await self.link_discovery(result, result_url, depth, visited, new_links, depths) if result.success:
for new_url, new_parent in new_links: # Discover new links from this result
new_depth = depths.get(new_url, depth + 1) new_links: List[Tuple[str, Optional[str]]] = []
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0 await self.link_discovery(result, result_url, depth, visited, new_links, depths)
await queue.put((new_score, new_depth, new_url, new_parent))
for new_url, new_parent in new_links:
new_depth = depths.get(new_url, depth + 1)
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
await queue.put((new_score, new_depth, new_url, new_parent))
# End of crawl. # End of crawl.

View File

@@ -26,15 +26,20 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
filter_chain: FilterChain = FilterChain(), filter_chain: FilterChain = FilterChain(),
url_scorer: Optional[URLScorer] = None, url_scorer: Optional[URLScorer] = None,
include_external: bool = False, include_external: bool = False,
score_threshold: float = float('-inf'),
max_pages: int = float('inf'),
logger: Optional[logging.Logger] = None, logger: Optional[logging.Logger] = None,
): ):
self.max_depth = max_depth self.max_depth = max_depth
self.filter_chain = filter_chain self.filter_chain = filter_chain
self.url_scorer = url_scorer self.url_scorer = url_scorer
self.include_external = include_external self.include_external = include_external
self.score_threshold = score_threshold
self.max_pages = max_pages
self.logger = logger or logging.getLogger(__name__) self.logger = logger or logging.getLogger(__name__)
self.stats = TraversalStats(start_time=datetime.now()) self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event() self._cancel_event = asyncio.Event()
self._pages_crawled = 0
async def can_process_url(self, url: str, depth: int) -> bool: async def can_process_url(self, url: str, depth: int) -> bool:
""" """
@@ -77,11 +82,20 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
if next_depth > self.max_depth: if next_depth > self.max_depth:
return return
# If we've reached the max pages limit, don't discover new links
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping link discovery")
return
# Get internal links and, if enabled, external links. # Get internal links and, if enabled, external links.
links = result.links.get("internal", []) links = result.links.get("internal", [])
if self.include_external: if self.include_external:
links += result.links.get("external", []) links += result.links.get("external", [])
valid_links = []
# First collect all valid links
for link in links: for link in links:
url = link.get("href") url = link.get("href")
if url in visited: if url in visited:
@@ -90,10 +104,29 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self.stats.urls_skipped += 1 self.stats.urls_skipped += 1
continue continue
# Score the URL if a scorer is provided. In this simple BFS # Score the URL if a scorer is provided
# the score is not used for ordering.
score = self.url_scorer.score(url) if self.url_scorer else 0 score = self.url_scorer.score(url) if self.url_scorer else 0
# attach the score to metadata if needed.
# Skip URLs with scores below the threshold
if score < self.score_threshold:
self.logger.debug(f"URL {url} skipped: score {score} below threshold {self.score_threshold}")
self.stats.urls_skipped += 1
continue
valid_links.append((url, score))
# If we have more valid links than capacity, sort by score and take the top ones
if len(valid_links) > remaining_capacity:
if self.url_scorer:
# Sort by score in descending order
valid_links.sort(key=lambda x: x[1], reverse=True)
# Take only as many as we have capacity for
valid_links = valid_links[:remaining_capacity]
self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
# Process the final selected links
for url, score in valid_links:
# attach the score to metadata if needed
if score: if score:
result.metadata = result.metadata or {} result.metadata = result.metadata or {}
result.metadata["score"] = score result.metadata["score"] = score
@@ -126,6 +159,10 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
batch_config = config.clone(deep_crawl_strategy=None, stream=False) batch_config = config.clone(deep_crawl_strategy=None, stream=False)
batch_results = await crawler.arun_many(urls=urls, config=batch_config) batch_results = await crawler.arun_many(urls=urls, config=batch_config)
# Update pages crawled counter - count only successful crawls
successful_results = [r for r in batch_results if r.success]
self._pages_crawled += len(successful_results)
for result in batch_results: for result in batch_results:
url = result.url url = result.url
depth = depths.get(url, 0) depth = depths.get(url, 0)
@@ -134,7 +171,11 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
parent_url = next((parent for (u, parent) in current_level if u == url), None) parent_url = next((parent for (u, parent) in current_level if u == url), None)
result.metadata["parent_url"] = parent_url result.metadata["parent_url"] = parent_url
results.append(result) results.append(result)
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Only discover links from successful crawls
if result.success:
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
current_level = next_level current_level = next_level
@@ -161,6 +202,9 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
stream_config = config.clone(deep_crawl_strategy=None, stream=True) stream_config = config.clone(deep_crawl_strategy=None, stream=True)
stream_gen = await crawler.arun_many(urls=urls, config=stream_config) stream_gen = await crawler.arun_many(urls=urls, config=stream_config)
# Keep track of processed results for this batch
results_count = 0
async for result in stream_gen: async for result in stream_gen:
url = result.url url = result.url
depth = depths.get(url, 0) depth = depths.get(url, 0)
@@ -168,8 +212,23 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
result.metadata["depth"] = depth result.metadata["depth"] = depth
parent_url = next((parent for (u, parent) in current_level if u == url), None) parent_url = next((parent for (u, parent) in current_level if u == url), None)
result.metadata["parent_url"] = parent_url result.metadata["parent_url"] = parent_url
# Count only successful crawls
if result.success:
self._pages_crawled += 1
results_count += 1
yield result yield result
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Only discover links from successful crawls
if result.success:
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
# If we didn't get results back (e.g. due to errors), avoid getting stuck in an infinite loop
# by considering these URLs as visited but not counting them toward the max_pages limit
if results_count == 0 and urls:
self.logger.warning(f"No results returned for {len(urls)} URLs, marking as visited")
current_level = next_level current_level = next_level

View File

@@ -37,6 +37,7 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
# Clone config to disable recursive deep crawling. # Clone config to disable recursive deep crawling.
batch_config = config.clone(deep_crawl_strategy=None, stream=False) batch_config = config.clone(deep_crawl_strategy=None, stream=False)
url_results = await crawler.arun_many(urls=[url], config=batch_config) url_results = await crawler.arun_many(urls=[url], config=batch_config)
for result in url_results: for result in url_results:
result.metadata = result.metadata or {} result.metadata = result.metadata or {}
result.metadata["depth"] = depth result.metadata["depth"] = depth
@@ -45,12 +46,18 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
result.metadata["score"] = self.url_scorer.score(url) result.metadata["score"] = self.url_scorer.score(url)
results.append(result) results.append(result)
new_links: List[Tuple[str, Optional[str]]] = [] # Count only successful crawls toward max_pages limit
await self.link_discovery(result, url, depth, visited, new_links, depths) if result.success:
# Push new links in reverse order so the first discovered is processed next. self._pages_crawled += 1
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1) # Only discover links from successful crawls
stack.append((new_url, new_parent, new_depth)) new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)
# Push new links in reverse order so the first discovered is processed next.
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
return results return results
async def _arun_stream( async def _arun_stream(
@@ -83,8 +90,13 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
result.metadata["score"] = self.url_scorer.score(url) result.metadata["score"] = self.url_scorer.score(url)
yield result yield result
new_links: List[Tuple[str, Optional[str]]] = [] # Only count successful crawls toward max_pages limit
await self.link_discovery(result, url, depth, visited, new_links, depths) # and only discover links from successful crawls
for new_url, new_parent in reversed(new_links): if result.success:
new_depth = depths.get(new_url, depth + 1) self._pages_crawled += 1
stack.append((new_url, new_parent, new_depth))
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))

View File

@@ -80,7 +80,7 @@ async def stream_vs_nonstream():
base_config = CrawlerRunConfig( base_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1, include_external=False), deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1, include_external=False),
scraping_strategy=LXMLWebScrapingStrategy(), scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True, verbose=False,
) )
async with AsyncWebCrawler() as crawler: async with AsyncWebCrawler() as crawler:
@@ -212,11 +212,11 @@ async def filters_and_scorers():
# Create a keyword relevance scorer # Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer( keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration","javascript","css"], weight=0.3 keywords=["crawl", "example", "async", "configuration","javascript","css"], weight=1
) )
config = CrawlerRunConfig( config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy( # Note: Changed to BestFirst deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=1, include_external=False, url_scorer=keyword_scorer max_depth=1, include_external=False, url_scorer=keyword_scorer
), ),
scraping_strategy=LXMLWebScrapingStrategy(), scraping_strategy=LXMLWebScrapingStrategy(),
@@ -373,6 +373,104 @@ async def advanced_filters():
# Main function to run the entire tutorial # Main function to run the entire tutorial
async def max_pages_and_thresholds():
"""
PART 6: Demonstrates using max_pages and score_threshold parameters with different strategies.
This function shows:
- How to limit the number of pages crawled
- How to set score thresholds for more targeted crawling
- Comparing BFS, DFS, and Best-First strategies with these parameters
"""
print("\n===== MAX PAGES AND SCORE THRESHOLDS =====")
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
async with AsyncWebCrawler() as crawler:
# Define a common keyword scorer for all examples
keyword_scorer = KeywordRelevanceScorer(
keywords=["browser", "crawler", "web", "automation"],
weight=1.0
)
# EXAMPLE 1: BFS WITH MAX PAGES
print("\n📊 EXAMPLE 1: BFS STRATEGY WITH MAX PAGES LIMIT")
print(" Limit the crawler to a maximum of 5 pages")
bfs_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=5 # Only crawl 5 pages
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
cache_mode=CacheMode.BYPASS,
)
results = await crawler.arun(url="https://docs.crawl4ai.com", config=bfs_config)
print(f" ✅ Crawled exactly {len(results)} pages as specified by max_pages")
for result in results:
depth = result.metadata.get("depth", 0)
print(f" → Depth: {depth} | {result.url}")
# EXAMPLE 2: DFS WITH SCORE THRESHOLD
print("\n📊 EXAMPLE 2: DFS STRATEGY WITH SCORE THRESHOLD")
print(" Only crawl pages with a relevance score above 0.5")
dfs_config = CrawlerRunConfig(
deep_crawl_strategy=DFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
score_threshold=0.7, # Only process URLs with scores above 0.5
max_pages=10
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
cache_mode=CacheMode.BYPASS,
)
results = await crawler.arun(url="https://docs.crawl4ai.com", config=dfs_config)
print(f" ✅ Crawled {len(results)} pages with scores above threshold")
for result in results:
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f" → Depth: {depth} | Score: {score:.2f} | {result.url}")
# EXAMPLE 3: BEST-FIRST WITH BOTH CONSTRAINTS
print("\n📊 EXAMPLE 3: BEST-FIRST STRATEGY WITH BOTH CONSTRAINTS")
print(" Limit to 7 pages with scores above 0.3, prioritizing highest scores")
bf_config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=7, # Limit to 7 pages total
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
cache_mode=CacheMode.BYPASS,
stream=True,
)
results = []
async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=bf_config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f" → Depth: {depth} | Score: {score:.2f} | {result.url}")
print(f" ✅ Crawled {len(results)} high-value pages with scores above 0.3")
if results:
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
print(f" ✅ Average score: {avg_score:.2f}")
print(" 🔍 Note: BestFirstCrawlingStrategy visited highest-scoring pages first")
async def run_tutorial(): async def run_tutorial():
""" """
Executes all tutorial sections in sequence. Executes all tutorial sections in sequence.
@@ -384,9 +482,10 @@ async def run_tutorial():
# Define sections - uncomment to run specific parts during development # Define sections - uncomment to run specific parts during development
tutorial_sections = [ tutorial_sections = [
basic_deep_crawl, # basic_deep_crawl,
stream_vs_nonstream, # stream_vs_nonstream,
filters_and_scorers, # filters_and_scorers,
max_pages_and_thresholds, # Added new section
wrap_up, wrap_up,
advanced_filters, advanced_filters,
] ]

View File

@@ -73,12 +73,18 @@ from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy( strategy = BFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain include_external=False, # Stay within the same domain
max_pages=50, # Maximum number of pages to crawl (optional)
score_threshold=0.3, # Minimum score for URLs to be crawled (optional)
) )
``` ```
**Key parameters:** **Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page - **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains - **`include_external`**: Whether to follow links to other domains
- **`max_pages`**: Maximum number of pages to crawl (default: infinite)
- **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
- **`filter_chain`**: FilterChain instance for URL filtering
- **`url_scorer`**: Scorer instance for evaluating URLs
### 2.2 DFSDeepCrawlStrategy (Depth-First Search) ### 2.2 DFSDeepCrawlStrategy (Depth-First Search)
@@ -91,12 +97,18 @@ from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
strategy = DFSDeepCrawlStrategy( strategy = DFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain include_external=False, # Stay within the same domain
max_pages=30, # Maximum number of pages to crawl (optional)
score_threshold=0.5, # Minimum score for URLs to be crawled (optional)
) )
``` ```
**Key parameters:** **Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page - **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains - **`include_external`**: Whether to follow links to other domains
- **`max_pages`**: Maximum number of pages to crawl (default: infinite)
- **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
- **`filter_chain`**: FilterChain instance for URL filtering
- **`url_scorer`**: Scorer instance for evaluating URLs
### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy) ### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
@@ -116,7 +128,8 @@ scorer = KeywordRelevanceScorer(
strategy = BestFirstCrawlingStrategy( strategy = BestFirstCrawlingStrategy(
max_depth=2, max_depth=2,
include_external=False, include_external=False,
url_scorer=scorer url_scorer=scorer,
max_pages=25, # Maximum number of pages to crawl (optional)
) )
``` ```
@@ -124,6 +137,8 @@ This crawling approach:
- Evaluates each discovered URL based on scorer criteria - Evaluates each discovered URL based on scorer criteria
- Visits higher-scoring pages first - Visits higher-scoring pages first
- Helps focus crawl resources on the most relevant content - Helps focus crawl resources on the most relevant content
- Can limit total pages crawled with `max_pages`
- Does not need `score_threshold` as it naturally prioritizes by score
--- ---
@@ -410,27 +425,64 @@ if __name__ == "__main__":
--- ---
## 8. Common Pitfalls & Tips ## 8. Limiting and Controlling Crawl Size
1.**Set realistic depth limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size. ### 8.1 Using max_pages
You can limit the total number of pages crawled with the `max_pages` parameter:
```python
# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=20
)
```
This feature is useful for:
- Controlling API costs
- Setting predictable execution times
- Focusing on the most important content
- Testing crawl configurations before full execution
### 8.2 Using score_threshold
For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:
```python
# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
max_depth=2,
url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
score_threshold=0.4 # Skip URLs with scores below this value
)
```
Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.
## 9. Common Pitfalls & Tips
1.**Set realistic limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size. Use `max_pages` to set hard limits.
2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization. 2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
3.**Be a good web citizen.** Respect robots.txt. (disabled by default) 3.**Be a good web citizen.** Respect robots.txt. (disabled by default)
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.status` when processing results.
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.success` and `result.error_message` when processing results. 5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.
--- ---
## 9. Summary & Next Steps ## 10. Summary & Next Steps
In this **Deep Crawling with Crawl4AI** tutorial, you learned to: In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
- Configure **BFSDeepCrawlStrategy** and **BestFirstCrawlingStrategy** - Configure **BFSDeepCrawlStrategy**, **DFSDeepCrawlStrategy**, and **BestFirstCrawlingStrategy**
- Process results in streaming or non-streaming mode - Process results in streaming or non-streaming mode
- Apply filters to target specific content - Apply filters to target specific content
- Use scorers to prioritize the most relevant pages - Use scorers to prioritize the most relevant pages
- Limit crawls with `max_pages` and `score_threshold` parameters
- Build a complete advanced crawler with combined techniques - Build a complete advanced crawler with combined techniques
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case. With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.