refactor(deep-crawl): add max_pages limit and improve crawl control

Add max_pages parameter to all deep crawling strategies to limit total pages crawled.
Add score_threshold parameter to BFS/DFS strategies for quality control.
Remove legacy parameter handling in AsyncWebCrawler.
Improve error handling and logging in crawl strategies.

BREAKING CHANGE: Removed support for legacy parameters in AsyncWebCrawler.run_many()
This commit is contained in:
UncleCode
2025-03-03 21:51:11 +08:00
parent c612f9a852
commit d024749633
7 changed files with 372 additions and 91 deletions

View File

@@ -5,6 +5,27 @@ All notable changes to Crawl4AI will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Version 0.5.0 (2025-03-02)
### Added
- *(profiles)* Add BrowserProfiler class for dedicated browser profile management
- *(cli)* Add interactive profile management to CLI with rich UI
- *(profiles)* Add ability to crawl directly from profile management interface
- *(browser)* Support identity-based browsing with persistent profiles
### Changed
- *(browser)* Refactor profile management from ManagedBrowser to BrowserProfiler class
- *(cli)* Enhance CLI with profile selection and status display for crawling
- *(examples)* Update identity-based browsing example to use BrowserProfiler class
- *(docs)* Update identity-based crawling documentation
### Fixed
- *(browser)* Fix profile detection and management on different platforms
- *(cli)* Fix CLI command structure for better user experience
## Version 0.5.0 (2025-02-21)

View File

@@ -224,22 +224,22 @@ class AsyncWebCrawler:
url: str,
config: CrawlerRunConfig = None,
# Legacy parameters maintained for backwards compatibility
word_count_threshold=MIN_WORD_THRESHOLD,
extraction_strategy: ExtractionStrategy = None,
chunking_strategy: ChunkingStrategy = RegexChunking(),
content_filter: RelevantContentFilter = None,
cache_mode: Optional[CacheMode] = None,
# word_count_threshold=MIN_WORD_THRESHOLD,
# extraction_strategy: ExtractionStrategy = None,
# chunking_strategy: ChunkingStrategy = RegexChunking(),
# content_filter: RelevantContentFilter = None,
# cache_mode: Optional[CacheMode] = None,
# Deprecated cache parameters
bypass_cache: bool = False,
disable_cache: bool = False,
no_cache_read: bool = False,
no_cache_write: bool = False,
# bypass_cache: bool = False,
# disable_cache: bool = False,
# no_cache_read: bool = False,
# no_cache_write: bool = False,
# Other legacy parameters
css_selector: str = None,
screenshot: bool = False,
pdf: bool = False,
user_agent: str = None,
verbose=True,
# css_selector: str = None,
# screenshot: bool = False,
# pdf: bool = False,
# user_agent: str = None,
# verbose=True,
**kwargs,
) -> RunManyReturn:
"""
@@ -276,39 +276,41 @@ class AsyncWebCrawler:
async with self._lock or self.nullcontext():
try:
self.logger.verbose = crawler_config.verbose
# Handle configuration
if crawler_config is not None:
config = crawler_config
else:
# Merge all parameters into a single kwargs dict for config creation
config_kwargs = {
"word_count_threshold": word_count_threshold,
"extraction_strategy": extraction_strategy,
"chunking_strategy": chunking_strategy,
"content_filter": content_filter,
"cache_mode": cache_mode,
"bypass_cache": bypass_cache,
"disable_cache": disable_cache,
"no_cache_read": no_cache_read,
"no_cache_write": no_cache_write,
"css_selector": css_selector,
"screenshot": screenshot,
"pdf": pdf,
"verbose": verbose,
**kwargs,
}
config = CrawlerRunConfig.from_kwargs(config_kwargs)
# config_kwargs = {
# "word_count_threshold": word_count_threshold,
# "extraction_strategy": extraction_strategy,
# "chunking_strategy": chunking_strategy,
# "content_filter": content_filter,
# "cache_mode": cache_mode,
# "bypass_cache": bypass_cache,
# "disable_cache": disable_cache,
# "no_cache_read": no_cache_read,
# "no_cache_write": no_cache_write,
# "css_selector": css_selector,
# "screenshot": screenshot,
# "pdf": pdf,
# "verbose": verbose,
# **kwargs,
# }
# config = CrawlerRunConfig.from_kwargs(config_kwargs)
pass
# Handle deprecated cache parameters
if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
# Convert legacy parameters if cache_mode not provided
if config.cache_mode is None:
config.cache_mode = _legacy_to_cache_mode(
disable_cache=disable_cache,
bypass_cache=bypass_cache,
no_cache_read=no_cache_read,
no_cache_write=no_cache_write,
)
# if any([bypass_cache, disable_cache, no_cache_read, no_cache_write]):
# # Convert legacy parameters if cache_mode not provided
# if config.cache_mode is None:
# config.cache_mode = _legacy_to_cache_mode(
# disable_cache=disable_cache,
# bypass_cache=bypass_cache,
# no_cache_read=no_cache_read,
# no_cache_write=no_cache_write,
# )
# Default to ENABLED if no cache mode specified
if config.cache_mode is None:
@@ -344,7 +346,11 @@ class AsyncWebCrawler:
# If screenshot is requested but its not in cache, then set cache_result to None
screenshot_data = cached_result.screenshot
pdf_data = cached_result.pdf
if config.screenshot and not screenshot or config.pdf and not pdf:
# if config.screenshot and not screenshot or config.pdf and not pdf:
if config.screenshot and not screenshot_data:
cached_result = None
if config.pdf and not pdf_data:
cached_result = None
self.logger.url_status(
@@ -358,12 +364,11 @@ class AsyncWebCrawler:
if config and config.proxy_rotation_strategy:
next_proxy = await config.proxy_rotation_strategy.get_next_proxy()
if next_proxy:
if verbose:
self.logger.info(
message="Switch proxy: {proxy}",
tag="PROXY",
params={"proxy": next_proxy.server},
)
self.logger.info(
message="Switch proxy: {proxy}",
tag="PROXY",
params={"proxy": next_proxy.server},
)
config.proxy_config = next_proxy
# config = config.clone(proxy_config=next_proxy)
@@ -371,8 +376,8 @@ class AsyncWebCrawler:
if not cached_result or not html:
t1 = time.perf_counter()
if user_agent:
self.crawler_strategy.update_user_agent(user_agent)
if config.user_agent:
self.crawler_strategy.update_user_agent(config.user_agent)
# Check robots.txt if enabled
if config and config.check_robots_txt:

View File

@@ -37,15 +37,18 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
filter_chain: FilterChain = FilterChain(),
url_scorer: Optional[URLScorer] = None,
include_external: bool = False,
max_pages: int = float('inf'),
logger: Optional[logging.Logger] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
self.url_scorer = url_scorer
self.include_external = include_external
self.max_pages = max_pages
self.logger = logger or logging.getLogger(__name__)
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self._pages_crawled = 0
async def can_process_url(self, url: str, depth: int) -> bool:
"""
@@ -86,12 +89,20 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
new_depth = current_depth + 1
if new_depth > self.max_depth:
return
# If we've reached the max pages limit, don't discover new links
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping link discovery")
return
# Retrieve internal links; include external links if enabled.
links = result.links.get("internal", [])
if self.include_external:
links += result.links.get("external", [])
# If we have more links than remaining capacity, limit how many we'll process
valid_links = []
for link in links:
url = link.get("href")
if url in visited:
@@ -99,8 +110,16 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
if not await self.can_process_url(url, new_depth):
self.stats.urls_skipped += 1
continue
# Record the new depth.
valid_links.append(url)
# If we have more valid links than capacity, limit them
if len(valid_links) > remaining_capacity:
valid_links = valid_links[:remaining_capacity]
self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
# Record the new depths and add to next_links
for url in valid_links:
depths[url] = new_depth
next_links.append((url, source_url))
@@ -123,6 +142,11 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
depths: Dict[str, int] = {start_url: 0}
while not queue.empty() and not self._cancel_event.is_set():
# Stop if we've reached the max pages limit
if self._pages_crawled >= self.max_pages:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping crawl")
break
batch: List[Tuple[float, int, str, Optional[str]]] = []
# Retrieve up to BATCH_SIZE items from the priority queue.
for _ in range(BATCH_SIZE):
@@ -153,14 +177,23 @@ class BestFirstCrawlingStrategy(DeepCrawlStrategy):
result.metadata["depth"] = depth
result.metadata["parent_url"] = parent_url
result.metadata["score"] = score
# Count only successful crawls toward max_pages limit
if result.success:
self._pages_crawled += 1
yield result
# Discover new links from this result.
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, result_url, depth, visited, new_links, depths)
for new_url, new_parent in new_links:
new_depth = depths.get(new_url, depth + 1)
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
await queue.put((new_score, new_depth, new_url, new_parent))
# Only discover links from successful crawls
if result.success:
# Discover new links from this result
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, result_url, depth, visited, new_links, depths)
for new_url, new_parent in new_links:
new_depth = depths.get(new_url, depth + 1)
new_score = self.url_scorer.score(new_url) if self.url_scorer else 0
await queue.put((new_score, new_depth, new_url, new_parent))
# End of crawl.

View File

@@ -24,17 +24,22 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self,
max_depth: int,
filter_chain: FilterChain = FilterChain(),
url_scorer: Optional[URLScorer] = None,
url_scorer: Optional[URLScorer] = None,
include_external: bool = False,
score_threshold: float = float('-inf'),
max_pages: int = float('inf'),
logger: Optional[logging.Logger] = None,
):
self.max_depth = max_depth
self.filter_chain = filter_chain
self.url_scorer = url_scorer
self.include_external = include_external
self.score_threshold = score_threshold
self.max_pages = max_pages
self.logger = logger or logging.getLogger(__name__)
self.stats = TraversalStats(start_time=datetime.now())
self._cancel_event = asyncio.Event()
self._pages_crawled = 0
async def can_process_url(self, url: str, depth: int) -> bool:
"""
@@ -72,16 +77,25 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
prepares the next level of URLs.
Each valid URL is appended to next_level as a tuple (url, parent_url)
and its depth is tracked.
"""
"""
next_depth = current_depth + 1
if next_depth > self.max_depth:
return
# If we've reached the max pages limit, don't discover new links
remaining_capacity = self.max_pages - self._pages_crawled
if remaining_capacity <= 0:
self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping link discovery")
return
# Get internal links and, if enabled, external links.
links = result.links.get("internal", [])
if self.include_external:
links += result.links.get("external", [])
valid_links = []
# First collect all valid links
for link in links:
url = link.get("href")
if url in visited:
@@ -90,10 +104,29 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
self.stats.urls_skipped += 1
continue
# Score the URL if a scorer is provided. In this simple BFS
# the score is not used for ordering.
# Score the URL if a scorer is provided
score = self.url_scorer.score(url) if self.url_scorer else 0
# attach the score to metadata if needed.
# Skip URLs with scores below the threshold
if score < self.score_threshold:
self.logger.debug(f"URL {url} skipped: score {score} below threshold {self.score_threshold}")
self.stats.urls_skipped += 1
continue
valid_links.append((url, score))
# If we have more valid links than capacity, sort by score and take the top ones
if len(valid_links) > remaining_capacity:
if self.url_scorer:
# Sort by score in descending order
valid_links.sort(key=lambda x: x[1], reverse=True)
# Take only as many as we have capacity for
valid_links = valid_links[:remaining_capacity]
self.logger.info(f"Limiting to {remaining_capacity} URLs due to max_pages limit")
# Process the final selected links
for url, score in valid_links:
# attach the score to metadata if needed
if score:
result.metadata = result.metadata or {}
result.metadata["score"] = score
@@ -125,7 +158,11 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
# Clone the config to disable deep crawling recursion and enforce batch mode.
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
batch_results = await crawler.arun_many(urls=urls, config=batch_config)
# Update pages crawled counter - count only successful crawls
successful_results = [r for r in batch_results if r.success]
self._pages_crawled += len(successful_results)
for result in batch_results:
url = result.url
depth = depths.get(url, 0)
@@ -134,7 +171,11 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
parent_url = next((parent for (u, parent) in current_level if u == url), None)
result.metadata["parent_url"] = parent_url
results.append(result)
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Only discover links from successful crawls
if result.success:
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
current_level = next_level
@@ -161,6 +202,9 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
stream_config = config.clone(deep_crawl_strategy=None, stream=True)
stream_gen = await crawler.arun_many(urls=urls, config=stream_config)
# Keep track of processed results for this batch
results_count = 0
async for result in stream_gen:
url = result.url
depth = depths.get(url, 0)
@@ -168,9 +212,24 @@ class BFSDeepCrawlStrategy(DeepCrawlStrategy):
result.metadata["depth"] = depth
parent_url = next((parent for (u, parent) in current_level if u == url), None)
result.metadata["parent_url"] = parent_url
# Count only successful crawls
if result.success:
self._pages_crawled += 1
results_count += 1
yield result
await self.link_discovery(result, url, depth, visited, next_level, depths)
# Only discover links from successful crawls
if result.success:
# Link discovery will handle the max pages limit internally
await self.link_discovery(result, url, depth, visited, next_level, depths)
# If we didn't get results back (e.g. due to errors), avoid getting stuck in an infinite loop
# by considering these URLs as visited but not counting them toward the max_pages limit
if results_count == 0 and urls:
self.logger.warning(f"No results returned for {len(urls)} URLs, marking as visited")
current_level = next_level
async def shutdown(self) -> None:

View File

@@ -37,6 +37,7 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
# Clone config to disable recursive deep crawling.
batch_config = config.clone(deep_crawl_strategy=None, stream=False)
url_results = await crawler.arun_many(urls=[url], config=batch_config)
for result in url_results:
result.metadata = result.metadata or {}
result.metadata["depth"] = depth
@@ -44,13 +45,19 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
if self.url_scorer:
result.metadata["score"] = self.url_scorer.score(url)
results.append(result)
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)
# Push new links in reverse order so the first discovered is processed next.
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
# Count only successful crawls toward max_pages limit
if result.success:
self._pages_crawled += 1
# Only discover links from successful crawls
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)
# Push new links in reverse order so the first discovered is processed next.
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
return results
async def _arun_stream(
@@ -83,8 +90,13 @@ class DFSDeepCrawlStrategy(BFSDeepCrawlStrategy):
result.metadata["score"] = self.url_scorer.score(url)
yield result
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))
# Only count successful crawls toward max_pages limit
# and only discover links from successful crawls
if result.success:
self._pages_crawled += 1
new_links: List[Tuple[str, Optional[str]]] = []
await self.link_discovery(result, url, depth, visited, new_links, depths)
for new_url, new_parent in reversed(new_links):
new_depth = depths.get(new_url, depth + 1)
stack.append((new_url, new_parent, new_depth))

View File

@@ -80,7 +80,7 @@ async def stream_vs_nonstream():
base_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1, include_external=False),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
verbose=False,
)
async with AsyncWebCrawler() as crawler:
@@ -212,11 +212,11 @@ async def filters_and_scorers():
# Create a keyword relevance scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration","javascript","css"], weight=0.3
keywords=["crawl", "example", "async", "configuration","javascript","css"], weight=1
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy( # Note: Changed to BestFirst
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=1, include_external=False, url_scorer=keyword_scorer
),
scraping_strategy=LXMLWebScrapingStrategy(),
@@ -373,6 +373,104 @@ async def advanced_filters():
# Main function to run the entire tutorial
async def max_pages_and_thresholds():
"""
PART 6: Demonstrates using max_pages and score_threshold parameters with different strategies.
This function shows:
- How to limit the number of pages crawled
- How to set score thresholds for more targeted crawling
- Comparing BFS, DFS, and Best-First strategies with these parameters
"""
print("\n===== MAX PAGES AND SCORE THRESHOLDS =====")
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
async with AsyncWebCrawler() as crawler:
# Define a common keyword scorer for all examples
keyword_scorer = KeywordRelevanceScorer(
keywords=["browser", "crawler", "web", "automation"],
weight=1.0
)
# EXAMPLE 1: BFS WITH MAX PAGES
print("\n📊 EXAMPLE 1: BFS STRATEGY WITH MAX PAGES LIMIT")
print(" Limit the crawler to a maximum of 5 pages")
bfs_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=5 # Only crawl 5 pages
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
cache_mode=CacheMode.BYPASS,
)
results = await crawler.arun(url="https://docs.crawl4ai.com", config=bfs_config)
print(f" ✅ Crawled exactly {len(results)} pages as specified by max_pages")
for result in results:
depth = result.metadata.get("depth", 0)
print(f" → Depth: {depth} | {result.url}")
# EXAMPLE 2: DFS WITH SCORE THRESHOLD
print("\n📊 EXAMPLE 2: DFS STRATEGY WITH SCORE THRESHOLD")
print(" Only crawl pages with a relevance score above 0.5")
dfs_config = CrawlerRunConfig(
deep_crawl_strategy=DFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
score_threshold=0.7, # Only process URLs with scores above 0.5
max_pages=10
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
cache_mode=CacheMode.BYPASS,
)
results = await crawler.arun(url="https://docs.crawl4ai.com", config=dfs_config)
print(f" ✅ Crawled {len(results)} pages with scores above threshold")
for result in results:
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f" → Depth: {depth} | Score: {score:.2f} | {result.url}")
# EXAMPLE 3: BEST-FIRST WITH BOTH CONSTRAINTS
print("\n📊 EXAMPLE 3: BEST-FIRST STRATEGY WITH BOTH CONSTRAINTS")
print(" Limit to 7 pages with scores above 0.3, prioritizing highest scores")
bf_config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=7, # Limit to 7 pages total
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True,
cache_mode=CacheMode.BYPASS,
stream=True,
)
results = []
async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=bf_config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f" → Depth: {depth} | Score: {score:.2f} | {result.url}")
print(f" ✅ Crawled {len(results)} high-value pages with scores above 0.3")
if results:
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
print(f" ✅ Average score: {avg_score:.2f}")
print(" 🔍 Note: BestFirstCrawlingStrategy visited highest-scoring pages first")
async def run_tutorial():
"""
Executes all tutorial sections in sequence.
@@ -384,9 +482,10 @@ async def run_tutorial():
# Define sections - uncomment to run specific parts during development
tutorial_sections = [
basic_deep_crawl,
stream_vs_nonstream,
filters_and_scorers,
# basic_deep_crawl,
# stream_vs_nonstream,
# filters_and_scorers,
max_pages_and_thresholds, # Added new section
wrap_up,
advanced_filters,
]

View File

@@ -73,12 +73,18 @@ from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
strategy = BFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=50, # Maximum number of pages to crawl (optional)
score_threshold=0.3, # Minimum score for URLs to be crawled (optional)
)
```
**Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains
- **`max_pages`**: Maximum number of pages to crawl (default: infinite)
- **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
- **`filter_chain`**: FilterChain instance for URL filtering
- **`url_scorer`**: Scorer instance for evaluating URLs
### 2.2 DFSDeepCrawlStrategy (Depth-First Search)
@@ -91,12 +97,18 @@ from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
strategy = DFSDeepCrawlStrategy(
max_depth=2, # Crawl initial page + 2 levels deep
include_external=False, # Stay within the same domain
max_pages=30, # Maximum number of pages to crawl (optional)
score_threshold=0.5, # Minimum score for URLs to be crawled (optional)
)
```
**Key parameters:**
- **`max_depth`**: Number of levels to crawl beyond the starting page
- **`include_external`**: Whether to follow links to other domains
- **`max_pages`**: Maximum number of pages to crawl (default: infinite)
- **`score_threshold`**: Minimum score for URLs to be crawled (default: -inf)
- **`filter_chain`**: FilterChain instance for URL filtering
- **`url_scorer`**: Scorer instance for evaluating URLs
### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
@@ -116,7 +128,8 @@ scorer = KeywordRelevanceScorer(
strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=scorer
url_scorer=scorer,
max_pages=25, # Maximum number of pages to crawl (optional)
)
```
@@ -124,6 +137,8 @@ This crawling approach:
- Evaluates each discovered URL based on scorer criteria
- Visits higher-scoring pages first
- Helps focus crawl resources on the most relevant content
- Can limit total pages crawled with `max_pages`
- Does not need `score_threshold` as it naturally prioritizes by score
---
@@ -410,27 +425,64 @@ if __name__ == "__main__":
---
## 8. Common Pitfalls & Tips
## 8. Limiting and Controlling Crawl Size
1.**Set realistic depth limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size.
### 8.1 Using max_pages
You can limit the total number of pages crawled with the `max_pages` parameter:
```python
# Limit to exactly 20 pages regardless of depth
strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=20
)
```
This feature is useful for:
- Controlling API costs
- Setting predictable execution times
- Focusing on the most important content
- Testing crawl configurations before full execution
### 8.2 Using score_threshold
For BFS and DFS strategies, you can set a minimum score threshold to only crawl high-quality pages:
```python
# Only follow links with scores above 0.4
strategy = DFSDeepCrawlStrategy(
max_depth=2,
url_scorer=KeywordRelevanceScorer(keywords=["api", "guide", "reference"]),
score_threshold=0.4 # Skip URLs with scores below this value
)
```
Note that for BestFirstCrawlingStrategy, score_threshold is not needed since pages are already processed in order of highest score first.
## 9. Common Pitfalls & Tips
1.**Set realistic limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size. Use `max_pages` to set hard limits.
2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
3.**Be a good web citizen.** Respect robots.txt. (disabled by default)
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.status` when processing results.
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.success` and `result.error_message` when processing results.
5.**Balance breadth vs. depth.** Choose your strategy wisely - BFS for comprehensive coverage, DFS for deep exploration, BestFirst for focused relevance-based crawling.
---
## 9. Summary & Next Steps
## 10. Summary & Next Steps
In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
- Configure **BFSDeepCrawlStrategy** and **BestFirstCrawlingStrategy**
- Configure **BFSDeepCrawlStrategy**, **DFSDeepCrawlStrategy**, and **BestFirstCrawlingStrategy**
- Process results in streaming or non-streaming mode
- Apply filters to target specific content
- Use scorers to prioritize the most relevant pages
- Limit crawls with `max_pages` and `score_threshold` parameters
- Build a complete advanced crawler with combined techniques
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.