Release prep (#749)
* fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
This commit is contained in:
436
docs/md_v2/core/deep-crawling.md
Normal file
436
docs/md_v2/core/deep-crawling.md
Normal file
@@ -0,0 +1,436 @@
|
||||
# Deep Crawling
|
||||
|
||||
One of Crawl4AI's most powerful features is its ability to perform **configurable deep crawling** that can explore websites beyond a single page. With fine-tuned control over crawl depth, domain boundaries, and content filtering, Crawl4AI gives you the tools to extract precisely the content you need.
|
||||
|
||||
In this tutorial, you'll learn:
|
||||
|
||||
1. How to set up a **Basic Deep Crawler** with BFS strategy
|
||||
2. Understanding the difference between **streamed and non-streamed** output
|
||||
3. Implementing **filters and scorers** to target specific content
|
||||
4. Creating **advanced filtering chains** for sophisticated crawls
|
||||
5. Using **BestFirstCrawling** for intelligent exploration prioritization
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve completed or read [AsyncWebCrawler Basics](../core/simple-crawling.md) to understand how to run a simple crawl.
|
||||
> - You know how to configure `CrawlerRunConfig`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Quick Example
|
||||
|
||||
Here's a minimal code snippet that implements a basic deep crawl using the **BFSDeepCrawlStrategy**:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
|
||||
async def main():
|
||||
# Configure a 2-level deep crawl
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
print(f"Crawled {len(results)} pages in total")
|
||||
|
||||
# Access individual results
|
||||
for result in results[:3]: # Show first 3 results
|
||||
print(f"URL: {result.url}")
|
||||
print(f"Depth: {result.metadata.get('depth', 0)}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What's happening?**
|
||||
- `BFSDeepCrawlStrategy(max_depth=2, include_external=False)` instructs Crawl4AI to:
|
||||
- Crawl the starting page (depth 0) plus 2 more levels
|
||||
- Stay within the same domain (don't follow external links)
|
||||
- Each result contains metadata like the crawl depth
|
||||
- Results are returned as a list after all crawling is complete
|
||||
|
||||
---
|
||||
|
||||
## 2. Understanding Deep Crawling Strategy Options
|
||||
|
||||
### 2.1 BFSDeepCrawlStrategy (Breadth-First Search)
|
||||
|
||||
The **BFSDeepCrawlStrategy** uses a breadth-first approach, exploring all links at one depth before moving deeper:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
# Basic configuration
|
||||
strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2, # Crawl initial page + 2 levels deep
|
||||
include_external=False, # Stay within the same domain
|
||||
)
|
||||
```
|
||||
|
||||
**Key parameters:**
|
||||
- **`max_depth`**: Number of levels to crawl beyond the starting page
|
||||
- **`include_external`**: Whether to follow links to other domains
|
||||
|
||||
### 2.2 DFSDeepCrawlStrategy (Depth-First Search)
|
||||
|
||||
The **DFSDeepCrawlStrategy** uses a depth-first approach, explores as far down a branch as possible before backtracking.
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import DFSDeepCrawlStrategy
|
||||
|
||||
# Basic configuration
|
||||
strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=2, # Crawl initial page + 2 levels deep
|
||||
include_external=False, # Stay within the same domain
|
||||
)
|
||||
```
|
||||
|
||||
**Key parameters:**
|
||||
- **`max_depth`**: Number of levels to crawl beyond the starting page
|
||||
- **`include_external`**: Whether to follow links to other domains
|
||||
|
||||
### 2.3 BestFirstCrawlingStrategy (⭐️ - Recommended Deep crawl strategy)
|
||||
|
||||
For more intelligent crawling, use **BestFirstCrawlingStrategy** with scorers to prioritize the most relevant pages:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
# Create a scorer
|
||||
scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
# Configure the strategy
|
||||
strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
url_scorer=scorer
|
||||
)
|
||||
```
|
||||
|
||||
This crawling approach:
|
||||
- Evaluates each discovered URL based on scorer criteria
|
||||
- Visits higher-scoring pages first
|
||||
- Helps focus crawl resources on the most relevant content
|
||||
|
||||
---
|
||||
|
||||
## 3. Streaming vs. Non-Streaming Results
|
||||
|
||||
Crawl4AI can return results in two modes:
|
||||
|
||||
### 3.1 Non-Streaming Mode (Default)
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=False # Default behavior
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Wait for ALL results to be collected before returning
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
for result in results:
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
**When to use non-streaming mode:**
|
||||
- You need the complete dataset before processing
|
||||
- You're performing batch operations on all results together
|
||||
- Crawl time isn't a critical factor
|
||||
|
||||
### 3.2 Streaming Mode
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=True # Enable streaming
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Returns an async iterator
|
||||
async for result in await crawler.arun("https://example.com", config=config):
|
||||
# Process each result as it becomes available
|
||||
process_result(result)
|
||||
```
|
||||
|
||||
**Benefits of streaming mode:**
|
||||
- Process results immediately as they're discovered
|
||||
- Start working with early results while crawling continues
|
||||
- Better for real-time applications or progressive display
|
||||
- Reduces memory pressure when handling many pages
|
||||
|
||||
---
|
||||
|
||||
## 4. Filtering Content with Filter Chains
|
||||
|
||||
Filters help you narrow down which pages to crawl. Combine multiple filters using **FilterChain** for powerful targeting.
|
||||
|
||||
### 4.1 Basic URL Pattern Filter
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import FilterChain, URLPatternFilter
|
||||
|
||||
# Only follow URLs containing "blog" or "docs"
|
||||
url_filter = URLPatternFilter(patterns=["*blog*", "*docs*"])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4.2 Combining Multiple Filters
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
DomainFilter,
|
||||
ContentTypeFilter
|
||||
)
|
||||
|
||||
# Create a chain of filters
|
||||
filter_chain = FilterChain([
|
||||
# Only follow URLs with specific patterns
|
||||
URLPatternFilter(patterns=["*guide*", "*tutorial*"]),
|
||||
|
||||
# Only crawl specific domains
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.example.com"],
|
||||
blocked_domains=["old.docs.example.com"]
|
||||
),
|
||||
|
||||
# Only include specific content types
|
||||
ContentTypeFilter(allowed_types=["text/html"])
|
||||
])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=filter_chain
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 4.3 Available Filter Types
|
||||
|
||||
Crawl4AI includes several specialized filters:
|
||||
|
||||
- **`URLPatternFilter`**: Matches URL patterns using wildcard syntax
|
||||
- **`DomainFilter`**: Controls which domains to include or exclude
|
||||
- **`ContentTypeFilter`**: Filters based on HTTP Content-Type
|
||||
- **`ContentRelevanceFilter`**: Uses similarity to a text query
|
||||
- **`SEOFilter`**: Evaluates SEO elements (meta tags, headers, etc.)
|
||||
|
||||
---
|
||||
|
||||
## 5. Using Scorers for Prioritized Crawling
|
||||
|
||||
Scorers assign priority values to discovered URLs, helping the crawler focus on the most relevant content first.
|
||||
|
||||
### 5.1 KeywordRelevanceScorer
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
|
||||
# Create a keyword relevance scorer
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7 # Importance of this scorer (0.0 to 1.0)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
url_scorer=keyword_scorer
|
||||
),
|
||||
stream=True # Recommended with BestFirstCrawling
|
||||
)
|
||||
|
||||
# Results will come in order of relevance score
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://example.com", config=config):
|
||||
score = result.metadata.get("score", 0)
|
||||
print(f"Score: {score:.2f} | {result.url}")
|
||||
```
|
||||
|
||||
**How scorers work:**
|
||||
- Evaluate each discovered URL before crawling
|
||||
- Calculate relevance based on various signals
|
||||
- Help the crawler make intelligent choices about traversal order
|
||||
|
||||
---
|
||||
|
||||
## 6. Advanced Filtering Techniques
|
||||
|
||||
### 6.1 SEO Filter for Quality Assessment
|
||||
|
||||
The **SEOFilter** helps you identify pages with strong SEO characteristics:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import FilterChain, SEOFilter
|
||||
|
||||
# Create an SEO filter that looks for specific keywords in page metadata
|
||||
seo_filter = SEOFilter(
|
||||
threshold=0.5, # Minimum score (0.0 to 1.0)
|
||||
keywords=["tutorial", "guide", "documentation"]
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([seo_filter])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 6.2 Content Relevance Filter
|
||||
|
||||
The **ContentRelevanceFilter** analyzes the actual content of pages:
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import FilterChain, ContentRelevanceFilter
|
||||
|
||||
# Create a content relevance filter
|
||||
relevance_filter = ContentRelevanceFilter(
|
||||
query="Web crawling and data extraction with Python",
|
||||
threshold=0.7 # Minimum similarity score (0.0 to 1.0)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([relevance_filter])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This filter:
|
||||
- Measures semantic similarity between query and page content
|
||||
- It's a BM25-based relevance filter using head section content
|
||||
|
||||
---
|
||||
|
||||
## 7. Building a Complete Advanced Crawler
|
||||
|
||||
This example combines multiple techniques for a sophisticated crawl:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
FilterChain,
|
||||
DomainFilter,
|
||||
URLPatternFilter,
|
||||
ContentTypeFilter
|
||||
)
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
async def run_advanced_crawler():
|
||||
# Create a sophisticated filter chain
|
||||
filter_chain = FilterChain([
|
||||
# Domain boundaries
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.example.com"],
|
||||
blocked_domains=["old.docs.example.com"]
|
||||
),
|
||||
|
||||
# URL patterns to include
|
||||
URLPatternFilter(patterns=["*guide*", "*tutorial*", "*blog*"]),
|
||||
|
||||
# Content type filtering
|
||||
ContentTypeFilter(allowed_types=["text/html"])
|
||||
])
|
||||
|
||||
# Create a relevance scorer
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
# Set up the configuration
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=keyword_scorer
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
stream=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Execute the crawl
|
||||
results = []
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.example.com", config=config):
|
||||
results.append(result)
|
||||
score = result.metadata.get("score", 0)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||||
|
||||
# Analyze the results
|
||||
print(f"Crawled {len(results)} high-value pages")
|
||||
print(f"Average score: {sum(r.metadata.get('score', 0) for r in results) / len(results):.2f}")
|
||||
|
||||
# Group by depth
|
||||
depth_counts = {}
|
||||
for result in results:
|
||||
depth = result.metadata.get("depth", 0)
|
||||
depth_counts[depth] = depth_counts.get(depth, 0) + 1
|
||||
|
||||
print("Pages crawled by depth:")
|
||||
for depth, count in sorted(depth_counts.items()):
|
||||
print(f" Depth {depth}: {count} pages")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(run_advanced_crawler())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
|
||||
## 8. Common Pitfalls & Tips
|
||||
|
||||
1.**Set realistic depth limits.** Be cautious with `max_depth` values > 3, which can exponentially increase crawl size.
|
||||
|
||||
2.**Don't neglect the scoring component.** BestFirstCrawling works best with well-tuned scorers. Experiment with keyword weights for optimal prioritization.
|
||||
|
||||
3.**Be a good web citizen.** Respect robots.txt. (disabled by default)
|
||||
|
||||
|
||||
4.**Handle page errors gracefully.** Not all pages will be accessible. Check `result.success` and `result.error_message` when processing results.
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary & Next Steps
|
||||
|
||||
In this **Deep Crawling with Crawl4AI** tutorial, you learned to:
|
||||
|
||||
- Configure **BFSDeepCrawlStrategy** and **BestFirstCrawlingStrategy**
|
||||
- Process results in streaming or non-streaming mode
|
||||
- Apply filters to target specific content
|
||||
- Use scorers to prioritize the most relevant pages
|
||||
- Build a complete advanced crawler with combined techniques
|
||||
|
||||
With these tools, you can efficiently extract structured data from websites at scale, focusing precisely on the content you need for your specific use case.
|
||||
Reference in New Issue
Block a user