feat: Add URL-specific crawler configurations for multi-URL crawling

Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns.

Key additions:
- Add url_matcher and match_mode parameters to CrawlerRunConfig
- Implement is_match() method supporting string patterns, functions, and mixed lists
- Add MatchMode enum for OR/AND logic when combining multiple matchers
- Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig]
- Add select_config() method to dispatchers for runtime config selection
- First matching config wins, with fallback to default

Pattern matching supports:
- Glob-style strings: *.pdf, */blog/*, *api*
- Lambda functions: lambda url: 'github.com' in url
- Mixed patterns with AND/OR logic for complex matching

This enables optimal per-URL configuration:
- PDFs: Use PDFContentScrapingStrategy without JavaScript
- Blogs: Apply content filtering to reduce noise
- APIs: Skip JavaScript, use JSON extraction
- Dynamic sites: Execute only necessary JavaScript

Breaking changes: None - fully backward compatible
This commit is contained in:
ntohidi
2025-08-02 19:10:36 +08:00
parent 864d87afb2
commit a03e68fa2f
13 changed files with 1096 additions and 20 deletions

View File

@@ -0,0 +1,304 @@
"""
🎯 Multi-Config URL Matching Demo
=================================
Learn how to use different crawler configurations for different URL patterns
in a single crawl batch with Crawl4AI's multi-config feature.
Part 1: Understanding URL Matching (Pattern Testing)
Part 2: Practical Example with Real Crawling
"""
import asyncio
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
MatchMode
)
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
def print_section(title):
"""Print a formatted section header"""
print(f"\n{'=' * 60}")
print(f"{title}")
print(f"{'=' * 60}\n")
def test_url_matching(config, test_urls, config_name):
"""Test URL matching for a config and show results"""
print(f"Config: {config_name}")
print(f"Matcher: {config.url_matcher}")
if hasattr(config, 'match_mode'):
print(f"Mode: {config.match_mode.value}")
print("-" * 40)
for url in test_urls:
matches = config.is_match(url)
symbol = "" if matches else ""
print(f"{symbol} {url}")
print()
# ==============================================================================
# PART 1: Understanding URL Matching
# ==============================================================================
def demo_part1_pattern_matching():
"""Part 1: Learn how URL matching works without crawling"""
print_section("PART 1: Understanding URL Matching")
print("Let's explore different ways to match URLs with configs.\n")
# Test URLs we'll use throughout
test_urls = [
"https://example.com/report.pdf",
"https://example.com/data.json",
"https://example.com/blog/post-1",
"https://example.com/article/news",
"https://api.example.com/v1/users",
"https://example.com/about"
]
# 1.1 Simple String Pattern
print("1.1 Simple String Pattern Matching")
print("-" * 40)
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf"
)
test_url_matching(pdf_config, test_urls, "PDF Config")
# 1.2 Multiple String Patterns
print("1.2 Multiple String Patterns (OR logic)")
print("-" * 40)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
match_mode=MatchMode.OR # This is default, shown for clarity
)
test_url_matching(blog_config, test_urls, "Blog/Article Config")
# 1.3 Single Function Matcher
print("1.3 Function-based Matching")
print("-" * 40)
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json')
)
test_url_matching(api_config, test_urls, "API Config")
# 1.4 List of Functions
print("1.4 Multiple Functions with AND Logic")
print("-" * 40)
# Must be HTTPS AND contain 'api' AND have version number
secure_api_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'),
lambda url: 'api' in url,
lambda url: '/v' in url # Version indicator
],
match_mode=MatchMode.AND
)
test_url_matching(secure_api_config, test_urls, "Secure API Config")
# 1.5 Mixed: String and Function Together
print("1.5 Mixed Patterns: String + Function")
print("-" * 40)
# Match JSON files OR any API endpoint
json_or_api_config = CrawlerRunConfig(
url_matcher=[
"*.json", # String pattern
lambda url: 'api' in url # Function
],
match_mode=MatchMode.OR
)
test_url_matching(json_or_api_config, test_urls, "JSON or API Config")
# 1.6 Complex: Multiple Strings + Multiple Functions
print("1.6 Complex Matcher: Mixed Types with AND Logic")
print("-" * 40)
# Must be: HTTPS AND (.com domain) AND (blog OR article) AND NOT a PDF
complex_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Function: HTTPS check
"*.com/*", # String: .com domain
lambda url: any(pattern in url for pattern in ['/blog/', '/article/']), # Function: Blog OR article
lambda url: not url.endswith('.pdf') # Function: Not PDF
],
match_mode=MatchMode.AND
)
test_url_matching(complex_config, test_urls, "Complex Mixed Config")
print("\n✅ Key Takeaway: First matching config wins when passed to arun_many()!")
# ==============================================================================
# PART 2: Practical Multi-URL Crawling
# ==============================================================================
async def demo_part2_practical_crawling():
"""Part 2: Real-world example with different content types"""
print_section("PART 2: Practical Multi-URL Crawling")
print("Now let's see multi-config in action with real URLs.\n")
# Create specialized configs for different content types
configs = [
# Config 1: PDF documents - only match files ending with .pdf
CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
),
# Config 2: Blog/article pages with content filtering
CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
),
# Config 3: Dynamic pages requiring JavaScript
CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);" # Scroll to load content
),
# Config 4: Mixed matcher - API endpoints (string OR function)
CrawlerRunConfig(
url_matcher=[
"*.json", # String pattern for JSON files
lambda url: 'api' in url or 'httpbin.org' in url # Function for API endpoints
],
match_mode=MatchMode.OR,
extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
),
# Config 5: Complex matcher - Secure documentation sites
CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Must be HTTPS
"*.org/*", # String: .org domain
lambda url: any(doc in url for doc in ['docs', 'documentation', 'reference']), # Has docs
lambda url: not url.endswith(('.pdf', '.json')) # Not PDF or JSON
],
match_mode=MatchMode.AND,
wait_for="css:.content, css:article" # Wait for content to load
),
# Default config for everything else
CrawlerRunConfig() # No url_matcher means it never matches (except as fallback)
]
# URLs to crawl - each will use a different config
urls = [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → PDF config
"https://blog.python.org/", # → Blog config with content filter
"https://github.com/microsoft/playwright", # → JS config
"https://httpbin.org/json", # → Mixed matcher config (API)
"https://docs.python.org/3/reference/", # → Complex matcher config
"https://example.com/", # → Default config
]
print("URLs to crawl:")
for i, url in enumerate(urls, 1):
print(f"{i}. {url}")
print("\nCrawling with appropriate config for each URL...\n")
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=configs
)
# Display results
print("Results:")
print("-" * 60)
for result in results:
if result.success:
# Determine which config was used
config_type = "Default"
if result.url.endswith('.pdf'):
config_type = "PDF Strategy"
elif any(pattern in result.url for pattern in ['blog', 'python.org']) and 'docs' not in result.url:
config_type = "Blog + Content Filter"
elif 'github.com' in result.url:
config_type = "JavaScript Enabled"
elif 'httpbin.org' in result.url or result.url.endswith('.json'):
config_type = "Mixed Matcher (API)"
elif 'docs.python.org' in result.url:
config_type = "Complex Matcher (Secure Docs)"
print(f"\n{result.url}")
print(f" Config used: {config_type}")
print(f" Content size: {len(result.markdown)} chars")
# Show if we have fit_markdown (from content filter)
if hasattr(result.markdown, 'fit_markdown') and result.markdown.fit_markdown:
print(f" Fit markdown size: {len(result.markdown.fit_markdown)} chars")
reduction = (1 - len(result.markdown.fit_markdown) / len(result.markdown)) * 100
print(f" Content reduced by: {reduction:.1f}%")
# Show extracted data if using extraction strategy
if hasattr(result, 'extracted_content') and result.extracted_content:
print(f" Extracted data: {str(result.extracted_content)[:100]}...")
else:
print(f"\n{result.url}")
print(f" Error: {result.error_message}")
print("\n" + "=" * 60)
print("✅ Multi-config crawling complete!")
print("\nBenefits demonstrated:")
print("- PDFs handled with specialized scraper")
print("- Blog content filtered for relevance")
print("- JavaScript executed only where needed")
print("- Mixed matchers (string + function) for flexible matching")
print("- Complex matchers for precise URL targeting")
print("- Each URL got optimal configuration automatically!")
async def main():
"""Run both parts of the demo"""
print("""
🎯 Multi-Config URL Matching Demo
=================================
Learn how Crawl4AI can use different configurations
for different URLs in a single batch.
""")
# Part 1: Pattern matching
demo_part1_pattern_matching()
print("\nPress Enter to continue to Part 2...")
try:
input()
except EOFError:
# Running in non-interactive mode, skip input
pass
# Part 2: Practical crawling
await demo_part2_practical_crawling()
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -404,7 +404,174 @@ for result in results:
print(f"Duration: {dr.end_time - dr.start_time}")
```
## 6. Summary
## 6. URL-Specific Configurations
When crawling diverse content types, you often need different configurations for different URLs. For example:
- PDFs need specialized extraction
- Blog pages benefit from content filtering
- Dynamic sites need JavaScript execution
- API endpoints need JSON parsing
### 6.1 Basic URL Pattern Matching
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def crawl_mixed_content():
# Configure different strategies for different content
configs = [
# PDF files - specialized extraction
CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
),
# Blog/article pages - content filtering
CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
),
# Dynamic pages - JavaScript execution
CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
),
# API endpoints - JSON extraction
CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
),
# Default config for everything else
CrawlerRunConfig() # No url_matcher = fallback
]
# Mixed URLs
urls = [
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
"https://blog.python.org/",
"https://github.com/microsoft/playwright",
"https://httpbin.org/json",
"https://example.com/"
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=configs # Pass list of configs
)
for result in results:
print(f"{result.url}: {len(result.markdown)} chars")
```
### 6.2 Advanced Pattern Matching
The `url_matcher` parameter supports three types of patterns:
#### Glob Patterns (Strings)
```python
# Simple patterns
"*.pdf" # Any PDF file
"*/api/*" # Any URL with /api/ in path
"https://*.example.com/*" # Subdomain matching
"*://example.com/blog/*" # Any protocol
```
#### Custom Functions
```python
# Complex logic with lambdas
lambda url: url.startswith('https://') and 'secure' in url
lambda url: len(url) > 50 and url.count('/') > 5
lambda url: any(domain in url for domain in ['api.', 'data.', 'feed.'])
```
#### Mixed Lists with AND/OR Logic
```python
# Combine multiple conditions
CrawlerRunConfig(
url_matcher=[
"https://*", # Must be HTTPS
lambda url: 'internal' in url, # Must contain 'internal'
lambda url: not url.endswith('.pdf') # Must not be PDF
],
match_mode=MatchMode.AND # ALL conditions must match
)
```
### 6.3 Practical Example: News Site Crawler
```python
async def crawl_news_site():
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0,
rate_limiter=RateLimiter(base_delay=(1.0, 2.0))
)
configs = [
# Homepage - light extraction
CrawlerRunConfig(
url_matcher=lambda url: url.rstrip('/') == 'https://news.ycombinator.com',
css_selector="nav, .headline",
extraction_strategy=None
),
# Article pages - full extraction
CrawlerRunConfig(
url_matcher="*/article/*",
extraction_strategy=CosineStrategy(
semantic_filter="article content",
word_count_threshold=100
),
screenshot=True,
excluded_tags=["nav", "aside", "footer"]
),
# Author pages - metadata focus
CrawlerRunConfig(
url_matcher="*/author/*",
extraction_strategy=JsonCssExtractionStrategy({
"name": "h1.author-name",
"bio": ".author-bio",
"articles": "article.post-card h2"
})
),
# Everything else
CrawlerRunConfig()
]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=news_urls,
config=configs,
dispatcher=dispatcher
)
```
### 6.4 Best Practices
1. **Order Matters**: Configs are evaluated in order - put specific patterns before general ones
2. **Always Include a Default**: Last config should have no `url_matcher` as a fallback
3. **Test Your Patterns**: Use the config's `is_match()` method to test patterns:
```python
config = CrawlerRunConfig(url_matcher="*/api/*")
print(config.is_match("https://example.com/api/users")) # True
```
4. **Optimize for Performance**:
- Disable JS for static content
- Skip screenshots for data APIs
- Use appropriate extraction strategies
## 7. Summary
1.**Two Dispatcher Types**:

View File

@@ -7,7 +7,7 @@
```python
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[CrawlerRunConfig] = None,
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
@@ -15,7 +15,9 @@ async def arun_many(
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
:param config: (Optional) Either:
- A single `CrawlerRunConfig` applying to all URLs
- A list of `CrawlerRunConfig` objects with url_matcher patterns
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
@@ -95,6 +97,65 @@ results = await crawler.arun_many(
)
```
### URL-Specific Configurations
Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
)
# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
)
# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
)
# Default fallback config
default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback
# Pass the list of configs - first match wins!
results = await crawler.arun_many(
urls=[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config
"https://blog.python.org/", # → blog_config
"https://github.com/microsoft/playwright", # → github_config
"https://httpbin.org/json", # → api_config
"https://example.com/" # → default_config
],
config=[pdf_config, blog_config, github_config, api_config, default_config]
)
```
**URL Matching Features**:
- **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"`
- **Function matchers**: `lambda url: 'api' in url`
- **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND`
- **First match wins**: Configs are evaluated in order
**Key Points**:
- Each URL is processed by the same or separate sessions, depending on the dispatchers strategy.
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info. 

View File

@@ -208,6 +208,64 @@ config = CrawlerRunConfig(
See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detailed examples.
---
### I) **URL Matching Configuration**
| **Parameter** | **Type / Default** | **What It Does** |
|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
| **`url_matcher`** | `UrlMatcher` (None) | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types |
| **`match_mode`** | `MatchMode` (MatchMode.OR) | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match) |
The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Simple string pattern (glob-style)
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Multiple patterns with OR logic (default)
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
match_mode=MatchMode.OR # Any pattern matches
)
# Function matcher
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
)
# Mixed: String + Function with AND logic
complex_config = CrawlerRunConfig(
url_matcher=[
lambda url: url.startswith('https://'), # Must be HTTPS
"*.org/*", # Must be .org domain
lambda url: 'docs' in url # Must contain 'docs'
],
match_mode=MatchMode.AND # ALL conditions must match
)
# Combined patterns and functions with AND logic
secure_docs = CrawlerRunConfig(
url_matcher=["https://*", lambda url: '.doc' in url],
match_mode=MatchMode.AND # Must be HTTPS AND contain .doc
)
```
**UrlMatcher Types:**
- **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"`
- **Functions**: `lambda url: bool` - Custom logic for complex matching
- **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND`
When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
---## 2.2 Helper Methods
Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies:

View File

@@ -209,7 +209,13 @@ class CrawlerRunConfig:
- The maximum number of concurrent crawl sessions.
- Helps prevent overwhelming the system.
14. **`display_mode`**:
14. **`url_matcher`** & **`match_mode`**:
- Enable URL-specific configurations when used with `arun_many()`.
- Set `url_matcher` to patterns (glob, function, or list) to match specific URLs.
- Use `match_mode` (OR/AND) to control how multiple patterns combine.
- See [URL-Specific Configurations](../api/arun_many.md#url-specific-configurations) for examples.
15. **`display_mode`**:
- The display mode for progress information (`DETAILED`, `BRIEF`, etc.).
- Affects how much information is printed during the crawl.