feat: Add URL-specific crawler configurations for multi-URL crawling

Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns.

Key additions:
- Add url_matcher and match_mode parameters to CrawlerRunConfig
- Implement is_match() method supporting string patterns, functions, and mixed lists
- Add MatchMode enum for OR/AND logic when combining multiple matchers
- Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig]
- Add select_config() method to dispatchers for runtime config selection
- First matching config wins, with fallback to default

Pattern matching supports:
- Glob-style strings: *.pdf, */blog/*, *api*
- Lambda functions: lambda url: 'github.com' in url
- Mixed patterns with AND/OR logic for complex matching

This enables optimal per-URL configuration:
- PDFs: Use PDFContentScrapingStrategy without JavaScript
- Blogs: Apply content filtering to reduce noise
- APIs: Skip JavaScript, use JSON extraction
- Dynamic sites: Execute only necessary JavaScript

Breaking changes: None - fully backward compatible
This commit is contained in:
ntohidi
2025-08-02 19:10:36 +08:00
parent 864d87afb2
commit a03e68fa2f
13 changed files with 1096 additions and 20 deletions

View File

@@ -7,7 +7,7 @@
```python
async def arun_many(
urls: Union[List[str], List[Any]],
config: Optional[CrawlerRunConfig] = None,
config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
dispatcher: Optional[BaseDispatcher] = None,
...
) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
@@ -15,7 +15,9 @@ async def arun_many(
Crawl multiple URLs concurrently or in batches.
:param urls: A list of URLs (or tasks) to crawl.
:param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
:param config: (Optional) Either:
- A single `CrawlerRunConfig` applying to all URLs
- A list of `CrawlerRunConfig` objects with url_matcher patterns
:param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
...
:return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
@@ -95,6 +97,65 @@ results = await crawler.arun_many(
)
```
### URL-Specific Configurations
Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns:
```python
from crawl4ai import CrawlerRunConfig, MatchMode
from crawl4ai.processors.pdf import PDFContentScrapingStrategy
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# PDF files - specialized extraction
pdf_config = CrawlerRunConfig(
url_matcher="*.pdf",
scraping_strategy=PDFContentScrapingStrategy()
)
# Blog/article pages - content filtering
blog_config = CrawlerRunConfig(
url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.48)
)
)
# Dynamic pages - JavaScript execution
github_config = CrawlerRunConfig(
url_matcher=lambda url: 'github.com' in url,
js_code="window.scrollTo(0, 500);"
)
# API endpoints - JSON extraction
api_config = CrawlerRunConfig(
url_matcher=lambda url: 'api' in url or url.endswith('.json'),
extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
)
# Default fallback config
default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback
# Pass the list of configs - first match wins!
results = await crawler.arun_many(
urls=[
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config
"https://blog.python.org/", # → blog_config
"https://github.com/microsoft/playwright", # → github_config
"https://httpbin.org/json", # → api_config
"https://example.com/" # → default_config
],
config=[pdf_config, blog_config, github_config, api_config, default_config]
)
```
**URL Matching Features**:
- **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"`
- **Function matchers**: `lambda url: 'api' in url`
- **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND`
- **First match wins**: Configs are evaluated in order
**Key Points**:
- Each URL is processed by the same or separate sessions, depending on the dispatchers strategy.
- `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.