feat: Add URL-specific crawler configurations for multi-URL crawling

Implement dynamic configuration selection based on URL patterns to optimize crawling for different content types. This feature enables users to apply different crawling strategies (PDF extraction, content filtering, JavaScript execution) based on URL matching patterns. Key additions: - Add url_matcher and match_mode parameters to CrawlerRunConfig - Implement is_match() method supporting string patterns, functions, and mixed lists - Add MatchMode enum for OR/AND logic when combining multiple matchers - Update AsyncWebCrawler.arun_many() to accept List[CrawlerRunConfig] - Add select_config() method to dispatchers for runtime config selection - First matching config wins, with fallback to default Pattern matching supports: - Glob-style strings: *.pdf, */blog/*, *api* - Lambda functions: lambda url: 'github.com' in url - Mixed patterns with AND/OR logic for complex matching This enables optimal per-URL configuration: - PDFs: Use PDFContentScrapingStrategy without JavaScript - Blogs: Apply content filtering to reduce noise - APIs: Skip JavaScript, use JSON extraction - Dynamic sites: Execute only necessary JavaScript Breaking changes: None - fully backward compatible
2025-08-02 19:10:36 +08:00
parent 864d87afb2
commit a03e68fa2f
13 changed files with 1096 additions and 20 deletions
--- a/docs/md_v2/api/arun_many.md
+++ b/docs/md_v2/api/arun_many.md
@@ -7,7 +7,7 @@
 ```python
 async def arun_many(
    urls: Union[List[str], List[Any]],
-    config: Optional[CrawlerRunConfig] = None,
+    config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None,
    dispatcher: Optional[BaseDispatcher] = None,
    ...
 ) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]:
@@ -15,7 +15,9 @@ async def arun_many(
    Crawl multiple URLs concurrently or in batches.

    :param urls: A list of URLs (or tasks) to crawl.
-    :param config: (Optional) A default `CrawlerRunConfig` applying to each crawl.
+    :param config: (Optional) Either:
+        - A single `CrawlerRunConfig` applying to all URLs
+        - A list of `CrawlerRunConfig` objects with url_matcher patterns
    :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher).
    ...
    :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled.
@@ -95,6 +97,65 @@ results = await crawler.arun_many(
 )
 ```

+### URL-Specific Configurations
+
+Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns:
+
+```python
+from crawl4ai import CrawlerRunConfig, MatchMode
+from crawl4ai.processors.pdf import PDFContentScrapingStrategy
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+from crawl4ai.content_filter_strategy import PruningContentFilter
+from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
+
+# PDF files - specialized extraction
+pdf_config = CrawlerRunConfig(
+    url_matcher="*.pdf",
+    scraping_strategy=PDFContentScrapingStrategy()
+)
+
+# Blog/article pages - content filtering
+blog_config = CrawlerRunConfig(
+    url_matcher=["*/blog/*", "*/article/*", "*python.org*"],
+    markdown_generator=DefaultMarkdownGenerator(
+        content_filter=PruningContentFilter(threshold=0.48)
+    )
+)
+
+# Dynamic pages - JavaScript execution
+github_config = CrawlerRunConfig(
+    url_matcher=lambda url: 'github.com' in url,
+    js_code="window.scrollTo(0, 500);"
+)
+
+# API endpoints - JSON extraction
+api_config = CrawlerRunConfig(
+    url_matcher=lambda url: 'api' in url or url.endswith('.json'),
+    extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
+)
+
+# Default fallback config
+default_config = CrawlerRunConfig()  # No url_matcher means it never matches except as fallback
+
+# Pass the list of configs - first match wins!
+results = await crawler.arun_many(
+    urls=[
+        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",  # → pdf_config
+        "https://blog.python.org/",  # → blog_config
+        "https://github.com/microsoft/playwright",  # → github_config
+        "https://httpbin.org/json",  # → api_config
+        "https://example.com/"  # → default_config
+    ],
+    config=[pdf_config, blog_config, github_config, api_config, default_config]
+)
+```
+
+**URL Matching Features**:
+- **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"`
+- **Function matchers**: `lambda url: 'api' in url`
+- **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND`
+- **First match wins**: Configs are evaluated in order
+
 **Key Points**:
 - Each URL is processed by the same or separate sessions, depending on the dispatcher’s strategy.
 - `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -208,6 +208,64 @@ config = CrawlerRunConfig(

 See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detailed examples.

+---
+
+### I) **URL Matching Configuration**
+
+| **Parameter**          | **Type / Default**           | **What It Does**                                                                                                                    |
+|------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| **`url_matcher`**      | `UrlMatcher` (None)          | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types                                         |
+| **`match_mode`**       | `MatchMode` (MatchMode.OR)   | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match)                       |
+
+The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`:
+
+```python
+from crawl4ai import CrawlerRunConfig, MatchMode
+from crawl4ai.processors.pdf import PDFContentScrapingStrategy
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+# Simple string pattern (glob-style)
+pdf_config = CrawlerRunConfig(
+    url_matcher="*.pdf",
+    scraping_strategy=PDFContentScrapingStrategy()
+)
+
+# Multiple patterns with OR logic (default)
+blog_config = CrawlerRunConfig(
+    url_matcher=["*/blog/*", "*/article/*", "*/news/*"],
+    match_mode=MatchMode.OR  # Any pattern matches
+)
+
+# Function matcher
+api_config = CrawlerRunConfig(
+    url_matcher=lambda url: 'api' in url or url.endswith('.json'),
+    extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
+)
+
+# Mixed: String + Function with AND logic
+complex_config = CrawlerRunConfig(
+    url_matcher=[
+        lambda url: url.startswith('https://'),  # Must be HTTPS
+        "*.org/*",                               # Must be .org domain
+        lambda url: 'docs' in url                # Must contain 'docs'
+    ],
+    match_mode=MatchMode.AND  # ALL conditions must match
+)
+
+# Combined patterns and functions with AND logic
+secure_docs = CrawlerRunConfig(
+    url_matcher=["https://*", lambda url: '.doc' in url],
+    match_mode=MatchMode.AND  # Must be HTTPS AND contain .doc
+)
+```
+
+**UrlMatcher Types:**
+- **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"`
+- **Functions**: `lambda url: bool` - Custom logic for complex matching
+- **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND`
+
+When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
+
 ---## 2.2 Helper Methods

 Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies: