fix: Correct URL matcher fallback behavior and improve memory monitoring

Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
parent a03e68fa2f
commit 307fe28b32
9 changed files with 251 additions and 29 deletions
--- a/docs/md_v2/advanced/multi-url-crawling.md
+++ b/docs/md_v2/advanced/multi-url-crawling.md
@@ -447,11 +447,11 @@ async def crawl_mixed_content():
        # API endpoints - JSON extraction
        CrawlerRunConfig(
            url_matcher=lambda url: 'api' in url or url.endswith('.json'),
-            extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
+            # Custome settings for JSON extraction
        ),
        
        # Default config for everything else
-        CrawlerRunConfig()  # No url_matcher = fallback
+        CrawlerRunConfig()  # No url_matcher means it matches ALL URLs (fallback)
    ]
    
    # Mixed URLs
@@ -475,6 +475,8 @@ async def crawl_mixed_content():

 ### 6.2 Advanced Pattern Matching

+**Important**: A `CrawlerRunConfig` without `url_matcher` (or with `url_matcher=None`) matches ALL URLs. This makes it perfect as a default/fallback configuration.
+
 The `url_matcher` parameter supports three types of patterns:

 #### Glob Patterns (Strings)
@@ -560,11 +562,17 @@ async def crawl_news_site():
 ### 6.4 Best Practices

 1. **Order Matters**: Configs are evaluated in order - put specific patterns before general ones
-2. **Always Include a Default**: Last config should have no `url_matcher` as a fallback
+2. **Default Config Behavior**: 
+   - A config without `url_matcher` matches ALL URLs
+   - Always include a default config as the last item if you want to handle all URLs
+   - Without a default config, unmatched URLs will fail with "No matching configuration found"
 3. **Test Your Patterns**: Use the config's `is_match()` method to test patterns:
   ```python
-   config = CrawlerRunConfig(url_matcher="*/api/*")
-   print(config.is_match("https://example.com/api/users"))  # True
+   config = CrawlerRunConfig(url_matcher="*.pdf")
+   print(config.is_match("https://example.com/doc.pdf"))  # True
+   
+   default_config = CrawlerRunConfig()  # No url_matcher
+   print(default_config.is_match("https://any-url.com"))  # True - matches everything!
   ```
 4. **Optimize for Performance**: 
   - Disable JS for static content