fix: Correct URL matcher fallback behavior and improve memory monitoring

Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
parent a03e68fa2f
commit 307fe28b32
9 changed files with 251 additions and 29 deletions
--- a/docs/examples/demo_multi_config_clean.py
+++ b/docs/examples/demo_multi_config_clean.py
@@ -188,7 +188,6 @@ async def demo_part2_practical_crawling():
                lambda url: 'api' in url or 'httpbin.org' in url  # Function for API endpoints
            ],
            match_mode=MatchMode.OR,
-            extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
        ),
        
        # Config 5: Complex matcher - Secure documentation sites
@@ -200,11 +199,11 @@ async def demo_part2_practical_crawling():
                lambda url: not url.endswith(('.pdf', '.json'))  # Not PDF or JSON
            ],
            match_mode=MatchMode.AND,
-            wait_for="css:.content, css:article"  # Wait for content to load
+            # wait_for="css:.content, css:article"  # Wait for content to load
        ),
        
        # Default config for everything else
-        CrawlerRunConfig()  # No url_matcher means it never matches (except as fallback)
+        # CrawlerRunConfig()  # No url_matcher means it matches everything (use it as fallback)
    ]
    
    # URLs to crawl - each will use a different config
@@ -214,7 +213,7 @@ async def demo_part2_practical_crawling():
        "https://github.com/microsoft/playwright",  # → JS config
        "https://httpbin.org/json",  # → Mixed matcher config (API)
        "https://docs.python.org/3/reference/",  # → Complex matcher config
-        "https://example.com/",  # → Default config
+        "https://www.w3schools.com/",  # → Default config, if you uncomment the default config line above, if not you will see `Error: No matching configuration`
    ]
    
    print("URLs to crawl:")