fix: Correct URL matcher fallback behavior and improve memory monitoring

Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
parent a03e68fa2f
commit 307fe28b32
9 changed files with 251 additions and 29 deletions
--- a/tests/test_multi_config.py
+++ b/tests/test_multi_config.py
@@ -55,13 +55,13 @@ async def test_multi_config():
    
    # Test URLs - using real URLs that exist
    test_urls = [
-        # "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",  # Real PDF
-        # "https://www.bbc.com/news/articles/c5y3e3glnldo",  # News article
-        # "https://blog.python.org/",  # Blog URL  
-        # "https://api.github.com/users/github",  # GitHub API (returns JSON)
-        # "https://httpbin.org/json",  # API endpoint that returns JSON
-        # "https://www.python.org/",  # Generic HTTPS page
-        # "http://info.cern.ch/",  # HTTP (not HTTPS) page
+        "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",  # Real PDF
+        "https://www.bbc.com/news/articles/c5y3e3glnldo",  # News article
+        "https://blog.python.org/",  # Blog URL  
+        "https://api.github.com/users/github",  # GitHub API (returns JSON)
+        "https://httpbin.org/json",  # API endpoint that returns JSON
+        "https://www.python.org/",  # Generic HTTPS page
+        "http://info.cern.ch/",  # HTTP (not HTTPS) page
        "https://example.com/",  # → Default config
    ]