fix: Correct URL matcher fallback behavior and improve memory monitoring
Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.
This commit is contained in:
@@ -55,13 +55,13 @@ async def test_multi_config():
|
||||
|
||||
# Test URLs - using real URLs that exist
|
||||
test_urls = [
|
||||
# "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # Real PDF
|
||||
# "https://www.bbc.com/news/articles/c5y3e3glnldo", # News article
|
||||
# "https://blog.python.org/", # Blog URL
|
||||
# "https://api.github.com/users/github", # GitHub API (returns JSON)
|
||||
# "https://httpbin.org/json", # API endpoint that returns JSON
|
||||
# "https://www.python.org/", # Generic HTTPS page
|
||||
# "http://info.cern.ch/", # HTTP (not HTTPS) page
|
||||
"https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # Real PDF
|
||||
"https://www.bbc.com/news/articles/c5y3e3glnldo", # News article
|
||||
"https://blog.python.org/", # Blog URL
|
||||
"https://api.github.com/users/github", # GitHub API (returns JSON)
|
||||
"https://httpbin.org/json", # API endpoint that returns JSON
|
||||
"https://www.python.org/", # Generic HTTPS page
|
||||
"http://info.cern.ch/", # HTTP (not HTTPS) page
|
||||
"https://example.com/", # → Default config
|
||||
]
|
||||
|
||||
|
||||
Reference in New Issue
Block a user