fix: Correct URL matcher fallback behavior and improve memory monitoring

Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring.

Bug fixes:
- Change select_config() to return None when no config matches instead of using first config
- Add proper error handling in dispatchers when no config matches a URL
- Return failed CrawlResult with "No matching configuration found" error message
- Fix is_match() to return True when url_matcher is None (matches all URLs)
- Import and use get_true_memory_usage_percent() for more accurate memory monitoring

Behavior clarification:
- CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing)
- This is the intended behavior for default/fallback configurations
- Enables clean pattern: specific configs first, default config last

Documentation updates:
- Clarify that configs without url_matcher match everything
- Explain "No matching configuration found" error when no default config
- Add examples showing proper default config usage
- Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md
- Simplify API config examples by removing extraction_strategy

Demo and test updates:
- Update demo_multi_config_clean.py with commented default config to show behavior
- Change example URL to w3schools.com to demonstrate no-match scenario
- Uncomment all test URLs in test_multi_config.py for comprehensive testing

Breaking changes: None - this restores the intended behavior

This ensures URLs only get processed with appropriate configs, preventing
issues like HTML pages being processed with PDF extraction strategies.
This commit is contained in:
ntohidi
2025-08-03 16:50:54 +08:00
parent a03e68fa2f
commit 307fe28b32
9 changed files with 251 additions and 29 deletions

View File

@@ -188,7 +188,6 @@ async def demo_part2_practical_crawling():
lambda url: 'api' in url or 'httpbin.org' in url # Function for API endpoints
],
match_mode=MatchMode.OR,
extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
),
# Config 5: Complex matcher - Secure documentation sites
@@ -200,11 +199,11 @@ async def demo_part2_practical_crawling():
lambda url: not url.endswith(('.pdf', '.json')) # Not PDF or JSON
],
match_mode=MatchMode.AND,
wait_for="css:.content, css:article" # Wait for content to load
# wait_for="css:.content, css:article" # Wait for content to load
),
# Default config for everything else
CrawlerRunConfig() # No url_matcher means it never matches (except as fallback)
# CrawlerRunConfig() # No url_matcher means it matches everything (use it as fallback)
]
# URLs to crawl - each will use a different config
@@ -214,7 +213,7 @@ async def demo_part2_practical_crawling():
"https://github.com/microsoft/playwright", # → JS config
"https://httpbin.org/json", # → Mixed matcher config (API)
"https://docs.python.org/3/reference/", # → Complex matcher config
"https://example.com/", # → Default config
"https://www.w3schools.com/", # → Default config, if you uncomment the default config line above, if not you will see `Error: No matching configuration`
]
print("URLs to crawl:")