fix: Correct URL matcher fallback behavior and improve memory monitoring

Fix critical issue where unmatched URLs incorrectly used the first config instead of failing safely. Also clarify that configs without url_matcher match ALL URLs by design, and improve memory usage monitoring. Bug fixes: - Change select_config() to return None when no config matches instead of using first config - Add proper error handling in dispatchers when no config matches a URL - Return failed CrawlResult with "No matching configuration found" error message - Fix is_match() to return True when url_matcher is None (matches all URLs) - Import and use get_true_memory_usage_percent() for more accurate memory monitoring Behavior clarification: - CrawlerRunConfig with url_matcher=None matches ALL URLs (not nothing) - This is the intended behavior for default/fallback configurations - Enables clean pattern: specific configs first, default config last Documentation updates: - Clarify that configs without url_matcher match everything - Explain "No matching configuration found" error when no default config - Add examples showing proper default config usage - Update all relevant docs: multi-url-crawling.md, arun_many.md, parameters.md - Simplify API config examples by removing extraction_strategy Demo and test updates: - Update demo_multi_config_clean.py with commented default config to show behavior - Change example URL to w3schools.com to demonstrate no-match scenario - Uncomment all test URLs in test_multi_config.py for comprehensive testing Breaking changes: None - this restores the intended behavior This ensures URLs only get processed with appropriate configs, preventing issues like HTML pages being processed with PDF extraction strategies.
2025-08-03 16:50:54 +08:00
parent a03e68fa2f
commit 307fe28b32
9 changed files with 251 additions and 29 deletions
--- a/docs/md_v2/api/parameters.md
+++ b/docs/md_v2/api/parameters.md
@@ -214,7 +214,7 @@ See [Virtual Scroll documentation](../../advanced/virtual-scroll.md) for detaile

 | **Parameter**          | **Type / Default**           | **What It Does**                                                                                                                    |
 |------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
-| **`url_matcher`**      | `UrlMatcher` (None)          | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types                                         |
+| **`url_matcher`**      | `UrlMatcher` (None)          | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types. **None means match ALL URLs**         |
 | **`match_mode`**       | `MatchMode` (MatchMode.OR)   | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match)                       |

 The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`:
@@ -239,7 +239,7 @@ blog_config = CrawlerRunConfig(
 # Function matcher
 api_config = CrawlerRunConfig(
    url_matcher=lambda url: 'api' in url or url.endswith('.json'),
-    extraction_strategy=JsonCssExtractionStrategy({"data": "body"})
+    # Other settings like extraction_strategy
 )

 # Mixed: String + Function with AND logic
@@ -257,14 +257,21 @@ secure_docs = CrawlerRunConfig(
    url_matcher=["https://*", lambda url: '.doc' in url],
    match_mode=MatchMode.AND  # Must be HTTPS AND contain .doc
 )
+
+# Default config - matches ALL URLs
+default_config = CrawlerRunConfig()  # No url_matcher = matches everything
 ```

 **UrlMatcher Types:**
+- **None (default)**: When `url_matcher` is None or not set, the config matches ALL URLs
 - **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"`
 - **Functions**: `lambda url: bool` - Custom logic for complex matching
 - **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND`

-When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
+**Important Behavior:**
+- When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins!
+- If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found"
+- Always include a default config as the last item if you want to handle all URLs

 ---## 2.2 Helper Methods