feat(async_url_seeder): add smart URL filtering to exclude nonsense URLs

This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed. Significant modifications include: - Added parameter to in . - Implemented logic in to check and filter out nonsense URLs during the seeding process in . - Updated documentation to reflect the new filtering feature and provide examples of its usage in . This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs. BREAKING CHANGE: The now requires the parameter to be explicitly set if the default behavior is to be altered. Related issues: #123
2025-06-05 15:46:24 +08:00
parent c6fc5c0518
commit 82a25c037a
3 changed files with 144 additions and 4 deletions
--- a/docs/md_v2/core/url-seeding.md
+++ b/docs/md_v2/core/url-seeding.md
@@ -253,6 +253,7 @@ The `SeedingConfig` object is your control panel. Here's everything you can conf
 | `query` | str | None | Search query for BM25 scoring |
 | `scoring_method` | str | None | Scoring method (currently "bm25") |
 | `score_threshold` | float | None | Minimum score to include URL |
+| `filter_nonsense_urls` | bool | True | Filter out utility URLs (robots.txt, etc.) |

 #### Pattern Matching Examples

@@ -1078,12 +1079,43 @@ URL seeding transforms web crawling from a blind expedition into a surgical stri

 Whether you're building a research tool, monitoring competitors, or creating a content aggregator, URL seeding gives you the intelligence to crawl smarter, not harder.

+### Smart URL Filtering
+
+The seeder automatically filters out nonsense URLs that aren't useful for content crawling:
+
+```python
+# Enabled by default
+config = SeedingConfig(
+    source="sitemap",
+    filter_nonsense_urls=True  # Default: True
+)
+
+# URLs that get filtered:
+# - robots.txt, sitemap.xml, ads.txt
+# - API endpoints (/api/, /v1/, .json)
+# - Media files (.jpg, .mp4, .pdf)
+# - Archives (.zip, .tar.gz)
+# - Source code (.js, .css)
+# - Admin/login pages
+# - And many more...
+```
+
+To disable filtering (not recommended):
+
+```python
+config = SeedingConfig(
+    source="sitemap",
+    filter_nonsense_urls=False  # Include ALL URLs
+)
+```
+
 ### Key Features Summary

 1. **Parallel Sitemap Index Processing**: Automatically detects and processes sitemap indexes in parallel
 2. **Memory Protection**: Bounded queues prevent RAM issues with large domains (1M+ URLs)
 3. **Context Manager Support**: Automatic cleanup with `async with` statement
 4. **URL-Based Scoring**: Smart filtering even without head extraction
-5. **Dual Caching**: Separate caches for URL lists and metadata
+5. **Smart URL Filtering**: Automatically excludes utility/nonsense URLs
+6. **Dual Caching**: Separate caches for URL lists and metadata

 Now go forth and seed intelligently! 🌱🚀