feat(deep-crawling): improve URL normalization and domain filtering

Enhance URL handling in deep crawling with:
- New URL normalization functions for consistent URL formats
- Improved domain filtering with subdomain support
- Added URLPatternFilter to public API
- Better URL deduplication in BFS strategy

These changes improve crawling accuracy and reduce duplicate visits.
This commit is contained in:
UncleCode
2025-03-06 22:45:57 +08:00
parent 1b72880007
commit f78c46446b
6 changed files with 186 additions and 14 deletions

View File

@@ -48,8 +48,9 @@ from .deep_crawling import (
DeepCrawlStrategy,
BFSDeepCrawlStrategy,
FilterChain,
ContentTypeFilter,
URLPatternFilter,
DomainFilter,
ContentTypeFilter,
URLFilter,
FilterStats,
SEOFilter,
@@ -75,6 +76,7 @@ __all__ = [
"BestFirstCrawlingStrategy",
"DFSDeepCrawlStrategy",
"FilterChain",
"URLPatternFilter",
"ContentTypeFilter",
"DomainFilter",
"FilterStats",