feat(deep-crawling): improve URL normalization and domain filtering
Enhance URL handling in deep crawling with: - New URL normalization functions for consistent URL formats - Improved domain filtering with subdomain support - Added URLPatternFilter to public API - Better URL deduplication in BFS strategy These changes improve crawling accuracy and reduce duplicate visits.
This commit is contained in:
@@ -48,8 +48,9 @@ from .deep_crawling import (
|
||||
DeepCrawlStrategy,
|
||||
BFSDeepCrawlStrategy,
|
||||
FilterChain,
|
||||
ContentTypeFilter,
|
||||
URLPatternFilter,
|
||||
DomainFilter,
|
||||
ContentTypeFilter,
|
||||
URLFilter,
|
||||
FilterStats,
|
||||
SEOFilter,
|
||||
@@ -75,6 +76,7 @@ __all__ = [
|
||||
"BestFirstCrawlingStrategy",
|
||||
"DFSDeepCrawlStrategy",
|
||||
"FilterChain",
|
||||
"URLPatternFilter",
|
||||
"ContentTypeFilter",
|
||||
"DomainFilter",
|
||||
"FilterStats",
|
||||
|
||||
Reference in New Issue
Block a user