UncleCode
0d357ab7d2
feat(scraper): Enhance URL filtering and scoring systems
Implement comprehensive URL filtering and scoring capabilities:
Filters:
- Add URLPatternFilter with glob/regex support
- Implement ContentTypeFilter with MIME type checking
- Add DomainFilter for domain control
- Create FilterChain with stats tracking
Scorers:
- Complete KeywordRelevanceScorer implementation
- Add PathDepthScorer for URL structure scoring
- Implement ContentTypeScorer for file type priorities
- Add FreshnessScorer for date-based scoring
- Add DomainAuthorityScorer for domain weighting
- Create CompositeScorer for combined strategies
Features:
- Add statistics tracking for both filters and scorers
- Implement logging support throughout
- Add resource cleanup methods
- Create comprehensive documentation
- Include performance optimizations
Tests and docs included.
Note: Review URL normalization overlap with recent crawler changes.
2024-11-08 19:02:28 +08:00
..
2024-10-14 21:03:28 +08:00
2024-09-25 16:52:11 +08:00
2024-09-24 20:52:08 +08:00
2024-11-08 19:02:28 +08:00
2024-09-24 20:52:08 +08:00
2024-05-17 17:00:43 +08:00
2024-05-16 17:31:44 +08:00