This update introduces a new feature in the URL seeding process that allows for the automatic filtering of utility URLs, such as robots.txt and sitemap.xml, which are not useful for content crawling. The class has been enhanced with a new parameter, , which is enabled by default. This change aims to improve the efficiency of the crawling process by reducing the number of irrelevant URLs processed.
Significant modifications include:
- Added parameter to in .
- Implemented logic in to check and filter out nonsense URLs during the seeding process in .
- Updated documentation to reflect the new filtering feature and provide examples of its usage in .
This change enhances the overall functionality of the URL seeder, making it smarter and more efficient in identifying and excluding non-content URLs.
BREAKING CHANGE: The now requires the parameter to be explicitly set if the default behavior is to be altered.
Related issues: #123
This commit introduces significant updates to the LinkedIn data discovery documentation by adding two new Jupyter notebooks that provide detailed insights into data discovery processes. The previous workshop notebook has been removed to streamline the content and avoid redundancy. Additionally, the URL seeder documentation has been expanded with a new tutorial and several enhancements to existing scripts, improving usability and clarity.
The changes include:
- Added and for comprehensive LinkedIn data discovery.
- Removed to eliminate outdated content.
- Updated to reflect new data visualization requirements.
- Introduced and to facilitate easier access to URL seeding techniques.
- Enhanced existing Python scripts and markdown files in the URL seeder section for better documentation and examples.
These changes aim to improve the overall documentation quality and user experience for developers working with LinkedIn data and URL seeding techniques.
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.
## Core Features
### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
- Sitemaps (including nested and gzipped)
- Common Crawl index
- Combined sources for maximum coverage
- Extracts page metadata without full crawling:
- Title, description, keywords
- Open Graph and Twitter Card tags
- JSON-LD structured data
- Language and charset information
- BM25 relevance scoring for intelligent filtering:
- Query-based URL discovery
- Configurable score thresholds
- Automatic ranking by relevance
- Performance optimizations:
- Async/concurrent processing with configurable workers
- Rate limiting (hits per second)
- Automatic caching with TTL
- Streaming results for large datasets
### SeedingConfig
- Comprehensive configuration for URL seeding:
- Source selection (sitemap, cc, or both)
- URL pattern filtering with wildcards
- Live URL validation options
- Metadata extraction controls
- BM25 scoring parameters
- Concurrency and rate limiting
### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs
## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler
## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
- Basic discovery
- Cache management
- Live validation
- BM25 scoring
- Multi-domain discovery
- Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
- Pattern-based filtering
- Metadata exploration
- Smart search with BM25
## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests
## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic
## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage
This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.