- Created manifest.json for the Crawl4AI Assistant extension.
- Added popup HTML, CSS, and JS files for the extension interface.
- Included icons and favicon for the extension.
- Implemented functionality for schema capture and code generation.
- Updated index.md to reflect the availability of the new extension.
- Enhanced LLM Context Builder layout and styles for consistency.
- Adjusted global styles for better branding and responsiveness.
- Generate OneShot js code geenrator
- Introduced a new C4A-Script tutorial example for login flow using Blockly.
- Updated index.html to include Blockly theme and event editor modal for script editing.
- Created a test HTML file for testing Blockly integration.
- Added comprehensive C4A-Script API reference documentation covering commands, syntax, and examples.
- Developed core documentation for C4A-Script, detailing its features, commands, and real-world examples.
- Updated mkdocs.yml to include new C4A-Script documentation in navigation.
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.
## Core Features
### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
- Sitemaps (including nested and gzipped)
- Common Crawl index
- Combined sources for maximum coverage
- Extracts page metadata without full crawling:
- Title, description, keywords
- Open Graph and Twitter Card tags
- JSON-LD structured data
- Language and charset information
- BM25 relevance scoring for intelligent filtering:
- Query-based URL discovery
- Configurable score thresholds
- Automatic ranking by relevance
- Performance optimizations:
- Async/concurrent processing with configurable workers
- Rate limiting (hits per second)
- Automatic caching with TTL
- Streaming results for large datasets
### SeedingConfig
- Comprehensive configuration for URL seeding:
- Source selection (sitemap, cc, or both)
- URL pattern filtering with wildcards
- Live URL validation options
- Metadata extraction controls
- BM25 scoring parameters
- Concurrency and rate limiting
### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs
## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler
## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
- Basic discovery
- Cache management
- Live validation
- BM25 scoring
- Multi-domain discovery
- Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
- Pattern-based filtering
- Metadata exploration
- Smart search with BM25
## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests
## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic
## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage
This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.