Commit Graph

16 Commits

Author SHA1 Message Date
ntohidi
0f210f6e02 Merge branch '2025-MAY-2' into next-MAY 2025-07-08 11:46:13 +02:00
UncleCode
3048cc1ff9 feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering
This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling.

## Core Features

### AsyncUrlSeeder Component
- Discovers URLs from multiple sources:
  - Sitemaps (including nested and gzipped)
  - Common Crawl index
  - Combined sources for maximum coverage
- Extracts page metadata without full crawling:
  - Title, description, keywords
  - Open Graph and Twitter Card tags
  - JSON-LD structured data
  - Language and charset information
- BM25 relevance scoring for intelligent filtering:
  - Query-based URL discovery
  - Configurable score thresholds
  - Automatic ranking by relevance
- Performance optimizations:
  - Async/concurrent processing with configurable workers
  - Rate limiting (hits per second)
  - Automatic caching with TTL
  - Streaming results for large datasets

### SeedingConfig
- Comprehensive configuration for URL seeding:
  - Source selection (sitemap, cc, or both)
  - URL pattern filtering with wildcards
  - Live URL validation options
  - Metadata extraction controls
  - BM25 scoring parameters
  - Concurrency and rate limiting

### Integration with AsyncWebCrawler
- Seamless pipeline: discover → filter → crawl
- Direct compatibility with arun_many()
- Significant resource savings by pre-filtering URLs

## Documentation
- Comprehensive guide comparing URL seeding vs deep crawling
- Complete API reference with parameter tables
- Practical examples showing all features
- Performance benchmarks and best practices
- Integration patterns with AsyncWebCrawler

## Examples
- url_seeder_demo.py: Interactive Rich-based demo with:
  - Basic discovery
  - Cache management
  - Live validation
  - BM25 scoring
  - Multi-domain discovery
  - Complete pipeline integration
- url_seeder_quick_demo.py: Screenshot-friendly examples:
  - Pattern-based filtering
  - Metadata exploration
  - Smart search with BM25

## Testing
- Comprehensive test suite (test_async_url_seeder_bm25.py)
- Coverage of all major features
- Edge cases and error handling
- Performance and consistency tests

## Implementation Details
- Built on httpx with HTTP/2 support
- Optional dependencies: lxml, brotli, rank_bm25
- Cache management in ~/.crawl4ai/seeder_cache/
- Logger integration with AsyncLoggerBase
- Proper error handling and retry logic

## Bug Fixes
- Fixed logger color compatibility (lightblack → bright_black)
- Corrected URL extraction from seeder results for arun_many()
- Updated all examples and documentation with proper usage

This feature enables users to crawl smarter, not harder, by discovering
and analyzing URLs before committing resources to crawling them.
2025-06-03 23:27:12 +08:00
ntohidi
33a0c7a17a fix(logger): add RED color to LogColor enum for enhanced logging options 2025-05-22 11:17:28 +02:00
UncleCode
0e5d672763 Merge branch 'pr-971' into merge-pr971 2025-05-01 18:57:28 +08:00
wakaka6
cd2b490b40 refactor(logger): Apply the Enumeration for color 2025-05-01 17:04:44 +08:00
unclecode
f3ebb38edf Merge PR #899 into next, resolve conflicts in server.py and docs/browser-crawler-config.md 2025-04-22 14:56:47 +08:00
UncleCode
a58c8000aa refactor(server): migrate to pool-based crawler management
Replace crawler_manager.py with simpler crawler_pool.py implementation:
- Add global page semaphore for hard concurrency cap
- Implement browser pool with idle cleanup
- Add playground UI for testing and stress testing
- Update API handlers to use pooled crawlers
- Enhance logging levels and symbols

BREAKING CHANGE: Removes CrawlerManager class in favor of simpler pool-based approach
2025-04-20 20:14:26 +08:00
Aravind Karnam
022f5c9e25 Merged next branch 2025-04-12 10:47:02 +05:30
wakaka6
b2f3cb0dfa WIP: logger migriate to rich 2025-04-11 00:44:43 +08:00
UncleCode
14894b4d70 feat(config): set DefaultMarkdownGenerator as the default markdown generator in CrawlerRunConfig
feat(logger): add color mapping for log message formatting options
2025-04-03 20:34:19 +08:00
Aravind Karnam
8b761f232b fix: improve logged url readability by decoding encoded urls 2025-03-21 13:40:23 +05:30
Aravind Karnam
eedda1ae5c fix: Truncate long urls in middle than end since users are confused that same url is being scraped several times. Also remove labels on status and timer to be replaced with symbols to save space and display more URL 2025-03-20 18:56:19 +05:30
UncleCode
c6d48080a4 feat(logger): add abstract logger base class and file logger implementation
Add AsyncLoggerBase abstract class to standardize logger interface and introduce AsyncFileLogger for file-only logging. Remove deprecated always_bypass_cache parameter and clean up AsyncWebCrawler initialization.

BREAKING CHANGE: Removed deprecated 'always_by_pass_cache' parameter. Use BrowserConfig cache settings instead.
2025-02-23 21:23:41 +08:00
UncleCode
8ec12d7d68 Apply Ruff Corrections 2025-01-13 19:19:58 +08:00
UncleCode
a11d9646e3 Enhance crawler features and improve documentation
- Added detailed CrawlerRunConfig parameters documentation.
  - Introduced plans for real-time event-driven crawling.
  - Updated async logger default level to DEBUG for better insights.
  - Improved structure and readability in configuration file.
  - Enhanced documentation on future capabilities in new blog entries.
2024-12-16 18:52:51 +08:00
UncleCode
852729ff38 feat(docker): add Docker Compose configurations for local and hub deployment; enhance GPU support checks in Dockerfile
feat(requirements): update requirements.txt to include snowballstemmer
fix(version_manager): correct version parsing to use __version__.__version__
feat(main): introduce chunking strategy and content filter in CrawlRequest model
feat(content_filter): enhance BM25 algorithm with priority tag scoring for improved content relevance
feat(logger): implement new async logger engine replacing print statements throughout library
fix(database): resolve version-related deadlock and circular lock issues in database operations
docs(docker): expand Docker deployment documentation with usage instructions for Docker Compose
2024-11-18 21:00:06 +08:00