feat: Add AsyncUrlSeeder for intelligent URL discovery and filtering

This commit introduces AsyncUrlSeeder, a high-performance URL discovery system that enables intelligent crawling at scale by pre-discovering and filtering URLs before crawling. ## Core Features ### AsyncUrlSeeder Component - Discovers URLs from multiple sources: - Sitemaps (including nested and gzipped) - Common Crawl index - Combined sources for maximum coverage - Extracts page metadata without full crawling: - Title, description, keywords - Open Graph and Twitter Card tags - JSON-LD structured data - Language and charset information - BM25 relevance scoring for intelligent filtering: - Query-based URL discovery - Configurable score thresholds - Automatic ranking by relevance - Performance optimizations: - Async/concurrent processing with configurable workers - Rate limiting (hits per second) - Automatic caching with TTL - Streaming results for large datasets ### SeedingConfig - Comprehensive configuration for URL seeding: - Source selection (sitemap, cc, or both) - URL pattern filtering with wildcards - Live URL validation options - Metadata extraction controls - BM25 scoring parameters - Concurrency and rate limiting ### Integration with AsyncWebCrawler - Seamless pipeline: discover → filter → crawl - Direct compatibility with arun_many() - Significant resource savings by pre-filtering URLs ## Documentation - Comprehensive guide comparing URL seeding vs deep crawling - Complete API reference with parameter tables - Practical examples showing all features - Performance benchmarks and best practices - Integration patterns with AsyncWebCrawler ## Examples - url_seeder_demo.py: Interactive Rich-based demo with: - Basic discovery - Cache management - Live validation - BM25 scoring - Multi-domain discovery - Complete pipeline integration - url_seeder_quick_demo.py: Screenshot-friendly examples: - Pattern-based filtering - Metadata exploration - Smart search with BM25 ## Testing - Comprehensive test suite (test_async_url_seeder_bm25.py) - Coverage of all major features - Edge cases and error handling - Performance and consistency tests ## Implementation Details - Built on httpx with HTTP/2 support - Optional dependencies: lxml, brotli, rank_bm25 - Cache management in ~/.crawl4ai/seeder_cache/ - Logger integration with AsyncLoggerBase - Proper error handling and retry logic ## Bug Fixes - Fixed logger color compatibility (lightblack → bright_black) - Corrected URL extraction from seeder results for arun_many() - Updated all examples and documentation with proper usage This feature enables users to crawl smarter, not harder, by discovering and analyzing URLs before committing resources to crawling them.
2025-06-03 23:27:12 +08:00
parent 3b766e1aac
commit 3048cc1ff9
12 changed files with 3209 additions and 23 deletions
--- a/crawl4ai/async_webcrawler.py
+++ b/crawl4ai/async_webcrawler.py
@@ -35,9 +35,10 @@ from .markdown_generation_strategy import (
 )
 from .deep_crawling import DeepCrawlDecorator
 from .async_logger import AsyncLogger, AsyncLoggerBase
-from .async_configs import BrowserConfig, CrawlerRunConfig, ProxyConfig
+from .async_configs import BrowserConfig, CrawlerRunConfig, ProxyConfig, SeedingConfig
 from .async_dispatcher import *  # noqa: F403
 from .async_dispatcher import BaseDispatcher, MemoryAdaptiveDispatcher, RateLimiter
+from .async_url_seeder import AsyncUrlSeeder

 from .utils import (
    sanitize_input_encode,
@@ -163,6 +164,8 @@ class AsyncWebCrawler:
        # Decorate arun method with deep crawling capabilities
        self._deep_handler = DeepCrawlDecorator(self)
        self.arun = self._deep_handler(self.arun)
+        
+        self.url_seeder: Optional[AsyncUrlSeeder] = None

    async def start(self):
        """
@@ -744,3 +747,94 @@ class AsyncWebCrawler:
        else:
            _results = await dispatcher.run_urls(crawler=self, urls=urls, config=config)
            return [transform_result(res) for res in _results]
+
+    async def aseed_urls(
+        self,
+        domain_or_domains: Union[str, List[str]],
+        config: Optional[SeedingConfig] = None,
+        **kwargs
+    ) -> Union[List[str], Dict[str, List[Union[str, Dict[str, Any]]]]]:
+        """
+        Discovers, filters, and optionally validates URLs for a given domain(s)
+        using sitemaps and Common Crawl archives.
+
+        Args:
+            domain_or_domains: A single domain string (e.g., "iana.org") or a list of domains.
+            config: A SeedingConfig object to control the seeding process.
+                    Parameters passed directly via kwargs will override those in 'config'.
+            **kwargs: Additional parameters (e.g., `source`, `live_check`, `extract_head`,
+                      `pattern`, `concurrency`, `hits_per_sec`, `force_refresh`, `verbose`)
+                      that will be used to construct or update the SeedingConfig.
+
+        Returns:
+            If `extract_head` is False:
+                - For a single domain: `List[str]` of discovered URLs.
+                - For multiple domains: `Dict[str, List[str]]` mapping each domain to its URLs.
+            If `extract_head` is True:
+                - For a single domain: `List[Dict[str, Any]]` where each dict contains 'url'
+                  and 'head_data' (parsed <head> metadata).
+                - For multiple domains: `Dict[str, List[Dict[str, Any]]]` mapping each domain
+                  to a list of URL data dictionaries.
+
+        Raises:
+            ValueError: If `domain_or_domains` is not a string or a list of strings.
+            Exception: Any underlying exceptions from AsyncUrlSeeder or network operations.
+
+        Example:
+            >>> # Discover URLs from sitemap with live check for 'example.com'
+            >>> result = await crawler.aseed_urls("example.com", source="sitemap", live_check=True, hits_per_sec=10)
+
+            >>> # Discover URLs from Common Crawl, extract head data for 'example.com' and 'python.org'
+            >>> multi_domain_result = await crawler.aseed_urls(
+            >>>     ["example.com", "python.org"],
+            >>>     source="cc", extract_head=True, concurrency=200, hits_per_sec=50
+            >>> )
+        """
+        # Initialize AsyncUrlSeeder here if it hasn't been already
+        if not self.url_seeder:
+            # Pass the crawler's base_directory for seeder's cache management
+            # Pass the crawler's logger for consistent logging
+            self.url_seeder = AsyncUrlSeeder(
+                base_directory=self.crawl4ai_folder,
+                logger=self.logger
+            )                    
+
+        # Merge config object with direct kwargs, giving kwargs precedence
+        seeding_config = config.clone(**kwargs) if config else SeedingConfig.from_kwargs(kwargs)
+        
+        # Ensure base_directory is set for the seeder's cache
+        seeding_config.base_directory = seeding_config.base_directory or self.crawl4ai_folder        
+        # Ensure the seeder uses the crawler's logger (if not already set)
+        if not self.url_seeder.logger:
+            self.url_seeder.logger = self.logger
+
+        # Pass verbose setting if explicitly provided in SeedingConfig or kwargs
+        if seeding_config.verbose is not None:
+            self.url_seeder.logger.verbose = seeding_config.verbose
+        else: # Default to crawler's verbose setting
+            self.url_seeder.logger.verbose = self.logger.verbose
+
+
+        if isinstance(domain_or_domains, str):
+            self.logger.info(
+                message="Starting URL seeding for domain: {domain}",
+                tag="SEED",
+                params={"domain": domain_or_domains}
+            )
+            return await self.url_seeder.urls(
+                domain_or_domains,
+                seeding_config
+            )
+        elif isinstance(domain_or_domains, (list, tuple)):
+            self.logger.info(
+                message="Starting URL seeding for {count} domains",
+                tag="SEED",
+                params={"count": len(domain_or_domains)}
+            )
+            # AsyncUrlSeeder.many_urls directly accepts a list of domains and individual params.
+            return await self.url_seeder.many_urls(
+                domain_or_domains,
+                seeding_config
+            )
+        else:
+            raise ValueError("`domain_or_domains` must be a string or a list of strings.")