# URL Seeding: The Smart Way to Crawl at Scale ## Why URL Seeding? Web crawling comes in different flavors, each with its own strengths. Let's understand when to use URL seeding versus deep crawling. ### Deep Crawling: Real-Time Discovery Deep crawling is perfect when you need: - **Fresh, real-time data** - discovering pages as they're created - **Dynamic exploration** - following links based on content - **Selective extraction** - stopping when you find what you need ```python # Deep crawling example: Explore a website dynamically import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.deep_crawling import BFSDeepCrawlStrategy async def deep_crawl_example(): # Configure a 2-level deep crawl config = CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy( max_depth=2, # Crawl 2 levels deep include_external=False, # Stay within domain max_pages=50 # Limit for efficiency ), verbose=True ) async with AsyncWebCrawler() as crawler: # Start crawling and follow links dynamically results = await crawler.arun("https://example.com", config=config) print(f"Discovered and crawled {len(results)} pages") for result in results[:3]: print(f"Found: {result.url} at depth {result.metadata.get('depth', 0)}") asyncio.run(deep_crawl_example()) ``` ### URL Seeding: Bulk Discovery URL seeding shines when you want: - **Comprehensive coverage** - get thousands of URLs in seconds - **Bulk processing** - filter before crawling - **Resource efficiency** - know exactly what you'll crawl ```python # URL seeding example: Analyze all documentation from crawl4ai import AsyncUrlSeeder, SeedingConfig seeder = AsyncUrlSeeder() config = SeedingConfig( source="sitemap", extract_head=True, pattern="*/docs/*" ) # Get ALL documentation URLs instantly urls = await seeder.urls("example.com", config) # 1000+ URLs discovered in seconds! ``` ### The Trade-offs | Aspect | Deep Crawling | URL Seeding | |--------|---------------|-------------| | **Coverage** | Discovers pages dynamically | Gets most existing URLs instantly | | **Freshness** | Finds brand new pages | May miss very recent pages | | **Speed** | Slower, page by page | Extremely fast bulk discovery | | **Resource Usage** | Higher - crawls to discover | Lower - discovers then crawls | | **Control** | Can stop mid-process | Pre-filters before crawling | ### When to Use Each **Choose Deep Crawling when:** - You need the absolute latest content - You're searching for specific information - The site structure is unknown or dynamic - You want to stop as soon as you find what you need **Choose URL Seeding when:** - You need to analyze large portions of a site - You want to filter URLs before crawling - You're doing comparative analysis - You need to optimize resource usage The magic happens when you understand both approaches and choose the right tool for your task. Sometimes, you might even combine them - use URL seeding for bulk discovery, then deep crawl specific sections for the latest updates. ## Your First URL Seeding Adventure Let's see the magic in action. We'll discover blog posts about Python, filter for tutorials, and crawl only those pages. ```python import asyncio from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig async def smart_blog_crawler(): # Step 1: Create our URL discoverer seeder = AsyncUrlSeeder() # Step 2: Configure discovery - let's find all blog posts config = SeedingConfig( source="sitemap+cc", # Use the website's sitemap+cc pattern="*/courses/*", # Only courses related posts extract_head=True, # Get page metadata max_urls=100 # Limit for this example ) # Step 3: Discover URLs from the Python blog print("šŸ” Discovering course posts...") urls = await seeder.urls("realpython.com", config) print(f"āœ… Found {len(urls)} course posts") # Step 4: Filter for Python tutorials (using metadata!) tutorials = [ url for url in urls if url["status"] == "valid" and any(keyword in str(url["head_data"]).lower() for keyword in ["tutorial", "guide", "how to"]) ] print(f"šŸ“š Filtered to {len(tutorials)} tutorials") # Step 5: Show what we found print("\nšŸŽÆ Found these tutorials:") for tutorial in tutorials[:5]: # First 5 title = tutorial["head_data"].get("title", "No title") print(f" - {title}") print(f" {tutorial['url']}") # Step 6: Now crawl ONLY these relevant pages print("\nšŸš€ Crawling tutorials...") async with AsyncWebCrawler() as crawler: config = CrawlerRunConfig( only_text=True, word_count_threshold=300, # Only substantial articles stream=True ) # Extract URLs and crawl them tutorial_urls = [t["url"] for t in tutorials[:10]] results = await crawler.arun_many(tutorial_urls, config=config) successful = 0 async for result in results: if result.success: successful += 1 print(f"āœ“ Crawled: {result.url[:60]}...") print(f"\n✨ Successfully crawled {successful} tutorials!") # Run it! asyncio.run(smart_blog_crawler()) ``` **What just happened?** 1. We discovered all blog URLs from the sitemap+cc 2. We filtered using metadata (no crawling needed!) 3. We crawled only the relevant tutorials 4. We saved tons of time and bandwidth This is the power of URL seeding - you see everything before you crawl anything. ## Understanding the URL Seeder Now that you've seen the magic, let's understand how it works. ### Basic Usage Creating a URL seeder is simple: ```python from crawl4ai import AsyncUrlSeeder # Method 1: Manual cleanup seeder = AsyncUrlSeeder() try: config = SeedingConfig(source="sitemap") urls = await seeder.urls("example.com", config) finally: await seeder.close() # Method 2: Context manager (recommended) async with AsyncUrlSeeder() as seeder: config = SeedingConfig(source="sitemap") urls = await seeder.urls("example.com", config) # Automatically cleaned up on exit ``` The seeder can discover URLs from two powerful sources: #### 1. Sitemaps (Fastest) ```python # Discover from sitemap config = SeedingConfig(source="sitemap") urls = await seeder.urls("example.com", config) ``` Sitemaps are XML files that websites create specifically to list all their URLs. It's like getting a menu at a restaurant - everything is listed upfront. **Sitemap Index Support**: For large websites like TechCrunch that use sitemap indexes (a sitemap of sitemaps), the seeder automatically detects and processes all sub-sitemaps in parallel: ```xml https://techcrunch.com/sitemap-1.xml https://techcrunch.com/sitemap-2.xml ``` The seeder handles this transparently - you'll get all URLs from all sub-sitemaps automatically! #### 2. Common Crawl (Most Comprehensive) ```python # Discover from Common Crawl config = SeedingConfig(source="cc") urls = await seeder.urls("example.com", config) ``` Common Crawl is a massive public dataset that regularly crawls the entire web. It's like having access to a pre-built index of the internet. #### 3. Both Sources (Maximum Coverage) ```python # Use both sources config = SeedingConfig(source="sitemap+cc") urls = await seeder.urls("example.com", config) ``` ### Configuration Magic: SeedingConfig The `SeedingConfig` object is your control panel. Here's everything you can configure: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `source` | str | "sitemap+cc" | URL source: "cc" (Common Crawl), "sitemap", or "sitemap+cc" | | `pattern` | str | "*" | URL pattern filter (e.g., "*/blog/*", "*.html") | | `extract_head` | bool | False | Extract metadata from page `` | | `live_check` | bool | False | Verify URLs are accessible | | `max_urls` | int | -1 | Maximum URLs to return (-1 = unlimited) | | `concurrency` | int | 10 | Parallel workers for fetching | | `hits_per_sec` | int | 5 | Rate limit for requests | | `force` | bool | False | Bypass cache, fetch fresh data | | `verbose` | bool | False | Show detailed progress | | `query` | str | None | Search query for BM25 scoring | | `scoring_method` | str | None | Scoring method (currently "bm25") | | `score_threshold` | float | None | Minimum score to include URL | | `filter_nonsense_urls` | bool | True | Filter out utility URLs (robots.txt, etc.) | | `cache_ttl_hours` | int | 24 | Hours before sitemap cache expires (0 = no TTL) | | `validate_sitemap_lastmod` | bool | True | Check sitemap's lastmod and refetch if newer | #### Pattern Matching Examples ```python # Match all blog posts config = SeedingConfig(pattern="*/blog/*") # Match only HTML files config = SeedingConfig(pattern="*.html") # Match product pages config = SeedingConfig(pattern="*/product/*") # Match everything except admin pages config = SeedingConfig(pattern="*") # Then filter: urls = [u for u in urls if "/admin/" not in u["url"]] ``` ### URL Validation: Live Checking Sometimes you need to know if URLs are actually accessible. That's where live checking comes in: ```python config = SeedingConfig( source="sitemap", live_check=True, # Verify each URL is accessible concurrency=20 # Check 20 URLs in parallel ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("example.com", config) # Now you can filter by status live_urls = [u for u in urls if u["status"] == "valid"] dead_urls = [u for u in urls if u["status"] == "not_valid"] print(f"Live URLs: {len(live_urls)}") print(f"Dead URLs: {len(dead_urls)}") ``` **When to use live checking:** - Before a large crawling operation - When working with older sitemaps - When data freshness is critical **When to skip it:** - Quick explorations - When you trust the source - When speed is more important than accuracy ### The Power of Metadata: Head Extraction This is where URL seeding gets really powerful. Instead of crawling entire pages, you can extract just the metadata: ```python config = SeedingConfig( extract_head=True # Extract metadata from section ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("example.com", config) # Now each URL has rich metadata for url in urls[:3]: print(f"\nURL: {url['url']}") print(f"Title: {url['head_data'].get('title')}") meta = url['head_data'].get('meta', {}) print(f"Description: {meta.get('description')}") print(f"Keywords: {meta.get('keywords')}") # Even Open Graph data! print(f"OG Image: {meta.get('og:image')}") ``` #### What Can We Extract? The head extraction gives you a treasure trove of information: ```python # Example of extracted head_data { "title": "10 Python Tips for Beginners", "charset": "utf-8", "lang": "en", "meta": { "description": "Learn essential Python tips...", "keywords": "python, programming, tutorial", "author": "Jane Developer", "viewport": "width=device-width, initial-scale=1", # Open Graph tags "og:title": "10 Python Tips for Beginners", "og:description": "Essential Python tips for new programmers", "og:image": "https://example.com/python-tips.jpg", "og:type": "article", # Twitter Card tags "twitter:card": "summary_large_image", "twitter:title": "10 Python Tips", # Dublin Core metadata "dc.creator": "Jane Developer", "dc.date": "2024-01-15" }, "link": { "canonical": [{"href": "https://example.com/blog/python-tips"}], "alternate": [{"href": "/feed.xml", "type": "application/rss+xml"}] }, "jsonld": [ { "@type": "Article", "headline": "10 Python Tips for Beginners", "datePublished": "2024-01-15", "author": {"@type": "Person", "name": "Jane Developer"} } ] } ``` This metadata is gold for filtering! You can find exactly what you need without crawling a single page. ### Smart URL-Based Filtering (No Head Extraction) When `extract_head=False` but you still provide a query, the seeder uses intelligent URL-based scoring: ```python # Fast filtering based on URL structure alone config = SeedingConfig( source="sitemap", extract_head=False, # Don't fetch page metadata query="python tutorial async", scoring_method="bm25", score_threshold=0.3 ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("example.com", config) # URLs are scored based on: # 1. Domain parts matching (e.g., 'python' in python.example.com) # 2. Path segments (e.g., '/tutorials/python-async/') # 3. Query parameters (e.g., '?topic=python') # 4. Fuzzy matching using character n-grams # Example URL scoring: # https://example.com/tutorials/python/async-guide.html - High score # https://example.com/blog/javascript-tips.html - Low score ``` This approach is much faster than head extraction while still providing intelligent filtering! ### Understanding Results Each URL in the results has this structure: ```python { "url": "https://example.com/blog/python-tips.html", "status": "valid", # "valid", "not_valid", or "unknown" "head_data": { # Only if extract_head=True "title": "Page Title", "meta": {...}, "link": {...}, "jsonld": [...] }, "relevance_score": 0.85 # Only if using BM25 scoring } ``` Let's see a real example: ```python config = SeedingConfig( source="sitemap", extract_head=True, live_check=True ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("blog.example.com", config) # Analyze the results for url in urls[:5]: print(f"\n{'='*60}") print(f"URL: {url['url']}") print(f"Status: {url['status']}") if url['head_data']: data = url['head_data'] print(f"Title: {data.get('title', 'No title')}") # Check content type meta = data.get('meta', {}) content_type = meta.get('og:type', 'unknown') print(f"Content Type: {content_type}") # Publication date pub_date = None for jsonld in data.get('jsonld', []): if isinstance(jsonld, dict): pub_date = jsonld.get('datePublished') if pub_date: break if pub_date: print(f"Published: {pub_date}") # Word count (if available) word_count = meta.get('word_count') if word_count: print(f"Word Count: {word_count}") ``` ## Smart Filtering with BM25 Scoring Now for the really cool part - intelligent filtering based on relevance! ### Introduction to Relevance Scoring BM25 is a ranking algorithm that scores how relevant a document is to a search query. With URL seeding, we can score URLs based on their metadata *before* crawling them. Think of it like this: - Traditional way: Read every book in the library to find ones about Python - Smart way: Check the titles and descriptions, score them, read only the most relevant ### Query-Based Discovery Here's how to use BM25 scoring: ```python config = SeedingConfig( source="sitemap", extract_head=True, # Required for scoring query="python async tutorial", # What we're looking for scoring_method="bm25", # Use BM25 algorithm score_threshold=0.3 # Minimum relevance score ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("realpython.com", config) # Results are automatically sorted by relevance! for url in urls[:5]: print(f"Score: {url['relevance_score']:.2f} - {url['url']}") print(f" Title: {url['head_data']['title']}") ``` ### Real Examples #### Finding Documentation Pages ```python # Find API documentation config = SeedingConfig( source="sitemap", extract_head=True, query="API reference documentation endpoints", scoring_method="bm25", score_threshold=0.5, max_urls=20 ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("docs.example.com", config) # The highest scoring URLs will be API docs! ``` #### Discovering Product Pages ```python # Find specific products config = SeedingConfig( source="sitemap+cc", # Use both sources extract_head=True, query="wireless headphones noise canceling", scoring_method="bm25", score_threshold=0.4, pattern="*/product/*" # Combine with pattern matching ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("shop.example.com", config) # Filter further by price (from metadata) affordable = [ u for u in urls if float(u['head_data'].get('meta', {}).get('product:price', '0')) < 200 ] ``` #### Filtering News Articles ```python # Find recent news about AI config = SeedingConfig( source="sitemap", extract_head=True, query="artificial intelligence machine learning breakthrough", scoring_method="bm25", score_threshold=0.35 ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("technews.com", config) # Filter by date from datetime import datetime, timedelta recent = [] cutoff = datetime.now() - timedelta(days=7) for url in urls: # Check JSON-LD for publication date for jsonld in url['head_data'].get('jsonld', []): if 'datePublished' in jsonld: pub_date = datetime.fromisoformat(jsonld['datePublished'].replace('Z', '+00:00')) if pub_date > cutoff: recent.append(url) break ``` #### Complex Query Patterns ```python # Multi-concept queries queries = [ "python async await concurrency tutorial", "data science pandas numpy visualization", "web scraping beautifulsoup selenium automation", "machine learning tensorflow keras deep learning" ] all_tutorials = [] for query in queries: config = SeedingConfig( source="sitemap", extract_head=True, query=query, scoring_method="bm25", score_threshold=0.4, max_urls=10 # Top 10 per topic ) async with AsyncUrlSeeder() as seeder: urls = await seeder.urls("learning-platform.com", config) all_tutorials.extend(urls) # Remove duplicates while preserving order seen = set() unique_tutorials = [] for url in all_tutorials: if url['url'] not in seen: seen.add(url['url']) unique_tutorials.append(url) print(f"Found {len(unique_tutorials)} unique tutorials across all topics") ``` ## Scaling Up: Multiple Domains When you need to discover URLs across multiple websites, URL seeding really shines. ### The `many_urls` Method ```python # Discover URLs from multiple domains in parallel domains = ["site1.com", "site2.com", "site3.com"] config = SeedingConfig( source="sitemap", extract_head=True, query="python tutorial", scoring_method="bm25", score_threshold=0.3 ) # Returns a dictionary: {domain: [urls]} async with AsyncUrlSeeder() as seeder: results = await seeder.many_urls(domains, config) # Process results for domain, urls in results.items(): print(f"\n{domain}: Found {len(urls)} relevant URLs") if urls: top = urls[0] # Highest scoring print(f" Top result: {top['url']}") print(f" Score: {top['relevance_score']:.2f}") ``` ### Cross-Domain Examples #### Competitor Analysis ```python # Analyze content strategies across competitors competitors = [ "competitor1.com", "competitor2.com", "competitor3.com" ] config = SeedingConfig( source="sitemap", extract_head=True, pattern="*/blog/*", max_urls=100 ) async with AsyncUrlSeeder() as seeder: results = await seeder.many_urls(competitors, config) # Analyze content types for domain, urls in results.items(): content_types = {} for url in urls: # Extract content type from metadata og_type = url['head_data'].get('meta', {}).get('og:type', 'unknown') content_types[og_type] = content_types.get(og_type, 0) + 1 print(f"\n{domain} content distribution:") for ctype, count in sorted(content_types.items(), key=lambda x: x[1], reverse=True): print(f" {ctype}: {count}") ``` #### Industry Research ```python # Research Python tutorials across educational sites educational_sites = [ "realpython.com", "pythontutorial.net", "learnpython.org", "python.org" ] config = SeedingConfig( source="sitemap", extract_head=True, query="beginner python tutorial basics", scoring_method="bm25", score_threshold=0.3, max_urls=20 # Per site ) async with AsyncUrlSeeder() as seeder: results = await seeder.many_urls(educational_sites, config) # Find the best beginner tutorials all_tutorials = [] for domain, urls in results.items(): for url in urls: url['domain'] = domain # Add domain info all_tutorials.append(url) # Sort by relevance across all domains all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True) print("Top 10 Python tutorials for beginners across all sites:") for i, tutorial in enumerate(all_tutorials[:10], 1): print(f"{i}. [{tutorial['relevance_score']:.2f}] {tutorial['head_data']['title']}") print(f" {tutorial['url']}") print(f" From: {tutorial['domain']}") ``` #### Multi-Site Monitoring ```python # Monitor news about your company across multiple sources news_sites = [ "techcrunch.com", "theverge.com", "wired.com", "arstechnica.com" ] company_name = "YourCompany" config = SeedingConfig( source="cc", # Common Crawl for recent content extract_head=True, query=f"{company_name} announcement news", scoring_method="bm25", score_threshold=0.5, # High threshold for relevance max_urls=10 ) async with AsyncUrlSeeder() as seeder: results = await seeder.many_urls(news_sites, config) # Collect all mentions mentions = [] for domain, urls in results.items(): mentions.extend(urls) if mentions: print(f"Found {len(mentions)} mentions of {company_name}:") for mention in mentions: print(f"\n- {mention['head_data']['title']}") print(f" {mention['url']}") print(f" Score: {mention['relevance_score']:.2f}") else: print(f"No recent mentions of {company_name} found") ``` ## Advanced Integration Patterns Let's put everything together in a real-world example. ### Building a Research Assistant Here's a complete example that discovers, scores, filters, and crawls intelligently: ```python import asyncio from datetime import datetime from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig class ResearchAssistant: def __init__(self): self.seeder = None async def __aenter__(self): self.seeder = AsyncUrlSeeder() await self.seeder.__aenter__() return self async def __aexit__(self, exc_type, exc_val, exc_tb): if self.seeder: await self.seeder.__aexit__(exc_type, exc_val, exc_tb) async def research_topic(self, topic, domains, max_articles=20): """Research a topic across multiple domains.""" print(f"šŸ”¬ Researching '{topic}' across {len(domains)} domains...") # Step 1: Discover relevant URLs config = SeedingConfig( source="sitemap+cc", # Maximum coverage extract_head=True, # Get metadata query=topic, # Research topic scoring_method="bm25", # Smart scoring score_threshold=0.4, # Quality threshold max_urls=10, # Per domain concurrency=20, # Fast discovery verbose=True ) # Discover across all domains discoveries = await self.seeder.many_urls(domains, config) # Step 2: Collect and rank all articles all_articles = [] for domain, urls in discoveries.items(): for url in urls: url['domain'] = domain all_articles.append(url) # Sort by relevance all_articles.sort(key=lambda x: x['relevance_score'], reverse=True) # Take top articles top_articles = all_articles[:max_articles] print(f"\nšŸ“Š Found {len(all_articles)} relevant articles") print(f"šŸ“Œ Selected top {len(top_articles)} for deep analysis") # Step 3: Show what we're about to crawl print("\nšŸŽÆ Articles to analyze:") for i, article in enumerate(top_articles[:5], 1): print(f"\n{i}. {article['head_data']['title']}") print(f" Score: {article['relevance_score']:.2f}") print(f" Source: {article['domain']}") print(f" URL: {article['url'][:60]}...") # Step 4: Crawl the selected articles print(f"\nšŸš€ Deep crawling {len(top_articles)} articles...") async with AsyncWebCrawler() as crawler: config = CrawlerRunConfig( only_text=True, word_count_threshold=200, # Substantial content only stream=True ) # Extract URLs and crawl all articles article_urls = [article['url'] for article in top_articles] results = [] crawl_results = await crawler.arun_many(article_urls, config=config) async for result in crawl_results: if result.success: results.append({ 'url': result.url, 'title': result.metadata.get('title', 'No title'), 'content': result.markdown.raw_markdown, 'domain': next(a['domain'] for a in top_articles if a['url'] == result.url), 'score': next(a['relevance_score'] for a in top_articles if a['url'] == result.url) }) print(f"āœ“ Crawled: {result.url[:60]}...") # Step 5: Analyze and summarize print(f"\nšŸ“ Analysis complete! Crawled {len(results)} articles") return self.create_research_summary(topic, results) def create_research_summary(self, topic, articles): """Create a research summary from crawled articles.""" summary = { 'topic': topic, 'timestamp': datetime.now().isoformat(), 'total_articles': len(articles), 'sources': {} } # Group by domain for article in articles: domain = article['domain'] if domain not in summary['sources']: summary['sources'][domain] = [] summary['sources'][domain].append({ 'title': article['title'], 'url': article['url'], 'score': article['score'], 'excerpt': article['content'][:500] + '...' if len(article['content']) > 500 else article['content'] }) return summary # Use the research assistant async def main(): async with ResearchAssistant() as assistant: # Research Python async programming across multiple sources topic = "python asyncio best practices performance optimization" domains = [ "realpython.com", "python.org", "stackoverflow.com", "medium.com" ] summary = await assistant.research_topic(topic, domains, max_articles=15) # Display results print("\n" + "="*60) print("RESEARCH SUMMARY") print("="*60) print(f"Topic: {summary['topic']}") print(f"Date: {summary['timestamp']}") print(f"Total Articles Analyzed: {summary['total_articles']}") print("\nKey Findings by Source:") for domain, articles in summary['sources'].items(): print(f"\nšŸ“š {domain} ({len(articles)} articles)") for article in articles[:2]: # Top 2 per domain print(f"\n Title: {article['title']}") print(f" Relevance: {article['score']:.2f}") print(f" Preview: {article['excerpt'][:200]}...") asyncio.run(main()) ``` ### Performance Optimization Tips 1. **Use caching wisely** ```python # First run - populate cache config = SeedingConfig(source="sitemap", extract_head=True, force=True) urls = await seeder.urls("example.com", config) # Subsequent runs - use cache (much faster) config = SeedingConfig(source="sitemap", extract_head=True, force=False) urls = await seeder.urls("example.com", config) ``` 2. **Optimize concurrency** ```python # For many small requests (like HEAD checks) config = SeedingConfig(concurrency=50, hits_per_sec=20) # For fewer large requests (like full head extraction) config = SeedingConfig(concurrency=10, hits_per_sec=5) ``` 3. **Stream large result sets** ```python # When crawling many URLs async with AsyncWebCrawler() as crawler: # Assuming urls is a list of URL strings crawl_results = await crawler.arun_many(urls, config=config) # Process as they arrive async for result in crawl_results: process_immediately(result) # Don't wait for all ``` 4. **Memory protection for large domains** The seeder uses bounded queues to prevent memory issues when processing domains with millions of URLs: ```python # Safe for domains with 1M+ URLs config = SeedingConfig( source="cc+sitemap", concurrency=50, # Queue size adapts to concurrency max_urls=100000 # Process in batches if needed ) # The seeder automatically manages memory by: # - Using bounded queues (prevents RAM spikes) # - Applying backpressure when queue is full # - Processing URLs as they're discovered ``` ## Best Practices & Tips ### Cache Management The seeder automatically caches results to speed up repeated operations: - **Common Crawl cache**: `~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl` - **Sitemap cache**: `~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].json` - **HEAD data cache**: `~/.cache/url_seeder/head/[hash].json` #### Smart TTL Cache for Sitemaps Sitemap caches now include intelligent validation: ```python # Default: 24-hour TTL with lastmod validation config = SeedingConfig( source="sitemap", cache_ttl_hours=24, # Cache expires after 24 hours validate_sitemap_lastmod=True # Also check if sitemap was updated ) # Aggressive caching (1 week, no lastmod check) config = SeedingConfig( source="sitemap", cache_ttl_hours=168, # 7 days validate_sitemap_lastmod=False # Trust TTL only ) # Always validate (no TTL, only lastmod) config = SeedingConfig( source="sitemap", cache_ttl_hours=0, # Disable TTL validate_sitemap_lastmod=True # Refetch if sitemap has newer lastmod ) # Always fresh (bypass cache completely) config = SeedingConfig( source="sitemap", force=True # Ignore all caching ) ``` **Cache validation priority:** 1. `force=True` → Always refetch 2. Cache doesn't exist → Fetch fresh 3. `validate_sitemap_lastmod=True` and sitemap has newer `` → Refetch 4. `cache_ttl_hours > 0` and cache is older than TTL → Refetch 5. Cache corrupted → Refetch (automatic recovery) 6. Otherwise → Use cache ### Pattern Matching Strategies ```python # Be specific when possible good_pattern = "*/blog/2024/*.html" # Specific bad_pattern = "*" # Too broad # Combine patterns with metadata filtering config = SeedingConfig( pattern="*/articles/*", extract_head=True ) urls = await seeder.urls("news.com", config) # Further filter by publish date, author, category, etc. recent = [u for u in urls if is_recent(u['head_data'])] ``` ### Rate Limiting Considerations ```python # Be respectful of servers config = SeedingConfig( hits_per_sec=10, # Max 10 requests per second concurrency=20 # But use 20 workers ) # For your own servers config = SeedingConfig( hits_per_sec=None, # No limit concurrency=100 # Go fast ) ``` ## Quick Reference ### Common Patterns ```python # Blog post discovery config = SeedingConfig( source="sitemap", pattern="*/blog/*", extract_head=True, query="your topic", scoring_method="bm25" ) # E-commerce product discovery config = SeedingConfig( source="sitemap+cc", pattern="*/product/*", extract_head=True, live_check=True ) # Documentation search config = SeedingConfig( source="sitemap", pattern="*/docs/*", extract_head=True, query="API reference", scoring_method="bm25", score_threshold=0.5 ) # News monitoring config = SeedingConfig( source="cc", extract_head=True, query="company name", scoring_method="bm25", max_urls=50 ) ``` ### Troubleshooting Guide | Issue | Solution | |-------|----------| | No URLs found | Try `source="cc+sitemap"`, check domain spelling | | Slow discovery | Reduce `concurrency`, add `hits_per_sec` limit | | Missing metadata | Ensure `extract_head=True` | | Low relevance scores | Refine query, lower `score_threshold` | | Rate limit errors | Reduce `hits_per_sec` and `concurrency` | | Memory issues with large sites | Use `max_urls` to limit results, reduce `concurrency` | | Connection not closed | Use context manager or call `await seeder.close()` | | Stale/outdated URLs | Set `cache_ttl_hours=0` or use `force=True` | | Cache not updating | Check `validate_sitemap_lastmod=True`, or use `force=True` | | Incomplete URL list | Delete cache file and refetch, or use `force=True` | ### Performance Benchmarks Typical performance on a standard connection: - **Sitemap discovery**: 100-1,000 URLs/second - **Common Crawl discovery**: 50-500 URLs/second - **HEAD checking**: 10-50 URLs/second - **Head extraction**: 5-20 URLs/second - **BM25 scoring**: 10,000+ URLs/second ## Conclusion URL seeding transforms web crawling from a blind expedition into a surgical strike. By discovering and analyzing URLs before crawling, you can: - Save hours of crawling time - Reduce bandwidth usage by 90%+ - Find exactly what you need - Scale across multiple domains effortlessly Whether you're building a research tool, monitoring competitors, or creating a content aggregator, URL seeding gives you the intelligence to crawl smarter, not harder. ### Smart URL Filtering The seeder automatically filters out nonsense URLs that aren't useful for content crawling: ```python # Enabled by default config = SeedingConfig( source="sitemap", filter_nonsense_urls=True # Default: True ) # URLs that get filtered: # - robots.txt, sitemap.xml, ads.txt # - API endpoints (/api/, /v1/, .json) # - Media files (.jpg, .mp4, .pdf) # - Archives (.zip, .tar.gz) # - Source code (.js, .css) # - Admin/login pages # - And many more... ``` To disable filtering (not recommended): ```python config = SeedingConfig( source="sitemap", filter_nonsense_urls=False # Include ALL URLs ) ``` ### Key Features Summary 1. **Parallel Sitemap Index Processing**: Automatically detects and processes sitemap indexes in parallel 2. **Memory Protection**: Bounded queues prevent RAM issues with large domains (1M+ URLs) 3. **Context Manager Support**: Automatic cleanup with `async with` statement 4. **URL-Based Scoring**: Smart filtering even without head extraction 5. **Smart URL Filtering**: Automatically excludes utility/nonsense URLs 6. **Smart TTL Cache**: Sitemap caches with TTL expiry and lastmod validation 7. **Automatic Cache Recovery**: Corrupted or incomplete caches are automatically refreshed Now go forth and seed intelligently!