* Fix: Use correct URL variable for raw HTML extraction (#1116) - Prevents full HTML content from being passed as URL to extraction strategies - Added unit tests to verify raw HTML and regular URL processing Fix: Wrong URL variable used for extraction of raw html * Fix #1181: Preserve whitespace in code blocks during HTML scraping The remove_empty_elements_fast() method was removing whitespace-only span elements inside <pre> and <code> tags, causing import statements like "import torch" to become "importtorch". Now skips elements inside code blocks where whitespace is significant. * Refactor Pydantic model configuration to use ConfigDict for arbitrary types * Fix EmbeddingStrategy: Uncomment response handling for the variations and clean up mock data. ref #1621 * Fix: permission issues with .cache/url_seeder and other runtime cache dirs. ref #1638 * fix: ensure BrowserConfig.to_dict serializes proxy_config * feat: make LLM backoff configurable end-to-end - extend LLMConfig with backoff delay/attempt/factor fields and thread them through LLMExtractionStrategy, LLMContentFilter, table extraction, and Docker API handlers - expose the backoff parameter knobs on perform_completion_with_backoff/aperform_completion_with_backoff and document them in the md_v2 guides * reproduced AttributeError from #1642 * pass timeout parameter to docker client request * added missing deep crawling objects to init * generalized query in ContentRelevanceFilter to be a str or list * import modules from enhanceable deserialization * parameterized tests * Fix: capture current page URL to reflect JavaScript navigation and add test for delayed redirects. ref #1268 * refactor: replace PyPDF2 with pypdf across the codebase. ref #1412 * Add browser_context_id and target_id parameters to BrowserConfig Enable Crawl4AI to connect to pre-created CDP browser contexts, which is essential for cloud browser services that pre-create isolated contexts. Changes: - Add browser_context_id and target_id parameters to BrowserConfig - Update from_kwargs() and to_dict() methods - Modify BrowserManager.start() to use existing context when provided - Add _get_page_by_target_id() helper method - Update get_page() to handle pre-existing targets - Add test for browser_context_id functionality This enables cloud services to: 1. Create isolated CDP contexts before Crawl4AI connects 2. Pass context/target IDs to BrowserConfig 3. Have Crawl4AI reuse existing contexts instead of creating new ones * Add cdp_cleanup_on_close flag to prevent memory leaks in cloud/server scenarios * Fix: add cdp_cleanup_on_close to from_kwargs * Fix: find context by target_id for concurrent CDP connections * Fix: use target_id to find correct page in get_page * Fix: use CDP to find context by browserContextId for concurrent sessions * Revert context matching attempts - Playwright cannot see CDP-created contexts * Add create_isolated_context flag for concurrent CDP crawls When True, forces creation of a new browser context instead of reusing the default context. Essential for concurrent crawls on the same browser to prevent navigation conflicts. * Add context caching to create_isolated_context branch Uses contexts_by_config cache (same as non-CDP mode) to reuse contexts for multiple URLs with same config. Still creates new page per crawl for navigation isolation. Benefits batch/deep crawls. * Add init_scripts support to BrowserConfig for pre-page-load JS injection This adds the ability to inject JavaScript that runs before any page loads, useful for stealth evasions (canvas/audio fingerprinting, userAgentData). - Add init_scripts parameter to BrowserConfig (list of JS strings) - Apply init_scripts in setup_context() via context.add_init_script() - Update from_kwargs() and to_dict() for serialization * Fix CDP connection handling: support WS URLs and proper cleanup Changes to browser_manager.py: 1. _verify_cdp_ready(): Support multiple URL formats - WebSocket URLs (ws://, wss://): Skip HTTP verification, Playwright handles directly - HTTP URLs with query params: Properly parse with urlparse to preserve query string - Fixes issue where naive f"{cdp_url}/json/version" broke WS URLs and query params 2. close(): Proper cleanup when cdp_cleanup_on_close=True - Close all sessions (pages) - Close all contexts - Call browser.close() to disconnect (doesn't terminate browser, just releases connection) - Wait 1 second for CDP connection to fully release - Stop Playwright instance to prevent memory leaks This enables: - Connecting to specific browsers via WS URL - Reusing the same browser with multiple sequential connections - No user wait needed between connections (internal 1s delay handles it) Added tests/browser/test_cdp_cleanup_reuse.py with comprehensive tests. * Update gitignore * Some debugging for caching * Add _generate_screenshot_from_html for raw: and file:// URLs Implements the missing method that was being called but never defined. Now raw: and file:// URLs can generate screenshots by: 1. Loading HTML into a browser page via page.set_content() 2. Taking screenshot using existing take_screenshot() method 3. Cleaning up the page afterward This enables cached HTML to be rendered with screenshots in crawl4ai-cloud. * Add PDF and MHTML support for raw: and file:// URLs - Replace _generate_screenshot_from_html with _generate_media_from_html - New method handles screenshot, PDF, and MHTML in one browser session - Update raw: and file:// URL handlers to use new method - Enables cached HTML to generate all media types * Add crash recovery for deep crawl strategies Add optional resume_state and on_state_change parameters to all deep crawl strategies (BFS, DFS, Best-First) for cloud deployment crash recovery. Features: - resume_state: Pass saved state to resume from checkpoint - on_state_change: Async callback fired after each URL for real-time state persistence to external storage (Redis, DB, etc.) - export_state(): Get last captured state manually - Zero overhead when features are disabled (None defaults) State includes visited URLs, pending queue/stack, depths, and pages_crawled count. All state is JSON-serializable. * Fix: HTTP strategy raw: URL parsing truncates at # character The AsyncHTTPCrawlerStrategy.crawl() method used urlparse() to extract content from raw: URLs. This caused HTML with CSS color codes like #eee to be truncated because # is treated as a URL fragment delimiter. Before: raw:body{background:#eee} -> parsed.path = 'body{background:' After: raw:body{background:#eee} -> raw_content = 'body{background:#eee' Fix: Strip the raw: or raw:// prefix directly instead of using urlparse, matching how the browser strategy handles it. * Add base_url parameter to CrawlerRunConfig for raw HTML processing When processing raw: HTML (e.g., from cache), the URL parameter is meaningless for markdown link resolution. This adds a base_url parameter that can be set explicitly to provide proper URL resolution context. Changes: - Add base_url parameter to CrawlerRunConfig.__init__ - Add base_url to CrawlerRunConfig.from_kwargs - Update aprocess_html to use base_url for markdown generation Usage: config = CrawlerRunConfig(base_url='https://example.com') result = await crawler.arun(url='raw:{html}', config=config) * Add prefetch mode for two-phase deep crawling - Add `prefetch` parameter to CrawlerRunConfig - Add `quick_extract_links()` function for fast link extraction - Add short-circuit in aprocess_html() for prefetch mode - Add 42 tests (unit, integration, regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Updates on proxy rotation and proxy configuration * Add proxy support to HTTP crawler strategy * Add browser pipeline support for raw:/file:// URLs - Add process_in_browser parameter to CrawlerRunConfig - Route raw:/file:// URLs through _crawl_web() when browser operations needed - Use page.set_content() instead of goto() for local content - Fix cookie handling for non-HTTP URLs in browser_manager - Auto-detect browser requirements: js_code, wait_for, screenshot, etc. - Maintain fast path for raw:/file:// without browser params Fixes #310 * Add smart TTL cache for sitemap URL seeder - Add cache_ttl_hours and validate_sitemap_lastmod params to SeedingConfig - New JSON cache format with metadata (version, created_at, lastmod, url_count) - Cache validation by TTL expiry and sitemap lastmod comparison - Auto-migration from old .jsonl to new .json format - Fixes bug where incomplete cache was used indefinitely * Update URL seeder docs with smart TTL cache parameters - Add cache_ttl_hours and validate_sitemap_lastmod to parameter table - Document smart TTL cache validation with examples - Add cache-related troubleshooting entries - Update key features summary * Add MEMORY.md to gitignore * Docs: Add multi-sample schema generation section Add documentation explaining how to pass multiple HTML samples to generate_schema() for stable selectors that work across pages with varying DOM structures. Includes: - Problem explanation (fragile nth-child selectors) - Solution with code example - Key points for multi-sample queries - Comparison table of fragile vs stable selectors * Fix critical RCE and LFI vulnerabilities in Docker API deployment Security fixes for vulnerabilities reported by ProjectDiscovery: 1. Remote Code Execution via Hooks (CVE pending) - Remove __import__ from allowed_builtins in hook_manager.py - Prevents arbitrary module imports (os, subprocess, etc.) - Hooks now disabled by default via CRAWL4AI_HOOKS_ENABLED env var 2. Local File Inclusion via file:// URLs (CVE pending) - Add URL scheme validation to /execute_js, /screenshot, /pdf, /html - Block file://, javascript:, data: and other dangerous schemes - Only allow http://, https://, and raw: (where appropriate) 3. Security hardening - Add CRAWL4AI_HOOKS_ENABLED=false as default (opt-in for hooks) - Add security warning comments in config.yml - Add validate_url_scheme() helper for consistent validation Testing: - Add unit tests (test_security_fixes.py) - 16 tests - Add integration tests (run_security_tests.py) for live server Affected endpoints: - POST /crawl (hooks disabled by default) - POST /crawl/stream (hooks disabled by default) - POST /execute_js (URL validation added) - POST /screenshot (URL validation added) - POST /pdf (URL validation added) - POST /html (URL validation added) Breaking changes: - Hooks require CRAWL4AI_HOOKS_ENABLED=true to function - file:// URLs no longer work on API endpoints (use library directly) * Enhance authentication flow by implementing JWT token retrieval and adding authorization headers to API requests * Add release notes for v0.7.9, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates * Add release notes for v0.8.0, detailing breaking changes, security fixes, new features, bug fixes, and documentation updates Documentation for v0.8.0 release: - SECURITY.md: Security policy and vulnerability reporting guidelines - RELEASE_NOTES_v0.8.0.md: Comprehensive release notes - migration/v0.8.0-upgrade-guide.md: Step-by-step migration guide - security/GHSA-DRAFT-RCE-LFI.md: GitHub security advisory drafts - CHANGELOG.md: Updated with v0.8.0 changes Breaking changes documented: - Docker API hooks disabled by default (CRAWL4AI_HOOKS_ENABLED) - file:// URLs blocked on Docker API endpoints Security fixes credited to Neo by ProjectDiscovery * Add examples for deep crawl crash recovery and prefetch mode in documentation * Release v0.8.0: The v0.8.0 Update - Updated version to 0.8.0 - Added comprehensive demo and release notes - Updated all documentation * Update security researcher acknowledgment with a hyperlink for Neo by ProjectDiscovery * Add async agenerate_schema method for schema generation - Extract prompt building to shared _build_schema_prompt() method - Add agenerate_schema() async version using aperform_completion_with_backoff - Refactor generate_schema() to use shared prompt builder - Fixes Gemini/Vertex AI compatibility in async contexts (FastAPI) * Fix: Enable litellm.drop_params for O-series/GPT-5 model compatibility O-series (o1, o3) and GPT-5 models only support temperature=1. Setting litellm.drop_params=True auto-drops unsupported parameters instead of throwing UnsupportedParamsError. Fixes temperature=0.01 error for these models in LLM extraction. --------- Co-authored-by: rbushria <rbushri@gmail.com> Co-authored-by: AHMET YILMAZ <tawfik@kidocode.com> Co-authored-by: Soham Kukreti <kukretisoham@gmail.com> Co-authored-by: Chris Murphy <chris.murphy@klaviyo.com> Co-authored-by: unclecode <unclecode@kidocode.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
35 KiB
URL Seeding: The Smart Way to Crawl at Scale
Why URL Seeding?
Web crawling comes in different flavors, each with its own strengths. Let's understand when to use URL seeding versus deep crawling.
Deep Crawling: Real-Time Discovery
Deep crawling is perfect when you need:
- Fresh, real-time data - discovering pages as they're created
- Dynamic exploration - following links based on content
- Selective extraction - stopping when you find what you need
# Deep crawling example: Explore a website dynamically
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async def deep_crawl_example():
# Configure a 2-level deep crawl
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, # Crawl 2 levels deep
include_external=False, # Stay within domain
max_pages=50 # Limit for efficiency
),
verbose=True
)
async with AsyncWebCrawler() as crawler:
# Start crawling and follow links dynamically
results = await crawler.arun("https://example.com", config=config)
print(f"Discovered and crawled {len(results)} pages")
for result in results[:3]:
print(f"Found: {result.url} at depth {result.metadata.get('depth', 0)}")
asyncio.run(deep_crawl_example())
URL Seeding: Bulk Discovery
URL seeding shines when you want:
- Comprehensive coverage - get thousands of URLs in seconds
- Bulk processing - filter before crawling
- Resource efficiency - know exactly what you'll crawl
# URL seeding example: Analyze all documentation
from crawl4ai import AsyncUrlSeeder, SeedingConfig
seeder = AsyncUrlSeeder()
config = SeedingConfig(
source="sitemap",
extract_head=True,
pattern="*/docs/*"
)
# Get ALL documentation URLs instantly
urls = await seeder.urls("example.com", config)
# 1000+ URLs discovered in seconds!
The Trade-offs
| Aspect | Deep Crawling | URL Seeding |
|---|---|---|
| Coverage | Discovers pages dynamically | Gets most existing URLs instantly |
| Freshness | Finds brand new pages | May miss very recent pages |
| Speed | Slower, page by page | Extremely fast bulk discovery |
| Resource Usage | Higher - crawls to discover | Lower - discovers then crawls |
| Control | Can stop mid-process | Pre-filters before crawling |
When to Use Each
Choose Deep Crawling when:
- You need the absolute latest content
- You're searching for specific information
- The site structure is unknown or dynamic
- You want to stop as soon as you find what you need
Choose URL Seeding when:
- You need to analyze large portions of a site
- You want to filter URLs before crawling
- You're doing comparative analysis
- You need to optimize resource usage
The magic happens when you understand both approaches and choose the right tool for your task. Sometimes, you might even combine them - use URL seeding for bulk discovery, then deep crawl specific sections for the latest updates.
Your First URL Seeding Adventure
Let's see the magic in action. We'll discover blog posts about Python, filter for tutorials, and crawl only those pages.
import asyncio
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig
async def smart_blog_crawler():
# Step 1: Create our URL discoverer
seeder = AsyncUrlSeeder()
# Step 2: Configure discovery - let's find all blog posts
config = SeedingConfig(
source="sitemap+cc", # Use the website's sitemap+cc
pattern="*/courses/*", # Only courses related posts
extract_head=True, # Get page metadata
max_urls=100 # Limit for this example
)
# Step 3: Discover URLs from the Python blog
print("🔍 Discovering course posts...")
urls = await seeder.urls("realpython.com", config)
print(f"✅ Found {len(urls)} course posts")
# Step 4: Filter for Python tutorials (using metadata!)
tutorials = [
url for url in urls
if url["status"] == "valid" and
any(keyword in str(url["head_data"]).lower()
for keyword in ["tutorial", "guide", "how to"])
]
print(f"📚 Filtered to {len(tutorials)} tutorials")
# Step 5: Show what we found
print("\n🎯 Found these tutorials:")
for tutorial in tutorials[:5]: # First 5
title = tutorial["head_data"].get("title", "No title")
print(f" - {title}")
print(f" {tutorial['url']}")
# Step 6: Now crawl ONLY these relevant pages
print("\n🚀 Crawling tutorials...")
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
only_text=True,
word_count_threshold=300, # Only substantial articles
stream=True
)
# Extract URLs and crawl them
tutorial_urls = [t["url"] for t in tutorials[:10]]
results = await crawler.arun_many(tutorial_urls, config=config)
successful = 0
async for result in results:
if result.success:
successful += 1
print(f"✓ Crawled: {result.url[:60]}...")
print(f"\n✨ Successfully crawled {successful} tutorials!")
# Run it!
asyncio.run(smart_blog_crawler())
What just happened?
- We discovered all blog URLs from the sitemap+cc
- We filtered using metadata (no crawling needed!)
- We crawled only the relevant tutorials
- We saved tons of time and bandwidth
This is the power of URL seeding - you see everything before you crawl anything.
Understanding the URL Seeder
Now that you've seen the magic, let's understand how it works.
Basic Usage
Creating a URL seeder is simple:
from crawl4ai import AsyncUrlSeeder
# Method 1: Manual cleanup
seeder = AsyncUrlSeeder()
try:
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
finally:
await seeder.close()
# Method 2: Context manager (recommended)
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
# Automatically cleaned up on exit
The seeder can discover URLs from two powerful sources:
1. Sitemaps (Fastest)
# Discover from sitemap
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
Sitemaps are XML files that websites create specifically to list all their URLs. It's like getting a menu at a restaurant - everything is listed upfront.
Sitemap Index Support: For large websites like TechCrunch that use sitemap indexes (a sitemap of sitemaps), the seeder automatically detects and processes all sub-sitemaps in parallel:
<!-- Example sitemap index -->
<sitemapindex>
<sitemap>
<loc>https://techcrunch.com/sitemap-1.xml</loc>
</sitemap>
<sitemap>
<loc>https://techcrunch.com/sitemap-2.xml</loc>
</sitemap>
<!-- ... more sitemaps ... -->
</sitemapindex>
The seeder handles this transparently - you'll get all URLs from all sub-sitemaps automatically!
2. Common Crawl (Most Comprehensive)
# Discover from Common Crawl
config = SeedingConfig(source="cc")
urls = await seeder.urls("example.com", config)
Common Crawl is a massive public dataset that regularly crawls the entire web. It's like having access to a pre-built index of the internet.
3. Both Sources (Maximum Coverage)
# Use both sources
config = SeedingConfig(source="sitemap+cc")
urls = await seeder.urls("example.com", config)
Configuration Magic: SeedingConfig
The SeedingConfig object is your control panel. Here's everything you can configure:
| Parameter | Type | Default | Description |
|---|---|---|---|
source |
str | "sitemap+cc" | URL source: "cc" (Common Crawl), "sitemap", or "sitemap+cc" |
pattern |
str | "*" | URL pattern filter (e.g., "/blog/", "*.html") |
extract_head |
bool | False | Extract metadata from page <head> |
live_check |
bool | False | Verify URLs are accessible |
max_urls |
int | -1 | Maximum URLs to return (-1 = unlimited) |
concurrency |
int | 10 | Parallel workers for fetching |
hits_per_sec |
int | 5 | Rate limit for requests |
force |
bool | False | Bypass cache, fetch fresh data |
verbose |
bool | False | Show detailed progress |
query |
str | None | Search query for BM25 scoring |
scoring_method |
str | None | Scoring method (currently "bm25") |
score_threshold |
float | None | Minimum score to include URL |
filter_nonsense_urls |
bool | True | Filter out utility URLs (robots.txt, etc.) |
cache_ttl_hours |
int | 24 | Hours before sitemap cache expires (0 = no TTL) |
validate_sitemap_lastmod |
bool | True | Check sitemap's lastmod and refetch if newer |
Pattern Matching Examples
# Match all blog posts
config = SeedingConfig(pattern="*/blog/*")
# Match only HTML files
config = SeedingConfig(pattern="*.html")
# Match product pages
config = SeedingConfig(pattern="*/product/*")
# Match everything except admin pages
config = SeedingConfig(pattern="*")
# Then filter: urls = [u for u in urls if "/admin/" not in u["url"]]
URL Validation: Live Checking
Sometimes you need to know if URLs are actually accessible. That's where live checking comes in:
config = SeedingConfig(
source="sitemap",
live_check=True, # Verify each URL is accessible
concurrency=20 # Check 20 URLs in parallel
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("example.com", config)
# Now you can filter by status
live_urls = [u for u in urls if u["status"] == "valid"]
dead_urls = [u for u in urls if u["status"] == "not_valid"]
print(f"Live URLs: {len(live_urls)}")
print(f"Dead URLs: {len(dead_urls)}")
When to use live checking:
- Before a large crawling operation
- When working with older sitemaps
- When data freshness is critical
When to skip it:
- Quick explorations
- When you trust the source
- When speed is more important than accuracy
The Power of Metadata: Head Extraction
This is where URL seeding gets really powerful. Instead of crawling entire pages, you can extract just the metadata:
config = SeedingConfig(
extract_head=True # Extract metadata from <head> section
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("example.com", config)
# Now each URL has rich metadata
for url in urls[:3]:
print(f"\nURL: {url['url']}")
print(f"Title: {url['head_data'].get('title')}")
meta = url['head_data'].get('meta', {})
print(f"Description: {meta.get('description')}")
print(f"Keywords: {meta.get('keywords')}")
# Even Open Graph data!
print(f"OG Image: {meta.get('og:image')}")
What Can We Extract?
The head extraction gives you a treasure trove of information:
# Example of extracted head_data
{
"title": "10 Python Tips for Beginners",
"charset": "utf-8",
"lang": "en",
"meta": {
"description": "Learn essential Python tips...",
"keywords": "python, programming, tutorial",
"author": "Jane Developer",
"viewport": "width=device-width, initial-scale=1",
# Open Graph tags
"og:title": "10 Python Tips for Beginners",
"og:description": "Essential Python tips for new programmers",
"og:image": "https://example.com/python-tips.jpg",
"og:type": "article",
# Twitter Card tags
"twitter:card": "summary_large_image",
"twitter:title": "10 Python Tips",
# Dublin Core metadata
"dc.creator": "Jane Developer",
"dc.date": "2024-01-15"
},
"link": {
"canonical": [{"href": "https://example.com/blog/python-tips"}],
"alternate": [{"href": "/feed.xml", "type": "application/rss+xml"}]
},
"jsonld": [
{
"@type": "Article",
"headline": "10 Python Tips for Beginners",
"datePublished": "2024-01-15",
"author": {"@type": "Person", "name": "Jane Developer"}
}
]
}
This metadata is gold for filtering! You can find exactly what you need without crawling a single page.
Smart URL-Based Filtering (No Head Extraction)
When extract_head=False but you still provide a query, the seeder uses intelligent URL-based scoring:
# Fast filtering based on URL structure alone
config = SeedingConfig(
source="sitemap",
extract_head=False, # Don't fetch page metadata
query="python tutorial async",
scoring_method="bm25",
score_threshold=0.3
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("example.com", config)
# URLs are scored based on:
# 1. Domain parts matching (e.g., 'python' in python.example.com)
# 2. Path segments (e.g., '/tutorials/python-async/')
# 3. Query parameters (e.g., '?topic=python')
# 4. Fuzzy matching using character n-grams
# Example URL scoring:
# https://example.com/tutorials/python/async-guide.html - High score
# https://example.com/blog/javascript-tips.html - Low score
This approach is much faster than head extraction while still providing intelligent filtering!
Understanding Results
Each URL in the results has this structure:
{
"url": "https://example.com/blog/python-tips.html",
"status": "valid", # "valid", "not_valid", or "unknown"
"head_data": { # Only if extract_head=True
"title": "Page Title",
"meta": {...},
"link": {...},
"jsonld": [...]
},
"relevance_score": 0.85 # Only if using BM25 scoring
}
Let's see a real example:
config = SeedingConfig(
source="sitemap",
extract_head=True,
live_check=True
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("blog.example.com", config)
# Analyze the results
for url in urls[:5]:
print(f"\n{'='*60}")
print(f"URL: {url['url']}")
print(f"Status: {url['status']}")
if url['head_data']:
data = url['head_data']
print(f"Title: {data.get('title', 'No title')}")
# Check content type
meta = data.get('meta', {})
content_type = meta.get('og:type', 'unknown')
print(f"Content Type: {content_type}")
# Publication date
pub_date = None
for jsonld in data.get('jsonld', []):
if isinstance(jsonld, dict):
pub_date = jsonld.get('datePublished')
if pub_date:
break
if pub_date:
print(f"Published: {pub_date}")
# Word count (if available)
word_count = meta.get('word_count')
if word_count:
print(f"Word Count: {word_count}")
Smart Filtering with BM25 Scoring
Now for the really cool part - intelligent filtering based on relevance!
Introduction to Relevance Scoring
BM25 is a ranking algorithm that scores how relevant a document is to a search query. With URL seeding, we can score URLs based on their metadata before crawling them.
Think of it like this:
- Traditional way: Read every book in the library to find ones about Python
- Smart way: Check the titles and descriptions, score them, read only the most relevant
Query-Based Discovery
Here's how to use BM25 scoring:
config = SeedingConfig(
source="sitemap",
extract_head=True, # Required for scoring
query="python async tutorial", # What we're looking for
scoring_method="bm25", # Use BM25 algorithm
score_threshold=0.3 # Minimum relevance score
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("realpython.com", config)
# Results are automatically sorted by relevance!
for url in urls[:5]:
print(f"Score: {url['relevance_score']:.2f} - {url['url']}")
print(f" Title: {url['head_data']['title']}")
Real Examples
Finding Documentation Pages
# Find API documentation
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="API reference documentation endpoints",
scoring_method="bm25",
score_threshold=0.5,
max_urls=20
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("docs.example.com", config)
# The highest scoring URLs will be API docs!
Discovering Product Pages
# Find specific products
config = SeedingConfig(
source="sitemap+cc", # Use both sources
extract_head=True,
query="wireless headphones noise canceling",
scoring_method="bm25",
score_threshold=0.4,
pattern="*/product/*" # Combine with pattern matching
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("shop.example.com", config)
# Filter further by price (from metadata)
affordable = [
u for u in urls
if float(u['head_data'].get('meta', {}).get('product:price', '0')) < 200
]
Filtering News Articles
# Find recent news about AI
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="artificial intelligence machine learning breakthrough",
scoring_method="bm25",
score_threshold=0.35
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("technews.com", config)
# Filter by date
from datetime import datetime, timedelta
recent = []
cutoff = datetime.now() - timedelta(days=7)
for url in urls:
# Check JSON-LD for publication date
for jsonld in url['head_data'].get('jsonld', []):
if 'datePublished' in jsonld:
pub_date = datetime.fromisoformat(jsonld['datePublished'].replace('Z', '+00:00'))
if pub_date > cutoff:
recent.append(url)
break
Complex Query Patterns
# Multi-concept queries
queries = [
"python async await concurrency tutorial",
"data science pandas numpy visualization",
"web scraping beautifulsoup selenium automation",
"machine learning tensorflow keras deep learning"
]
all_tutorials = []
for query in queries:
config = SeedingConfig(
source="sitemap",
extract_head=True,
query=query,
scoring_method="bm25",
score_threshold=0.4,
max_urls=10 # Top 10 per topic
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("learning-platform.com", config)
all_tutorials.extend(urls)
# Remove duplicates while preserving order
seen = set()
unique_tutorials = []
for url in all_tutorials:
if url['url'] not in seen:
seen.add(url['url'])
unique_tutorials.append(url)
print(f"Found {len(unique_tutorials)} unique tutorials across all topics")
Scaling Up: Multiple Domains
When you need to discover URLs across multiple websites, URL seeding really shines.
The many_urls Method
# Discover URLs from multiple domains in parallel
domains = ["site1.com", "site2.com", "site3.com"]
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="python tutorial",
scoring_method="bm25",
score_threshold=0.3
)
# Returns a dictionary: {domain: [urls]}
async with AsyncUrlSeeder() as seeder:
results = await seeder.many_urls(domains, config)
# Process results
for domain, urls in results.items():
print(f"\n{domain}: Found {len(urls)} relevant URLs")
if urls:
top = urls[0] # Highest scoring
print(f" Top result: {top['url']}")
print(f" Score: {top['relevance_score']:.2f}")
Cross-Domain Examples
Competitor Analysis
# Analyze content strategies across competitors
competitors = [
"competitor1.com",
"competitor2.com",
"competitor3.com"
]
config = SeedingConfig(
source="sitemap",
extract_head=True,
pattern="*/blog/*",
max_urls=100
)
async with AsyncUrlSeeder() as seeder:
results = await seeder.many_urls(competitors, config)
# Analyze content types
for domain, urls in results.items():
content_types = {}
for url in urls:
# Extract content type from metadata
og_type = url['head_data'].get('meta', {}).get('og:type', 'unknown')
content_types[og_type] = content_types.get(og_type, 0) + 1
print(f"\n{domain} content distribution:")
for ctype, count in sorted(content_types.items(), key=lambda x: x[1], reverse=True):
print(f" {ctype}: {count}")
Industry Research
# Research Python tutorials across educational sites
educational_sites = [
"realpython.com",
"pythontutorial.net",
"learnpython.org",
"python.org"
]
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="beginner python tutorial basics",
scoring_method="bm25",
score_threshold=0.3,
max_urls=20 # Per site
)
async with AsyncUrlSeeder() as seeder:
results = await seeder.many_urls(educational_sites, config)
# Find the best beginner tutorials
all_tutorials = []
for domain, urls in results.items():
for url in urls:
url['domain'] = domain # Add domain info
all_tutorials.append(url)
# Sort by relevance across all domains
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)
print("Top 10 Python tutorials for beginners across all sites:")
for i, tutorial in enumerate(all_tutorials[:10], 1):
print(f"{i}. [{tutorial['relevance_score']:.2f}] {tutorial['head_data']['title']}")
print(f" {tutorial['url']}")
print(f" From: {tutorial['domain']}")
Multi-Site Monitoring
# Monitor news about your company across multiple sources
news_sites = [
"techcrunch.com",
"theverge.com",
"wired.com",
"arstechnica.com"
]
company_name = "YourCompany"
config = SeedingConfig(
source="cc", # Common Crawl for recent content
extract_head=True,
query=f"{company_name} announcement news",
scoring_method="bm25",
score_threshold=0.5, # High threshold for relevance
max_urls=10
)
async with AsyncUrlSeeder() as seeder:
results = await seeder.many_urls(news_sites, config)
# Collect all mentions
mentions = []
for domain, urls in results.items():
mentions.extend(urls)
if mentions:
print(f"Found {len(mentions)} mentions of {company_name}:")
for mention in mentions:
print(f"\n- {mention['head_data']['title']}")
print(f" {mention['url']}")
print(f" Score: {mention['relevance_score']:.2f}")
else:
print(f"No recent mentions of {company_name} found")
Advanced Integration Patterns
Let's put everything together in a real-world example.
Building a Research Assistant
Here's a complete example that discovers, scores, filters, and crawls intelligently:
import asyncio
from datetime import datetime
from crawl4ai import AsyncUrlSeeder, AsyncWebCrawler, SeedingConfig, CrawlerRunConfig
class ResearchAssistant:
def __init__(self):
self.seeder = None
async def __aenter__(self):
self.seeder = AsyncUrlSeeder()
await self.seeder.__aenter__()
return self
async def __aexit__(self, exc_type, exc_val, exc_tb):
if self.seeder:
await self.seeder.__aexit__(exc_type, exc_val, exc_tb)
async def research_topic(self, topic, domains, max_articles=20):
"""Research a topic across multiple domains."""
print(f"🔬 Researching '{topic}' across {len(domains)} domains...")
# Step 1: Discover relevant URLs
config = SeedingConfig(
source="sitemap+cc", # Maximum coverage
extract_head=True, # Get metadata
query=topic, # Research topic
scoring_method="bm25", # Smart scoring
score_threshold=0.4, # Quality threshold
max_urls=10, # Per domain
concurrency=20, # Fast discovery
verbose=True
)
# Discover across all domains
discoveries = await self.seeder.many_urls(domains, config)
# Step 2: Collect and rank all articles
all_articles = []
for domain, urls in discoveries.items():
for url in urls:
url['domain'] = domain
all_articles.append(url)
# Sort by relevance
all_articles.sort(key=lambda x: x['relevance_score'], reverse=True)
# Take top articles
top_articles = all_articles[:max_articles]
print(f"\n📊 Found {len(all_articles)} relevant articles")
print(f"📌 Selected top {len(top_articles)} for deep analysis")
# Step 3: Show what we're about to crawl
print("\n🎯 Articles to analyze:")
for i, article in enumerate(top_articles[:5], 1):
print(f"\n{i}. {article['head_data']['title']}")
print(f" Score: {article['relevance_score']:.2f}")
print(f" Source: {article['domain']}")
print(f" URL: {article['url'][:60]}...")
# Step 4: Crawl the selected articles
print(f"\n🚀 Deep crawling {len(top_articles)} articles...")
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
only_text=True,
word_count_threshold=200, # Substantial content only
stream=True
)
# Extract URLs and crawl all articles
article_urls = [article['url'] for article in top_articles]
results = []
crawl_results = await crawler.arun_many(article_urls, config=config)
async for result in crawl_results:
if result.success:
results.append({
'url': result.url,
'title': result.metadata.get('title', 'No title'),
'content': result.markdown.raw_markdown,
'domain': next(a['domain'] for a in top_articles if a['url'] == result.url),
'score': next(a['relevance_score'] for a in top_articles if a['url'] == result.url)
})
print(f"✓ Crawled: {result.url[:60]}...")
# Step 5: Analyze and summarize
print(f"\n📝 Analysis complete! Crawled {len(results)} articles")
return self.create_research_summary(topic, results)
def create_research_summary(self, topic, articles):
"""Create a research summary from crawled articles."""
summary = {
'topic': topic,
'timestamp': datetime.now().isoformat(),
'total_articles': len(articles),
'sources': {}
}
# Group by domain
for article in articles:
domain = article['domain']
if domain not in summary['sources']:
summary['sources'][domain] = []
summary['sources'][domain].append({
'title': article['title'],
'url': article['url'],
'score': article['score'],
'excerpt': article['content'][:500] + '...' if len(article['content']) > 500 else article['content']
})
return summary
# Use the research assistant
async def main():
async with ResearchAssistant() as assistant:
# Research Python async programming across multiple sources
topic = "python asyncio best practices performance optimization"
domains = [
"realpython.com",
"python.org",
"stackoverflow.com",
"medium.com"
]
summary = await assistant.research_topic(topic, domains, max_articles=15)
# Display results
print("\n" + "="*60)
print("RESEARCH SUMMARY")
print("="*60)
print(f"Topic: {summary['topic']}")
print(f"Date: {summary['timestamp']}")
print(f"Total Articles Analyzed: {summary['total_articles']}")
print("\nKey Findings by Source:")
for domain, articles in summary['sources'].items():
print(f"\n📚 {domain} ({len(articles)} articles)")
for article in articles[:2]: # Top 2 per domain
print(f"\n Title: {article['title']}")
print(f" Relevance: {article['score']:.2f}")
print(f" Preview: {article['excerpt'][:200]}...")
asyncio.run(main())
Performance Optimization Tips
- Use caching wisely
# First run - populate cache
config = SeedingConfig(source="sitemap", extract_head=True, force=True)
urls = await seeder.urls("example.com", config)
# Subsequent runs - use cache (much faster)
config = SeedingConfig(source="sitemap", extract_head=True, force=False)
urls = await seeder.urls("example.com", config)
- Optimize concurrency
# For many small requests (like HEAD checks)
config = SeedingConfig(concurrency=50, hits_per_sec=20)
# For fewer large requests (like full head extraction)
config = SeedingConfig(concurrency=10, hits_per_sec=5)
- Stream large result sets
# When crawling many URLs
async with AsyncWebCrawler() as crawler:
# Assuming urls is a list of URL strings
crawl_results = await crawler.arun_many(urls, config=config)
# Process as they arrive
async for result in crawl_results:
process_immediately(result) # Don't wait for all
- Memory protection for large domains
The seeder uses bounded queues to prevent memory issues when processing domains with millions of URLs:
# Safe for domains with 1M+ URLs
config = SeedingConfig(
source="cc+sitemap",
concurrency=50, # Queue size adapts to concurrency
max_urls=100000 # Process in batches if needed
)
# The seeder automatically manages memory by:
# - Using bounded queues (prevents RAM spikes)
# - Applying backpressure when queue is full
# - Processing URLs as they're discovered
Best Practices & Tips
Cache Management
The seeder automatically caches results to speed up repeated operations:
- Common Crawl cache:
~/.crawl4ai/seeder_cache/[index]_[domain]_[hash].jsonl - Sitemap cache:
~/.crawl4ai/seeder_cache/sitemap_[domain]_[hash].json - HEAD data cache:
~/.cache/url_seeder/head/[hash].json
Smart TTL Cache for Sitemaps
Sitemap caches now include intelligent validation:
# Default: 24-hour TTL with lastmod validation
config = SeedingConfig(
source="sitemap",
cache_ttl_hours=24, # Cache expires after 24 hours
validate_sitemap_lastmod=True # Also check if sitemap was updated
)
# Aggressive caching (1 week, no lastmod check)
config = SeedingConfig(
source="sitemap",
cache_ttl_hours=168, # 7 days
validate_sitemap_lastmod=False # Trust TTL only
)
# Always validate (no TTL, only lastmod)
config = SeedingConfig(
source="sitemap",
cache_ttl_hours=0, # Disable TTL
validate_sitemap_lastmod=True # Refetch if sitemap has newer lastmod
)
# Always fresh (bypass cache completely)
config = SeedingConfig(
source="sitemap",
force=True # Ignore all caching
)
Cache validation priority:
force=True→ Always refetch- Cache doesn't exist → Fetch fresh
validate_sitemap_lastmod=Trueand sitemap has newer<lastmod>→ Refetchcache_ttl_hours > 0and cache is older than TTL → Refetch- Cache corrupted → Refetch (automatic recovery)
- Otherwise → Use cache
Pattern Matching Strategies
# Be specific when possible
good_pattern = "*/blog/2024/*.html" # Specific
bad_pattern = "*" # Too broad
# Combine patterns with metadata filtering
config = SeedingConfig(
pattern="*/articles/*",
extract_head=True
)
urls = await seeder.urls("news.com", config)
# Further filter by publish date, author, category, etc.
recent = [u for u in urls if is_recent(u['head_data'])]
Rate Limiting Considerations
# Be respectful of servers
config = SeedingConfig(
hits_per_sec=10, # Max 10 requests per second
concurrency=20 # But use 20 workers
)
# For your own servers
config = SeedingConfig(
hits_per_sec=None, # No limit
concurrency=100 # Go fast
)
Quick Reference
Common Patterns
# Blog post discovery
config = SeedingConfig(
source="sitemap",
pattern="*/blog/*",
extract_head=True,
query="your topic",
scoring_method="bm25"
)
# E-commerce product discovery
config = SeedingConfig(
source="sitemap+cc",
pattern="*/product/*",
extract_head=True,
live_check=True
)
# Documentation search
config = SeedingConfig(
source="sitemap",
pattern="*/docs/*",
extract_head=True,
query="API reference",
scoring_method="bm25",
score_threshold=0.5
)
# News monitoring
config = SeedingConfig(
source="cc",
extract_head=True,
query="company name",
scoring_method="bm25",
max_urls=50
)
Troubleshooting Guide
| Issue | Solution |
|---|---|
| No URLs found | Try source="cc+sitemap", check domain spelling |
| Slow discovery | Reduce concurrency, add hits_per_sec limit |
| Missing metadata | Ensure extract_head=True |
| Low relevance scores | Refine query, lower score_threshold |
| Rate limit errors | Reduce hits_per_sec and concurrency |
| Memory issues with large sites | Use max_urls to limit results, reduce concurrency |
| Connection not closed | Use context manager or call await seeder.close() |
| Stale/outdated URLs | Set cache_ttl_hours=0 or use force=True |
| Cache not updating | Check validate_sitemap_lastmod=True, or use force=True |
| Incomplete URL list | Delete cache file and refetch, or use force=True |
Performance Benchmarks
Typical performance on a standard connection:
- Sitemap discovery: 100-1,000 URLs/second
- Common Crawl discovery: 50-500 URLs/second
- HEAD checking: 10-50 URLs/second
- Head extraction: 5-20 URLs/second
- BM25 scoring: 10,000+ URLs/second
Conclusion
URL seeding transforms web crawling from a blind expedition into a surgical strike. By discovering and analyzing URLs before crawling, you can:
- Save hours of crawling time
- Reduce bandwidth usage by 90%+
- Find exactly what you need
- Scale across multiple domains effortlessly
Whether you're building a research tool, monitoring competitors, or creating a content aggregator, URL seeding gives you the intelligence to crawl smarter, not harder.
Smart URL Filtering
The seeder automatically filters out nonsense URLs that aren't useful for content crawling:
# Enabled by default
config = SeedingConfig(
source="sitemap",
filter_nonsense_urls=True # Default: True
)
# URLs that get filtered:
# - robots.txt, sitemap.xml, ads.txt
# - API endpoints (/api/, /v1/, .json)
# - Media files (.jpg, .mp4, .pdf)
# - Archives (.zip, .tar.gz)
# - Source code (.js, .css)
# - Admin/login pages
# - And many more...
To disable filtering (not recommended):
config = SeedingConfig(
source="sitemap",
filter_nonsense_urls=False # Include ALL URLs
)
Key Features Summary
- Parallel Sitemap Index Processing: Automatically detects and processes sitemap indexes in parallel
- Memory Protection: Bounded queues prevent RAM issues with large domains (1M+ URLs)
- Context Manager Support: Automatic cleanup with
async withstatement - URL-Based Scoring: Smart filtering even without head extraction
- Smart URL Filtering: Automatically excludes utility/nonsense URLs
- Smart TTL Cache: Sitemap caches with TTL expiry and lastmod validation
- Automatic Cache Recovery: Corrupted or incomplete caches are automatically refreshed
Now go forth and seed intelligently!