This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
5930 lines
181 KiB
Plaintext
5930 lines
181 KiB
Plaintext
# Crawl4AI
|
||
|
||
> Open-source LLM-friendly web crawler and scraper for AI applications
|
||
|
||
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. Built with Python and Playwright for high-performance crawling with structured data extraction.
|
||
|
||
**Key Features:**
|
||
- Asynchronous crawling with high concurrency
|
||
- Multiple extraction strategies (CSS, XPath, LLM-based)
|
||
- Built-in markdown generation with content filtering
|
||
- Docker deployment with REST API
|
||
- Session management and browser automation
|
||
- Advanced anti-detection capabilities
|
||
|
||
**Quick Links:**
|
||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||
- [Documentation](https://docs.crawl4ai.com)
|
||
- [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
|
||
|
||
---
|
||
|
||
|
||
## Installation
|
||
|
||
Multiple installation options for different environments and use cases.
|
||
|
||
### Basic Installation
|
||
|
||
```bash
|
||
# Install core library
|
||
pip install crawl4ai
|
||
|
||
# Initial setup (installs Playwright browsers)
|
||
crawl4ai-setup
|
||
|
||
# Verify installation
|
||
crawl4ai-doctor
|
||
```
|
||
|
||
### Quick Verification
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler
|
||
|
||
async def main():
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com")
|
||
print(result.markdown[:300])
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
**📖 Learn more:** [Basic Usage Guide](https://docs.crawl4ai.com/core/quickstart.md)
|
||
|
||
### Advanced Features (Optional)
|
||
|
||
```bash
|
||
# PyTorch-based features (text clustering, semantic chunking)
|
||
pip install crawl4ai[torch]
|
||
crawl4ai-setup
|
||
|
||
# Transformers (Hugging Face models)
|
||
pip install crawl4ai[transformer]
|
||
crawl4ai-setup
|
||
|
||
# All features (large download)
|
||
pip install crawl4ai[all]
|
||
crawl4ai-setup
|
||
|
||
# Pre-download models (optional)
|
||
crawl4ai-download-models
|
||
```
|
||
|
||
**📖 Learn more:** [Advanced Features Documentation](https://docs.crawl4ai.com/extraction/llm-strategies.md)
|
||
|
||
### Docker Deployment
|
||
|
||
```bash
|
||
# Pull pre-built image (specify platform for consistency)
|
||
docker pull --platform linux/amd64 unclecode/crawl4ai:latest
|
||
# For ARM (M1/M2 Macs): docker pull --platform linux/arm64 unclecode/crawl4ai:latest
|
||
|
||
# Setup environment for LLM support
|
||
cat > .llm.env << EOL
|
||
OPENAI_API_KEY=sk-your-key
|
||
ANTHROPIC_API_KEY=your-anthropic-key
|
||
EOL
|
||
|
||
# Run with LLM support (specify platform)
|
||
docker run -d \
|
||
--platform linux/amd64 \
|
||
-p 11235:11235 \
|
||
--name crawl4ai \
|
||
--env-file .llm.env \
|
||
--shm-size=1g \
|
||
unclecode/crawl4ai:latest
|
||
|
||
# For ARM Macs, use: --platform linux/arm64
|
||
|
||
# Basic run (no LLM)
|
||
docker run -d \
|
||
--platform linux/amd64 \
|
||
-p 11235:11235 \
|
||
--name crawl4ai \
|
||
--shm-size=1g \
|
||
unclecode/crawl4ai:latest
|
||
```
|
||
|
||
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment.md)
|
||
|
||
### Docker Compose
|
||
|
||
```bash
|
||
# Clone repository
|
||
git clone https://github.com/unclecode/crawl4ai.git
|
||
cd crawl4ai
|
||
|
||
# Copy environment template
|
||
cp deploy/docker/.llm.env.example .llm.env
|
||
# Edit .llm.env with your API keys
|
||
|
||
# Run pre-built image
|
||
IMAGE=unclecode/crawl4ai:latest docker compose up -d
|
||
|
||
# Build and run locally
|
||
docker compose up --build -d
|
||
|
||
# Build with all features
|
||
INSTALL_TYPE=all docker compose up --build -d
|
||
|
||
# Stop service
|
||
docker compose down
|
||
```
|
||
|
||
**📖 Learn more:** [Docker Compose Configuration](https://docs.crawl4ai.com/core/docker-deployment.md#option-2-using-docker-compose)
|
||
|
||
### Manual Docker Build
|
||
|
||
```bash
|
||
# Build multi-architecture image (specify platform)
|
||
docker buildx build --platform linux/amd64 -t crawl4ai-local:latest --load .
|
||
# For ARM: docker buildx build --platform linux/arm64 -t crawl4ai-local:latest --load .
|
||
|
||
# Build with specific features
|
||
docker buildx build \
|
||
--platform linux/amd64 \
|
||
--build-arg INSTALL_TYPE=all \
|
||
--build-arg ENABLE_GPU=false \
|
||
-t crawl4ai-local:latest --load .
|
||
|
||
# Run custom build (specify platform)
|
||
docker run -d \
|
||
--platform linux/amd64 \
|
||
-p 11235:11235 \
|
||
--name crawl4ai-custom \
|
||
--env-file .llm.env \
|
||
--shm-size=1g \
|
||
crawl4ai-local:latest
|
||
```
|
||
|
||
**📖 Learn more:** [Manual Build Guide](https://docs.crawl4ai.com/core/docker-deployment.md#option-3-manual-local-build--run)
|
||
|
||
### Google Colab
|
||
|
||
```python
|
||
# Install in Colab
|
||
!pip install crawl4ai
|
||
!crawl4ai-setup
|
||
|
||
# If setup fails, manually install Playwright browsers
|
||
!playwright install chromium
|
||
|
||
# Install with all features (may take 5-10 minutes)
|
||
!pip install crawl4ai[all]
|
||
!crawl4ai-setup
|
||
!crawl4ai-download-models
|
||
|
||
# If still having issues, force Playwright install
|
||
!playwright install chromium --force
|
||
|
||
# Quick test
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler
|
||
|
||
async def test_crawl():
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com")
|
||
print("✅ Installation successful!")
|
||
print(f"Content length: {len(result.markdown)}")
|
||
|
||
# Run test in Colab
|
||
await test_crawl()
|
||
```
|
||
|
||
**📖 Learn more:** [Colab Examples Notebook](https://colab.research.google.com/github/unclecode/crawl4ai/blob/main/docs/examples/quickstart.ipynb)
|
||
|
||
### Docker API Usage
|
||
|
||
```python
|
||
# Using Docker SDK
|
||
import asyncio
|
||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
|
||
|
||
async def main():
|
||
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
|
||
results = await client.crawl(
|
||
["https://example.com"],
|
||
browser_config=BrowserConfig(headless=True),
|
||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||
)
|
||
for result in results:
|
||
print(f"Success: {result.success}, Length: {len(result.markdown)}")
|
||
|
||
asyncio.run(main())
|
||
```
|
||
|
||
**📖 Learn more:** [Docker Client API](https://docs.crawl4ai.com/core/docker-deployment.md#python-sdk)
|
||
|
||
### Direct API Calls
|
||
|
||
```python
|
||
# REST API example
|
||
import requests
|
||
|
||
payload = {
|
||
"urls": ["https://example.com"],
|
||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||
}
|
||
|
||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||
print(response.json())
|
||
```
|
||
|
||
**📖 Learn more:** [REST API Reference](https://docs.crawl4ai.com/core/docker-deployment.md#rest-api-examples)
|
||
|
||
### Health Check
|
||
|
||
```bash
|
||
# Check Docker service
|
||
curl http://localhost:11235/health
|
||
|
||
# Access playground
|
||
open http://localhost:11235/playground
|
||
|
||
# View metrics
|
||
curl http://localhost:11235/metrics
|
||
```
|
||
|
||
**📖 Learn more:** [Monitoring & Metrics](https://docs.crawl4ai.com/core/docker-deployment.md#metrics--monitoring)
|
||
---
|
||
|
||
|
||
## Simple Crawling
|
||
|
||
Basic web crawling operations with AsyncWebCrawler, configurations, and response handling.
|
||
|
||
### Basic Setup
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
|
||
async def main():
|
||
browser_config = BrowserConfig() # Default browser settings
|
||
run_config = CrawlerRunConfig() # Default crawl settings
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=run_config
|
||
)
|
||
print(result.markdown)
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
### Understanding CrawlResult
|
||
|
||
```python
|
||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||
|
||
config = CrawlerRunConfig(
|
||
markdown_generator=DefaultMarkdownGenerator(
|
||
content_filter=PruningContentFilter(threshold=0.6),
|
||
options={"ignore_links": True}
|
||
)
|
||
)
|
||
|
||
result = await crawler.arun("https://example.com", config=config)
|
||
|
||
# Different content formats
|
||
print(result.html) # Raw HTML
|
||
print(result.cleaned_html) # Cleaned HTML
|
||
print(result.markdown.raw_markdown) # Raw markdown
|
||
print(result.markdown.fit_markdown) # Filtered markdown
|
||
|
||
# Status information
|
||
print(result.success) # True/False
|
||
print(result.status_code) # HTTP status (200, 404, etc.)
|
||
|
||
# Extracted content
|
||
print(result.media) # Images, videos, audio
|
||
print(result.links) # Internal/external links
|
||
```
|
||
|
||
### Basic Configuration Options
|
||
|
||
```python
|
||
run_config = CrawlerRunConfig(
|
||
word_count_threshold=10, # Min words per block
|
||
exclude_external_links=True, # Remove external links
|
||
remove_overlay_elements=True, # Remove popups/modals
|
||
process_iframes=True, # Process iframe content
|
||
excluded_tags=['form', 'header'] # Skip these tags
|
||
)
|
||
|
||
result = await crawler.arun("https://example.com", config=run_config)
|
||
```
|
||
|
||
### Error Handling
|
||
|
||
```python
|
||
result = await crawler.arun("https://example.com", config=run_config)
|
||
|
||
if not result.success:
|
||
print(f"Crawl failed: {result.error_message}")
|
||
print(f"Status code: {result.status_code}")
|
||
else:
|
||
print(f"Success! Content length: {len(result.markdown)}")
|
||
```
|
||
|
||
### Debugging with Verbose Logging
|
||
|
||
```python
|
||
browser_config = BrowserConfig(verbose=True)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun("https://example.com")
|
||
# Detailed logging output will be displayed
|
||
```
|
||
|
||
### Complete Example
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||
|
||
async def comprehensive_crawl():
|
||
browser_config = BrowserConfig(verbose=True)
|
||
|
||
run_config = CrawlerRunConfig(
|
||
# Content filtering
|
||
word_count_threshold=10,
|
||
excluded_tags=['form', 'header', 'nav'],
|
||
exclude_external_links=True,
|
||
|
||
# Content processing
|
||
process_iframes=True,
|
||
remove_overlay_elements=True,
|
||
|
||
# Cache control
|
||
cache_mode=CacheMode.ENABLED
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=run_config
|
||
)
|
||
|
||
if result.success:
|
||
# Display content summary
|
||
print(f"Title: {result.metadata.get('title', 'No title')}")
|
||
print(f"Content: {result.markdown[:500]}...")
|
||
|
||
# Process media
|
||
images = result.media.get("images", [])
|
||
print(f"Found {len(images)} images")
|
||
for img in images[:3]: # First 3 images
|
||
print(f" - {img.get('src', 'No src')}")
|
||
|
||
# Process links
|
||
internal_links = result.links.get("internal", [])
|
||
print(f"Found {len(internal_links)} internal links")
|
||
for link in internal_links[:3]: # First 3 links
|
||
print(f" - {link.get('href', 'No href')}")
|
||
|
||
else:
|
||
print(f"❌ Crawl failed: {result.error_message}")
|
||
print(f"Status: {result.status_code}")
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(comprehensive_crawl())
|
||
```
|
||
|
||
### Working with Raw HTML and Local Files
|
||
|
||
```python
|
||
# Crawl raw HTML
|
||
raw_html = "<html><body><h1>Test</h1><p>Content</p></body></html>"
|
||
result = await crawler.arun(f"raw://{raw_html}")
|
||
|
||
# Crawl local file
|
||
result = await crawler.arun("file:///path/to/local/file.html")
|
||
|
||
# Both return standard CrawlResult objects
|
||
print(result.markdown)
|
||
```
|
||
|
||
## Table Extraction
|
||
|
||
Extract structured data from HTML tables with automatic detection and scoring.
|
||
|
||
### Basic Table Extraction
|
||
|
||
```python
|
||
import asyncio
|
||
import pandas as pd
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
|
||
async def extract_tables():
|
||
async with AsyncWebCrawler() as crawler:
|
||
config = CrawlerRunConfig(
|
||
table_score_threshold=7, # Higher = stricter detection
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
result = await crawler.arun("https://example.com/tables", config=config)
|
||
|
||
if result.success and result.tables:
|
||
# New tables field (v0.6+)
|
||
for i, table in enumerate(result.tables):
|
||
print(f"Table {i+1}:")
|
||
print(f"Headers: {table['headers']}")
|
||
print(f"Rows: {len(table['rows'])}")
|
||
print(f"Caption: {table.get('caption', 'No caption')}")
|
||
|
||
# Convert to DataFrame
|
||
df = pd.DataFrame(table['rows'], columns=table['headers'])
|
||
print(df.head())
|
||
|
||
asyncio.run(extract_tables())
|
||
```
|
||
|
||
### Advanced Table Processing
|
||
|
||
```python
|
||
from crawl4ai import LXMLWebScrapingStrategy
|
||
|
||
async def process_financial_tables():
|
||
config = CrawlerRunConfig(
|
||
table_score_threshold=8, # Strict detection for data tables
|
||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||
keep_data_attributes=True,
|
||
scan_full_page=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://coinmarketcap.com", config=config)
|
||
|
||
if result.tables:
|
||
# Get the main data table (usually first/largest)
|
||
main_table = result.tables[0]
|
||
|
||
# Create DataFrame
|
||
df = pd.DataFrame(
|
||
main_table['rows'],
|
||
columns=main_table['headers']
|
||
)
|
||
|
||
# Clean and process data
|
||
df = clean_financial_data(df)
|
||
|
||
# Save for analysis
|
||
df.to_csv("market_data.csv", index=False)
|
||
return df
|
||
|
||
def clean_financial_data(df):
|
||
"""Clean currency symbols, percentages, and large numbers"""
|
||
for col in df.columns:
|
||
if 'price' in col.lower():
|
||
# Remove currency symbols
|
||
df[col] = df[col].str.replace(r'[^\d.]', '', regex=True)
|
||
df[col] = pd.to_numeric(df[col], errors='coerce')
|
||
|
||
elif '%' in str(df[col].iloc[0]):
|
||
# Convert percentages
|
||
df[col] = df[col].str.replace('%', '').astype(float) / 100
|
||
|
||
elif any(suffix in str(df[col].iloc[0]) for suffix in ['B', 'M', 'K']):
|
||
# Handle large numbers (Billions, Millions, etc.)
|
||
df[col] = df[col].apply(convert_large_numbers)
|
||
|
||
return df
|
||
|
||
def convert_large_numbers(value):
|
||
"""Convert 1.5B -> 1500000000"""
|
||
if pd.isna(value):
|
||
return float('nan')
|
||
|
||
value = str(value)
|
||
multiplier = 1
|
||
if 'B' in value:
|
||
multiplier = 1e9
|
||
elif 'M' in value:
|
||
multiplier = 1e6
|
||
elif 'K' in value:
|
||
multiplier = 1e3
|
||
|
||
number = float(re.sub(r'[^\d.]', '', value))
|
||
return number * multiplier
|
||
```
|
||
|
||
### Table Detection Configuration
|
||
|
||
```python
|
||
# Strict table detection (data-heavy pages)
|
||
strict_config = CrawlerRunConfig(
|
||
table_score_threshold=9, # Only high-quality tables
|
||
word_count_threshold=5, # Ignore sparse content
|
||
excluded_tags=['nav', 'footer'] # Skip navigation tables
|
||
)
|
||
|
||
# Lenient detection (mixed content pages)
|
||
lenient_config = CrawlerRunConfig(
|
||
table_score_threshold=5, # Include layout tables
|
||
process_iframes=True, # Check embedded tables
|
||
scan_full_page=True # Scroll to load dynamic tables
|
||
)
|
||
|
||
# Financial/data site optimization
|
||
financial_config = CrawlerRunConfig(
|
||
table_score_threshold=8,
|
||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||
wait_for="css:table", # Wait for tables to load
|
||
scan_full_page=True,
|
||
scroll_delay=0.2
|
||
)
|
||
```
|
||
|
||
### Multi-Table Processing
|
||
|
||
```python
|
||
async def extract_all_tables():
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/data", config=config)
|
||
|
||
tables_data = {}
|
||
|
||
for i, table in enumerate(result.tables):
|
||
# Create meaningful names based on content
|
||
table_name = (
|
||
table.get('caption') or
|
||
f"table_{i+1}_{table['headers'][0]}"
|
||
).replace(' ', '_').lower()
|
||
|
||
df = pd.DataFrame(table['rows'], columns=table['headers'])
|
||
|
||
# Store with metadata
|
||
tables_data[table_name] = {
|
||
'dataframe': df,
|
||
'headers': table['headers'],
|
||
'row_count': len(table['rows']),
|
||
'caption': table.get('caption'),
|
||
'summary': table.get('summary')
|
||
}
|
||
|
||
return tables_data
|
||
|
||
# Usage
|
||
tables = await extract_all_tables()
|
||
for name, data in tables.items():
|
||
print(f"{name}: {data['row_count']} rows")
|
||
data['dataframe'].to_csv(f"{name}.csv")
|
||
```
|
||
|
||
### Backward Compatibility
|
||
|
||
```python
|
||
# Support both new and old table formats
|
||
def get_tables(result):
|
||
# New format (v0.6+)
|
||
if hasattr(result, 'tables') and result.tables:
|
||
return result.tables
|
||
|
||
# Fallback to media.tables (older versions)
|
||
return result.media.get('tables', [])
|
||
|
||
# Usage in existing code
|
||
result = await crawler.arun(url, config=config)
|
||
tables = get_tables(result)
|
||
|
||
for table in tables:
|
||
df = pd.DataFrame(table['rows'], columns=table['headers'])
|
||
# Process table data...
|
||
```
|
||
|
||
### Table Quality Scoring
|
||
|
||
```python
|
||
# Understanding table_score_threshold values:
|
||
# 10: Only perfect data tables (headers + data rows)
|
||
# 8-9: High-quality tables (recommended for financial/data sites)
|
||
# 6-7: Mixed content tables (news sites, wikis)
|
||
# 4-5: Layout tables included (broader detection)
|
||
# 1-3: All table-like structures (very permissive)
|
||
|
||
config = CrawlerRunConfig(
|
||
table_score_threshold=8, # Balanced detection
|
||
verbose=True # See scoring details in logs
|
||
)
|
||
```
|
||
|
||
|
||
**📖 Learn more:** [CrawlResult API Reference](https://docs.crawl4ai.com/api/crawl-result/), [Browser & Crawler Configuration](https://docs.crawl4ai.com/core/browser-crawler-config/), [Cache Modes](https://docs.crawl4ai.com/core/cache-modes/)
|
||
---
|
||
|
||
|
||
## Browser, Crawler & LLM Configuration
|
||
|
||
Core configuration classes for controlling browser behavior, crawl operations, LLM providers, and understanding crawl results.
|
||
|
||
### BrowserConfig - Browser Environment Setup
|
||
|
||
```python
|
||
from crawl4ai import BrowserConfig, AsyncWebCrawler
|
||
|
||
# Basic browser configuration
|
||
browser_config = BrowserConfig(
|
||
browser_type="chromium", # "chromium", "firefox", "webkit"
|
||
headless=True, # False for visible browser (debugging)
|
||
viewport_width=1280,
|
||
viewport_height=720,
|
||
verbose=True
|
||
)
|
||
|
||
# Advanced browser setup with proxy and persistence
|
||
browser_config = BrowserConfig(
|
||
headless=False,
|
||
proxy="http://user:pass@proxy:8080",
|
||
use_persistent_context=True,
|
||
user_data_dir="./browser_data",
|
||
cookies=[
|
||
{"name": "session", "value": "abc123", "domain": "example.com"}
|
||
],
|
||
headers={"Accept-Language": "en-US,en;q=0.9"},
|
||
user_agent="Mozilla/5.0 (X11; Linux x86_64) Chrome/116.0.0.0 Safari/537.36",
|
||
text_mode=True, # Disable images for faster crawling
|
||
extra_args=["--disable-extensions", "--no-sandbox"]
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun("https://example.com")
|
||
```
|
||
|
||
### CrawlerRunConfig - Crawl Operation Control
|
||
|
||
```python
|
||
from crawl4ai import CrawlerRunConfig, CacheMode
|
||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||
|
||
# Basic crawl configuration
|
||
run_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
word_count_threshold=10,
|
||
excluded_tags=["nav", "footer", "script"],
|
||
exclude_external_links=True,
|
||
screenshot=True,
|
||
pdf=True
|
||
)
|
||
|
||
# Advanced content processing
|
||
md_generator = DefaultMarkdownGenerator(
|
||
content_filter=PruningContentFilter(threshold=0.6),
|
||
options={"citations": True, "ignore_links": False}
|
||
)
|
||
|
||
run_config = CrawlerRunConfig(
|
||
# Content processing
|
||
markdown_generator=md_generator,
|
||
css_selector="main.content", # Focus on specific content
|
||
target_elements=[".article", ".post"], # Multiple target selectors
|
||
process_iframes=True,
|
||
remove_overlay_elements=True,
|
||
|
||
# Page interaction
|
||
js_code=[
|
||
"window.scrollTo(0, document.body.scrollHeight);",
|
||
"document.querySelector('.load-more')?.click();"
|
||
],
|
||
wait_for="css:.content-loaded",
|
||
wait_for_timeout=10000,
|
||
scan_full_page=True,
|
||
|
||
# Session management
|
||
session_id="persistent_session",
|
||
|
||
# Media handling
|
||
screenshot=True,
|
||
pdf=True,
|
||
capture_mhtml=True,
|
||
image_score_threshold=5,
|
||
|
||
# Advanced options
|
||
simulate_user=True,
|
||
magic=True, # Auto-handle popups
|
||
verbose=True
|
||
)
|
||
```
|
||
|
||
### CrawlerRunConfig Parameters by Category
|
||
|
||
```python
|
||
# Content Processing
|
||
config = CrawlerRunConfig(
|
||
word_count_threshold=10, # Min words per content block
|
||
css_selector="main.article", # Focus on specific content
|
||
target_elements=[".post", ".content"], # Multiple target selectors
|
||
excluded_tags=["nav", "footer"], # Remove these tags
|
||
excluded_selector="#ads, .tracker", # Remove by selector
|
||
only_text=True, # Text-only extraction
|
||
keep_data_attributes=True, # Preserve data-* attributes
|
||
remove_forms=True, # Remove all forms
|
||
process_iframes=True # Include iframe content
|
||
)
|
||
|
||
# Page Navigation & Timing
|
||
config = CrawlerRunConfig(
|
||
wait_until="networkidle", # Wait condition
|
||
page_timeout=60000, # 60 second timeout
|
||
wait_for="css:.loaded", # Wait for specific element
|
||
wait_for_images=True, # Wait for images to load
|
||
delay_before_return_html=0.5, # Final delay before capture
|
||
semaphore_count=10 # Max concurrent operations
|
||
)
|
||
|
||
# Page Interaction
|
||
config = CrawlerRunConfig(
|
||
js_code="document.querySelector('button').click();",
|
||
scan_full_page=True, # Auto-scroll page
|
||
scroll_delay=0.3, # Delay between scrolls
|
||
remove_overlay_elements=True, # Remove popups/modals
|
||
simulate_user=True, # Simulate human behavior
|
||
override_navigator=True, # Override navigator properties
|
||
magic=True # Auto-handle common patterns
|
||
)
|
||
|
||
# Caching & Session
|
||
config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS, # Cache behavior
|
||
session_id="my_session", # Persistent session
|
||
shared_data={"context": "value"} # Share data between hooks
|
||
)
|
||
|
||
# Media & Output
|
||
config = CrawlerRunConfig(
|
||
screenshot=True, # Capture screenshot
|
||
pdf=True, # Generate PDF
|
||
capture_mhtml=True, # Capture MHTML archive
|
||
image_score_threshold=3, # Filter low-quality images
|
||
exclude_external_images=True # Remove external images
|
||
)
|
||
|
||
# Link & Domain Filtering
|
||
config = CrawlerRunConfig(
|
||
exclude_external_links=True, # Remove external links
|
||
exclude_social_media_links=True, # Remove social media links
|
||
exclude_domains=["ads.com", "tracker.io"], # Custom domain filter
|
||
exclude_internal_links=False # Keep internal links
|
||
)
|
||
```
|
||
|
||
### LLMConfig - Language Model Setup
|
||
|
||
```python
|
||
from crawl4ai import LLMConfig
|
||
|
||
# OpenAI configuration
|
||
llm_config = LLMConfig(
|
||
provider="openai/gpt-4o-mini",
|
||
api_token=os.getenv("OPENAI_API_KEY"), # or "env:OPENAI_API_KEY"
|
||
temperature=0.1,
|
||
max_tokens=2000
|
||
)
|
||
|
||
# Local model with Ollama
|
||
llm_config = LLMConfig(
|
||
provider="ollama/llama3.3",
|
||
api_token=None, # Not needed for Ollama
|
||
base_url="http://localhost:11434" # Custom endpoint
|
||
)
|
||
|
||
# Anthropic Claude
|
||
llm_config = LLMConfig(
|
||
provider="anthropic/claude-3-5-sonnet-20240620",
|
||
api_token="env:ANTHROPIC_API_KEY",
|
||
max_tokens=4000
|
||
)
|
||
|
||
# Google Gemini
|
||
llm_config = LLMConfig(
|
||
provider="gemini/gemini-1.5-pro",
|
||
api_token="env:GEMINI_API_KEY"
|
||
)
|
||
|
||
# Groq (fast inference)
|
||
llm_config = LLMConfig(
|
||
provider="groq/llama3-70b-8192",
|
||
api_token="env:GROQ_API_KEY"
|
||
)
|
||
```
|
||
|
||
### CrawlResult - Understanding Output
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com", config=run_config)
|
||
|
||
# Basic status information
|
||
print(f"Success: {result.success}")
|
||
print(f"Status: {result.status_code}")
|
||
print(f"URL: {result.url}")
|
||
|
||
if not result.success:
|
||
print(f"Error: {result.error_message}")
|
||
return
|
||
|
||
# HTML content variants
|
||
print(f"Original HTML: {len(result.html)} chars")
|
||
print(f"Cleaned HTML: {len(result.cleaned_html or '')} chars")
|
||
|
||
# Markdown output (MarkdownGenerationResult)
|
||
if result.markdown:
|
||
print(f"Raw markdown: {len(result.markdown.raw_markdown)} chars")
|
||
print(f"With citations: {len(result.markdown.markdown_with_citations)} chars")
|
||
|
||
# Filtered content (if content filter was used)
|
||
if result.markdown.fit_markdown:
|
||
print(f"Fit markdown: {len(result.markdown.fit_markdown)} chars")
|
||
print(f"Fit HTML: {len(result.markdown.fit_html)} chars")
|
||
|
||
# Extracted structured data
|
||
if result.extracted_content:
|
||
import json
|
||
data = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(data)} items")
|
||
|
||
# Media and links
|
||
images = result.media.get("images", [])
|
||
print(f"Found {len(images)} images")
|
||
for img in images[:3]: # First 3 images
|
||
print(f" {img.get('src')} (score: {img.get('score', 0)})")
|
||
|
||
internal_links = result.links.get("internal", [])
|
||
external_links = result.links.get("external", [])
|
||
print(f"Links: {len(internal_links)} internal, {len(external_links)} external")
|
||
|
||
# Generated files
|
||
if result.screenshot:
|
||
print(f"Screenshot captured: {len(result.screenshot)} chars (base64)")
|
||
# Save screenshot
|
||
import base64
|
||
with open("page.png", "wb") as f:
|
||
f.write(base64.b64decode(result.screenshot))
|
||
|
||
if result.pdf:
|
||
print(f"PDF generated: {len(result.pdf)} bytes")
|
||
with open("page.pdf", "wb") as f:
|
||
f.write(result.pdf)
|
||
|
||
if result.mhtml:
|
||
print(f"MHTML captured: {len(result.mhtml)} chars")
|
||
with open("page.mhtml", "w", encoding="utf-8") as f:
|
||
f.write(result.mhtml)
|
||
|
||
# SSL certificate information
|
||
if result.ssl_certificate:
|
||
print(f"SSL Issuer: {result.ssl_certificate.issuer}")
|
||
print(f"Valid until: {result.ssl_certificate.valid_until}")
|
||
|
||
# Network and console data (if captured)
|
||
if result.network_requests:
|
||
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
|
||
print(f"Network requests captured: {len(requests)}")
|
||
|
||
if result.console_messages:
|
||
errors = [m for m in result.console_messages if m.get("type") == "error"]
|
||
print(f"Console messages: {len(result.console_messages)} ({len(errors)} errors)")
|
||
|
||
# Session and metadata
|
||
if result.session_id:
|
||
print(f"Session ID: {result.session_id}")
|
||
|
||
if result.metadata:
|
||
print(f"Metadata: {result.metadata.get('title', 'No title')}")
|
||
```
|
||
|
||
### Configuration Helpers and Best Practices
|
||
|
||
```python
|
||
# Clone configurations for variations
|
||
base_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.ENABLED,
|
||
word_count_threshold=200,
|
||
verbose=True
|
||
)
|
||
|
||
# Create streaming version
|
||
stream_config = base_config.clone(
|
||
stream=True,
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
# Create debug version
|
||
debug_config = base_config.clone(
|
||
headless=False,
|
||
page_timeout=120000,
|
||
verbose=True
|
||
)
|
||
|
||
# Serialize/deserialize configurations
|
||
config_dict = base_config.dump() # Convert to dict
|
||
restored_config = CrawlerRunConfig.load(config_dict) # Restore from dict
|
||
|
||
# Browser configuration management
|
||
browser_config = BrowserConfig(headless=True, text_mode=True)
|
||
browser_dict = browser_config.to_dict()
|
||
cloned_browser = browser_config.clone(headless=False, verbose=True)
|
||
```
|
||
|
||
### Common Configuration Patterns
|
||
|
||
```python
|
||
# Fast text-only crawling
|
||
fast_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.ENABLED,
|
||
text_mode=True,
|
||
exclude_external_links=True,
|
||
exclude_external_images=True,
|
||
word_count_threshold=50
|
||
)
|
||
|
||
# Comprehensive data extraction
|
||
comprehensive_config = CrawlerRunConfig(
|
||
process_iframes=True,
|
||
scan_full_page=True,
|
||
wait_for_images=True,
|
||
screenshot=True,
|
||
capture_network_requests=True,
|
||
capture_console_messages=True,
|
||
magic=True
|
||
)
|
||
|
||
# Stealth crawling
|
||
stealth_config = CrawlerRunConfig(
|
||
simulate_user=True,
|
||
override_navigator=True,
|
||
mean_delay=2.0,
|
||
max_range=1.0,
|
||
user_agent_mode="random"
|
||
)
|
||
```
|
||
|
||
**📖 Learn more:** [Complete Parameter Reference](https://docs.crawl4ai.com/api/parameters/), [Content Filtering](https://docs.crawl4ai.com/core/markdown-generation/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Network Capture](https://docs.crawl4ai.com/advanced/network-console-capture/)
|
||
---
|
||
|
||
|
||
## Extraction Strategies
|
||
|
||
Powerful data extraction from web pages using LLM-based intelligent parsing or fast schema/pattern-based approaches.
|
||
|
||
### LLM-Based Extraction - Intelligent Content Understanding
|
||
|
||
```python
|
||
import os
|
||
import asyncio
|
||
import json
|
||
from pydantic import BaseModel, Field
|
||
from typing import List
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
|
||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||
|
||
# Define structured data model
|
||
class Product(BaseModel):
|
||
name: str = Field(description="Product name")
|
||
price: str = Field(description="Product price")
|
||
description: str = Field(description="Product description")
|
||
features: List[str] = Field(description="List of product features")
|
||
rating: float = Field(description="Product rating out of 5")
|
||
|
||
# Configure LLM provider
|
||
llm_config = LLMConfig(
|
||
provider="openai/gpt-4o-mini", # or "ollama/llama3.3", "anthropic/claude-3-5-sonnet"
|
||
api_token=os.getenv("OPENAI_API_KEY"), # or "env:OPENAI_API_KEY"
|
||
temperature=0.1,
|
||
max_tokens=2000
|
||
)
|
||
|
||
# Create LLM extraction strategy
|
||
llm_strategy = LLMExtractionStrategy(
|
||
llm_config=llm_config,
|
||
schema=Product.model_json_schema(),
|
||
extraction_type="schema", # or "block" for freeform text
|
||
instruction="""
|
||
Extract product information from the webpage content.
|
||
Focus on finding complete product details including:
|
||
- Product name and price
|
||
- Detailed description
|
||
- All listed features
|
||
- Customer rating if available
|
||
Return valid JSON array of products.
|
||
""",
|
||
chunk_token_threshold=1200, # Split content if too large
|
||
overlap_rate=0.1, # 10% overlap between chunks
|
||
apply_chunking=True, # Enable automatic chunking
|
||
input_format="markdown", # "html", "fit_markdown", or "markdown"
|
||
extra_args={"temperature": 0.0, "max_tokens": 800},
|
||
verbose=True
|
||
)
|
||
|
||
async def extract_with_llm():
|
||
browser_config = BrowserConfig(headless=True)
|
||
|
||
crawl_config = CrawlerRunConfig(
|
||
extraction_strategy=llm_strategy,
|
||
cache_mode=CacheMode.BYPASS,
|
||
word_count_threshold=10
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com/products",
|
||
config=crawl_config
|
||
)
|
||
|
||
if result.success:
|
||
# Parse extracted JSON
|
||
products = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(products)} products")
|
||
|
||
for product in products[:3]: # Show first 3
|
||
print(f"Product: {product['name']}")
|
||
print(f"Price: {product['price']}")
|
||
print(f"Rating: {product.get('rating', 'N/A')}")
|
||
|
||
# Show token usage and cost
|
||
llm_strategy.show_usage()
|
||
else:
|
||
print(f"Extraction failed: {result.error_message}")
|
||
|
||
asyncio.run(extract_with_llm())
|
||
```
|
||
|
||
### LLM Strategy Advanced Configuration
|
||
|
||
```python
|
||
# Multiple provider configurations
|
||
providers = {
|
||
"openai": LLMConfig(
|
||
provider="openai/gpt-4o",
|
||
api_token="env:OPENAI_API_KEY",
|
||
temperature=0.1
|
||
),
|
||
"anthropic": LLMConfig(
|
||
provider="anthropic/claude-3-5-sonnet-20240620",
|
||
api_token="env:ANTHROPIC_API_KEY",
|
||
max_tokens=4000
|
||
),
|
||
"ollama": LLMConfig(
|
||
provider="ollama/llama3.3",
|
||
api_token=None, # Not needed for Ollama
|
||
base_url="http://localhost:11434"
|
||
),
|
||
"groq": LLMConfig(
|
||
provider="groq/llama3-70b-8192",
|
||
api_token="env:GROQ_API_KEY"
|
||
)
|
||
}
|
||
|
||
# Advanced chunking for large content
|
||
large_content_strategy = LLMExtractionStrategy(
|
||
llm_config=providers["openai"],
|
||
schema=YourModel.model_json_schema(),
|
||
extraction_type="schema",
|
||
instruction="Extract detailed information...",
|
||
|
||
# Chunking parameters
|
||
chunk_token_threshold=2000, # Larger chunks for complex content
|
||
overlap_rate=0.15, # More overlap for context preservation
|
||
apply_chunking=True,
|
||
|
||
# Input format selection
|
||
input_format="fit_markdown", # Use filtered content if available
|
||
|
||
# LLM parameters
|
||
extra_args={
|
||
"temperature": 0.0, # Deterministic output
|
||
"top_p": 0.9,
|
||
"frequency_penalty": 0.1,
|
||
"presence_penalty": 0.1,
|
||
"max_tokens": 1500
|
||
},
|
||
verbose=True
|
||
)
|
||
|
||
# Knowledge graph extraction
|
||
class Entity(BaseModel):
|
||
name: str
|
||
type: str # "person", "organization", "location", etc.
|
||
description: str
|
||
|
||
class Relationship(BaseModel):
|
||
source: str
|
||
target: str
|
||
relationship: str
|
||
confidence: float
|
||
|
||
class KnowledgeGraph(BaseModel):
|
||
entities: List[Entity]
|
||
relationships: List[Relationship]
|
||
summary: str
|
||
|
||
knowledge_strategy = LLMExtractionStrategy(
|
||
llm_config=providers["anthropic"],
|
||
schema=KnowledgeGraph.model_json_schema(),
|
||
extraction_type="schema",
|
||
instruction="""
|
||
Create a knowledge graph from the content by:
|
||
1. Identifying key entities (people, organizations, locations, concepts)
|
||
2. Finding relationships between entities
|
||
3. Providing confidence scores for relationships
|
||
4. Summarizing the main topics
|
||
""",
|
||
input_format="html", # Use HTML for better structure preservation
|
||
apply_chunking=True,
|
||
chunk_token_threshold=1500
|
||
)
|
||
```
|
||
|
||
### JSON CSS Extraction - Fast Schema-Based Extraction
|
||
|
||
```python
|
||
import asyncio
|
||
import json
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
# Basic CSS extraction schema
|
||
simple_schema = {
|
||
"name": "Product Listings",
|
||
"baseSelector": "div.product-card",
|
||
"fields": [
|
||
{
|
||
"name": "title",
|
||
"selector": "h2.product-title",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": ".price",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "image_url",
|
||
"selector": "img.product-image",
|
||
"type": "attribute",
|
||
"attribute": "src"
|
||
},
|
||
{
|
||
"name": "product_url",
|
||
"selector": "a.product-link",
|
||
"type": "attribute",
|
||
"attribute": "href"
|
||
}
|
||
]
|
||
}
|
||
|
||
# Complex nested schema with multiple data types
|
||
complex_schema = {
|
||
"name": "E-commerce Product Catalog",
|
||
"baseSelector": "div.category",
|
||
"baseFields": [
|
||
{
|
||
"name": "category_id",
|
||
"type": "attribute",
|
||
"attribute": "data-category-id"
|
||
},
|
||
{
|
||
"name": "category_url",
|
||
"type": "attribute",
|
||
"attribute": "data-url"
|
||
}
|
||
],
|
||
"fields": [
|
||
{
|
||
"name": "category_name",
|
||
"selector": "h2.category-title",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "products",
|
||
"selector": "div.product",
|
||
"type": "nested_list", # Array of complex objects
|
||
"fields": [
|
||
{
|
||
"name": "name",
|
||
"selector": "h3.product-name",
|
||
"type": "text",
|
||
"default": "Unknown Product"
|
||
},
|
||
{
|
||
"name": "price",
|
||
"selector": "span.price",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "details",
|
||
"selector": "div.product-details",
|
||
"type": "nested", # Single complex object
|
||
"fields": [
|
||
{
|
||
"name": "brand",
|
||
"selector": "span.brand",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "model",
|
||
"selector": "span.model",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "specs",
|
||
"selector": "div.specifications",
|
||
"type": "html" # Preserve HTML structure
|
||
}
|
||
]
|
||
},
|
||
{
|
||
"name": "features",
|
||
"selector": "ul.features li",
|
||
"type": "list", # Simple array of strings
|
||
"fields": [
|
||
{"name": "feature", "type": "text"}
|
||
]
|
||
},
|
||
{
|
||
"name": "reviews",
|
||
"selector": "div.review",
|
||
"type": "nested_list",
|
||
"fields": [
|
||
{
|
||
"name": "reviewer",
|
||
"selector": "span.reviewer-name",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "rating",
|
||
"selector": "span.rating",
|
||
"type": "attribute",
|
||
"attribute": "data-rating"
|
||
},
|
||
{
|
||
"name": "comment",
|
||
"selector": "p.review-text",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "date",
|
||
"selector": "time.review-date",
|
||
"type": "attribute",
|
||
"attribute": "datetime"
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
|
||
async def extract_with_css_schema():
|
||
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
|
||
|
||
config = CrawlerRunConfig(
|
||
extraction_strategy=strategy,
|
||
cache_mode=CacheMode.BYPASS,
|
||
# Enable dynamic content loading if needed
|
||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||
wait_for="css:.product:nth-child(10)", # Wait for products to load
|
||
process_iframes=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com/catalog",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
print(f"Extracted {len(data)} categories")
|
||
|
||
for category in data:
|
||
print(f"Category: {category['category_name']}")
|
||
print(f"Products: {len(category.get('products', []))}")
|
||
|
||
# Show first product details
|
||
if category.get('products'):
|
||
product = category['products'][0]
|
||
print(f" First product: {product.get('name')}")
|
||
print(f" Features: {len(product.get('features', []))}")
|
||
print(f" Reviews: {len(product.get('reviews', []))}")
|
||
|
||
asyncio.run(extract_with_css_schema())
|
||
```
|
||
|
||
### Automatic Schema Generation - One-Time LLM, Unlimited Use
|
||
|
||
```python
|
||
import json
|
||
import asyncio
|
||
from pathlib import Path
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||
|
||
async def generate_and_use_schema():
|
||
"""
|
||
1. Use LLM once to generate schema from sample HTML
|
||
2. Cache the schema for reuse
|
||
3. Use cached schema for fast extraction without LLM calls
|
||
"""
|
||
|
||
cache_dir = Path("./schema_cache")
|
||
cache_dir.mkdir(exist_ok=True)
|
||
schema_file = cache_dir / "ecommerce_schema.json"
|
||
|
||
# Step 1: Generate or load cached schema
|
||
if schema_file.exists():
|
||
schema = json.load(schema_file.open())
|
||
print("Using cached schema")
|
||
else:
|
||
print("Generating schema using LLM...")
|
||
|
||
# Configure LLM for schema generation
|
||
llm_config = LLMConfig(
|
||
provider="openai/gpt-4o", # or "ollama/llama3.3" for local
|
||
api_token="env:OPENAI_API_KEY"
|
||
)
|
||
|
||
# Get sample HTML from target site
|
||
async with AsyncWebCrawler() as crawler:
|
||
sample_result = await crawler.arun(
|
||
url="https://example.com/products",
|
||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||
)
|
||
sample_html = sample_result.cleaned_html[:5000] # Use first 5k chars
|
||
|
||
# Generate schema automatically (ONE-TIME LLM COST)
|
||
schema = JsonCssExtractionStrategy.generate_schema(
|
||
html=sample_html,
|
||
schema_type="css",
|
||
llm_config=llm_config,
|
||
instruction="Extract product information including name, price, description, and features"
|
||
)
|
||
|
||
# Cache schema for future use (NO MORE LLM CALLS)
|
||
json.dump(schema, schema_file.open("w"), indent=2)
|
||
print("Schema generated and cached")
|
||
|
||
# Step 2: Use schema for fast extraction (NO LLM CALLS)
|
||
strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||
|
||
config = CrawlerRunConfig(
|
||
extraction_strategy=strategy,
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
# Step 3: Extract from multiple pages using same schema
|
||
urls = [
|
||
"https://example.com/products",
|
||
"https://example.com/electronics",
|
||
"https://example.com/books"
|
||
]
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
for url in urls:
|
||
result = await crawler.arun(url=url, config=config)
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
print(f"{url}: Extracted {len(data)} items")
|
||
else:
|
||
print(f"{url}: Failed - {result.error_message}")
|
||
|
||
asyncio.run(generate_and_use_schema())
|
||
```
|
||
|
||
### XPath Extraction Strategy
|
||
|
||
```python
|
||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||
|
||
# XPath-based schema (alternative to CSS)
|
||
xpath_schema = {
|
||
"name": "News Articles",
|
||
"baseSelector": "//article[@class='news-item']",
|
||
"baseFields": [
|
||
{
|
||
"name": "article_id",
|
||
"type": "attribute",
|
||
"attribute": "data-id"
|
||
}
|
||
],
|
||
"fields": [
|
||
{
|
||
"name": "headline",
|
||
"selector": ".//h2[@class='headline']",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "author",
|
||
"selector": ".//span[@class='author']/text()",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "publish_date",
|
||
"selector": ".//time/@datetime",
|
||
"type": "text"
|
||
},
|
||
{
|
||
"name": "content",
|
||
"selector": ".//div[@class='article-body']",
|
||
"type": "html"
|
||
},
|
||
{
|
||
"name": "tags",
|
||
"selector": ".//div[@class='tags']/span[@class='tag']",
|
||
"type": "list",
|
||
"fields": [
|
||
{"name": "tag", "type": "text"}
|
||
]
|
||
}
|
||
]
|
||
}
|
||
|
||
# Generate XPath schema automatically
|
||
async def generate_xpath_schema():
|
||
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)
|
||
|
||
sample_html = """
|
||
<article class="news-item" data-id="123">
|
||
<h2 class="headline">Breaking News</h2>
|
||
<span class="author">John Doe</span>
|
||
<time datetime="2024-01-01">Today</time>
|
||
<div class="article-body"><p>Content here...</p></div>
|
||
</article>
|
||
"""
|
||
|
||
schema = JsonXPathExtractionStrategy.generate_schema(
|
||
html=sample_html,
|
||
schema_type="xpath",
|
||
llm_config=llm_config
|
||
)
|
||
|
||
return schema
|
||
|
||
# Use XPath strategy
|
||
xpath_strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
|
||
```
|
||
|
||
### Regex Extraction Strategy - Pattern-Based Fast Extraction
|
||
|
||
```python
|
||
from crawl4ai.extraction_strategy import RegexExtractionStrategy
|
||
|
||
# Built-in patterns for common data types
|
||
async def extract_with_builtin_patterns():
|
||
# Use multiple built-in patterns
|
||
strategy = RegexExtractionStrategy(
|
||
pattern=(
|
||
RegexExtractionStrategy.Email |
|
||
RegexExtractionStrategy.PhoneUS |
|
||
RegexExtractionStrategy.Url |
|
||
RegexExtractionStrategy.Currency |
|
||
RegexExtractionStrategy.DateIso
|
||
)
|
||
)
|
||
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com/contact",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
matches = json.loads(result.extracted_content)
|
||
|
||
# Group by pattern type
|
||
by_type = {}
|
||
for match in matches:
|
||
label = match['label']
|
||
if label not in by_type:
|
||
by_type[label] = []
|
||
by_type[label].append(match['value'])
|
||
|
||
for pattern_type, values in by_type.items():
|
||
print(f"{pattern_type}: {len(values)} matches")
|
||
for value in values[:3]: # Show first 3
|
||
print(f" {value}")
|
||
|
||
# Custom regex patterns
|
||
custom_patterns = {
|
||
"product_code": r"SKU-\d{4,6}",
|
||
"discount": r"\d{1,2}%\s*off",
|
||
"model_number": r"Model:\s*([A-Z0-9-]+)"
|
||
}
|
||
|
||
async def extract_with_custom_patterns():
|
||
strategy = RegexExtractionStrategy(custom=custom_patterns)
|
||
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com/products",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
for item in data:
|
||
print(f"{item['label']}: {item['value']}")
|
||
|
||
# LLM-generated patterns (one-time cost)
|
||
async def generate_custom_patterns():
|
||
cache_file = Path("./patterns/price_patterns.json")
|
||
|
||
if cache_file.exists():
|
||
patterns = json.load(cache_file.open())
|
||
else:
|
||
llm_config = LLMConfig(
|
||
provider="openai/gpt-4o-mini",
|
||
api_token="env:OPENAI_API_KEY"
|
||
)
|
||
|
||
# Get sample content
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/pricing")
|
||
sample_html = result.cleaned_html
|
||
|
||
# Generate optimized patterns
|
||
patterns = RegexExtractionStrategy.generate_pattern(
|
||
label="pricing_info",
|
||
html=sample_html,
|
||
query="Extract all pricing information including discounts and special offers",
|
||
llm_config=llm_config
|
||
)
|
||
|
||
# Cache for reuse
|
||
cache_file.parent.mkdir(exist_ok=True)
|
||
json.dump(patterns, cache_file.open("w"), indent=2)
|
||
|
||
# Use cached patterns (no more LLM calls)
|
||
strategy = RegexExtractionStrategy(custom=patterns)
|
||
return strategy
|
||
|
||
asyncio.run(extract_with_builtin_patterns())
|
||
asyncio.run(extract_with_custom_patterns())
|
||
```
|
||
|
||
### Complete Extraction Workflow - Combining Strategies
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
from crawl4ai.extraction_strategy import (
|
||
JsonCssExtractionStrategy,
|
||
RegexExtractionStrategy,
|
||
LLMExtractionStrategy
|
||
)
|
||
|
||
async def multi_strategy_extraction():
|
||
"""
|
||
Demonstrate using multiple extraction strategies in sequence:
|
||
1. Fast regex for common patterns
|
||
2. Schema-based for structured data
|
||
3. LLM for complex reasoning
|
||
"""
|
||
|
||
browser_config = BrowserConfig(headless=True)
|
||
|
||
# Strategy 1: Fast regex extraction
|
||
regex_strategy = RegexExtractionStrategy(
|
||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||
)
|
||
|
||
# Strategy 2: Schema-based structured extraction
|
||
product_schema = {
|
||
"name": "Products",
|
||
"baseSelector": "div.product",
|
||
"fields": [
|
||
{"name": "name", "selector": "h3", "type": "text"},
|
||
{"name": "price", "selector": ".price", "type": "text"},
|
||
{"name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating"}
|
||
]
|
||
}
|
||
css_strategy = JsonCssExtractionStrategy(product_schema)
|
||
|
||
# Strategy 3: LLM for complex analysis
|
||
llm_strategy = LLMExtractionStrategy(
|
||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||
schema={
|
||
"type": "object",
|
||
"properties": {
|
||
"sentiment": {"type": "string"},
|
||
"key_topics": {"type": "array", "items": {"type": "string"}},
|
||
"summary": {"type": "string"}
|
||
}
|
||
},
|
||
extraction_type="schema",
|
||
instruction="Analyze the content sentiment, extract key topics, and provide a summary"
|
||
)
|
||
|
||
url = "https://example.com/product-reviews"
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
# Extract contact info with regex
|
||
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
|
||
regex_result = await crawler.arun(url=url, config=regex_config)
|
||
|
||
# Extract structured product data
|
||
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
|
||
css_result = await crawler.arun(url=url, config=css_config)
|
||
|
||
# Extract insights with LLM
|
||
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||
llm_result = await crawler.arun(url=url, config=llm_config)
|
||
|
||
# Combine results
|
||
results = {
|
||
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
|
||
"products": json.loads(css_result.extracted_content) if css_result.success else [],
|
||
"analysis": json.loads(llm_result.extracted_content) if llm_result.success else {}
|
||
}
|
||
|
||
print(f"Found {len(results['contacts'])} contact entries")
|
||
print(f"Found {len(results['products'])} products")
|
||
print(f"Sentiment: {results['analysis'].get('sentiment', 'N/A')}")
|
||
|
||
return results
|
||
|
||
# Performance comparison
|
||
async def compare_extraction_performance():
|
||
"""Compare speed and accuracy of different strategies"""
|
||
import time
|
||
|
||
url = "https://example.com/large-catalog"
|
||
|
||
strategies = {
|
||
"regex": RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
|
||
"css": JsonCssExtractionStrategy({
|
||
"name": "Prices",
|
||
"baseSelector": ".price",
|
||
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
|
||
}),
|
||
"llm": LLMExtractionStrategy(
|
||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||
instruction="Extract all prices from the content",
|
||
extraction_type="block"
|
||
)
|
||
}
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
for name, strategy in strategies.items():
|
||
start_time = time.time()
|
||
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
result = await crawler.arun(url=url, config=config)
|
||
|
||
duration = time.time() - start_time
|
||
|
||
if result.success:
|
||
data = json.loads(result.extracted_content)
|
||
print(f"{name}: {len(data)} items in {duration:.2f}s")
|
||
else:
|
||
print(f"{name}: Failed in {duration:.2f}s")
|
||
|
||
asyncio.run(multi_strategy_extraction())
|
||
asyncio.run(compare_extraction_performance())
|
||
```
|
||
|
||
### Best Practices and Strategy Selection
|
||
|
||
```python
|
||
# Strategy selection guide
|
||
def choose_extraction_strategy(use_case):
|
||
"""
|
||
Guide for selecting the right extraction strategy
|
||
"""
|
||
|
||
strategies = {
|
||
# Fast pattern matching for common data types
|
||
"contact_info": RegexExtractionStrategy(
|
||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||
),
|
||
|
||
# Structured data from consistent HTML
|
||
"product_catalogs": JsonCssExtractionStrategy,
|
||
|
||
# Complex reasoning and semantic understanding
|
||
"content_analysis": LLMExtractionStrategy,
|
||
|
||
# Mixed approach for comprehensive extraction
|
||
"complete_site_analysis": "multi_strategy"
|
||
}
|
||
|
||
recommendations = {
|
||
"speed_priority": "Use RegexExtractionStrategy for simple patterns, JsonCssExtractionStrategy for structured data",
|
||
"accuracy_priority": "Use LLMExtractionStrategy for complex content, JsonCssExtractionStrategy for predictable structure",
|
||
"cost_priority": "Avoid LLM strategies, use schema generation once then JsonCssExtractionStrategy",
|
||
"scale_priority": "Cache schemas, use regex for simple patterns, avoid LLM for high-volume extraction"
|
||
}
|
||
|
||
return recommendations.get(use_case, "Combine strategies based on content complexity")
|
||
|
||
# Error handling and validation
|
||
async def robust_extraction():
|
||
strategies = [
|
||
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email),
|
||
JsonCssExtractionStrategy(simple_schema),
|
||
# LLM as fallback for complex cases
|
||
]
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
for strategy in strategies:
|
||
try:
|
||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||
result = await crawler.arun(url="https://example.com", config=config)
|
||
|
||
if result.success and result.extracted_content:
|
||
data = json.loads(result.extracted_content)
|
||
if data: # Validate non-empty results
|
||
print(f"Success with {strategy.__class__.__name__}")
|
||
return data
|
||
|
||
except Exception as e:
|
||
print(f"Strategy {strategy.__class__.__name__} failed: {e}")
|
||
continue
|
||
|
||
print("All strategies failed")
|
||
return None
|
||
```
|
||
|
||
**📖 Learn more:** [LLM Strategies Deep Dive](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Regex Patterns](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||
---
|
||
|
||
|
||
## Multi-URL Crawling
|
||
|
||
Concurrent crawling of multiple URLs with intelligent resource management, rate limiting, and real-time monitoring.
|
||
|
||
### Basic Multi-URL Crawling
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
|
||
# Batch processing (default) - get all results at once
|
||
async def batch_crawl():
|
||
urls = [
|
||
"https://example.com/page1",
|
||
"https://example.com/page2",
|
||
"https://example.com/page3"
|
||
]
|
||
|
||
config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
stream=False # Default: batch mode
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun_many(urls, config=config)
|
||
|
||
for result in results:
|
||
if result.success:
|
||
print(f"✅ {result.url}: {len(result.markdown)} chars")
|
||
else:
|
||
print(f"❌ {result.url}: {result.error_message}")
|
||
|
||
# Streaming processing - handle results as they complete
|
||
async def streaming_crawl():
|
||
config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
stream=True # Enable streaming
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
# Process results as they become available
|
||
async for result in await crawler.arun_many(urls, config=config):
|
||
if result.success:
|
||
print(f"🔥 Just completed: {result.url}")
|
||
await process_result_immediately(result)
|
||
else:
|
||
print(f"❌ Failed: {result.url}")
|
||
```
|
||
|
||
### Memory-Adaptive Dispatching
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor, DisplayMode
|
||
|
||
# Automatically manages concurrency based on system memory
|
||
async def memory_adaptive_crawl():
|
||
dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=80.0, # Pause if memory exceeds 80%
|
||
check_interval=1.0, # Check memory every second
|
||
max_session_permit=15, # Max concurrent tasks
|
||
memory_wait_timeout=300.0 # Wait up to 5 minutes for memory
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
word_count_threshold=50
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun_many(
|
||
urls=large_url_list,
|
||
config=config,
|
||
dispatcher=dispatcher
|
||
)
|
||
|
||
# Each result includes dispatch information
|
||
for result in results:
|
||
if result.dispatch_result:
|
||
dr = result.dispatch_result
|
||
print(f"Memory used: {dr.memory_usage:.1f}MB")
|
||
print(f"Duration: {dr.end_time - dr.start_time}")
|
||
```
|
||
|
||
### Rate-Limited Crawling
|
||
|
||
```python
|
||
from crawl4ai import RateLimiter, SemaphoreDispatcher
|
||
|
||
# Control request pacing and handle server rate limits
|
||
async def rate_limited_crawl():
|
||
rate_limiter = RateLimiter(
|
||
base_delay=(1.0, 3.0), # Random delay 1-3 seconds
|
||
max_delay=60.0, # Cap backoff at 60 seconds
|
||
max_retries=3, # Retry failed requests 3 times
|
||
rate_limit_codes=[429, 503] # Handle these status codes
|
||
)
|
||
|
||
dispatcher = SemaphoreDispatcher(
|
||
max_session_permit=5, # Fixed concurrency limit
|
||
rate_limiter=rate_limiter
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
user_agent_mode="random", # Randomize user agents
|
||
simulate_user=True # Simulate human behavior
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun_many(
|
||
urls=urls,
|
||
config=config,
|
||
dispatcher=dispatcher
|
||
):
|
||
print(f"Processed: {result.url}")
|
||
```
|
||
|
||
### Real-Time Monitoring
|
||
|
||
```python
|
||
from crawl4ai import CrawlerMonitor, DisplayMode
|
||
|
||
# Monitor crawling progress in real-time
|
||
async def monitored_crawl():
|
||
monitor = CrawlerMonitor(
|
||
max_visible_rows=20, # Show 20 tasks in display
|
||
display_mode=DisplayMode.DETAILED # Show individual task details
|
||
)
|
||
|
||
dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=75.0,
|
||
max_session_permit=10,
|
||
monitor=monitor # Attach monitor to dispatcher
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun_many(
|
||
urls=urls,
|
||
dispatcher=dispatcher
|
||
)
|
||
```
|
||
|
||
### Advanced Dispatcher Configurations
|
||
|
||
```python
|
||
# Memory-adaptive with comprehensive monitoring
|
||
memory_dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=85.0, # Higher memory tolerance
|
||
check_interval=0.5, # Check memory more frequently
|
||
max_session_permit=20, # More concurrent tasks
|
||
memory_wait_timeout=600.0, # Wait longer for memory
|
||
rate_limiter=RateLimiter(
|
||
base_delay=(0.5, 1.5),
|
||
max_delay=30.0,
|
||
max_retries=5
|
||
),
|
||
monitor=CrawlerMonitor(
|
||
max_visible_rows=15,
|
||
display_mode=DisplayMode.AGGREGATED # Summary view
|
||
)
|
||
)
|
||
|
||
# Simple semaphore-based dispatcher
|
||
semaphore_dispatcher = SemaphoreDispatcher(
|
||
max_session_permit=8, # Fixed concurrency
|
||
rate_limiter=RateLimiter(
|
||
base_delay=(1.0, 2.0),
|
||
max_delay=20.0
|
||
)
|
||
)
|
||
|
||
# Usage with custom dispatcher
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun_many(
|
||
urls=urls,
|
||
config=config,
|
||
dispatcher=memory_dispatcher # or semaphore_dispatcher
|
||
)
|
||
```
|
||
|
||
### Handling Large-Scale Crawling
|
||
|
||
```python
|
||
async def large_scale_crawl():
|
||
# For thousands of URLs
|
||
urls = load_urls_from_file("large_url_list.txt") # 10,000+ URLs
|
||
|
||
dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=70.0, # Conservative memory usage
|
||
max_session_permit=25, # Higher concurrency
|
||
rate_limiter=RateLimiter(
|
||
base_delay=(0.1, 0.5), # Faster for large batches
|
||
max_retries=2 # Fewer retries for speed
|
||
),
|
||
monitor=CrawlerMonitor(display_mode=DisplayMode.AGGREGATED)
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.ENABLED, # Use caching for efficiency
|
||
stream=True, # Stream for memory efficiency
|
||
word_count_threshold=100, # Skip short content
|
||
exclude_external_links=True # Reduce processing overhead
|
||
)
|
||
|
||
successful_crawls = 0
|
||
failed_crawls = 0
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun_many(
|
||
urls=urls,
|
||
config=config,
|
||
dispatcher=dispatcher
|
||
):
|
||
if result.success:
|
||
successful_crawls += 1
|
||
await save_result_to_database(result)
|
||
else:
|
||
failed_crawls += 1
|
||
await log_failure(result.url, result.error_message)
|
||
|
||
# Progress reporting
|
||
if (successful_crawls + failed_crawls) % 100 == 0:
|
||
print(f"Progress: {successful_crawls + failed_crawls}/{len(urls)}")
|
||
|
||
print(f"Completed: {successful_crawls} successful, {failed_crawls} failed")
|
||
```
|
||
|
||
### Robots.txt Compliance
|
||
|
||
```python
|
||
async def compliant_crawl():
|
||
config = CrawlerRunConfig(
|
||
check_robots_txt=True, # Respect robots.txt
|
||
user_agent="MyBot/1.0", # Identify your bot
|
||
mean_delay=2.0, # Be polite with delays
|
||
max_range=1.0
|
||
)
|
||
|
||
dispatcher = SemaphoreDispatcher(
|
||
max_session_permit=3, # Conservative concurrency
|
||
rate_limiter=RateLimiter(
|
||
base_delay=(2.0, 5.0), # Slower, more respectful
|
||
max_retries=1
|
||
)
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun_many(
|
||
urls=urls,
|
||
config=config,
|
||
dispatcher=dispatcher
|
||
):
|
||
if result.success:
|
||
print(f"✅ Crawled: {result.url}")
|
||
elif "robots.txt" in result.error_message:
|
||
print(f"🚫 Blocked by robots.txt: {result.url}")
|
||
else:
|
||
print(f"❌ Error: {result.url}")
|
||
```
|
||
|
||
### Performance Analysis
|
||
|
||
```python
|
||
async def analyze_crawl_performance():
|
||
dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=80.0,
|
||
max_session_permit=12,
|
||
monitor=CrawlerMonitor(display_mode=DisplayMode.DETAILED)
|
||
)
|
||
|
||
start_time = time.time()
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun_many(
|
||
urls=urls,
|
||
dispatcher=dispatcher
|
||
)
|
||
|
||
end_time = time.time()
|
||
|
||
# Analyze results
|
||
successful = [r for r in results if r.success]
|
||
failed = [r for r in results if not r.success]
|
||
|
||
print(f"Total time: {end_time - start_time:.2f}s")
|
||
print(f"Success rate: {len(successful)}/{len(results)} ({len(successful)/len(results)*100:.1f}%)")
|
||
print(f"Avg time per URL: {(end_time - start_time)/len(results):.2f}s")
|
||
|
||
# Memory usage analysis
|
||
if successful and successful[0].dispatch_result:
|
||
memory_usage = [r.dispatch_result.memory_usage for r in successful if r.dispatch_result]
|
||
peak_memory = [r.dispatch_result.peak_memory for r in successful if r.dispatch_result]
|
||
|
||
print(f"Avg memory usage: {sum(memory_usage)/len(memory_usage):.1f}MB")
|
||
print(f"Peak memory usage: {max(peak_memory):.1f}MB")
|
||
```
|
||
|
||
### Error Handling and Recovery
|
||
|
||
```python
|
||
async def robust_multi_crawl():
|
||
failed_urls = []
|
||
|
||
config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
stream=True,
|
||
page_timeout=30000 # 30 second timeout
|
||
)
|
||
|
||
dispatcher = MemoryAdaptiveDispatcher(
|
||
memory_threshold_percent=85.0,
|
||
max_session_permit=10
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun_many(
|
||
urls=urls,
|
||
config=config,
|
||
dispatcher=dispatcher
|
||
):
|
||
if result.success:
|
||
await process_successful_result(result)
|
||
else:
|
||
failed_urls.append({
|
||
'url': result.url,
|
||
'error': result.error_message,
|
||
'status_code': result.status_code
|
||
})
|
||
|
||
# Retry logic for specific errors
|
||
if result.status_code in [503, 429]: # Server errors
|
||
await schedule_retry(result.url)
|
||
|
||
# Report failures
|
||
if failed_urls:
|
||
print(f"Failed to crawl {len(failed_urls)} URLs:")
|
||
for failure in failed_urls[:10]: # Show first 10
|
||
print(f" {failure['url']}: {failure['error']}")
|
||
```
|
||
|
||
**📖 Learn more:** [Advanced Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Crawl Dispatcher](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [arun_many() API Reference](https://docs.crawl4ai.com/api/arun_many/)
|
||
---
|
||
|
||
|
||
## Deep Crawling
|
||
|
||
Multi-level website exploration with intelligent filtering, scoring, and prioritization strategies.
|
||
|
||
### Basic Deep Crawl Setup
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||
|
||
# Basic breadth-first deep crawling
|
||
async def basic_deep_crawl():
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||
max_depth=2, # Initial page + 2 levels
|
||
include_external=False # Stay within same domain
|
||
),
|
||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||
verbose=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun("https://docs.crawl4ai.com", config=config)
|
||
|
||
# Group results by depth
|
||
pages_by_depth = {}
|
||
for result in results:
|
||
depth = result.metadata.get("depth", 0)
|
||
if depth not in pages_by_depth:
|
||
pages_by_depth[depth] = []
|
||
pages_by_depth[depth].append(result.url)
|
||
|
||
print(f"Crawled {len(results)} pages total")
|
||
for depth, urls in sorted(pages_by_depth.items()):
|
||
print(f"Depth {depth}: {len(urls)} pages")
|
||
```
|
||
|
||
### Deep Crawl Strategies
|
||
|
||
```python
|
||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, BestFirstCrawlingStrategy
|
||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||
|
||
# Breadth-First Search - explores all links at one depth before going deeper
|
||
bfs_strategy = BFSDeepCrawlStrategy(
|
||
max_depth=2,
|
||
include_external=False,
|
||
max_pages=50, # Limit total pages
|
||
score_threshold=0.3 # Minimum score for URLs
|
||
)
|
||
|
||
# Depth-First Search - explores as deep as possible before backtracking
|
||
dfs_strategy = DFSDeepCrawlStrategy(
|
||
max_depth=2,
|
||
include_external=False,
|
||
max_pages=30,
|
||
score_threshold=0.5
|
||
)
|
||
|
||
# Best-First - prioritizes highest scoring pages (recommended)
|
||
keyword_scorer = KeywordRelevanceScorer(
|
||
keywords=["crawl", "example", "async", "configuration"],
|
||
weight=0.7
|
||
)
|
||
|
||
best_first_strategy = BestFirstCrawlingStrategy(
|
||
max_depth=2,
|
||
include_external=False,
|
||
url_scorer=keyword_scorer,
|
||
max_pages=25 # No score_threshold needed - naturally prioritizes
|
||
)
|
||
|
||
# Usage
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=best_first_strategy, # Choose your strategy
|
||
scraping_strategy=LXMLWebScrapingStrategy()
|
||
)
|
||
```
|
||
|
||
### Streaming vs Batch Processing
|
||
|
||
```python
|
||
# Batch mode - wait for all results
|
||
async def batch_deep_crawl():
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||
stream=False # Default - collect all results first
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun("https://example.com", config=config)
|
||
|
||
# Process all results at once
|
||
for result in results:
|
||
print(f"Batch processed: {result.url}")
|
||
|
||
# Streaming mode - process results as they arrive
|
||
async def streaming_deep_crawl():
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||
stream=True # Process results immediately
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun("https://example.com", config=config):
|
||
depth = result.metadata.get("depth", 0)
|
||
print(f"Stream processed depth {depth}: {result.url}")
|
||
```
|
||
|
||
### Filtering with Filter Chains
|
||
|
||
```python
|
||
from crawl4ai.deep_crawling.filters import (
|
||
FilterChain,
|
||
URLPatternFilter,
|
||
DomainFilter,
|
||
ContentTypeFilter,
|
||
SEOFilter,
|
||
ContentRelevanceFilter
|
||
)
|
||
|
||
# Single URL pattern filter
|
||
url_filter = URLPatternFilter(patterns=["*core*", "*guide*"])
|
||
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||
max_depth=1,
|
||
filter_chain=FilterChain([url_filter])
|
||
)
|
||
)
|
||
|
||
# Multiple filters in chain
|
||
advanced_filter_chain = FilterChain([
|
||
# Domain filtering
|
||
DomainFilter(
|
||
allowed_domains=["docs.example.com"],
|
||
blocked_domains=["old.docs.example.com", "staging.example.com"]
|
||
),
|
||
|
||
# URL pattern matching
|
||
URLPatternFilter(patterns=["*tutorial*", "*guide*", "*blog*"]),
|
||
|
||
# Content type filtering
|
||
ContentTypeFilter(allowed_types=["text/html"]),
|
||
|
||
# SEO quality filter
|
||
SEOFilter(
|
||
threshold=0.5,
|
||
keywords=["tutorial", "guide", "documentation"]
|
||
),
|
||
|
||
# Content relevance filter
|
||
ContentRelevanceFilter(
|
||
query="Web crawling and data extraction with Python",
|
||
threshold=0.7
|
||
)
|
||
])
|
||
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||
max_depth=2,
|
||
filter_chain=advanced_filter_chain
|
||
)
|
||
)
|
||
```
|
||
|
||
### Intelligent Crawling with Scorers
|
||
|
||
```python
|
||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||
|
||
# Keyword relevance scoring
|
||
async def scored_deep_crawl():
|
||
keyword_scorer = KeywordRelevanceScorer(
|
||
keywords=["browser", "crawler", "web", "automation"],
|
||
weight=1.0
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||
max_depth=2,
|
||
include_external=False,
|
||
url_scorer=keyword_scorer
|
||
),
|
||
stream=True, # Recommended with BestFirst
|
||
verbose=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||
score = result.metadata.get("score", 0)
|
||
depth = result.metadata.get("depth", 0)
|
||
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||
```
|
||
|
||
### Limiting Crawl Size
|
||
|
||
```python
|
||
# Max pages limitation across strategies
|
||
async def limited_crawls():
|
||
# BFS with page limit
|
||
bfs_config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||
max_depth=2,
|
||
max_pages=5, # Only crawl 5 pages total
|
||
url_scorer=KeywordRelevanceScorer(keywords=["browser", "crawler"], weight=1.0)
|
||
)
|
||
)
|
||
|
||
# DFS with score threshold
|
||
dfs_config = CrawlerRunConfig(
|
||
deep_crawl_strategy=DFSDeepCrawlStrategy(
|
||
max_depth=2,
|
||
score_threshold=0.7, # Only URLs with scores above 0.7
|
||
max_pages=10,
|
||
url_scorer=KeywordRelevanceScorer(keywords=["web", "automation"], weight=1.0)
|
||
)
|
||
)
|
||
|
||
# Best-First with both constraints
|
||
bf_config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||
max_depth=2,
|
||
max_pages=7, # Automatically gets highest scored pages
|
||
url_scorer=KeywordRelevanceScorer(keywords=["crawl", "example"], weight=1.0)
|
||
),
|
||
stream=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
# Use any of the configs
|
||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=bf_config):
|
||
score = result.metadata.get("score", 0)
|
||
print(f"Score: {score:.2f} | {result.url}")
|
||
```
|
||
|
||
### Complete Advanced Deep Crawler
|
||
|
||
```python
|
||
async def comprehensive_deep_crawl():
|
||
# Sophisticated filter chain
|
||
filter_chain = FilterChain([
|
||
DomainFilter(
|
||
allowed_domains=["docs.crawl4ai.com"],
|
||
blocked_domains=["old.docs.crawl4ai.com"]
|
||
),
|
||
URLPatternFilter(patterns=["*core*", "*advanced*", "*blog*"]),
|
||
ContentTypeFilter(allowed_types=["text/html"]),
|
||
SEOFilter(threshold=0.4, keywords=["crawl", "tutorial", "guide"])
|
||
])
|
||
|
||
# Multi-keyword scorer
|
||
keyword_scorer = KeywordRelevanceScorer(
|
||
keywords=["crawl", "example", "async", "configuration", "browser"],
|
||
weight=0.8
|
||
)
|
||
|
||
# Complete configuration
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||
max_depth=2,
|
||
include_external=False,
|
||
filter_chain=filter_chain,
|
||
url_scorer=keyword_scorer,
|
||
max_pages=20
|
||
),
|
||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||
stream=True,
|
||
verbose=True,
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
# Execute and analyze
|
||
results = []
|
||
start_time = time.time()
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||
results.append(result)
|
||
score = result.metadata.get("score", 0)
|
||
depth = result.metadata.get("depth", 0)
|
||
print(f"→ Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||
|
||
# Performance analysis
|
||
duration = time.time() - start_time
|
||
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
|
||
|
||
print(f"✅ Crawled {len(results)} pages in {duration:.2f}s")
|
||
print(f"✅ Average relevance score: {avg_score:.2f}")
|
||
|
||
# Depth distribution
|
||
depth_counts = {}
|
||
for result in results:
|
||
depth = result.metadata.get("depth", 0)
|
||
depth_counts[depth] = depth_counts.get(depth, 0) + 1
|
||
|
||
for depth, count in sorted(depth_counts.items()):
|
||
print(f"📊 Depth {depth}: {count} pages")
|
||
```
|
||
|
||
### Error Handling and Robustness
|
||
|
||
```python
|
||
async def robust_deep_crawl():
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||
max_depth=2,
|
||
max_pages=15,
|
||
url_scorer=KeywordRelevanceScorer(keywords=["guide", "tutorial"])
|
||
),
|
||
stream=True,
|
||
page_timeout=30000 # 30 second timeout per page
|
||
)
|
||
|
||
successful_pages = []
|
||
failed_pages = []
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||
if result.success:
|
||
successful_pages.append(result)
|
||
depth = result.metadata.get("depth", 0)
|
||
score = result.metadata.get("score", 0)
|
||
print(f"✅ Depth {depth} | Score: {score:.2f} | {result.url}")
|
||
else:
|
||
failed_pages.append({
|
||
'url': result.url,
|
||
'error': result.error_message,
|
||
'depth': result.metadata.get("depth", 0)
|
||
})
|
||
print(f"❌ Failed: {result.url} - {result.error_message}")
|
||
|
||
print(f"📊 Results: {len(successful_pages)} successful, {len(failed_pages)} failed")
|
||
|
||
# Analyze failures by depth
|
||
if failed_pages:
|
||
failure_by_depth = {}
|
||
for failure in failed_pages:
|
||
depth = failure['depth']
|
||
failure_by_depth[depth] = failure_by_depth.get(depth, 0) + 1
|
||
|
||
print("❌ Failures by depth:")
|
||
for depth, count in sorted(failure_by_depth.items()):
|
||
print(f" Depth {depth}: {count} failures")
|
||
```
|
||
|
||
**📖 Learn more:** [Deep Crawling Guide](https://docs.crawl4ai.com/core/deep-crawling/), [Filter Documentation](https://docs.crawl4ai.com/core/content-selection/), [Scoring Strategies](https://docs.crawl4ai.com/advanced/advanced-features/)
|
||
---
|
||
|
||
|
||
## Docker Deployment
|
||
|
||
Complete Docker deployment guide with pre-built images, API endpoints, configuration, and MCP integration.
|
||
|
||
### Quick Start with Pre-built Images
|
||
|
||
```bash
|
||
# Pull latest image
|
||
docker pull unclecode/crawl4ai:latest
|
||
|
||
# Setup LLM API keys
|
||
cat > .llm.env << EOL
|
||
OPENAI_API_KEY=sk-your-key
|
||
ANTHROPIC_API_KEY=your-anthropic-key
|
||
GROQ_API_KEY=your-groq-key
|
||
GEMINI_API_TOKEN=your-gemini-token
|
||
EOL
|
||
|
||
# Run with LLM support
|
||
docker run -d \
|
||
-p 11235:11235 \
|
||
--name crawl4ai \
|
||
--env-file .llm.env \
|
||
--shm-size=1g \
|
||
unclecode/crawl4ai:latest
|
||
|
||
# Basic run (no LLM)
|
||
docker run -d \
|
||
-p 11235:11235 \
|
||
--name crawl4ai \
|
||
--shm-size=1g \
|
||
unclecode/crawl4ai:latest
|
||
|
||
# Check health
|
||
curl http://localhost:11235/health
|
||
```
|
||
|
||
### Docker Compose Deployment
|
||
|
||
```bash
|
||
# Clone and setup
|
||
git clone https://github.com/unclecode/crawl4ai.git
|
||
cd crawl4ai
|
||
cp deploy/docker/.llm.env.example .llm.env
|
||
# Edit .llm.env with your API keys
|
||
|
||
# Run pre-built image
|
||
IMAGE=unclecode/crawl4ai:latest docker compose up -d
|
||
|
||
# Build locally
|
||
docker compose up --build -d
|
||
|
||
# Build with all features
|
||
INSTALL_TYPE=all docker compose up --build -d
|
||
|
||
# Build with GPU support
|
||
ENABLE_GPU=true docker compose up --build -d
|
||
|
||
# Stop service
|
||
docker compose down
|
||
```
|
||
|
||
### Manual Build with Multi-Architecture
|
||
|
||
```bash
|
||
# Clone repository
|
||
git clone https://github.com/unclecode/crawl4ai.git
|
||
cd crawl4ai
|
||
|
||
# Build for current architecture
|
||
docker buildx build -t crawl4ai-local:latest --load .
|
||
|
||
# Build for multiple architectures
|
||
docker buildx build --platform linux/amd64,linux/arm64 \
|
||
-t crawl4ai-local:latest --load .
|
||
|
||
# Build with specific features
|
||
docker buildx build \
|
||
--build-arg INSTALL_TYPE=all \
|
||
--build-arg ENABLE_GPU=false \
|
||
-t crawl4ai-local:latest --load .
|
||
|
||
# Run custom build
|
||
docker run -d \
|
||
-p 11235:11235 \
|
||
--name crawl4ai-custom \
|
||
--env-file .llm.env \
|
||
--shm-size=1g \
|
||
crawl4ai-local:latest
|
||
```
|
||
|
||
### Build Arguments
|
||
|
||
```bash
|
||
# Available build options
|
||
docker buildx build \
|
||
--build-arg INSTALL_TYPE=all \ # default|all|torch|transformer
|
||
--build-arg ENABLE_GPU=true \ # true|false
|
||
--build-arg APP_HOME=/app \ # Install path
|
||
--build-arg USE_LOCAL=true \ # Use local source
|
||
--build-arg GITHUB_REPO=url \ # Git repo if USE_LOCAL=false
|
||
--build-arg GITHUB_BRANCH=main \ # Git branch
|
||
-t crawl4ai-custom:latest --load .
|
||
```
|
||
|
||
### Core API Endpoints
|
||
|
||
```python
|
||
# Main crawling endpoints
|
||
import requests
|
||
import json
|
||
|
||
# Basic crawl
|
||
payload = {
|
||
"urls": ["https://example.com"],
|
||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||
}
|
||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||
|
||
# Streaming crawl
|
||
payload["crawler_config"]["params"]["stream"] = True
|
||
response = requests.post("http://localhost:11235/crawl/stream", json=payload)
|
||
|
||
# Health check
|
||
response = requests.get("http://localhost:11235/health")
|
||
|
||
# API schema
|
||
response = requests.get("http://localhost:11235/schema")
|
||
|
||
# Metrics (Prometheus format)
|
||
response = requests.get("http://localhost:11235/metrics")
|
||
```
|
||
|
||
### Specialized Endpoints
|
||
|
||
```python
|
||
# HTML extraction (preprocessed for schema)
|
||
response = requests.post("http://localhost:11235/html",
|
||
json={"url": "https://example.com"})
|
||
|
||
# Screenshot capture
|
||
response = requests.post("http://localhost:11235/screenshot", json={
|
||
"url": "https://example.com",
|
||
"screenshot_wait_for": 2,
|
||
"output_path": "/path/to/save/screenshot.png"
|
||
})
|
||
|
||
# PDF generation
|
||
response = requests.post("http://localhost:11235/pdf", json={
|
||
"url": "https://example.com",
|
||
"output_path": "/path/to/save/document.pdf"
|
||
})
|
||
|
||
# JavaScript execution
|
||
response = requests.post("http://localhost:11235/execute_js", json={
|
||
"url": "https://example.com",
|
||
"scripts": [
|
||
"return document.title",
|
||
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
|
||
]
|
||
})
|
||
|
||
# Markdown generation
|
||
response = requests.post("http://localhost:11235/md", json={
|
||
"url": "https://example.com",
|
||
"f": "fit", # raw|fit|bm25|llm
|
||
"q": "extract main content", # query for filtering
|
||
"c": "0" # cache: 0=bypass, 1=use
|
||
})
|
||
|
||
# LLM Q&A
|
||
response = requests.get("http://localhost:11235/llm/https://example.com?q=What is this page about?")
|
||
|
||
# Library context (for AI assistants)
|
||
response = requests.get("http://localhost:11235/ask", params={
|
||
"context_type": "all", # code|doc|all
|
||
"query": "how to use extraction strategies",
|
||
"score_ratio": 0.5,
|
||
"max_results": 20
|
||
})
|
||
```
|
||
|
||
### Python SDK Usage
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
|
||
|
||
async def main():
|
||
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
|
||
# Non-streaming crawl
|
||
results = await client.crawl(
|
||
["https://example.com"],
|
||
browser_config=BrowserConfig(headless=True),
|
||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||
)
|
||
|
||
for result in results:
|
||
print(f"URL: {result.url}, Success: {result.success}")
|
||
print(f"Content length: {len(result.markdown)}")
|
||
|
||
# Streaming crawl
|
||
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
||
async for result in await client.crawl(
|
||
["https://example.com", "https://python.org"],
|
||
browser_config=BrowserConfig(headless=True),
|
||
crawler_config=stream_config
|
||
):
|
||
print(f"Streamed: {result.url} - {result.success}")
|
||
|
||
# Get API schema
|
||
schema = await client.get_schema()
|
||
print(f"Schema available: {bool(schema)}")
|
||
|
||
asyncio.run(main())
|
||
```
|
||
|
||
### Advanced API Configuration
|
||
|
||
```python
|
||
# Complex extraction with LLM
|
||
payload = {
|
||
"urls": ["https://example.com"],
|
||
"browser_config": {
|
||
"type": "BrowserConfig",
|
||
"params": {
|
||
"headless": True,
|
||
"viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}
|
||
}
|
||
},
|
||
"crawler_config": {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"extraction_strategy": {
|
||
"type": "LLMExtractionStrategy",
|
||
"params": {
|
||
"llm_config": {
|
||
"type": "LLMConfig",
|
||
"params": {
|
||
"provider": "openai/gpt-4o-mini",
|
||
"api_token": "env:OPENAI_API_KEY"
|
||
}
|
||
},
|
||
"schema": {
|
||
"type": "dict",
|
||
"value": {
|
||
"type": "object",
|
||
"properties": {
|
||
"title": {"type": "string"},
|
||
"content": {"type": "string"}
|
||
}
|
||
}
|
||
},
|
||
"instruction": "Extract title and main content"
|
||
}
|
||
},
|
||
"markdown_generator": {
|
||
"type": "DefaultMarkdownGenerator",
|
||
"params": {
|
||
"content_filter": {
|
||
"type": "PruningContentFilter",
|
||
"params": {"threshold": 0.6}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||
```
|
||
|
||
### CSS Extraction Strategy
|
||
|
||
```python
|
||
# CSS-based structured extraction
|
||
schema = {
|
||
"name": "ProductList",
|
||
"baseSelector": ".product",
|
||
"fields": [
|
||
{"name": "title", "selector": "h2", "type": "text"},
|
||
{"name": "price", "selector": ".price", "type": "text"},
|
||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||
]
|
||
}
|
||
|
||
payload = {
|
||
"urls": ["https://example-shop.com"],
|
||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||
"crawler_config": {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"extraction_strategy": {
|
||
"type": "JsonCssExtractionStrategy",
|
||
"params": {
|
||
"schema": {"type": "dict", "value": schema}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||
data = response.json()
|
||
extracted = json.loads(data["results"][0]["extracted_content"])
|
||
```
|
||
|
||
### MCP (Model Context Protocol) Integration
|
||
|
||
```bash
|
||
# Add Crawl4AI as MCP provider to Claude Code
|
||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||
|
||
# List MCP providers
|
||
claude mcp list
|
||
|
||
# Test MCP connection
|
||
python tests/mcp/test_mcp_socket.py
|
||
|
||
# Available MCP endpoints
|
||
# SSE: http://localhost:11235/mcp/sse
|
||
# WebSocket: ws://localhost:11235/mcp/ws
|
||
# Schema: http://localhost:11235/mcp/schema
|
||
```
|
||
|
||
Available MCP tools:
|
||
- `md` - Generate markdown from web content
|
||
- `html` - Extract preprocessed HTML
|
||
- `screenshot` - Capture webpage screenshots
|
||
- `pdf` - Generate PDF documents
|
||
- `execute_js` - Run JavaScript on web pages
|
||
- `crawl` - Perform multi-URL crawling
|
||
- `ask` - Query Crawl4AI library context
|
||
|
||
### Configuration Management
|
||
|
||
```yaml
|
||
# config.yml structure
|
||
app:
|
||
title: "Crawl4AI API"
|
||
version: "1.0.0"
|
||
host: "0.0.0.0"
|
||
port: 11235
|
||
timeout_keep_alive: 300
|
||
|
||
llm:
|
||
provider: "openai/gpt-4o-mini"
|
||
api_key_env: "OPENAI_API_KEY"
|
||
|
||
security:
|
||
enabled: false
|
||
jwt_enabled: false
|
||
trusted_hosts: ["*"]
|
||
|
||
crawler:
|
||
memory_threshold_percent: 95.0
|
||
rate_limiter:
|
||
base_delay: [1.0, 2.0]
|
||
timeouts:
|
||
stream_init: 30.0
|
||
batch_process: 300.0
|
||
pool:
|
||
max_pages: 40
|
||
idle_ttl_sec: 1800
|
||
|
||
rate_limiting:
|
||
enabled: true
|
||
default_limit: "1000/minute"
|
||
storage_uri: "memory://"
|
||
|
||
logging:
|
||
level: "INFO"
|
||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||
```
|
||
|
||
### Custom Configuration Deployment
|
||
|
||
```bash
|
||
# Method 1: Mount custom config
|
||
docker run -d -p 11235:11235 \
|
||
--name crawl4ai-custom \
|
||
--env-file .llm.env \
|
||
--shm-size=1g \
|
||
-v $(pwd)/my-config.yml:/app/config.yml \
|
||
unclecode/crawl4ai:latest
|
||
|
||
# Method 2: Build with custom config
|
||
# Edit deploy/docker/config.yml then build
|
||
docker buildx build -t crawl4ai-custom:latest --load .
|
||
```
|
||
|
||
### Monitoring and Health Checks
|
||
|
||
```bash
|
||
# Health endpoint
|
||
curl http://localhost:11235/health
|
||
|
||
# Prometheus metrics
|
||
curl http://localhost:11235/metrics
|
||
|
||
# Configuration validation
|
||
curl -X POST http://localhost:11235/config/dump \
|
||
-H "Content-Type: application/json" \
|
||
-d '{"code": "CrawlerRunConfig(cache_mode=\"BYPASS\", screenshot=True)"}'
|
||
```
|
||
|
||
### Playground Interface
|
||
|
||
Access the interactive playground at `http://localhost:11235/playground` for:
|
||
- Testing configurations with visual interface
|
||
- Generating JSON payloads for REST API
|
||
- Converting Python config to JSON format
|
||
- Testing crawl operations directly in browser
|
||
|
||
### Async Job Processing
|
||
|
||
```python
|
||
# Submit job for async processing
|
||
import time
|
||
|
||
# Submit crawl job
|
||
response = requests.post("http://localhost:11235/crawl/job", json=payload)
|
||
task_id = response.json()["task_id"]
|
||
|
||
# Poll for completion
|
||
while True:
|
||
result = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
|
||
status = result.json()
|
||
|
||
if status["status"] in ["COMPLETED", "FAILED"]:
|
||
break
|
||
time.sleep(1.5)
|
||
|
||
print("Final result:", status)
|
||
```
|
||
|
||
### Production Deployment
|
||
|
||
```bash
|
||
# Production-ready deployment
|
||
docker run -d \
|
||
--name crawl4ai-prod \
|
||
--restart unless-stopped \
|
||
-p 11235:11235 \
|
||
--env-file .llm.env \
|
||
--shm-size=2g \
|
||
--memory=8g \
|
||
--cpus=4 \
|
||
-v /path/to/custom-config.yml:/app/config.yml \
|
||
unclecode/crawl4ai:latest
|
||
|
||
# With Docker Compose for production
|
||
version: '3.8'
|
||
services:
|
||
crawl4ai:
|
||
image: unclecode/crawl4ai:latest
|
||
ports:
|
||
- "11235:11235"
|
||
environment:
|
||
- OPENAI_API_KEY=${OPENAI_API_KEY}
|
||
volumes:
|
||
- ./config.yml:/app/config.yml
|
||
shm_size: 2g
|
||
deploy:
|
||
resources:
|
||
limits:
|
||
memory: 8G
|
||
cpus: '4'
|
||
restart: unless-stopped
|
||
```
|
||
|
||
### Configuration Validation and JSON Structure
|
||
|
||
```python
|
||
# Method 1: Create config objects and dump to see expected JSON structure
|
||
from crawl4ai import BrowserConfig, CrawlerRunConfig, LLMConfig, CacheMode
|
||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
|
||
import json
|
||
|
||
# Create browser config and see JSON structure
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
viewport_width=1280,
|
||
viewport_height=720,
|
||
proxy="http://user:pass@proxy:8080"
|
||
)
|
||
|
||
# Get JSON structure
|
||
browser_json = browser_config.dump()
|
||
print("BrowserConfig JSON structure:")
|
||
print(json.dumps(browser_json, indent=2))
|
||
|
||
# Create crawler config with extraction strategy
|
||
schema = {
|
||
"name": "Articles",
|
||
"baseSelector": ".article",
|
||
"fields": [
|
||
{"name": "title", "selector": "h2", "type": "text"},
|
||
{"name": "content", "selector": ".content", "type": "html"}
|
||
]
|
||
}
|
||
|
||
crawler_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
screenshot=True,
|
||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
|
||
wait_for="css:.loaded"
|
||
)
|
||
|
||
crawler_json = crawler_config.dump()
|
||
print("\nCrawlerRunConfig JSON structure:")
|
||
print(json.dumps(crawler_json, indent=2))
|
||
```
|
||
|
||
### Reverse Validation - JSON to Objects
|
||
|
||
```python
|
||
# Method 2: Load JSON back to config objects for validation
|
||
from crawl4ai.async_configs import from_serializable_dict
|
||
|
||
# Test JSON structure by converting back to objects
|
||
test_browser_json = {
|
||
"type": "BrowserConfig",
|
||
"params": {
|
||
"headless": True,
|
||
"viewport_width": 1280,
|
||
"proxy": "http://user:pass@proxy:8080"
|
||
}
|
||
}
|
||
|
||
try:
|
||
# Convert JSON back to object
|
||
restored_browser = from_serializable_dict(test_browser_json)
|
||
print(f"✅ Valid BrowserConfig: {type(restored_browser)}")
|
||
print(f"Headless: {restored_browser.headless}")
|
||
print(f"Proxy: {restored_browser.proxy}")
|
||
except Exception as e:
|
||
print(f"❌ Invalid BrowserConfig JSON: {e}")
|
||
|
||
# Test complex crawler config JSON
|
||
test_crawler_json = {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"cache_mode": "bypass",
|
||
"screenshot": True,
|
||
"extraction_strategy": {
|
||
"type": "JsonCssExtractionStrategy",
|
||
"params": {
|
||
"schema": {
|
||
"type": "dict",
|
||
"value": {
|
||
"name": "Products",
|
||
"baseSelector": ".product",
|
||
"fields": [
|
||
{"name": "title", "selector": "h3", "type": "text"}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
try:
|
||
restored_crawler = from_serializable_dict(test_crawler_json)
|
||
print(f"✅ Valid CrawlerRunConfig: {type(restored_crawler)}")
|
||
print(f"Cache mode: {restored_crawler.cache_mode}")
|
||
print(f"Has extraction strategy: {restored_crawler.extraction_strategy is not None}")
|
||
except Exception as e:
|
||
print(f"❌ Invalid CrawlerRunConfig JSON: {e}")
|
||
```
|
||
|
||
### Using Server's /config/dump Endpoint for Validation
|
||
|
||
```python
|
||
import requests
|
||
|
||
# Method 3: Use server endpoint to validate configuration syntax
|
||
def validate_config_with_server(config_code: str) -> dict:
|
||
"""Validate configuration using server's /config/dump endpoint"""
|
||
response = requests.post(
|
||
"http://localhost:11235/config/dump",
|
||
json={"code": config_code}
|
||
)
|
||
|
||
if response.status_code == 200:
|
||
print("✅ Valid configuration syntax")
|
||
return response.json()
|
||
else:
|
||
print(f"❌ Invalid configuration: {response.status_code}")
|
||
print(response.json())
|
||
return None
|
||
|
||
# Test valid configuration
|
||
valid_config = """
|
||
CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
screenshot=True,
|
||
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
|
||
wait_for="css:.content-loaded"
|
||
)
|
||
"""
|
||
|
||
result = validate_config_with_server(valid_config)
|
||
if result:
|
||
print("Generated JSON structure:")
|
||
print(json.dumps(result, indent=2))
|
||
|
||
# Test invalid configuration (should fail)
|
||
invalid_config = """
|
||
CrawlerRunConfig(
|
||
cache_mode="invalid_mode",
|
||
screenshot=True,
|
||
js_code=some_function() # This will fail
|
||
)
|
||
"""
|
||
|
||
validate_config_with_server(invalid_config)
|
||
```
|
||
|
||
### Configuration Builder Helper
|
||
|
||
```python
|
||
def build_and_validate_request(urls, browser_params=None, crawler_params=None):
|
||
"""Helper to build and validate complete request payload"""
|
||
|
||
# Create configurations
|
||
browser_config = BrowserConfig(**(browser_params or {}))
|
||
crawler_config = CrawlerRunConfig(**(crawler_params or {}))
|
||
|
||
# Build complete request payload
|
||
payload = {
|
||
"urls": urls if isinstance(urls, list) else [urls],
|
||
"browser_config": browser_config.dump(),
|
||
"crawler_config": crawler_config.dump()
|
||
}
|
||
|
||
print("✅ Complete request payload:")
|
||
print(json.dumps(payload, indent=2))
|
||
|
||
# Validate by attempting to reconstruct
|
||
try:
|
||
test_browser = from_serializable_dict(payload["browser_config"])
|
||
test_crawler = from_serializable_dict(payload["crawler_config"])
|
||
print("✅ Payload validation successful")
|
||
return payload
|
||
except Exception as e:
|
||
print(f"❌ Payload validation failed: {e}")
|
||
return None
|
||
|
||
# Example usage
|
||
payload = build_and_validate_request(
|
||
urls=["https://example.com"],
|
||
browser_params={"headless": True, "viewport_width": 1280},
|
||
crawler_params={
|
||
"cache_mode": CacheMode.BYPASS,
|
||
"screenshot": True,
|
||
"word_count_threshold": 10
|
||
}
|
||
)
|
||
|
||
if payload:
|
||
# Send to server
|
||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||
print(f"Server response: {response.status_code}")
|
||
```
|
||
|
||
### Common JSON Structure Patterns
|
||
|
||
```python
|
||
# Pattern 1: Simple primitive values
|
||
simple_config = {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"cache_mode": "bypass", # String enum value
|
||
"screenshot": True, # Boolean
|
||
"page_timeout": 60000 # Integer
|
||
}
|
||
}
|
||
|
||
# Pattern 2: Nested objects
|
||
nested_config = {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"extraction_strategy": {
|
||
"type": "LLMExtractionStrategy",
|
||
"params": {
|
||
"llm_config": {
|
||
"type": "LLMConfig",
|
||
"params": {
|
||
"provider": "openai/gpt-4o-mini",
|
||
"api_token": "env:OPENAI_API_KEY"
|
||
}
|
||
},
|
||
"instruction": "Extract main content"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
# Pattern 3: Dictionary values (must use type: dict wrapper)
|
||
dict_config = {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"extraction_strategy": {
|
||
"type": "JsonCssExtractionStrategy",
|
||
"params": {
|
||
"schema": {
|
||
"type": "dict", # Required wrapper
|
||
"value": { # Actual dictionary content
|
||
"name": "Products",
|
||
"baseSelector": ".product",
|
||
"fields": [
|
||
{"name": "title", "selector": "h2", "type": "text"}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
|
||
# Pattern 4: Lists and arrays
|
||
list_config = {
|
||
"type": "CrawlerRunConfig",
|
||
"params": {
|
||
"js_code": [ # Lists are handled directly
|
||
"window.scrollTo(0, document.body.scrollHeight);",
|
||
"document.querySelector('.load-more')?.click();"
|
||
],
|
||
"excluded_tags": ["script", "style", "nav"]
|
||
}
|
||
}
|
||
```
|
||
|
||
### Troubleshooting Common JSON Errors
|
||
|
||
```python
|
||
def diagnose_json_errors():
|
||
"""Common JSON structure errors and fixes"""
|
||
|
||
# ❌ WRONG: Missing type wrapper for objects
|
||
wrong_config = {
|
||
"browser_config": {
|
||
"headless": True # Missing type wrapper
|
||
}
|
||
}
|
||
|
||
# ✅ CORRECT: Proper type wrapper
|
||
correct_config = {
|
||
"browser_config": {
|
||
"type": "BrowserConfig",
|
||
"params": {
|
||
"headless": True
|
||
}
|
||
}
|
||
}
|
||
|
||
# ❌ WRONG: Dictionary without type: dict wrapper
|
||
wrong_dict = {
|
||
"schema": {
|
||
"name": "Products" # Raw dict, should be wrapped
|
||
}
|
||
}
|
||
|
||
# ✅ CORRECT: Dictionary with proper wrapper
|
||
correct_dict = {
|
||
"schema": {
|
||
"type": "dict",
|
||
"value": {
|
||
"name": "Products"
|
||
}
|
||
}
|
||
}
|
||
|
||
# ❌ WRONG: Invalid enum string
|
||
wrong_enum = {
|
||
"cache_mode": "DISABLED" # Wrong case/value
|
||
}
|
||
|
||
# ✅ CORRECT: Valid enum string
|
||
correct_enum = {
|
||
"cache_mode": "bypass" # or "enabled", "disabled", etc.
|
||
}
|
||
|
||
print("Common error patterns documented above")
|
||
|
||
# Validate your JSON structure before sending
|
||
def pre_flight_check(payload):
|
||
"""Run checks before sending to server"""
|
||
required_keys = ["urls", "browser_config", "crawler_config"]
|
||
|
||
for key in required_keys:
|
||
if key not in payload:
|
||
print(f"❌ Missing required key: {key}")
|
||
return False
|
||
|
||
# Check type wrappers
|
||
for config_key in ["browser_config", "crawler_config"]:
|
||
config = payload[config_key]
|
||
if not isinstance(config, dict) or "type" not in config:
|
||
print(f"❌ {config_key} missing type wrapper")
|
||
return False
|
||
if "params" not in config:
|
||
print(f"❌ {config_key} missing params")
|
||
return False
|
||
|
||
print("✅ Pre-flight check passed")
|
||
return True
|
||
|
||
# Example usage
|
||
payload = {
|
||
"urls": ["https://example.com"],
|
||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||
}
|
||
|
||
if pre_flight_check(payload):
|
||
# Safe to send to server
|
||
pass
|
||
```
|
||
|
||
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Configuration Options](https://docs.crawl4ai.com/core/docker-deployment/#server-configuration)
|
||
---
|
||
|
||
|
||
## CLI & Identity-Based Browsing
|
||
|
||
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
|
||
|
||
### Basic CLI Usage
|
||
|
||
```bash
|
||
# Simple crawling
|
||
crwl https://example.com
|
||
|
||
# Get markdown output
|
||
crwl https://example.com -o markdown
|
||
|
||
# JSON output with cache bypass
|
||
crwl https://example.com -o json --bypass-cache
|
||
|
||
# Verbose mode with specific browser settings
|
||
crwl https://example.com -b "headless=false,viewport_width=1280" -v
|
||
```
|
||
|
||
### Profile Management Commands
|
||
|
||
```bash
|
||
# Launch interactive profile manager
|
||
crwl profiles
|
||
|
||
# Create, list, and manage browser profiles
|
||
# This opens a menu where you can:
|
||
# 1. List existing profiles
|
||
# 2. Create new profile (opens browser for setup)
|
||
# 3. Delete profiles
|
||
# 4. Use profile to crawl a website
|
||
|
||
# Use a specific profile for crawling
|
||
crwl https://example.com -p my-profile-name
|
||
|
||
# Example workflow for authenticated sites:
|
||
# 1. Create profile and log in
|
||
crwl profiles # Select "Create new profile"
|
||
# 2. Use profile for crawling authenticated content
|
||
crwl https://site-requiring-login.com/dashboard -p my-profile-name
|
||
```
|
||
|
||
### CDP Browser Management
|
||
|
||
```bash
|
||
# Launch browser with CDP debugging (default port 9222)
|
||
crwl cdp
|
||
|
||
# Use specific profile and custom port
|
||
crwl cdp -p my-profile -P 9223
|
||
|
||
# Launch headless browser with CDP
|
||
crwl cdp --headless
|
||
|
||
# Launch in incognito mode (ignores profile)
|
||
crwl cdp --incognito
|
||
|
||
# Use custom user data directory
|
||
crwl cdp --user-data-dir ~/my-browser-data --port 9224
|
||
```
|
||
|
||
### Builtin Browser Management
|
||
|
||
```bash
|
||
# Start persistent browser instance
|
||
crwl browser start
|
||
|
||
# Check browser status
|
||
crwl browser status
|
||
|
||
# Open visible window to see the browser
|
||
crwl browser view --url https://example.com
|
||
|
||
# Stop the browser
|
||
crwl browser stop
|
||
|
||
# Restart with different options
|
||
crwl browser restart --browser-type chromium --port 9223 --no-headless
|
||
|
||
# Use builtin browser in crawling
|
||
crwl https://example.com -b "browser_mode=builtin"
|
||
```
|
||
|
||
### Authentication Workflow Examples
|
||
|
||
```bash
|
||
# Complete workflow for LinkedIn scraping
|
||
# 1. Create authenticated profile
|
||
crwl profiles
|
||
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
|
||
|
||
# 2. Use profile for crawling
|
||
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
|
||
|
||
# 3. Extract structured data with authentication
|
||
crwl https://linkedin.com/search/results/people/ \
|
||
-p linkedin-profile \
|
||
-j "Extract people profiles with names, titles, and companies" \
|
||
-b "headless=false"
|
||
|
||
# GitHub authenticated crawling
|
||
crwl profiles # Create github-profile
|
||
crwl https://github.com/settings/profile -p github-profile
|
||
|
||
# Twitter/X authenticated access
|
||
crwl profiles # Create twitter-profile
|
||
crwl https://twitter.com/home -p twitter-profile -o markdown
|
||
```
|
||
|
||
### Advanced CLI Configuration
|
||
|
||
```bash
|
||
# Complex crawling with multiple configs
|
||
crwl https://example.com \
|
||
-B browser.yml \
|
||
-C crawler.yml \
|
||
-e extract_llm.yml \
|
||
-s llm_schema.json \
|
||
-p my-auth-profile \
|
||
-o json \
|
||
-v
|
||
|
||
# Quick LLM extraction with authentication
|
||
crwl https://private-site.com/dashboard \
|
||
-p auth-profile \
|
||
-j "Extract user dashboard data including metrics and notifications" \
|
||
-b "headless=true,viewport_width=1920"
|
||
|
||
# Content filtering with authentication
|
||
crwl https://members-only-site.com \
|
||
-p member-profile \
|
||
-f filter_bm25.yml \
|
||
-c "css_selector=.member-content,scan_full_page=true" \
|
||
-o markdown-fit
|
||
```
|
||
|
||
### Configuration Files for Identity Browsing
|
||
|
||
```yaml
|
||
# browser_auth.yml
|
||
headless: false
|
||
use_managed_browser: true
|
||
user_data_dir: "/path/to/profile"
|
||
viewport_width: 1280
|
||
viewport_height: 720
|
||
simulate_user: true
|
||
override_navigator: true
|
||
|
||
# crawler_auth.yml
|
||
magic: true
|
||
remove_overlay_elements: true
|
||
simulate_user: true
|
||
wait_for: "css:.authenticated-content"
|
||
page_timeout: 60000
|
||
delay_before_return_html: 2
|
||
scan_full_page: true
|
||
```
|
||
|
||
### Global Configuration Management
|
||
|
||
```bash
|
||
# List all configuration settings
|
||
crwl config list
|
||
|
||
# Set default LLM provider
|
||
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
|
||
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
|
||
|
||
# Set browser defaults
|
||
crwl config set BROWSER_HEADLESS false # Always show browser
|
||
crwl config set USER_AGENT_MODE random # Random user agents
|
||
|
||
# Enable verbose mode globally
|
||
crwl config set VERBOSE true
|
||
```
|
||
|
||
### Q&A with Authenticated Content
|
||
|
||
```bash
|
||
# Ask questions about authenticated content
|
||
crwl https://private-dashboard.com -p dashboard-profile \
|
||
-q "What are the key metrics shown in my dashboard?"
|
||
|
||
# Multiple questions workflow
|
||
crwl https://company-intranet.com -p work-profile -o markdown # View content
|
||
crwl https://company-intranet.com -p work-profile \
|
||
-q "Summarize this week's announcements"
|
||
crwl https://company-intranet.com -p work-profile \
|
||
-q "What are the upcoming deadlines?"
|
||
```
|
||
|
||
### Profile Creation Programmatically
|
||
|
||
```python
|
||
# Create profiles via Python API
|
||
import asyncio
|
||
from crawl4ai import BrowserProfiler
|
||
|
||
async def create_auth_profile():
|
||
profiler = BrowserProfiler()
|
||
|
||
# Create profile interactively (opens browser)
|
||
profile_path = await profiler.create_profile("linkedin-auth")
|
||
print(f"Profile created at: {profile_path}")
|
||
|
||
# List all profiles
|
||
profiles = profiler.list_profiles()
|
||
for profile in profiles:
|
||
print(f"Profile: {profile['name']} at {profile['path']}")
|
||
|
||
# Use profile for crawling
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
use_managed_browser=True,
|
||
user_data_dir=profile_path
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun("https://linkedin.com/feed")
|
||
return result
|
||
|
||
# asyncio.run(create_auth_profile())
|
||
```
|
||
|
||
### Identity Browsing Best Practices
|
||
|
||
```bash
|
||
# 1. Create specific profiles for different sites
|
||
crwl profiles # Create "linkedin-work"
|
||
crwl profiles # Create "github-personal"
|
||
crwl profiles # Create "company-intranet"
|
||
|
||
# 2. Use descriptive profile names
|
||
crwl https://site1.com -p site1-admin-account
|
||
crwl https://site2.com -p site2-user-account
|
||
|
||
# 3. Combine with appropriate browser settings
|
||
crwl https://secure-site.com \
|
||
-p secure-profile \
|
||
-b "headless=false,simulate_user=true,magic=true" \
|
||
-c "wait_for=.logged-in-indicator,page_timeout=30000"
|
||
|
||
# 4. Test profile before automated crawling
|
||
crwl cdp -p test-profile # Manually verify login status
|
||
crwl https://test-url.com -p test-profile -v # Verbose test crawl
|
||
```
|
||
|
||
### Troubleshooting Authentication Issues
|
||
|
||
```bash
|
||
# Debug authentication problems
|
||
crwl https://auth-site.com -p auth-profile \
|
||
-b "headless=false,verbose=true" \
|
||
-c "verbose=true,page_timeout=60000" \
|
||
-v
|
||
|
||
# Check profile status
|
||
crwl profiles # List profiles and check creation dates
|
||
|
||
# Recreate problematic profiles
|
||
crwl profiles # Delete old profile, create new one
|
||
|
||
# Test with visible browser
|
||
crwl https://problem-site.com -p profile-name \
|
||
-b "headless=false" \
|
||
-c "delay_before_return_html=5"
|
||
```
|
||
|
||
### Common Use Cases
|
||
|
||
```bash
|
||
# Social media monitoring (after authentication)
|
||
crwl https://twitter.com/home -p twitter-monitor \
|
||
-j "Extract latest tweets with sentiment and engagement metrics"
|
||
|
||
# E-commerce competitor analysis (with account access)
|
||
crwl https://competitor-site.com/products -p competitor-account \
|
||
-j "Extract product prices, availability, and descriptions"
|
||
|
||
# Company dashboard monitoring
|
||
crwl https://company-dashboard.com -p work-profile \
|
||
-c "css_selector=.dashboard-content" \
|
||
-q "What alerts or notifications need attention?"
|
||
|
||
# Research data collection (authenticated access)
|
||
crwl https://research-platform.com/data -p research-profile \
|
||
-e extract_research.yml \
|
||
-s research_schema.json \
|
||
-o json
|
||
```
|
||
|
||
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/)
|
||
---
|
||
|
||
|
||
## HTTP Crawler Strategy
|
||
|
||
Fast, lightweight HTTP-only crawling without browser overhead for cases where JavaScript execution isn't needed.
|
||
|
||
### Basic HTTP Crawler Setup
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig, CacheMode
|
||
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
|
||
from crawl4ai.async_logger import AsyncLogger
|
||
|
||
async def main():
|
||
# Initialize HTTP strategy
|
||
http_strategy = AsyncHTTPCrawlerStrategy(
|
||
browser_config=HTTPCrawlerConfig(
|
||
method="GET",
|
||
verify_ssl=True,
|
||
follow_redirects=True
|
||
),
|
||
logger=AsyncLogger(verbose=True)
|
||
)
|
||
|
||
# Use with AsyncWebCrawler
|
||
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
|
||
result = await crawler.arun("https://example.com")
|
||
print(f"Status: {result.status_code}")
|
||
print(f"Content: {len(result.html)} chars")
|
||
|
||
if __name__ == "__main__":
|
||
asyncio.run(main())
|
||
```
|
||
|
||
### HTTP Request Types
|
||
|
||
```python
|
||
# GET request (default)
|
||
http_config = HTTPCrawlerConfig(
|
||
method="GET",
|
||
headers={"Accept": "application/json"}
|
||
)
|
||
|
||
# POST with JSON data
|
||
http_config = HTTPCrawlerConfig(
|
||
method="POST",
|
||
json={"key": "value", "data": [1, 2, 3]},
|
||
headers={"Content-Type": "application/json"}
|
||
)
|
||
|
||
# POST with form data
|
||
http_config = HTTPCrawlerConfig(
|
||
method="POST",
|
||
data={"username": "user", "password": "pass"},
|
||
headers={"Content-Type": "application/x-www-form-urlencoded"}
|
||
)
|
||
|
||
# Advanced configuration
|
||
http_config = HTTPCrawlerConfig(
|
||
method="GET",
|
||
headers={"User-Agent": "Custom Bot/1.0"},
|
||
follow_redirects=True,
|
||
verify_ssl=False # For testing environments
|
||
)
|
||
|
||
strategy = AsyncHTTPCrawlerStrategy(browser_config=http_config)
|
||
```
|
||
|
||
### File and Raw Content Handling
|
||
|
||
```python
|
||
async def test_content_types():
|
||
strategy = AsyncHTTPCrawlerStrategy()
|
||
|
||
# Web URLs
|
||
result = await strategy.crawl("https://httpbin.org/get")
|
||
print(f"Web content: {result.status_code}")
|
||
|
||
# Local files
|
||
result = await strategy.crawl("file:///path/to/local/file.html")
|
||
print(f"File content: {len(result.html)}")
|
||
|
||
# Raw HTML content
|
||
raw_html = "raw://<html><body><h1>Test</h1><p>Content</p></body></html>"
|
||
result = await strategy.crawl(raw_html)
|
||
print(f"Raw content: {result.html}")
|
||
|
||
# Raw content with complex HTML
|
||
complex_html = """raw://<!DOCTYPE html>
|
||
<html>
|
||
<head><title>Test Page</title></head>
|
||
<body>
|
||
<div class="content">
|
||
<h1>Main Title</h1>
|
||
<p>Paragraph content</p>
|
||
<ul><li>Item 1</li><li>Item 2</li></ul>
|
||
</div>
|
||
</body>
|
||
</html>"""
|
||
result = await strategy.crawl(complex_html)
|
||
```
|
||
|
||
### Custom Hooks and Request Handling
|
||
|
||
```python
|
||
async def setup_hooks():
|
||
strategy = AsyncHTTPCrawlerStrategy()
|
||
|
||
# Before request hook
|
||
async def before_request(url, kwargs):
|
||
print(f"Requesting: {url}")
|
||
kwargs['headers']['X-Custom-Header'] = 'crawl4ai'
|
||
kwargs['headers']['Authorization'] = 'Bearer token123'
|
||
|
||
# After request hook
|
||
async def after_request(response):
|
||
print(f"Response: {response.status_code}")
|
||
if hasattr(response, 'redirected_url'):
|
||
print(f"Redirected to: {response.redirected_url}")
|
||
|
||
# Error handling hook
|
||
async def on_error(error):
|
||
print(f"Request failed: {error}")
|
||
|
||
# Set hooks
|
||
strategy.set_hook('before_request', before_request)
|
||
strategy.set_hook('after_request', after_request)
|
||
strategy.set_hook('on_error', on_error)
|
||
|
||
# Use with hooks
|
||
result = await strategy.crawl("https://httpbin.org/headers")
|
||
return result
|
||
```
|
||
|
||
### Performance Configuration
|
||
|
||
```python
|
||
# High-performance setup
|
||
strategy = AsyncHTTPCrawlerStrategy(
|
||
max_connections=50, # Concurrent connections
|
||
dns_cache_ttl=300, # DNS cache timeout
|
||
chunk_size=128 * 1024 # 128KB chunks for large files
|
||
)
|
||
|
||
# Memory-efficient setup for large files
|
||
strategy = AsyncHTTPCrawlerStrategy(
|
||
max_connections=10,
|
||
chunk_size=32 * 1024, # Smaller chunks
|
||
dns_cache_ttl=600
|
||
)
|
||
|
||
# Custom timeout configuration
|
||
config = CrawlerRunConfig(
|
||
page_timeout=30000, # 30 second timeout
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
result = await strategy.crawl("https://slow-server.com", config=config)
|
||
```
|
||
|
||
### Error Handling and Retries
|
||
|
||
```python
|
||
from crawl4ai.async_crawler_strategy import (
|
||
ConnectionTimeoutError,
|
||
HTTPStatusError,
|
||
HTTPCrawlerError
|
||
)
|
||
|
||
async def robust_crawling():
|
||
strategy = AsyncHTTPCrawlerStrategy()
|
||
|
||
urls = [
|
||
"https://example.com",
|
||
"https://httpbin.org/status/404",
|
||
"https://nonexistent.domain.test"
|
||
]
|
||
|
||
for url in urls:
|
||
try:
|
||
result = await strategy.crawl(url)
|
||
print(f"✓ {url}: {result.status_code}")
|
||
|
||
except HTTPStatusError as e:
|
||
print(f"✗ {url}: HTTP {e.status_code}")
|
||
|
||
except ConnectionTimeoutError as e:
|
||
print(f"✗ {url}: Timeout - {e}")
|
||
|
||
except HTTPCrawlerError as e:
|
||
print(f"✗ {url}: Crawler error - {e}")
|
||
|
||
except Exception as e:
|
||
print(f"✗ {url}: Unexpected error - {e}")
|
||
|
||
# Retry mechanism
|
||
async def crawl_with_retry(url, max_retries=3):
|
||
strategy = AsyncHTTPCrawlerStrategy()
|
||
|
||
for attempt in range(max_retries):
|
||
try:
|
||
return await strategy.crawl(url)
|
||
except (ConnectionTimeoutError, HTTPCrawlerError) as e:
|
||
if attempt == max_retries - 1:
|
||
raise
|
||
print(f"Retry {attempt + 1}/{max_retries}: {e}")
|
||
await asyncio.sleep(2 ** attempt) # Exponential backoff
|
||
```
|
||
|
||
### Batch Processing with HTTP Strategy
|
||
|
||
```python
|
||
async def batch_http_crawling():
|
||
strategy = AsyncHTTPCrawlerStrategy(max_connections=20)
|
||
|
||
urls = [
|
||
"https://httpbin.org/get",
|
||
"https://httpbin.org/user-agent",
|
||
"https://httpbin.org/headers",
|
||
"https://example.com",
|
||
"https://httpbin.org/json"
|
||
]
|
||
|
||
# Sequential processing
|
||
results = []
|
||
async with strategy:
|
||
for url in urls:
|
||
try:
|
||
result = await strategy.crawl(url)
|
||
results.append((url, result.status_code, len(result.html)))
|
||
except Exception as e:
|
||
results.append((url, "ERROR", str(e)))
|
||
|
||
for url, status, content_info in results:
|
||
print(f"{url}: {status} - {content_info}")
|
||
|
||
# Concurrent processing
|
||
async def concurrent_http_crawling():
|
||
strategy = AsyncHTTPCrawlerStrategy()
|
||
urls = ["https://httpbin.org/delay/1"] * 5
|
||
|
||
async def crawl_single(url):
|
||
try:
|
||
result = await strategy.crawl(url)
|
||
return f"✓ {result.status_code}"
|
||
except Exception as e:
|
||
return f"✗ {e}"
|
||
|
||
async with strategy:
|
||
tasks = [crawl_single(url) for url in urls]
|
||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||
|
||
for i, result in enumerate(results):
|
||
print(f"URL {i+1}: {result}")
|
||
```
|
||
|
||
### Integration with Content Processing
|
||
|
||
```python
|
||
from crawl4ai import DefaultMarkdownGenerator, PruningContentFilter
|
||
|
||
async def http_with_processing():
|
||
# HTTP strategy with content processing
|
||
http_strategy = AsyncHTTPCrawlerStrategy(
|
||
browser_config=HTTPCrawlerConfig(verify_ssl=True)
|
||
)
|
||
|
||
# Configure markdown generation
|
||
crawler_config = CrawlerRunConfig(
|
||
cache_mode=CacheMode.BYPASS,
|
||
markdown_generator=DefaultMarkdownGenerator(
|
||
content_filter=PruningContentFilter(
|
||
threshold=0.48,
|
||
threshold_type="fixed",
|
||
min_word_threshold=10
|
||
)
|
||
),
|
||
word_count_threshold=5,
|
||
excluded_tags=['script', 'style', 'nav'],
|
||
exclude_external_links=True
|
||
)
|
||
|
||
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
|
||
result = await crawler.arun(
|
||
url="https://example.com",
|
||
config=crawler_config
|
||
)
|
||
|
||
print(f"Status: {result.status_code}")
|
||
print(f"Raw HTML: {len(result.html)} chars")
|
||
if result.markdown:
|
||
print(f"Markdown: {len(result.markdown.raw_markdown)} chars")
|
||
if result.markdown.fit_markdown:
|
||
print(f"Filtered: {len(result.markdown.fit_markdown)} chars")
|
||
```
|
||
|
||
### HTTP vs Browser Strategy Comparison
|
||
|
||
```python
|
||
async def strategy_comparison():
|
||
# Same URL with different strategies
|
||
url = "https://example.com"
|
||
|
||
# HTTP Strategy (fast, no JS)
|
||
http_strategy = AsyncHTTPCrawlerStrategy()
|
||
start_time = time.time()
|
||
http_result = await http_strategy.crawl(url)
|
||
http_time = time.time() - start_time
|
||
|
||
# Browser Strategy (full features)
|
||
from crawl4ai import BrowserConfig
|
||
browser_config = BrowserConfig(headless=True)
|
||
start_time = time.time()
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
browser_result = await crawler.arun(url)
|
||
browser_time = time.time() - start_time
|
||
|
||
print(f"HTTP Strategy:")
|
||
print(f" Time: {http_time:.2f}s")
|
||
print(f" Content: {len(http_result.html)} chars")
|
||
print(f" Features: Fast, lightweight, no JS")
|
||
|
||
print(f"Browser Strategy:")
|
||
print(f" Time: {browser_time:.2f}s")
|
||
print(f" Content: {len(browser_result.html)} chars")
|
||
print(f" Features: Full browser, JS, screenshots, etc.")
|
||
|
||
# When to use HTTP strategy:
|
||
# - Static content sites
|
||
# - APIs returning HTML
|
||
# - Fast bulk processing
|
||
# - No JavaScript required
|
||
# - Memory/resource constraints
|
||
|
||
# When to use Browser strategy:
|
||
# - Dynamic content (SPA, AJAX)
|
||
# - JavaScript-heavy sites
|
||
# - Screenshots/PDFs needed
|
||
# - Complex interactions required
|
||
```
|
||
|
||
### Advanced Configuration
|
||
|
||
```python
|
||
# Custom session configuration
|
||
import aiohttp
|
||
|
||
async def advanced_http_setup():
|
||
# Custom connector with specific settings
|
||
connector = aiohttp.TCPConnector(
|
||
limit=100, # Connection pool size
|
||
ttl_dns_cache=600, # DNS cache TTL
|
||
use_dns_cache=True, # Enable DNS caching
|
||
keepalive_timeout=30, # Keep-alive timeout
|
||
force_close=False # Reuse connections
|
||
)
|
||
|
||
strategy = AsyncHTTPCrawlerStrategy(
|
||
max_connections=50,
|
||
dns_cache_ttl=600,
|
||
chunk_size=64 * 1024
|
||
)
|
||
|
||
# Custom headers for all requests
|
||
http_config = HTTPCrawlerConfig(
|
||
headers={
|
||
"User-Agent": "Crawl4AI-HTTP/1.0",
|
||
"Accept": "text/html,application/xhtml+xml",
|
||
"Accept-Language": "en-US,en;q=0.9",
|
||
"Accept-Encoding": "gzip, deflate, br",
|
||
"DNT": "1"
|
||
},
|
||
verify_ssl=True,
|
||
follow_redirects=True
|
||
)
|
||
|
||
strategy.browser_config = http_config
|
||
|
||
# Use with custom timeout
|
||
config = CrawlerRunConfig(
|
||
page_timeout=45000, # 45 seconds
|
||
cache_mode=CacheMode.ENABLED
|
||
)
|
||
|
||
result = await strategy.crawl("https://example.com", config=config)
|
||
await strategy.close()
|
||
```
|
||
|
||
**📖 Learn more:** [AsyncWebCrawler API](https://docs.crawl4ai.com/api/async-webcrawler/), [Browser vs HTTP Strategy](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||
---
|
||
|
||
|
||
## URL Seeding
|
||
|
||
Smart URL discovery for efficient large-scale crawling. Discover thousands of URLs instantly, filter by relevance, then crawl only what matters.
|
||
|
||
### Why URL Seeding vs Deep Crawling
|
||
|
||
```python
|
||
# Deep Crawling: Real-time discovery (page by page)
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||
|
||
async def deep_crawl_example():
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||
max_depth=2,
|
||
include_external=False,
|
||
max_pages=50
|
||
)
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
results = await crawler.arun("https://example.com", config=config)
|
||
print(f"Discovered {len(results)} pages dynamically")
|
||
|
||
# URL Seeding: Bulk discovery (thousands instantly)
|
||
from crawl4ai import AsyncUrlSeeder, SeedingConfig
|
||
|
||
async def url_seeding_example():
|
||
config = SeedingConfig(
|
||
source="sitemap+cc",
|
||
pattern="*/docs/*",
|
||
extract_head=True,
|
||
query="API documentation",
|
||
scoring_method="bm25",
|
||
max_urls=1000
|
||
)
|
||
|
||
async with AsyncUrlSeeder() as seeder:
|
||
urls = await seeder.urls("example.com", config)
|
||
print(f"Discovered {len(urls)} URLs instantly")
|
||
# Now crawl only the most relevant ones
|
||
```
|
||
|
||
### Basic URL Discovery
|
||
|
||
```python
|
||
import asyncio
|
||
from crawl4ai import AsyncUrlSeeder, SeedingConfig
|
||
|
||
async def basic_discovery():
|
||
# Context manager handles cleanup automatically
|
||
async with AsyncUrlSeeder() as seeder:
|
||
|
||
# Simple discovery from sitemaps
|
||
config = SeedingConfig(source="sitemap")
|
||
urls = await seeder.urls("example.com", config)
|
||
|
||
print(f"Found {len(urls)} URLs from sitemap")
|
||
for url in urls[:5]:
|
||
print(f" - {url['url']} (status: {url['status']})")
|
||
|
||
# Manual cleanup (if needed)
|
||
async def manual_cleanup():
|
||
seeder = AsyncUrlSeeder()
|
||
try:
|
||
config = SeedingConfig(source="cc") # Common Crawl
|
||
urls = await seeder.urls("example.com", config)
|
||
print(f"Found {len(urls)} URLs from Common Crawl")
|
||
finally:
|
||
await seeder.close()
|
||
|
||
asyncio.run(basic_discovery())
|
||
```
|
||
|
||
### Data Sources and Patterns
|
||
|
||
```python
|
||
# Different data sources
|
||
configs = [
|
||
SeedingConfig(source="sitemap"), # Fastest, official URLs
|
||
SeedingConfig(source="cc"), # Most comprehensive
|
||
SeedingConfig(source="sitemap+cc"), # Maximum coverage
|
||
]
|
||
|
||
# URL pattern filtering
|
||
patterns = [
|
||
SeedingConfig(pattern="*/blog/*"), # Blog posts only
|
||
SeedingConfig(pattern="*.html"), # HTML files only
|
||
SeedingConfig(pattern="*/product/*"), # Product pages
|
||
SeedingConfig(pattern="*/docs/api/*"), # API documentation
|
||
SeedingConfig(pattern="*"), # Everything
|
||
]
|
||
|
||
# Advanced pattern usage
|
||
async def pattern_filtering():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# Find all blog posts from 2024
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
pattern="*/blog/2024/*.html",
|
||
max_urls=100
|
||
)
|
||
|
||
blog_urls = await seeder.urls("example.com", config)
|
||
|
||
# Further filter by keywords in URL
|
||
python_posts = [
|
||
url for url in blog_urls
|
||
if "python" in url['url'].lower()
|
||
]
|
||
|
||
print(f"Found {len(python_posts)} Python blog posts")
|
||
```
|
||
|
||
### SeedingConfig Parameters
|
||
|
||
```python
|
||
from crawl4ai import SeedingConfig
|
||
|
||
# Comprehensive configuration
|
||
config = SeedingConfig(
|
||
# Data sources
|
||
source="sitemap+cc", # "sitemap", "cc", "sitemap+cc"
|
||
pattern="*/docs/*", # URL pattern filter
|
||
|
||
# Metadata extraction
|
||
extract_head=True, # Get <head> metadata
|
||
live_check=True, # Verify URLs are accessible
|
||
|
||
# Performance controls
|
||
max_urls=1000, # Limit results (-1 = unlimited)
|
||
concurrency=20, # Parallel workers
|
||
hits_per_sec=10, # Rate limiting
|
||
|
||
# Relevance scoring
|
||
query="API documentation guide", # Search query
|
||
scoring_method="bm25", # Scoring algorithm
|
||
score_threshold=0.3, # Minimum relevance (0.0-1.0)
|
||
|
||
# Cache and filtering
|
||
force=False, # Bypass cache
|
||
filter_nonsense_urls=True, # Remove utility URLs
|
||
verbose=True # Debug output
|
||
)
|
||
|
||
# Quick configurations for common use cases
|
||
blog_config = SeedingConfig(
|
||
source="sitemap",
|
||
pattern="*/blog/*",
|
||
extract_head=True
|
||
)
|
||
|
||
api_docs_config = SeedingConfig(
|
||
source="sitemap+cc",
|
||
pattern="*/docs/*",
|
||
query="API reference documentation",
|
||
scoring_method="bm25",
|
||
score_threshold=0.5
|
||
)
|
||
|
||
product_pages_config = SeedingConfig(
|
||
source="cc",
|
||
pattern="*/product/*",
|
||
live_check=True,
|
||
max_urls=500
|
||
)
|
||
```
|
||
|
||
### Metadata Extraction and Analysis
|
||
|
||
```python
|
||
async def metadata_extraction():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True, # Extract <head> metadata
|
||
pattern="*/blog/*",
|
||
max_urls=50
|
||
)
|
||
|
||
urls = await seeder.urls("example.com", config)
|
||
|
||
# Analyze extracted metadata
|
||
for url in urls[:5]:
|
||
head_data = url['head_data']
|
||
print(f"\nURL: {url['url']}")
|
||
print(f"Title: {head_data.get('title', 'No title')}")
|
||
|
||
# Standard meta tags
|
||
meta = head_data.get('meta', {})
|
||
print(f"Description: {meta.get('description', 'N/A')}")
|
||
print(f"Keywords: {meta.get('keywords', 'N/A')}")
|
||
print(f"Author: {meta.get('author', 'N/A')}")
|
||
|
||
# Open Graph data
|
||
print(f"OG Image: {meta.get('og:image', 'N/A')}")
|
||
print(f"OG Type: {meta.get('og:type', 'N/A')}")
|
||
|
||
# JSON-LD structured data
|
||
jsonld = head_data.get('jsonld', [])
|
||
if jsonld:
|
||
print(f"Structured data: {len(jsonld)} items")
|
||
for item in jsonld[:2]:
|
||
if isinstance(item, dict):
|
||
print(f" Type: {item.get('@type', 'Unknown')}")
|
||
print(f" Name: {item.get('name', 'N/A')}")
|
||
|
||
# Filter by metadata
|
||
async def metadata_filtering():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
max_urls=100
|
||
)
|
||
|
||
urls = await seeder.urls("news.example.com", config)
|
||
|
||
# Filter by publication date (from JSON-LD)
|
||
from datetime import datetime, timedelta
|
||
recent_cutoff = datetime.now() - timedelta(days=7)
|
||
|
||
recent_articles = []
|
||
for url in urls:
|
||
for jsonld in url['head_data'].get('jsonld', []):
|
||
if isinstance(jsonld, dict) and 'datePublished' in jsonld:
|
||
try:
|
||
pub_date = datetime.fromisoformat(
|
||
jsonld['datePublished'].replace('Z', '+00:00')
|
||
)
|
||
if pub_date > recent_cutoff:
|
||
recent_articles.append(url)
|
||
break
|
||
except:
|
||
continue
|
||
|
||
print(f"Found {len(recent_articles)} recent articles")
|
||
```
|
||
|
||
### BM25 Relevance Scoring
|
||
|
||
```python
|
||
async def relevance_scoring():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# Find pages about Python async programming
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True, # Required for content-based scoring
|
||
query="python async await concurrency",
|
||
scoring_method="bm25",
|
||
score_threshold=0.3, # Only 30%+ relevant pages
|
||
max_urls=20
|
||
)
|
||
|
||
urls = await seeder.urls("docs.python.org", config)
|
||
|
||
# Results are automatically sorted by relevance
|
||
print("Most relevant Python async content:")
|
||
for url in urls[:5]:
|
||
score = url['relevance_score']
|
||
title = url['head_data'].get('title', 'No title')
|
||
print(f"[{score:.2f}] {title}")
|
||
print(f" {url['url']}")
|
||
|
||
# URL-based scoring (when extract_head=False)
|
||
async def url_based_scoring():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=False, # Fast URL-only scoring
|
||
query="machine learning tutorial",
|
||
scoring_method="bm25",
|
||
score_threshold=0.2
|
||
)
|
||
|
||
urls = await seeder.urls("example.com", config)
|
||
|
||
# Scoring based on URL structure, domain, path segments
|
||
for url in urls[:5]:
|
||
print(f"[{url['relevance_score']:.2f}] {url['url']}")
|
||
|
||
# Multi-concept queries
|
||
async def complex_queries():
|
||
queries = [
|
||
"data science pandas numpy visualization",
|
||
"web scraping automation selenium",
|
||
"machine learning tensorflow pytorch",
|
||
"api documentation rest graphql"
|
||
]
|
||
|
||
async with AsyncUrlSeeder() as seeder:
|
||
all_results = []
|
||
|
||
for query in queries:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
query=query,
|
||
scoring_method="bm25",
|
||
score_threshold=0.4,
|
||
max_urls=10
|
||
)
|
||
|
||
urls = await seeder.urls("learning-site.com", config)
|
||
all_results.extend(urls)
|
||
|
||
# Remove duplicates while preserving order
|
||
seen = set()
|
||
unique_results = []
|
||
for url in all_results:
|
||
if url['url'] not in seen:
|
||
seen.add(url['url'])
|
||
unique_results.append(url)
|
||
|
||
print(f"Found {len(unique_results)} unique pages across all topics")
|
||
```
|
||
|
||
### Live URL Validation
|
||
|
||
```python
|
||
async def url_validation():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
live_check=True, # Verify URLs are accessible
|
||
concurrency=15, # Parallel HEAD requests
|
||
hits_per_sec=8, # Rate limiting
|
||
max_urls=100
|
||
)
|
||
|
||
urls = await seeder.urls("example.com", config)
|
||
|
||
# Analyze results
|
||
valid_urls = [u for u in urls if u['status'] == 'valid']
|
||
invalid_urls = [u for u in urls if u['status'] == 'not_valid']
|
||
|
||
print(f"✅ Valid URLs: {len(valid_urls)}")
|
||
print(f"❌ Invalid URLs: {len(invalid_urls)}")
|
||
print(f"📊 Success rate: {len(valid_urls)/len(urls)*100:.1f}%")
|
||
|
||
# Show some invalid URLs for debugging
|
||
if invalid_urls:
|
||
print("\nSample invalid URLs:")
|
||
for url in invalid_urls[:3]:
|
||
print(f" - {url['url']}")
|
||
|
||
# Combined validation and metadata
|
||
async def comprehensive_validation():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
live_check=True, # Verify accessibility
|
||
extract_head=True, # Get metadata
|
||
query="tutorial guide", # Relevance scoring
|
||
scoring_method="bm25",
|
||
score_threshold=0.2,
|
||
concurrency=10,
|
||
max_urls=50
|
||
)
|
||
|
||
urls = await seeder.urls("docs.example.com", config)
|
||
|
||
# Filter for valid, relevant tutorials
|
||
good_tutorials = [
|
||
url for url in urls
|
||
if url['status'] == 'valid' and
|
||
url['relevance_score'] > 0.3 and
|
||
'tutorial' in url['head_data'].get('title', '').lower()
|
||
]
|
||
|
||
print(f"Found {len(good_tutorials)} high-quality tutorials")
|
||
```
|
||
|
||
### Multi-Domain Discovery
|
||
|
||
```python
|
||
async def multi_domain_research():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# Research Python tutorials across multiple sites
|
||
domains = [
|
||
"docs.python.org",
|
||
"realpython.com",
|
||
"python-course.eu",
|
||
"tutorialspoint.com"
|
||
]
|
||
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
query="python beginner tutorial basics",
|
||
scoring_method="bm25",
|
||
score_threshold=0.3,
|
||
max_urls=15 # Per domain
|
||
)
|
||
|
||
# Discover across all domains in parallel
|
||
results = await seeder.many_urls(domains, config)
|
||
|
||
# Collect and rank all tutorials
|
||
all_tutorials = []
|
||
for domain, urls in results.items():
|
||
for url in urls:
|
||
url['domain'] = domain
|
||
all_tutorials.append(url)
|
||
|
||
# Sort by relevance across all domains
|
||
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)
|
||
|
||
print(f"Top 10 Python tutorials across {len(domains)} sites:")
|
||
for i, tutorial in enumerate(all_tutorials[:10], 1):
|
||
score = tutorial['relevance_score']
|
||
title = tutorial['head_data'].get('title', 'No title')[:60]
|
||
domain = tutorial['domain']
|
||
print(f"{i:2d}. [{score:.2f}] {title}")
|
||
print(f" {domain}")
|
||
|
||
# Competitor analysis
|
||
async def competitor_analysis():
|
||
competitors = ["competitor1.com", "competitor2.com", "competitor3.com"]
|
||
|
||
async with AsyncUrlSeeder() as seeder:
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
pattern="*/blog/*",
|
||
max_urls=50
|
||
)
|
||
|
||
results = await seeder.many_urls(competitors, config)
|
||
|
||
# Analyze content strategies
|
||
for domain, urls in results.items():
|
||
content_types = {}
|
||
|
||
for url in urls:
|
||
# Extract content type from metadata
|
||
meta = url['head_data'].get('meta', {})
|
||
og_type = meta.get('og:type', 'unknown')
|
||
content_types[og_type] = content_types.get(og_type, 0) + 1
|
||
|
||
print(f"\n{domain} content distribution:")
|
||
for ctype, count in sorted(content_types.items(),
|
||
key=lambda x: x[1], reverse=True):
|
||
print(f" {ctype}: {count}")
|
||
```
|
||
|
||
### Complete Pipeline: Discovery → Filter → Crawl
|
||
|
||
```python
|
||
async def smart_research_pipeline():
|
||
"""Complete pipeline: discover URLs, filter by relevance, crawl top results"""
|
||
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# Step 1: Discover relevant URLs
|
||
print("🔍 Discovering URLs...")
|
||
config = SeedingConfig(
|
||
source="sitemap+cc",
|
||
extract_head=True,
|
||
query="machine learning deep learning tutorial",
|
||
scoring_method="bm25",
|
||
score_threshold=0.4,
|
||
max_urls=100
|
||
)
|
||
|
||
urls = await seeder.urls("example.com", config)
|
||
print(f" Found {len(urls)} relevant URLs")
|
||
|
||
# Step 2: Select top articles
|
||
top_articles = sorted(urls,
|
||
key=lambda x: x['relevance_score'],
|
||
reverse=True)[:10]
|
||
|
||
print(f" Selected top {len(top_articles)} for crawling")
|
||
|
||
# Step 3: Show what we're about to crawl
|
||
print("\n📋 Articles to crawl:")
|
||
for i, article in enumerate(top_articles, 1):
|
||
score = article['relevance_score']
|
||
title = article['head_data'].get('title', 'No title')[:60]
|
||
print(f" {i}. [{score:.2f}] {title}")
|
||
|
||
# Step 4: Crawl selected articles
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
|
||
print(f"\n🕷️ Crawling {len(top_articles)} articles...")
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
config = CrawlerRunConfig(
|
||
only_text=True,
|
||
word_count_threshold=200,
|
||
stream=True # Process results as they come
|
||
)
|
||
|
||
# Extract URLs and crawl
|
||
article_urls = [article['url'] for article in top_articles]
|
||
|
||
crawled_count = 0
|
||
async for result in await crawler.arun_many(article_urls, config=config):
|
||
if result.success:
|
||
crawled_count += 1
|
||
word_count = len(result.markdown.raw_markdown.split())
|
||
print(f" ✅ [{crawled_count}/{len(article_urls)}] "
|
||
f"{word_count} words from {result.url[:50]}...")
|
||
else:
|
||
print(f" ❌ Failed: {result.url[:50]}...")
|
||
|
||
print(f"\n✨ Successfully crawled {crawled_count} articles!")
|
||
|
||
asyncio.run(smart_research_pipeline())
|
||
```
|
||
|
||
### Advanced Features and Performance
|
||
|
||
```python
|
||
# Cache management
|
||
async def cache_management():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# First run - populate cache
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
force=True # Bypass cache, fetch fresh
|
||
)
|
||
urls = await seeder.urls("example.com", config)
|
||
|
||
# Subsequent runs - use cache (much faster)
|
||
config = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
force=False # Use cache
|
||
)
|
||
urls = await seeder.urls("example.com", config)
|
||
|
||
# Performance optimization
|
||
async def performance_tuning():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# High-performance configuration
|
||
config = SeedingConfig(
|
||
source="cc",
|
||
concurrency=50, # Many parallel workers
|
||
hits_per_sec=20, # High rate limit
|
||
max_urls=10000, # Large dataset
|
||
extract_head=False, # Skip metadata for speed
|
||
filter_nonsense_urls=True # Auto-filter utility URLs
|
||
)
|
||
|
||
import time
|
||
start = time.time()
|
||
urls = await seeder.urls("large-site.com", config)
|
||
elapsed = time.time() - start
|
||
|
||
print(f"Processed {len(urls)} URLs in {elapsed:.2f}s")
|
||
print(f"Speed: {len(urls)/elapsed:.0f} URLs/second")
|
||
|
||
# Memory-safe processing for large domains
|
||
async def large_domain_processing():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# Safe for domains with 1M+ URLs
|
||
config = SeedingConfig(
|
||
source="cc+sitemap",
|
||
concurrency=50, # Bounded queue adapts to this
|
||
max_urls=100000, # Process in batches
|
||
filter_nonsense_urls=True
|
||
)
|
||
|
||
# The seeder automatically manages memory by:
|
||
# - Using bounded queues (prevents RAM spikes)
|
||
# - Applying backpressure when queue is full
|
||
# - Processing URLs as they're discovered
|
||
urls = await seeder.urls("huge-site.com", config)
|
||
|
||
# Configuration cloning and reuse
|
||
config_base = SeedingConfig(
|
||
source="sitemap",
|
||
extract_head=True,
|
||
concurrency=20
|
||
)
|
||
|
||
# Create variations
|
||
blog_config = config_base.clone(pattern="*/blog/*")
|
||
docs_config = config_base.clone(
|
||
pattern="*/docs/*",
|
||
query="API documentation",
|
||
scoring_method="bm25"
|
||
)
|
||
fast_config = config_base.clone(
|
||
extract_head=False,
|
||
concurrency=100,
|
||
hits_per_sec=50
|
||
)
|
||
```
|
||
|
||
### Troubleshooting and Best Practices
|
||
|
||
```python
|
||
# Common issues and solutions
|
||
async def troubleshooting_guide():
|
||
async with AsyncUrlSeeder() as seeder:
|
||
# Issue: No URLs found
|
||
try:
|
||
config = SeedingConfig(source="sitemap", pattern="*/nonexistent/*")
|
||
urls = await seeder.urls("example.com", config)
|
||
if not urls:
|
||
# Solution: Try broader pattern or different source
|
||
config = SeedingConfig(source="cc+sitemap", pattern="*")
|
||
urls = await seeder.urls("example.com", config)
|
||
except Exception as e:
|
||
print(f"Discovery failed: {e}")
|
||
|
||
# Issue: Slow performance
|
||
config = SeedingConfig(
|
||
source="sitemap", # Faster than CC
|
||
concurrency=10, # Reduce if hitting rate limits
|
||
hits_per_sec=5, # Add rate limiting
|
||
extract_head=False # Skip if metadata not needed
|
||
)
|
||
|
||
# Issue: Low relevance scores
|
||
config = SeedingConfig(
|
||
query="specific detailed query terms",
|
||
score_threshold=0.1, # Lower threshold
|
||
scoring_method="bm25"
|
||
)
|
||
|
||
# Issue: Memory issues with large sites
|
||
config = SeedingConfig(
|
||
max_urls=10000, # Limit results
|
||
concurrency=20, # Reduce concurrency
|
||
source="sitemap" # Use sitemap only
|
||
)
|
||
|
||
# Performance benchmarks
|
||
print("""
|
||
Typical performance on standard connection:
|
||
- Sitemap discovery: 100-1,000 URLs/second
|
||
- Common Crawl discovery: 50-500 URLs/second
|
||
- HEAD checking: 10-50 URLs/second
|
||
- Head extraction: 5-20 URLs/second
|
||
- BM25 scoring: 10,000+ URLs/second
|
||
""")
|
||
|
||
# Best practices
|
||
best_practices = """
|
||
✅ Use context manager: async with AsyncUrlSeeder() as seeder
|
||
✅ Start with sitemaps (faster), add CC if needed
|
||
✅ Use extract_head=True only when you need metadata
|
||
✅ Set reasonable max_urls to limit processing
|
||
✅ Add rate limiting for respectful crawling
|
||
✅ Cache results with force=False for repeated operations
|
||
✅ Filter nonsense URLs (enabled by default)
|
||
✅ Use specific patterns to reduce irrelevant results
|
||
"""
|
||
```
|
||
|
||
**📖 Learn more:** [Complete URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [SeedingConfig Reference](https://docs.crawl4ai.com/api/parameters/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||
---
|
||
|
||
|
||
### Advanced Configuration Features
|
||
|
||
#### User Agent Management & Bot Detection Avoidance
|
||
|
||
```python
|
||
from crawl4ai import CrawlerRunConfig
|
||
|
||
# Random user agent generation
|
||
config = CrawlerRunConfig(
|
||
user_agent_mode="random",
|
||
user_agent_generator_config={
|
||
"platform": "windows", # "windows", "macos", "linux", "android", "ios"
|
||
"browser": "chrome", # "chrome", "firefox", "safari", "edge"
|
||
"device_type": "desktop" # "desktop", "mobile", "tablet"
|
||
}
|
||
)
|
||
|
||
# Custom user agent with stealth features
|
||
config = CrawlerRunConfig(
|
||
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
|
||
simulate_user=True, # Simulate human mouse movements
|
||
override_navigator=True, # Override navigator properties
|
||
mean_delay=1.5, # Random delays between actions
|
||
max_range=2.0
|
||
)
|
||
|
||
# Combined anti-detection approach
|
||
stealth_config = CrawlerRunConfig(
|
||
user_agent_mode="random",
|
||
simulate_user=True,
|
||
override_navigator=True,
|
||
magic=True, # Auto-handle common bot detection patterns
|
||
delay_before_return_html=2.0
|
||
)
|
||
```
|
||
|
||
#### Proxy Configuration with ProxyConfig
|
||
|
||
```python
|
||
from crawl4ai import CrawlerRunConfig, ProxyConfig, ProxyRotationStrategy
|
||
|
||
# Single proxy configuration
|
||
proxy_config = ProxyConfig(
|
||
server="http://proxy.example.com:8080",
|
||
username="proxy_user",
|
||
password="proxy_pass"
|
||
)
|
||
|
||
# From proxy string format
|
||
proxy_config = ProxyConfig.from_string("192.168.1.100:8080:username:password")
|
||
|
||
# Multiple proxies with rotation
|
||
proxies = [
|
||
ProxyConfig(server="http://proxy1.com:8080", username="user1", password="pass1"),
|
||
ProxyConfig(server="http://proxy2.com:8080", username="user2", password="pass2"),
|
||
ProxyConfig(server="http://proxy3.com:8080", username="user3", password="pass3")
|
||
]
|
||
|
||
rotation_strategy = ProxyRotationStrategy(
|
||
proxies=proxies,
|
||
rotation_method="round_robin" # or "random", "least_used"
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
proxy_config=proxy_config,
|
||
proxy_rotation_strategy=rotation_strategy
|
||
)
|
||
|
||
# Load proxies from environment variable
|
||
proxies_from_env = ProxyConfig.from_env("MY_PROXIES") # comma-separated proxy strings
|
||
```
|
||
|
||
#### Content Selection: css_selector vs target_elements
|
||
|
||
```python
|
||
from crawl4ai import CrawlerRunConfig
|
||
|
||
# css_selector: Extracts HTML at top level, affects entire processing
|
||
config = CrawlerRunConfig(
|
||
css_selector="main.article, .content-area", # Can be list of selectors
|
||
# Everything else (markdown, extraction, links) works only on this HTML subset
|
||
)
|
||
|
||
# target_elements: Focuses extraction within already processed HTML
|
||
config = CrawlerRunConfig(
|
||
css_selector="body", # First extract entire body
|
||
target_elements=[ # Then focus extraction on these elements
|
||
".article-content",
|
||
".post-body",
|
||
".main-text"
|
||
],
|
||
# Links, media from entire body, but markdown/extraction only from target_elements
|
||
)
|
||
|
||
# Hierarchical content selection
|
||
config = CrawlerRunConfig(
|
||
css_selector=["#main-content", ".article-wrapper"], # Top-level extraction
|
||
target_elements=[ # Subset for processing
|
||
".article-title",
|
||
".article-body",
|
||
".article-metadata"
|
||
],
|
||
excluded_selector="#sidebar, .ads, .comments" # Remove these from selection
|
||
)
|
||
```
|
||
|
||
#### Advanced wait_for Conditions
|
||
|
||
```python
|
||
from crawl4ai import CrawlerRunConfig
|
||
|
||
# CSS selector waiting
|
||
config = CrawlerRunConfig(
|
||
wait_for="css:.content-loaded", # Wait for element to appear
|
||
wait_for_timeout=15000
|
||
)
|
||
|
||
# JavaScript boolean expression waiting
|
||
config = CrawlerRunConfig(
|
||
wait_for="js:() => window.dataLoaded === true", # Custom JS condition
|
||
wait_for_timeout=20000
|
||
)
|
||
|
||
# Complex JavaScript conditions
|
||
config = CrawlerRunConfig(
|
||
wait_for="js:() => document.querySelectorAll('.item').length >= 10",
|
||
js_code=[
|
||
"document.querySelector('.load-more')?.click();",
|
||
"window.scrollTo(0, document.body.scrollHeight);"
|
||
]
|
||
)
|
||
|
||
# Multiple conditions with JavaScript
|
||
config = CrawlerRunConfig(
|
||
wait_for="js:() => !document.querySelector('.loading') && document.querySelector('.results')",
|
||
page_timeout=30000
|
||
)
|
||
```
|
||
|
||
#### Session Management for Multi-Step Crawling
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||
|
||
# Persistent session across multiple arun() calls
|
||
async def multi_step_crawling():
|
||
async with AsyncWebCrawler() as crawler:
|
||
# Step 1: Login page
|
||
login_config = CrawlerRunConfig(
|
||
session_id="user_session", # Create persistent session
|
||
js_code="document.querySelector('#username').value = 'user'; document.querySelector('#password').value = 'pass'; document.querySelector('#login').click();",
|
||
wait_for="css:.dashboard",
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
result1 = await crawler.arun("https://example.com/login", config=login_config)
|
||
|
||
# Step 2: Navigate to protected area (reuses same browser page)
|
||
nav_config = CrawlerRunConfig(
|
||
session_id="user_session", # Same session = same browser page
|
||
js_only=True, # No page reload, just JS navigation
|
||
js_code="window.location.href = '/dashboard/data';",
|
||
wait_for="css:.data-table"
|
||
)
|
||
|
||
result2 = await crawler.arun("https://example.com/dashboard/data", config=nav_config)
|
||
|
||
# Step 3: Extract data from multiple pages
|
||
for page in range(1, 6):
|
||
page_config = CrawlerRunConfig(
|
||
session_id="user_session",
|
||
js_only=True,
|
||
js_code=f"document.querySelector('.page-{page}').click();",
|
||
wait_for=f"js:() => document.querySelector('.page-{page}').classList.contains('active')"
|
||
)
|
||
|
||
result = await crawler.arun(f"https://example.com/data/page/{page}", config=page_config)
|
||
print(f"Page {page} data extracted: {len(result.extracted_content)}")
|
||
|
||
# Important: Kill session when done
|
||
await crawler.kill_session("user_session")
|
||
|
||
# Session with shared data between steps
|
||
async def session_with_shared_data():
|
||
shared_context = {"user_id": "12345", "preferences": {"theme": "dark"}}
|
||
|
||
config = CrawlerRunConfig(
|
||
session_id="persistent_session",
|
||
shared_data=shared_context, # Available across all session calls
|
||
js_code="console.log('User ID:', window.sharedData.user_id);"
|
||
)
|
||
```
|
||
|
||
#### Identity-Based Crawling Parameters
|
||
|
||
```python
|
||
from crawl4ai import CrawlerRunConfig, GeolocationConfig
|
||
|
||
# Locale and timezone simulation
|
||
config = CrawlerRunConfig(
|
||
locale="en-US", # Browser language preference
|
||
timezone_id="America/New_York", # Timezone setting
|
||
user_agent_mode="random",
|
||
user_agent_generator_config={
|
||
"platform": "windows",
|
||
"locale": "en-US"
|
||
}
|
||
)
|
||
|
||
# Geolocation simulation
|
||
geo_config = GeolocationConfig(
|
||
latitude=40.7128, # New York coordinates
|
||
longitude=-74.0060,
|
||
accuracy=100.0
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
geolocation=geo_config,
|
||
locale="en-US",
|
||
timezone_id="America/New_York"
|
||
)
|
||
|
||
# Complete identity simulation
|
||
identity_config = CrawlerRunConfig(
|
||
# Location identity
|
||
locale="fr-FR",
|
||
timezone_id="Europe/Paris",
|
||
geolocation=GeolocationConfig(latitude=48.8566, longitude=2.3522),
|
||
|
||
# Browser identity
|
||
user_agent_mode="random",
|
||
user_agent_generator_config={
|
||
"platform": "windows",
|
||
"locale": "fr-FR",
|
||
"browser": "chrome"
|
||
},
|
||
|
||
# Behavioral identity
|
||
simulate_user=True,
|
||
override_navigator=True,
|
||
mean_delay=2.0,
|
||
max_range=1.5
|
||
)
|
||
```
|
||
|
||
#### Simplified Import Pattern
|
||
|
||
```python
|
||
# Almost everything from crawl4ai main package
|
||
from crawl4ai import (
|
||
AsyncWebCrawler,
|
||
BrowserConfig,
|
||
CrawlerRunConfig,
|
||
LLMConfig,
|
||
CacheMode,
|
||
ProxyConfig,
|
||
GeolocationConfig
|
||
)
|
||
|
||
# Specialized strategies (still from crawl4ai)
|
||
from crawl4ai import (
|
||
JsonCssExtractionStrategy,
|
||
LLMExtractionStrategy,
|
||
DefaultMarkdownGenerator,
|
||
PruningContentFilter,
|
||
RegexChunking
|
||
)
|
||
|
||
# Complete example with simplified imports
|
||
async def example_crawl():
|
||
browser_config = BrowserConfig(headless=True)
|
||
|
||
run_config = CrawlerRunConfig(
|
||
user_agent_mode="random",
|
||
proxy_config=ProxyConfig.from_string("192.168.1.1:8080:user:pass"),
|
||
css_selector="main.content",
|
||
target_elements=[".article", ".post"],
|
||
wait_for="js:() => document.querySelector('.loaded')",
|
||
session_id="my_session",
|
||
simulate_user=True
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
result = await crawler.arun("https://example.com", config=run_config)
|
||
return result
|
||
```
|
||
|
||
**📖 Learn more:** [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Proxy & Security](https://docs.crawl4ai.com/advanced/proxy-security/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Content Selection](https://docs.crawl4ai.com/core/content-selection/)
|
||
---
|
||
|
||
|
||
## Advanced Features
|
||
|
||
Comprehensive guide to advanced crawling capabilities including file handling, authentication, dynamic content, monitoring, and session management.
|
||
|
||
### File Download Handling
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
import os
|
||
|
||
# Enable downloads with custom path
|
||
downloads_path = os.path.join(os.getcwd(), "my_downloads")
|
||
os.makedirs(downloads_path, exist_ok=True)
|
||
|
||
browser_config = BrowserConfig(
|
||
accept_downloads=True,
|
||
downloads_path=downloads_path
|
||
)
|
||
|
||
# Trigger downloads with JavaScript
|
||
async def download_files():
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
config = CrawlerRunConfig(
|
||
js_code="""
|
||
// Click download links
|
||
const downloadLinks = document.querySelectorAll('a[href$=".pdf"]');
|
||
for (const link of downloadLinks) {
|
||
link.click();
|
||
await new Promise(r => setTimeout(r, 2000)); // Delay between downloads
|
||
}
|
||
""",
|
||
wait_for=5 # Wait for downloads to start
|
||
)
|
||
|
||
result = await crawler.arun("https://example.com/downloads", config=config)
|
||
|
||
if result.downloaded_files:
|
||
print("Downloaded files:")
|
||
for file_path in result.downloaded_files:
|
||
print(f"- {file_path} ({os.path.getsize(file_path)} bytes)")
|
||
```
|
||
|
||
### Hooks & Authentication
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||
from playwright.async_api import Page, BrowserContext
|
||
|
||
async def advanced_crawler_with_hooks():
|
||
browser_config = BrowserConfig(headless=True, verbose=True)
|
||
crawler = AsyncWebCrawler(config=browser_config)
|
||
|
||
# Hook functions for different stages
|
||
async def on_browser_created(browser, **kwargs):
|
||
print("[HOOK] Browser created successfully")
|
||
return browser
|
||
|
||
async def on_page_context_created(page: Page, context: BrowserContext, **kwargs):
|
||
print("[HOOK] Setting up page & context")
|
||
|
||
# Block images for faster crawling
|
||
async def route_filter(route):
|
||
if route.request.resource_type == "image":
|
||
await route.abort()
|
||
else:
|
||
await route.continue_()
|
||
|
||
await context.route("**", route_filter)
|
||
|
||
# Simulate login if needed
|
||
# await page.goto("https://example.com/login")
|
||
# await page.fill("input[name='username']", "testuser")
|
||
# await page.fill("input[name='password']", "password123")
|
||
# await page.click("button[type='submit']")
|
||
|
||
await page.set_viewport_size({"width": 1080, "height": 600})
|
||
return page
|
||
|
||
async def before_goto(page: Page, context: BrowserContext, url: str, **kwargs):
|
||
print(f"[HOOK] About to navigate to: {url}")
|
||
await page.set_extra_http_headers({"Custom-Header": "my-value"})
|
||
return page
|
||
|
||
async def after_goto(page: Page, context: BrowserContext, url: str, response, **kwargs):
|
||
print(f"[HOOK] Successfully loaded: {url}")
|
||
try:
|
||
await page.wait_for_selector('.content', timeout=1000)
|
||
print("[HOOK] Content found!")
|
||
except:
|
||
print("[HOOK] Content not found, continuing")
|
||
return page
|
||
|
||
async def before_retrieve_html(page: Page, context: BrowserContext, **kwargs):
|
||
print("[HOOK] Final actions before HTML retrieval")
|
||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
|
||
return page
|
||
|
||
# Attach hooks
|
||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||
crawler.crawler_strategy.set_hook("before_retrieve_html", before_retrieve_html)
|
||
|
||
await crawler.start()
|
||
|
||
config = CrawlerRunConfig()
|
||
result = await crawler.arun("https://example.com", config=config)
|
||
|
||
if result.success:
|
||
print(f"Crawled successfully: {len(result.html)} chars")
|
||
|
||
await crawler.close()
|
||
```
|
||
|
||
### Lazy Loading & Dynamic Content
|
||
|
||
```python
|
||
# Handle lazy-loaded images and infinite scroll
|
||
async def handle_lazy_loading():
|
||
config = CrawlerRunConfig(
|
||
# Wait for images to fully load
|
||
wait_for_images=True,
|
||
|
||
# Automatically scroll entire page to trigger lazy loading
|
||
scan_full_page=True,
|
||
scroll_delay=0.5, # Delay between scroll steps
|
||
|
||
# JavaScript for custom lazy loading
|
||
js_code="""
|
||
// Scroll and wait for content to load
|
||
window.scrollTo(0, document.body.scrollHeight);
|
||
|
||
// Click "Load More" if available
|
||
const loadMoreBtn = document.querySelector('.load-more');
|
||
if (loadMoreBtn) {
|
||
loadMoreBtn.click();
|
||
}
|
||
""",
|
||
|
||
# Wait for specific content to appear
|
||
wait_for="css:.lazy-content:nth-child(20)", # Wait for 20 items
|
||
|
||
# Exclude external images to focus on main content
|
||
exclude_external_images=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/gallery", config=config)
|
||
|
||
if result.success:
|
||
images = result.media.get("images", [])
|
||
print(f"Loaded {len(images)} images after lazy loading")
|
||
for img in images[:3]:
|
||
print(f"- {img.get('src')} (score: {img.get('score', 'N/A')})")
|
||
```
|
||
|
||
### Network & Console Monitoring
|
||
|
||
```python
|
||
# Capture all network requests and console messages for debugging
|
||
async def monitor_network_and_console():
|
||
config = CrawlerRunConfig(
|
||
capture_network_requests=True,
|
||
capture_console_messages=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com", config=config)
|
||
|
||
if result.success:
|
||
# Analyze network requests
|
||
if result.network_requests:
|
||
requests = [r for r in result.network_requests if r.get("event_type") == "request"]
|
||
responses = [r for r in result.network_requests if r.get("event_type") == "response"]
|
||
failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"]
|
||
|
||
print(f"Network activity: {len(requests)} requests, {len(responses)} responses, {len(failures)} failures")
|
||
|
||
# Find API calls
|
||
api_calls = [r for r in requests if "api" in r.get("url", "")]
|
||
print(f"API calls detected: {len(api_calls)}")
|
||
|
||
# Show failed requests
|
||
for failure in failures[:3]:
|
||
print(f"Failed: {failure.get('url')} - {failure.get('failure_text')}")
|
||
|
||
# Analyze console messages
|
||
if result.console_messages:
|
||
message_types = {}
|
||
for msg in result.console_messages:
|
||
msg_type = msg.get("type", "unknown")
|
||
message_types[msg_type] = message_types.get(msg_type, 0) + 1
|
||
|
||
print(f"Console messages: {message_types}")
|
||
|
||
# Show errors
|
||
errors = [msg for msg in result.console_messages if msg.get("type") == "error"]
|
||
for error in errors[:2]:
|
||
print(f"JS Error: {error.get('text', '')[:100]}")
|
||
```
|
||
|
||
### Session Management for Multi-Step Workflows
|
||
|
||
```python
|
||
# Maintain state across multiple requests for complex workflows
|
||
async def multi_step_session_workflow():
|
||
session_id = "workflow_session"
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
# Step 1: Initial page load
|
||
config1 = CrawlerRunConfig(
|
||
session_id=session_id,
|
||
wait_for="css:.content-loaded"
|
||
)
|
||
|
||
result1 = await crawler.arun("https://example.com/step1", config=config1)
|
||
print("Step 1 completed")
|
||
|
||
# Step 2: Navigate and interact (same browser tab)
|
||
config2 = CrawlerRunConfig(
|
||
session_id=session_id,
|
||
js_only=True, # Don't reload page, just run JS
|
||
js_code="""
|
||
document.querySelector('#next-button').click();
|
||
""",
|
||
wait_for="css:.step2-content"
|
||
)
|
||
|
||
result2 = await crawler.arun("https://example.com/step2", config=config2)
|
||
print("Step 2 completed")
|
||
|
||
# Step 3: Form submission
|
||
config3 = CrawlerRunConfig(
|
||
session_id=session_id,
|
||
js_only=True,
|
||
js_code="""
|
||
document.querySelector('#form-field').value = 'test data';
|
||
document.querySelector('#submit-btn').click();
|
||
""",
|
||
wait_for="css:.results"
|
||
)
|
||
|
||
result3 = await crawler.arun("https://example.com/submit", config=config3)
|
||
print("Step 3 completed")
|
||
|
||
# Clean up session
|
||
await crawler.crawler_strategy.kill_session(session_id)
|
||
|
||
# Advanced GitHub commits pagination example
|
||
async def github_commits_pagination():
|
||
session_id = "github_session"
|
||
all_commits = []
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
for page in range(3):
|
||
if page == 0:
|
||
# Initial load
|
||
config = CrawlerRunConfig(
|
||
session_id=session_id,
|
||
wait_for="js:() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"
|
||
)
|
||
else:
|
||
# Navigate to next page
|
||
config = CrawlerRunConfig(
|
||
session_id=session_id,
|
||
js_only=True,
|
||
js_code='document.querySelector(\'a[data-testid="pagination-next-button"]\').click();',
|
||
wait_for="js:() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"
|
||
)
|
||
|
||
result = await crawler.arun(
|
||
"https://github.com/microsoft/TypeScript/commits/main",
|
||
config=config
|
||
)
|
||
|
||
if result.success:
|
||
commit_count = result.cleaned_html.count('li.Box-sc-g0xbh4-0')
|
||
print(f"Page {page + 1}: Found {commit_count} commits")
|
||
|
||
await crawler.crawler_strategy.kill_session(session_id)
|
||
```
|
||
|
||
### SSL Certificate Analysis
|
||
|
||
```python
|
||
# Fetch and analyze SSL certificates
|
||
async def analyze_ssl_certificates():
|
||
config = CrawlerRunConfig(
|
||
fetch_ssl_certificate=True
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com", config=config)
|
||
|
||
if result.success and result.ssl_certificate:
|
||
cert = result.ssl_certificate
|
||
|
||
# Basic certificate info
|
||
print(f"Issuer: {cert.issuer.get('CN', 'Unknown')}")
|
||
print(f"Subject: {cert.subject.get('CN', 'Unknown')}")
|
||
print(f"Valid from: {cert.valid_from}")
|
||
print(f"Valid until: {cert.valid_until}")
|
||
print(f"Fingerprint: {cert.fingerprint}")
|
||
|
||
# Export certificate in different formats
|
||
import os
|
||
os.makedirs("certificates", exist_ok=True)
|
||
|
||
cert.to_json("certificates/cert.json")
|
||
cert.to_pem("certificates/cert.pem")
|
||
cert.to_der("certificates/cert.der")
|
||
|
||
print("Certificate exported in multiple formats")
|
||
```
|
||
|
||
### Advanced Page Interaction
|
||
|
||
```python
|
||
# Complex page interactions with dynamic content
|
||
async def advanced_page_interaction():
|
||
async with AsyncWebCrawler() as crawler:
|
||
# Multi-step interaction with waiting
|
||
config = CrawlerRunConfig(
|
||
js_code=[
|
||
# Step 1: Scroll to load content
|
||
"window.scrollTo(0, document.body.scrollHeight);",
|
||
|
||
# Step 2: Wait and click load more
|
||
"""
|
||
(async () => {
|
||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||
const loadMore = document.querySelector('.load-more');
|
||
if (loadMore) loadMore.click();
|
||
})();
|
||
"""
|
||
],
|
||
|
||
# Wait for new content to appear
|
||
wait_for="js:() => document.querySelectorAll('.item').length > 20",
|
||
|
||
# Additional timing controls
|
||
page_timeout=60000, # 60 second timeout
|
||
delay_before_return_html=2.0, # Wait before final capture
|
||
|
||
# Handle overlays automatically
|
||
remove_overlay_elements=True,
|
||
magic=True, # Auto-handle common popup patterns
|
||
|
||
# Simulate human behavior
|
||
simulate_user=True,
|
||
override_navigator=True
|
||
)
|
||
|
||
result = await crawler.arun("https://example.com/dynamic", config=config)
|
||
|
||
if result.success:
|
||
print(f"Interactive crawl completed: {len(result.cleaned_html)} chars")
|
||
|
||
# Form interaction example
|
||
async def form_interaction_example():
|
||
config = CrawlerRunConfig(
|
||
js_code="""
|
||
// Fill search form
|
||
document.querySelector('#search-input').value = 'machine learning';
|
||
document.querySelector('#category-select').value = 'technology';
|
||
document.querySelector('#search-form').submit();
|
||
""",
|
||
wait_for="css:.search-results",
|
||
session_id="search_session"
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com/search", config=config)
|
||
print("Search completed, results loaded")
|
||
```
|
||
|
||
### Local File & Raw HTML Processing
|
||
|
||
```python
|
||
# Handle different input types: URLs, local files, raw HTML
|
||
async def handle_different_inputs():
|
||
async with AsyncWebCrawler() as crawler:
|
||
# 1. Regular web URL
|
||
result1 = await crawler.arun("https://example.com")
|
||
|
||
# 2. Local HTML file
|
||
local_file_path = "/path/to/file.html"
|
||
result2 = await crawler.arun(f"file://{local_file_path}")
|
||
|
||
# 3. Raw HTML content
|
||
raw_html = "<html><body><h1>Test Content</h1><p>Sample text</p></body></html>"
|
||
result3 = await crawler.arun(f"raw:{raw_html}")
|
||
|
||
# All return the same CrawlResult structure
|
||
for i, result in enumerate([result1, result2, result3], 1):
|
||
if result.success:
|
||
print(f"Input {i}: {len(result.markdown)} chars of markdown")
|
||
|
||
# Save and re-process HTML example
|
||
async def save_and_reprocess():
|
||
async with AsyncWebCrawler() as crawler:
|
||
# Original crawl
|
||
result = await crawler.arun("https://example.com")
|
||
|
||
if result.success:
|
||
# Save HTML to file
|
||
with open("saved_page.html", "w", encoding="utf-8") as f:
|
||
f.write(result.html)
|
||
|
||
# Re-process from file
|
||
file_result = await crawler.arun("file://./saved_page.html")
|
||
|
||
# Process as raw HTML
|
||
raw_result = await crawler.arun(f"raw:{result.html}")
|
||
|
||
# Verify consistency
|
||
assert len(result.markdown) == len(file_result.markdown) == len(raw_result.markdown)
|
||
print("✅ All processing methods produced identical results")
|
||
```
|
||
|
||
### Advanced Link & Media Handling
|
||
|
||
```python
|
||
# Comprehensive link and media extraction with filtering
|
||
async def advanced_link_media_handling():
|
||
config = CrawlerRunConfig(
|
||
# Link filtering
|
||
exclude_external_links=False, # Keep external links for analysis
|
||
exclude_social_media_links=True,
|
||
exclude_domains=["ads.com", "tracker.io", "spammy.net"],
|
||
|
||
# Media handling
|
||
exclude_external_images=True,
|
||
image_score_threshold=5, # Only high-quality images
|
||
table_score_threshold=7, # Only well-structured tables
|
||
wait_for_images=True,
|
||
|
||
# Capture additional formats
|
||
screenshot=True,
|
||
pdf=True,
|
||
capture_mhtml=True # Full page archive
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun("https://example.com", config=config)
|
||
|
||
if result.success:
|
||
# Analyze links
|
||
internal_links = result.links.get("internal", [])
|
||
external_links = result.links.get("external", [])
|
||
print(f"Links: {len(internal_links)} internal, {len(external_links)} external")
|
||
|
||
# Analyze media
|
||
images = result.media.get("images", [])
|
||
tables = result.media.get("tables", [])
|
||
print(f"Media: {len(images)} images, {len(tables)} tables")
|
||
|
||
# High-quality images only
|
||
quality_images = [img for img in images if img.get("score", 0) >= 5]
|
||
print(f"High-quality images: {len(quality_images)}")
|
||
|
||
# Table analysis
|
||
for i, table in enumerate(tables[:2]):
|
||
print(f"Table {i+1}: {len(table.get('headers', []))} columns, {len(table.get('rows', []))} rows")
|
||
|
||
# Save captured files
|
||
if result.screenshot:
|
||
import base64
|
||
with open("page_screenshot.png", "wb") as f:
|
||
f.write(base64.b64decode(result.screenshot))
|
||
|
||
if result.pdf:
|
||
with open("page.pdf", "wb") as f:
|
||
f.write(result.pdf)
|
||
|
||
if result.mhtml:
|
||
with open("page_archive.mhtml", "w", encoding="utf-8") as f:
|
||
f.write(result.mhtml)
|
||
|
||
print("Additional formats saved: screenshot, PDF, MHTML archive")
|
||
```
|
||
|
||
### Performance & Resource Management
|
||
|
||
```python
|
||
# Optimize performance for large-scale crawling
|
||
async def performance_optimized_crawling():
|
||
# Lightweight browser config
|
||
browser_config = BrowserConfig(
|
||
headless=True,
|
||
text_mode=True, # Disable images for speed
|
||
light_mode=True, # Reduce background features
|
||
extra_args=["--disable-extensions", "--no-sandbox"]
|
||
)
|
||
|
||
# Efficient crawl config
|
||
config = CrawlerRunConfig(
|
||
# Content filtering for speed
|
||
excluded_tags=["script", "style", "nav", "footer"],
|
||
exclude_external_links=True,
|
||
exclude_all_images=True, # Remove all images for max speed
|
||
word_count_threshold=50,
|
||
|
||
# Timing optimizations
|
||
page_timeout=30000, # Faster timeout
|
||
delay_before_return_html=0.1,
|
||
|
||
# Resource monitoring
|
||
capture_network_requests=False, # Disable unless needed
|
||
capture_console_messages=False,
|
||
|
||
# Cache for repeated URLs
|
||
cache_mode=CacheMode.ENABLED
|
||
)
|
||
|
||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
|
||
|
||
# Efficient batch processing
|
||
batch_config = config.clone(
|
||
stream=True, # Stream results as they complete
|
||
semaphore_count=3 # Control concurrency
|
||
)
|
||
|
||
async for result in await crawler.arun_many(urls, config=batch_config):
|
||
if result.success:
|
||
print(f"✅ {result.url}: {len(result.markdown)} chars")
|
||
else:
|
||
print(f"❌ {result.url}: {result.error_message}")
|
||
```
|
||
|
||
**📖 Learn more:** [Hooks & Authentication](https://docs.crawl4ai.com/advanced/hooks-auth/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Network Monitoring](https://docs.crawl4ai.com/advanced/network-console-capture/), [Page Interaction](https://docs.crawl4ai.com/core/page-interaction/), [File Downloads](https://docs.crawl4ai.com/advanced/file-downloading/)
|
||
---
|
||
|
||
|
||
## Deep Crawling Filters & Scorers
|
||
|
||
Advanced URL filtering and scoring strategies for intelligent deep crawling with performance optimization.
|
||
|
||
### URL Filters - Content and Domain Control
|
||
|
||
```python
|
||
from crawl4ai.deep_crawling.filters import (
|
||
URLPatternFilter, DomainFilter, ContentTypeFilter,
|
||
FilterChain, ContentRelevanceFilter, SEOFilter
|
||
)
|
||
|
||
# Pattern-based filtering
|
||
pattern_filter = URLPatternFilter(
|
||
patterns=[
|
||
"*.html", # HTML pages only
|
||
"*/blog/*", # Blog posts
|
||
"*/articles/*", # Article pages
|
||
"*2024*", # Recent content
|
||
"^https://example.com/docs/.*" # Regex pattern
|
||
],
|
||
use_glob=True,
|
||
reverse=False # False = include matching, True = exclude matching
|
||
)
|
||
|
||
# Domain filtering with subdomains
|
||
domain_filter = DomainFilter(
|
||
allowed_domains=["example.com", "docs.example.com"],
|
||
blocked_domains=["ads.example.com", "tracker.com"]
|
||
)
|
||
|
||
# Content type filtering
|
||
content_filter = ContentTypeFilter(
|
||
allowed_types=["text/html", "application/pdf"],
|
||
check_extension=True
|
||
)
|
||
|
||
# Apply individual filters
|
||
url = "https://example.com/blog/2024/article.html"
|
||
print(f"Pattern filter: {pattern_filter.apply(url)}")
|
||
print(f"Domain filter: {domain_filter.apply(url)}")
|
||
print(f"Content filter: {content_filter.apply(url)}")
|
||
```
|
||
|
||
### Filter Chaining - Combine Multiple Filters
|
||
|
||
```python
|
||
# Create filter chain for comprehensive filtering
|
||
filter_chain = FilterChain([
|
||
DomainFilter(allowed_domains=["example.com"]),
|
||
URLPatternFilter(patterns=["*/blog/*", "*/docs/*"]),
|
||
ContentTypeFilter(allowed_types=["text/html"])
|
||
])
|
||
|
||
# Apply chain to URLs
|
||
urls = [
|
||
"https://example.com/blog/post1.html",
|
||
"https://spam.com/content.html",
|
||
"https://example.com/blog/image.jpg",
|
||
"https://example.com/docs/guide.html"
|
||
]
|
||
|
||
async def filter_urls(urls, filter_chain):
|
||
filtered = []
|
||
for url in urls:
|
||
if await filter_chain.apply(url):
|
||
filtered.append(url)
|
||
return filtered
|
||
|
||
# Usage
|
||
filtered_urls = await filter_urls(urls, filter_chain)
|
||
print(f"Filtered URLs: {filtered_urls}")
|
||
|
||
# Check filter statistics
|
||
for filter_obj in filter_chain.filters:
|
||
stats = filter_obj.stats
|
||
print(f"{filter_obj.name}: {stats.passed_urls}/{stats.total_urls} passed")
|
||
```
|
||
|
||
### Advanced Content Filters
|
||
|
||
```python
|
||
# BM25-based content relevance filtering
|
||
relevance_filter = ContentRelevanceFilter(
|
||
query="python machine learning tutorial",
|
||
threshold=0.5, # Minimum relevance score
|
||
k1=1.2, # TF saturation parameter
|
||
b=0.75, # Length normalization
|
||
avgdl=1000 # Average document length
|
||
)
|
||
|
||
# SEO quality filtering
|
||
seo_filter = SEOFilter(
|
||
threshold=0.65, # Minimum SEO score
|
||
keywords=["python", "tutorial", "guide"],
|
||
weights={
|
||
"title_length": 0.15,
|
||
"title_kw": 0.18,
|
||
"meta_description": 0.12,
|
||
"canonical": 0.10,
|
||
"robot_ok": 0.20,
|
||
"schema_org": 0.10,
|
||
"url_quality": 0.15
|
||
}
|
||
)
|
||
|
||
# Apply advanced filters
|
||
url = "https://example.com/python-ml-tutorial"
|
||
relevance_score = await relevance_filter.apply(url)
|
||
seo_score = await seo_filter.apply(url)
|
||
|
||
print(f"Relevance: {relevance_score}, SEO: {seo_score}")
|
||
```
|
||
|
||
### URL Scorers - Quality and Relevance Scoring
|
||
|
||
```python
|
||
from crawl4ai.deep_crawling.scorers import (
|
||
KeywordRelevanceScorer, PathDepthScorer, ContentTypeScorer,
|
||
FreshnessScorer, DomainAuthorityScorer, CompositeScorer
|
||
)
|
||
|
||
# Keyword relevance scoring
|
||
keyword_scorer = KeywordRelevanceScorer(
|
||
keywords=["python", "tutorial", "guide", "machine", "learning"],
|
||
weight=1.0,
|
||
case_sensitive=False
|
||
)
|
||
|
||
# Path depth scoring (optimal depth = 3)
|
||
depth_scorer = PathDepthScorer(
|
||
optimal_depth=3, # /category/subcategory/article
|
||
weight=0.8
|
||
)
|
||
|
||
# Content type scoring
|
||
content_type_scorer = ContentTypeScorer(
|
||
type_weights={
|
||
"html": 1.0, # Highest priority
|
||
"pdf": 0.8, # Medium priority
|
||
"txt": 0.6, # Lower priority
|
||
"doc": 0.4 # Lowest priority
|
||
},
|
||
weight=0.9
|
||
)
|
||
|
||
# Freshness scoring
|
||
freshness_scorer = FreshnessScorer(
|
||
weight=0.7,
|
||
current_year=2024
|
||
)
|
||
|
||
# Domain authority scoring
|
||
domain_scorer = DomainAuthorityScorer(
|
||
domain_weights={
|
||
"python.org": 1.0,
|
||
"github.com": 0.9,
|
||
"stackoverflow.com": 0.85,
|
||
"medium.com": 0.7,
|
||
"personal-blog.com": 0.3
|
||
},
|
||
default_weight=0.5,
|
||
weight=1.0
|
||
)
|
||
|
||
# Score individual URLs
|
||
url = "https://python.org/tutorial/2024/machine-learning.html"
|
||
scores = {
|
||
"keyword": keyword_scorer.score(url),
|
||
"depth": depth_scorer.score(url),
|
||
"content": content_type_scorer.score(url),
|
||
"freshness": freshness_scorer.score(url),
|
||
"domain": domain_scorer.score(url)
|
||
}
|
||
|
||
print(f"Individual scores: {scores}")
|
||
```
|
||
|
||
### Composite Scoring - Combine Multiple Scorers
|
||
|
||
```python
|
||
# Create composite scorer combining all strategies
|
||
composite_scorer = CompositeScorer(
|
||
scorers=[
|
||
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
|
||
PathDepthScorer(optimal_depth=3, weight=1.0),
|
||
ContentTypeScorer({"html": 1.0, "pdf": 0.8}, weight=1.2),
|
||
FreshnessScorer(weight=0.8, current_year=2024),
|
||
DomainAuthorityScorer({
|
||
"python.org": 1.0,
|
||
"github.com": 0.9
|
||
}, weight=1.3)
|
||
],
|
||
normalize=True # Normalize by number of scorers
|
||
)
|
||
|
||
# Score multiple URLs
|
||
urls_to_score = [
|
||
"https://python.org/tutorial/2024/basics.html",
|
||
"https://github.com/user/python-guide/blob/main/README.md",
|
||
"https://random-blog.com/old/2018/python-stuff.html",
|
||
"https://python.org/docs/deep/nested/advanced/guide.html"
|
||
]
|
||
|
||
scored_urls = []
|
||
for url in urls_to_score:
|
||
score = composite_scorer.score(url)
|
||
scored_urls.append((url, score))
|
||
|
||
# Sort by score (highest first)
|
||
scored_urls.sort(key=lambda x: x[1], reverse=True)
|
||
|
||
for url, score in scored_urls:
|
||
print(f"Score: {score:.3f} - {url}")
|
||
|
||
# Check scorer statistics
|
||
print(f"\nScoring statistics:")
|
||
print(f"URLs scored: {composite_scorer.stats._urls_scored}")
|
||
print(f"Average score: {composite_scorer.stats.get_average():.3f}")
|
||
```
|
||
|
||
### Advanced Filter Patterns
|
||
|
||
```python
|
||
# Complex pattern matching
|
||
advanced_patterns = URLPatternFilter(
|
||
patterns=[
|
||
r"^https://docs\.python\.org/\d+/", # Python docs with version
|
||
r".*/tutorial/.*\.html$", # Tutorial pages
|
||
r".*/guide/(?!deprecated).*", # Guides but not deprecated
|
||
"*/blog/{2020,2021,2022,2023,2024}/*", # Recent blog posts
|
||
"**/{api,reference}/**/*.html" # API/reference docs
|
||
],
|
||
use_glob=True
|
||
)
|
||
|
||
# Exclude patterns (reverse=True)
|
||
exclude_filter = URLPatternFilter(
|
||
patterns=[
|
||
"*/admin/*",
|
||
"*/login/*",
|
||
"*/private/*",
|
||
"**/.*", # Hidden files
|
||
"*.{jpg,png,gif,css,js}$" # Media and assets
|
||
],
|
||
reverse=True # Exclude matching patterns
|
||
)
|
||
|
||
# Content type with extension mapping
|
||
detailed_content_filter = ContentTypeFilter(
|
||
allowed_types=["text", "application"],
|
||
check_extension=True,
|
||
ext_map={
|
||
"html": "text/html",
|
||
"htm": "text/html",
|
||
"md": "text/markdown",
|
||
"pdf": "application/pdf",
|
||
"doc": "application/msword",
|
||
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||
}
|
||
)
|
||
```
|
||
|
||
### Performance-Optimized Filtering
|
||
|
||
```python
|
||
# High-performance filter chain for large-scale crawling
|
||
class OptimizedFilterChain:
|
||
def __init__(self):
|
||
# Fast filters first (domain, patterns)
|
||
self.fast_filters = [
|
||
DomainFilter(
|
||
allowed_domains=["example.com", "docs.example.com"],
|
||
blocked_domains=["ads.example.com"]
|
||
),
|
||
URLPatternFilter([
|
||
"*.html", "*.pdf", "*/blog/*", "*/docs/*"
|
||
])
|
||
]
|
||
|
||
# Slower filters last (content analysis)
|
||
self.slow_filters = [
|
||
ContentRelevanceFilter(
|
||
query="important content",
|
||
threshold=0.3
|
||
)
|
||
]
|
||
|
||
async def apply_optimized(self, url: str) -> bool:
|
||
# Apply fast filters first
|
||
for filter_obj in self.fast_filters:
|
||
if not filter_obj.apply(url):
|
||
return False
|
||
|
||
# Only apply slow filters if fast filters pass
|
||
for filter_obj in self.slow_filters:
|
||
if not await filter_obj.apply(url):
|
||
return False
|
||
|
||
return True
|
||
|
||
# Batch filtering with concurrency
|
||
async def batch_filter_urls(urls, filter_chain, max_concurrent=50):
|
||
import asyncio
|
||
semaphore = asyncio.Semaphore(max_concurrent)
|
||
|
||
async def filter_single(url):
|
||
async with semaphore:
|
||
return await filter_chain.apply(url), url
|
||
|
||
tasks = [filter_single(url) for url in urls]
|
||
results = await asyncio.gather(*tasks)
|
||
|
||
return [url for passed, url in results if passed]
|
||
|
||
# Usage with 1000 URLs
|
||
large_url_list = [f"https://example.com/page{i}.html" for i in range(1000)]
|
||
optimized_chain = OptimizedFilterChain()
|
||
filtered = await batch_filter_urls(large_url_list, optimized_chain)
|
||
```
|
||
|
||
### Custom Filter Implementation
|
||
|
||
```python
|
||
from crawl4ai.deep_crawling.filters import URLFilter
|
||
import re
|
||
|
||
class CustomLanguageFilter(URLFilter):
|
||
"""Filter URLs by language indicators"""
|
||
|
||
def __init__(self, allowed_languages=["en"], weight=1.0):
|
||
super().__init__()
|
||
self.allowed_languages = set(allowed_languages)
|
||
self.lang_patterns = {
|
||
"en": re.compile(r"/en/|/english/|lang=en"),
|
||
"es": re.compile(r"/es/|/spanish/|lang=es"),
|
||
"fr": re.compile(r"/fr/|/french/|lang=fr"),
|
||
"de": re.compile(r"/de/|/german/|lang=de")
|
||
}
|
||
|
||
def apply(self, url: str) -> bool:
|
||
# Default to English if no language indicators
|
||
if not any(pattern.search(url) for pattern in self.lang_patterns.values()):
|
||
result = "en" in self.allowed_languages
|
||
self._update_stats(result)
|
||
return result
|
||
|
||
# Check for allowed languages
|
||
for lang in self.allowed_languages:
|
||
if lang in self.lang_patterns:
|
||
if self.lang_patterns[lang].search(url):
|
||
self._update_stats(True)
|
||
return True
|
||
|
||
self._update_stats(False)
|
||
return False
|
||
|
||
# Custom scorer implementation
|
||
from crawl4ai.deep_crawling.scorers import URLScorer
|
||
|
||
class CustomComplexityScorer(URLScorer):
|
||
"""Score URLs by content complexity indicators"""
|
||
|
||
def __init__(self, weight=1.0):
|
||
super().__init__(weight)
|
||
self.complexity_indicators = {
|
||
"tutorial": 0.9,
|
||
"guide": 0.8,
|
||
"example": 0.7,
|
||
"reference": 0.6,
|
||
"api": 0.5
|
||
}
|
||
|
||
def _calculate_score(self, url: str) -> float:
|
||
url_lower = url.lower()
|
||
max_score = 0.0
|
||
|
||
for indicator, score in self.complexity_indicators.items():
|
||
if indicator in url_lower:
|
||
max_score = max(max_score, score)
|
||
|
||
return max_score
|
||
|
||
# Use custom filters and scorers
|
||
custom_filter = CustomLanguageFilter(allowed_languages=["en", "es"])
|
||
custom_scorer = CustomComplexityScorer(weight=1.2)
|
||
|
||
url = "https://example.com/en/tutorial/advanced-guide.html"
|
||
passes_filter = custom_filter.apply(url)
|
||
complexity_score = custom_scorer.score(url)
|
||
|
||
print(f"Passes language filter: {passes_filter}")
|
||
print(f"Complexity score: {complexity_score}")
|
||
```
|
||
|
||
### Integration with Deep Crawling
|
||
|
||
```python
|
||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||
from crawl4ai.deep_crawling import DeepCrawlStrategy
|
||
|
||
async def deep_crawl_with_filtering():
|
||
# Create comprehensive filter chain
|
||
filter_chain = FilterChain([
|
||
DomainFilter(allowed_domains=["python.org"]),
|
||
URLPatternFilter(["*/tutorial/*", "*/guide/*", "*/docs/*"]),
|
||
ContentTypeFilter(["text/html"]),
|
||
SEOFilter(threshold=0.6, keywords=["python", "programming"])
|
||
])
|
||
|
||
# Create composite scorer
|
||
scorer = CompositeScorer([
|
||
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
|
||
FreshnessScorer(weight=0.8),
|
||
PathDepthScorer(optimal_depth=3, weight=1.0)
|
||
], normalize=True)
|
||
|
||
# Configure deep crawl strategy with filters and scorers
|
||
deep_strategy = DeepCrawlStrategy(
|
||
max_depth=3,
|
||
max_pages=100,
|
||
url_filter=filter_chain,
|
||
url_scorer=scorer,
|
||
score_threshold=0.6 # Only crawl URLs scoring above 0.6
|
||
)
|
||
|
||
config = CrawlerRunConfig(
|
||
deep_crawl_strategy=deep_strategy,
|
||
cache_mode=CacheMode.BYPASS
|
||
)
|
||
|
||
async with AsyncWebCrawler() as crawler:
|
||
result = await crawler.arun(
|
||
url="https://python.org",
|
||
config=config
|
||
)
|
||
|
||
print(f"Deep crawl completed: {result.success}")
|
||
if hasattr(result, 'deep_crawl_results'):
|
||
print(f"Pages crawled: {len(result.deep_crawl_results)}")
|
||
|
||
# Run the deep crawl
|
||
await deep_crawl_with_filtering()
|
||
```
|
||
|
||
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Custom Filter Development](https://docs.crawl4ai.com/advanced/custom-filters/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/)
|
||
---
|
||
|
||
|
||
## Summary
|
||
|
||
Crawl4AI provides a comprehensive solution for web crawling and data extraction optimized for AI applications. From simple page crawling to complex multi-URL operations with advanced filtering, the library offers the flexibility and performance needed for modern data extraction workflows.
|
||
|
||
**Key Takeaways:**
|
||
- Start with basic installation and simple crawling patterns
|
||
- Use configuration objects for consistent, maintainable code
|
||
- Choose appropriate extraction strategies based on your data structure
|
||
- Leverage Docker for production deployments
|
||
- Implement advanced features like deep crawling and custom filters as needed
|
||
|
||
**Next Steps:**
|
||
- Explore the [GitHub repository](https://github.com/unclecode/crawl4ai) for latest updates
|
||
- Join the [Discord community](https://discord.gg/jP8KfhDhyN) for support
|
||
- Check out [example projects](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) for inspiration
|
||
|
||
Happy crawling! 🕷️
|