Update Documentation

This commit is contained in:
UncleCode
2024-10-27 19:24:46 +08:00
parent 38474bd66a
commit 4239654722
111 changed files with 7680 additions and 53 deletions

157
docs/details/extraction.md Normal file
View File

@@ -0,0 +1,157 @@
### Extraction Strategies
#### 1. LLMExtractionStrategy
```python
LLMExtractionStrategy(
# Core Parameters
provider: str = DEFAULT_PROVIDER, # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
api_token: Optional[str] = None, # API token for the provider
instruction: str = None, # Custom instruction for extraction
schema: Dict = None, # Pydantic model schema for structured extraction
extraction_type: str = "block", # Type of extraction: "block" or "schema"
# Chunking Parameters
chunk_token_threshold: int = CHUNK_TOKEN_THRESHOLD, # Maximum tokens per chunk
overlap_rate: float = OVERLAP_RATE, # Overlap between chunks
word_token_rate: float = WORD_TOKEN_RATE, # Conversion rate from words to tokens
apply_chunking: bool = True, # Whether to apply text chunking
# API Configuration
base_url: str = None, # Base URL for API calls
api_base: str = None, # Alternative base URL
extra_args: Dict = {}, # Additional provider-specific arguments
verbose: bool = False # Enable verbose logging
)
```
Usage Example:
```python
class NewsArticle(BaseModel):
title: str
content: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
api_token="your-token",
schema=NewsArticle.schema(),
instruction="Extract news article content with title and main text"
)
result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
```
#### 2. JsonCssExtractionStrategy
```python
JsonCssExtractionStrategy(
schema: Dict[str, Any], # Schema defining extraction rules
verbose: bool = False # Enable verbose logging
)
# Schema Structure
schema = {
"name": str, # Name of the extraction schema
"baseSelector": str, # CSS selector for base elements
"fields": [
{
"name": str, # Field name
"selector": str, # CSS selector
"type": str, # Field type: "text", "attribute", "html", "regex", "nested", "list", "nested_list"
"attribute": str, # For type="attribute"
"pattern": str, # For type="regex"
"transform": str, # Optional: "lowercase", "uppercase", "strip"
"default": Any, # Default value if extraction fails
"fields": List[Dict], # For nested/list types
}
]
}
```
Usage Example:
```python
schema = {
"name": "News Articles",
"baseSelector": "article.news-item",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text",
"transform": "strip"
},
{
"name": "date",
"selector": ".date",
"type": "attribute",
"attribute": "datetime"
}
]
}
strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
```
#### 3. CosineStrategy
```python
CosineStrategy(
# Content Filtering
semantic_filter: str = None, # Keyword filter for document filtering
word_count_threshold: int = 10, # Minimum words per cluster
sim_threshold: float = 0.3, # Similarity threshold for filtering
# Clustering Parameters
max_dist: float = 0.2, # Maximum distance for clustering
linkage_method: str = 'ward', # Clustering linkage method
top_k: int = 3, # Number of top categories to extract
# Model Configuration
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
verbose: bool = False # Enable verbose logging
)
```
### Chunking Strategies
#### 1. RegexChunking
```python
RegexChunking(
patterns: List[str] = None # List of regex patterns for splitting text
# Default pattern: [r'\n\n']
)
```
Usage Example:
```python
chunker = RegexChunking(patterns=[r'\n\n', r'\.\s+']) # Split on double newlines and sentences
chunks = chunker.chunk(text)
```
#### 2. SlidingWindowChunking
```python
SlidingWindowChunking(
window_size: int = 100, # Size of the window in words
step: int = 50, # Number of words to slide the window
)
```
Usage Example:
```python
chunker = SlidingWindowChunking(window_size=200, step=100)
chunks = chunker.chunk(text) # Creates overlapping chunks of 200 words, moving 100 words at a time
```
#### 3. OverlappingWindowChunking
```python
OverlappingWindowChunking(
window_size: int = 1000, # Size of each chunk in words
overlap: int = 100 # Number of words to overlap between chunks
)
```
Usage Example:
```python
chunker = OverlappingWindowChunking(window_size=500, overlap=50)
chunks = chunker.chunk(text) # Creates 500-word chunks with 50-word overlap
```

View File

@@ -0,0 +1,175 @@
# Features
## Current Features
1. Async-first architecture for high-performance web crawling
2. Built-in anti-bot detection bypass ("magic mode")
3. Multiple browser engine support (Chromium, Firefox, WebKit)
4. Smart session management with automatic cleanup
5. Automatic content cleaning and relevance scoring
6. Built-in markdown generation with formatting preservation
7. Intelligent image scoring and filtering
8. Automatic popup and overlay removal
9. Smart wait conditions (CSS/JavaScript based)
10. Multi-provider LLM integration (OpenAI, HuggingFace, Ollama)
11. Schema-based structured data extraction
12. Automated iframe content processing
13. Intelligent link categorization (internal/external)
14. Multiple chunking strategies for large content
15. Real-time HTML cleaning and sanitization
16. Automatic screenshot capabilities
17. Social media link filtering
18. Semantic similarity-based content clustering
19. Human behavior simulation for anti-bot bypass
20. Proxy support with authentication
21. Automatic resource cleanup
22. Custom CSS selector-based extraction
23. Automatic content relevance scoring ("fit" content)
24. Recursive website crawling capabilities
25. Flexible hook system for customization
26. Built-in caching system
27. Domain-based content filtering
28. Dynamic content handling with JavaScript execution
29. Automatic media content extraction and classification
30. Metadata extraction and processing
31. Customizable HTML to Markdown conversion
32. Token-aware content chunking for LLM processing
33. Automatic response header and status code handling
34. Browser fingerprint customization
35. Multiple extraction strategies (LLM, CSS, Cosine, XPATH)
36. Automatic error image generation for failed screenshots
37. Smart content overlap handling for large texts
38. Built-in rate limiting for batch processing
39. Automatic cookie handling
40. Browser Console logging and debugging capabilities
## Feature Techs
• Browser Management
- Asynchronous browser control
- Multi-browser support (Chromium, Firefox, WebKit)
- Headless mode support
- Browser cleanup and resource management
- Custom browser arguments and configuration
- Context management with `__aenter__` and `__aexit__`
• Session Handling
- Session management with TTL (Time To Live)
- Session reuse capabilities
- Session cleanup for expired sessions
- Session-based context preservation
• Stealth Features
- Playwright stealth configuration
- Navigator properties override
- WebDriver detection evasion
- Chrome app simulation
- Plugin simulation
- Language preferences simulation
- Hardware concurrency simulation
- Media codecs simulation
• Network Features
- Proxy support with authentication
- Custom headers management
- Cookie handling
- Response header capture
- Status code tracking
- Network idle detection
• Page Interaction
- Smart wait functionality for multiple conditions
- CSS selector-based waiting
- JavaScript condition waiting
- Custom JavaScript execution
- User interaction simulation (mouse/keyboard)
- Page scrolling
- Timeout management
- Load state monitoring
• Content Processing
- HTML content extraction
- Iframe processing and content extraction
- Delayed content retrieval
- Content caching
- Cache file management
- HTML cleaning and processing
• Image Handling
- Screenshot capabilities (full page)
- Base64 encoding of screenshots
- Image dimension updating
- Image filtering (size/visibility)
- Error image generation
- Natural width/height preservation
• Overlay Management
- Popup removal
- Cookie notice removal
- Newsletter dialog removal
- Modal removal
- Fixed position element removal
- Z-index based overlay detection
- Visibility checking
• Hook System
- Browser creation hooks
- User agent update hooks
- Execution start hooks
- Navigation hooks (before/after goto)
- HTML retrieval hooks
- HTML return hooks
• Error Handling
- Browser error catching
- Network error handling
- Timeout handling
- Screenshot error recovery
- Invalid selector handling
- General exception management
• Performance Features
- Concurrent URL processing
- Semaphore-based rate limiting
- Async gathering of results
- Resource cleanup
- Memory management
• Debug Features
- Console logging
- Page error logging
- Verbose mode
- Error message generation
- Warning system
• Security Features
- Certificate error handling
- Sandbox configuration
- GPU handling
- CSP (Content Security Policy) compliant waiting
• Configuration
- User agent customization
- Viewport configuration
- Timeout configuration
- Browser type selection
- Proxy configuration
- Header configuration
• Data Models
- Pydantic model for responses
- Type hints throughout code
- Structured response format
- Optional response fields
• File System Integration
- Cache directory management
- File path handling
- Cache metadata storage
- File read/write operations
• Metadata Handling
- Response headers capture
- Status code tracking
- Cache metadata
- Session tracking
- Timestamp management

150
docs/details/features.md Normal file
View File

@@ -0,0 +1,150 @@
### 1. Basic Web Crawling
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown) # Get clean markdown content
print(result.html) # Get raw HTML
print(result.cleaned_html) # Get cleaned HTML
```
### 2. Browser Control Options
- Multiple Browser Support
```python
# Choose between different browser engines
crawler = AsyncWebCrawler(browser_type="firefox") # or "chromium", "webkit"
crawler = AsyncWebCrawler(headless=False) # For visible browser
```
- Proxy Configuration
```python
crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
# Or with authentication
crawler = AsyncWebCrawler(proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
})
```
### 3. Content Selection & Filtering
- CSS Selector Support
```python
result = await crawler.arun(
url="https://example.com",
css_selector=".main-content" # Extract specific content
)
```
- Content Filtering Options
```python
result = await crawler.arun(
url="https://example.com",
word_count_threshold=10, # Minimum words per block
excluded_tags=['form', 'header'], # Tags to exclude
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_external_images=True # Remove external images
)
```
### 4. Dynamic Content Handling
- JavaScript Execution
```python
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight)" # Execute custom JS
)
```
- Wait Conditions
```python
result = await crawler.arun(
url="https://example.com",
wait_for="css:.my-element", # Wait for element
wait_for="js:() => document.readyState === 'complete'" # Wait for condition
)
```
### 5. Anti-Bot Protection Handling
```python
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True, # Mask automation signals
magic=True # Enable all anti-detection features
)
```
### 6. Session Management
```python
session_id = "my_session"
result1 = await crawler.arun(url="https://example.com/page1", session_id=session_id)
result2 = await crawler.arun(url="https://example.com/page2", session_id=session_id)
await crawler.crawler_strategy.kill_session(session_id)
```
### 7. Media Handling
- Screenshot Capture
```python
result = await crawler.arun(
url="https://example.com",
screenshot=True
)
base64_screenshot = result.screenshot
```
- Media Extraction
```python
result = await crawler.arun(url="https://example.com")
print(result.media['images']) # List of images
print(result.media['videos']) # List of videos
print(result.media['audios']) # List of audio files
```
### 8. Structured Data Extraction
- CSS-based Extraction
```python
schema = {
"name": "News Articles",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=extraction_strategy
)
structured_data = json.loads(result.extracted_content)
```
- LLM-based Extraction (Multiple Providers)
```python
class NewsArticle(BaseModel):
title: str
summary: str
strategy = LLMExtractionStrategy(
provider="ollama/nemotron", # or "huggingface/...", "ollama/..."
api_token="your-token",
schema=NewsArticle.schema(),
instruction="Extract news article details..."
)
result = await crawler.arun(
url="https://example.com",
extraction_strategy=strategy
)
```
### 9. Content Cleaning & Processing
```python
result = await crawler.arun(
url="https://example.com",
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True, # Process iframe content
)
print(result.fit_markdown) # Get most relevant content
print(result.fit_html) # Get cleaned HTML
```

View File

@@ -0,0 +1,457 @@
I'll expand the outline with detailed descriptions and examples based on all the provided files. I'll start with the first few sections:
### 1. Basic Web Crawling
Basic web crawling provides the foundation for extracting content from websites. The library supports both simple single-page crawling and recursive website crawling.
```python
# Simple page crawling
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.html) # Raw HTML
print(result.markdown) # Cleaned markdown
print(result.cleaned_html) # Cleaned HTML
# Recursive website crawling
class SimpleWebsiteScraper:
def __init__(self, crawler: AsyncWebCrawler):
self.crawler = crawler
async def scrape(self, start_url: str, max_depth: int):
results = await self.scrape_recursive(start_url, max_depth)
return results
# Usage
async with AsyncWebCrawler() as crawler:
scraper = SimpleWebsiteScraper(crawler)
results = await scraper.scrape("https://example.com", depth=2)
```
### 2. Browser Control Options
The library provides extensive control over browser behavior, allowing customization of browser type, headless mode, and proxy settings.
```python
# Browser Type Selection
async with AsyncWebCrawler(
browser_type="firefox", # Options: "chromium", "firefox", "webkit"
headless=False, # For visible browser
verbose=True # Enable logging
) as crawler:
result = await crawler.arun(url="https://example.com")
# Proxy Configuration
async with AsyncWebCrawler(
proxy_config={
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
},
headers={
"User-Agent": "Custom User Agent",
"Accept-Language": "en-US,en;q=0.9"
}
) as crawler:
result = await crawler.arun(url="https://example.com")
```
### 3. Content Selection & Filtering
The library offers multiple ways to select and filter content, from CSS selectors to word count thresholds.
```python
# CSS Selector and Content Filtering
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
css_selector="article.main-content", # Extract specific content
word_count_threshold=10, # Minimum words per block
excluded_tags=['form', 'header'], # Tags to exclude
exclude_external_links=True, # Remove external links
exclude_social_media_links=True, # Remove social media links
exclude_domains=["pinterest.com", "facebook.com"] # Exclude specific domains
)
# Custom HTML to Text Options
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
html2text={
"escape_dot": False,
"links_each_paragraph": True,
"protect_links": True
}
)
```
### 4. Dynamic Content Handling
The library provides sophisticated handling of dynamic content with JavaScript execution and wait conditions.
```python
# JavaScript Execution and Wait Conditions
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
js_code=[
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();"
],
wait_for="css:.dynamic-content", # Wait for element
delay_before_return_html=2.0 # Wait after JS execution
)
# Smart Wait Conditions
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
wait_for="""() => {
return document.querySelectorAll('.item').length > 10;
}""",
page_timeout=60000 # 60 seconds timeout
)
```
### 5. Advanced Link Analysis
The library provides comprehensive link analysis capabilities, distinguishing between internal and external links, with options for filtering and processing.
```python
# Basic Link Analysis
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# Access internal and external links
for internal_link in result.links['internal']:
print(f"Internal: {internal_link['href']} - {internal_link['text']}")
for external_link in result.links['external']:
print(f"External: {external_link['href']} - {external_link['text']}")
# Advanced Link Filtering
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
exclude_external_links=True, # Remove all external links
exclude_social_media_links=True, # Remove social media links
exclude_social_media_domains=[ # Custom social media domains
"facebook.com", "twitter.com", "instagram.com"
],
exclude_domains=["pinterest.com"] # Specific domains to exclude
)
```
### 6. Anti-Bot Protection Handling
The library includes sophisticated anti-detection mechanisms to handle websites with bot protection.
```python
# Basic Anti-Detection
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
simulate_user=True, # Simulate human behavior
override_navigator=True # Override navigator properties
)
# Advanced Anti-Detection with Magic Mode
async with AsyncWebCrawler(headless=False) as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True, # Enable all anti-detection features
remove_overlay_elements=True, # Remove popups/modals automatically
# Custom navigator properties
js_code="""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
"""
)
```
### 7. Session Management
Session management allows maintaining state across multiple requests and handling cookies.
```python
# Basic Session Management
async with AsyncWebCrawler() as crawler:
session_id = "my_session"
# Login
login_result = await crawler.arun(
url="https://example.com/login",
session_id=session_id,
js_code="document.querySelector('form').submit();"
)
# Use same session for subsequent requests
protected_result = await crawler.arun(
url="https://example.com/protected",
session_id=session_id
)
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
# Advanced Session with Custom Cookies
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
session_id="custom_session",
cookies=[{
"name": "sessionId",
"value": "abc123",
"domain": "example.com"
}]
)
```
### 8. Screenshot and Media Handling
The library provides comprehensive media handling capabilities, including screenshots and media content extraction.
```python
# Screenshot Capture
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
screenshot=True,
screenshot_wait_for=2.0 # Wait before taking screenshot
)
# Save screenshot
if result.screenshot:
with open("screenshot.png", "wb") as f:
f.write(base64.b64decode(result.screenshot))
# Media Extraction
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
# Process images with metadata
for image in result.media['images']:
print(f"Image: {image['src']}")
print(f"Alt text: {image['alt']}")
print(f"Context: {image['desc']}")
print(f"Relevance score: {image['score']}")
# Process videos and audio
for video in result.media['videos']:
print(f"Video: {video['src']}")
for audio in result.media['audios']:
print(f"Audio: {audio['src']}")
```
### 9. Structured Data Extraction & Chunking
The library supports multiple strategies for structured data extraction and content chunking.
```python
# LLM-based Extraction
class NewsArticle(BaseModel):
title: str
content: str
author: str
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4',
api_token="your-token",
schema=NewsArticle.schema(),
instruction="Extract news article details",
chunk_token_threshold=1000,
overlap_rate=0.1
)
# CSS-based Extraction
schema = {
"name": "Product Listing",
"baseSelector": ".product-card",
"fields": [
{
"name": "title",
"selector": "h2",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text",
"transform": "strip"
}
]
}
css_strategy = JsonCssExtractionStrategy(schema)
# Text Chunking
from crawl4ai.chunking_strategy import OverlappingWindowChunking
chunking_strategy = OverlappingWindowChunking(
window_size=1000,
overlap=100
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
extraction_strategy=extraction_strategy,
chunking_strategy=chunking_strategy
)
```
### 10. Content Cleaning & Processing
The library provides extensive content cleaning and processing capabilities, ensuring high-quality output in various formats.
```python
# Basic Content Cleaning
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True, # Process iframe content
word_count_threshold=10 # Minimum words per block
)
print(result.cleaned_html) # Clean HTML
print(result.fit_html) # Most relevant HTML content
print(result.fit_markdown) # Most relevant markdown content
# Advanced Content Processing
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
excluded_tags=['form', 'header', 'footer', 'nav'],
html2text={
"escape_dot": False,
"body_width": 0,
"protect_links": True,
"unicode_snob": True,
"ignore_links": False,
"ignore_images": False,
"ignore_emphasis": False,
"bypass_tables": False,
"ignore_tables": False
}
)
```
### Advanced Usage Patterns
#### 1. Combining Multiple Features
```python
async with AsyncWebCrawler(
browser_type="chromium",
headless=False,
verbose=True
) as crawler:
result = await crawler.arun(
url="https://example.com",
# Anti-bot measures
magic=True,
simulate_user=True,
# Content selection
css_selector="article.main",
word_count_threshold=10,
# Dynamic content handling
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.dynamic-content",
# Content filtering
exclude_external_links=True,
exclude_social_media_links=True,
# Media handling
screenshot=True,
process_iframes=True,
# Content cleaning
remove_overlay_elements=True
)
```
#### 2. Custom Extraction Pipeline
```python
# Define custom schemas and strategies
class Article(BaseModel):
title: str
content: str
date: str
# CSS extraction for initial content
css_schema = {
"name": "Article Extraction",
"baseSelector": "article",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"},
{"name": "date", "selector": ".date", "type": "text"}
]
}
# LLM processing for semantic analysis
llm_strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
api_token="your-token",
schema=Article.schema(),
instruction="Extract and clean article content"
)
# Chunking strategy for large content
chunking = OverlappingWindowChunking(window_size=1000, overlap=100)
async with AsyncWebCrawler() as crawler:
# First pass: Extract structure
css_result = await crawler.arun(
url="https://example.com",
extraction_strategy=JsonCssExtractionStrategy(css_schema)
)
# Second pass: Semantic processing
llm_result = await crawler.arun(
url="https://example.com",
extraction_strategy=llm_strategy,
chunking_strategy=chunking
)
```
#### 3. Website Crawling with Custom Processing
```python
class CustomWebsiteCrawler:
def __init__(self, crawler: AsyncWebCrawler):
self.crawler = crawler
self.results = {}
async def process_page(self, url: str) -> Dict:
result = await self.crawler.arun(
url=url,
magic=True,
word_count_threshold=10,
exclude_external_links=True,
process_iframes=True,
remove_overlay_elements=True
)
# Process internal links
internal_links = [
link['href'] for link in result.links['internal']
if self._is_valid_link(link['href'])
]
# Extract media
media_urls = [img['src'] for img in result.media['images']]
return {
'content': result.markdown,
'links': internal_links,
'media': media_urls,
'metadata': result.metadata
}
async def crawl_website(self, start_url: str, max_depth: int = 2):
visited = set()
queue = [(start_url, 0)]
while queue:
url, depth = queue.pop(0)
if depth > max_depth or url in visited:
continue
visited.add(url)
self.results[url] = await self.process_page(url)
```

View File

@@ -0,0 +1,282 @@
### AsyncWebCrawler Constructor Parameters
```python
AsyncWebCrawler(
# Core Browser Settings
browser_type: str = "chromium", # Options: "chromium", "firefox", "webkit"
headless: bool = True, # Whether to run browser in headless mode
verbose: bool = False, # Enable verbose logging
# Cache Settings
always_by_pass_cache: bool = False, # Always bypass cache regardless of run settings
base_directory: str = str(Path.home()), # Base directory for cache storage
# Network Settings
proxy: str = None, # Simple proxy URL (e.g., "http://proxy.example.com:8080")
proxy_config: Dict = None, # Advanced proxy settings with auth: {"server": str, "username": str, "password": str}
# Browser Behavior
sleep_on_close: bool = False, # Wait before closing browser
# Other Settings passed to AsyncPlaywrightCrawlerStrategy
user_agent: str = None, # Custom user agent string
headers: Dict[str, str] = {}, # Custom HTTP headers
js_code: Union[str, List[str]] = None, # Default JavaScript to execute
)
```
### arun() Method Parameters
```python
arun(
# Core Parameters
url: str, # Required: URL to crawl
# Content Selection
css_selector: str = None, # CSS selector to extract specific content
word_count_threshold: int = MIN_WORD_THRESHOLD, # Minimum words for content blocks
# Cache Control
bypass_cache: bool = False, # Bypass cache for this request
# Session Management
session_id: str = None, # Session identifier for persistent browsing
# Screenshot Options
screenshot: bool = False, # Take page screenshot
screenshot_wait_for: float = None, # Wait time before screenshot
# Content Processing
process_iframes: bool = False, # Process iframe content
remove_overlay_elements: bool = False, # Remove popups/modals
# Anti-Bot/Detection
simulate_user: bool = False, # Simulate human-like behavior
override_navigator: bool = False, # Override navigator properties
magic: bool = False, # Enable all anti-detection features
# Content Filtering
excluded_tags: List[str] = None, # HTML tags to exclude
exclude_external_links: bool = False, # Remove external links
exclude_social_media_links: bool = False, # Remove social media links
exclude_external_images: bool = False, # Remove external images
exclude_social_media_domains: List[str] = None, # Additional social media domains to exclude
remove_forms: bool = False, # Remove all form elements
# JavaScript Handling
js_code: Union[str, List[str]] = None, # JavaScript to execute
js_only: bool = False, # Only execute JavaScript without reloading page
wait_for: str = None, # Wait condition (CSS selector or JS function)
# Page Loading
page_timeout: int = 60000, # Page load timeout in milliseconds
delay_before_return_html: float = None, # Wait before returning HTML
# Debug Options
log_console: bool = False, # Log browser console messages
# Content Format Control
only_text: bool = False, # Extract only text content
keep_data_attributes: bool = False, # Keep data-* attributes in HTML
# Markdown Options
include_links_on_markdown: bool = False, # Include links in markdown output
html2text: Dict = {}, # HTML to text conversion options
# Extraction Strategy
extraction_strategy: ExtractionStrategy = None, # Strategy for structured data extraction
# Advanced Browser Control
user_agent: str = None, # Override user agent for this request
)
```
### Extraction Strategy Parameters
```python
# JsonCssExtractionStrategy
{
"name": str, # Name of extraction schema
"baseSelector": str, # Base CSS selector
"fields": [
{
"name": str, # Field name
"selector": str, # CSS selector
"type": str, # Data type ("text", etc.)
"transform": str = None # Optional transformation
}
]
}
# LLMExtractionStrategy
{
"provider": str, # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
"api_token": str, # API token
"schema": dict, # Pydantic model schema
"extraction_type": str, # Type of extraction ("schema", etc.)
"instruction": str, # Extraction instruction
"extra_args": dict = None, # Additional provider-specific arguments
"extra_headers": dict = None # Additional HTTP headers
}
```
### HTML to Text Conversion Options (html2text parameter)
```python
{
"escape_dot": bool = True, # Escape dots in text
# Other html2text library options
}
```
### CrawlResult Fields
```python
class CrawlResult(BaseModel):
# Basic Information
url: str # The crawled URL
# Example: "https://example.com"
success: bool # Whether the crawl was successful
# Example: True/False
status_code: Optional[int] # HTTP status code
# Example: 200, 404, 500
# Content Fields
html: str # Raw HTML content
# Example: "<html><body>...</body></html>"
cleaned_html: Optional[str] # HTML after cleaning and processing
# Example: "<article><p>Clean content...</p></article>"
fit_html: Optional[str] # Most relevant HTML content after content cleaning strategy
# Example: "<div><p>Most relevant content...</p></div>"
markdown: Optional[str] # HTML converted to markdown
# Example: "# Title\n\nContent paragraph..."
fit_markdown: Optional[str] # Most relevant content in markdown
# Example: "# Main Article\n\nKey content..."
# Media Content
media: Dict[str, List[Dict]] = {} # Extracted media information
# Example: {
# "images": [
# {
# "src": "https://example.com/image.jpg",
# "alt": "Image description",
# "desc": "Contextual description",
# "score": 5, # Relevance score
# "type": "image"
# }
# ],
# "videos": [
# {
# "src": "https://example.com/video.mp4",
# "alt": "Video title",
# "type": "video",
# "description": "Video context"
# }
# ],
# "audios": [
# {
# "src": "https://example.com/audio.mp3",
# "alt": "Audio title",
# "type": "audio",
# "description": "Audio context"
# }
# ]
# }
# Link Information
links: Dict[str, List[Dict]] = {} # Extracted links
# Example: {
# "internal": [
# {
# "href": "https://example.com/page",
# "text": "Link text",
# "title": "Link title"
# }
# ],
# "external": [
# {
# "href": "https://external.com",
# "text": "External link text",
# "title": "External link title"
# }
# ]
# }
# Extraction Results
extracted_content: Optional[str] # Content from extraction strategy
# Example for JsonCssExtractionStrategy:
# '[{"title": "Article 1", "date": "2024-03-20"}, ...]'
# Example for LLMExtractionStrategy:
# '{"entities": [...], "relationships": [...]}'
# Additional Information
metadata: Optional[dict] = None # Page metadata
# Example: {
# "title": "Page Title",
# "description": "Meta description",
# "keywords": ["keyword1", "keyword2"],
# "author": "Author Name",
# "published_date": "2024-03-20"
# }
screenshot: Optional[str] = None # Base64 encoded screenshot
# Example: "iVBORw0KGgoAAAANSUhEUgAA..."
error_message: Optional[str] = None # Error message if crawl failed
# Example: "Failed to load page: timeout"
session_id: Optional[str] = None # Session identifier
# Example: "session_123456"
response_headers: Optional[dict] = None # HTTP response headers
# Example: {
# "content-type": "text/html",
# "server": "nginx/1.18.0",
# "date": "Wed, 20 Mar 2024 12:00:00 GMT"
# }
```
### Common Usage Patterns:
1. Basic Content Extraction:
```python
result = await crawler.arun(url="https://example.com")
print(result.markdown) # Clean, readable content
print(result.cleaned_html) # Cleaned HTML
```
2. Media Analysis:
```python
result = await crawler.arun(url="https://example.com")
for image in result.media["images"]:
if image["score"] > 3: # High-relevance images
print(f"High-quality image: {image['src']}")
```
3. Link Analysis:
```python
result = await crawler.arun(url="https://example.com")
internal_links = [link["href"] for link in result.links["internal"]]
external_links = [link["href"] for link in result.links["external"]]
```
4. Structured Data Extraction:
```python
result = await crawler.arun(
url="https://example.com",
extraction_strategy=my_strategy
)
structured_data = json.loads(result.extracted_content)
```
5. Error Handling:
```python
result = await crawler.arun(url="https://example.com")
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
```

View File

@@ -0,0 +1,67 @@
1. **E-commerce Product Monitor**
- Scraping product details from multiple e-commerce sites
- Price tracking with structured data extraction
- Handling dynamic content and anti-bot measures
- Features: JsonCssExtraction, session management, anti-bot
2. **News Aggregator & Summarizer**
- Crawling news websites
- Content extraction and summarization
- Topic classification
- Features: LLMExtraction, CosineStrategy, content cleaning
3. **Academic Paper Research Assistant**
- Crawling research papers from academic sites
- Extracting citations and references
- Building knowledge graphs
- Features: structured extraction, link analysis, chunking
4. **Social Media Content Analyzer**
- Handling JavaScript-heavy sites
- Dynamic content loading
- Sentiment analysis integration
- Features: dynamic content handling, session management
5. **Real Estate Market Analyzer**
- Scraping property listings
- Processing image galleries
- Geolocation data extraction
- Features: media handling, structured data extraction
6. **Documentation Site Generator**
- Recursive website crawling
- Markdown generation
- Link validation
- Features: website crawling, content cleaning
7. **Job Board Aggregator**
- Handling pagination
- Structured job data extraction
- Filtering and categorization
- Features: session management, JsonCssExtraction
8. **Recipe Database Builder**
- Schema-based extraction
- Image processing
- Ingredient parsing
- Features: structured extraction, media handling
9. **Travel Blog Content Analyzer**
- Location extraction
- Image and map processing
- Content categorization
- Features: CosineStrategy, media handling
10. **Technical Documentation Scraper**
- API documentation extraction
- Code snippet processing
- Version tracking
- Features: content cleaning, structured extraction
Each example will include:
- Problem description
- Technical requirements
- Complete implementation
- Error handling
- Output processing
- Performance considerations