Update Documentation

2024-10-27 19:24:46 +08:00
parent 38474bd66a
commit 4239654722
111 changed files with 7680 additions and 53 deletions
--- a/docs/details/extraction.md
+++ b/docs/details/extraction.md
@@ -0,0 +1,157 @@
+### Extraction Strategies
+
+#### 1. LLMExtractionStrategy
+```python
+LLMExtractionStrategy(
+    # Core Parameters
+    provider: str = DEFAULT_PROVIDER,  # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
+    api_token: Optional[str] = None,  # API token for the provider
+    instruction: str = None,  # Custom instruction for extraction
+    schema: Dict = None,  # Pydantic model schema for structured extraction
+    extraction_type: str = "block",  # Type of extraction: "block" or "schema"
+    
+    # Chunking Parameters
+    chunk_token_threshold: int = CHUNK_TOKEN_THRESHOLD,  # Maximum tokens per chunk
+    overlap_rate: float = OVERLAP_RATE,  # Overlap between chunks
+    word_token_rate: float = WORD_TOKEN_RATE,  # Conversion rate from words to tokens
+    apply_chunking: bool = True,  # Whether to apply text chunking
+    
+    # API Configuration
+    base_url: str = None,  # Base URL for API calls
+    api_base: str = None,  # Alternative base URL
+    extra_args: Dict = {},  # Additional provider-specific arguments
+    
+    verbose: bool = False  # Enable verbose logging
+)
+```
+
+Usage Example:
+```python
+class NewsArticle(BaseModel):
+    title: str
+    content: str
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",
+    api_token="your-token",
+    schema=NewsArticle.schema(),
+    instruction="Extract news article content with title and main text"
+)
+
+result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
+```
+
+#### 2. JsonCssExtractionStrategy
+```python
+JsonCssExtractionStrategy(
+    schema: Dict[str, Any],  # Schema defining extraction rules
+    verbose: bool = False  # Enable verbose logging
+)
+
+# Schema Structure
+schema = {
+    "name": str,  # Name of the extraction schema
+    "baseSelector": str,  # CSS selector for base elements
+    "fields": [
+        {
+            "name": str,  # Field name
+            "selector": str,  # CSS selector
+            "type": str,  # Field type: "text", "attribute", "html", "regex", "nested", "list", "nested_list"
+            "attribute": str,  # For type="attribute"
+            "pattern": str,  # For type="regex"
+            "transform": str,  # Optional: "lowercase", "uppercase", "strip"
+            "default": Any,  # Default value if extraction fails
+            "fields": List[Dict],  # For nested/list types
+        }
+    ]
+}
+```
+
+Usage Example:
+```python
+schema = {
+    "name": "News Articles",
+    "baseSelector": "article.news-item",
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h1",
+            "type": "text",
+            "transform": "strip"
+        },
+        {
+            "name": "date",
+            "selector": ".date",
+            "type": "attribute",
+            "attribute": "datetime"
+        }
+    ]
+}
+
+strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(url="https://example.com", extraction_strategy=strategy)
+```
+
+#### 3. CosineStrategy
+```python
+CosineStrategy(
+    # Content Filtering
+    semantic_filter: str = None,  # Keyword filter for document filtering
+    word_count_threshold: int = 10,  # Minimum words per cluster
+    sim_threshold: float = 0.3,  # Similarity threshold for filtering
+    
+    # Clustering Parameters
+    max_dist: float = 0.2,  # Maximum distance for clustering
+    linkage_method: str = 'ward',  # Clustering linkage method
+    top_k: int = 3,  # Number of top categories to extract
+    
+    # Model Configuration
+    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
+    
+    verbose: bool = False  # Enable verbose logging
+)
+```
+
+### Chunking Strategies
+
+#### 1. RegexChunking
+```python
+RegexChunking(
+    patterns: List[str] = None  # List of regex patterns for splitting text
+    # Default pattern: [r'\n\n']
+)
+```
+
+Usage Example:
+```python
+chunker = RegexChunking(patterns=[r'\n\n', r'\.\s+'])  # Split on double newlines and sentences
+chunks = chunker.chunk(text)
+```
+
+#### 2. SlidingWindowChunking
+```python
+SlidingWindowChunking(
+    window_size: int = 100,  # Size of the window in words
+    step: int = 50,  # Number of words to slide the window
+)
+```
+
+Usage Example:
+```python
+chunker = SlidingWindowChunking(window_size=200, step=100)
+chunks = chunker.chunk(text)  # Creates overlapping chunks of 200 words, moving 100 words at a time
+```
+
+#### 3. OverlappingWindowChunking
+```python
+OverlappingWindowChunking(
+    window_size: int = 1000,  # Size of each chunk in words
+    overlap: int = 100  # Number of words to overlap between chunks
+)
+```
+
+Usage Example:
+```python
+chunker = OverlappingWindowChunking(window_size=500, overlap=50)
+chunks = chunker.chunk(text)  # Creates 500-word chunks with 50-word overlap
+```
--- a/docs/details/feature_lists.md
+++ b/docs/details/feature_lists.md
@@ -0,0 +1,175 @@
+# Features
+
+## Current Features
+1. Async-first architecture for high-performance web crawling
+2. Built-in anti-bot detection bypass ("magic mode")
+3. Multiple browser engine support (Chromium, Firefox, WebKit)
+4. Smart session management with automatic cleanup
+5. Automatic content cleaning and relevance scoring
+6. Built-in markdown generation with formatting preservation
+7. Intelligent image scoring and filtering
+8. Automatic popup and overlay removal
+9. Smart wait conditions (CSS/JavaScript based)
+10. Multi-provider LLM integration (OpenAI, HuggingFace, Ollama)
+11. Schema-based structured data extraction
+12. Automated iframe content processing
+13. Intelligent link categorization (internal/external)
+14. Multiple chunking strategies for large content
+15. Real-time HTML cleaning and sanitization
+16. Automatic screenshot capabilities
+17. Social media link filtering
+18. Semantic similarity-based content clustering
+19. Human behavior simulation for anti-bot bypass
+20. Proxy support with authentication
+21. Automatic resource cleanup
+22. Custom CSS selector-based extraction
+23. Automatic content relevance scoring ("fit" content)
+24. Recursive website crawling capabilities
+25. Flexible hook system for customization
+26. Built-in caching system
+27. Domain-based content filtering
+28. Dynamic content handling with JavaScript execution
+29. Automatic media content extraction and classification
+30. Metadata extraction and processing
+31. Customizable HTML to Markdown conversion
+32. Token-aware content chunking for LLM processing
+33. Automatic response header and status code handling
+34. Browser fingerprint customization
+35. Multiple extraction strategies (LLM, CSS, Cosine, XPATH)
+36. Automatic error image generation for failed screenshots
+37. Smart content overlap handling for large texts
+38. Built-in rate limiting for batch processing
+39. Automatic cookie handling
+40. Browser Console logging and debugging capabilities
+
+## Feature Techs
+• Browser Management
+  - Asynchronous browser control
+  - Multi-browser support (Chromium, Firefox, WebKit)
+  - Headless mode support
+  - Browser cleanup and resource management
+  - Custom browser arguments and configuration
+  - Context management with `__aenter__` and `__aexit__`
+
+• Session Handling
+  - Session management with TTL (Time To Live)
+  - Session reuse capabilities
+  - Session cleanup for expired sessions
+  - Session-based context preservation
+
+• Stealth Features
+  - Playwright stealth configuration
+  - Navigator properties override
+  - WebDriver detection evasion
+  - Chrome app simulation
+  - Plugin simulation
+  - Language preferences simulation
+  - Hardware concurrency simulation
+  - Media codecs simulation
+
+• Network Features
+  - Proxy support with authentication
+  - Custom headers management
+  - Cookie handling
+  - Response header capture
+  - Status code tracking
+  - Network idle detection
+
+• Page Interaction
+  - Smart wait functionality for multiple conditions
+  - CSS selector-based waiting
+  - JavaScript condition waiting
+  - Custom JavaScript execution
+  - User interaction simulation (mouse/keyboard)
+  - Page scrolling
+  - Timeout management
+  - Load state monitoring
+
+• Content Processing
+  - HTML content extraction
+  - Iframe processing and content extraction
+  - Delayed content retrieval
+  - Content caching
+  - Cache file management
+  - HTML cleaning and processing
+
+• Image Handling
+  - Screenshot capabilities (full page)
+  - Base64 encoding of screenshots
+  - Image dimension updating
+  - Image filtering (size/visibility)
+  - Error image generation
+  - Natural width/height preservation
+
+• Overlay Management
+  - Popup removal
+  - Cookie notice removal
+  - Newsletter dialog removal
+  - Modal removal
+  - Fixed position element removal
+  - Z-index based overlay detection
+  - Visibility checking
+
+• Hook System
+  - Browser creation hooks
+  - User agent update hooks
+  - Execution start hooks
+  - Navigation hooks (before/after goto)
+  - HTML retrieval hooks
+  - HTML return hooks
+
+• Error Handling
+  - Browser error catching
+  - Network error handling
+  - Timeout handling
+  - Screenshot error recovery
+  - Invalid selector handling
+  - General exception management
+
+• Performance Features
+  - Concurrent URL processing
+  - Semaphore-based rate limiting
+  - Async gathering of results
+  - Resource cleanup
+  - Memory management
+
+• Debug Features
+  - Console logging
+  - Page error logging
+  - Verbose mode
+  - Error message generation
+  - Warning system
+
+• Security Features
+  - Certificate error handling
+  - Sandbox configuration
+  - GPU handling
+  - CSP (Content Security Policy) compliant waiting
+
+• Configuration
+  - User agent customization
+  - Viewport configuration
+  - Timeout configuration
+  - Browser type selection
+  - Proxy configuration
+  - Header configuration
+
+• Data Models
+  - Pydantic model for responses
+  - Type hints throughout code
+  - Structured response format
+  - Optional response fields
+
+• File System Integration
+  - Cache directory management
+  - File path handling
+  - Cache metadata storage
+  - File read/write operations
+
+• Metadata Handling
+  - Response headers capture
+  - Status code tracking
+  - Cache metadata
+  - Session tracking
+  - Timestamp management
+
--- a/docs/details/features.md
+++ b/docs/details/features.md
@@ -0,0 +1,150 @@
+### 1. Basic Web Crawling
+```python
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    print(result.markdown)  # Get clean markdown content
+    print(result.html)      # Get raw HTML
+    print(result.cleaned_html)  # Get cleaned HTML
+```
+
+### 2. Browser Control Options
+- Multiple Browser Support
+```python
+# Choose between different browser engines
+crawler = AsyncWebCrawler(browser_type="firefox")  # or "chromium", "webkit"
+crawler = AsyncWebCrawler(headless=False)  # For visible browser
+```
+
+- Proxy Configuration
+```python
+crawler = AsyncWebCrawler(proxy="http://proxy.example.com:8080")
+# Or with authentication
+crawler = AsyncWebCrawler(proxy_config={
+    "server": "http://proxy.example.com:8080",
+    "username": "user",
+    "password": "pass"
+})
+```
+
+### 3. Content Selection & Filtering
+- CSS Selector Support
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    css_selector=".main-content"  # Extract specific content
+)
+```
+
+- Content Filtering Options
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    word_count_threshold=10,  # Minimum words per block
+    excluded_tags=['form', 'header'],  # Tags to exclude
+    exclude_external_links=True,  # Remove external links
+    exclude_social_media_links=True,  # Remove social media links
+    exclude_external_images=True  # Remove external images
+)
+```
+
+### 4. Dynamic Content Handling
+- JavaScript Execution
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    js_code="window.scrollTo(0, document.body.scrollHeight)"  # Execute custom JS
+)
+```
+
+- Wait Conditions
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    wait_for="css:.my-element",  # Wait for element
+    wait_for="js:() => document.readyState === 'complete'"  # Wait for condition
+)
+```
+
+### 5. Anti-Bot Protection Handling
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    simulate_user=True,  # Simulate human behavior
+    override_navigator=True,  # Mask automation signals
+    magic=True  # Enable all anti-detection features
+)
+```
+
+### 6. Session Management
+```python
+session_id = "my_session"
+result1 = await crawler.arun(url="https://example.com/page1", session_id=session_id)
+result2 = await crawler.arun(url="https://example.com/page2", session_id=session_id)
+await crawler.crawler_strategy.kill_session(session_id)
+```
+
+### 7. Media Handling
+- Screenshot Capture
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    screenshot=True
+)
+base64_screenshot = result.screenshot
+```
+
+- Media Extraction
+```python
+result = await crawler.arun(url="https://example.com")
+print(result.media['images'])  # List of images
+print(result.media['videos'])  # List of videos
+print(result.media['audios'])  # List of audio files
+```
+
+### 8. Structured Data Extraction
+- CSS-based Extraction
+```python
+schema = {
+    "name": "News Articles",
+    "baseSelector": "article",
+    "fields": [
+        {"name": "title", "selector": "h1", "type": "text"},
+        {"name": "date", "selector": ".date", "type": "text"}
+    ]
+}
+extraction_strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=extraction_strategy
+)
+structured_data = json.loads(result.extracted_content)
+```
+
+- LLM-based Extraction (Multiple Providers)
+```python
+class NewsArticle(BaseModel):
+    title: str
+    summary: str
+
+strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",  # or "huggingface/...", "ollama/..."
+    api_token="your-token",
+    schema=NewsArticle.schema(),
+    instruction="Extract news article details..."
+)
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=strategy
+)
+```
+
+### 9. Content Cleaning & Processing
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    remove_overlay_elements=True,  # Remove popups/modals
+    process_iframes=True,  # Process iframe content
+)
+print(result.fit_markdown)  # Get most relevant content
+print(result.fit_html)     # Get cleaned HTML
+```
--- a/docs/details/features_details.md
+++ b/docs/details/features_details.md
@@ -0,0 +1,457 @@
+I'll expand the outline with detailed descriptions and examples based on all the provided files. I'll start with the first few sections:
+
+### 1. Basic Web Crawling
+Basic web crawling provides the foundation for extracting content from websites. The library supports both simple single-page crawling and recursive website crawling.
+
+```python
+# Simple page crawling
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    print(result.html)        # Raw HTML
+    print(result.markdown)    # Cleaned markdown
+    print(result.cleaned_html)  # Cleaned HTML
+
+# Recursive website crawling
+class SimpleWebsiteScraper:
+    def __init__(self, crawler: AsyncWebCrawler):
+        self.crawler = crawler
+
+    async def scrape(self, start_url: str, max_depth: int):
+        results = await self.scrape_recursive(start_url, max_depth)
+        return results
+
+# Usage
+async with AsyncWebCrawler() as crawler:
+    scraper = SimpleWebsiteScraper(crawler)
+    results = await scraper.scrape("https://example.com", depth=2)
+```
+
+### 2. Browser Control Options
+The library provides extensive control over browser behavior, allowing customization of browser type, headless mode, and proxy settings.
+
+```python
+# Browser Type Selection
+async with AsyncWebCrawler(
+    browser_type="firefox",  # Options: "chromium", "firefox", "webkit"
+    headless=False,         # For visible browser
+    verbose=True           # Enable logging
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+
+# Proxy Configuration
+async with AsyncWebCrawler(
+    proxy_config={
+        "server": "http://proxy.example.com:8080",
+        "username": "user",
+        "password": "pass"
+    },
+    headers={
+        "User-Agent": "Custom User Agent",
+        "Accept-Language": "en-US,en;q=0.9"
+    }
+) as crawler:
+    result = await crawler.arun(url="https://example.com")
+```
+
+### 3. Content Selection & Filtering
+The library offers multiple ways to select and filter content, from CSS selectors to word count thresholds.
+
+```python
+# CSS Selector and Content Filtering
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        css_selector="article.main-content",  # Extract specific content
+        word_count_threshold=10,              # Minimum words per block
+        excluded_tags=['form', 'header'],     # Tags to exclude
+        exclude_external_links=True,          # Remove external links
+        exclude_social_media_links=True,      # Remove social media links
+        exclude_domains=["pinterest.com", "facebook.com"]  # Exclude specific domains
+    )
+
+# Custom HTML to Text Options
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        html2text={
+            "escape_dot": False,
+            "links_each_paragraph": True,
+            "protect_links": True
+        }
+    )
+```
+
+### 4. Dynamic Content Handling
+The library provides sophisticated handling of dynamic content with JavaScript execution and wait conditions.
+
+```python
+# JavaScript Execution and Wait Conditions
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        js_code=[
+            "window.scrollTo(0, document.body.scrollHeight);",
+            "document.querySelector('.load-more').click();"
+        ],
+        wait_for="css:.dynamic-content",  # Wait for element
+        delay_before_return_html=2.0      # Wait after JS execution
+    )
+
+# Smart Wait Conditions
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        wait_for="""() => {
+            return document.querySelectorAll('.item').length > 10;
+        }""",
+        page_timeout=60000  # 60 seconds timeout
+    )
+```
+
+### 5. Advanced Link Analysis
+The library provides comprehensive link analysis capabilities, distinguishing between internal and external links, with options for filtering and processing.
+
+```python
+# Basic Link Analysis
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    
+    # Access internal and external links
+    for internal_link in result.links['internal']:
+        print(f"Internal: {internal_link['href']} - {internal_link['text']}")
+    
+    for external_link in result.links['external']:
+        print(f"External: {external_link['href']} - {external_link['text']}")
+
+# Advanced Link Filtering
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        exclude_external_links=True,          # Remove all external links
+        exclude_social_media_links=True,      # Remove social media links
+        exclude_social_media_domains=[                # Custom social media domains
+            "facebook.com", "twitter.com", "instagram.com"
+        ],
+        exclude_domains=["pinterest.com"]     # Specific domains to exclude
+    )
+```
+
+### 6. Anti-Bot Protection Handling
+The library includes sophisticated anti-detection mechanisms to handle websites with bot protection.
+
+```python
+# Basic Anti-Detection
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        simulate_user=True,        # Simulate human behavior
+        override_navigator=True    # Override navigator properties
+    )
+
+# Advanced Anti-Detection with Magic Mode
+async with AsyncWebCrawler(headless=False) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        magic=True,               # Enable all anti-detection features
+        remove_overlay_elements=True,  # Remove popups/modals automatically
+        # Custom navigator properties
+        js_code="""
+        Object.defineProperty(navigator, 'webdriver', {
+            get: () => undefined
+        });
+        """
+    )
+```
+
+### 7. Session Management
+Session management allows maintaining state across multiple requests and handling cookies.
+
+```python
+# Basic Session Management
+async with AsyncWebCrawler() as crawler:
+    session_id = "my_session"
+    
+    # Login
+    login_result = await crawler.arun(
+        url="https://example.com/login",
+        session_id=session_id,
+        js_code="document.querySelector('form').submit();"
+    )
+    
+    # Use same session for subsequent requests
+    protected_result = await crawler.arun(
+        url="https://example.com/protected",
+        session_id=session_id
+    )
+    
+    # Clean up session
+    await crawler.crawler_strategy.kill_session(session_id)
+
+# Advanced Session with Custom Cookies
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        session_id="custom_session",
+        cookies=[{
+            "name": "sessionId",
+            "value": "abc123",
+            "domain": "example.com"
+        }]
+    )
+```
+
+### 8. Screenshot and Media Handling
+The library provides comprehensive media handling capabilities, including screenshots and media content extraction.
+
+```python
+# Screenshot Capture
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        screenshot=True,
+        screenshot_wait_for=2.0  # Wait before taking screenshot
+    )
+    
+    # Save screenshot
+    if result.screenshot:
+        with open("screenshot.png", "wb") as f:
+            f.write(base64.b64decode(result.screenshot))
+
+# Media Extraction
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(url="https://example.com")
+    
+    # Process images with metadata
+    for image in result.media['images']:
+        print(f"Image: {image['src']}")
+        print(f"Alt text: {image['alt']}")
+        print(f"Context: {image['desc']}")
+        print(f"Relevance score: {image['score']}")
+    
+    # Process videos and audio
+    for video in result.media['videos']:
+        print(f"Video: {video['src']}")
+    for audio in result.media['audios']:
+        print(f"Audio: {audio['src']}")
+```
+
+### 9. Structured Data Extraction & Chunking
+The library supports multiple strategies for structured data extraction and content chunking.
+
+```python
+# LLM-based Extraction
+class NewsArticle(BaseModel):
+    title: str
+    content: str
+    author: str
+
+extraction_strategy = LLMExtractionStrategy(
+    provider='openai/gpt-4',
+    api_token="your-token",
+    schema=NewsArticle.schema(),
+    instruction="Extract news article details",
+    chunk_token_threshold=1000,
+    overlap_rate=0.1
+)
+
+# CSS-based Extraction
+schema = {
+    "name": "Product Listing",
+    "baseSelector": ".product-card",
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h2",
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": ".price",
+            "type": "text",
+            "transform": "strip"
+        }
+    ]
+}
+
+css_strategy = JsonCssExtractionStrategy(schema)
+
+# Text Chunking
+from crawl4ai.chunking_strategy import OverlappingWindowChunking
+
+chunking_strategy = OverlappingWindowChunking(
+    window_size=1000,
+    overlap=100
+)
+
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=extraction_strategy,
+        chunking_strategy=chunking_strategy
+    )
+```
+
+
+### 10. Content Cleaning & Processing
+The library provides extensive content cleaning and processing capabilities, ensuring high-quality output in various formats.
+
+```python
+# Basic Content Cleaning
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        remove_overlay_elements=True,  # Remove popups/modals
+        process_iframes=True,          # Process iframe content
+        word_count_threshold=10        # Minimum words per block
+    )
+    
+    print(result.cleaned_html)    # Clean HTML
+    print(result.fit_html)        # Most relevant HTML content
+    print(result.fit_markdown)    # Most relevant markdown content
+
+# Advanced Content Processing
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        excluded_tags=['form', 'header', 'footer', 'nav'],
+        html2text={
+            "escape_dot": False,
+            "body_width": 0,
+            "protect_links": True,
+            "unicode_snob": True,
+            "ignore_links": False,
+            "ignore_images": False,
+            "ignore_emphasis": False,
+            "bypass_tables": False,
+            "ignore_tables": False
+        }
+    )
+```
+
+### Advanced Usage Patterns
+
+#### 1. Combining Multiple Features
+```python
+async with AsyncWebCrawler(
+    browser_type="chromium",
+    headless=False,
+    verbose=True
+) as crawler:
+    result = await crawler.arun(
+        url="https://example.com",
+        # Anti-bot measures
+        magic=True,
+        simulate_user=True,
+        
+        # Content selection
+        css_selector="article.main",
+        word_count_threshold=10,
+        
+        # Dynamic content handling
+        js_code="window.scrollTo(0, document.body.scrollHeight);",
+        wait_for="css:.dynamic-content",
+        
+        # Content filtering
+        exclude_external_links=True,
+        exclude_social_media_links=True,
+        
+        # Media handling
+        screenshot=True,
+        process_iframes=True,
+        
+        # Content cleaning
+        remove_overlay_elements=True
+    )
+```
+
+#### 2. Custom Extraction Pipeline
+```python
+# Define custom schemas and strategies
+class Article(BaseModel):
+    title: str
+    content: str
+    date: str
+
+# CSS extraction for initial content
+css_schema = {
+    "name": "Article Extraction",
+    "baseSelector": "article",
+    "fields": [
+        {"name": "title", "selector": "h1", "type": "text"},
+        {"name": "content", "selector": ".content", "type": "html"},
+        {"name": "date", "selector": ".date", "type": "text"}
+    ]
+}
+
+# LLM processing for semantic analysis
+llm_strategy = LLMExtractionStrategy(
+    provider="ollama/nemotron",
+    api_token="your-token",
+    schema=Article.schema(),
+    instruction="Extract and clean article content"
+)
+
+# Chunking strategy for large content
+chunking = OverlappingWindowChunking(window_size=1000, overlap=100)
+
+async with AsyncWebCrawler() as crawler:
+    # First pass: Extract structure
+    css_result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=JsonCssExtractionStrategy(css_schema)
+    )
+    
+    # Second pass: Semantic processing
+    llm_result = await crawler.arun(
+        url="https://example.com",
+        extraction_strategy=llm_strategy,
+        chunking_strategy=chunking
+    )
+```
+
+#### 3. Website Crawling with Custom Processing
+```python
+class CustomWebsiteCrawler:
+    def __init__(self, crawler: AsyncWebCrawler):
+        self.crawler = crawler
+        self.results = {}
+
+    async def process_page(self, url: str) -> Dict:
+        result = await self.crawler.arun(
+            url=url,
+            magic=True,
+            word_count_threshold=10,
+            exclude_external_links=True,
+            process_iframes=True,
+            remove_overlay_elements=True
+        )
+        
+        # Process internal links
+        internal_links = [
+            link['href'] for link in result.links['internal']
+            if self._is_valid_link(link['href'])
+        ]
+        
+        # Extract media
+        media_urls = [img['src'] for img in result.media['images']]
+        
+        return {
+            'content': result.markdown,
+            'links': internal_links,
+            'media': media_urls,
+            'metadata': result.metadata
+        }
+
+    async def crawl_website(self, start_url: str, max_depth: int = 2):
+        visited = set()
+        queue = [(start_url, 0)]
+        
+        while queue:
+            url, depth = queue.pop(0)
+            if depth > max_depth or url in visited:
+                continue
+                
+            visited.add(url)
+            self.results[url] = await self.process_page(url)
+```
+
--- a/docs/details/input_output.md
+++ b/docs/details/input_output.md
@@ -0,0 +1,282 @@
+### AsyncWebCrawler Constructor Parameters
+```python
+AsyncWebCrawler(
+    # Core Browser Settings
+    browser_type: str = "chromium",  # Options: "chromium", "firefox", "webkit"
+    headless: bool = True,  # Whether to run browser in headless mode
+    verbose: bool = False,  # Enable verbose logging
+    
+    # Cache Settings
+    always_by_pass_cache: bool = False,  # Always bypass cache regardless of run settings
+    base_directory: str = str(Path.home()),  # Base directory for cache storage
+    
+    # Network Settings
+    proxy: str = None,  # Simple proxy URL (e.g., "http://proxy.example.com:8080")
+    proxy_config: Dict = None,  # Advanced proxy settings with auth: {"server": str, "username": str, "password": str}
+    
+    # Browser Behavior
+    sleep_on_close: bool = False,  # Wait before closing browser
+    
+    # Other Settings passed to AsyncPlaywrightCrawlerStrategy
+    user_agent: str = None,  # Custom user agent string
+    headers: Dict[str, str] = {},  # Custom HTTP headers
+    js_code: Union[str, List[str]] = None,  # Default JavaScript to execute
+)
+```
+
+### arun() Method Parameters
+```python
+arun(
+    # Core Parameters
+    url: str,  # Required: URL to crawl
+    
+    # Content Selection
+    css_selector: str = None,  # CSS selector to extract specific content
+    word_count_threshold: int = MIN_WORD_THRESHOLD,  # Minimum words for content blocks
+    
+    # Cache Control
+    bypass_cache: bool = False,  # Bypass cache for this request
+    
+    # Session Management
+    session_id: str = None,  # Session identifier for persistent browsing
+    
+    # Screenshot Options
+    screenshot: bool = False,  # Take page screenshot
+    screenshot_wait_for: float = None,  # Wait time before screenshot
+    
+    # Content Processing
+    process_iframes: bool = False,  # Process iframe content
+    remove_overlay_elements: bool = False,  # Remove popups/modals
+    
+    # Anti-Bot/Detection
+    simulate_user: bool = False,  # Simulate human-like behavior
+    override_navigator: bool = False,  # Override navigator properties
+    magic: bool = False,  # Enable all anti-detection features
+    
+    # Content Filtering
+    excluded_tags: List[str] = None,  # HTML tags to exclude
+    exclude_external_links: bool = False,  # Remove external links
+    exclude_social_media_links: bool = False,  # Remove social media links
+    exclude_external_images: bool = False,  # Remove external images
+    exclude_social_media_domains: List[str] = None,  # Additional social media domains to exclude
+    remove_forms: bool = False,  # Remove all form elements
+    
+    # JavaScript Handling
+    js_code: Union[str, List[str]] = None,  # JavaScript to execute
+    js_only: bool = False,  # Only execute JavaScript without reloading page
+    wait_for: str = None,  # Wait condition (CSS selector or JS function)
+    
+    # Page Loading
+    page_timeout: int = 60000,  # Page load timeout in milliseconds
+    delay_before_return_html: float = None,  # Wait before returning HTML
+    
+    # Debug Options
+    log_console: bool = False,  # Log browser console messages
+    
+    # Content Format Control
+    only_text: bool = False,  # Extract only text content
+    keep_data_attributes: bool = False,  # Keep data-* attributes in HTML
+    
+    # Markdown Options
+    include_links_on_markdown: bool = False,  # Include links in markdown output
+    html2text: Dict = {},  # HTML to text conversion options
+    
+    # Extraction Strategy
+    extraction_strategy: ExtractionStrategy = None,  # Strategy for structured data extraction
+    
+    # Advanced Browser Control
+    user_agent: str = None,  # Override user agent for this request
+)
+```
+
+### Extraction Strategy Parameters
+```python
+# JsonCssExtractionStrategy
+{
+    "name": str,  # Name of extraction schema
+    "baseSelector": str,  # Base CSS selector
+    "fields": [
+        {
+            "name": str,  # Field name
+            "selector": str,  # CSS selector
+            "type": str,  # Data type ("text", etc.)
+            "transform": str = None  # Optional transformation
+        }
+    ]
+}
+
+# LLMExtractionStrategy
+{
+    "provider": str,  # LLM provider (e.g., "openai/gpt-4", "huggingface/...", "ollama/...")
+    "api_token": str,  # API token
+    "schema": dict,  # Pydantic model schema
+    "extraction_type": str,  # Type of extraction ("schema", etc.)
+    "instruction": str,  # Extraction instruction
+    "extra_args": dict = None,  # Additional provider-specific arguments
+    "extra_headers": dict = None  # Additional HTTP headers
+}
+```
+
+### HTML to Text Conversion Options (html2text parameter)
+```python
+{
+    "escape_dot": bool = True,  # Escape dots in text
+    # Other html2text library options
+}
+```
+
+
+### CrawlResult Fields
+
+```python
+class CrawlResult(BaseModel):
+    # Basic Information
+    url: str  # The crawled URL
+    # Example: "https://example.com"
+    
+    success: bool  # Whether the crawl was successful
+    # Example: True/False
+    
+    status_code: Optional[int]  # HTTP status code
+    # Example: 200, 404, 500
+    
+    # Content Fields
+    html: str  # Raw HTML content
+    # Example: "<html><body>...</body></html>"
+    
+    cleaned_html: Optional[str]  # HTML after cleaning and processing
+    # Example: "<article><p>Clean content...</p></article>"
+    
+    fit_html: Optional[str]  # Most relevant HTML content after content cleaning strategy
+    # Example: "<div><p>Most relevant content...</p></div>"
+    
+    markdown: Optional[str]  # HTML converted to markdown
+    # Example: "# Title\n\nContent paragraph..."
+    
+    fit_markdown: Optional[str]  # Most relevant content in markdown
+    # Example: "# Main Article\n\nKey content..."
+    
+    # Media Content
+    media: Dict[str, List[Dict]] = {}  # Extracted media information
+    # Example: {
+    #     "images": [
+    #         {
+    #             "src": "https://example.com/image.jpg",
+    #             "alt": "Image description",
+    #             "desc": "Contextual description",
+    #             "score": 5,  # Relevance score
+    #             "type": "image"
+    #         }
+    #     ],
+    #     "videos": [
+    #         {
+    #             "src": "https://example.com/video.mp4",
+    #             "alt": "Video title",
+    #             "type": "video",
+    #             "description": "Video context"
+    #         }
+    #     ],
+    #     "audios": [
+    #         {
+    #             "src": "https://example.com/audio.mp3",
+    #             "alt": "Audio title",
+    #             "type": "audio",
+    #             "description": "Audio context"
+    #         }
+    #     ]
+    # }
+    
+    # Link Information
+    links: Dict[str, List[Dict]] = {}  # Extracted links
+    # Example: {
+    #     "internal": [
+    #         {
+    #             "href": "https://example.com/page",
+    #             "text": "Link text",
+    #             "title": "Link title"
+    #         }
+    #     ],
+    #     "external": [
+    #         {
+    #             "href": "https://external.com",
+    #             "text": "External link text",
+    #             "title": "External link title"
+    #         }
+    #     ]
+    # }
+    
+    # Extraction Results
+    extracted_content: Optional[str]  # Content from extraction strategy
+    # Example for JsonCssExtractionStrategy:
+    # '[{"title": "Article 1", "date": "2024-03-20"}, ...]'
+    # Example for LLMExtractionStrategy:
+    # '{"entities": [...], "relationships": [...]}'
+    
+    # Additional Information
+    metadata: Optional[dict] = None  # Page metadata
+    # Example: {
+    #     "title": "Page Title",
+    #     "description": "Meta description",
+    #     "keywords": ["keyword1", "keyword2"],
+    #     "author": "Author Name",
+    #     "published_date": "2024-03-20"
+    # }
+    
+    screenshot: Optional[str] = None  # Base64 encoded screenshot
+    # Example: "iVBORw0KGgoAAAANSUhEUgAA..."
+    
+    error_message: Optional[str] = None  # Error message if crawl failed
+    # Example: "Failed to load page: timeout"
+    
+    session_id: Optional[str] = None  # Session identifier
+    # Example: "session_123456"
+    
+    response_headers: Optional[dict] = None  # HTTP response headers
+    # Example: {
+    #     "content-type": "text/html",
+    #     "server": "nginx/1.18.0",
+    #     "date": "Wed, 20 Mar 2024 12:00:00 GMT"
+    # }
+```
+
+### Common Usage Patterns:
+
+1. Basic Content Extraction:
+```python
+result = await crawler.arun(url="https://example.com")
+print(result.markdown)  # Clean, readable content
+print(result.cleaned_html)  # Cleaned HTML
+```
+
+2. Media Analysis:
+```python
+result = await crawler.arun(url="https://example.com")
+for image in result.media["images"]:
+    if image["score"] > 3:  # High-relevance images
+        print(f"High-quality image: {image['src']}")
+```
+
+3. Link Analysis:
+```python
+result = await crawler.arun(url="https://example.com")
+internal_links = [link["href"] for link in result.links["internal"]]
+external_links = [link["href"] for link in result.links["external"]]
+```
+
+4. Structured Data Extraction:
+```python
+result = await crawler.arun(
+    url="https://example.com",
+    extraction_strategy=my_strategy
+)
+structured_data = json.loads(result.extracted_content)
+```
+
+5. Error Handling:
+```python
+result = await crawler.arun(url="https://example.com")
+if not result.success:
+    print(f"Crawl failed: {result.error_message}")
+    print(f"Status code: {result.status_code}")
+```
+
--- a/docs/details/realworld_examples.md
+++ b/docs/details/realworld_examples.md
@@ -0,0 +1,67 @@
+1. **E-commerce Product Monitor**
+   - Scraping product details from multiple e-commerce sites
+   - Price tracking with structured data extraction
+   - Handling dynamic content and anti-bot measures
+   - Features: JsonCssExtraction, session management, anti-bot
+
+2. **News Aggregator & Summarizer**
+   - Crawling news websites
+   - Content extraction and summarization
+   - Topic classification
+   - Features: LLMExtraction, CosineStrategy, content cleaning
+
+3. **Academic Paper Research Assistant**
+   - Crawling research papers from academic sites
+   - Extracting citations and references
+   - Building knowledge graphs
+   - Features: structured extraction, link analysis, chunking
+
+4. **Social Media Content Analyzer**
+   - Handling JavaScript-heavy sites
+   - Dynamic content loading
+   - Sentiment analysis integration
+   - Features: dynamic content handling, session management
+
+5. **Real Estate Market Analyzer**
+   - Scraping property listings
+   - Processing image galleries
+   - Geolocation data extraction
+   - Features: media handling, structured data extraction
+
+6. **Documentation Site Generator**
+   - Recursive website crawling
+   - Markdown generation
+   - Link validation
+   - Features: website crawling, content cleaning
+
+7. **Job Board Aggregator**
+   - Handling pagination
+   - Structured job data extraction
+   - Filtering and categorization
+   - Features: session management, JsonCssExtraction
+
+8. **Recipe Database Builder**
+   - Schema-based extraction
+   - Image processing
+   - Ingredient parsing
+   - Features: structured extraction, media handling
+
+9. **Travel Blog Content Analyzer**
+   - Location extraction
+   - Image and map processing
+   - Content categorization
+   - Features: CosineStrategy, media handling
+
+10. **Technical Documentation Scraper**
+    - API documentation extraction
+    - Code snippet processing
+    - Version tracking
+    - Features: content cleaning, structured extraction
+
+Each example will include:
+- Problem description
+- Technical requirements
+- Complete implementation
+- Error handling
+- Output processing
+- Performance considerations