feat: add Script Builder to Chrome Extension and reorganize LLM context files

This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
This commit is contained in:
UncleCode
2025-06-08 22:02:12 +08:00
parent 926592649e
commit 40640badad
72 changed files with 28600 additions and 100986 deletions

View File

@@ -0,0 +1,425 @@
## CLI Workflows and Profile Management
Visual representations of command-line interface operations, browser profile management, and identity-based crawling workflows.
### CLI Command Flow Architecture
```mermaid
flowchart TD
A[crwl command] --> B{Command Type?}
B -->|URL Crawling| C[Parse URL & Options]
B -->|Profile Management| D[profiles subcommand]
B -->|CDP Browser| E[cdp subcommand]
B -->|Browser Control| F[browser subcommand]
B -->|Configuration| G[config subcommand]
C --> C1{Output Format?}
C1 -->|Default| C2[HTML/Markdown]
C1 -->|JSON| C3[Structured Data]
C1 -->|markdown| C4[Clean Markdown]
C1 -->|markdown-fit| C5[Filtered Content]
C --> C6{Authentication?}
C6 -->|Profile Specified| C7[Load Browser Profile]
C6 -->|No Profile| C8[Anonymous Session]
C7 --> C9[Launch with User Data]
C8 --> C10[Launch Clean Browser]
C9 --> C11[Execute Crawl]
C10 --> C11
C11 --> C12{Success?}
C12 -->|Yes| C13[Return Results]
C12 -->|No| C14[Error Handling]
D --> D1[Interactive Profile Menu]
D1 --> D2{Menu Choice?}
D2 -->|Create| D3[Open Browser for Setup]
D2 -->|List| D4[Show Existing Profiles]
D2 -->|Delete| D5[Remove Profile]
D2 -->|Use| D6[Crawl with Profile]
E --> E1[Launch CDP Browser]
E1 --> E2[Remote Debugging Active]
F --> F1{Browser Action?}
F1 -->|start| F2[Start Builtin Browser]
F1 -->|stop| F3[Stop Builtin Browser]
F1 -->|status| F4[Check Browser Status]
F1 -->|view| F5[Open Browser Window]
G --> G1{Config Action?}
G1 -->|list| G2[Show All Settings]
G1 -->|set| G3[Update Setting]
G1 -->|get| G4[Read Setting]
style A fill:#e1f5fe
style C13 fill:#c8e6c9
style C14 fill:#ffcdd2
style D3 fill:#fff3e0
style E2 fill:#f3e5f5
```
### Profile Management Workflow
```mermaid
sequenceDiagram
participant User
participant CLI
participant ProfileManager
participant Browser
participant FileSystem
User->>CLI: crwl profiles
CLI->>ProfileManager: Initialize profile manager
ProfileManager->>FileSystem: Scan for existing profiles
FileSystem-->>ProfileManager: Profile list
ProfileManager-->>CLI: Show interactive menu
CLI-->>User: Display options
Note over User: User selects "Create new profile"
User->>CLI: Create profile "linkedin-auth"
CLI->>ProfileManager: create_profile("linkedin-auth")
ProfileManager->>FileSystem: Create profile directory
ProfileManager->>Browser: Launch with new user data dir
Browser-->>User: Opens browser window
Note over User: User manually logs in to LinkedIn
User->>Browser: Navigate and authenticate
Browser->>FileSystem: Save cookies, session data
User->>CLI: Press 'q' to save profile
CLI->>ProfileManager: finalize_profile()
ProfileManager->>FileSystem: Lock profile settings
ProfileManager-->>CLI: Profile saved
CLI-->>User: Profile "linkedin-auth" created
Note over User: Later usage
User->>CLI: crwl https://linkedin.com/feed -p linkedin-auth
CLI->>ProfileManager: load_profile("linkedin-auth")
ProfileManager->>FileSystem: Read profile data
FileSystem-->>ProfileManager: User data directory
ProfileManager-->>CLI: Profile configuration
CLI->>Browser: Launch with existing profile
Browser-->>CLI: Authenticated session ready
CLI->>Browser: Navigate to target URL
Browser-->>CLI: Crawl results with auth context
CLI-->>User: Authenticated content
```
### Browser Management State Machine
```mermaid
stateDiagram-v2
[*] --> Stopped: Initial state
Stopped --> Starting: crwl browser start
Starting --> Running: Browser launched
Running --> Viewing: crwl browser view
Viewing --> Running: Close window
Running --> Stopping: crwl browser stop
Stopping --> Stopped: Cleanup complete
Running --> Restarting: crwl browser restart
Restarting --> Running: New browser instance
Stopped --> CDP_Mode: crwl cdp
CDP_Mode --> CDP_Running: Remote debugging active
CDP_Running --> CDP_Mode: Manual close
CDP_Mode --> Stopped: Exit CDP
Running --> StatusCheck: crwl browser status
StatusCheck --> Running: Return status
note right of Running : Port 9222 active\nBuiltin browser available
note right of CDP_Running : Remote debugging\nManual control enabled
note right of Viewing : Visual browser window\nDirect interaction
```
### Authentication Workflow for Protected Sites
```mermaid
flowchart TD
A[Protected Site Access Needed] --> B[Create Profile Strategy]
B --> C{Existing Profile?}
C -->|Yes| D[Test Profile Validity]
C -->|No| E[Create New Profile]
D --> D1{Profile Valid?}
D1 -->|Yes| F[Use Existing Profile]
D1 -->|No| E
E --> E1[crwl profiles]
E1 --> E2[Select Create New Profile]
E2 --> E3[Enter Profile Name]
E3 --> E4[Browser Opens for Auth]
E4 --> E5{Authentication Method?}
E5 -->|Login Form| E6[Fill Username/Password]
E5 -->|OAuth| E7[OAuth Flow]
E5 -->|2FA| E8[Handle 2FA]
E5 -->|Session Cookie| E9[Import Cookies]
E6 --> E10[Manual Login Process]
E7 --> E10
E8 --> E10
E9 --> E10
E10 --> E11[Verify Authentication]
E11 --> E12{Auth Successful?}
E12 -->|Yes| E13[Save Profile - Press q]
E12 -->|No| E10
E13 --> F
F --> G[Execute Authenticated Crawl]
G --> H[crwl URL -p profile-name]
H --> I[Load Profile Data]
I --> J[Launch Browser with Auth]
J --> K[Navigate to Protected Content]
K --> L[Extract Authenticated Data]
L --> M[Return Results]
style E4 fill:#fff3e0
style E10 fill:#e3f2fd
style F fill:#e8f5e8
style M fill:#c8e6c9
```
### CDP Browser Architecture
```mermaid
graph TB
subgraph "CLI Layer"
A[crwl cdp command] --> B[CDP Manager]
B --> C[Port Configuration]
B --> D[Profile Selection]
end
subgraph "Browser Process"
E[Chromium/Firefox] --> F[Remote Debugging]
F --> G[WebSocket Endpoint]
G --> H[ws://localhost:9222]
end
subgraph "Client Connections"
I[Manual Browser Control] --> H
J[DevTools Interface] --> H
K[External Automation] --> H
L[Crawl4AI Crawler] --> H
end
subgraph "Profile Data"
M[User Data Directory] --> E
N[Cookies & Sessions] --> M
O[Extensions] --> M
P[Browser State] --> M
end
A --> E
C --> H
D --> M
style H fill:#e3f2fd
style E fill:#f3e5f5
style M fill:#e8f5e8
```
### Configuration Management Hierarchy
```mermaid
graph TD
subgraph "Global Configuration"
A[~/.crawl4ai/config.yml] --> B[Default Settings]
B --> C[LLM Providers]
B --> D[Browser Defaults]
B --> E[Output Preferences]
end
subgraph "Profile Configuration"
F[Profile Directory] --> G[Browser State]
F --> H[Authentication Data]
F --> I[Site-Specific Settings]
end
subgraph "Command-Line Overrides"
J[-b browser_config] --> K[Runtime Browser Settings]
L[-c crawler_config] --> M[Runtime Crawler Settings]
N[-o output_format] --> O[Runtime Output Format]
end
subgraph "Configuration Files"
P[browser.yml] --> Q[Browser Config Template]
R[crawler.yml] --> S[Crawler Config Template]
T[extract.yml] --> U[Extraction Config]
end
subgraph "Resolution Order"
V[Command Line Args] --> W[Config Files]
W --> X[Profile Settings]
X --> Y[Global Defaults]
end
J --> V
L --> V
N --> V
P --> W
R --> W
T --> W
F --> X
A --> Y
style V fill:#ffcdd2
style W fill:#fff3e0
style X fill:#e3f2fd
style Y fill:#e8f5e8
```
### Identity-Based Crawling Decision Tree
```mermaid
flowchart TD
A[Target Website Assessment] --> B{Authentication Required?}
B -->|No| C[Standard Anonymous Crawl]
B -->|Yes| D{Authentication Type?}
D -->|Login Form| E[Create Login Profile]
D -->|OAuth/SSO| F[Create OAuth Profile]
D -->|API Key/Token| G[Use Headers/Config]
D -->|Session Cookies| H[Import Cookie Profile]
E --> E1[crwl profiles → Manual login]
F --> F1[crwl profiles → OAuth flow]
G --> G1[Configure headers in crawler config]
H --> H1[Import cookies to profile]
E1 --> I[Test Authentication]
F1 --> I
G1 --> I
H1 --> I
I --> J{Auth Test Success?}
J -->|Yes| K[Production Crawl Setup]
J -->|No| L[Debug Authentication]
L --> L1{Common Issues?}
L1 -->|Rate Limiting| L2[Add delays/user simulation]
L1 -->|Bot Detection| L3[Enable stealth mode]
L1 -->|Session Expired| L4[Refresh authentication]
L1 -->|CAPTCHA| L5[Manual intervention needed]
L2 --> M[Retry with Adjustments]
L3 --> M
L4 --> E1
L5 --> N[Semi-automated approach]
M --> I
N --> O[Manual auth + automated crawl]
K --> P[Automated Authenticated Crawling]
O --> P
C --> P
P --> Q[Monitor & Maintain Profiles]
style I fill:#fff3e0
style K fill:#e8f5e8
style P fill:#c8e6c9
style L fill:#ffcdd2
style N fill:#f3e5f5
```
### CLI Usage Patterns and Best Practices
```mermaid
timeline
title CLI Workflow Evolution
section Setup Phase
Installation : pip install crawl4ai
: crawl4ai-setup
Basic Test : crwl https://example.com
Config Setup : crwl config set defaults
section Profile Creation
Site Analysis : Identify auth requirements
Profile Creation : crwl profiles
Manual Login : Authenticate in browser
Profile Save : Press 'q' to save
section Development Phase
Test Crawls : crwl URL -p profile -v
Config Tuning : Adjust browser/crawler settings
Output Testing : Try different output formats
Error Handling : Debug authentication issues
section Production Phase
Automated Crawls : crwl URL -p profile -o json
Batch Processing : Multiple URLs with same profile
Monitoring : Check profile validity
Maintenance : Update profiles as needed
```
### Multi-Profile Management Strategy
```mermaid
graph LR
subgraph "Profile Categories"
A[Social Media Profiles]
B[Work/Enterprise Profiles]
C[E-commerce Profiles]
D[Research Profiles]
end
subgraph "Social Media"
A --> A1[linkedin-personal]
A --> A2[twitter-monitor]
A --> A3[facebook-research]
A --> A4[instagram-brand]
end
subgraph "Enterprise"
B --> B1[company-intranet]
B --> B2[github-enterprise]
B --> B3[confluence-docs]
B --> B4[jira-tickets]
end
subgraph "E-commerce"
C --> C1[amazon-seller]
C --> C2[shopify-admin]
C --> C3[ebay-monitor]
C --> C4[marketplace-competitor]
end
subgraph "Research"
D --> D1[academic-journals]
D --> D2[data-platforms]
D --> D3[survey-tools]
D --> D4[government-portals]
end
subgraph "Usage Patterns"
E[Daily Monitoring] --> A2
E --> B1
F[Weekly Reports] --> C3
F --> D2
G[On-Demand Research] --> D1
G --> D4
H[Competitive Analysis] --> C4
H --> A4
end
style A1 fill:#e3f2fd
style B1 fill:#f3e5f5
style C1 fill:#e8f5e8
style D1 fill:#fff3e0
```
**📖 Learn more:** [CLI Reference](https://docs.crawl4ai.com/core/cli/), [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [Authentication Strategies](https://docs.crawl4ai.com/advanced/hooks-auth/)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,401 @@
## Deep Crawling Filters & Scorers Architecture
Visual representations of advanced URL filtering, scoring strategies, and performance optimization workflows for intelligent deep crawling.
### Filter Chain Processing Pipeline
```mermaid
flowchart TD
A[URL Input] --> B{Domain Filter}
B -->|✓ Pass| C{Pattern Filter}
B -->|✗ Fail| X1[Reject: Invalid Domain]
C -->|✓ Pass| D{Content Type Filter}
C -->|✗ Fail| X2[Reject: Pattern Mismatch]
D -->|✓ Pass| E{SEO Filter}
D -->|✗ Fail| X3[Reject: Wrong Content Type]
E -->|✓ Pass| F{Content Relevance Filter}
E -->|✗ Fail| X4[Reject: Low SEO Score]
F -->|✓ Pass| G[URL Accepted]
F -->|✗ Fail| X5[Reject: Low Relevance]
G --> H[Add to Crawl Queue]
subgraph "Fast Filters"
B
C
D
end
subgraph "Slow Filters"
E
F
end
style A fill:#e3f2fd
style G fill:#c8e6c9
style H fill:#e8f5e8
style X1 fill:#ffcdd2
style X2 fill:#ffcdd2
style X3 fill:#ffcdd2
style X4 fill:#ffcdd2
style X5 fill:#ffcdd2
```
### URL Scoring System Architecture
```mermaid
graph TB
subgraph "Input URL"
A[https://python.org/tutorial/2024/ml-guide.html]
end
subgraph "Individual Scorers"
B[Keyword Relevance Scorer]
C[Path Depth Scorer]
D[Content Type Scorer]
E[Freshness Scorer]
F[Domain Authority Scorer]
end
subgraph "Scoring Process"
B --> B1[Keywords: python, tutorial, ml<br/>Score: 0.85]
C --> C1[Depth: 4 levels<br/>Optimal: 3<br/>Score: 0.75]
D --> D1[Content: HTML<br/>Score: 1.0]
E --> E1[Year: 2024<br/>Score: 1.0]
F --> F1[Domain: python.org<br/>Score: 1.0]
end
subgraph "Composite Scoring"
G[Weighted Combination]
B1 --> G
C1 --> G
D1 --> G
E1 --> G
F1 --> G
end
subgraph "Final Result"
H[Composite Score: 0.92]
I{Score > Threshold?}
J[Accept URL]
K[Reject URL]
end
A --> B
A --> C
A --> D
A --> E
A --> F
G --> H
H --> I
I -->|✓ 0.92 > 0.6| J
I -->|✗ Score too low| K
style A fill:#e3f2fd
style G fill:#fff3e0
style H fill:#e8f5e8
style J fill:#c8e6c9
style K fill:#ffcdd2
```
### Filter vs Scorer Decision Matrix
```mermaid
flowchart TD
A[URL Processing Decision] --> B{Binary Decision Needed?}
B -->|Yes - Include/Exclude| C[Use Filters]
B -->|No - Quality Rating| D[Use Scorers]
C --> C1{Filter Type Needed?}
C1 -->|Domain Control| C2[DomainFilter]
C1 -->|Pattern Matching| C3[URLPatternFilter]
C1 -->|Content Type| C4[ContentTypeFilter]
C1 -->|SEO Quality| C5[SEOFilter]
C1 -->|Content Relevance| C6[ContentRelevanceFilter]
D --> D1{Scoring Criteria?}
D1 -->|Keyword Relevance| D2[KeywordRelevanceScorer]
D1 -->|URL Structure| D3[PathDepthScorer]
D1 -->|Content Quality| D4[ContentTypeScorer]
D1 -->|Time Sensitivity| D5[FreshnessScorer]
D1 -->|Source Authority| D6[DomainAuthorityScorer]
C2 --> E[Chain Filters]
C3 --> E
C4 --> E
C5 --> E
C6 --> E
D2 --> F[Composite Scorer]
D3 --> F
D4 --> F
D5 --> F
D6 --> F
E --> G[Binary Output: Pass/Fail]
F --> H[Numeric Score: 0.0-1.0]
G --> I[Apply to URL Queue]
H --> J[Priority Ranking]
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#e3f2fd
style G fill:#c8e6c9
style H fill:#ffecb3
```
### Performance Optimization Strategy
```mermaid
sequenceDiagram
participant Queue as URL Queue
participant Fast as Fast Filters
participant Slow as Slow Filters
participant Score as Scorers
participant Output as Filtered URLs
Note over Queue, Output: Batch Processing (1000 URLs)
Queue->>Fast: Apply Domain Filter
Fast-->>Queue: 60% passed (600 URLs)
Queue->>Fast: Apply Pattern Filter
Fast-->>Queue: 70% passed (420 URLs)
Queue->>Fast: Apply Content Type Filter
Fast-->>Queue: 90% passed (378 URLs)
Note over Fast: Fast filters eliminate 62% of URLs
Queue->>Slow: Apply SEO Filter (378 URLs)
Slow-->>Queue: 80% passed (302 URLs)
Queue->>Slow: Apply Relevance Filter
Slow-->>Queue: 75% passed (227 URLs)
Note over Slow: Content analysis on remaining URLs
Queue->>Score: Calculate Composite Scores
Score-->>Queue: Scored and ranked
Queue->>Output: Top 100 URLs by score
Output-->>Queue: Processing complete
Note over Queue, Output: Total: 90% filtered out, 10% high-quality URLs retained
```
### Custom Filter Implementation Flow
```mermaid
stateDiagram-v2
[*] --> Planning
Planning --> IdentifyNeeds: Define filtering criteria
IdentifyNeeds --> ChooseType: Binary vs Scoring decision
ChooseType --> FilterImpl: Binary decision needed
ChooseType --> ScorerImpl: Quality rating needed
FilterImpl --> InheritURLFilter: Extend URLFilter base class
ScorerImpl --> InheritURLScorer: Extend URLScorer base class
InheritURLFilter --> ImplementApply: def apply(url) -> bool
InheritURLScorer --> ImplementScore: def _calculate_score(url) -> float
ImplementApply --> AddLogic: Add custom filtering logic
ImplementScore --> AddLogic
AddLogic --> TestFilter: Unit testing
TestFilter --> OptimizePerf: Performance optimization
OptimizePerf --> Integration: Integrate with FilterChain
Integration --> Production: Deploy to production
Production --> Monitor: Monitor performance
Monitor --> Tune: Tune parameters
Tune --> Production
note right of Planning : Consider performance impact
note right of AddLogic : Handle edge cases
note right of OptimizePerf : Cache frequently accessed data
```
### Filter Chain Optimization Patterns
```mermaid
graph TB
subgraph "Naive Approach - Poor Performance"
A1[All URLs] --> B1[Slow Filter 1]
B1 --> C1[Slow Filter 2]
C1 --> D1[Fast Filter 1]
D1 --> E1[Fast Filter 2]
E1 --> F1[Final Results]
B1 -.->|High CPU| G1[Performance Issues]
C1 -.->|Network Calls| G1
end
subgraph "Optimized Approach - High Performance"
A2[All URLs] --> B2[Fast Filter 1]
B2 --> C2[Fast Filter 2]
C2 --> D2[Batch Process]
D2 --> E2[Slow Filter 1]
E2 --> F2[Slow Filter 2]
F2 --> G2[Final Results]
D2 --> H2[Concurrent Processing]
H2 --> I2[Semaphore Control]
end
subgraph "Performance Metrics"
J[Processing Time]
K[Memory Usage]
L[CPU Utilization]
M[Network Requests]
end
G1 -.-> J
G1 -.-> K
G1 -.-> L
G1 -.-> M
G2 -.-> J
G2 -.-> K
G2 -.-> L
G2 -.-> M
style A1 fill:#ffcdd2
style G1 fill:#ffcdd2
style A2 fill:#c8e6c9
style G2 fill:#c8e6c9
style H2 fill:#e8f5e8
style I2 fill:#e8f5e8
```
### Composite Scoring Weight Distribution
```mermaid
pie title Composite Scorer Weight Distribution
"Keyword Relevance (30%)" : 30
"Domain Authority (25%)" : 25
"Content Type (20%)" : 20
"Freshness (15%)" : 15
"Path Depth (10%)" : 10
```
### Deep Crawl Integration Architecture
```mermaid
graph TD
subgraph "Deep Crawl Strategy"
A[Start URL] --> B[Extract Links]
B --> C[Apply Filter Chain]
C --> D[Calculate Scores]
D --> E[Priority Queue]
E --> F[Crawl Next URL]
F --> B
end
subgraph "Filter Chain Components"
C --> C1[Domain Filter]
C --> C2[Pattern Filter]
C --> C3[Content Filter]
C --> C4[SEO Filter]
C --> C5[Relevance Filter]
end
subgraph "Scoring Components"
D --> D1[Keyword Scorer]
D --> D2[Depth Scorer]
D --> D3[Freshness Scorer]
D --> D4[Authority Scorer]
D --> D5[Composite Score]
end
subgraph "Queue Management"
E --> E1{Score > Threshold?}
E1 -->|Yes| E2[High Priority Queue]
E1 -->|No| E3[Low Priority Queue]
E2 --> F
E3 --> G[Delayed Processing]
end
subgraph "Control Flow"
H{Max Depth Reached?}
I{Max Pages Reached?}
J[Stop Crawling]
end
F --> H
H -->|No| I
H -->|Yes| J
I -->|No| B
I -->|Yes| J
style A fill:#e3f2fd
style E2 fill:#c8e6c9
style E3 fill:#fff3e0
style J fill:#ffcdd2
```
### Filter Performance Comparison
```mermaid
xychart-beta
title "Filter Performance Comparison (1000 URLs)"
x-axis [Domain, Pattern, ContentType, SEO, Relevance]
y-axis "Processing Time (ms)" 0 --> 1000
bar [50, 80, 45, 300, 800]
```
### Scoring Algorithm Workflow
```mermaid
flowchart TD
A[Input URL] --> B[Parse URL Components]
B --> C[Extract Features]
C --> D[Domain Analysis]
C --> E[Path Analysis]
C --> F[Content Type Detection]
C --> G[Keyword Extraction]
C --> H[Freshness Detection]
D --> I[Domain Authority Score]
E --> J[Path Depth Score]
F --> K[Content Type Score]
G --> L[Keyword Relevance Score]
H --> M[Freshness Score]
I --> N[Apply Weights]
J --> N
K --> N
L --> N
M --> N
N --> O[Normalize Scores]
O --> P[Calculate Final Score]
P --> Q{Score >= Threshold?}
Q -->|Yes| R[Accept for Crawling]
Q -->|No| S[Reject URL]
R --> T[Add to Priority Queue]
S --> U[Log Rejection Reason]
style A fill:#e3f2fd
style P fill:#fff3e0
style R fill:#c8e6c9
style S fill:#ffcdd2
style T fill:#e8f5e8
```
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/), [Custom Implementations](https://docs.crawl4ai.com/advanced/custom-filters/)

View File

@@ -0,0 +1,428 @@
## Deep Crawling Workflows and Architecture
Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns.
### Deep Crawl Strategy Overview
```mermaid
flowchart TD
A[Start Deep Crawl] --> B{Strategy Selection}
B -->|Explore All Levels| C[BFS Strategy]
B -->|Dive Deep Fast| D[DFS Strategy]
B -->|Smart Prioritization| E[Best-First Strategy]
C --> C1[Breadth-First Search]
C1 --> C2[Process all depth 0 links]
C2 --> C3[Process all depth 1 links]
C3 --> C4[Continue by depth level]
D --> D1[Depth-First Search]
D1 --> D2[Follow first link deeply]
D2 --> D3[Backtrack when max depth reached]
D3 --> D4[Continue with next branch]
E --> E1[Best-First Search]
E1 --> E2[Score all discovered URLs]
E2 --> E3[Process highest scoring URLs first]
E3 --> E4[Continuously re-prioritize queue]
C4 --> F[Apply Filters]
D4 --> F
E4 --> F
F --> G{Filter Chain Processing}
G -->|Domain Filter| G1[Check allowed/blocked domains]
G -->|URL Pattern Filter| G2[Match URL patterns]
G -->|Content Type Filter| G3[Verify content types]
G -->|SEO Filter| G4[Evaluate SEO quality]
G -->|Content Relevance| G5[Score content relevance]
G1 --> H{Passed All Filters?}
G2 --> H
G3 --> H
G4 --> H
G5 --> H
H -->|Yes| I[Add to Crawl Queue]
H -->|No| J[Discard URL]
I --> K{Processing Mode}
K -->|Streaming| L[Process Immediately]
K -->|Batch| M[Collect All Results]
L --> N[Stream Result to User]
M --> O[Return Complete Result Set]
J --> P{More URLs in Queue?}
N --> P
O --> P
P -->|Yes| Q{Within Limits?}
P -->|No| R[Deep Crawl Complete]
Q -->|Max Depth OK| S{Max Pages OK}
Q -->|Max Depth Exceeded| T[Skip Deeper URLs]
S -->|Under Limit| U[Continue Crawling]
S -->|Limit Reached| R
T --> P
U --> F
style A fill:#e1f5fe
style R fill:#c8e6c9
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#e8f5e8
```
### Deep Crawl Strategy Comparison
```mermaid
graph TB
subgraph "BFS - Breadth-First Search"
BFS1[Level 0: Start URL]
BFS2[Level 1: All direct links]
BFS3[Level 2: All second-level links]
BFS4[Level 3: All third-level links]
BFS1 --> BFS2
BFS2 --> BFS3
BFS3 --> BFS4
BFS_NOTE[Complete each depth before going deeper<br/>Good for site mapping<br/>Memory intensive for wide sites]
end
subgraph "DFS - Depth-First Search"
DFS1[Start URL]
DFS2[First Link → Deep]
DFS3[Follow until max depth]
DFS4[Backtrack and try next]
DFS1 --> DFS2
DFS2 --> DFS3
DFS3 --> DFS4
DFS4 --> DFS2
DFS_NOTE[Go deep on first path<br/>Memory efficient<br/>May miss important pages]
end
subgraph "Best-First - Priority Queue"
BF1[Start URL]
BF2[Score all discovered links]
BF3[Process highest scoring first]
BF4[Continuously re-prioritize]
BF1 --> BF2
BF2 --> BF3
BF3 --> BF4
BF4 --> BF2
BF_NOTE[Intelligent prioritization<br/>Finds relevant content fast<br/>Recommended for most use cases]
end
style BFS1 fill:#e3f2fd
style DFS1 fill:#f3e5f5
style BF1 fill:#e8f5e8
style BFS_NOTE fill:#fff3e0
style DFS_NOTE fill:#fff3e0
style BF_NOTE fill:#fff3e0
```
### Filter Chain Processing Sequence
```mermaid
sequenceDiagram
participant URL as Discovered URL
participant Chain as Filter Chain
participant Domain as Domain Filter
participant Pattern as URL Pattern Filter
participant Content as Content Type Filter
participant SEO as SEO Filter
participant Relevance as Content Relevance Filter
participant Queue as Crawl Queue
URL->>Chain: Process URL
Chain->>Domain: Check domain rules
alt Domain Allowed
Domain-->>Chain: ✓ Pass
Chain->>Pattern: Check URL patterns
alt Pattern Matches
Pattern-->>Chain: ✓ Pass
Chain->>Content: Check content type
alt Content Type Valid
Content-->>Chain: ✓ Pass
Chain->>SEO: Evaluate SEO quality
alt SEO Score Above Threshold
SEO-->>Chain: ✓ Pass
Chain->>Relevance: Score content relevance
alt Relevance Score High
Relevance-->>Chain: ✓ Pass
Chain->>Queue: Add to crawl queue
Queue-->>URL: Queued for crawling
else Relevance Score Low
Relevance-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Low relevance
end
else SEO Score Low
SEO-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Poor SEO
end
else Invalid Content Type
Content-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Wrong content type
end
else Pattern Mismatch
Pattern-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Pattern mismatch
end
else Domain Blocked
Domain-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Blocked domain
end
```
### URL Lifecycle State Machine
```mermaid
stateDiagram-v2
[*] --> Discovered: Found on page
Discovered --> FilterPending: Enter filter chain
FilterPending --> DomainCheck: Apply domain filter
DomainCheck --> PatternCheck: Domain allowed
DomainCheck --> Rejected: Domain blocked
PatternCheck --> ContentCheck: Pattern matches
PatternCheck --> Rejected: Pattern mismatch
ContentCheck --> SEOCheck: Content type valid
ContentCheck --> Rejected: Invalid content
SEOCheck --> RelevanceCheck: SEO score sufficient
SEOCheck --> Rejected: Poor SEO score
RelevanceCheck --> Scored: Relevance score calculated
RelevanceCheck --> Rejected: Low relevance
Scored --> Queued: Added to priority queue
Queued --> Crawling: Selected for processing
Crawling --> Success: Page crawled successfully
Crawling --> Failed: Crawl failed
Success --> LinkExtraction: Extract new links
LinkExtraction --> [*]: Process complete
Failed --> [*]: Record failure
Rejected --> [*]: Log rejection reason
note right of Scored : Score determines priority<br/>in Best-First strategy
note right of Failed : Errors logged with<br/>depth and reason
```
### Streaming vs Batch Processing Architecture
```mermaid
graph TB
subgraph "Input"
A[Start URL] --> B[Deep Crawl Strategy]
end
subgraph "Crawl Engine"
B --> C[URL Discovery]
C --> D[Filter Chain]
D --> E[Priority Queue]
E --> F[Page Processor]
end
subgraph "Streaming Mode stream=True"
F --> G1[Process Page]
G1 --> H1[Extract Content]
H1 --> I1[Yield Result Immediately]
I1 --> J1[async for result]
J1 --> K1[Real-time Processing]
G1 --> L1[Extract Links]
L1 --> M1[Add to Queue]
M1 --> F
end
subgraph "Batch Mode stream=False"
F --> G2[Process Page]
G2 --> H2[Extract Content]
H2 --> I2[Store Result]
I2 --> N2[Result Collection]
G2 --> L2[Extract Links]
L2 --> M2[Add to Queue]
M2 --> O2{More URLs?}
O2 -->|Yes| F
O2 -->|No| P2[Return All Results]
P2 --> Q2[Batch Processing]
end
style I1 fill:#e8f5e8
style K1 fill:#e8f5e8
style P2 fill:#e3f2fd
style Q2 fill:#e3f2fd
```
### Advanced Scoring and Prioritization System
```mermaid
flowchart LR
subgraph "URL Discovery"
A[Page Links] --> B[Extract URLs]
B --> C[Normalize URLs]
end
subgraph "Scoring System"
C --> D[Keyword Relevance Scorer]
D --> D1[URL Text Analysis]
D --> D2[Keyword Matching]
D --> D3[Calculate Base Score]
D3 --> E[Additional Scoring Factors]
E --> E1[URL Structure weight: 0.2]
E --> E2[Link Context weight: 0.3]
E --> E3[Page Depth Penalty weight: 0.1]
E --> E4[Domain Authority weight: 0.4]
D1 --> F[Combined Score]
D2 --> F
D3 --> F
E1 --> F
E2 --> F
E3 --> F
E4 --> F
end
subgraph "Prioritization"
F --> G{Score Threshold}
G -->|Above Threshold| H[Priority Queue]
G -->|Below Threshold| I[Discard URL]
H --> J[Best-First Selection]
J --> K[Highest Score First]
K --> L[Process Page]
L --> M[Update Scores]
M --> N[Re-prioritize Queue]
N --> J
end
style F fill:#fff3e0
style H fill:#e8f5e8
style L fill:#e3f2fd
```
### Deep Crawl Performance and Limits
```mermaid
graph TD
subgraph "Crawl Constraints"
A[Max Depth: 2] --> A1[Prevents infinite crawling]
B[Max Pages: 50] --> B1[Controls resource usage]
C[Score Threshold: 0.3] --> C1[Quality filtering]
D[Domain Limits] --> D1[Scope control]
end
subgraph "Performance Monitoring"
E[Pages Crawled] --> F[Depth Distribution]
E --> G[Success Rate]
E --> H[Average Score]
E --> I[Processing Time]
F --> J[Performance Report]
G --> J
H --> J
I --> J
end
subgraph "Resource Management"
K[Memory Usage] --> L{Memory Threshold}
L -->|Under Limit| M[Continue Crawling]
L -->|Over Limit| N[Reduce Concurrency]
O[CPU Usage] --> P{CPU Threshold}
P -->|Normal| M
P -->|High| Q[Add Delays]
R[Network Load] --> S{Rate Limits}
S -->|OK| M
S -->|Exceeded| T[Throttle Requests]
end
M --> U[Optimal Performance]
N --> V[Reduced Performance]
Q --> V
T --> V
style U fill:#c8e6c9
style V fill:#fff3e0
style J fill:#e3f2fd
```
### Error Handling and Recovery Flow
```mermaid
sequenceDiagram
participant Strategy as Deep Crawl Strategy
participant Queue as Priority Queue
participant Crawler as Page Crawler
participant Error as Error Handler
participant Result as Result Collector
Strategy->>Queue: Get next URL
Queue-->>Strategy: Return highest priority URL
Strategy->>Crawler: Crawl page
alt Successful Crawl
Crawler-->>Strategy: Return page content
Strategy->>Result: Store successful result
Strategy->>Strategy: Extract new links
Strategy->>Queue: Add new URLs to queue
else Network Error
Crawler-->>Error: Network timeout/failure
Error->>Error: Log error with details
Error->>Queue: Mark URL as failed
Error-->>Strategy: Skip to next URL
else Parse Error
Crawler-->>Error: HTML parsing failed
Error->>Error: Log parse error
Error->>Result: Store failed result
Error-->>Strategy: Continue with next URL
else Rate Limit Hit
Crawler-->>Error: Rate limit exceeded
Error->>Error: Apply backoff strategy
Error->>Queue: Re-queue URL with delay
Error-->>Strategy: Wait before retry
else Depth Limit
Strategy->>Strategy: Check depth constraint
Strategy-->>Queue: Skip URL - too deep
else Page Limit
Strategy->>Strategy: Check page count
Strategy-->>Result: Stop crawling - limit reached
end
Strategy->>Queue: Request next URL
Queue-->>Strategy: More URLs available?
alt Queue Empty
Queue-->>Result: Crawl complete
else Queue Has URLs
Queue-->>Strategy: Continue crawling
end
```
**📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/)

View File

@@ -0,0 +1,603 @@
## Docker Deployment Architecture and Workflows
Visual representations of Crawl4AI Docker deployment, API architecture, configuration management, and service interactions.
### Docker Deployment Decision Flow
```mermaid
flowchart TD
A[Start Docker Deployment] --> B{Deployment Type?}
B -->|Quick Start| C[Pre-built Image]
B -->|Development| D[Docker Compose]
B -->|Custom Build| E[Manual Build]
B -->|Production| F[Production Setup]
C --> C1[docker pull unclecode/crawl4ai]
C1 --> C2{Need LLM Support?}
C2 -->|Yes| C3[Setup .llm.env]
C2 -->|No| C4[Basic run]
C3 --> C5[docker run with --env-file]
C4 --> C6[docker run basic]
D --> D1[git clone repository]
D1 --> D2[cp .llm.env.example .llm.env]
D2 --> D3{Build Type?}
D3 -->|Pre-built| D4[IMAGE=latest docker compose up]
D3 -->|Local Build| D5[docker compose up --build]
D3 -->|All Features| D6[INSTALL_TYPE=all docker compose up]
E --> E1[docker buildx build]
E1 --> E2{Architecture?}
E2 -->|Single| E3[--platform linux/amd64]
E2 -->|Multi| E4[--platform linux/amd64,linux/arm64]
E3 --> E5[Build complete]
E4 --> E5
F --> F1[Production configuration]
F1 --> F2[Custom config.yml]
F2 --> F3[Resource limits]
F3 --> F4[Health monitoring]
F4 --> F5[Production ready]
C5 --> G[Service running on :11235]
C6 --> G
D4 --> G
D5 --> G
D6 --> G
E5 --> H[docker run custom image]
H --> G
F5 --> I[Production deployment]
G --> J[Access playground at /playground]
G --> K[Health check at /health]
I --> L[Production monitoring]
style A fill:#e1f5fe
style G fill:#c8e6c9
style I fill:#c8e6c9
style J fill:#fff3e0
style K fill:#fff3e0
style L fill:#e8f5e8
```
### Docker Container Architecture
```mermaid
graph TB
subgraph "Host Environment"
A[Docker Engine] --> B[Crawl4AI Container]
C[.llm.env] --> B
D[Custom config.yml] --> B
E[Port 11235] --> B
F[Shared Memory 1GB+] --> B
end
subgraph "Container Services"
B --> G[FastAPI Server :8020]
B --> H[Gunicorn WSGI]
B --> I[Supervisord Process Manager]
B --> J[Redis Cache :6379]
G --> K[REST API Endpoints]
G --> L[WebSocket Connections]
G --> M[MCP Protocol]
H --> N[Worker Processes]
I --> O[Service Monitoring]
J --> P[Request Caching]
end
subgraph "Browser Management"
B --> Q[Playwright Framework]
Q --> R[Chromium Browser]
Q --> S[Firefox Browser]
Q --> T[WebKit Browser]
R --> U[Browser Pool]
S --> U
T --> U
U --> V[Page Sessions]
U --> W[Context Management]
end
subgraph "External Services"
X[OpenAI API] -.-> K
Y[Anthropic Claude] -.-> K
Z[Local Ollama] -.-> K
AA[Groq API] -.-> K
BB[Google Gemini] -.-> K
end
subgraph "Client Interactions"
CC[Python SDK] --> K
DD[REST API Calls] --> K
EE[MCP Clients] --> M
FF[Web Browser] --> G
GG[Monitoring Tools] --> K
end
style B fill:#e3f2fd
style G fill:#f3e5f5
style Q fill:#e8f5e8
style K fill:#fff3e0
```
### API Endpoints Architecture
```mermaid
graph LR
subgraph "Core Endpoints"
A[/crawl] --> A1[Single URL crawl]
A2[/crawl/stream] --> A3[Streaming multi-URL]
A4[/crawl/job] --> A5[Async job submission]
A6[/crawl/job/{id}] --> A7[Job status check]
end
subgraph "Specialized Endpoints"
B[/html] --> B1[Preprocessed HTML]
B2[/screenshot] --> B3[PNG capture]
B4[/pdf] --> B5[PDF generation]
B6[/execute_js] --> B7[JavaScript execution]
B8[/md] --> B9[Markdown extraction]
end
subgraph "Utility Endpoints"
C[/health] --> C1[Service status]
C2[/metrics] --> C3[Prometheus metrics]
C4[/schema] --> C5[API documentation]
C6[/playground] --> C7[Interactive testing]
end
subgraph "LLM Integration"
D[/llm/{url}] --> D1[Q&A over URL]
D2[/ask] --> D3[Library context search]
D4[/config/dump] --> D5[Config validation]
end
subgraph "MCP Protocol"
E[/mcp/sse] --> E1[Server-Sent Events]
E2[/mcp/ws] --> E3[WebSocket connection]
E4[/mcp/schema] --> E5[MCP tool definitions]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
```
### Request Processing Flow
```mermaid
sequenceDiagram
participant Client
participant FastAPI
participant RequestValidator
participant BrowserPool
participant Playwright
participant ExtractionEngine
participant LLMProvider
Client->>FastAPI: POST /crawl with config
FastAPI->>RequestValidator: Validate JSON structure
alt Valid Request
RequestValidator-->>FastAPI: ✓ Validated
FastAPI->>BrowserPool: Request browser instance
BrowserPool->>Playwright: Launch browser/reuse session
Playwright-->>BrowserPool: Browser ready
BrowserPool-->>FastAPI: Browser allocated
FastAPI->>Playwright: Navigate to URL
Playwright->>Playwright: Execute JS, wait conditions
Playwright-->>FastAPI: Page content ready
FastAPI->>ExtractionEngine: Process content
alt LLM Extraction
ExtractionEngine->>LLMProvider: Send content + schema
LLMProvider-->>ExtractionEngine: Structured data
else CSS Extraction
ExtractionEngine->>ExtractionEngine: Apply CSS selectors
end
ExtractionEngine-->>FastAPI: Extraction complete
FastAPI->>BrowserPool: Release browser
FastAPI-->>Client: CrawlResult response
else Invalid Request
RequestValidator-->>FastAPI: ✗ Validation error
FastAPI-->>Client: 400 Bad Request
end
```
### Configuration Management Flow
```mermaid
stateDiagram-v2
[*] --> ConfigLoading
ConfigLoading --> DefaultConfig: Load default config.yml
ConfigLoading --> CustomConfig: Custom config mounted
ConfigLoading --> EnvOverrides: Environment variables
DefaultConfig --> ConfigMerging
CustomConfig --> ConfigMerging
EnvOverrides --> ConfigMerging
ConfigMerging --> ConfigValidation
ConfigValidation --> Valid: Schema validation passes
ConfigValidation --> Invalid: Validation errors
Invalid --> ConfigError: Log errors and exit
ConfigError --> [*]
Valid --> ServiceInitialization
ServiceInitialization --> FastAPISetup
ServiceInitialization --> BrowserPoolInit
ServiceInitialization --> CacheSetup
FastAPISetup --> Running
BrowserPoolInit --> Running
CacheSetup --> Running
Running --> ConfigReload: Config change detected
ConfigReload --> ConfigValidation
Running --> [*]: Service shutdown
note right of ConfigMerging : Priority: ENV > Custom > Default
note right of ServiceInitialization : All services must initialize successfully
```
### Multi-Architecture Build Process
```mermaid
flowchart TD
A[Developer Push] --> B[GitHub Repository]
B --> C[Docker Buildx]
C --> D{Build Strategy}
D -->|Multi-arch| E[Parallel Builds]
D -->|Single-arch| F[Platform-specific Build]
E --> G[AMD64 Build]
E --> H[ARM64 Build]
F --> I[Target Platform Build]
subgraph "AMD64 Build Process"
G --> G1[Ubuntu base image]
G1 --> G2[Python 3.11 install]
G2 --> G3[System dependencies]
G3 --> G4[Crawl4AI installation]
G4 --> G5[Playwright setup]
G5 --> G6[FastAPI configuration]
G6 --> G7[AMD64 image ready]
end
subgraph "ARM64 Build Process"
H --> H1[Ubuntu ARM64 base]
H1 --> H2[Python 3.11 install]
H2 --> H3[ARM-specific deps]
H3 --> H4[Crawl4AI installation]
H4 --> H5[Playwright setup]
H5 --> H6[FastAPI configuration]
H6 --> H7[ARM64 image ready]
end
subgraph "Single Architecture"
I --> I1[Base image selection]
I1 --> I2[Platform dependencies]
I2 --> I3[Application setup]
I3 --> I4[Platform image ready]
end
G7 --> J[Multi-arch Manifest]
H7 --> J
I4 --> K[Platform Image]
J --> L[Docker Hub Registry]
K --> L
L --> M[Pull Request Auto-selects Architecture]
style A fill:#e1f5fe
style J fill:#c8e6c9
style K fill:#c8e6c9
style L fill:#f3e5f5
style M fill:#e8f5e8
```
### MCP Integration Architecture
```mermaid
graph TB
subgraph "MCP Client Applications"
A[Claude Code] --> B[MCP Protocol]
C[Cursor IDE] --> B
D[Windsurf] --> B
E[Custom MCP Client] --> B
end
subgraph "Crawl4AI MCP Server"
B --> F[MCP Endpoint Router]
F --> G[SSE Transport /mcp/sse]
F --> H[WebSocket Transport /mcp/ws]
F --> I[Schema Endpoint /mcp/schema]
G --> J[MCP Tool Handler]
H --> J
J --> K[Tool: md]
J --> L[Tool: html]
J --> M[Tool: screenshot]
J --> N[Tool: pdf]
J --> O[Tool: execute_js]
J --> P[Tool: crawl]
J --> Q[Tool: ask]
end
subgraph "Crawl4AI Core Services"
K --> R[Markdown Generator]
L --> S[HTML Preprocessor]
M --> T[Screenshot Service]
N --> U[PDF Generator]
O --> V[JavaScript Executor]
P --> W[Batch Crawler]
Q --> X[Context Search]
R --> Y[Browser Pool]
S --> Y
T --> Y
U --> Y
V --> Y
W --> Y
X --> Z[Knowledge Base]
end
subgraph "External Resources"
Y --> AA[Playwright Browsers]
Z --> BB[Library Documentation]
Z --> CC[Code Examples]
AA --> DD[Web Pages]
end
style B fill:#e3f2fd
style J fill:#f3e5f5
style Y fill:#e8f5e8
style Z fill:#fff3e0
```
### API Request/Response Flow Patterns
```mermaid
sequenceDiagram
participant Client
participant LoadBalancer
participant FastAPI
participant ConfigValidator
participant BrowserManager
participant CrawlEngine
participant ResponseBuilder
Note over Client,ResponseBuilder: Basic Crawl Request
Client->>LoadBalancer: POST /crawl
LoadBalancer->>FastAPI: Route request
FastAPI->>ConfigValidator: Validate browser_config
ConfigValidator-->>FastAPI: ✓ Valid BrowserConfig
FastAPI->>ConfigValidator: Validate crawler_config
ConfigValidator-->>FastAPI: ✓ Valid CrawlerRunConfig
FastAPI->>BrowserManager: Allocate browser
BrowserManager-->>FastAPI: Browser instance
FastAPI->>CrawlEngine: Execute crawl
Note over CrawlEngine: Page processing
CrawlEngine->>CrawlEngine: Navigate & wait
CrawlEngine->>CrawlEngine: Extract content
CrawlEngine->>CrawlEngine: Apply strategies
CrawlEngine-->>FastAPI: CrawlResult
FastAPI->>ResponseBuilder: Format response
ResponseBuilder-->>FastAPI: JSON response
FastAPI->>BrowserManager: Release browser
FastAPI-->>LoadBalancer: Response ready
LoadBalancer-->>Client: 200 OK + CrawlResult
Note over Client,ResponseBuilder: Streaming Request
Client->>FastAPI: POST /crawl/stream
FastAPI-->>Client: 200 OK (stream start)
loop For each URL
FastAPI->>CrawlEngine: Process URL
CrawlEngine-->>FastAPI: Result ready
FastAPI-->>Client: NDJSON line
end
FastAPI-->>Client: Stream completed
```
### Configuration Validation Workflow
```mermaid
flowchart TD
A[Client Request] --> B[JSON Payload]
B --> C{Pre-validation}
C -->|✓ Valid JSON| D[Extract Configurations]
C -->|✗ Invalid JSON| E[Return 400 Bad Request]
D --> F[BrowserConfig Validation]
D --> G[CrawlerRunConfig Validation]
F --> H{BrowserConfig Valid?}
G --> I{CrawlerRunConfig Valid?}
H -->|✓ Valid| J[Browser Setup]
H -->|✗ Invalid| K[Log Browser Config Errors]
I -->|✓ Valid| L[Crawler Setup]
I -->|✗ Invalid| M[Log Crawler Config Errors]
K --> N[Collect All Errors]
M --> N
N --> O[Return 422 Validation Error]
J --> P{Both Configs Valid?}
L --> P
P -->|✓ Yes| Q[Proceed to Crawling]
P -->|✗ No| O
Q --> R[Execute Crawl Pipeline]
R --> S[Return CrawlResult]
E --> T[Client Error Response]
O --> T
S --> U[Client Success Response]
style A fill:#e1f5fe
style Q fill:#c8e6c9
style S fill:#c8e6c9
style U fill:#c8e6c9
style E fill:#ffcdd2
style O fill:#ffcdd2
style T fill:#ffcdd2
```
### Production Deployment Architecture
```mermaid
graph TB
subgraph "Load Balancer Layer"
A[NGINX/HAProxy] --> B[Health Check]
A --> C[Request Routing]
A --> D[SSL Termination]
end
subgraph "Application Layer"
C --> E[Crawl4AI Instance 1]
C --> F[Crawl4AI Instance 2]
C --> G[Crawl4AI Instance N]
E --> H[FastAPI Server]
F --> I[FastAPI Server]
G --> J[FastAPI Server]
H --> K[Browser Pool 1]
I --> L[Browser Pool 2]
J --> M[Browser Pool N]
end
subgraph "Shared Services"
N[Redis Cluster] --> E
N --> F
N --> G
O[Monitoring Stack] --> P[Prometheus]
O --> Q[Grafana]
O --> R[AlertManager]
P --> E
P --> F
P --> G
end
subgraph "External Dependencies"
S[OpenAI API] -.-> H
T[Anthropic API] -.-> I
U[Local LLM Cluster] -.-> J
end
subgraph "Persistent Storage"
V[Configuration Volume] --> E
V --> F
V --> G
W[Cache Volume] --> N
X[Logs Volume] --> O
end
style A fill:#e3f2fd
style E fill:#f3e5f5
style F fill:#f3e5f5
style G fill:#f3e5f5
style N fill:#e8f5e8
style O fill:#fff3e0
```
### Docker Resource Management
```mermaid
graph TD
subgraph "Resource Allocation"
A[Host Resources] --> B[CPU Cores]
A --> C[Memory GB]
A --> D[Disk Space]
A --> E[Network Bandwidth]
B --> F[Container Limits]
C --> F
D --> F
E --> F
end
subgraph "Container Configuration"
F --> G[--cpus=4]
F --> H[--memory=8g]
F --> I[--shm-size=2g]
F --> J[Volume Mounts]
G --> K[Browser Processes]
H --> L[Browser Memory]
I --> M[Shared Memory for Browsers]
J --> N[Config & Cache Storage]
end
subgraph "Monitoring & Scaling"
O[Resource Monitor] --> P[CPU Usage %]
O --> Q[Memory Usage %]
O --> R[Request Queue Length]
P --> S{CPU > 80%?}
Q --> T{Memory > 90%?}
R --> U{Queue > 100?}
S -->|Yes| V[Scale Up]
T -->|Yes| V
U -->|Yes| V
V --> W[Add Container Instance]
W --> X[Update Load Balancer]
end
subgraph "Performance Optimization"
Y[Browser Pool Tuning] --> Z[Max Pages: 40]
Y --> AA[Idle TTL: 30min]
Y --> BB[Concurrency Limits]
Z --> CC[Memory Efficiency]
AA --> DD[Resource Cleanup]
BB --> EE[Throughput Control]
end
style A fill:#e1f5fe
style F fill:#f3e5f5
style O fill:#e8f5e8
style Y fill:#fff3e0
```
**📖 Learn more:** [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Production Configuration](https://docs.crawl4ai.com/core/docker-deployment/#production-deployment)

View File

@@ -0,0 +1,478 @@
## Extraction Strategy Workflows and Architecture
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
### Extraction Strategy Decision Tree
```mermaid
flowchart TD
A[Content to Extract] --> B{Content Type?}
B -->|Simple Patterns| C[Common Data Types]
B -->|Structured HTML| D[Predictable Structure]
B -->|Complex Content| E[Requires Reasoning]
B -->|Mixed Content| F[Multiple Data Types]
C --> C1{Pattern Type?}
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
D --> D1{Selector Type?}
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
E --> E1{LLM Provider?}
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
E1 -->|Local Ollama| E3[Local LLM Strategy]
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
F --> F1[Multi-Strategy Approach]
F1 --> F2[1. Regex for Patterns]
F1 --> F3[2. CSS for Structure]
F1 --> F4[3. LLM for Complex Analysis]
C2 --> G[Fast Extraction ⚡]
C3 --> G
C4 --> H[Cached Pattern Reuse]
D2 --> I[Schema-based Extraction 🏗️]
D3 --> I
D4 --> J[Generated Schema Cache]
E2 --> K[Intelligent Parsing 🧠]
E3 --> K
E4 --> L[Hybrid Cost-Effective]
F2 --> M[Comprehensive Results 📊]
F3 --> M
F4 --> M
style G fill:#c8e6c9
style I fill:#e3f2fd
style K fill:#fff3e0
style M fill:#f3e5f5
style H fill:#e8f5e8
style J fill:#e8f5e8
style L fill:#ffecb3
```
### LLM Extraction Strategy Workflow
```mermaid
sequenceDiagram
participant User
participant Crawler
participant LLMStrategy
participant Chunker
participant LLMProvider
participant Parser
User->>Crawler: Configure LLMExtractionStrategy
User->>Crawler: arun(url, config)
Crawler->>Crawler: Navigate to URL
Crawler->>Crawler: Extract content (HTML/Markdown)
Crawler->>LLMStrategy: Process content
LLMStrategy->>LLMStrategy: Check content size
alt Content > chunk_threshold
LLMStrategy->>Chunker: Split into chunks with overlap
Chunker-->>LLMStrategy: Return chunks[]
loop For each chunk
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>LLMStrategy: Merge chunk results
else Content <= threshold
LLMStrategy->>LLMProvider: Send full content + schema
LLMProvider-->>LLMStrategy: Return structured JSON
end
LLMStrategy->>Parser: Validate JSON schema
Parser-->>LLMStrategy: Validated data
LLMStrategy->>LLMStrategy: Track token usage
LLMStrategy-->>Crawler: Return extracted_content
Crawler-->>User: CrawlResult with JSON data
User->>LLMStrategy: show_usage()
LLMStrategy-->>User: Token count & estimated cost
```
### Schema-Based Extraction Architecture
```mermaid
graph TB
subgraph "Schema Definition"
A[JSON Schema] --> A1[baseSelector]
A --> A2[fields[]]
A --> A3[nested structures]
A2 --> A4[CSS/XPath selectors]
A2 --> A5[Data types: text, html, attribute]
A2 --> A6[Default values]
A3 --> A7[nested objects]
A3 --> A8[nested_list arrays]
A3 --> A9[simple lists]
end
subgraph "Extraction Engine"
B[HTML Content] --> C[Selector Engine]
C --> C1[CSS Selector Parser]
C --> C2[XPath Evaluator]
C1 --> D[Element Matcher]
C2 --> D
D --> E[Type Converter]
E --> E1[Text Extraction]
E --> E2[HTML Preservation]
E --> E3[Attribute Extraction]
E --> E4[Nested Processing]
end
subgraph "Result Processing"
F[Raw Extracted Data] --> G[Structure Builder]
G --> G1[Object Construction]
G --> G2[Array Assembly]
G --> G3[Type Validation]
G1 --> H[JSON Output]
G2 --> H
G3 --> H
end
A --> C
E --> F
H --> I[extracted_content]
style A fill:#e3f2fd
style C fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#c8e6c9
```
### Automatic Schema Generation Process
```mermaid
stateDiagram-v2
[*] --> CheckCache
CheckCache --> CacheHit: Schema exists
CheckCache --> SamplePage: Schema missing
CacheHit --> LoadSchema
LoadSchema --> FastExtraction
SamplePage --> ExtractHTML: Crawl sample URL
ExtractHTML --> LLMAnalysis: Send HTML to LLM
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
GenerateSchema --> ValidateSchema: Test generated schema
ValidateSchema --> SchemaWorks: Valid selectors
ValidateSchema --> RefineSchema: Invalid selectors
RefineSchema --> LLMAnalysis: Iterate with feedback
SchemaWorks --> CacheSchema: Save for reuse
CacheSchema --> FastExtraction: Use cached schema
FastExtraction --> [*]: No more LLM calls needed
note right of CheckCache : One-time LLM cost
note right of FastExtraction : Unlimited fast reuse
note right of CacheSchema : JSON file storage
```
### Multi-Strategy Extraction Pipeline
```mermaid
flowchart LR
A[Web Page Content] --> B[Strategy Pipeline]
subgraph B["Extraction Pipeline"]
B1[Stage 1: Regex Patterns]
B2[Stage 2: Schema-based CSS]
B3[Stage 3: LLM Analysis]
B1 --> B1a[Email addresses]
B1 --> B1b[Phone numbers]
B1 --> B1c[URLs and links]
B1 --> B1d[Currency amounts]
B2 --> B2a[Structured products]
B2 --> B2b[Article metadata]
B2 --> B2c[User reviews]
B2 --> B2d[Navigation links]
B3 --> B3a[Sentiment analysis]
B3 --> B3b[Key topics]
B3 --> B3c[Entity recognition]
B3 --> B3d[Content summary]
end
B1a --> C[Result Merger]
B1b --> C
B1c --> C
B1d --> C
B2a --> C
B2b --> C
B2c --> C
B2d --> C
B3a --> C
B3b --> C
B3c --> C
B3d --> C
C --> D[Combined JSON Output]
D --> E[Final CrawlResult]
style B1 fill:#c8e6c9
style B2 fill:#e3f2fd
style B3 fill:#fff3e0
style C fill:#f3e5f5
```
### Performance Comparison Matrix
```mermaid
graph TD
subgraph "Strategy Performance"
A[Extraction Strategy Comparison]
subgraph "Speed ⚡"
S1[Regex: ~10ms]
S2[CSS Schema: ~50ms]
S3[XPath: ~100ms]
S4[LLM: ~2-10s]
end
subgraph "Accuracy 🎯"
A1[Regex: Pattern-dependent]
A2[CSS: High for structured]
A3[XPath: Very high]
A4[LLM: Excellent for complex]
end
subgraph "Cost 💰"
C1[Regex: Free]
C2[CSS: Free]
C3[XPath: Free]
C4[LLM: $0.001-0.01 per page]
end
subgraph "Complexity 🔧"
X1[Regex: Simple patterns only]
X2[CSS: Structured HTML]
X3[XPath: Complex selectors]
X4[LLM: Any content type]
end
end
style S1 fill:#c8e6c9
style S2 fill:#e8f5e8
style S3 fill:#fff3e0
style S4 fill:#ffcdd2
style A2 fill:#e8f5e8
style A3 fill:#c8e6c9
style A4 fill:#c8e6c9
style C1 fill:#c8e6c9
style C2 fill:#c8e6c9
style C3 fill:#c8e6c9
style C4 fill:#fff3e0
style X1 fill:#ffcdd2
style X2 fill:#e8f5e8
style X3 fill:#c8e6c9
style X4 fill:#c8e6c9
```
### Regex Pattern Strategy Flow
```mermaid
flowchart TD
A[Regex Extraction] --> B{Pattern Source?}
B -->|Built-in| C[Use Predefined Patterns]
B -->|Custom| D[Define Custom Regex]
B -->|LLM-Generated| E[Generate with AI]
C --> C1[Email Pattern]
C --> C2[Phone Pattern]
C --> C3[URL Pattern]
C --> C4[Currency Pattern]
C --> C5[Date Pattern]
D --> D1[Write Custom Regex]
D --> D2[Test Pattern]
D --> D3{Pattern Works?}
D3 -->|No| D1
D3 -->|Yes| D4[Use Pattern]
E --> E1[Provide Sample Content]
E --> E2[LLM Analyzes Content]
E --> E3[Generate Optimized Regex]
E --> E4[Cache Pattern for Reuse]
C1 --> F[Pattern Matching]
C2 --> F
C3 --> F
C4 --> F
C5 --> F
D4 --> F
E4 --> F
F --> G[Extract Matches]
G --> H[Group by Pattern Type]
H --> I[JSON Output with Labels]
style C fill:#e8f5e8
style D fill:#e3f2fd
style E fill:#fff3e0
style F fill:#f3e5f5
```
### Complex Schema Structure Visualization
```mermaid
graph TB
subgraph "E-commerce Schema Example"
A[Category baseSelector] --> B[Category Fields]
A --> C[Products nested_list]
B --> B1[category_name]
B --> B2[category_id attribute]
B --> B3[category_url attribute]
C --> C1[Product baseSelector]
C1 --> C2[name text]
C1 --> C3[price text]
C1 --> C4[Details nested object]
C1 --> C5[Features list]
C1 --> C6[Reviews nested_list]
C4 --> C4a[brand text]
C4 --> C4b[model text]
C4 --> C4c[specs html]
C5 --> C5a[feature text array]
C6 --> C6a[reviewer text]
C6 --> C6b[rating attribute]
C6 --> C6c[comment text]
C6 --> C6d[date attribute]
end
subgraph "JSON Output Structure"
D[categories array] --> D1[category object]
D1 --> D2[category_name]
D1 --> D3[category_id]
D1 --> D4[products array]
D4 --> D5[product object]
D5 --> D6[name, price]
D5 --> D7[details object]
D5 --> D8[features array]
D5 --> D9[reviews array]
D7 --> D7a[brand, model, specs]
D8 --> D8a[feature strings]
D9 --> D9a[review objects]
end
A -.-> D
B1 -.-> D2
C2 -.-> D6
C4 -.-> D7
C5 -.-> D8
C6 -.-> D9
style A fill:#e3f2fd
style C fill:#f3e5f5
style C4 fill:#e8f5e8
style D fill:#fff3e0
```
### Error Handling and Fallback Strategy
```mermaid
stateDiagram-v2
[*] --> PrimaryStrategy
PrimaryStrategy --> Success: Extraction successful
PrimaryStrategy --> ValidationFailed: Invalid data
PrimaryStrategy --> ExtractionFailed: No matches found
PrimaryStrategy --> TimeoutError: LLM timeout
ValidationFailed --> FallbackStrategy: Try alternative
ExtractionFailed --> FallbackStrategy: Try alternative
TimeoutError --> FallbackStrategy: Try alternative
FallbackStrategy --> FallbackSuccess: Fallback works
FallbackStrategy --> FallbackFailed: All strategies failed
FallbackSuccess --> Success: Return results
FallbackFailed --> ErrorReport: Log failure details
Success --> [*]: Complete
ErrorReport --> [*]: Return empty results
note right of PrimaryStrategy : Try fastest/most accurate first
note right of FallbackStrategy : Use simpler but reliable method
note left of ErrorReport : Provide debugging information
```
### Token Usage and Cost Optimization
```mermaid
flowchart TD
A[LLM Extraction Request] --> B{Content Size Check}
B -->|Small < 1200 tokens| C[Single LLM Call]
B -->|Large > 1200 tokens| D[Chunking Strategy]
C --> C1[Send full content]
C1 --> C2[Parse JSON response]
C2 --> C3[Track token usage]
D --> D1[Split into chunks]
D1 --> D2[Add overlap between chunks]
D2 --> D3[Process chunks in parallel]
D3 --> D4[Chunk 1 → LLM]
D3 --> D5[Chunk 2 → LLM]
D3 --> D6[Chunk N → LLM]
D4 --> D7[Merge results]
D5 --> D7
D6 --> D7
D7 --> D8[Deduplicate data]
D8 --> D9[Aggregate token usage]
C3 --> E[Cost Calculation]
D9 --> E
E --> F[Usage Report]
F --> F1[Prompt tokens: X]
F --> F2[Completion tokens: Y]
F --> F3[Total cost: $Z]
style C fill:#c8e6c9
style D fill:#fff3e0
style E fill:#e3f2fd
style F fill:#f3e5f5
```
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,472 @@
## HTTP Crawler Strategy Workflows
Visual representations of HTTP-based crawling architecture, request flows, and performance characteristics compared to browser-based strategies.
### HTTP vs Browser Strategy Decision Tree
```mermaid
flowchart TD
A[Content Crawling Need] --> B{Content Type Analysis}
B -->|Static HTML| C{JavaScript Required?}
B -->|Dynamic SPA| D[Browser Strategy Required]
B -->|API Endpoints| E[HTTP Strategy Optimal]
B -->|Mixed Content| F{Primary Content Source?}
C -->|No JS Needed| G[HTTP Strategy Recommended]
C -->|JS Required| H[Browser Strategy Required]
C -->|Unknown| I{Performance Priority?}
I -->|Speed Critical| J[Try HTTP First]
I -->|Accuracy Critical| K[Use Browser Strategy]
F -->|Mostly Static| G
F -->|Mostly Dynamic| D
G --> L{Resource Constraints?}
L -->|Memory Limited| M[HTTP Strategy - Lightweight]
L -->|CPU Limited| N[HTTP Strategy - No Browser]
L -->|Network Limited| O[HTTP Strategy - Efficient]
L -->|No Constraints| P[Either Strategy Works]
J --> Q[Test HTTP Results]
Q --> R{Content Complete?}
R -->|Yes| S[Continue with HTTP]
R -->|No| T[Switch to Browser Strategy]
D --> U[Browser Strategy Features]
H --> U
K --> U
T --> U
U --> V[JavaScript Execution]
U --> W[Screenshots/PDFs]
U --> X[Complex Interactions]
U --> Y[Session Management]
M --> Z[HTTP Strategy Benefits]
N --> Z
O --> Z
S --> Z
Z --> AA[10x Faster Processing]
Z --> BB[Lower Memory Usage]
Z --> CC[Higher Concurrency]
Z --> DD[Simpler Deployment]
style G fill:#c8e6c9
style M fill:#c8e6c9
style N fill:#c8e6c9
style O fill:#c8e6c9
style S fill:#c8e6c9
style D fill:#e3f2fd
style H fill:#e3f2fd
style K fill:#e3f2fd
style T fill:#e3f2fd
style U fill:#e3f2fd
```
### HTTP Request Lifecycle Sequence
```mermaid
sequenceDiagram
participant Client
participant HTTPStrategy as HTTP Strategy
participant Session as HTTP Session
participant Server as Target Server
participant Processor as Content Processor
Client->>HTTPStrategy: crawl(url, config)
HTTPStrategy->>HTTPStrategy: validate_url()
alt URL Type Check
HTTPStrategy->>HTTPStrategy: handle_file_url()
Note over HTTPStrategy: file:// URLs
else
HTTPStrategy->>HTTPStrategy: handle_raw_content()
Note over HTTPStrategy: raw:// content
else
HTTPStrategy->>Session: prepare_request()
Session->>Session: apply_config()
Session->>Session: set_headers()
Session->>Session: setup_auth()
Session->>Server: HTTP Request
Note over Session,Server: GET/POST/PUT with headers
alt Success Response
Server-->>Session: HTTP 200 + Content
Session-->>HTTPStrategy: response_data
else Redirect Response
Server-->>Session: HTTP 3xx + Location
Session->>Server: Follow redirect
Server-->>Session: HTTP 200 + Content
Session-->>HTTPStrategy: final_response
else Error Response
Server-->>Session: HTTP 4xx/5xx
Session-->>HTTPStrategy: error_response
end
end
HTTPStrategy->>Processor: process_content()
Processor->>Processor: clean_html()
Processor->>Processor: extract_metadata()
Processor->>Processor: generate_markdown()
Processor-->>HTTPStrategy: processed_result
HTTPStrategy-->>Client: CrawlResult
Note over Client,Processor: Fast, lightweight processing
Note over HTTPStrategy: No browser overhead
```
### HTTP Strategy Architecture
```mermaid
graph TB
subgraph "HTTP Crawler Strategy"
A[AsyncHTTPCrawlerStrategy] --> B[Session Manager]
A --> C[Request Builder]
A --> D[Response Handler]
A --> E[Error Manager]
B --> B1[Connection Pool]
B --> B2[DNS Cache]
B --> B3[SSL Context]
C --> C1[Headers Builder]
C --> C2[Auth Handler]
C --> C3[Payload Encoder]
D --> D1[Content Decoder]
D --> D2[Redirect Handler]
D --> D3[Status Validator]
E --> E1[Retry Logic]
E --> E2[Timeout Handler]
E --> E3[Exception Mapper]
end
subgraph "Content Processing"
F[Raw HTML] --> G[HTML Cleaner]
G --> H[Markdown Generator]
H --> I[Link Extractor]
I --> J[Media Extractor]
J --> K[Metadata Parser]
end
subgraph "External Resources"
L[Target Websites]
M[Local Files]
N[Raw Content]
end
subgraph "Output"
O[CrawlResult]
O --> O1[HTML Content]
O --> O2[Markdown Text]
O --> O3[Extracted Links]
O --> O4[Media References]
O --> O5[Status Information]
end
A --> F
L --> A
M --> A
N --> A
K --> O
style A fill:#e3f2fd
style B fill:#f3e5f5
style F fill:#e8f5e8
style O fill:#fff3e0
```
### Performance Comparison Flow
```mermaid
graph LR
subgraph "HTTP Strategy Performance"
A1[Request Start] --> A2[DNS Lookup: 50ms]
A2 --> A3[TCP Connect: 100ms]
A3 --> A4[HTTP Request: 200ms]
A4 --> A5[Content Download: 300ms]
A5 --> A6[Processing: 50ms]
A6 --> A7[Total: ~700ms]
end
subgraph "Browser Strategy Performance"
B1[Request Start] --> B2[Browser Launch: 2000ms]
B2 --> B3[Page Navigation: 1000ms]
B3 --> B4[JS Execution: 500ms]
B4 --> B5[Content Rendering: 300ms]
B5 --> B6[Processing: 100ms]
B6 --> B7[Total: ~3900ms]
end
subgraph "Resource Usage"
C1[HTTP Memory: ~50MB]
C2[Browser Memory: ~500MB]
C3[HTTP CPU: Low]
C4[Browser CPU: High]
C5[HTTP Concurrency: 100+]
C6[Browser Concurrency: 10-20]
end
A7 --> D[5.5x Faster]
B7 --> D
C1 --> E[10x Less Memory]
C2 --> E
C5 --> F[5x More Concurrent]
C6 --> F
style A7 fill:#c8e6c9
style B7 fill:#ffcdd2
style C1 fill:#c8e6c9
style C2 fill:#ffcdd2
style C5 fill:#c8e6c9
style C6 fill:#ffcdd2
```
### HTTP Request Types and Configuration
```mermaid
stateDiagram-v2
[*] --> HTTPConfigSetup
HTTPConfigSetup --> MethodSelection
MethodSelection --> GET: Simple data retrieval
MethodSelection --> POST: Form submission
MethodSelection --> PUT: Data upload
MethodSelection --> DELETE: Resource removal
GET --> HeaderSetup: Set Accept headers
POST --> PayloadSetup: JSON or form data
PUT --> PayloadSetup: File or data upload
DELETE --> AuthSetup: Authentication required
PayloadSetup --> JSONPayload: application/json
PayloadSetup --> FormPayload: form-data
PayloadSetup --> RawPayload: custom content
JSONPayload --> HeaderSetup
FormPayload --> HeaderSetup
RawPayload --> HeaderSetup
HeaderSetup --> AuthSetup
AuthSetup --> SSLSetup
SSLSetup --> RedirectSetup
RedirectSetup --> RequestExecution
RequestExecution --> [*]: Request complete
note right of GET : Default method for most crawling
note right of POST : API interactions, form submissions
note right of JSONPayload : Structured data transmission
note right of HeaderSetup : User-Agent, Accept, Custom headers
```
### Error Handling and Retry Workflow
```mermaid
flowchart TD
A[HTTP Request] --> B{Response Received?}
B -->|No| C[Connection Error]
B -->|Yes| D{Status Code Check}
C --> C1{Timeout Error?}
C1 -->|Yes| C2[ConnectionTimeoutError]
C1 -->|No| C3[Network Error]
D -->|2xx| E[Success Response]
D -->|3xx| F[Redirect Response]
D -->|4xx| G[Client Error]
D -->|5xx| H[Server Error]
F --> F1{Follow Redirects?}
F1 -->|Yes| F2[Follow Redirect]
F1 -->|No| F3[Return Redirect Response]
F2 --> A
G --> G1{Retry on 4xx?}
G1 -->|No| G2[HTTPStatusError]
G1 -->|Yes| I[Check Retry Count]
H --> H1{Retry on 5xx?}
H1 -->|Yes| I
H1 -->|No| H2[HTTPStatusError]
C2 --> I
C3 --> I
I --> J{Retries < Max?}
J -->|No| K[Final Error]
J -->|Yes| L[Calculate Backoff]
L --> M[Wait Backoff Time]
M --> N[Increment Retry Count]
N --> A
E --> O[Process Content]
F3 --> O
O --> P[Return CrawlResult]
G2 --> Q[Error CrawlResult]
H2 --> Q
K --> Q
style E fill:#c8e6c9
style P fill:#c8e6c9
style G2 fill:#ffcdd2
style H2 fill:#ffcdd2
style K fill:#ffcdd2
style Q fill:#ffcdd2
```
### Batch Processing Architecture
```mermaid
sequenceDiagram
participant Client
participant BatchManager as Batch Manager
participant HTTPPool as Connection Pool
participant Workers as HTTP Workers
participant Targets as Target Servers
Client->>BatchManager: batch_crawl(urls)
BatchManager->>BatchManager: create_semaphore(max_concurrent)
loop For each URL batch
BatchManager->>HTTPPool: acquire_connection()
HTTPPool->>Workers: assign_worker()
par Concurrent Processing
Workers->>Targets: HTTP Request 1
Workers->>Targets: HTTP Request 2
Workers->>Targets: HTTP Request N
end
par Response Handling
Targets-->>Workers: Response 1
Targets-->>Workers: Response 2
Targets-->>Workers: Response N
end
Workers->>HTTPPool: return_connection()
HTTPPool->>BatchManager: batch_results()
end
BatchManager->>BatchManager: aggregate_results()
BatchManager-->>Client: final_results()
Note over Workers,Targets: 20-100 concurrent connections
Note over BatchManager: Memory-efficient processing
Note over HTTPPool: Connection reuse optimization
```
### Content Type Processing Pipeline
```mermaid
graph TD
A[HTTP Response] --> B{Content-Type Detection}
B -->|text/html| C[HTML Processing]
B -->|application/json| D[JSON Processing]
B -->|text/plain| E[Text Processing]
B -->|application/xml| F[XML Processing]
B -->|Other| G[Binary Processing]
C --> C1[Parse HTML Structure]
C1 --> C2[Extract Text Content]
C2 --> C3[Generate Markdown]
C3 --> C4[Extract Links/Media]
D --> D1[Parse JSON Structure]
D1 --> D2[Extract Data Fields]
D2 --> D3[Format as Readable Text]
E --> E1[Clean Text Content]
E1 --> E2[Basic Formatting]
F --> F1[Parse XML Structure]
F1 --> F2[Extract Text Nodes]
F2 --> F3[Convert to Markdown]
G --> G1[Save Binary Content]
G1 --> G2[Generate Metadata]
C4 --> H[Content Analysis]
D3 --> H
E2 --> H
F3 --> H
G2 --> H
H --> I[Link Extraction]
H --> J[Media Detection]
H --> K[Metadata Parsing]
I --> L[CrawlResult Assembly]
J --> L
K --> L
L --> M[Final Output]
style C fill:#e8f5e8
style H fill:#fff3e0
style L fill:#e3f2fd
style M fill:#c8e6c9
```
### Integration with Processing Strategies
```mermaid
graph LR
subgraph "HTTP Strategy Core"
A[HTTP Request] --> B[Raw Content]
B --> C[Content Decoder]
end
subgraph "Processing Pipeline"
C --> D[HTML Cleaner]
D --> E[Markdown Generator]
E --> F{Content Filter?}
F -->|Yes| G[Pruning Filter]
F -->|Yes| H[BM25 Filter]
F -->|No| I[Raw Markdown]
G --> J[Fit Markdown]
H --> J
end
subgraph "Extraction Strategies"
I --> K[CSS Extraction]
J --> K
I --> L[XPath Extraction]
J --> L
I --> M[LLM Extraction]
J --> M
end
subgraph "Output Generation"
K --> N[Structured JSON]
L --> N
M --> N
I --> O[Clean Markdown]
J --> P[Filtered Content]
N --> Q[Final CrawlResult]
O --> Q
P --> Q
end
style A fill:#e3f2fd
style C fill:#f3e5f5
style E fill:#e8f5e8
style Q fill:#c8e6c9
```
**📖 Learn more:** [HTTP vs Browser Strategies](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Error Handling](https://docs.crawl4ai.com/api/async-webcrawler/)

View File

@@ -0,0 +1,368 @@
## Installation Workflows and Architecture
Visual representations of Crawl4AI installation processes, deployment options, and system interactions.
### Installation Decision Flow
```mermaid
flowchart TD
A[Start Installation] --> B{Environment Type?}
B -->|Local Development| C[Basic Python Install]
B -->|Production| D[Docker Deployment]
B -->|Research/Testing| E[Google Colab]
B -->|CI/CD Pipeline| F[Automated Setup]
C --> C1[pip install crawl4ai]
C1 --> C2[crawl4ai-setup]
C2 --> C3{Need Advanced Features?}
C3 -->|No| C4[Basic Installation Complete]
C3 -->|Text Clustering| C5[pip install crawl4ai with torch]
C3 -->|Transformers| C6[pip install crawl4ai with transformer]
C3 -->|All Features| C7[pip install crawl4ai with all]
C5 --> C8[crawl4ai-download-models]
C6 --> C8
C7 --> C8
C8 --> C9[Advanced Installation Complete]
D --> D1{Deployment Method?}
D1 -->|Pre-built Image| D2[docker pull unclecode/crawl4ai]
D1 -->|Docker Compose| D3[Clone repo + docker compose]
D1 -->|Custom Build| D4[docker buildx build]
D2 --> D5[Configure .llm.env]
D3 --> D5
D4 --> D5
D5 --> D6[docker run with ports]
D6 --> D7[Docker Deployment Complete]
E --> E1[Colab pip install]
E1 --> E2[playwright install chromium]
E2 --> E3[Test basic crawl]
E3 --> E4[Colab Setup Complete]
F --> F1[Automated pip install]
F1 --> F2[Automated setup scripts]
F2 --> F3[CI/CD Integration Complete]
C4 --> G[Verify with crawl4ai-doctor]
C9 --> G
D7 --> H[Health check via API]
E4 --> I[Run test crawl]
F3 --> G
G --> J[Installation Verified]
H --> J
I --> J
style A fill:#e1f5fe
style J fill:#c8e6c9
style C4 fill:#fff3e0
style C9 fill:#fff3e0
style D7 fill:#f3e5f5
style E4 fill:#fce4ec
style F3 fill:#e8f5e8
```
### Basic Installation Sequence
```mermaid
sequenceDiagram
participant User
participant PyPI
participant System
participant Playwright
participant Crawler
User->>PyPI: pip install crawl4ai
PyPI-->>User: Package downloaded
User->>System: crawl4ai-setup
System->>Playwright: Install browser binaries
Playwright-->>System: Chromium, Firefox installed
System-->>User: Setup complete
User->>System: crawl4ai-doctor
System->>System: Check Python version
System->>System: Verify Playwright installation
System->>System: Test browser launch
System-->>User: Diagnostics report
User->>Crawler: Basic crawl test
Crawler->>Playwright: Launch browser
Playwright-->>Crawler: Browser ready
Crawler->>Crawler: Navigate to test URL
Crawler-->>User: Success confirmation
```
### Docker Deployment Architecture
```mermaid
graph TB
subgraph "Host System"
A[Docker Engine] --> B[Crawl4AI Container]
C[.llm.env File] --> B
D[Port 11235] --> B
end
subgraph "Container Environment"
B --> E[FastAPI Server]
B --> F[Playwright Browsers]
B --> G[Python Runtime]
E --> H[/crawl Endpoint]
E --> I[/playground Interface]
E --> J[/health Monitoring]
E --> K[/metrics Prometheus]
F --> L[Chromium Browser]
F --> M[Firefox Browser]
F --> N[WebKit Browser]
end
subgraph "External Services"
O[OpenAI API] --> B
P[Anthropic API] --> B
Q[Local LLM Ollama] --> B
end
subgraph "Client Applications"
R[Python SDK] --> H
S[REST API Calls] --> H
T[Web Browser] --> I
U[Monitoring Tools] --> J
V[Prometheus] --> K
end
style B fill:#e3f2fd
style E fill:#f3e5f5
style F fill:#e8f5e8
style G fill:#fff3e0
```
### Advanced Features Installation Flow
```mermaid
stateDiagram-v2
[*] --> BasicInstall
BasicInstall --> FeatureChoice: crawl4ai installed
FeatureChoice --> TorchInstall: Need text clustering
FeatureChoice --> TransformerInstall: Need HuggingFace models
FeatureChoice --> AllInstall: Need everything
FeatureChoice --> Complete: Basic features sufficient
TorchInstall --> TorchSetup: pip install crawl4ai with torch
TransformerInstall --> TransformerSetup: pip install crawl4ai with transformer
AllInstall --> AllSetup: pip install crawl4ai with all
TorchSetup --> ModelDownload: crawl4ai-setup
TransformerSetup --> ModelDownload: crawl4ai-setup
AllSetup --> ModelDownload: crawl4ai-setup
ModelDownload --> PreDownload: crawl4ai-download-models
PreDownload --> Complete: All models cached
Complete --> Verification: crawl4ai-doctor
Verification --> [*]: Installation verified
note right of TorchInstall : PyTorch for semantic operations
note right of TransformerInstall : HuggingFace for LLM features
note right of AllInstall : Complete feature set
```
### Platform-Specific Installation Matrix
```mermaid
graph LR
subgraph "Installation Methods"
A[Python Package] --> A1[pip install]
B[Docker Image] --> B1[docker pull]
C[Source Build] --> C1[git clone + build]
D[Cloud Platform] --> D1[Colab/Kaggle]
end
subgraph "Operating Systems"
E[Linux x86_64]
F[Linux ARM64]
G[macOS Intel]
H[macOS Apple Silicon]
I[Windows x86_64]
end
subgraph "Feature Sets"
J[Basic crawling]
K[Text clustering torch]
L[LLM transformers]
M[All features]
end
A1 --> E
A1 --> F
A1 --> G
A1 --> H
A1 --> I
B1 --> E
B1 --> F
B1 --> G
B1 --> H
C1 --> E
C1 --> F
C1 --> G
C1 --> H
C1 --> I
D1 --> E
D1 --> I
E --> J
E --> K
E --> L
E --> M
F --> J
F --> K
F --> L
F --> M
G --> J
G --> K
G --> L
G --> M
H --> J
H --> K
H --> L
H --> M
I --> J
I --> K
I --> L
I --> M
style A1 fill:#e3f2fd
style B1 fill:#f3e5f5
style C1 fill:#e8f5e8
style D1 fill:#fff3e0
```
### Docker Multi-Stage Build Process
```mermaid
sequenceDiagram
participant Dev as Developer
participant Git as GitHub Repo
participant Docker as Docker Engine
participant Registry as Docker Hub
participant User as End User
Dev->>Git: Push code changes
Docker->>Git: Clone repository
Docker->>Docker: Stage 1 - Base Python image
Docker->>Docker: Stage 2 - Install dependencies
Docker->>Docker: Stage 3 - Install Playwright
Docker->>Docker: Stage 4 - Copy application code
Docker->>Docker: Stage 5 - Setup FastAPI server
Note over Docker: Multi-architecture build
Docker->>Docker: Build for linux/amd64
Docker->>Docker: Build for linux/arm64
Docker->>Registry: Push multi-arch manifest
Registry-->>Docker: Build complete
User->>Registry: docker pull unclecode/crawl4ai
Registry-->>User: Download appropriate architecture
User->>Docker: docker run with configuration
Docker->>Docker: Start container
Docker->>Docker: Initialize FastAPI server
Docker->>Docker: Setup Playwright browsers
Docker-->>User: Service ready on port 11235
```
### Installation Verification Workflow
```mermaid
flowchart TD
A[Installation Complete] --> B[Run crawl4ai-doctor]
B --> C{Python Version Check}
C -->|✓ 3.10+| D{Playwright Check}
C -->|✗ < 3.10| C1[Upgrade Python]
C1 --> D
D -->|✓ Installed| E{Browser Binaries}
D -->|✗ Missing| D1[Run crawl4ai-setup]
D1 --> E
E -->|✓ Available| F{Test Browser Launch}
E -->|✗ Missing| E1[playwright install]
E1 --> F
F -->|✓ Success| G[Test Basic Crawl]
F -->|✗ Failed| F1[Check system dependencies]
F1 --> F
G --> H{Crawl Test Result}
H -->|✓ Success| I[Installation Verified ✓]
H -->|✗ Failed| H1[Check network/permissions]
H1 --> G
I --> J[Ready for Production Use]
style I fill:#c8e6c9
style J fill:#e8f5e8
style C1 fill:#ffcdd2
style D1 fill:#fff3e0
style E1 fill:#fff3e0
style F1 fill:#ffcdd2
style H1 fill:#ffcdd2
```
### Resource Requirements by Installation Type
```mermaid
graph TD
subgraph "Basic Installation"
A1[Memory: 512MB]
A2[Disk: 2GB]
A3[CPU: 1 core]
A4[Network: Required for setup]
end
subgraph "Advanced Features torch"
B1[Memory: 2GB+]
B2[Disk: 5GB+]
B3[CPU: 2+ cores]
B4[GPU: Optional CUDA]
end
subgraph "All Features"
C1[Memory: 4GB+]
C2[Disk: 10GB+]
C3[CPU: 4+ cores]
C4[GPU: Recommended]
end
subgraph "Docker Deployment"
D1[Memory: 1GB+]
D2[Disk: 3GB+]
D3[CPU: 2+ cores]
D4[Ports: 11235]
D5[Shared Memory: 1GB]
end
style A1 fill:#e8f5e8
style B1 fill:#fff3e0
style C1 fill:#ffecb3
style D1 fill:#e3f2fd
```
**📖 Learn more:** [Installation Guide](https://docs.crawl4ai.com/core/installation/), [Docker Deployment](https://docs.crawl4ai.com/core/docker-deployment/), [System Requirements](https://docs.crawl4ai.com/core/installation/#prerequisites)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,392 @@
## Multi-URL Crawling Workflows and Architecture
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
### Multi-URL Processing Modes
```mermaid
flowchart TD
A[Multi-URL Crawling Request] --> B{Processing Mode?}
B -->|Batch Mode| C[Collect All URLs]
B -->|Streaming Mode| D[Process URLs Individually]
C --> C1[Queue All URLs]
C1 --> C2[Execute Concurrently]
C2 --> C3[Wait for All Completion]
C3 --> C4[Return Complete Results Array]
D --> D1[Queue URLs]
D1 --> D2[Start First Batch]
D2 --> D3[Yield Results as Available]
D3 --> D4{More URLs?}
D4 -->|Yes| D5[Start Next URLs]
D4 -->|No| D6[Stream Complete]
D5 --> D3
C4 --> E[Process Results]
D6 --> E
E --> F[Success/Failure Analysis]
F --> G[End]
style C fill:#e3f2fd
style D fill:#f3e5f5
style C4 fill:#c8e6c9
style D6 fill:#c8e6c9
```
### Memory-Adaptive Dispatcher Flow
```mermaid
stateDiagram-v2
[*] --> Initializing
Initializing --> MonitoringMemory: Start dispatcher
MonitoringMemory --> CheckingMemory: Every check_interval
CheckingMemory --> MemoryOK: Memory < threshold
CheckingMemory --> MemoryHigh: Memory >= threshold
MemoryOK --> DispatchingTasks: Start new crawls
MemoryHigh --> WaitingForMemory: Pause dispatching
DispatchingTasks --> TaskRunning: Launch crawler
TaskRunning --> TaskCompleted: Crawl finished
TaskRunning --> TaskFailed: Crawl error
TaskCompleted --> MonitoringMemory: Update stats
TaskFailed --> MonitoringMemory: Update stats
WaitingForMemory --> CheckingMemory: Wait timeout
WaitingForMemory --> MonitoringMemory: Memory freed
note right of MemoryHigh: Prevents OOM crashes
note right of DispatchingTasks: Respects max_session_permit
note right of WaitingForMemory: Configurable timeout
```
### Concurrent Crawling Architecture
```mermaid
graph TB
subgraph "URL Queue Management"
A[URL Input List] --> B[URL Queue]
B --> C[Priority Scheduler]
C --> D[Batch Assignment]
end
subgraph "Dispatcher Layer"
E[Memory Adaptive Dispatcher]
F[Semaphore Dispatcher]
G[Rate Limiter]
H[Resource Monitor]
E --> I[Memory Checker]
F --> J[Concurrency Controller]
G --> K[Delay Calculator]
H --> L[System Stats]
end
subgraph "Crawler Pool"
M[Crawler Instance 1]
N[Crawler Instance 2]
O[Crawler Instance 3]
P[Crawler Instance N]
M --> Q[Browser Session 1]
N --> R[Browser Session 2]
O --> S[Browser Session 3]
P --> T[Browser Session N]
end
subgraph "Result Processing"
U[Result Collector]
V[Success Handler]
W[Error Handler]
X[Retry Queue]
Y[Final Results]
end
D --> E
D --> F
E --> M
F --> N
G --> O
H --> P
Q --> U
R --> U
S --> U
T --> U
U --> V
U --> W
W --> X
X --> B
V --> Y
style E fill:#e3f2fd
style F fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#fff3e0
```
### Rate Limiting and Backoff Strategy
```mermaid
sequenceDiagram
participant C as Crawler
participant RL as Rate Limiter
participant S as Server
participant D as Dispatcher
C->>RL: Request to crawl URL
RL->>RL: Calculate delay
RL->>RL: Apply base delay (1-3s)
RL->>C: Delay applied
C->>S: HTTP Request
alt Success Response
S-->>C: 200 OK + Content
C->>RL: Report success
RL->>RL: Reset failure count
C->>D: Return successful result
else Rate Limited
S-->>C: 429 Too Many Requests
C->>RL: Report rate limit
RL->>RL: Exponential backoff
RL->>RL: Increase delay (up to max_delay)
RL->>C: Apply longer delay
C->>S: Retry request after delay
else Server Error
S-->>C: 503 Service Unavailable
C->>RL: Report server error
RL->>RL: Moderate backoff
RL->>C: Retry with backoff
else Max Retries Exceeded
RL->>C: Stop retrying
C->>D: Return failed result
end
```
### Large-Scale Crawling Workflow
```mermaid
flowchart TD
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
B --> C{Select Dispatcher Type}
C -->|Memory Constrained| D[Memory Adaptive]
C -->|Fixed Resources| E[Semaphore Based]
D --> F[Set Memory Threshold 70%]
E --> G[Set Concurrency Limit]
F --> H[Configure Monitoring]
G --> H
H --> I[Start Crawling Process]
I --> J[Monitor System Resources]
J --> K{Memory Usage?}
K -->|< Threshold| L[Continue Dispatching]
K -->|>= Threshold| M[Pause New Tasks]
L --> N[Process Results Stream]
M --> O[Wait for Memory]
O --> K
N --> P{Result Type?}
P -->|Success| Q[Save to Database]
P -->|Failure| R[Log Error]
Q --> S[Update Progress Counter]
R --> S
S --> T{More URLs?}
T -->|Yes| U[Get Next Batch]
T -->|No| V[Generate Final Report]
U --> L
V --> W[Analysis Complete]
style A fill:#e1f5fe
style D fill:#e8f5e8
style E fill:#f3e5f5
style V fill:#c8e6c9
style W fill:#a5d6a7
```
### Real-Time Monitoring Dashboard Flow
```mermaid
graph LR
subgraph "Data Collection"
A[Crawler Tasks] --> B[Performance Metrics]
A --> C[Memory Usage]
A --> D[Success/Failure Rates]
A --> E[Response Times]
end
subgraph "Monitor Processing"
F[CrawlerMonitor] --> G[Aggregate Statistics]
F --> H[Display Formatter]
F --> I[Update Scheduler]
end
subgraph "Display Modes"
J[DETAILED Mode]
K[AGGREGATED Mode]
J --> L[Individual Task Status]
J --> M[Task-Level Metrics]
K --> N[Summary Statistics]
K --> O[Overall Progress]
end
subgraph "Output Interface"
P[Console Display]
Q[Progress Bars]
R[Status Tables]
S[Real-time Updates]
end
B --> F
C --> F
D --> F
E --> F
G --> J
G --> K
H --> J
H --> K
I --> J
I --> K
L --> P
M --> Q
N --> R
O --> S
style F fill:#e3f2fd
style J fill:#f3e5f5
style K fill:#e8f5e8
```
### Error Handling and Recovery Pattern
```mermaid
stateDiagram-v2
[*] --> ProcessingURL
ProcessingURL --> CrawlAttempt: Start crawl
CrawlAttempt --> Success: HTTP 200
CrawlAttempt --> NetworkError: Connection failed
CrawlAttempt --> RateLimit: HTTP 429
CrawlAttempt --> ServerError: HTTP 5xx
CrawlAttempt --> Timeout: Request timeout
Success --> [*]: Return result
NetworkError --> RetryCheck: Check retry count
RateLimit --> BackoffWait: Apply exponential backoff
ServerError --> RetryCheck: Check retry count
Timeout --> RetryCheck: Check retry count
BackoffWait --> RetryCheck: After delay
RetryCheck --> CrawlAttempt: retries < max_retries
RetryCheck --> Failed: retries >= max_retries
Failed --> ErrorLog: Log failure details
ErrorLog --> [*]: Return failed result
note right of BackoffWait: Exponential backoff for rate limits
note right of RetryCheck: Configurable max_retries
note right of ErrorLog: Detailed error tracking
```
### Resource Management Timeline
```mermaid
gantt
title Multi-URL Crawling Resource Management
dateFormat X
axisFormat %s
section Memory Usage
Initialize Dispatcher :0, 1
Memory Monitoring :1, 10
Peak Usage Period :3, 7
Memory Cleanup :7, 9
section Task Execution
URL Queue Setup :0, 2
Batch 1 Processing :2, 5
Batch 2 Processing :4, 7
Batch 3 Processing :6, 9
Final Results :9, 10
section Rate Limiting
Normal Delays :2, 4
Backoff Period :4, 6
Recovery Period :6, 8
section Monitoring
System Health Check :0, 10
Progress Updates :1, 9
Performance Metrics :2, 8
```
### Concurrent Processing Performance Matrix
```mermaid
graph TD
subgraph "Input Factors"
A[Number of URLs]
B[Concurrency Level]
C[Memory Threshold]
D[Rate Limiting]
end
subgraph "Processing Characteristics"
A --> E[Low 1-100 URLs]
A --> F[Medium 100-1k URLs]
A --> G[High 1k-10k URLs]
A --> H[Very High 10k+ URLs]
B --> I[Conservative 1-5]
B --> J[Moderate 5-15]
B --> K[Aggressive 15-30]
C --> L[Strict 60-70%]
C --> M[Balanced 70-80%]
C --> N[Relaxed 80-90%]
end
subgraph "Recommended Configurations"
E --> O[Simple Semaphore]
F --> P[Memory Adaptive Basic]
G --> Q[Memory Adaptive Advanced]
H --> R[Memory Adaptive + Monitoring]
I --> O
J --> P
K --> Q
K --> R
L --> Q
M --> P
N --> O
end
style O fill:#c8e6c9
style P fill:#fff3e0
style Q fill:#ffecb3
style R fill:#ffcdd2
```
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization)

View File

@@ -0,0 +1,411 @@
## Simple Crawling Workflows and Data Flow
Visual representations of basic web crawling operations, configuration patterns, and result processing workflows.
### Basic Crawling Sequence
```mermaid
sequenceDiagram
participant User
participant Crawler as AsyncWebCrawler
participant Browser as Browser Instance
participant Page as Web Page
participant Processor as Content Processor
User->>Crawler: Create with BrowserConfig
Crawler->>Browser: Launch browser instance
Browser-->>Crawler: Browser ready
User->>Crawler: arun(url, CrawlerRunConfig)
Crawler->>Browser: Create new page/context
Browser->>Page: Navigate to URL
Page-->>Browser: Page loaded
Browser->>Processor: Extract raw HTML
Processor->>Processor: Clean HTML
Processor->>Processor: Generate markdown
Processor->>Processor: Extract media/links
Processor-->>Crawler: CrawlResult created
Crawler-->>User: Return CrawlResult
Note over User,Processor: All processing happens asynchronously
```
### Crawling Configuration Flow
```mermaid
flowchart TD
A[Start Crawling] --> B{Browser Config Set?}
B -->|No| B1[Use Default BrowserConfig]
B -->|Yes| B2[Custom BrowserConfig]
B1 --> C[Launch Browser]
B2 --> C
C --> D{Crawler Run Config Set?}
D -->|No| D1[Use Default CrawlerRunConfig]
D -->|Yes| D2[Custom CrawlerRunConfig]
D1 --> E[Navigate to URL]
D2 --> E
E --> F{Page Load Success?}
F -->|No| F1[Return Error Result]
F -->|Yes| G[Apply Content Filters]
G --> G1{excluded_tags set?}
G1 -->|Yes| G2[Remove specified tags]
G1 -->|No| G3[Keep all tags]
G2 --> G4{css_selector set?}
G3 --> G4
G4 -->|Yes| G5[Extract selected elements]
G4 -->|No| G6[Process full page]
G5 --> H[Generate Markdown]
G6 --> H
H --> H1{markdown_generator set?}
H1 -->|Yes| H2[Use custom generator]
H1 -->|No| H3[Use default generator]
H2 --> I[Extract Media and Links]
H3 --> I
I --> I1{process_iframes?}
I1 -->|Yes| I2[Include iframe content]
I1 -->|No| I3[Skip iframes]
I2 --> J[Create CrawlResult]
I3 --> J
J --> K[Return Result]
style A fill:#e1f5fe
style K fill:#c8e6c9
style F1 fill:#ffcdd2
```
### CrawlResult Data Structure
```mermaid
graph TB
subgraph "CrawlResult Object"
A[CrawlResult] --> B[Basic Info]
A --> C[Content Variants]
A --> D[Extracted Data]
A --> E[Media Assets]
A --> F[Optional Outputs]
B --> B1[url: Final URL]
B --> B2[success: Boolean]
B --> B3[status_code: HTTP Status]
B --> B4[error_message: Error Details]
C --> C1[html: Raw HTML]
C --> C2[cleaned_html: Sanitized HTML]
C --> C3[markdown: MarkdownGenerationResult]
C3 --> C3A[raw_markdown: Basic conversion]
C3 --> C3B[markdown_with_citations: With references]
C3 --> C3C[fit_markdown: Filtered content]
C3 --> C3D[references_markdown: Citation list]
D --> D1[links: Internal/External]
D --> D2[media: Images/Videos/Audio]
D --> D3[metadata: Page info]
D --> D4[extracted_content: JSON data]
D --> D5[tables: Structured table data]
E --> E1[screenshot: Base64 image]
E --> E2[pdf: PDF bytes]
E --> E3[mhtml: Archive file]
E --> E4[downloaded_files: File paths]
F --> F1[session_id: Browser session]
F --> F2[ssl_certificate: Security info]
F --> F3[response_headers: HTTP headers]
F --> F4[network_requests: Traffic log]
F --> F5[console_messages: Browser logs]
end
style A fill:#e3f2fd
style C3 fill:#f3e5f5
style D5 fill:#e8f5e8
```
### Content Processing Pipeline
```mermaid
flowchart LR
subgraph "Input Sources"
A1[Web URL]
A2[Raw HTML]
A3[Local File]
end
A1 --> B[Browser Navigation]
A2 --> C[Direct Processing]
A3 --> C
B --> D[Raw HTML Capture]
C --> D
D --> E{Content Filtering}
E --> E1[Remove Scripts/Styles]
E --> E2[Apply excluded_tags]
E --> E3[Apply css_selector]
E --> E4[Remove overlay elements]
E1 --> F[Cleaned HTML]
E2 --> F
E3 --> F
E4 --> F
F --> G{Markdown Generation}
G --> G1[HTML to Markdown]
G --> G2[Apply Content Filter]
G --> G3[Generate Citations]
G1 --> H[MarkdownGenerationResult]
G2 --> H
G3 --> H
F --> I{Media Extraction}
I --> I1[Find Images]
I --> I2[Find Videos/Audio]
I --> I3[Score Relevance]
I1 --> J[Media Dictionary]
I2 --> J
I3 --> J
F --> K{Link Extraction}
K --> K1[Internal Links]
K --> K2[External Links]
K --> K3[Apply Link Filters]
K1 --> L[Links Dictionary]
K2 --> L
K3 --> L
H --> M[Final CrawlResult]
J --> M
L --> M
style D fill:#e3f2fd
style F fill:#f3e5f5
style H fill:#e8f5e8
style M fill:#c8e6c9
```
### Table Extraction Workflow
```mermaid
stateDiagram-v2
[*] --> DetectTables
DetectTables --> ScoreTables: Find table elements
ScoreTables --> EvaluateThreshold: Calculate quality scores
EvaluateThreshold --> PassThreshold: score >= table_score_threshold
EvaluateThreshold --> RejectTable: score < threshold
PassThreshold --> ExtractHeaders: Parse table structure
ExtractHeaders --> ExtractRows: Get header cells
ExtractRows --> ExtractMetadata: Get data rows
ExtractMetadata --> CreateTableObject: Get caption/summary
CreateTableObject --> AddToResult: {headers, rows, caption, summary}
AddToResult --> [*]: Table extraction complete
RejectTable --> [*]: Table skipped
note right of ScoreTables : Factors: header presence, data density, structure quality
note right of EvaluateThreshold : Threshold 1-10, higher = stricter
```
### Error Handling Decision Tree
```mermaid
flowchart TD
A[Start Crawl] --> B[Navigate to URL]
B --> C{Navigation Success?}
C -->|Network Error| C1[Set error_message: Network failure]
C -->|Timeout| C2[Set error_message: Page timeout]
C -->|Invalid URL| C3[Set error_message: Invalid URL format]
C -->|Success| D[Process Page Content]
C1 --> E[success = False]
C2 --> E
C3 --> E
D --> F{Content Processing OK?}
F -->|Parser Error| F1[Set error_message: HTML parsing failed]
F -->|Memory Error| F2[Set error_message: Insufficient memory]
F -->|Success| G[Generate Outputs]
F1 --> E
F2 --> E
G --> H{Output Generation OK?}
H -->|Markdown Error| H1[Partial success with warnings]
H -->|Extraction Error| H2[Partial success with warnings]
H -->|Success| I[success = True]
H1 --> I
H2 --> I
E --> J[Return Failed CrawlResult]
I --> K[Return Successful CrawlResult]
J --> L[User Error Handling]
K --> M[User Result Processing]
L --> L1{Check error_message}
L1 -->|Network| L2[Retry with different config]
L1 -->|Timeout| L3[Increase page_timeout]
L1 -->|Parser| L4[Try different scraping_strategy]
style E fill:#ffcdd2
style I fill:#c8e6c9
style J fill:#ffcdd2
style K fill:#c8e6c9
```
### Configuration Impact Matrix
```mermaid
graph TB
subgraph "Configuration Categories"
A[Content Processing]
B[Page Interaction]
C[Output Generation]
D[Performance]
end
subgraph "Configuration Options"
A --> A1[word_count_threshold]
A --> A2[excluded_tags]
A --> A3[css_selector]
A --> A4[exclude_external_links]
B --> B1[process_iframes]
B --> B2[remove_overlay_elements]
B --> B3[scan_full_page]
B --> B4[wait_for]
C --> C1[screenshot]
C --> C2[pdf]
C --> C3[markdown_generator]
C --> C4[table_score_threshold]
D --> D1[cache_mode]
D --> D2[verbose]
D --> D3[page_timeout]
D --> D4[semaphore_count]
end
subgraph "Result Impact"
A1 --> R1[Filters short text blocks]
A2 --> R2[Removes specified HTML tags]
A3 --> R3[Focuses on selected content]
A4 --> R4[Cleans links dictionary]
B1 --> R5[Includes iframe content]
B2 --> R6[Removes popups/modals]
B3 --> R7[Loads dynamic content]
B4 --> R8[Waits for specific elements]
C1 --> R9[Adds screenshot field]
C2 --> R10[Adds pdf field]
C3 --> R11[Custom markdown processing]
C4 --> R12[Filters table quality]
D1 --> R13[Controls caching behavior]
D2 --> R14[Detailed logging output]
D3 --> R15[Prevents timeout errors]
D4 --> R16[Limits concurrent operations]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
```
### Raw HTML and Local File Processing
```mermaid
sequenceDiagram
participant User
participant Crawler
participant Processor
participant FileSystem
Note over User,FileSystem: Raw HTML Processing
User->>Crawler: arun("raw://html_content")
Crawler->>Processor: Parse raw HTML directly
Processor->>Processor: Apply same content filters
Processor-->>Crawler: Standard CrawlResult
Crawler-->>User: Result with markdown
Note over User,FileSystem: Local File Processing
User->>Crawler: arun("file:///path/to/file.html")
Crawler->>FileSystem: Read local file
FileSystem-->>Crawler: File content
Crawler->>Processor: Process file HTML
Processor->>Processor: Apply content processing
Processor-->>Crawler: Standard CrawlResult
Crawler-->>User: Result with markdown
Note over User,FileSystem: Both return identical CrawlResult structure
```
### Comprehensive Processing Example Flow
```mermaid
flowchart TD
A[Input: example.com] --> B[Create Configurations]
B --> B1[BrowserConfig verbose=True]
B --> B2[CrawlerRunConfig with filters]
B1 --> C[Launch AsyncWebCrawler]
B2 --> C
C --> D[Navigate and Process]
D --> E{Check Success}
E -->|Failed| E1[Print Error Message]
E -->|Success| F[Extract Content Summary]
F --> F1[Get Page Title]
F --> F2[Get Content Preview]
F --> F3[Process Media Items]
F --> F4[Process Links]
F3 --> F3A[Count Images]
F3 --> F3B[Show First 3 Images]
F4 --> F4A[Count Internal Links]
F4 --> F4B[Show First 3 Links]
F1 --> G[Display Results]
F2 --> G
F3A --> G
F3B --> G
F4A --> G
F4B --> G
E1 --> H[End with Error]
G --> I[End with Success]
style E1 fill:#ffcdd2
style G fill:#c8e6c9
style H fill:#ffcdd2
style I fill:#c8e6c9
```
**📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/)

View File

@@ -0,0 +1,441 @@
## URL Seeding Workflows and Architecture
Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows.
### URL Seeding vs Deep Crawling Strategy Comparison
```mermaid
graph TB
subgraph "Deep Crawling Approach"
A1[Start URL] --> A2[Load Page]
A2 --> A3[Extract Links]
A3 --> A4{More Links?}
A4 -->|Yes| A5[Queue Next Page]
A5 --> A2
A4 -->|No| A6[Complete]
A7[⏱️ Real-time Discovery]
A8[🐌 Sequential Processing]
A9[🔍 Limited by Page Structure]
A10[💾 High Memory Usage]
end
subgraph "URL Seeding Approach"
B1[Domain Input] --> B2[Query Sitemap]
B1 --> B3[Query Common Crawl]
B2 --> B4[Merge Results]
B3 --> B4
B4 --> B5[Apply Filters]
B5 --> B6[Score Relevance]
B6 --> B7[Rank Results]
B7 --> B8[Select Top URLs]
B9[⚡ Instant Discovery]
B10[🚀 Parallel Processing]
B11[🎯 Pattern-based Filtering]
B12[💡 Smart Relevance Scoring]
end
style A1 fill:#ffecb3
style B1 fill:#e8f5e8
style A6 fill:#ffcdd2
style B8 fill:#c8e6c9
```
### URL Discovery Data Flow
```mermaid
sequenceDiagram
participant User
participant Seeder as AsyncUrlSeeder
participant SM as Sitemap
participant CC as Common Crawl
participant Filter as URL Filter
participant Scorer as BM25 Scorer
User->>Seeder: urls("example.com", config)
par Parallel Data Sources
Seeder->>SM: Fetch sitemap.xml
SM-->>Seeder: 500 URLs
and
Seeder->>CC: Query Common Crawl
CC-->>Seeder: 2000 URLs
end
Seeder->>Seeder: Merge and deduplicate
Note over Seeder: 2200 unique URLs
Seeder->>Filter: Apply pattern filter
Filter-->>Seeder: 800 matching URLs
alt extract_head=True
loop For each URL
Seeder->>Seeder: Extract <head> metadata
end
Note over Seeder: Title, description, keywords
end
alt query provided
Seeder->>Scorer: Calculate relevance scores
Scorer-->>Seeder: Scored URLs
Seeder->>Seeder: Filter by score_threshold
Note over Seeder: 200 relevant URLs
end
Seeder->>Seeder: Sort by relevance
Seeder->>Seeder: Apply max_urls limit
Seeder-->>User: Top 100 URLs ready for crawling
```
### SeedingConfig Decision Tree
```mermaid
flowchart TD
A[SeedingConfig Setup] --> B{Data Source Strategy?}
B -->|Fast & Official| C[source="sitemap"]
B -->|Comprehensive| D[source="cc"]
B -->|Maximum Coverage| E[source="sitemap+cc"]
C --> F{Need Filtering?}
D --> F
E --> F
F -->|Yes| G[Set URL Pattern]
F -->|No| H[pattern="*"]
G --> I{Pattern Examples}
I --> I1[pattern="*/blog/*"]
I --> I2[pattern="*/docs/api/*"]
I --> I3[pattern="*.pdf"]
I --> I4[pattern="*/product/*"]
H --> J{Need Metadata?}
I1 --> J
I2 --> J
I3 --> J
I4 --> J
J -->|Yes| K[extract_head=True]
J -->|No| L[extract_head=False]
K --> M{Need Validation?}
L --> M
M -->|Yes| N[live_check=True]
M -->|No| O[live_check=False]
N --> P{Need Relevance Scoring?}
O --> P
P -->|Yes| Q[Set Query + BM25]
P -->|No| R[Skip Scoring]
Q --> S[query="search terms"]
S --> T[scoring_method="bm25"]
T --> U[score_threshold=0.3]
R --> V[Performance Tuning]
U --> V
V --> W[Set max_urls]
W --> X[Set concurrency]
X --> Y[Set hits_per_sec]
Y --> Z[Configuration Complete]
style A fill:#e3f2fd
style Z fill:#c8e6c9
style K fill:#fff3e0
style N fill:#fff3e0
style Q fill:#f3e5f5
```
### BM25 Relevance Scoring Pipeline
```mermaid
graph TB
subgraph "Text Corpus Preparation"
A1[URL Collection] --> A2[Extract Metadata]
A2 --> A3[Title + Description + Keywords]
A3 --> A4[Tokenize Text]
A4 --> A5[Remove Stop Words]
A5 --> A6[Create Document Corpus]
end
subgraph "BM25 Algorithm"
B1[Query Terms] --> B2[Term Frequency Calculation]
A6 --> B2
B2 --> B3[Inverse Document Frequency]
B3 --> B4[BM25 Score Calculation]
B4 --> B5[Score = Σ(IDF × TF × K1+1)/(TF + K1×(1-b+b×|d|/avgdl))]
end
subgraph "Scoring Results"
B5 --> C1[URL Relevance Scores]
C1 --> C2{Score ≥ Threshold?}
C2 -->|Yes| C3[Include in Results]
C2 -->|No| C4[Filter Out]
C3 --> C5[Sort by Score DESC]
C5 --> C6[Return Top URLs]
end
subgraph "Example Scores"
D1["python async tutorial" → 0.85]
D2["python documentation" → 0.72]
D3["javascript guide" → 0.23]
D4["contact us page" → 0.05]
end
style B5 fill:#e3f2fd
style C6 fill:#c8e6c9
style D1 fill:#c8e6c9
style D2 fill:#c8e6c9
style D3 fill:#ffecb3
style D4 fill:#ffcdd2
```
### Multi-Domain Discovery Architecture
```mermaid
graph TB
subgraph "Input Layer"
A1[Domain List]
A2[SeedingConfig]
A3[Query Terms]
end
subgraph "Discovery Engine"
B1[AsyncUrlSeeder]
B2[Parallel Workers]
B3[Rate Limiter]
B4[Memory Manager]
end
subgraph "Data Sources"
C1[Sitemap Fetcher]
C2[Common Crawl API]
C3[Live URL Checker]
C4[Metadata Extractor]
end
subgraph "Processing Pipeline"
D1[URL Deduplication]
D2[Pattern Filtering]
D3[Relevance Scoring]
D4[Quality Assessment]
end
subgraph "Output Layer"
E1[Scored URL Lists]
E2[Domain Statistics]
E3[Performance Metrics]
E4[Cache Storage]
end
A1 --> B1
A2 --> B1
A3 --> B1
B1 --> B2
B2 --> B3
B3 --> B4
B2 --> C1
B2 --> C2
B2 --> C3
B2 --> C4
C1 --> D1
C2 --> D1
C3 --> D2
C4 --> D3
D1 --> D2
D2 --> D3
D3 --> D4
D4 --> E1
B4 --> E2
B3 --> E3
D1 --> E4
style B1 fill:#e3f2fd
style D3 fill:#f3e5f5
style E1 fill:#c8e6c9
```
### Complete Discovery-to-Crawl Pipeline
```mermaid
stateDiagram-v2
[*] --> Discovery
Discovery --> SourceSelection: Configure data sources
SourceSelection --> Sitemap: source="sitemap"
SourceSelection --> CommonCrawl: source="cc"
SourceSelection --> Both: source="sitemap+cc"
Sitemap --> URLCollection
CommonCrawl --> URLCollection
Both --> URLCollection
URLCollection --> Filtering: Apply patterns
Filtering --> MetadataExtraction: extract_head=True
Filtering --> LiveValidation: extract_head=False
MetadataExtraction --> LiveValidation: live_check=True
MetadataExtraction --> RelevanceScoring: live_check=False
LiveValidation --> RelevanceScoring
RelevanceScoring --> ResultRanking: query provided
RelevanceScoring --> ResultLimiting: no query
ResultRanking --> ResultLimiting: apply score_threshold
ResultLimiting --> URLSelection: apply max_urls
URLSelection --> CrawlPreparation: URLs ready
CrawlPreparation --> CrawlExecution: AsyncWebCrawler
CrawlExecution --> StreamProcessing: stream=True
CrawlExecution --> BatchProcessing: stream=False
StreamProcessing --> [*]
BatchProcessing --> [*]
note right of Discovery : 🔍 Smart URL Discovery
note right of URLCollection : 📚 Merge & Deduplicate
note right of RelevanceScoring : 🎯 BM25 Algorithm
note right of CrawlExecution : 🕷️ High-Performance Crawling
```
### Performance Optimization Strategies
```mermaid
graph LR
subgraph "Input Optimization"
A1[Smart Source Selection] --> A2[Sitemap First]
A2 --> A3[Add CC if Needed]
A3 --> A4[Pattern Filtering Early]
end
subgraph "Processing Optimization"
B1[Parallel Workers] --> B2[Bounded Queues]
B2 --> B3[Rate Limiting]
B3 --> B4[Memory Management]
B4 --> B5[Lazy Evaluation]
end
subgraph "Output Optimization"
C1[Relevance Threshold] --> C2[Max URL Limits]
C2 --> C3[Caching Strategy]
C3 --> C4[Streaming Results]
end
subgraph "Performance Metrics"
D1[URLs/Second: 100-1000]
D2[Memory Usage: Bounded]
D3[Network Efficiency: 95%+]
D4[Cache Hit Rate: 80%+]
end
A4 --> B1
B5 --> C1
C4 --> D1
style A2 fill:#e8f5e8
style B2 fill:#e3f2fd
style C3 fill:#f3e5f5
style D3 fill:#c8e6c9
```
### URL Discovery vs Traditional Crawling Comparison
```mermaid
graph TB
subgraph "Traditional Approach"
T1[Start URL] --> T2[Crawl Page]
T2 --> T3[Extract Links]
T3 --> T4[Queue New URLs]
T4 --> T2
T5[❌ Time: Hours/Days]
T6[❌ Resource Heavy]
T7[❌ Depth Limited]
T8[❌ Discovery Bias]
end
subgraph "URL Seeding Approach"
S1[Domain Input] --> S2[Query All Sources]
S2 --> S3[Pattern Filter]
S3 --> S4[Relevance Score]
S4 --> S5[Select Best URLs]
S5 --> S6[Ready to Crawl]
S7[✅ Time: Seconds/Minutes]
S8[✅ Resource Efficient]
S9[✅ Complete Coverage]
S10[✅ Quality Focused]
end
subgraph "Use Case Decision Matrix"
U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling]
U3[Large Sites > 10000 pages] --> U4[Use URL Seeding]
U5[Unknown Structure] --> U6[Start with Seeding]
U7[Real-time Discovery] --> U8[Use Deep Crawling]
U9[Quality over Quantity] --> U10[Use URL Seeding]
end
style S6 fill:#c8e6c9
style S7 fill:#c8e6c9
style S8 fill:#c8e6c9
style S9 fill:#c8e6c9
style S10 fill:#c8e6c9
style T5 fill:#ffcdd2
style T6 fill:#ffcdd2
style T7 fill:#ffcdd2
style T8 fill:#ffcdd2
```
### Data Source Characteristics and Selection
```mermaid
graph TB
subgraph "Sitemap Source"
SM1[📋 Official URL List]
SM2[⚡ Fast Response]
SM3[📅 Recently Updated]
SM4[🎯 High Quality URLs]
SM5[❌ May Miss Some Pages]
end
subgraph "Common Crawl Source"
CC1[🌐 Comprehensive Coverage]
CC2[📚 Historical Data]
CC3[🔍 Deep Discovery]
CC4[⏳ Slower Response]
CC5[🧹 May Include Noise]
end
subgraph "Combined Strategy"
CB1[🚀 Best of Both]
CB2[📊 Maximum Coverage]
CB3[✨ Automatic Deduplication]
CB4[⚖️ Balanced Performance]
end
subgraph "Selection Guidelines"
G1[Speed Critical → Sitemap Only]
G2[Coverage Critical → Common Crawl]
G3[Best Quality → Combined]
G4[Unknown Domain → Combined]
end
style SM2 fill:#c8e6c9
style SM4 fill:#c8e6c9
style CC1 fill:#e3f2fd
style CC3 fill:#e3f2fd
style CB1 fill:#f3e5f5
style CB3 fill:#f3e5f5
```
**📖 Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,295 @@
## CLI & Identity-Based Browsing
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
### Basic CLI Usage
```bash
# Simple crawling
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json --bypass-cache
# Verbose mode with specific browser settings
crwl https://example.com -b "headless=false,viewport_width=1280" -v
```
### Profile Management Commands
```bash
# Launch interactive profile manager
crwl profiles
# Create, list, and manage browser profiles
# This opens a menu where you can:
# 1. List existing profiles
# 2. Create new profile (opens browser for setup)
# 3. Delete profiles
# 4. Use profile to crawl a website
# Use a specific profile for crawling
crwl https://example.com -p my-profile-name
# Example workflow for authenticated sites:
# 1. Create profile and log in
crwl profiles # Select "Create new profile"
# 2. Use profile for crawling authenticated content
crwl https://site-requiring-login.com/dashboard -p my-profile-name
```
### CDP Browser Management
```bash
# Launch browser with CDP debugging (default port 9222)
crwl cdp
# Use specific profile and custom port
crwl cdp -p my-profile -P 9223
# Launch headless browser with CDP
crwl cdp --headless
# Launch in incognito mode (ignores profile)
crwl cdp --incognito
# Use custom user data directory
crwl cdp --user-data-dir ~/my-browser-data --port 9224
```
### Builtin Browser Management
```bash
# Start persistent browser instance
crwl browser start
# Check browser status
crwl browser status
# Open visible window to see the browser
crwl browser view --url https://example.com
# Stop the browser
crwl browser stop
# Restart with different options
crwl browser restart --browser-type chromium --port 9223 --no-headless
# Use builtin browser in crawling
crwl https://example.com -b "browser_mode=builtin"
```
### Authentication Workflow Examples
```bash
# Complete workflow for LinkedIn scraping
# 1. Create authenticated profile
crwl profiles
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
# 2. Use profile for crawling
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
# 3. Extract structured data with authentication
crwl https://linkedin.com/search/results/people/ \
-p linkedin-profile \
-j "Extract people profiles with names, titles, and companies" \
-b "headless=false"
# GitHub authenticated crawling
crwl profiles # Create github-profile
crwl https://github.com/settings/profile -p github-profile
# Twitter/X authenticated access
crwl profiles # Create twitter-profile
crwl https://twitter.com/home -p twitter-profile -o markdown
```
### Advanced CLI Configuration
```bash
# Complex crawling with multiple configs
crwl https://example.com \
-B browser.yml \
-C crawler.yml \
-e extract_llm.yml \
-s llm_schema.json \
-p my-auth-profile \
-o json \
-v
# Quick LLM extraction with authentication
crwl https://private-site.com/dashboard \
-p auth-profile \
-j "Extract user dashboard data including metrics and notifications" \
-b "headless=true,viewport_width=1920"
# Content filtering with authentication
crwl https://members-only-site.com \
-p member-profile \
-f filter_bm25.yml \
-c "css_selector=.member-content,scan_full_page=true" \
-o markdown-fit
```
### Configuration Files for Identity Browsing
```yaml
# browser_auth.yml
headless: false
use_managed_browser: true
user_data_dir: "/path/to/profile"
viewport_width: 1280
viewport_height: 720
simulate_user: true
override_navigator: true
# crawler_auth.yml
magic: true
remove_overlay_elements: true
simulate_user: true
wait_for: "css:.authenticated-content"
page_timeout: 60000
delay_before_return_html: 2
scan_full_page: true
```
### Global Configuration Management
```bash
# List all configuration settings
crwl config list
# Set default LLM provider
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
# Set browser defaults
crwl config set BROWSER_HEADLESS false # Always show browser
crwl config set USER_AGENT_MODE random # Random user agents
# Enable verbose mode globally
crwl config set VERBOSE true
```
### Q&A with Authenticated Content
```bash
# Ask questions about authenticated content
crwl https://private-dashboard.com -p dashboard-profile \
-q "What are the key metrics shown in my dashboard?"
# Multiple questions workflow
crwl https://company-intranet.com -p work-profile -o markdown # View content
crwl https://company-intranet.com -p work-profile \
-q "Summarize this week's announcements"
crwl https://company-intranet.com -p work-profile \
-q "What are the upcoming deadlines?"
```
### Profile Creation Programmatically
```python
# Create profiles via Python API
import asyncio
from crawl4ai import BrowserProfiler
async def create_auth_profile():
profiler = BrowserProfiler()
# Create profile interactively (opens browser)
profile_path = await profiler.create_profile("linkedin-auth")
print(f"Profile created at: {profile_path}")
# List all profiles
profiles = profiler.list_profiles()
for profile in profiles:
print(f"Profile: {profile['name']} at {profile['path']}")
# Use profile for crawling
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir=profile_path
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://linkedin.com/feed")
return result
# asyncio.run(create_auth_profile())
```
### Identity Browsing Best Practices
```bash
# 1. Create specific profiles for different sites
crwl profiles # Create "linkedin-work"
crwl profiles # Create "github-personal"
crwl profiles # Create "company-intranet"
# 2. Use descriptive profile names
crwl https://site1.com -p site1-admin-account
crwl https://site2.com -p site2-user-account
# 3. Combine with appropriate browser settings
crwl https://secure-site.com \
-p secure-profile \
-b "headless=false,simulate_user=true,magic=true" \
-c "wait_for=.logged-in-indicator,page_timeout=30000"
# 4. Test profile before automated crawling
crwl cdp -p test-profile # Manually verify login status
crwl https://test-url.com -p test-profile -v # Verbose test crawl
```
### Troubleshooting Authentication Issues
```bash
# Debug authentication problems
crwl https://auth-site.com -p auth-profile \
-b "headless=false,verbose=true" \
-c "verbose=true,page_timeout=60000" \
-v
# Check profile status
crwl profiles # List profiles and check creation dates
# Recreate problematic profiles
crwl profiles # Delete old profile, create new one
# Test with visible browser
crwl https://problem-site.com -p profile-name \
-b "headless=false" \
-c "delay_before_return_html=5"
```
### Common Use Cases
```bash
# Social media monitoring (after authentication)
crwl https://twitter.com/home -p twitter-monitor \
-j "Extract latest tweets with sentiment and engagement metrics"
# E-commerce competitor analysis (with account access)
crwl https://competitor-site.com/products -p competitor-account \
-j "Extract product prices, availability, and descriptions"
# Company dashboard monitoring
crwl https://company-dashboard.com -p work-profile \
-c "css_selector=.dashboard-content" \
-q "What alerts or notifications need attention?"
# Research data collection (authenticated access)
crwl https://research-platform.com/data -p research-profile \
-e extract_research.yml \
-s research_schema.json \
-o json
```
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,446 @@
## Deep Crawling Filters & Scorers
Advanced URL filtering and scoring strategies for intelligent deep crawling with performance optimization.
### URL Filters - Content and Domain Control
```python
from crawl4ai.deep_crawling.filters import (
URLPatternFilter, DomainFilter, ContentTypeFilter,
FilterChain, ContentRelevanceFilter, SEOFilter
)
# Pattern-based filtering
pattern_filter = URLPatternFilter(
patterns=[
"*.html", # HTML pages only
"*/blog/*", # Blog posts
"*/articles/*", # Article pages
"*2024*", # Recent content
"^https://example.com/docs/.*" # Regex pattern
],
use_glob=True,
reverse=False # False = include matching, True = exclude matching
)
# Domain filtering with subdomains
domain_filter = DomainFilter(
allowed_domains=["example.com", "docs.example.com"],
blocked_domains=["ads.example.com", "tracker.com"]
)
# Content type filtering
content_filter = ContentTypeFilter(
allowed_types=["text/html", "application/pdf"],
check_extension=True
)
# Apply individual filters
url = "https://example.com/blog/2024/article.html"
print(f"Pattern filter: {pattern_filter.apply(url)}")
print(f"Domain filter: {domain_filter.apply(url)}")
print(f"Content filter: {content_filter.apply(url)}")
```
### Filter Chaining - Combine Multiple Filters
```python
# Create filter chain for comprehensive filtering
filter_chain = FilterChain([
DomainFilter(allowed_domains=["example.com"]),
URLPatternFilter(patterns=["*/blog/*", "*/docs/*"]),
ContentTypeFilter(allowed_types=["text/html"])
])
# Apply chain to URLs
urls = [
"https://example.com/blog/post1.html",
"https://spam.com/content.html",
"https://example.com/blog/image.jpg",
"https://example.com/docs/guide.html"
]
async def filter_urls(urls, filter_chain):
filtered = []
for url in urls:
if await filter_chain.apply(url):
filtered.append(url)
return filtered
# Usage
filtered_urls = await filter_urls(urls, filter_chain)
print(f"Filtered URLs: {filtered_urls}")
# Check filter statistics
for filter_obj in filter_chain.filters:
stats = filter_obj.stats
print(f"{filter_obj.name}: {stats.passed_urls}/{stats.total_urls} passed")
```
### Advanced Content Filters
```python
# BM25-based content relevance filtering
relevance_filter = ContentRelevanceFilter(
query="python machine learning tutorial",
threshold=0.5, # Minimum relevance score
k1=1.2, # TF saturation parameter
b=0.75, # Length normalization
avgdl=1000 # Average document length
)
# SEO quality filtering
seo_filter = SEOFilter(
threshold=0.65, # Minimum SEO score
keywords=["python", "tutorial", "guide"],
weights={
"title_length": 0.15,
"title_kw": 0.18,
"meta_description": 0.12,
"canonical": 0.10,
"robot_ok": 0.20,
"schema_org": 0.10,
"url_quality": 0.15
}
)
# Apply advanced filters
url = "https://example.com/python-ml-tutorial"
relevance_score = await relevance_filter.apply(url)
seo_score = await seo_filter.apply(url)
print(f"Relevance: {relevance_score}, SEO: {seo_score}")
```
### URL Scorers - Quality and Relevance Scoring
```python
from crawl4ai.deep_crawling.scorers import (
KeywordRelevanceScorer, PathDepthScorer, ContentTypeScorer,
FreshnessScorer, DomainAuthorityScorer, CompositeScorer
)
# Keyword relevance scoring
keyword_scorer = KeywordRelevanceScorer(
keywords=["python", "tutorial", "guide", "machine", "learning"],
weight=1.0,
case_sensitive=False
)
# Path depth scoring (optimal depth = 3)
depth_scorer = PathDepthScorer(
optimal_depth=3, # /category/subcategory/article
weight=0.8
)
# Content type scoring
content_type_scorer = ContentTypeScorer(
type_weights={
"html": 1.0, # Highest priority
"pdf": 0.8, # Medium priority
"txt": 0.6, # Lower priority
"doc": 0.4 # Lowest priority
},
weight=0.9
)
# Freshness scoring
freshness_scorer = FreshnessScorer(
weight=0.7,
current_year=2024
)
# Domain authority scoring
domain_scorer = DomainAuthorityScorer(
domain_weights={
"python.org": 1.0,
"github.com": 0.9,
"stackoverflow.com": 0.85,
"medium.com": 0.7,
"personal-blog.com": 0.3
},
default_weight=0.5,
weight=1.0
)
# Score individual URLs
url = "https://python.org/tutorial/2024/machine-learning.html"
scores = {
"keyword": keyword_scorer.score(url),
"depth": depth_scorer.score(url),
"content": content_type_scorer.score(url),
"freshness": freshness_scorer.score(url),
"domain": domain_scorer.score(url)
}
print(f"Individual scores: {scores}")
```
### Composite Scoring - Combine Multiple Scorers
```python
# Create composite scorer combining all strategies
composite_scorer = CompositeScorer(
scorers=[
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
PathDepthScorer(optimal_depth=3, weight=1.0),
ContentTypeScorer({"html": 1.0, "pdf": 0.8}, weight=1.2),
FreshnessScorer(weight=0.8, current_year=2024),
DomainAuthorityScorer({
"python.org": 1.0,
"github.com": 0.9
}, weight=1.3)
],
normalize=True # Normalize by number of scorers
)
# Score multiple URLs
urls_to_score = [
"https://python.org/tutorial/2024/basics.html",
"https://github.com/user/python-guide/blob/main/README.md",
"https://random-blog.com/old/2018/python-stuff.html",
"https://python.org/docs/deep/nested/advanced/guide.html"
]
scored_urls = []
for url in urls_to_score:
score = composite_scorer.score(url)
scored_urls.append((url, score))
# Sort by score (highest first)
scored_urls.sort(key=lambda x: x[1], reverse=True)
for url, score in scored_urls:
print(f"Score: {score:.3f} - {url}")
# Check scorer statistics
print(f"\nScoring statistics:")
print(f"URLs scored: {composite_scorer.stats._urls_scored}")
print(f"Average score: {composite_scorer.stats.get_average():.3f}")
```
### Advanced Filter Patterns
```python
# Complex pattern matching
advanced_patterns = URLPatternFilter(
patterns=[
r"^https://docs\.python\.org/\d+/", # Python docs with version
r".*/tutorial/.*\.html$", # Tutorial pages
r".*/guide/(?!deprecated).*", # Guides but not deprecated
"*/blog/{2020,2021,2022,2023,2024}/*", # Recent blog posts
"**/{api,reference}/**/*.html" # API/reference docs
],
use_glob=True
)
# Exclude patterns (reverse=True)
exclude_filter = URLPatternFilter(
patterns=[
"*/admin/*",
"*/login/*",
"*/private/*",
"**/.*", # Hidden files
"*.{jpg,png,gif,css,js}$" # Media and assets
],
reverse=True # Exclude matching patterns
)
# Content type with extension mapping
detailed_content_filter = ContentTypeFilter(
allowed_types=["text", "application"],
check_extension=True,
ext_map={
"html": "text/html",
"htm": "text/html",
"md": "text/markdown",
"pdf": "application/pdf",
"doc": "application/msword",
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
}
)
```
### Performance-Optimized Filtering
```python
# High-performance filter chain for large-scale crawling
class OptimizedFilterChain:
def __init__(self):
# Fast filters first (domain, patterns)
self.fast_filters = [
DomainFilter(
allowed_domains=["example.com", "docs.example.com"],
blocked_domains=["ads.example.com"]
),
URLPatternFilter([
"*.html", "*.pdf", "*/blog/*", "*/docs/*"
])
]
# Slower filters last (content analysis)
self.slow_filters = [
ContentRelevanceFilter(
query="important content",
threshold=0.3
)
]
async def apply_optimized(self, url: str) -> bool:
# Apply fast filters first
for filter_obj in self.fast_filters:
if not filter_obj.apply(url):
return False
# Only apply slow filters if fast filters pass
for filter_obj in self.slow_filters:
if not await filter_obj.apply(url):
return False
return True
# Batch filtering with concurrency
async def batch_filter_urls(urls, filter_chain, max_concurrent=50):
import asyncio
semaphore = asyncio.Semaphore(max_concurrent)
async def filter_single(url):
async with semaphore:
return await filter_chain.apply(url), url
tasks = [filter_single(url) for url in urls]
results = await asyncio.gather(*tasks)
return [url for passed, url in results if passed]
# Usage with 1000 URLs
large_url_list = [f"https://example.com/page{i}.html" for i in range(1000)]
optimized_chain = OptimizedFilterChain()
filtered = await batch_filter_urls(large_url_list, optimized_chain)
```
### Custom Filter Implementation
```python
from crawl4ai.deep_crawling.filters import URLFilter
import re
class CustomLanguageFilter(URLFilter):
"""Filter URLs by language indicators"""
def __init__(self, allowed_languages=["en"], weight=1.0):
super().__init__()
self.allowed_languages = set(allowed_languages)
self.lang_patterns = {
"en": re.compile(r"/en/|/english/|lang=en"),
"es": re.compile(r"/es/|/spanish/|lang=es"),
"fr": re.compile(r"/fr/|/french/|lang=fr"),
"de": re.compile(r"/de/|/german/|lang=de")
}
def apply(self, url: str) -> bool:
# Default to English if no language indicators
if not any(pattern.search(url) for pattern in self.lang_patterns.values()):
result = "en" in self.allowed_languages
self._update_stats(result)
return result
# Check for allowed languages
for lang in self.allowed_languages:
if lang in self.lang_patterns:
if self.lang_patterns[lang].search(url):
self._update_stats(True)
return True
self._update_stats(False)
return False
# Custom scorer implementation
from crawl4ai.deep_crawling.scorers import URLScorer
class CustomComplexityScorer(URLScorer):
"""Score URLs by content complexity indicators"""
def __init__(self, weight=1.0):
super().__init__(weight)
self.complexity_indicators = {
"tutorial": 0.9,
"guide": 0.8,
"example": 0.7,
"reference": 0.6,
"api": 0.5
}
def _calculate_score(self, url: str) -> float:
url_lower = url.lower()
max_score = 0.0
for indicator, score in self.complexity_indicators.items():
if indicator in url_lower:
max_score = max(max_score, score)
return max_score
# Use custom filters and scorers
custom_filter = CustomLanguageFilter(allowed_languages=["en", "es"])
custom_scorer = CustomComplexityScorer(weight=1.2)
url = "https://example.com/en/tutorial/advanced-guide.html"
passes_filter = custom_filter.apply(url)
complexity_score = custom_scorer.score(url)
print(f"Passes language filter: {passes_filter}")
print(f"Complexity score: {complexity_score}")
```
### Integration with Deep Crawling
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import DeepCrawlStrategy
async def deep_crawl_with_filtering():
# Create comprehensive filter chain
filter_chain = FilterChain([
DomainFilter(allowed_domains=["python.org"]),
URLPatternFilter(["*/tutorial/*", "*/guide/*", "*/docs/*"]),
ContentTypeFilter(["text/html"]),
SEOFilter(threshold=0.6, keywords=["python", "programming"])
])
# Create composite scorer
scorer = CompositeScorer([
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
FreshnessScorer(weight=0.8),
PathDepthScorer(optimal_depth=3, weight=1.0)
], normalize=True)
# Configure deep crawl strategy with filters and scorers
deep_strategy = DeepCrawlStrategy(
max_depth=3,
max_pages=100,
url_filter=filter_chain,
url_scorer=scorer,
score_threshold=0.6 # Only crawl URLs scoring above 0.6
)
config = CrawlerRunConfig(
deep_crawl_strategy=deep_strategy,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://python.org",
config=config
)
print(f"Deep crawl completed: {result.success}")
if hasattr(result, 'deep_crawl_results'):
print(f"Pages crawled: {len(result.deep_crawl_results)}")
# Run the deep crawl
await deep_crawl_with_filtering()
```
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Custom Filter Development](https://docs.crawl4ai.com/advanced/custom-filters/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/)

View File

@@ -0,0 +1,348 @@
## Deep Crawling
Multi-level website exploration with intelligent filtering, scoring, and prioritization strategies.
### Basic Deep Crawl Setup
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
# Basic breadth-first deep crawling
async def basic_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2, # Initial page + 2 levels
include_external=False # Stay within same domain
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://docs.crawl4ai.com", config=config)
# Group results by depth
pages_by_depth = {}
for result in results:
depth = result.metadata.get("depth", 0)
if depth not in pages_by_depth:
pages_by_depth[depth] = []
pages_by_depth[depth].append(result.url)
print(f"Crawled {len(results)} pages total")
for depth, urls in sorted(pages_by_depth.items()):
print(f"Depth {depth}: {len(urls)} pages")
```
### Deep Crawl Strategies
```python
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Breadth-First Search - explores all links at one depth before going deeper
bfs_strategy = BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=50, # Limit total pages
score_threshold=0.3 # Minimum score for URLs
)
# Depth-First Search - explores as deep as possible before backtracking
dfs_strategy = DFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=30,
score_threshold=0.5
)
# Best-First - prioritizes highest scoring pages (recommended)
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"],
weight=0.7
)
best_first_strategy = BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer,
max_pages=25 # No score_threshold needed - naturally prioritizes
)
# Usage
config = CrawlerRunConfig(
deep_crawl_strategy=best_first_strategy, # Choose your strategy
scraping_strategy=LXMLWebScrapingStrategy()
)
```
### Streaming vs Batch Processing
```python
# Batch mode - wait for all results
async def batch_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=False # Default - collect all results first
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
# Process all results at once
for result in results:
print(f"Batch processed: {result.url}")
# Streaming mode - process results as they arrive
async def streaming_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
stream=True # Process results immediately
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://example.com", config=config):
depth = result.metadata.get("depth", 0)
print(f"Stream processed depth {depth}: {result.url}")
```
### Filtering with Filter Chains
```python
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
DomainFilter,
ContentTypeFilter,
SEOFilter,
ContentRelevanceFilter
)
# Single URL pattern filter
url_filter = URLPatternFilter(patterns=["*core*", "*guide*"])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=1,
filter_chain=FilterChain([url_filter])
)
)
# Multiple filters in chain
advanced_filter_chain = FilterChain([
# Domain filtering
DomainFilter(
allowed_domains=["docs.example.com"],
blocked_domains=["old.docs.example.com", "staging.example.com"]
),
# URL pattern matching
URLPatternFilter(patterns=["*tutorial*", "*guide*", "*blog*"]),
# Content type filtering
ContentTypeFilter(allowed_types=["text/html"]),
# SEO quality filter
SEOFilter(
threshold=0.5,
keywords=["tutorial", "guide", "documentation"]
),
# Content relevance filter
ContentRelevanceFilter(
query="Web crawling and data extraction with Python",
threshold=0.7
)
])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
filter_chain=advanced_filter_chain
)
)
```
### Intelligent Crawling with Scorers
```python
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
# Keyword relevance scoring
async def scored_deep_crawl():
keyword_scorer = KeywordRelevanceScorer(
keywords=["browser", "crawler", "web", "automation"],
weight=1.0
)
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
url_scorer=keyword_scorer
),
stream=True, # Recommended with BestFirst
verbose=True
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
```
### Limiting Crawl Size
```python
# Max pages limitation across strategies
async def limited_crawls():
# BFS with page limit
bfs_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
max_pages=5, # Only crawl 5 pages total
url_scorer=KeywordRelevanceScorer(keywords=["browser", "crawler"], weight=1.0)
)
)
# DFS with score threshold
dfs_config = CrawlerRunConfig(
deep_crawl_strategy=DFSDeepCrawlStrategy(
max_depth=2,
score_threshold=0.7, # Only URLs with scores above 0.7
max_pages=10,
url_scorer=KeywordRelevanceScorer(keywords=["web", "automation"], weight=1.0)
)
)
# Best-First with both constraints
bf_config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
max_pages=7, # Automatically gets highest scored pages
url_scorer=KeywordRelevanceScorer(keywords=["crawl", "example"], weight=1.0)
),
stream=True
)
async with AsyncWebCrawler() as crawler:
# Use any of the configs
async for result in await crawler.arun("https://docs.crawl4ai.com", config=bf_config):
score = result.metadata.get("score", 0)
print(f"Score: {score:.2f} | {result.url}")
```
### Complete Advanced Deep Crawler
```python
async def comprehensive_deep_crawl():
# Sophisticated filter chain
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["docs.crawl4ai.com"],
blocked_domains=["old.docs.crawl4ai.com"]
),
URLPatternFilter(patterns=["*core*", "*advanced*", "*blog*"]),
ContentTypeFilter(allowed_types=["text/html"]),
SEOFilter(threshold=0.4, keywords=["crawl", "tutorial", "guide"])
])
# Multi-keyword scorer
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration", "browser"],
weight=0.8
)
# Complete configuration
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer,
max_pages=20
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True,
cache_mode=CacheMode.BYPASS
)
# Execute and analyze
results = []
start_time = time.time()
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
results.append(result)
score = result.metadata.get("score", 0)
depth = result.metadata.get("depth", 0)
print(f"→ Depth: {depth} | Score: {score:.2f} | {result.url}")
# Performance analysis
duration = time.time() - start_time
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
print(f"✅ Crawled {len(results)} pages in {duration:.2f}s")
print(f"✅ Average relevance score: {avg_score:.2f}")
# Depth distribution
depth_counts = {}
for result in results:
depth = result.metadata.get("depth", 0)
depth_counts[depth] = depth_counts.get(depth, 0) + 1
for depth, count in sorted(depth_counts.items()):
print(f"📊 Depth {depth}: {count} pages")
```
### Error Handling and Robustness
```python
async def robust_deep_crawl():
config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
max_pages=15,
url_scorer=KeywordRelevanceScorer(keywords=["guide", "tutorial"])
),
stream=True,
page_timeout=30000 # 30 second timeout per page
)
successful_pages = []
failed_pages = []
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
if result.success:
successful_pages.append(result)
depth = result.metadata.get("depth", 0)
score = result.metadata.get("score", 0)
print(f"✅ Depth {depth} | Score: {score:.2f} | {result.url}")
else:
failed_pages.append({
'url': result.url,
'error': result.error_message,
'depth': result.metadata.get("depth", 0)
})
print(f"❌ Failed: {result.url} - {result.error_message}")
print(f"📊 Results: {len(successful_pages)} successful, {len(failed_pages)} failed")
# Analyze failures by depth
if failed_pages:
failure_by_depth = {}
for failure in failed_pages:
depth = failure['depth']
failure_by_depth[depth] = failure_by_depth.get(depth, 0) + 1
print("❌ Failures by depth:")
for depth, count in sorted(failure_by_depth.items()):
print(f" Depth {depth}: {count} failures")
```
**📖 Learn more:** [Deep Crawling Guide](https://docs.crawl4ai.com/core/deep-crawling/), [Filter Documentation](https://docs.crawl4ai.com/core/content-selection/), [Scoring Strategies](https://docs.crawl4ai.com/advanced/advanced-features/)

View File

@@ -0,0 +1,826 @@
## Docker Deployment
Complete Docker deployment guide with pre-built images, API endpoints, configuration, and MCP integration.
### Quick Start with Pre-built Images
```bash
# Pull latest image
docker pull unclecode/crawl4ai:latest
# Setup LLM API keys
cat > .llm.env << EOL
OPENAI_API_KEY=sk-your-key
ANTHROPIC_API_KEY=your-anthropic-key
GROQ_API_KEY=your-groq-key
GEMINI_API_TOKEN=your-gemini-token
EOL
# Run with LLM support
docker run -d \
-p 11235:11235 \
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:latest
# Basic run (no LLM)
docker run -d \
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:latest
# Check health
curl http://localhost:11235/health
```
### Docker Compose Deployment
```bash
# Clone and setup
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
cp deploy/docker/.llm.env.example .llm.env
# Edit .llm.env with your API keys
# Run pre-built image
IMAGE=unclecode/crawl4ai:latest docker compose up -d
# Build locally
docker compose up --build -d
# Build with all features
INSTALL_TYPE=all docker compose up --build -d
# Build with GPU support
ENABLE_GPU=true docker compose up --build -d
# Stop service
docker compose down
```
### Manual Build with Multi-Architecture
```bash
# Clone repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Build for current architecture
docker buildx build -t crawl4ai-local:latest --load .
# Build for multiple architectures
docker buildx build --platform linux/amd64,linux/arm64 \
-t crawl4ai-local:latest --load .
# Build with specific features
docker buildx build \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=false \
-t crawl4ai-local:latest --load .
# Run custom build
docker run -d \
-p 11235:11235 \
--name crawl4ai-custom \
--env-file .llm.env \
--shm-size=1g \
crawl4ai-local:latest
```
### Build Arguments
```bash
# Available build options
docker buildx build \
--build-arg INSTALL_TYPE=all \ # default|all|torch|transformer
--build-arg ENABLE_GPU=true \ # true|false
--build-arg APP_HOME=/app \ # Install path
--build-arg USE_LOCAL=true \ # Use local source
--build-arg GITHUB_REPO=url \ # Git repo if USE_LOCAL=false
--build-arg GITHUB_BRANCH=main \ # Git branch
-t crawl4ai-custom:latest --load .
```
### Core API Endpoints
```python
# Main crawling endpoints
import requests
import json
# Basic crawl
payload = {
"urls": ["https://example.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
# Streaming crawl
payload["crawler_config"]["params"]["stream"] = True
response = requests.post("http://localhost:11235/crawl/stream", json=payload)
# Health check
response = requests.get("http://localhost:11235/health")
# API schema
response = requests.get("http://localhost:11235/schema")
# Metrics (Prometheus format)
response = requests.get("http://localhost:11235/metrics")
```
### Specialized Endpoints
```python
# HTML extraction (preprocessed for schema)
response = requests.post("http://localhost:11235/html",
json={"url": "https://example.com"})
# Screenshot capture
response = requests.post("http://localhost:11235/screenshot", json={
"url": "https://example.com",
"screenshot_wait_for": 2,
"output_path": "/path/to/save/screenshot.png"
})
# PDF generation
response = requests.post("http://localhost:11235/pdf", json={
"url": "https://example.com",
"output_path": "/path/to/save/document.pdf"
})
# JavaScript execution
response = requests.post("http://localhost:11235/execute_js", json={
"url": "https://example.com",
"scripts": [
"return document.title",
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
]
})
# Markdown generation
response = requests.post("http://localhost:11235/md", json={
"url": "https://example.com",
"f": "fit", # raw|fit|bm25|llm
"q": "extract main content", # query for filtering
"c": "0" # cache: 0=bypass, 1=use
})
# LLM Q&A
response = requests.get("http://localhost:11235/llm/https://example.com?q=What is this page about?")
# Library context (for AI assistants)
response = requests.get("http://localhost:11235/ask", params={
"context_type": "all", # code|doc|all
"query": "how to use extraction strategies",
"score_ratio": 0.5,
"max_results": 20
})
```
### Python SDK Usage
```python
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
# Non-streaming crawl
results = await client.crawl(
["https://example.com"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
for result in results:
print(f"URL: {result.url}, Success: {result.success}")
print(f"Content length: {len(result.markdown)}")
# Streaming crawl
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
async for result in await client.crawl(
["https://example.com", "https://python.org"],
browser_config=BrowserConfig(headless=True),
crawler_config=stream_config
):
print(f"Streamed: {result.url} - {result.success}")
# Get API schema
schema = await client.get_schema()
print(f"Schema available: {bool(schema)}")
asyncio.run(main())
```
### Advanced API Configuration
```python
# Complex extraction with LLM
payload = {
"urls": ["https://example.com"],
"browser_config": {
"type": "BrowserConfig",
"params": {
"headless": True,
"viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}
}
},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"llm_config": {
"type": "LLMConfig",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": "env:OPENAI_API_KEY"
}
},
"schema": {
"type": "dict",
"value": {
"type": "object",
"properties": {
"title": {"type": "string"},
"content": {"type": "string"}
}
}
},
"instruction": "Extract title and main content"
}
},
"markdown_generator": {
"type": "DefaultMarkdownGenerator",
"params": {
"content_filter": {
"type": "PruningContentFilter",
"params": {"threshold": 0.6}
}
}
}
}
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
```
### CSS Extraction Strategy
```python
# CSS-based structured extraction
schema = {
"name": "ProductList",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
payload = {
"urls": ["https://example-shop.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {"type": "dict", "value": schema}
}
}
}
}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
data = response.json()
extracted = json.loads(data["results"][0]["extracted_content"])
```
### MCP (Model Context Protocol) Integration
```bash
# Add Crawl4AI as MCP provider to Claude Code
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
# List MCP providers
claude mcp list
# Test MCP connection
python tests/mcp/test_mcp_socket.py
# Available MCP endpoints
# SSE: http://localhost:11235/mcp/sse
# WebSocket: ws://localhost:11235/mcp/ws
# Schema: http://localhost:11235/mcp/schema
```
Available MCP tools:
- `md` - Generate markdown from web content
- `html` - Extract preprocessed HTML
- `screenshot` - Capture webpage screenshots
- `pdf` - Generate PDF documents
- `execute_js` - Run JavaScript on web pages
- `crawl` - Perform multi-URL crawling
- `ask` - Query Crawl4AI library context
### Configuration Management
```yaml
# config.yml structure
app:
title: "Crawl4AI API"
version: "1.0.0"
host: "0.0.0.0"
port: 11235
timeout_keep_alive: 300
llm:
provider: "openai/gpt-4o-mini"
api_key_env: "OPENAI_API_KEY"
security:
enabled: false
jwt_enabled: false
trusted_hosts: ["*"]
crawler:
memory_threshold_percent: 95.0
rate_limiter:
base_delay: [1.0, 2.0]
timeouts:
stream_init: 30.0
batch_process: 300.0
pool:
max_pages: 40
idle_ttl_sec: 1800
rate_limiting:
enabled: true
default_limit: "1000/minute"
storage_uri: "memory://"
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
```
### Custom Configuration Deployment
```bash
# Method 1: Mount custom config
docker run -d -p 11235:11235 \
--name crawl4ai-custom \
--env-file .llm.env \
--shm-size=1g \
-v $(pwd)/my-config.yml:/app/config.yml \
unclecode/crawl4ai:latest
# Method 2: Build with custom config
# Edit deploy/docker/config.yml then build
docker buildx build -t crawl4ai-custom:latest --load .
```
### Monitoring and Health Checks
```bash
# Health endpoint
curl http://localhost:11235/health
# Prometheus metrics
curl http://localhost:11235/metrics
# Configuration validation
curl -X POST http://localhost:11235/config/dump \
-H "Content-Type: application/json" \
-d '{"code": "CrawlerRunConfig(cache_mode=\"BYPASS\", screenshot=True)"}'
```
### Playground Interface
Access the interactive playground at `http://localhost:11235/playground` for:
- Testing configurations with visual interface
- Generating JSON payloads for REST API
- Converting Python config to JSON format
- Testing crawl operations directly in browser
### Async Job Processing
```python
# Submit job for async processing
import time
# Submit crawl job
response = requests.post("http://localhost:11235/crawl/job", json=payload)
task_id = response.json()["task_id"]
# Poll for completion
while True:
result = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
status = result.json()
if status["status"] in ["COMPLETED", "FAILED"]:
break
time.sleep(1.5)
print("Final result:", status)
```
### Production Deployment
```bash
# Production-ready deployment
docker run -d \
--name crawl4ai-prod \
--restart unless-stopped \
-p 11235:11235 \
--env-file .llm.env \
--shm-size=2g \
--memory=8g \
--cpus=4 \
-v /path/to/custom-config.yml:/app/config.yml \
unclecode/crawl4ai:latest
# With Docker Compose for production
version: '3.8'
services:
crawl4ai:
image: unclecode/crawl4ai:latest
ports:
- "11235:11235"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes:
- ./config.yml:/app/config.yml
shm_size: 2g
deploy:
resources:
limits:
memory: 8G
cpus: '4'
restart: unless-stopped
```
### Configuration Validation and JSON Structure
```python
# Method 1: Create config objects and dump to see expected JSON structure
from crawl4ai import BrowserConfig, CrawlerRunConfig, LLMConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
import json
# Create browser config and see JSON structure
browser_config = BrowserConfig(
headless=True,
viewport_width=1280,
viewport_height=720,
proxy="http://user:pass@proxy:8080"
)
# Get JSON structure
browser_json = browser_config.dump()
print("BrowserConfig JSON structure:")
print(json.dumps(browser_json, indent=2))
# Create crawler config with extraction strategy
schema = {
"name": "Articles",
"baseSelector": ".article",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "content", "selector": ".content", "type": "html"}
]
}
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
extraction_strategy=JsonCssExtractionStrategy(schema),
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
wait_for="css:.loaded"
)
crawler_json = crawler_config.dump()
print("\nCrawlerRunConfig JSON structure:")
print(json.dumps(crawler_json, indent=2))
```
### Reverse Validation - JSON to Objects
```python
# Method 2: Load JSON back to config objects for validation
from crawl4ai.async_configs import from_serializable_dict
# Test JSON structure by converting back to objects
test_browser_json = {
"type": "BrowserConfig",
"params": {
"headless": True,
"viewport_width": 1280,
"proxy": "http://user:pass@proxy:8080"
}
}
try:
# Convert JSON back to object
restored_browser = from_serializable_dict(test_browser_json)
print(f"✅ Valid BrowserConfig: {type(restored_browser)}")
print(f"Headless: {restored_browser.headless}")
print(f"Proxy: {restored_browser.proxy}")
except Exception as e:
print(f"❌ Invalid BrowserConfig JSON: {e}")
# Test complex crawler config JSON
test_crawler_json = {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "bypass",
"screenshot": True,
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {
"type": "dict",
"value": {
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h3", "type": "text"}
]
}
}
}
}
}
}
try:
restored_crawler = from_serializable_dict(test_crawler_json)
print(f"✅ Valid CrawlerRunConfig: {type(restored_crawler)}")
print(f"Cache mode: {restored_crawler.cache_mode}")
print(f"Has extraction strategy: {restored_crawler.extraction_strategy is not None}")
except Exception as e:
print(f"❌ Invalid CrawlerRunConfig JSON: {e}")
```
### Using Server's /config/dump Endpoint for Validation
```python
import requests
# Method 3: Use server endpoint to validate configuration syntax
def validate_config_with_server(config_code: str) -> dict:
"""Validate configuration using server's /config/dump endpoint"""
response = requests.post(
"http://localhost:11235/config/dump",
json={"code": config_code}
)
if response.status_code == 200:
print("✅ Valid configuration syntax")
return response.json()
else:
print(f"❌ Invalid configuration: {response.status_code}")
print(response.json())
return None
# Test valid configuration
valid_config = """
CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True,
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
wait_for="css:.content-loaded"
)
"""
result = validate_config_with_server(valid_config)
if result:
print("Generated JSON structure:")
print(json.dumps(result, indent=2))
# Test invalid configuration (should fail)
invalid_config = """
CrawlerRunConfig(
cache_mode="invalid_mode",
screenshot=True,
js_code=some_function() # This will fail
)
"""
validate_config_with_server(invalid_config)
```
### Configuration Builder Helper
```python
def build_and_validate_request(urls, browser_params=None, crawler_params=None):
"""Helper to build and validate complete request payload"""
# Create configurations
browser_config = BrowserConfig(**(browser_params or {}))
crawler_config = CrawlerRunConfig(**(crawler_params or {}))
# Build complete request payload
payload = {
"urls": urls if isinstance(urls, list) else [urls],
"browser_config": browser_config.dump(),
"crawler_config": crawler_config.dump()
}
print("✅ Complete request payload:")
print(json.dumps(payload, indent=2))
# Validate by attempting to reconstruct
try:
test_browser = from_serializable_dict(payload["browser_config"])
test_crawler = from_serializable_dict(payload["crawler_config"])
print("✅ Payload validation successful")
return payload
except Exception as e:
print(f"❌ Payload validation failed: {e}")
return None
# Example usage
payload = build_and_validate_request(
urls=["https://example.com"],
browser_params={"headless": True, "viewport_width": 1280},
crawler_params={
"cache_mode": CacheMode.BYPASS,
"screenshot": True,
"word_count_threshold": 10
}
)
if payload:
# Send to server
response = requests.post("http://localhost:11235/crawl", json=payload)
print(f"Server response: {response.status_code}")
```
### Common JSON Structure Patterns
```python
# Pattern 1: Simple primitive values
simple_config = {
"type": "CrawlerRunConfig",
"params": {
"cache_mode": "bypass", # String enum value
"screenshot": True, # Boolean
"page_timeout": 60000 # Integer
}
}
# Pattern 2: Nested objects
nested_config = {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "LLMExtractionStrategy",
"params": {
"llm_config": {
"type": "LLMConfig",
"params": {
"provider": "openai/gpt-4o-mini",
"api_token": "env:OPENAI_API_KEY"
}
},
"instruction": "Extract main content"
}
}
}
}
# Pattern 3: Dictionary values (must use type: dict wrapper)
dict_config = {
"type": "CrawlerRunConfig",
"params": {
"extraction_strategy": {
"type": "JsonCssExtractionStrategy",
"params": {
"schema": {
"type": "dict", # Required wrapper
"value": { # Actual dictionary content
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"}
]
}
}
}
}
}
}
# Pattern 4: Lists and arrays
list_config = {
"type": "CrawlerRunConfig",
"params": {
"js_code": [ # Lists are handled directly
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more')?.click();"
],
"excluded_tags": ["script", "style", "nav"]
}
}
```
### Troubleshooting Common JSON Errors
```python
def diagnose_json_errors():
"""Common JSON structure errors and fixes"""
# ❌ WRONG: Missing type wrapper for objects
wrong_config = {
"browser_config": {
"headless": True # Missing type wrapper
}
}
# ✅ CORRECT: Proper type wrapper
correct_config = {
"browser_config": {
"type": "BrowserConfig",
"params": {
"headless": True
}
}
}
# ❌ WRONG: Dictionary without type: dict wrapper
wrong_dict = {
"schema": {
"name": "Products" # Raw dict, should be wrapped
}
}
# ✅ CORRECT: Dictionary with proper wrapper
correct_dict = {
"schema": {
"type": "dict",
"value": {
"name": "Products"
}
}
}
# ❌ WRONG: Invalid enum string
wrong_enum = {
"cache_mode": "DISABLED" # Wrong case/value
}
# ✅ CORRECT: Valid enum string
correct_enum = {
"cache_mode": "bypass" # or "enabled", "disabled", etc.
}
print("Common error patterns documented above")
# Validate your JSON structure before sending
def pre_flight_check(payload):
"""Run checks before sending to server"""
required_keys = ["urls", "browser_config", "crawler_config"]
for key in required_keys:
if key not in payload:
print(f"❌ Missing required key: {key}")
return False
# Check type wrappers
for config_key in ["browser_config", "crawler_config"]:
config = payload[config_key]
if not isinstance(config, dict) or "type" not in config:
print(f"❌ {config_key} missing type wrapper")
return False
if "params" not in config:
print(f"❌ {config_key} missing params")
return False
print("✅ Pre-flight check passed")
return True
# Example usage
payload = {
"urls": ["https://example.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
}
if pre_flight_check(payload):
# Safe to send to server
pass
```
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Configuration Options](https://docs.crawl4ai.com/core/docker-deployment/#server-configuration)

View File

@@ -0,0 +1,788 @@
## Extraction Strategies
Powerful data extraction from web pages using LLM-based intelligent parsing or fast schema/pattern-based approaches.
### LLM-Based Extraction - Intelligent Content Understanding
```python
import os
import asyncio
import json
from pydantic import BaseModel, Field
from typing import List
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
# Define structured data model
class Product(BaseModel):
name: str = Field(description="Product name")
price: str = Field(description="Product price")
description: str = Field(description="Product description")
features: List[str] = Field(description="List of product features")
rating: float = Field(description="Product rating out of 5")
# Configure LLM provider
llm_config = LLMConfig(
provider="openai/gpt-4o-mini", # or "ollama/llama3.3", "anthropic/claude-3-5-sonnet"
api_token=os.getenv("OPENAI_API_KEY"), # or "env:OPENAI_API_KEY"
temperature=0.1,
max_tokens=2000
)
# Create LLM extraction strategy
llm_strategy = LLMExtractionStrategy(
llm_config=llm_config,
schema=Product.model_json_schema(),
extraction_type="schema", # or "block" for freeform text
instruction="""
Extract product information from the webpage content.
Focus on finding complete product details including:
- Product name and price
- Detailed description
- All listed features
- Customer rating if available
Return valid JSON array of products.
""",
chunk_token_threshold=1200, # Split content if too large
overlap_rate=0.1, # 10% overlap between chunks
apply_chunking=True, # Enable automatic chunking
input_format="markdown", # "html", "fit_markdown", or "markdown"
extra_args={"temperature": 0.0, "max_tokens": 800},
verbose=True
)
async def extract_with_llm():
browser_config = BrowserConfig(headless=True)
crawl_config = CrawlerRunConfig(
extraction_strategy=llm_strategy,
cache_mode=CacheMode.BYPASS,
word_count_threshold=10
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=crawl_config
)
if result.success:
# Parse extracted JSON
products = json.loads(result.extracted_content)
print(f"Extracted {len(products)} products")
for product in products[:3]: # Show first 3
print(f"Product: {product['name']}")
print(f"Price: {product['price']}")
print(f"Rating: {product.get('rating', 'N/A')}")
# Show token usage and cost
llm_strategy.show_usage()
else:
print(f"Extraction failed: {result.error_message}")
asyncio.run(extract_with_llm())
```
### LLM Strategy Advanced Configuration
```python
# Multiple provider configurations
providers = {
"openai": LLMConfig(
provider="openai/gpt-4o",
api_token="env:OPENAI_API_KEY",
temperature=0.1
),
"anthropic": LLMConfig(
provider="anthropic/claude-3-5-sonnet-20240620",
api_token="env:ANTHROPIC_API_KEY",
max_tokens=4000
),
"ollama": LLMConfig(
provider="ollama/llama3.3",
api_token=None, # Not needed for Ollama
base_url="http://localhost:11434"
),
"groq": LLMConfig(
provider="groq/llama3-70b-8192",
api_token="env:GROQ_API_KEY"
)
}
# Advanced chunking for large content
large_content_strategy = LLMExtractionStrategy(
llm_config=providers["openai"],
schema=YourModel.model_json_schema(),
extraction_type="schema",
instruction="Extract detailed information...",
# Chunking parameters
chunk_token_threshold=2000, # Larger chunks for complex content
overlap_rate=0.15, # More overlap for context preservation
apply_chunking=True,
# Input format selection
input_format="fit_markdown", # Use filtered content if available
# LLM parameters
extra_args={
"temperature": 0.0, # Deterministic output
"top_p": 0.9,
"frequency_penalty": 0.1,
"presence_penalty": 0.1,
"max_tokens": 1500
},
verbose=True
)
# Knowledge graph extraction
class Entity(BaseModel):
name: str
type: str # "person", "organization", "location", etc.
description: str
class Relationship(BaseModel):
source: str
target: str
relationship: str
confidence: float
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
summary: str
knowledge_strategy = LLMExtractionStrategy(
llm_config=providers["anthropic"],
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="""
Create a knowledge graph from the content by:
1. Identifying key entities (people, organizations, locations, concepts)
2. Finding relationships between entities
3. Providing confidence scores for relationships
4. Summarizing the main topics
""",
input_format="html", # Use HTML for better structure preservation
apply_chunking=True,
chunk_token_threshold=1500
)
```
### JSON CSS Extraction - Fast Schema-Based Extraction
```python
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Basic CSS extraction schema
simple_schema = {
"name": "Product Listings",
"baseSelector": "div.product-card",
"fields": [
{
"name": "title",
"selector": "h2.product-title",
"type": "text"
},
{
"name": "price",
"selector": ".price",
"type": "text"
},
{
"name": "image_url",
"selector": "img.product-image",
"type": "attribute",
"attribute": "src"
},
{
"name": "product_url",
"selector": "a.product-link",
"type": "attribute",
"attribute": "href"
}
]
}
# Complex nested schema with multiple data types
complex_schema = {
"name": "E-commerce Product Catalog",
"baseSelector": "div.category",
"baseFields": [
{
"name": "category_id",
"type": "attribute",
"attribute": "data-category-id"
},
{
"name": "category_url",
"type": "attribute",
"attribute": "data-url"
}
],
"fields": [
{
"name": "category_name",
"selector": "h2.category-title",
"type": "text"
},
{
"name": "products",
"selector": "div.product",
"type": "nested_list", # Array of complex objects
"fields": [
{
"name": "name",
"selector": "h3.product-name",
"type": "text",
"default": "Unknown Product"
},
{
"name": "price",
"selector": "span.price",
"type": "text"
},
{
"name": "details",
"selector": "div.product-details",
"type": "nested", # Single complex object
"fields": [
{
"name": "brand",
"selector": "span.brand",
"type": "text"
},
{
"name": "model",
"selector": "span.model",
"type": "text"
},
{
"name": "specs",
"selector": "div.specifications",
"type": "html" # Preserve HTML structure
}
]
},
{
"name": "features",
"selector": "ul.features li",
"type": "list", # Simple array of strings
"fields": [
{"name": "feature", "type": "text"}
]
},
{
"name": "reviews",
"selector": "div.review",
"type": "nested_list",
"fields": [
{
"name": "reviewer",
"selector": "span.reviewer-name",
"type": "text"
},
{
"name": "rating",
"selector": "span.rating",
"type": "attribute",
"attribute": "data-rating"
},
{
"name": "comment",
"selector": "p.review-text",
"type": "text"
},
{
"name": "date",
"selector": "time.review-date",
"type": "attribute",
"attribute": "datetime"
}
]
}
]
}
]
}
async def extract_with_css_schema():
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
config = CrawlerRunConfig(
extraction_strategy=strategy,
cache_mode=CacheMode.BYPASS,
# Enable dynamic content loading if needed
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.product:nth-child(10)", # Wait for products to load
process_iframes=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/catalog",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
print(f"Extracted {len(data)} categories")
for category in data:
print(f"Category: {category['category_name']}")
print(f"Products: {len(category.get('products', []))}")
# Show first product details
if category.get('products'):
product = category['products'][0]
print(f" First product: {product.get('name')}")
print(f" Features: {len(product.get('features', []))}")
print(f" Reviews: {len(product.get('reviews', []))}")
asyncio.run(extract_with_css_schema())
```
### Automatic Schema Generation - One-Time LLM, Unlimited Use
```python
import json
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def generate_and_use_schema():
"""
1. Use LLM once to generate schema from sample HTML
2. Cache the schema for reuse
3. Use cached schema for fast extraction without LLM calls
"""
cache_dir = Path("./schema_cache")
cache_dir.mkdir(exist_ok=True)
schema_file = cache_dir / "ecommerce_schema.json"
# Step 1: Generate or load cached schema
if schema_file.exists():
schema = json.load(schema_file.open())
print("Using cached schema")
else:
print("Generating schema using LLM...")
# Configure LLM for schema generation
llm_config = LLMConfig(
provider="openai/gpt-4o", # or "ollama/llama3.3" for local
api_token="env:OPENAI_API_KEY"
)
# Get sample HTML from target site
async with AsyncWebCrawler() as crawler:
sample_result = await crawler.arun(
url="https://example.com/products",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
sample_html = sample_result.cleaned_html[:5000] # Use first 5k chars
# Generate schema automatically (ONE-TIME LLM COST)
schema = JsonCssExtractionStrategy.generate_schema(
html=sample_html,
schema_type="css",
llm_config=llm_config,
instruction="Extract product information including name, price, description, and features"
)
# Cache schema for future use (NO MORE LLM CALLS)
json.dump(schema, schema_file.open("w"), indent=2)
print("Schema generated and cached")
# Step 2: Use schema for fast extraction (NO LLM CALLS)
strategy = JsonCssExtractionStrategy(schema, verbose=True)
config = CrawlerRunConfig(
extraction_strategy=strategy,
cache_mode=CacheMode.BYPASS
)
# Step 3: Extract from multiple pages using same schema
urls = [
"https://example.com/products",
"https://example.com/electronics",
"https://example.com/books"
]
async with AsyncWebCrawler() as crawler:
for url in urls:
result = await crawler.arun(url=url, config=config)
if result.success:
data = json.loads(result.extracted_content)
print(f"{url}: Extracted {len(data)} items")
else:
print(f"{url}: Failed - {result.error_message}")
asyncio.run(generate_and_use_schema())
```
### XPath Extraction Strategy
```python
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
# XPath-based schema (alternative to CSS)
xpath_schema = {
"name": "News Articles",
"baseSelector": "//article[@class='news-item']",
"baseFields": [
{
"name": "article_id",
"type": "attribute",
"attribute": "data-id"
}
],
"fields": [
{
"name": "headline",
"selector": ".//h2[@class='headline']",
"type": "text"
},
{
"name": "author",
"selector": ".//span[@class='author']/text()",
"type": "text"
},
{
"name": "publish_date",
"selector": ".//time/@datetime",
"type": "text"
},
{
"name": "content",
"selector": ".//div[@class='article-body']",
"type": "html"
},
{
"name": "tags",
"selector": ".//div[@class='tags']/span[@class='tag']",
"type": "list",
"fields": [
{"name": "tag", "type": "text"}
]
}
]
}
# Generate XPath schema automatically
async def generate_xpath_schema():
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)
sample_html = """
<article class="news-item" data-id="123">
<h2 class="headline">Breaking News</h2>
<span class="author">John Doe</span>
<time datetime="2024-01-01">Today</time>
<div class="article-body"><p>Content here...</p></div>
</article>
"""
schema = JsonXPathExtractionStrategy.generate_schema(
html=sample_html,
schema_type="xpath",
llm_config=llm_config
)
return schema
# Use XPath strategy
xpath_strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
```
### Regex Extraction Strategy - Pattern-Based Fast Extraction
```python
from crawl4ai.extraction_strategy import RegexExtractionStrategy
# Built-in patterns for common data types
async def extract_with_builtin_patterns():
# Use multiple built-in patterns
strategy = RegexExtractionStrategy(
pattern=(
RegexExtractionStrategy.Email |
RegexExtractionStrategy.PhoneUS |
RegexExtractionStrategy.Url |
RegexExtractionStrategy.Currency |
RegexExtractionStrategy.DateIso
)
)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/contact",
config=config
)
if result.success:
matches = json.loads(result.extracted_content)
# Group by pattern type
by_type = {}
for match in matches:
label = match['label']
if label not in by_type:
by_type[label] = []
by_type[label].append(match['value'])
for pattern_type, values in by_type.items():
print(f"{pattern_type}: {len(values)} matches")
for value in values[:3]: # Show first 3
print(f" {value}")
# Custom regex patterns
custom_patterns = {
"product_code": r"SKU-\d{4,6}",
"discount": r"\d{1,2}%\s*off",
"model_number": r"Model:\s*([A-Z0-9-]+)"
}
async def extract_with_custom_patterns():
strategy = RegexExtractionStrategy(custom=custom_patterns)
config = CrawlerRunConfig(extraction_strategy=strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com/products",
config=config
)
if result.success:
data = json.loads(result.extracted_content)
for item in data:
print(f"{item['label']}: {item['value']}")
# LLM-generated patterns (one-time cost)
async def generate_custom_patterns():
cache_file = Path("./patterns/price_patterns.json")
if cache_file.exists():
patterns = json.load(cache_file.open())
else:
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token="env:OPENAI_API_KEY"
)
# Get sample content
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/pricing")
sample_html = result.cleaned_html
# Generate optimized patterns
patterns = RegexExtractionStrategy.generate_pattern(
label="pricing_info",
html=sample_html,
query="Extract all pricing information including discounts and special offers",
llm_config=llm_config
)
# Cache for reuse
cache_file.parent.mkdir(exist_ok=True)
json.dump(patterns, cache_file.open("w"), indent=2)
# Use cached patterns (no more LLM calls)
strategy = RegexExtractionStrategy(custom=patterns)
return strategy
asyncio.run(extract_with_builtin_patterns())
asyncio.run(extract_with_custom_patterns())
```
### Complete Extraction Workflow - Combining Strategies
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import (
JsonCssExtractionStrategy,
RegexExtractionStrategy,
LLMExtractionStrategy
)
async def multi_strategy_extraction():
"""
Demonstrate using multiple extraction strategies in sequence:
1. Fast regex for common patterns
2. Schema-based for structured data
3. LLM for complex reasoning
"""
browser_config = BrowserConfig(headless=True)
# Strategy 1: Fast regex extraction
regex_strategy = RegexExtractionStrategy(
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
)
# Strategy 2: Schema-based structured extraction
product_schema = {
"name": "Products",
"baseSelector": "div.product",
"fields": [
{"name": "name", "selector": "h3", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"},
{"name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating"}
]
}
css_strategy = JsonCssExtractionStrategy(product_schema)
# Strategy 3: LLM for complex analysis
llm_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
schema={
"type": "object",
"properties": {
"sentiment": {"type": "string"},
"key_topics": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"}
}
},
extraction_type="schema",
instruction="Analyze the content sentiment, extract key topics, and provide a summary"
)
url = "https://example.com/product-reviews"
async with AsyncWebCrawler(config=browser_config) as crawler:
# Extract contact info with regex
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
regex_result = await crawler.arun(url=url, config=regex_config)
# Extract structured product data
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
css_result = await crawler.arun(url=url, config=css_config)
# Extract insights with LLM
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
llm_result = await crawler.arun(url=url, config=llm_config)
# Combine results
results = {
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
"products": json.loads(css_result.extracted_content) if css_result.success else [],
"analysis": json.loads(llm_result.extracted_content) if llm_result.success else {}
}
print(f"Found {len(results['contacts'])} contact entries")
print(f"Found {len(results['products'])} products")
print(f"Sentiment: {results['analysis'].get('sentiment', 'N/A')}")
return results
# Performance comparison
async def compare_extraction_performance():
"""Compare speed and accuracy of different strategies"""
import time
url = "https://example.com/large-catalog"
strategies = {
"regex": RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
"css": JsonCssExtractionStrategy({
"name": "Prices",
"baseSelector": ".price",
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
}),
"llm": LLMExtractionStrategy(
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
instruction="Extract all prices from the content",
extraction_type="block"
)
}
async with AsyncWebCrawler() as crawler:
for name, strategy in strategies.items():
start_time = time.time()
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url=url, config=config)
duration = time.time() - start_time
if result.success:
data = json.loads(result.extracted_content)
print(f"{name}: {len(data)} items in {duration:.2f}s")
else:
print(f"{name}: Failed in {duration:.2f}s")
asyncio.run(multi_strategy_extraction())
asyncio.run(compare_extraction_performance())
```
### Best Practices and Strategy Selection
```python
# Strategy selection guide
def choose_extraction_strategy(use_case):
"""
Guide for selecting the right extraction strategy
"""
strategies = {
# Fast pattern matching for common data types
"contact_info": RegexExtractionStrategy(
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
),
# Structured data from consistent HTML
"product_catalogs": JsonCssExtractionStrategy,
# Complex reasoning and semantic understanding
"content_analysis": LLMExtractionStrategy,
# Mixed approach for comprehensive extraction
"complete_site_analysis": "multi_strategy"
}
recommendations = {
"speed_priority": "Use RegexExtractionStrategy for simple patterns, JsonCssExtractionStrategy for structured data",
"accuracy_priority": "Use LLMExtractionStrategy for complex content, JsonCssExtractionStrategy for predictable structure",
"cost_priority": "Avoid LLM strategies, use schema generation once then JsonCssExtractionStrategy",
"scale_priority": "Cache schemas, use regex for simple patterns, avoid LLM for high-volume extraction"
}
return recommendations.get(use_case, "Combine strategies based on content complexity")
# Error handling and validation
async def robust_extraction():
strategies = [
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email),
JsonCssExtractionStrategy(simple_schema),
# LLM as fallback for complex cases
]
async with AsyncWebCrawler() as crawler:
for strategy in strategies:
try:
config = CrawlerRunConfig(extraction_strategy=strategy)
result = await crawler.arun(url="https://example.com", config=config)
if result.success and result.extracted_content:
data = json.loads(result.extracted_content)
if data: # Validate non-empty results
print(f"Success with {strategy.__class__.__name__}")
return data
except Exception as e:
print(f"Strategy {strategy.__class__.__name__} failed: {e}")
continue
print("All strategies failed")
return None
```
**📖 Learn more:** [LLM Strategies Deep Dive](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Regex Patterns](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,388 @@
## HTTP Crawler Strategy
Fast, lightweight HTTP-only crawling without browser overhead for cases where JavaScript execution isn't needed.
### Basic HTTP Crawler Setup
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig, CacheMode
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
from crawl4ai.async_logger import AsyncLogger
async def main():
# Initialize HTTP strategy
http_strategy = AsyncHTTPCrawlerStrategy(
browser_config=HTTPCrawlerConfig(
method="GET",
verify_ssl=True,
follow_redirects=True
),
logger=AsyncLogger(verbose=True)
)
# Use with AsyncWebCrawler
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
result = await crawler.arun("https://example.com")
print(f"Status: {result.status_code}")
print(f"Content: {len(result.html)} chars")
if __name__ == "__main__":
asyncio.run(main())
```
### HTTP Request Types
```python
# GET request (default)
http_config = HTTPCrawlerConfig(
method="GET",
headers={"Accept": "application/json"}
)
# POST with JSON data
http_config = HTTPCrawlerConfig(
method="POST",
json={"key": "value", "data": [1, 2, 3]},
headers={"Content-Type": "application/json"}
)
# POST with form data
http_config = HTTPCrawlerConfig(
method="POST",
data={"username": "user", "password": "pass"},
headers={"Content-Type": "application/x-www-form-urlencoded"}
)
# Advanced configuration
http_config = HTTPCrawlerConfig(
method="GET",
headers={"User-Agent": "Custom Bot/1.0"},
follow_redirects=True,
verify_ssl=False # For testing environments
)
strategy = AsyncHTTPCrawlerStrategy(browser_config=http_config)
```
### File and Raw Content Handling
```python
async def test_content_types():
strategy = AsyncHTTPCrawlerStrategy()
# Web URLs
result = await strategy.crawl("https://httpbin.org/get")
print(f"Web content: {result.status_code}")
# Local files
result = await strategy.crawl("file:///path/to/local/file.html")
print(f"File content: {len(result.html)}")
# Raw HTML content
raw_html = "raw://<html><body><h1>Test</h1><p>Content</p></body></html>"
result = await strategy.crawl(raw_html)
print(f"Raw content: {result.html}")
# Raw content with complex HTML
complex_html = """raw://<!DOCTYPE html>
<html>
<head><title>Test Page</title></head>
<body>
<div class="content">
<h1>Main Title</h1>
<p>Paragraph content</p>
<ul><li>Item 1</li><li>Item 2</li></ul>
</div>
</body>
</html>"""
result = await strategy.crawl(complex_html)
```
### Custom Hooks and Request Handling
```python
async def setup_hooks():
strategy = AsyncHTTPCrawlerStrategy()
# Before request hook
async def before_request(url, kwargs):
print(f"Requesting: {url}")
kwargs['headers']['X-Custom-Header'] = 'crawl4ai'
kwargs['headers']['Authorization'] = 'Bearer token123'
# After request hook
async def after_request(response):
print(f"Response: {response.status_code}")
if hasattr(response, 'redirected_url'):
print(f"Redirected to: {response.redirected_url}")
# Error handling hook
async def on_error(error):
print(f"Request failed: {error}")
# Set hooks
strategy.set_hook('before_request', before_request)
strategy.set_hook('after_request', after_request)
strategy.set_hook('on_error', on_error)
# Use with hooks
result = await strategy.crawl("https://httpbin.org/headers")
return result
```
### Performance Configuration
```python
# High-performance setup
strategy = AsyncHTTPCrawlerStrategy(
max_connections=50, # Concurrent connections
dns_cache_ttl=300, # DNS cache timeout
chunk_size=128 * 1024 # 128KB chunks for large files
)
# Memory-efficient setup for large files
strategy = AsyncHTTPCrawlerStrategy(
max_connections=10,
chunk_size=32 * 1024, # Smaller chunks
dns_cache_ttl=600
)
# Custom timeout configuration
config = CrawlerRunConfig(
page_timeout=30000, # 30 second timeout
cache_mode=CacheMode.BYPASS
)
result = await strategy.crawl("https://slow-server.com", config=config)
```
### Error Handling and Retries
```python
from crawl4ai.async_crawler_strategy import (
ConnectionTimeoutError,
HTTPStatusError,
HTTPCrawlerError
)
async def robust_crawling():
strategy = AsyncHTTPCrawlerStrategy()
urls = [
"https://example.com",
"https://httpbin.org/status/404",
"https://nonexistent.domain.test"
]
for url in urls:
try:
result = await strategy.crawl(url)
print(f"✓ {url}: {result.status_code}")
except HTTPStatusError as e:
print(f"✗ {url}: HTTP {e.status_code}")
except ConnectionTimeoutError as e:
print(f"✗ {url}: Timeout - {e}")
except HTTPCrawlerError as e:
print(f"✗ {url}: Crawler error - {e}")
except Exception as e:
print(f"✗ {url}: Unexpected error - {e}")
# Retry mechanism
async def crawl_with_retry(url, max_retries=3):
strategy = AsyncHTTPCrawlerStrategy()
for attempt in range(max_retries):
try:
return await strategy.crawl(url)
except (ConnectionTimeoutError, HTTPCrawlerError) as e:
if attempt == max_retries - 1:
raise
print(f"Retry {attempt + 1}/{max_retries}: {e}")
await asyncio.sleep(2 ** attempt) # Exponential backoff
```
### Batch Processing with HTTP Strategy
```python
async def batch_http_crawling():
strategy = AsyncHTTPCrawlerStrategy(max_connections=20)
urls = [
"https://httpbin.org/get",
"https://httpbin.org/user-agent",
"https://httpbin.org/headers",
"https://example.com",
"https://httpbin.org/json"
]
# Sequential processing
results = []
async with strategy:
for url in urls:
try:
result = await strategy.crawl(url)
results.append((url, result.status_code, len(result.html)))
except Exception as e:
results.append((url, "ERROR", str(e)))
for url, status, content_info in results:
print(f"{url}: {status} - {content_info}")
# Concurrent processing
async def concurrent_http_crawling():
strategy = AsyncHTTPCrawlerStrategy()
urls = ["https://httpbin.org/delay/1"] * 5
async def crawl_single(url):
try:
result = await strategy.crawl(url)
return f"✓ {result.status_code}"
except Exception as e:
return f"✗ {e}"
async with strategy:
tasks = [crawl_single(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
for i, result in enumerate(results):
print(f"URL {i+1}: {result}")
```
### Integration with Content Processing
```python
from crawl4ai import DefaultMarkdownGenerator, PruningContentFilter
async def http_with_processing():
# HTTP strategy with content processing
http_strategy = AsyncHTTPCrawlerStrategy(
browser_config=HTTPCrawlerConfig(verify_ssl=True)
)
# Configure markdown generation
crawler_config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48,
threshold_type="fixed",
min_word_threshold=10
)
),
word_count_threshold=5,
excluded_tags=['script', 'style', 'nav'],
exclude_external_links=True
)
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawler_config
)
print(f"Status: {result.status_code}")
print(f"Raw HTML: {len(result.html)} chars")
if result.markdown:
print(f"Markdown: {len(result.markdown.raw_markdown)} chars")
if result.markdown.fit_markdown:
print(f"Filtered: {len(result.markdown.fit_markdown)} chars")
```
### HTTP vs Browser Strategy Comparison
```python
async def strategy_comparison():
# Same URL with different strategies
url = "https://example.com"
# HTTP Strategy (fast, no JS)
http_strategy = AsyncHTTPCrawlerStrategy()
start_time = time.time()
http_result = await http_strategy.crawl(url)
http_time = time.time() - start_time
# Browser Strategy (full features)
from crawl4ai import BrowserConfig
browser_config = BrowserConfig(headless=True)
start_time = time.time()
async with AsyncWebCrawler(config=browser_config) as crawler:
browser_result = await crawler.arun(url)
browser_time = time.time() - start_time
print(f"HTTP Strategy:")
print(f" Time: {http_time:.2f}s")
print(f" Content: {len(http_result.html)} chars")
print(f" Features: Fast, lightweight, no JS")
print(f"Browser Strategy:")
print(f" Time: {browser_time:.2f}s")
print(f" Content: {len(browser_result.html)} chars")
print(f" Features: Full browser, JS, screenshots, etc.")
# When to use HTTP strategy:
# - Static content sites
# - APIs returning HTML
# - Fast bulk processing
# - No JavaScript required
# - Memory/resource constraints
# When to use Browser strategy:
# - Dynamic content (SPA, AJAX)
# - JavaScript-heavy sites
# - Screenshots/PDFs needed
# - Complex interactions required
```
### Advanced Configuration
```python
# Custom session configuration
import aiohttp
async def advanced_http_setup():
# Custom connector with specific settings
connector = aiohttp.TCPConnector(
limit=100, # Connection pool size
ttl_dns_cache=600, # DNS cache TTL
use_dns_cache=True, # Enable DNS caching
keepalive_timeout=30, # Keep-alive timeout
force_close=False # Reuse connections
)
strategy = AsyncHTTPCrawlerStrategy(
max_connections=50,
dns_cache_ttl=600,
chunk_size=64 * 1024
)
# Custom headers for all requests
http_config = HTTPCrawlerConfig(
headers={
"User-Agent": "Crawl4AI-HTTP/1.0",
"Accept": "text/html,application/xhtml+xml",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"DNT": "1"
},
verify_ssl=True,
follow_redirects=True
)
strategy.browser_config = http_config
# Use with custom timeout
config = CrawlerRunConfig(
page_timeout=45000, # 45 seconds
cache_mode=CacheMode.ENABLED
)
result = await strategy.crawl("https://example.com", config=config)
await strategy.close()
```
**📖 Learn more:** [AsyncWebCrawler API](https://docs.crawl4ai.com/api/async-webcrawler/), [Browser vs HTTP Strategy](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)

View File

@@ -0,0 +1,231 @@
## Installation
Multiple installation options for different environments and use cases.
### Basic Installation
```bash
# Install core library
pip install crawl4ai
# Initial setup (installs Playwright browsers)
crawl4ai-setup
# Verify installation
crawl4ai-doctor
```
### Quick Verification
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown[:300])
if __name__ == "__main__":
asyncio.run(main())
```
**📖 Learn more:** [Basic Usage Guide](https://docs.crawl4ai.com/core/quickstart.md)
### Advanced Features (Optional)
```bash
# PyTorch-based features (text clustering, semantic chunking)
pip install crawl4ai[torch]
crawl4ai-setup
# Transformers (Hugging Face models)
pip install crawl4ai[transformer]
crawl4ai-setup
# All features (large download)
pip install crawl4ai[all]
crawl4ai-setup
# Pre-download models (optional)
crawl4ai-download-models
```
**📖 Learn more:** [Advanced Features Documentation](https://docs.crawl4ai.com/extraction/llm-strategies.md)
### Docker Deployment
```bash
# Pull pre-built image (specify platform for consistency)
docker pull --platform linux/amd64 unclecode/crawl4ai:latest
# For ARM (M1/M2 Macs): docker pull --platform linux/arm64 unclecode/crawl4ai:latest
# Setup environment for LLM support
cat > .llm.env << EOL
OPENAI_API_KEY=sk-your-key
ANTHROPIC_API_KEY=your-anthropic-key
EOL
# Run with LLM support (specify platform)
docker run -d \
--platform linux/amd64 \
-p 11235:11235 \
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:latest
# For ARM Macs, use: --platform linux/arm64
# Basic run (no LLM)
docker run -d \
--platform linux/amd64 \
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:latest
```
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment.md)
### Docker Compose
```bash
# Clone repository
git clone https://github.com/unclecode/crawl4ai.git
cd crawl4ai
# Copy environment template
cp deploy/docker/.llm.env.example .llm.env
# Edit .llm.env with your API keys
# Run pre-built image
IMAGE=unclecode/crawl4ai:latest docker compose up -d
# Build and run locally
docker compose up --build -d
# Build with all features
INSTALL_TYPE=all docker compose up --build -d
# Stop service
docker compose down
```
**📖 Learn more:** [Docker Compose Configuration](https://docs.crawl4ai.com/core/docker-deployment.md#option-2-using-docker-compose)
### Manual Docker Build
```bash
# Build multi-architecture image (specify platform)
docker buildx build --platform linux/amd64 -t crawl4ai-local:latest --load .
# For ARM: docker buildx build --platform linux/arm64 -t crawl4ai-local:latest --load .
# Build with specific features
docker buildx build \
--platform linux/amd64 \
--build-arg INSTALL_TYPE=all \
--build-arg ENABLE_GPU=false \
-t crawl4ai-local:latest --load .
# Run custom build (specify platform)
docker run -d \
--platform linux/amd64 \
-p 11235:11235 \
--name crawl4ai-custom \
--env-file .llm.env \
--shm-size=1g \
crawl4ai-local:latest
```
**📖 Learn more:** [Manual Build Guide](https://docs.crawl4ai.com/core/docker-deployment.md#option-3-manual-local-build--run)
### Google Colab
```python
# Install in Colab
!pip install crawl4ai
!crawl4ai-setup
# If setup fails, manually install Playwright browsers
!playwright install chromium
# Install with all features (may take 5-10 minutes)
!pip install crawl4ai[all]
!crawl4ai-setup
!crawl4ai-download-models
# If still having issues, force Playwright install
!playwright install chromium --force
# Quick test
import asyncio
from crawl4ai import AsyncWebCrawler
async def test_crawl():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com")
print("✅ Installation successful!")
print(f"Content length: {len(result.markdown)}")
# Run test in Colab
await test_crawl()
```
**📖 Learn more:** [Colab Examples Notebook](https://colab.research.google.com/github/unclecode/crawl4ai/blob/main/docs/examples/quickstart.ipynb)
### Docker API Usage
```python
# Using Docker SDK
import asyncio
from crawl4ai.docker_client import Crawl4aiDockerClient
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
async def main():
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
results = await client.crawl(
["https://example.com"],
browser_config=BrowserConfig(headless=True),
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
for result in results:
print(f"Success: {result.success}, Length: {len(result.markdown)}")
asyncio.run(main())
```
**📖 Learn more:** [Docker Client API](https://docs.crawl4ai.com/core/docker-deployment.md#python-sdk)
### Direct API Calls
```python
# REST API example
import requests
payload = {
"urls": ["https://example.com"],
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
}
response = requests.post("http://localhost:11235/crawl", json=payload)
print(response.json())
```
**📖 Learn more:** [REST API Reference](https://docs.crawl4ai.com/core/docker-deployment.md#rest-api-examples)
### Health Check
```bash
# Check Docker service
curl http://localhost:11235/health
# Access playground
open http://localhost:11235/playground
# View metrics
curl http://localhost:11235/metrics
```
**📖 Learn more:** [Monitoring & Metrics](https://docs.crawl4ai.com/core/docker-deployment.md#metrics--monitoring)

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,339 @@
## Multi-URL Crawling
Concurrent crawling of multiple URLs with intelligent resource management, rate limiting, and real-time monitoring.
### Basic Multi-URL Crawling
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
# Batch processing (default) - get all results at once
async def batch_crawl():
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=False # Default: batch mode
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls, config=config)
for result in results:
if result.success:
print(f"✅ {result.url}: {len(result.markdown)} chars")
else:
print(f"❌ {result.url}: {result.error_message}")
# Streaming processing - handle results as they complete
async def streaming_crawl():
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True # Enable streaming
)
async with AsyncWebCrawler() as crawler:
# Process results as they become available
async for result in await crawler.arun_many(urls, config=config):
if result.success:
print(f"🔥 Just completed: {result.url}")
await process_result_immediately(result)
else:
print(f"❌ Failed: {result.url}")
```
### Memory-Adaptive Dispatching
```python
from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor, DisplayMode
# Automatically manages concurrency based on system memory
async def memory_adaptive_crawl():
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0, # Pause if memory exceeds 80%
check_interval=1.0, # Check memory every second
max_session_permit=15, # Max concurrent tasks
memory_wait_timeout=300.0 # Wait up to 5 minutes for memory
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
word_count_threshold=50
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=large_url_list,
config=config,
dispatcher=dispatcher
)
# Each result includes dispatch information
for result in results:
if result.dispatch_result:
dr = result.dispatch_result
print(f"Memory used: {dr.memory_usage:.1f}MB")
print(f"Duration: {dr.end_time - dr.start_time}")
```
### Rate-Limited Crawling
```python
from crawl4ai import RateLimiter, SemaphoreDispatcher
# Control request pacing and handle server rate limits
async def rate_limited_crawl():
rate_limiter = RateLimiter(
base_delay=(1.0, 3.0), # Random delay 1-3 seconds
max_delay=60.0, # Cap backoff at 60 seconds
max_retries=3, # Retry failed requests 3 times
rate_limit_codes=[429, 503] # Handle these status codes
)
dispatcher = SemaphoreDispatcher(
max_session_permit=5, # Fixed concurrency limit
rate_limiter=rate_limiter
)
config = CrawlerRunConfig(
user_agent_mode="random", # Randomize user agents
simulate_user=True # Simulate human behavior
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun_many(
urls=urls,
config=config,
dispatcher=dispatcher
):
print(f"Processed: {result.url}")
```
### Real-Time Monitoring
```python
from crawl4ai import CrawlerMonitor, DisplayMode
# Monitor crawling progress in real-time
async def monitored_crawl():
monitor = CrawlerMonitor(
max_visible_rows=20, # Show 20 tasks in display
display_mode=DisplayMode.DETAILED # Show individual task details
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=75.0,
max_session_permit=10,
monitor=monitor # Attach monitor to dispatcher
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
dispatcher=dispatcher
)
```
### Advanced Dispatcher Configurations
```python
# Memory-adaptive with comprehensive monitoring
memory_dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=85.0, # Higher memory tolerance
check_interval=0.5, # Check memory more frequently
max_session_permit=20, # More concurrent tasks
memory_wait_timeout=600.0, # Wait longer for memory
rate_limiter=RateLimiter(
base_delay=(0.5, 1.5),
max_delay=30.0,
max_retries=5
),
monitor=CrawlerMonitor(
max_visible_rows=15,
display_mode=DisplayMode.AGGREGATED # Summary view
)
)
# Simple semaphore-based dispatcher
semaphore_dispatcher = SemaphoreDispatcher(
max_session_permit=8, # Fixed concurrency
rate_limiter=RateLimiter(
base_delay=(1.0, 2.0),
max_delay=20.0
)
)
# Usage with custom dispatcher
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
config=config,
dispatcher=memory_dispatcher # or semaphore_dispatcher
)
```
### Handling Large-Scale Crawling
```python
async def large_scale_crawl():
# For thousands of URLs
urls = load_urls_from_file("large_url_list.txt") # 10,000+ URLs
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=70.0, # Conservative memory usage
max_session_permit=25, # Higher concurrency
rate_limiter=RateLimiter(
base_delay=(0.1, 0.5), # Faster for large batches
max_retries=2 # Fewer retries for speed
),
monitor=CrawlerMonitor(display_mode=DisplayMode.AGGREGATED)
)
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED, # Use caching for efficiency
stream=True, # Stream for memory efficiency
word_count_threshold=100, # Skip short content
exclude_external_links=True # Reduce processing overhead
)
successful_crawls = 0
failed_crawls = 0
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun_many(
urls=urls,
config=config,
dispatcher=dispatcher
):
if result.success:
successful_crawls += 1
await save_result_to_database(result)
else:
failed_crawls += 1
await log_failure(result.url, result.error_message)
# Progress reporting
if (successful_crawls + failed_crawls) % 100 == 0:
print(f"Progress: {successful_crawls + failed_crawls}/{len(urls)}")
print(f"Completed: {successful_crawls} successful, {failed_crawls} failed")
```
### Robots.txt Compliance
```python
async def compliant_crawl():
config = CrawlerRunConfig(
check_robots_txt=True, # Respect robots.txt
user_agent="MyBot/1.0", # Identify your bot
mean_delay=2.0, # Be polite with delays
max_range=1.0
)
dispatcher = SemaphoreDispatcher(
max_session_permit=3, # Conservative concurrency
rate_limiter=RateLimiter(
base_delay=(2.0, 5.0), # Slower, more respectful
max_retries=1
)
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun_many(
urls=urls,
config=config,
dispatcher=dispatcher
):
if result.success:
print(f"✅ Crawled: {result.url}")
elif "robots.txt" in result.error_message:
print(f"🚫 Blocked by robots.txt: {result.url}")
else:
print(f"❌ Error: {result.url}")
```
### Performance Analysis
```python
async def analyze_crawl_performance():
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0,
max_session_permit=12,
monitor=CrawlerMonitor(display_mode=DisplayMode.DETAILED)
)
start_time = time.time()
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=urls,
dispatcher=dispatcher
)
end_time = time.time()
# Analyze results
successful = [r for r in results if r.success]
failed = [r for r in results if not r.success]
print(f"Total time: {end_time - start_time:.2f}s")
print(f"Success rate: {len(successful)}/{len(results)} ({len(successful)/len(results)*100:.1f}%)")
print(f"Avg time per URL: {(end_time - start_time)/len(results):.2f}s")
# Memory usage analysis
if successful and successful[0].dispatch_result:
memory_usage = [r.dispatch_result.memory_usage for r in successful if r.dispatch_result]
peak_memory = [r.dispatch_result.peak_memory for r in successful if r.dispatch_result]
print(f"Avg memory usage: {sum(memory_usage)/len(memory_usage):.1f}MB")
print(f"Peak memory usage: {max(peak_memory):.1f}MB")
```
### Error Handling and Recovery
```python
async def robust_multi_crawl():
failed_urls = []
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
stream=True,
page_timeout=30000 # 30 second timeout
)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=85.0,
max_session_permit=10
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun_many(
urls=urls,
config=config,
dispatcher=dispatcher
):
if result.success:
await process_successful_result(result)
else:
failed_urls.append({
'url': result.url,
'error': result.error_message,
'status_code': result.status_code
})
# Retry logic for specific errors
if result.status_code in [503, 429]: # Server errors
await schedule_retry(result.url)
# Report failures
if failed_urls:
print(f"Failed to crawl {len(failed_urls)} URLs:")
for failure in failed_urls[:10]: # Show first 10
print(f" {failure['url']}: {failure['error']}")
```
**📖 Learn more:** [Advanced Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Crawl Dispatcher](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [arun_many() API Reference](https://docs.crawl4ai.com/api/arun_many/)

View File

@@ -0,0 +1,365 @@
## Simple Crawling
Basic web crawling operations with AsyncWebCrawler, configurations, and response handling.
### Basic Setup
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
async def main():
browser_config = BrowserConfig() # Default browser settings
run_config = CrawlerRunConfig() # Default crawl settings
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
### Understanding CrawlResult
```python
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import PruningContentFilter
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.6),
options={"ignore_links": True}
)
)
result = await crawler.arun("https://example.com", config=config)
# Different content formats
print(result.html) # Raw HTML
print(result.cleaned_html) # Cleaned HTML
print(result.markdown.raw_markdown) # Raw markdown
print(result.markdown.fit_markdown) # Filtered markdown
# Status information
print(result.success) # True/False
print(result.status_code) # HTTP status (200, 404, etc.)
# Extracted content
print(result.media) # Images, videos, audio
print(result.links) # Internal/external links
```
### Basic Configuration Options
```python
run_config = CrawlerRunConfig(
word_count_threshold=10, # Min words per block
exclude_external_links=True, # Remove external links
remove_overlay_elements=True, # Remove popups/modals
process_iframes=True, # Process iframe content
excluded_tags=['form', 'header'] # Skip these tags
)
result = await crawler.arun("https://example.com", config=run_config)
```
### Error Handling
```python
result = await crawler.arun("https://example.com", config=run_config)
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
else:
print(f"Success! Content length: {len(result.markdown)}")
```
### Debugging with Verbose Logging
```python
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://example.com")
# Detailed logging output will be displayed
```
### Complete Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
async def comprehensive_crawl():
browser_config = BrowserConfig(verbose=True)
run_config = CrawlerRunConfig(
# Content filtering
word_count_threshold=10,
excluded_tags=['form', 'header', 'nav'],
exclude_external_links=True,
# Content processing
process_iframes=True,
remove_overlay_elements=True,
# Cache control
cache_mode=CacheMode.ENABLED
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=run_config
)
if result.success:
# Display content summary
print(f"Title: {result.metadata.get('title', 'No title')}")
print(f"Content: {result.markdown[:500]}...")
# Process media
images = result.media.get("images", [])
print(f"Found {len(images)} images")
for img in images[:3]: # First 3 images
print(f" - {img.get('src', 'No src')}")
# Process links
internal_links = result.links.get("internal", [])
print(f"Found {len(internal_links)} internal links")
for link in internal_links[:3]: # First 3 links
print(f" - {link.get('href', 'No href')}")
else:
print(f"❌ Crawl failed: {result.error_message}")
print(f"Status: {result.status_code}")
if __name__ == "__main__":
asyncio.run(comprehensive_crawl())
```
### Working with Raw HTML and Local Files
```python
# Crawl raw HTML
raw_html = "<html><body><h1>Test</h1><p>Content</p></body></html>"
result = await crawler.arun(f"raw://{raw_html}")
# Crawl local file
result = await crawler.arun("file:///path/to/local/file.html")
# Both return standard CrawlResult objects
print(result.markdown)
```
## Table Extraction
Extract structured data from HTML tables with automatic detection and scoring.
### Basic Table Extraction
```python
import asyncio
import pandas as pd
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def extract_tables():
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
table_score_threshold=7, # Higher = stricter detection
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun("https://example.com/tables", config=config)
if result.success and result.tables:
# New tables field (v0.6+)
for i, table in enumerate(result.tables):
print(f"Table {i+1}:")
print(f"Headers: {table['headers']}")
print(f"Rows: {len(table['rows'])}")
print(f"Caption: {table.get('caption', 'No caption')}")
# Convert to DataFrame
df = pd.DataFrame(table['rows'], columns=table['headers'])
print(df.head())
asyncio.run(extract_tables())
```
### Advanced Table Processing
```python
from crawl4ai import LXMLWebScrapingStrategy
async def process_financial_tables():
config = CrawlerRunConfig(
table_score_threshold=8, # Strict detection for data tables
scraping_strategy=LXMLWebScrapingStrategy(),
keep_data_attributes=True,
scan_full_page=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://coinmarketcap.com", config=config)
if result.tables:
# Get the main data table (usually first/largest)
main_table = result.tables[0]
# Create DataFrame
df = pd.DataFrame(
main_table['rows'],
columns=main_table['headers']
)
# Clean and process data
df = clean_financial_data(df)
# Save for analysis
df.to_csv("market_data.csv", index=False)
return df
def clean_financial_data(df):
"""Clean currency symbols, percentages, and large numbers"""
for col in df.columns:
if 'price' in col.lower():
# Remove currency symbols
df[col] = df[col].str.replace(r'[^\d.]', '', regex=True)
df[col] = pd.to_numeric(df[col], errors='coerce')
elif '%' in str(df[col].iloc[0]):
# Convert percentages
df[col] = df[col].str.replace('%', '').astype(float) / 100
elif any(suffix in str(df[col].iloc[0]) for suffix in ['B', 'M', 'K']):
# Handle large numbers (Billions, Millions, etc.)
df[col] = df[col].apply(convert_large_numbers)
return df
def convert_large_numbers(value):
"""Convert 1.5B -> 1500000000"""
if pd.isna(value):
return float('nan')
value = str(value)
multiplier = 1
if 'B' in value:
multiplier = 1e9
elif 'M' in value:
multiplier = 1e6
elif 'K' in value:
multiplier = 1e3
number = float(re.sub(r'[^\d.]', '', value))
return number * multiplier
```
### Table Detection Configuration
```python
# Strict table detection (data-heavy pages)
strict_config = CrawlerRunConfig(
table_score_threshold=9, # Only high-quality tables
word_count_threshold=5, # Ignore sparse content
excluded_tags=['nav', 'footer'] # Skip navigation tables
)
# Lenient detection (mixed content pages)
lenient_config = CrawlerRunConfig(
table_score_threshold=5, # Include layout tables
process_iframes=True, # Check embedded tables
scan_full_page=True # Scroll to load dynamic tables
)
# Financial/data site optimization
financial_config = CrawlerRunConfig(
table_score_threshold=8,
scraping_strategy=LXMLWebScrapingStrategy(),
wait_for="css:table", # Wait for tables to load
scan_full_page=True,
scroll_delay=0.2
)
```
### Multi-Table Processing
```python
async def extract_all_tables():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/data", config=config)
tables_data = {}
for i, table in enumerate(result.tables):
# Create meaningful names based on content
table_name = (
table.get('caption') or
f"table_{i+1}_{table['headers'][0]}"
).replace(' ', '_').lower()
df = pd.DataFrame(table['rows'], columns=table['headers'])
# Store with metadata
tables_data[table_name] = {
'dataframe': df,
'headers': table['headers'],
'row_count': len(table['rows']),
'caption': table.get('caption'),
'summary': table.get('summary')
}
return tables_data
# Usage
tables = await extract_all_tables()
for name, data in tables.items():
print(f"{name}: {data['row_count']} rows")
data['dataframe'].to_csv(f"{name}.csv")
```
### Backward Compatibility
```python
# Support both new and old table formats
def get_tables(result):
# New format (v0.6+)
if hasattr(result, 'tables') and result.tables:
return result.tables
# Fallback to media.tables (older versions)
return result.media.get('tables', [])
# Usage in existing code
result = await crawler.arun(url, config=config)
tables = get_tables(result)
for table in tables:
df = pd.DataFrame(table['rows'], columns=table['headers'])
# Process table data...
```
### Table Quality Scoring
```python
# Understanding table_score_threshold values:
# 10: Only perfect data tables (headers + data rows)
# 8-9: High-quality tables (recommended for financial/data sites)
# 6-7: Mixed content tables (news sites, wikis)
# 4-5: Layout tables included (broader detection)
# 1-3: All table-like structures (very permissive)
config = CrawlerRunConfig(
table_score_threshold=8, # Balanced detection
verbose=True # See scoring details in logs
)
```
**📖 Learn more:** [CrawlResult API Reference](https://docs.crawl4ai.com/api/crawl-result/), [Browser & Crawler Configuration](https://docs.crawl4ai.com/core/browser-crawler-config/), [Cache Modes](https://docs.crawl4ai.com/core/cache-modes/)

View File

@@ -0,0 +1,655 @@
## URL Seeding
Smart URL discovery for efficient large-scale crawling. Discover thousands of URLs instantly, filter by relevance, then crawl only what matters.
### Why URL Seeding vs Deep Crawling
```python
# Deep Crawling: Real-time discovery (page by page)
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async def deep_crawl_example():
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=50
)
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://example.com", config=config)
print(f"Discovered {len(results)} pages dynamically")
# URL Seeding: Bulk discovery (thousands instantly)
from crawl4ai import AsyncUrlSeeder, SeedingConfig
async def url_seeding_example():
config = SeedingConfig(
source="sitemap+cc",
pattern="*/docs/*",
extract_head=True,
query="API documentation",
scoring_method="bm25",
max_urls=1000
)
async with AsyncUrlSeeder() as seeder:
urls = await seeder.urls("example.com", config)
print(f"Discovered {len(urls)} URLs instantly")
# Now crawl only the most relevant ones
```
### Basic URL Discovery
```python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig
async def basic_discovery():
# Context manager handles cleanup automatically
async with AsyncUrlSeeder() as seeder:
# Simple discovery from sitemaps
config = SeedingConfig(source="sitemap")
urls = await seeder.urls("example.com", config)
print(f"Found {len(urls)} URLs from sitemap")
for url in urls[:5]:
print(f" - {url['url']} (status: {url['status']})")
# Manual cleanup (if needed)
async def manual_cleanup():
seeder = AsyncUrlSeeder()
try:
config = SeedingConfig(source="cc") # Common Crawl
urls = await seeder.urls("example.com", config)
print(f"Found {len(urls)} URLs from Common Crawl")
finally:
await seeder.close()
asyncio.run(basic_discovery())
```
### Data Sources and Patterns
```python
# Different data sources
configs = [
SeedingConfig(source="sitemap"), # Fastest, official URLs
SeedingConfig(source="cc"), # Most comprehensive
SeedingConfig(source="sitemap+cc"), # Maximum coverage
]
# URL pattern filtering
patterns = [
SeedingConfig(pattern="*/blog/*"), # Blog posts only
SeedingConfig(pattern="*.html"), # HTML files only
SeedingConfig(pattern="*/product/*"), # Product pages
SeedingConfig(pattern="*/docs/api/*"), # API documentation
SeedingConfig(pattern="*"), # Everything
]
# Advanced pattern usage
async def pattern_filtering():
async with AsyncUrlSeeder() as seeder:
# Find all blog posts from 2024
config = SeedingConfig(
source="sitemap",
pattern="*/blog/2024/*.html",
max_urls=100
)
blog_urls = await seeder.urls("example.com", config)
# Further filter by keywords in URL
python_posts = [
url for url in blog_urls
if "python" in url['url'].lower()
]
print(f"Found {len(python_posts)} Python blog posts")
```
### SeedingConfig Parameters
```python
from crawl4ai import SeedingConfig
# Comprehensive configuration
config = SeedingConfig(
# Data sources
source="sitemap+cc", # "sitemap", "cc", "sitemap+cc"
pattern="*/docs/*", # URL pattern filter
# Metadata extraction
extract_head=True, # Get <head> metadata
live_check=True, # Verify URLs are accessible
# Performance controls
max_urls=1000, # Limit results (-1 = unlimited)
concurrency=20, # Parallel workers
hits_per_sec=10, # Rate limiting
# Relevance scoring
query="API documentation guide", # Search query
scoring_method="bm25", # Scoring algorithm
score_threshold=0.3, # Minimum relevance (0.0-1.0)
# Cache and filtering
force=False, # Bypass cache
filter_nonsense_urls=True, # Remove utility URLs
verbose=True # Debug output
)
# Quick configurations for common use cases
blog_config = SeedingConfig(
source="sitemap",
pattern="*/blog/*",
extract_head=True
)
api_docs_config = SeedingConfig(
source="sitemap+cc",
pattern="*/docs/*",
query="API reference documentation",
scoring_method="bm25",
score_threshold=0.5
)
product_pages_config = SeedingConfig(
source="cc",
pattern="*/product/*",
live_check=True,
max_urls=500
)
```
### Metadata Extraction and Analysis
```python
async def metadata_extraction():
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(
source="sitemap",
extract_head=True, # Extract <head> metadata
pattern="*/blog/*",
max_urls=50
)
urls = await seeder.urls("example.com", config)
# Analyze extracted metadata
for url in urls[:5]:
head_data = url['head_data']
print(f"\nURL: {url['url']}")
print(f"Title: {head_data.get('title', 'No title')}")
# Standard meta tags
meta = head_data.get('meta', {})
print(f"Description: {meta.get('description', 'N/A')}")
print(f"Keywords: {meta.get('keywords', 'N/A')}")
print(f"Author: {meta.get('author', 'N/A')}")
# Open Graph data
print(f"OG Image: {meta.get('og:image', 'N/A')}")
print(f"OG Type: {meta.get('og:type', 'N/A')}")
# JSON-LD structured data
jsonld = head_data.get('jsonld', [])
if jsonld:
print(f"Structured data: {len(jsonld)} items")
for item in jsonld[:2]:
if isinstance(item, dict):
print(f" Type: {item.get('@type', 'Unknown')}")
print(f" Name: {item.get('name', 'N/A')}")
# Filter by metadata
async def metadata_filtering():
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(
source="sitemap",
extract_head=True,
max_urls=100
)
urls = await seeder.urls("news.example.com", config)
# Filter by publication date (from JSON-LD)
from datetime import datetime, timedelta
recent_cutoff = datetime.now() - timedelta(days=7)
recent_articles = []
for url in urls:
for jsonld in url['head_data'].get('jsonld', []):
if isinstance(jsonld, dict) and 'datePublished' in jsonld:
try:
pub_date = datetime.fromisoformat(
jsonld['datePublished'].replace('Z', '+00:00')
)
if pub_date > recent_cutoff:
recent_articles.append(url)
break
except:
continue
print(f"Found {len(recent_articles)} recent articles")
```
### BM25 Relevance Scoring
```python
async def relevance_scoring():
async with AsyncUrlSeeder() as seeder:
# Find pages about Python async programming
config = SeedingConfig(
source="sitemap",
extract_head=True, # Required for content-based scoring
query="python async await concurrency",
scoring_method="bm25",
score_threshold=0.3, # Only 30%+ relevant pages
max_urls=20
)
urls = await seeder.urls("docs.python.org", config)
# Results are automatically sorted by relevance
print("Most relevant Python async content:")
for url in urls[:5]:
score = url['relevance_score']
title = url['head_data'].get('title', 'No title')
print(f"[{score:.2f}] {title}")
print(f" {url['url']}")
# URL-based scoring (when extract_head=False)
async def url_based_scoring():
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(
source="sitemap",
extract_head=False, # Fast URL-only scoring
query="machine learning tutorial",
scoring_method="bm25",
score_threshold=0.2
)
urls = await seeder.urls("example.com", config)
# Scoring based on URL structure, domain, path segments
for url in urls[:5]:
print(f"[{url['relevance_score']:.2f}] {url['url']}")
# Multi-concept queries
async def complex_queries():
queries = [
"data science pandas numpy visualization",
"web scraping automation selenium",
"machine learning tensorflow pytorch",
"api documentation rest graphql"
]
async with AsyncUrlSeeder() as seeder:
all_results = []
for query in queries:
config = SeedingConfig(
source="sitemap",
extract_head=True,
query=query,
scoring_method="bm25",
score_threshold=0.4,
max_urls=10
)
urls = await seeder.urls("learning-site.com", config)
all_results.extend(urls)
# Remove duplicates while preserving order
seen = set()
unique_results = []
for url in all_results:
if url['url'] not in seen:
seen.add(url['url'])
unique_results.append(url)
print(f"Found {len(unique_results)} unique pages across all topics")
```
### Live URL Validation
```python
async def url_validation():
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(
source="sitemap",
live_check=True, # Verify URLs are accessible
concurrency=15, # Parallel HEAD requests
hits_per_sec=8, # Rate limiting
max_urls=100
)
urls = await seeder.urls("example.com", config)
# Analyze results
valid_urls = [u for u in urls if u['status'] == 'valid']
invalid_urls = [u for u in urls if u['status'] == 'not_valid']
print(f"✅ Valid URLs: {len(valid_urls)}")
print(f"❌ Invalid URLs: {len(invalid_urls)}")
print(f"📊 Success rate: {len(valid_urls)/len(urls)*100:.1f}%")
# Show some invalid URLs for debugging
if invalid_urls:
print("\nSample invalid URLs:")
for url in invalid_urls[:3]:
print(f" - {url['url']}")
# Combined validation and metadata
async def comprehensive_validation():
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(
source="sitemap",
live_check=True, # Verify accessibility
extract_head=True, # Get metadata
query="tutorial guide", # Relevance scoring
scoring_method="bm25",
score_threshold=0.2,
concurrency=10,
max_urls=50
)
urls = await seeder.urls("docs.example.com", config)
# Filter for valid, relevant tutorials
good_tutorials = [
url for url in urls
if url['status'] == 'valid' and
url['relevance_score'] > 0.3 and
'tutorial' in url['head_data'].get('title', '').lower()
]
print(f"Found {len(good_tutorials)} high-quality tutorials")
```
### Multi-Domain Discovery
```python
async def multi_domain_research():
async with AsyncUrlSeeder() as seeder:
# Research Python tutorials across multiple sites
domains = [
"docs.python.org",
"realpython.com",
"python-course.eu",
"tutorialspoint.com"
]
config = SeedingConfig(
source="sitemap",
extract_head=True,
query="python beginner tutorial basics",
scoring_method="bm25",
score_threshold=0.3,
max_urls=15 # Per domain
)
# Discover across all domains in parallel
results = await seeder.many_urls(domains, config)
# Collect and rank all tutorials
all_tutorials = []
for domain, urls in results.items():
for url in urls:
url['domain'] = domain
all_tutorials.append(url)
# Sort by relevance across all domains
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)
print(f"Top 10 Python tutorials across {len(domains)} sites:")
for i, tutorial in enumerate(all_tutorials[:10], 1):
score = tutorial['relevance_score']
title = tutorial['head_data'].get('title', 'No title')[:60]
domain = tutorial['domain']
print(f"{i:2d}. [{score:.2f}] {title}")
print(f" {domain}")
# Competitor analysis
async def competitor_analysis():
competitors = ["competitor1.com", "competitor2.com", "competitor3.com"]
async with AsyncUrlSeeder() as seeder:
config = SeedingConfig(
source="sitemap",
extract_head=True,
pattern="*/blog/*",
max_urls=50
)
results = await seeder.many_urls(competitors, config)
# Analyze content strategies
for domain, urls in results.items():
content_types = {}
for url in urls:
# Extract content type from metadata
meta = url['head_data'].get('meta', {})
og_type = meta.get('og:type', 'unknown')
content_types[og_type] = content_types.get(og_type, 0) + 1
print(f"\n{domain} content distribution:")
for ctype, count in sorted(content_types.items(),
key=lambda x: x[1], reverse=True):
print(f" {ctype}: {count}")
```
### Complete Pipeline: Discovery → Filter → Crawl
```python
async def smart_research_pipeline():
"""Complete pipeline: discover URLs, filter by relevance, crawl top results"""
async with AsyncUrlSeeder() as seeder:
# Step 1: Discover relevant URLs
print("🔍 Discovering URLs...")
config = SeedingConfig(
source="sitemap+cc",
extract_head=True,
query="machine learning deep learning tutorial",
scoring_method="bm25",
score_threshold=0.4,
max_urls=100
)
urls = await seeder.urls("example.com", config)
print(f" Found {len(urls)} relevant URLs")
# Step 2: Select top articles
top_articles = sorted(urls,
key=lambda x: x['relevance_score'],
reverse=True)[:10]
print(f" Selected top {len(top_articles)} for crawling")
# Step 3: Show what we're about to crawl
print("\n📋 Articles to crawl:")
for i, article in enumerate(top_articles, 1):
score = article['relevance_score']
title = article['head_data'].get('title', 'No title')[:60]
print(f" {i}. [{score:.2f}] {title}")
# Step 4: Crawl selected articles
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
print(f"\n🕷 Crawling {len(top_articles)} articles...")
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(
only_text=True,
word_count_threshold=200,
stream=True # Process results as they come
)
# Extract URLs and crawl
article_urls = [article['url'] for article in top_articles]
crawled_count = 0
async for result in await crawler.arun_many(article_urls, config=config):
if result.success:
crawled_count += 1
word_count = len(result.markdown.raw_markdown.split())
print(f" ✅ [{crawled_count}/{len(article_urls)}] "
f"{word_count} words from {result.url[:50]}...")
else:
print(f" ❌ Failed: {result.url[:50]}...")
print(f"\n✨ Successfully crawled {crawled_count} articles!")
asyncio.run(smart_research_pipeline())
```
### Advanced Features and Performance
```python
# Cache management
async def cache_management():
async with AsyncUrlSeeder() as seeder:
# First run - populate cache
config = SeedingConfig(
source="sitemap",
extract_head=True,
force=True # Bypass cache, fetch fresh
)
urls = await seeder.urls("example.com", config)
# Subsequent runs - use cache (much faster)
config = SeedingConfig(
source="sitemap",
extract_head=True,
force=False # Use cache
)
urls = await seeder.urls("example.com", config)
# Performance optimization
async def performance_tuning():
async with AsyncUrlSeeder() as seeder:
# High-performance configuration
config = SeedingConfig(
source="cc",
concurrency=50, # Many parallel workers
hits_per_sec=20, # High rate limit
max_urls=10000, # Large dataset
extract_head=False, # Skip metadata for speed
filter_nonsense_urls=True # Auto-filter utility URLs
)
import time
start = time.time()
urls = await seeder.urls("large-site.com", config)
elapsed = time.time() - start
print(f"Processed {len(urls)} URLs in {elapsed:.2f}s")
print(f"Speed: {len(urls)/elapsed:.0f} URLs/second")
# Memory-safe processing for large domains
async def large_domain_processing():
async with AsyncUrlSeeder() as seeder:
# Safe for domains with 1M+ URLs
config = SeedingConfig(
source="cc+sitemap",
concurrency=50, # Bounded queue adapts to this
max_urls=100000, # Process in batches
filter_nonsense_urls=True
)
# The seeder automatically manages memory by:
# - Using bounded queues (prevents RAM spikes)
# - Applying backpressure when queue is full
# - Processing URLs as they're discovered
urls = await seeder.urls("huge-site.com", config)
# Configuration cloning and reuse
config_base = SeedingConfig(
source="sitemap",
extract_head=True,
concurrency=20
)
# Create variations
blog_config = config_base.clone(pattern="*/blog/*")
docs_config = config_base.clone(
pattern="*/docs/*",
query="API documentation",
scoring_method="bm25"
)
fast_config = config_base.clone(
extract_head=False,
concurrency=100,
hits_per_sec=50
)
```
### Troubleshooting and Best Practices
```python
# Common issues and solutions
async def troubleshooting_guide():
async with AsyncUrlSeeder() as seeder:
# Issue: No URLs found
try:
config = SeedingConfig(source="sitemap", pattern="*/nonexistent/*")
urls = await seeder.urls("example.com", config)
if not urls:
# Solution: Try broader pattern or different source
config = SeedingConfig(source="cc+sitemap", pattern="*")
urls = await seeder.urls("example.com", config)
except Exception as e:
print(f"Discovery failed: {e}")
# Issue: Slow performance
config = SeedingConfig(
source="sitemap", # Faster than CC
concurrency=10, # Reduce if hitting rate limits
hits_per_sec=5, # Add rate limiting
extract_head=False # Skip if metadata not needed
)
# Issue: Low relevance scores
config = SeedingConfig(
query="specific detailed query terms",
score_threshold=0.1, # Lower threshold
scoring_method="bm25"
)
# Issue: Memory issues with large sites
config = SeedingConfig(
max_urls=10000, # Limit results
concurrency=20, # Reduce concurrency
source="sitemap" # Use sitemap only
)
# Performance benchmarks
print("""
Typical performance on standard connection:
- Sitemap discovery: 100-1,000 URLs/second
- Common Crawl discovery: 50-500 URLs/second
- HEAD checking: 10-50 URLs/second
- Head extraction: 5-20 URLs/second
- BM25 scoring: 10,000+ URLs/second
""")
# Best practices
best_practices = """
✅ Use context manager: async with AsyncUrlSeeder() as seeder
✅ Start with sitemaps (faster), add CC if needed
✅ Use extract_head=True only when you need metadata
✅ Set reasonable max_urls to limit processing
✅ Add rate limiting for respectful crawling
✅ Cache results with force=False for repeated operations
✅ Filter nonsense URLs (enabled by default)
✅ Use specific patterns to reduce irrelevant results
"""
```
**📖 Learn more:** [Complete URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [SeedingConfig Reference](https://docs.crawl4ai.com/api/parameters/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)