feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
This commit is contained in:
425
docs/md_v2/assets/llm.txt/diagrams/cli.txt
Normal file
425
docs/md_v2/assets/llm.txt/diagrams/cli.txt
Normal file
@@ -0,0 +1,425 @@
|
||||
## CLI Workflows and Profile Management
|
||||
|
||||
Visual representations of command-line interface operations, browser profile management, and identity-based crawling workflows.
|
||||
|
||||
### CLI Command Flow Architecture
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[crwl command] --> B{Command Type?}
|
||||
|
||||
B -->|URL Crawling| C[Parse URL & Options]
|
||||
B -->|Profile Management| D[profiles subcommand]
|
||||
B -->|CDP Browser| E[cdp subcommand]
|
||||
B -->|Browser Control| F[browser subcommand]
|
||||
B -->|Configuration| G[config subcommand]
|
||||
|
||||
C --> C1{Output Format?}
|
||||
C1 -->|Default| C2[HTML/Markdown]
|
||||
C1 -->|JSON| C3[Structured Data]
|
||||
C1 -->|markdown| C4[Clean Markdown]
|
||||
C1 -->|markdown-fit| C5[Filtered Content]
|
||||
|
||||
C --> C6{Authentication?}
|
||||
C6 -->|Profile Specified| C7[Load Browser Profile]
|
||||
C6 -->|No Profile| C8[Anonymous Session]
|
||||
|
||||
C7 --> C9[Launch with User Data]
|
||||
C8 --> C10[Launch Clean Browser]
|
||||
|
||||
C9 --> C11[Execute Crawl]
|
||||
C10 --> C11
|
||||
|
||||
C11 --> C12{Success?}
|
||||
C12 -->|Yes| C13[Return Results]
|
||||
C12 -->|No| C14[Error Handling]
|
||||
|
||||
D --> D1[Interactive Profile Menu]
|
||||
D1 --> D2{Menu Choice?}
|
||||
D2 -->|Create| D3[Open Browser for Setup]
|
||||
D2 -->|List| D4[Show Existing Profiles]
|
||||
D2 -->|Delete| D5[Remove Profile]
|
||||
D2 -->|Use| D6[Crawl with Profile]
|
||||
|
||||
E --> E1[Launch CDP Browser]
|
||||
E1 --> E2[Remote Debugging Active]
|
||||
|
||||
F --> F1{Browser Action?}
|
||||
F1 -->|start| F2[Start Builtin Browser]
|
||||
F1 -->|stop| F3[Stop Builtin Browser]
|
||||
F1 -->|status| F4[Check Browser Status]
|
||||
F1 -->|view| F5[Open Browser Window]
|
||||
|
||||
G --> G1{Config Action?}
|
||||
G1 -->|list| G2[Show All Settings]
|
||||
G1 -->|set| G3[Update Setting]
|
||||
G1 -->|get| G4[Read Setting]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style C13 fill:#c8e6c9
|
||||
style C14 fill:#ffcdd2
|
||||
style D3 fill:#fff3e0
|
||||
style E2 fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Profile Management Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant CLI
|
||||
participant ProfileManager
|
||||
participant Browser
|
||||
participant FileSystem
|
||||
|
||||
User->>CLI: crwl profiles
|
||||
CLI->>ProfileManager: Initialize profile manager
|
||||
ProfileManager->>FileSystem: Scan for existing profiles
|
||||
FileSystem-->>ProfileManager: Profile list
|
||||
ProfileManager-->>CLI: Show interactive menu
|
||||
CLI-->>User: Display options
|
||||
|
||||
Note over User: User selects "Create new profile"
|
||||
|
||||
User->>CLI: Create profile "linkedin-auth"
|
||||
CLI->>ProfileManager: create_profile("linkedin-auth")
|
||||
ProfileManager->>FileSystem: Create profile directory
|
||||
ProfileManager->>Browser: Launch with new user data dir
|
||||
Browser-->>User: Opens browser window
|
||||
|
||||
Note over User: User manually logs in to LinkedIn
|
||||
|
||||
User->>Browser: Navigate and authenticate
|
||||
Browser->>FileSystem: Save cookies, session data
|
||||
User->>CLI: Press 'q' to save profile
|
||||
CLI->>ProfileManager: finalize_profile()
|
||||
ProfileManager->>FileSystem: Lock profile settings
|
||||
ProfileManager-->>CLI: Profile saved
|
||||
CLI-->>User: Profile "linkedin-auth" created
|
||||
|
||||
Note over User: Later usage
|
||||
|
||||
User->>CLI: crwl https://linkedin.com/feed -p linkedin-auth
|
||||
CLI->>ProfileManager: load_profile("linkedin-auth")
|
||||
ProfileManager->>FileSystem: Read profile data
|
||||
FileSystem-->>ProfileManager: User data directory
|
||||
ProfileManager-->>CLI: Profile configuration
|
||||
CLI->>Browser: Launch with existing profile
|
||||
Browser-->>CLI: Authenticated session ready
|
||||
CLI->>Browser: Navigate to target URL
|
||||
Browser-->>CLI: Crawl results with auth context
|
||||
CLI-->>User: Authenticated content
|
||||
```
|
||||
|
||||
### Browser Management State Machine
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Stopped: Initial state
|
||||
|
||||
Stopped --> Starting: crwl browser start
|
||||
Starting --> Running: Browser launched
|
||||
Running --> Viewing: crwl browser view
|
||||
Viewing --> Running: Close window
|
||||
Running --> Stopping: crwl browser stop
|
||||
Stopping --> Stopped: Cleanup complete
|
||||
|
||||
Running --> Restarting: crwl browser restart
|
||||
Restarting --> Running: New browser instance
|
||||
|
||||
Stopped --> CDP_Mode: crwl cdp
|
||||
CDP_Mode --> CDP_Running: Remote debugging active
|
||||
CDP_Running --> CDP_Mode: Manual close
|
||||
CDP_Mode --> Stopped: Exit CDP
|
||||
|
||||
Running --> StatusCheck: crwl browser status
|
||||
StatusCheck --> Running: Return status
|
||||
|
||||
note right of Running : Port 9222 active\nBuiltin browser available
|
||||
note right of CDP_Running : Remote debugging\nManual control enabled
|
||||
note right of Viewing : Visual browser window\nDirect interaction
|
||||
```
|
||||
|
||||
### Authentication Workflow for Protected Sites
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Protected Site Access Needed] --> B[Create Profile Strategy]
|
||||
|
||||
B --> C{Existing Profile?}
|
||||
C -->|Yes| D[Test Profile Validity]
|
||||
C -->|No| E[Create New Profile]
|
||||
|
||||
D --> D1{Profile Valid?}
|
||||
D1 -->|Yes| F[Use Existing Profile]
|
||||
D1 -->|No| E
|
||||
|
||||
E --> E1[crwl profiles]
|
||||
E1 --> E2[Select Create New Profile]
|
||||
E2 --> E3[Enter Profile Name]
|
||||
E3 --> E4[Browser Opens for Auth]
|
||||
|
||||
E4 --> E5{Authentication Method?}
|
||||
E5 -->|Login Form| E6[Fill Username/Password]
|
||||
E5 -->|OAuth| E7[OAuth Flow]
|
||||
E5 -->|2FA| E8[Handle 2FA]
|
||||
E5 -->|Session Cookie| E9[Import Cookies]
|
||||
|
||||
E6 --> E10[Manual Login Process]
|
||||
E7 --> E10
|
||||
E8 --> E10
|
||||
E9 --> E10
|
||||
|
||||
E10 --> E11[Verify Authentication]
|
||||
E11 --> E12{Auth Successful?}
|
||||
E12 -->|Yes| E13[Save Profile - Press q]
|
||||
E12 -->|No| E10
|
||||
|
||||
E13 --> F
|
||||
F --> G[Execute Authenticated Crawl]
|
||||
|
||||
G --> H[crwl URL -p profile-name]
|
||||
H --> I[Load Profile Data]
|
||||
I --> J[Launch Browser with Auth]
|
||||
J --> K[Navigate to Protected Content]
|
||||
K --> L[Extract Authenticated Data]
|
||||
L --> M[Return Results]
|
||||
|
||||
style E4 fill:#fff3e0
|
||||
style E10 fill:#e3f2fd
|
||||
style F fill:#e8f5e8
|
||||
style M fill:#c8e6c9
|
||||
```
|
||||
|
||||
### CDP Browser Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "CLI Layer"
|
||||
A[crwl cdp command] --> B[CDP Manager]
|
||||
B --> C[Port Configuration]
|
||||
B --> D[Profile Selection]
|
||||
end
|
||||
|
||||
subgraph "Browser Process"
|
||||
E[Chromium/Firefox] --> F[Remote Debugging]
|
||||
F --> G[WebSocket Endpoint]
|
||||
G --> H[ws://localhost:9222]
|
||||
end
|
||||
|
||||
subgraph "Client Connections"
|
||||
I[Manual Browser Control] --> H
|
||||
J[DevTools Interface] --> H
|
||||
K[External Automation] --> H
|
||||
L[Crawl4AI Crawler] --> H
|
||||
end
|
||||
|
||||
subgraph "Profile Data"
|
||||
M[User Data Directory] --> E
|
||||
N[Cookies & Sessions] --> M
|
||||
O[Extensions] --> M
|
||||
P[Browser State] --> M
|
||||
end
|
||||
|
||||
A --> E
|
||||
C --> H
|
||||
D --> M
|
||||
|
||||
style H fill:#e3f2fd
|
||||
style E fill:#f3e5f5
|
||||
style M fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Configuration Management Hierarchy
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Global Configuration"
|
||||
A[~/.crawl4ai/config.yml] --> B[Default Settings]
|
||||
B --> C[LLM Providers]
|
||||
B --> D[Browser Defaults]
|
||||
B --> E[Output Preferences]
|
||||
end
|
||||
|
||||
subgraph "Profile Configuration"
|
||||
F[Profile Directory] --> G[Browser State]
|
||||
F --> H[Authentication Data]
|
||||
F --> I[Site-Specific Settings]
|
||||
end
|
||||
|
||||
subgraph "Command-Line Overrides"
|
||||
J[-b browser_config] --> K[Runtime Browser Settings]
|
||||
L[-c crawler_config] --> M[Runtime Crawler Settings]
|
||||
N[-o output_format] --> O[Runtime Output Format]
|
||||
end
|
||||
|
||||
subgraph "Configuration Files"
|
||||
P[browser.yml] --> Q[Browser Config Template]
|
||||
R[crawler.yml] --> S[Crawler Config Template]
|
||||
T[extract.yml] --> U[Extraction Config]
|
||||
end
|
||||
|
||||
subgraph "Resolution Order"
|
||||
V[Command Line Args] --> W[Config Files]
|
||||
W --> X[Profile Settings]
|
||||
X --> Y[Global Defaults]
|
||||
end
|
||||
|
||||
J --> V
|
||||
L --> V
|
||||
N --> V
|
||||
P --> W
|
||||
R --> W
|
||||
T --> W
|
||||
F --> X
|
||||
A --> Y
|
||||
|
||||
style V fill:#ffcdd2
|
||||
style W fill:#fff3e0
|
||||
style X fill:#e3f2fd
|
||||
style Y fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Identity-Based Crawling Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Target Website Assessment] --> B{Authentication Required?}
|
||||
|
||||
B -->|No| C[Standard Anonymous Crawl]
|
||||
B -->|Yes| D{Authentication Type?}
|
||||
|
||||
D -->|Login Form| E[Create Login Profile]
|
||||
D -->|OAuth/SSO| F[Create OAuth Profile]
|
||||
D -->|API Key/Token| G[Use Headers/Config]
|
||||
D -->|Session Cookies| H[Import Cookie Profile]
|
||||
|
||||
E --> E1[crwl profiles → Manual login]
|
||||
F --> F1[crwl profiles → OAuth flow]
|
||||
G --> G1[Configure headers in crawler config]
|
||||
H --> H1[Import cookies to profile]
|
||||
|
||||
E1 --> I[Test Authentication]
|
||||
F1 --> I
|
||||
G1 --> I
|
||||
H1 --> I
|
||||
|
||||
I --> J{Auth Test Success?}
|
||||
J -->|Yes| K[Production Crawl Setup]
|
||||
J -->|No| L[Debug Authentication]
|
||||
|
||||
L --> L1{Common Issues?}
|
||||
L1 -->|Rate Limiting| L2[Add delays/user simulation]
|
||||
L1 -->|Bot Detection| L3[Enable stealth mode]
|
||||
L1 -->|Session Expired| L4[Refresh authentication]
|
||||
L1 -->|CAPTCHA| L5[Manual intervention needed]
|
||||
|
||||
L2 --> M[Retry with Adjustments]
|
||||
L3 --> M
|
||||
L4 --> E1
|
||||
L5 --> N[Semi-automated approach]
|
||||
|
||||
M --> I
|
||||
N --> O[Manual auth + automated crawl]
|
||||
|
||||
K --> P[Automated Authenticated Crawling]
|
||||
O --> P
|
||||
C --> P
|
||||
|
||||
P --> Q[Monitor & Maintain Profiles]
|
||||
|
||||
style I fill:#fff3e0
|
||||
style K fill:#e8f5e8
|
||||
style P fill:#c8e6c9
|
||||
style L fill:#ffcdd2
|
||||
style N fill:#f3e5f5
|
||||
```
|
||||
|
||||
### CLI Usage Patterns and Best Practices
|
||||
|
||||
```mermaid
|
||||
timeline
|
||||
title CLI Workflow Evolution
|
||||
|
||||
section Setup Phase
|
||||
Installation : pip install crawl4ai
|
||||
: crawl4ai-setup
|
||||
Basic Test : crwl https://example.com
|
||||
Config Setup : crwl config set defaults
|
||||
|
||||
section Profile Creation
|
||||
Site Analysis : Identify auth requirements
|
||||
Profile Creation : crwl profiles
|
||||
Manual Login : Authenticate in browser
|
||||
Profile Save : Press 'q' to save
|
||||
|
||||
section Development Phase
|
||||
Test Crawls : crwl URL -p profile -v
|
||||
Config Tuning : Adjust browser/crawler settings
|
||||
Output Testing : Try different output formats
|
||||
Error Handling : Debug authentication issues
|
||||
|
||||
section Production Phase
|
||||
Automated Crawls : crwl URL -p profile -o json
|
||||
Batch Processing : Multiple URLs with same profile
|
||||
Monitoring : Check profile validity
|
||||
Maintenance : Update profiles as needed
|
||||
```
|
||||
|
||||
### Multi-Profile Management Strategy
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Profile Categories"
|
||||
A[Social Media Profiles]
|
||||
B[Work/Enterprise Profiles]
|
||||
C[E-commerce Profiles]
|
||||
D[Research Profiles]
|
||||
end
|
||||
|
||||
subgraph "Social Media"
|
||||
A --> A1[linkedin-personal]
|
||||
A --> A2[twitter-monitor]
|
||||
A --> A3[facebook-research]
|
||||
A --> A4[instagram-brand]
|
||||
end
|
||||
|
||||
subgraph "Enterprise"
|
||||
B --> B1[company-intranet]
|
||||
B --> B2[github-enterprise]
|
||||
B --> B3[confluence-docs]
|
||||
B --> B4[jira-tickets]
|
||||
end
|
||||
|
||||
subgraph "E-commerce"
|
||||
C --> C1[amazon-seller]
|
||||
C --> C2[shopify-admin]
|
||||
C --> C3[ebay-monitor]
|
||||
C --> C4[marketplace-competitor]
|
||||
end
|
||||
|
||||
subgraph "Research"
|
||||
D --> D1[academic-journals]
|
||||
D --> D2[data-platforms]
|
||||
D --> D3[survey-tools]
|
||||
D --> D4[government-portals]
|
||||
end
|
||||
|
||||
subgraph "Usage Patterns"
|
||||
E[Daily Monitoring] --> A2
|
||||
E --> B1
|
||||
F[Weekly Reports] --> C3
|
||||
F --> D2
|
||||
G[On-Demand Research] --> D1
|
||||
G --> D4
|
||||
H[Competitive Analysis] --> C4
|
||||
H --> A4
|
||||
end
|
||||
|
||||
style A1 fill:#e3f2fd
|
||||
style B1 fill:#f3e5f5
|
||||
style C1 fill:#e8f5e8
|
||||
style D1 fill:#fff3e0
|
||||
```
|
||||
|
||||
**📖 Learn more:** [CLI Reference](https://docs.crawl4ai.com/core/cli/), [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [Authentication Strategies](https://docs.crawl4ai.com/advanced/hooks-auth/)
|
||||
1421
docs/md_v2/assets/llm.txt/diagrams/config_objects.txt
Normal file
1421
docs/md_v2/assets/llm.txt/diagrams/config_objects.txt
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,401 @@
|
||||
## Deep Crawling Filters & Scorers Architecture
|
||||
|
||||
Visual representations of advanced URL filtering, scoring strategies, and performance optimization workflows for intelligent deep crawling.
|
||||
|
||||
### Filter Chain Processing Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[URL Input] --> B{Domain Filter}
|
||||
B -->|✓ Pass| C{Pattern Filter}
|
||||
B -->|✗ Fail| X1[Reject: Invalid Domain]
|
||||
|
||||
C -->|✓ Pass| D{Content Type Filter}
|
||||
C -->|✗ Fail| X2[Reject: Pattern Mismatch]
|
||||
|
||||
D -->|✓ Pass| E{SEO Filter}
|
||||
D -->|✗ Fail| X3[Reject: Wrong Content Type]
|
||||
|
||||
E -->|✓ Pass| F{Content Relevance Filter}
|
||||
E -->|✗ Fail| X4[Reject: Low SEO Score]
|
||||
|
||||
F -->|✓ Pass| G[URL Accepted]
|
||||
F -->|✗ Fail| X5[Reject: Low Relevance]
|
||||
|
||||
G --> H[Add to Crawl Queue]
|
||||
|
||||
subgraph "Fast Filters"
|
||||
B
|
||||
C
|
||||
D
|
||||
end
|
||||
|
||||
subgraph "Slow Filters"
|
||||
E
|
||||
F
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style G fill:#c8e6c9
|
||||
style H fill:#e8f5e8
|
||||
style X1 fill:#ffcdd2
|
||||
style X2 fill:#ffcdd2
|
||||
style X3 fill:#ffcdd2
|
||||
style X4 fill:#ffcdd2
|
||||
style X5 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### URL Scoring System Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Input URL"
|
||||
A[https://python.org/tutorial/2024/ml-guide.html]
|
||||
end
|
||||
|
||||
subgraph "Individual Scorers"
|
||||
B[Keyword Relevance Scorer]
|
||||
C[Path Depth Scorer]
|
||||
D[Content Type Scorer]
|
||||
E[Freshness Scorer]
|
||||
F[Domain Authority Scorer]
|
||||
end
|
||||
|
||||
subgraph "Scoring Process"
|
||||
B --> B1[Keywords: python, tutorial, ml<br/>Score: 0.85]
|
||||
C --> C1[Depth: 4 levels<br/>Optimal: 3<br/>Score: 0.75]
|
||||
D --> D1[Content: HTML<br/>Score: 1.0]
|
||||
E --> E1[Year: 2024<br/>Score: 1.0]
|
||||
F --> F1[Domain: python.org<br/>Score: 1.0]
|
||||
end
|
||||
|
||||
subgraph "Composite Scoring"
|
||||
G[Weighted Combination]
|
||||
B1 --> G
|
||||
C1 --> G
|
||||
D1 --> G
|
||||
E1 --> G
|
||||
F1 --> G
|
||||
end
|
||||
|
||||
subgraph "Final Result"
|
||||
H[Composite Score: 0.92]
|
||||
I{Score > Threshold?}
|
||||
J[Accept URL]
|
||||
K[Reject URL]
|
||||
end
|
||||
|
||||
A --> B
|
||||
A --> C
|
||||
A --> D
|
||||
A --> E
|
||||
A --> F
|
||||
|
||||
G --> H
|
||||
H --> I
|
||||
I -->|✓ 0.92 > 0.6| J
|
||||
I -->|✗ Score too low| K
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style G fill:#fff3e0
|
||||
style H fill:#e8f5e8
|
||||
style J fill:#c8e6c9
|
||||
style K fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Filter vs Scorer Decision Matrix
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[URL Processing Decision] --> B{Binary Decision Needed?}
|
||||
|
||||
B -->|Yes - Include/Exclude| C[Use Filters]
|
||||
B -->|No - Quality Rating| D[Use Scorers]
|
||||
|
||||
C --> C1{Filter Type Needed?}
|
||||
C1 -->|Domain Control| C2[DomainFilter]
|
||||
C1 -->|Pattern Matching| C3[URLPatternFilter]
|
||||
C1 -->|Content Type| C4[ContentTypeFilter]
|
||||
C1 -->|SEO Quality| C5[SEOFilter]
|
||||
C1 -->|Content Relevance| C6[ContentRelevanceFilter]
|
||||
|
||||
D --> D1{Scoring Criteria?}
|
||||
D1 -->|Keyword Relevance| D2[KeywordRelevanceScorer]
|
||||
D1 -->|URL Structure| D3[PathDepthScorer]
|
||||
D1 -->|Content Quality| D4[ContentTypeScorer]
|
||||
D1 -->|Time Sensitivity| D5[FreshnessScorer]
|
||||
D1 -->|Source Authority| D6[DomainAuthorityScorer]
|
||||
|
||||
C2 --> E[Chain Filters]
|
||||
C3 --> E
|
||||
C4 --> E
|
||||
C5 --> E
|
||||
C6 --> E
|
||||
|
||||
D2 --> F[Composite Scorer]
|
||||
D3 --> F
|
||||
D4 --> F
|
||||
D5 --> F
|
||||
D6 --> F
|
||||
|
||||
E --> G[Binary Output: Pass/Fail]
|
||||
F --> H[Numeric Score: 0.0-1.0]
|
||||
|
||||
G --> I[Apply to URL Queue]
|
||||
H --> J[Priority Ranking]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
style E fill:#f3e5f5
|
||||
style F fill:#e3f2fd
|
||||
style G fill:#c8e6c9
|
||||
style H fill:#ffecb3
|
||||
```
|
||||
|
||||
### Performance Optimization Strategy
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Queue as URL Queue
|
||||
participant Fast as Fast Filters
|
||||
participant Slow as Slow Filters
|
||||
participant Score as Scorers
|
||||
participant Output as Filtered URLs
|
||||
|
||||
Note over Queue, Output: Batch Processing (1000 URLs)
|
||||
|
||||
Queue->>Fast: Apply Domain Filter
|
||||
Fast-->>Queue: 60% passed (600 URLs)
|
||||
|
||||
Queue->>Fast: Apply Pattern Filter
|
||||
Fast-->>Queue: 70% passed (420 URLs)
|
||||
|
||||
Queue->>Fast: Apply Content Type Filter
|
||||
Fast-->>Queue: 90% passed (378 URLs)
|
||||
|
||||
Note over Fast: Fast filters eliminate 62% of URLs
|
||||
|
||||
Queue->>Slow: Apply SEO Filter (378 URLs)
|
||||
Slow-->>Queue: 80% passed (302 URLs)
|
||||
|
||||
Queue->>Slow: Apply Relevance Filter
|
||||
Slow-->>Queue: 75% passed (227 URLs)
|
||||
|
||||
Note over Slow: Content analysis on remaining URLs
|
||||
|
||||
Queue->>Score: Calculate Composite Scores
|
||||
Score-->>Queue: Scored and ranked
|
||||
|
||||
Queue->>Output: Top 100 URLs by score
|
||||
Output-->>Queue: Processing complete
|
||||
|
||||
Note over Queue, Output: Total: 90% filtered out, 10% high-quality URLs retained
|
||||
```
|
||||
|
||||
### Custom Filter Implementation Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Planning
|
||||
|
||||
Planning --> IdentifyNeeds: Define filtering criteria
|
||||
IdentifyNeeds --> ChooseType: Binary vs Scoring decision
|
||||
|
||||
ChooseType --> FilterImpl: Binary decision needed
|
||||
ChooseType --> ScorerImpl: Quality rating needed
|
||||
|
||||
FilterImpl --> InheritURLFilter: Extend URLFilter base class
|
||||
ScorerImpl --> InheritURLScorer: Extend URLScorer base class
|
||||
|
||||
InheritURLFilter --> ImplementApply: def apply(url) -> bool
|
||||
InheritURLScorer --> ImplementScore: def _calculate_score(url) -> float
|
||||
|
||||
ImplementApply --> AddLogic: Add custom filtering logic
|
||||
ImplementScore --> AddLogic
|
||||
|
||||
AddLogic --> TestFilter: Unit testing
|
||||
TestFilter --> OptimizePerf: Performance optimization
|
||||
|
||||
OptimizePerf --> Integration: Integrate with FilterChain
|
||||
Integration --> Production: Deploy to production
|
||||
|
||||
Production --> Monitor: Monitor performance
|
||||
Monitor --> Tune: Tune parameters
|
||||
Tune --> Production
|
||||
|
||||
note right of Planning : Consider performance impact
|
||||
note right of AddLogic : Handle edge cases
|
||||
note right of OptimizePerf : Cache frequently accessed data
|
||||
```
|
||||
|
||||
### Filter Chain Optimization Patterns
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Naive Approach - Poor Performance"
|
||||
A1[All URLs] --> B1[Slow Filter 1]
|
||||
B1 --> C1[Slow Filter 2]
|
||||
C1 --> D1[Fast Filter 1]
|
||||
D1 --> E1[Fast Filter 2]
|
||||
E1 --> F1[Final Results]
|
||||
|
||||
B1 -.->|High CPU| G1[Performance Issues]
|
||||
C1 -.->|Network Calls| G1
|
||||
end
|
||||
|
||||
subgraph "Optimized Approach - High Performance"
|
||||
A2[All URLs] --> B2[Fast Filter 1]
|
||||
B2 --> C2[Fast Filter 2]
|
||||
C2 --> D2[Batch Process]
|
||||
D2 --> E2[Slow Filter 1]
|
||||
E2 --> F2[Slow Filter 2]
|
||||
F2 --> G2[Final Results]
|
||||
|
||||
D2 --> H2[Concurrent Processing]
|
||||
H2 --> I2[Semaphore Control]
|
||||
end
|
||||
|
||||
subgraph "Performance Metrics"
|
||||
J[Processing Time]
|
||||
K[Memory Usage]
|
||||
L[CPU Utilization]
|
||||
M[Network Requests]
|
||||
end
|
||||
|
||||
G1 -.-> J
|
||||
G1 -.-> K
|
||||
G1 -.-> L
|
||||
G1 -.-> M
|
||||
|
||||
G2 -.-> J
|
||||
G2 -.-> K
|
||||
G2 -.-> L
|
||||
G2 -.-> M
|
||||
|
||||
style A1 fill:#ffcdd2
|
||||
style G1 fill:#ffcdd2
|
||||
style A2 fill:#c8e6c9
|
||||
style G2 fill:#c8e6c9
|
||||
style H2 fill:#e8f5e8
|
||||
style I2 fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Composite Scoring Weight Distribution
|
||||
|
||||
```mermaid
|
||||
pie title Composite Scorer Weight Distribution
|
||||
"Keyword Relevance (30%)" : 30
|
||||
"Domain Authority (25%)" : 25
|
||||
"Content Type (20%)" : 20
|
||||
"Freshness (15%)" : 15
|
||||
"Path Depth (10%)" : 10
|
||||
```
|
||||
|
||||
### Deep Crawl Integration Architecture
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Deep Crawl Strategy"
|
||||
A[Start URL] --> B[Extract Links]
|
||||
B --> C[Apply Filter Chain]
|
||||
C --> D[Calculate Scores]
|
||||
D --> E[Priority Queue]
|
||||
E --> F[Crawl Next URL]
|
||||
F --> B
|
||||
end
|
||||
|
||||
subgraph "Filter Chain Components"
|
||||
C --> C1[Domain Filter]
|
||||
C --> C2[Pattern Filter]
|
||||
C --> C3[Content Filter]
|
||||
C --> C4[SEO Filter]
|
||||
C --> C5[Relevance Filter]
|
||||
end
|
||||
|
||||
subgraph "Scoring Components"
|
||||
D --> D1[Keyword Scorer]
|
||||
D --> D2[Depth Scorer]
|
||||
D --> D3[Freshness Scorer]
|
||||
D --> D4[Authority Scorer]
|
||||
D --> D5[Composite Score]
|
||||
end
|
||||
|
||||
subgraph "Queue Management"
|
||||
E --> E1{Score > Threshold?}
|
||||
E1 -->|Yes| E2[High Priority Queue]
|
||||
E1 -->|No| E3[Low Priority Queue]
|
||||
E2 --> F
|
||||
E3 --> G[Delayed Processing]
|
||||
end
|
||||
|
||||
subgraph "Control Flow"
|
||||
H{Max Depth Reached?}
|
||||
I{Max Pages Reached?}
|
||||
J[Stop Crawling]
|
||||
end
|
||||
|
||||
F --> H
|
||||
H -->|No| I
|
||||
H -->|Yes| J
|
||||
I -->|No| B
|
||||
I -->|Yes| J
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style E2 fill:#c8e6c9
|
||||
style E3 fill:#fff3e0
|
||||
style J fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Filter Performance Comparison
|
||||
|
||||
```mermaid
|
||||
xychart-beta
|
||||
title "Filter Performance Comparison (1000 URLs)"
|
||||
x-axis [Domain, Pattern, ContentType, SEO, Relevance]
|
||||
y-axis "Processing Time (ms)" 0 --> 1000
|
||||
bar [50, 80, 45, 300, 800]
|
||||
```
|
||||
|
||||
### Scoring Algorithm Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Input URL] --> B[Parse URL Components]
|
||||
B --> C[Extract Features]
|
||||
|
||||
C --> D[Domain Analysis]
|
||||
C --> E[Path Analysis]
|
||||
C --> F[Content Type Detection]
|
||||
C --> G[Keyword Extraction]
|
||||
C --> H[Freshness Detection]
|
||||
|
||||
D --> I[Domain Authority Score]
|
||||
E --> J[Path Depth Score]
|
||||
F --> K[Content Type Score]
|
||||
G --> L[Keyword Relevance Score]
|
||||
H --> M[Freshness Score]
|
||||
|
||||
I --> N[Apply Weights]
|
||||
J --> N
|
||||
K --> N
|
||||
L --> N
|
||||
M --> N
|
||||
|
||||
N --> O[Normalize Scores]
|
||||
O --> P[Calculate Final Score]
|
||||
P --> Q{Score >= Threshold?}
|
||||
|
||||
Q -->|Yes| R[Accept for Crawling]
|
||||
Q -->|No| S[Reject URL]
|
||||
|
||||
R --> T[Add to Priority Queue]
|
||||
S --> U[Log Rejection Reason]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style P fill:#fff3e0
|
||||
style R fill:#c8e6c9
|
||||
style S fill:#ffcdd2
|
||||
style T fill:#e8f5e8
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/), [Custom Implementations](https://docs.crawl4ai.com/advanced/custom-filters/)
|
||||
428
docs/md_v2/assets/llm.txt/diagrams/deep_crawling.txt
Normal file
428
docs/md_v2/assets/llm.txt/diagrams/deep_crawling.txt
Normal file
@@ -0,0 +1,428 @@
|
||||
## Deep Crawling Workflows and Architecture
|
||||
|
||||
Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns.
|
||||
|
||||
### Deep Crawl Strategy Overview
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Deep Crawl] --> B{Strategy Selection}
|
||||
|
||||
B -->|Explore All Levels| C[BFS Strategy]
|
||||
B -->|Dive Deep Fast| D[DFS Strategy]
|
||||
B -->|Smart Prioritization| E[Best-First Strategy]
|
||||
|
||||
C --> C1[Breadth-First Search]
|
||||
C1 --> C2[Process all depth 0 links]
|
||||
C2 --> C3[Process all depth 1 links]
|
||||
C3 --> C4[Continue by depth level]
|
||||
|
||||
D --> D1[Depth-First Search]
|
||||
D1 --> D2[Follow first link deeply]
|
||||
D2 --> D3[Backtrack when max depth reached]
|
||||
D3 --> D4[Continue with next branch]
|
||||
|
||||
E --> E1[Best-First Search]
|
||||
E1 --> E2[Score all discovered URLs]
|
||||
E2 --> E3[Process highest scoring URLs first]
|
||||
E3 --> E4[Continuously re-prioritize queue]
|
||||
|
||||
C4 --> F[Apply Filters]
|
||||
D4 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G{Filter Chain Processing}
|
||||
G -->|Domain Filter| G1[Check allowed/blocked domains]
|
||||
G -->|URL Pattern Filter| G2[Match URL patterns]
|
||||
G -->|Content Type Filter| G3[Verify content types]
|
||||
G -->|SEO Filter| G4[Evaluate SEO quality]
|
||||
G -->|Content Relevance| G5[Score content relevance]
|
||||
|
||||
G1 --> H{Passed All Filters?}
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
G4 --> H
|
||||
G5 --> H
|
||||
|
||||
H -->|Yes| I[Add to Crawl Queue]
|
||||
H -->|No| J[Discard URL]
|
||||
|
||||
I --> K{Processing Mode}
|
||||
K -->|Streaming| L[Process Immediately]
|
||||
K -->|Batch| M[Collect All Results]
|
||||
|
||||
L --> N[Stream Result to User]
|
||||
M --> O[Return Complete Result Set]
|
||||
|
||||
J --> P{More URLs in Queue?}
|
||||
N --> P
|
||||
O --> P
|
||||
|
||||
P -->|Yes| Q{Within Limits?}
|
||||
P -->|No| R[Deep Crawl Complete]
|
||||
|
||||
Q -->|Max Depth OK| S{Max Pages OK}
|
||||
Q -->|Max Depth Exceeded| T[Skip Deeper URLs]
|
||||
|
||||
S -->|Under Limit| U[Continue Crawling]
|
||||
S -->|Limit Reached| R
|
||||
|
||||
T --> P
|
||||
U --> F
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style R fill:#c8e6c9
|
||||
style C fill:#fff3e0
|
||||
style D fill:#f3e5f5
|
||||
style E fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Deep Crawl Strategy Comparison
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "BFS - Breadth-First Search"
|
||||
BFS1[Level 0: Start URL]
|
||||
BFS2[Level 1: All direct links]
|
||||
BFS3[Level 2: All second-level links]
|
||||
BFS4[Level 3: All third-level links]
|
||||
|
||||
BFS1 --> BFS2
|
||||
BFS2 --> BFS3
|
||||
BFS3 --> BFS4
|
||||
|
||||
BFS_NOTE[Complete each depth before going deeper<br/>Good for site mapping<br/>Memory intensive for wide sites]
|
||||
end
|
||||
|
||||
subgraph "DFS - Depth-First Search"
|
||||
DFS1[Start URL]
|
||||
DFS2[First Link → Deep]
|
||||
DFS3[Follow until max depth]
|
||||
DFS4[Backtrack and try next]
|
||||
|
||||
DFS1 --> DFS2
|
||||
DFS2 --> DFS3
|
||||
DFS3 --> DFS4
|
||||
DFS4 --> DFS2
|
||||
|
||||
DFS_NOTE[Go deep on first path<br/>Memory efficient<br/>May miss important pages]
|
||||
end
|
||||
|
||||
subgraph "Best-First - Priority Queue"
|
||||
BF1[Start URL]
|
||||
BF2[Score all discovered links]
|
||||
BF3[Process highest scoring first]
|
||||
BF4[Continuously re-prioritize]
|
||||
|
||||
BF1 --> BF2
|
||||
BF2 --> BF3
|
||||
BF3 --> BF4
|
||||
BF4 --> BF2
|
||||
|
||||
BF_NOTE[Intelligent prioritization<br/>Finds relevant content fast<br/>Recommended for most use cases]
|
||||
end
|
||||
|
||||
style BFS1 fill:#e3f2fd
|
||||
style DFS1 fill:#f3e5f5
|
||||
style BF1 fill:#e8f5e8
|
||||
style BFS_NOTE fill:#fff3e0
|
||||
style DFS_NOTE fill:#fff3e0
|
||||
style BF_NOTE fill:#fff3e0
|
||||
```
|
||||
|
||||
### Filter Chain Processing Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant URL as Discovered URL
|
||||
participant Chain as Filter Chain
|
||||
participant Domain as Domain Filter
|
||||
participant Pattern as URL Pattern Filter
|
||||
participant Content as Content Type Filter
|
||||
participant SEO as SEO Filter
|
||||
participant Relevance as Content Relevance Filter
|
||||
participant Queue as Crawl Queue
|
||||
|
||||
URL->>Chain: Process URL
|
||||
Chain->>Domain: Check domain rules
|
||||
|
||||
alt Domain Allowed
|
||||
Domain-->>Chain: ✓ Pass
|
||||
Chain->>Pattern: Check URL patterns
|
||||
|
||||
alt Pattern Matches
|
||||
Pattern-->>Chain: ✓ Pass
|
||||
Chain->>Content: Check content type
|
||||
|
||||
alt Content Type Valid
|
||||
Content-->>Chain: ✓ Pass
|
||||
Chain->>SEO: Evaluate SEO quality
|
||||
|
||||
alt SEO Score Above Threshold
|
||||
SEO-->>Chain: ✓ Pass
|
||||
Chain->>Relevance: Score content relevance
|
||||
|
||||
alt Relevance Score High
|
||||
Relevance-->>Chain: ✓ Pass
|
||||
Chain->>Queue: Add to crawl queue
|
||||
Queue-->>URL: Queued for crawling
|
||||
else Relevance Score Low
|
||||
Relevance-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Low relevance
|
||||
end
|
||||
else SEO Score Low
|
||||
SEO-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Poor SEO
|
||||
end
|
||||
else Invalid Content Type
|
||||
Content-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Wrong content type
|
||||
end
|
||||
else Pattern Mismatch
|
||||
Pattern-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Pattern mismatch
|
||||
end
|
||||
else Domain Blocked
|
||||
Domain-->>Chain: ✗ Reject
|
||||
Chain-->>URL: Filtered out - Blocked domain
|
||||
end
|
||||
```
|
||||
|
||||
### URL Lifecycle State Machine
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Discovered: Found on page
|
||||
|
||||
Discovered --> FilterPending: Enter filter chain
|
||||
|
||||
FilterPending --> DomainCheck: Apply domain filter
|
||||
DomainCheck --> PatternCheck: Domain allowed
|
||||
DomainCheck --> Rejected: Domain blocked
|
||||
|
||||
PatternCheck --> ContentCheck: Pattern matches
|
||||
PatternCheck --> Rejected: Pattern mismatch
|
||||
|
||||
ContentCheck --> SEOCheck: Content type valid
|
||||
ContentCheck --> Rejected: Invalid content
|
||||
|
||||
SEOCheck --> RelevanceCheck: SEO score sufficient
|
||||
SEOCheck --> Rejected: Poor SEO score
|
||||
|
||||
RelevanceCheck --> Scored: Relevance score calculated
|
||||
RelevanceCheck --> Rejected: Low relevance
|
||||
|
||||
Scored --> Queued: Added to priority queue
|
||||
|
||||
Queued --> Crawling: Selected for processing
|
||||
Crawling --> Success: Page crawled successfully
|
||||
Crawling --> Failed: Crawl failed
|
||||
|
||||
Success --> LinkExtraction: Extract new links
|
||||
LinkExtraction --> [*]: Process complete
|
||||
|
||||
Failed --> [*]: Record failure
|
||||
Rejected --> [*]: Log rejection reason
|
||||
|
||||
note right of Scored : Score determines priority<br/>in Best-First strategy
|
||||
|
||||
note right of Failed : Errors logged with<br/>depth and reason
|
||||
```
|
||||
|
||||
### Streaming vs Batch Processing Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Input"
|
||||
A[Start URL] --> B[Deep Crawl Strategy]
|
||||
end
|
||||
|
||||
subgraph "Crawl Engine"
|
||||
B --> C[URL Discovery]
|
||||
C --> D[Filter Chain]
|
||||
D --> E[Priority Queue]
|
||||
E --> F[Page Processor]
|
||||
end
|
||||
|
||||
subgraph "Streaming Mode stream=True"
|
||||
F --> G1[Process Page]
|
||||
G1 --> H1[Extract Content]
|
||||
H1 --> I1[Yield Result Immediately]
|
||||
I1 --> J1[async for result]
|
||||
J1 --> K1[Real-time Processing]
|
||||
|
||||
G1 --> L1[Extract Links]
|
||||
L1 --> M1[Add to Queue]
|
||||
M1 --> F
|
||||
end
|
||||
|
||||
subgraph "Batch Mode stream=False"
|
||||
F --> G2[Process Page]
|
||||
G2 --> H2[Extract Content]
|
||||
H2 --> I2[Store Result]
|
||||
I2 --> N2[Result Collection]
|
||||
|
||||
G2 --> L2[Extract Links]
|
||||
L2 --> M2[Add to Queue]
|
||||
M2 --> O2{More URLs?}
|
||||
O2 -->|Yes| F
|
||||
O2 -->|No| P2[Return All Results]
|
||||
P2 --> Q2[Batch Processing]
|
||||
end
|
||||
|
||||
style I1 fill:#e8f5e8
|
||||
style K1 fill:#e8f5e8
|
||||
style P2 fill:#e3f2fd
|
||||
style Q2 fill:#e3f2fd
|
||||
```
|
||||
|
||||
### Advanced Scoring and Prioritization System
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph "URL Discovery"
|
||||
A[Page Links] --> B[Extract URLs]
|
||||
B --> C[Normalize URLs]
|
||||
end
|
||||
|
||||
subgraph "Scoring System"
|
||||
C --> D[Keyword Relevance Scorer]
|
||||
D --> D1[URL Text Analysis]
|
||||
D --> D2[Keyword Matching]
|
||||
D --> D3[Calculate Base Score]
|
||||
|
||||
D3 --> E[Additional Scoring Factors]
|
||||
E --> E1[URL Structure weight: 0.2]
|
||||
E --> E2[Link Context weight: 0.3]
|
||||
E --> E3[Page Depth Penalty weight: 0.1]
|
||||
E --> E4[Domain Authority weight: 0.4]
|
||||
|
||||
D1 --> F[Combined Score]
|
||||
D2 --> F
|
||||
D3 --> F
|
||||
E1 --> F
|
||||
E2 --> F
|
||||
E3 --> F
|
||||
E4 --> F
|
||||
end
|
||||
|
||||
subgraph "Prioritization"
|
||||
F --> G{Score Threshold}
|
||||
G -->|Above Threshold| H[Priority Queue]
|
||||
G -->|Below Threshold| I[Discard URL]
|
||||
|
||||
H --> J[Best-First Selection]
|
||||
J --> K[Highest Score First]
|
||||
K --> L[Process Page]
|
||||
|
||||
L --> M[Update Scores]
|
||||
M --> N[Re-prioritize Queue]
|
||||
N --> J
|
||||
end
|
||||
|
||||
style F fill:#fff3e0
|
||||
style H fill:#e8f5e8
|
||||
style L fill:#e3f2fd
|
||||
```
|
||||
|
||||
### Deep Crawl Performance and Limits
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Crawl Constraints"
|
||||
A[Max Depth: 2] --> A1[Prevents infinite crawling]
|
||||
B[Max Pages: 50] --> B1[Controls resource usage]
|
||||
C[Score Threshold: 0.3] --> C1[Quality filtering]
|
||||
D[Domain Limits] --> D1[Scope control]
|
||||
end
|
||||
|
||||
subgraph "Performance Monitoring"
|
||||
E[Pages Crawled] --> F[Depth Distribution]
|
||||
E --> G[Success Rate]
|
||||
E --> H[Average Score]
|
||||
E --> I[Processing Time]
|
||||
|
||||
F --> J[Performance Report]
|
||||
G --> J
|
||||
H --> J
|
||||
I --> J
|
||||
end
|
||||
|
||||
subgraph "Resource Management"
|
||||
K[Memory Usage] --> L{Memory Threshold}
|
||||
L -->|Under Limit| M[Continue Crawling]
|
||||
L -->|Over Limit| N[Reduce Concurrency]
|
||||
|
||||
O[CPU Usage] --> P{CPU Threshold}
|
||||
P -->|Normal| M
|
||||
P -->|High| Q[Add Delays]
|
||||
|
||||
R[Network Load] --> S{Rate Limits}
|
||||
S -->|OK| M
|
||||
S -->|Exceeded| T[Throttle Requests]
|
||||
end
|
||||
|
||||
M --> U[Optimal Performance]
|
||||
N --> V[Reduced Performance]
|
||||
Q --> V
|
||||
T --> V
|
||||
|
||||
style U fill:#c8e6c9
|
||||
style V fill:#fff3e0
|
||||
style J fill:#e3f2fd
|
||||
```
|
||||
|
||||
### Error Handling and Recovery Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Strategy as Deep Crawl Strategy
|
||||
participant Queue as Priority Queue
|
||||
participant Crawler as Page Crawler
|
||||
participant Error as Error Handler
|
||||
participant Result as Result Collector
|
||||
|
||||
Strategy->>Queue: Get next URL
|
||||
Queue-->>Strategy: Return highest priority URL
|
||||
|
||||
Strategy->>Crawler: Crawl page
|
||||
|
||||
alt Successful Crawl
|
||||
Crawler-->>Strategy: Return page content
|
||||
Strategy->>Result: Store successful result
|
||||
Strategy->>Strategy: Extract new links
|
||||
Strategy->>Queue: Add new URLs to queue
|
||||
else Network Error
|
||||
Crawler-->>Error: Network timeout/failure
|
||||
Error->>Error: Log error with details
|
||||
Error->>Queue: Mark URL as failed
|
||||
Error-->>Strategy: Skip to next URL
|
||||
else Parse Error
|
||||
Crawler-->>Error: HTML parsing failed
|
||||
Error->>Error: Log parse error
|
||||
Error->>Result: Store failed result
|
||||
Error-->>Strategy: Continue with next URL
|
||||
else Rate Limit Hit
|
||||
Crawler-->>Error: Rate limit exceeded
|
||||
Error->>Error: Apply backoff strategy
|
||||
Error->>Queue: Re-queue URL with delay
|
||||
Error-->>Strategy: Wait before retry
|
||||
else Depth Limit
|
||||
Strategy->>Strategy: Check depth constraint
|
||||
Strategy-->>Queue: Skip URL - too deep
|
||||
else Page Limit
|
||||
Strategy->>Strategy: Check page count
|
||||
Strategy-->>Result: Stop crawling - limit reached
|
||||
end
|
||||
|
||||
Strategy->>Queue: Request next URL
|
||||
Queue-->>Strategy: More URLs available?
|
||||
|
||||
alt Queue Empty
|
||||
Queue-->>Result: Crawl complete
|
||||
else Queue Has URLs
|
||||
Queue-->>Strategy: Continue crawling
|
||||
end
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/)
|
||||
603
docs/md_v2/assets/llm.txt/diagrams/docker.txt
Normal file
603
docs/md_v2/assets/llm.txt/diagrams/docker.txt
Normal file
@@ -0,0 +1,603 @@
|
||||
## Docker Deployment Architecture and Workflows
|
||||
|
||||
Visual representations of Crawl4AI Docker deployment, API architecture, configuration management, and service interactions.
|
||||
|
||||
### Docker Deployment Decision Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Docker Deployment] --> B{Deployment Type?}
|
||||
|
||||
B -->|Quick Start| C[Pre-built Image]
|
||||
B -->|Development| D[Docker Compose]
|
||||
B -->|Custom Build| E[Manual Build]
|
||||
B -->|Production| F[Production Setup]
|
||||
|
||||
C --> C1[docker pull unclecode/crawl4ai]
|
||||
C1 --> C2{Need LLM Support?}
|
||||
C2 -->|Yes| C3[Setup .llm.env]
|
||||
C2 -->|No| C4[Basic run]
|
||||
C3 --> C5[docker run with --env-file]
|
||||
C4 --> C6[docker run basic]
|
||||
|
||||
D --> D1[git clone repository]
|
||||
D1 --> D2[cp .llm.env.example .llm.env]
|
||||
D2 --> D3{Build Type?}
|
||||
D3 -->|Pre-built| D4[IMAGE=latest docker compose up]
|
||||
D3 -->|Local Build| D5[docker compose up --build]
|
||||
D3 -->|All Features| D6[INSTALL_TYPE=all docker compose up]
|
||||
|
||||
E --> E1[docker buildx build]
|
||||
E1 --> E2{Architecture?}
|
||||
E2 -->|Single| E3[--platform linux/amd64]
|
||||
E2 -->|Multi| E4[--platform linux/amd64,linux/arm64]
|
||||
E3 --> E5[Build complete]
|
||||
E4 --> E5
|
||||
|
||||
F --> F1[Production configuration]
|
||||
F1 --> F2[Custom config.yml]
|
||||
F2 --> F3[Resource limits]
|
||||
F3 --> F4[Health monitoring]
|
||||
F4 --> F5[Production ready]
|
||||
|
||||
C5 --> G[Service running on :11235]
|
||||
C6 --> G
|
||||
D4 --> G
|
||||
D5 --> G
|
||||
D6 --> G
|
||||
E5 --> H[docker run custom image]
|
||||
H --> G
|
||||
F5 --> I[Production deployment]
|
||||
|
||||
G --> J[Access playground at /playground]
|
||||
G --> K[Health check at /health]
|
||||
I --> L[Production monitoring]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style G fill:#c8e6c9
|
||||
style I fill:#c8e6c9
|
||||
style J fill:#fff3e0
|
||||
style K fill:#fff3e0
|
||||
style L fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Docker Container Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Host Environment"
|
||||
A[Docker Engine] --> B[Crawl4AI Container]
|
||||
C[.llm.env] --> B
|
||||
D[Custom config.yml] --> B
|
||||
E[Port 11235] --> B
|
||||
F[Shared Memory 1GB+] --> B
|
||||
end
|
||||
|
||||
subgraph "Container Services"
|
||||
B --> G[FastAPI Server :8020]
|
||||
B --> H[Gunicorn WSGI]
|
||||
B --> I[Supervisord Process Manager]
|
||||
B --> J[Redis Cache :6379]
|
||||
|
||||
G --> K[REST API Endpoints]
|
||||
G --> L[WebSocket Connections]
|
||||
G --> M[MCP Protocol]
|
||||
|
||||
H --> N[Worker Processes]
|
||||
I --> O[Service Monitoring]
|
||||
J --> P[Request Caching]
|
||||
end
|
||||
|
||||
subgraph "Browser Management"
|
||||
B --> Q[Playwright Framework]
|
||||
Q --> R[Chromium Browser]
|
||||
Q --> S[Firefox Browser]
|
||||
Q --> T[WebKit Browser]
|
||||
|
||||
R --> U[Browser Pool]
|
||||
S --> U
|
||||
T --> U
|
||||
|
||||
U --> V[Page Sessions]
|
||||
U --> W[Context Management]
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
X[OpenAI API] -.-> K
|
||||
Y[Anthropic Claude] -.-> K
|
||||
Z[Local Ollama] -.-> K
|
||||
AA[Groq API] -.-> K
|
||||
BB[Google Gemini] -.-> K
|
||||
end
|
||||
|
||||
subgraph "Client Interactions"
|
||||
CC[Python SDK] --> K
|
||||
DD[REST API Calls] --> K
|
||||
EE[MCP Clients] --> M
|
||||
FF[Web Browser] --> G
|
||||
GG[Monitoring Tools] --> K
|
||||
end
|
||||
|
||||
style B fill:#e3f2fd
|
||||
style G fill:#f3e5f5
|
||||
style Q fill:#e8f5e8
|
||||
style K fill:#fff3e0
|
||||
```
|
||||
|
||||
### API Endpoints Architecture
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Core Endpoints"
|
||||
A[/crawl] --> A1[Single URL crawl]
|
||||
A2[/crawl/stream] --> A3[Streaming multi-URL]
|
||||
A4[/crawl/job] --> A5[Async job submission]
|
||||
A6[/crawl/job/{id}] --> A7[Job status check]
|
||||
end
|
||||
|
||||
subgraph "Specialized Endpoints"
|
||||
B[/html] --> B1[Preprocessed HTML]
|
||||
B2[/screenshot] --> B3[PNG capture]
|
||||
B4[/pdf] --> B5[PDF generation]
|
||||
B6[/execute_js] --> B7[JavaScript execution]
|
||||
B8[/md] --> B9[Markdown extraction]
|
||||
end
|
||||
|
||||
subgraph "Utility Endpoints"
|
||||
C[/health] --> C1[Service status]
|
||||
C2[/metrics] --> C3[Prometheus metrics]
|
||||
C4[/schema] --> C5[API documentation]
|
||||
C6[/playground] --> C7[Interactive testing]
|
||||
end
|
||||
|
||||
subgraph "LLM Integration"
|
||||
D[/llm/{url}] --> D1[Q&A over URL]
|
||||
D2[/ask] --> D3[Library context search]
|
||||
D4[/config/dump] --> D5[Config validation]
|
||||
end
|
||||
|
||||
subgraph "MCP Protocol"
|
||||
E[/mcp/sse] --> E1[Server-Sent Events]
|
||||
E2[/mcp/ws] --> E3[WebSocket connection]
|
||||
E4[/mcp/schema] --> E5[MCP tool definitions]
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style B fill:#f3e5f5
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
style E fill:#fce4ec
|
||||
```
|
||||
|
||||
### Request Processing Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant FastAPI
|
||||
participant RequestValidator
|
||||
participant BrowserPool
|
||||
participant Playwright
|
||||
participant ExtractionEngine
|
||||
participant LLMProvider
|
||||
|
||||
Client->>FastAPI: POST /crawl with config
|
||||
FastAPI->>RequestValidator: Validate JSON structure
|
||||
|
||||
alt Valid Request
|
||||
RequestValidator-->>FastAPI: ✓ Validated
|
||||
FastAPI->>BrowserPool: Request browser instance
|
||||
BrowserPool->>Playwright: Launch browser/reuse session
|
||||
Playwright-->>BrowserPool: Browser ready
|
||||
BrowserPool-->>FastAPI: Browser allocated
|
||||
|
||||
FastAPI->>Playwright: Navigate to URL
|
||||
Playwright->>Playwright: Execute JS, wait conditions
|
||||
Playwright-->>FastAPI: Page content ready
|
||||
|
||||
FastAPI->>ExtractionEngine: Process content
|
||||
|
||||
alt LLM Extraction
|
||||
ExtractionEngine->>LLMProvider: Send content + schema
|
||||
LLMProvider-->>ExtractionEngine: Structured data
|
||||
else CSS Extraction
|
||||
ExtractionEngine->>ExtractionEngine: Apply CSS selectors
|
||||
end
|
||||
|
||||
ExtractionEngine-->>FastAPI: Extraction complete
|
||||
FastAPI->>BrowserPool: Release browser
|
||||
FastAPI-->>Client: CrawlResult response
|
||||
|
||||
else Invalid Request
|
||||
RequestValidator-->>FastAPI: ✗ Validation error
|
||||
FastAPI-->>Client: 400 Bad Request
|
||||
end
|
||||
```
|
||||
|
||||
### Configuration Management Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> ConfigLoading
|
||||
|
||||
ConfigLoading --> DefaultConfig: Load default config.yml
|
||||
ConfigLoading --> CustomConfig: Custom config mounted
|
||||
ConfigLoading --> EnvOverrides: Environment variables
|
||||
|
||||
DefaultConfig --> ConfigMerging
|
||||
CustomConfig --> ConfigMerging
|
||||
EnvOverrides --> ConfigMerging
|
||||
|
||||
ConfigMerging --> ConfigValidation
|
||||
|
||||
ConfigValidation --> Valid: Schema validation passes
|
||||
ConfigValidation --> Invalid: Validation errors
|
||||
|
||||
Invalid --> ConfigError: Log errors and exit
|
||||
ConfigError --> [*]
|
||||
|
||||
Valid --> ServiceInitialization
|
||||
ServiceInitialization --> FastAPISetup
|
||||
ServiceInitialization --> BrowserPoolInit
|
||||
ServiceInitialization --> CacheSetup
|
||||
|
||||
FastAPISetup --> Running
|
||||
BrowserPoolInit --> Running
|
||||
CacheSetup --> Running
|
||||
|
||||
Running --> ConfigReload: Config change detected
|
||||
ConfigReload --> ConfigValidation
|
||||
|
||||
Running --> [*]: Service shutdown
|
||||
|
||||
note right of ConfigMerging : Priority: ENV > Custom > Default
|
||||
note right of ServiceInitialization : All services must initialize successfully
|
||||
```
|
||||
|
||||
### Multi-Architecture Build Process
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Developer Push] --> B[GitHub Repository]
|
||||
|
||||
B --> C[Docker Buildx]
|
||||
C --> D{Build Strategy}
|
||||
|
||||
D -->|Multi-arch| E[Parallel Builds]
|
||||
D -->|Single-arch| F[Platform-specific Build]
|
||||
|
||||
E --> G[AMD64 Build]
|
||||
E --> H[ARM64 Build]
|
||||
|
||||
F --> I[Target Platform Build]
|
||||
|
||||
subgraph "AMD64 Build Process"
|
||||
G --> G1[Ubuntu base image]
|
||||
G1 --> G2[Python 3.11 install]
|
||||
G2 --> G3[System dependencies]
|
||||
G3 --> G4[Crawl4AI installation]
|
||||
G4 --> G5[Playwright setup]
|
||||
G5 --> G6[FastAPI configuration]
|
||||
G6 --> G7[AMD64 image ready]
|
||||
end
|
||||
|
||||
subgraph "ARM64 Build Process"
|
||||
H --> H1[Ubuntu ARM64 base]
|
||||
H1 --> H2[Python 3.11 install]
|
||||
H2 --> H3[ARM-specific deps]
|
||||
H3 --> H4[Crawl4AI installation]
|
||||
H4 --> H5[Playwright setup]
|
||||
H5 --> H6[FastAPI configuration]
|
||||
H6 --> H7[ARM64 image ready]
|
||||
end
|
||||
|
||||
subgraph "Single Architecture"
|
||||
I --> I1[Base image selection]
|
||||
I1 --> I2[Platform dependencies]
|
||||
I2 --> I3[Application setup]
|
||||
I3 --> I4[Platform image ready]
|
||||
end
|
||||
|
||||
G7 --> J[Multi-arch Manifest]
|
||||
H7 --> J
|
||||
I4 --> K[Platform Image]
|
||||
|
||||
J --> L[Docker Hub Registry]
|
||||
K --> L
|
||||
|
||||
L --> M[Pull Request Auto-selects Architecture]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style J fill:#c8e6c9
|
||||
style K fill:#c8e6c9
|
||||
style L fill:#f3e5f5
|
||||
style M fill:#e8f5e8
|
||||
```
|
||||
|
||||
### MCP Integration Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "MCP Client Applications"
|
||||
A[Claude Code] --> B[MCP Protocol]
|
||||
C[Cursor IDE] --> B
|
||||
D[Windsurf] --> B
|
||||
E[Custom MCP Client] --> B
|
||||
end
|
||||
|
||||
subgraph "Crawl4AI MCP Server"
|
||||
B --> F[MCP Endpoint Router]
|
||||
F --> G[SSE Transport /mcp/sse]
|
||||
F --> H[WebSocket Transport /mcp/ws]
|
||||
F --> I[Schema Endpoint /mcp/schema]
|
||||
|
||||
G --> J[MCP Tool Handler]
|
||||
H --> J
|
||||
|
||||
J --> K[Tool: md]
|
||||
J --> L[Tool: html]
|
||||
J --> M[Tool: screenshot]
|
||||
J --> N[Tool: pdf]
|
||||
J --> O[Tool: execute_js]
|
||||
J --> P[Tool: crawl]
|
||||
J --> Q[Tool: ask]
|
||||
end
|
||||
|
||||
subgraph "Crawl4AI Core Services"
|
||||
K --> R[Markdown Generator]
|
||||
L --> S[HTML Preprocessor]
|
||||
M --> T[Screenshot Service]
|
||||
N --> U[PDF Generator]
|
||||
O --> V[JavaScript Executor]
|
||||
P --> W[Batch Crawler]
|
||||
Q --> X[Context Search]
|
||||
|
||||
R --> Y[Browser Pool]
|
||||
S --> Y
|
||||
T --> Y
|
||||
U --> Y
|
||||
V --> Y
|
||||
W --> Y
|
||||
X --> Z[Knowledge Base]
|
||||
end
|
||||
|
||||
subgraph "External Resources"
|
||||
Y --> AA[Playwright Browsers]
|
||||
Z --> BB[Library Documentation]
|
||||
Z --> CC[Code Examples]
|
||||
AA --> DD[Web Pages]
|
||||
end
|
||||
|
||||
style B fill:#e3f2fd
|
||||
style J fill:#f3e5f5
|
||||
style Y fill:#e8f5e8
|
||||
style Z fill:#fff3e0
|
||||
```
|
||||
|
||||
### API Request/Response Flow Patterns
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant LoadBalancer
|
||||
participant FastAPI
|
||||
participant ConfigValidator
|
||||
participant BrowserManager
|
||||
participant CrawlEngine
|
||||
participant ResponseBuilder
|
||||
|
||||
Note over Client,ResponseBuilder: Basic Crawl Request
|
||||
|
||||
Client->>LoadBalancer: POST /crawl
|
||||
LoadBalancer->>FastAPI: Route request
|
||||
|
||||
FastAPI->>ConfigValidator: Validate browser_config
|
||||
ConfigValidator-->>FastAPI: ✓ Valid BrowserConfig
|
||||
|
||||
FastAPI->>ConfigValidator: Validate crawler_config
|
||||
ConfigValidator-->>FastAPI: ✓ Valid CrawlerRunConfig
|
||||
|
||||
FastAPI->>BrowserManager: Allocate browser
|
||||
BrowserManager-->>FastAPI: Browser instance
|
||||
|
||||
FastAPI->>CrawlEngine: Execute crawl
|
||||
|
||||
Note over CrawlEngine: Page processing
|
||||
CrawlEngine->>CrawlEngine: Navigate & wait
|
||||
CrawlEngine->>CrawlEngine: Extract content
|
||||
CrawlEngine->>CrawlEngine: Apply strategies
|
||||
|
||||
CrawlEngine-->>FastAPI: CrawlResult
|
||||
|
||||
FastAPI->>ResponseBuilder: Format response
|
||||
ResponseBuilder-->>FastAPI: JSON response
|
||||
|
||||
FastAPI->>BrowserManager: Release browser
|
||||
FastAPI-->>LoadBalancer: Response ready
|
||||
LoadBalancer-->>Client: 200 OK + CrawlResult
|
||||
|
||||
Note over Client,ResponseBuilder: Streaming Request
|
||||
|
||||
Client->>FastAPI: POST /crawl/stream
|
||||
FastAPI-->>Client: 200 OK (stream start)
|
||||
|
||||
loop For each URL
|
||||
FastAPI->>CrawlEngine: Process URL
|
||||
CrawlEngine-->>FastAPI: Result ready
|
||||
FastAPI-->>Client: NDJSON line
|
||||
end
|
||||
|
||||
FastAPI-->>Client: Stream completed
|
||||
```
|
||||
|
||||
### Configuration Validation Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Client Request] --> B[JSON Payload]
|
||||
B --> C{Pre-validation}
|
||||
|
||||
C -->|✓ Valid JSON| D[Extract Configurations]
|
||||
C -->|✗ Invalid JSON| E[Return 400 Bad Request]
|
||||
|
||||
D --> F[BrowserConfig Validation]
|
||||
D --> G[CrawlerRunConfig Validation]
|
||||
|
||||
F --> H{BrowserConfig Valid?}
|
||||
G --> I{CrawlerRunConfig Valid?}
|
||||
|
||||
H -->|✓ Valid| J[Browser Setup]
|
||||
H -->|✗ Invalid| K[Log Browser Config Errors]
|
||||
|
||||
I -->|✓ Valid| L[Crawler Setup]
|
||||
I -->|✗ Invalid| M[Log Crawler Config Errors]
|
||||
|
||||
K --> N[Collect All Errors]
|
||||
M --> N
|
||||
N --> O[Return 422 Validation Error]
|
||||
|
||||
J --> P{Both Configs Valid?}
|
||||
L --> P
|
||||
|
||||
P -->|✓ Yes| Q[Proceed to Crawling]
|
||||
P -->|✗ No| O
|
||||
|
||||
Q --> R[Execute Crawl Pipeline]
|
||||
R --> S[Return CrawlResult]
|
||||
|
||||
E --> T[Client Error Response]
|
||||
O --> T
|
||||
S --> U[Client Success Response]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style Q fill:#c8e6c9
|
||||
style S fill:#c8e6c9
|
||||
style U fill:#c8e6c9
|
||||
style E fill:#ffcdd2
|
||||
style O fill:#ffcdd2
|
||||
style T fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Production Deployment Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Load Balancer Layer"
|
||||
A[NGINX/HAProxy] --> B[Health Check]
|
||||
A --> C[Request Routing]
|
||||
A --> D[SSL Termination]
|
||||
end
|
||||
|
||||
subgraph "Application Layer"
|
||||
C --> E[Crawl4AI Instance 1]
|
||||
C --> F[Crawl4AI Instance 2]
|
||||
C --> G[Crawl4AI Instance N]
|
||||
|
||||
E --> H[FastAPI Server]
|
||||
F --> I[FastAPI Server]
|
||||
G --> J[FastAPI Server]
|
||||
|
||||
H --> K[Browser Pool 1]
|
||||
I --> L[Browser Pool 2]
|
||||
J --> M[Browser Pool N]
|
||||
end
|
||||
|
||||
subgraph "Shared Services"
|
||||
N[Redis Cluster] --> E
|
||||
N --> F
|
||||
N --> G
|
||||
|
||||
O[Monitoring Stack] --> P[Prometheus]
|
||||
O --> Q[Grafana]
|
||||
O --> R[AlertManager]
|
||||
|
||||
P --> E
|
||||
P --> F
|
||||
P --> G
|
||||
end
|
||||
|
||||
subgraph "External Dependencies"
|
||||
S[OpenAI API] -.-> H
|
||||
T[Anthropic API] -.-> I
|
||||
U[Local LLM Cluster] -.-> J
|
||||
end
|
||||
|
||||
subgraph "Persistent Storage"
|
||||
V[Configuration Volume] --> E
|
||||
V --> F
|
||||
V --> G
|
||||
|
||||
W[Cache Volume] --> N
|
||||
X[Logs Volume] --> O
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style E fill:#f3e5f5
|
||||
style F fill:#f3e5f5
|
||||
style G fill:#f3e5f5
|
||||
style N fill:#e8f5e8
|
||||
style O fill:#fff3e0
|
||||
```
|
||||
|
||||
### Docker Resource Management
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Resource Allocation"
|
||||
A[Host Resources] --> B[CPU Cores]
|
||||
A --> C[Memory GB]
|
||||
A --> D[Disk Space]
|
||||
A --> E[Network Bandwidth]
|
||||
|
||||
B --> F[Container Limits]
|
||||
C --> F
|
||||
D --> F
|
||||
E --> F
|
||||
end
|
||||
|
||||
subgraph "Container Configuration"
|
||||
F --> G[--cpus=4]
|
||||
F --> H[--memory=8g]
|
||||
F --> I[--shm-size=2g]
|
||||
F --> J[Volume Mounts]
|
||||
|
||||
G --> K[Browser Processes]
|
||||
H --> L[Browser Memory]
|
||||
I --> M[Shared Memory for Browsers]
|
||||
J --> N[Config & Cache Storage]
|
||||
end
|
||||
|
||||
subgraph "Monitoring & Scaling"
|
||||
O[Resource Monitor] --> P[CPU Usage %]
|
||||
O --> Q[Memory Usage %]
|
||||
O --> R[Request Queue Length]
|
||||
|
||||
P --> S{CPU > 80%?}
|
||||
Q --> T{Memory > 90%?}
|
||||
R --> U{Queue > 100?}
|
||||
|
||||
S -->|Yes| V[Scale Up]
|
||||
T -->|Yes| V
|
||||
U -->|Yes| V
|
||||
|
||||
V --> W[Add Container Instance]
|
||||
W --> X[Update Load Balancer]
|
||||
end
|
||||
|
||||
subgraph "Performance Optimization"
|
||||
Y[Browser Pool Tuning] --> Z[Max Pages: 40]
|
||||
Y --> AA[Idle TTL: 30min]
|
||||
Y --> BB[Concurrency Limits]
|
||||
|
||||
Z --> CC[Memory Efficiency]
|
||||
AA --> DD[Resource Cleanup]
|
||||
BB --> EE[Throughput Control]
|
||||
end
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style F fill:#f3e5f5
|
||||
style O fill:#e8f5e8
|
||||
style Y fill:#fff3e0
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Production Configuration](https://docs.crawl4ai.com/core/docker-deployment/#production-deployment)
|
||||
478
docs/md_v2/assets/llm.txt/diagrams/extraction.txt
Normal file
478
docs/md_v2/assets/llm.txt/diagrams/extraction.txt
Normal file
@@ -0,0 +1,478 @@
|
||||
## Extraction Strategy Workflows and Architecture
|
||||
|
||||
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
|
||||
|
||||
### Extraction Strategy Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Content to Extract] --> B{Content Type?}
|
||||
|
||||
B -->|Simple Patterns| C[Common Data Types]
|
||||
B -->|Structured HTML| D[Predictable Structure]
|
||||
B -->|Complex Content| E[Requires Reasoning]
|
||||
B -->|Mixed Content| F[Multiple Data Types]
|
||||
|
||||
C --> C1{Pattern Type?}
|
||||
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
|
||||
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
|
||||
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
|
||||
|
||||
D --> D1{Selector Type?}
|
||||
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
|
||||
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
|
||||
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
|
||||
|
||||
E --> E1{LLM Provider?}
|
||||
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
|
||||
E1 -->|Local Ollama| E3[Local LLM Strategy]
|
||||
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
|
||||
|
||||
F --> F1[Multi-Strategy Approach]
|
||||
F1 --> F2[1. Regex for Patterns]
|
||||
F1 --> F3[2. CSS for Structure]
|
||||
F1 --> F4[3. LLM for Complex Analysis]
|
||||
|
||||
C2 --> G[Fast Extraction ⚡]
|
||||
C3 --> G
|
||||
C4 --> H[Cached Pattern Reuse]
|
||||
|
||||
D2 --> I[Schema-based Extraction 🏗️]
|
||||
D3 --> I
|
||||
D4 --> J[Generated Schema Cache]
|
||||
|
||||
E2 --> K[Intelligent Parsing 🧠]
|
||||
E3 --> K
|
||||
E4 --> L[Hybrid Cost-Effective]
|
||||
|
||||
F2 --> M[Comprehensive Results 📊]
|
||||
F3 --> M
|
||||
F4 --> M
|
||||
|
||||
style G fill:#c8e6c9
|
||||
style I fill:#e3f2fd
|
||||
style K fill:#fff3e0
|
||||
style M fill:#f3e5f5
|
||||
style H fill:#e8f5e8
|
||||
style J fill:#e8f5e8
|
||||
style L fill:#ffecb3
|
||||
```
|
||||
|
||||
### LLM Extraction Strategy Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler
|
||||
participant LLMStrategy
|
||||
participant Chunker
|
||||
participant LLMProvider
|
||||
participant Parser
|
||||
|
||||
User->>Crawler: Configure LLMExtractionStrategy
|
||||
User->>Crawler: arun(url, config)
|
||||
|
||||
Crawler->>Crawler: Navigate to URL
|
||||
Crawler->>Crawler: Extract content (HTML/Markdown)
|
||||
Crawler->>LLMStrategy: Process content
|
||||
|
||||
LLMStrategy->>LLMStrategy: Check content size
|
||||
|
||||
alt Content > chunk_threshold
|
||||
LLMStrategy->>Chunker: Split into chunks with overlap
|
||||
Chunker-->>LLMStrategy: Return chunks[]
|
||||
|
||||
loop For each chunk
|
||||
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>LLMStrategy: Merge chunk results
|
||||
else Content <= threshold
|
||||
LLMStrategy->>LLMProvider: Send full content + schema
|
||||
LLMProvider-->>LLMStrategy: Return structured JSON
|
||||
end
|
||||
|
||||
LLMStrategy->>Parser: Validate JSON schema
|
||||
Parser-->>LLMStrategy: Validated data
|
||||
|
||||
LLMStrategy->>LLMStrategy: Track token usage
|
||||
LLMStrategy-->>Crawler: Return extracted_content
|
||||
|
||||
Crawler-->>User: CrawlResult with JSON data
|
||||
|
||||
User->>LLMStrategy: show_usage()
|
||||
LLMStrategy-->>User: Token count & estimated cost
|
||||
```
|
||||
|
||||
### Schema-Based Extraction Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Schema Definition"
|
||||
A[JSON Schema] --> A1[baseSelector]
|
||||
A --> A2[fields[]]
|
||||
A --> A3[nested structures]
|
||||
|
||||
A2 --> A4[CSS/XPath selectors]
|
||||
A2 --> A5[Data types: text, html, attribute]
|
||||
A2 --> A6[Default values]
|
||||
|
||||
A3 --> A7[nested objects]
|
||||
A3 --> A8[nested_list arrays]
|
||||
A3 --> A9[simple lists]
|
||||
end
|
||||
|
||||
subgraph "Extraction Engine"
|
||||
B[HTML Content] --> C[Selector Engine]
|
||||
C --> C1[CSS Selector Parser]
|
||||
C --> C2[XPath Evaluator]
|
||||
|
||||
C1 --> D[Element Matcher]
|
||||
C2 --> D
|
||||
|
||||
D --> E[Type Converter]
|
||||
E --> E1[Text Extraction]
|
||||
E --> E2[HTML Preservation]
|
||||
E --> E3[Attribute Extraction]
|
||||
E --> E4[Nested Processing]
|
||||
end
|
||||
|
||||
subgraph "Result Processing"
|
||||
F[Raw Extracted Data] --> G[Structure Builder]
|
||||
G --> G1[Object Construction]
|
||||
G --> G2[Array Assembly]
|
||||
G --> G3[Type Validation]
|
||||
|
||||
G1 --> H[JSON Output]
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
end
|
||||
|
||||
A --> C
|
||||
E --> F
|
||||
H --> I[extracted_content]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style G fill:#e8f5e8
|
||||
style H fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Automatic Schema Generation Process
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> CheckCache
|
||||
|
||||
CheckCache --> CacheHit: Schema exists
|
||||
CheckCache --> SamplePage: Schema missing
|
||||
|
||||
CacheHit --> LoadSchema
|
||||
LoadSchema --> FastExtraction
|
||||
|
||||
SamplePage --> ExtractHTML: Crawl sample URL
|
||||
ExtractHTML --> LLMAnalysis: Send HTML to LLM
|
||||
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
|
||||
GenerateSchema --> ValidateSchema: Test generated schema
|
||||
|
||||
ValidateSchema --> SchemaWorks: Valid selectors
|
||||
ValidateSchema --> RefineSchema: Invalid selectors
|
||||
|
||||
RefineSchema --> LLMAnalysis: Iterate with feedback
|
||||
|
||||
SchemaWorks --> CacheSchema: Save for reuse
|
||||
CacheSchema --> FastExtraction: Use cached schema
|
||||
|
||||
FastExtraction --> [*]: No more LLM calls needed
|
||||
|
||||
note right of CheckCache : One-time LLM cost
|
||||
note right of FastExtraction : Unlimited fast reuse
|
||||
note right of CacheSchema : JSON file storage
|
||||
```
|
||||
|
||||
### Multi-Strategy Extraction Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
A[Web Page Content] --> B[Strategy Pipeline]
|
||||
|
||||
subgraph B["Extraction Pipeline"]
|
||||
B1[Stage 1: Regex Patterns]
|
||||
B2[Stage 2: Schema-based CSS]
|
||||
B3[Stage 3: LLM Analysis]
|
||||
|
||||
B1 --> B1a[Email addresses]
|
||||
B1 --> B1b[Phone numbers]
|
||||
B1 --> B1c[URLs and links]
|
||||
B1 --> B1d[Currency amounts]
|
||||
|
||||
B2 --> B2a[Structured products]
|
||||
B2 --> B2b[Article metadata]
|
||||
B2 --> B2c[User reviews]
|
||||
B2 --> B2d[Navigation links]
|
||||
|
||||
B3 --> B3a[Sentiment analysis]
|
||||
B3 --> B3b[Key topics]
|
||||
B3 --> B3c[Entity recognition]
|
||||
B3 --> B3d[Content summary]
|
||||
end
|
||||
|
||||
B1a --> C[Result Merger]
|
||||
B1b --> C
|
||||
B1c --> C
|
||||
B1d --> C
|
||||
|
||||
B2a --> C
|
||||
B2b --> C
|
||||
B2c --> C
|
||||
B2d --> C
|
||||
|
||||
B3a --> C
|
||||
B3b --> C
|
||||
B3c --> C
|
||||
B3d --> C
|
||||
|
||||
C --> D[Combined JSON Output]
|
||||
D --> E[Final CrawlResult]
|
||||
|
||||
style B1 fill:#c8e6c9
|
||||
style B2 fill:#e3f2fd
|
||||
style B3 fill:#fff3e0
|
||||
style C fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Performance Comparison Matrix
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Strategy Performance"
|
||||
A[Extraction Strategy Comparison]
|
||||
|
||||
subgraph "Speed ⚡"
|
||||
S1[Regex: ~10ms]
|
||||
S2[CSS Schema: ~50ms]
|
||||
S3[XPath: ~100ms]
|
||||
S4[LLM: ~2-10s]
|
||||
end
|
||||
|
||||
subgraph "Accuracy 🎯"
|
||||
A1[Regex: Pattern-dependent]
|
||||
A2[CSS: High for structured]
|
||||
A3[XPath: Very high]
|
||||
A4[LLM: Excellent for complex]
|
||||
end
|
||||
|
||||
subgraph "Cost 💰"
|
||||
C1[Regex: Free]
|
||||
C2[CSS: Free]
|
||||
C3[XPath: Free]
|
||||
C4[LLM: $0.001-0.01 per page]
|
||||
end
|
||||
|
||||
subgraph "Complexity 🔧"
|
||||
X1[Regex: Simple patterns only]
|
||||
X2[CSS: Structured HTML]
|
||||
X3[XPath: Complex selectors]
|
||||
X4[LLM: Any content type]
|
||||
end
|
||||
end
|
||||
|
||||
style S1 fill:#c8e6c9
|
||||
style S2 fill:#e8f5e8
|
||||
style S3 fill:#fff3e0
|
||||
style S4 fill:#ffcdd2
|
||||
|
||||
style A2 fill:#e8f5e8
|
||||
style A3 fill:#c8e6c9
|
||||
style A4 fill:#c8e6c9
|
||||
|
||||
style C1 fill:#c8e6c9
|
||||
style C2 fill:#c8e6c9
|
||||
style C3 fill:#c8e6c9
|
||||
style C4 fill:#fff3e0
|
||||
|
||||
style X1 fill:#ffcdd2
|
||||
style X2 fill:#e8f5e8
|
||||
style X3 fill:#c8e6c9
|
||||
style X4 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Regex Pattern Strategy Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Regex Extraction] --> B{Pattern Source?}
|
||||
|
||||
B -->|Built-in| C[Use Predefined Patterns]
|
||||
B -->|Custom| D[Define Custom Regex]
|
||||
B -->|LLM-Generated| E[Generate with AI]
|
||||
|
||||
C --> C1[Email Pattern]
|
||||
C --> C2[Phone Pattern]
|
||||
C --> C3[URL Pattern]
|
||||
C --> C4[Currency Pattern]
|
||||
C --> C5[Date Pattern]
|
||||
|
||||
D --> D1[Write Custom Regex]
|
||||
D --> D2[Test Pattern]
|
||||
D --> D3{Pattern Works?}
|
||||
D3 -->|No| D1
|
||||
D3 -->|Yes| D4[Use Pattern]
|
||||
|
||||
E --> E1[Provide Sample Content]
|
||||
E --> E2[LLM Analyzes Content]
|
||||
E --> E3[Generate Optimized Regex]
|
||||
E --> E4[Cache Pattern for Reuse]
|
||||
|
||||
C1 --> F[Pattern Matching]
|
||||
C2 --> F
|
||||
C3 --> F
|
||||
C4 --> F
|
||||
C5 --> F
|
||||
D4 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G[Extract Matches]
|
||||
G --> H[Group by Pattern Type]
|
||||
H --> I[JSON Output with Labels]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#e3f2fd
|
||||
style E fill:#fff3e0
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Complex Schema Structure Visualization
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "E-commerce Schema Example"
|
||||
A[Category baseSelector] --> B[Category Fields]
|
||||
A --> C[Products nested_list]
|
||||
|
||||
B --> B1[category_name]
|
||||
B --> B2[category_id attribute]
|
||||
B --> B3[category_url attribute]
|
||||
|
||||
C --> C1[Product baseSelector]
|
||||
C1 --> C2[name text]
|
||||
C1 --> C3[price text]
|
||||
C1 --> C4[Details nested object]
|
||||
C1 --> C5[Features list]
|
||||
C1 --> C6[Reviews nested_list]
|
||||
|
||||
C4 --> C4a[brand text]
|
||||
C4 --> C4b[model text]
|
||||
C4 --> C4c[specs html]
|
||||
|
||||
C5 --> C5a[feature text array]
|
||||
|
||||
C6 --> C6a[reviewer text]
|
||||
C6 --> C6b[rating attribute]
|
||||
C6 --> C6c[comment text]
|
||||
C6 --> C6d[date attribute]
|
||||
end
|
||||
|
||||
subgraph "JSON Output Structure"
|
||||
D[categories array] --> D1[category object]
|
||||
D1 --> D2[category_name]
|
||||
D1 --> D3[category_id]
|
||||
D1 --> D4[products array]
|
||||
|
||||
D4 --> D5[product object]
|
||||
D5 --> D6[name, price]
|
||||
D5 --> D7[details object]
|
||||
D5 --> D8[features array]
|
||||
D5 --> D9[reviews array]
|
||||
|
||||
D7 --> D7a[brand, model, specs]
|
||||
D8 --> D8a[feature strings]
|
||||
D9 --> D9a[review objects]
|
||||
end
|
||||
|
||||
A -.-> D
|
||||
B1 -.-> D2
|
||||
C2 -.-> D6
|
||||
C4 -.-> D7
|
||||
C5 -.-> D8
|
||||
C6 -.-> D9
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style C4 fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
```
|
||||
|
||||
### Error Handling and Fallback Strategy
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> PrimaryStrategy
|
||||
|
||||
PrimaryStrategy --> Success: Extraction successful
|
||||
PrimaryStrategy --> ValidationFailed: Invalid data
|
||||
PrimaryStrategy --> ExtractionFailed: No matches found
|
||||
PrimaryStrategy --> TimeoutError: LLM timeout
|
||||
|
||||
ValidationFailed --> FallbackStrategy: Try alternative
|
||||
ExtractionFailed --> FallbackStrategy: Try alternative
|
||||
TimeoutError --> FallbackStrategy: Try alternative
|
||||
|
||||
FallbackStrategy --> FallbackSuccess: Fallback works
|
||||
FallbackStrategy --> FallbackFailed: All strategies failed
|
||||
|
||||
FallbackSuccess --> Success: Return results
|
||||
FallbackFailed --> ErrorReport: Log failure details
|
||||
|
||||
Success --> [*]: Complete
|
||||
ErrorReport --> [*]: Return empty results
|
||||
|
||||
note right of PrimaryStrategy : Try fastest/most accurate first
|
||||
note right of FallbackStrategy : Use simpler but reliable method
|
||||
note left of ErrorReport : Provide debugging information
|
||||
```
|
||||
|
||||
### Token Usage and Cost Optimization
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[LLM Extraction Request] --> B{Content Size Check}
|
||||
|
||||
B -->|Small < 1200 tokens| C[Single LLM Call]
|
||||
B -->|Large > 1200 tokens| D[Chunking Strategy]
|
||||
|
||||
C --> C1[Send full content]
|
||||
C1 --> C2[Parse JSON response]
|
||||
C2 --> C3[Track token usage]
|
||||
|
||||
D --> D1[Split into chunks]
|
||||
D1 --> D2[Add overlap between chunks]
|
||||
D2 --> D3[Process chunks in parallel]
|
||||
|
||||
D3 --> D4[Chunk 1 → LLM]
|
||||
D3 --> D5[Chunk 2 → LLM]
|
||||
D3 --> D6[Chunk N → LLM]
|
||||
|
||||
D4 --> D7[Merge results]
|
||||
D5 --> D7
|
||||
D6 --> D7
|
||||
|
||||
D7 --> D8[Deduplicate data]
|
||||
D8 --> D9[Aggregate token usage]
|
||||
|
||||
C3 --> E[Cost Calculation]
|
||||
D9 --> E
|
||||
|
||||
E --> F[Usage Report]
|
||||
F --> F1[Prompt tokens: X]
|
||||
F --> F2[Completion tokens: Y]
|
||||
F --> F3[Total cost: $Z]
|
||||
|
||||
style C fill:#c8e6c9
|
||||
style D fill:#fff3e0
|
||||
style E fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
```
|
||||
|
||||
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
@@ -0,0 +1,472 @@
|
||||
## HTTP Crawler Strategy Workflows
|
||||
|
||||
Visual representations of HTTP-based crawling architecture, request flows, and performance characteristics compared to browser-based strategies.
|
||||
|
||||
### HTTP vs Browser Strategy Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Content Crawling Need] --> B{Content Type Analysis}
|
||||
|
||||
B -->|Static HTML| C{JavaScript Required?}
|
||||
B -->|Dynamic SPA| D[Browser Strategy Required]
|
||||
B -->|API Endpoints| E[HTTP Strategy Optimal]
|
||||
B -->|Mixed Content| F{Primary Content Source?}
|
||||
|
||||
C -->|No JS Needed| G[HTTP Strategy Recommended]
|
||||
C -->|JS Required| H[Browser Strategy Required]
|
||||
C -->|Unknown| I{Performance Priority?}
|
||||
|
||||
I -->|Speed Critical| J[Try HTTP First]
|
||||
I -->|Accuracy Critical| K[Use Browser Strategy]
|
||||
|
||||
F -->|Mostly Static| G
|
||||
F -->|Mostly Dynamic| D
|
||||
|
||||
G --> L{Resource Constraints?}
|
||||
L -->|Memory Limited| M[HTTP Strategy - Lightweight]
|
||||
L -->|CPU Limited| N[HTTP Strategy - No Browser]
|
||||
L -->|Network Limited| O[HTTP Strategy - Efficient]
|
||||
L -->|No Constraints| P[Either Strategy Works]
|
||||
|
||||
J --> Q[Test HTTP Results]
|
||||
Q --> R{Content Complete?}
|
||||
R -->|Yes| S[Continue with HTTP]
|
||||
R -->|No| T[Switch to Browser Strategy]
|
||||
|
||||
D --> U[Browser Strategy Features]
|
||||
H --> U
|
||||
K --> U
|
||||
T --> U
|
||||
|
||||
U --> V[JavaScript Execution]
|
||||
U --> W[Screenshots/PDFs]
|
||||
U --> X[Complex Interactions]
|
||||
U --> Y[Session Management]
|
||||
|
||||
M --> Z[HTTP Strategy Benefits]
|
||||
N --> Z
|
||||
O --> Z
|
||||
S --> Z
|
||||
|
||||
Z --> AA[10x Faster Processing]
|
||||
Z --> BB[Lower Memory Usage]
|
||||
Z --> CC[Higher Concurrency]
|
||||
Z --> DD[Simpler Deployment]
|
||||
|
||||
style G fill:#c8e6c9
|
||||
style M fill:#c8e6c9
|
||||
style N fill:#c8e6c9
|
||||
style O fill:#c8e6c9
|
||||
style S fill:#c8e6c9
|
||||
style D fill:#e3f2fd
|
||||
style H fill:#e3f2fd
|
||||
style K fill:#e3f2fd
|
||||
style T fill:#e3f2fd
|
||||
style U fill:#e3f2fd
|
||||
```
|
||||
|
||||
### HTTP Request Lifecycle Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant HTTPStrategy as HTTP Strategy
|
||||
participant Session as HTTP Session
|
||||
participant Server as Target Server
|
||||
participant Processor as Content Processor
|
||||
|
||||
Client->>HTTPStrategy: crawl(url, config)
|
||||
HTTPStrategy->>HTTPStrategy: validate_url()
|
||||
|
||||
alt URL Type Check
|
||||
HTTPStrategy->>HTTPStrategy: handle_file_url()
|
||||
Note over HTTPStrategy: file:// URLs
|
||||
else
|
||||
HTTPStrategy->>HTTPStrategy: handle_raw_content()
|
||||
Note over HTTPStrategy: raw:// content
|
||||
else
|
||||
HTTPStrategy->>Session: prepare_request()
|
||||
Session->>Session: apply_config()
|
||||
Session->>Session: set_headers()
|
||||
Session->>Session: setup_auth()
|
||||
|
||||
Session->>Server: HTTP Request
|
||||
Note over Session,Server: GET/POST/PUT with headers
|
||||
|
||||
alt Success Response
|
||||
Server-->>Session: HTTP 200 + Content
|
||||
Session-->>HTTPStrategy: response_data
|
||||
else Redirect Response
|
||||
Server-->>Session: HTTP 3xx + Location
|
||||
Session->>Server: Follow redirect
|
||||
Server-->>Session: HTTP 200 + Content
|
||||
Session-->>HTTPStrategy: final_response
|
||||
else Error Response
|
||||
Server-->>Session: HTTP 4xx/5xx
|
||||
Session-->>HTTPStrategy: error_response
|
||||
end
|
||||
end
|
||||
|
||||
HTTPStrategy->>Processor: process_content()
|
||||
Processor->>Processor: clean_html()
|
||||
Processor->>Processor: extract_metadata()
|
||||
Processor->>Processor: generate_markdown()
|
||||
Processor-->>HTTPStrategy: processed_result
|
||||
|
||||
HTTPStrategy-->>Client: CrawlResult
|
||||
|
||||
Note over Client,Processor: Fast, lightweight processing
|
||||
Note over HTTPStrategy: No browser overhead
|
||||
```
|
||||
|
||||
### HTTP Strategy Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "HTTP Crawler Strategy"
|
||||
A[AsyncHTTPCrawlerStrategy] --> B[Session Manager]
|
||||
A --> C[Request Builder]
|
||||
A --> D[Response Handler]
|
||||
A --> E[Error Manager]
|
||||
|
||||
B --> B1[Connection Pool]
|
||||
B --> B2[DNS Cache]
|
||||
B --> B3[SSL Context]
|
||||
|
||||
C --> C1[Headers Builder]
|
||||
C --> C2[Auth Handler]
|
||||
C --> C3[Payload Encoder]
|
||||
|
||||
D --> D1[Content Decoder]
|
||||
D --> D2[Redirect Handler]
|
||||
D --> D3[Status Validator]
|
||||
|
||||
E --> E1[Retry Logic]
|
||||
E --> E2[Timeout Handler]
|
||||
E --> E3[Exception Mapper]
|
||||
end
|
||||
|
||||
subgraph "Content Processing"
|
||||
F[Raw HTML] --> G[HTML Cleaner]
|
||||
G --> H[Markdown Generator]
|
||||
H --> I[Link Extractor]
|
||||
I --> J[Media Extractor]
|
||||
J --> K[Metadata Parser]
|
||||
end
|
||||
|
||||
subgraph "External Resources"
|
||||
L[Target Websites]
|
||||
M[Local Files]
|
||||
N[Raw Content]
|
||||
end
|
||||
|
||||
subgraph "Output"
|
||||
O[CrawlResult]
|
||||
O --> O1[HTML Content]
|
||||
O --> O2[Markdown Text]
|
||||
O --> O3[Extracted Links]
|
||||
O --> O4[Media References]
|
||||
O --> O5[Status Information]
|
||||
end
|
||||
|
||||
A --> F
|
||||
L --> A
|
||||
M --> A
|
||||
N --> A
|
||||
K --> O
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style B fill:#f3e5f5
|
||||
style F fill:#e8f5e8
|
||||
style O fill:#fff3e0
|
||||
```
|
||||
|
||||
### Performance Comparison Flow
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "HTTP Strategy Performance"
|
||||
A1[Request Start] --> A2[DNS Lookup: 50ms]
|
||||
A2 --> A3[TCP Connect: 100ms]
|
||||
A3 --> A4[HTTP Request: 200ms]
|
||||
A4 --> A5[Content Download: 300ms]
|
||||
A5 --> A6[Processing: 50ms]
|
||||
A6 --> A7[Total: ~700ms]
|
||||
end
|
||||
|
||||
subgraph "Browser Strategy Performance"
|
||||
B1[Request Start] --> B2[Browser Launch: 2000ms]
|
||||
B2 --> B3[Page Navigation: 1000ms]
|
||||
B3 --> B4[JS Execution: 500ms]
|
||||
B4 --> B5[Content Rendering: 300ms]
|
||||
B5 --> B6[Processing: 100ms]
|
||||
B6 --> B7[Total: ~3900ms]
|
||||
end
|
||||
|
||||
subgraph "Resource Usage"
|
||||
C1[HTTP Memory: ~50MB]
|
||||
C2[Browser Memory: ~500MB]
|
||||
C3[HTTP CPU: Low]
|
||||
C4[Browser CPU: High]
|
||||
C5[HTTP Concurrency: 100+]
|
||||
C6[Browser Concurrency: 10-20]
|
||||
end
|
||||
|
||||
A7 --> D[5.5x Faster]
|
||||
B7 --> D
|
||||
C1 --> E[10x Less Memory]
|
||||
C2 --> E
|
||||
C5 --> F[5x More Concurrent]
|
||||
C6 --> F
|
||||
|
||||
style A7 fill:#c8e6c9
|
||||
style B7 fill:#ffcdd2
|
||||
style C1 fill:#c8e6c9
|
||||
style C2 fill:#ffcdd2
|
||||
style C5 fill:#c8e6c9
|
||||
style C6 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### HTTP Request Types and Configuration
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> HTTPConfigSetup
|
||||
|
||||
HTTPConfigSetup --> MethodSelection
|
||||
|
||||
MethodSelection --> GET: Simple data retrieval
|
||||
MethodSelection --> POST: Form submission
|
||||
MethodSelection --> PUT: Data upload
|
||||
MethodSelection --> DELETE: Resource removal
|
||||
|
||||
GET --> HeaderSetup: Set Accept headers
|
||||
POST --> PayloadSetup: JSON or form data
|
||||
PUT --> PayloadSetup: File or data upload
|
||||
DELETE --> AuthSetup: Authentication required
|
||||
|
||||
PayloadSetup --> JSONPayload: application/json
|
||||
PayloadSetup --> FormPayload: form-data
|
||||
PayloadSetup --> RawPayload: custom content
|
||||
|
||||
JSONPayload --> HeaderSetup
|
||||
FormPayload --> HeaderSetup
|
||||
RawPayload --> HeaderSetup
|
||||
|
||||
HeaderSetup --> AuthSetup
|
||||
AuthSetup --> SSLSetup
|
||||
SSLSetup --> RedirectSetup
|
||||
RedirectSetup --> RequestExecution
|
||||
|
||||
RequestExecution --> [*]: Request complete
|
||||
|
||||
note right of GET : Default method for most crawling
|
||||
note right of POST : API interactions, form submissions
|
||||
note right of JSONPayload : Structured data transmission
|
||||
note right of HeaderSetup : User-Agent, Accept, Custom headers
|
||||
```
|
||||
|
||||
### Error Handling and Retry Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[HTTP Request] --> B{Response Received?}
|
||||
|
||||
B -->|No| C[Connection Error]
|
||||
B -->|Yes| D{Status Code Check}
|
||||
|
||||
C --> C1{Timeout Error?}
|
||||
C1 -->|Yes| C2[ConnectionTimeoutError]
|
||||
C1 -->|No| C3[Network Error]
|
||||
|
||||
D -->|2xx| E[Success Response]
|
||||
D -->|3xx| F[Redirect Response]
|
||||
D -->|4xx| G[Client Error]
|
||||
D -->|5xx| H[Server Error]
|
||||
|
||||
F --> F1{Follow Redirects?}
|
||||
F1 -->|Yes| F2[Follow Redirect]
|
||||
F1 -->|No| F3[Return Redirect Response]
|
||||
F2 --> A
|
||||
|
||||
G --> G1{Retry on 4xx?}
|
||||
G1 -->|No| G2[HTTPStatusError]
|
||||
G1 -->|Yes| I[Check Retry Count]
|
||||
|
||||
H --> H1{Retry on 5xx?}
|
||||
H1 -->|Yes| I
|
||||
H1 -->|No| H2[HTTPStatusError]
|
||||
|
||||
C2 --> I
|
||||
C3 --> I
|
||||
|
||||
I --> J{Retries < Max?}
|
||||
J -->|No| K[Final Error]
|
||||
J -->|Yes| L[Calculate Backoff]
|
||||
|
||||
L --> M[Wait Backoff Time]
|
||||
M --> N[Increment Retry Count]
|
||||
N --> A
|
||||
|
||||
E --> O[Process Content]
|
||||
F3 --> O
|
||||
O --> P[Return CrawlResult]
|
||||
|
||||
G2 --> Q[Error CrawlResult]
|
||||
H2 --> Q
|
||||
K --> Q
|
||||
|
||||
style E fill:#c8e6c9
|
||||
style P fill:#c8e6c9
|
||||
style G2 fill:#ffcdd2
|
||||
style H2 fill:#ffcdd2
|
||||
style K fill:#ffcdd2
|
||||
style Q fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Batch Processing Architecture
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Client
|
||||
participant BatchManager as Batch Manager
|
||||
participant HTTPPool as Connection Pool
|
||||
participant Workers as HTTP Workers
|
||||
participant Targets as Target Servers
|
||||
|
||||
Client->>BatchManager: batch_crawl(urls)
|
||||
BatchManager->>BatchManager: create_semaphore(max_concurrent)
|
||||
|
||||
loop For each URL batch
|
||||
BatchManager->>HTTPPool: acquire_connection()
|
||||
HTTPPool->>Workers: assign_worker()
|
||||
|
||||
par Concurrent Processing
|
||||
Workers->>Targets: HTTP Request 1
|
||||
Workers->>Targets: HTTP Request 2
|
||||
Workers->>Targets: HTTP Request N
|
||||
end
|
||||
|
||||
par Response Handling
|
||||
Targets-->>Workers: Response 1
|
||||
Targets-->>Workers: Response 2
|
||||
Targets-->>Workers: Response N
|
||||
end
|
||||
|
||||
Workers->>HTTPPool: return_connection()
|
||||
HTTPPool->>BatchManager: batch_results()
|
||||
end
|
||||
|
||||
BatchManager->>BatchManager: aggregate_results()
|
||||
BatchManager-->>Client: final_results()
|
||||
|
||||
Note over Workers,Targets: 20-100 concurrent connections
|
||||
Note over BatchManager: Memory-efficient processing
|
||||
Note over HTTPPool: Connection reuse optimization
|
||||
```
|
||||
|
||||
### Content Type Processing Pipeline
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
A[HTTP Response] --> B{Content-Type Detection}
|
||||
|
||||
B -->|text/html| C[HTML Processing]
|
||||
B -->|application/json| D[JSON Processing]
|
||||
B -->|text/plain| E[Text Processing]
|
||||
B -->|application/xml| F[XML Processing]
|
||||
B -->|Other| G[Binary Processing]
|
||||
|
||||
C --> C1[Parse HTML Structure]
|
||||
C1 --> C2[Extract Text Content]
|
||||
C2 --> C3[Generate Markdown]
|
||||
C3 --> C4[Extract Links/Media]
|
||||
|
||||
D --> D1[Parse JSON Structure]
|
||||
D1 --> D2[Extract Data Fields]
|
||||
D2 --> D3[Format as Readable Text]
|
||||
|
||||
E --> E1[Clean Text Content]
|
||||
E1 --> E2[Basic Formatting]
|
||||
|
||||
F --> F1[Parse XML Structure]
|
||||
F1 --> F2[Extract Text Nodes]
|
||||
F2 --> F3[Convert to Markdown]
|
||||
|
||||
G --> G1[Save Binary Content]
|
||||
G1 --> G2[Generate Metadata]
|
||||
|
||||
C4 --> H[Content Analysis]
|
||||
D3 --> H
|
||||
E2 --> H
|
||||
F3 --> H
|
||||
G2 --> H
|
||||
|
||||
H --> I[Link Extraction]
|
||||
H --> J[Media Detection]
|
||||
H --> K[Metadata Parsing]
|
||||
|
||||
I --> L[CrawlResult Assembly]
|
||||
J --> L
|
||||
K --> L
|
||||
|
||||
L --> M[Final Output]
|
||||
|
||||
style C fill:#e8f5e8
|
||||
style H fill:#fff3e0
|
||||
style L fill:#e3f2fd
|
||||
style M fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Integration with Processing Strategies
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "HTTP Strategy Core"
|
||||
A[HTTP Request] --> B[Raw Content]
|
||||
B --> C[Content Decoder]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
C --> D[HTML Cleaner]
|
||||
D --> E[Markdown Generator]
|
||||
E --> F{Content Filter?}
|
||||
|
||||
F -->|Yes| G[Pruning Filter]
|
||||
F -->|Yes| H[BM25 Filter]
|
||||
F -->|No| I[Raw Markdown]
|
||||
|
||||
G --> J[Fit Markdown]
|
||||
H --> J
|
||||
end
|
||||
|
||||
subgraph "Extraction Strategies"
|
||||
I --> K[CSS Extraction]
|
||||
J --> K
|
||||
I --> L[XPath Extraction]
|
||||
J --> L
|
||||
I --> M[LLM Extraction]
|
||||
J --> M
|
||||
end
|
||||
|
||||
subgraph "Output Generation"
|
||||
K --> N[Structured JSON]
|
||||
L --> N
|
||||
M --> N
|
||||
|
||||
I --> O[Clean Markdown]
|
||||
J --> P[Filtered Content]
|
||||
|
||||
N --> Q[Final CrawlResult]
|
||||
O --> Q
|
||||
P --> Q
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C fill:#f3e5f5
|
||||
style E fill:#e8f5e8
|
||||
style Q fill:#c8e6c9
|
||||
```
|
||||
|
||||
**📖 Learn more:** [HTTP vs Browser Strategies](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Error Handling](https://docs.crawl4ai.com/api/async-webcrawler/)
|
||||
368
docs/md_v2/assets/llm.txt/diagrams/installation.txt
Normal file
368
docs/md_v2/assets/llm.txt/diagrams/installation.txt
Normal file
@@ -0,0 +1,368 @@
|
||||
## Installation Workflows and Architecture
|
||||
|
||||
Visual representations of Crawl4AI installation processes, deployment options, and system interactions.
|
||||
|
||||
### Installation Decision Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Installation] --> B{Environment Type?}
|
||||
|
||||
B -->|Local Development| C[Basic Python Install]
|
||||
B -->|Production| D[Docker Deployment]
|
||||
B -->|Research/Testing| E[Google Colab]
|
||||
B -->|CI/CD Pipeline| F[Automated Setup]
|
||||
|
||||
C --> C1[pip install crawl4ai]
|
||||
C1 --> C2[crawl4ai-setup]
|
||||
C2 --> C3{Need Advanced Features?}
|
||||
|
||||
C3 -->|No| C4[Basic Installation Complete]
|
||||
C3 -->|Text Clustering| C5[pip install crawl4ai with torch]
|
||||
C3 -->|Transformers| C6[pip install crawl4ai with transformer]
|
||||
C3 -->|All Features| C7[pip install crawl4ai with all]
|
||||
|
||||
C5 --> C8[crawl4ai-download-models]
|
||||
C6 --> C8
|
||||
C7 --> C8
|
||||
C8 --> C9[Advanced Installation Complete]
|
||||
|
||||
D --> D1{Deployment Method?}
|
||||
D1 -->|Pre-built Image| D2[docker pull unclecode/crawl4ai]
|
||||
D1 -->|Docker Compose| D3[Clone repo + docker compose]
|
||||
D1 -->|Custom Build| D4[docker buildx build]
|
||||
|
||||
D2 --> D5[Configure .llm.env]
|
||||
D3 --> D5
|
||||
D4 --> D5
|
||||
D5 --> D6[docker run with ports]
|
||||
D6 --> D7[Docker Deployment Complete]
|
||||
|
||||
E --> E1[Colab pip install]
|
||||
E1 --> E2[playwright install chromium]
|
||||
E2 --> E3[Test basic crawl]
|
||||
E3 --> E4[Colab Setup Complete]
|
||||
|
||||
F --> F1[Automated pip install]
|
||||
F1 --> F2[Automated setup scripts]
|
||||
F2 --> F3[CI/CD Integration Complete]
|
||||
|
||||
C4 --> G[Verify with crawl4ai-doctor]
|
||||
C9 --> G
|
||||
D7 --> H[Health check via API]
|
||||
E4 --> I[Run test crawl]
|
||||
F3 --> G
|
||||
|
||||
G --> J[Installation Verified]
|
||||
H --> J
|
||||
I --> J
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style J fill:#c8e6c9
|
||||
style C4 fill:#fff3e0
|
||||
style C9 fill:#fff3e0
|
||||
style D7 fill:#f3e5f5
|
||||
style E4 fill:#fce4ec
|
||||
style F3 fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Basic Installation Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant PyPI
|
||||
participant System
|
||||
participant Playwright
|
||||
participant Crawler
|
||||
|
||||
User->>PyPI: pip install crawl4ai
|
||||
PyPI-->>User: Package downloaded
|
||||
|
||||
User->>System: crawl4ai-setup
|
||||
System->>Playwright: Install browser binaries
|
||||
Playwright-->>System: Chromium, Firefox installed
|
||||
System-->>User: Setup complete
|
||||
|
||||
User->>System: crawl4ai-doctor
|
||||
System->>System: Check Python version
|
||||
System->>System: Verify Playwright installation
|
||||
System->>System: Test browser launch
|
||||
System-->>User: Diagnostics report
|
||||
|
||||
User->>Crawler: Basic crawl test
|
||||
Crawler->>Playwright: Launch browser
|
||||
Playwright-->>Crawler: Browser ready
|
||||
Crawler->>Crawler: Navigate to test URL
|
||||
Crawler-->>User: Success confirmation
|
||||
```
|
||||
|
||||
### Docker Deployment Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Host System"
|
||||
A[Docker Engine] --> B[Crawl4AI Container]
|
||||
C[.llm.env File] --> B
|
||||
D[Port 11235] --> B
|
||||
end
|
||||
|
||||
subgraph "Container Environment"
|
||||
B --> E[FastAPI Server]
|
||||
B --> F[Playwright Browsers]
|
||||
B --> G[Python Runtime]
|
||||
|
||||
E --> H[/crawl Endpoint]
|
||||
E --> I[/playground Interface]
|
||||
E --> J[/health Monitoring]
|
||||
E --> K[/metrics Prometheus]
|
||||
|
||||
F --> L[Chromium Browser]
|
||||
F --> M[Firefox Browser]
|
||||
F --> N[WebKit Browser]
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
O[OpenAI API] --> B
|
||||
P[Anthropic API] --> B
|
||||
Q[Local LLM Ollama] --> B
|
||||
end
|
||||
|
||||
subgraph "Client Applications"
|
||||
R[Python SDK] --> H
|
||||
S[REST API Calls] --> H
|
||||
T[Web Browser] --> I
|
||||
U[Monitoring Tools] --> J
|
||||
V[Prometheus] --> K
|
||||
end
|
||||
|
||||
style B fill:#e3f2fd
|
||||
style E fill:#f3e5f5
|
||||
style F fill:#e8f5e8
|
||||
style G fill:#fff3e0
|
||||
```
|
||||
|
||||
### Advanced Features Installation Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> BasicInstall
|
||||
|
||||
BasicInstall --> FeatureChoice: crawl4ai installed
|
||||
|
||||
FeatureChoice --> TorchInstall: Need text clustering
|
||||
FeatureChoice --> TransformerInstall: Need HuggingFace models
|
||||
FeatureChoice --> AllInstall: Need everything
|
||||
FeatureChoice --> Complete: Basic features sufficient
|
||||
|
||||
TorchInstall --> TorchSetup: pip install crawl4ai with torch
|
||||
TransformerInstall --> TransformerSetup: pip install crawl4ai with transformer
|
||||
AllInstall --> AllSetup: pip install crawl4ai with all
|
||||
|
||||
TorchSetup --> ModelDownload: crawl4ai-setup
|
||||
TransformerSetup --> ModelDownload: crawl4ai-setup
|
||||
AllSetup --> ModelDownload: crawl4ai-setup
|
||||
|
||||
ModelDownload --> PreDownload: crawl4ai-download-models
|
||||
PreDownload --> Complete: All models cached
|
||||
|
||||
Complete --> Verification: crawl4ai-doctor
|
||||
Verification --> [*]: Installation verified
|
||||
|
||||
note right of TorchInstall : PyTorch for semantic operations
|
||||
note right of TransformerInstall : HuggingFace for LLM features
|
||||
note right of AllInstall : Complete feature set
|
||||
```
|
||||
|
||||
### Platform-Specific Installation Matrix
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Installation Methods"
|
||||
A[Python Package] --> A1[pip install]
|
||||
B[Docker Image] --> B1[docker pull]
|
||||
C[Source Build] --> C1[git clone + build]
|
||||
D[Cloud Platform] --> D1[Colab/Kaggle]
|
||||
end
|
||||
|
||||
subgraph "Operating Systems"
|
||||
E[Linux x86_64]
|
||||
F[Linux ARM64]
|
||||
G[macOS Intel]
|
||||
H[macOS Apple Silicon]
|
||||
I[Windows x86_64]
|
||||
end
|
||||
|
||||
subgraph "Feature Sets"
|
||||
J[Basic crawling]
|
||||
K[Text clustering torch]
|
||||
L[LLM transformers]
|
||||
M[All features]
|
||||
end
|
||||
|
||||
A1 --> E
|
||||
A1 --> F
|
||||
A1 --> G
|
||||
A1 --> H
|
||||
A1 --> I
|
||||
|
||||
B1 --> E
|
||||
B1 --> F
|
||||
B1 --> G
|
||||
B1 --> H
|
||||
|
||||
C1 --> E
|
||||
C1 --> F
|
||||
C1 --> G
|
||||
C1 --> H
|
||||
C1 --> I
|
||||
|
||||
D1 --> E
|
||||
D1 --> I
|
||||
|
||||
E --> J
|
||||
E --> K
|
||||
E --> L
|
||||
E --> M
|
||||
|
||||
F --> J
|
||||
F --> K
|
||||
F --> L
|
||||
F --> M
|
||||
|
||||
G --> J
|
||||
G --> K
|
||||
G --> L
|
||||
G --> M
|
||||
|
||||
H --> J
|
||||
H --> K
|
||||
H --> L
|
||||
H --> M
|
||||
|
||||
I --> J
|
||||
I --> K
|
||||
I --> L
|
||||
I --> M
|
||||
|
||||
style A1 fill:#e3f2fd
|
||||
style B1 fill:#f3e5f5
|
||||
style C1 fill:#e8f5e8
|
||||
style D1 fill:#fff3e0
|
||||
```
|
||||
|
||||
### Docker Multi-Stage Build Process
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Dev as Developer
|
||||
participant Git as GitHub Repo
|
||||
participant Docker as Docker Engine
|
||||
participant Registry as Docker Hub
|
||||
participant User as End User
|
||||
|
||||
Dev->>Git: Push code changes
|
||||
|
||||
Docker->>Git: Clone repository
|
||||
Docker->>Docker: Stage 1 - Base Python image
|
||||
Docker->>Docker: Stage 2 - Install dependencies
|
||||
Docker->>Docker: Stage 3 - Install Playwright
|
||||
Docker->>Docker: Stage 4 - Copy application code
|
||||
Docker->>Docker: Stage 5 - Setup FastAPI server
|
||||
|
||||
Note over Docker: Multi-architecture build
|
||||
Docker->>Docker: Build for linux/amd64
|
||||
Docker->>Docker: Build for linux/arm64
|
||||
|
||||
Docker->>Registry: Push multi-arch manifest
|
||||
Registry-->>Docker: Build complete
|
||||
|
||||
User->>Registry: docker pull unclecode/crawl4ai
|
||||
Registry-->>User: Download appropriate architecture
|
||||
|
||||
User->>Docker: docker run with configuration
|
||||
Docker->>Docker: Start container
|
||||
Docker->>Docker: Initialize FastAPI server
|
||||
Docker->>Docker: Setup Playwright browsers
|
||||
Docker-->>User: Service ready on port 11235
|
||||
```
|
||||
|
||||
### Installation Verification Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Installation Complete] --> B[Run crawl4ai-doctor]
|
||||
|
||||
B --> C{Python Version Check}
|
||||
C -->|✓ 3.10+| D{Playwright Check}
|
||||
C -->|✗ < 3.10| C1[Upgrade Python]
|
||||
C1 --> D
|
||||
|
||||
D -->|✓ Installed| E{Browser Binaries}
|
||||
D -->|✗ Missing| D1[Run crawl4ai-setup]
|
||||
D1 --> E
|
||||
|
||||
E -->|✓ Available| F{Test Browser Launch}
|
||||
E -->|✗ Missing| E1[playwright install]
|
||||
E1 --> F
|
||||
|
||||
F -->|✓ Success| G[Test Basic Crawl]
|
||||
F -->|✗ Failed| F1[Check system dependencies]
|
||||
F1 --> F
|
||||
|
||||
G --> H{Crawl Test Result}
|
||||
H -->|✓ Success| I[Installation Verified ✓]
|
||||
H -->|✗ Failed| H1[Check network/permissions]
|
||||
H1 --> G
|
||||
|
||||
I --> J[Ready for Production Use]
|
||||
|
||||
style I fill:#c8e6c9
|
||||
style J fill:#e8f5e8
|
||||
style C1 fill:#ffcdd2
|
||||
style D1 fill:#fff3e0
|
||||
style E1 fill:#fff3e0
|
||||
style F1 fill:#ffcdd2
|
||||
style H1 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Resource Requirements by Installation Type
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Basic Installation"
|
||||
A1[Memory: 512MB]
|
||||
A2[Disk: 2GB]
|
||||
A3[CPU: 1 core]
|
||||
A4[Network: Required for setup]
|
||||
end
|
||||
|
||||
subgraph "Advanced Features torch"
|
||||
B1[Memory: 2GB+]
|
||||
B2[Disk: 5GB+]
|
||||
B3[CPU: 2+ cores]
|
||||
B4[GPU: Optional CUDA]
|
||||
end
|
||||
|
||||
subgraph "All Features"
|
||||
C1[Memory: 4GB+]
|
||||
C2[Disk: 10GB+]
|
||||
C3[CPU: 4+ cores]
|
||||
C4[GPU: Recommended]
|
||||
end
|
||||
|
||||
subgraph "Docker Deployment"
|
||||
D1[Memory: 1GB+]
|
||||
D2[Disk: 3GB+]
|
||||
D3[CPU: 2+ cores]
|
||||
D4[Ports: 11235]
|
||||
D5[Shared Memory: 1GB]
|
||||
end
|
||||
|
||||
style A1 fill:#e8f5e8
|
||||
style B1 fill:#fff3e0
|
||||
style C1 fill:#ffecb3
|
||||
style D1 fill:#e3f2fd
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Installation Guide](https://docs.crawl4ai.com/core/installation/), [Docker Deployment](https://docs.crawl4ai.com/core/docker-deployment/), [System Requirements](https://docs.crawl4ai.com/core/installation/#prerequisites)
|
||||
5912
docs/md_v2/assets/llm.txt/diagrams/llms-diagram.txt
Normal file
5912
docs/md_v2/assets/llm.txt/diagrams/llms-diagram.txt
Normal file
File diff suppressed because it is too large
Load Diff
392
docs/md_v2/assets/llm.txt/diagrams/multi_urls_crawling.txt
Normal file
392
docs/md_v2/assets/llm.txt/diagrams/multi_urls_crawling.txt
Normal file
@@ -0,0 +1,392 @@
|
||||
## Multi-URL Crawling Workflows and Architecture
|
||||
|
||||
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
|
||||
|
||||
### Multi-URL Processing Modes
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Multi-URL Crawling Request] --> B{Processing Mode?}
|
||||
|
||||
B -->|Batch Mode| C[Collect All URLs]
|
||||
B -->|Streaming Mode| D[Process URLs Individually]
|
||||
|
||||
C --> C1[Queue All URLs]
|
||||
C1 --> C2[Execute Concurrently]
|
||||
C2 --> C3[Wait for All Completion]
|
||||
C3 --> C4[Return Complete Results Array]
|
||||
|
||||
D --> D1[Queue URLs]
|
||||
D1 --> D2[Start First Batch]
|
||||
D2 --> D3[Yield Results as Available]
|
||||
D3 --> D4{More URLs?}
|
||||
D4 -->|Yes| D5[Start Next URLs]
|
||||
D4 -->|No| D6[Stream Complete]
|
||||
D5 --> D3
|
||||
|
||||
C4 --> E[Process Results]
|
||||
D6 --> E
|
||||
|
||||
E --> F[Success/Failure Analysis]
|
||||
F --> G[End]
|
||||
|
||||
style C fill:#e3f2fd
|
||||
style D fill:#f3e5f5
|
||||
style C4 fill:#c8e6c9
|
||||
style D6 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Memory-Adaptive Dispatcher Flow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Initializing
|
||||
|
||||
Initializing --> MonitoringMemory: Start dispatcher
|
||||
|
||||
MonitoringMemory --> CheckingMemory: Every check_interval
|
||||
CheckingMemory --> MemoryOK: Memory < threshold
|
||||
CheckingMemory --> MemoryHigh: Memory >= threshold
|
||||
|
||||
MemoryOK --> DispatchingTasks: Start new crawls
|
||||
MemoryHigh --> WaitingForMemory: Pause dispatching
|
||||
|
||||
DispatchingTasks --> TaskRunning: Launch crawler
|
||||
TaskRunning --> TaskCompleted: Crawl finished
|
||||
TaskRunning --> TaskFailed: Crawl error
|
||||
|
||||
TaskCompleted --> MonitoringMemory: Update stats
|
||||
TaskFailed --> MonitoringMemory: Update stats
|
||||
|
||||
WaitingForMemory --> CheckingMemory: Wait timeout
|
||||
WaitingForMemory --> MonitoringMemory: Memory freed
|
||||
|
||||
note right of MemoryHigh: Prevents OOM crashes
|
||||
note right of DispatchingTasks: Respects max_session_permit
|
||||
note right of WaitingForMemory: Configurable timeout
|
||||
```
|
||||
|
||||
### Concurrent Crawling Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "URL Queue Management"
|
||||
A[URL Input List] --> B[URL Queue]
|
||||
B --> C[Priority Scheduler]
|
||||
C --> D[Batch Assignment]
|
||||
end
|
||||
|
||||
subgraph "Dispatcher Layer"
|
||||
E[Memory Adaptive Dispatcher]
|
||||
F[Semaphore Dispatcher]
|
||||
G[Rate Limiter]
|
||||
H[Resource Monitor]
|
||||
|
||||
E --> I[Memory Checker]
|
||||
F --> J[Concurrency Controller]
|
||||
G --> K[Delay Calculator]
|
||||
H --> L[System Stats]
|
||||
end
|
||||
|
||||
subgraph "Crawler Pool"
|
||||
M[Crawler Instance 1]
|
||||
N[Crawler Instance 2]
|
||||
O[Crawler Instance 3]
|
||||
P[Crawler Instance N]
|
||||
|
||||
M --> Q[Browser Session 1]
|
||||
N --> R[Browser Session 2]
|
||||
O --> S[Browser Session 3]
|
||||
P --> T[Browser Session N]
|
||||
end
|
||||
|
||||
subgraph "Result Processing"
|
||||
U[Result Collector]
|
||||
V[Success Handler]
|
||||
W[Error Handler]
|
||||
X[Retry Queue]
|
||||
Y[Final Results]
|
||||
end
|
||||
|
||||
D --> E
|
||||
D --> F
|
||||
E --> M
|
||||
F --> N
|
||||
G --> O
|
||||
H --> P
|
||||
|
||||
Q --> U
|
||||
R --> U
|
||||
S --> U
|
||||
T --> U
|
||||
|
||||
U --> V
|
||||
U --> W
|
||||
W --> X
|
||||
X --> B
|
||||
V --> Y
|
||||
|
||||
style E fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
style G fill:#e8f5e8
|
||||
style H fill:#fff3e0
|
||||
```
|
||||
|
||||
### Rate Limiting and Backoff Strategy
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant C as Crawler
|
||||
participant RL as Rate Limiter
|
||||
participant S as Server
|
||||
participant D as Dispatcher
|
||||
|
||||
C->>RL: Request to crawl URL
|
||||
RL->>RL: Calculate delay
|
||||
RL->>RL: Apply base delay (1-3s)
|
||||
RL->>C: Delay applied
|
||||
|
||||
C->>S: HTTP Request
|
||||
|
||||
alt Success Response
|
||||
S-->>C: 200 OK + Content
|
||||
C->>RL: Report success
|
||||
RL->>RL: Reset failure count
|
||||
C->>D: Return successful result
|
||||
else Rate Limited
|
||||
S-->>C: 429 Too Many Requests
|
||||
C->>RL: Report rate limit
|
||||
RL->>RL: Exponential backoff
|
||||
RL->>RL: Increase delay (up to max_delay)
|
||||
RL->>C: Apply longer delay
|
||||
C->>S: Retry request after delay
|
||||
else Server Error
|
||||
S-->>C: 503 Service Unavailable
|
||||
C->>RL: Report server error
|
||||
RL->>RL: Moderate backoff
|
||||
RL->>C: Retry with backoff
|
||||
else Max Retries Exceeded
|
||||
RL->>C: Stop retrying
|
||||
C->>D: Return failed result
|
||||
end
|
||||
```
|
||||
|
||||
### Large-Scale Crawling Workflow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
|
||||
|
||||
B --> C{Select Dispatcher Type}
|
||||
C -->|Memory Constrained| D[Memory Adaptive]
|
||||
C -->|Fixed Resources| E[Semaphore Based]
|
||||
|
||||
D --> F[Set Memory Threshold 70%]
|
||||
E --> G[Set Concurrency Limit]
|
||||
|
||||
F --> H[Configure Monitoring]
|
||||
G --> H
|
||||
|
||||
H --> I[Start Crawling Process]
|
||||
I --> J[Monitor System Resources]
|
||||
|
||||
J --> K{Memory Usage?}
|
||||
K -->|< Threshold| L[Continue Dispatching]
|
||||
K -->|>= Threshold| M[Pause New Tasks]
|
||||
|
||||
L --> N[Process Results Stream]
|
||||
M --> O[Wait for Memory]
|
||||
O --> K
|
||||
|
||||
N --> P{Result Type?}
|
||||
P -->|Success| Q[Save to Database]
|
||||
P -->|Failure| R[Log Error]
|
||||
|
||||
Q --> S[Update Progress Counter]
|
||||
R --> S
|
||||
|
||||
S --> T{More URLs?}
|
||||
T -->|Yes| U[Get Next Batch]
|
||||
T -->|No| V[Generate Final Report]
|
||||
|
||||
U --> L
|
||||
V --> W[Analysis Complete]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style D fill:#e8f5e8
|
||||
style E fill:#f3e5f5
|
||||
style V fill:#c8e6c9
|
||||
style W fill:#a5d6a7
|
||||
```
|
||||
|
||||
### Real-Time Monitoring Dashboard Flow
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Data Collection"
|
||||
A[Crawler Tasks] --> B[Performance Metrics]
|
||||
A --> C[Memory Usage]
|
||||
A --> D[Success/Failure Rates]
|
||||
A --> E[Response Times]
|
||||
end
|
||||
|
||||
subgraph "Monitor Processing"
|
||||
F[CrawlerMonitor] --> G[Aggregate Statistics]
|
||||
F --> H[Display Formatter]
|
||||
F --> I[Update Scheduler]
|
||||
end
|
||||
|
||||
subgraph "Display Modes"
|
||||
J[DETAILED Mode]
|
||||
K[AGGREGATED Mode]
|
||||
|
||||
J --> L[Individual Task Status]
|
||||
J --> M[Task-Level Metrics]
|
||||
K --> N[Summary Statistics]
|
||||
K --> O[Overall Progress]
|
||||
end
|
||||
|
||||
subgraph "Output Interface"
|
||||
P[Console Display]
|
||||
Q[Progress Bars]
|
||||
R[Status Tables]
|
||||
S[Real-time Updates]
|
||||
end
|
||||
|
||||
B --> F
|
||||
C --> F
|
||||
D --> F
|
||||
E --> F
|
||||
|
||||
G --> J
|
||||
G --> K
|
||||
H --> J
|
||||
H --> K
|
||||
I --> J
|
||||
I --> K
|
||||
|
||||
L --> P
|
||||
M --> Q
|
||||
N --> R
|
||||
O --> S
|
||||
|
||||
style F fill:#e3f2fd
|
||||
style J fill:#f3e5f5
|
||||
style K fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Error Handling and Recovery Pattern
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> ProcessingURL
|
||||
|
||||
ProcessingURL --> CrawlAttempt: Start crawl
|
||||
|
||||
CrawlAttempt --> Success: HTTP 200
|
||||
CrawlAttempt --> NetworkError: Connection failed
|
||||
CrawlAttempt --> RateLimit: HTTP 429
|
||||
CrawlAttempt --> ServerError: HTTP 5xx
|
||||
CrawlAttempt --> Timeout: Request timeout
|
||||
|
||||
Success --> [*]: Return result
|
||||
|
||||
NetworkError --> RetryCheck: Check retry count
|
||||
RateLimit --> BackoffWait: Apply exponential backoff
|
||||
ServerError --> RetryCheck: Check retry count
|
||||
Timeout --> RetryCheck: Check retry count
|
||||
|
||||
BackoffWait --> RetryCheck: After delay
|
||||
|
||||
RetryCheck --> CrawlAttempt: retries < max_retries
|
||||
RetryCheck --> Failed: retries >= max_retries
|
||||
|
||||
Failed --> ErrorLog: Log failure details
|
||||
ErrorLog --> [*]: Return failed result
|
||||
|
||||
note right of BackoffWait: Exponential backoff for rate limits
|
||||
note right of RetryCheck: Configurable max_retries
|
||||
note right of ErrorLog: Detailed error tracking
|
||||
```
|
||||
|
||||
### Resource Management Timeline
|
||||
|
||||
```mermaid
|
||||
gantt
|
||||
title Multi-URL Crawling Resource Management
|
||||
dateFormat X
|
||||
axisFormat %s
|
||||
|
||||
section Memory Usage
|
||||
Initialize Dispatcher :0, 1
|
||||
Memory Monitoring :1, 10
|
||||
Peak Usage Period :3, 7
|
||||
Memory Cleanup :7, 9
|
||||
|
||||
section Task Execution
|
||||
URL Queue Setup :0, 2
|
||||
Batch 1 Processing :2, 5
|
||||
Batch 2 Processing :4, 7
|
||||
Batch 3 Processing :6, 9
|
||||
Final Results :9, 10
|
||||
|
||||
section Rate Limiting
|
||||
Normal Delays :2, 4
|
||||
Backoff Period :4, 6
|
||||
Recovery Period :6, 8
|
||||
|
||||
section Monitoring
|
||||
System Health Check :0, 10
|
||||
Progress Updates :1, 9
|
||||
Performance Metrics :2, 8
|
||||
```
|
||||
|
||||
### Concurrent Processing Performance Matrix
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Input Factors"
|
||||
A[Number of URLs]
|
||||
B[Concurrency Level]
|
||||
C[Memory Threshold]
|
||||
D[Rate Limiting]
|
||||
end
|
||||
|
||||
subgraph "Processing Characteristics"
|
||||
A --> E[Low 1-100 URLs]
|
||||
A --> F[Medium 100-1k URLs]
|
||||
A --> G[High 1k-10k URLs]
|
||||
A --> H[Very High 10k+ URLs]
|
||||
|
||||
B --> I[Conservative 1-5]
|
||||
B --> J[Moderate 5-15]
|
||||
B --> K[Aggressive 15-30]
|
||||
|
||||
C --> L[Strict 60-70%]
|
||||
C --> M[Balanced 70-80%]
|
||||
C --> N[Relaxed 80-90%]
|
||||
end
|
||||
|
||||
subgraph "Recommended Configurations"
|
||||
E --> O[Simple Semaphore]
|
||||
F --> P[Memory Adaptive Basic]
|
||||
G --> Q[Memory Adaptive Advanced]
|
||||
H --> R[Memory Adaptive + Monitoring]
|
||||
|
||||
I --> O
|
||||
J --> P
|
||||
K --> Q
|
||||
K --> R
|
||||
|
||||
L --> Q
|
||||
M --> P
|
||||
N --> O
|
||||
end
|
||||
|
||||
style O fill:#c8e6c9
|
||||
style P fill:#fff3e0
|
||||
style Q fill:#ffecb3
|
||||
style R fill:#ffcdd2
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization)
|
||||
411
docs/md_v2/assets/llm.txt/diagrams/simple_crawling.txt
Normal file
411
docs/md_v2/assets/llm.txt/diagrams/simple_crawling.txt
Normal file
@@ -0,0 +1,411 @@
|
||||
## Simple Crawling Workflows and Data Flow
|
||||
|
||||
Visual representations of basic web crawling operations, configuration patterns, and result processing workflows.
|
||||
|
||||
### Basic Crawling Sequence
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler as AsyncWebCrawler
|
||||
participant Browser as Browser Instance
|
||||
participant Page as Web Page
|
||||
participant Processor as Content Processor
|
||||
|
||||
User->>Crawler: Create with BrowserConfig
|
||||
Crawler->>Browser: Launch browser instance
|
||||
Browser-->>Crawler: Browser ready
|
||||
|
||||
User->>Crawler: arun(url, CrawlerRunConfig)
|
||||
Crawler->>Browser: Create new page/context
|
||||
Browser->>Page: Navigate to URL
|
||||
Page-->>Browser: Page loaded
|
||||
|
||||
Browser->>Processor: Extract raw HTML
|
||||
Processor->>Processor: Clean HTML
|
||||
Processor->>Processor: Generate markdown
|
||||
Processor->>Processor: Extract media/links
|
||||
Processor-->>Crawler: CrawlResult created
|
||||
|
||||
Crawler-->>User: Return CrawlResult
|
||||
|
||||
Note over User,Processor: All processing happens asynchronously
|
||||
```
|
||||
|
||||
### Crawling Configuration Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Crawling] --> B{Browser Config Set?}
|
||||
|
||||
B -->|No| B1[Use Default BrowserConfig]
|
||||
B -->|Yes| B2[Custom BrowserConfig]
|
||||
|
||||
B1 --> C[Launch Browser]
|
||||
B2 --> C
|
||||
|
||||
C --> D{Crawler Run Config Set?}
|
||||
|
||||
D -->|No| D1[Use Default CrawlerRunConfig]
|
||||
D -->|Yes| D2[Custom CrawlerRunConfig]
|
||||
|
||||
D1 --> E[Navigate to URL]
|
||||
D2 --> E
|
||||
|
||||
E --> F{Page Load Success?}
|
||||
F -->|No| F1[Return Error Result]
|
||||
F -->|Yes| G[Apply Content Filters]
|
||||
|
||||
G --> G1{excluded_tags set?}
|
||||
G1 -->|Yes| G2[Remove specified tags]
|
||||
G1 -->|No| G3[Keep all tags]
|
||||
G2 --> G4{css_selector set?}
|
||||
G3 --> G4
|
||||
|
||||
G4 -->|Yes| G5[Extract selected elements]
|
||||
G4 -->|No| G6[Process full page]
|
||||
G5 --> H[Generate Markdown]
|
||||
G6 --> H
|
||||
|
||||
H --> H1{markdown_generator set?}
|
||||
H1 -->|Yes| H2[Use custom generator]
|
||||
H1 -->|No| H3[Use default generator]
|
||||
H2 --> I[Extract Media and Links]
|
||||
H3 --> I
|
||||
|
||||
I --> I1{process_iframes?}
|
||||
I1 -->|Yes| I2[Include iframe content]
|
||||
I1 -->|No| I3[Skip iframes]
|
||||
I2 --> J[Create CrawlResult]
|
||||
I3 --> J
|
||||
|
||||
J --> K[Return Result]
|
||||
|
||||
style A fill:#e1f5fe
|
||||
style K fill:#c8e6c9
|
||||
style F1 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### CrawlResult Data Structure
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "CrawlResult Object"
|
||||
A[CrawlResult] --> B[Basic Info]
|
||||
A --> C[Content Variants]
|
||||
A --> D[Extracted Data]
|
||||
A --> E[Media Assets]
|
||||
A --> F[Optional Outputs]
|
||||
|
||||
B --> B1[url: Final URL]
|
||||
B --> B2[success: Boolean]
|
||||
B --> B3[status_code: HTTP Status]
|
||||
B --> B4[error_message: Error Details]
|
||||
|
||||
C --> C1[html: Raw HTML]
|
||||
C --> C2[cleaned_html: Sanitized HTML]
|
||||
C --> C3[markdown: MarkdownGenerationResult]
|
||||
|
||||
C3 --> C3A[raw_markdown: Basic conversion]
|
||||
C3 --> C3B[markdown_with_citations: With references]
|
||||
C3 --> C3C[fit_markdown: Filtered content]
|
||||
C3 --> C3D[references_markdown: Citation list]
|
||||
|
||||
D --> D1[links: Internal/External]
|
||||
D --> D2[media: Images/Videos/Audio]
|
||||
D --> D3[metadata: Page info]
|
||||
D --> D4[extracted_content: JSON data]
|
||||
D --> D5[tables: Structured table data]
|
||||
|
||||
E --> E1[screenshot: Base64 image]
|
||||
E --> E2[pdf: PDF bytes]
|
||||
E --> E3[mhtml: Archive file]
|
||||
E --> E4[downloaded_files: File paths]
|
||||
|
||||
F --> F1[session_id: Browser session]
|
||||
F --> F2[ssl_certificate: Security info]
|
||||
F --> F3[response_headers: HTTP headers]
|
||||
F --> F4[network_requests: Traffic log]
|
||||
F --> F5[console_messages: Browser logs]
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style C3 fill:#f3e5f5
|
||||
style D5 fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Content Processing Pipeline
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
subgraph "Input Sources"
|
||||
A1[Web URL]
|
||||
A2[Raw HTML]
|
||||
A3[Local File]
|
||||
end
|
||||
|
||||
A1 --> B[Browser Navigation]
|
||||
A2 --> C[Direct Processing]
|
||||
A3 --> C
|
||||
|
||||
B --> D[Raw HTML Capture]
|
||||
C --> D
|
||||
|
||||
D --> E{Content Filtering}
|
||||
|
||||
E --> E1[Remove Scripts/Styles]
|
||||
E --> E2[Apply excluded_tags]
|
||||
E --> E3[Apply css_selector]
|
||||
E --> E4[Remove overlay elements]
|
||||
|
||||
E1 --> F[Cleaned HTML]
|
||||
E2 --> F
|
||||
E3 --> F
|
||||
E4 --> F
|
||||
|
||||
F --> G{Markdown Generation}
|
||||
|
||||
G --> G1[HTML to Markdown]
|
||||
G --> G2[Apply Content Filter]
|
||||
G --> G3[Generate Citations]
|
||||
|
||||
G1 --> H[MarkdownGenerationResult]
|
||||
G2 --> H
|
||||
G3 --> H
|
||||
|
||||
F --> I{Media Extraction}
|
||||
I --> I1[Find Images]
|
||||
I --> I2[Find Videos/Audio]
|
||||
I --> I3[Score Relevance]
|
||||
I1 --> J[Media Dictionary]
|
||||
I2 --> J
|
||||
I3 --> J
|
||||
|
||||
F --> K{Link Extraction}
|
||||
K --> K1[Internal Links]
|
||||
K --> K2[External Links]
|
||||
K --> K3[Apply Link Filters]
|
||||
K1 --> L[Links Dictionary]
|
||||
K2 --> L
|
||||
K3 --> L
|
||||
|
||||
H --> M[Final CrawlResult]
|
||||
J --> M
|
||||
L --> M
|
||||
|
||||
style D fill:#e3f2fd
|
||||
style F fill:#f3e5f5
|
||||
style H fill:#e8f5e8
|
||||
style M fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Table Extraction Workflow
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> DetectTables
|
||||
|
||||
DetectTables --> ScoreTables: Find table elements
|
||||
|
||||
ScoreTables --> EvaluateThreshold: Calculate quality scores
|
||||
EvaluateThreshold --> PassThreshold: score >= table_score_threshold
|
||||
EvaluateThreshold --> RejectTable: score < threshold
|
||||
|
||||
PassThreshold --> ExtractHeaders: Parse table structure
|
||||
ExtractHeaders --> ExtractRows: Get header cells
|
||||
ExtractRows --> ExtractMetadata: Get data rows
|
||||
ExtractMetadata --> CreateTableObject: Get caption/summary
|
||||
|
||||
CreateTableObject --> AddToResult: {headers, rows, caption, summary}
|
||||
AddToResult --> [*]: Table extraction complete
|
||||
|
||||
RejectTable --> [*]: Table skipped
|
||||
|
||||
note right of ScoreTables : Factors: header presence, data density, structure quality
|
||||
note right of EvaluateThreshold : Threshold 1-10, higher = stricter
|
||||
```
|
||||
|
||||
### Error Handling Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Start Crawl] --> B[Navigate to URL]
|
||||
|
||||
B --> C{Navigation Success?}
|
||||
C -->|Network Error| C1[Set error_message: Network failure]
|
||||
C -->|Timeout| C2[Set error_message: Page timeout]
|
||||
C -->|Invalid URL| C3[Set error_message: Invalid URL format]
|
||||
C -->|Success| D[Process Page Content]
|
||||
|
||||
C1 --> E[success = False]
|
||||
C2 --> E
|
||||
C3 --> E
|
||||
|
||||
D --> F{Content Processing OK?}
|
||||
F -->|Parser Error| F1[Set error_message: HTML parsing failed]
|
||||
F -->|Memory Error| F2[Set error_message: Insufficient memory]
|
||||
F -->|Success| G[Generate Outputs]
|
||||
|
||||
F1 --> E
|
||||
F2 --> E
|
||||
|
||||
G --> H{Output Generation OK?}
|
||||
H -->|Markdown Error| H1[Partial success with warnings]
|
||||
H -->|Extraction Error| H2[Partial success with warnings]
|
||||
H -->|Success| I[success = True]
|
||||
|
||||
H1 --> I
|
||||
H2 --> I
|
||||
|
||||
E --> J[Return Failed CrawlResult]
|
||||
I --> K[Return Successful CrawlResult]
|
||||
|
||||
J --> L[User Error Handling]
|
||||
K --> M[User Result Processing]
|
||||
|
||||
L --> L1{Check error_message}
|
||||
L1 -->|Network| L2[Retry with different config]
|
||||
L1 -->|Timeout| L3[Increase page_timeout]
|
||||
L1 -->|Parser| L4[Try different scraping_strategy]
|
||||
|
||||
style E fill:#ffcdd2
|
||||
style I fill:#c8e6c9
|
||||
style J fill:#ffcdd2
|
||||
style K fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Configuration Impact Matrix
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Configuration Categories"
|
||||
A[Content Processing]
|
||||
B[Page Interaction]
|
||||
C[Output Generation]
|
||||
D[Performance]
|
||||
end
|
||||
|
||||
subgraph "Configuration Options"
|
||||
A --> A1[word_count_threshold]
|
||||
A --> A2[excluded_tags]
|
||||
A --> A3[css_selector]
|
||||
A --> A4[exclude_external_links]
|
||||
|
||||
B --> B1[process_iframes]
|
||||
B --> B2[remove_overlay_elements]
|
||||
B --> B3[scan_full_page]
|
||||
B --> B4[wait_for]
|
||||
|
||||
C --> C1[screenshot]
|
||||
C --> C2[pdf]
|
||||
C --> C3[markdown_generator]
|
||||
C --> C4[table_score_threshold]
|
||||
|
||||
D --> D1[cache_mode]
|
||||
D --> D2[verbose]
|
||||
D --> D3[page_timeout]
|
||||
D --> D4[semaphore_count]
|
||||
end
|
||||
|
||||
subgraph "Result Impact"
|
||||
A1 --> R1[Filters short text blocks]
|
||||
A2 --> R2[Removes specified HTML tags]
|
||||
A3 --> R3[Focuses on selected content]
|
||||
A4 --> R4[Cleans links dictionary]
|
||||
|
||||
B1 --> R5[Includes iframe content]
|
||||
B2 --> R6[Removes popups/modals]
|
||||
B3 --> R7[Loads dynamic content]
|
||||
B4 --> R8[Waits for specific elements]
|
||||
|
||||
C1 --> R9[Adds screenshot field]
|
||||
C2 --> R10[Adds pdf field]
|
||||
C3 --> R11[Custom markdown processing]
|
||||
C4 --> R12[Filters table quality]
|
||||
|
||||
D1 --> R13[Controls caching behavior]
|
||||
D2 --> R14[Detailed logging output]
|
||||
D3 --> R15[Prevents timeout errors]
|
||||
D4 --> R16[Limits concurrent operations]
|
||||
end
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style B fill:#f3e5f5
|
||||
style C fill:#e8f5e8
|
||||
style D fill:#fff3e0
|
||||
```
|
||||
|
||||
### Raw HTML and Local File Processing
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Crawler
|
||||
participant Processor
|
||||
participant FileSystem
|
||||
|
||||
Note over User,FileSystem: Raw HTML Processing
|
||||
User->>Crawler: arun("raw://html_content")
|
||||
Crawler->>Processor: Parse raw HTML directly
|
||||
Processor->>Processor: Apply same content filters
|
||||
Processor-->>Crawler: Standard CrawlResult
|
||||
Crawler-->>User: Result with markdown
|
||||
|
||||
Note over User,FileSystem: Local File Processing
|
||||
User->>Crawler: arun("file:///path/to/file.html")
|
||||
Crawler->>FileSystem: Read local file
|
||||
FileSystem-->>Crawler: File content
|
||||
Crawler->>Processor: Process file HTML
|
||||
Processor->>Processor: Apply content processing
|
||||
Processor-->>Crawler: Standard CrawlResult
|
||||
Crawler-->>User: Result with markdown
|
||||
|
||||
Note over User,FileSystem: Both return identical CrawlResult structure
|
||||
```
|
||||
|
||||
### Comprehensive Processing Example Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[Input: example.com] --> B[Create Configurations]
|
||||
|
||||
B --> B1[BrowserConfig verbose=True]
|
||||
B --> B2[CrawlerRunConfig with filters]
|
||||
|
||||
B1 --> C[Launch AsyncWebCrawler]
|
||||
B2 --> C
|
||||
|
||||
C --> D[Navigate and Process]
|
||||
|
||||
D --> E{Check Success}
|
||||
E -->|Failed| E1[Print Error Message]
|
||||
E -->|Success| F[Extract Content Summary]
|
||||
|
||||
F --> F1[Get Page Title]
|
||||
F --> F2[Get Content Preview]
|
||||
F --> F3[Process Media Items]
|
||||
F --> F4[Process Links]
|
||||
|
||||
F3 --> F3A[Count Images]
|
||||
F3 --> F3B[Show First 3 Images]
|
||||
|
||||
F4 --> F4A[Count Internal Links]
|
||||
F4 --> F4B[Show First 3 Links]
|
||||
|
||||
F1 --> G[Display Results]
|
||||
F2 --> G
|
||||
F3A --> G
|
||||
F3B --> G
|
||||
F4A --> G
|
||||
F4B --> G
|
||||
|
||||
E1 --> H[End with Error]
|
||||
G --> I[End with Success]
|
||||
|
||||
style E1 fill:#ffcdd2
|
||||
style G fill:#c8e6c9
|
||||
style H fill:#ffcdd2
|
||||
style I fill:#c8e6c9
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/)
|
||||
441
docs/md_v2/assets/llm.txt/diagrams/url_seeder.txt
Normal file
441
docs/md_v2/assets/llm.txt/diagrams/url_seeder.txt
Normal file
@@ -0,0 +1,441 @@
|
||||
## URL Seeding Workflows and Architecture
|
||||
|
||||
Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows.
|
||||
|
||||
### URL Seeding vs Deep Crawling Strategy Comparison
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Deep Crawling Approach"
|
||||
A1[Start URL] --> A2[Load Page]
|
||||
A2 --> A3[Extract Links]
|
||||
A3 --> A4{More Links?}
|
||||
A4 -->|Yes| A5[Queue Next Page]
|
||||
A5 --> A2
|
||||
A4 -->|No| A6[Complete]
|
||||
|
||||
A7[⏱️ Real-time Discovery]
|
||||
A8[🐌 Sequential Processing]
|
||||
A9[🔍 Limited by Page Structure]
|
||||
A10[💾 High Memory Usage]
|
||||
end
|
||||
|
||||
subgraph "URL Seeding Approach"
|
||||
B1[Domain Input] --> B2[Query Sitemap]
|
||||
B1 --> B3[Query Common Crawl]
|
||||
B2 --> B4[Merge Results]
|
||||
B3 --> B4
|
||||
B4 --> B5[Apply Filters]
|
||||
B5 --> B6[Score Relevance]
|
||||
B6 --> B7[Rank Results]
|
||||
B7 --> B8[Select Top URLs]
|
||||
|
||||
B9[⚡ Instant Discovery]
|
||||
B10[🚀 Parallel Processing]
|
||||
B11[🎯 Pattern-based Filtering]
|
||||
B12[💡 Smart Relevance Scoring]
|
||||
end
|
||||
|
||||
style A1 fill:#ffecb3
|
||||
style B1 fill:#e8f5e8
|
||||
style A6 fill:#ffcdd2
|
||||
style B8 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### URL Discovery Data Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant User
|
||||
participant Seeder as AsyncUrlSeeder
|
||||
participant SM as Sitemap
|
||||
participant CC as Common Crawl
|
||||
participant Filter as URL Filter
|
||||
participant Scorer as BM25 Scorer
|
||||
|
||||
User->>Seeder: urls("example.com", config)
|
||||
|
||||
par Parallel Data Sources
|
||||
Seeder->>SM: Fetch sitemap.xml
|
||||
SM-->>Seeder: 500 URLs
|
||||
and
|
||||
Seeder->>CC: Query Common Crawl
|
||||
CC-->>Seeder: 2000 URLs
|
||||
end
|
||||
|
||||
Seeder->>Seeder: Merge and deduplicate
|
||||
Note over Seeder: 2200 unique URLs
|
||||
|
||||
Seeder->>Filter: Apply pattern filter
|
||||
Filter-->>Seeder: 800 matching URLs
|
||||
|
||||
alt extract_head=True
|
||||
loop For each URL
|
||||
Seeder->>Seeder: Extract <head> metadata
|
||||
end
|
||||
Note over Seeder: Title, description, keywords
|
||||
end
|
||||
|
||||
alt query provided
|
||||
Seeder->>Scorer: Calculate relevance scores
|
||||
Scorer-->>Seeder: Scored URLs
|
||||
Seeder->>Seeder: Filter by score_threshold
|
||||
Note over Seeder: 200 relevant URLs
|
||||
end
|
||||
|
||||
Seeder->>Seeder: Sort by relevance
|
||||
Seeder->>Seeder: Apply max_urls limit
|
||||
Seeder-->>User: Top 100 URLs ready for crawling
|
||||
```
|
||||
|
||||
### SeedingConfig Decision Tree
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A[SeedingConfig Setup] --> B{Data Source Strategy?}
|
||||
|
||||
B -->|Fast & Official| C[source="sitemap"]
|
||||
B -->|Comprehensive| D[source="cc"]
|
||||
B -->|Maximum Coverage| E[source="sitemap+cc"]
|
||||
|
||||
C --> F{Need Filtering?}
|
||||
D --> F
|
||||
E --> F
|
||||
|
||||
F -->|Yes| G[Set URL Pattern]
|
||||
F -->|No| H[pattern="*"]
|
||||
|
||||
G --> I{Pattern Examples}
|
||||
I --> I1[pattern="*/blog/*"]
|
||||
I --> I2[pattern="*/docs/api/*"]
|
||||
I --> I3[pattern="*.pdf"]
|
||||
I --> I4[pattern="*/product/*"]
|
||||
|
||||
H --> J{Need Metadata?}
|
||||
I1 --> J
|
||||
I2 --> J
|
||||
I3 --> J
|
||||
I4 --> J
|
||||
|
||||
J -->|Yes| K[extract_head=True]
|
||||
J -->|No| L[extract_head=False]
|
||||
|
||||
K --> M{Need Validation?}
|
||||
L --> M
|
||||
|
||||
M -->|Yes| N[live_check=True]
|
||||
M -->|No| O[live_check=False]
|
||||
|
||||
N --> P{Need Relevance Scoring?}
|
||||
O --> P
|
||||
|
||||
P -->|Yes| Q[Set Query + BM25]
|
||||
P -->|No| R[Skip Scoring]
|
||||
|
||||
Q --> S[query="search terms"]
|
||||
S --> T[scoring_method="bm25"]
|
||||
T --> U[score_threshold=0.3]
|
||||
|
||||
R --> V[Performance Tuning]
|
||||
U --> V
|
||||
|
||||
V --> W[Set max_urls]
|
||||
W --> X[Set concurrency]
|
||||
X --> Y[Set hits_per_sec]
|
||||
Y --> Z[Configuration Complete]
|
||||
|
||||
style A fill:#e3f2fd
|
||||
style Z fill:#c8e6c9
|
||||
style K fill:#fff3e0
|
||||
style N fill:#fff3e0
|
||||
style Q fill:#f3e5f5
|
||||
```
|
||||
|
||||
### BM25 Relevance Scoring Pipeline
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Text Corpus Preparation"
|
||||
A1[URL Collection] --> A2[Extract Metadata]
|
||||
A2 --> A3[Title + Description + Keywords]
|
||||
A3 --> A4[Tokenize Text]
|
||||
A4 --> A5[Remove Stop Words]
|
||||
A5 --> A6[Create Document Corpus]
|
||||
end
|
||||
|
||||
subgraph "BM25 Algorithm"
|
||||
B1[Query Terms] --> B2[Term Frequency Calculation]
|
||||
A6 --> B2
|
||||
B2 --> B3[Inverse Document Frequency]
|
||||
B3 --> B4[BM25 Score Calculation]
|
||||
B4 --> B5[Score = Σ(IDF × TF × K1+1)/(TF + K1×(1-b+b×|d|/avgdl))]
|
||||
end
|
||||
|
||||
subgraph "Scoring Results"
|
||||
B5 --> C1[URL Relevance Scores]
|
||||
C1 --> C2{Score ≥ Threshold?}
|
||||
C2 -->|Yes| C3[Include in Results]
|
||||
C2 -->|No| C4[Filter Out]
|
||||
C3 --> C5[Sort by Score DESC]
|
||||
C5 --> C6[Return Top URLs]
|
||||
end
|
||||
|
||||
subgraph "Example Scores"
|
||||
D1["python async tutorial" → 0.85]
|
||||
D2["python documentation" → 0.72]
|
||||
D3["javascript guide" → 0.23]
|
||||
D4["contact us page" → 0.05]
|
||||
end
|
||||
|
||||
style B5 fill:#e3f2fd
|
||||
style C6 fill:#c8e6c9
|
||||
style D1 fill:#c8e6c9
|
||||
style D2 fill:#c8e6c9
|
||||
style D3 fill:#ffecb3
|
||||
style D4 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Multi-Domain Discovery Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Input Layer"
|
||||
A1[Domain List]
|
||||
A2[SeedingConfig]
|
||||
A3[Query Terms]
|
||||
end
|
||||
|
||||
subgraph "Discovery Engine"
|
||||
B1[AsyncUrlSeeder]
|
||||
B2[Parallel Workers]
|
||||
B3[Rate Limiter]
|
||||
B4[Memory Manager]
|
||||
end
|
||||
|
||||
subgraph "Data Sources"
|
||||
C1[Sitemap Fetcher]
|
||||
C2[Common Crawl API]
|
||||
C3[Live URL Checker]
|
||||
C4[Metadata Extractor]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
D1[URL Deduplication]
|
||||
D2[Pattern Filtering]
|
||||
D3[Relevance Scoring]
|
||||
D4[Quality Assessment]
|
||||
end
|
||||
|
||||
subgraph "Output Layer"
|
||||
E1[Scored URL Lists]
|
||||
E2[Domain Statistics]
|
||||
E3[Performance Metrics]
|
||||
E4[Cache Storage]
|
||||
end
|
||||
|
||||
A1 --> B1
|
||||
A2 --> B1
|
||||
A3 --> B1
|
||||
|
||||
B1 --> B2
|
||||
B2 --> B3
|
||||
B3 --> B4
|
||||
|
||||
B2 --> C1
|
||||
B2 --> C2
|
||||
B2 --> C3
|
||||
B2 --> C4
|
||||
|
||||
C1 --> D1
|
||||
C2 --> D1
|
||||
C3 --> D2
|
||||
C4 --> D3
|
||||
|
||||
D1 --> D2
|
||||
D2 --> D3
|
||||
D3 --> D4
|
||||
|
||||
D4 --> E1
|
||||
B4 --> E2
|
||||
B3 --> E3
|
||||
D1 --> E4
|
||||
|
||||
style B1 fill:#e3f2fd
|
||||
style D3 fill:#f3e5f5
|
||||
style E1 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### Complete Discovery-to-Crawl Pipeline
|
||||
|
||||
```mermaid
|
||||
stateDiagram-v2
|
||||
[*] --> Discovery
|
||||
|
||||
Discovery --> SourceSelection: Configure data sources
|
||||
SourceSelection --> Sitemap: source="sitemap"
|
||||
SourceSelection --> CommonCrawl: source="cc"
|
||||
SourceSelection --> Both: source="sitemap+cc"
|
||||
|
||||
Sitemap --> URLCollection
|
||||
CommonCrawl --> URLCollection
|
||||
Both --> URLCollection
|
||||
|
||||
URLCollection --> Filtering: Apply patterns
|
||||
Filtering --> MetadataExtraction: extract_head=True
|
||||
Filtering --> LiveValidation: extract_head=False
|
||||
|
||||
MetadataExtraction --> LiveValidation: live_check=True
|
||||
MetadataExtraction --> RelevanceScoring: live_check=False
|
||||
LiveValidation --> RelevanceScoring
|
||||
|
||||
RelevanceScoring --> ResultRanking: query provided
|
||||
RelevanceScoring --> ResultLimiting: no query
|
||||
|
||||
ResultRanking --> ResultLimiting: apply score_threshold
|
||||
ResultLimiting --> URLSelection: apply max_urls
|
||||
|
||||
URLSelection --> CrawlPreparation: URLs ready
|
||||
CrawlPreparation --> CrawlExecution: AsyncWebCrawler
|
||||
|
||||
CrawlExecution --> StreamProcessing: stream=True
|
||||
CrawlExecution --> BatchProcessing: stream=False
|
||||
|
||||
StreamProcessing --> [*]
|
||||
BatchProcessing --> [*]
|
||||
|
||||
note right of Discovery : 🔍 Smart URL Discovery
|
||||
note right of URLCollection : 📚 Merge & Deduplicate
|
||||
note right of RelevanceScoring : 🎯 BM25 Algorithm
|
||||
note right of CrawlExecution : 🕷️ High-Performance Crawling
|
||||
```
|
||||
|
||||
### Performance Optimization Strategies
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Input Optimization"
|
||||
A1[Smart Source Selection] --> A2[Sitemap First]
|
||||
A2 --> A3[Add CC if Needed]
|
||||
A3 --> A4[Pattern Filtering Early]
|
||||
end
|
||||
|
||||
subgraph "Processing Optimization"
|
||||
B1[Parallel Workers] --> B2[Bounded Queues]
|
||||
B2 --> B3[Rate Limiting]
|
||||
B3 --> B4[Memory Management]
|
||||
B4 --> B5[Lazy Evaluation]
|
||||
end
|
||||
|
||||
subgraph "Output Optimization"
|
||||
C1[Relevance Threshold] --> C2[Max URL Limits]
|
||||
C2 --> C3[Caching Strategy]
|
||||
C3 --> C4[Streaming Results]
|
||||
end
|
||||
|
||||
subgraph "Performance Metrics"
|
||||
D1[URLs/Second: 100-1000]
|
||||
D2[Memory Usage: Bounded]
|
||||
D3[Network Efficiency: 95%+]
|
||||
D4[Cache Hit Rate: 80%+]
|
||||
end
|
||||
|
||||
A4 --> B1
|
||||
B5 --> C1
|
||||
C4 --> D1
|
||||
|
||||
style A2 fill:#e8f5e8
|
||||
style B2 fill:#e3f2fd
|
||||
style C3 fill:#f3e5f5
|
||||
style D3 fill:#c8e6c9
|
||||
```
|
||||
|
||||
### URL Discovery vs Traditional Crawling Comparison
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Traditional Approach"
|
||||
T1[Start URL] --> T2[Crawl Page]
|
||||
T2 --> T3[Extract Links]
|
||||
T3 --> T4[Queue New URLs]
|
||||
T4 --> T2
|
||||
T5[❌ Time: Hours/Days]
|
||||
T6[❌ Resource Heavy]
|
||||
T7[❌ Depth Limited]
|
||||
T8[❌ Discovery Bias]
|
||||
end
|
||||
|
||||
subgraph "URL Seeding Approach"
|
||||
S1[Domain Input] --> S2[Query All Sources]
|
||||
S2 --> S3[Pattern Filter]
|
||||
S3 --> S4[Relevance Score]
|
||||
S4 --> S5[Select Best URLs]
|
||||
S5 --> S6[Ready to Crawl]
|
||||
|
||||
S7[✅ Time: Seconds/Minutes]
|
||||
S8[✅ Resource Efficient]
|
||||
S9[✅ Complete Coverage]
|
||||
S10[✅ Quality Focused]
|
||||
end
|
||||
|
||||
subgraph "Use Case Decision Matrix"
|
||||
U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling]
|
||||
U3[Large Sites > 10000 pages] --> U4[Use URL Seeding]
|
||||
U5[Unknown Structure] --> U6[Start with Seeding]
|
||||
U7[Real-time Discovery] --> U8[Use Deep Crawling]
|
||||
U9[Quality over Quantity] --> U10[Use URL Seeding]
|
||||
end
|
||||
|
||||
style S6 fill:#c8e6c9
|
||||
style S7 fill:#c8e6c9
|
||||
style S8 fill:#c8e6c9
|
||||
style S9 fill:#c8e6c9
|
||||
style S10 fill:#c8e6c9
|
||||
style T5 fill:#ffcdd2
|
||||
style T6 fill:#ffcdd2
|
||||
style T7 fill:#ffcdd2
|
||||
style T8 fill:#ffcdd2
|
||||
```
|
||||
|
||||
### Data Source Characteristics and Selection
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Sitemap Source"
|
||||
SM1[📋 Official URL List]
|
||||
SM2[⚡ Fast Response]
|
||||
SM3[📅 Recently Updated]
|
||||
SM4[🎯 High Quality URLs]
|
||||
SM5[❌ May Miss Some Pages]
|
||||
end
|
||||
|
||||
subgraph "Common Crawl Source"
|
||||
CC1[🌐 Comprehensive Coverage]
|
||||
CC2[📚 Historical Data]
|
||||
CC3[🔍 Deep Discovery]
|
||||
CC4[⏳ Slower Response]
|
||||
CC5[🧹 May Include Noise]
|
||||
end
|
||||
|
||||
subgraph "Combined Strategy"
|
||||
CB1[🚀 Best of Both]
|
||||
CB2[📊 Maximum Coverage]
|
||||
CB3[✨ Automatic Deduplication]
|
||||
CB4[⚖️ Balanced Performance]
|
||||
end
|
||||
|
||||
subgraph "Selection Guidelines"
|
||||
G1[Speed Critical → Sitemap Only]
|
||||
G2[Coverage Critical → Common Crawl]
|
||||
G3[Best Quality → Combined]
|
||||
G4[Unknown Domain → Combined]
|
||||
end
|
||||
|
||||
style SM2 fill:#c8e6c9
|
||||
style SM4 fill:#c8e6c9
|
||||
style CC1 fill:#e3f2fd
|
||||
style CC3 fill:#e3f2fd
|
||||
style CB1 fill:#f3e5f5
|
||||
style CB3 fill:#f3e5f5
|
||||
```
|
||||
|
||||
**📖 Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
295
docs/md_v2/assets/llm.txt/txt/cli.txt
Normal file
295
docs/md_v2/assets/llm.txt/txt/cli.txt
Normal file
@@ -0,0 +1,295 @@
|
||||
## CLI & Identity-Based Browsing
|
||||
|
||||
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
|
||||
|
||||
### Basic CLI Usage
|
||||
|
||||
```bash
|
||||
# Simple crawling
|
||||
crwl https://example.com
|
||||
|
||||
# Get markdown output
|
||||
crwl https://example.com -o markdown
|
||||
|
||||
# JSON output with cache bypass
|
||||
crwl https://example.com -o json --bypass-cache
|
||||
|
||||
# Verbose mode with specific browser settings
|
||||
crwl https://example.com -b "headless=false,viewport_width=1280" -v
|
||||
```
|
||||
|
||||
### Profile Management Commands
|
||||
|
||||
```bash
|
||||
# Launch interactive profile manager
|
||||
crwl profiles
|
||||
|
||||
# Create, list, and manage browser profiles
|
||||
# This opens a menu where you can:
|
||||
# 1. List existing profiles
|
||||
# 2. Create new profile (opens browser for setup)
|
||||
# 3. Delete profiles
|
||||
# 4. Use profile to crawl a website
|
||||
|
||||
# Use a specific profile for crawling
|
||||
crwl https://example.com -p my-profile-name
|
||||
|
||||
# Example workflow for authenticated sites:
|
||||
# 1. Create profile and log in
|
||||
crwl profiles # Select "Create new profile"
|
||||
# 2. Use profile for crawling authenticated content
|
||||
crwl https://site-requiring-login.com/dashboard -p my-profile-name
|
||||
```
|
||||
|
||||
### CDP Browser Management
|
||||
|
||||
```bash
|
||||
# Launch browser with CDP debugging (default port 9222)
|
||||
crwl cdp
|
||||
|
||||
# Use specific profile and custom port
|
||||
crwl cdp -p my-profile -P 9223
|
||||
|
||||
# Launch headless browser with CDP
|
||||
crwl cdp --headless
|
||||
|
||||
# Launch in incognito mode (ignores profile)
|
||||
crwl cdp --incognito
|
||||
|
||||
# Use custom user data directory
|
||||
crwl cdp --user-data-dir ~/my-browser-data --port 9224
|
||||
```
|
||||
|
||||
### Builtin Browser Management
|
||||
|
||||
```bash
|
||||
# Start persistent browser instance
|
||||
crwl browser start
|
||||
|
||||
# Check browser status
|
||||
crwl browser status
|
||||
|
||||
# Open visible window to see the browser
|
||||
crwl browser view --url https://example.com
|
||||
|
||||
# Stop the browser
|
||||
crwl browser stop
|
||||
|
||||
# Restart with different options
|
||||
crwl browser restart --browser-type chromium --port 9223 --no-headless
|
||||
|
||||
# Use builtin browser in crawling
|
||||
crwl https://example.com -b "browser_mode=builtin"
|
||||
```
|
||||
|
||||
### Authentication Workflow Examples
|
||||
|
||||
```bash
|
||||
# Complete workflow for LinkedIn scraping
|
||||
# 1. Create authenticated profile
|
||||
crwl profiles
|
||||
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
|
||||
|
||||
# 2. Use profile for crawling
|
||||
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
|
||||
|
||||
# 3. Extract structured data with authentication
|
||||
crwl https://linkedin.com/search/results/people/ \
|
||||
-p linkedin-profile \
|
||||
-j "Extract people profiles with names, titles, and companies" \
|
||||
-b "headless=false"
|
||||
|
||||
# GitHub authenticated crawling
|
||||
crwl profiles # Create github-profile
|
||||
crwl https://github.com/settings/profile -p github-profile
|
||||
|
||||
# Twitter/X authenticated access
|
||||
crwl profiles # Create twitter-profile
|
||||
crwl https://twitter.com/home -p twitter-profile -o markdown
|
||||
```
|
||||
|
||||
### Advanced CLI Configuration
|
||||
|
||||
```bash
|
||||
# Complex crawling with multiple configs
|
||||
crwl https://example.com \
|
||||
-B browser.yml \
|
||||
-C crawler.yml \
|
||||
-e extract_llm.yml \
|
||||
-s llm_schema.json \
|
||||
-p my-auth-profile \
|
||||
-o json \
|
||||
-v
|
||||
|
||||
# Quick LLM extraction with authentication
|
||||
crwl https://private-site.com/dashboard \
|
||||
-p auth-profile \
|
||||
-j "Extract user dashboard data including metrics and notifications" \
|
||||
-b "headless=true,viewport_width=1920"
|
||||
|
||||
# Content filtering with authentication
|
||||
crwl https://members-only-site.com \
|
||||
-p member-profile \
|
||||
-f filter_bm25.yml \
|
||||
-c "css_selector=.member-content,scan_full_page=true" \
|
||||
-o markdown-fit
|
||||
```
|
||||
|
||||
### Configuration Files for Identity Browsing
|
||||
|
||||
```yaml
|
||||
# browser_auth.yml
|
||||
headless: false
|
||||
use_managed_browser: true
|
||||
user_data_dir: "/path/to/profile"
|
||||
viewport_width: 1280
|
||||
viewport_height: 720
|
||||
simulate_user: true
|
||||
override_navigator: true
|
||||
|
||||
# crawler_auth.yml
|
||||
magic: true
|
||||
remove_overlay_elements: true
|
||||
simulate_user: true
|
||||
wait_for: "css:.authenticated-content"
|
||||
page_timeout: 60000
|
||||
delay_before_return_html: 2
|
||||
scan_full_page: true
|
||||
```
|
||||
|
||||
### Global Configuration Management
|
||||
|
||||
```bash
|
||||
# List all configuration settings
|
||||
crwl config list
|
||||
|
||||
# Set default LLM provider
|
||||
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
|
||||
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
|
||||
|
||||
# Set browser defaults
|
||||
crwl config set BROWSER_HEADLESS false # Always show browser
|
||||
crwl config set USER_AGENT_MODE random # Random user agents
|
||||
|
||||
# Enable verbose mode globally
|
||||
crwl config set VERBOSE true
|
||||
```
|
||||
|
||||
### Q&A with Authenticated Content
|
||||
|
||||
```bash
|
||||
# Ask questions about authenticated content
|
||||
crwl https://private-dashboard.com -p dashboard-profile \
|
||||
-q "What are the key metrics shown in my dashboard?"
|
||||
|
||||
# Multiple questions workflow
|
||||
crwl https://company-intranet.com -p work-profile -o markdown # View content
|
||||
crwl https://company-intranet.com -p work-profile \
|
||||
-q "Summarize this week's announcements"
|
||||
crwl https://company-intranet.com -p work-profile \
|
||||
-q "What are the upcoming deadlines?"
|
||||
```
|
||||
|
||||
### Profile Creation Programmatically
|
||||
|
||||
```python
|
||||
# Create profiles via Python API
|
||||
import asyncio
|
||||
from crawl4ai import BrowserProfiler
|
||||
|
||||
async def create_auth_profile():
|
||||
profiler = BrowserProfiler()
|
||||
|
||||
# Create profile interactively (opens browser)
|
||||
profile_path = await profiler.create_profile("linkedin-auth")
|
||||
print(f"Profile created at: {profile_path}")
|
||||
|
||||
# List all profiles
|
||||
profiles = profiler.list_profiles()
|
||||
for profile in profiles:
|
||||
print(f"Profile: {profile['name']} at {profile['path']}")
|
||||
|
||||
# Use profile for crawling
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir=profile_path
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://linkedin.com/feed")
|
||||
return result
|
||||
|
||||
# asyncio.run(create_auth_profile())
|
||||
```
|
||||
|
||||
### Identity Browsing Best Practices
|
||||
|
||||
```bash
|
||||
# 1. Create specific profiles for different sites
|
||||
crwl profiles # Create "linkedin-work"
|
||||
crwl profiles # Create "github-personal"
|
||||
crwl profiles # Create "company-intranet"
|
||||
|
||||
# 2. Use descriptive profile names
|
||||
crwl https://site1.com -p site1-admin-account
|
||||
crwl https://site2.com -p site2-user-account
|
||||
|
||||
# 3. Combine with appropriate browser settings
|
||||
crwl https://secure-site.com \
|
||||
-p secure-profile \
|
||||
-b "headless=false,simulate_user=true,magic=true" \
|
||||
-c "wait_for=.logged-in-indicator,page_timeout=30000"
|
||||
|
||||
# 4. Test profile before automated crawling
|
||||
crwl cdp -p test-profile # Manually verify login status
|
||||
crwl https://test-url.com -p test-profile -v # Verbose test crawl
|
||||
```
|
||||
|
||||
### Troubleshooting Authentication Issues
|
||||
|
||||
```bash
|
||||
# Debug authentication problems
|
||||
crwl https://auth-site.com -p auth-profile \
|
||||
-b "headless=false,verbose=true" \
|
||||
-c "verbose=true,page_timeout=60000" \
|
||||
-v
|
||||
|
||||
# Check profile status
|
||||
crwl profiles # List profiles and check creation dates
|
||||
|
||||
# Recreate problematic profiles
|
||||
crwl profiles # Delete old profile, create new one
|
||||
|
||||
# Test with visible browser
|
||||
crwl https://problem-site.com -p profile-name \
|
||||
-b "headless=false" \
|
||||
-c "delay_before_return_html=5"
|
||||
```
|
||||
|
||||
### Common Use Cases
|
||||
|
||||
```bash
|
||||
# Social media monitoring (after authentication)
|
||||
crwl https://twitter.com/home -p twitter-monitor \
|
||||
-j "Extract latest tweets with sentiment and engagement metrics"
|
||||
|
||||
# E-commerce competitor analysis (with account access)
|
||||
crwl https://competitor-site.com/products -p competitor-account \
|
||||
-j "Extract product prices, availability, and descriptions"
|
||||
|
||||
# Company dashboard monitoring
|
||||
crwl https://company-dashboard.com -p work-profile \
|
||||
-c "css_selector=.dashboard-content" \
|
||||
-q "What alerts or notifications need attention?"
|
||||
|
||||
# Research data collection (authenticated access)
|
||||
crwl https://research-platform.com/data -p research-profile \
|
||||
-e extract_research.yml \
|
||||
-s research_schema.json \
|
||||
-o json
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/)
|
||||
1171
docs/md_v2/assets/llm.txt/txt/config_objects.txt
Normal file
1171
docs/md_v2/assets/llm.txt/txt/config_objects.txt
Normal file
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,446 @@
|
||||
## Deep Crawling Filters & Scorers
|
||||
|
||||
Advanced URL filtering and scoring strategies for intelligent deep crawling with performance optimization.
|
||||
|
||||
### URL Filters - Content and Domain Control
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
URLPatternFilter, DomainFilter, ContentTypeFilter,
|
||||
FilterChain, ContentRelevanceFilter, SEOFilter
|
||||
)
|
||||
|
||||
# Pattern-based filtering
|
||||
pattern_filter = URLPatternFilter(
|
||||
patterns=[
|
||||
"*.html", # HTML pages only
|
||||
"*/blog/*", # Blog posts
|
||||
"*/articles/*", # Article pages
|
||||
"*2024*", # Recent content
|
||||
"^https://example.com/docs/.*" # Regex pattern
|
||||
],
|
||||
use_glob=True,
|
||||
reverse=False # False = include matching, True = exclude matching
|
||||
)
|
||||
|
||||
# Domain filtering with subdomains
|
||||
domain_filter = DomainFilter(
|
||||
allowed_domains=["example.com", "docs.example.com"],
|
||||
blocked_domains=["ads.example.com", "tracker.com"]
|
||||
)
|
||||
|
||||
# Content type filtering
|
||||
content_filter = ContentTypeFilter(
|
||||
allowed_types=["text/html", "application/pdf"],
|
||||
check_extension=True
|
||||
)
|
||||
|
||||
# Apply individual filters
|
||||
url = "https://example.com/blog/2024/article.html"
|
||||
print(f"Pattern filter: {pattern_filter.apply(url)}")
|
||||
print(f"Domain filter: {domain_filter.apply(url)}")
|
||||
print(f"Content filter: {content_filter.apply(url)}")
|
||||
```
|
||||
|
||||
### Filter Chaining - Combine Multiple Filters
|
||||
|
||||
```python
|
||||
# Create filter chain for comprehensive filtering
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(allowed_domains=["example.com"]),
|
||||
URLPatternFilter(patterns=["*/blog/*", "*/docs/*"]),
|
||||
ContentTypeFilter(allowed_types=["text/html"])
|
||||
])
|
||||
|
||||
# Apply chain to URLs
|
||||
urls = [
|
||||
"https://example.com/blog/post1.html",
|
||||
"https://spam.com/content.html",
|
||||
"https://example.com/blog/image.jpg",
|
||||
"https://example.com/docs/guide.html"
|
||||
]
|
||||
|
||||
async def filter_urls(urls, filter_chain):
|
||||
filtered = []
|
||||
for url in urls:
|
||||
if await filter_chain.apply(url):
|
||||
filtered.append(url)
|
||||
return filtered
|
||||
|
||||
# Usage
|
||||
filtered_urls = await filter_urls(urls, filter_chain)
|
||||
print(f"Filtered URLs: {filtered_urls}")
|
||||
|
||||
# Check filter statistics
|
||||
for filter_obj in filter_chain.filters:
|
||||
stats = filter_obj.stats
|
||||
print(f"{filter_obj.name}: {stats.passed_urls}/{stats.total_urls} passed")
|
||||
```
|
||||
|
||||
### Advanced Content Filters
|
||||
|
||||
```python
|
||||
# BM25-based content relevance filtering
|
||||
relevance_filter = ContentRelevanceFilter(
|
||||
query="python machine learning tutorial",
|
||||
threshold=0.5, # Minimum relevance score
|
||||
k1=1.2, # TF saturation parameter
|
||||
b=0.75, # Length normalization
|
||||
avgdl=1000 # Average document length
|
||||
)
|
||||
|
||||
# SEO quality filtering
|
||||
seo_filter = SEOFilter(
|
||||
threshold=0.65, # Minimum SEO score
|
||||
keywords=["python", "tutorial", "guide"],
|
||||
weights={
|
||||
"title_length": 0.15,
|
||||
"title_kw": 0.18,
|
||||
"meta_description": 0.12,
|
||||
"canonical": 0.10,
|
||||
"robot_ok": 0.20,
|
||||
"schema_org": 0.10,
|
||||
"url_quality": 0.15
|
||||
}
|
||||
)
|
||||
|
||||
# Apply advanced filters
|
||||
url = "https://example.com/python-ml-tutorial"
|
||||
relevance_score = await relevance_filter.apply(url)
|
||||
seo_score = await seo_filter.apply(url)
|
||||
|
||||
print(f"Relevance: {relevance_score}, SEO: {seo_score}")
|
||||
```
|
||||
|
||||
### URL Scorers - Quality and Relevance Scoring
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.scorers import (
|
||||
KeywordRelevanceScorer, PathDepthScorer, ContentTypeScorer,
|
||||
FreshnessScorer, DomainAuthorityScorer, CompositeScorer
|
||||
)
|
||||
|
||||
# Keyword relevance scoring
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["python", "tutorial", "guide", "machine", "learning"],
|
||||
weight=1.0,
|
||||
case_sensitive=False
|
||||
)
|
||||
|
||||
# Path depth scoring (optimal depth = 3)
|
||||
depth_scorer = PathDepthScorer(
|
||||
optimal_depth=3, # /category/subcategory/article
|
||||
weight=0.8
|
||||
)
|
||||
|
||||
# Content type scoring
|
||||
content_type_scorer = ContentTypeScorer(
|
||||
type_weights={
|
||||
"html": 1.0, # Highest priority
|
||||
"pdf": 0.8, # Medium priority
|
||||
"txt": 0.6, # Lower priority
|
||||
"doc": 0.4 # Lowest priority
|
||||
},
|
||||
weight=0.9
|
||||
)
|
||||
|
||||
# Freshness scoring
|
||||
freshness_scorer = FreshnessScorer(
|
||||
weight=0.7,
|
||||
current_year=2024
|
||||
)
|
||||
|
||||
# Domain authority scoring
|
||||
domain_scorer = DomainAuthorityScorer(
|
||||
domain_weights={
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9,
|
||||
"stackoverflow.com": 0.85,
|
||||
"medium.com": 0.7,
|
||||
"personal-blog.com": 0.3
|
||||
},
|
||||
default_weight=0.5,
|
||||
weight=1.0
|
||||
)
|
||||
|
||||
# Score individual URLs
|
||||
url = "https://python.org/tutorial/2024/machine-learning.html"
|
||||
scores = {
|
||||
"keyword": keyword_scorer.score(url),
|
||||
"depth": depth_scorer.score(url),
|
||||
"content": content_type_scorer.score(url),
|
||||
"freshness": freshness_scorer.score(url),
|
||||
"domain": domain_scorer.score(url)
|
||||
}
|
||||
|
||||
print(f"Individual scores: {scores}")
|
||||
```
|
||||
|
||||
### Composite Scoring - Combine Multiple Scorers
|
||||
|
||||
```python
|
||||
# Create composite scorer combining all strategies
|
||||
composite_scorer = CompositeScorer(
|
||||
scorers=[
|
||||
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
|
||||
PathDepthScorer(optimal_depth=3, weight=1.0),
|
||||
ContentTypeScorer({"html": 1.0, "pdf": 0.8}, weight=1.2),
|
||||
FreshnessScorer(weight=0.8, current_year=2024),
|
||||
DomainAuthorityScorer({
|
||||
"python.org": 1.0,
|
||||
"github.com": 0.9
|
||||
}, weight=1.3)
|
||||
],
|
||||
normalize=True # Normalize by number of scorers
|
||||
)
|
||||
|
||||
# Score multiple URLs
|
||||
urls_to_score = [
|
||||
"https://python.org/tutorial/2024/basics.html",
|
||||
"https://github.com/user/python-guide/blob/main/README.md",
|
||||
"https://random-blog.com/old/2018/python-stuff.html",
|
||||
"https://python.org/docs/deep/nested/advanced/guide.html"
|
||||
]
|
||||
|
||||
scored_urls = []
|
||||
for url in urls_to_score:
|
||||
score = composite_scorer.score(url)
|
||||
scored_urls.append((url, score))
|
||||
|
||||
# Sort by score (highest first)
|
||||
scored_urls.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
for url, score in scored_urls:
|
||||
print(f"Score: {score:.3f} - {url}")
|
||||
|
||||
# Check scorer statistics
|
||||
print(f"\nScoring statistics:")
|
||||
print(f"URLs scored: {composite_scorer.stats._urls_scored}")
|
||||
print(f"Average score: {composite_scorer.stats.get_average():.3f}")
|
||||
```
|
||||
|
||||
### Advanced Filter Patterns
|
||||
|
||||
```python
|
||||
# Complex pattern matching
|
||||
advanced_patterns = URLPatternFilter(
|
||||
patterns=[
|
||||
r"^https://docs\.python\.org/\d+/", # Python docs with version
|
||||
r".*/tutorial/.*\.html$", # Tutorial pages
|
||||
r".*/guide/(?!deprecated).*", # Guides but not deprecated
|
||||
"*/blog/{2020,2021,2022,2023,2024}/*", # Recent blog posts
|
||||
"**/{api,reference}/**/*.html" # API/reference docs
|
||||
],
|
||||
use_glob=True
|
||||
)
|
||||
|
||||
# Exclude patterns (reverse=True)
|
||||
exclude_filter = URLPatternFilter(
|
||||
patterns=[
|
||||
"*/admin/*",
|
||||
"*/login/*",
|
||||
"*/private/*",
|
||||
"**/.*", # Hidden files
|
||||
"*.{jpg,png,gif,css,js}$" # Media and assets
|
||||
],
|
||||
reverse=True # Exclude matching patterns
|
||||
)
|
||||
|
||||
# Content type with extension mapping
|
||||
detailed_content_filter = ContentTypeFilter(
|
||||
allowed_types=["text", "application"],
|
||||
check_extension=True,
|
||||
ext_map={
|
||||
"html": "text/html",
|
||||
"htm": "text/html",
|
||||
"md": "text/markdown",
|
||||
"pdf": "application/pdf",
|
||||
"doc": "application/msword",
|
||||
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Performance-Optimized Filtering
|
||||
|
||||
```python
|
||||
# High-performance filter chain for large-scale crawling
|
||||
class OptimizedFilterChain:
|
||||
def __init__(self):
|
||||
# Fast filters first (domain, patterns)
|
||||
self.fast_filters = [
|
||||
DomainFilter(
|
||||
allowed_domains=["example.com", "docs.example.com"],
|
||||
blocked_domains=["ads.example.com"]
|
||||
),
|
||||
URLPatternFilter([
|
||||
"*.html", "*.pdf", "*/blog/*", "*/docs/*"
|
||||
])
|
||||
]
|
||||
|
||||
# Slower filters last (content analysis)
|
||||
self.slow_filters = [
|
||||
ContentRelevanceFilter(
|
||||
query="important content",
|
||||
threshold=0.3
|
||||
)
|
||||
]
|
||||
|
||||
async def apply_optimized(self, url: str) -> bool:
|
||||
# Apply fast filters first
|
||||
for filter_obj in self.fast_filters:
|
||||
if not filter_obj.apply(url):
|
||||
return False
|
||||
|
||||
# Only apply slow filters if fast filters pass
|
||||
for filter_obj in self.slow_filters:
|
||||
if not await filter_obj.apply(url):
|
||||
return False
|
||||
|
||||
return True
|
||||
|
||||
# Batch filtering with concurrency
|
||||
async def batch_filter_urls(urls, filter_chain, max_concurrent=50):
|
||||
import asyncio
|
||||
semaphore = asyncio.Semaphore(max_concurrent)
|
||||
|
||||
async def filter_single(url):
|
||||
async with semaphore:
|
||||
return await filter_chain.apply(url), url
|
||||
|
||||
tasks = [filter_single(url) for url in urls]
|
||||
results = await asyncio.gather(*tasks)
|
||||
|
||||
return [url for passed, url in results if passed]
|
||||
|
||||
# Usage with 1000 URLs
|
||||
large_url_list = [f"https://example.com/page{i}.html" for i in range(1000)]
|
||||
optimized_chain = OptimizedFilterChain()
|
||||
filtered = await batch_filter_urls(large_url_list, optimized_chain)
|
||||
```
|
||||
|
||||
### Custom Filter Implementation
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import URLFilter
|
||||
import re
|
||||
|
||||
class CustomLanguageFilter(URLFilter):
|
||||
"""Filter URLs by language indicators"""
|
||||
|
||||
def __init__(self, allowed_languages=["en"], weight=1.0):
|
||||
super().__init__()
|
||||
self.allowed_languages = set(allowed_languages)
|
||||
self.lang_patterns = {
|
||||
"en": re.compile(r"/en/|/english/|lang=en"),
|
||||
"es": re.compile(r"/es/|/spanish/|lang=es"),
|
||||
"fr": re.compile(r"/fr/|/french/|lang=fr"),
|
||||
"de": re.compile(r"/de/|/german/|lang=de")
|
||||
}
|
||||
|
||||
def apply(self, url: str) -> bool:
|
||||
# Default to English if no language indicators
|
||||
if not any(pattern.search(url) for pattern in self.lang_patterns.values()):
|
||||
result = "en" in self.allowed_languages
|
||||
self._update_stats(result)
|
||||
return result
|
||||
|
||||
# Check for allowed languages
|
||||
for lang in self.allowed_languages:
|
||||
if lang in self.lang_patterns:
|
||||
if self.lang_patterns[lang].search(url):
|
||||
self._update_stats(True)
|
||||
return True
|
||||
|
||||
self._update_stats(False)
|
||||
return False
|
||||
|
||||
# Custom scorer implementation
|
||||
from crawl4ai.deep_crawling.scorers import URLScorer
|
||||
|
||||
class CustomComplexityScorer(URLScorer):
|
||||
"""Score URLs by content complexity indicators"""
|
||||
|
||||
def __init__(self, weight=1.0):
|
||||
super().__init__(weight)
|
||||
self.complexity_indicators = {
|
||||
"tutorial": 0.9,
|
||||
"guide": 0.8,
|
||||
"example": 0.7,
|
||||
"reference": 0.6,
|
||||
"api": 0.5
|
||||
}
|
||||
|
||||
def _calculate_score(self, url: str) -> float:
|
||||
url_lower = url.lower()
|
||||
max_score = 0.0
|
||||
|
||||
for indicator, score in self.complexity_indicators.items():
|
||||
if indicator in url_lower:
|
||||
max_score = max(max_score, score)
|
||||
|
||||
return max_score
|
||||
|
||||
# Use custom filters and scorers
|
||||
custom_filter = CustomLanguageFilter(allowed_languages=["en", "es"])
|
||||
custom_scorer = CustomComplexityScorer(weight=1.2)
|
||||
|
||||
url = "https://example.com/en/tutorial/advanced-guide.html"
|
||||
passes_filter = custom_filter.apply(url)
|
||||
complexity_score = custom_scorer.score(url)
|
||||
|
||||
print(f"Passes language filter: {passes_filter}")
|
||||
print(f"Complexity score: {complexity_score}")
|
||||
```
|
||||
|
||||
### Integration with Deep Crawling
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import DeepCrawlStrategy
|
||||
|
||||
async def deep_crawl_with_filtering():
|
||||
# Create comprehensive filter chain
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(allowed_domains=["python.org"]),
|
||||
URLPatternFilter(["*/tutorial/*", "*/guide/*", "*/docs/*"]),
|
||||
ContentTypeFilter(["text/html"]),
|
||||
SEOFilter(threshold=0.6, keywords=["python", "programming"])
|
||||
])
|
||||
|
||||
# Create composite scorer
|
||||
scorer = CompositeScorer([
|
||||
KeywordRelevanceScorer(["python", "tutorial"], weight=1.5),
|
||||
FreshnessScorer(weight=0.8),
|
||||
PathDepthScorer(optimal_depth=3, weight=1.0)
|
||||
], normalize=True)
|
||||
|
||||
# Configure deep crawl strategy with filters and scorers
|
||||
deep_strategy = DeepCrawlStrategy(
|
||||
max_depth=3,
|
||||
max_pages=100,
|
||||
url_filter=filter_chain,
|
||||
url_scorer=scorer,
|
||||
score_threshold=0.6 # Only crawl URLs scoring above 0.6
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=deep_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://python.org",
|
||||
config=config
|
||||
)
|
||||
|
||||
print(f"Deep crawl completed: {result.success}")
|
||||
if hasattr(result, 'deep_crawl_results'):
|
||||
print(f"Pages crawled: {len(result.deep_crawl_results)}")
|
||||
|
||||
# Run the deep crawl
|
||||
await deep_crawl_with_filtering()
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Custom Filter Development](https://docs.crawl4ai.com/advanced/custom-filters/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/)
|
||||
348
docs/md_v2/assets/llm.txt/txt/deep_crawling.txt
Normal file
348
docs/md_v2/assets/llm.txt/txt/deep_crawling.txt
Normal file
@@ -0,0 +1,348 @@
|
||||
## Deep Crawling
|
||||
|
||||
Multi-level website exploration with intelligent filtering, scoring, and prioritization strategies.
|
||||
|
||||
### Basic Deep Crawl Setup
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
||||
|
||||
# Basic breadth-first deep crawling
|
||||
async def basic_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2, # Initial page + 2 levels
|
||||
include_external=False # Stay within same domain
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://docs.crawl4ai.com", config=config)
|
||||
|
||||
# Group results by depth
|
||||
pages_by_depth = {}
|
||||
for result in results:
|
||||
depth = result.metadata.get("depth", 0)
|
||||
if depth not in pages_by_depth:
|
||||
pages_by_depth[depth] = []
|
||||
pages_by_depth[depth].append(result.url)
|
||||
|
||||
print(f"Crawled {len(results)} pages total")
|
||||
for depth, urls in sorted(pages_by_depth.items()):
|
||||
print(f"Depth {depth}: {len(urls)} pages")
|
||||
```
|
||||
|
||||
### Deep Crawl Strategies
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy, DFSDeepCrawlStrategy, BestFirstCrawlingStrategy
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
# Breadth-First Search - explores all links at one depth before going deeper
|
||||
bfs_strategy = BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=50, # Limit total pages
|
||||
score_threshold=0.3 # Minimum score for URLs
|
||||
)
|
||||
|
||||
# Depth-First Search - explores as deep as possible before backtracking
|
||||
dfs_strategy = DFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=30,
|
||||
score_threshold=0.5
|
||||
)
|
||||
|
||||
# Best-First - prioritizes highest scoring pages (recommended)
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration"],
|
||||
weight=0.7
|
||||
)
|
||||
|
||||
best_first_strategy = BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
url_scorer=keyword_scorer,
|
||||
max_pages=25 # No score_threshold needed - naturally prioritizes
|
||||
)
|
||||
|
||||
# Usage
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=best_first_strategy, # Choose your strategy
|
||||
scraping_strategy=LXMLWebScrapingStrategy()
|
||||
)
|
||||
```
|
||||
|
||||
### Streaming vs Batch Processing
|
||||
|
||||
```python
|
||||
# Batch mode - wait for all results
|
||||
async def batch_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=False # Default - collect all results first
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
# Process all results at once
|
||||
for result in results:
|
||||
print(f"Batch processed: {result.url}")
|
||||
|
||||
# Streaming mode - process results as they arrive
|
||||
async def streaming_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=1),
|
||||
stream=True # Process results immediately
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://example.com", config=config):
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"Stream processed depth {depth}: {result.url}")
|
||||
```
|
||||
|
||||
### Filtering with Filter Chains
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.filters import (
|
||||
FilterChain,
|
||||
URLPatternFilter,
|
||||
DomainFilter,
|
||||
ContentTypeFilter,
|
||||
SEOFilter,
|
||||
ContentRelevanceFilter
|
||||
)
|
||||
|
||||
# Single URL pattern filter
|
||||
url_filter = URLPatternFilter(patterns=["*core*", "*guide*"])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=1,
|
||||
filter_chain=FilterChain([url_filter])
|
||||
)
|
||||
)
|
||||
|
||||
# Multiple filters in chain
|
||||
advanced_filter_chain = FilterChain([
|
||||
# Domain filtering
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.example.com"],
|
||||
blocked_domains=["old.docs.example.com", "staging.example.com"]
|
||||
),
|
||||
|
||||
# URL pattern matching
|
||||
URLPatternFilter(patterns=["*tutorial*", "*guide*", "*blog*"]),
|
||||
|
||||
# Content type filtering
|
||||
ContentTypeFilter(allowed_types=["text/html"]),
|
||||
|
||||
# SEO quality filter
|
||||
SEOFilter(
|
||||
threshold=0.5,
|
||||
keywords=["tutorial", "guide", "documentation"]
|
||||
),
|
||||
|
||||
# Content relevance filter
|
||||
ContentRelevanceFilter(
|
||||
query="Web crawling and data extraction with Python",
|
||||
threshold=0.7
|
||||
)
|
||||
])
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
filter_chain=advanced_filter_chain
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Intelligent Crawling with Scorers
|
||||
|
||||
```python
|
||||
from crawl4ai.deep_crawling.scorers import KeywordRelevanceScorer
|
||||
|
||||
# Keyword relevance scoring
|
||||
async def scored_deep_crawl():
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["browser", "crawler", "web", "automation"],
|
||||
weight=1.0
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
url_scorer=keyword_scorer
|
||||
),
|
||||
stream=True, # Recommended with BestFirst
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||||
score = result.metadata.get("score", 0)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||||
```
|
||||
|
||||
### Limiting Crawl Size
|
||||
|
||||
```python
|
||||
# Max pages limitation across strategies
|
||||
async def limited_crawls():
|
||||
# BFS with page limit
|
||||
bfs_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
max_pages=5, # Only crawl 5 pages total
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["browser", "crawler"], weight=1.0)
|
||||
)
|
||||
)
|
||||
|
||||
# DFS with score threshold
|
||||
dfs_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=DFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
score_threshold=0.7, # Only URLs with scores above 0.7
|
||||
max_pages=10,
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["web", "automation"], weight=1.0)
|
||||
)
|
||||
)
|
||||
|
||||
# Best-First with both constraints
|
||||
bf_config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=7, # Automatically gets highest scored pages
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["crawl", "example"], weight=1.0)
|
||||
),
|
||||
stream=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Use any of the configs
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=bf_config):
|
||||
score = result.metadata.get("score", 0)
|
||||
print(f"Score: {score:.2f} | {result.url}")
|
||||
```
|
||||
|
||||
### Complete Advanced Deep Crawler
|
||||
|
||||
```python
|
||||
async def comprehensive_deep_crawl():
|
||||
# Sophisticated filter chain
|
||||
filter_chain = FilterChain([
|
||||
DomainFilter(
|
||||
allowed_domains=["docs.crawl4ai.com"],
|
||||
blocked_domains=["old.docs.crawl4ai.com"]
|
||||
),
|
||||
URLPatternFilter(patterns=["*core*", "*advanced*", "*blog*"]),
|
||||
ContentTypeFilter(allowed_types=["text/html"]),
|
||||
SEOFilter(threshold=0.4, keywords=["crawl", "tutorial", "guide"])
|
||||
])
|
||||
|
||||
# Multi-keyword scorer
|
||||
keyword_scorer = KeywordRelevanceScorer(
|
||||
keywords=["crawl", "example", "async", "configuration", "browser"],
|
||||
weight=0.8
|
||||
)
|
||||
|
||||
# Complete configuration
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
filter_chain=filter_chain,
|
||||
url_scorer=keyword_scorer,
|
||||
max_pages=20
|
||||
),
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
stream=True,
|
||||
verbose=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# Execute and analyze
|
||||
results = []
|
||||
start_time = time.time()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||||
results.append(result)
|
||||
score = result.metadata.get("score", 0)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
print(f"→ Depth: {depth} | Score: {score:.2f} | {result.url}")
|
||||
|
||||
# Performance analysis
|
||||
duration = time.time() - start_time
|
||||
avg_score = sum(r.metadata.get('score', 0) for r in results) / len(results)
|
||||
|
||||
print(f"✅ Crawled {len(results)} pages in {duration:.2f}s")
|
||||
print(f"✅ Average relevance score: {avg_score:.2f}")
|
||||
|
||||
# Depth distribution
|
||||
depth_counts = {}
|
||||
for result in results:
|
||||
depth = result.metadata.get("depth", 0)
|
||||
depth_counts[depth] = depth_counts.get(depth, 0) + 1
|
||||
|
||||
for depth, count in sorted(depth_counts.items()):
|
||||
print(f"📊 Depth {depth}: {count} pages")
|
||||
```
|
||||
|
||||
### Error Handling and Robustness
|
||||
|
||||
```python
|
||||
async def robust_deep_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
||||
max_depth=2,
|
||||
max_pages=15,
|
||||
url_scorer=KeywordRelevanceScorer(keywords=["guide", "tutorial"])
|
||||
),
|
||||
stream=True,
|
||||
page_timeout=30000 # 30 second timeout per page
|
||||
)
|
||||
|
||||
successful_pages = []
|
||||
failed_pages = []
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun("https://docs.crawl4ai.com", config=config):
|
||||
if result.success:
|
||||
successful_pages.append(result)
|
||||
depth = result.metadata.get("depth", 0)
|
||||
score = result.metadata.get("score", 0)
|
||||
print(f"✅ Depth {depth} | Score: {score:.2f} | {result.url}")
|
||||
else:
|
||||
failed_pages.append({
|
||||
'url': result.url,
|
||||
'error': result.error_message,
|
||||
'depth': result.metadata.get("depth", 0)
|
||||
})
|
||||
print(f"❌ Failed: {result.url} - {result.error_message}")
|
||||
|
||||
print(f"📊 Results: {len(successful_pages)} successful, {len(failed_pages)} failed")
|
||||
|
||||
# Analyze failures by depth
|
||||
if failed_pages:
|
||||
failure_by_depth = {}
|
||||
for failure in failed_pages:
|
||||
depth = failure['depth']
|
||||
failure_by_depth[depth] = failure_by_depth.get(depth, 0) + 1
|
||||
|
||||
print("❌ Failures by depth:")
|
||||
for depth, count in sorted(failure_by_depth.items()):
|
||||
print(f" Depth {depth}: {count} failures")
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Deep Crawling Guide](https://docs.crawl4ai.com/core/deep-crawling/), [Filter Documentation](https://docs.crawl4ai.com/core/content-selection/), [Scoring Strategies](https://docs.crawl4ai.com/advanced/advanced-features/)
|
||||
826
docs/md_v2/assets/llm.txt/txt/docker.txt
Normal file
826
docs/md_v2/assets/llm.txt/txt/docker.txt
Normal file
@@ -0,0 +1,826 @@
|
||||
## Docker Deployment
|
||||
|
||||
Complete Docker deployment guide with pre-built images, API endpoints, configuration, and MCP integration.
|
||||
|
||||
### Quick Start with Pre-built Images
|
||||
|
||||
```bash
|
||||
# Pull latest image
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
|
||||
# Setup LLM API keys
|
||||
cat > .llm.env << EOL
|
||||
OPENAI_API_KEY=sk-your-key
|
||||
ANTHROPIC_API_KEY=your-anthropic-key
|
||||
GROQ_API_KEY=your-groq-key
|
||||
GEMINI_API_TOKEN=your-gemini-token
|
||||
EOL
|
||||
|
||||
# Run with LLM support
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Basic run (no LLM)
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Check health
|
||||
curl http://localhost:11235/health
|
||||
```
|
||||
|
||||
### Docker Compose Deployment
|
||||
|
||||
```bash
|
||||
# Clone and setup
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
cp deploy/docker/.llm.env.example .llm.env
|
||||
# Edit .llm.env with your API keys
|
||||
|
||||
# Run pre-built image
|
||||
IMAGE=unclecode/crawl4ai:latest docker compose up -d
|
||||
|
||||
# Build locally
|
||||
docker compose up --build -d
|
||||
|
||||
# Build with all features
|
||||
INSTALL_TYPE=all docker compose up --build -d
|
||||
|
||||
# Build with GPU support
|
||||
ENABLE_GPU=true docker compose up --build -d
|
||||
|
||||
# Stop service
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Manual Build with Multi-Architecture
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
|
||||
# Build for current architecture
|
||||
docker buildx build -t crawl4ai-local:latest --load .
|
||||
|
||||
# Build for multiple architectures
|
||||
docker buildx build --platform linux/amd64,linux/arm64 \
|
||||
-t crawl4ai-local:latest --load .
|
||||
|
||||
# Build with specific features
|
||||
docker buildx build \
|
||||
--build-arg INSTALL_TYPE=all \
|
||||
--build-arg ENABLE_GPU=false \
|
||||
-t crawl4ai-local:latest --load .
|
||||
|
||||
# Run custom build
|
||||
docker run -d \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-custom \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
### Build Arguments
|
||||
|
||||
```bash
|
||||
# Available build options
|
||||
docker buildx build \
|
||||
--build-arg INSTALL_TYPE=all \ # default|all|torch|transformer
|
||||
--build-arg ENABLE_GPU=true \ # true|false
|
||||
--build-arg APP_HOME=/app \ # Install path
|
||||
--build-arg USE_LOCAL=true \ # Use local source
|
||||
--build-arg GITHUB_REPO=url \ # Git repo if USE_LOCAL=false
|
||||
--build-arg GITHUB_BRANCH=main \ # Git branch
|
||||
-t crawl4ai-custom:latest --load .
|
||||
```
|
||||
|
||||
### Core API Endpoints
|
||||
|
||||
```python
|
||||
# Main crawling endpoints
|
||||
import requests
|
||||
import json
|
||||
|
||||
# Basic crawl
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||||
}
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
|
||||
# Streaming crawl
|
||||
payload["crawler_config"]["params"]["stream"] = True
|
||||
response = requests.post("http://localhost:11235/crawl/stream", json=payload)
|
||||
|
||||
# Health check
|
||||
response = requests.get("http://localhost:11235/health")
|
||||
|
||||
# API schema
|
||||
response = requests.get("http://localhost:11235/schema")
|
||||
|
||||
# Metrics (Prometheus format)
|
||||
response = requests.get("http://localhost:11235/metrics")
|
||||
```
|
||||
|
||||
### Specialized Endpoints
|
||||
|
||||
```python
|
||||
# HTML extraction (preprocessed for schema)
|
||||
response = requests.post("http://localhost:11235/html",
|
||||
json={"url": "https://example.com"})
|
||||
|
||||
# Screenshot capture
|
||||
response = requests.post("http://localhost:11235/screenshot", json={
|
||||
"url": "https://example.com",
|
||||
"screenshot_wait_for": 2,
|
||||
"output_path": "/path/to/save/screenshot.png"
|
||||
})
|
||||
|
||||
# PDF generation
|
||||
response = requests.post("http://localhost:11235/pdf", json={
|
||||
"url": "https://example.com",
|
||||
"output_path": "/path/to/save/document.pdf"
|
||||
})
|
||||
|
||||
# JavaScript execution
|
||||
response = requests.post("http://localhost:11235/execute_js", json={
|
||||
"url": "https://example.com",
|
||||
"scripts": [
|
||||
"return document.title",
|
||||
"return Array.from(document.querySelectorAll('a')).map(a => a.href)"
|
||||
]
|
||||
})
|
||||
|
||||
# Markdown generation
|
||||
response = requests.post("http://localhost:11235/md", json={
|
||||
"url": "https://example.com",
|
||||
"f": "fit", # raw|fit|bm25|llm
|
||||
"q": "extract main content", # query for filtering
|
||||
"c": "0" # cache: 0=bypass, 1=use
|
||||
})
|
||||
|
||||
# LLM Q&A
|
||||
response = requests.get("http://localhost:11235/llm/https://example.com?q=What is this page about?")
|
||||
|
||||
# Library context (for AI assistants)
|
||||
response = requests.get("http://localhost:11235/ask", params={
|
||||
"context_type": "all", # code|doc|all
|
||||
"query": "how to use extraction strategies",
|
||||
"score_ratio": 0.5,
|
||||
"max_results": 20
|
||||
})
|
||||
```
|
||||
|
||||
### Python SDK Usage
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
|
||||
# Non-streaming crawl
|
||||
results = await client.crawl(
|
||||
["https://example.com"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
|
||||
for result in results:
|
||||
print(f"URL: {result.url}, Success: {result.success}")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
|
||||
# Streaming crawl
|
||||
stream_config = CrawlerRunConfig(stream=True, cache_mode=CacheMode.BYPASS)
|
||||
async for result in await client.crawl(
|
||||
["https://example.com", "https://python.org"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=stream_config
|
||||
):
|
||||
print(f"Streamed: {result.url} - {result.success}")
|
||||
|
||||
# Get API schema
|
||||
schema = await client.get_schema()
|
||||
print(f"Schema available: {bool(schema)}")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Advanced API Configuration
|
||||
|
||||
```python
|
||||
# Complex extraction with LLM
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {
|
||||
"type": "BrowserConfig",
|
||||
"params": {
|
||||
"headless": True,
|
||||
"viewport": {"type": "dict", "value": {"width": 1200, "height": 800}}
|
||||
}
|
||||
},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": "env:OPENAI_API_KEY"
|
||||
}
|
||||
},
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {"type": "string"},
|
||||
"content": {"type": "string"}
|
||||
}
|
||||
}
|
||||
},
|
||||
"instruction": "Extract title and main content"
|
||||
}
|
||||
},
|
||||
"markdown_generator": {
|
||||
"type": "DefaultMarkdownGenerator",
|
||||
"params": {
|
||||
"content_filter": {
|
||||
"type": "PruningContentFilter",
|
||||
"params": {"threshold": 0.6}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
```
|
||||
|
||||
### CSS Extraction Strategy
|
||||
|
||||
```python
|
||||
# CSS-based structured extraction
|
||||
schema = {
|
||||
"name": "ProductList",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
payload = {
|
||||
"urls": ["https://example-shop.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {"type": "dict", "value": schema}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
data = response.json()
|
||||
extracted = json.loads(data["results"][0]["extracted_content"])
|
||||
```
|
||||
|
||||
### MCP (Model Context Protocol) Integration
|
||||
|
||||
```bash
|
||||
# Add Crawl4AI as MCP provider to Claude Code
|
||||
claude mcp add --transport sse c4ai-sse http://localhost:11235/mcp/sse
|
||||
|
||||
# List MCP providers
|
||||
claude mcp list
|
||||
|
||||
# Test MCP connection
|
||||
python tests/mcp/test_mcp_socket.py
|
||||
|
||||
# Available MCP endpoints
|
||||
# SSE: http://localhost:11235/mcp/sse
|
||||
# WebSocket: ws://localhost:11235/mcp/ws
|
||||
# Schema: http://localhost:11235/mcp/schema
|
||||
```
|
||||
|
||||
Available MCP tools:
|
||||
- `md` - Generate markdown from web content
|
||||
- `html` - Extract preprocessed HTML
|
||||
- `screenshot` - Capture webpage screenshots
|
||||
- `pdf` - Generate PDF documents
|
||||
- `execute_js` - Run JavaScript on web pages
|
||||
- `crawl` - Perform multi-URL crawling
|
||||
- `ask` - Query Crawl4AI library context
|
||||
|
||||
### Configuration Management
|
||||
|
||||
```yaml
|
||||
# config.yml structure
|
||||
app:
|
||||
title: "Crawl4AI API"
|
||||
version: "1.0.0"
|
||||
host: "0.0.0.0"
|
||||
port: 11235
|
||||
timeout_keep_alive: 300
|
||||
|
||||
llm:
|
||||
provider: "openai/gpt-4o-mini"
|
||||
api_key_env: "OPENAI_API_KEY"
|
||||
|
||||
security:
|
||||
enabled: false
|
||||
jwt_enabled: false
|
||||
trusted_hosts: ["*"]
|
||||
|
||||
crawler:
|
||||
memory_threshold_percent: 95.0
|
||||
rate_limiter:
|
||||
base_delay: [1.0, 2.0]
|
||||
timeouts:
|
||||
stream_init: 30.0
|
||||
batch_process: 300.0
|
||||
pool:
|
||||
max_pages: 40
|
||||
idle_ttl_sec: 1800
|
||||
|
||||
rate_limiting:
|
||||
enabled: true
|
||||
default_limit: "1000/minute"
|
||||
storage_uri: "memory://"
|
||||
|
||||
logging:
|
||||
level: "INFO"
|
||||
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
||||
```
|
||||
|
||||
### Custom Configuration Deployment
|
||||
|
||||
```bash
|
||||
# Method 1: Mount custom config
|
||||
docker run -d -p 11235:11235 \
|
||||
--name crawl4ai-custom \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
-v $(pwd)/my-config.yml:/app/config.yml \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# Method 2: Build with custom config
|
||||
# Edit deploy/docker/config.yml then build
|
||||
docker buildx build -t crawl4ai-custom:latest --load .
|
||||
```
|
||||
|
||||
### Monitoring and Health Checks
|
||||
|
||||
```bash
|
||||
# Health endpoint
|
||||
curl http://localhost:11235/health
|
||||
|
||||
# Prometheus metrics
|
||||
curl http://localhost:11235/metrics
|
||||
|
||||
# Configuration validation
|
||||
curl -X POST http://localhost:11235/config/dump \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"code": "CrawlerRunConfig(cache_mode=\"BYPASS\", screenshot=True)"}'
|
||||
```
|
||||
|
||||
### Playground Interface
|
||||
|
||||
Access the interactive playground at `http://localhost:11235/playground` for:
|
||||
- Testing configurations with visual interface
|
||||
- Generating JSON payloads for REST API
|
||||
- Converting Python config to JSON format
|
||||
- Testing crawl operations directly in browser
|
||||
|
||||
### Async Job Processing
|
||||
|
||||
```python
|
||||
# Submit job for async processing
|
||||
import time
|
||||
|
||||
# Submit crawl job
|
||||
response = requests.post("http://localhost:11235/crawl/job", json=payload)
|
||||
task_id = response.json()["task_id"]
|
||||
|
||||
# Poll for completion
|
||||
while True:
|
||||
result = requests.get(f"http://localhost:11235/crawl/job/{task_id}")
|
||||
status = result.json()
|
||||
|
||||
if status["status"] in ["COMPLETED", "FAILED"]:
|
||||
break
|
||||
time.sleep(1.5)
|
||||
|
||||
print("Final result:", status)
|
||||
```
|
||||
|
||||
### Production Deployment
|
||||
|
||||
```bash
|
||||
# Production-ready deployment
|
||||
docker run -d \
|
||||
--name crawl4ai-prod \
|
||||
--restart unless-stopped \
|
||||
-p 11235:11235 \
|
||||
--env-file .llm.env \
|
||||
--shm-size=2g \
|
||||
--memory=8g \
|
||||
--cpus=4 \
|
||||
-v /path/to/custom-config.yml:/app/config.yml \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# With Docker Compose for production
|
||||
version: '3.8'
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:latest
|
||||
ports:
|
||||
- "11235:11235"
|
||||
environment:
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY}
|
||||
volumes:
|
||||
- ./config.yml:/app/config.yml
|
||||
shm_size: 2g
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
memory: 8G
|
||||
cpus: '4'
|
||||
restart: unless-stopped
|
||||
```
|
||||
|
||||
### Configuration Validation and JSON Structure
|
||||
|
||||
```python
|
||||
# Method 1: Create config objects and dump to see expected JSON structure
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, LLMConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
|
||||
import json
|
||||
|
||||
# Create browser config and see JSON structure
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1280,
|
||||
viewport_height=720,
|
||||
proxy="http://user:pass@proxy:8080"
|
||||
)
|
||||
|
||||
# Get JSON structure
|
||||
browser_json = browser_config.dump()
|
||||
print("BrowserConfig JSON structure:")
|
||||
print(json.dumps(browser_json, indent=2))
|
||||
|
||||
# Create crawler config with extraction strategy
|
||||
schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": ".article",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "html"}
|
||||
]
|
||||
}
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
screenshot=True,
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
|
||||
wait_for="css:.loaded"
|
||||
)
|
||||
|
||||
crawler_json = crawler_config.dump()
|
||||
print("\nCrawlerRunConfig JSON structure:")
|
||||
print(json.dumps(crawler_json, indent=2))
|
||||
```
|
||||
|
||||
### Reverse Validation - JSON to Objects
|
||||
|
||||
```python
|
||||
# Method 2: Load JSON back to config objects for validation
|
||||
from crawl4ai.async_configs import from_serializable_dict
|
||||
|
||||
# Test JSON structure by converting back to objects
|
||||
test_browser_json = {
|
||||
"type": "BrowserConfig",
|
||||
"params": {
|
||||
"headless": True,
|
||||
"viewport_width": 1280,
|
||||
"proxy": "http://user:pass@proxy:8080"
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
# Convert JSON back to object
|
||||
restored_browser = from_serializable_dict(test_browser_json)
|
||||
print(f"✅ Valid BrowserConfig: {type(restored_browser)}")
|
||||
print(f"Headless: {restored_browser.headless}")
|
||||
print(f"Proxy: {restored_browser.proxy}")
|
||||
except Exception as e:
|
||||
print(f"❌ Invalid BrowserConfig JSON: {e}")
|
||||
|
||||
# Test complex crawler config JSON
|
||||
test_crawler_json = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"cache_mode": "bypass",
|
||||
"screenshot": True,
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h3", "type": "text"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
try:
|
||||
restored_crawler = from_serializable_dict(test_crawler_json)
|
||||
print(f"✅ Valid CrawlerRunConfig: {type(restored_crawler)}")
|
||||
print(f"Cache mode: {restored_crawler.cache_mode}")
|
||||
print(f"Has extraction strategy: {restored_crawler.extraction_strategy is not None}")
|
||||
except Exception as e:
|
||||
print(f"❌ Invalid CrawlerRunConfig JSON: {e}")
|
||||
```
|
||||
|
||||
### Using Server's /config/dump Endpoint for Validation
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
# Method 3: Use server endpoint to validate configuration syntax
|
||||
def validate_config_with_server(config_code: str) -> dict:
|
||||
"""Validate configuration using server's /config/dump endpoint"""
|
||||
response = requests.post(
|
||||
"http://localhost:11235/config/dump",
|
||||
json={"code": config_code}
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
print("✅ Valid configuration syntax")
|
||||
return response.json()
|
||||
else:
|
||||
print(f"❌ Invalid configuration: {response.status_code}")
|
||||
print(response.json())
|
||||
return None
|
||||
|
||||
# Test valid configuration
|
||||
valid_config = """
|
||||
CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
screenshot=True,
|
||||
js_code=["window.scrollTo(0, document.body.scrollHeight);"],
|
||||
wait_for="css:.content-loaded"
|
||||
)
|
||||
"""
|
||||
|
||||
result = validate_config_with_server(valid_config)
|
||||
if result:
|
||||
print("Generated JSON structure:")
|
||||
print(json.dumps(result, indent=2))
|
||||
|
||||
# Test invalid configuration (should fail)
|
||||
invalid_config = """
|
||||
CrawlerRunConfig(
|
||||
cache_mode="invalid_mode",
|
||||
screenshot=True,
|
||||
js_code=some_function() # This will fail
|
||||
)
|
||||
"""
|
||||
|
||||
validate_config_with_server(invalid_config)
|
||||
```
|
||||
|
||||
### Configuration Builder Helper
|
||||
|
||||
```python
|
||||
def build_and_validate_request(urls, browser_params=None, crawler_params=None):
|
||||
"""Helper to build and validate complete request payload"""
|
||||
|
||||
# Create configurations
|
||||
browser_config = BrowserConfig(**(browser_params or {}))
|
||||
crawler_config = CrawlerRunConfig(**(crawler_params or {}))
|
||||
|
||||
# Build complete request payload
|
||||
payload = {
|
||||
"urls": urls if isinstance(urls, list) else [urls],
|
||||
"browser_config": browser_config.dump(),
|
||||
"crawler_config": crawler_config.dump()
|
||||
}
|
||||
|
||||
print("✅ Complete request payload:")
|
||||
print(json.dumps(payload, indent=2))
|
||||
|
||||
# Validate by attempting to reconstruct
|
||||
try:
|
||||
test_browser = from_serializable_dict(payload["browser_config"])
|
||||
test_crawler = from_serializable_dict(payload["crawler_config"])
|
||||
print("✅ Payload validation successful")
|
||||
return payload
|
||||
except Exception as e:
|
||||
print(f"❌ Payload validation failed: {e}")
|
||||
return None
|
||||
|
||||
# Example usage
|
||||
payload = build_and_validate_request(
|
||||
urls=["https://example.com"],
|
||||
browser_params={"headless": True, "viewport_width": 1280},
|
||||
crawler_params={
|
||||
"cache_mode": CacheMode.BYPASS,
|
||||
"screenshot": True,
|
||||
"word_count_threshold": 10
|
||||
}
|
||||
)
|
||||
|
||||
if payload:
|
||||
# Send to server
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
print(f"Server response: {response.status_code}")
|
||||
```
|
||||
|
||||
### Common JSON Structure Patterns
|
||||
|
||||
```python
|
||||
# Pattern 1: Simple primitive values
|
||||
simple_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"cache_mode": "bypass", # String enum value
|
||||
"screenshot": True, # Boolean
|
||||
"page_timeout": 60000 # Integer
|
||||
}
|
||||
}
|
||||
|
||||
# Pattern 2: Nested objects
|
||||
nested_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "LLMExtractionStrategy",
|
||||
"params": {
|
||||
"llm_config": {
|
||||
"type": "LLMConfig",
|
||||
"params": {
|
||||
"provider": "openai/gpt-4o-mini",
|
||||
"api_token": "env:OPENAI_API_KEY"
|
||||
}
|
||||
},
|
||||
"instruction": "Extract main content"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Pattern 3: Dictionary values (must use type: dict wrapper)
|
||||
dict_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"extraction_strategy": {
|
||||
"type": "JsonCssExtractionStrategy",
|
||||
"params": {
|
||||
"schema": {
|
||||
"type": "dict", # Required wrapper
|
||||
"value": { # Actual dictionary content
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"}
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Pattern 4: Lists and arrays
|
||||
list_config = {
|
||||
"type": "CrawlerRunConfig",
|
||||
"params": {
|
||||
"js_code": [ # Lists are handled directly
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
"document.querySelector('.load-more')?.click();"
|
||||
],
|
||||
"excluded_tags": ["script", "style", "nav"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Troubleshooting Common JSON Errors
|
||||
|
||||
```python
|
||||
def diagnose_json_errors():
|
||||
"""Common JSON structure errors and fixes"""
|
||||
|
||||
# ❌ WRONG: Missing type wrapper for objects
|
||||
wrong_config = {
|
||||
"browser_config": {
|
||||
"headless": True # Missing type wrapper
|
||||
}
|
||||
}
|
||||
|
||||
# ✅ CORRECT: Proper type wrapper
|
||||
correct_config = {
|
||||
"browser_config": {
|
||||
"type": "BrowserConfig",
|
||||
"params": {
|
||||
"headless": True
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# ❌ WRONG: Dictionary without type: dict wrapper
|
||||
wrong_dict = {
|
||||
"schema": {
|
||||
"name": "Products" # Raw dict, should be wrapped
|
||||
}
|
||||
}
|
||||
|
||||
# ✅ CORRECT: Dictionary with proper wrapper
|
||||
correct_dict = {
|
||||
"schema": {
|
||||
"type": "dict",
|
||||
"value": {
|
||||
"name": "Products"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# ❌ WRONG: Invalid enum string
|
||||
wrong_enum = {
|
||||
"cache_mode": "DISABLED" # Wrong case/value
|
||||
}
|
||||
|
||||
# ✅ CORRECT: Valid enum string
|
||||
correct_enum = {
|
||||
"cache_mode": "bypass" # or "enabled", "disabled", etc.
|
||||
}
|
||||
|
||||
print("Common error patterns documented above")
|
||||
|
||||
# Validate your JSON structure before sending
|
||||
def pre_flight_check(payload):
|
||||
"""Run checks before sending to server"""
|
||||
required_keys = ["urls", "browser_config", "crawler_config"]
|
||||
|
||||
for key in required_keys:
|
||||
if key not in payload:
|
||||
print(f"❌ Missing required key: {key}")
|
||||
return False
|
||||
|
||||
# Check type wrappers
|
||||
for config_key in ["browser_config", "crawler_config"]:
|
||||
config = payload[config_key]
|
||||
if not isinstance(config, dict) or "type" not in config:
|
||||
print(f"❌ {config_key} missing type wrapper")
|
||||
return False
|
||||
if "params" not in config:
|
||||
print(f"❌ {config_key} missing params")
|
||||
return False
|
||||
|
||||
print("✅ Pre-flight check passed")
|
||||
return True
|
||||
|
||||
# Example usage
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||||
}
|
||||
|
||||
if pre_flight_check(payload):
|
||||
# Safe to send to server
|
||||
pass
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Configuration Options](https://docs.crawl4ai.com/core/docker-deployment/#server-configuration)
|
||||
788
docs/md_v2/assets/llm.txt/txt/extraction.txt
Normal file
788
docs/md_v2/assets/llm.txt/txt/extraction.txt
Normal file
@@ -0,0 +1,788 @@
|
||||
## Extraction Strategies
|
||||
|
||||
Powerful data extraction from web pages using LLM-based intelligent parsing or fast schema/pattern-based approaches.
|
||||
|
||||
### LLM-Based Extraction - Intelligent Content Understanding
|
||||
|
||||
```python
|
||||
import os
|
||||
import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
# Define structured data model
|
||||
class Product(BaseModel):
|
||||
name: str = Field(description="Product name")
|
||||
price: str = Field(description="Product price")
|
||||
description: str = Field(description="Product description")
|
||||
features: List[str] = Field(description="List of product features")
|
||||
rating: float = Field(description="Product rating out of 5")
|
||||
|
||||
# Configure LLM provider
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini", # or "ollama/llama3.3", "anthropic/claude-3-5-sonnet"
|
||||
api_token=os.getenv("OPENAI_API_KEY"), # or "env:OPENAI_API_KEY"
|
||||
temperature=0.1,
|
||||
max_tokens=2000
|
||||
)
|
||||
|
||||
# Create LLM extraction strategy
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=llm_config,
|
||||
schema=Product.model_json_schema(),
|
||||
extraction_type="schema", # or "block" for freeform text
|
||||
instruction="""
|
||||
Extract product information from the webpage content.
|
||||
Focus on finding complete product details including:
|
||||
- Product name and price
|
||||
- Detailed description
|
||||
- All listed features
|
||||
- Customer rating if available
|
||||
Return valid JSON array of products.
|
||||
""",
|
||||
chunk_token_threshold=1200, # Split content if too large
|
||||
overlap_rate=0.1, # 10% overlap between chunks
|
||||
apply_chunking=True, # Enable automatic chunking
|
||||
input_format="markdown", # "html", "fit_markdown", or "markdown"
|
||||
extra_args={"temperature": 0.0, "max_tokens": 800},
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async def extract_with_llm():
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=llm_strategy,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=10
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Parse extracted JSON
|
||||
products = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(products)} products")
|
||||
|
||||
for product in products[:3]: # Show first 3
|
||||
print(f"Product: {product['name']}")
|
||||
print(f"Price: {product['price']}")
|
||||
print(f"Rating: {product.get('rating', 'N/A')}")
|
||||
|
||||
# Show token usage and cost
|
||||
llm_strategy.show_usage()
|
||||
else:
|
||||
print(f"Extraction failed: {result.error_message}")
|
||||
|
||||
asyncio.run(extract_with_llm())
|
||||
```
|
||||
|
||||
### LLM Strategy Advanced Configuration
|
||||
|
||||
```python
|
||||
# Multiple provider configurations
|
||||
providers = {
|
||||
"openai": LLMConfig(
|
||||
provider="openai/gpt-4o",
|
||||
api_token="env:OPENAI_API_KEY",
|
||||
temperature=0.1
|
||||
),
|
||||
"anthropic": LLMConfig(
|
||||
provider="anthropic/claude-3-5-sonnet-20240620",
|
||||
api_token="env:ANTHROPIC_API_KEY",
|
||||
max_tokens=4000
|
||||
),
|
||||
"ollama": LLMConfig(
|
||||
provider="ollama/llama3.3",
|
||||
api_token=None, # Not needed for Ollama
|
||||
base_url="http://localhost:11434"
|
||||
),
|
||||
"groq": LLMConfig(
|
||||
provider="groq/llama3-70b-8192",
|
||||
api_token="env:GROQ_API_KEY"
|
||||
)
|
||||
}
|
||||
|
||||
# Advanced chunking for large content
|
||||
large_content_strategy = LLMExtractionStrategy(
|
||||
llm_config=providers["openai"],
|
||||
schema=YourModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract detailed information...",
|
||||
|
||||
# Chunking parameters
|
||||
chunk_token_threshold=2000, # Larger chunks for complex content
|
||||
overlap_rate=0.15, # More overlap for context preservation
|
||||
apply_chunking=True,
|
||||
|
||||
# Input format selection
|
||||
input_format="fit_markdown", # Use filtered content if available
|
||||
|
||||
# LLM parameters
|
||||
extra_args={
|
||||
"temperature": 0.0, # Deterministic output
|
||||
"top_p": 0.9,
|
||||
"frequency_penalty": 0.1,
|
||||
"presence_penalty": 0.1,
|
||||
"max_tokens": 1500
|
||||
},
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Knowledge graph extraction
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
type: str # "person", "organization", "location", etc.
|
||||
description: str
|
||||
|
||||
class Relationship(BaseModel):
|
||||
source: str
|
||||
target: str
|
||||
relationship: str
|
||||
confidence: float
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity]
|
||||
relationships: List[Relationship]
|
||||
summary: str
|
||||
|
||||
knowledge_strategy = LLMExtractionStrategy(
|
||||
llm_config=providers["anthropic"],
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
Create a knowledge graph from the content by:
|
||||
1. Identifying key entities (people, organizations, locations, concepts)
|
||||
2. Finding relationships between entities
|
||||
3. Providing confidence scores for relationships
|
||||
4. Summarizing the main topics
|
||||
""",
|
||||
input_format="html", # Use HTML for better structure preservation
|
||||
apply_chunking=True,
|
||||
chunk_token_threshold=1500
|
||||
)
|
||||
```
|
||||
|
||||
### JSON CSS Extraction - Fast Schema-Based Extraction
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
# Basic CSS extraction schema
|
||||
simple_schema = {
|
||||
"name": "Product Listings",
|
||||
"baseSelector": "div.product-card",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2.product-title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "image_url",
|
||||
"selector": "img.product-image",
|
||||
"type": "attribute",
|
||||
"attribute": "src"
|
||||
},
|
||||
{
|
||||
"name": "product_url",
|
||||
"selector": "a.product-link",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Complex nested schema with multiple data types
|
||||
complex_schema = {
|
||||
"name": "E-commerce Product Catalog",
|
||||
"baseSelector": "div.category",
|
||||
"baseFields": [
|
||||
{
|
||||
"name": "category_id",
|
||||
"type": "attribute",
|
||||
"attribute": "data-category-id"
|
||||
},
|
||||
{
|
||||
"name": "category_url",
|
||||
"type": "attribute",
|
||||
"attribute": "data-url"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "category_name",
|
||||
"selector": "h2.category-title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list", # Array of complex objects
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h3.product-name",
|
||||
"type": "text",
|
||||
"default": "Unknown Product"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested", # Single complex object
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "specs",
|
||||
"selector": "div.specifications",
|
||||
"type": "html" # Preserve HTML structure
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.features li",
|
||||
"type": "list", # Simple array of strings
|
||||
"fields": [
|
||||
{"name": "feature", "type": "text"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "reviews",
|
||||
"selector": "div.review",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "reviewer",
|
||||
"selector": "span.reviewer-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": "span.rating",
|
||||
"type": "attribute",
|
||||
"attribute": "data-rating"
|
||||
},
|
||||
{
|
||||
"name": "comment",
|
||||
"selector": "p.review-text",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "date",
|
||||
"selector": "time.review-date",
|
||||
"type": "attribute",
|
||||
"attribute": "datetime"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async def extract_with_css_schema():
|
||||
strategy = JsonCssExtractionStrategy(complex_schema, verbose=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
# Enable dynamic content loading if needed
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="css:.product:nth-child(10)", # Wait for products to load
|
||||
process_iframes=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/catalog",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(data)} categories")
|
||||
|
||||
for category in data:
|
||||
print(f"Category: {category['category_name']}")
|
||||
print(f"Products: {len(category.get('products', []))}")
|
||||
|
||||
# Show first product details
|
||||
if category.get('products'):
|
||||
product = category['products'][0]
|
||||
print(f" First product: {product.get('name')}")
|
||||
print(f" Features: {len(product.get('features', []))}")
|
||||
print(f" Reviews: {len(product.get('reviews', []))}")
|
||||
|
||||
asyncio.run(extract_with_css_schema())
|
||||
```
|
||||
|
||||
### Automatic Schema Generation - One-Time LLM, Unlimited Use
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def generate_and_use_schema():
|
||||
"""
|
||||
1. Use LLM once to generate schema from sample HTML
|
||||
2. Cache the schema for reuse
|
||||
3. Use cached schema for fast extraction without LLM calls
|
||||
"""
|
||||
|
||||
cache_dir = Path("./schema_cache")
|
||||
cache_dir.mkdir(exist_ok=True)
|
||||
schema_file = cache_dir / "ecommerce_schema.json"
|
||||
|
||||
# Step 1: Generate or load cached schema
|
||||
if schema_file.exists():
|
||||
schema = json.load(schema_file.open())
|
||||
print("Using cached schema")
|
||||
else:
|
||||
print("Generating schema using LLM...")
|
||||
|
||||
# Configure LLM for schema generation
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o", # or "ollama/llama3.3" for local
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Get sample HTML from target site
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
sample_result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
sample_html = sample_result.cleaned_html[:5000] # Use first 5k chars
|
||||
|
||||
# Generate schema automatically (ONE-TIME LLM COST)
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
schema_type="css",
|
||||
llm_config=llm_config,
|
||||
instruction="Extract product information including name, price, description, and features"
|
||||
)
|
||||
|
||||
# Cache schema for future use (NO MORE LLM CALLS)
|
||||
json.dump(schema, schema_file.open("w"), indent=2)
|
||||
print("Schema generated and cached")
|
||||
|
||||
# Step 2: Use schema for fast extraction (NO LLM CALLS)
|
||||
strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# Step 3: Extract from multiple pages using same schema
|
||||
urls = [
|
||||
"https://example.com/products",
|
||||
"https://example.com/electronics",
|
||||
"https://example.com/books"
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for url in urls:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"{url}: Extracted {len(data)} items")
|
||||
else:
|
||||
print(f"{url}: Failed - {result.error_message}")
|
||||
|
||||
asyncio.run(generate_and_use_schema())
|
||||
```
|
||||
|
||||
### XPath Extraction Strategy
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
# XPath-based schema (alternative to CSS)
|
||||
xpath_schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "//article[@class='news-item']",
|
||||
"baseFields": [
|
||||
{
|
||||
"name": "article_id",
|
||||
"type": "attribute",
|
||||
"attribute": "data-id"
|
||||
}
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "headline",
|
||||
"selector": ".//h2[@class='headline']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "author",
|
||||
"selector": ".//span[@class='author']/text()",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "publish_date",
|
||||
"selector": ".//time/@datetime",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "content",
|
||||
"selector": ".//div[@class='article-body']",
|
||||
"type": "html"
|
||||
},
|
||||
{
|
||||
"name": "tags",
|
||||
"selector": ".//div[@class='tags']/span[@class='tag']",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{"name": "tag", "type": "text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Generate XPath schema automatically
|
||||
async def generate_xpath_schema():
|
||||
llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None)
|
||||
|
||||
sample_html = """
|
||||
<article class="news-item" data-id="123">
|
||||
<h2 class="headline">Breaking News</h2>
|
||||
<span class="author">John Doe</span>
|
||||
<time datetime="2024-01-01">Today</time>
|
||||
<div class="article-body"><p>Content here...</p></div>
|
||||
</article>
|
||||
"""
|
||||
|
||||
schema = JsonXPathExtractionStrategy.generate_schema(
|
||||
html=sample_html,
|
||||
schema_type="xpath",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
return schema
|
||||
|
||||
# Use XPath strategy
|
||||
xpath_strategy = JsonXPathExtractionStrategy(xpath_schema, verbose=True)
|
||||
```
|
||||
|
||||
### Regex Extraction Strategy - Pattern-Based Fast Extraction
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import RegexExtractionStrategy
|
||||
|
||||
# Built-in patterns for common data types
|
||||
async def extract_with_builtin_patterns():
|
||||
# Use multiple built-in patterns
|
||||
strategy = RegexExtractionStrategy(
|
||||
pattern=(
|
||||
RegexExtractionStrategy.Email |
|
||||
RegexExtractionStrategy.PhoneUS |
|
||||
RegexExtractionStrategy.Url |
|
||||
RegexExtractionStrategy.Currency |
|
||||
RegexExtractionStrategy.DateIso
|
||||
)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/contact",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
matches = json.loads(result.extracted_content)
|
||||
|
||||
# Group by pattern type
|
||||
by_type = {}
|
||||
for match in matches:
|
||||
label = match['label']
|
||||
if label not in by_type:
|
||||
by_type[label] = []
|
||||
by_type[label].append(match['value'])
|
||||
|
||||
for pattern_type, values in by_type.items():
|
||||
print(f"{pattern_type}: {len(values)} matches")
|
||||
for value in values[:3]: # Show first 3
|
||||
print(f" {value}")
|
||||
|
||||
# Custom regex patterns
|
||||
custom_patterns = {
|
||||
"product_code": r"SKU-\d{4,6}",
|
||||
"discount": r"\d{1,2}%\s*off",
|
||||
"model_number": r"Model:\s*([A-Z0-9-]+)"
|
||||
}
|
||||
|
||||
async def extract_with_custom_patterns():
|
||||
strategy = RegexExtractionStrategy(custom=custom_patterns)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
for item in data:
|
||||
print(f"{item['label']}: {item['value']}")
|
||||
|
||||
# LLM-generated patterns (one-time cost)
|
||||
async def generate_custom_patterns():
|
||||
cache_file = Path("./patterns/price_patterns.json")
|
||||
|
||||
if cache_file.exists():
|
||||
patterns = json.load(cache_file.open())
|
||||
else:
|
||||
llm_config = LLMConfig(
|
||||
provider="openai/gpt-4o-mini",
|
||||
api_token="env:OPENAI_API_KEY"
|
||||
)
|
||||
|
||||
# Get sample content
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/pricing")
|
||||
sample_html = result.cleaned_html
|
||||
|
||||
# Generate optimized patterns
|
||||
patterns = RegexExtractionStrategy.generate_pattern(
|
||||
label="pricing_info",
|
||||
html=sample_html,
|
||||
query="Extract all pricing information including discounts and special offers",
|
||||
llm_config=llm_config
|
||||
)
|
||||
|
||||
# Cache for reuse
|
||||
cache_file.parent.mkdir(exist_ok=True)
|
||||
json.dump(patterns, cache_file.open("w"), indent=2)
|
||||
|
||||
# Use cached patterns (no more LLM calls)
|
||||
strategy = RegexExtractionStrategy(custom=patterns)
|
||||
return strategy
|
||||
|
||||
asyncio.run(extract_with_builtin_patterns())
|
||||
asyncio.run(extract_with_custom_patterns())
|
||||
```
|
||||
|
||||
### Complete Extraction Workflow - Combining Strategies
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import (
|
||||
JsonCssExtractionStrategy,
|
||||
RegexExtractionStrategy,
|
||||
LLMExtractionStrategy
|
||||
)
|
||||
|
||||
async def multi_strategy_extraction():
|
||||
"""
|
||||
Demonstrate using multiple extraction strategies in sequence:
|
||||
1. Fast regex for common patterns
|
||||
2. Schema-based for structured data
|
||||
3. LLM for complex reasoning
|
||||
"""
|
||||
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
|
||||
# Strategy 1: Fast regex extraction
|
||||
regex_strategy = RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||||
)
|
||||
|
||||
# Strategy 2: Schema-based structured extraction
|
||||
product_schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": "div.product",
|
||||
"fields": [
|
||||
{"name": "name", "selector": "h3", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"},
|
||||
{"name": "rating", "selector": ".rating", "type": "attribute", "attribute": "data-rating"}
|
||||
]
|
||||
}
|
||||
css_strategy = JsonCssExtractionStrategy(product_schema)
|
||||
|
||||
# Strategy 3: LLM for complex analysis
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
schema={
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"sentiment": {"type": "string"},
|
||||
"key_topics": {"type": "array", "items": {"type": "string"}},
|
||||
"summary": {"type": "string"}
|
||||
}
|
||||
},
|
||||
extraction_type="schema",
|
||||
instruction="Analyze the content sentiment, extract key topics, and provide a summary"
|
||||
)
|
||||
|
||||
url = "https://example.com/product-reviews"
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Extract contact info with regex
|
||||
regex_config = CrawlerRunConfig(extraction_strategy=regex_strategy)
|
||||
regex_result = await crawler.arun(url=url, config=regex_config)
|
||||
|
||||
# Extract structured product data
|
||||
css_config = CrawlerRunConfig(extraction_strategy=css_strategy)
|
||||
css_result = await crawler.arun(url=url, config=css_config)
|
||||
|
||||
# Extract insights with LLM
|
||||
llm_config = CrawlerRunConfig(extraction_strategy=llm_strategy)
|
||||
llm_result = await crawler.arun(url=url, config=llm_config)
|
||||
|
||||
# Combine results
|
||||
results = {
|
||||
"contacts": json.loads(regex_result.extracted_content) if regex_result.success else [],
|
||||
"products": json.loads(css_result.extracted_content) if css_result.success else [],
|
||||
"analysis": json.loads(llm_result.extracted_content) if llm_result.success else {}
|
||||
}
|
||||
|
||||
print(f"Found {len(results['contacts'])} contact entries")
|
||||
print(f"Found {len(results['products'])} products")
|
||||
print(f"Sentiment: {results['analysis'].get('sentiment', 'N/A')}")
|
||||
|
||||
return results
|
||||
|
||||
# Performance comparison
|
||||
async def compare_extraction_performance():
|
||||
"""Compare speed and accuracy of different strategies"""
|
||||
import time
|
||||
|
||||
url = "https://example.com/large-catalog"
|
||||
|
||||
strategies = {
|
||||
"regex": RegexExtractionStrategy(pattern=RegexExtractionStrategy.Currency),
|
||||
"css": JsonCssExtractionStrategy({
|
||||
"name": "Prices",
|
||||
"baseSelector": ".price",
|
||||
"fields": [{"name": "amount", "selector": "span", "type": "text"}]
|
||||
}),
|
||||
"llm": LLMExtractionStrategy(
|
||||
llm_config=LLMConfig(provider="openai/gpt-4o-mini", api_token="env:OPENAI_API_KEY"),
|
||||
instruction="Extract all prices from the content",
|
||||
extraction_type="block"
|
||||
)
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for name, strategy in strategies.items():
|
||||
start_time = time.time()
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
|
||||
duration = time.time() - start_time
|
||||
|
||||
if result.success:
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"{name}: {len(data)} items in {duration:.2f}s")
|
||||
else:
|
||||
print(f"{name}: Failed in {duration:.2f}s")
|
||||
|
||||
asyncio.run(multi_strategy_extraction())
|
||||
asyncio.run(compare_extraction_performance())
|
||||
```
|
||||
|
||||
### Best Practices and Strategy Selection
|
||||
|
||||
```python
|
||||
# Strategy selection guide
|
||||
def choose_extraction_strategy(use_case):
|
||||
"""
|
||||
Guide for selecting the right extraction strategy
|
||||
"""
|
||||
|
||||
strategies = {
|
||||
# Fast pattern matching for common data types
|
||||
"contact_info": RegexExtractionStrategy(
|
||||
pattern=RegexExtractionStrategy.Email | RegexExtractionStrategy.PhoneUS
|
||||
),
|
||||
|
||||
# Structured data from consistent HTML
|
||||
"product_catalogs": JsonCssExtractionStrategy,
|
||||
|
||||
# Complex reasoning and semantic understanding
|
||||
"content_analysis": LLMExtractionStrategy,
|
||||
|
||||
# Mixed approach for comprehensive extraction
|
||||
"complete_site_analysis": "multi_strategy"
|
||||
}
|
||||
|
||||
recommendations = {
|
||||
"speed_priority": "Use RegexExtractionStrategy for simple patterns, JsonCssExtractionStrategy for structured data",
|
||||
"accuracy_priority": "Use LLMExtractionStrategy for complex content, JsonCssExtractionStrategy for predictable structure",
|
||||
"cost_priority": "Avoid LLM strategies, use schema generation once then JsonCssExtractionStrategy",
|
||||
"scale_priority": "Cache schemas, use regex for simple patterns, avoid LLM for high-volume extraction"
|
||||
}
|
||||
|
||||
return recommendations.get(use_case, "Combine strategies based on content complexity")
|
||||
|
||||
# Error handling and validation
|
||||
async def robust_extraction():
|
||||
strategies = [
|
||||
RegexExtractionStrategy(pattern=RegexExtractionStrategy.Email),
|
||||
JsonCssExtractionStrategy(simple_schema),
|
||||
# LLM as fallback for complex cases
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
for strategy in strategies:
|
||||
try:
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.extracted_content:
|
||||
data = json.loads(result.extracted_content)
|
||||
if data: # Validate non-empty results
|
||||
print(f"Success with {strategy.__class__.__name__}")
|
||||
return data
|
||||
|
||||
except Exception as e:
|
||||
print(f"Strategy {strategy.__class__.__name__} failed: {e}")
|
||||
continue
|
||||
|
||||
print("All strategies failed")
|
||||
return None
|
||||
```
|
||||
|
||||
**📖 Learn more:** [LLM Strategies Deep Dive](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Regex Patterns](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
388
docs/md_v2/assets/llm.txt/txt/http_based_crawler_strategy.txt
Normal file
388
docs/md_v2/assets/llm.txt/txt/http_based_crawler_strategy.txt
Normal file
@@ -0,0 +1,388 @@
|
||||
## HTTP Crawler Strategy
|
||||
|
||||
Fast, lightweight HTTP-only crawling without browser overhead for cases where JavaScript execution isn't needed.
|
||||
|
||||
### Basic HTTP Crawler Setup
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig, CacheMode
|
||||
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
|
||||
from crawl4ai.async_logger import AsyncLogger
|
||||
|
||||
async def main():
|
||||
# Initialize HTTP strategy
|
||||
http_strategy = AsyncHTTPCrawlerStrategy(
|
||||
browser_config=HTTPCrawlerConfig(
|
||||
method="GET",
|
||||
verify_ssl=True,
|
||||
follow_redirects=True
|
||||
),
|
||||
logger=AsyncLogger(verbose=True)
|
||||
)
|
||||
|
||||
# Use with AsyncWebCrawler
|
||||
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(f"Status: {result.status_code}")
|
||||
print(f"Content: {len(result.html)} chars")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### HTTP Request Types
|
||||
|
||||
```python
|
||||
# GET request (default)
|
||||
http_config = HTTPCrawlerConfig(
|
||||
method="GET",
|
||||
headers={"Accept": "application/json"}
|
||||
)
|
||||
|
||||
# POST with JSON data
|
||||
http_config = HTTPCrawlerConfig(
|
||||
method="POST",
|
||||
json={"key": "value", "data": [1, 2, 3]},
|
||||
headers={"Content-Type": "application/json"}
|
||||
)
|
||||
|
||||
# POST with form data
|
||||
http_config = HTTPCrawlerConfig(
|
||||
method="POST",
|
||||
data={"username": "user", "password": "pass"},
|
||||
headers={"Content-Type": "application/x-www-form-urlencoded"}
|
||||
)
|
||||
|
||||
# Advanced configuration
|
||||
http_config = HTTPCrawlerConfig(
|
||||
method="GET",
|
||||
headers={"User-Agent": "Custom Bot/1.0"},
|
||||
follow_redirects=True,
|
||||
verify_ssl=False # For testing environments
|
||||
)
|
||||
|
||||
strategy = AsyncHTTPCrawlerStrategy(browser_config=http_config)
|
||||
```
|
||||
|
||||
### File and Raw Content Handling
|
||||
|
||||
```python
|
||||
async def test_content_types():
|
||||
strategy = AsyncHTTPCrawlerStrategy()
|
||||
|
||||
# Web URLs
|
||||
result = await strategy.crawl("https://httpbin.org/get")
|
||||
print(f"Web content: {result.status_code}")
|
||||
|
||||
# Local files
|
||||
result = await strategy.crawl("file:///path/to/local/file.html")
|
||||
print(f"File content: {len(result.html)}")
|
||||
|
||||
# Raw HTML content
|
||||
raw_html = "raw://<html><body><h1>Test</h1><p>Content</p></body></html>"
|
||||
result = await strategy.crawl(raw_html)
|
||||
print(f"Raw content: {result.html}")
|
||||
|
||||
# Raw content with complex HTML
|
||||
complex_html = """raw://<!DOCTYPE html>
|
||||
<html>
|
||||
<head><title>Test Page</title></head>
|
||||
<body>
|
||||
<div class="content">
|
||||
<h1>Main Title</h1>
|
||||
<p>Paragraph content</p>
|
||||
<ul><li>Item 1</li><li>Item 2</li></ul>
|
||||
</div>
|
||||
</body>
|
||||
</html>"""
|
||||
result = await strategy.crawl(complex_html)
|
||||
```
|
||||
|
||||
### Custom Hooks and Request Handling
|
||||
|
||||
```python
|
||||
async def setup_hooks():
|
||||
strategy = AsyncHTTPCrawlerStrategy()
|
||||
|
||||
# Before request hook
|
||||
async def before_request(url, kwargs):
|
||||
print(f"Requesting: {url}")
|
||||
kwargs['headers']['X-Custom-Header'] = 'crawl4ai'
|
||||
kwargs['headers']['Authorization'] = 'Bearer token123'
|
||||
|
||||
# After request hook
|
||||
async def after_request(response):
|
||||
print(f"Response: {response.status_code}")
|
||||
if hasattr(response, 'redirected_url'):
|
||||
print(f"Redirected to: {response.redirected_url}")
|
||||
|
||||
# Error handling hook
|
||||
async def on_error(error):
|
||||
print(f"Request failed: {error}")
|
||||
|
||||
# Set hooks
|
||||
strategy.set_hook('before_request', before_request)
|
||||
strategy.set_hook('after_request', after_request)
|
||||
strategy.set_hook('on_error', on_error)
|
||||
|
||||
# Use with hooks
|
||||
result = await strategy.crawl("https://httpbin.org/headers")
|
||||
return result
|
||||
```
|
||||
|
||||
### Performance Configuration
|
||||
|
||||
```python
|
||||
# High-performance setup
|
||||
strategy = AsyncHTTPCrawlerStrategy(
|
||||
max_connections=50, # Concurrent connections
|
||||
dns_cache_ttl=300, # DNS cache timeout
|
||||
chunk_size=128 * 1024 # 128KB chunks for large files
|
||||
)
|
||||
|
||||
# Memory-efficient setup for large files
|
||||
strategy = AsyncHTTPCrawlerStrategy(
|
||||
max_connections=10,
|
||||
chunk_size=32 * 1024, # Smaller chunks
|
||||
dns_cache_ttl=600
|
||||
)
|
||||
|
||||
# Custom timeout configuration
|
||||
config = CrawlerRunConfig(
|
||||
page_timeout=30000, # 30 second timeout
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await strategy.crawl("https://slow-server.com", config=config)
|
||||
```
|
||||
|
||||
### Error Handling and Retries
|
||||
|
||||
```python
|
||||
from crawl4ai.async_crawler_strategy import (
|
||||
ConnectionTimeoutError,
|
||||
HTTPStatusError,
|
||||
HTTPCrawlerError
|
||||
)
|
||||
|
||||
async def robust_crawling():
|
||||
strategy = AsyncHTTPCrawlerStrategy()
|
||||
|
||||
urls = [
|
||||
"https://example.com",
|
||||
"https://httpbin.org/status/404",
|
||||
"https://nonexistent.domain.test"
|
||||
]
|
||||
|
||||
for url in urls:
|
||||
try:
|
||||
result = await strategy.crawl(url)
|
||||
print(f"✓ {url}: {result.status_code}")
|
||||
|
||||
except HTTPStatusError as e:
|
||||
print(f"✗ {url}: HTTP {e.status_code}")
|
||||
|
||||
except ConnectionTimeoutError as e:
|
||||
print(f"✗ {url}: Timeout - {e}")
|
||||
|
||||
except HTTPCrawlerError as e:
|
||||
print(f"✗ {url}: Crawler error - {e}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ {url}: Unexpected error - {e}")
|
||||
|
||||
# Retry mechanism
|
||||
async def crawl_with_retry(url, max_retries=3):
|
||||
strategy = AsyncHTTPCrawlerStrategy()
|
||||
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return await strategy.crawl(url)
|
||||
except (ConnectionTimeoutError, HTTPCrawlerError) as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
print(f"Retry {attempt + 1}/{max_retries}: {e}")
|
||||
await asyncio.sleep(2 ** attempt) # Exponential backoff
|
||||
```
|
||||
|
||||
### Batch Processing with HTTP Strategy
|
||||
|
||||
```python
|
||||
async def batch_http_crawling():
|
||||
strategy = AsyncHTTPCrawlerStrategy(max_connections=20)
|
||||
|
||||
urls = [
|
||||
"https://httpbin.org/get",
|
||||
"https://httpbin.org/user-agent",
|
||||
"https://httpbin.org/headers",
|
||||
"https://example.com",
|
||||
"https://httpbin.org/json"
|
||||
]
|
||||
|
||||
# Sequential processing
|
||||
results = []
|
||||
async with strategy:
|
||||
for url in urls:
|
||||
try:
|
||||
result = await strategy.crawl(url)
|
||||
results.append((url, result.status_code, len(result.html)))
|
||||
except Exception as e:
|
||||
results.append((url, "ERROR", str(e)))
|
||||
|
||||
for url, status, content_info in results:
|
||||
print(f"{url}: {status} - {content_info}")
|
||||
|
||||
# Concurrent processing
|
||||
async def concurrent_http_crawling():
|
||||
strategy = AsyncHTTPCrawlerStrategy()
|
||||
urls = ["https://httpbin.org/delay/1"] * 5
|
||||
|
||||
async def crawl_single(url):
|
||||
try:
|
||||
result = await strategy.crawl(url)
|
||||
return f"✓ {result.status_code}"
|
||||
except Exception as e:
|
||||
return f"✗ {e}"
|
||||
|
||||
async with strategy:
|
||||
tasks = [crawl_single(url) for url in urls]
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
for i, result in enumerate(results):
|
||||
print(f"URL {i+1}: {result}")
|
||||
```
|
||||
|
||||
### Integration with Content Processing
|
||||
|
||||
```python
|
||||
from crawl4ai import DefaultMarkdownGenerator, PruningContentFilter
|
||||
|
||||
async def http_with_processing():
|
||||
# HTTP strategy with content processing
|
||||
http_strategy = AsyncHTTPCrawlerStrategy(
|
||||
browser_config=HTTPCrawlerConfig(verify_ssl=True)
|
||||
)
|
||||
|
||||
# Configure markdown generation
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.48,
|
||||
threshold_type="fixed",
|
||||
min_word_threshold=10
|
||||
)
|
||||
),
|
||||
word_count_threshold=5,
|
||||
excluded_tags=['script', 'style', 'nav'],
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(crawler_strategy=http_strategy) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=crawler_config
|
||||
)
|
||||
|
||||
print(f"Status: {result.status_code}")
|
||||
print(f"Raw HTML: {len(result.html)} chars")
|
||||
if result.markdown:
|
||||
print(f"Markdown: {len(result.markdown.raw_markdown)} chars")
|
||||
if result.markdown.fit_markdown:
|
||||
print(f"Filtered: {len(result.markdown.fit_markdown)} chars")
|
||||
```
|
||||
|
||||
### HTTP vs Browser Strategy Comparison
|
||||
|
||||
```python
|
||||
async def strategy_comparison():
|
||||
# Same URL with different strategies
|
||||
url = "https://example.com"
|
||||
|
||||
# HTTP Strategy (fast, no JS)
|
||||
http_strategy = AsyncHTTPCrawlerStrategy()
|
||||
start_time = time.time()
|
||||
http_result = await http_strategy.crawl(url)
|
||||
http_time = time.time() - start_time
|
||||
|
||||
# Browser Strategy (full features)
|
||||
from crawl4ai import BrowserConfig
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
start_time = time.time()
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
browser_result = await crawler.arun(url)
|
||||
browser_time = time.time() - start_time
|
||||
|
||||
print(f"HTTP Strategy:")
|
||||
print(f" Time: {http_time:.2f}s")
|
||||
print(f" Content: {len(http_result.html)} chars")
|
||||
print(f" Features: Fast, lightweight, no JS")
|
||||
|
||||
print(f"Browser Strategy:")
|
||||
print(f" Time: {browser_time:.2f}s")
|
||||
print(f" Content: {len(browser_result.html)} chars")
|
||||
print(f" Features: Full browser, JS, screenshots, etc.")
|
||||
|
||||
# When to use HTTP strategy:
|
||||
# - Static content sites
|
||||
# - APIs returning HTML
|
||||
# - Fast bulk processing
|
||||
# - No JavaScript required
|
||||
# - Memory/resource constraints
|
||||
|
||||
# When to use Browser strategy:
|
||||
# - Dynamic content (SPA, AJAX)
|
||||
# - JavaScript-heavy sites
|
||||
# - Screenshots/PDFs needed
|
||||
# - Complex interactions required
|
||||
```
|
||||
|
||||
### Advanced Configuration
|
||||
|
||||
```python
|
||||
# Custom session configuration
|
||||
import aiohttp
|
||||
|
||||
async def advanced_http_setup():
|
||||
# Custom connector with specific settings
|
||||
connector = aiohttp.TCPConnector(
|
||||
limit=100, # Connection pool size
|
||||
ttl_dns_cache=600, # DNS cache TTL
|
||||
use_dns_cache=True, # Enable DNS caching
|
||||
keepalive_timeout=30, # Keep-alive timeout
|
||||
force_close=False # Reuse connections
|
||||
)
|
||||
|
||||
strategy = AsyncHTTPCrawlerStrategy(
|
||||
max_connections=50,
|
||||
dns_cache_ttl=600,
|
||||
chunk_size=64 * 1024
|
||||
)
|
||||
|
||||
# Custom headers for all requests
|
||||
http_config = HTTPCrawlerConfig(
|
||||
headers={
|
||||
"User-Agent": "Crawl4AI-HTTP/1.0",
|
||||
"Accept": "text/html,application/xhtml+xml",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
"Accept-Encoding": "gzip, deflate, br",
|
||||
"DNT": "1"
|
||||
},
|
||||
verify_ssl=True,
|
||||
follow_redirects=True
|
||||
)
|
||||
|
||||
strategy.browser_config = http_config
|
||||
|
||||
# Use with custom timeout
|
||||
config = CrawlerRunConfig(
|
||||
page_timeout=45000, # 45 seconds
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
|
||||
result = await strategy.crawl("https://example.com", config=config)
|
||||
await strategy.close()
|
||||
```
|
||||
|
||||
**📖 Learn more:** [AsyncWebCrawler API](https://docs.crawl4ai.com/api/async-webcrawler/), [Browser vs HTTP Strategy](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
231
docs/md_v2/assets/llm.txt/txt/installation.txt
Normal file
231
docs/md_v2/assets/llm.txt/txt/installation.txt
Normal file
@@ -0,0 +1,231 @@
|
||||
## Installation
|
||||
|
||||
Multiple installation options for different environments and use cases.
|
||||
|
||||
### Basic Installation
|
||||
|
||||
```bash
|
||||
# Install core library
|
||||
pip install crawl4ai
|
||||
|
||||
# Initial setup (installs Playwright browsers)
|
||||
crawl4ai-setup
|
||||
|
||||
# Verify installation
|
||||
crawl4ai-doctor
|
||||
```
|
||||
|
||||
### Quick Verification
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown[:300])
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Basic Usage Guide](https://docs.crawl4ai.com/core/quickstart.md)
|
||||
|
||||
### Advanced Features (Optional)
|
||||
|
||||
```bash
|
||||
# PyTorch-based features (text clustering, semantic chunking)
|
||||
pip install crawl4ai[torch]
|
||||
crawl4ai-setup
|
||||
|
||||
# Transformers (Hugging Face models)
|
||||
pip install crawl4ai[transformer]
|
||||
crawl4ai-setup
|
||||
|
||||
# All features (large download)
|
||||
pip install crawl4ai[all]
|
||||
crawl4ai-setup
|
||||
|
||||
# Pre-download models (optional)
|
||||
crawl4ai-download-models
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Advanced Features Documentation](https://docs.crawl4ai.com/extraction/llm-strategies.md)
|
||||
|
||||
### Docker Deployment
|
||||
|
||||
```bash
|
||||
# Pull pre-built image (specify platform for consistency)
|
||||
docker pull --platform linux/amd64 unclecode/crawl4ai:latest
|
||||
# For ARM (M1/M2 Macs): docker pull --platform linux/arm64 unclecode/crawl4ai:latest
|
||||
|
||||
# Setup environment for LLM support
|
||||
cat > .llm.env << EOL
|
||||
OPENAI_API_KEY=sk-your-key
|
||||
ANTHROPIC_API_KEY=your-anthropic-key
|
||||
EOL
|
||||
|
||||
# Run with LLM support (specify platform)
|
||||
docker run -d \
|
||||
--platform linux/amd64 \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:latest
|
||||
|
||||
# For ARM Macs, use: --platform linux/arm64
|
||||
|
||||
# Basic run (no LLM)
|
||||
docker run -d \
|
||||
--platform linux/amd64 \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai \
|
||||
--shm-size=1g \
|
||||
unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Complete Docker Guide](https://docs.crawl4ai.com/core/docker-deployment.md)
|
||||
|
||||
### Docker Compose
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/unclecode/crawl4ai.git
|
||||
cd crawl4ai
|
||||
|
||||
# Copy environment template
|
||||
cp deploy/docker/.llm.env.example .llm.env
|
||||
# Edit .llm.env with your API keys
|
||||
|
||||
# Run pre-built image
|
||||
IMAGE=unclecode/crawl4ai:latest docker compose up -d
|
||||
|
||||
# Build and run locally
|
||||
docker compose up --build -d
|
||||
|
||||
# Build with all features
|
||||
INSTALL_TYPE=all docker compose up --build -d
|
||||
|
||||
# Stop service
|
||||
docker compose down
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Docker Compose Configuration](https://docs.crawl4ai.com/core/docker-deployment.md#option-2-using-docker-compose)
|
||||
|
||||
### Manual Docker Build
|
||||
|
||||
```bash
|
||||
# Build multi-architecture image (specify platform)
|
||||
docker buildx build --platform linux/amd64 -t crawl4ai-local:latest --load .
|
||||
# For ARM: docker buildx build --platform linux/arm64 -t crawl4ai-local:latest --load .
|
||||
|
||||
# Build with specific features
|
||||
docker buildx build \
|
||||
--platform linux/amd64 \
|
||||
--build-arg INSTALL_TYPE=all \
|
||||
--build-arg ENABLE_GPU=false \
|
||||
-t crawl4ai-local:latest --load .
|
||||
|
||||
# Run custom build (specify platform)
|
||||
docker run -d \
|
||||
--platform linux/amd64 \
|
||||
-p 11235:11235 \
|
||||
--name crawl4ai-custom \
|
||||
--env-file .llm.env \
|
||||
--shm-size=1g \
|
||||
crawl4ai-local:latest
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Manual Build Guide](https://docs.crawl4ai.com/core/docker-deployment.md#option-3-manual-local-build--run)
|
||||
|
||||
### Google Colab
|
||||
|
||||
```python
|
||||
# Install in Colab
|
||||
!pip install crawl4ai
|
||||
!crawl4ai-setup
|
||||
|
||||
# If setup fails, manually install Playwright browsers
|
||||
!playwright install chromium
|
||||
|
||||
# Install with all features (may take 5-10 minutes)
|
||||
!pip install crawl4ai[all]
|
||||
!crawl4ai-setup
|
||||
!crawl4ai-download-models
|
||||
|
||||
# If still having issues, force Playwright install
|
||||
!playwright install chromium --force
|
||||
|
||||
# Quick test
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def test_crawl():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print("✅ Installation successful!")
|
||||
print(f"Content length: {len(result.markdown)}")
|
||||
|
||||
# Run test in Colab
|
||||
await test_crawl()
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Colab Examples Notebook](https://colab.research.google.com/github/unclecode/crawl4ai/blob/main/docs/examples/quickstart.ipynb)
|
||||
|
||||
### Docker API Usage
|
||||
|
||||
```python
|
||||
# Using Docker SDK
|
||||
import asyncio
|
||||
from crawl4ai.docker_client import Crawl4aiDockerClient
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
async with Crawl4aiDockerClient(base_url="http://localhost:11235") as client:
|
||||
results = await client.crawl(
|
||||
["https://example.com"],
|
||||
browser_config=BrowserConfig(headless=True),
|
||||
crawler_config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
for result in results:
|
||||
print(f"Success: {result.success}, Length: {len(result.markdown)}")
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Docker Client API](https://docs.crawl4ai.com/core/docker-deployment.md#python-sdk)
|
||||
|
||||
### Direct API Calls
|
||||
|
||||
```python
|
||||
# REST API example
|
||||
import requests
|
||||
|
||||
payload = {
|
||||
"urls": ["https://example.com"],
|
||||
"browser_config": {"type": "BrowserConfig", "params": {"headless": True}},
|
||||
"crawler_config": {"type": "CrawlerRunConfig", "params": {"cache_mode": "bypass"}}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=payload)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
**📖 Learn more:** [REST API Reference](https://docs.crawl4ai.com/core/docker-deployment.md#rest-api-examples)
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
# Check Docker service
|
||||
curl http://localhost:11235/health
|
||||
|
||||
# Access playground
|
||||
open http://localhost:11235/playground
|
||||
|
||||
# View metrics
|
||||
curl http://localhost:11235/metrics
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Monitoring & Metrics](https://docs.crawl4ai.com/core/docker-deployment.md#metrics--monitoring)
|
||||
5929
docs/md_v2/assets/llm.txt/txt/llms-full.txt
Normal file
5929
docs/md_v2/assets/llm.txt/txt/llms-full.txt
Normal file
File diff suppressed because it is too large
Load Diff
339
docs/md_v2/assets/llm.txt/txt/multi_urls_crawling.txt
Normal file
339
docs/md_v2/assets/llm.txt/txt/multi_urls_crawling.txt
Normal file
@@ -0,0 +1,339 @@
|
||||
## Multi-URL Crawling
|
||||
|
||||
Concurrent crawling of multiple URLs with intelligent resource management, rate limiting, and real-time monitoring.
|
||||
|
||||
### Basic Multi-URL Crawling
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
# Batch processing (default) - get all results at once
|
||||
async def batch_crawl():
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3"
|
||||
]
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
stream=False # Default: batch mode
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(urls, config=config)
|
||||
|
||||
for result in results:
|
||||
if result.success:
|
||||
print(f"✅ {result.url}: {len(result.markdown)} chars")
|
||||
else:
|
||||
print(f"❌ {result.url}: {result.error_message}")
|
||||
|
||||
# Streaming processing - handle results as they complete
|
||||
async def streaming_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
stream=True # Enable streaming
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Process results as they become available
|
||||
async for result in await crawler.arun_many(urls, config=config):
|
||||
if result.success:
|
||||
print(f"🔥 Just completed: {result.url}")
|
||||
await process_result_immediately(result)
|
||||
else:
|
||||
print(f"❌ Failed: {result.url}")
|
||||
```
|
||||
|
||||
### Memory-Adaptive Dispatching
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, MemoryAdaptiveDispatcher, CrawlerMonitor, DisplayMode
|
||||
|
||||
# Automatically manages concurrency based on system memory
|
||||
async def memory_adaptive_crawl():
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=80.0, # Pause if memory exceeds 80%
|
||||
check_interval=1.0, # Check memory every second
|
||||
max_session_permit=15, # Max concurrent tasks
|
||||
memory_wait_timeout=300.0 # Wait up to 5 minutes for memory
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=50
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=large_url_list,
|
||||
config=config,
|
||||
dispatcher=dispatcher
|
||||
)
|
||||
|
||||
# Each result includes dispatch information
|
||||
for result in results:
|
||||
if result.dispatch_result:
|
||||
dr = result.dispatch_result
|
||||
print(f"Memory used: {dr.memory_usage:.1f}MB")
|
||||
print(f"Duration: {dr.end_time - dr.start_time}")
|
||||
```
|
||||
|
||||
### Rate-Limited Crawling
|
||||
|
||||
```python
|
||||
from crawl4ai import RateLimiter, SemaphoreDispatcher
|
||||
|
||||
# Control request pacing and handle server rate limits
|
||||
async def rate_limited_crawl():
|
||||
rate_limiter = RateLimiter(
|
||||
base_delay=(1.0, 3.0), # Random delay 1-3 seconds
|
||||
max_delay=60.0, # Cap backoff at 60 seconds
|
||||
max_retries=3, # Retry failed requests 3 times
|
||||
rate_limit_codes=[429, 503] # Handle these status codes
|
||||
)
|
||||
|
||||
dispatcher = SemaphoreDispatcher(
|
||||
max_session_permit=5, # Fixed concurrency limit
|
||||
rate_limiter=rate_limiter
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
user_agent_mode="random", # Randomize user agents
|
||||
simulate_user=True # Simulate human behavior
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=config,
|
||||
dispatcher=dispatcher
|
||||
):
|
||||
print(f"Processed: {result.url}")
|
||||
```
|
||||
|
||||
### Real-Time Monitoring
|
||||
|
||||
```python
|
||||
from crawl4ai import CrawlerMonitor, DisplayMode
|
||||
|
||||
# Monitor crawling progress in real-time
|
||||
async def monitored_crawl():
|
||||
monitor = CrawlerMonitor(
|
||||
max_visible_rows=20, # Show 20 tasks in display
|
||||
display_mode=DisplayMode.DETAILED # Show individual task details
|
||||
)
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=75.0,
|
||||
max_session_permit=10,
|
||||
monitor=monitor # Attach monitor to dispatcher
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=urls,
|
||||
dispatcher=dispatcher
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Dispatcher Configurations
|
||||
|
||||
```python
|
||||
# Memory-adaptive with comprehensive monitoring
|
||||
memory_dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=85.0, # Higher memory tolerance
|
||||
check_interval=0.5, # Check memory more frequently
|
||||
max_session_permit=20, # More concurrent tasks
|
||||
memory_wait_timeout=600.0, # Wait longer for memory
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=(0.5, 1.5),
|
||||
max_delay=30.0,
|
||||
max_retries=5
|
||||
),
|
||||
monitor=CrawlerMonitor(
|
||||
max_visible_rows=15,
|
||||
display_mode=DisplayMode.AGGREGATED # Summary view
|
||||
)
|
||||
)
|
||||
|
||||
# Simple semaphore-based dispatcher
|
||||
semaphore_dispatcher = SemaphoreDispatcher(
|
||||
max_session_permit=8, # Fixed concurrency
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=(1.0, 2.0),
|
||||
max_delay=20.0
|
||||
)
|
||||
)
|
||||
|
||||
# Usage with custom dispatcher
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=config,
|
||||
dispatcher=memory_dispatcher # or semaphore_dispatcher
|
||||
)
|
||||
```
|
||||
|
||||
### Handling Large-Scale Crawling
|
||||
|
||||
```python
|
||||
async def large_scale_crawl():
|
||||
# For thousands of URLs
|
||||
urls = load_urls_from_file("large_url_list.txt") # 10,000+ URLs
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=70.0, # Conservative memory usage
|
||||
max_session_permit=25, # Higher concurrency
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=(0.1, 0.5), # Faster for large batches
|
||||
max_retries=2 # Fewer retries for speed
|
||||
),
|
||||
monitor=CrawlerMonitor(display_mode=DisplayMode.AGGREGATED)
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED, # Use caching for efficiency
|
||||
stream=True, # Stream for memory efficiency
|
||||
word_count_threshold=100, # Skip short content
|
||||
exclude_external_links=True # Reduce processing overhead
|
||||
)
|
||||
|
||||
successful_crawls = 0
|
||||
failed_crawls = 0
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=config,
|
||||
dispatcher=dispatcher
|
||||
):
|
||||
if result.success:
|
||||
successful_crawls += 1
|
||||
await save_result_to_database(result)
|
||||
else:
|
||||
failed_crawls += 1
|
||||
await log_failure(result.url, result.error_message)
|
||||
|
||||
# Progress reporting
|
||||
if (successful_crawls + failed_crawls) % 100 == 0:
|
||||
print(f"Progress: {successful_crawls + failed_crawls}/{len(urls)}")
|
||||
|
||||
print(f"Completed: {successful_crawls} successful, {failed_crawls} failed")
|
||||
```
|
||||
|
||||
### Robots.txt Compliance
|
||||
|
||||
```python
|
||||
async def compliant_crawl():
|
||||
config = CrawlerRunConfig(
|
||||
check_robots_txt=True, # Respect robots.txt
|
||||
user_agent="MyBot/1.0", # Identify your bot
|
||||
mean_delay=2.0, # Be polite with delays
|
||||
max_range=1.0
|
||||
)
|
||||
|
||||
dispatcher = SemaphoreDispatcher(
|
||||
max_session_permit=3, # Conservative concurrency
|
||||
rate_limiter=RateLimiter(
|
||||
base_delay=(2.0, 5.0), # Slower, more respectful
|
||||
max_retries=1
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=config,
|
||||
dispatcher=dispatcher
|
||||
):
|
||||
if result.success:
|
||||
print(f"✅ Crawled: {result.url}")
|
||||
elif "robots.txt" in result.error_message:
|
||||
print(f"🚫 Blocked by robots.txt: {result.url}")
|
||||
else:
|
||||
print(f"❌ Error: {result.url}")
|
||||
```
|
||||
|
||||
### Performance Analysis
|
||||
|
||||
```python
|
||||
async def analyze_crawl_performance():
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=80.0,
|
||||
max_session_permit=12,
|
||||
monitor=CrawlerMonitor(display_mode=DisplayMode.DETAILED)
|
||||
)
|
||||
|
||||
start_time = time.time()
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(
|
||||
urls=urls,
|
||||
dispatcher=dispatcher
|
||||
)
|
||||
|
||||
end_time = time.time()
|
||||
|
||||
# Analyze results
|
||||
successful = [r for r in results if r.success]
|
||||
failed = [r for r in results if not r.success]
|
||||
|
||||
print(f"Total time: {end_time - start_time:.2f}s")
|
||||
print(f"Success rate: {len(successful)}/{len(results)} ({len(successful)/len(results)*100:.1f}%)")
|
||||
print(f"Avg time per URL: {(end_time - start_time)/len(results):.2f}s")
|
||||
|
||||
# Memory usage analysis
|
||||
if successful and successful[0].dispatch_result:
|
||||
memory_usage = [r.dispatch_result.memory_usage for r in successful if r.dispatch_result]
|
||||
peak_memory = [r.dispatch_result.peak_memory for r in successful if r.dispatch_result]
|
||||
|
||||
print(f"Avg memory usage: {sum(memory_usage)/len(memory_usage):.1f}MB")
|
||||
print(f"Peak memory usage: {max(peak_memory):.1f}MB")
|
||||
```
|
||||
|
||||
### Error Handling and Recovery
|
||||
|
||||
```python
|
||||
async def robust_multi_crawl():
|
||||
failed_urls = []
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
stream=True,
|
||||
page_timeout=30000 # 30 second timeout
|
||||
)
|
||||
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=85.0,
|
||||
max_session_permit=10
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
async for result in await crawler.arun_many(
|
||||
urls=urls,
|
||||
config=config,
|
||||
dispatcher=dispatcher
|
||||
):
|
||||
if result.success:
|
||||
await process_successful_result(result)
|
||||
else:
|
||||
failed_urls.append({
|
||||
'url': result.url,
|
||||
'error': result.error_message,
|
||||
'status_code': result.status_code
|
||||
})
|
||||
|
||||
# Retry logic for specific errors
|
||||
if result.status_code in [503, 429]: # Server errors
|
||||
await schedule_retry(result.url)
|
||||
|
||||
# Report failures
|
||||
if failed_urls:
|
||||
print(f"Failed to crawl {len(failed_urls)} URLs:")
|
||||
for failure in failed_urls[:10]: # Show first 10
|
||||
print(f" {failure['url']}: {failure['error']}")
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Advanced Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Crawl Dispatcher](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [arun_many() API Reference](https://docs.crawl4ai.com/api/arun_many/)
|
||||
365
docs/md_v2/assets/llm.txt/txt/simple_crawling.txt
Normal file
365
docs/md_v2/assets/llm.txt/txt/simple_crawling.txt
Normal file
@@ -0,0 +1,365 @@
|
||||
## Simple Crawling
|
||||
|
||||
Basic web crawling operations with AsyncWebCrawler, configurations, and response handling.
|
||||
|
||||
### Basic Setup
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_config = BrowserConfig() # Default browser settings
|
||||
run_config = CrawlerRunConfig() # Default crawl settings
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Understanding CrawlResult
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.6),
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
)
|
||||
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
# Different content formats
|
||||
print(result.html) # Raw HTML
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown.raw_markdown) # Raw markdown
|
||||
print(result.markdown.fit_markdown) # Filtered markdown
|
||||
|
||||
# Status information
|
||||
print(result.success) # True/False
|
||||
print(result.status_code) # HTTP status (200, 404, etc.)
|
||||
|
||||
# Extracted content
|
||||
print(result.media) # Images, videos, audio
|
||||
print(result.links) # Internal/external links
|
||||
```
|
||||
|
||||
### Basic Configuration Options
|
||||
|
||||
```python
|
||||
run_config = CrawlerRunConfig(
|
||||
word_count_threshold=10, # Min words per block
|
||||
exclude_external_links=True, # Remove external links
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
process_iframes=True, # Process iframe content
|
||||
excluded_tags=['form', 'header'] # Skip these tags
|
||||
)
|
||||
|
||||
result = await crawler.arun("https://example.com", config=run_config)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
result = await crawler.arun("https://example.com", config=run_config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Crawl failed: {result.error_message}")
|
||||
print(f"Status code: {result.status_code}")
|
||||
else:
|
||||
print(f"Success! Content length: {len(result.markdown)}")
|
||||
```
|
||||
|
||||
### Debugging with Verbose Logging
|
||||
|
||||
```python
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
# Detailed logging output will be displayed
|
||||
```
|
||||
|
||||
### Complete Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def comprehensive_crawl():
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
|
||||
run_config = CrawlerRunConfig(
|
||||
# Content filtering
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['form', 'header', 'nav'],
|
||||
exclude_external_links=True,
|
||||
|
||||
# Content processing
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
|
||||
# Cache control
|
||||
cache_mode=CacheMode.ENABLED
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Display content summary
|
||||
print(f"Title: {result.metadata.get('title', 'No title')}")
|
||||
print(f"Content: {result.markdown[:500]}...")
|
||||
|
||||
# Process media
|
||||
images = result.media.get("images", [])
|
||||
print(f"Found {len(images)} images")
|
||||
for img in images[:3]: # First 3 images
|
||||
print(f" - {img.get('src', 'No src')}")
|
||||
|
||||
# Process links
|
||||
internal_links = result.links.get("internal", [])
|
||||
print(f"Found {len(internal_links)} internal links")
|
||||
for link in internal_links[:3]: # First 3 links
|
||||
print(f" - {link.get('href', 'No href')}")
|
||||
|
||||
else:
|
||||
print(f"❌ Crawl failed: {result.error_message}")
|
||||
print(f"Status: {result.status_code}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(comprehensive_crawl())
|
||||
```
|
||||
|
||||
### Working with Raw HTML and Local Files
|
||||
|
||||
```python
|
||||
# Crawl raw HTML
|
||||
raw_html = "<html><body><h1>Test</h1><p>Content</p></body></html>"
|
||||
result = await crawler.arun(f"raw://{raw_html}")
|
||||
|
||||
# Crawl local file
|
||||
result = await crawler.arun("file:///path/to/local/file.html")
|
||||
|
||||
# Both return standard CrawlResult objects
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
## Table Extraction
|
||||
|
||||
Extract structured data from HTML tables with automatic detection and scoring.
|
||||
|
||||
### Basic Table Extraction
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import pandas as pd
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def extract_tables():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=7, # Higher = stricter detection
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun("https://example.com/tables", config=config)
|
||||
|
||||
if result.success and result.tables:
|
||||
# New tables field (v0.6+)
|
||||
for i, table in enumerate(result.tables):
|
||||
print(f"Table {i+1}:")
|
||||
print(f"Headers: {table['headers']}")
|
||||
print(f"Rows: {len(table['rows'])}")
|
||||
print(f"Caption: {table.get('caption', 'No caption')}")
|
||||
|
||||
# Convert to DataFrame
|
||||
df = pd.DataFrame(table['rows'], columns=table['headers'])
|
||||
print(df.head())
|
||||
|
||||
asyncio.run(extract_tables())
|
||||
```
|
||||
|
||||
### Advanced Table Processing
|
||||
|
||||
```python
|
||||
from crawl4ai import LXMLWebScrapingStrategy
|
||||
|
||||
async def process_financial_tables():
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=8, # Strict detection for data tables
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
keep_data_attributes=True,
|
||||
scan_full_page=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://coinmarketcap.com", config=config)
|
||||
|
||||
if result.tables:
|
||||
# Get the main data table (usually first/largest)
|
||||
main_table = result.tables[0]
|
||||
|
||||
# Create DataFrame
|
||||
df = pd.DataFrame(
|
||||
main_table['rows'],
|
||||
columns=main_table['headers']
|
||||
)
|
||||
|
||||
# Clean and process data
|
||||
df = clean_financial_data(df)
|
||||
|
||||
# Save for analysis
|
||||
df.to_csv("market_data.csv", index=False)
|
||||
return df
|
||||
|
||||
def clean_financial_data(df):
|
||||
"""Clean currency symbols, percentages, and large numbers"""
|
||||
for col in df.columns:
|
||||
if 'price' in col.lower():
|
||||
# Remove currency symbols
|
||||
df[col] = df[col].str.replace(r'[^\d.]', '', regex=True)
|
||||
df[col] = pd.to_numeric(df[col], errors='coerce')
|
||||
|
||||
elif '%' in str(df[col].iloc[0]):
|
||||
# Convert percentages
|
||||
df[col] = df[col].str.replace('%', '').astype(float) / 100
|
||||
|
||||
elif any(suffix in str(df[col].iloc[0]) for suffix in ['B', 'M', 'K']):
|
||||
# Handle large numbers (Billions, Millions, etc.)
|
||||
df[col] = df[col].apply(convert_large_numbers)
|
||||
|
||||
return df
|
||||
|
||||
def convert_large_numbers(value):
|
||||
"""Convert 1.5B -> 1500000000"""
|
||||
if pd.isna(value):
|
||||
return float('nan')
|
||||
|
||||
value = str(value)
|
||||
multiplier = 1
|
||||
if 'B' in value:
|
||||
multiplier = 1e9
|
||||
elif 'M' in value:
|
||||
multiplier = 1e6
|
||||
elif 'K' in value:
|
||||
multiplier = 1e3
|
||||
|
||||
number = float(re.sub(r'[^\d.]', '', value))
|
||||
return number * multiplier
|
||||
```
|
||||
|
||||
### Table Detection Configuration
|
||||
|
||||
```python
|
||||
# Strict table detection (data-heavy pages)
|
||||
strict_config = CrawlerRunConfig(
|
||||
table_score_threshold=9, # Only high-quality tables
|
||||
word_count_threshold=5, # Ignore sparse content
|
||||
excluded_tags=['nav', 'footer'] # Skip navigation tables
|
||||
)
|
||||
|
||||
# Lenient detection (mixed content pages)
|
||||
lenient_config = CrawlerRunConfig(
|
||||
table_score_threshold=5, # Include layout tables
|
||||
process_iframes=True, # Check embedded tables
|
||||
scan_full_page=True # Scroll to load dynamic tables
|
||||
)
|
||||
|
||||
# Financial/data site optimization
|
||||
financial_config = CrawlerRunConfig(
|
||||
table_score_threshold=8,
|
||||
scraping_strategy=LXMLWebScrapingStrategy(),
|
||||
wait_for="css:table", # Wait for tables to load
|
||||
scan_full_page=True,
|
||||
scroll_delay=0.2
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-Table Processing
|
||||
|
||||
```python
|
||||
async def extract_all_tables():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/data", config=config)
|
||||
|
||||
tables_data = {}
|
||||
|
||||
for i, table in enumerate(result.tables):
|
||||
# Create meaningful names based on content
|
||||
table_name = (
|
||||
table.get('caption') or
|
||||
f"table_{i+1}_{table['headers'][0]}"
|
||||
).replace(' ', '_').lower()
|
||||
|
||||
df = pd.DataFrame(table['rows'], columns=table['headers'])
|
||||
|
||||
# Store with metadata
|
||||
tables_data[table_name] = {
|
||||
'dataframe': df,
|
||||
'headers': table['headers'],
|
||||
'row_count': len(table['rows']),
|
||||
'caption': table.get('caption'),
|
||||
'summary': table.get('summary')
|
||||
}
|
||||
|
||||
return tables_data
|
||||
|
||||
# Usage
|
||||
tables = await extract_all_tables()
|
||||
for name, data in tables.items():
|
||||
print(f"{name}: {data['row_count']} rows")
|
||||
data['dataframe'].to_csv(f"{name}.csv")
|
||||
```
|
||||
|
||||
### Backward Compatibility
|
||||
|
||||
```python
|
||||
# Support both new and old table formats
|
||||
def get_tables(result):
|
||||
# New format (v0.6+)
|
||||
if hasattr(result, 'tables') and result.tables:
|
||||
return result.tables
|
||||
|
||||
# Fallback to media.tables (older versions)
|
||||
return result.media.get('tables', [])
|
||||
|
||||
# Usage in existing code
|
||||
result = await crawler.arun(url, config=config)
|
||||
tables = get_tables(result)
|
||||
|
||||
for table in tables:
|
||||
df = pd.DataFrame(table['rows'], columns=table['headers'])
|
||||
# Process table data...
|
||||
```
|
||||
|
||||
### Table Quality Scoring
|
||||
|
||||
```python
|
||||
# Understanding table_score_threshold values:
|
||||
# 10: Only perfect data tables (headers + data rows)
|
||||
# 8-9: High-quality tables (recommended for financial/data sites)
|
||||
# 6-7: Mixed content tables (news sites, wikis)
|
||||
# 4-5: Layout tables included (broader detection)
|
||||
# 1-3: All table-like structures (very permissive)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
table_score_threshold=8, # Balanced detection
|
||||
verbose=True # See scoring details in logs
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
**📖 Learn more:** [CrawlResult API Reference](https://docs.crawl4ai.com/api/crawl-result/), [Browser & Crawler Configuration](https://docs.crawl4ai.com/core/browser-crawler-config/), [Cache Modes](https://docs.crawl4ai.com/core/cache-modes/)
|
||||
655
docs/md_v2/assets/llm.txt/txt/url_seeder.txt
Normal file
655
docs/md_v2/assets/llm.txt/txt/url_seeder.txt
Normal file
@@ -0,0 +1,655 @@
|
||||
## URL Seeding
|
||||
|
||||
Smart URL discovery for efficient large-scale crawling. Discover thousands of URLs instantly, filter by relevance, then crawl only what matters.
|
||||
|
||||
### Why URL Seeding vs Deep Crawling
|
||||
|
||||
```python
|
||||
# Deep Crawling: Real-time discovery (page by page)
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
|
||||
|
||||
async def deep_crawl_example():
|
||||
config = CrawlerRunConfig(
|
||||
deep_crawl_strategy=BFSDeepCrawlStrategy(
|
||||
max_depth=2,
|
||||
include_external=False,
|
||||
max_pages=50
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun("https://example.com", config=config)
|
||||
print(f"Discovered {len(results)} pages dynamically")
|
||||
|
||||
# URL Seeding: Bulk discovery (thousands instantly)
|
||||
from crawl4ai import AsyncUrlSeeder, SeedingConfig
|
||||
|
||||
async def url_seeding_example():
|
||||
config = SeedingConfig(
|
||||
source="sitemap+cc",
|
||||
pattern="*/docs/*",
|
||||
extract_head=True,
|
||||
query="API documentation",
|
||||
scoring_method="bm25",
|
||||
max_urls=1000
|
||||
)
|
||||
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
urls = await seeder.urls("example.com", config)
|
||||
print(f"Discovered {len(urls)} URLs instantly")
|
||||
# Now crawl only the most relevant ones
|
||||
```
|
||||
|
||||
### Basic URL Discovery
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncUrlSeeder, SeedingConfig
|
||||
|
||||
async def basic_discovery():
|
||||
# Context manager handles cleanup automatically
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
|
||||
# Simple discovery from sitemaps
|
||||
config = SeedingConfig(source="sitemap")
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
print(f"Found {len(urls)} URLs from sitemap")
|
||||
for url in urls[:5]:
|
||||
print(f" - {url['url']} (status: {url['status']})")
|
||||
|
||||
# Manual cleanup (if needed)
|
||||
async def manual_cleanup():
|
||||
seeder = AsyncUrlSeeder()
|
||||
try:
|
||||
config = SeedingConfig(source="cc") # Common Crawl
|
||||
urls = await seeder.urls("example.com", config)
|
||||
print(f"Found {len(urls)} URLs from Common Crawl")
|
||||
finally:
|
||||
await seeder.close()
|
||||
|
||||
asyncio.run(basic_discovery())
|
||||
```
|
||||
|
||||
### Data Sources and Patterns
|
||||
|
||||
```python
|
||||
# Different data sources
|
||||
configs = [
|
||||
SeedingConfig(source="sitemap"), # Fastest, official URLs
|
||||
SeedingConfig(source="cc"), # Most comprehensive
|
||||
SeedingConfig(source="sitemap+cc"), # Maximum coverage
|
||||
]
|
||||
|
||||
# URL pattern filtering
|
||||
patterns = [
|
||||
SeedingConfig(pattern="*/blog/*"), # Blog posts only
|
||||
SeedingConfig(pattern="*.html"), # HTML files only
|
||||
SeedingConfig(pattern="*/product/*"), # Product pages
|
||||
SeedingConfig(pattern="*/docs/api/*"), # API documentation
|
||||
SeedingConfig(pattern="*"), # Everything
|
||||
]
|
||||
|
||||
# Advanced pattern usage
|
||||
async def pattern_filtering():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# Find all blog posts from 2024
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
pattern="*/blog/2024/*.html",
|
||||
max_urls=100
|
||||
)
|
||||
|
||||
blog_urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Further filter by keywords in URL
|
||||
python_posts = [
|
||||
url for url in blog_urls
|
||||
if "python" in url['url'].lower()
|
||||
]
|
||||
|
||||
print(f"Found {len(python_posts)} Python blog posts")
|
||||
```
|
||||
|
||||
### SeedingConfig Parameters
|
||||
|
||||
```python
|
||||
from crawl4ai import SeedingConfig
|
||||
|
||||
# Comprehensive configuration
|
||||
config = SeedingConfig(
|
||||
# Data sources
|
||||
source="sitemap+cc", # "sitemap", "cc", "sitemap+cc"
|
||||
pattern="*/docs/*", # URL pattern filter
|
||||
|
||||
# Metadata extraction
|
||||
extract_head=True, # Get <head> metadata
|
||||
live_check=True, # Verify URLs are accessible
|
||||
|
||||
# Performance controls
|
||||
max_urls=1000, # Limit results (-1 = unlimited)
|
||||
concurrency=20, # Parallel workers
|
||||
hits_per_sec=10, # Rate limiting
|
||||
|
||||
# Relevance scoring
|
||||
query="API documentation guide", # Search query
|
||||
scoring_method="bm25", # Scoring algorithm
|
||||
score_threshold=0.3, # Minimum relevance (0.0-1.0)
|
||||
|
||||
# Cache and filtering
|
||||
force=False, # Bypass cache
|
||||
filter_nonsense_urls=True, # Remove utility URLs
|
||||
verbose=True # Debug output
|
||||
)
|
||||
|
||||
# Quick configurations for common use cases
|
||||
blog_config = SeedingConfig(
|
||||
source="sitemap",
|
||||
pattern="*/blog/*",
|
||||
extract_head=True
|
||||
)
|
||||
|
||||
api_docs_config = SeedingConfig(
|
||||
source="sitemap+cc",
|
||||
pattern="*/docs/*",
|
||||
query="API reference documentation",
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.5
|
||||
)
|
||||
|
||||
product_pages_config = SeedingConfig(
|
||||
source="cc",
|
||||
pattern="*/product/*",
|
||||
live_check=True,
|
||||
max_urls=500
|
||||
)
|
||||
```
|
||||
|
||||
### Metadata Extraction and Analysis
|
||||
|
||||
```python
|
||||
async def metadata_extraction():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True, # Extract <head> metadata
|
||||
pattern="*/blog/*",
|
||||
max_urls=50
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Analyze extracted metadata
|
||||
for url in urls[:5]:
|
||||
head_data = url['head_data']
|
||||
print(f"\nURL: {url['url']}")
|
||||
print(f"Title: {head_data.get('title', 'No title')}")
|
||||
|
||||
# Standard meta tags
|
||||
meta = head_data.get('meta', {})
|
||||
print(f"Description: {meta.get('description', 'N/A')}")
|
||||
print(f"Keywords: {meta.get('keywords', 'N/A')}")
|
||||
print(f"Author: {meta.get('author', 'N/A')}")
|
||||
|
||||
# Open Graph data
|
||||
print(f"OG Image: {meta.get('og:image', 'N/A')}")
|
||||
print(f"OG Type: {meta.get('og:type', 'N/A')}")
|
||||
|
||||
# JSON-LD structured data
|
||||
jsonld = head_data.get('jsonld', [])
|
||||
if jsonld:
|
||||
print(f"Structured data: {len(jsonld)} items")
|
||||
for item in jsonld[:2]:
|
||||
if isinstance(item, dict):
|
||||
print(f" Type: {item.get('@type', 'Unknown')}")
|
||||
print(f" Name: {item.get('name', 'N/A')}")
|
||||
|
||||
# Filter by metadata
|
||||
async def metadata_filtering():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
max_urls=100
|
||||
)
|
||||
|
||||
urls = await seeder.urls("news.example.com", config)
|
||||
|
||||
# Filter by publication date (from JSON-LD)
|
||||
from datetime import datetime, timedelta
|
||||
recent_cutoff = datetime.now() - timedelta(days=7)
|
||||
|
||||
recent_articles = []
|
||||
for url in urls:
|
||||
for jsonld in url['head_data'].get('jsonld', []):
|
||||
if isinstance(jsonld, dict) and 'datePublished' in jsonld:
|
||||
try:
|
||||
pub_date = datetime.fromisoformat(
|
||||
jsonld['datePublished'].replace('Z', '+00:00')
|
||||
)
|
||||
if pub_date > recent_cutoff:
|
||||
recent_articles.append(url)
|
||||
break
|
||||
except:
|
||||
continue
|
||||
|
||||
print(f"Found {len(recent_articles)} recent articles")
|
||||
```
|
||||
|
||||
### BM25 Relevance Scoring
|
||||
|
||||
```python
|
||||
async def relevance_scoring():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# Find pages about Python async programming
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True, # Required for content-based scoring
|
||||
query="python async await concurrency",
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.3, # Only 30%+ relevant pages
|
||||
max_urls=20
|
||||
)
|
||||
|
||||
urls = await seeder.urls("docs.python.org", config)
|
||||
|
||||
# Results are automatically sorted by relevance
|
||||
print("Most relevant Python async content:")
|
||||
for url in urls[:5]:
|
||||
score = url['relevance_score']
|
||||
title = url['head_data'].get('title', 'No title')
|
||||
print(f"[{score:.2f}] {title}")
|
||||
print(f" {url['url']}")
|
||||
|
||||
# URL-based scoring (when extract_head=False)
|
||||
async def url_based_scoring():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=False, # Fast URL-only scoring
|
||||
query="machine learning tutorial",
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.2
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Scoring based on URL structure, domain, path segments
|
||||
for url in urls[:5]:
|
||||
print(f"[{url['relevance_score']:.2f}] {url['url']}")
|
||||
|
||||
# Multi-concept queries
|
||||
async def complex_queries():
|
||||
queries = [
|
||||
"data science pandas numpy visualization",
|
||||
"web scraping automation selenium",
|
||||
"machine learning tensorflow pytorch",
|
||||
"api documentation rest graphql"
|
||||
]
|
||||
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
all_results = []
|
||||
|
||||
for query in queries:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
query=query,
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.4,
|
||||
max_urls=10
|
||||
)
|
||||
|
||||
urls = await seeder.urls("learning-site.com", config)
|
||||
all_results.extend(urls)
|
||||
|
||||
# Remove duplicates while preserving order
|
||||
seen = set()
|
||||
unique_results = []
|
||||
for url in all_results:
|
||||
if url['url'] not in seen:
|
||||
seen.add(url['url'])
|
||||
unique_results.append(url)
|
||||
|
||||
print(f"Found {len(unique_results)} unique pages across all topics")
|
||||
```
|
||||
|
||||
### Live URL Validation
|
||||
|
||||
```python
|
||||
async def url_validation():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
live_check=True, # Verify URLs are accessible
|
||||
concurrency=15, # Parallel HEAD requests
|
||||
hits_per_sec=8, # Rate limiting
|
||||
max_urls=100
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Analyze results
|
||||
valid_urls = [u for u in urls if u['status'] == 'valid']
|
||||
invalid_urls = [u for u in urls if u['status'] == 'not_valid']
|
||||
|
||||
print(f"✅ Valid URLs: {len(valid_urls)}")
|
||||
print(f"❌ Invalid URLs: {len(invalid_urls)}")
|
||||
print(f"📊 Success rate: {len(valid_urls)/len(urls)*100:.1f}%")
|
||||
|
||||
# Show some invalid URLs for debugging
|
||||
if invalid_urls:
|
||||
print("\nSample invalid URLs:")
|
||||
for url in invalid_urls[:3]:
|
||||
print(f" - {url['url']}")
|
||||
|
||||
# Combined validation and metadata
|
||||
async def comprehensive_validation():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
live_check=True, # Verify accessibility
|
||||
extract_head=True, # Get metadata
|
||||
query="tutorial guide", # Relevance scoring
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.2,
|
||||
concurrency=10,
|
||||
max_urls=50
|
||||
)
|
||||
|
||||
urls = await seeder.urls("docs.example.com", config)
|
||||
|
||||
# Filter for valid, relevant tutorials
|
||||
good_tutorials = [
|
||||
url for url in urls
|
||||
if url['status'] == 'valid' and
|
||||
url['relevance_score'] > 0.3 and
|
||||
'tutorial' in url['head_data'].get('title', '').lower()
|
||||
]
|
||||
|
||||
print(f"Found {len(good_tutorials)} high-quality tutorials")
|
||||
```
|
||||
|
||||
### Multi-Domain Discovery
|
||||
|
||||
```python
|
||||
async def multi_domain_research():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# Research Python tutorials across multiple sites
|
||||
domains = [
|
||||
"docs.python.org",
|
||||
"realpython.com",
|
||||
"python-course.eu",
|
||||
"tutorialspoint.com"
|
||||
]
|
||||
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
query="python beginner tutorial basics",
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.3,
|
||||
max_urls=15 # Per domain
|
||||
)
|
||||
|
||||
# Discover across all domains in parallel
|
||||
results = await seeder.many_urls(domains, config)
|
||||
|
||||
# Collect and rank all tutorials
|
||||
all_tutorials = []
|
||||
for domain, urls in results.items():
|
||||
for url in urls:
|
||||
url['domain'] = domain
|
||||
all_tutorials.append(url)
|
||||
|
||||
# Sort by relevance across all domains
|
||||
all_tutorials.sort(key=lambda x: x['relevance_score'], reverse=True)
|
||||
|
||||
print(f"Top 10 Python tutorials across {len(domains)} sites:")
|
||||
for i, tutorial in enumerate(all_tutorials[:10], 1):
|
||||
score = tutorial['relevance_score']
|
||||
title = tutorial['head_data'].get('title', 'No title')[:60]
|
||||
domain = tutorial['domain']
|
||||
print(f"{i:2d}. [{score:.2f}] {title}")
|
||||
print(f" {domain}")
|
||||
|
||||
# Competitor analysis
|
||||
async def competitor_analysis():
|
||||
competitors = ["competitor1.com", "competitor2.com", "competitor3.com"]
|
||||
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
pattern="*/blog/*",
|
||||
max_urls=50
|
||||
)
|
||||
|
||||
results = await seeder.many_urls(competitors, config)
|
||||
|
||||
# Analyze content strategies
|
||||
for domain, urls in results.items():
|
||||
content_types = {}
|
||||
|
||||
for url in urls:
|
||||
# Extract content type from metadata
|
||||
meta = url['head_data'].get('meta', {})
|
||||
og_type = meta.get('og:type', 'unknown')
|
||||
content_types[og_type] = content_types.get(og_type, 0) + 1
|
||||
|
||||
print(f"\n{domain} content distribution:")
|
||||
for ctype, count in sorted(content_types.items(),
|
||||
key=lambda x: x[1], reverse=True):
|
||||
print(f" {ctype}: {count}")
|
||||
```
|
||||
|
||||
### Complete Pipeline: Discovery → Filter → Crawl
|
||||
|
||||
```python
|
||||
async def smart_research_pipeline():
|
||||
"""Complete pipeline: discover URLs, filter by relevance, crawl top results"""
|
||||
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# Step 1: Discover relevant URLs
|
||||
print("🔍 Discovering URLs...")
|
||||
config = SeedingConfig(
|
||||
source="sitemap+cc",
|
||||
extract_head=True,
|
||||
query="machine learning deep learning tutorial",
|
||||
scoring_method="bm25",
|
||||
score_threshold=0.4,
|
||||
max_urls=100
|
||||
)
|
||||
|
||||
urls = await seeder.urls("example.com", config)
|
||||
print(f" Found {len(urls)} relevant URLs")
|
||||
|
||||
# Step 2: Select top articles
|
||||
top_articles = sorted(urls,
|
||||
key=lambda x: x['relevance_score'],
|
||||
reverse=True)[:10]
|
||||
|
||||
print(f" Selected top {len(top_articles)} for crawling")
|
||||
|
||||
# Step 3: Show what we're about to crawl
|
||||
print("\n📋 Articles to crawl:")
|
||||
for i, article in enumerate(top_articles, 1):
|
||||
score = article['relevance_score']
|
||||
title = article['head_data'].get('title', 'No title')[:60]
|
||||
print(f" {i}. [{score:.2f}] {title}")
|
||||
|
||||
# Step 4: Crawl selected articles
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
print(f"\n🕷️ Crawling {len(top_articles)} articles...")
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(
|
||||
only_text=True,
|
||||
word_count_threshold=200,
|
||||
stream=True # Process results as they come
|
||||
)
|
||||
|
||||
# Extract URLs and crawl
|
||||
article_urls = [article['url'] for article in top_articles]
|
||||
|
||||
crawled_count = 0
|
||||
async for result in await crawler.arun_many(article_urls, config=config):
|
||||
if result.success:
|
||||
crawled_count += 1
|
||||
word_count = len(result.markdown.raw_markdown.split())
|
||||
print(f" ✅ [{crawled_count}/{len(article_urls)}] "
|
||||
f"{word_count} words from {result.url[:50]}...")
|
||||
else:
|
||||
print(f" ❌ Failed: {result.url[:50]}...")
|
||||
|
||||
print(f"\n✨ Successfully crawled {crawled_count} articles!")
|
||||
|
||||
asyncio.run(smart_research_pipeline())
|
||||
```
|
||||
|
||||
### Advanced Features and Performance
|
||||
|
||||
```python
|
||||
# Cache management
|
||||
async def cache_management():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# First run - populate cache
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
force=True # Bypass cache, fetch fresh
|
||||
)
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Subsequent runs - use cache (much faster)
|
||||
config = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
force=False # Use cache
|
||||
)
|
||||
urls = await seeder.urls("example.com", config)
|
||||
|
||||
# Performance optimization
|
||||
async def performance_tuning():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# High-performance configuration
|
||||
config = SeedingConfig(
|
||||
source="cc",
|
||||
concurrency=50, # Many parallel workers
|
||||
hits_per_sec=20, # High rate limit
|
||||
max_urls=10000, # Large dataset
|
||||
extract_head=False, # Skip metadata for speed
|
||||
filter_nonsense_urls=True # Auto-filter utility URLs
|
||||
)
|
||||
|
||||
import time
|
||||
start = time.time()
|
||||
urls = await seeder.urls("large-site.com", config)
|
||||
elapsed = time.time() - start
|
||||
|
||||
print(f"Processed {len(urls)} URLs in {elapsed:.2f}s")
|
||||
print(f"Speed: {len(urls)/elapsed:.0f} URLs/second")
|
||||
|
||||
# Memory-safe processing for large domains
|
||||
async def large_domain_processing():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# Safe for domains with 1M+ URLs
|
||||
config = SeedingConfig(
|
||||
source="cc+sitemap",
|
||||
concurrency=50, # Bounded queue adapts to this
|
||||
max_urls=100000, # Process in batches
|
||||
filter_nonsense_urls=True
|
||||
)
|
||||
|
||||
# The seeder automatically manages memory by:
|
||||
# - Using bounded queues (prevents RAM spikes)
|
||||
# - Applying backpressure when queue is full
|
||||
# - Processing URLs as they're discovered
|
||||
urls = await seeder.urls("huge-site.com", config)
|
||||
|
||||
# Configuration cloning and reuse
|
||||
config_base = SeedingConfig(
|
||||
source="sitemap",
|
||||
extract_head=True,
|
||||
concurrency=20
|
||||
)
|
||||
|
||||
# Create variations
|
||||
blog_config = config_base.clone(pattern="*/blog/*")
|
||||
docs_config = config_base.clone(
|
||||
pattern="*/docs/*",
|
||||
query="API documentation",
|
||||
scoring_method="bm25"
|
||||
)
|
||||
fast_config = config_base.clone(
|
||||
extract_head=False,
|
||||
concurrency=100,
|
||||
hits_per_sec=50
|
||||
)
|
||||
```
|
||||
|
||||
### Troubleshooting and Best Practices
|
||||
|
||||
```python
|
||||
# Common issues and solutions
|
||||
async def troubleshooting_guide():
|
||||
async with AsyncUrlSeeder() as seeder:
|
||||
# Issue: No URLs found
|
||||
try:
|
||||
config = SeedingConfig(source="sitemap", pattern="*/nonexistent/*")
|
||||
urls = await seeder.urls("example.com", config)
|
||||
if not urls:
|
||||
# Solution: Try broader pattern or different source
|
||||
config = SeedingConfig(source="cc+sitemap", pattern="*")
|
||||
urls = await seeder.urls("example.com", config)
|
||||
except Exception as e:
|
||||
print(f"Discovery failed: {e}")
|
||||
|
||||
# Issue: Slow performance
|
||||
config = SeedingConfig(
|
||||
source="sitemap", # Faster than CC
|
||||
concurrency=10, # Reduce if hitting rate limits
|
||||
hits_per_sec=5, # Add rate limiting
|
||||
extract_head=False # Skip if metadata not needed
|
||||
)
|
||||
|
||||
# Issue: Low relevance scores
|
||||
config = SeedingConfig(
|
||||
query="specific detailed query terms",
|
||||
score_threshold=0.1, # Lower threshold
|
||||
scoring_method="bm25"
|
||||
)
|
||||
|
||||
# Issue: Memory issues with large sites
|
||||
config = SeedingConfig(
|
||||
max_urls=10000, # Limit results
|
||||
concurrency=20, # Reduce concurrency
|
||||
source="sitemap" # Use sitemap only
|
||||
)
|
||||
|
||||
# Performance benchmarks
|
||||
print("""
|
||||
Typical performance on standard connection:
|
||||
- Sitemap discovery: 100-1,000 URLs/second
|
||||
- Common Crawl discovery: 50-500 URLs/second
|
||||
- HEAD checking: 10-50 URLs/second
|
||||
- Head extraction: 5-20 URLs/second
|
||||
- BM25 scoring: 10,000+ URLs/second
|
||||
""")
|
||||
|
||||
# Best practices
|
||||
best_practices = """
|
||||
✅ Use context manager: async with AsyncUrlSeeder() as seeder
|
||||
✅ Start with sitemaps (faster), add CC if needed
|
||||
✅ Use extract_head=True only when you need metadata
|
||||
✅ Set reasonable max_urls to limit processing
|
||||
✅ Add rate limiting for respectful crawling
|
||||
✅ Cache results with force=False for repeated operations
|
||||
✅ Filter nonsense URLs (enabled by default)
|
||||
✅ Use specific patterns to reduce irrelevant results
|
||||
"""
|
||||
```
|
||||
|
||||
**📖 Learn more:** [Complete URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [SeedingConfig Reference](https://docs.crawl4ai.com/api/parameters/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/)
|
||||
Reference in New Issue
Block a user