This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
392 lines
9.5 KiB
Plaintext
392 lines
9.5 KiB
Plaintext
## Multi-URL Crawling Workflows and Architecture
|
|
|
|
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
|
|
|
|
### Multi-URL Processing Modes
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Multi-URL Crawling Request] --> B{Processing Mode?}
|
|
|
|
B -->|Batch Mode| C[Collect All URLs]
|
|
B -->|Streaming Mode| D[Process URLs Individually]
|
|
|
|
C --> C1[Queue All URLs]
|
|
C1 --> C2[Execute Concurrently]
|
|
C2 --> C3[Wait for All Completion]
|
|
C3 --> C4[Return Complete Results Array]
|
|
|
|
D --> D1[Queue URLs]
|
|
D1 --> D2[Start First Batch]
|
|
D2 --> D3[Yield Results as Available]
|
|
D3 --> D4{More URLs?}
|
|
D4 -->|Yes| D5[Start Next URLs]
|
|
D4 -->|No| D6[Stream Complete]
|
|
D5 --> D3
|
|
|
|
C4 --> E[Process Results]
|
|
D6 --> E
|
|
|
|
E --> F[Success/Failure Analysis]
|
|
F --> G[End]
|
|
|
|
style C fill:#e3f2fd
|
|
style D fill:#f3e5f5
|
|
style C4 fill:#c8e6c9
|
|
style D6 fill:#c8e6c9
|
|
```
|
|
|
|
### Memory-Adaptive Dispatcher Flow
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> Initializing
|
|
|
|
Initializing --> MonitoringMemory: Start dispatcher
|
|
|
|
MonitoringMemory --> CheckingMemory: Every check_interval
|
|
CheckingMemory --> MemoryOK: Memory < threshold
|
|
CheckingMemory --> MemoryHigh: Memory >= threshold
|
|
|
|
MemoryOK --> DispatchingTasks: Start new crawls
|
|
MemoryHigh --> WaitingForMemory: Pause dispatching
|
|
|
|
DispatchingTasks --> TaskRunning: Launch crawler
|
|
TaskRunning --> TaskCompleted: Crawl finished
|
|
TaskRunning --> TaskFailed: Crawl error
|
|
|
|
TaskCompleted --> MonitoringMemory: Update stats
|
|
TaskFailed --> MonitoringMemory: Update stats
|
|
|
|
WaitingForMemory --> CheckingMemory: Wait timeout
|
|
WaitingForMemory --> MonitoringMemory: Memory freed
|
|
|
|
note right of MemoryHigh: Prevents OOM crashes
|
|
note right of DispatchingTasks: Respects max_session_permit
|
|
note right of WaitingForMemory: Configurable timeout
|
|
```
|
|
|
|
### Concurrent Crawling Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "URL Queue Management"
|
|
A[URL Input List] --> B[URL Queue]
|
|
B --> C[Priority Scheduler]
|
|
C --> D[Batch Assignment]
|
|
end
|
|
|
|
subgraph "Dispatcher Layer"
|
|
E[Memory Adaptive Dispatcher]
|
|
F[Semaphore Dispatcher]
|
|
G[Rate Limiter]
|
|
H[Resource Monitor]
|
|
|
|
E --> I[Memory Checker]
|
|
F --> J[Concurrency Controller]
|
|
G --> K[Delay Calculator]
|
|
H --> L[System Stats]
|
|
end
|
|
|
|
subgraph "Crawler Pool"
|
|
M[Crawler Instance 1]
|
|
N[Crawler Instance 2]
|
|
O[Crawler Instance 3]
|
|
P[Crawler Instance N]
|
|
|
|
M --> Q[Browser Session 1]
|
|
N --> R[Browser Session 2]
|
|
O --> S[Browser Session 3]
|
|
P --> T[Browser Session N]
|
|
end
|
|
|
|
subgraph "Result Processing"
|
|
U[Result Collector]
|
|
V[Success Handler]
|
|
W[Error Handler]
|
|
X[Retry Queue]
|
|
Y[Final Results]
|
|
end
|
|
|
|
D --> E
|
|
D --> F
|
|
E --> M
|
|
F --> N
|
|
G --> O
|
|
H --> P
|
|
|
|
Q --> U
|
|
R --> U
|
|
S --> U
|
|
T --> U
|
|
|
|
U --> V
|
|
U --> W
|
|
W --> X
|
|
X --> B
|
|
V --> Y
|
|
|
|
style E fill:#e3f2fd
|
|
style F fill:#f3e5f5
|
|
style G fill:#e8f5e8
|
|
style H fill:#fff3e0
|
|
```
|
|
|
|
### Rate Limiting and Backoff Strategy
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant C as Crawler
|
|
participant RL as Rate Limiter
|
|
participant S as Server
|
|
participant D as Dispatcher
|
|
|
|
C->>RL: Request to crawl URL
|
|
RL->>RL: Calculate delay
|
|
RL->>RL: Apply base delay (1-3s)
|
|
RL->>C: Delay applied
|
|
|
|
C->>S: HTTP Request
|
|
|
|
alt Success Response
|
|
S-->>C: 200 OK + Content
|
|
C->>RL: Report success
|
|
RL->>RL: Reset failure count
|
|
C->>D: Return successful result
|
|
else Rate Limited
|
|
S-->>C: 429 Too Many Requests
|
|
C->>RL: Report rate limit
|
|
RL->>RL: Exponential backoff
|
|
RL->>RL: Increase delay (up to max_delay)
|
|
RL->>C: Apply longer delay
|
|
C->>S: Retry request after delay
|
|
else Server Error
|
|
S-->>C: 503 Service Unavailable
|
|
C->>RL: Report server error
|
|
RL->>RL: Moderate backoff
|
|
RL->>C: Retry with backoff
|
|
else Max Retries Exceeded
|
|
RL->>C: Stop retrying
|
|
C->>D: Return failed result
|
|
end
|
|
```
|
|
|
|
### Large-Scale Crawling Workflow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
|
|
|
|
B --> C{Select Dispatcher Type}
|
|
C -->|Memory Constrained| D[Memory Adaptive]
|
|
C -->|Fixed Resources| E[Semaphore Based]
|
|
|
|
D --> F[Set Memory Threshold 70%]
|
|
E --> G[Set Concurrency Limit]
|
|
|
|
F --> H[Configure Monitoring]
|
|
G --> H
|
|
|
|
H --> I[Start Crawling Process]
|
|
I --> J[Monitor System Resources]
|
|
|
|
J --> K{Memory Usage?}
|
|
K -->|< Threshold| L[Continue Dispatching]
|
|
K -->|>= Threshold| M[Pause New Tasks]
|
|
|
|
L --> N[Process Results Stream]
|
|
M --> O[Wait for Memory]
|
|
O --> K
|
|
|
|
N --> P{Result Type?}
|
|
P -->|Success| Q[Save to Database]
|
|
P -->|Failure| R[Log Error]
|
|
|
|
Q --> S[Update Progress Counter]
|
|
R --> S
|
|
|
|
S --> T{More URLs?}
|
|
T -->|Yes| U[Get Next Batch]
|
|
T -->|No| V[Generate Final Report]
|
|
|
|
U --> L
|
|
V --> W[Analysis Complete]
|
|
|
|
style A fill:#e1f5fe
|
|
style D fill:#e8f5e8
|
|
style E fill:#f3e5f5
|
|
style V fill:#c8e6c9
|
|
style W fill:#a5d6a7
|
|
```
|
|
|
|
### Real-Time Monitoring Dashboard Flow
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Data Collection"
|
|
A[Crawler Tasks] --> B[Performance Metrics]
|
|
A --> C[Memory Usage]
|
|
A --> D[Success/Failure Rates]
|
|
A --> E[Response Times]
|
|
end
|
|
|
|
subgraph "Monitor Processing"
|
|
F[CrawlerMonitor] --> G[Aggregate Statistics]
|
|
F --> H[Display Formatter]
|
|
F --> I[Update Scheduler]
|
|
end
|
|
|
|
subgraph "Display Modes"
|
|
J[DETAILED Mode]
|
|
K[AGGREGATED Mode]
|
|
|
|
J --> L[Individual Task Status]
|
|
J --> M[Task-Level Metrics]
|
|
K --> N[Summary Statistics]
|
|
K --> O[Overall Progress]
|
|
end
|
|
|
|
subgraph "Output Interface"
|
|
P[Console Display]
|
|
Q[Progress Bars]
|
|
R[Status Tables]
|
|
S[Real-time Updates]
|
|
end
|
|
|
|
B --> F
|
|
C --> F
|
|
D --> F
|
|
E --> F
|
|
|
|
G --> J
|
|
G --> K
|
|
H --> J
|
|
H --> K
|
|
I --> J
|
|
I --> K
|
|
|
|
L --> P
|
|
M --> Q
|
|
N --> R
|
|
O --> S
|
|
|
|
style F fill:#e3f2fd
|
|
style J fill:#f3e5f5
|
|
style K fill:#e8f5e8
|
|
```
|
|
|
|
### Error Handling and Recovery Pattern
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> ProcessingURL
|
|
|
|
ProcessingURL --> CrawlAttempt: Start crawl
|
|
|
|
CrawlAttempt --> Success: HTTP 200
|
|
CrawlAttempt --> NetworkError: Connection failed
|
|
CrawlAttempt --> RateLimit: HTTP 429
|
|
CrawlAttempt --> ServerError: HTTP 5xx
|
|
CrawlAttempt --> Timeout: Request timeout
|
|
|
|
Success --> [*]: Return result
|
|
|
|
NetworkError --> RetryCheck: Check retry count
|
|
RateLimit --> BackoffWait: Apply exponential backoff
|
|
ServerError --> RetryCheck: Check retry count
|
|
Timeout --> RetryCheck: Check retry count
|
|
|
|
BackoffWait --> RetryCheck: After delay
|
|
|
|
RetryCheck --> CrawlAttempt: retries < max_retries
|
|
RetryCheck --> Failed: retries >= max_retries
|
|
|
|
Failed --> ErrorLog: Log failure details
|
|
ErrorLog --> [*]: Return failed result
|
|
|
|
note right of BackoffWait: Exponential backoff for rate limits
|
|
note right of RetryCheck: Configurable max_retries
|
|
note right of ErrorLog: Detailed error tracking
|
|
```
|
|
|
|
### Resource Management Timeline
|
|
|
|
```mermaid
|
|
gantt
|
|
title Multi-URL Crawling Resource Management
|
|
dateFormat X
|
|
axisFormat %s
|
|
|
|
section Memory Usage
|
|
Initialize Dispatcher :0, 1
|
|
Memory Monitoring :1, 10
|
|
Peak Usage Period :3, 7
|
|
Memory Cleanup :7, 9
|
|
|
|
section Task Execution
|
|
URL Queue Setup :0, 2
|
|
Batch 1 Processing :2, 5
|
|
Batch 2 Processing :4, 7
|
|
Batch 3 Processing :6, 9
|
|
Final Results :9, 10
|
|
|
|
section Rate Limiting
|
|
Normal Delays :2, 4
|
|
Backoff Period :4, 6
|
|
Recovery Period :6, 8
|
|
|
|
section Monitoring
|
|
System Health Check :0, 10
|
|
Progress Updates :1, 9
|
|
Performance Metrics :2, 8
|
|
```
|
|
|
|
### Concurrent Processing Performance Matrix
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Input Factors"
|
|
A[Number of URLs]
|
|
B[Concurrency Level]
|
|
C[Memory Threshold]
|
|
D[Rate Limiting]
|
|
end
|
|
|
|
subgraph "Processing Characteristics"
|
|
A --> E[Low 1-100 URLs]
|
|
A --> F[Medium 100-1k URLs]
|
|
A --> G[High 1k-10k URLs]
|
|
A --> H[Very High 10k+ URLs]
|
|
|
|
B --> I[Conservative 1-5]
|
|
B --> J[Moderate 5-15]
|
|
B --> K[Aggressive 15-30]
|
|
|
|
C --> L[Strict 60-70%]
|
|
C --> M[Balanced 70-80%]
|
|
C --> N[Relaxed 80-90%]
|
|
end
|
|
|
|
subgraph "Recommended Configurations"
|
|
E --> O[Simple Semaphore]
|
|
F --> P[Memory Adaptive Basic]
|
|
G --> Q[Memory Adaptive Advanced]
|
|
H --> R[Memory Adaptive + Monitoring]
|
|
|
|
I --> O
|
|
J --> P
|
|
K --> Q
|
|
K --> R
|
|
|
|
L --> Q
|
|
M --> P
|
|
N --> O
|
|
end
|
|
|
|
style O fill:#c8e6c9
|
|
style P fill:#fff3e0
|
|
style Q fill:#ffecb3
|
|
style R fill:#ffcdd2
|
|
```
|
|
|
|
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization) |