Files
crawl4ai/docs/md_v2/assets/llm.txt/diagrams/multi_urls_crawling.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

392 lines
9.5 KiB
Plaintext

## Multi-URL Crawling Workflows and Architecture
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
### Multi-URL Processing Modes
```mermaid
flowchart TD
A[Multi-URL Crawling Request] --> B{Processing Mode?}
B -->|Batch Mode| C[Collect All URLs]
B -->|Streaming Mode| D[Process URLs Individually]
C --> C1[Queue All URLs]
C1 --> C2[Execute Concurrently]
C2 --> C3[Wait for All Completion]
C3 --> C4[Return Complete Results Array]
D --> D1[Queue URLs]
D1 --> D2[Start First Batch]
D2 --> D3[Yield Results as Available]
D3 --> D4{More URLs?}
D4 -->|Yes| D5[Start Next URLs]
D4 -->|No| D6[Stream Complete]
D5 --> D3
C4 --> E[Process Results]
D6 --> E
E --> F[Success/Failure Analysis]
F --> G[End]
style C fill:#e3f2fd
style D fill:#f3e5f5
style C4 fill:#c8e6c9
style D6 fill:#c8e6c9
```
### Memory-Adaptive Dispatcher Flow
```mermaid
stateDiagram-v2
[*] --> Initializing
Initializing --> MonitoringMemory: Start dispatcher
MonitoringMemory --> CheckingMemory: Every check_interval
CheckingMemory --> MemoryOK: Memory < threshold
CheckingMemory --> MemoryHigh: Memory >= threshold
MemoryOK --> DispatchingTasks: Start new crawls
MemoryHigh --> WaitingForMemory: Pause dispatching
DispatchingTasks --> TaskRunning: Launch crawler
TaskRunning --> TaskCompleted: Crawl finished
TaskRunning --> TaskFailed: Crawl error
TaskCompleted --> MonitoringMemory: Update stats
TaskFailed --> MonitoringMemory: Update stats
WaitingForMemory --> CheckingMemory: Wait timeout
WaitingForMemory --> MonitoringMemory: Memory freed
note right of MemoryHigh: Prevents OOM crashes
note right of DispatchingTasks: Respects max_session_permit
note right of WaitingForMemory: Configurable timeout
```
### Concurrent Crawling Architecture
```mermaid
graph TB
subgraph "URL Queue Management"
A[URL Input List] --> B[URL Queue]
B --> C[Priority Scheduler]
C --> D[Batch Assignment]
end
subgraph "Dispatcher Layer"
E[Memory Adaptive Dispatcher]
F[Semaphore Dispatcher]
G[Rate Limiter]
H[Resource Monitor]
E --> I[Memory Checker]
F --> J[Concurrency Controller]
G --> K[Delay Calculator]
H --> L[System Stats]
end
subgraph "Crawler Pool"
M[Crawler Instance 1]
N[Crawler Instance 2]
O[Crawler Instance 3]
P[Crawler Instance N]
M --> Q[Browser Session 1]
N --> R[Browser Session 2]
O --> S[Browser Session 3]
P --> T[Browser Session N]
end
subgraph "Result Processing"
U[Result Collector]
V[Success Handler]
W[Error Handler]
X[Retry Queue]
Y[Final Results]
end
D --> E
D --> F
E --> M
F --> N
G --> O
H --> P
Q --> U
R --> U
S --> U
T --> U
U --> V
U --> W
W --> X
X --> B
V --> Y
style E fill:#e3f2fd
style F fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#fff3e0
```
### Rate Limiting and Backoff Strategy
```mermaid
sequenceDiagram
participant C as Crawler
participant RL as Rate Limiter
participant S as Server
participant D as Dispatcher
C->>RL: Request to crawl URL
RL->>RL: Calculate delay
RL->>RL: Apply base delay (1-3s)
RL->>C: Delay applied
C->>S: HTTP Request
alt Success Response
S-->>C: 200 OK + Content
C->>RL: Report success
RL->>RL: Reset failure count
C->>D: Return successful result
else Rate Limited
S-->>C: 429 Too Many Requests
C->>RL: Report rate limit
RL->>RL: Exponential backoff
RL->>RL: Increase delay (up to max_delay)
RL->>C: Apply longer delay
C->>S: Retry request after delay
else Server Error
S-->>C: 503 Service Unavailable
C->>RL: Report server error
RL->>RL: Moderate backoff
RL->>C: Retry with backoff
else Max Retries Exceeded
RL->>C: Stop retrying
C->>D: Return failed result
end
```
### Large-Scale Crawling Workflow
```mermaid
flowchart TD
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
B --> C{Select Dispatcher Type}
C -->|Memory Constrained| D[Memory Adaptive]
C -->|Fixed Resources| E[Semaphore Based]
D --> F[Set Memory Threshold 70%]
E --> G[Set Concurrency Limit]
F --> H[Configure Monitoring]
G --> H
H --> I[Start Crawling Process]
I --> J[Monitor System Resources]
J --> K{Memory Usage?}
K -->|< Threshold| L[Continue Dispatching]
K -->|>= Threshold| M[Pause New Tasks]
L --> N[Process Results Stream]
M --> O[Wait for Memory]
O --> K
N --> P{Result Type?}
P -->|Success| Q[Save to Database]
P -->|Failure| R[Log Error]
Q --> S[Update Progress Counter]
R --> S
S --> T{More URLs?}
T -->|Yes| U[Get Next Batch]
T -->|No| V[Generate Final Report]
U --> L
V --> W[Analysis Complete]
style A fill:#e1f5fe
style D fill:#e8f5e8
style E fill:#f3e5f5
style V fill:#c8e6c9
style W fill:#a5d6a7
```
### Real-Time Monitoring Dashboard Flow
```mermaid
graph LR
subgraph "Data Collection"
A[Crawler Tasks] --> B[Performance Metrics]
A --> C[Memory Usage]
A --> D[Success/Failure Rates]
A --> E[Response Times]
end
subgraph "Monitor Processing"
F[CrawlerMonitor] --> G[Aggregate Statistics]
F --> H[Display Formatter]
F --> I[Update Scheduler]
end
subgraph "Display Modes"
J[DETAILED Mode]
K[AGGREGATED Mode]
J --> L[Individual Task Status]
J --> M[Task-Level Metrics]
K --> N[Summary Statistics]
K --> O[Overall Progress]
end
subgraph "Output Interface"
P[Console Display]
Q[Progress Bars]
R[Status Tables]
S[Real-time Updates]
end
B --> F
C --> F
D --> F
E --> F
G --> J
G --> K
H --> J
H --> K
I --> J
I --> K
L --> P
M --> Q
N --> R
O --> S
style F fill:#e3f2fd
style J fill:#f3e5f5
style K fill:#e8f5e8
```
### Error Handling and Recovery Pattern
```mermaid
stateDiagram-v2
[*] --> ProcessingURL
ProcessingURL --> CrawlAttempt: Start crawl
CrawlAttempt --> Success: HTTP 200
CrawlAttempt --> NetworkError: Connection failed
CrawlAttempt --> RateLimit: HTTP 429
CrawlAttempt --> ServerError: HTTP 5xx
CrawlAttempt --> Timeout: Request timeout
Success --> [*]: Return result
NetworkError --> RetryCheck: Check retry count
RateLimit --> BackoffWait: Apply exponential backoff
ServerError --> RetryCheck: Check retry count
Timeout --> RetryCheck: Check retry count
BackoffWait --> RetryCheck: After delay
RetryCheck --> CrawlAttempt: retries < max_retries
RetryCheck --> Failed: retries >= max_retries
Failed --> ErrorLog: Log failure details
ErrorLog --> [*]: Return failed result
note right of BackoffWait: Exponential backoff for rate limits
note right of RetryCheck: Configurable max_retries
note right of ErrorLog: Detailed error tracking
```
### Resource Management Timeline
```mermaid
gantt
title Multi-URL Crawling Resource Management
dateFormat X
axisFormat %s
section Memory Usage
Initialize Dispatcher :0, 1
Memory Monitoring :1, 10
Peak Usage Period :3, 7
Memory Cleanup :7, 9
section Task Execution
URL Queue Setup :0, 2
Batch 1 Processing :2, 5
Batch 2 Processing :4, 7
Batch 3 Processing :6, 9
Final Results :9, 10
section Rate Limiting
Normal Delays :2, 4
Backoff Period :4, 6
Recovery Period :6, 8
section Monitoring
System Health Check :0, 10
Progress Updates :1, 9
Performance Metrics :2, 8
```
### Concurrent Processing Performance Matrix
```mermaid
graph TD
subgraph "Input Factors"
A[Number of URLs]
B[Concurrency Level]
C[Memory Threshold]
D[Rate Limiting]
end
subgraph "Processing Characteristics"
A --> E[Low 1-100 URLs]
A --> F[Medium 100-1k URLs]
A --> G[High 1k-10k URLs]
A --> H[Very High 10k+ URLs]
B --> I[Conservative 1-5]
B --> J[Moderate 5-15]
B --> K[Aggressive 15-30]
C --> L[Strict 60-70%]
C --> M[Balanced 70-80%]
C --> N[Relaxed 80-90%]
end
subgraph "Recommended Configurations"
E --> O[Simple Semaphore]
F --> P[Memory Adaptive Basic]
G --> Q[Memory Adaptive Advanced]
H --> R[Memory Adaptive + Monitoring]
I --> O
J --> P
K --> Q
K --> R
L --> Q
M --> P
N --> O
end
style O fill:#c8e6c9
style P fill:#fff3e0
style Q fill:#ffecb3
style R fill:#ffcdd2
```
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization)