feat: add Script Builder to Chrome Extension and reorganize LLM context files

This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
This commit is contained in:
UncleCode
2025-06-08 22:02:12 +08:00
parent 926592649e
commit 40640badad
72 changed files with 28600 additions and 100986 deletions

View File

@@ -0,0 +1,392 @@
## Multi-URL Crawling Workflows and Architecture
Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently.
### Multi-URL Processing Modes
```mermaid
flowchart TD
A[Multi-URL Crawling Request] --> B{Processing Mode?}
B -->|Batch Mode| C[Collect All URLs]
B -->|Streaming Mode| D[Process URLs Individually]
C --> C1[Queue All URLs]
C1 --> C2[Execute Concurrently]
C2 --> C3[Wait for All Completion]
C3 --> C4[Return Complete Results Array]
D --> D1[Queue URLs]
D1 --> D2[Start First Batch]
D2 --> D3[Yield Results as Available]
D3 --> D4{More URLs?}
D4 -->|Yes| D5[Start Next URLs]
D4 -->|No| D6[Stream Complete]
D5 --> D3
C4 --> E[Process Results]
D6 --> E
E --> F[Success/Failure Analysis]
F --> G[End]
style C fill:#e3f2fd
style D fill:#f3e5f5
style C4 fill:#c8e6c9
style D6 fill:#c8e6c9
```
### Memory-Adaptive Dispatcher Flow
```mermaid
stateDiagram-v2
[*] --> Initializing
Initializing --> MonitoringMemory: Start dispatcher
MonitoringMemory --> CheckingMemory: Every check_interval
CheckingMemory --> MemoryOK: Memory < threshold
CheckingMemory --> MemoryHigh: Memory >= threshold
MemoryOK --> DispatchingTasks: Start new crawls
MemoryHigh --> WaitingForMemory: Pause dispatching
DispatchingTasks --> TaskRunning: Launch crawler
TaskRunning --> TaskCompleted: Crawl finished
TaskRunning --> TaskFailed: Crawl error
TaskCompleted --> MonitoringMemory: Update stats
TaskFailed --> MonitoringMemory: Update stats
WaitingForMemory --> CheckingMemory: Wait timeout
WaitingForMemory --> MonitoringMemory: Memory freed
note right of MemoryHigh: Prevents OOM crashes
note right of DispatchingTasks: Respects max_session_permit
note right of WaitingForMemory: Configurable timeout
```
### Concurrent Crawling Architecture
```mermaid
graph TB
subgraph "URL Queue Management"
A[URL Input List] --> B[URL Queue]
B --> C[Priority Scheduler]
C --> D[Batch Assignment]
end
subgraph "Dispatcher Layer"
E[Memory Adaptive Dispatcher]
F[Semaphore Dispatcher]
G[Rate Limiter]
H[Resource Monitor]
E --> I[Memory Checker]
F --> J[Concurrency Controller]
G --> K[Delay Calculator]
H --> L[System Stats]
end
subgraph "Crawler Pool"
M[Crawler Instance 1]
N[Crawler Instance 2]
O[Crawler Instance 3]
P[Crawler Instance N]
M --> Q[Browser Session 1]
N --> R[Browser Session 2]
O --> S[Browser Session 3]
P --> T[Browser Session N]
end
subgraph "Result Processing"
U[Result Collector]
V[Success Handler]
W[Error Handler]
X[Retry Queue]
Y[Final Results]
end
D --> E
D --> F
E --> M
F --> N
G --> O
H --> P
Q --> U
R --> U
S --> U
T --> U
U --> V
U --> W
W --> X
X --> B
V --> Y
style E fill:#e3f2fd
style F fill:#f3e5f5
style G fill:#e8f5e8
style H fill:#fff3e0
```
### Rate Limiting and Backoff Strategy
```mermaid
sequenceDiagram
participant C as Crawler
participant RL as Rate Limiter
participant S as Server
participant D as Dispatcher
C->>RL: Request to crawl URL
RL->>RL: Calculate delay
RL->>RL: Apply base delay (1-3s)
RL->>C: Delay applied
C->>S: HTTP Request
alt Success Response
S-->>C: 200 OK + Content
C->>RL: Report success
RL->>RL: Reset failure count
C->>D: Return successful result
else Rate Limited
S-->>C: 429 Too Many Requests
C->>RL: Report rate limit
RL->>RL: Exponential backoff
RL->>RL: Increase delay (up to max_delay)
RL->>C: Apply longer delay
C->>S: Retry request after delay
else Server Error
S-->>C: 503 Service Unavailable
C->>RL: Report server error
RL->>RL: Moderate backoff
RL->>C: Retry with backoff
else Max Retries Exceeded
RL->>C: Stop retrying
C->>D: Return failed result
end
```
### Large-Scale Crawling Workflow
```mermaid
flowchart TD
A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher]
B --> C{Select Dispatcher Type}
C -->|Memory Constrained| D[Memory Adaptive]
C -->|Fixed Resources| E[Semaphore Based]
D --> F[Set Memory Threshold 70%]
E --> G[Set Concurrency Limit]
F --> H[Configure Monitoring]
G --> H
H --> I[Start Crawling Process]
I --> J[Monitor System Resources]
J --> K{Memory Usage?}
K -->|< Threshold| L[Continue Dispatching]
K -->|>= Threshold| M[Pause New Tasks]
L --> N[Process Results Stream]
M --> O[Wait for Memory]
O --> K
N --> P{Result Type?}
P -->|Success| Q[Save to Database]
P -->|Failure| R[Log Error]
Q --> S[Update Progress Counter]
R --> S
S --> T{More URLs?}
T -->|Yes| U[Get Next Batch]
T -->|No| V[Generate Final Report]
U --> L
V --> W[Analysis Complete]
style A fill:#e1f5fe
style D fill:#e8f5e8
style E fill:#f3e5f5
style V fill:#c8e6c9
style W fill:#a5d6a7
```
### Real-Time Monitoring Dashboard Flow
```mermaid
graph LR
subgraph "Data Collection"
A[Crawler Tasks] --> B[Performance Metrics]
A --> C[Memory Usage]
A --> D[Success/Failure Rates]
A --> E[Response Times]
end
subgraph "Monitor Processing"
F[CrawlerMonitor] --> G[Aggregate Statistics]
F --> H[Display Formatter]
F --> I[Update Scheduler]
end
subgraph "Display Modes"
J[DETAILED Mode]
K[AGGREGATED Mode]
J --> L[Individual Task Status]
J --> M[Task-Level Metrics]
K --> N[Summary Statistics]
K --> O[Overall Progress]
end
subgraph "Output Interface"
P[Console Display]
Q[Progress Bars]
R[Status Tables]
S[Real-time Updates]
end
B --> F
C --> F
D --> F
E --> F
G --> J
G --> K
H --> J
H --> K
I --> J
I --> K
L --> P
M --> Q
N --> R
O --> S
style F fill:#e3f2fd
style J fill:#f3e5f5
style K fill:#e8f5e8
```
### Error Handling and Recovery Pattern
```mermaid
stateDiagram-v2
[*] --> ProcessingURL
ProcessingURL --> CrawlAttempt: Start crawl
CrawlAttempt --> Success: HTTP 200
CrawlAttempt --> NetworkError: Connection failed
CrawlAttempt --> RateLimit: HTTP 429
CrawlAttempt --> ServerError: HTTP 5xx
CrawlAttempt --> Timeout: Request timeout
Success --> [*]: Return result
NetworkError --> RetryCheck: Check retry count
RateLimit --> BackoffWait: Apply exponential backoff
ServerError --> RetryCheck: Check retry count
Timeout --> RetryCheck: Check retry count
BackoffWait --> RetryCheck: After delay
RetryCheck --> CrawlAttempt: retries < max_retries
RetryCheck --> Failed: retries >= max_retries
Failed --> ErrorLog: Log failure details
ErrorLog --> [*]: Return failed result
note right of BackoffWait: Exponential backoff for rate limits
note right of RetryCheck: Configurable max_retries
note right of ErrorLog: Detailed error tracking
```
### Resource Management Timeline
```mermaid
gantt
title Multi-URL Crawling Resource Management
dateFormat X
axisFormat %s
section Memory Usage
Initialize Dispatcher :0, 1
Memory Monitoring :1, 10
Peak Usage Period :3, 7
Memory Cleanup :7, 9
section Task Execution
URL Queue Setup :0, 2
Batch 1 Processing :2, 5
Batch 2 Processing :4, 7
Batch 3 Processing :6, 9
Final Results :9, 10
section Rate Limiting
Normal Delays :2, 4
Backoff Period :4, 6
Recovery Period :6, 8
section Monitoring
System Health Check :0, 10
Progress Updates :1, 9
Performance Metrics :2, 8
```
### Concurrent Processing Performance Matrix
```mermaid
graph TD
subgraph "Input Factors"
A[Number of URLs]
B[Concurrency Level]
C[Memory Threshold]
D[Rate Limiting]
end
subgraph "Processing Characteristics"
A --> E[Low 1-100 URLs]
A --> F[Medium 100-1k URLs]
A --> G[High 1k-10k URLs]
A --> H[Very High 10k+ URLs]
B --> I[Conservative 1-5]
B --> J[Moderate 5-15]
B --> K[Aggressive 15-30]
C --> L[Strict 60-70%]
C --> M[Balanced 70-80%]
C --> N[Relaxed 80-90%]
end
subgraph "Recommended Configurations"
E --> O[Simple Semaphore]
F --> P[Memory Adaptive Basic]
G --> Q[Memory Adaptive Advanced]
H --> R[Memory Adaptive + Monitoring]
I --> O
J --> P
K --> Q
K --> R
L --> Q
M --> P
N --> O
end
style O fill:#c8e6c9
style P fill:#fff3e0
style Q fill:#ffecb3
style R fill:#ffcdd2
```
**📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization)