This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
428 lines
12 KiB
Plaintext
428 lines
12 KiB
Plaintext
## Deep Crawling Workflows and Architecture
|
|
|
|
Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns.
|
|
|
|
### Deep Crawl Strategy Overview
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Start Deep Crawl] --> B{Strategy Selection}
|
|
|
|
B -->|Explore All Levels| C[BFS Strategy]
|
|
B -->|Dive Deep Fast| D[DFS Strategy]
|
|
B -->|Smart Prioritization| E[Best-First Strategy]
|
|
|
|
C --> C1[Breadth-First Search]
|
|
C1 --> C2[Process all depth 0 links]
|
|
C2 --> C3[Process all depth 1 links]
|
|
C3 --> C4[Continue by depth level]
|
|
|
|
D --> D1[Depth-First Search]
|
|
D1 --> D2[Follow first link deeply]
|
|
D2 --> D3[Backtrack when max depth reached]
|
|
D3 --> D4[Continue with next branch]
|
|
|
|
E --> E1[Best-First Search]
|
|
E1 --> E2[Score all discovered URLs]
|
|
E2 --> E3[Process highest scoring URLs first]
|
|
E3 --> E4[Continuously re-prioritize queue]
|
|
|
|
C4 --> F[Apply Filters]
|
|
D4 --> F
|
|
E4 --> F
|
|
|
|
F --> G{Filter Chain Processing}
|
|
G -->|Domain Filter| G1[Check allowed/blocked domains]
|
|
G -->|URL Pattern Filter| G2[Match URL patterns]
|
|
G -->|Content Type Filter| G3[Verify content types]
|
|
G -->|SEO Filter| G4[Evaluate SEO quality]
|
|
G -->|Content Relevance| G5[Score content relevance]
|
|
|
|
G1 --> H{Passed All Filters?}
|
|
G2 --> H
|
|
G3 --> H
|
|
G4 --> H
|
|
G5 --> H
|
|
|
|
H -->|Yes| I[Add to Crawl Queue]
|
|
H -->|No| J[Discard URL]
|
|
|
|
I --> K{Processing Mode}
|
|
K -->|Streaming| L[Process Immediately]
|
|
K -->|Batch| M[Collect All Results]
|
|
|
|
L --> N[Stream Result to User]
|
|
M --> O[Return Complete Result Set]
|
|
|
|
J --> P{More URLs in Queue?}
|
|
N --> P
|
|
O --> P
|
|
|
|
P -->|Yes| Q{Within Limits?}
|
|
P -->|No| R[Deep Crawl Complete]
|
|
|
|
Q -->|Max Depth OK| S{Max Pages OK}
|
|
Q -->|Max Depth Exceeded| T[Skip Deeper URLs]
|
|
|
|
S -->|Under Limit| U[Continue Crawling]
|
|
S -->|Limit Reached| R
|
|
|
|
T --> P
|
|
U --> F
|
|
|
|
style A fill:#e1f5fe
|
|
style R fill:#c8e6c9
|
|
style C fill:#fff3e0
|
|
style D fill:#f3e5f5
|
|
style E fill:#e8f5e8
|
|
```
|
|
|
|
### Deep Crawl Strategy Comparison
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "BFS - Breadth-First Search"
|
|
BFS1[Level 0: Start URL]
|
|
BFS2[Level 1: All direct links]
|
|
BFS3[Level 2: All second-level links]
|
|
BFS4[Level 3: All third-level links]
|
|
|
|
BFS1 --> BFS2
|
|
BFS2 --> BFS3
|
|
BFS3 --> BFS4
|
|
|
|
BFS_NOTE[Complete each depth before going deeper<br/>Good for site mapping<br/>Memory intensive for wide sites]
|
|
end
|
|
|
|
subgraph "DFS - Depth-First Search"
|
|
DFS1[Start URL]
|
|
DFS2[First Link → Deep]
|
|
DFS3[Follow until max depth]
|
|
DFS4[Backtrack and try next]
|
|
|
|
DFS1 --> DFS2
|
|
DFS2 --> DFS3
|
|
DFS3 --> DFS4
|
|
DFS4 --> DFS2
|
|
|
|
DFS_NOTE[Go deep on first path<br/>Memory efficient<br/>May miss important pages]
|
|
end
|
|
|
|
subgraph "Best-First - Priority Queue"
|
|
BF1[Start URL]
|
|
BF2[Score all discovered links]
|
|
BF3[Process highest scoring first]
|
|
BF4[Continuously re-prioritize]
|
|
|
|
BF1 --> BF2
|
|
BF2 --> BF3
|
|
BF3 --> BF4
|
|
BF4 --> BF2
|
|
|
|
BF_NOTE[Intelligent prioritization<br/>Finds relevant content fast<br/>Recommended for most use cases]
|
|
end
|
|
|
|
style BFS1 fill:#e3f2fd
|
|
style DFS1 fill:#f3e5f5
|
|
style BF1 fill:#e8f5e8
|
|
style BFS_NOTE fill:#fff3e0
|
|
style DFS_NOTE fill:#fff3e0
|
|
style BF_NOTE fill:#fff3e0
|
|
```
|
|
|
|
### Filter Chain Processing Sequence
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant URL as Discovered URL
|
|
participant Chain as Filter Chain
|
|
participant Domain as Domain Filter
|
|
participant Pattern as URL Pattern Filter
|
|
participant Content as Content Type Filter
|
|
participant SEO as SEO Filter
|
|
participant Relevance as Content Relevance Filter
|
|
participant Queue as Crawl Queue
|
|
|
|
URL->>Chain: Process URL
|
|
Chain->>Domain: Check domain rules
|
|
|
|
alt Domain Allowed
|
|
Domain-->>Chain: ✓ Pass
|
|
Chain->>Pattern: Check URL patterns
|
|
|
|
alt Pattern Matches
|
|
Pattern-->>Chain: ✓ Pass
|
|
Chain->>Content: Check content type
|
|
|
|
alt Content Type Valid
|
|
Content-->>Chain: ✓ Pass
|
|
Chain->>SEO: Evaluate SEO quality
|
|
|
|
alt SEO Score Above Threshold
|
|
SEO-->>Chain: ✓ Pass
|
|
Chain->>Relevance: Score content relevance
|
|
|
|
alt Relevance Score High
|
|
Relevance-->>Chain: ✓ Pass
|
|
Chain->>Queue: Add to crawl queue
|
|
Queue-->>URL: Queued for crawling
|
|
else Relevance Score Low
|
|
Relevance-->>Chain: ✗ Reject
|
|
Chain-->>URL: Filtered out - Low relevance
|
|
end
|
|
else SEO Score Low
|
|
SEO-->>Chain: ✗ Reject
|
|
Chain-->>URL: Filtered out - Poor SEO
|
|
end
|
|
else Invalid Content Type
|
|
Content-->>Chain: ✗ Reject
|
|
Chain-->>URL: Filtered out - Wrong content type
|
|
end
|
|
else Pattern Mismatch
|
|
Pattern-->>Chain: ✗ Reject
|
|
Chain-->>URL: Filtered out - Pattern mismatch
|
|
end
|
|
else Domain Blocked
|
|
Domain-->>Chain: ✗ Reject
|
|
Chain-->>URL: Filtered out - Blocked domain
|
|
end
|
|
```
|
|
|
|
### URL Lifecycle State Machine
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> Discovered: Found on page
|
|
|
|
Discovered --> FilterPending: Enter filter chain
|
|
|
|
FilterPending --> DomainCheck: Apply domain filter
|
|
DomainCheck --> PatternCheck: Domain allowed
|
|
DomainCheck --> Rejected: Domain blocked
|
|
|
|
PatternCheck --> ContentCheck: Pattern matches
|
|
PatternCheck --> Rejected: Pattern mismatch
|
|
|
|
ContentCheck --> SEOCheck: Content type valid
|
|
ContentCheck --> Rejected: Invalid content
|
|
|
|
SEOCheck --> RelevanceCheck: SEO score sufficient
|
|
SEOCheck --> Rejected: Poor SEO score
|
|
|
|
RelevanceCheck --> Scored: Relevance score calculated
|
|
RelevanceCheck --> Rejected: Low relevance
|
|
|
|
Scored --> Queued: Added to priority queue
|
|
|
|
Queued --> Crawling: Selected for processing
|
|
Crawling --> Success: Page crawled successfully
|
|
Crawling --> Failed: Crawl failed
|
|
|
|
Success --> LinkExtraction: Extract new links
|
|
LinkExtraction --> [*]: Process complete
|
|
|
|
Failed --> [*]: Record failure
|
|
Rejected --> [*]: Log rejection reason
|
|
|
|
note right of Scored : Score determines priority<br/>in Best-First strategy
|
|
|
|
note right of Failed : Errors logged with<br/>depth and reason
|
|
```
|
|
|
|
### Streaming vs Batch Processing Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Input"
|
|
A[Start URL] --> B[Deep Crawl Strategy]
|
|
end
|
|
|
|
subgraph "Crawl Engine"
|
|
B --> C[URL Discovery]
|
|
C --> D[Filter Chain]
|
|
D --> E[Priority Queue]
|
|
E --> F[Page Processor]
|
|
end
|
|
|
|
subgraph "Streaming Mode stream=True"
|
|
F --> G1[Process Page]
|
|
G1 --> H1[Extract Content]
|
|
H1 --> I1[Yield Result Immediately]
|
|
I1 --> J1[async for result]
|
|
J1 --> K1[Real-time Processing]
|
|
|
|
G1 --> L1[Extract Links]
|
|
L1 --> M1[Add to Queue]
|
|
M1 --> F
|
|
end
|
|
|
|
subgraph "Batch Mode stream=False"
|
|
F --> G2[Process Page]
|
|
G2 --> H2[Extract Content]
|
|
H2 --> I2[Store Result]
|
|
I2 --> N2[Result Collection]
|
|
|
|
G2 --> L2[Extract Links]
|
|
L2 --> M2[Add to Queue]
|
|
M2 --> O2{More URLs?}
|
|
O2 -->|Yes| F
|
|
O2 -->|No| P2[Return All Results]
|
|
P2 --> Q2[Batch Processing]
|
|
end
|
|
|
|
style I1 fill:#e8f5e8
|
|
style K1 fill:#e8f5e8
|
|
style P2 fill:#e3f2fd
|
|
style Q2 fill:#e3f2fd
|
|
```
|
|
|
|
### Advanced Scoring and Prioritization System
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph "URL Discovery"
|
|
A[Page Links] --> B[Extract URLs]
|
|
B --> C[Normalize URLs]
|
|
end
|
|
|
|
subgraph "Scoring System"
|
|
C --> D[Keyword Relevance Scorer]
|
|
D --> D1[URL Text Analysis]
|
|
D --> D2[Keyword Matching]
|
|
D --> D3[Calculate Base Score]
|
|
|
|
D3 --> E[Additional Scoring Factors]
|
|
E --> E1[URL Structure weight: 0.2]
|
|
E --> E2[Link Context weight: 0.3]
|
|
E --> E3[Page Depth Penalty weight: 0.1]
|
|
E --> E4[Domain Authority weight: 0.4]
|
|
|
|
D1 --> F[Combined Score]
|
|
D2 --> F
|
|
D3 --> F
|
|
E1 --> F
|
|
E2 --> F
|
|
E3 --> F
|
|
E4 --> F
|
|
end
|
|
|
|
subgraph "Prioritization"
|
|
F --> G{Score Threshold}
|
|
G -->|Above Threshold| H[Priority Queue]
|
|
G -->|Below Threshold| I[Discard URL]
|
|
|
|
H --> J[Best-First Selection]
|
|
J --> K[Highest Score First]
|
|
K --> L[Process Page]
|
|
|
|
L --> M[Update Scores]
|
|
M --> N[Re-prioritize Queue]
|
|
N --> J
|
|
end
|
|
|
|
style F fill:#fff3e0
|
|
style H fill:#e8f5e8
|
|
style L fill:#e3f2fd
|
|
```
|
|
|
|
### Deep Crawl Performance and Limits
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Crawl Constraints"
|
|
A[Max Depth: 2] --> A1[Prevents infinite crawling]
|
|
B[Max Pages: 50] --> B1[Controls resource usage]
|
|
C[Score Threshold: 0.3] --> C1[Quality filtering]
|
|
D[Domain Limits] --> D1[Scope control]
|
|
end
|
|
|
|
subgraph "Performance Monitoring"
|
|
E[Pages Crawled] --> F[Depth Distribution]
|
|
E --> G[Success Rate]
|
|
E --> H[Average Score]
|
|
E --> I[Processing Time]
|
|
|
|
F --> J[Performance Report]
|
|
G --> J
|
|
H --> J
|
|
I --> J
|
|
end
|
|
|
|
subgraph "Resource Management"
|
|
K[Memory Usage] --> L{Memory Threshold}
|
|
L -->|Under Limit| M[Continue Crawling]
|
|
L -->|Over Limit| N[Reduce Concurrency]
|
|
|
|
O[CPU Usage] --> P{CPU Threshold}
|
|
P -->|Normal| M
|
|
P -->|High| Q[Add Delays]
|
|
|
|
R[Network Load] --> S{Rate Limits}
|
|
S -->|OK| M
|
|
S -->|Exceeded| T[Throttle Requests]
|
|
end
|
|
|
|
M --> U[Optimal Performance]
|
|
N --> V[Reduced Performance]
|
|
Q --> V
|
|
T --> V
|
|
|
|
style U fill:#c8e6c9
|
|
style V fill:#fff3e0
|
|
style J fill:#e3f2fd
|
|
```
|
|
|
|
### Error Handling and Recovery Flow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant Strategy as Deep Crawl Strategy
|
|
participant Queue as Priority Queue
|
|
participant Crawler as Page Crawler
|
|
participant Error as Error Handler
|
|
participant Result as Result Collector
|
|
|
|
Strategy->>Queue: Get next URL
|
|
Queue-->>Strategy: Return highest priority URL
|
|
|
|
Strategy->>Crawler: Crawl page
|
|
|
|
alt Successful Crawl
|
|
Crawler-->>Strategy: Return page content
|
|
Strategy->>Result: Store successful result
|
|
Strategy->>Strategy: Extract new links
|
|
Strategy->>Queue: Add new URLs to queue
|
|
else Network Error
|
|
Crawler-->>Error: Network timeout/failure
|
|
Error->>Error: Log error with details
|
|
Error->>Queue: Mark URL as failed
|
|
Error-->>Strategy: Skip to next URL
|
|
else Parse Error
|
|
Crawler-->>Error: HTML parsing failed
|
|
Error->>Error: Log parse error
|
|
Error->>Result: Store failed result
|
|
Error-->>Strategy: Continue with next URL
|
|
else Rate Limit Hit
|
|
Crawler-->>Error: Rate limit exceeded
|
|
Error->>Error: Apply backoff strategy
|
|
Error->>Queue: Re-queue URL with delay
|
|
Error-->>Strategy: Wait before retry
|
|
else Depth Limit
|
|
Strategy->>Strategy: Check depth constraint
|
|
Strategy-->>Queue: Skip URL - too deep
|
|
else Page Limit
|
|
Strategy->>Strategy: Check page count
|
|
Strategy-->>Result: Stop crawling - limit reached
|
|
end
|
|
|
|
Strategy->>Queue: Request next URL
|
|
Queue-->>Strategy: More URLs available?
|
|
|
|
alt Queue Empty
|
|
Queue-->>Result: Crawl complete
|
|
else Queue Has URLs
|
|
Queue-->>Strategy: Continue crawling
|
|
end
|
|
```
|
|
|
|
**📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/) |