Files
crawl4ai/docs/md_v2/assets/llm.txt/diagrams/deep_crawling.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

428 lines
12 KiB
Plaintext

## Deep Crawling Workflows and Architecture
Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns.
### Deep Crawl Strategy Overview
```mermaid
flowchart TD
A[Start Deep Crawl] --> B{Strategy Selection}
B -->|Explore All Levels| C[BFS Strategy]
B -->|Dive Deep Fast| D[DFS Strategy]
B -->|Smart Prioritization| E[Best-First Strategy]
C --> C1[Breadth-First Search]
C1 --> C2[Process all depth 0 links]
C2 --> C3[Process all depth 1 links]
C3 --> C4[Continue by depth level]
D --> D1[Depth-First Search]
D1 --> D2[Follow first link deeply]
D2 --> D3[Backtrack when max depth reached]
D3 --> D4[Continue with next branch]
E --> E1[Best-First Search]
E1 --> E2[Score all discovered URLs]
E2 --> E3[Process highest scoring URLs first]
E3 --> E4[Continuously re-prioritize queue]
C4 --> F[Apply Filters]
D4 --> F
E4 --> F
F --> G{Filter Chain Processing}
G -->|Domain Filter| G1[Check allowed/blocked domains]
G -->|URL Pattern Filter| G2[Match URL patterns]
G -->|Content Type Filter| G3[Verify content types]
G -->|SEO Filter| G4[Evaluate SEO quality]
G -->|Content Relevance| G5[Score content relevance]
G1 --> H{Passed All Filters?}
G2 --> H
G3 --> H
G4 --> H
G5 --> H
H -->|Yes| I[Add to Crawl Queue]
H -->|No| J[Discard URL]
I --> K{Processing Mode}
K -->|Streaming| L[Process Immediately]
K -->|Batch| M[Collect All Results]
L --> N[Stream Result to User]
M --> O[Return Complete Result Set]
J --> P{More URLs in Queue?}
N --> P
O --> P
P -->|Yes| Q{Within Limits?}
P -->|No| R[Deep Crawl Complete]
Q -->|Max Depth OK| S{Max Pages OK}
Q -->|Max Depth Exceeded| T[Skip Deeper URLs]
S -->|Under Limit| U[Continue Crawling]
S -->|Limit Reached| R
T --> P
U --> F
style A fill:#e1f5fe
style R fill:#c8e6c9
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#e8f5e8
```
### Deep Crawl Strategy Comparison
```mermaid
graph TB
subgraph "BFS - Breadth-First Search"
BFS1[Level 0: Start URL]
BFS2[Level 1: All direct links]
BFS3[Level 2: All second-level links]
BFS4[Level 3: All third-level links]
BFS1 --> BFS2
BFS2 --> BFS3
BFS3 --> BFS4
BFS_NOTE[Complete each depth before going deeper<br/>Good for site mapping<br/>Memory intensive for wide sites]
end
subgraph "DFS - Depth-First Search"
DFS1[Start URL]
DFS2[First Link → Deep]
DFS3[Follow until max depth]
DFS4[Backtrack and try next]
DFS1 --> DFS2
DFS2 --> DFS3
DFS3 --> DFS4
DFS4 --> DFS2
DFS_NOTE[Go deep on first path<br/>Memory efficient<br/>May miss important pages]
end
subgraph "Best-First - Priority Queue"
BF1[Start URL]
BF2[Score all discovered links]
BF3[Process highest scoring first]
BF4[Continuously re-prioritize]
BF1 --> BF2
BF2 --> BF3
BF3 --> BF4
BF4 --> BF2
BF_NOTE[Intelligent prioritization<br/>Finds relevant content fast<br/>Recommended for most use cases]
end
style BFS1 fill:#e3f2fd
style DFS1 fill:#f3e5f5
style BF1 fill:#e8f5e8
style BFS_NOTE fill:#fff3e0
style DFS_NOTE fill:#fff3e0
style BF_NOTE fill:#fff3e0
```
### Filter Chain Processing Sequence
```mermaid
sequenceDiagram
participant URL as Discovered URL
participant Chain as Filter Chain
participant Domain as Domain Filter
participant Pattern as URL Pattern Filter
participant Content as Content Type Filter
participant SEO as SEO Filter
participant Relevance as Content Relevance Filter
participant Queue as Crawl Queue
URL->>Chain: Process URL
Chain->>Domain: Check domain rules
alt Domain Allowed
Domain-->>Chain: ✓ Pass
Chain->>Pattern: Check URL patterns
alt Pattern Matches
Pattern-->>Chain: ✓ Pass
Chain->>Content: Check content type
alt Content Type Valid
Content-->>Chain: ✓ Pass
Chain->>SEO: Evaluate SEO quality
alt SEO Score Above Threshold
SEO-->>Chain: ✓ Pass
Chain->>Relevance: Score content relevance
alt Relevance Score High
Relevance-->>Chain: ✓ Pass
Chain->>Queue: Add to crawl queue
Queue-->>URL: Queued for crawling
else Relevance Score Low
Relevance-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Low relevance
end
else SEO Score Low
SEO-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Poor SEO
end
else Invalid Content Type
Content-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Wrong content type
end
else Pattern Mismatch
Pattern-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Pattern mismatch
end
else Domain Blocked
Domain-->>Chain: ✗ Reject
Chain-->>URL: Filtered out - Blocked domain
end
```
### URL Lifecycle State Machine
```mermaid
stateDiagram-v2
[*] --> Discovered: Found on page
Discovered --> FilterPending: Enter filter chain
FilterPending --> DomainCheck: Apply domain filter
DomainCheck --> PatternCheck: Domain allowed
DomainCheck --> Rejected: Domain blocked
PatternCheck --> ContentCheck: Pattern matches
PatternCheck --> Rejected: Pattern mismatch
ContentCheck --> SEOCheck: Content type valid
ContentCheck --> Rejected: Invalid content
SEOCheck --> RelevanceCheck: SEO score sufficient
SEOCheck --> Rejected: Poor SEO score
RelevanceCheck --> Scored: Relevance score calculated
RelevanceCheck --> Rejected: Low relevance
Scored --> Queued: Added to priority queue
Queued --> Crawling: Selected for processing
Crawling --> Success: Page crawled successfully
Crawling --> Failed: Crawl failed
Success --> LinkExtraction: Extract new links
LinkExtraction --> [*]: Process complete
Failed --> [*]: Record failure
Rejected --> [*]: Log rejection reason
note right of Scored : Score determines priority<br/>in Best-First strategy
note right of Failed : Errors logged with<br/>depth and reason
```
### Streaming vs Batch Processing Architecture
```mermaid
graph TB
subgraph "Input"
A[Start URL] --> B[Deep Crawl Strategy]
end
subgraph "Crawl Engine"
B --> C[URL Discovery]
C --> D[Filter Chain]
D --> E[Priority Queue]
E --> F[Page Processor]
end
subgraph "Streaming Mode stream=True"
F --> G1[Process Page]
G1 --> H1[Extract Content]
H1 --> I1[Yield Result Immediately]
I1 --> J1[async for result]
J1 --> K1[Real-time Processing]
G1 --> L1[Extract Links]
L1 --> M1[Add to Queue]
M1 --> F
end
subgraph "Batch Mode stream=False"
F --> G2[Process Page]
G2 --> H2[Extract Content]
H2 --> I2[Store Result]
I2 --> N2[Result Collection]
G2 --> L2[Extract Links]
L2 --> M2[Add to Queue]
M2 --> O2{More URLs?}
O2 -->|Yes| F
O2 -->|No| P2[Return All Results]
P2 --> Q2[Batch Processing]
end
style I1 fill:#e8f5e8
style K1 fill:#e8f5e8
style P2 fill:#e3f2fd
style Q2 fill:#e3f2fd
```
### Advanced Scoring and Prioritization System
```mermaid
flowchart LR
subgraph "URL Discovery"
A[Page Links] --> B[Extract URLs]
B --> C[Normalize URLs]
end
subgraph "Scoring System"
C --> D[Keyword Relevance Scorer]
D --> D1[URL Text Analysis]
D --> D2[Keyword Matching]
D --> D3[Calculate Base Score]
D3 --> E[Additional Scoring Factors]
E --> E1[URL Structure weight: 0.2]
E --> E2[Link Context weight: 0.3]
E --> E3[Page Depth Penalty weight: 0.1]
E --> E4[Domain Authority weight: 0.4]
D1 --> F[Combined Score]
D2 --> F
D3 --> F
E1 --> F
E2 --> F
E3 --> F
E4 --> F
end
subgraph "Prioritization"
F --> G{Score Threshold}
G -->|Above Threshold| H[Priority Queue]
G -->|Below Threshold| I[Discard URL]
H --> J[Best-First Selection]
J --> K[Highest Score First]
K --> L[Process Page]
L --> M[Update Scores]
M --> N[Re-prioritize Queue]
N --> J
end
style F fill:#fff3e0
style H fill:#e8f5e8
style L fill:#e3f2fd
```
### Deep Crawl Performance and Limits
```mermaid
graph TD
subgraph "Crawl Constraints"
A[Max Depth: 2] --> A1[Prevents infinite crawling]
B[Max Pages: 50] --> B1[Controls resource usage]
C[Score Threshold: 0.3] --> C1[Quality filtering]
D[Domain Limits] --> D1[Scope control]
end
subgraph "Performance Monitoring"
E[Pages Crawled] --> F[Depth Distribution]
E --> G[Success Rate]
E --> H[Average Score]
E --> I[Processing Time]
F --> J[Performance Report]
G --> J
H --> J
I --> J
end
subgraph "Resource Management"
K[Memory Usage] --> L{Memory Threshold}
L -->|Under Limit| M[Continue Crawling]
L -->|Over Limit| N[Reduce Concurrency]
O[CPU Usage] --> P{CPU Threshold}
P -->|Normal| M
P -->|High| Q[Add Delays]
R[Network Load] --> S{Rate Limits}
S -->|OK| M
S -->|Exceeded| T[Throttle Requests]
end
M --> U[Optimal Performance]
N --> V[Reduced Performance]
Q --> V
T --> V
style U fill:#c8e6c9
style V fill:#fff3e0
style J fill:#e3f2fd
```
### Error Handling and Recovery Flow
```mermaid
sequenceDiagram
participant Strategy as Deep Crawl Strategy
participant Queue as Priority Queue
participant Crawler as Page Crawler
participant Error as Error Handler
participant Result as Result Collector
Strategy->>Queue: Get next URL
Queue-->>Strategy: Return highest priority URL
Strategy->>Crawler: Crawl page
alt Successful Crawl
Crawler-->>Strategy: Return page content
Strategy->>Result: Store successful result
Strategy->>Strategy: Extract new links
Strategy->>Queue: Add new URLs to queue
else Network Error
Crawler-->>Error: Network timeout/failure
Error->>Error: Log error with details
Error->>Queue: Mark URL as failed
Error-->>Strategy: Skip to next URL
else Parse Error
Crawler-->>Error: HTML parsing failed
Error->>Error: Log parse error
Error->>Result: Store failed result
Error-->>Strategy: Continue with next URL
else Rate Limit Hit
Crawler-->>Error: Rate limit exceeded
Error->>Error: Apply backoff strategy
Error->>Queue: Re-queue URL with delay
Error-->>Strategy: Wait before retry
else Depth Limit
Strategy->>Strategy: Check depth constraint
Strategy-->>Queue: Skip URL - too deep
else Page Limit
Strategy->>Strategy: Check page count
Strategy-->>Result: Stop crawling - limit reached
end
Strategy->>Queue: Request next URL
Queue-->>Strategy: More URLs available?
alt Queue Empty
Queue-->>Result: Crawl complete
else Queue Has URLs
Queue-->>Strategy: Continue crawling
end
```
**📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/)