This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
478 lines
13 KiB
Plaintext
478 lines
13 KiB
Plaintext
## Extraction Strategy Workflows and Architecture
|
|
|
|
Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows.
|
|
|
|
### Extraction Strategy Decision Tree
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Content to Extract] --> B{Content Type?}
|
|
|
|
B -->|Simple Patterns| C[Common Data Types]
|
|
B -->|Structured HTML| D[Predictable Structure]
|
|
B -->|Complex Content| E[Requires Reasoning]
|
|
B -->|Mixed Content| F[Multiple Data Types]
|
|
|
|
C --> C1{Pattern Type?}
|
|
C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns]
|
|
C1 -->|Custom Patterns| C3[Custom Regex Strategy]
|
|
C1 -->|LLM-Generated| C4[One-time Pattern Generation]
|
|
|
|
D --> D1{Selector Type?}
|
|
D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy]
|
|
D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy]
|
|
D1 -->|Need Schema?| D4[Auto-generate Schema with LLM]
|
|
|
|
E --> E1{LLM Provider?}
|
|
E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy]
|
|
E1 -->|Local Ollama| E3[Local LLM Strategy]
|
|
E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once]
|
|
|
|
F --> F1[Multi-Strategy Approach]
|
|
F1 --> F2[1. Regex for Patterns]
|
|
F1 --> F3[2. CSS for Structure]
|
|
F1 --> F4[3. LLM for Complex Analysis]
|
|
|
|
C2 --> G[Fast Extraction ⚡]
|
|
C3 --> G
|
|
C4 --> H[Cached Pattern Reuse]
|
|
|
|
D2 --> I[Schema-based Extraction 🏗️]
|
|
D3 --> I
|
|
D4 --> J[Generated Schema Cache]
|
|
|
|
E2 --> K[Intelligent Parsing 🧠]
|
|
E3 --> K
|
|
E4 --> L[Hybrid Cost-Effective]
|
|
|
|
F2 --> M[Comprehensive Results 📊]
|
|
F3 --> M
|
|
F4 --> M
|
|
|
|
style G fill:#c8e6c9
|
|
style I fill:#e3f2fd
|
|
style K fill:#fff3e0
|
|
style M fill:#f3e5f5
|
|
style H fill:#e8f5e8
|
|
style J fill:#e8f5e8
|
|
style L fill:#ffecb3
|
|
```
|
|
|
|
### LLM Extraction Strategy Workflow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Crawler
|
|
participant LLMStrategy
|
|
participant Chunker
|
|
participant LLMProvider
|
|
participant Parser
|
|
|
|
User->>Crawler: Configure LLMExtractionStrategy
|
|
User->>Crawler: arun(url, config)
|
|
|
|
Crawler->>Crawler: Navigate to URL
|
|
Crawler->>Crawler: Extract content (HTML/Markdown)
|
|
Crawler->>LLMStrategy: Process content
|
|
|
|
LLMStrategy->>LLMStrategy: Check content size
|
|
|
|
alt Content > chunk_threshold
|
|
LLMStrategy->>Chunker: Split into chunks with overlap
|
|
Chunker-->>LLMStrategy: Return chunks[]
|
|
|
|
loop For each chunk
|
|
LLMStrategy->>LLMProvider: Send chunk + schema + instruction
|
|
LLMProvider-->>LLMStrategy: Return structured JSON
|
|
end
|
|
|
|
LLMStrategy->>LLMStrategy: Merge chunk results
|
|
else Content <= threshold
|
|
LLMStrategy->>LLMProvider: Send full content + schema
|
|
LLMProvider-->>LLMStrategy: Return structured JSON
|
|
end
|
|
|
|
LLMStrategy->>Parser: Validate JSON schema
|
|
Parser-->>LLMStrategy: Validated data
|
|
|
|
LLMStrategy->>LLMStrategy: Track token usage
|
|
LLMStrategy-->>Crawler: Return extracted_content
|
|
|
|
Crawler-->>User: CrawlResult with JSON data
|
|
|
|
User->>LLMStrategy: show_usage()
|
|
LLMStrategy-->>User: Token count & estimated cost
|
|
```
|
|
|
|
### Schema-Based Extraction Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Schema Definition"
|
|
A[JSON Schema] --> A1[baseSelector]
|
|
A --> A2[fields[]]
|
|
A --> A3[nested structures]
|
|
|
|
A2 --> A4[CSS/XPath selectors]
|
|
A2 --> A5[Data types: text, html, attribute]
|
|
A2 --> A6[Default values]
|
|
|
|
A3 --> A7[nested objects]
|
|
A3 --> A8[nested_list arrays]
|
|
A3 --> A9[simple lists]
|
|
end
|
|
|
|
subgraph "Extraction Engine"
|
|
B[HTML Content] --> C[Selector Engine]
|
|
C --> C1[CSS Selector Parser]
|
|
C --> C2[XPath Evaluator]
|
|
|
|
C1 --> D[Element Matcher]
|
|
C2 --> D
|
|
|
|
D --> E[Type Converter]
|
|
E --> E1[Text Extraction]
|
|
E --> E2[HTML Preservation]
|
|
E --> E3[Attribute Extraction]
|
|
E --> E4[Nested Processing]
|
|
end
|
|
|
|
subgraph "Result Processing"
|
|
F[Raw Extracted Data] --> G[Structure Builder]
|
|
G --> G1[Object Construction]
|
|
G --> G2[Array Assembly]
|
|
G --> G3[Type Validation]
|
|
|
|
G1 --> H[JSON Output]
|
|
G2 --> H
|
|
G3 --> H
|
|
end
|
|
|
|
A --> C
|
|
E --> F
|
|
H --> I[extracted_content]
|
|
|
|
style A fill:#e3f2fd
|
|
style C fill:#f3e5f5
|
|
style G fill:#e8f5e8
|
|
style H fill:#c8e6c9
|
|
```
|
|
|
|
### Automatic Schema Generation Process
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> CheckCache
|
|
|
|
CheckCache --> CacheHit: Schema exists
|
|
CheckCache --> SamplePage: Schema missing
|
|
|
|
CacheHit --> LoadSchema
|
|
LoadSchema --> FastExtraction
|
|
|
|
SamplePage --> ExtractHTML: Crawl sample URL
|
|
ExtractHTML --> LLMAnalysis: Send HTML to LLM
|
|
LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors
|
|
GenerateSchema --> ValidateSchema: Test generated schema
|
|
|
|
ValidateSchema --> SchemaWorks: Valid selectors
|
|
ValidateSchema --> RefineSchema: Invalid selectors
|
|
|
|
RefineSchema --> LLMAnalysis: Iterate with feedback
|
|
|
|
SchemaWorks --> CacheSchema: Save for reuse
|
|
CacheSchema --> FastExtraction: Use cached schema
|
|
|
|
FastExtraction --> [*]: No more LLM calls needed
|
|
|
|
note right of CheckCache : One-time LLM cost
|
|
note right of FastExtraction : Unlimited fast reuse
|
|
note right of CacheSchema : JSON file storage
|
|
```
|
|
|
|
### Multi-Strategy Extraction Pipeline
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
A[Web Page Content] --> B[Strategy Pipeline]
|
|
|
|
subgraph B["Extraction Pipeline"]
|
|
B1[Stage 1: Regex Patterns]
|
|
B2[Stage 2: Schema-based CSS]
|
|
B3[Stage 3: LLM Analysis]
|
|
|
|
B1 --> B1a[Email addresses]
|
|
B1 --> B1b[Phone numbers]
|
|
B1 --> B1c[URLs and links]
|
|
B1 --> B1d[Currency amounts]
|
|
|
|
B2 --> B2a[Structured products]
|
|
B2 --> B2b[Article metadata]
|
|
B2 --> B2c[User reviews]
|
|
B2 --> B2d[Navigation links]
|
|
|
|
B3 --> B3a[Sentiment analysis]
|
|
B3 --> B3b[Key topics]
|
|
B3 --> B3c[Entity recognition]
|
|
B3 --> B3d[Content summary]
|
|
end
|
|
|
|
B1a --> C[Result Merger]
|
|
B1b --> C
|
|
B1c --> C
|
|
B1d --> C
|
|
|
|
B2a --> C
|
|
B2b --> C
|
|
B2c --> C
|
|
B2d --> C
|
|
|
|
B3a --> C
|
|
B3b --> C
|
|
B3c --> C
|
|
B3d --> C
|
|
|
|
C --> D[Combined JSON Output]
|
|
D --> E[Final CrawlResult]
|
|
|
|
style B1 fill:#c8e6c9
|
|
style B2 fill:#e3f2fd
|
|
style B3 fill:#fff3e0
|
|
style C fill:#f3e5f5
|
|
```
|
|
|
|
### Performance Comparison Matrix
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Strategy Performance"
|
|
A[Extraction Strategy Comparison]
|
|
|
|
subgraph "Speed ⚡"
|
|
S1[Regex: ~10ms]
|
|
S2[CSS Schema: ~50ms]
|
|
S3[XPath: ~100ms]
|
|
S4[LLM: ~2-10s]
|
|
end
|
|
|
|
subgraph "Accuracy 🎯"
|
|
A1[Regex: Pattern-dependent]
|
|
A2[CSS: High for structured]
|
|
A3[XPath: Very high]
|
|
A4[LLM: Excellent for complex]
|
|
end
|
|
|
|
subgraph "Cost 💰"
|
|
C1[Regex: Free]
|
|
C2[CSS: Free]
|
|
C3[XPath: Free]
|
|
C4[LLM: $0.001-0.01 per page]
|
|
end
|
|
|
|
subgraph "Complexity 🔧"
|
|
X1[Regex: Simple patterns only]
|
|
X2[CSS: Structured HTML]
|
|
X3[XPath: Complex selectors]
|
|
X4[LLM: Any content type]
|
|
end
|
|
end
|
|
|
|
style S1 fill:#c8e6c9
|
|
style S2 fill:#e8f5e8
|
|
style S3 fill:#fff3e0
|
|
style S4 fill:#ffcdd2
|
|
|
|
style A2 fill:#e8f5e8
|
|
style A3 fill:#c8e6c9
|
|
style A4 fill:#c8e6c9
|
|
|
|
style C1 fill:#c8e6c9
|
|
style C2 fill:#c8e6c9
|
|
style C3 fill:#c8e6c9
|
|
style C4 fill:#fff3e0
|
|
|
|
style X1 fill:#ffcdd2
|
|
style X2 fill:#e8f5e8
|
|
style X3 fill:#c8e6c9
|
|
style X4 fill:#c8e6c9
|
|
```
|
|
|
|
### Regex Pattern Strategy Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Regex Extraction] --> B{Pattern Source?}
|
|
|
|
B -->|Built-in| C[Use Predefined Patterns]
|
|
B -->|Custom| D[Define Custom Regex]
|
|
B -->|LLM-Generated| E[Generate with AI]
|
|
|
|
C --> C1[Email Pattern]
|
|
C --> C2[Phone Pattern]
|
|
C --> C3[URL Pattern]
|
|
C --> C4[Currency Pattern]
|
|
C --> C5[Date Pattern]
|
|
|
|
D --> D1[Write Custom Regex]
|
|
D --> D2[Test Pattern]
|
|
D --> D3{Pattern Works?}
|
|
D3 -->|No| D1
|
|
D3 -->|Yes| D4[Use Pattern]
|
|
|
|
E --> E1[Provide Sample Content]
|
|
E --> E2[LLM Analyzes Content]
|
|
E --> E3[Generate Optimized Regex]
|
|
E --> E4[Cache Pattern for Reuse]
|
|
|
|
C1 --> F[Pattern Matching]
|
|
C2 --> F
|
|
C3 --> F
|
|
C4 --> F
|
|
C5 --> F
|
|
D4 --> F
|
|
E4 --> F
|
|
|
|
F --> G[Extract Matches]
|
|
G --> H[Group by Pattern Type]
|
|
H --> I[JSON Output with Labels]
|
|
|
|
style C fill:#e8f5e8
|
|
style D fill:#e3f2fd
|
|
style E fill:#fff3e0
|
|
style F fill:#f3e5f5
|
|
```
|
|
|
|
### Complex Schema Structure Visualization
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "E-commerce Schema Example"
|
|
A[Category baseSelector] --> B[Category Fields]
|
|
A --> C[Products nested_list]
|
|
|
|
B --> B1[category_name]
|
|
B --> B2[category_id attribute]
|
|
B --> B3[category_url attribute]
|
|
|
|
C --> C1[Product baseSelector]
|
|
C1 --> C2[name text]
|
|
C1 --> C3[price text]
|
|
C1 --> C4[Details nested object]
|
|
C1 --> C5[Features list]
|
|
C1 --> C6[Reviews nested_list]
|
|
|
|
C4 --> C4a[brand text]
|
|
C4 --> C4b[model text]
|
|
C4 --> C4c[specs html]
|
|
|
|
C5 --> C5a[feature text array]
|
|
|
|
C6 --> C6a[reviewer text]
|
|
C6 --> C6b[rating attribute]
|
|
C6 --> C6c[comment text]
|
|
C6 --> C6d[date attribute]
|
|
end
|
|
|
|
subgraph "JSON Output Structure"
|
|
D[categories array] --> D1[category object]
|
|
D1 --> D2[category_name]
|
|
D1 --> D3[category_id]
|
|
D1 --> D4[products array]
|
|
|
|
D4 --> D5[product object]
|
|
D5 --> D6[name, price]
|
|
D5 --> D7[details object]
|
|
D5 --> D8[features array]
|
|
D5 --> D9[reviews array]
|
|
|
|
D7 --> D7a[brand, model, specs]
|
|
D8 --> D8a[feature strings]
|
|
D9 --> D9a[review objects]
|
|
end
|
|
|
|
A -.-> D
|
|
B1 -.-> D2
|
|
C2 -.-> D6
|
|
C4 -.-> D7
|
|
C5 -.-> D8
|
|
C6 -.-> D9
|
|
|
|
style A fill:#e3f2fd
|
|
style C fill:#f3e5f5
|
|
style C4 fill:#e8f5e8
|
|
style D fill:#fff3e0
|
|
```
|
|
|
|
### Error Handling and Fallback Strategy
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> PrimaryStrategy
|
|
|
|
PrimaryStrategy --> Success: Extraction successful
|
|
PrimaryStrategy --> ValidationFailed: Invalid data
|
|
PrimaryStrategy --> ExtractionFailed: No matches found
|
|
PrimaryStrategy --> TimeoutError: LLM timeout
|
|
|
|
ValidationFailed --> FallbackStrategy: Try alternative
|
|
ExtractionFailed --> FallbackStrategy: Try alternative
|
|
TimeoutError --> FallbackStrategy: Try alternative
|
|
|
|
FallbackStrategy --> FallbackSuccess: Fallback works
|
|
FallbackStrategy --> FallbackFailed: All strategies failed
|
|
|
|
FallbackSuccess --> Success: Return results
|
|
FallbackFailed --> ErrorReport: Log failure details
|
|
|
|
Success --> [*]: Complete
|
|
ErrorReport --> [*]: Return empty results
|
|
|
|
note right of PrimaryStrategy : Try fastest/most accurate first
|
|
note right of FallbackStrategy : Use simpler but reliable method
|
|
note left of ErrorReport : Provide debugging information
|
|
```
|
|
|
|
### Token Usage and Cost Optimization
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[LLM Extraction Request] --> B{Content Size Check}
|
|
|
|
B -->|Small < 1200 tokens| C[Single LLM Call]
|
|
B -->|Large > 1200 tokens| D[Chunking Strategy]
|
|
|
|
C --> C1[Send full content]
|
|
C1 --> C2[Parse JSON response]
|
|
C2 --> C3[Track token usage]
|
|
|
|
D --> D1[Split into chunks]
|
|
D1 --> D2[Add overlap between chunks]
|
|
D2 --> D3[Process chunks in parallel]
|
|
|
|
D3 --> D4[Chunk 1 → LLM]
|
|
D3 --> D5[Chunk 2 → LLM]
|
|
D3 --> D6[Chunk N → LLM]
|
|
|
|
D4 --> D7[Merge results]
|
|
D5 --> D7
|
|
D6 --> D7
|
|
|
|
D7 --> D8[Deduplicate data]
|
|
D8 --> D9[Aggregate token usage]
|
|
|
|
C3 --> E[Cost Calculation]
|
|
D9 --> E
|
|
|
|
E --> F[Usage Report]
|
|
F --> F1[Prompt tokens: X]
|
|
F --> F2[Completion tokens: Y]
|
|
F --> F3[Total cost: $Z]
|
|
|
|
style C fill:#c8e6c9
|
|
style D fill:#fff3e0
|
|
style E fill:#e3f2fd
|
|
style F fill:#f3e5f5
|
|
```
|
|
|
|
**📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/) |