This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
411 lines
11 KiB
Plaintext
411 lines
11 KiB
Plaintext
## Simple Crawling Workflows and Data Flow
|
|
|
|
Visual representations of basic web crawling operations, configuration patterns, and result processing workflows.
|
|
|
|
### Basic Crawling Sequence
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Crawler as AsyncWebCrawler
|
|
participant Browser as Browser Instance
|
|
participant Page as Web Page
|
|
participant Processor as Content Processor
|
|
|
|
User->>Crawler: Create with BrowserConfig
|
|
Crawler->>Browser: Launch browser instance
|
|
Browser-->>Crawler: Browser ready
|
|
|
|
User->>Crawler: arun(url, CrawlerRunConfig)
|
|
Crawler->>Browser: Create new page/context
|
|
Browser->>Page: Navigate to URL
|
|
Page-->>Browser: Page loaded
|
|
|
|
Browser->>Processor: Extract raw HTML
|
|
Processor->>Processor: Clean HTML
|
|
Processor->>Processor: Generate markdown
|
|
Processor->>Processor: Extract media/links
|
|
Processor-->>Crawler: CrawlResult created
|
|
|
|
Crawler-->>User: Return CrawlResult
|
|
|
|
Note over User,Processor: All processing happens asynchronously
|
|
```
|
|
|
|
### Crawling Configuration Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Start Crawling] --> B{Browser Config Set?}
|
|
|
|
B -->|No| B1[Use Default BrowserConfig]
|
|
B -->|Yes| B2[Custom BrowserConfig]
|
|
|
|
B1 --> C[Launch Browser]
|
|
B2 --> C
|
|
|
|
C --> D{Crawler Run Config Set?}
|
|
|
|
D -->|No| D1[Use Default CrawlerRunConfig]
|
|
D -->|Yes| D2[Custom CrawlerRunConfig]
|
|
|
|
D1 --> E[Navigate to URL]
|
|
D2 --> E
|
|
|
|
E --> F{Page Load Success?}
|
|
F -->|No| F1[Return Error Result]
|
|
F -->|Yes| G[Apply Content Filters]
|
|
|
|
G --> G1{excluded_tags set?}
|
|
G1 -->|Yes| G2[Remove specified tags]
|
|
G1 -->|No| G3[Keep all tags]
|
|
G2 --> G4{css_selector set?}
|
|
G3 --> G4
|
|
|
|
G4 -->|Yes| G5[Extract selected elements]
|
|
G4 -->|No| G6[Process full page]
|
|
G5 --> H[Generate Markdown]
|
|
G6 --> H
|
|
|
|
H --> H1{markdown_generator set?}
|
|
H1 -->|Yes| H2[Use custom generator]
|
|
H1 -->|No| H3[Use default generator]
|
|
H2 --> I[Extract Media and Links]
|
|
H3 --> I
|
|
|
|
I --> I1{process_iframes?}
|
|
I1 -->|Yes| I2[Include iframe content]
|
|
I1 -->|No| I3[Skip iframes]
|
|
I2 --> J[Create CrawlResult]
|
|
I3 --> J
|
|
|
|
J --> K[Return Result]
|
|
|
|
style A fill:#e1f5fe
|
|
style K fill:#c8e6c9
|
|
style F1 fill:#ffcdd2
|
|
```
|
|
|
|
### CrawlResult Data Structure
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "CrawlResult Object"
|
|
A[CrawlResult] --> B[Basic Info]
|
|
A --> C[Content Variants]
|
|
A --> D[Extracted Data]
|
|
A --> E[Media Assets]
|
|
A --> F[Optional Outputs]
|
|
|
|
B --> B1[url: Final URL]
|
|
B --> B2[success: Boolean]
|
|
B --> B3[status_code: HTTP Status]
|
|
B --> B4[error_message: Error Details]
|
|
|
|
C --> C1[html: Raw HTML]
|
|
C --> C2[cleaned_html: Sanitized HTML]
|
|
C --> C3[markdown: MarkdownGenerationResult]
|
|
|
|
C3 --> C3A[raw_markdown: Basic conversion]
|
|
C3 --> C3B[markdown_with_citations: With references]
|
|
C3 --> C3C[fit_markdown: Filtered content]
|
|
C3 --> C3D[references_markdown: Citation list]
|
|
|
|
D --> D1[links: Internal/External]
|
|
D --> D2[media: Images/Videos/Audio]
|
|
D --> D3[metadata: Page info]
|
|
D --> D4[extracted_content: JSON data]
|
|
D --> D5[tables: Structured table data]
|
|
|
|
E --> E1[screenshot: Base64 image]
|
|
E --> E2[pdf: PDF bytes]
|
|
E --> E3[mhtml: Archive file]
|
|
E --> E4[downloaded_files: File paths]
|
|
|
|
F --> F1[session_id: Browser session]
|
|
F --> F2[ssl_certificate: Security info]
|
|
F --> F3[response_headers: HTTP headers]
|
|
F --> F4[network_requests: Traffic log]
|
|
F --> F5[console_messages: Browser logs]
|
|
end
|
|
|
|
style A fill:#e3f2fd
|
|
style C3 fill:#f3e5f5
|
|
style D5 fill:#e8f5e8
|
|
```
|
|
|
|
### Content Processing Pipeline
|
|
|
|
```mermaid
|
|
flowchart LR
|
|
subgraph "Input Sources"
|
|
A1[Web URL]
|
|
A2[Raw HTML]
|
|
A3[Local File]
|
|
end
|
|
|
|
A1 --> B[Browser Navigation]
|
|
A2 --> C[Direct Processing]
|
|
A3 --> C
|
|
|
|
B --> D[Raw HTML Capture]
|
|
C --> D
|
|
|
|
D --> E{Content Filtering}
|
|
|
|
E --> E1[Remove Scripts/Styles]
|
|
E --> E2[Apply excluded_tags]
|
|
E --> E3[Apply css_selector]
|
|
E --> E4[Remove overlay elements]
|
|
|
|
E1 --> F[Cleaned HTML]
|
|
E2 --> F
|
|
E3 --> F
|
|
E4 --> F
|
|
|
|
F --> G{Markdown Generation}
|
|
|
|
G --> G1[HTML to Markdown]
|
|
G --> G2[Apply Content Filter]
|
|
G --> G3[Generate Citations]
|
|
|
|
G1 --> H[MarkdownGenerationResult]
|
|
G2 --> H
|
|
G3 --> H
|
|
|
|
F --> I{Media Extraction}
|
|
I --> I1[Find Images]
|
|
I --> I2[Find Videos/Audio]
|
|
I --> I3[Score Relevance]
|
|
I1 --> J[Media Dictionary]
|
|
I2 --> J
|
|
I3 --> J
|
|
|
|
F --> K{Link Extraction}
|
|
K --> K1[Internal Links]
|
|
K --> K2[External Links]
|
|
K --> K3[Apply Link Filters]
|
|
K1 --> L[Links Dictionary]
|
|
K2 --> L
|
|
K3 --> L
|
|
|
|
H --> M[Final CrawlResult]
|
|
J --> M
|
|
L --> M
|
|
|
|
style D fill:#e3f2fd
|
|
style F fill:#f3e5f5
|
|
style H fill:#e8f5e8
|
|
style M fill:#c8e6c9
|
|
```
|
|
|
|
### Table Extraction Workflow
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> DetectTables
|
|
|
|
DetectTables --> ScoreTables: Find table elements
|
|
|
|
ScoreTables --> EvaluateThreshold: Calculate quality scores
|
|
EvaluateThreshold --> PassThreshold: score >= table_score_threshold
|
|
EvaluateThreshold --> RejectTable: score < threshold
|
|
|
|
PassThreshold --> ExtractHeaders: Parse table structure
|
|
ExtractHeaders --> ExtractRows: Get header cells
|
|
ExtractRows --> ExtractMetadata: Get data rows
|
|
ExtractMetadata --> CreateTableObject: Get caption/summary
|
|
|
|
CreateTableObject --> AddToResult: {headers, rows, caption, summary}
|
|
AddToResult --> [*]: Table extraction complete
|
|
|
|
RejectTable --> [*]: Table skipped
|
|
|
|
note right of ScoreTables : Factors: header presence, data density, structure quality
|
|
note right of EvaluateThreshold : Threshold 1-10, higher = stricter
|
|
```
|
|
|
|
### Error Handling Decision Tree
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Start Crawl] --> B[Navigate to URL]
|
|
|
|
B --> C{Navigation Success?}
|
|
C -->|Network Error| C1[Set error_message: Network failure]
|
|
C -->|Timeout| C2[Set error_message: Page timeout]
|
|
C -->|Invalid URL| C3[Set error_message: Invalid URL format]
|
|
C -->|Success| D[Process Page Content]
|
|
|
|
C1 --> E[success = False]
|
|
C2 --> E
|
|
C3 --> E
|
|
|
|
D --> F{Content Processing OK?}
|
|
F -->|Parser Error| F1[Set error_message: HTML parsing failed]
|
|
F -->|Memory Error| F2[Set error_message: Insufficient memory]
|
|
F -->|Success| G[Generate Outputs]
|
|
|
|
F1 --> E
|
|
F2 --> E
|
|
|
|
G --> H{Output Generation OK?}
|
|
H -->|Markdown Error| H1[Partial success with warnings]
|
|
H -->|Extraction Error| H2[Partial success with warnings]
|
|
H -->|Success| I[success = True]
|
|
|
|
H1 --> I
|
|
H2 --> I
|
|
|
|
E --> J[Return Failed CrawlResult]
|
|
I --> K[Return Successful CrawlResult]
|
|
|
|
J --> L[User Error Handling]
|
|
K --> M[User Result Processing]
|
|
|
|
L --> L1{Check error_message}
|
|
L1 -->|Network| L2[Retry with different config]
|
|
L1 -->|Timeout| L3[Increase page_timeout]
|
|
L1 -->|Parser| L4[Try different scraping_strategy]
|
|
|
|
style E fill:#ffcdd2
|
|
style I fill:#c8e6c9
|
|
style J fill:#ffcdd2
|
|
style K fill:#c8e6c9
|
|
```
|
|
|
|
### Configuration Impact Matrix
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "Configuration Categories"
|
|
A[Content Processing]
|
|
B[Page Interaction]
|
|
C[Output Generation]
|
|
D[Performance]
|
|
end
|
|
|
|
subgraph "Configuration Options"
|
|
A --> A1[word_count_threshold]
|
|
A --> A2[excluded_tags]
|
|
A --> A3[css_selector]
|
|
A --> A4[exclude_external_links]
|
|
|
|
B --> B1[process_iframes]
|
|
B --> B2[remove_overlay_elements]
|
|
B --> B3[scan_full_page]
|
|
B --> B4[wait_for]
|
|
|
|
C --> C1[screenshot]
|
|
C --> C2[pdf]
|
|
C --> C3[markdown_generator]
|
|
C --> C4[table_score_threshold]
|
|
|
|
D --> D1[cache_mode]
|
|
D --> D2[verbose]
|
|
D --> D3[page_timeout]
|
|
D --> D4[semaphore_count]
|
|
end
|
|
|
|
subgraph "Result Impact"
|
|
A1 --> R1[Filters short text blocks]
|
|
A2 --> R2[Removes specified HTML tags]
|
|
A3 --> R3[Focuses on selected content]
|
|
A4 --> R4[Cleans links dictionary]
|
|
|
|
B1 --> R5[Includes iframe content]
|
|
B2 --> R6[Removes popups/modals]
|
|
B3 --> R7[Loads dynamic content]
|
|
B4 --> R8[Waits for specific elements]
|
|
|
|
C1 --> R9[Adds screenshot field]
|
|
C2 --> R10[Adds pdf field]
|
|
C3 --> R11[Custom markdown processing]
|
|
C4 --> R12[Filters table quality]
|
|
|
|
D1 --> R13[Controls caching behavior]
|
|
D2 --> R14[Detailed logging output]
|
|
D3 --> R15[Prevents timeout errors]
|
|
D4 --> R16[Limits concurrent operations]
|
|
end
|
|
|
|
style A fill:#e3f2fd
|
|
style B fill:#f3e5f5
|
|
style C fill:#e8f5e8
|
|
style D fill:#fff3e0
|
|
```
|
|
|
|
### Raw HTML and Local File Processing
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant Crawler
|
|
participant Processor
|
|
participant FileSystem
|
|
|
|
Note over User,FileSystem: Raw HTML Processing
|
|
User->>Crawler: arun("raw://html_content")
|
|
Crawler->>Processor: Parse raw HTML directly
|
|
Processor->>Processor: Apply same content filters
|
|
Processor-->>Crawler: Standard CrawlResult
|
|
Crawler-->>User: Result with markdown
|
|
|
|
Note over User,FileSystem: Local File Processing
|
|
User->>Crawler: arun("file:///path/to/file.html")
|
|
Crawler->>FileSystem: Read local file
|
|
FileSystem-->>Crawler: File content
|
|
Crawler->>Processor: Process file HTML
|
|
Processor->>Processor: Apply content processing
|
|
Processor-->>Crawler: Standard CrawlResult
|
|
Crawler-->>User: Result with markdown
|
|
|
|
Note over User,FileSystem: Both return identical CrawlResult structure
|
|
```
|
|
|
|
### Comprehensive Processing Example Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Input: example.com] --> B[Create Configurations]
|
|
|
|
B --> B1[BrowserConfig verbose=True]
|
|
B --> B2[CrawlerRunConfig with filters]
|
|
|
|
B1 --> C[Launch AsyncWebCrawler]
|
|
B2 --> C
|
|
|
|
C --> D[Navigate and Process]
|
|
|
|
D --> E{Check Success}
|
|
E -->|Failed| E1[Print Error Message]
|
|
E -->|Success| F[Extract Content Summary]
|
|
|
|
F --> F1[Get Page Title]
|
|
F --> F2[Get Content Preview]
|
|
F --> F3[Process Media Items]
|
|
F --> F4[Process Links]
|
|
|
|
F3 --> F3A[Count Images]
|
|
F3 --> F3B[Show First 3 Images]
|
|
|
|
F4 --> F4A[Count Internal Links]
|
|
F4 --> F4B[Show First 3 Links]
|
|
|
|
F1 --> G[Display Results]
|
|
F2 --> G
|
|
F3A --> G
|
|
F3B --> G
|
|
F4A --> G
|
|
F4B --> G
|
|
|
|
E1 --> H[End with Error]
|
|
G --> I[End with Success]
|
|
|
|
style E1 fill:#ffcdd2
|
|
style G fill:#c8e6c9
|
|
style H fill:#ffcdd2
|
|
style I fill:#c8e6c9
|
|
```
|
|
|
|
**📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/) |