Files
crawl4ai/docs/md_v2/assets/llm.txt/diagrams/simple_crawling.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

411 lines
11 KiB
Plaintext

## Simple Crawling Workflows and Data Flow
Visual representations of basic web crawling operations, configuration patterns, and result processing workflows.
### Basic Crawling Sequence
```mermaid
sequenceDiagram
participant User
participant Crawler as AsyncWebCrawler
participant Browser as Browser Instance
participant Page as Web Page
participant Processor as Content Processor
User->>Crawler: Create with BrowserConfig
Crawler->>Browser: Launch browser instance
Browser-->>Crawler: Browser ready
User->>Crawler: arun(url, CrawlerRunConfig)
Crawler->>Browser: Create new page/context
Browser->>Page: Navigate to URL
Page-->>Browser: Page loaded
Browser->>Processor: Extract raw HTML
Processor->>Processor: Clean HTML
Processor->>Processor: Generate markdown
Processor->>Processor: Extract media/links
Processor-->>Crawler: CrawlResult created
Crawler-->>User: Return CrawlResult
Note over User,Processor: All processing happens asynchronously
```
### Crawling Configuration Flow
```mermaid
flowchart TD
A[Start Crawling] --> B{Browser Config Set?}
B -->|No| B1[Use Default BrowserConfig]
B -->|Yes| B2[Custom BrowserConfig]
B1 --> C[Launch Browser]
B2 --> C
C --> D{Crawler Run Config Set?}
D -->|No| D1[Use Default CrawlerRunConfig]
D -->|Yes| D2[Custom CrawlerRunConfig]
D1 --> E[Navigate to URL]
D2 --> E
E --> F{Page Load Success?}
F -->|No| F1[Return Error Result]
F -->|Yes| G[Apply Content Filters]
G --> G1{excluded_tags set?}
G1 -->|Yes| G2[Remove specified tags]
G1 -->|No| G3[Keep all tags]
G2 --> G4{css_selector set?}
G3 --> G4
G4 -->|Yes| G5[Extract selected elements]
G4 -->|No| G6[Process full page]
G5 --> H[Generate Markdown]
G6 --> H
H --> H1{markdown_generator set?}
H1 -->|Yes| H2[Use custom generator]
H1 -->|No| H3[Use default generator]
H2 --> I[Extract Media and Links]
H3 --> I
I --> I1{process_iframes?}
I1 -->|Yes| I2[Include iframe content]
I1 -->|No| I3[Skip iframes]
I2 --> J[Create CrawlResult]
I3 --> J
J --> K[Return Result]
style A fill:#e1f5fe
style K fill:#c8e6c9
style F1 fill:#ffcdd2
```
### CrawlResult Data Structure
```mermaid
graph TB
subgraph "CrawlResult Object"
A[CrawlResult] --> B[Basic Info]
A --> C[Content Variants]
A --> D[Extracted Data]
A --> E[Media Assets]
A --> F[Optional Outputs]
B --> B1[url: Final URL]
B --> B2[success: Boolean]
B --> B3[status_code: HTTP Status]
B --> B4[error_message: Error Details]
C --> C1[html: Raw HTML]
C --> C2[cleaned_html: Sanitized HTML]
C --> C3[markdown: MarkdownGenerationResult]
C3 --> C3A[raw_markdown: Basic conversion]
C3 --> C3B[markdown_with_citations: With references]
C3 --> C3C[fit_markdown: Filtered content]
C3 --> C3D[references_markdown: Citation list]
D --> D1[links: Internal/External]
D --> D2[media: Images/Videos/Audio]
D --> D3[metadata: Page info]
D --> D4[extracted_content: JSON data]
D --> D5[tables: Structured table data]
E --> E1[screenshot: Base64 image]
E --> E2[pdf: PDF bytes]
E --> E3[mhtml: Archive file]
E --> E4[downloaded_files: File paths]
F --> F1[session_id: Browser session]
F --> F2[ssl_certificate: Security info]
F --> F3[response_headers: HTTP headers]
F --> F4[network_requests: Traffic log]
F --> F5[console_messages: Browser logs]
end
style A fill:#e3f2fd
style C3 fill:#f3e5f5
style D5 fill:#e8f5e8
```
### Content Processing Pipeline
```mermaid
flowchart LR
subgraph "Input Sources"
A1[Web URL]
A2[Raw HTML]
A3[Local File]
end
A1 --> B[Browser Navigation]
A2 --> C[Direct Processing]
A3 --> C
B --> D[Raw HTML Capture]
C --> D
D --> E{Content Filtering}
E --> E1[Remove Scripts/Styles]
E --> E2[Apply excluded_tags]
E --> E3[Apply css_selector]
E --> E4[Remove overlay elements]
E1 --> F[Cleaned HTML]
E2 --> F
E3 --> F
E4 --> F
F --> G{Markdown Generation}
G --> G1[HTML to Markdown]
G --> G2[Apply Content Filter]
G --> G3[Generate Citations]
G1 --> H[MarkdownGenerationResult]
G2 --> H
G3 --> H
F --> I{Media Extraction}
I --> I1[Find Images]
I --> I2[Find Videos/Audio]
I --> I3[Score Relevance]
I1 --> J[Media Dictionary]
I2 --> J
I3 --> J
F --> K{Link Extraction}
K --> K1[Internal Links]
K --> K2[External Links]
K --> K3[Apply Link Filters]
K1 --> L[Links Dictionary]
K2 --> L
K3 --> L
H --> M[Final CrawlResult]
J --> M
L --> M
style D fill:#e3f2fd
style F fill:#f3e5f5
style H fill:#e8f5e8
style M fill:#c8e6c9
```
### Table Extraction Workflow
```mermaid
stateDiagram-v2
[*] --> DetectTables
DetectTables --> ScoreTables: Find table elements
ScoreTables --> EvaluateThreshold: Calculate quality scores
EvaluateThreshold --> PassThreshold: score >= table_score_threshold
EvaluateThreshold --> RejectTable: score < threshold
PassThreshold --> ExtractHeaders: Parse table structure
ExtractHeaders --> ExtractRows: Get header cells
ExtractRows --> ExtractMetadata: Get data rows
ExtractMetadata --> CreateTableObject: Get caption/summary
CreateTableObject --> AddToResult: {headers, rows, caption, summary}
AddToResult --> [*]: Table extraction complete
RejectTable --> [*]: Table skipped
note right of ScoreTables : Factors: header presence, data density, structure quality
note right of EvaluateThreshold : Threshold 1-10, higher = stricter
```
### Error Handling Decision Tree
```mermaid
flowchart TD
A[Start Crawl] --> B[Navigate to URL]
B --> C{Navigation Success?}
C -->|Network Error| C1[Set error_message: Network failure]
C -->|Timeout| C2[Set error_message: Page timeout]
C -->|Invalid URL| C3[Set error_message: Invalid URL format]
C -->|Success| D[Process Page Content]
C1 --> E[success = False]
C2 --> E
C3 --> E
D --> F{Content Processing OK?}
F -->|Parser Error| F1[Set error_message: HTML parsing failed]
F -->|Memory Error| F2[Set error_message: Insufficient memory]
F -->|Success| G[Generate Outputs]
F1 --> E
F2 --> E
G --> H{Output Generation OK?}
H -->|Markdown Error| H1[Partial success with warnings]
H -->|Extraction Error| H2[Partial success with warnings]
H -->|Success| I[success = True]
H1 --> I
H2 --> I
E --> J[Return Failed CrawlResult]
I --> K[Return Successful CrawlResult]
J --> L[User Error Handling]
K --> M[User Result Processing]
L --> L1{Check error_message}
L1 -->|Network| L2[Retry with different config]
L1 -->|Timeout| L3[Increase page_timeout]
L1 -->|Parser| L4[Try different scraping_strategy]
style E fill:#ffcdd2
style I fill:#c8e6c9
style J fill:#ffcdd2
style K fill:#c8e6c9
```
### Configuration Impact Matrix
```mermaid
graph TB
subgraph "Configuration Categories"
A[Content Processing]
B[Page Interaction]
C[Output Generation]
D[Performance]
end
subgraph "Configuration Options"
A --> A1[word_count_threshold]
A --> A2[excluded_tags]
A --> A3[css_selector]
A --> A4[exclude_external_links]
B --> B1[process_iframes]
B --> B2[remove_overlay_elements]
B --> B3[scan_full_page]
B --> B4[wait_for]
C --> C1[screenshot]
C --> C2[pdf]
C --> C3[markdown_generator]
C --> C4[table_score_threshold]
D --> D1[cache_mode]
D --> D2[verbose]
D --> D3[page_timeout]
D --> D4[semaphore_count]
end
subgraph "Result Impact"
A1 --> R1[Filters short text blocks]
A2 --> R2[Removes specified HTML tags]
A3 --> R3[Focuses on selected content]
A4 --> R4[Cleans links dictionary]
B1 --> R5[Includes iframe content]
B2 --> R6[Removes popups/modals]
B3 --> R7[Loads dynamic content]
B4 --> R8[Waits for specific elements]
C1 --> R9[Adds screenshot field]
C2 --> R10[Adds pdf field]
C3 --> R11[Custom markdown processing]
C4 --> R12[Filters table quality]
D1 --> R13[Controls caching behavior]
D2 --> R14[Detailed logging output]
D3 --> R15[Prevents timeout errors]
D4 --> R16[Limits concurrent operations]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
```
### Raw HTML and Local File Processing
```mermaid
sequenceDiagram
participant User
participant Crawler
participant Processor
participant FileSystem
Note over User,FileSystem: Raw HTML Processing
User->>Crawler: arun("raw://html_content")
Crawler->>Processor: Parse raw HTML directly
Processor->>Processor: Apply same content filters
Processor-->>Crawler: Standard CrawlResult
Crawler-->>User: Result with markdown
Note over User,FileSystem: Local File Processing
User->>Crawler: arun("file:///path/to/file.html")
Crawler->>FileSystem: Read local file
FileSystem-->>Crawler: File content
Crawler->>Processor: Process file HTML
Processor->>Processor: Apply content processing
Processor-->>Crawler: Standard CrawlResult
Crawler-->>User: Result with markdown
Note over User,FileSystem: Both return identical CrawlResult structure
```
### Comprehensive Processing Example Flow
```mermaid
flowchart TD
A[Input: example.com] --> B[Create Configurations]
B --> B1[BrowserConfig verbose=True]
B --> B2[CrawlerRunConfig with filters]
B1 --> C[Launch AsyncWebCrawler]
B2 --> C
C --> D[Navigate and Process]
D --> E{Check Success}
E -->|Failed| E1[Print Error Message]
E -->|Success| F[Extract Content Summary]
F --> F1[Get Page Title]
F --> F2[Get Content Preview]
F --> F3[Process Media Items]
F --> F4[Process Links]
F3 --> F3A[Count Images]
F3 --> F3B[Show First 3 Images]
F4 --> F4A[Count Internal Links]
F4 --> F4B[Show First 3 Links]
F1 --> G[Display Results]
F2 --> G
F3A --> G
F3B --> G
F4A --> G
F4B --> G
E1 --> H[End with Error]
G --> I[End with Success]
style E1 fill:#ffcdd2
style G fill:#c8e6c9
style H fill:#ffcdd2
style I fill:#c8e6c9
```
**📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/)