## Simple Crawling Workflows and Data Flow Visual representations of basic web crawling operations, configuration patterns, and result processing workflows. ### Basic Crawling Sequence ```mermaid sequenceDiagram participant User participant Crawler as AsyncWebCrawler participant Browser as Browser Instance participant Page as Web Page participant Processor as Content Processor User->>Crawler: Create with BrowserConfig Crawler->>Browser: Launch browser instance Browser-->>Crawler: Browser ready User->>Crawler: arun(url, CrawlerRunConfig) Crawler->>Browser: Create new page/context Browser->>Page: Navigate to URL Page-->>Browser: Page loaded Browser->>Processor: Extract raw HTML Processor->>Processor: Clean HTML Processor->>Processor: Generate markdown Processor->>Processor: Extract media/links Processor-->>Crawler: CrawlResult created Crawler-->>User: Return CrawlResult Note over User,Processor: All processing happens asynchronously ``` ### Crawling Configuration Flow ```mermaid flowchart TD A[Start Crawling] --> B{Browser Config Set?} B -->|No| B1[Use Default BrowserConfig] B -->|Yes| B2[Custom BrowserConfig] B1 --> C[Launch Browser] B2 --> C C --> D{Crawler Run Config Set?} D -->|No| D1[Use Default CrawlerRunConfig] D -->|Yes| D2[Custom CrawlerRunConfig] D1 --> E[Navigate to URL] D2 --> E E --> F{Page Load Success?} F -->|No| F1[Return Error Result] F -->|Yes| G[Apply Content Filters] G --> G1{excluded_tags set?} G1 -->|Yes| G2[Remove specified tags] G1 -->|No| G3[Keep all tags] G2 --> G4{css_selector set?} G3 --> G4 G4 -->|Yes| G5[Extract selected elements] G4 -->|No| G6[Process full page] G5 --> H[Generate Markdown] G6 --> H H --> H1{markdown_generator set?} H1 -->|Yes| H2[Use custom generator] H1 -->|No| H3[Use default generator] H2 --> I[Extract Media and Links] H3 --> I I --> I1{process_iframes?} I1 -->|Yes| I2[Include iframe content] I1 -->|No| I3[Skip iframes] I2 --> J[Create CrawlResult] I3 --> J J --> K[Return Result] style A fill:#e1f5fe style K fill:#c8e6c9 style F1 fill:#ffcdd2 ``` ### CrawlResult Data Structure ```mermaid graph TB subgraph "CrawlResult Object" A[CrawlResult] --> B[Basic Info] A --> C[Content Variants] A --> D[Extracted Data] A --> E[Media Assets] A --> F[Optional Outputs] B --> B1[url: Final URL] B --> B2[success: Boolean] B --> B3[status_code: HTTP Status] B --> B4[error_message: Error Details] C --> C1[html: Raw HTML] C --> C2[cleaned_html: Sanitized HTML] C --> C3[markdown: MarkdownGenerationResult] C3 --> C3A[raw_markdown: Basic conversion] C3 --> C3B[markdown_with_citations: With references] C3 --> C3C[fit_markdown: Filtered content] C3 --> C3D[references_markdown: Citation list] D --> D1[links: Internal/External] D --> D2[media: Images/Videos/Audio] D --> D3[metadata: Page info] D --> D4[extracted_content: JSON data] D --> D5[tables: Structured table data] E --> E1[screenshot: Base64 image] E --> E2[pdf: PDF bytes] E --> E3[mhtml: Archive file] E --> E4[downloaded_files: File paths] F --> F1[session_id: Browser session] F --> F2[ssl_certificate: Security info] F --> F3[response_headers: HTTP headers] F --> F4[network_requests: Traffic log] F --> F5[console_messages: Browser logs] end style A fill:#e3f2fd style C3 fill:#f3e5f5 style D5 fill:#e8f5e8 ``` ### Content Processing Pipeline ```mermaid flowchart LR subgraph "Input Sources" A1[Web URL] A2[Raw HTML] A3[Local File] end A1 --> B[Browser Navigation] A2 --> C[Direct Processing] A3 --> C B --> D[Raw HTML Capture] C --> D D --> E{Content Filtering} E --> E1[Remove Scripts/Styles] E --> E2[Apply excluded_tags] E --> E3[Apply css_selector] E --> E4[Remove overlay elements] E1 --> F[Cleaned HTML] E2 --> F E3 --> F E4 --> F F --> G{Markdown Generation} G --> G1[HTML to Markdown] G --> G2[Apply Content Filter] G --> G3[Generate Citations] G1 --> H[MarkdownGenerationResult] G2 --> H G3 --> H F --> I{Media Extraction} I --> I1[Find Images] I --> I2[Find Videos/Audio] I --> I3[Score Relevance] I1 --> J[Media Dictionary] I2 --> J I3 --> J F --> K{Link Extraction} K --> K1[Internal Links] K --> K2[External Links] K --> K3[Apply Link Filters] K1 --> L[Links Dictionary] K2 --> L K3 --> L H --> M[Final CrawlResult] J --> M L --> M style D fill:#e3f2fd style F fill:#f3e5f5 style H fill:#e8f5e8 style M fill:#c8e6c9 ``` ### Table Extraction Workflow ```mermaid stateDiagram-v2 [*] --> DetectTables DetectTables --> ScoreTables: Find table elements ScoreTables --> EvaluateThreshold: Calculate quality scores EvaluateThreshold --> PassThreshold: score >= table_score_threshold EvaluateThreshold --> RejectTable: score < threshold PassThreshold --> ExtractHeaders: Parse table structure ExtractHeaders --> ExtractRows: Get header cells ExtractRows --> ExtractMetadata: Get data rows ExtractMetadata --> CreateTableObject: Get caption/summary CreateTableObject --> AddToResult: {headers, rows, caption, summary} AddToResult --> [*]: Table extraction complete RejectTable --> [*]: Table skipped note right of ScoreTables : Factors: header presence, data density, structure quality note right of EvaluateThreshold : Threshold 1-10, higher = stricter ``` ### Error Handling Decision Tree ```mermaid flowchart TD A[Start Crawl] --> B[Navigate to URL] B --> C{Navigation Success?} C -->|Network Error| C1[Set error_message: Network failure] C -->|Timeout| C2[Set error_message: Page timeout] C -->|Invalid URL| C3[Set error_message: Invalid URL format] C -->|Success| D[Process Page Content] C1 --> E[success = False] C2 --> E C3 --> E D --> F{Content Processing OK?} F -->|Parser Error| F1[Set error_message: HTML parsing failed] F -->|Memory Error| F2[Set error_message: Insufficient memory] F -->|Success| G[Generate Outputs] F1 --> E F2 --> E G --> H{Output Generation OK?} H -->|Markdown Error| H1[Partial success with warnings] H -->|Extraction Error| H2[Partial success with warnings] H -->|Success| I[success = True] H1 --> I H2 --> I E --> J[Return Failed CrawlResult] I --> K[Return Successful CrawlResult] J --> L[User Error Handling] K --> M[User Result Processing] L --> L1{Check error_message} L1 -->|Network| L2[Retry with different config] L1 -->|Timeout| L3[Increase page_timeout] L1 -->|Parser| L4[Try different scraping_strategy] style E fill:#ffcdd2 style I fill:#c8e6c9 style J fill:#ffcdd2 style K fill:#c8e6c9 ``` ### Configuration Impact Matrix ```mermaid graph TB subgraph "Configuration Categories" A[Content Processing] B[Page Interaction] C[Output Generation] D[Performance] end subgraph "Configuration Options" A --> A1[word_count_threshold] A --> A2[excluded_tags] A --> A3[css_selector] A --> A4[exclude_external_links] B --> B1[process_iframes] B --> B2[remove_overlay_elements] B --> B3[scan_full_page] B --> B4[wait_for] C --> C1[screenshot] C --> C2[pdf] C --> C3[markdown_generator] C --> C4[table_score_threshold] D --> D1[cache_mode] D --> D2[verbose] D --> D3[page_timeout] D --> D4[semaphore_count] end subgraph "Result Impact" A1 --> R1[Filters short text blocks] A2 --> R2[Removes specified HTML tags] A3 --> R3[Focuses on selected content] A4 --> R4[Cleans links dictionary] B1 --> R5[Includes iframe content] B2 --> R6[Removes popups/modals] B3 --> R7[Loads dynamic content] B4 --> R8[Waits for specific elements] C1 --> R9[Adds screenshot field] C2 --> R10[Adds pdf field] C3 --> R11[Custom markdown processing] C4 --> R12[Filters table quality] D1 --> R13[Controls caching behavior] D2 --> R14[Detailed logging output] D3 --> R15[Prevents timeout errors] D4 --> R16[Limits concurrent operations] end style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0 ``` ### Raw HTML and Local File Processing ```mermaid sequenceDiagram participant User participant Crawler participant Processor participant FileSystem Note over User,FileSystem: Raw HTML Processing User->>Crawler: arun("raw://html_content") Crawler->>Processor: Parse raw HTML directly Processor->>Processor: Apply same content filters Processor-->>Crawler: Standard CrawlResult Crawler-->>User: Result with markdown Note over User,FileSystem: Local File Processing User->>Crawler: arun("file:///path/to/file.html") Crawler->>FileSystem: Read local file FileSystem-->>Crawler: File content Crawler->>Processor: Process file HTML Processor->>Processor: Apply content processing Processor-->>Crawler: Standard CrawlResult Crawler-->>User: Result with markdown Note over User,FileSystem: Both return identical CrawlResult structure ``` ### Comprehensive Processing Example Flow ```mermaid flowchart TD A[Input: example.com] --> B[Create Configurations] B --> B1[BrowserConfig verbose=True] B --> B2[CrawlerRunConfig with filters] B1 --> C[Launch AsyncWebCrawler] B2 --> C C --> D[Navigate and Process] D --> E{Check Success} E -->|Failed| E1[Print Error Message] E -->|Success| F[Extract Content Summary] F --> F1[Get Page Title] F --> F2[Get Content Preview] F --> F3[Process Media Items] F --> F4[Process Links] F3 --> F3A[Count Images] F3 --> F3B[Show First 3 Images] F4 --> F4A[Count Internal Links] F4 --> F4B[Show First 3 Links] F1 --> G[Display Results] F2 --> G F3A --> G F3B --> G F4A --> G F4B --> G E1 --> H[End with Error] G --> I[End with Success] style E1 fill:#ffcdd2 style G fill:#c8e6c9 style H fill:#ffcdd2 style I fill:#c8e6c9 ``` **📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/)