# Crawl4AI > Open-source LLM-friendly web crawler and scraper for AI applications Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. Built with Python and Playwright for high-performance crawling with structured data extraction. **Key Features:** - Asynchronous crawling with high concurrency - Multiple extraction strategies (CSS, XPath, LLM-based) - Built-in markdown generation with content filtering - Docker deployment with REST API - Session management and browser automation - Advanced anti-detection capabilities **Quick Links:** - [GitHub Repository](https://github.com/unclecode/crawl4ai) - [Documentation](https://docs.crawl4ai.com) - [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) --- ## Installation Workflows and Architecture Visual representations of Crawl4AI installation processes, deployment options, and system interactions. ### Installation Decision Flow ```mermaid flowchart TD A[Start Installation] --> B{Environment Type?} B -->|Local Development| C[Basic Python Install] B -->|Production| D[Docker Deployment] B -->|Research/Testing| E[Google Colab] B -->|CI/CD Pipeline| F[Automated Setup] C --> C1[pip install crawl4ai] C1 --> C2[crawl4ai-setup] C2 --> C3{Need Advanced Features?} C3 -->|No| C4[Basic Installation Complete] C3 -->|Text Clustering| C5[pip install crawl4ai with torch] C3 -->|Transformers| C6[pip install crawl4ai with transformer] C3 -->|All Features| C7[pip install crawl4ai with all] C5 --> C8[crawl4ai-download-models] C6 --> C8 C7 --> C8 C8 --> C9[Advanced Installation Complete] D --> D1{Deployment Method?} D1 -->|Pre-built Image| D2[docker pull unclecode/crawl4ai] D1 -->|Docker Compose| D3[Clone repo + docker compose] D1 -->|Custom Build| D4[docker buildx build] D2 --> D5[Configure .llm.env] D3 --> D5 D4 --> D5 D5 --> D6[docker run with ports] D6 --> D7[Docker Deployment Complete] E --> E1[Colab pip install] E1 --> E2[playwright install chromium] E2 --> E3[Test basic crawl] E3 --> E4[Colab Setup Complete] F --> F1[Automated pip install] F1 --> F2[Automated setup scripts] F2 --> F3[CI/CD Integration Complete] C4 --> G[Verify with crawl4ai-doctor] C9 --> G D7 --> H[Health check via API] E4 --> I[Run test crawl] F3 --> G G --> J[Installation Verified] H --> J I --> J style A fill:#e1f5fe style J fill:#c8e6c9 style C4 fill:#fff3e0 style C9 fill:#fff3e0 style D7 fill:#f3e5f5 style E4 fill:#fce4ec style F3 fill:#e8f5e8 ``` ### Basic Installation Sequence ```mermaid sequenceDiagram participant User participant PyPI participant System participant Playwright participant Crawler User->>PyPI: pip install crawl4ai PyPI-->>User: Package downloaded User->>System: crawl4ai-setup System->>Playwright: Install browser binaries Playwright-->>System: Chromium, Firefox installed System-->>User: Setup complete User->>System: crawl4ai-doctor System->>System: Check Python version System->>System: Verify Playwright installation System->>System: Test browser launch System-->>User: Diagnostics report User->>Crawler: Basic crawl test Crawler->>Playwright: Launch browser Playwright-->>Crawler: Browser ready Crawler->>Crawler: Navigate to test URL Crawler-->>User: Success confirmation ``` ### Docker Deployment Architecture ```mermaid graph TB subgraph "Host System" A[Docker Engine] --> B[Crawl4AI Container] C[.llm.env File] --> B D[Port 11235] --> B end subgraph "Container Environment" B --> E[FastAPI Server] B --> F[Playwright Browsers] B --> G[Python Runtime] E --> H[/crawl Endpoint] E --> I[/playground Interface] E --> J[/health Monitoring] E --> K[/metrics Prometheus] F --> L[Chromium Browser] F --> M[Firefox Browser] F --> N[WebKit Browser] end subgraph "External Services" O[OpenAI API] --> B P[Anthropic API] --> B Q[Local LLM Ollama] --> B end subgraph "Client Applications" R[Python SDK] --> H S[REST API Calls] --> H T[Web Browser] --> I U[Monitoring Tools] --> J V[Prometheus] --> K end style B fill:#e3f2fd style E fill:#f3e5f5 style F fill:#e8f5e8 style G fill:#fff3e0 ``` ### Advanced Features Installation Flow ```mermaid stateDiagram-v2 [*] --> BasicInstall BasicInstall --> FeatureChoice: crawl4ai installed FeatureChoice --> TorchInstall: Need text clustering FeatureChoice --> TransformerInstall: Need HuggingFace models FeatureChoice --> AllInstall: Need everything FeatureChoice --> Complete: Basic features sufficient TorchInstall --> TorchSetup: pip install crawl4ai with torch TransformerInstall --> TransformerSetup: pip install crawl4ai with transformer AllInstall --> AllSetup: pip install crawl4ai with all TorchSetup --> ModelDownload: crawl4ai-setup TransformerSetup --> ModelDownload: crawl4ai-setup AllSetup --> ModelDownload: crawl4ai-setup ModelDownload --> PreDownload: crawl4ai-download-models PreDownload --> Complete: All models cached Complete --> Verification: crawl4ai-doctor Verification --> [*]: Installation verified note right of TorchInstall : PyTorch for semantic operations note right of TransformerInstall : HuggingFace for LLM features note right of AllInstall : Complete feature set ``` ### Platform-Specific Installation Matrix ```mermaid graph LR subgraph "Installation Methods" A[Python Package] --> A1[pip install] B[Docker Image] --> B1[docker pull] C[Source Build] --> C1[git clone + build] D[Cloud Platform] --> D1[Colab/Kaggle] end subgraph "Operating Systems" E[Linux x86_64] F[Linux ARM64] G[macOS Intel] H[macOS Apple Silicon] I[Windows x86_64] end subgraph "Feature Sets" J[Basic crawling] K[Text clustering torch] L[LLM transformers] M[All features] end A1 --> E A1 --> F A1 --> G A1 --> H A1 --> I B1 --> E B1 --> F B1 --> G B1 --> H C1 --> E C1 --> F C1 --> G C1 --> H C1 --> I D1 --> E D1 --> I E --> J E --> K E --> L E --> M F --> J F --> K F --> L F --> M G --> J G --> K G --> L G --> M H --> J H --> K H --> L H --> M I --> J I --> K I --> L I --> M style A1 fill:#e3f2fd style B1 fill:#f3e5f5 style C1 fill:#e8f5e8 style D1 fill:#fff3e0 ``` ### Docker Multi-Stage Build Process ```mermaid sequenceDiagram participant Dev as Developer participant Git as GitHub Repo participant Docker as Docker Engine participant Registry as Docker Hub participant User as End User Dev->>Git: Push code changes Docker->>Git: Clone repository Docker->>Docker: Stage 1 - Base Python image Docker->>Docker: Stage 2 - Install dependencies Docker->>Docker: Stage 3 - Install Playwright Docker->>Docker: Stage 4 - Copy application code Docker->>Docker: Stage 5 - Setup FastAPI server Note over Docker: Multi-architecture build Docker->>Docker: Build for linux/amd64 Docker->>Docker: Build for linux/arm64 Docker->>Registry: Push multi-arch manifest Registry-->>Docker: Build complete User->>Registry: docker pull unclecode/crawl4ai Registry-->>User: Download appropriate architecture User->>Docker: docker run with configuration Docker->>Docker: Start container Docker->>Docker: Initialize FastAPI server Docker->>Docker: Setup Playwright browsers Docker-->>User: Service ready on port 11235 ``` ### Installation Verification Workflow ```mermaid flowchart TD A[Installation Complete] --> B[Run crawl4ai-doctor] B --> C{Python Version Check} C -->|✓ 3.10+| D{Playwright Check} C -->|✗ < 3.10| C1[Upgrade Python] C1 --> D D -->|✓ Installed| E{Browser Binaries} D -->|✗ Missing| D1[Run crawl4ai-setup] D1 --> E E -->|✓ Available| F{Test Browser Launch} E -->|✗ Missing| E1[playwright install] E1 --> F F -->|✓ Success| G[Test Basic Crawl] F -->|✗ Failed| F1[Check system dependencies] F1 --> F G --> H{Crawl Test Result} H -->|✓ Success| I[Installation Verified ✓] H -->|✗ Failed| H1[Check network/permissions] H1 --> G I --> J[Ready for Production Use] style I fill:#c8e6c9 style J fill:#e8f5e8 style C1 fill:#ffcdd2 style D1 fill:#fff3e0 style E1 fill:#fff3e0 style F1 fill:#ffcdd2 style H1 fill:#ffcdd2 ``` ### Resource Requirements by Installation Type ```mermaid graph TD subgraph "Basic Installation" A1[Memory: 512MB] A2[Disk: 2GB] A3[CPU: 1 core] A4[Network: Required for setup] end subgraph "Advanced Features torch" B1[Memory: 2GB+] B2[Disk: 5GB+] B3[CPU: 2+ cores] B4[GPU: Optional CUDA] end subgraph "All Features" C1[Memory: 4GB+] C2[Disk: 10GB+] C3[CPU: 4+ cores] C4[GPU: Recommended] end subgraph "Docker Deployment" D1[Memory: 1GB+] D2[Disk: 3GB+] D3[CPU: 2+ cores] D4[Ports: 11235] D5[Shared Memory: 1GB] end style A1 fill:#e8f5e8 style B1 fill:#fff3e0 style C1 fill:#ffecb3 style D1 fill:#e3f2fd ``` **📖 Learn more:** [Installation Guide](https://docs.crawl4ai.com/core/installation/), [Docker Deployment](https://docs.crawl4ai.com/core/docker-deployment/), [System Requirements](https://docs.crawl4ai.com/core/installation/#prerequisites) --- ## Simple Crawling Workflows and Data Flow Visual representations of basic web crawling operations, configuration patterns, and result processing workflows. ### Basic Crawling Sequence ```mermaid sequenceDiagram participant User participant Crawler as AsyncWebCrawler participant Browser as Browser Instance participant Page as Web Page participant Processor as Content Processor User->>Crawler: Create with BrowserConfig Crawler->>Browser: Launch browser instance Browser-->>Crawler: Browser ready User->>Crawler: arun(url, CrawlerRunConfig) Crawler->>Browser: Create new page/context Browser->>Page: Navigate to URL Page-->>Browser: Page loaded Browser->>Processor: Extract raw HTML Processor->>Processor: Clean HTML Processor->>Processor: Generate markdown Processor->>Processor: Extract media/links Processor-->>Crawler: CrawlResult created Crawler-->>User: Return CrawlResult Note over User,Processor: All processing happens asynchronously ``` ### Crawling Configuration Flow ```mermaid flowchart TD A[Start Crawling] --> B{Browser Config Set?} B -->|No| B1[Use Default BrowserConfig] B -->|Yes| B2[Custom BrowserConfig] B1 --> C[Launch Browser] B2 --> C C --> D{Crawler Run Config Set?} D -->|No| D1[Use Default CrawlerRunConfig] D -->|Yes| D2[Custom CrawlerRunConfig] D1 --> E[Navigate to URL] D2 --> E E --> F{Page Load Success?} F -->|No| F1[Return Error Result] F -->|Yes| G[Apply Content Filters] G --> G1{excluded_tags set?} G1 -->|Yes| G2[Remove specified tags] G1 -->|No| G3[Keep all tags] G2 --> G4{css_selector set?} G3 --> G4 G4 -->|Yes| G5[Extract selected elements] G4 -->|No| G6[Process full page] G5 --> H[Generate Markdown] G6 --> H H --> H1{markdown_generator set?} H1 -->|Yes| H2[Use custom generator] H1 -->|No| H3[Use default generator] H2 --> I[Extract Media and Links] H3 --> I I --> I1{process_iframes?} I1 -->|Yes| I2[Include iframe content] I1 -->|No| I3[Skip iframes] I2 --> J[Create CrawlResult] I3 --> J J --> K[Return Result] style A fill:#e1f5fe style K fill:#c8e6c9 style F1 fill:#ffcdd2 ``` ### CrawlResult Data Structure ```mermaid graph TB subgraph "CrawlResult Object" A[CrawlResult] --> B[Basic Info] A --> C[Content Variants] A --> D[Extracted Data] A --> E[Media Assets] A --> F[Optional Outputs] B --> B1[url: Final URL] B --> B2[success: Boolean] B --> B3[status_code: HTTP Status] B --> B4[error_message: Error Details] C --> C1[html: Raw HTML] C --> C2[cleaned_html: Sanitized HTML] C --> C3[markdown: MarkdownGenerationResult] C3 --> C3A[raw_markdown: Basic conversion] C3 --> C3B[markdown_with_citations: With references] C3 --> C3C[fit_markdown: Filtered content] C3 --> C3D[references_markdown: Citation list] D --> D1[links: Internal/External] D --> D2[media: Images/Videos/Audio] D --> D3[metadata: Page info] D --> D4[extracted_content: JSON data] D --> D5[tables: Structured table data] E --> E1[screenshot: Base64 image] E --> E2[pdf: PDF bytes] E --> E3[mhtml: Archive file] E --> E4[downloaded_files: File paths] F --> F1[session_id: Browser session] F --> F2[ssl_certificate: Security info] F --> F3[response_headers: HTTP headers] F --> F4[network_requests: Traffic log] F --> F5[console_messages: Browser logs] end style A fill:#e3f2fd style C3 fill:#f3e5f5 style D5 fill:#e8f5e8 ``` ### Content Processing Pipeline ```mermaid flowchart LR subgraph "Input Sources" A1[Web URL] A2[Raw HTML] A3[Local File] end A1 --> B[Browser Navigation] A2 --> C[Direct Processing] A3 --> C B --> D[Raw HTML Capture] C --> D D --> E{Content Filtering} E --> E1[Remove Scripts/Styles] E --> E2[Apply excluded_tags] E --> E3[Apply css_selector] E --> E4[Remove overlay elements] E1 --> F[Cleaned HTML] E2 --> F E3 --> F E4 --> F F --> G{Markdown Generation} G --> G1[HTML to Markdown] G --> G2[Apply Content Filter] G --> G3[Generate Citations] G1 --> H[MarkdownGenerationResult] G2 --> H G3 --> H F --> I{Media Extraction} I --> I1[Find Images] I --> I2[Find Videos/Audio] I --> I3[Score Relevance] I1 --> J[Media Dictionary] I2 --> J I3 --> J F --> K{Link Extraction} K --> K1[Internal Links] K --> K2[External Links] K --> K3[Apply Link Filters] K1 --> L[Links Dictionary] K2 --> L K3 --> L H --> M[Final CrawlResult] J --> M L --> M style D fill:#e3f2fd style F fill:#f3e5f5 style H fill:#e8f5e8 style M fill:#c8e6c9 ``` ### Table Extraction Workflow ```mermaid stateDiagram-v2 [*] --> DetectTables DetectTables --> ScoreTables: Find table elements ScoreTables --> EvaluateThreshold: Calculate quality scores EvaluateThreshold --> PassThreshold: score >= table_score_threshold EvaluateThreshold --> RejectTable: score < threshold PassThreshold --> ExtractHeaders: Parse table structure ExtractHeaders --> ExtractRows: Get header cells ExtractRows --> ExtractMetadata: Get data rows ExtractMetadata --> CreateTableObject: Get caption/summary CreateTableObject --> AddToResult: {headers, rows, caption, summary} AddToResult --> [*]: Table extraction complete RejectTable --> [*]: Table skipped note right of ScoreTables : Factors: header presence, data density, structure quality note right of EvaluateThreshold : Threshold 1-10, higher = stricter ``` ### Error Handling Decision Tree ```mermaid flowchart TD A[Start Crawl] --> B[Navigate to URL] B --> C{Navigation Success?} C -->|Network Error| C1[Set error_message: Network failure] C -->|Timeout| C2[Set error_message: Page timeout] C -->|Invalid URL| C3[Set error_message: Invalid URL format] C -->|Success| D[Process Page Content] C1 --> E[success = False] C2 --> E C3 --> E D --> F{Content Processing OK?} F -->|Parser Error| F1[Set error_message: HTML parsing failed] F -->|Memory Error| F2[Set error_message: Insufficient memory] F -->|Success| G[Generate Outputs] F1 --> E F2 --> E G --> H{Output Generation OK?} H -->|Markdown Error| H1[Partial success with warnings] H -->|Extraction Error| H2[Partial success with warnings] H -->|Success| I[success = True] H1 --> I H2 --> I E --> J[Return Failed CrawlResult] I --> K[Return Successful CrawlResult] J --> L[User Error Handling] K --> M[User Result Processing] L --> L1{Check error_message} L1 -->|Network| L2[Retry with different config] L1 -->|Timeout| L3[Increase page_timeout] L1 -->|Parser| L4[Try different scraping_strategy] style E fill:#ffcdd2 style I fill:#c8e6c9 style J fill:#ffcdd2 style K fill:#c8e6c9 ``` ### Configuration Impact Matrix ```mermaid graph TB subgraph "Configuration Categories" A[Content Processing] B[Page Interaction] C[Output Generation] D[Performance] end subgraph "Configuration Options" A --> A1[word_count_threshold] A --> A2[excluded_tags] A --> A3[css_selector] A --> A4[exclude_external_links] B --> B1[process_iframes] B --> B2[remove_overlay_elements] B --> B3[scan_full_page] B --> B4[wait_for] C --> C1[screenshot] C --> C2[pdf] C --> C3[markdown_generator] C --> C4[table_score_threshold] D --> D1[cache_mode] D --> D2[verbose] D --> D3[page_timeout] D --> D4[semaphore_count] end subgraph "Result Impact" A1 --> R1[Filters short text blocks] A2 --> R2[Removes specified HTML tags] A3 --> R3[Focuses on selected content] A4 --> R4[Cleans links dictionary] B1 --> R5[Includes iframe content] B2 --> R6[Removes popups/modals] B3 --> R7[Loads dynamic content] B4 --> R8[Waits for specific elements] C1 --> R9[Adds screenshot field] C2 --> R10[Adds pdf field] C3 --> R11[Custom markdown processing] C4 --> R12[Filters table quality] D1 --> R13[Controls caching behavior] D2 --> R14[Detailed logging output] D3 --> R15[Prevents timeout errors] D4 --> R16[Limits concurrent operations] end style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0 ``` ### Raw HTML and Local File Processing ```mermaid sequenceDiagram participant User participant Crawler participant Processor participant FileSystem Note over User,FileSystem: Raw HTML Processing User->>Crawler: arun("raw://html_content") Crawler->>Processor: Parse raw HTML directly Processor->>Processor: Apply same content filters Processor-->>Crawler: Standard CrawlResult Crawler-->>User: Result with markdown Note over User,FileSystem: Local File Processing User->>Crawler: arun("file:///path/to/file.html") Crawler->>FileSystem: Read local file FileSystem-->>Crawler: File content Crawler->>Processor: Process file HTML Processor->>Processor: Apply content processing Processor-->>Crawler: Standard CrawlResult Crawler-->>User: Result with markdown Note over User,FileSystem: Both return identical CrawlResult structure ``` ### Comprehensive Processing Example Flow ```mermaid flowchart TD A[Input: example.com] --> B[Create Configurations] B --> B1[BrowserConfig verbose=True] B --> B2[CrawlerRunConfig with filters] B1 --> C[Launch AsyncWebCrawler] B2 --> C C --> D[Navigate and Process] D --> E{Check Success} E -->|Failed| E1[Print Error Message] E -->|Success| F[Extract Content Summary] F --> F1[Get Page Title] F --> F2[Get Content Preview] F --> F3[Process Media Items] F --> F4[Process Links] F3 --> F3A[Count Images] F3 --> F3B[Show First 3 Images] F4 --> F4A[Count Internal Links] F4 --> F4B[Show First 3 Links] F1 --> G[Display Results] F2 --> G F3A --> G F3B --> G F4A --> G F4B --> G E1 --> H[End with Error] G --> I[End with Success] style E1 fill:#ffcdd2 style G fill:#c8e6c9 style H fill:#ffcdd2 style I fill:#c8e6c9 ``` **📖 Learn more:** [Simple Crawling Guide](https://docs.crawl4ai.com/core/simple-crawling/), [Configuration Options](https://docs.crawl4ai.com/core/browser-crawler-config/), [Result Processing](https://docs.crawl4ai.com/core/crawler-result/), [Table Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/) --- ## Configuration Objects and System Architecture Visual representations of Crawl4AI's configuration system, object relationships, and data flow patterns. ### Configuration Object Relationships ```mermaid classDiagram class BrowserConfig { +browser_type: str +headless: bool +viewport_width: int +viewport_height: int +proxy: str +user_agent: str +cookies: list +headers: dict +clone() BrowserConfig +to_dict() dict } class CrawlerRunConfig { +cache_mode: CacheMode +extraction_strategy: ExtractionStrategy +markdown_generator: MarkdownGenerator +js_code: list +wait_for: str +screenshot: bool +session_id: str +clone() CrawlerRunConfig +dump() dict } class LLMConfig { +provider: str +api_token: str +base_url: str +temperature: float +max_tokens: int +clone() LLMConfig +to_dict() dict } class CrawlResult { +url: str +success: bool +html: str +cleaned_html: str +markdown: MarkdownGenerationResult +extracted_content: str +media: dict +links: dict +screenshot: str +pdf: bytes } class AsyncWebCrawler { +config: BrowserConfig +arun() CrawlResult } AsyncWebCrawler --> BrowserConfig : uses AsyncWebCrawler --> CrawlerRunConfig : accepts CrawlerRunConfig --> LLMConfig : contains AsyncWebCrawler --> CrawlResult : returns note for BrowserConfig "Controls browser\nenvironment and behavior" note for CrawlerRunConfig "Controls individual\ncrawl operations" note for LLMConfig "Configures LLM\nproviders and parameters" note for CrawlResult "Contains all crawl\noutputs and metadata" ``` ### Configuration Decision Flow ```mermaid flowchart TD A[Start Configuration] --> B{Use Case Type?} B -->|Simple Web Scraping| C[Basic Config Pattern] B -->|Data Extraction| D[Extraction Config Pattern] B -->|Stealth Crawling| E[Stealth Config Pattern] B -->|High Performance| F[Performance Config Pattern] C --> C1[BrowserConfig: headless=True] C --> C2[CrawlerRunConfig: basic options] C1 --> C3[No LLMConfig needed] C2 --> C3 C3 --> G[Simple Crawling Ready] D --> D1[BrowserConfig: standard setup] D --> D2[CrawlerRunConfig: with extraction_strategy] D --> D3[LLMConfig: for LLM extraction] D1 --> D4[Advanced Extraction Ready] D2 --> D4 D3 --> D4 E --> E1[BrowserConfig: proxy + user_agent] E --> E2[CrawlerRunConfig: simulate_user=True] E1 --> E3[Stealth Crawling Ready] E2 --> E3 F --> F1[BrowserConfig: lightweight] F --> F2[CrawlerRunConfig: caching + concurrent] F1 --> F3[High Performance Ready] F2 --> F3 G --> H[Execute Crawl] D4 --> H E3 --> H F3 --> H H --> I[Get CrawlResult] style A fill:#e1f5fe style I fill:#c8e6c9 style G fill:#fff3e0 style D4 fill:#f3e5f5 style E3 fill:#ffebee style F3 fill:#e8f5e8 ``` ### Configuration Lifecycle Sequence ```mermaid sequenceDiagram participant User participant BrowserConfig as Browser Config participant CrawlerConfig as Crawler Config participant LLMConfig as LLM Config participant Crawler as AsyncWebCrawler participant Browser as Browser Instance participant Result as CrawlResult User->>BrowserConfig: Create with browser settings User->>CrawlerConfig: Create with crawl options User->>LLMConfig: Create with LLM provider User->>Crawler: Initialize with BrowserConfig Crawler->>Browser: Launch browser with config Browser-->>Crawler: Browser ready User->>Crawler: arun(url, CrawlerConfig) Crawler->>Crawler: Apply CrawlerConfig settings alt LLM Extraction Needed Crawler->>LLMConfig: Get LLM settings LLMConfig-->>Crawler: Provider configuration end Crawler->>Browser: Navigate with settings Browser->>Browser: Apply page interactions Browser->>Browser: Execute JavaScript if specified Browser->>Browser: Wait for conditions Browser-->>Crawler: Page content ready Crawler->>Crawler: Process content per config Crawler->>Result: Create CrawlResult Result-->>User: Return complete result Note over User,Result: Configuration objects control every aspect ``` ### BrowserConfig Parameter Flow ```mermaid graph TB subgraph "BrowserConfig Parameters" A[browser_type] --> A1[chromium/firefox/webkit] B[headless] --> B1[true: invisible / false: visible] C[viewport] --> C1[width x height dimensions] D[proxy] --> D1[proxy server configuration] E[user_agent] --> E1[browser identification string] F[cookies] --> F1[session authentication] G[headers] --> G1[HTTP request headers] H[extra_args] --> H1[browser command line flags] end subgraph "Browser Instance" I[Playwright Browser] J[Browser Context] K[Page Instance] end A1 --> I B1 --> I C1 --> J D1 --> J E1 --> J F1 --> J G1 --> J H1 --> I I --> J J --> K style I fill:#e3f2fd style J fill:#f3e5f5 style K fill:#e8f5e8 ``` ### CrawlerRunConfig Category Breakdown ```mermaid mindmap root((CrawlerRunConfig)) Content Processing word_count_threshold css_selector target_elements excluded_tags markdown_generator extraction_strategy Page Navigation wait_until page_timeout wait_for wait_for_images delay_before_return_html Page Interaction js_code scan_full_page simulate_user magic remove_overlay_elements Caching Session cache_mode session_id shared_data Media Output screenshot pdf capture_mhtml image_score_threshold Link Filtering exclude_external_links exclude_domains exclude_social_media_links ``` ### LLM Provider Selection Flow ```mermaid flowchart TD A[Need LLM Processing?] --> B{Provider Type?} B -->|Cloud API| C{Which Service?} B -->|Local Model| D[Local Setup] B -->|Custom Endpoint| E[Custom Config] C -->|OpenAI| C1[OpenAI GPT Models] C -->|Anthropic| C2[Claude Models] C -->|Google| C3[Gemini Models] C -->|Groq| C4[Fast Inference] D --> D1[Ollama Setup] E --> E1[Custom base_url] C1 --> F1[LLMConfig with OpenAI settings] C2 --> F2[LLMConfig with Anthropic settings] C3 --> F3[LLMConfig with Google settings] C4 --> F4[LLMConfig with Groq settings] D1 --> F5[LLMConfig with Ollama settings] E1 --> F6[LLMConfig with custom settings] F1 --> G[Use in Extraction Strategy] F2 --> G F3 --> G F4 --> G F5 --> G F6 --> G style A fill:#e1f5fe style G fill:#c8e6c9 ``` ### CrawlResult Structure and Data Flow ```mermaid graph TB subgraph "CrawlResult Output" A[Basic Info] B[HTML Content] C[Markdown Output] D[Extracted Data] E[Media Files] F[Metadata] end subgraph "Basic Info Details" A --> A1[url: final URL] A --> A2[success: boolean] A --> A3[status_code: HTTP status] A --> A4[error_message: if failed] end subgraph "HTML Content Types" B --> B1[html: raw HTML] B --> B2[cleaned_html: processed] B --> B3[fit_html: filtered content] end subgraph "Markdown Variants" C --> C1[raw_markdown: basic conversion] C --> C2[markdown_with_citations: with refs] C --> C3[fit_markdown: filtered content] C --> C4[references_markdown: citation list] end subgraph "Extracted Content" D --> D1[extracted_content: JSON string] D --> D2[From CSS extraction] D --> D3[From LLM extraction] D --> D4[From XPath extraction] end subgraph "Media and Links" E --> E1[images: list with scores] E --> E2[videos: media content] E --> E3[internal_links: same domain] E --> E4[external_links: other domains] end subgraph "Generated Files" F --> F1[screenshot: base64 PNG] F --> F2[pdf: binary PDF data] F --> F3[mhtml: archive format] F --> F4[ssl_certificate: cert info] end style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#ffebee style F fill:#f1f8e9 ``` ### Configuration Pattern State Machine ```mermaid stateDiagram-v2 [*] --> ConfigCreation ConfigCreation --> BasicConfig: Simple use case ConfigCreation --> AdvancedConfig: Complex requirements ConfigCreation --> TemplateConfig: Use predefined pattern BasicConfig --> Validation: Check parameters AdvancedConfig --> Validation: Check parameters TemplateConfig --> Validation: Check parameters Validation --> Invalid: Missing required fields Validation --> Valid: All parameters correct Invalid --> ConfigCreation: Fix and retry Valid --> InUse: Passed to crawler InUse --> Cloning: Need variation InUse --> Serialization: Save configuration InUse --> Complete: Crawl finished Cloning --> Modified: clone() with updates Modified --> Valid: Validate changes Serialization --> Stored: dump() to dict Stored --> Restoration: load() from dict Restoration --> Valid: Recreate config object Complete --> [*] note right of BasicConfig : Minimal required settings note right of AdvancedConfig : Full feature configuration note right of TemplateConfig : Pre-built patterns ``` ### Configuration Integration Architecture ```mermaid graph TB subgraph "User Layer" U1[Configuration Creation] U2[Parameter Selection] U3[Pattern Application] end subgraph "Configuration Layer" C1[BrowserConfig] C2[CrawlerRunConfig] C3[LLMConfig] C4[Config Validation] C5[Config Cloning] end subgraph "Crawler Engine" E1[Browser Management] E2[Page Navigation] E3[Content Processing] E4[Extraction Pipeline] E5[Result Generation] end subgraph "Output Layer" O1[CrawlResult Assembly] O2[Data Formatting] O3[File Generation] O4[Metadata Collection] end U1 --> C1 U2 --> C2 U3 --> C3 C1 --> C4 C2 --> C4 C3 --> C4 C4 --> E1 C2 --> E2 C2 --> E3 C3 --> E4 E1 --> E2 E2 --> E3 E3 --> E4 E4 --> E5 E5 --> O1 O1 --> O2 O2 --> O3 O3 --> O4 C5 -.-> C1 C5 -.-> C2 C5 -.-> C3 style U1 fill:#e1f5fe style C4 fill:#fff3e0 style E4 fill:#f3e5f5 style O4 fill:#c8e6c9 ``` ### Configuration Best Practices Flow ```mermaid flowchart TD A[Configuration Planning] --> B{Performance Priority?} B -->|Speed| C[Fast Config Pattern] B -->|Quality| D[Comprehensive Config Pattern] B -->|Stealth| E[Stealth Config Pattern] B -->|Balanced| F[Standard Config Pattern] C --> C1[Enable caching] C --> C2[Disable heavy features] C --> C3[Use text_mode] C1 --> G[Apply Configuration] C2 --> G C3 --> G D --> D1[Enable all processing] D --> D2[Use content filters] D --> D3[Capture everything] D1 --> G D2 --> G D3 --> G E --> E1[Rotate user agents] E --> E2[Use proxies] E --> E3[Simulate human behavior] E1 --> G E2 --> G E3 --> G F --> F1[Balanced timeouts] F --> F2[Selective processing] F --> F3[Smart caching] F1 --> G F2 --> G F3 --> G G --> H[Test Configuration] H --> I{Results Satisfactory?} I -->|Yes| J[Production Ready] I -->|No| K[Adjust Parameters] K --> L[Clone and Modify] L --> H J --> M[Deploy with Confidence] style A fill:#e1f5fe style J fill:#c8e6c9 style M fill:#e8f5e8 ``` ## Advanced Configuration Workflows and Patterns Visual representations of advanced Crawl4AI configuration strategies, proxy management, session handling, and identity-based crawling patterns. ### User Agent and Anti-Detection Strategy Flow ```mermaid flowchart TD A[Start Configuration] --> B{Detection Avoidance Needed?} B -->|No| C[Standard User Agent] B -->|Yes| D[Anti-Detection Strategy] C --> C1[Static user_agent string] C1 --> Z[Basic Configuration] D --> E{User Agent Strategy} E -->|Random| F[user_agent_mode: random] E -->|Static Custom| G[Custom user_agent string] E -->|Platform Specific| H[Generator Config] F --> I[Configure Generator] H --> I I --> I1[Platform: windows/macos/linux] I1 --> I2[Browser: chrome/firefox/safari] I2 --> I3[Device: desktop/mobile/tablet] G --> J[Behavioral Simulation] I3 --> J J --> K{Enable Simulation?} K -->|Yes| L[simulate_user: True] K -->|No| M[Standard Behavior] L --> N[override_navigator: True] N --> O[Configure Delays] O --> O1[mean_delay: 1.5] O1 --> O2[max_range: 2.0] O2 --> P[Magic Mode] M --> P P --> Q{Auto-Handle Patterns?} Q -->|Yes| R[magic: True] Q -->|No| S[Manual Handling] R --> T[Complete Anti-Detection Setup] S --> T Z --> T style D fill:#ffeb3b style T fill:#c8e6c9 style L fill:#ff9800 style R fill:#9c27b0 ``` ### Proxy Configuration and Rotation Architecture ```mermaid graph TB subgraph "Proxy Configuration Types" A[Single Proxy] --> A1[ProxyConfig object] B[Proxy String] --> B1[from_string method] C[Environment Proxies] --> C1[from_env method] D[Multiple Proxies] --> D1[ProxyRotationStrategy] end subgraph "ProxyConfig Structure" A1 --> E[server: URL] A1 --> F[username: auth] A1 --> G[password: auth] A1 --> H[ip: extracted] end subgraph "Rotation Strategies" D1 --> I[round_robin] D1 --> J[random] D1 --> K[least_used] D1 --> L[failure_aware] end subgraph "Configuration Flow" M[CrawlerRunConfig] --> N[proxy_config] M --> O[proxy_rotation_strategy] N --> P[Single Proxy Usage] O --> Q[Multi-Proxy Rotation] end subgraph "Runtime Behavior" P --> R[All requests use same proxy] Q --> S[Requests rotate through proxies] S --> T[Health monitoring] T --> U[Automatic failover] end style A1 fill:#e3f2fd style D1 fill:#f3e5f5 style M fill:#e8f5e8 style T fill:#fff3e0 ``` ### Content Selection Strategy Comparison ```mermaid sequenceDiagram participant Browser participant HTML as Raw HTML participant CSS as css_selector participant Target as target_elements participant Processor as Content Processor participant Output Note over Browser,Output: css_selector Strategy Browser->>HTML: Load complete page HTML->>CSS: Apply css_selector CSS->>CSS: Extract matching elements only CSS->>Processor: Process subset HTML Processor->>Output: Markdown + Extraction from subset Note over Browser,Output: target_elements Strategy Browser->>HTML: Load complete page HTML->>Processor: Process entire page Processor->>Target: Focus on target_elements Target->>Target: Extract from specified elements Processor->>Output: Full page links/media + targeted content Note over CSS,Target: Key Difference Note over CSS: Affects entire processing pipeline Note over Target: Affects only content extraction ``` ### Advanced wait_for Conditions Decision Tree ```mermaid flowchart TD A[Configure wait_for] --> B{Condition Type?} B -->|CSS Element| C[CSS Selector Wait] B -->|JavaScript Condition| D[JS Expression Wait] B -->|Complex Logic| E[Custom JS Function] B -->|No Wait| F[Default domcontentloaded] C --> C1["wait_for: 'css:.element'"] C1 --> C2[Element appears in DOM] C2 --> G[Continue Processing] D --> D1["wait_for: 'js:() => condition'"] D1 --> D2[JavaScript returns true] D2 --> G E --> E1[Complex JS Function] E1 --> E2{Multiple Conditions} E2 -->|AND Logic| E3[All conditions true] E2 -->|OR Logic| E4[Any condition true] E2 -->|Custom Logic| E5[User-defined logic] E3 --> G E4 --> G E5 --> G F --> G G --> H{Timeout Reached?} H -->|No| I[Page Ready] H -->|Yes| J[Timeout Error] I --> K[Begin Content Extraction] J --> L[Handle Error/Retry] style C1 fill:#e8f5e8 style D1 fill:#fff3e0 style E1 fill:#ffeb3b style I fill:#c8e6c9 style J fill:#ffcdd2 ``` ### Session Management Lifecycle ```mermaid stateDiagram-v2 [*] --> SessionCreate SessionCreate --> SessionActive: session_id provided SessionCreate --> OneTime: no session_id SessionActive --> BrowserLaunch: First arun() call BrowserLaunch --> PageLoad: Navigate to URL PageLoad --> JSExecution: Execute js_code JSExecution --> ContentExtract: Extract content ContentExtract --> SessionHold: Keep session alive SessionHold --> ReuseSession: Subsequent arun() calls ReuseSession --> JSOnlyMode: js_only=True ReuseSession --> NewNavigation: js_only=False JSOnlyMode --> JSExecution: Execute JS in existing page NewNavigation --> PageLoad: Navigate to new URL SessionHold --> SessionKill: kill_session() called SessionHold --> SessionTimeout: Timeout reached SessionHold --> SessionError: Error occurred SessionKill --> SessionCleanup SessionTimeout --> SessionCleanup SessionError --> SessionCleanup SessionCleanup --> [*] OneTime --> BrowserLaunch ContentExtract --> OneTimeCleanup: No session_id OneTimeCleanup --> [*] note right of SessionActive : Persistent browser context note right of JSOnlyMode : Reuse existing page note right of OneTime : Temporary browser instance ``` ### Identity-Based Crawling Configuration Matrix ```mermaid graph TD subgraph "Geographic Identity" A[Geolocation] --> A1[latitude/longitude] A2[Timezone] --> A3[timezone_id] A4[Locale] --> A5[language/region] end subgraph "Browser Identity" B[User Agent] --> B1[Platform fingerprint] B2[Navigator Properties] --> B3[override_navigator] B4[Headers] --> B5[Accept-Language] end subgraph "Behavioral Identity" C[Mouse Simulation] --> C1[simulate_user] C2[Timing Patterns] --> C3[mean_delay/max_range] C4[Interaction Patterns] --> C5[Human-like behavior] end subgraph "Configuration Integration" D[CrawlerRunConfig] --> A D --> B D --> C D --> E[Complete Identity Profile] E --> F[Geographic Consistency] E --> G[Browser Consistency] E --> H[Behavioral Consistency] end F --> I[Paris, France Example] I --> I1[locale: fr-FR] I --> I2[timezone: Europe/Paris] I --> I3[geolocation: 48.8566, 2.3522] G --> J[Windows Chrome Example] J --> J1[platform: windows] J --> J2[browser: chrome] J --> J3[user_agent: matching pattern] H --> K[Human Simulation] K --> K1[Random delays] K --> K2[Mouse movements] K --> K3[Navigation patterns] style E fill:#ff9800 style I fill:#e3f2fd style J fill:#f3e5f5 style K fill:#e8f5e8 ``` ### Multi-Step Crawling Sequence Flow ```mermaid sequenceDiagram participant User participant Crawler participant Session as Browser Session participant Page1 as Login Page participant Page2 as Dashboard participant Page3 as Data Pages User->>Crawler: Step 1 - Login Crawler->>Session: Create session_id="user_session" Session->>Page1: Navigate to login Page1->>Page1: Execute login JS Page1->>Page1: Wait for dashboard redirect Page1-->>Crawler: Login complete User->>Crawler: Step 2 - Navigate dashboard Note over Crawler,Session: Reuse existing session Crawler->>Session: js_only=True (no page reload) Session->>Page2: Execute navigation JS Page2->>Page2: Wait for data table Page2-->>Crawler: Dashboard ready User->>Crawler: Step 3 - Extract data pages loop For each page 1-5 Crawler->>Session: js_only=True Session->>Page3: Click page button Page3->>Page3: Wait for page active Page3->>Page3: Extract content Page3-->>Crawler: Page data end User->>Crawler: Cleanup Crawler->>Session: kill_session() Session-->>Crawler: Session destroyed ``` ### Configuration Import and Usage Patterns ```mermaid graph LR subgraph "Main Package Imports" A[crawl4ai] --> A1[AsyncWebCrawler] A --> A2[BrowserConfig] A --> A3[CrawlerRunConfig] A --> A4[LLMConfig] A --> A5[CacheMode] A --> A6[ProxyConfig] A --> A7[GeolocationConfig] end subgraph "Strategy Imports" A --> B1[JsonCssExtractionStrategy] A --> B2[LLMExtractionStrategy] A --> B3[DefaultMarkdownGenerator] A --> B4[PruningContentFilter] A --> B5[RegexChunking] end subgraph "Configuration Assembly" C[Configuration Builder] --> A2 C --> A3 C --> A4 A2 --> D[Browser Environment] A3 --> E[Crawl Behavior] A4 --> F[LLM Integration] E --> B1 E --> B2 E --> B3 E --> B4 E --> B5 end subgraph "Runtime Flow" G[Crawler Instance] --> D G --> H[Execute Crawl] H --> E H --> F H --> I[CrawlResult] end style A fill:#e3f2fd style C fill:#fff3e0 style G fill:#e8f5e8 style I fill:#c8e6c9 ``` ### Advanced Configuration Decision Matrix ```mermaid flowchart TD A[Advanced Configuration Needed] --> B{Primary Use Case?} B -->|Bot Detection Avoidance| C[Anti-Detection Setup] B -->|Geographic Simulation| D[Identity-Based Config] B -->|Multi-Step Workflows| E[Session Management] B -->|Network Reliability| F[Proxy Configuration] B -->|Content Precision| G[Selector Strategy] C --> C1[Random User Agents] C --> C2[Behavioral Simulation] C --> C3[Navigator Override] C --> C4[Magic Mode] D --> D1[Geolocation Setup] D --> D2[Locale Configuration] D --> D3[Timezone Setting] D --> D4[Browser Fingerprinting] E --> E1[Session ID Management] E --> E2[JS-Only Navigation] E --> E3[Shared Data Context] E --> E4[Session Cleanup] F --> F1[Single Proxy] F --> F2[Proxy Rotation] F --> F3[Failover Strategy] F --> F4[Health Monitoring] G --> G1[css_selector for Subset] G --> G2[target_elements for Focus] G --> G3[excluded_selector for Removal] G --> G4[Hierarchical Selection] C1 --> H[Production Configuration] C2 --> H C3 --> H C4 --> H D1 --> H D2 --> H D3 --> H D4 --> H E1 --> H E2 --> H E3 --> H E4 --> H F1 --> H F2 --> H F3 --> H F4 --> H G1 --> H G2 --> H G3 --> H G4 --> H style H fill:#c8e6c9 style C fill:#ff9800 style D fill:#9c27b0 style E fill:#2196f3 style F fill:#4caf50 style G fill:#ff5722 ``` ## Advanced Features Workflows and Architecture Visual representations of advanced crawling capabilities, session management, hooks system, and performance optimization strategies. ### File Download Workflow ```mermaid sequenceDiagram participant User participant Crawler participant Browser participant FileSystem participant Page User->>Crawler: Configure downloads_path Crawler->>Browser: Create context with download handling Browser-->>Crawler: Context ready Crawler->>Page: Navigate to target URL Page-->>Crawler: Page loaded Crawler->>Page: Execute download JavaScript Page->>Page: Find download links (.pdf, .zip, etc.) loop For each download link Page->>Browser: Click download link Browser->>FileSystem: Save file to downloads_path FileSystem-->>Browser: File saved Browser-->>Page: Download complete end Page-->>Crawler: All downloads triggered Crawler->>FileSystem: Check downloaded files FileSystem-->>Crawler: List of file paths Crawler-->>User: CrawlResult with downloaded_files[] Note over User,FileSystem: Files available in downloads_path ``` ### Hooks Execution Flow ```mermaid flowchart TD A[Start Crawl] --> B[on_browser_created Hook] B --> C[Browser Instance Created] C --> D[on_page_context_created Hook] D --> E[Page & Context Setup] E --> F[before_goto Hook] F --> G[Navigate to URL] G --> H[after_goto Hook] H --> I[Page Loaded] I --> J[before_retrieve_html Hook] J --> K[Extract HTML Content] K --> L[Return CrawlResult] subgraph "Hook Capabilities" B1[Route Filtering] B2[Authentication] B3[Custom Headers] B4[Viewport Setup] B5[Content Manipulation] end D --> B1 F --> B2 F --> B3 D --> B4 J --> B5 style A fill:#e1f5fe style L fill:#c8e6c9 style B fill:#fff3e0 style D fill:#f3e5f5 style F fill:#e8f5e8 style H fill:#fce4ec style J fill:#fff9c4 ``` ### Session Management State Machine ```mermaid stateDiagram-v2 [*] --> SessionCreated: session_id provided SessionCreated --> PageLoaded: Initial arun() PageLoaded --> JavaScriptExecution: js_code executed JavaScriptExecution --> ContentUpdated: DOM modified ContentUpdated --> NextOperation: js_only=True NextOperation --> JavaScriptExecution: More interactions NextOperation --> SessionMaintained: Keep session alive NextOperation --> SessionClosed: kill_session() SessionMaintained --> PageLoaded: Navigate to new URL SessionMaintained --> JavaScriptExecution: Continue interactions SessionClosed --> [*]: Session terminated note right of SessionCreated Browser tab created Context preserved end note note right of ContentUpdated State maintained Cookies preserved Local storage intact end note note right of SessionClosed Clean up resources Release browser tab end note ``` ### Lazy Loading & Dynamic Content Strategy ```mermaid flowchart TD A[Page Load] --> B{Content Type?} B -->|Static Content| C[Standard Extraction] B -->|Lazy Loaded| D[Enable scan_full_page] B -->|Infinite Scroll| E[Custom Scroll Strategy] B -->|Load More Button| F[JavaScript Interaction] D --> D1[Automatic Scrolling] D1 --> D2[Wait for Images] D2 --> D3[Content Stabilization] E --> E1[Detect Scroll Triggers] E1 --> E2[Progressive Loading] E2 --> E3[Monitor Content Changes] F --> F1[Find Load More Button] F1 --> F2[Click and Wait] F2 --> F3{More Content?} F3 -->|Yes| F1 F3 -->|No| G[Complete Extraction] D3 --> G E3 --> G C --> G G --> H[Return Enhanced Content] subgraph "Optimization Techniques" I[exclude_external_images] J[image_score_threshold] K[wait_for selectors] L[scroll_delay tuning] end D --> I E --> J F --> K D1 --> L style A fill:#e1f5fe style H fill:#c8e6c9 style D fill:#fff3e0 style E fill:#f3e5f5 style F fill:#e8f5e8 ``` ### Network & Console Monitoring Architecture ```mermaid graph TB subgraph "Browser Context" A[Web Page] --> B[Network Requests] A --> C[Console Messages] A --> D[Resource Loading] end subgraph "Monitoring Layer" B --> E[Request Interceptor] C --> F[Console Listener] D --> G[Resource Monitor] E --> H[Request Events] E --> I[Response Events] E --> J[Failure Events] F --> K[Log Messages] F --> L[Error Messages] F --> M[Warning Messages] end subgraph "Data Collection" H --> N[Request Details] I --> O[Response Analysis] J --> P[Failure Tracking] K --> Q[Debug Information] L --> R[Error Analysis] M --> S[Performance Insights] end subgraph "Output Aggregation" N --> T[network_requests Array] O --> T P --> T Q --> U[console_messages Array] R --> U S --> U end T --> V[CrawlResult] U --> V style V fill:#c8e6c9 style E fill:#fff3e0 style F fill:#f3e5f5 style T fill:#e8f5e8 style U fill:#fce4ec ``` ### Multi-Step Workflow Sequence ```mermaid sequenceDiagram participant User participant Crawler participant Session participant Page participant Server User->>Crawler: Step 1 - Initial load Crawler->>Session: Create session_id Session->>Page: New browser tab Page->>Server: GET /step1 Server-->>Page: Page content Page-->>Crawler: Content ready Crawler-->>User: Result 1 User->>Crawler: Step 2 - Navigate (js_only=true) Crawler->>Session: Reuse existing session Session->>Page: Execute JavaScript Page->>Page: Click next button Page->>Server: Navigate to /step2 Server-->>Page: New content Page-->>Crawler: Updated content Crawler-->>User: Result 2 User->>Crawler: Step 3 - Form submission Crawler->>Session: Continue session Session->>Page: Execute form JS Page->>Page: Fill form fields Page->>Server: POST form data Server-->>Page: Results page Page-->>Crawler: Final content Crawler-->>User: Result 3 User->>Crawler: Cleanup Crawler->>Session: kill_session() Session->>Page: Close tab Session-->>Crawler: Session terminated Note over User,Server: State preserved across steps Note over Session: Cookies, localStorage maintained ``` ### SSL Certificate Analysis Flow ```mermaid flowchart LR A[Enable SSL Fetch] --> B[HTTPS Connection] B --> C[Certificate Retrieval] C --> D[Certificate Analysis] D --> E[Basic Info] D --> F[Validity Check] D --> G[Chain Verification] D --> H[Security Assessment] E --> E1[Issuer Details] E --> E2[Subject Information] E --> E3[Serial Number] F --> F1[Not Before Date] F --> F2[Not After Date] F --> F3[Expiration Warning] G --> G1[Root CA] G --> G2[Intermediate Certs] G --> G3[Trust Path] H --> H1[Key Length] H --> H2[Signature Algorithm] H --> H3[Vulnerabilities] subgraph "Export Formats" I[JSON Format] J[PEM Format] K[DER Format] end E1 --> I F1 --> I G1 --> I H1 --> I I --> J J --> K style A fill:#e1f5fe style D fill:#fff3e0 style I fill:#e8f5e8 style J fill:#f3e5f5 style K fill:#fce4ec ``` ### Performance Optimization Decision Tree ```mermaid flowchart TD A[Performance Optimization] --> B{Primary Goal?} B -->|Speed| C[Fast Crawling Mode] B -->|Resource Usage| D[Memory Optimization] B -->|Scale| E[Batch Processing] B -->|Quality| F[Comprehensive Extraction] C --> C1[text_mode=True] C --> C2[exclude_all_images=True] C --> C3[excluded_tags=['script','style']] C --> C4[page_timeout=30000] D --> D1[light_mode=True] D --> D2[headless=True] D --> D3[semaphore_count=3] D --> D4[disable monitoring] E --> E1[stream=True] E --> E2[cache_mode=ENABLED] E --> E3[arun_many()] E --> E4[concurrent batches] F --> F1[wait_for_images=True] F --> F2[process_iframes=True] F --> F3[capture_network=True] F --> F4[screenshot=True] subgraph "Trade-offs" G[Speed vs Quality] H[Memory vs Features] I[Scale vs Detail] end C --> G D --> H E --> I subgraph "Monitoring Metrics" J[Response Time] K[Memory Usage] L[Success Rate] M[Content Quality] end C1 --> J D1 --> K E1 --> L F1 --> M style A fill:#e1f5fe style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#f3e5f5 style F fill:#fce4ec ``` ### Advanced Page Interaction Matrix ```mermaid graph LR subgraph "Interaction Types" A[Form Filling] B[Dynamic Loading] C[Modal Handling] D[Scroll Interactions] E[Button Clicks] end subgraph "Detection Methods" F[CSS Selectors] G[JavaScript Conditions] H[Element Visibility] I[Content Changes] J[Network Activity] end subgraph "Automation Features" K[simulate_user=True] L[magic=True] M[remove_overlay_elements=True] N[override_navigator=True] O[scan_full_page=True] end subgraph "Wait Strategies" P[wait_for CSS] Q[wait_for JS] R[wait_for_images] S[delay_before_return] T[custom timeouts] end A --> F A --> K A --> P B --> G B --> O B --> Q C --> H C --> L C --> M D --> I D --> O D --> S E --> F E --> K E --> T style A fill:#e8f5e8 style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#fce4ec style E fill:#e1f5fe ``` ### Input Source Processing Flow ```mermaid flowchart TD A[Input Source] --> B{Input Type?} B -->|URL| C[Web Request] B -->|file://| D[Local File] B -->|raw:| E[Raw HTML] C --> C1[HTTP/HTTPS Request] C1 --> C2[Browser Navigation] C2 --> C3[Page Rendering] C3 --> F[Content Processing] D --> D1[File System Access] D1 --> D2[Read HTML File] D2 --> D3[Parse Content] D3 --> F E --> E1[Parse Raw HTML] E1 --> E2[Create Virtual Page] E2 --> E3[Direct Processing] E3 --> F F --> G[Common Processing Pipeline] G --> H[Markdown Generation] G --> I[Link Extraction] G --> J[Media Processing] G --> K[Data Extraction] H --> L[CrawlResult] I --> L J --> L K --> L subgraph "Processing Features" M[Same extraction strategies] N[Same filtering options] O[Same output formats] P[Consistent results] end F --> M F --> N F --> O F --> P style A fill:#e1f5fe style L fill:#c8e6c9 style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#f3e5f5 ``` **📖 Learn more:** [Advanced Features Guide](https://docs.crawl4ai.com/advanced/advanced-features/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Hooks System](https://docs.crawl4ai.com/advanced/hooks-auth/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance/) **📖 Learn more:** [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Proxy & Security](https://docs.crawl4ai.com/advanced/proxy-security/), [Content Selection](https://docs.crawl4ai.com/core/content-selection/) **📖 Learn more:** [Configuration Reference](https://docs.crawl4ai.com/api/parameters/), [Best Practices](https://docs.crawl4ai.com/core/browser-crawler-config/), [Advanced Configuration](https://docs.crawl4ai.com/advanced/advanced-features/) --- ## Extraction Strategy Workflows and Architecture Visual representations of Crawl4AI's data extraction approaches, strategy selection, and processing workflows. ### Extraction Strategy Decision Tree ```mermaid flowchart TD A[Content to Extract] --> B{Content Type?} B -->|Simple Patterns| C[Common Data Types] B -->|Structured HTML| D[Predictable Structure] B -->|Complex Content| E[Requires Reasoning] B -->|Mixed Content| F[Multiple Data Types] C --> C1{Pattern Type?} C1 -->|Email, Phone, URLs| C2[Built-in Regex Patterns] C1 -->|Custom Patterns| C3[Custom Regex Strategy] C1 -->|LLM-Generated| C4[One-time Pattern Generation] D --> D1{Selector Type?} D1 -->|CSS Selectors| D2[JsonCssExtractionStrategy] D1 -->|XPath Expressions| D3[JsonXPathExtractionStrategy] D1 -->|Need Schema?| D4[Auto-generate Schema with LLM] E --> E1{LLM Provider?} E1 -->|OpenAI/Anthropic| E2[Cloud LLM Strategy] E1 -->|Local Ollama| E3[Local LLM Strategy] E1 -->|Cost-sensitive| E4[Hybrid: Generate Schema Once] F --> F1[Multi-Strategy Approach] F1 --> F2[1. Regex for Patterns] F1 --> F3[2. CSS for Structure] F1 --> F4[3. LLM for Complex Analysis] C2 --> G[Fast Extraction ⚡] C3 --> G C4 --> H[Cached Pattern Reuse] D2 --> I[Schema-based Extraction 🏗️] D3 --> I D4 --> J[Generated Schema Cache] E2 --> K[Intelligent Parsing 🧠] E3 --> K E4 --> L[Hybrid Cost-Effective] F2 --> M[Comprehensive Results 📊] F3 --> M F4 --> M style G fill:#c8e6c9 style I fill:#e3f2fd style K fill:#fff3e0 style M fill:#f3e5f5 style H fill:#e8f5e8 style J fill:#e8f5e8 style L fill:#ffecb3 ``` ### LLM Extraction Strategy Workflow ```mermaid sequenceDiagram participant User participant Crawler participant LLMStrategy participant Chunker participant LLMProvider participant Parser User->>Crawler: Configure LLMExtractionStrategy User->>Crawler: arun(url, config) Crawler->>Crawler: Navigate to URL Crawler->>Crawler: Extract content (HTML/Markdown) Crawler->>LLMStrategy: Process content LLMStrategy->>LLMStrategy: Check content size alt Content > chunk_threshold LLMStrategy->>Chunker: Split into chunks with overlap Chunker-->>LLMStrategy: Return chunks[] loop For each chunk LLMStrategy->>LLMProvider: Send chunk + schema + instruction LLMProvider-->>LLMStrategy: Return structured JSON end LLMStrategy->>LLMStrategy: Merge chunk results else Content <= threshold LLMStrategy->>LLMProvider: Send full content + schema LLMProvider-->>LLMStrategy: Return structured JSON end LLMStrategy->>Parser: Validate JSON schema Parser-->>LLMStrategy: Validated data LLMStrategy->>LLMStrategy: Track token usage LLMStrategy-->>Crawler: Return extracted_content Crawler-->>User: CrawlResult with JSON data User->>LLMStrategy: show_usage() LLMStrategy-->>User: Token count & estimated cost ``` ### Schema-Based Extraction Architecture ```mermaid graph TB subgraph "Schema Definition" A[JSON Schema] --> A1[baseSelector] A --> A2[fields[]] A --> A3[nested structures] A2 --> A4[CSS/XPath selectors] A2 --> A5[Data types: text, html, attribute] A2 --> A6[Default values] A3 --> A7[nested objects] A3 --> A8[nested_list arrays] A3 --> A9[simple lists] end subgraph "Extraction Engine" B[HTML Content] --> C[Selector Engine] C --> C1[CSS Selector Parser] C --> C2[XPath Evaluator] C1 --> D[Element Matcher] C2 --> D D --> E[Type Converter] E --> E1[Text Extraction] E --> E2[HTML Preservation] E --> E3[Attribute Extraction] E --> E4[Nested Processing] end subgraph "Result Processing" F[Raw Extracted Data] --> G[Structure Builder] G --> G1[Object Construction] G --> G2[Array Assembly] G --> G3[Type Validation] G1 --> H[JSON Output] G2 --> H G3 --> H end A --> C E --> F H --> I[extracted_content] style A fill:#e3f2fd style C fill:#f3e5f5 style G fill:#e8f5e8 style H fill:#c8e6c9 ``` ### Automatic Schema Generation Process ```mermaid stateDiagram-v2 [*] --> CheckCache CheckCache --> CacheHit: Schema exists CheckCache --> SamplePage: Schema missing CacheHit --> LoadSchema LoadSchema --> FastExtraction SamplePage --> ExtractHTML: Crawl sample URL ExtractHTML --> LLMAnalysis: Send HTML to LLM LLMAnalysis --> GenerateSchema: Create CSS/XPath selectors GenerateSchema --> ValidateSchema: Test generated schema ValidateSchema --> SchemaWorks: Valid selectors ValidateSchema --> RefineSchema: Invalid selectors RefineSchema --> LLMAnalysis: Iterate with feedback SchemaWorks --> CacheSchema: Save for reuse CacheSchema --> FastExtraction: Use cached schema FastExtraction --> [*]: No more LLM calls needed note right of CheckCache : One-time LLM cost note right of FastExtraction : Unlimited fast reuse note right of CacheSchema : JSON file storage ``` ### Multi-Strategy Extraction Pipeline ```mermaid flowchart LR A[Web Page Content] --> B[Strategy Pipeline] subgraph B["Extraction Pipeline"] B1[Stage 1: Regex Patterns] B2[Stage 2: Schema-based CSS] B3[Stage 3: LLM Analysis] B1 --> B1a[Email addresses] B1 --> B1b[Phone numbers] B1 --> B1c[URLs and links] B1 --> B1d[Currency amounts] B2 --> B2a[Structured products] B2 --> B2b[Article metadata] B2 --> B2c[User reviews] B2 --> B2d[Navigation links] B3 --> B3a[Sentiment analysis] B3 --> B3b[Key topics] B3 --> B3c[Entity recognition] B3 --> B3d[Content summary] end B1a --> C[Result Merger] B1b --> C B1c --> C B1d --> C B2a --> C B2b --> C B2c --> C B2d --> C B3a --> C B3b --> C B3c --> C B3d --> C C --> D[Combined JSON Output] D --> E[Final CrawlResult] style B1 fill:#c8e6c9 style B2 fill:#e3f2fd style B3 fill:#fff3e0 style C fill:#f3e5f5 ``` ### Performance Comparison Matrix ```mermaid graph TD subgraph "Strategy Performance" A[Extraction Strategy Comparison] subgraph "Speed ⚡" S1[Regex: ~10ms] S2[CSS Schema: ~50ms] S3[XPath: ~100ms] S4[LLM: ~2-10s] end subgraph "Accuracy 🎯" A1[Regex: Pattern-dependent] A2[CSS: High for structured] A3[XPath: Very high] A4[LLM: Excellent for complex] end subgraph "Cost 💰" C1[Regex: Free] C2[CSS: Free] C3[XPath: Free] C4[LLM: $0.001-0.01 per page] end subgraph "Complexity 🔧" X1[Regex: Simple patterns only] X2[CSS: Structured HTML] X3[XPath: Complex selectors] X4[LLM: Any content type] end end style S1 fill:#c8e6c9 style S2 fill:#e8f5e8 style S3 fill:#fff3e0 style S4 fill:#ffcdd2 style A2 fill:#e8f5e8 style A3 fill:#c8e6c9 style A4 fill:#c8e6c9 style C1 fill:#c8e6c9 style C2 fill:#c8e6c9 style C3 fill:#c8e6c9 style C4 fill:#fff3e0 style X1 fill:#ffcdd2 style X2 fill:#e8f5e8 style X3 fill:#c8e6c9 style X4 fill:#c8e6c9 ``` ### Regex Pattern Strategy Flow ```mermaid flowchart TD A[Regex Extraction] --> B{Pattern Source?} B -->|Built-in| C[Use Predefined Patterns] B -->|Custom| D[Define Custom Regex] B -->|LLM-Generated| E[Generate with AI] C --> C1[Email Pattern] C --> C2[Phone Pattern] C --> C3[URL Pattern] C --> C4[Currency Pattern] C --> C5[Date Pattern] D --> D1[Write Custom Regex] D --> D2[Test Pattern] D --> D3{Pattern Works?} D3 -->|No| D1 D3 -->|Yes| D4[Use Pattern] E --> E1[Provide Sample Content] E --> E2[LLM Analyzes Content] E --> E3[Generate Optimized Regex] E --> E4[Cache Pattern for Reuse] C1 --> F[Pattern Matching] C2 --> F C3 --> F C4 --> F C5 --> F D4 --> F E4 --> F F --> G[Extract Matches] G --> H[Group by Pattern Type] H --> I[JSON Output with Labels] style C fill:#e8f5e8 style D fill:#e3f2fd style E fill:#fff3e0 style F fill:#f3e5f5 ``` ### Complex Schema Structure Visualization ```mermaid graph TB subgraph "E-commerce Schema Example" A[Category baseSelector] --> B[Category Fields] A --> C[Products nested_list] B --> B1[category_name] B --> B2[category_id attribute] B --> B3[category_url attribute] C --> C1[Product baseSelector] C1 --> C2[name text] C1 --> C3[price text] C1 --> C4[Details nested object] C1 --> C5[Features list] C1 --> C6[Reviews nested_list] C4 --> C4a[brand text] C4 --> C4b[model text] C4 --> C4c[specs html] C5 --> C5a[feature text array] C6 --> C6a[reviewer text] C6 --> C6b[rating attribute] C6 --> C6c[comment text] C6 --> C6d[date attribute] end subgraph "JSON Output Structure" D[categories array] --> D1[category object] D1 --> D2[category_name] D1 --> D3[category_id] D1 --> D4[products array] D4 --> D5[product object] D5 --> D6[name, price] D5 --> D7[details object] D5 --> D8[features array] D5 --> D9[reviews array] D7 --> D7a[brand, model, specs] D8 --> D8a[feature strings] D9 --> D9a[review objects] end A -.-> D B1 -.-> D2 C2 -.-> D6 C4 -.-> D7 C5 -.-> D8 C6 -.-> D9 style A fill:#e3f2fd style C fill:#f3e5f5 style C4 fill:#e8f5e8 style D fill:#fff3e0 ``` ### Error Handling and Fallback Strategy ```mermaid stateDiagram-v2 [*] --> PrimaryStrategy PrimaryStrategy --> Success: Extraction successful PrimaryStrategy --> ValidationFailed: Invalid data PrimaryStrategy --> ExtractionFailed: No matches found PrimaryStrategy --> TimeoutError: LLM timeout ValidationFailed --> FallbackStrategy: Try alternative ExtractionFailed --> FallbackStrategy: Try alternative TimeoutError --> FallbackStrategy: Try alternative FallbackStrategy --> FallbackSuccess: Fallback works FallbackStrategy --> FallbackFailed: All strategies failed FallbackSuccess --> Success: Return results FallbackFailed --> ErrorReport: Log failure details Success --> [*]: Complete ErrorReport --> [*]: Return empty results note right of PrimaryStrategy : Try fastest/most accurate first note right of FallbackStrategy : Use simpler but reliable method note left of ErrorReport : Provide debugging information ``` ### Token Usage and Cost Optimization ```mermaid flowchart TD A[LLM Extraction Request] --> B{Content Size Check} B -->|Small < 1200 tokens| C[Single LLM Call] B -->|Large > 1200 tokens| D[Chunking Strategy] C --> C1[Send full content] C1 --> C2[Parse JSON response] C2 --> C3[Track token usage] D --> D1[Split into chunks] D1 --> D2[Add overlap between chunks] D2 --> D3[Process chunks in parallel] D3 --> D4[Chunk 1 → LLM] D3 --> D5[Chunk 2 → LLM] D3 --> D6[Chunk N → LLM] D4 --> D7[Merge results] D5 --> D7 D6 --> D7 D7 --> D8[Deduplicate data] D8 --> D9[Aggregate token usage] C3 --> E[Cost Calculation] D9 --> E E --> F[Usage Report] F --> F1[Prompt tokens: X] F --> F2[Completion tokens: Y] F --> F3[Total cost: $Z] style C fill:#c8e6c9 style D fill:#fff3e0 style E fill:#e3f2fd style F fill:#f3e5f5 ``` **📖 Learn more:** [LLM Strategies](https://docs.crawl4ai.com/extraction/llm-strategies/), [Schema-Based Extraction](https://docs.crawl4ai.com/extraction/no-llm-strategies/), [Pattern Matching](https://docs.crawl4ai.com/extraction/no-llm-strategies/#regexextractionstrategy), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/) --- ## Multi-URL Crawling Workflows and Architecture Visual representations of concurrent crawling patterns, resource management, and monitoring systems for handling multiple URLs efficiently. ### Multi-URL Processing Modes ```mermaid flowchart TD A[Multi-URL Crawling Request] --> B{Processing Mode?} B -->|Batch Mode| C[Collect All URLs] B -->|Streaming Mode| D[Process URLs Individually] C --> C1[Queue All URLs] C1 --> C2[Execute Concurrently] C2 --> C3[Wait for All Completion] C3 --> C4[Return Complete Results Array] D --> D1[Queue URLs] D1 --> D2[Start First Batch] D2 --> D3[Yield Results as Available] D3 --> D4{More URLs?} D4 -->|Yes| D5[Start Next URLs] D4 -->|No| D6[Stream Complete] D5 --> D3 C4 --> E[Process Results] D6 --> E E --> F[Success/Failure Analysis] F --> G[End] style C fill:#e3f2fd style D fill:#f3e5f5 style C4 fill:#c8e6c9 style D6 fill:#c8e6c9 ``` ### Memory-Adaptive Dispatcher Flow ```mermaid stateDiagram-v2 [*] --> Initializing Initializing --> MonitoringMemory: Start dispatcher MonitoringMemory --> CheckingMemory: Every check_interval CheckingMemory --> MemoryOK: Memory < threshold CheckingMemory --> MemoryHigh: Memory >= threshold MemoryOK --> DispatchingTasks: Start new crawls MemoryHigh --> WaitingForMemory: Pause dispatching DispatchingTasks --> TaskRunning: Launch crawler TaskRunning --> TaskCompleted: Crawl finished TaskRunning --> TaskFailed: Crawl error TaskCompleted --> MonitoringMemory: Update stats TaskFailed --> MonitoringMemory: Update stats WaitingForMemory --> CheckingMemory: Wait timeout WaitingForMemory --> MonitoringMemory: Memory freed note right of MemoryHigh: Prevents OOM crashes note right of DispatchingTasks: Respects max_session_permit note right of WaitingForMemory: Configurable timeout ``` ### Concurrent Crawling Architecture ```mermaid graph TB subgraph "URL Queue Management" A[URL Input List] --> B[URL Queue] B --> C[Priority Scheduler] C --> D[Batch Assignment] end subgraph "Dispatcher Layer" E[Memory Adaptive Dispatcher] F[Semaphore Dispatcher] G[Rate Limiter] H[Resource Monitor] E --> I[Memory Checker] F --> J[Concurrency Controller] G --> K[Delay Calculator] H --> L[System Stats] end subgraph "Crawler Pool" M[Crawler Instance 1] N[Crawler Instance 2] O[Crawler Instance 3] P[Crawler Instance N] M --> Q[Browser Session 1] N --> R[Browser Session 2] O --> S[Browser Session 3] P --> T[Browser Session N] end subgraph "Result Processing" U[Result Collector] V[Success Handler] W[Error Handler] X[Retry Queue] Y[Final Results] end D --> E D --> F E --> M F --> N G --> O H --> P Q --> U R --> U S --> U T --> U U --> V U --> W W --> X X --> B V --> Y style E fill:#e3f2fd style F fill:#f3e5f5 style G fill:#e8f5e8 style H fill:#fff3e0 ``` ### Rate Limiting and Backoff Strategy ```mermaid sequenceDiagram participant C as Crawler participant RL as Rate Limiter participant S as Server participant D as Dispatcher C->>RL: Request to crawl URL RL->>RL: Calculate delay RL->>RL: Apply base delay (1-3s) RL->>C: Delay applied C->>S: HTTP Request alt Success Response S-->>C: 200 OK + Content C->>RL: Report success RL->>RL: Reset failure count C->>D: Return successful result else Rate Limited S-->>C: 429 Too Many Requests C->>RL: Report rate limit RL->>RL: Exponential backoff RL->>RL: Increase delay (up to max_delay) RL->>C: Apply longer delay C->>S: Retry request after delay else Server Error S-->>C: 503 Service Unavailable C->>RL: Report server error RL->>RL: Moderate backoff RL->>C: Retry with backoff else Max Retries Exceeded RL->>C: Stop retrying C->>D: Return failed result end ``` ### Large-Scale Crawling Workflow ```mermaid flowchart TD A[Load URL List 10k+ URLs] --> B[Initialize Dispatcher] B --> C{Select Dispatcher Type} C -->|Memory Constrained| D[Memory Adaptive] C -->|Fixed Resources| E[Semaphore Based] D --> F[Set Memory Threshold 70%] E --> G[Set Concurrency Limit] F --> H[Configure Monitoring] G --> H H --> I[Start Crawling Process] I --> J[Monitor System Resources] J --> K{Memory Usage?} K -->|< Threshold| L[Continue Dispatching] K -->|>= Threshold| M[Pause New Tasks] L --> N[Process Results Stream] M --> O[Wait for Memory] O --> K N --> P{Result Type?} P -->|Success| Q[Save to Database] P -->|Failure| R[Log Error] Q --> S[Update Progress Counter] R --> S S --> T{More URLs?} T -->|Yes| U[Get Next Batch] T -->|No| V[Generate Final Report] U --> L V --> W[Analysis Complete] style A fill:#e1f5fe style D fill:#e8f5e8 style E fill:#f3e5f5 style V fill:#c8e6c9 style W fill:#a5d6a7 ``` ### Real-Time Monitoring Dashboard Flow ```mermaid graph LR subgraph "Data Collection" A[Crawler Tasks] --> B[Performance Metrics] A --> C[Memory Usage] A --> D[Success/Failure Rates] A --> E[Response Times] end subgraph "Monitor Processing" F[CrawlerMonitor] --> G[Aggregate Statistics] F --> H[Display Formatter] F --> I[Update Scheduler] end subgraph "Display Modes" J[DETAILED Mode] K[AGGREGATED Mode] J --> L[Individual Task Status] J --> M[Task-Level Metrics] K --> N[Summary Statistics] K --> O[Overall Progress] end subgraph "Output Interface" P[Console Display] Q[Progress Bars] R[Status Tables] S[Real-time Updates] end B --> F C --> F D --> F E --> F G --> J G --> K H --> J H --> K I --> J I --> K L --> P M --> Q N --> R O --> S style F fill:#e3f2fd style J fill:#f3e5f5 style K fill:#e8f5e8 ``` ### Error Handling and Recovery Pattern ```mermaid stateDiagram-v2 [*] --> ProcessingURL ProcessingURL --> CrawlAttempt: Start crawl CrawlAttempt --> Success: HTTP 200 CrawlAttempt --> NetworkError: Connection failed CrawlAttempt --> RateLimit: HTTP 429 CrawlAttempt --> ServerError: HTTP 5xx CrawlAttempt --> Timeout: Request timeout Success --> [*]: Return result NetworkError --> RetryCheck: Check retry count RateLimit --> BackoffWait: Apply exponential backoff ServerError --> RetryCheck: Check retry count Timeout --> RetryCheck: Check retry count BackoffWait --> RetryCheck: After delay RetryCheck --> CrawlAttempt: retries < max_retries RetryCheck --> Failed: retries >= max_retries Failed --> ErrorLog: Log failure details ErrorLog --> [*]: Return failed result note right of BackoffWait: Exponential backoff for rate limits note right of RetryCheck: Configurable max_retries note right of ErrorLog: Detailed error tracking ``` ### Resource Management Timeline ```mermaid gantt title Multi-URL Crawling Resource Management dateFormat X axisFormat %s section Memory Usage Initialize Dispatcher :0, 1 Memory Monitoring :1, 10 Peak Usage Period :3, 7 Memory Cleanup :7, 9 section Task Execution URL Queue Setup :0, 2 Batch 1 Processing :2, 5 Batch 2 Processing :4, 7 Batch 3 Processing :6, 9 Final Results :9, 10 section Rate Limiting Normal Delays :2, 4 Backoff Period :4, 6 Recovery Period :6, 8 section Monitoring System Health Check :0, 10 Progress Updates :1, 9 Performance Metrics :2, 8 ``` ### Concurrent Processing Performance Matrix ```mermaid graph TD subgraph "Input Factors" A[Number of URLs] B[Concurrency Level] C[Memory Threshold] D[Rate Limiting] end subgraph "Processing Characteristics" A --> E[Low 1-100 URLs] A --> F[Medium 100-1k URLs] A --> G[High 1k-10k URLs] A --> H[Very High 10k+ URLs] B --> I[Conservative 1-5] B --> J[Moderate 5-15] B --> K[Aggressive 15-30] C --> L[Strict 60-70%] C --> M[Balanced 70-80%] C --> N[Relaxed 80-90%] end subgraph "Recommended Configurations" E --> O[Simple Semaphore] F --> P[Memory Adaptive Basic] G --> Q[Memory Adaptive Advanced] H --> R[Memory Adaptive + Monitoring] I --> O J --> P K --> Q K --> R L --> Q M --> P N --> O end style O fill:#c8e6c9 style P fill:#fff3e0 style Q fill:#ffecb3 style R fill:#ffcdd2 ``` **📖 Learn more:** [Multi-URL Crawling Guide](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Dispatcher Configuration](https://docs.crawl4ai.com/advanced/crawl-dispatcher/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/#performance-optimization) --- ## Deep Crawling Workflows and Architecture Visual representations of multi-level website exploration, filtering strategies, and intelligent crawling patterns. ### Deep Crawl Strategy Overview ```mermaid flowchart TD A[Start Deep Crawl] --> B{Strategy Selection} B -->|Explore All Levels| C[BFS Strategy] B -->|Dive Deep Fast| D[DFS Strategy] B -->|Smart Prioritization| E[Best-First Strategy] C --> C1[Breadth-First Search] C1 --> C2[Process all depth 0 links] C2 --> C3[Process all depth 1 links] C3 --> C4[Continue by depth level] D --> D1[Depth-First Search] D1 --> D2[Follow first link deeply] D2 --> D3[Backtrack when max depth reached] D3 --> D4[Continue with next branch] E --> E1[Best-First Search] E1 --> E2[Score all discovered URLs] E2 --> E3[Process highest scoring URLs first] E3 --> E4[Continuously re-prioritize queue] C4 --> F[Apply Filters] D4 --> F E4 --> F F --> G{Filter Chain Processing} G -->|Domain Filter| G1[Check allowed/blocked domains] G -->|URL Pattern Filter| G2[Match URL patterns] G -->|Content Type Filter| G3[Verify content types] G -->|SEO Filter| G4[Evaluate SEO quality] G -->|Content Relevance| G5[Score content relevance] G1 --> H{Passed All Filters?} G2 --> H G3 --> H G4 --> H G5 --> H H -->|Yes| I[Add to Crawl Queue] H -->|No| J[Discard URL] I --> K{Processing Mode} K -->|Streaming| L[Process Immediately] K -->|Batch| M[Collect All Results] L --> N[Stream Result to User] M --> O[Return Complete Result Set] J --> P{More URLs in Queue?} N --> P O --> P P -->|Yes| Q{Within Limits?} P -->|No| R[Deep Crawl Complete] Q -->|Max Depth OK| S{Max Pages OK} Q -->|Max Depth Exceeded| T[Skip Deeper URLs] S -->|Under Limit| U[Continue Crawling] S -->|Limit Reached| R T --> P U --> F style A fill:#e1f5fe style R fill:#c8e6c9 style C fill:#fff3e0 style D fill:#f3e5f5 style E fill:#e8f5e8 ``` ### Deep Crawl Strategy Comparison ```mermaid graph TB subgraph "BFS - Breadth-First Search" BFS1[Level 0: Start URL] BFS2[Level 1: All direct links] BFS3[Level 2: All second-level links] BFS4[Level 3: All third-level links] BFS1 --> BFS2 BFS2 --> BFS3 BFS3 --> BFS4 BFS_NOTE[Complete each depth before going deeper
Good for site mapping
Memory intensive for wide sites] end subgraph "DFS - Depth-First Search" DFS1[Start URL] DFS2[First Link → Deep] DFS3[Follow until max depth] DFS4[Backtrack and try next] DFS1 --> DFS2 DFS2 --> DFS3 DFS3 --> DFS4 DFS4 --> DFS2 DFS_NOTE[Go deep on first path
Memory efficient
May miss important pages] end subgraph "Best-First - Priority Queue" BF1[Start URL] BF2[Score all discovered links] BF3[Process highest scoring first] BF4[Continuously re-prioritize] BF1 --> BF2 BF2 --> BF3 BF3 --> BF4 BF4 --> BF2 BF_NOTE[Intelligent prioritization
Finds relevant content fast
Recommended for most use cases] end style BFS1 fill:#e3f2fd style DFS1 fill:#f3e5f5 style BF1 fill:#e8f5e8 style BFS_NOTE fill:#fff3e0 style DFS_NOTE fill:#fff3e0 style BF_NOTE fill:#fff3e0 ``` ### Filter Chain Processing Sequence ```mermaid sequenceDiagram participant URL as Discovered URL participant Chain as Filter Chain participant Domain as Domain Filter participant Pattern as URL Pattern Filter participant Content as Content Type Filter participant SEO as SEO Filter participant Relevance as Content Relevance Filter participant Queue as Crawl Queue URL->>Chain: Process URL Chain->>Domain: Check domain rules alt Domain Allowed Domain-->>Chain: ✓ Pass Chain->>Pattern: Check URL patterns alt Pattern Matches Pattern-->>Chain: ✓ Pass Chain->>Content: Check content type alt Content Type Valid Content-->>Chain: ✓ Pass Chain->>SEO: Evaluate SEO quality alt SEO Score Above Threshold SEO-->>Chain: ✓ Pass Chain->>Relevance: Score content relevance alt Relevance Score High Relevance-->>Chain: ✓ Pass Chain->>Queue: Add to crawl queue Queue-->>URL: Queued for crawling else Relevance Score Low Relevance-->>Chain: ✗ Reject Chain-->>URL: Filtered out - Low relevance end else SEO Score Low SEO-->>Chain: ✗ Reject Chain-->>URL: Filtered out - Poor SEO end else Invalid Content Type Content-->>Chain: ✗ Reject Chain-->>URL: Filtered out - Wrong content type end else Pattern Mismatch Pattern-->>Chain: ✗ Reject Chain-->>URL: Filtered out - Pattern mismatch end else Domain Blocked Domain-->>Chain: ✗ Reject Chain-->>URL: Filtered out - Blocked domain end ``` ### URL Lifecycle State Machine ```mermaid stateDiagram-v2 [*] --> Discovered: Found on page Discovered --> FilterPending: Enter filter chain FilterPending --> DomainCheck: Apply domain filter DomainCheck --> PatternCheck: Domain allowed DomainCheck --> Rejected: Domain blocked PatternCheck --> ContentCheck: Pattern matches PatternCheck --> Rejected: Pattern mismatch ContentCheck --> SEOCheck: Content type valid ContentCheck --> Rejected: Invalid content SEOCheck --> RelevanceCheck: SEO score sufficient SEOCheck --> Rejected: Poor SEO score RelevanceCheck --> Scored: Relevance score calculated RelevanceCheck --> Rejected: Low relevance Scored --> Queued: Added to priority queue Queued --> Crawling: Selected for processing Crawling --> Success: Page crawled successfully Crawling --> Failed: Crawl failed Success --> LinkExtraction: Extract new links LinkExtraction --> [*]: Process complete Failed --> [*]: Record failure Rejected --> [*]: Log rejection reason note right of Scored : Score determines priority
in Best-First strategy note right of Failed : Errors logged with
depth and reason ``` ### Streaming vs Batch Processing Architecture ```mermaid graph TB subgraph "Input" A[Start URL] --> B[Deep Crawl Strategy] end subgraph "Crawl Engine" B --> C[URL Discovery] C --> D[Filter Chain] D --> E[Priority Queue] E --> F[Page Processor] end subgraph "Streaming Mode stream=True" F --> G1[Process Page] G1 --> H1[Extract Content] H1 --> I1[Yield Result Immediately] I1 --> J1[async for result] J1 --> K1[Real-time Processing] G1 --> L1[Extract Links] L1 --> M1[Add to Queue] M1 --> F end subgraph "Batch Mode stream=False" F --> G2[Process Page] G2 --> H2[Extract Content] H2 --> I2[Store Result] I2 --> N2[Result Collection] G2 --> L2[Extract Links] L2 --> M2[Add to Queue] M2 --> O2{More URLs?} O2 -->|Yes| F O2 -->|No| P2[Return All Results] P2 --> Q2[Batch Processing] end style I1 fill:#e8f5e8 style K1 fill:#e8f5e8 style P2 fill:#e3f2fd style Q2 fill:#e3f2fd ``` ### Advanced Scoring and Prioritization System ```mermaid flowchart LR subgraph "URL Discovery" A[Page Links] --> B[Extract URLs] B --> C[Normalize URLs] end subgraph "Scoring System" C --> D[Keyword Relevance Scorer] D --> D1[URL Text Analysis] D --> D2[Keyword Matching] D --> D3[Calculate Base Score] D3 --> E[Additional Scoring Factors] E --> E1[URL Structure weight: 0.2] E --> E2[Link Context weight: 0.3] E --> E3[Page Depth Penalty weight: 0.1] E --> E4[Domain Authority weight: 0.4] D1 --> F[Combined Score] D2 --> F D3 --> F E1 --> F E2 --> F E3 --> F E4 --> F end subgraph "Prioritization" F --> G{Score Threshold} G -->|Above Threshold| H[Priority Queue] G -->|Below Threshold| I[Discard URL] H --> J[Best-First Selection] J --> K[Highest Score First] K --> L[Process Page] L --> M[Update Scores] M --> N[Re-prioritize Queue] N --> J end style F fill:#fff3e0 style H fill:#e8f5e8 style L fill:#e3f2fd ``` ### Deep Crawl Performance and Limits ```mermaid graph TD subgraph "Crawl Constraints" A[Max Depth: 2] --> A1[Prevents infinite crawling] B[Max Pages: 50] --> B1[Controls resource usage] C[Score Threshold: 0.3] --> C1[Quality filtering] D[Domain Limits] --> D1[Scope control] end subgraph "Performance Monitoring" E[Pages Crawled] --> F[Depth Distribution] E --> G[Success Rate] E --> H[Average Score] E --> I[Processing Time] F --> J[Performance Report] G --> J H --> J I --> J end subgraph "Resource Management" K[Memory Usage] --> L{Memory Threshold} L -->|Under Limit| M[Continue Crawling] L -->|Over Limit| N[Reduce Concurrency] O[CPU Usage] --> P{CPU Threshold} P -->|Normal| M P -->|High| Q[Add Delays] R[Network Load] --> S{Rate Limits} S -->|OK| M S -->|Exceeded| T[Throttle Requests] end M --> U[Optimal Performance] N --> V[Reduced Performance] Q --> V T --> V style U fill:#c8e6c9 style V fill:#fff3e0 style J fill:#e3f2fd ``` ### Error Handling and Recovery Flow ```mermaid sequenceDiagram participant Strategy as Deep Crawl Strategy participant Queue as Priority Queue participant Crawler as Page Crawler participant Error as Error Handler participant Result as Result Collector Strategy->>Queue: Get next URL Queue-->>Strategy: Return highest priority URL Strategy->>Crawler: Crawl page alt Successful Crawl Crawler-->>Strategy: Return page content Strategy->>Result: Store successful result Strategy->>Strategy: Extract new links Strategy->>Queue: Add new URLs to queue else Network Error Crawler-->>Error: Network timeout/failure Error->>Error: Log error with details Error->>Queue: Mark URL as failed Error-->>Strategy: Skip to next URL else Parse Error Crawler-->>Error: HTML parsing failed Error->>Error: Log parse error Error->>Result: Store failed result Error-->>Strategy: Continue with next URL else Rate Limit Hit Crawler-->>Error: Rate limit exceeded Error->>Error: Apply backoff strategy Error->>Queue: Re-queue URL with delay Error-->>Strategy: Wait before retry else Depth Limit Strategy->>Strategy: Check depth constraint Strategy-->>Queue: Skip URL - too deep else Page Limit Strategy->>Strategy: Check page count Strategy-->>Result: Stop crawling - limit reached end Strategy->>Queue: Request next URL Queue-->>Strategy: More URLs available? alt Queue Empty Queue-->>Result: Crawl complete else Queue Has URLs Queue-->>Strategy: Continue crawling end ``` **📖 Learn more:** [Deep Crawling Strategies](https://docs.crawl4ai.com/core/deep-crawling/), [Content Filtering](https://docs.crawl4ai.com/core/content-selection/), [Advanced Crawling Patterns](https://docs.crawl4ai.com/advanced/advanced-features/) --- ## Docker Deployment Architecture and Workflows Visual representations of Crawl4AI Docker deployment, API architecture, configuration management, and service interactions. ### Docker Deployment Decision Flow ```mermaid flowchart TD A[Start Docker Deployment] --> B{Deployment Type?} B -->|Quick Start| C[Pre-built Image] B -->|Development| D[Docker Compose] B -->|Custom Build| E[Manual Build] B -->|Production| F[Production Setup] C --> C1[docker pull unclecode/crawl4ai] C1 --> C2{Need LLM Support?} C2 -->|Yes| C3[Setup .llm.env] C2 -->|No| C4[Basic run] C3 --> C5[docker run with --env-file] C4 --> C6[docker run basic] D --> D1[git clone repository] D1 --> D2[cp .llm.env.example .llm.env] D2 --> D3{Build Type?} D3 -->|Pre-built| D4[IMAGE=latest docker compose up] D3 -->|Local Build| D5[docker compose up --build] D3 -->|All Features| D6[INSTALL_TYPE=all docker compose up] E --> E1[docker buildx build] E1 --> E2{Architecture?} E2 -->|Single| E3[--platform linux/amd64] E2 -->|Multi| E4[--platform linux/amd64,linux/arm64] E3 --> E5[Build complete] E4 --> E5 F --> F1[Production configuration] F1 --> F2[Custom config.yml] F2 --> F3[Resource limits] F3 --> F4[Health monitoring] F4 --> F5[Production ready] C5 --> G[Service running on :11235] C6 --> G D4 --> G D5 --> G D6 --> G E5 --> H[docker run custom image] H --> G F5 --> I[Production deployment] G --> J[Access playground at /playground] G --> K[Health check at /health] I --> L[Production monitoring] style A fill:#e1f5fe style G fill:#c8e6c9 style I fill:#c8e6c9 style J fill:#fff3e0 style K fill:#fff3e0 style L fill:#e8f5e8 ``` ### Docker Container Architecture ```mermaid graph TB subgraph "Host Environment" A[Docker Engine] --> B[Crawl4AI Container] C[.llm.env] --> B D[Custom config.yml] --> B E[Port 11235] --> B F[Shared Memory 1GB+] --> B end subgraph "Container Services" B --> G[FastAPI Server :8020] B --> H[Gunicorn WSGI] B --> I[Supervisord Process Manager] B --> J[Redis Cache :6379] G --> K[REST API Endpoints] G --> L[WebSocket Connections] G --> M[MCP Protocol] H --> N[Worker Processes] I --> O[Service Monitoring] J --> P[Request Caching] end subgraph "Browser Management" B --> Q[Playwright Framework] Q --> R[Chromium Browser] Q --> S[Firefox Browser] Q --> T[WebKit Browser] R --> U[Browser Pool] S --> U T --> U U --> V[Page Sessions] U --> W[Context Management] end subgraph "External Services" X[OpenAI API] -.-> K Y[Anthropic Claude] -.-> K Z[Local Ollama] -.-> K AA[Groq API] -.-> K BB[Google Gemini] -.-> K end subgraph "Client Interactions" CC[Python SDK] --> K DD[REST API Calls] --> K EE[MCP Clients] --> M FF[Web Browser] --> G GG[Monitoring Tools] --> K end style B fill:#e3f2fd style G fill:#f3e5f5 style Q fill:#e8f5e8 style K fill:#fff3e0 ``` ### API Endpoints Architecture ```mermaid graph LR subgraph "Core Endpoints" A[/crawl] --> A1[Single URL crawl] A2[/crawl/stream] --> A3[Streaming multi-URL] A4[/crawl/job] --> A5[Async job submission] A6[/crawl/job/{id}] --> A7[Job status check] end subgraph "Specialized Endpoints" B[/html] --> B1[Preprocessed HTML] B2[/screenshot] --> B3[PNG capture] B4[/pdf] --> B5[PDF generation] B6[/execute_js] --> B7[JavaScript execution] B8[/md] --> B9[Markdown extraction] end subgraph "Utility Endpoints" C[/health] --> C1[Service status] C2[/metrics] --> C3[Prometheus metrics] C4[/schema] --> C5[API documentation] C6[/playground] --> C7[Interactive testing] end subgraph "LLM Integration" D[/llm/{url}] --> D1[Q&A over URL] D2[/ask] --> D3[Library context search] D4[/config/dump] --> D5[Config validation] end subgraph "MCP Protocol" E[/mcp/sse] --> E1[Server-Sent Events] E2[/mcp/ws] --> E3[WebSocket connection] E4[/mcp/schema] --> E5[MCP tool definitions] end style A fill:#e3f2fd style B fill:#f3e5f5 style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#fce4ec ``` ### Request Processing Flow ```mermaid sequenceDiagram participant Client participant FastAPI participant RequestValidator participant BrowserPool participant Playwright participant ExtractionEngine participant LLMProvider Client->>FastAPI: POST /crawl with config FastAPI->>RequestValidator: Validate JSON structure alt Valid Request RequestValidator-->>FastAPI: ✓ Validated FastAPI->>BrowserPool: Request browser instance BrowserPool->>Playwright: Launch browser/reuse session Playwright-->>BrowserPool: Browser ready BrowserPool-->>FastAPI: Browser allocated FastAPI->>Playwright: Navigate to URL Playwright->>Playwright: Execute JS, wait conditions Playwright-->>FastAPI: Page content ready FastAPI->>ExtractionEngine: Process content alt LLM Extraction ExtractionEngine->>LLMProvider: Send content + schema LLMProvider-->>ExtractionEngine: Structured data else CSS Extraction ExtractionEngine->>ExtractionEngine: Apply CSS selectors end ExtractionEngine-->>FastAPI: Extraction complete FastAPI->>BrowserPool: Release browser FastAPI-->>Client: CrawlResult response else Invalid Request RequestValidator-->>FastAPI: ✗ Validation error FastAPI-->>Client: 400 Bad Request end ``` ### Configuration Management Flow ```mermaid stateDiagram-v2 [*] --> ConfigLoading ConfigLoading --> DefaultConfig: Load default config.yml ConfigLoading --> CustomConfig: Custom config mounted ConfigLoading --> EnvOverrides: Environment variables DefaultConfig --> ConfigMerging CustomConfig --> ConfigMerging EnvOverrides --> ConfigMerging ConfigMerging --> ConfigValidation ConfigValidation --> Valid: Schema validation passes ConfigValidation --> Invalid: Validation errors Invalid --> ConfigError: Log errors and exit ConfigError --> [*] Valid --> ServiceInitialization ServiceInitialization --> FastAPISetup ServiceInitialization --> BrowserPoolInit ServiceInitialization --> CacheSetup FastAPISetup --> Running BrowserPoolInit --> Running CacheSetup --> Running Running --> ConfigReload: Config change detected ConfigReload --> ConfigValidation Running --> [*]: Service shutdown note right of ConfigMerging : Priority: ENV > Custom > Default note right of ServiceInitialization : All services must initialize successfully ``` ### Multi-Architecture Build Process ```mermaid flowchart TD A[Developer Push] --> B[GitHub Repository] B --> C[Docker Buildx] C --> D{Build Strategy} D -->|Multi-arch| E[Parallel Builds] D -->|Single-arch| F[Platform-specific Build] E --> G[AMD64 Build] E --> H[ARM64 Build] F --> I[Target Platform Build] subgraph "AMD64 Build Process" G --> G1[Ubuntu base image] G1 --> G2[Python 3.11 install] G2 --> G3[System dependencies] G3 --> G4[Crawl4AI installation] G4 --> G5[Playwright setup] G5 --> G6[FastAPI configuration] G6 --> G7[AMD64 image ready] end subgraph "ARM64 Build Process" H --> H1[Ubuntu ARM64 base] H1 --> H2[Python 3.11 install] H2 --> H3[ARM-specific deps] H3 --> H4[Crawl4AI installation] H4 --> H5[Playwright setup] H5 --> H6[FastAPI configuration] H6 --> H7[ARM64 image ready] end subgraph "Single Architecture" I --> I1[Base image selection] I1 --> I2[Platform dependencies] I2 --> I3[Application setup] I3 --> I4[Platform image ready] end G7 --> J[Multi-arch Manifest] H7 --> J I4 --> K[Platform Image] J --> L[Docker Hub Registry] K --> L L --> M[Pull Request Auto-selects Architecture] style A fill:#e1f5fe style J fill:#c8e6c9 style K fill:#c8e6c9 style L fill:#f3e5f5 style M fill:#e8f5e8 ``` ### MCP Integration Architecture ```mermaid graph TB subgraph "MCP Client Applications" A[Claude Code] --> B[MCP Protocol] C[Cursor IDE] --> B D[Windsurf] --> B E[Custom MCP Client] --> B end subgraph "Crawl4AI MCP Server" B --> F[MCP Endpoint Router] F --> G[SSE Transport /mcp/sse] F --> H[WebSocket Transport /mcp/ws] F --> I[Schema Endpoint /mcp/schema] G --> J[MCP Tool Handler] H --> J J --> K[Tool: md] J --> L[Tool: html] J --> M[Tool: screenshot] J --> N[Tool: pdf] J --> O[Tool: execute_js] J --> P[Tool: crawl] J --> Q[Tool: ask] end subgraph "Crawl4AI Core Services" K --> R[Markdown Generator] L --> S[HTML Preprocessor] M --> T[Screenshot Service] N --> U[PDF Generator] O --> V[JavaScript Executor] P --> W[Batch Crawler] Q --> X[Context Search] R --> Y[Browser Pool] S --> Y T --> Y U --> Y V --> Y W --> Y X --> Z[Knowledge Base] end subgraph "External Resources" Y --> AA[Playwright Browsers] Z --> BB[Library Documentation] Z --> CC[Code Examples] AA --> DD[Web Pages] end style B fill:#e3f2fd style J fill:#f3e5f5 style Y fill:#e8f5e8 style Z fill:#fff3e0 ``` ### API Request/Response Flow Patterns ```mermaid sequenceDiagram participant Client participant LoadBalancer participant FastAPI participant ConfigValidator participant BrowserManager participant CrawlEngine participant ResponseBuilder Note over Client,ResponseBuilder: Basic Crawl Request Client->>LoadBalancer: POST /crawl LoadBalancer->>FastAPI: Route request FastAPI->>ConfigValidator: Validate browser_config ConfigValidator-->>FastAPI: ✓ Valid BrowserConfig FastAPI->>ConfigValidator: Validate crawler_config ConfigValidator-->>FastAPI: ✓ Valid CrawlerRunConfig FastAPI->>BrowserManager: Allocate browser BrowserManager-->>FastAPI: Browser instance FastAPI->>CrawlEngine: Execute crawl Note over CrawlEngine: Page processing CrawlEngine->>CrawlEngine: Navigate & wait CrawlEngine->>CrawlEngine: Extract content CrawlEngine->>CrawlEngine: Apply strategies CrawlEngine-->>FastAPI: CrawlResult FastAPI->>ResponseBuilder: Format response ResponseBuilder-->>FastAPI: JSON response FastAPI->>BrowserManager: Release browser FastAPI-->>LoadBalancer: Response ready LoadBalancer-->>Client: 200 OK + CrawlResult Note over Client,ResponseBuilder: Streaming Request Client->>FastAPI: POST /crawl/stream FastAPI-->>Client: 200 OK (stream start) loop For each URL FastAPI->>CrawlEngine: Process URL CrawlEngine-->>FastAPI: Result ready FastAPI-->>Client: NDJSON line end FastAPI-->>Client: Stream completed ``` ### Configuration Validation Workflow ```mermaid flowchart TD A[Client Request] --> B[JSON Payload] B --> C{Pre-validation} C -->|✓ Valid JSON| D[Extract Configurations] C -->|✗ Invalid JSON| E[Return 400 Bad Request] D --> F[BrowserConfig Validation] D --> G[CrawlerRunConfig Validation] F --> H{BrowserConfig Valid?} G --> I{CrawlerRunConfig Valid?} H -->|✓ Valid| J[Browser Setup] H -->|✗ Invalid| K[Log Browser Config Errors] I -->|✓ Valid| L[Crawler Setup] I -->|✗ Invalid| M[Log Crawler Config Errors] K --> N[Collect All Errors] M --> N N --> O[Return 422 Validation Error] J --> P{Both Configs Valid?} L --> P P -->|✓ Yes| Q[Proceed to Crawling] P -->|✗ No| O Q --> R[Execute Crawl Pipeline] R --> S[Return CrawlResult] E --> T[Client Error Response] O --> T S --> U[Client Success Response] style A fill:#e1f5fe style Q fill:#c8e6c9 style S fill:#c8e6c9 style U fill:#c8e6c9 style E fill:#ffcdd2 style O fill:#ffcdd2 style T fill:#ffcdd2 ``` ### Production Deployment Architecture ```mermaid graph TB subgraph "Load Balancer Layer" A[NGINX/HAProxy] --> B[Health Check] A --> C[Request Routing] A --> D[SSL Termination] end subgraph "Application Layer" C --> E[Crawl4AI Instance 1] C --> F[Crawl4AI Instance 2] C --> G[Crawl4AI Instance N] E --> H[FastAPI Server] F --> I[FastAPI Server] G --> J[FastAPI Server] H --> K[Browser Pool 1] I --> L[Browser Pool 2] J --> M[Browser Pool N] end subgraph "Shared Services" N[Redis Cluster] --> E N --> F N --> G O[Monitoring Stack] --> P[Prometheus] O --> Q[Grafana] O --> R[AlertManager] P --> E P --> F P --> G end subgraph "External Dependencies" S[OpenAI API] -.-> H T[Anthropic API] -.-> I U[Local LLM Cluster] -.-> J end subgraph "Persistent Storage" V[Configuration Volume] --> E V --> F V --> G W[Cache Volume] --> N X[Logs Volume] --> O end style A fill:#e3f2fd style E fill:#f3e5f5 style F fill:#f3e5f5 style G fill:#f3e5f5 style N fill:#e8f5e8 style O fill:#fff3e0 ``` ### Docker Resource Management ```mermaid graph TD subgraph "Resource Allocation" A[Host Resources] --> B[CPU Cores] A --> C[Memory GB] A --> D[Disk Space] A --> E[Network Bandwidth] B --> F[Container Limits] C --> F D --> F E --> F end subgraph "Container Configuration" F --> G[--cpus=4] F --> H[--memory=8g] F --> I[--shm-size=2g] F --> J[Volume Mounts] G --> K[Browser Processes] H --> L[Browser Memory] I --> M[Shared Memory for Browsers] J --> N[Config & Cache Storage] end subgraph "Monitoring & Scaling" O[Resource Monitor] --> P[CPU Usage %] O --> Q[Memory Usage %] O --> R[Request Queue Length] P --> S{CPU > 80%?} Q --> T{Memory > 90%?} R --> U{Queue > 100?} S -->|Yes| V[Scale Up] T -->|Yes| V U -->|Yes| V V --> W[Add Container Instance] W --> X[Update Load Balancer] end subgraph "Performance Optimization" Y[Browser Pool Tuning] --> Z[Max Pages: 40] Y --> AA[Idle TTL: 30min] Y --> BB[Concurrency Limits] Z --> CC[Memory Efficiency] AA --> DD[Resource Cleanup] BB --> EE[Throughput Control] end style A fill:#e1f5fe style F fill:#f3e5f5 style O fill:#e8f5e8 style Y fill:#fff3e0 ``` **📖 Learn more:** [Docker Deployment Guide](https://docs.crawl4ai.com/core/docker-deployment/), [API Reference](https://docs.crawl4ai.com/api/), [MCP Integration](https://docs.crawl4ai.com/core/docker-deployment/#mcp-model-context-protocol-support), [Production Configuration](https://docs.crawl4ai.com/core/docker-deployment/#production-deployment) --- ## CLI Workflows and Profile Management Visual representations of command-line interface operations, browser profile management, and identity-based crawling workflows. ### CLI Command Flow Architecture ```mermaid flowchart TD A[crwl command] --> B{Command Type?} B -->|URL Crawling| C[Parse URL & Options] B -->|Profile Management| D[profiles subcommand] B -->|CDP Browser| E[cdp subcommand] B -->|Browser Control| F[browser subcommand] B -->|Configuration| G[config subcommand] C --> C1{Output Format?} C1 -->|Default| C2[HTML/Markdown] C1 -->|JSON| C3[Structured Data] C1 -->|markdown| C4[Clean Markdown] C1 -->|markdown-fit| C5[Filtered Content] C --> C6{Authentication?} C6 -->|Profile Specified| C7[Load Browser Profile] C6 -->|No Profile| C8[Anonymous Session] C7 --> C9[Launch with User Data] C8 --> C10[Launch Clean Browser] C9 --> C11[Execute Crawl] C10 --> C11 C11 --> C12{Success?} C12 -->|Yes| C13[Return Results] C12 -->|No| C14[Error Handling] D --> D1[Interactive Profile Menu] D1 --> D2{Menu Choice?} D2 -->|Create| D3[Open Browser for Setup] D2 -->|List| D4[Show Existing Profiles] D2 -->|Delete| D5[Remove Profile] D2 -->|Use| D6[Crawl with Profile] E --> E1[Launch CDP Browser] E1 --> E2[Remote Debugging Active] F --> F1{Browser Action?} F1 -->|start| F2[Start Builtin Browser] F1 -->|stop| F3[Stop Builtin Browser] F1 -->|status| F4[Check Browser Status] F1 -->|view| F5[Open Browser Window] G --> G1{Config Action?} G1 -->|list| G2[Show All Settings] G1 -->|set| G3[Update Setting] G1 -->|get| G4[Read Setting] style A fill:#e1f5fe style C13 fill:#c8e6c9 style C14 fill:#ffcdd2 style D3 fill:#fff3e0 style E2 fill:#f3e5f5 ``` ### Profile Management Workflow ```mermaid sequenceDiagram participant User participant CLI participant ProfileManager participant Browser participant FileSystem User->>CLI: crwl profiles CLI->>ProfileManager: Initialize profile manager ProfileManager->>FileSystem: Scan for existing profiles FileSystem-->>ProfileManager: Profile list ProfileManager-->>CLI: Show interactive menu CLI-->>User: Display options Note over User: User selects "Create new profile" User->>CLI: Create profile "linkedin-auth" CLI->>ProfileManager: create_profile("linkedin-auth") ProfileManager->>FileSystem: Create profile directory ProfileManager->>Browser: Launch with new user data dir Browser-->>User: Opens browser window Note over User: User manually logs in to LinkedIn User->>Browser: Navigate and authenticate Browser->>FileSystem: Save cookies, session data User->>CLI: Press 'q' to save profile CLI->>ProfileManager: finalize_profile() ProfileManager->>FileSystem: Lock profile settings ProfileManager-->>CLI: Profile saved CLI-->>User: Profile "linkedin-auth" created Note over User: Later usage User->>CLI: crwl https://linkedin.com/feed -p linkedin-auth CLI->>ProfileManager: load_profile("linkedin-auth") ProfileManager->>FileSystem: Read profile data FileSystem-->>ProfileManager: User data directory ProfileManager-->>CLI: Profile configuration CLI->>Browser: Launch with existing profile Browser-->>CLI: Authenticated session ready CLI->>Browser: Navigate to target URL Browser-->>CLI: Crawl results with auth context CLI-->>User: Authenticated content ``` ### Browser Management State Machine ```mermaid stateDiagram-v2 [*] --> Stopped: Initial state Stopped --> Starting: crwl browser start Starting --> Running: Browser launched Running --> Viewing: crwl browser view Viewing --> Running: Close window Running --> Stopping: crwl browser stop Stopping --> Stopped: Cleanup complete Running --> Restarting: crwl browser restart Restarting --> Running: New browser instance Stopped --> CDP_Mode: crwl cdp CDP_Mode --> CDP_Running: Remote debugging active CDP_Running --> CDP_Mode: Manual close CDP_Mode --> Stopped: Exit CDP Running --> StatusCheck: crwl browser status StatusCheck --> Running: Return status note right of Running : Port 9222 active\nBuiltin browser available note right of CDP_Running : Remote debugging\nManual control enabled note right of Viewing : Visual browser window\nDirect interaction ``` ### Authentication Workflow for Protected Sites ```mermaid flowchart TD A[Protected Site Access Needed] --> B[Create Profile Strategy] B --> C{Existing Profile?} C -->|Yes| D[Test Profile Validity] C -->|No| E[Create New Profile] D --> D1{Profile Valid?} D1 -->|Yes| F[Use Existing Profile] D1 -->|No| E E --> E1[crwl profiles] E1 --> E2[Select Create New Profile] E2 --> E3[Enter Profile Name] E3 --> E4[Browser Opens for Auth] E4 --> E5{Authentication Method?} E5 -->|Login Form| E6[Fill Username/Password] E5 -->|OAuth| E7[OAuth Flow] E5 -->|2FA| E8[Handle 2FA] E5 -->|Session Cookie| E9[Import Cookies] E6 --> E10[Manual Login Process] E7 --> E10 E8 --> E10 E9 --> E10 E10 --> E11[Verify Authentication] E11 --> E12{Auth Successful?} E12 -->|Yes| E13[Save Profile - Press q] E12 -->|No| E10 E13 --> F F --> G[Execute Authenticated Crawl] G --> H[crwl URL -p profile-name] H --> I[Load Profile Data] I --> J[Launch Browser with Auth] J --> K[Navigate to Protected Content] K --> L[Extract Authenticated Data] L --> M[Return Results] style E4 fill:#fff3e0 style E10 fill:#e3f2fd style F fill:#e8f5e8 style M fill:#c8e6c9 ``` ### CDP Browser Architecture ```mermaid graph TB subgraph "CLI Layer" A[crwl cdp command] --> B[CDP Manager] B --> C[Port Configuration] B --> D[Profile Selection] end subgraph "Browser Process" E[Chromium/Firefox] --> F[Remote Debugging] F --> G[WebSocket Endpoint] G --> H[ws://localhost:9222] end subgraph "Client Connections" I[Manual Browser Control] --> H J[DevTools Interface] --> H K[External Automation] --> H L[Crawl4AI Crawler] --> H end subgraph "Profile Data" M[User Data Directory] --> E N[Cookies & Sessions] --> M O[Extensions] --> M P[Browser State] --> M end A --> E C --> H D --> M style H fill:#e3f2fd style E fill:#f3e5f5 style M fill:#e8f5e8 ``` ### Configuration Management Hierarchy ```mermaid graph TD subgraph "Global Configuration" A[~/.crawl4ai/config.yml] --> B[Default Settings] B --> C[LLM Providers] B --> D[Browser Defaults] B --> E[Output Preferences] end subgraph "Profile Configuration" F[Profile Directory] --> G[Browser State] F --> H[Authentication Data] F --> I[Site-Specific Settings] end subgraph "Command-Line Overrides" J[-b browser_config] --> K[Runtime Browser Settings] L[-c crawler_config] --> M[Runtime Crawler Settings] N[-o output_format] --> O[Runtime Output Format] end subgraph "Configuration Files" P[browser.yml] --> Q[Browser Config Template] R[crawler.yml] --> S[Crawler Config Template] T[extract.yml] --> U[Extraction Config] end subgraph "Resolution Order" V[Command Line Args] --> W[Config Files] W --> X[Profile Settings] X --> Y[Global Defaults] end J --> V L --> V N --> V P --> W R --> W T --> W F --> X A --> Y style V fill:#ffcdd2 style W fill:#fff3e0 style X fill:#e3f2fd style Y fill:#e8f5e8 ``` ### Identity-Based Crawling Decision Tree ```mermaid flowchart TD A[Target Website Assessment] --> B{Authentication Required?} B -->|No| C[Standard Anonymous Crawl] B -->|Yes| D{Authentication Type?} D -->|Login Form| E[Create Login Profile] D -->|OAuth/SSO| F[Create OAuth Profile] D -->|API Key/Token| G[Use Headers/Config] D -->|Session Cookies| H[Import Cookie Profile] E --> E1[crwl profiles → Manual login] F --> F1[crwl profiles → OAuth flow] G --> G1[Configure headers in crawler config] H --> H1[Import cookies to profile] E1 --> I[Test Authentication] F1 --> I G1 --> I H1 --> I I --> J{Auth Test Success?} J -->|Yes| K[Production Crawl Setup] J -->|No| L[Debug Authentication] L --> L1{Common Issues?} L1 -->|Rate Limiting| L2[Add delays/user simulation] L1 -->|Bot Detection| L3[Enable stealth mode] L1 -->|Session Expired| L4[Refresh authentication] L1 -->|CAPTCHA| L5[Manual intervention needed] L2 --> M[Retry with Adjustments] L3 --> M L4 --> E1 L5 --> N[Semi-automated approach] M --> I N --> O[Manual auth + automated crawl] K --> P[Automated Authenticated Crawling] O --> P C --> P P --> Q[Monitor & Maintain Profiles] style I fill:#fff3e0 style K fill:#e8f5e8 style P fill:#c8e6c9 style L fill:#ffcdd2 style N fill:#f3e5f5 ``` ### CLI Usage Patterns and Best Practices ```mermaid timeline title CLI Workflow Evolution section Setup Phase Installation : pip install crawl4ai : crawl4ai-setup Basic Test : crwl https://example.com Config Setup : crwl config set defaults section Profile Creation Site Analysis : Identify auth requirements Profile Creation : crwl profiles Manual Login : Authenticate in browser Profile Save : Press 'q' to save section Development Phase Test Crawls : crwl URL -p profile -v Config Tuning : Adjust browser/crawler settings Output Testing : Try different output formats Error Handling : Debug authentication issues section Production Phase Automated Crawls : crwl URL -p profile -o json Batch Processing : Multiple URLs with same profile Monitoring : Check profile validity Maintenance : Update profiles as needed ``` ### Multi-Profile Management Strategy ```mermaid graph LR subgraph "Profile Categories" A[Social Media Profiles] B[Work/Enterprise Profiles] C[E-commerce Profiles] D[Research Profiles] end subgraph "Social Media" A --> A1[linkedin-personal] A --> A2[twitter-monitor] A --> A3[facebook-research] A --> A4[instagram-brand] end subgraph "Enterprise" B --> B1[company-intranet] B --> B2[github-enterprise] B --> B3[confluence-docs] B --> B4[jira-tickets] end subgraph "E-commerce" C --> C1[amazon-seller] C --> C2[shopify-admin] C --> C3[ebay-monitor] C --> C4[marketplace-competitor] end subgraph "Research" D --> D1[academic-journals] D --> D2[data-platforms] D --> D3[survey-tools] D --> D4[government-portals] end subgraph "Usage Patterns" E[Daily Monitoring] --> A2 E --> B1 F[Weekly Reports] --> C3 F --> D2 G[On-Demand Research] --> D1 G --> D4 H[Competitive Analysis] --> C4 H --> A4 end style A1 fill:#e3f2fd style B1 fill:#f3e5f5 style C1 fill:#e8f5e8 style D1 fill:#fff3e0 ``` **📖 Learn more:** [CLI Reference](https://docs.crawl4ai.com/core/cli/), [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [Authentication Strategies](https://docs.crawl4ai.com/advanced/hooks-auth/) --- ## HTTP Crawler Strategy Workflows Visual representations of HTTP-based crawling architecture, request flows, and performance characteristics compared to browser-based strategies. ### HTTP vs Browser Strategy Decision Tree ```mermaid flowchart TD A[Content Crawling Need] --> B{Content Type Analysis} B -->|Static HTML| C{JavaScript Required?} B -->|Dynamic SPA| D[Browser Strategy Required] B -->|API Endpoints| E[HTTP Strategy Optimal] B -->|Mixed Content| F{Primary Content Source?} C -->|No JS Needed| G[HTTP Strategy Recommended] C -->|JS Required| H[Browser Strategy Required] C -->|Unknown| I{Performance Priority?} I -->|Speed Critical| J[Try HTTP First] I -->|Accuracy Critical| K[Use Browser Strategy] F -->|Mostly Static| G F -->|Mostly Dynamic| D G --> L{Resource Constraints?} L -->|Memory Limited| M[HTTP Strategy - Lightweight] L -->|CPU Limited| N[HTTP Strategy - No Browser] L -->|Network Limited| O[HTTP Strategy - Efficient] L -->|No Constraints| P[Either Strategy Works] J --> Q[Test HTTP Results] Q --> R{Content Complete?} R -->|Yes| S[Continue with HTTP] R -->|No| T[Switch to Browser Strategy] D --> U[Browser Strategy Features] H --> U K --> U T --> U U --> V[JavaScript Execution] U --> W[Screenshots/PDFs] U --> X[Complex Interactions] U --> Y[Session Management] M --> Z[HTTP Strategy Benefits] N --> Z O --> Z S --> Z Z --> AA[10x Faster Processing] Z --> BB[Lower Memory Usage] Z --> CC[Higher Concurrency] Z --> DD[Simpler Deployment] style G fill:#c8e6c9 style M fill:#c8e6c9 style N fill:#c8e6c9 style O fill:#c8e6c9 style S fill:#c8e6c9 style D fill:#e3f2fd style H fill:#e3f2fd style K fill:#e3f2fd style T fill:#e3f2fd style U fill:#e3f2fd ``` ### HTTP Request Lifecycle Sequence ```mermaid sequenceDiagram participant Client participant HTTPStrategy as HTTP Strategy participant Session as HTTP Session participant Server as Target Server participant Processor as Content Processor Client->>HTTPStrategy: crawl(url, config) HTTPStrategy->>HTTPStrategy: validate_url() alt URL Type Check HTTPStrategy->>HTTPStrategy: handle_file_url() Note over HTTPStrategy: file:// URLs else HTTPStrategy->>HTTPStrategy: handle_raw_content() Note over HTTPStrategy: raw:// content else HTTPStrategy->>Session: prepare_request() Session->>Session: apply_config() Session->>Session: set_headers() Session->>Session: setup_auth() Session->>Server: HTTP Request Note over Session,Server: GET/POST/PUT with headers alt Success Response Server-->>Session: HTTP 200 + Content Session-->>HTTPStrategy: response_data else Redirect Response Server-->>Session: HTTP 3xx + Location Session->>Server: Follow redirect Server-->>Session: HTTP 200 + Content Session-->>HTTPStrategy: final_response else Error Response Server-->>Session: HTTP 4xx/5xx Session-->>HTTPStrategy: error_response end end HTTPStrategy->>Processor: process_content() Processor->>Processor: clean_html() Processor->>Processor: extract_metadata() Processor->>Processor: generate_markdown() Processor-->>HTTPStrategy: processed_result HTTPStrategy-->>Client: CrawlResult Note over Client,Processor: Fast, lightweight processing Note over HTTPStrategy: No browser overhead ``` ### HTTP Strategy Architecture ```mermaid graph TB subgraph "HTTP Crawler Strategy" A[AsyncHTTPCrawlerStrategy] --> B[Session Manager] A --> C[Request Builder] A --> D[Response Handler] A --> E[Error Manager] B --> B1[Connection Pool] B --> B2[DNS Cache] B --> B3[SSL Context] C --> C1[Headers Builder] C --> C2[Auth Handler] C --> C3[Payload Encoder] D --> D1[Content Decoder] D --> D2[Redirect Handler] D --> D3[Status Validator] E --> E1[Retry Logic] E --> E2[Timeout Handler] E --> E3[Exception Mapper] end subgraph "Content Processing" F[Raw HTML] --> G[HTML Cleaner] G --> H[Markdown Generator] H --> I[Link Extractor] I --> J[Media Extractor] J --> K[Metadata Parser] end subgraph "External Resources" L[Target Websites] M[Local Files] N[Raw Content] end subgraph "Output" O[CrawlResult] O --> O1[HTML Content] O --> O2[Markdown Text] O --> O3[Extracted Links] O --> O4[Media References] O --> O5[Status Information] end A --> F L --> A M --> A N --> A K --> O style A fill:#e3f2fd style B fill:#f3e5f5 style F fill:#e8f5e8 style O fill:#fff3e0 ``` ### Performance Comparison Flow ```mermaid graph LR subgraph "HTTP Strategy Performance" A1[Request Start] --> A2[DNS Lookup: 50ms] A2 --> A3[TCP Connect: 100ms] A3 --> A4[HTTP Request: 200ms] A4 --> A5[Content Download: 300ms] A5 --> A6[Processing: 50ms] A6 --> A7[Total: ~700ms] end subgraph "Browser Strategy Performance" B1[Request Start] --> B2[Browser Launch: 2000ms] B2 --> B3[Page Navigation: 1000ms] B3 --> B4[JS Execution: 500ms] B4 --> B5[Content Rendering: 300ms] B5 --> B6[Processing: 100ms] B6 --> B7[Total: ~3900ms] end subgraph "Resource Usage" C1[HTTP Memory: ~50MB] C2[Browser Memory: ~500MB] C3[HTTP CPU: Low] C4[Browser CPU: High] C5[HTTP Concurrency: 100+] C6[Browser Concurrency: 10-20] end A7 --> D[5.5x Faster] B7 --> D C1 --> E[10x Less Memory] C2 --> E C5 --> F[5x More Concurrent] C6 --> F style A7 fill:#c8e6c9 style B7 fill:#ffcdd2 style C1 fill:#c8e6c9 style C2 fill:#ffcdd2 style C5 fill:#c8e6c9 style C6 fill:#ffcdd2 ``` ### HTTP Request Types and Configuration ```mermaid stateDiagram-v2 [*] --> HTTPConfigSetup HTTPConfigSetup --> MethodSelection MethodSelection --> GET: Simple data retrieval MethodSelection --> POST: Form submission MethodSelection --> PUT: Data upload MethodSelection --> DELETE: Resource removal GET --> HeaderSetup: Set Accept headers POST --> PayloadSetup: JSON or form data PUT --> PayloadSetup: File or data upload DELETE --> AuthSetup: Authentication required PayloadSetup --> JSONPayload: application/json PayloadSetup --> FormPayload: form-data PayloadSetup --> RawPayload: custom content JSONPayload --> HeaderSetup FormPayload --> HeaderSetup RawPayload --> HeaderSetup HeaderSetup --> AuthSetup AuthSetup --> SSLSetup SSLSetup --> RedirectSetup RedirectSetup --> RequestExecution RequestExecution --> [*]: Request complete note right of GET : Default method for most crawling note right of POST : API interactions, form submissions note right of JSONPayload : Structured data transmission note right of HeaderSetup : User-Agent, Accept, Custom headers ``` ### Error Handling and Retry Workflow ```mermaid flowchart TD A[HTTP Request] --> B{Response Received?} B -->|No| C[Connection Error] B -->|Yes| D{Status Code Check} C --> C1{Timeout Error?} C1 -->|Yes| C2[ConnectionTimeoutError] C1 -->|No| C3[Network Error] D -->|2xx| E[Success Response] D -->|3xx| F[Redirect Response] D -->|4xx| G[Client Error] D -->|5xx| H[Server Error] F --> F1{Follow Redirects?} F1 -->|Yes| F2[Follow Redirect] F1 -->|No| F3[Return Redirect Response] F2 --> A G --> G1{Retry on 4xx?} G1 -->|No| G2[HTTPStatusError] G1 -->|Yes| I[Check Retry Count] H --> H1{Retry on 5xx?} H1 -->|Yes| I H1 -->|No| H2[HTTPStatusError] C2 --> I C3 --> I I --> J{Retries < Max?} J -->|No| K[Final Error] J -->|Yes| L[Calculate Backoff] L --> M[Wait Backoff Time] M --> N[Increment Retry Count] N --> A E --> O[Process Content] F3 --> O O --> P[Return CrawlResult] G2 --> Q[Error CrawlResult] H2 --> Q K --> Q style E fill:#c8e6c9 style P fill:#c8e6c9 style G2 fill:#ffcdd2 style H2 fill:#ffcdd2 style K fill:#ffcdd2 style Q fill:#ffcdd2 ``` ### Batch Processing Architecture ```mermaid sequenceDiagram participant Client participant BatchManager as Batch Manager participant HTTPPool as Connection Pool participant Workers as HTTP Workers participant Targets as Target Servers Client->>BatchManager: batch_crawl(urls) BatchManager->>BatchManager: create_semaphore(max_concurrent) loop For each URL batch BatchManager->>HTTPPool: acquire_connection() HTTPPool->>Workers: assign_worker() par Concurrent Processing Workers->>Targets: HTTP Request 1 Workers->>Targets: HTTP Request 2 Workers->>Targets: HTTP Request N end par Response Handling Targets-->>Workers: Response 1 Targets-->>Workers: Response 2 Targets-->>Workers: Response N end Workers->>HTTPPool: return_connection() HTTPPool->>BatchManager: batch_results() end BatchManager->>BatchManager: aggregate_results() BatchManager-->>Client: final_results() Note over Workers,Targets: 20-100 concurrent connections Note over BatchManager: Memory-efficient processing Note over HTTPPool: Connection reuse optimization ``` ### Content Type Processing Pipeline ```mermaid graph TD A[HTTP Response] --> B{Content-Type Detection} B -->|text/html| C[HTML Processing] B -->|application/json| D[JSON Processing] B -->|text/plain| E[Text Processing] B -->|application/xml| F[XML Processing] B -->|Other| G[Binary Processing] C --> C1[Parse HTML Structure] C1 --> C2[Extract Text Content] C2 --> C3[Generate Markdown] C3 --> C4[Extract Links/Media] D --> D1[Parse JSON Structure] D1 --> D2[Extract Data Fields] D2 --> D3[Format as Readable Text] E --> E1[Clean Text Content] E1 --> E2[Basic Formatting] F --> F1[Parse XML Structure] F1 --> F2[Extract Text Nodes] F2 --> F3[Convert to Markdown] G --> G1[Save Binary Content] G1 --> G2[Generate Metadata] C4 --> H[Content Analysis] D3 --> H E2 --> H F3 --> H G2 --> H H --> I[Link Extraction] H --> J[Media Detection] H --> K[Metadata Parsing] I --> L[CrawlResult Assembly] J --> L K --> L L --> M[Final Output] style C fill:#e8f5e8 style H fill:#fff3e0 style L fill:#e3f2fd style M fill:#c8e6c9 ``` ### Integration with Processing Strategies ```mermaid graph LR subgraph "HTTP Strategy Core" A[HTTP Request] --> B[Raw Content] B --> C[Content Decoder] end subgraph "Processing Pipeline" C --> D[HTML Cleaner] D --> E[Markdown Generator] E --> F{Content Filter?} F -->|Yes| G[Pruning Filter] F -->|Yes| H[BM25 Filter] F -->|No| I[Raw Markdown] G --> J[Fit Markdown] H --> J end subgraph "Extraction Strategies" I --> K[CSS Extraction] J --> K I --> L[XPath Extraction] J --> L I --> M[LLM Extraction] J --> M end subgraph "Output Generation" K --> N[Structured JSON] L --> N M --> N I --> O[Clean Markdown] J --> P[Filtered Content] N --> Q[Final CrawlResult] O --> Q P --> Q end style A fill:#e3f2fd style C fill:#f3e5f5 style E fill:#e8f5e8 style Q fill:#c8e6c9 ``` **📖 Learn more:** [HTTP vs Browser Strategies](https://docs.crawl4ai.com/core/browser-crawler-config/), [Performance Optimization](https://docs.crawl4ai.com/advanced/multi-url-crawling/), [Error Handling](https://docs.crawl4ai.com/api/async-webcrawler/) --- ## URL Seeding Workflows and Architecture Visual representations of URL discovery strategies, filtering pipelines, and smart crawling workflows. ### URL Seeding vs Deep Crawling Strategy Comparison ```mermaid graph TB subgraph "Deep Crawling Approach" A1[Start URL] --> A2[Load Page] A2 --> A3[Extract Links] A3 --> A4{More Links?} A4 -->|Yes| A5[Queue Next Page] A5 --> A2 A4 -->|No| A6[Complete] A7[⏱️ Real-time Discovery] A8[🐌 Sequential Processing] A9[🔍 Limited by Page Structure] A10[💾 High Memory Usage] end subgraph "URL Seeding Approach" B1[Domain Input] --> B2[Query Sitemap] B1 --> B3[Query Common Crawl] B2 --> B4[Merge Results] B3 --> B4 B4 --> B5[Apply Filters] B5 --> B6[Score Relevance] B6 --> B7[Rank Results] B7 --> B8[Select Top URLs] B9[⚡ Instant Discovery] B10[🚀 Parallel Processing] B11[🎯 Pattern-based Filtering] B12[💡 Smart Relevance Scoring] end style A1 fill:#ffecb3 style B1 fill:#e8f5e8 style A6 fill:#ffcdd2 style B8 fill:#c8e6c9 ``` ### URL Discovery Data Flow ```mermaid sequenceDiagram participant User participant Seeder as AsyncUrlSeeder participant SM as Sitemap participant CC as Common Crawl participant Filter as URL Filter participant Scorer as BM25 Scorer User->>Seeder: urls("example.com", config) par Parallel Data Sources Seeder->>SM: Fetch sitemap.xml SM-->>Seeder: 500 URLs and Seeder->>CC: Query Common Crawl CC-->>Seeder: 2000 URLs end Seeder->>Seeder: Merge and deduplicate Note over Seeder: 2200 unique URLs Seeder->>Filter: Apply pattern filter Filter-->>Seeder: 800 matching URLs alt extract_head=True loop For each URL Seeder->>Seeder: Extract metadata end Note over Seeder: Title, description, keywords end alt query provided Seeder->>Scorer: Calculate relevance scores Scorer-->>Seeder: Scored URLs Seeder->>Seeder: Filter by score_threshold Note over Seeder: 200 relevant URLs end Seeder->>Seeder: Sort by relevance Seeder->>Seeder: Apply max_urls limit Seeder-->>User: Top 100 URLs ready for crawling ``` ### SeedingConfig Decision Tree ```mermaid flowchart TD A[SeedingConfig Setup] --> B{Data Source Strategy?} B -->|Fast & Official| C[source="sitemap"] B -->|Comprehensive| D[source="cc"] B -->|Maximum Coverage| E[source="sitemap+cc"] C --> F{Need Filtering?} D --> F E --> F F -->|Yes| G[Set URL Pattern] F -->|No| H[pattern="*"] G --> I{Pattern Examples} I --> I1[pattern="*/blog/*"] I --> I2[pattern="*/docs/api/*"] I --> I3[pattern="*.pdf"] I --> I4[pattern="*/product/*"] H --> J{Need Metadata?} I1 --> J I2 --> J I3 --> J I4 --> J J -->|Yes| K[extract_head=True] J -->|No| L[extract_head=False] K --> M{Need Validation?} L --> M M -->|Yes| N[live_check=True] M -->|No| O[live_check=False] N --> P{Need Relevance Scoring?} O --> P P -->|Yes| Q[Set Query + BM25] P -->|No| R[Skip Scoring] Q --> S[query="search terms"] S --> T[scoring_method="bm25"] T --> U[score_threshold=0.3] R --> V[Performance Tuning] U --> V V --> W[Set max_urls] W --> X[Set concurrency] X --> Y[Set hits_per_sec] Y --> Z[Configuration Complete] style A fill:#e3f2fd style Z fill:#c8e6c9 style K fill:#fff3e0 style N fill:#fff3e0 style Q fill:#f3e5f5 ``` ### BM25 Relevance Scoring Pipeline ```mermaid graph TB subgraph "Text Corpus Preparation" A1[URL Collection] --> A2[Extract Metadata] A2 --> A3[Title + Description + Keywords] A3 --> A4[Tokenize Text] A4 --> A5[Remove Stop Words] A5 --> A6[Create Document Corpus] end subgraph "BM25 Algorithm" B1[Query Terms] --> B2[Term Frequency Calculation] A6 --> B2 B2 --> B3[Inverse Document Frequency] B3 --> B4[BM25 Score Calculation] B4 --> B5[Score = Σ(IDF × TF × K1+1)/(TF + K1×(1-b+b×|d|/avgdl))] end subgraph "Scoring Results" B5 --> C1[URL Relevance Scores] C1 --> C2{Score ≥ Threshold?} C2 -->|Yes| C3[Include in Results] C2 -->|No| C4[Filter Out] C3 --> C5[Sort by Score DESC] C5 --> C6[Return Top URLs] end subgraph "Example Scores" D1["python async tutorial" → 0.85] D2["python documentation" → 0.72] D3["javascript guide" → 0.23] D4["contact us page" → 0.05] end style B5 fill:#e3f2fd style C6 fill:#c8e6c9 style D1 fill:#c8e6c9 style D2 fill:#c8e6c9 style D3 fill:#ffecb3 style D4 fill:#ffcdd2 ``` ### Multi-Domain Discovery Architecture ```mermaid graph TB subgraph "Input Layer" A1[Domain List] A2[SeedingConfig] A3[Query Terms] end subgraph "Discovery Engine" B1[AsyncUrlSeeder] B2[Parallel Workers] B3[Rate Limiter] B4[Memory Manager] end subgraph "Data Sources" C1[Sitemap Fetcher] C2[Common Crawl API] C3[Live URL Checker] C4[Metadata Extractor] end subgraph "Processing Pipeline" D1[URL Deduplication] D2[Pattern Filtering] D3[Relevance Scoring] D4[Quality Assessment] end subgraph "Output Layer" E1[Scored URL Lists] E2[Domain Statistics] E3[Performance Metrics] E4[Cache Storage] end A1 --> B1 A2 --> B1 A3 --> B1 B1 --> B2 B2 --> B3 B3 --> B4 B2 --> C1 B2 --> C2 B2 --> C3 B2 --> C4 C1 --> D1 C2 --> D1 C3 --> D2 C4 --> D3 D1 --> D2 D2 --> D3 D3 --> D4 D4 --> E1 B4 --> E2 B3 --> E3 D1 --> E4 style B1 fill:#e3f2fd style D3 fill:#f3e5f5 style E1 fill:#c8e6c9 ``` ### Complete Discovery-to-Crawl Pipeline ```mermaid stateDiagram-v2 [*] --> Discovery Discovery --> SourceSelection: Configure data sources SourceSelection --> Sitemap: source="sitemap" SourceSelection --> CommonCrawl: source="cc" SourceSelection --> Both: source="sitemap+cc" Sitemap --> URLCollection CommonCrawl --> URLCollection Both --> URLCollection URLCollection --> Filtering: Apply patterns Filtering --> MetadataExtraction: extract_head=True Filtering --> LiveValidation: extract_head=False MetadataExtraction --> LiveValidation: live_check=True MetadataExtraction --> RelevanceScoring: live_check=False LiveValidation --> RelevanceScoring RelevanceScoring --> ResultRanking: query provided RelevanceScoring --> ResultLimiting: no query ResultRanking --> ResultLimiting: apply score_threshold ResultLimiting --> URLSelection: apply max_urls URLSelection --> CrawlPreparation: URLs ready CrawlPreparation --> CrawlExecution: AsyncWebCrawler CrawlExecution --> StreamProcessing: stream=True CrawlExecution --> BatchProcessing: stream=False StreamProcessing --> [*] BatchProcessing --> [*] note right of Discovery : 🔍 Smart URL Discovery note right of URLCollection : 📚 Merge & Deduplicate note right of RelevanceScoring : 🎯 BM25 Algorithm note right of CrawlExecution : 🕷️ High-Performance Crawling ``` ### Performance Optimization Strategies ```mermaid graph LR subgraph "Input Optimization" A1[Smart Source Selection] --> A2[Sitemap First] A2 --> A3[Add CC if Needed] A3 --> A4[Pattern Filtering Early] end subgraph "Processing Optimization" B1[Parallel Workers] --> B2[Bounded Queues] B2 --> B3[Rate Limiting] B3 --> B4[Memory Management] B4 --> B5[Lazy Evaluation] end subgraph "Output Optimization" C1[Relevance Threshold] --> C2[Max URL Limits] C2 --> C3[Caching Strategy] C3 --> C4[Streaming Results] end subgraph "Performance Metrics" D1[URLs/Second: 100-1000] D2[Memory Usage: Bounded] D3[Network Efficiency: 95%+] D4[Cache Hit Rate: 80%+] end A4 --> B1 B5 --> C1 C4 --> D1 style A2 fill:#e8f5e8 style B2 fill:#e3f2fd style C3 fill:#f3e5f5 style D3 fill:#c8e6c9 ``` ### URL Discovery vs Traditional Crawling Comparison ```mermaid graph TB subgraph "Traditional Approach" T1[Start URL] --> T2[Crawl Page] T2 --> T3[Extract Links] T3 --> T4[Queue New URLs] T4 --> T2 T5[❌ Time: Hours/Days] T6[❌ Resource Heavy] T7[❌ Depth Limited] T8[❌ Discovery Bias] end subgraph "URL Seeding Approach" S1[Domain Input] --> S2[Query All Sources] S2 --> S3[Pattern Filter] S3 --> S4[Relevance Score] S4 --> S5[Select Best URLs] S5 --> S6[Ready to Crawl] S7[✅ Time: Seconds/Minutes] S8[✅ Resource Efficient] S9[✅ Complete Coverage] S10[✅ Quality Focused] end subgraph "Use Case Decision Matrix" U1[Small Sites < 1000 pages] --> U2[Use Deep Crawling] U3[Large Sites > 10000 pages] --> U4[Use URL Seeding] U5[Unknown Structure] --> U6[Start with Seeding] U7[Real-time Discovery] --> U8[Use Deep Crawling] U9[Quality over Quantity] --> U10[Use URL Seeding] end style S6 fill:#c8e6c9 style S7 fill:#c8e6c9 style S8 fill:#c8e6c9 style S9 fill:#c8e6c9 style S10 fill:#c8e6c9 style T5 fill:#ffcdd2 style T6 fill:#ffcdd2 style T7 fill:#ffcdd2 style T8 fill:#ffcdd2 ``` ### Data Source Characteristics and Selection ```mermaid graph TB subgraph "Sitemap Source" SM1[📋 Official URL List] SM2[⚡ Fast Response] SM3[📅 Recently Updated] SM4[🎯 High Quality URLs] SM5[❌ May Miss Some Pages] end subgraph "Common Crawl Source" CC1[🌐 Comprehensive Coverage] CC2[📚 Historical Data] CC3[🔍 Deep Discovery] CC4[⏳ Slower Response] CC5[🧹 May Include Noise] end subgraph "Combined Strategy" CB1[🚀 Best of Both] CB2[📊 Maximum Coverage] CB3[✨ Automatic Deduplication] CB4[⚖️ Balanced Performance] end subgraph "Selection Guidelines" G1[Speed Critical → Sitemap Only] G2[Coverage Critical → Common Crawl] G3[Best Quality → Combined] G4[Unknown Domain → Combined] end style SM2 fill:#c8e6c9 style SM4 fill:#c8e6c9 style CC1 fill:#e3f2fd style CC3 fill:#e3f2fd style CB1 fill:#f3e5f5 style CB3 fill:#f3e5f5 ``` **📖 Learn more:** [URL Seeding Guide](https://docs.crawl4ai.com/core/url-seeding/), [Performance Optimization](https://docs.crawl4ai.com/advanced/optimization/), [Multi-URL Crawling](https://docs.crawl4ai.com/advanced/multi-url-crawling/) --- ## Deep Crawling Filters & Scorers Architecture Visual representations of advanced URL filtering, scoring strategies, and performance optimization workflows for intelligent deep crawling. ### Filter Chain Processing Pipeline ```mermaid flowchart TD A[URL Input] --> B{Domain Filter} B -->|✓ Pass| C{Pattern Filter} B -->|✗ Fail| X1[Reject: Invalid Domain] C -->|✓ Pass| D{Content Type Filter} C -->|✗ Fail| X2[Reject: Pattern Mismatch] D -->|✓ Pass| E{SEO Filter} D -->|✗ Fail| X3[Reject: Wrong Content Type] E -->|✓ Pass| F{Content Relevance Filter} E -->|✗ Fail| X4[Reject: Low SEO Score] F -->|✓ Pass| G[URL Accepted] F -->|✗ Fail| X5[Reject: Low Relevance] G --> H[Add to Crawl Queue] subgraph "Fast Filters" B C D end subgraph "Slow Filters" E F end style A fill:#e3f2fd style G fill:#c8e6c9 style H fill:#e8f5e8 style X1 fill:#ffcdd2 style X2 fill:#ffcdd2 style X3 fill:#ffcdd2 style X4 fill:#ffcdd2 style X5 fill:#ffcdd2 ``` ### URL Scoring System Architecture ```mermaid graph TB subgraph "Input URL" A[https://python.org/tutorial/2024/ml-guide.html] end subgraph "Individual Scorers" B[Keyword Relevance Scorer] C[Path Depth Scorer] D[Content Type Scorer] E[Freshness Scorer] F[Domain Authority Scorer] end subgraph "Scoring Process" B --> B1[Keywords: python, tutorial, ml
Score: 0.85] C --> C1[Depth: 4 levels
Optimal: 3
Score: 0.75] D --> D1[Content: HTML
Score: 1.0] E --> E1[Year: 2024
Score: 1.0] F --> F1[Domain: python.org
Score: 1.0] end subgraph "Composite Scoring" G[Weighted Combination] B1 --> G C1 --> G D1 --> G E1 --> G F1 --> G end subgraph "Final Result" H[Composite Score: 0.92] I{Score > Threshold?} J[Accept URL] K[Reject URL] end A --> B A --> C A --> D A --> E A --> F G --> H H --> I I -->|✓ 0.92 > 0.6| J I -->|✗ Score too low| K style A fill:#e3f2fd style G fill:#fff3e0 style H fill:#e8f5e8 style J fill:#c8e6c9 style K fill:#ffcdd2 ``` ### Filter vs Scorer Decision Matrix ```mermaid flowchart TD A[URL Processing Decision] --> B{Binary Decision Needed?} B -->|Yes - Include/Exclude| C[Use Filters] B -->|No - Quality Rating| D[Use Scorers] C --> C1{Filter Type Needed?} C1 -->|Domain Control| C2[DomainFilter] C1 -->|Pattern Matching| C3[URLPatternFilter] C1 -->|Content Type| C4[ContentTypeFilter] C1 -->|SEO Quality| C5[SEOFilter] C1 -->|Content Relevance| C6[ContentRelevanceFilter] D --> D1{Scoring Criteria?} D1 -->|Keyword Relevance| D2[KeywordRelevanceScorer] D1 -->|URL Structure| D3[PathDepthScorer] D1 -->|Content Quality| D4[ContentTypeScorer] D1 -->|Time Sensitivity| D5[FreshnessScorer] D1 -->|Source Authority| D6[DomainAuthorityScorer] C2 --> E[Chain Filters] C3 --> E C4 --> E C5 --> E C6 --> E D2 --> F[Composite Scorer] D3 --> F D4 --> F D5 --> F D6 --> F E --> G[Binary Output: Pass/Fail] F --> H[Numeric Score: 0.0-1.0] G --> I[Apply to URL Queue] H --> J[Priority Ranking] style C fill:#e8f5e8 style D fill:#fff3e0 style E fill:#f3e5f5 style F fill:#e3f2fd style G fill:#c8e6c9 style H fill:#ffecb3 ``` ### Performance Optimization Strategy ```mermaid sequenceDiagram participant Queue as URL Queue participant Fast as Fast Filters participant Slow as Slow Filters participant Score as Scorers participant Output as Filtered URLs Note over Queue, Output: Batch Processing (1000 URLs) Queue->>Fast: Apply Domain Filter Fast-->>Queue: 60% passed (600 URLs) Queue->>Fast: Apply Pattern Filter Fast-->>Queue: 70% passed (420 URLs) Queue->>Fast: Apply Content Type Filter Fast-->>Queue: 90% passed (378 URLs) Note over Fast: Fast filters eliminate 62% of URLs Queue->>Slow: Apply SEO Filter (378 URLs) Slow-->>Queue: 80% passed (302 URLs) Queue->>Slow: Apply Relevance Filter Slow-->>Queue: 75% passed (227 URLs) Note over Slow: Content analysis on remaining URLs Queue->>Score: Calculate Composite Scores Score-->>Queue: Scored and ranked Queue->>Output: Top 100 URLs by score Output-->>Queue: Processing complete Note over Queue, Output: Total: 90% filtered out, 10% high-quality URLs retained ``` ### Custom Filter Implementation Flow ```mermaid stateDiagram-v2 [*] --> Planning Planning --> IdentifyNeeds: Define filtering criteria IdentifyNeeds --> ChooseType: Binary vs Scoring decision ChooseType --> FilterImpl: Binary decision needed ChooseType --> ScorerImpl: Quality rating needed FilterImpl --> InheritURLFilter: Extend URLFilter base class ScorerImpl --> InheritURLScorer: Extend URLScorer base class InheritURLFilter --> ImplementApply: def apply(url) -> bool InheritURLScorer --> ImplementScore: def _calculate_score(url) -> float ImplementApply --> AddLogic: Add custom filtering logic ImplementScore --> AddLogic AddLogic --> TestFilter: Unit testing TestFilter --> OptimizePerf: Performance optimization OptimizePerf --> Integration: Integrate with FilterChain Integration --> Production: Deploy to production Production --> Monitor: Monitor performance Monitor --> Tune: Tune parameters Tune --> Production note right of Planning : Consider performance impact note right of AddLogic : Handle edge cases note right of OptimizePerf : Cache frequently accessed data ``` ### Filter Chain Optimization Patterns ```mermaid graph TB subgraph "Naive Approach - Poor Performance" A1[All URLs] --> B1[Slow Filter 1] B1 --> C1[Slow Filter 2] C1 --> D1[Fast Filter 1] D1 --> E1[Fast Filter 2] E1 --> F1[Final Results] B1 -.->|High CPU| G1[Performance Issues] C1 -.->|Network Calls| G1 end subgraph "Optimized Approach - High Performance" A2[All URLs] --> B2[Fast Filter 1] B2 --> C2[Fast Filter 2] C2 --> D2[Batch Process] D2 --> E2[Slow Filter 1] E2 --> F2[Slow Filter 2] F2 --> G2[Final Results] D2 --> H2[Concurrent Processing] H2 --> I2[Semaphore Control] end subgraph "Performance Metrics" J[Processing Time] K[Memory Usage] L[CPU Utilization] M[Network Requests] end G1 -.-> J G1 -.-> K G1 -.-> L G1 -.-> M G2 -.-> J G2 -.-> K G2 -.-> L G2 -.-> M style A1 fill:#ffcdd2 style G1 fill:#ffcdd2 style A2 fill:#c8e6c9 style G2 fill:#c8e6c9 style H2 fill:#e8f5e8 style I2 fill:#e8f5e8 ``` ### Composite Scoring Weight Distribution ```mermaid pie title Composite Scorer Weight Distribution "Keyword Relevance (30%)" : 30 "Domain Authority (25%)" : 25 "Content Type (20%)" : 20 "Freshness (15%)" : 15 "Path Depth (10%)" : 10 ``` ### Deep Crawl Integration Architecture ```mermaid graph TD subgraph "Deep Crawl Strategy" A[Start URL] --> B[Extract Links] B --> C[Apply Filter Chain] C --> D[Calculate Scores] D --> E[Priority Queue] E --> F[Crawl Next URL] F --> B end subgraph "Filter Chain Components" C --> C1[Domain Filter] C --> C2[Pattern Filter] C --> C3[Content Filter] C --> C4[SEO Filter] C --> C5[Relevance Filter] end subgraph "Scoring Components" D --> D1[Keyword Scorer] D --> D2[Depth Scorer] D --> D3[Freshness Scorer] D --> D4[Authority Scorer] D --> D5[Composite Score] end subgraph "Queue Management" E --> E1{Score > Threshold?} E1 -->|Yes| E2[High Priority Queue] E1 -->|No| E3[Low Priority Queue] E2 --> F E3 --> G[Delayed Processing] end subgraph "Control Flow" H{Max Depth Reached?} I{Max Pages Reached?} J[Stop Crawling] end F --> H H -->|No| I H -->|Yes| J I -->|No| B I -->|Yes| J style A fill:#e3f2fd style E2 fill:#c8e6c9 style E3 fill:#fff3e0 style J fill:#ffcdd2 ``` ### Filter Performance Comparison ```mermaid xychart-beta title "Filter Performance Comparison (1000 URLs)" x-axis [Domain, Pattern, ContentType, SEO, Relevance] y-axis "Processing Time (ms)" 0 --> 1000 bar [50, 80, 45, 300, 800] ``` ### Scoring Algorithm Workflow ```mermaid flowchart TD A[Input URL] --> B[Parse URL Components] B --> C[Extract Features] C --> D[Domain Analysis] C --> E[Path Analysis] C --> F[Content Type Detection] C --> G[Keyword Extraction] C --> H[Freshness Detection] D --> I[Domain Authority Score] E --> J[Path Depth Score] F --> K[Content Type Score] G --> L[Keyword Relevance Score] H --> M[Freshness Score] I --> N[Apply Weights] J --> N K --> N L --> N M --> N N --> O[Normalize Scores] O --> P[Calculate Final Score] P --> Q{Score >= Threshold?} Q -->|Yes| R[Accept for Crawling] Q -->|No| S[Reject URL] R --> T[Add to Priority Queue] S --> U[Log Rejection Reason] style A fill:#e3f2fd style P fill:#fff3e0 style R fill:#c8e6c9 style S fill:#ffcdd2 style T fill:#e8f5e8 ``` **📖 Learn more:** [Deep Crawling Strategy](https://docs.crawl4ai.com/core/deep-crawling/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance-tuning/), [Custom Implementations](https://docs.crawl4ai.com/advanced/custom-filters/) --- ## Summary Crawl4AI provides a comprehensive solution for web crawling and data extraction optimized for AI applications. From simple page crawling to complex multi-URL operations with advanced filtering, the library offers the flexibility and performance needed for modern data extraction workflows. **Key Takeaways:** - Start with basic installation and simple crawling patterns - Use configuration objects for consistent, maintainable code - Choose appropriate extraction strategies based on your data structure - Leverage Docker for production deployments - Implement advanced features like deep crawling and custom filters as needed **Next Steps:** - Explore the [GitHub repository](https://github.com/unclecode/crawl4ai) for latest updates - Join the [Discord community](https://discord.gg/jP8KfhDhyN) for support - Check out [example projects](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) for inspiration Happy crawling! 🕷️