Files
crawl4ai/docs/md_v2/assets/llm.txt/diagrams/config_objects.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

1421 lines
36 KiB
Plaintext

## Configuration Objects and System Architecture
Visual representations of Crawl4AI's configuration system, object relationships, and data flow patterns.
### Configuration Object Relationships
```mermaid
classDiagram
class BrowserConfig {
+browser_type: str
+headless: bool
+viewport_width: int
+viewport_height: int
+proxy: str
+user_agent: str
+cookies: list
+headers: dict
+clone() BrowserConfig
+to_dict() dict
}
class CrawlerRunConfig {
+cache_mode: CacheMode
+extraction_strategy: ExtractionStrategy
+markdown_generator: MarkdownGenerator
+js_code: list
+wait_for: str
+screenshot: bool
+session_id: str
+clone() CrawlerRunConfig
+dump() dict
}
class LLMConfig {
+provider: str
+api_token: str
+base_url: str
+temperature: float
+max_tokens: int
+clone() LLMConfig
+to_dict() dict
}
class CrawlResult {
+url: str
+success: bool
+html: str
+cleaned_html: str
+markdown: MarkdownGenerationResult
+extracted_content: str
+media: dict
+links: dict
+screenshot: str
+pdf: bytes
}
class AsyncWebCrawler {
+config: BrowserConfig
+arun() CrawlResult
}
AsyncWebCrawler --> BrowserConfig : uses
AsyncWebCrawler --> CrawlerRunConfig : accepts
CrawlerRunConfig --> LLMConfig : contains
AsyncWebCrawler --> CrawlResult : returns
note for BrowserConfig "Controls browser\nenvironment and behavior"
note for CrawlerRunConfig "Controls individual\ncrawl operations"
note for LLMConfig "Configures LLM\nproviders and parameters"
note for CrawlResult "Contains all crawl\noutputs and metadata"
```
### Configuration Decision Flow
```mermaid
flowchart TD
A[Start Configuration] --> B{Use Case Type?}
B -->|Simple Web Scraping| C[Basic Config Pattern]
B -->|Data Extraction| D[Extraction Config Pattern]
B -->|Stealth Crawling| E[Stealth Config Pattern]
B -->|High Performance| F[Performance Config Pattern]
C --> C1[BrowserConfig: headless=True]
C --> C2[CrawlerRunConfig: basic options]
C1 --> C3[No LLMConfig needed]
C2 --> C3
C3 --> G[Simple Crawling Ready]
D --> D1[BrowserConfig: standard setup]
D --> D2[CrawlerRunConfig: with extraction_strategy]
D --> D3[LLMConfig: for LLM extraction]
D1 --> D4[Advanced Extraction Ready]
D2 --> D4
D3 --> D4
E --> E1[BrowserConfig: proxy + user_agent]
E --> E2[CrawlerRunConfig: simulate_user=True]
E1 --> E3[Stealth Crawling Ready]
E2 --> E3
F --> F1[BrowserConfig: lightweight]
F --> F2[CrawlerRunConfig: caching + concurrent]
F1 --> F3[High Performance Ready]
F2 --> F3
G --> H[Execute Crawl]
D4 --> H
E3 --> H
F3 --> H
H --> I[Get CrawlResult]
style A fill:#e1f5fe
style I fill:#c8e6c9
style G fill:#fff3e0
style D4 fill:#f3e5f5
style E3 fill:#ffebee
style F3 fill:#e8f5e8
```
### Configuration Lifecycle Sequence
```mermaid
sequenceDiagram
participant User
participant BrowserConfig as Browser Config
participant CrawlerConfig as Crawler Config
participant LLMConfig as LLM Config
participant Crawler as AsyncWebCrawler
participant Browser as Browser Instance
participant Result as CrawlResult
User->>BrowserConfig: Create with browser settings
User->>CrawlerConfig: Create with crawl options
User->>LLMConfig: Create with LLM provider
User->>Crawler: Initialize with BrowserConfig
Crawler->>Browser: Launch browser with config
Browser-->>Crawler: Browser ready
User->>Crawler: arun(url, CrawlerConfig)
Crawler->>Crawler: Apply CrawlerConfig settings
alt LLM Extraction Needed
Crawler->>LLMConfig: Get LLM settings
LLMConfig-->>Crawler: Provider configuration
end
Crawler->>Browser: Navigate with settings
Browser->>Browser: Apply page interactions
Browser->>Browser: Execute JavaScript if specified
Browser->>Browser: Wait for conditions
Browser-->>Crawler: Page content ready
Crawler->>Crawler: Process content per config
Crawler->>Result: Create CrawlResult
Result-->>User: Return complete result
Note over User,Result: Configuration objects control every aspect
```
### BrowserConfig Parameter Flow
```mermaid
graph TB
subgraph "BrowserConfig Parameters"
A[browser_type] --> A1[chromium/firefox/webkit]
B[headless] --> B1[true: invisible / false: visible]
C[viewport] --> C1[width x height dimensions]
D[proxy] --> D1[proxy server configuration]
E[user_agent] --> E1[browser identification string]
F[cookies] --> F1[session authentication]
G[headers] --> G1[HTTP request headers]
H[extra_args] --> H1[browser command line flags]
end
subgraph "Browser Instance"
I[Playwright Browser]
J[Browser Context]
K[Page Instance]
end
A1 --> I
B1 --> I
C1 --> J
D1 --> J
E1 --> J
F1 --> J
G1 --> J
H1 --> I
I --> J
J --> K
style I fill:#e3f2fd
style J fill:#f3e5f5
style K fill:#e8f5e8
```
### CrawlerRunConfig Category Breakdown
```mermaid
mindmap
root((CrawlerRunConfig))
Content Processing
word_count_threshold
css_selector
target_elements
excluded_tags
markdown_generator
extraction_strategy
Page Navigation
wait_until
page_timeout
wait_for
wait_for_images
delay_before_return_html
Page Interaction
js_code
scan_full_page
simulate_user
magic
remove_overlay_elements
Caching Session
cache_mode
session_id
shared_data
Media Output
screenshot
pdf
capture_mhtml
image_score_threshold
Link Filtering
exclude_external_links
exclude_domains
exclude_social_media_links
```
### LLM Provider Selection Flow
```mermaid
flowchart TD
A[Need LLM Processing?] --> B{Provider Type?}
B -->|Cloud API| C{Which Service?}
B -->|Local Model| D[Local Setup]
B -->|Custom Endpoint| E[Custom Config]
C -->|OpenAI| C1[OpenAI GPT Models]
C -->|Anthropic| C2[Claude Models]
C -->|Google| C3[Gemini Models]
C -->|Groq| C4[Fast Inference]
D --> D1[Ollama Setup]
E --> E1[Custom base_url]
C1 --> F1[LLMConfig with OpenAI settings]
C2 --> F2[LLMConfig with Anthropic settings]
C3 --> F3[LLMConfig with Google settings]
C4 --> F4[LLMConfig with Groq settings]
D1 --> F5[LLMConfig with Ollama settings]
E1 --> F6[LLMConfig with custom settings]
F1 --> G[Use in Extraction Strategy]
F2 --> G
F3 --> G
F4 --> G
F5 --> G
F6 --> G
style A fill:#e1f5fe
style G fill:#c8e6c9
```
### CrawlResult Structure and Data Flow
```mermaid
graph TB
subgraph "CrawlResult Output"
A[Basic Info]
B[HTML Content]
C[Markdown Output]
D[Extracted Data]
E[Media Files]
F[Metadata]
end
subgraph "Basic Info Details"
A --> A1[url: final URL]
A --> A2[success: boolean]
A --> A3[status_code: HTTP status]
A --> A4[error_message: if failed]
end
subgraph "HTML Content Types"
B --> B1[html: raw HTML]
B --> B2[cleaned_html: processed]
B --> B3[fit_html: filtered content]
end
subgraph "Markdown Variants"
C --> C1[raw_markdown: basic conversion]
C --> C2[markdown_with_citations: with refs]
C --> C3[fit_markdown: filtered content]
C --> C4[references_markdown: citation list]
end
subgraph "Extracted Content"
D --> D1[extracted_content: JSON string]
D --> D2[From CSS extraction]
D --> D3[From LLM extraction]
D --> D4[From XPath extraction]
end
subgraph "Media and Links"
E --> E1[images: list with scores]
E --> E2[videos: media content]
E --> E3[internal_links: same domain]
E --> E4[external_links: other domains]
end
subgraph "Generated Files"
F --> F1[screenshot: base64 PNG]
F --> F2[pdf: binary PDF data]
F --> F3[mhtml: archive format]
F --> F4[ssl_certificate: cert info]
end
style A fill:#e3f2fd
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#ffebee
style F fill:#f1f8e9
```
### Configuration Pattern State Machine
```mermaid
stateDiagram-v2
[*] --> ConfigCreation
ConfigCreation --> BasicConfig: Simple use case
ConfigCreation --> AdvancedConfig: Complex requirements
ConfigCreation --> TemplateConfig: Use predefined pattern
BasicConfig --> Validation: Check parameters
AdvancedConfig --> Validation: Check parameters
TemplateConfig --> Validation: Check parameters
Validation --> Invalid: Missing required fields
Validation --> Valid: All parameters correct
Invalid --> ConfigCreation: Fix and retry
Valid --> InUse: Passed to crawler
InUse --> Cloning: Need variation
InUse --> Serialization: Save configuration
InUse --> Complete: Crawl finished
Cloning --> Modified: clone() with updates
Modified --> Valid: Validate changes
Serialization --> Stored: dump() to dict
Stored --> Restoration: load() from dict
Restoration --> Valid: Recreate config object
Complete --> [*]
note right of BasicConfig : Minimal required settings
note right of AdvancedConfig : Full feature configuration
note right of TemplateConfig : Pre-built patterns
```
### Configuration Integration Architecture
```mermaid
graph TB
subgraph "User Layer"
U1[Configuration Creation]
U2[Parameter Selection]
U3[Pattern Application]
end
subgraph "Configuration Layer"
C1[BrowserConfig]
C2[CrawlerRunConfig]
C3[LLMConfig]
C4[Config Validation]
C5[Config Cloning]
end
subgraph "Crawler Engine"
E1[Browser Management]
E2[Page Navigation]
E3[Content Processing]
E4[Extraction Pipeline]
E5[Result Generation]
end
subgraph "Output Layer"
O1[CrawlResult Assembly]
O2[Data Formatting]
O3[File Generation]
O4[Metadata Collection]
end
U1 --> C1
U2 --> C2
U3 --> C3
C1 --> C4
C2 --> C4
C3 --> C4
C4 --> E1
C2 --> E2
C2 --> E3
C3 --> E4
E1 --> E2
E2 --> E3
E3 --> E4
E4 --> E5
E5 --> O1
O1 --> O2
O2 --> O3
O3 --> O4
C5 -.-> C1
C5 -.-> C2
C5 -.-> C3
style U1 fill:#e1f5fe
style C4 fill:#fff3e0
style E4 fill:#f3e5f5
style O4 fill:#c8e6c9
```
### Configuration Best Practices Flow
```mermaid
flowchart TD
A[Configuration Planning] --> B{Performance Priority?}
B -->|Speed| C[Fast Config Pattern]
B -->|Quality| D[Comprehensive Config Pattern]
B -->|Stealth| E[Stealth Config Pattern]
B -->|Balanced| F[Standard Config Pattern]
C --> C1[Enable caching]
C --> C2[Disable heavy features]
C --> C3[Use text_mode]
C1 --> G[Apply Configuration]
C2 --> G
C3 --> G
D --> D1[Enable all processing]
D --> D2[Use content filters]
D --> D3[Capture everything]
D1 --> G
D2 --> G
D3 --> G
E --> E1[Rotate user agents]
E --> E2[Use proxies]
E --> E3[Simulate human behavior]
E1 --> G
E2 --> G
E3 --> G
F --> F1[Balanced timeouts]
F --> F2[Selective processing]
F --> F3[Smart caching]
F1 --> G
F2 --> G
F3 --> G
G --> H[Test Configuration]
H --> I{Results Satisfactory?}
I -->|Yes| J[Production Ready]
I -->|No| K[Adjust Parameters]
K --> L[Clone and Modify]
L --> H
J --> M[Deploy with Confidence]
style A fill:#e1f5fe
style J fill:#c8e6c9
style M fill:#e8f5e8
```
## Advanced Configuration Workflows and Patterns
Visual representations of advanced Crawl4AI configuration strategies, proxy management, session handling, and identity-based crawling patterns.
### User Agent and Anti-Detection Strategy Flow
```mermaid
flowchart TD
A[Start Configuration] --> B{Detection Avoidance Needed?}
B -->|No| C[Standard User Agent]
B -->|Yes| D[Anti-Detection Strategy]
C --> C1[Static user_agent string]
C1 --> Z[Basic Configuration]
D --> E{User Agent Strategy}
E -->|Random| F[user_agent_mode: random]
E -->|Static Custom| G[Custom user_agent string]
E -->|Platform Specific| H[Generator Config]
F --> I[Configure Generator]
H --> I
I --> I1[Platform: windows/macos/linux]
I1 --> I2[Browser: chrome/firefox/safari]
I2 --> I3[Device: desktop/mobile/tablet]
G --> J[Behavioral Simulation]
I3 --> J
J --> K{Enable Simulation?}
K -->|Yes| L[simulate_user: True]
K -->|No| M[Standard Behavior]
L --> N[override_navigator: True]
N --> O[Configure Delays]
O --> O1[mean_delay: 1.5]
O1 --> O2[max_range: 2.0]
O2 --> P[Magic Mode]
M --> P
P --> Q{Auto-Handle Patterns?}
Q -->|Yes| R[magic: True]
Q -->|No| S[Manual Handling]
R --> T[Complete Anti-Detection Setup]
S --> T
Z --> T
style D fill:#ffeb3b
style T fill:#c8e6c9
style L fill:#ff9800
style R fill:#9c27b0
```
### Proxy Configuration and Rotation Architecture
```mermaid
graph TB
subgraph "Proxy Configuration Types"
A[Single Proxy] --> A1[ProxyConfig object]
B[Proxy String] --> B1[from_string method]
C[Environment Proxies] --> C1[from_env method]
D[Multiple Proxies] --> D1[ProxyRotationStrategy]
end
subgraph "ProxyConfig Structure"
A1 --> E[server: URL]
A1 --> F[username: auth]
A1 --> G[password: auth]
A1 --> H[ip: extracted]
end
subgraph "Rotation Strategies"
D1 --> I[round_robin]
D1 --> J[random]
D1 --> K[least_used]
D1 --> L[failure_aware]
end
subgraph "Configuration Flow"
M[CrawlerRunConfig] --> N[proxy_config]
M --> O[proxy_rotation_strategy]
N --> P[Single Proxy Usage]
O --> Q[Multi-Proxy Rotation]
end
subgraph "Runtime Behavior"
P --> R[All requests use same proxy]
Q --> S[Requests rotate through proxies]
S --> T[Health monitoring]
T --> U[Automatic failover]
end
style A1 fill:#e3f2fd
style D1 fill:#f3e5f5
style M fill:#e8f5e8
style T fill:#fff3e0
```
### Content Selection Strategy Comparison
```mermaid
sequenceDiagram
participant Browser
participant HTML as Raw HTML
participant CSS as css_selector
participant Target as target_elements
participant Processor as Content Processor
participant Output
Note over Browser,Output: css_selector Strategy
Browser->>HTML: Load complete page
HTML->>CSS: Apply css_selector
CSS->>CSS: Extract matching elements only
CSS->>Processor: Process subset HTML
Processor->>Output: Markdown + Extraction from subset
Note over Browser,Output: target_elements Strategy
Browser->>HTML: Load complete page
HTML->>Processor: Process entire page
Processor->>Target: Focus on target_elements
Target->>Target: Extract from specified elements
Processor->>Output: Full page links/media + targeted content
Note over CSS,Target: Key Difference
Note over CSS: Affects entire processing pipeline
Note over Target: Affects only content extraction
```
### Advanced wait_for Conditions Decision Tree
```mermaid
flowchart TD
A[Configure wait_for] --> B{Condition Type?}
B -->|CSS Element| C[CSS Selector Wait]
B -->|JavaScript Condition| D[JS Expression Wait]
B -->|Complex Logic| E[Custom JS Function]
B -->|No Wait| F[Default domcontentloaded]
C --> C1["wait_for: 'css:.element'"]
C1 --> C2[Element appears in DOM]
C2 --> G[Continue Processing]
D --> D1["wait_for: 'js:() => condition'"]
D1 --> D2[JavaScript returns true]
D2 --> G
E --> E1[Complex JS Function]
E1 --> E2{Multiple Conditions}
E2 -->|AND Logic| E3[All conditions true]
E2 -->|OR Logic| E4[Any condition true]
E2 -->|Custom Logic| E5[User-defined logic]
E3 --> G
E4 --> G
E5 --> G
F --> G
G --> H{Timeout Reached?}
H -->|No| I[Page Ready]
H -->|Yes| J[Timeout Error]
I --> K[Begin Content Extraction]
J --> L[Handle Error/Retry]
style C1 fill:#e8f5e8
style D1 fill:#fff3e0
style E1 fill:#ffeb3b
style I fill:#c8e6c9
style J fill:#ffcdd2
```
### Session Management Lifecycle
```mermaid
stateDiagram-v2
[*] --> SessionCreate
SessionCreate --> SessionActive: session_id provided
SessionCreate --> OneTime: no session_id
SessionActive --> BrowserLaunch: First arun() call
BrowserLaunch --> PageLoad: Navigate to URL
PageLoad --> JSExecution: Execute js_code
JSExecution --> ContentExtract: Extract content
ContentExtract --> SessionHold: Keep session alive
SessionHold --> ReuseSession: Subsequent arun() calls
ReuseSession --> JSOnlyMode: js_only=True
ReuseSession --> NewNavigation: js_only=False
JSOnlyMode --> JSExecution: Execute JS in existing page
NewNavigation --> PageLoad: Navigate to new URL
SessionHold --> SessionKill: kill_session() called
SessionHold --> SessionTimeout: Timeout reached
SessionHold --> SessionError: Error occurred
SessionKill --> SessionCleanup
SessionTimeout --> SessionCleanup
SessionError --> SessionCleanup
SessionCleanup --> [*]
OneTime --> BrowserLaunch
ContentExtract --> OneTimeCleanup: No session_id
OneTimeCleanup --> [*]
note right of SessionActive : Persistent browser context
note right of JSOnlyMode : Reuse existing page
note right of OneTime : Temporary browser instance
```
### Identity-Based Crawling Configuration Matrix
```mermaid
graph TD
subgraph "Geographic Identity"
A[Geolocation] --> A1[latitude/longitude]
A2[Timezone] --> A3[timezone_id]
A4[Locale] --> A5[language/region]
end
subgraph "Browser Identity"
B[User Agent] --> B1[Platform fingerprint]
B2[Navigator Properties] --> B3[override_navigator]
B4[Headers] --> B5[Accept-Language]
end
subgraph "Behavioral Identity"
C[Mouse Simulation] --> C1[simulate_user]
C2[Timing Patterns] --> C3[mean_delay/max_range]
C4[Interaction Patterns] --> C5[Human-like behavior]
end
subgraph "Configuration Integration"
D[CrawlerRunConfig] --> A
D --> B
D --> C
D --> E[Complete Identity Profile]
E --> F[Geographic Consistency]
E --> G[Browser Consistency]
E --> H[Behavioral Consistency]
end
F --> I[Paris, France Example]
I --> I1[locale: fr-FR]
I --> I2[timezone: Europe/Paris]
I --> I3[geolocation: 48.8566, 2.3522]
G --> J[Windows Chrome Example]
J --> J1[platform: windows]
J --> J2[browser: chrome]
J --> J3[user_agent: matching pattern]
H --> K[Human Simulation]
K --> K1[Random delays]
K --> K2[Mouse movements]
K --> K3[Navigation patterns]
style E fill:#ff9800
style I fill:#e3f2fd
style J fill:#f3e5f5
style K fill:#e8f5e8
```
### Multi-Step Crawling Sequence Flow
```mermaid
sequenceDiagram
participant User
participant Crawler
participant Session as Browser Session
participant Page1 as Login Page
participant Page2 as Dashboard
participant Page3 as Data Pages
User->>Crawler: Step 1 - Login
Crawler->>Session: Create session_id="user_session"
Session->>Page1: Navigate to login
Page1->>Page1: Execute login JS
Page1->>Page1: Wait for dashboard redirect
Page1-->>Crawler: Login complete
User->>Crawler: Step 2 - Navigate dashboard
Note over Crawler,Session: Reuse existing session
Crawler->>Session: js_only=True (no page reload)
Session->>Page2: Execute navigation JS
Page2->>Page2: Wait for data table
Page2-->>Crawler: Dashboard ready
User->>Crawler: Step 3 - Extract data pages
loop For each page 1-5
Crawler->>Session: js_only=True
Session->>Page3: Click page button
Page3->>Page3: Wait for page active
Page3->>Page3: Extract content
Page3-->>Crawler: Page data
end
User->>Crawler: Cleanup
Crawler->>Session: kill_session()
Session-->>Crawler: Session destroyed
```
### Configuration Import and Usage Patterns
```mermaid
graph LR
subgraph "Main Package Imports"
A[crawl4ai] --> A1[AsyncWebCrawler]
A --> A2[BrowserConfig]
A --> A3[CrawlerRunConfig]
A --> A4[LLMConfig]
A --> A5[CacheMode]
A --> A6[ProxyConfig]
A --> A7[GeolocationConfig]
end
subgraph "Strategy Imports"
A --> B1[JsonCssExtractionStrategy]
A --> B2[LLMExtractionStrategy]
A --> B3[DefaultMarkdownGenerator]
A --> B4[PruningContentFilter]
A --> B5[RegexChunking]
end
subgraph "Configuration Assembly"
C[Configuration Builder] --> A2
C --> A3
C --> A4
A2 --> D[Browser Environment]
A3 --> E[Crawl Behavior]
A4 --> F[LLM Integration]
E --> B1
E --> B2
E --> B3
E --> B4
E --> B5
end
subgraph "Runtime Flow"
G[Crawler Instance] --> D
G --> H[Execute Crawl]
H --> E
H --> F
H --> I[CrawlResult]
end
style A fill:#e3f2fd
style C fill:#fff3e0
style G fill:#e8f5e8
style I fill:#c8e6c9
```
### Advanced Configuration Decision Matrix
```mermaid
flowchart TD
A[Advanced Configuration Needed] --> B{Primary Use Case?}
B -->|Bot Detection Avoidance| C[Anti-Detection Setup]
B -->|Geographic Simulation| D[Identity-Based Config]
B -->|Multi-Step Workflows| E[Session Management]
B -->|Network Reliability| F[Proxy Configuration]
B -->|Content Precision| G[Selector Strategy]
C --> C1[Random User Agents]
C --> C2[Behavioral Simulation]
C --> C3[Navigator Override]
C --> C4[Magic Mode]
D --> D1[Geolocation Setup]
D --> D2[Locale Configuration]
D --> D3[Timezone Setting]
D --> D4[Browser Fingerprinting]
E --> E1[Session ID Management]
E --> E2[JS-Only Navigation]
E --> E3[Shared Data Context]
E --> E4[Session Cleanup]
F --> F1[Single Proxy]
F --> F2[Proxy Rotation]
F --> F3[Failover Strategy]
F --> F4[Health Monitoring]
G --> G1[css_selector for Subset]
G --> G2[target_elements for Focus]
G --> G3[excluded_selector for Removal]
G --> G4[Hierarchical Selection]
C1 --> H[Production Configuration]
C2 --> H
C3 --> H
C4 --> H
D1 --> H
D2 --> H
D3 --> H
D4 --> H
E1 --> H
E2 --> H
E3 --> H
E4 --> H
F1 --> H
F2 --> H
F3 --> H
F4 --> H
G1 --> H
G2 --> H
G3 --> H
G4 --> H
style H fill:#c8e6c9
style C fill:#ff9800
style D fill:#9c27b0
style E fill:#2196f3
style F fill:#4caf50
style G fill:#ff5722
```
## Advanced Features Workflows and Architecture
Visual representations of advanced crawling capabilities, session management, hooks system, and performance optimization strategies.
### File Download Workflow
```mermaid
sequenceDiagram
participant User
participant Crawler
participant Browser
participant FileSystem
participant Page
User->>Crawler: Configure downloads_path
Crawler->>Browser: Create context with download handling
Browser-->>Crawler: Context ready
Crawler->>Page: Navigate to target URL
Page-->>Crawler: Page loaded
Crawler->>Page: Execute download JavaScript
Page->>Page: Find download links (.pdf, .zip, etc.)
loop For each download link
Page->>Browser: Click download link
Browser->>FileSystem: Save file to downloads_path
FileSystem-->>Browser: File saved
Browser-->>Page: Download complete
end
Page-->>Crawler: All downloads triggered
Crawler->>FileSystem: Check downloaded files
FileSystem-->>Crawler: List of file paths
Crawler-->>User: CrawlResult with downloaded_files[]
Note over User,FileSystem: Files available in downloads_path
```
### Hooks Execution Flow
```mermaid
flowchart TD
A[Start Crawl] --> B[on_browser_created Hook]
B --> C[Browser Instance Created]
C --> D[on_page_context_created Hook]
D --> E[Page & Context Setup]
E --> F[before_goto Hook]
F --> G[Navigate to URL]
G --> H[after_goto Hook]
H --> I[Page Loaded]
I --> J[before_retrieve_html Hook]
J --> K[Extract HTML Content]
K --> L[Return CrawlResult]
subgraph "Hook Capabilities"
B1[Route Filtering]
B2[Authentication]
B3[Custom Headers]
B4[Viewport Setup]
B5[Content Manipulation]
end
D --> B1
F --> B2
F --> B3
D --> B4
J --> B5
style A fill:#e1f5fe
style L fill:#c8e6c9
style B fill:#fff3e0
style D fill:#f3e5f5
style F fill:#e8f5e8
style H fill:#fce4ec
style J fill:#fff9c4
```
### Session Management State Machine
```mermaid
stateDiagram-v2
[*] --> SessionCreated: session_id provided
SessionCreated --> PageLoaded: Initial arun()
PageLoaded --> JavaScriptExecution: js_code executed
JavaScriptExecution --> ContentUpdated: DOM modified
ContentUpdated --> NextOperation: js_only=True
NextOperation --> JavaScriptExecution: More interactions
NextOperation --> SessionMaintained: Keep session alive
NextOperation --> SessionClosed: kill_session()
SessionMaintained --> PageLoaded: Navigate to new URL
SessionMaintained --> JavaScriptExecution: Continue interactions
SessionClosed --> [*]: Session terminated
note right of SessionCreated
Browser tab created
Context preserved
end note
note right of ContentUpdated
State maintained
Cookies preserved
Local storage intact
end note
note right of SessionClosed
Clean up resources
Release browser tab
end note
```
### Lazy Loading & Dynamic Content Strategy
```mermaid
flowchart TD
A[Page Load] --> B{Content Type?}
B -->|Static Content| C[Standard Extraction]
B -->|Lazy Loaded| D[Enable scan_full_page]
B -->|Infinite Scroll| E[Custom Scroll Strategy]
B -->|Load More Button| F[JavaScript Interaction]
D --> D1[Automatic Scrolling]
D1 --> D2[Wait for Images]
D2 --> D3[Content Stabilization]
E --> E1[Detect Scroll Triggers]
E1 --> E2[Progressive Loading]
E2 --> E3[Monitor Content Changes]
F --> F1[Find Load More Button]
F1 --> F2[Click and Wait]
F2 --> F3{More Content?}
F3 -->|Yes| F1
F3 -->|No| G[Complete Extraction]
D3 --> G
E3 --> G
C --> G
G --> H[Return Enhanced Content]
subgraph "Optimization Techniques"
I[exclude_external_images]
J[image_score_threshold]
K[wait_for selectors]
L[scroll_delay tuning]
end
D --> I
E --> J
F --> K
D1 --> L
style A fill:#e1f5fe
style H fill:#c8e6c9
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#e8f5e8
```
### Network & Console Monitoring Architecture
```mermaid
graph TB
subgraph "Browser Context"
A[Web Page] --> B[Network Requests]
A --> C[Console Messages]
A --> D[Resource Loading]
end
subgraph "Monitoring Layer"
B --> E[Request Interceptor]
C --> F[Console Listener]
D --> G[Resource Monitor]
E --> H[Request Events]
E --> I[Response Events]
E --> J[Failure Events]
F --> K[Log Messages]
F --> L[Error Messages]
F --> M[Warning Messages]
end
subgraph "Data Collection"
H --> N[Request Details]
I --> O[Response Analysis]
J --> P[Failure Tracking]
K --> Q[Debug Information]
L --> R[Error Analysis]
M --> S[Performance Insights]
end
subgraph "Output Aggregation"
N --> T[network_requests Array]
O --> T
P --> T
Q --> U[console_messages Array]
R --> U
S --> U
end
T --> V[CrawlResult]
U --> V
style V fill:#c8e6c9
style E fill:#fff3e0
style F fill:#f3e5f5
style T fill:#e8f5e8
style U fill:#fce4ec
```
### Multi-Step Workflow Sequence
```mermaid
sequenceDiagram
participant User
participant Crawler
participant Session
participant Page
participant Server
User->>Crawler: Step 1 - Initial load
Crawler->>Session: Create session_id
Session->>Page: New browser tab
Page->>Server: GET /step1
Server-->>Page: Page content
Page-->>Crawler: Content ready
Crawler-->>User: Result 1
User->>Crawler: Step 2 - Navigate (js_only=true)
Crawler->>Session: Reuse existing session
Session->>Page: Execute JavaScript
Page->>Page: Click next button
Page->>Server: Navigate to /step2
Server-->>Page: New content
Page-->>Crawler: Updated content
Crawler-->>User: Result 2
User->>Crawler: Step 3 - Form submission
Crawler->>Session: Continue session
Session->>Page: Execute form JS
Page->>Page: Fill form fields
Page->>Server: POST form data
Server-->>Page: Results page
Page-->>Crawler: Final content
Crawler-->>User: Result 3
User->>Crawler: Cleanup
Crawler->>Session: kill_session()
Session->>Page: Close tab
Session-->>Crawler: Session terminated
Note over User,Server: State preserved across steps
Note over Session: Cookies, localStorage maintained
```
### SSL Certificate Analysis Flow
```mermaid
flowchart LR
A[Enable SSL Fetch] --> B[HTTPS Connection]
B --> C[Certificate Retrieval]
C --> D[Certificate Analysis]
D --> E[Basic Info]
D --> F[Validity Check]
D --> G[Chain Verification]
D --> H[Security Assessment]
E --> E1[Issuer Details]
E --> E2[Subject Information]
E --> E3[Serial Number]
F --> F1[Not Before Date]
F --> F2[Not After Date]
F --> F3[Expiration Warning]
G --> G1[Root CA]
G --> G2[Intermediate Certs]
G --> G3[Trust Path]
H --> H1[Key Length]
H --> H2[Signature Algorithm]
H --> H3[Vulnerabilities]
subgraph "Export Formats"
I[JSON Format]
J[PEM Format]
K[DER Format]
end
E1 --> I
F1 --> I
G1 --> I
H1 --> I
I --> J
J --> K
style A fill:#e1f5fe
style D fill:#fff3e0
style I fill:#e8f5e8
style J fill:#f3e5f5
style K fill:#fce4ec
```
### Performance Optimization Decision Tree
```mermaid
flowchart TD
A[Performance Optimization] --> B{Primary Goal?}
B -->|Speed| C[Fast Crawling Mode]
B -->|Resource Usage| D[Memory Optimization]
B -->|Scale| E[Batch Processing]
B -->|Quality| F[Comprehensive Extraction]
C --> C1[text_mode=True]
C --> C2[exclude_all_images=True]
C --> C3[excluded_tags=['script','style']]
C --> C4[page_timeout=30000]
D --> D1[light_mode=True]
D --> D2[headless=True]
D --> D3[semaphore_count=3]
D --> D4[disable monitoring]
E --> E1[stream=True]
E --> E2[cache_mode=ENABLED]
E --> E3[arun_many()]
E --> E4[concurrent batches]
F --> F1[wait_for_images=True]
F --> F2[process_iframes=True]
F --> F3[capture_network=True]
F --> F4[screenshot=True]
subgraph "Trade-offs"
G[Speed vs Quality]
H[Memory vs Features]
I[Scale vs Detail]
end
C --> G
D --> H
E --> I
subgraph "Monitoring Metrics"
J[Response Time]
K[Memory Usage]
L[Success Rate]
M[Content Quality]
end
C1 --> J
D1 --> K
E1 --> L
F1 --> M
style A fill:#e1f5fe
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#fce4ec
```
### Advanced Page Interaction Matrix
```mermaid
graph LR
subgraph "Interaction Types"
A[Form Filling]
B[Dynamic Loading]
C[Modal Handling]
D[Scroll Interactions]
E[Button Clicks]
end
subgraph "Detection Methods"
F[CSS Selectors]
G[JavaScript Conditions]
H[Element Visibility]
I[Content Changes]
J[Network Activity]
end
subgraph "Automation Features"
K[simulate_user=True]
L[magic=True]
M[remove_overlay_elements=True]
N[override_navigator=True]
O[scan_full_page=True]
end
subgraph "Wait Strategies"
P[wait_for CSS]
Q[wait_for JS]
R[wait_for_images]
S[delay_before_return]
T[custom timeouts]
end
A --> F
A --> K
A --> P
B --> G
B --> O
B --> Q
C --> H
C --> L
C --> M
D --> I
D --> O
D --> S
E --> F
E --> K
E --> T
style A fill:#e8f5e8
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#fce4ec
style E fill:#e1f5fe
```
### Input Source Processing Flow
```mermaid
flowchart TD
A[Input Source] --> B{Input Type?}
B -->|URL| C[Web Request]
B -->|file://| D[Local File]
B -->|raw:| E[Raw HTML]
C --> C1[HTTP/HTTPS Request]
C1 --> C2[Browser Navigation]
C2 --> C3[Page Rendering]
C3 --> F[Content Processing]
D --> D1[File System Access]
D1 --> D2[Read HTML File]
D2 --> D3[Parse Content]
D3 --> F
E --> E1[Parse Raw HTML]
E1 --> E2[Create Virtual Page]
E2 --> E3[Direct Processing]
E3 --> F
F --> G[Common Processing Pipeline]
G --> H[Markdown Generation]
G --> I[Link Extraction]
G --> J[Media Processing]
G --> K[Data Extraction]
H --> L[CrawlResult]
I --> L
J --> L
K --> L
subgraph "Processing Features"
M[Same extraction strategies]
N[Same filtering options]
O[Same output formats]
P[Consistent results]
end
F --> M
F --> N
F --> O
F --> P
style A fill:#e1f5fe
style L fill:#c8e6c9
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#f3e5f5
```
**📖 Learn more:** [Advanced Features Guide](https://docs.crawl4ai.com/advanced/advanced-features/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Hooks System](https://docs.crawl4ai.com/advanced/hooks-auth/), [Performance Optimization](https://docs.crawl4ai.com/advanced/performance/)
**📖 Learn more:** [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Session Management](https://docs.crawl4ai.com/advanced/session-management/), [Proxy & Security](https://docs.crawl4ai.com/advanced/proxy-security/), [Content Selection](https://docs.crawl4ai.com/core/content-selection/)
**📖 Learn more:** [Configuration Reference](https://docs.crawl4ai.com/api/parameters/), [Best Practices](https://docs.crawl4ai.com/core/browser-crawler-config/), [Advanced Configuration](https://docs.crawl4ai.com/advanced/advanced-features/)