This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
425 lines
12 KiB
Plaintext
425 lines
12 KiB
Plaintext
## CLI Workflows and Profile Management
|
|
|
|
Visual representations of command-line interface operations, browser profile management, and identity-based crawling workflows.
|
|
|
|
### CLI Command Flow Architecture
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[crwl command] --> B{Command Type?}
|
|
|
|
B -->|URL Crawling| C[Parse URL & Options]
|
|
B -->|Profile Management| D[profiles subcommand]
|
|
B -->|CDP Browser| E[cdp subcommand]
|
|
B -->|Browser Control| F[browser subcommand]
|
|
B -->|Configuration| G[config subcommand]
|
|
|
|
C --> C1{Output Format?}
|
|
C1 -->|Default| C2[HTML/Markdown]
|
|
C1 -->|JSON| C3[Structured Data]
|
|
C1 -->|markdown| C4[Clean Markdown]
|
|
C1 -->|markdown-fit| C5[Filtered Content]
|
|
|
|
C --> C6{Authentication?}
|
|
C6 -->|Profile Specified| C7[Load Browser Profile]
|
|
C6 -->|No Profile| C8[Anonymous Session]
|
|
|
|
C7 --> C9[Launch with User Data]
|
|
C8 --> C10[Launch Clean Browser]
|
|
|
|
C9 --> C11[Execute Crawl]
|
|
C10 --> C11
|
|
|
|
C11 --> C12{Success?}
|
|
C12 -->|Yes| C13[Return Results]
|
|
C12 -->|No| C14[Error Handling]
|
|
|
|
D --> D1[Interactive Profile Menu]
|
|
D1 --> D2{Menu Choice?}
|
|
D2 -->|Create| D3[Open Browser for Setup]
|
|
D2 -->|List| D4[Show Existing Profiles]
|
|
D2 -->|Delete| D5[Remove Profile]
|
|
D2 -->|Use| D6[Crawl with Profile]
|
|
|
|
E --> E1[Launch CDP Browser]
|
|
E1 --> E2[Remote Debugging Active]
|
|
|
|
F --> F1{Browser Action?}
|
|
F1 -->|start| F2[Start Builtin Browser]
|
|
F1 -->|stop| F3[Stop Builtin Browser]
|
|
F1 -->|status| F4[Check Browser Status]
|
|
F1 -->|view| F5[Open Browser Window]
|
|
|
|
G --> G1{Config Action?}
|
|
G1 -->|list| G2[Show All Settings]
|
|
G1 -->|set| G3[Update Setting]
|
|
G1 -->|get| G4[Read Setting]
|
|
|
|
style A fill:#e1f5fe
|
|
style C13 fill:#c8e6c9
|
|
style C14 fill:#ffcdd2
|
|
style D3 fill:#fff3e0
|
|
style E2 fill:#f3e5f5
|
|
```
|
|
|
|
### Profile Management Workflow
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant User
|
|
participant CLI
|
|
participant ProfileManager
|
|
participant Browser
|
|
participant FileSystem
|
|
|
|
User->>CLI: crwl profiles
|
|
CLI->>ProfileManager: Initialize profile manager
|
|
ProfileManager->>FileSystem: Scan for existing profiles
|
|
FileSystem-->>ProfileManager: Profile list
|
|
ProfileManager-->>CLI: Show interactive menu
|
|
CLI-->>User: Display options
|
|
|
|
Note over User: User selects "Create new profile"
|
|
|
|
User->>CLI: Create profile "linkedin-auth"
|
|
CLI->>ProfileManager: create_profile("linkedin-auth")
|
|
ProfileManager->>FileSystem: Create profile directory
|
|
ProfileManager->>Browser: Launch with new user data dir
|
|
Browser-->>User: Opens browser window
|
|
|
|
Note over User: User manually logs in to LinkedIn
|
|
|
|
User->>Browser: Navigate and authenticate
|
|
Browser->>FileSystem: Save cookies, session data
|
|
User->>CLI: Press 'q' to save profile
|
|
CLI->>ProfileManager: finalize_profile()
|
|
ProfileManager->>FileSystem: Lock profile settings
|
|
ProfileManager-->>CLI: Profile saved
|
|
CLI-->>User: Profile "linkedin-auth" created
|
|
|
|
Note over User: Later usage
|
|
|
|
User->>CLI: crwl https://linkedin.com/feed -p linkedin-auth
|
|
CLI->>ProfileManager: load_profile("linkedin-auth")
|
|
ProfileManager->>FileSystem: Read profile data
|
|
FileSystem-->>ProfileManager: User data directory
|
|
ProfileManager-->>CLI: Profile configuration
|
|
CLI->>Browser: Launch with existing profile
|
|
Browser-->>CLI: Authenticated session ready
|
|
CLI->>Browser: Navigate to target URL
|
|
Browser-->>CLI: Crawl results with auth context
|
|
CLI-->>User: Authenticated content
|
|
```
|
|
|
|
### Browser Management State Machine
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> Stopped: Initial state
|
|
|
|
Stopped --> Starting: crwl browser start
|
|
Starting --> Running: Browser launched
|
|
Running --> Viewing: crwl browser view
|
|
Viewing --> Running: Close window
|
|
Running --> Stopping: crwl browser stop
|
|
Stopping --> Stopped: Cleanup complete
|
|
|
|
Running --> Restarting: crwl browser restart
|
|
Restarting --> Running: New browser instance
|
|
|
|
Stopped --> CDP_Mode: crwl cdp
|
|
CDP_Mode --> CDP_Running: Remote debugging active
|
|
CDP_Running --> CDP_Mode: Manual close
|
|
CDP_Mode --> Stopped: Exit CDP
|
|
|
|
Running --> StatusCheck: crwl browser status
|
|
StatusCheck --> Running: Return status
|
|
|
|
note right of Running : Port 9222 active\nBuiltin browser available
|
|
note right of CDP_Running : Remote debugging\nManual control enabled
|
|
note right of Viewing : Visual browser window\nDirect interaction
|
|
```
|
|
|
|
### Authentication Workflow for Protected Sites
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Protected Site Access Needed] --> B[Create Profile Strategy]
|
|
|
|
B --> C{Existing Profile?}
|
|
C -->|Yes| D[Test Profile Validity]
|
|
C -->|No| E[Create New Profile]
|
|
|
|
D --> D1{Profile Valid?}
|
|
D1 -->|Yes| F[Use Existing Profile]
|
|
D1 -->|No| E
|
|
|
|
E --> E1[crwl profiles]
|
|
E1 --> E2[Select Create New Profile]
|
|
E2 --> E3[Enter Profile Name]
|
|
E3 --> E4[Browser Opens for Auth]
|
|
|
|
E4 --> E5{Authentication Method?}
|
|
E5 -->|Login Form| E6[Fill Username/Password]
|
|
E5 -->|OAuth| E7[OAuth Flow]
|
|
E5 -->|2FA| E8[Handle 2FA]
|
|
E5 -->|Session Cookie| E9[Import Cookies]
|
|
|
|
E6 --> E10[Manual Login Process]
|
|
E7 --> E10
|
|
E8 --> E10
|
|
E9 --> E10
|
|
|
|
E10 --> E11[Verify Authentication]
|
|
E11 --> E12{Auth Successful?}
|
|
E12 -->|Yes| E13[Save Profile - Press q]
|
|
E12 -->|No| E10
|
|
|
|
E13 --> F
|
|
F --> G[Execute Authenticated Crawl]
|
|
|
|
G --> H[crwl URL -p profile-name]
|
|
H --> I[Load Profile Data]
|
|
I --> J[Launch Browser with Auth]
|
|
J --> K[Navigate to Protected Content]
|
|
K --> L[Extract Authenticated Data]
|
|
L --> M[Return Results]
|
|
|
|
style E4 fill:#fff3e0
|
|
style E10 fill:#e3f2fd
|
|
style F fill:#e8f5e8
|
|
style M fill:#c8e6c9
|
|
```
|
|
|
|
### CDP Browser Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "CLI Layer"
|
|
A[crwl cdp command] --> B[CDP Manager]
|
|
B --> C[Port Configuration]
|
|
B --> D[Profile Selection]
|
|
end
|
|
|
|
subgraph "Browser Process"
|
|
E[Chromium/Firefox] --> F[Remote Debugging]
|
|
F --> G[WebSocket Endpoint]
|
|
G --> H[ws://localhost:9222]
|
|
end
|
|
|
|
subgraph "Client Connections"
|
|
I[Manual Browser Control] --> H
|
|
J[DevTools Interface] --> H
|
|
K[External Automation] --> H
|
|
L[Crawl4AI Crawler] --> H
|
|
end
|
|
|
|
subgraph "Profile Data"
|
|
M[User Data Directory] --> E
|
|
N[Cookies & Sessions] --> M
|
|
O[Extensions] --> M
|
|
P[Browser State] --> M
|
|
end
|
|
|
|
A --> E
|
|
C --> H
|
|
D --> M
|
|
|
|
style H fill:#e3f2fd
|
|
style E fill:#f3e5f5
|
|
style M fill:#e8f5e8
|
|
```
|
|
|
|
### Configuration Management Hierarchy
|
|
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Global Configuration"
|
|
A[~/.crawl4ai/config.yml] --> B[Default Settings]
|
|
B --> C[LLM Providers]
|
|
B --> D[Browser Defaults]
|
|
B --> E[Output Preferences]
|
|
end
|
|
|
|
subgraph "Profile Configuration"
|
|
F[Profile Directory] --> G[Browser State]
|
|
F --> H[Authentication Data]
|
|
F --> I[Site-Specific Settings]
|
|
end
|
|
|
|
subgraph "Command-Line Overrides"
|
|
J[-b browser_config] --> K[Runtime Browser Settings]
|
|
L[-c crawler_config] --> M[Runtime Crawler Settings]
|
|
N[-o output_format] --> O[Runtime Output Format]
|
|
end
|
|
|
|
subgraph "Configuration Files"
|
|
P[browser.yml] --> Q[Browser Config Template]
|
|
R[crawler.yml] --> S[Crawler Config Template]
|
|
T[extract.yml] --> U[Extraction Config]
|
|
end
|
|
|
|
subgraph "Resolution Order"
|
|
V[Command Line Args] --> W[Config Files]
|
|
W --> X[Profile Settings]
|
|
X --> Y[Global Defaults]
|
|
end
|
|
|
|
J --> V
|
|
L --> V
|
|
N --> V
|
|
P --> W
|
|
R --> W
|
|
T --> W
|
|
F --> X
|
|
A --> Y
|
|
|
|
style V fill:#ffcdd2
|
|
style W fill:#fff3e0
|
|
style X fill:#e3f2fd
|
|
style Y fill:#e8f5e8
|
|
```
|
|
|
|
### Identity-Based Crawling Decision Tree
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
A[Target Website Assessment] --> B{Authentication Required?}
|
|
|
|
B -->|No| C[Standard Anonymous Crawl]
|
|
B -->|Yes| D{Authentication Type?}
|
|
|
|
D -->|Login Form| E[Create Login Profile]
|
|
D -->|OAuth/SSO| F[Create OAuth Profile]
|
|
D -->|API Key/Token| G[Use Headers/Config]
|
|
D -->|Session Cookies| H[Import Cookie Profile]
|
|
|
|
E --> E1[crwl profiles → Manual login]
|
|
F --> F1[crwl profiles → OAuth flow]
|
|
G --> G1[Configure headers in crawler config]
|
|
H --> H1[Import cookies to profile]
|
|
|
|
E1 --> I[Test Authentication]
|
|
F1 --> I
|
|
G1 --> I
|
|
H1 --> I
|
|
|
|
I --> J{Auth Test Success?}
|
|
J -->|Yes| K[Production Crawl Setup]
|
|
J -->|No| L[Debug Authentication]
|
|
|
|
L --> L1{Common Issues?}
|
|
L1 -->|Rate Limiting| L2[Add delays/user simulation]
|
|
L1 -->|Bot Detection| L3[Enable stealth mode]
|
|
L1 -->|Session Expired| L4[Refresh authentication]
|
|
L1 -->|CAPTCHA| L5[Manual intervention needed]
|
|
|
|
L2 --> M[Retry with Adjustments]
|
|
L3 --> M
|
|
L4 --> E1
|
|
L5 --> N[Semi-automated approach]
|
|
|
|
M --> I
|
|
N --> O[Manual auth + automated crawl]
|
|
|
|
K --> P[Automated Authenticated Crawling]
|
|
O --> P
|
|
C --> P
|
|
|
|
P --> Q[Monitor & Maintain Profiles]
|
|
|
|
style I fill:#fff3e0
|
|
style K fill:#e8f5e8
|
|
style P fill:#c8e6c9
|
|
style L fill:#ffcdd2
|
|
style N fill:#f3e5f5
|
|
```
|
|
|
|
### CLI Usage Patterns and Best Practices
|
|
|
|
```mermaid
|
|
timeline
|
|
title CLI Workflow Evolution
|
|
|
|
section Setup Phase
|
|
Installation : pip install crawl4ai
|
|
: crawl4ai-setup
|
|
Basic Test : crwl https://example.com
|
|
Config Setup : crwl config set defaults
|
|
|
|
section Profile Creation
|
|
Site Analysis : Identify auth requirements
|
|
Profile Creation : crwl profiles
|
|
Manual Login : Authenticate in browser
|
|
Profile Save : Press 'q' to save
|
|
|
|
section Development Phase
|
|
Test Crawls : crwl URL -p profile -v
|
|
Config Tuning : Adjust browser/crawler settings
|
|
Output Testing : Try different output formats
|
|
Error Handling : Debug authentication issues
|
|
|
|
section Production Phase
|
|
Automated Crawls : crwl URL -p profile -o json
|
|
Batch Processing : Multiple URLs with same profile
|
|
Monitoring : Check profile validity
|
|
Maintenance : Update profiles as needed
|
|
```
|
|
|
|
### Multi-Profile Management Strategy
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Profile Categories"
|
|
A[Social Media Profiles]
|
|
B[Work/Enterprise Profiles]
|
|
C[E-commerce Profiles]
|
|
D[Research Profiles]
|
|
end
|
|
|
|
subgraph "Social Media"
|
|
A --> A1[linkedin-personal]
|
|
A --> A2[twitter-monitor]
|
|
A --> A3[facebook-research]
|
|
A --> A4[instagram-brand]
|
|
end
|
|
|
|
subgraph "Enterprise"
|
|
B --> B1[company-intranet]
|
|
B --> B2[github-enterprise]
|
|
B --> B3[confluence-docs]
|
|
B --> B4[jira-tickets]
|
|
end
|
|
|
|
subgraph "E-commerce"
|
|
C --> C1[amazon-seller]
|
|
C --> C2[shopify-admin]
|
|
C --> C3[ebay-monitor]
|
|
C --> C4[marketplace-competitor]
|
|
end
|
|
|
|
subgraph "Research"
|
|
D --> D1[academic-journals]
|
|
D --> D2[data-platforms]
|
|
D --> D3[survey-tools]
|
|
D --> D4[government-portals]
|
|
end
|
|
|
|
subgraph "Usage Patterns"
|
|
E[Daily Monitoring] --> A2
|
|
E --> B1
|
|
F[Weekly Reports] --> C3
|
|
F --> D2
|
|
G[On-Demand Research] --> D1
|
|
G --> D4
|
|
H[Competitive Analysis] --> C4
|
|
H --> A4
|
|
end
|
|
|
|
style A1 fill:#e3f2fd
|
|
style B1 fill:#f3e5f5
|
|
style C1 fill:#e8f5e8
|
|
style D1 fill:#fff3e0
|
|
```
|
|
|
|
**📖 Learn more:** [CLI Reference](https://docs.crawl4ai.com/core/cli/), [Identity-Based Crawling](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [Authentication Strategies](https://docs.crawl4ai.com/advanced/hooks-auth/) |