This commit introduces significant enhancements to the Crawl4AI ecosystem: Chrome Extension - Script Builder (Alpha): - Add recording functionality to capture user interactions (clicks, typing, scrolling) - Implement smart event grouping for cleaner script generation - Support export to both JavaScript and C4A script formats - Add timeline view for visualizing and editing recorded actions - Include wait commands (time-based and element-based) - Add saved flows functionality for reusing automation scripts - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents) - Release new extension versions: v1.1.0, v1.2.0, v1.2.1 LLM Context Builder Improvements: - Reorganize context files from llmtxt/ to llm.txt/ with better structure - Separate diagram templates from text content (diagrams/ and txt/ subdirectories) - Add comprehensive context files for all major Crawl4AI components - Improve file naming convention for better discoverability Documentation Updates: - Update apps index page to match main documentation theme - Standardize color scheme: "Available" tags use primary color (#50ffff) - Change "Coming Soon" tags to dark gray for better visual hierarchy - Add interactive two-column layout for extension landing page - Include code examples for both Schema Builder and Script Builder features Technical Improvements: - Enhance event capture mechanism with better element selection - Add support for contenteditable elements and complex form interactions - Implement proper scroll event handling for both window and element scrolling - Add meta key support for keyboard shortcuts - Improve selector generation for more reliable element targeting The Script Builder is released as Alpha, acknowledging potential bugs while providing early access to this powerful automation recording feature.
295 lines
7.9 KiB
Plaintext
295 lines
7.9 KiB
Plaintext
## CLI & Identity-Based Browsing
|
|
|
|
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
|
|
|
|
### Basic CLI Usage
|
|
|
|
```bash
|
|
# Simple crawling
|
|
crwl https://example.com
|
|
|
|
# Get markdown output
|
|
crwl https://example.com -o markdown
|
|
|
|
# JSON output with cache bypass
|
|
crwl https://example.com -o json --bypass-cache
|
|
|
|
# Verbose mode with specific browser settings
|
|
crwl https://example.com -b "headless=false,viewport_width=1280" -v
|
|
```
|
|
|
|
### Profile Management Commands
|
|
|
|
```bash
|
|
# Launch interactive profile manager
|
|
crwl profiles
|
|
|
|
# Create, list, and manage browser profiles
|
|
# This opens a menu where you can:
|
|
# 1. List existing profiles
|
|
# 2. Create new profile (opens browser for setup)
|
|
# 3. Delete profiles
|
|
# 4. Use profile to crawl a website
|
|
|
|
# Use a specific profile for crawling
|
|
crwl https://example.com -p my-profile-name
|
|
|
|
# Example workflow for authenticated sites:
|
|
# 1. Create profile and log in
|
|
crwl profiles # Select "Create new profile"
|
|
# 2. Use profile for crawling authenticated content
|
|
crwl https://site-requiring-login.com/dashboard -p my-profile-name
|
|
```
|
|
|
|
### CDP Browser Management
|
|
|
|
```bash
|
|
# Launch browser with CDP debugging (default port 9222)
|
|
crwl cdp
|
|
|
|
# Use specific profile and custom port
|
|
crwl cdp -p my-profile -P 9223
|
|
|
|
# Launch headless browser with CDP
|
|
crwl cdp --headless
|
|
|
|
# Launch in incognito mode (ignores profile)
|
|
crwl cdp --incognito
|
|
|
|
# Use custom user data directory
|
|
crwl cdp --user-data-dir ~/my-browser-data --port 9224
|
|
```
|
|
|
|
### Builtin Browser Management
|
|
|
|
```bash
|
|
# Start persistent browser instance
|
|
crwl browser start
|
|
|
|
# Check browser status
|
|
crwl browser status
|
|
|
|
# Open visible window to see the browser
|
|
crwl browser view --url https://example.com
|
|
|
|
# Stop the browser
|
|
crwl browser stop
|
|
|
|
# Restart with different options
|
|
crwl browser restart --browser-type chromium --port 9223 --no-headless
|
|
|
|
# Use builtin browser in crawling
|
|
crwl https://example.com -b "browser_mode=builtin"
|
|
```
|
|
|
|
### Authentication Workflow Examples
|
|
|
|
```bash
|
|
# Complete workflow for LinkedIn scraping
|
|
# 1. Create authenticated profile
|
|
crwl profiles
|
|
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
|
|
|
|
# 2. Use profile for crawling
|
|
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
|
|
|
|
# 3. Extract structured data with authentication
|
|
crwl https://linkedin.com/search/results/people/ \
|
|
-p linkedin-profile \
|
|
-j "Extract people profiles with names, titles, and companies" \
|
|
-b "headless=false"
|
|
|
|
# GitHub authenticated crawling
|
|
crwl profiles # Create github-profile
|
|
crwl https://github.com/settings/profile -p github-profile
|
|
|
|
# Twitter/X authenticated access
|
|
crwl profiles # Create twitter-profile
|
|
crwl https://twitter.com/home -p twitter-profile -o markdown
|
|
```
|
|
|
|
### Advanced CLI Configuration
|
|
|
|
```bash
|
|
# Complex crawling with multiple configs
|
|
crwl https://example.com \
|
|
-B browser.yml \
|
|
-C crawler.yml \
|
|
-e extract_llm.yml \
|
|
-s llm_schema.json \
|
|
-p my-auth-profile \
|
|
-o json \
|
|
-v
|
|
|
|
# Quick LLM extraction with authentication
|
|
crwl https://private-site.com/dashboard \
|
|
-p auth-profile \
|
|
-j "Extract user dashboard data including metrics and notifications" \
|
|
-b "headless=true,viewport_width=1920"
|
|
|
|
# Content filtering with authentication
|
|
crwl https://members-only-site.com \
|
|
-p member-profile \
|
|
-f filter_bm25.yml \
|
|
-c "css_selector=.member-content,scan_full_page=true" \
|
|
-o markdown-fit
|
|
```
|
|
|
|
### Configuration Files for Identity Browsing
|
|
|
|
```yaml
|
|
# browser_auth.yml
|
|
headless: false
|
|
use_managed_browser: true
|
|
user_data_dir: "/path/to/profile"
|
|
viewport_width: 1280
|
|
viewport_height: 720
|
|
simulate_user: true
|
|
override_navigator: true
|
|
|
|
# crawler_auth.yml
|
|
magic: true
|
|
remove_overlay_elements: true
|
|
simulate_user: true
|
|
wait_for: "css:.authenticated-content"
|
|
page_timeout: 60000
|
|
delay_before_return_html: 2
|
|
scan_full_page: true
|
|
```
|
|
|
|
### Global Configuration Management
|
|
|
|
```bash
|
|
# List all configuration settings
|
|
crwl config list
|
|
|
|
# Set default LLM provider
|
|
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
|
|
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
|
|
|
|
# Set browser defaults
|
|
crwl config set BROWSER_HEADLESS false # Always show browser
|
|
crwl config set USER_AGENT_MODE random # Random user agents
|
|
|
|
# Enable verbose mode globally
|
|
crwl config set VERBOSE true
|
|
```
|
|
|
|
### Q&A with Authenticated Content
|
|
|
|
```bash
|
|
# Ask questions about authenticated content
|
|
crwl https://private-dashboard.com -p dashboard-profile \
|
|
-q "What are the key metrics shown in my dashboard?"
|
|
|
|
# Multiple questions workflow
|
|
crwl https://company-intranet.com -p work-profile -o markdown # View content
|
|
crwl https://company-intranet.com -p work-profile \
|
|
-q "Summarize this week's announcements"
|
|
crwl https://company-intranet.com -p work-profile \
|
|
-q "What are the upcoming deadlines?"
|
|
```
|
|
|
|
### Profile Creation Programmatically
|
|
|
|
```python
|
|
# Create profiles via Python API
|
|
import asyncio
|
|
from crawl4ai import BrowserProfiler
|
|
|
|
async def create_auth_profile():
|
|
profiler = BrowserProfiler()
|
|
|
|
# Create profile interactively (opens browser)
|
|
profile_path = await profiler.create_profile("linkedin-auth")
|
|
print(f"Profile created at: {profile_path}")
|
|
|
|
# List all profiles
|
|
profiles = profiler.list_profiles()
|
|
for profile in profiles:
|
|
print(f"Profile: {profile['name']} at {profile['path']}")
|
|
|
|
# Use profile for crawling
|
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
|
|
|
browser_config = BrowserConfig(
|
|
headless=True,
|
|
use_managed_browser=True,
|
|
user_data_dir=profile_path
|
|
)
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
result = await crawler.arun("https://linkedin.com/feed")
|
|
return result
|
|
|
|
# asyncio.run(create_auth_profile())
|
|
```
|
|
|
|
### Identity Browsing Best Practices
|
|
|
|
```bash
|
|
# 1. Create specific profiles for different sites
|
|
crwl profiles # Create "linkedin-work"
|
|
crwl profiles # Create "github-personal"
|
|
crwl profiles # Create "company-intranet"
|
|
|
|
# 2. Use descriptive profile names
|
|
crwl https://site1.com -p site1-admin-account
|
|
crwl https://site2.com -p site2-user-account
|
|
|
|
# 3. Combine with appropriate browser settings
|
|
crwl https://secure-site.com \
|
|
-p secure-profile \
|
|
-b "headless=false,simulate_user=true,magic=true" \
|
|
-c "wait_for=.logged-in-indicator,page_timeout=30000"
|
|
|
|
# 4. Test profile before automated crawling
|
|
crwl cdp -p test-profile # Manually verify login status
|
|
crwl https://test-url.com -p test-profile -v # Verbose test crawl
|
|
```
|
|
|
|
### Troubleshooting Authentication Issues
|
|
|
|
```bash
|
|
# Debug authentication problems
|
|
crwl https://auth-site.com -p auth-profile \
|
|
-b "headless=false,verbose=true" \
|
|
-c "verbose=true,page_timeout=60000" \
|
|
-v
|
|
|
|
# Check profile status
|
|
crwl profiles # List profiles and check creation dates
|
|
|
|
# Recreate problematic profiles
|
|
crwl profiles # Delete old profile, create new one
|
|
|
|
# Test with visible browser
|
|
crwl https://problem-site.com -p profile-name \
|
|
-b "headless=false" \
|
|
-c "delay_before_return_html=5"
|
|
```
|
|
|
|
### Common Use Cases
|
|
|
|
```bash
|
|
# Social media monitoring (after authentication)
|
|
crwl https://twitter.com/home -p twitter-monitor \
|
|
-j "Extract latest tweets with sentiment and engagement metrics"
|
|
|
|
# E-commerce competitor analysis (with account access)
|
|
crwl https://competitor-site.com/products -p competitor-account \
|
|
-j "Extract product prices, availability, and descriptions"
|
|
|
|
# Company dashboard monitoring
|
|
crwl https://company-dashboard.com -p work-profile \
|
|
-c "css_selector=.dashboard-content" \
|
|
-q "What alerts or notifications need attention?"
|
|
|
|
# Research data collection (authenticated access)
|
|
crwl https://research-platform.com/data -p research-profile \
|
|
-e extract_research.yml \
|
|
-s research_schema.json \
|
|
-o json
|
|
```
|
|
|
|
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/) |