Files
crawl4ai/docs/md_v2/assets/llm.txt/txt/cli.txt
UncleCode 40640badad feat: add Script Builder to Chrome Extension and reorganize LLM context files
This commit introduces significant enhancements to the Crawl4AI ecosystem:

  Chrome Extension - Script Builder (Alpha):
  - Add recording functionality to capture user interactions (clicks, typing, scrolling)
  - Implement smart event grouping for cleaner script generation
  - Support export to both JavaScript and C4A script formats
  - Add timeline view for visualizing and editing recorded actions
  - Include wait commands (time-based and element-based)
  - Add saved flows functionality for reusing automation scripts
  - Update UI with consistent dark terminal theme (Dank Mono font, green/pink accents)
  - Release new extension versions: v1.1.0, v1.2.0, v1.2.1

  LLM Context Builder Improvements:
  - Reorganize context files from llmtxt/ to llm.txt/ with better structure
  - Separate diagram templates from text content (diagrams/ and txt/ subdirectories)
  - Add comprehensive context files for all major Crawl4AI components
  - Improve file naming convention for better discoverability

  Documentation Updates:
  - Update apps index page to match main documentation theme
  - Standardize color scheme: "Available" tags use primary color (#50ffff)
  - Change "Coming Soon" tags to dark gray for better visual hierarchy
  - Add interactive two-column layout for extension landing page
  - Include code examples for both Schema Builder and Script Builder features

  Technical Improvements:
  - Enhance event capture mechanism with better element selection
  - Add support for contenteditable elements and complex form interactions
  - Implement proper scroll event handling for both window and element scrolling
  - Add meta key support for keyboard shortcuts
  - Improve selector generation for more reliable element targeting

  The Script Builder is released as Alpha, acknowledging potential bugs while providing
  early access to this powerful automation recording feature.
2025-06-08 22:02:12 +08:00

295 lines
7.9 KiB
Plaintext

## CLI & Identity-Based Browsing
Command-line interface for web crawling with persistent browser profiles, authentication, and identity management.
### Basic CLI Usage
```bash
# Simple crawling
crwl https://example.com
# Get markdown output
crwl https://example.com -o markdown
# JSON output with cache bypass
crwl https://example.com -o json --bypass-cache
# Verbose mode with specific browser settings
crwl https://example.com -b "headless=false,viewport_width=1280" -v
```
### Profile Management Commands
```bash
# Launch interactive profile manager
crwl profiles
# Create, list, and manage browser profiles
# This opens a menu where you can:
# 1. List existing profiles
# 2. Create new profile (opens browser for setup)
# 3. Delete profiles
# 4. Use profile to crawl a website
# Use a specific profile for crawling
crwl https://example.com -p my-profile-name
# Example workflow for authenticated sites:
# 1. Create profile and log in
crwl profiles # Select "Create new profile"
# 2. Use profile for crawling authenticated content
crwl https://site-requiring-login.com/dashboard -p my-profile-name
```
### CDP Browser Management
```bash
# Launch browser with CDP debugging (default port 9222)
crwl cdp
# Use specific profile and custom port
crwl cdp -p my-profile -P 9223
# Launch headless browser with CDP
crwl cdp --headless
# Launch in incognito mode (ignores profile)
crwl cdp --incognito
# Use custom user data directory
crwl cdp --user-data-dir ~/my-browser-data --port 9224
```
### Builtin Browser Management
```bash
# Start persistent browser instance
crwl browser start
# Check browser status
crwl browser status
# Open visible window to see the browser
crwl browser view --url https://example.com
# Stop the browser
crwl browser stop
# Restart with different options
crwl browser restart --browser-type chromium --port 9223 --no-headless
# Use builtin browser in crawling
crwl https://example.com -b "browser_mode=builtin"
```
### Authentication Workflow Examples
```bash
# Complete workflow for LinkedIn scraping
# 1. Create authenticated profile
crwl profiles
# Select "Create new profile" → login to LinkedIn in browser → press 'q' to save
# 2. Use profile for crawling
crwl https://linkedin.com/in/someone -p linkedin-profile -o markdown
# 3. Extract structured data with authentication
crwl https://linkedin.com/search/results/people/ \
-p linkedin-profile \
-j "Extract people profiles with names, titles, and companies" \
-b "headless=false"
# GitHub authenticated crawling
crwl profiles # Create github-profile
crwl https://github.com/settings/profile -p github-profile
# Twitter/X authenticated access
crwl profiles # Create twitter-profile
crwl https://twitter.com/home -p twitter-profile -o markdown
```
### Advanced CLI Configuration
```bash
# Complex crawling with multiple configs
crwl https://example.com \
-B browser.yml \
-C crawler.yml \
-e extract_llm.yml \
-s llm_schema.json \
-p my-auth-profile \
-o json \
-v
# Quick LLM extraction with authentication
crwl https://private-site.com/dashboard \
-p auth-profile \
-j "Extract user dashboard data including metrics and notifications" \
-b "headless=true,viewport_width=1920"
# Content filtering with authentication
crwl https://members-only-site.com \
-p member-profile \
-f filter_bm25.yml \
-c "css_selector=.member-content,scan_full_page=true" \
-o markdown-fit
```
### Configuration Files for Identity Browsing
```yaml
# browser_auth.yml
headless: false
use_managed_browser: true
user_data_dir: "/path/to/profile"
viewport_width: 1280
viewport_height: 720
simulate_user: true
override_navigator: true
# crawler_auth.yml
magic: true
remove_overlay_elements: true
simulate_user: true
wait_for: "css:.authenticated-content"
page_timeout: 60000
delay_before_return_html: 2
scan_full_page: true
```
### Global Configuration Management
```bash
# List all configuration settings
crwl config list
# Set default LLM provider
crwl config set DEFAULT_LLM_PROVIDER "anthropic/claude-3-sonnet"
crwl config set DEFAULT_LLM_PROVIDER_TOKEN "your-api-token"
# Set browser defaults
crwl config set BROWSER_HEADLESS false # Always show browser
crwl config set USER_AGENT_MODE random # Random user agents
# Enable verbose mode globally
crwl config set VERBOSE true
```
### Q&A with Authenticated Content
```bash
# Ask questions about authenticated content
crwl https://private-dashboard.com -p dashboard-profile \
-q "What are the key metrics shown in my dashboard?"
# Multiple questions workflow
crwl https://company-intranet.com -p work-profile -o markdown # View content
crwl https://company-intranet.com -p work-profile \
-q "Summarize this week's announcements"
crwl https://company-intranet.com -p work-profile \
-q "What are the upcoming deadlines?"
```
### Profile Creation Programmatically
```python
# Create profiles via Python API
import asyncio
from crawl4ai import BrowserProfiler
async def create_auth_profile():
profiler = BrowserProfiler()
# Create profile interactively (opens browser)
profile_path = await profiler.create_profile("linkedin-auth")
print(f"Profile created at: {profile_path}")
# List all profiles
profiles = profiler.list_profiles()
for profile in profiles:
print(f"Profile: {profile['name']} at {profile['path']}")
# Use profile for crawling
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir=profile_path
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://linkedin.com/feed")
return result
# asyncio.run(create_auth_profile())
```
### Identity Browsing Best Practices
```bash
# 1. Create specific profiles for different sites
crwl profiles # Create "linkedin-work"
crwl profiles # Create "github-personal"
crwl profiles # Create "company-intranet"
# 2. Use descriptive profile names
crwl https://site1.com -p site1-admin-account
crwl https://site2.com -p site2-user-account
# 3. Combine with appropriate browser settings
crwl https://secure-site.com \
-p secure-profile \
-b "headless=false,simulate_user=true,magic=true" \
-c "wait_for=.logged-in-indicator,page_timeout=30000"
# 4. Test profile before automated crawling
crwl cdp -p test-profile # Manually verify login status
crwl https://test-url.com -p test-profile -v # Verbose test crawl
```
### Troubleshooting Authentication Issues
```bash
# Debug authentication problems
crwl https://auth-site.com -p auth-profile \
-b "headless=false,verbose=true" \
-c "verbose=true,page_timeout=60000" \
-v
# Check profile status
crwl profiles # List profiles and check creation dates
# Recreate problematic profiles
crwl profiles # Delete old profile, create new one
# Test with visible browser
crwl https://problem-site.com -p profile-name \
-b "headless=false" \
-c "delay_before_return_html=5"
```
### Common Use Cases
```bash
# Social media monitoring (after authentication)
crwl https://twitter.com/home -p twitter-monitor \
-j "Extract latest tweets with sentiment and engagement metrics"
# E-commerce competitor analysis (with account access)
crwl https://competitor-site.com/products -p competitor-account \
-j "Extract product prices, availability, and descriptions"
# Company dashboard monitoring
crwl https://company-dashboard.com -p work-profile \
-c "css_selector=.dashboard-content" \
-q "What alerts or notifications need attention?"
# Research data collection (authenticated access)
crwl https://research-platform.com/data -p research-profile \
-e extract_research.yml \
-s research_schema.json \
-o json
```
**📖 Learn more:** [Identity-Based Crawling Documentation](https://docs.crawl4ai.com/advanced/identity-based-crawling/), [Browser Profile Management](https://docs.crawl4ai.com/advanced/session-management/), [CLI Examples](https://docs.crawl4ai.com/core/cli/)