feat(cli): add command line interface with comprehensive features
Implements a full-featured CLI for Crawl4AI with the following capabilities: - Basic and advanced web crawling - Configuration management via YAML/JSON files - Multiple extraction strategies (CSS, XPath, LLM) - Content filtering and optimization - Interactive Q&A capabilities - Various output formats - Comprehensive documentation and examples Also includes: - Home directory setup for configuration and cache - Environment variable support for API tokens - Test suite for CLI functionality
This commit is contained in:
13
docs/examples/cli/browser.yml
Normal file
13
docs/examples/cli/browser.yml
Normal file
@@ -0,0 +1,13 @@
|
||||
browser_type: "chromium"
|
||||
headless: true
|
||||
viewport_width: 1280
|
||||
viewport_height: 800
|
||||
user_agent_mode: "random"
|
||||
verbose: true
|
||||
text_mode: false
|
||||
light_mode: false
|
||||
ignore_https_errors: true
|
||||
java_script_enabled: true
|
||||
extra_args:
|
||||
- "--disable-gpu"
|
||||
- "--no-sandbox"
|
||||
13
docs/examples/cli/crawler.yml
Normal file
13
docs/examples/cli/crawler.yml
Normal file
@@ -0,0 +1,13 @@
|
||||
cache_mode: "bypass"
|
||||
wait_until: "networkidle"
|
||||
page_timeout: 30000
|
||||
delay_before_return_html: 0.5
|
||||
word_count_threshold: 100
|
||||
scan_full_page: true
|
||||
scroll_delay: 0.3
|
||||
process_iframes: false
|
||||
remove_overlay_elements: true
|
||||
magic: true
|
||||
verbose: true
|
||||
exclude_external_links: true
|
||||
exclude_social_media_links: true
|
||||
27
docs/examples/cli/css_schema.json
Normal file
27
docs/examples/cli/css_schema.json
Normal file
@@ -0,0 +1,27 @@
|
||||
{
|
||||
"name": "ArticleExtractor",
|
||||
"baseSelector": ".cards[data-tax=news] .card__data",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.card__title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "h4.card__title a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": ".card__details",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "topics",
|
||||
"selector": ".card__topics.topics",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
11
docs/examples/cli/extract.yml
Normal file
11
docs/examples/cli/extract.yml
Normal file
@@ -0,0 +1,11 @@
|
||||
type: "llm"
|
||||
provider: "openai/gpt-4o-mini"
|
||||
api_token: "env:OPENAI_API_KEY"
|
||||
instruction: "Extract all articles with their titles, authors, publication dates and main topics in a structured format"
|
||||
params:
|
||||
chunk_token_threshold: 4096
|
||||
overlap_rate: 0.1
|
||||
word_token_rate: 0.75
|
||||
temperature: 0.3
|
||||
max_tokens: 1000
|
||||
verbose: true
|
||||
3
docs/examples/cli/extract_css.yml
Normal file
3
docs/examples/cli/extract_css.yml
Normal file
@@ -0,0 +1,3 @@
|
||||
type: "json-css"
|
||||
params:
|
||||
verbose: true
|
||||
26
docs/examples/cli/llm_schema.json
Normal file
26
docs/examples/cli/llm_schema.json
Normal file
@@ -0,0 +1,26 @@
|
||||
{
|
||||
"title": "NewsArticle",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "string",
|
||||
"description": "The title/headline of the news article"
|
||||
},
|
||||
"link": {
|
||||
"type": "string",
|
||||
"description": "The URL or link to the full article"
|
||||
},
|
||||
"details": {
|
||||
"type": "string",
|
||||
"description": "Brief summary or details about the article content"
|
||||
},
|
||||
"topics": {
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "string"
|
||||
},
|
||||
"description": "List of topics or categories associated with the article"
|
||||
}
|
||||
},
|
||||
"required": ["title", "details"]
|
||||
}
|
||||
304
docs/md_v2/core/cli.md
Normal file
304
docs/md_v2/core/cli.md
Normal file
@@ -0,0 +1,304 @@
|
||||
# Crawl4AI CLI Guide
|
||||
|
||||
## Table of Contents
|
||||
- [Installation](#installation)
|
||||
- [Basic Usage](#basic-usage)
|
||||
- [Configuration](#configuration)
|
||||
- [Browser Configuration](#browser-configuration)
|
||||
- [Crawler Configuration](#crawler-configuration)
|
||||
- [Extraction Configuration](#extraction-configuration)
|
||||
- [Content Filtering](#content-filtering)
|
||||
- [Advanced Features](#advanced-features)
|
||||
- [LLM Q&A](#llm-qa)
|
||||
- [Structured Data Extraction](#structured-data-extraction)
|
||||
- [Content Filtering](#content-filtering-1)
|
||||
- [Output Formats](#output-formats)
|
||||
- [Examples](#examples)
|
||||
- [Configuration Reference](#configuration-reference)
|
||||
- [Best Practices & Tips](#best-practices--tips)
|
||||
|
||||
## Basic Usage
|
||||
|
||||
The Crawl4AI CLI (`crwl`) provides a simple interface to the Crawl4AI library:
|
||||
|
||||
```bash
|
||||
# Basic crawling
|
||||
crwl https://example.com
|
||||
|
||||
# Get markdown output
|
||||
crwl https://example.com -o markdown
|
||||
|
||||
# Verbose JSON output with cache bypass
|
||||
crwl https://example.com -o json -v --bypass-cache
|
||||
|
||||
# See usage examples
|
||||
crwl --example
|
||||
```
|
||||
|
||||
## Quick Example of Advanced Usage
|
||||
|
||||
If you clone the repository and run the following command, you will receive the content of the page in JSON format according to a JSON-CSS schema:
|
||||
|
||||
```bash
|
||||
crwl "https://www.infoq.com/ai-ml-data-eng/" -e docs/examples/cli/extract_css.yml -s docs/examples/cli/css_schema.json -o json;
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Browser Configuration
|
||||
|
||||
Browser settings can be configured via YAML file or command line parameters:
|
||||
|
||||
```yaml
|
||||
# browser.yml
|
||||
headless: true
|
||||
viewport_width: 1280
|
||||
user_agent_mode: "random"
|
||||
verbose: true
|
||||
ignore_https_errors: true
|
||||
```
|
||||
|
||||
```bash
|
||||
# Using config file
|
||||
crwl https://example.com -B browser.yml
|
||||
|
||||
# Using direct parameters
|
||||
crwl https://example.com -b "headless=true,viewport_width=1280,user_agent_mode=random"
|
||||
```
|
||||
|
||||
### Crawler Configuration
|
||||
|
||||
Control crawling behavior:
|
||||
|
||||
```yaml
|
||||
# crawler.yml
|
||||
cache_mode: "bypass"
|
||||
wait_until: "networkidle"
|
||||
page_timeout: 30000
|
||||
delay_before_return_html: 0.5
|
||||
word_count_threshold: 100
|
||||
scan_full_page: true
|
||||
scroll_delay: 0.3
|
||||
process_iframes: false
|
||||
remove_overlay_elements: true
|
||||
magic: true
|
||||
verbose: true
|
||||
```
|
||||
|
||||
```bash
|
||||
# Using config file
|
||||
crwl https://example.com -C crawler.yml
|
||||
|
||||
# Using direct parameters
|
||||
crwl https://example.com -c "css_selector=#main,delay_before_return_html=2,scan_full_page=true"
|
||||
```
|
||||
|
||||
### Extraction Configuration
|
||||
|
||||
Two types of extraction are supported:
|
||||
|
||||
1. CSS/XPath-based extraction:
|
||||
```yaml
|
||||
# extract_css.yml
|
||||
type: "json-css"
|
||||
params:
|
||||
verbose: true
|
||||
```
|
||||
|
||||
```json
|
||||
// css_schema.json
|
||||
{
|
||||
"name": "ArticleExtractor",
|
||||
"baseSelector": ".article",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h1.title",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "a.read-more",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. LLM-based extraction:
|
||||
```yaml
|
||||
# extract_llm.yml
|
||||
type: "llm"
|
||||
provider: "openai/gpt-4"
|
||||
instruction: "Extract all articles with their titles and links"
|
||||
api_token: "your-token"
|
||||
params:
|
||||
temperature: 0.3
|
||||
max_tokens: 1000
|
||||
```
|
||||
|
||||
```json
|
||||
// llm_schema.json
|
||||
{
|
||||
"title": "Article",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"title": {
|
||||
"type": "string",
|
||||
"description": "The title of the article"
|
||||
},
|
||||
"link": {
|
||||
"type": "string",
|
||||
"description": "URL to the full article"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### LLM Q&A
|
||||
|
||||
Ask questions about crawled content:
|
||||
|
||||
```bash
|
||||
# Simple question
|
||||
crwl https://example.com -q "What is the main topic discussed?"
|
||||
|
||||
# View content then ask questions
|
||||
crwl https://example.com -o markdown # See content first
|
||||
crwl https://example.com -q "Summarize the key points"
|
||||
crwl https://example.com -q "What are the conclusions?"
|
||||
|
||||
# Combined with advanced crawling
|
||||
crwl https://example.com \
|
||||
-B browser.yml \
|
||||
-c "css_selector=article,scan_full_page=true" \
|
||||
-q "What are the pros and cons mentioned?"
|
||||
```
|
||||
|
||||
First-time setup:
|
||||
- Prompts for LLM provider and API token
|
||||
- Saves configuration in `~/.crawl4ai/global.yml`
|
||||
- Supports various providers (openai/gpt-4, anthropic/claude-3-sonnet, etc.)
|
||||
- For case of `ollama` you do not need to provide API token.
|
||||
- See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for full list
|
||||
|
||||
### Structured Data Extraction
|
||||
|
||||
Extract structured data using CSS selectors:
|
||||
|
||||
```bash
|
||||
crwl https://example.com \
|
||||
-e extract_css.yml \
|
||||
-s css_schema.json \
|
||||
-o json
|
||||
```
|
||||
|
||||
Or using LLM-based extraction:
|
||||
|
||||
```bash
|
||||
crwl https://example.com \
|
||||
-e extract_llm.yml \
|
||||
-s llm_schema.json \
|
||||
-o json
|
||||
```
|
||||
|
||||
### Content Filtering
|
||||
|
||||
Filter content for relevance:
|
||||
|
||||
```yaml
|
||||
# filter_bm25.yml
|
||||
type: "bm25"
|
||||
query: "target content"
|
||||
threshold: 1.0
|
||||
|
||||
# filter_pruning.yml
|
||||
type: "pruning"
|
||||
query: "focus topic"
|
||||
threshold: 0.48
|
||||
```
|
||||
|
||||
```bash
|
||||
crwl https://example.com -f filter_bm25.yml -o markdown-fit
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `all` - Full crawl result including metadata
|
||||
- `json` - Extracted structured data (when using extraction)
|
||||
- `markdown` / `md` - Raw markdown output
|
||||
- `markdown-fit` / `md-fit` - Filtered markdown for better readability
|
||||
|
||||
## Complete Examples
|
||||
|
||||
1. Basic Extraction:
|
||||
```bash
|
||||
crwl https://example.com \
|
||||
-B browser.yml \
|
||||
-C crawler.yml \
|
||||
-o json
|
||||
```
|
||||
|
||||
2. Structured Data Extraction:
|
||||
```bash
|
||||
crwl https://example.com \
|
||||
-e extract_css.yml \
|
||||
-s css_schema.json \
|
||||
-o json \
|
||||
-v
|
||||
```
|
||||
|
||||
3. LLM Extraction with Filtering:
|
||||
```bash
|
||||
crwl https://example.com \
|
||||
-B browser.yml \
|
||||
-e extract_llm.yml \
|
||||
-s llm_schema.json \
|
||||
-f filter_bm25.yml \
|
||||
-o json
|
||||
```
|
||||
|
||||
4. Interactive Q&A:
|
||||
```bash
|
||||
# First crawl and view
|
||||
crwl https://example.com -o markdown
|
||||
|
||||
# Then ask questions
|
||||
crwl https://example.com -q "What are the main points?"
|
||||
crwl https://example.com -q "Summarize the conclusions"
|
||||
```
|
||||
|
||||
## Best Practices & Tips
|
||||
|
||||
1. **Configuration Management**:
|
||||
- Keep common configurations in YAML files
|
||||
- Use CLI parameters for quick overrides
|
||||
- Store sensitive data (API tokens) in `~/.crawl4ai/global.yml`
|
||||
|
||||
2. **Performance Optimization**:
|
||||
- Use `--bypass-cache` for fresh content
|
||||
- Enable `scan_full_page` for infinite scroll pages
|
||||
- Adjust `delay_before_return_html` for dynamic content
|
||||
|
||||
3. **Content Extraction**:
|
||||
- Use CSS extraction for structured content
|
||||
- Use LLM extraction for unstructured content
|
||||
- Combine with filters for focused results
|
||||
|
||||
4. **Q&A Workflow**:
|
||||
- View content first with `-o markdown`
|
||||
- Ask specific questions
|
||||
- Use broader context with appropriate selectors
|
||||
|
||||
## Recap
|
||||
|
||||
The Crawl4AI CLI provides:
|
||||
- Flexible configuration via files and parameters
|
||||
- Multiple extraction strategies (CSS, XPath, LLM)
|
||||
- Content filtering and optimization
|
||||
- Interactive Q&A capabilities
|
||||
- Various output formats
|
||||
|
||||
Reference in New Issue
Block a user