Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update
This commit is contained in:
347
docs/md_v2/core/adaptive-crawling.md
Normal file
347
docs/md_v2/core/adaptive-crawling.md
Normal file
@@ -0,0 +1,347 @@
|
||||
# Adaptive Web Crawling
|
||||
|
||||
## Introduction
|
||||
|
||||
Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. **Adaptive Crawling** changes this paradigm by introducing intelligence into the crawling process.
|
||||
|
||||
Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### The Problem It Solves
|
||||
|
||||
When crawling websites for specific information, you face two challenges:
|
||||
1. **Under-crawling**: Stopping too early and missing crucial information
|
||||
2. **Over-crawling**: Wasting resources by crawling irrelevant pages
|
||||
|
||||
Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
|
||||
|
||||
### How It Works
|
||||
|
||||
The AdaptiveCrawler uses three metrics to measure information sufficiency:
|
||||
|
||||
- **Coverage**: How well your collected pages cover the query terms
|
||||
- **Consistency**: Whether the information is coherent across pages
|
||||
- **Saturation**: Detecting when new pages aren't adding new information
|
||||
|
||||
When these metrics indicate sufficient information has been gathered, crawling stops automatically.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Create an adaptive crawler
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Start crawling with a query
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async context managers"
|
||||
)
|
||||
|
||||
# View statistics
|
||||
adaptive.print_stats()
|
||||
|
||||
# Get the most relevant content
|
||||
relevant_pages = adaptive.get_relevant_content(top_k=5)
|
||||
for page in relevant_pages:
|
||||
print(f"- {page['url']} (score: {page['score']:.2f})")
|
||||
```
|
||||
|
||||
### Configuration Options
|
||||
|
||||
```python
|
||||
from crawl4ai import AdaptiveConfig
|
||||
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.7, # Stop when 70% confident (default: 0.8)
|
||||
max_pages=20, # Maximum pages to crawl (default: 50)
|
||||
top_k_links=3, # Links to follow per page (default: 5)
|
||||
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config=config)
|
||||
```
|
||||
|
||||
## Crawling Strategies
|
||||
|
||||
Adaptive Crawling supports two distinct strategies for determining information sufficiency:
|
||||
|
||||
### Statistical Strategy (Default)
|
||||
|
||||
The statistical strategy uses pure information theory and term-based analysis:
|
||||
|
||||
- **Fast and efficient** - No API calls or model loading
|
||||
- **Term-based coverage** - Analyzes query term presence and distribution
|
||||
- **No external dependencies** - Works offline
|
||||
- **Best for**: Well-defined queries with specific terminology
|
||||
|
||||
```python
|
||||
# Default configuration uses statistical strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="statistical", # This is the default
|
||||
confidence_threshold=0.8
|
||||
)
|
||||
```
|
||||
|
||||
### Embedding Strategy
|
||||
|
||||
The embedding strategy uses semantic embeddings for deeper understanding:
|
||||
|
||||
- **Semantic understanding** - Captures meaning beyond exact term matches
|
||||
- **Query expansion** - Automatically generates query variations
|
||||
- **Gap-driven selection** - Identifies semantic gaps in knowledge
|
||||
- **Validation-based stopping** - Uses held-out queries to validate coverage
|
||||
- **Best for**: Complex queries, ambiguous topics, conceptual understanding
|
||||
|
||||
```python
|
||||
# Configure embedding strategy
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
|
||||
n_query_variations=10, # Generate 10 query variations
|
||||
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
|
||||
)
|
||||
|
||||
# With custom embedding provider (e.g., OpenAI)
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
embedding_llm_config={
|
||||
'provider': 'openai/text-embedding-3-small',
|
||||
'api_token': 'your-api-key'
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
### Strategy Comparison
|
||||
|
||||
| Feature | Statistical | Embedding |
|
||||
|---------|------------|-----------|
|
||||
| **Speed** | Very fast | Moderate (API calls) |
|
||||
| **Cost** | Free | Depends on provider |
|
||||
| **Accuracy** | Good for exact terms | Excellent for concepts |
|
||||
| **Dependencies** | None | Embedding model/API |
|
||||
| **Query Understanding** | Literal | Semantic |
|
||||
| **Best Use Case** | Technical docs, specific terms | Research, broad topics |
|
||||
|
||||
### Embedding Strategy Configuration
|
||||
|
||||
The embedding strategy offers fine-tuned control through several parameters:
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
strategy="embedding",
|
||||
|
||||
# Model configuration
|
||||
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
|
||||
embedding_llm_config=None, # Use for API-based embeddings
|
||||
|
||||
# Query expansion
|
||||
n_query_variations=10, # Number of query variations to generate
|
||||
|
||||
# Coverage parameters
|
||||
embedding_coverage_radius=0.2, # Distance threshold for coverage
|
||||
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
|
||||
|
||||
# Stopping criteria
|
||||
embedding_min_relative_improvement=0.1, # Min improvement to continue
|
||||
embedding_validation_min_score=0.3, # Min validation score
|
||||
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
|
||||
|
||||
# Link selection
|
||||
embedding_overlap_threshold=0.85, # Similarity for deduplication
|
||||
|
||||
# Display confidence mapping
|
||||
embedding_quality_min_confidence=0.7, # Min displayed confidence
|
||||
embedding_quality_max_confidence=0.95 # Max displayed confidence
|
||||
)
|
||||
```
|
||||
|
||||
### Handling Irrelevant Queries
|
||||
|
||||
The embedding strategy can detect when a query is completely unrelated to the content:
|
||||
|
||||
```python
|
||||
# This will stop quickly with low confidence
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="how to cook pasta" # Irrelevant to Python docs
|
||||
)
|
||||
|
||||
# Check if query was irrelevant
|
||||
if result.metrics.get('is_irrelevant', False):
|
||||
print("Query is unrelated to the content!")
|
||||
```
|
||||
|
||||
## When to Use Adaptive Crawling
|
||||
|
||||
### Perfect For:
|
||||
- **Research Tasks**: Finding comprehensive information about a topic
|
||||
- **Question Answering**: Gathering sufficient context to answer specific queries
|
||||
- **Knowledge Base Building**: Creating focused datasets for AI/ML applications
|
||||
- **Competitive Intelligence**: Collecting complete information about specific products/features
|
||||
|
||||
### Not Recommended For:
|
||||
- **Full Site Archiving**: When you need every page regardless of content
|
||||
- **Structured Data Extraction**: When targeting specific, known page patterns
|
||||
- **Real-time Monitoring**: When you need continuous updates
|
||||
|
||||
## Understanding the Output
|
||||
|
||||
### Confidence Score
|
||||
|
||||
The confidence score (0-1) indicates how sufficient the gathered information is:
|
||||
- **0.0-0.3**: Insufficient information, needs more crawling
|
||||
- **0.3-0.6**: Partial information, may answer basic queries
|
||||
- **0.6-0.8**: Good coverage, can answer most queries
|
||||
- **0.8-1.0**: Excellent coverage, comprehensive information
|
||||
|
||||
### Statistics Display
|
||||
|
||||
```python
|
||||
adaptive.print_stats(detailed=False) # Summary table
|
||||
adaptive.print_stats(detailed=True) # Detailed metrics
|
||||
```
|
||||
|
||||
The summary shows:
|
||||
- Pages crawled vs. confidence achieved
|
||||
- Coverage, consistency, and saturation scores
|
||||
- Crawling efficiency metrics
|
||||
|
||||
## Persistence and Resumption
|
||||
|
||||
### Saving Progress
|
||||
|
||||
```python
|
||||
config = AdaptiveConfig(
|
||||
save_state=True,
|
||||
state_path="my_crawl_state.json"
|
||||
)
|
||||
|
||||
# Crawl will auto-save progress
|
||||
result = await adaptive.digest(start_url, query)
|
||||
```
|
||||
|
||||
### Resuming a Crawl
|
||||
|
||||
```python
|
||||
# Resume from saved state
|
||||
result = await adaptive.digest(
|
||||
start_url,
|
||||
query,
|
||||
resume_from="my_crawl_state.json"
|
||||
)
|
||||
```
|
||||
|
||||
### Exporting Knowledge Base
|
||||
|
||||
```python
|
||||
# Export collected pages to JSONL
|
||||
adaptive.export_knowledge_base("knowledge_base.jsonl")
|
||||
|
||||
# Import into another session
|
||||
new_adaptive = AdaptiveCrawler(crawler)
|
||||
new_adaptive.import_knowledge_base("knowledge_base.jsonl")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Query Formulation
|
||||
- Use specific, descriptive queries
|
||||
- Include key terms you expect to find
|
||||
- Avoid overly broad queries
|
||||
|
||||
### 2. Threshold Tuning
|
||||
- Start with default (0.8) for general use
|
||||
- Lower to 0.6-0.7 for exploratory crawling
|
||||
- Raise to 0.9+ for exhaustive coverage
|
||||
|
||||
### 3. Performance Optimization
|
||||
- Use appropriate `max_pages` limits
|
||||
- Adjust `top_k_links` based on site structure
|
||||
- Enable caching for repeat crawls
|
||||
|
||||
### 4. Link Selection
|
||||
- The crawler prioritizes links based on:
|
||||
- Relevance to query
|
||||
- Expected information gain
|
||||
- URL structure and depth
|
||||
|
||||
## Examples
|
||||
|
||||
### Research Assistant
|
||||
|
||||
```python
|
||||
# Gather information about a programming concept
|
||||
result = await adaptive.digest(
|
||||
start_url="https://realpython.com",
|
||||
query="python decorators implementation patterns"
|
||||
)
|
||||
|
||||
# Get the most relevant excerpts
|
||||
for doc in adaptive.get_relevant_content(top_k=3):
|
||||
print(f"\nFrom: {doc['url']}")
|
||||
print(f"Relevance: {doc['score']:.2%}")
|
||||
print(doc['content'][:500] + "...")
|
||||
```
|
||||
|
||||
### Knowledge Base Builder
|
||||
|
||||
```python
|
||||
# Build a focused knowledge base about machine learning
|
||||
queries = [
|
||||
"supervised learning algorithms",
|
||||
"neural network architectures",
|
||||
"model evaluation metrics"
|
||||
]
|
||||
|
||||
for query in queries:
|
||||
await adaptive.digest(
|
||||
start_url="https://scikit-learn.org/stable/",
|
||||
query=query
|
||||
)
|
||||
|
||||
# Export combined knowledge base
|
||||
adaptive.export_knowledge_base("ml_knowledge.jsonl")
|
||||
```
|
||||
|
||||
### API Documentation Crawler
|
||||
|
||||
```python
|
||||
# Intelligently crawl API documentation
|
||||
config = AdaptiveConfig(
|
||||
confidence_threshold=0.85, # Higher threshold for completeness
|
||||
max_pages=30
|
||||
)
|
||||
|
||||
adaptive = AdaptiveCrawler(crawler, config)
|
||||
result = await adaptive.digest(
|
||||
start_url="https://api.example.com/docs",
|
||||
query="authentication endpoints rate limits"
|
||||
)
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Learn about [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
|
||||
- Explore the [AdaptiveCrawler API Reference](../api/adaptive-crawler.md)
|
||||
- See more [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/adaptive_crawling)
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q: How is this different from traditional crawling?**
|
||||
A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
|
||||
|
||||
**Q: Can I use this with JavaScript-heavy sites?**
|
||||
A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
|
||||
|
||||
**Q: How does it handle large websites?**
|
||||
A: The algorithm naturally limits crawling to relevant sections. Use `max_pages` as a safety limit.
|
||||
|
||||
**Q: Can I customize the scoring algorithms?**
|
||||
A: Advanced users can implement custom strategies. See [Adaptive Strategies](../advanced/adaptive-strategies.md).
|
||||
@@ -252,7 +252,7 @@ The `clone()` method:
|
||||
### Key fields to note
|
||||
|
||||
1. **`provider`**:
|
||||
- Which LLM provoder to use.
|
||||
- Which LLM provider to use.
|
||||
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
|
||||
|
||||
2. **`api_token`**:
|
||||
@@ -273,8 +273,8 @@ In a typical scenario, you define **one** `BrowserConfig` for your crawler sessi
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# 1) Browser config: headless, bigger viewport, no proxy
|
||||
@@ -298,7 +298,7 @@ async def main():
|
||||
# 3) Example LLM content filtering
|
||||
|
||||
gemini_config = LLMConfig(
|
||||
provider="gemini/gemini-1.5-pro"
|
||||
provider="gemini/gemini-1.5-pro",
|
||||
api_token = "env:GEMINI_API_TOKEN"
|
||||
)
|
||||
|
||||
@@ -322,8 +322,9 @@ async def main():
|
||||
)
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=filter,
|
||||
options={"ignore_links": True}
|
||||
content_filter=filter,
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
# 4) Crawler run config: skip cache, use extraction
|
||||
run_conf = CrawlerRunConfig(
|
||||
|
||||
395
docs/md_v2/core/c4a-script.md
Normal file
395
docs/md_v2/core/c4a-script.md
Normal file
@@ -0,0 +1,395 @@
|
||||
# C4A-Script: Visual Web Automation Made Simple
|
||||
|
||||
## What is C4A-Script?
|
||||
|
||||
C4A-Script is a powerful, human-readable domain-specific language (DSL) designed for web automation and interaction. Think of it as a simplified programming language that anyone can read and write, perfect for automating repetitive web tasks, testing user interfaces, or creating interactive demos.
|
||||
|
||||
### Why C4A-Script?
|
||||
|
||||
**Simple Syntax, Powerful Results**
|
||||
```c4a
|
||||
# Navigate and interact in plain English
|
||||
GO https://example.com
|
||||
WAIT `#search-box` 5
|
||||
TYPE "Hello World"
|
||||
CLICK `button[type="submit"]`
|
||||
```
|
||||
|
||||
**Visual Programming Support**
|
||||
C4A-Script comes with a built-in Blockly visual editor, allowing you to create scripts by dragging and dropping blocks - no coding experience required!
|
||||
|
||||
**Perfect for:**
|
||||
- **UI Testing**: Automate user interaction flows
|
||||
- **Demo Creation**: Build interactive product demonstrations
|
||||
- **Data Entry**: Automate form filling and submissions
|
||||
- **Testing Workflows**: Validate complex user journeys
|
||||
- **Training**: Teach web automation without code complexity
|
||||
|
||||
## Getting Started: Your First Script
|
||||
|
||||
Let's create a simple script that searches for something on a website:
|
||||
|
||||
```c4a
|
||||
# My first C4A-Script
|
||||
GO https://duckduckgo.com
|
||||
|
||||
# Wait for the search box to appear
|
||||
WAIT `input[name="q"]` 10
|
||||
|
||||
# Type our search query
|
||||
TYPE "Crawl4AI"
|
||||
|
||||
# Press Enter to search
|
||||
PRESS Enter
|
||||
|
||||
# Wait for results
|
||||
WAIT `.results` 5
|
||||
```
|
||||
|
||||
That's it! In just a few lines, you've automated a complete search workflow.
|
||||
|
||||
## Interactive Tutorial & Live Demo
|
||||
|
||||
Want to learn by doing? We've got you covered:
|
||||
|
||||
**🚀 [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try C4A-Script in your browser right now!
|
||||
|
||||
**📁 [Tutorial Examples](/examples/c4a_script/)** - Complete examples with source code
|
||||
|
||||
**🛠️ [Local Tutorial](/examples/c4a_script/tutorial/)** - Run the interactive tutorial on your machine
|
||||
|
||||
### Running the Tutorial Locally
|
||||
|
||||
The tutorial includes a Flask-based web interface with:
|
||||
- **Live Code Editor** with syntax highlighting
|
||||
- **Visual Blockly Editor** for drag-and-drop programming
|
||||
- **Recording Mode** to capture your actions and generate scripts
|
||||
- **Timeline View** to see and edit your automation steps
|
||||
|
||||
```bash
|
||||
# Clone and navigate to the tutorial
|
||||
cd docs/examples/c4a_script/tutorial/
|
||||
|
||||
# Install dependencies
|
||||
pip install flask
|
||||
|
||||
# Launch the tutorial server
|
||||
python app.py
|
||||
|
||||
# Open http://localhost:5000 in your browser
|
||||
```
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### Commands and Syntax
|
||||
|
||||
C4A-Script uses simple, English-like commands. Each command does one specific thing:
|
||||
|
||||
```c4a
|
||||
# Comments start with #
|
||||
COMMAND parameter1 parameter2
|
||||
|
||||
# Most commands use CSS selectors in backticks
|
||||
CLICK `#submit-button`
|
||||
|
||||
# Text content goes in quotes
|
||||
TYPE "Hello, World!"
|
||||
|
||||
# Numbers are used directly
|
||||
WAIT 3
|
||||
```
|
||||
|
||||
### Selectors: Finding Elements
|
||||
|
||||
C4A-Script uses CSS selectors to identify elements on the page:
|
||||
|
||||
```c4a
|
||||
# By ID
|
||||
CLICK `#login-button`
|
||||
|
||||
# By class
|
||||
CLICK `.submit-btn`
|
||||
|
||||
# By attribute
|
||||
CLICK `button[type="submit"]`
|
||||
|
||||
# By text content
|
||||
CLICK `button:contains("Sign In")`
|
||||
|
||||
# Complex selectors
|
||||
CLICK `.form-container input[name="email"]`
|
||||
```
|
||||
|
||||
### Variables and Dynamic Content
|
||||
|
||||
Store and reuse values with variables:
|
||||
|
||||
```c4a
|
||||
# Set a variable
|
||||
SETVAR username = "john@example.com"
|
||||
SETVAR password = "secret123"
|
||||
|
||||
# Use variables (prefix with $)
|
||||
TYPE $username
|
||||
PRESS Tab
|
||||
TYPE $password
|
||||
```
|
||||
|
||||
## Command Categories
|
||||
|
||||
### 🧭 Navigation Commands
|
||||
Move around the web like a user would:
|
||||
|
||||
| Command | Purpose | Example |
|
||||
|---------|---------|---------|
|
||||
| `GO` | Navigate to URL | `GO https://example.com` |
|
||||
| `RELOAD` | Refresh current page | `RELOAD` |
|
||||
| `BACK` | Go back in history | `BACK` |
|
||||
| `FORWARD` | Go forward in history | `FORWARD` |
|
||||
|
||||
### ⏱️ Wait Commands
|
||||
Ensure elements are ready before interacting:
|
||||
|
||||
| Command | Purpose | Example |
|
||||
|---------|---------|---------|
|
||||
| `WAIT` | Wait for time/element/text | `WAIT 3` or `WAIT \`#element\` 10` |
|
||||
|
||||
### 🖱️ Mouse Commands
|
||||
Click, drag, and move like a human:
|
||||
|
||||
| Command | Purpose | Example |
|
||||
|---------|---------|---------|
|
||||
| `CLICK` | Click element or coordinates | `CLICK \`button\`` or `CLICK 100 200` |
|
||||
| `DOUBLE_CLICK` | Double-click element | `DOUBLE_CLICK \`.item\`` |
|
||||
| `RIGHT_CLICK` | Right-click element | `RIGHT_CLICK \`#menu\`` |
|
||||
| `SCROLL` | Scroll in direction | `SCROLL DOWN 500` |
|
||||
| `DRAG` | Drag from point to point | `DRAG 100 100 500 300` |
|
||||
|
||||
### ⌨️ Keyboard Commands
|
||||
Type text and press keys naturally:
|
||||
|
||||
| Command | Purpose | Example |
|
||||
|---------|---------|---------|
|
||||
| `TYPE` | Type text or variable | `TYPE "Hello"` or `TYPE $username` |
|
||||
| `PRESS` | Press special keys | `PRESS Tab` or `PRESS Enter` |
|
||||
| `CLEAR` | Clear input field | `CLEAR \`#search\`` |
|
||||
| `SET` | Set input value directly | `SET \`#email\` "user@example.com"` |
|
||||
|
||||
### 🔀 Control Flow
|
||||
Add logic and repetition to your scripts:
|
||||
|
||||
| Command | Purpose | Example |
|
||||
|---------|---------|---------|
|
||||
| `IF` | Conditional execution | `IF (EXISTS \`#popup\`) THEN CLICK \`#close\`` |
|
||||
| `REPEAT` | Loop commands | `REPEAT (SCROLL DOWN 300, 5)` |
|
||||
|
||||
### 💾 Variables & Advanced
|
||||
Store data and execute custom code:
|
||||
|
||||
| Command | Purpose | Example |
|
||||
|---------|---------|---------|
|
||||
| `SETVAR` | Create variable | `SETVAR email = "test@example.com"` |
|
||||
| `EVAL` | Execute JavaScript | `EVAL \`console.log('Hello')\`` |
|
||||
|
||||
## Real-World Examples
|
||||
|
||||
### Example 1: Login Flow
|
||||
```c4a
|
||||
# Complete login automation
|
||||
GO https://myapp.com/login
|
||||
|
||||
# Wait for page to load
|
||||
WAIT `#login-form` 5
|
||||
|
||||
# Fill credentials
|
||||
CLICK `#email`
|
||||
TYPE "user@example.com"
|
||||
PRESS Tab
|
||||
TYPE "mypassword"
|
||||
|
||||
# Submit form
|
||||
CLICK `button[type="submit"]`
|
||||
|
||||
# Wait for dashboard
|
||||
WAIT `.dashboard` 10
|
||||
```
|
||||
|
||||
### Example 2: E-commerce Shopping
|
||||
```c4a
|
||||
# Shopping automation with variables
|
||||
SETVAR product = "laptop"
|
||||
SETVAR budget = "1000"
|
||||
|
||||
GO https://shop.example.com
|
||||
WAIT `#search-box` 3
|
||||
|
||||
# Search for product
|
||||
TYPE $product
|
||||
PRESS Enter
|
||||
WAIT `.product-list` 5
|
||||
|
||||
# Filter by price
|
||||
CLICK `.price-filter`
|
||||
SET `#max-price` $budget
|
||||
CLICK `.apply-filters`
|
||||
|
||||
# Select first result
|
||||
WAIT `.product-item` 3
|
||||
CLICK `.product-item:first-child`
|
||||
```
|
||||
|
||||
### Example 3: Form Automation with Conditions
|
||||
```c4a
|
||||
# Smart form filling with error handling
|
||||
GO https://forms.example.com
|
||||
|
||||
# Check if user is already logged in
|
||||
IF (EXISTS `.user-menu`) THEN GO https://forms.example.com/new
|
||||
IF (NOT EXISTS `.user-menu`) THEN CLICK `#login-link`
|
||||
|
||||
# Fill form
|
||||
WAIT `#contact-form` 5
|
||||
SET `#name` "John Doe"
|
||||
SET `#email` "john@example.com"
|
||||
SET `#message` "Hello from C4A-Script!"
|
||||
|
||||
# Handle popup if it appears
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
|
||||
|
||||
# Submit
|
||||
CLICK `#submit-button`
|
||||
WAIT `.success-message` 10
|
||||
```
|
||||
|
||||
## Visual Programming with Blockly
|
||||
|
||||
C4A-Script includes a powerful visual programming interface built on Google Blockly. Perfect for:
|
||||
|
||||
- **Non-programmers** who want to create automation
|
||||
- **Rapid prototyping** of automation workflows
|
||||
- **Educational environments** for teaching automation concepts
|
||||
- **Collaborative development** where visual representation helps communication
|
||||
|
||||
### Features:
|
||||
- **Drag & Drop Interface**: Build scripts by connecting blocks
|
||||
- **Real-time Sync**: Changes in visual mode instantly update the text script
|
||||
- **Smart Block Types**: Blocks are categorized by function (Navigation, Actions, etc.)
|
||||
- **Error Prevention**: Visual connections prevent syntax errors
|
||||
- **Comment Support**: Add visual comment blocks for documentation
|
||||
|
||||
Try the visual editor in our [live demo](https://docs.crawl4ai.com/c4a-script/demo) or [local tutorial](/examples/c4a_script/tutorial/).
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Recording Mode
|
||||
The tutorial interface includes a recording feature that watches your browser interactions and automatically generates C4A-Script commands:
|
||||
|
||||
1. Click "Record" in the tutorial interface
|
||||
2. Perform actions in the browser preview
|
||||
3. Watch as C4A-Script commands are generated in real-time
|
||||
4. Edit and refine the generated script
|
||||
|
||||
### Error Handling and Debugging
|
||||
C4A-Script provides clear error messages and debugging information:
|
||||
|
||||
```c4a
|
||||
# Use comments for debugging
|
||||
# This will wait up to 10 seconds for the element
|
||||
WAIT `#slow-loading-element` 10
|
||||
|
||||
# Check if element exists before clicking
|
||||
IF (EXISTS `#optional-button`) THEN CLICK `#optional-button`
|
||||
|
||||
# Use EVAL for custom debugging
|
||||
EVAL `console.log("Current page title:", document.title)`
|
||||
```
|
||||
|
||||
### Integration with Crawl4AI
|
||||
C4A-Script integrates seamlessly with Crawl4AI's web crawling capabilities:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
# Use C4A-Script for interaction before crawling
|
||||
script = """
|
||||
GO https://example.com
|
||||
CLICK `#load-more-content`
|
||||
WAIT `.dynamic-content` 5
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code=script,
|
||||
wait_for=".dynamic-content"
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Always Wait for Elements
|
||||
```c4a
|
||||
# Bad: Clicking immediately
|
||||
CLICK `#button`
|
||||
|
||||
# Good: Wait for element to appear
|
||||
WAIT `#button` 5
|
||||
CLICK `#button`
|
||||
```
|
||||
|
||||
### 2. Use Descriptive Comments
|
||||
```c4a
|
||||
# Login to user account
|
||||
GO https://myapp.com/login
|
||||
WAIT `#login-form` 5
|
||||
|
||||
# Enter credentials
|
||||
TYPE "user@example.com"
|
||||
PRESS Tab
|
||||
TYPE "password123"
|
||||
|
||||
# Submit and wait for redirect
|
||||
CLICK `#submit-button`
|
||||
WAIT `.dashboard` 10
|
||||
```
|
||||
|
||||
### 3. Handle Variable Conditions
|
||||
```c4a
|
||||
# Handle different page states
|
||||
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
|
||||
IF (EXISTS `.popup-modal`) THEN CLICK `.close-modal`
|
||||
|
||||
# Proceed with main workflow
|
||||
CLICK `#main-action`
|
||||
```
|
||||
|
||||
### 4. Use Variables for Reusability
|
||||
```c4a
|
||||
# Define once, use everywhere
|
||||
SETVAR base_url = "https://myapp.com"
|
||||
SETVAR test_email = "test@example.com"
|
||||
|
||||
GO $base_url/login
|
||||
SET `#email` $test_email
|
||||
```
|
||||
|
||||
## Getting Help
|
||||
|
||||
- **📖 [Complete Examples](/examples/c4a_script/)** - Real-world automation scripts
|
||||
- **🎮 [Interactive Tutorial](/examples/c4a_script/tutorial/)** - Hands-on learning environment
|
||||
- **📋 [API Reference](/api/c4a-script-reference/)** - Detailed command documentation
|
||||
- **🌐 [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try it in your browser
|
||||
|
||||
## What's Next?
|
||||
|
||||
Ready to dive deeper? Check out:
|
||||
|
||||
1. **[API Reference](/api/c4a-script-reference/)** - Complete command documentation
|
||||
2. **[Tutorial Examples](/examples/c4a_script/)** - Copy-paste ready scripts
|
||||
3. **[Local Tutorial Setup](/examples/c4a_script/tutorial/)** - Run the full development environment
|
||||
|
||||
C4A-Script makes web automation accessible to everyone. Whether you're a developer automating tests, a designer creating interactive demos, or a business user streamlining repetitive tasks, C4A-Script has the tools you need.
|
||||
|
||||
*Start automating today - your future self will thank you!* 🚀
|
||||
@@ -17,6 +17,9 @@
|
||||
- [Configuration Reference](#configuration-reference)
|
||||
- [Best Practices & Tips](#best-practices--tips)
|
||||
|
||||
## Installation
|
||||
The Crawl4AI CLI will be installed automatically when you install the library.
|
||||
|
||||
## Basic Usage
|
||||
|
||||
The Crawl4AI CLI (`crwl`) provides a simple interface to the Crawl4AI library:
|
||||
|
||||
@@ -191,7 +191,7 @@ You can combine content selection with a more advanced extraction strategy. For
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Minimal schema for repeated items
|
||||
@@ -243,7 +243,7 @@ import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
|
||||
class ArticleData(BaseModel):
|
||||
headline: str
|
||||
@@ -288,7 +288,7 @@ Below is a short function that unifies **CSS selection**, **exclusion** logic, a
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_main_articles(url: str):
|
||||
schema = {
|
||||
|
||||
@@ -138,7 +138,7 @@ If you run a JSON-based extraction strategy (CSS, XPath, LLM, etc.), the structu
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
|
||||
@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
|
||||
|
||||
#### 1. Pull the Image
|
||||
|
||||
Our latest release candidate is `0.6.0-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
|
||||
|
||||
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
|
||||
|
||||
```bash
|
||||
# Pull the release candidate (recommended for latest features)
|
||||
docker pull unclecode/crawl4ai:0.6.0-r1
|
||||
# Pull the release candidate (for testing new features)
|
||||
docker pull unclecode/crawl4ai:0.7.0-r1
|
||||
|
||||
# Or pull the latest stable version
|
||||
# Or pull the current stable version (0.6.0)
|
||||
docker pull unclecode/crawl4ai:latest
|
||||
```
|
||||
|
||||
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
|
||||
#### Docker Hub Versioning Explained
|
||||
|
||||
* **Image Name:** `unclecode/crawl4ai`
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r2`)
|
||||
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
|
||||
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
|
||||
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
|
||||
* **`latest` Tag:** Points to the most recent stable version
|
||||
|
||||
@@ -28,6 +28,11 @@ This page provides a comprehensive list of example scripts that demonstrate vari
|
||||
| Example | Description | Link |
|
||||
|---------|-------------|------|
|
||||
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
|
||||
<<<<<<< HEAD
|
||||
| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
|
||||
=======
|
||||
| Adaptive Crawling | Demonstrates intelligent crawling that automatically determines when sufficient information has been gathered. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/adaptive_crawling/) |
|
||||
>>>>>>> feature/progressive-crawling
|
||||
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
|
||||
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
|
||||
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |
|
||||
|
||||
@@ -137,7 +137,7 @@ if __name__ == "__main__":
|
||||
- Higher → fewer chunks but more relevant.
|
||||
- Lower → more inclusive.
|
||||
|
||||
> In more advanced scenarios, you might see parameters like `use_stemming`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted.
|
||||
> In more advanced scenarios, you might see parameters like `language`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted.
|
||||
|
||||
---
|
||||
|
||||
@@ -242,4 +242,4 @@ class MyCustomFilter(RelevantContentFilter):
|
||||
|
||||
With these tools, you can **zero in** on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
|
||||
|
||||
- Last Updated: 2025-01-01
|
||||
- Last Updated: 2025-01-01
|
||||
|
||||
@@ -105,7 +105,366 @@ result.links = {
|
||||
|
||||
---
|
||||
|
||||
## 2. Domain Filtering
|
||||
## 2. Advanced Link Head Extraction & Scoring
|
||||
|
||||
Ever wanted to not just extract links, but also get the actual content (title, description, metadata) from those linked pages? And score them for relevance? This is exactly what Link Head Extraction does - it fetches the `<head>` section from each discovered link and scores them using multiple algorithms.
|
||||
|
||||
### 2.1 Why Link Head Extraction?
|
||||
|
||||
When you crawl a page, you get hundreds of links. But which ones are actually valuable? Link Head Extraction solves this by:
|
||||
|
||||
1. **Fetching head content** from each link (title, description, meta tags)
|
||||
2. **Scoring links intrinsically** based on URL quality, text relevance, and context
|
||||
3. **Scoring links contextually** using BM25 algorithm when you provide a search query
|
||||
4. **Combining scores intelligently** to give you a final relevance ranking
|
||||
|
||||
### 2.2 Complete Working Example
|
||||
|
||||
Here's a full example you can copy, paste, and run immediately:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.async_configs import LinkPreviewConfig
|
||||
|
||||
async def extract_link_heads_example():
|
||||
"""
|
||||
Complete example showing link head extraction with scoring.
|
||||
This will crawl a documentation site and extract head content from internal links.
|
||||
"""
|
||||
|
||||
# Configure link head extraction
|
||||
config = CrawlerRunConfig(
|
||||
# Enable link head extraction with detailed configuration
|
||||
link_preview_config=LinkPreviewConfig(
|
||||
include_internal=True, # Extract from internal links
|
||||
include_external=False, # Skip external links for this example
|
||||
max_links=10, # Limit to 10 links for demo
|
||||
concurrency=5, # Process 5 links simultaneously
|
||||
timeout=10, # 10 second timeout per link
|
||||
query="API documentation guide", # Query for contextual scoring
|
||||
score_threshold=0.3, # Only include links scoring above 0.3
|
||||
verbose=True # Show detailed progress
|
||||
),
|
||||
# Enable intrinsic scoring (URL quality, text relevance)
|
||||
score_links=True,
|
||||
# Keep output clean
|
||||
only_text=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Crawl a documentation site (great for testing)
|
||||
result = await crawler.arun("https://docs.python.org/3/", config=config)
|
||||
|
||||
if result.success:
|
||||
print(f"✅ Successfully crawled: {result.url}")
|
||||
print(f"📄 Page title: {result.metadata.get('title', 'No title')}")
|
||||
|
||||
# Access links (now enhanced with head data and scores)
|
||||
internal_links = result.links.get("internal", [])
|
||||
external_links = result.links.get("external", [])
|
||||
|
||||
print(f"\n🔗 Found {len(internal_links)} internal links")
|
||||
print(f"🌍 Found {len(external_links)} external links")
|
||||
|
||||
# Count links with head data
|
||||
links_with_head = [link for link in internal_links
|
||||
if link.get("head_data") is not None]
|
||||
print(f"🧠 Links with head data extracted: {len(links_with_head)}")
|
||||
|
||||
# Show the top 3 scoring links
|
||||
print(f"\n🏆 Top 3 Links with Full Scoring:")
|
||||
for i, link in enumerate(links_with_head[:3]):
|
||||
print(f"\n{i+1}. {link['href']}")
|
||||
print(f" Link Text: '{link.get('text', 'No text')[:50]}...'")
|
||||
|
||||
# Show all three score types
|
||||
intrinsic = link.get('intrinsic_score')
|
||||
contextual = link.get('contextual_score')
|
||||
total = link.get('total_score')
|
||||
|
||||
if intrinsic is not None:
|
||||
print(f" 📊 Intrinsic Score: {intrinsic:.2f}/10.0 (URL quality & context)")
|
||||
if contextual is not None:
|
||||
print(f" 🎯 Contextual Score: {contextual:.3f} (BM25 relevance to query)")
|
||||
if total is not None:
|
||||
print(f" ⭐ Total Score: {total:.3f} (combined final score)")
|
||||
|
||||
# Show extracted head data
|
||||
head_data = link.get("head_data", {})
|
||||
if head_data:
|
||||
title = head_data.get("title", "No title")
|
||||
description = head_data.get("meta", {}).get("description", "No description")
|
||||
|
||||
print(f" 📰 Title: {title[:60]}...")
|
||||
if description:
|
||||
print(f" 📝 Description: {description[:80]}...")
|
||||
|
||||
# Show extraction status
|
||||
status = link.get("head_extraction_status", "unknown")
|
||||
print(f" ✅ Extraction Status: {status}")
|
||||
else:
|
||||
print(f"❌ Crawl failed: {result.error_message}")
|
||||
|
||||
# Run the example
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(extract_link_heads_example())
|
||||
```
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
✅ Successfully crawled: https://docs.python.org/3/
|
||||
📄 Page title: 3.13.5 Documentation
|
||||
🔗 Found 53 internal links
|
||||
🌍 Found 1 external links
|
||||
🧠 Links with head data extracted: 10
|
||||
|
||||
🏆 Top 3 Links with Full Scoring:
|
||||
|
||||
1. https://docs.python.org/3.15/
|
||||
Link Text: 'Python 3.15 (in development)...'
|
||||
📊 Intrinsic Score: 4.17/10.0 (URL quality & context)
|
||||
🎯 Contextual Score: 1.000 (BM25 relevance to query)
|
||||
⭐ Total Score: 5.917 (combined final score)
|
||||
📰 Title: 3.15.0a0 Documentation...
|
||||
📝 Description: The official Python documentation...
|
||||
✅ Extraction Status: valid
|
||||
```
|
||||
|
||||
### 2.3 Configuration Deep Dive
|
||||
|
||||
The `LinkPreviewConfig` class supports these options:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import LinkPreviewConfig
|
||||
|
||||
link_preview_config = LinkPreviewConfig(
|
||||
# BASIC SETTINGS
|
||||
verbose=True, # Show detailed logs (recommended for learning)
|
||||
|
||||
# LINK FILTERING
|
||||
include_internal=True, # Include same-domain links
|
||||
include_external=True, # Include different-domain links
|
||||
max_links=50, # Maximum links to process (prevents overload)
|
||||
|
||||
# PATTERN FILTERING
|
||||
include_patterns=[ # Only process links matching these patterns
|
||||
"*/docs/*",
|
||||
"*/api/*",
|
||||
"*/reference/*"
|
||||
],
|
||||
exclude_patterns=[ # Skip links matching these patterns
|
||||
"*/login*",
|
||||
"*/admin*"
|
||||
],
|
||||
|
||||
# PERFORMANCE SETTINGS
|
||||
concurrency=10, # How many links to process simultaneously
|
||||
timeout=5, # Seconds to wait per link
|
||||
|
||||
# RELEVANCE SCORING
|
||||
query="machine learning API", # Query for BM25 contextual scoring
|
||||
score_threshold=0.3, # Only include links above this score
|
||||
)
|
||||
```
|
||||
|
||||
### 2.4 Understanding the Three Score Types
|
||||
|
||||
Each extracted link gets three different scores:
|
||||
|
||||
#### 1. **Intrinsic Score (0-10)** - URL and Content Quality
|
||||
Based on URL structure, link text quality, and page context:
|
||||
|
||||
```python
|
||||
# High intrinsic score indicators:
|
||||
# ✅ Clean URL structure (docs.python.org/api/reference)
|
||||
# ✅ Meaningful link text ("API Reference Guide")
|
||||
# ✅ Relevant to page context
|
||||
# ✅ Not buried deep in navigation
|
||||
|
||||
# Low intrinsic score indicators:
|
||||
# ❌ Random URLs (site.com/x7f9g2h)
|
||||
# ❌ No link text or generic text ("Click here")
|
||||
# ❌ Unrelated to page content
|
||||
```
|
||||
|
||||
#### 2. **Contextual Score (0-1)** - BM25 Relevance to Query
|
||||
Only available when you provide a `query`. Uses BM25 algorithm against head content:
|
||||
|
||||
```python
|
||||
# Example: query = "machine learning tutorial"
|
||||
# High contextual score: Link to "Complete Machine Learning Guide"
|
||||
# Low contextual score: Link to "Privacy Policy"
|
||||
```
|
||||
|
||||
#### 3. **Total Score** - Smart Combination
|
||||
Intelligently combines intrinsic and contextual scores with fallbacks:
|
||||
|
||||
```python
|
||||
# When both scores available: (intrinsic * 0.3) + (contextual * 0.7)
|
||||
# When only intrinsic: uses intrinsic score
|
||||
# When only contextual: uses contextual score
|
||||
# When neither: not calculated
|
||||
```
|
||||
|
||||
### 2.5 Practical Use Cases
|
||||
|
||||
#### Use Case 1: Research Assistant
|
||||
Find the most relevant documentation pages:
|
||||
|
||||
```python
|
||||
async def research_assistant():
|
||||
config = CrawlerRunConfig(
|
||||
link_preview_config=LinkPreviewConfig(
|
||||
include_internal=True,
|
||||
include_external=True,
|
||||
include_patterns=["*/docs/*", "*/tutorial/*", "*/guide/*"],
|
||||
query="machine learning neural networks",
|
||||
max_links=20,
|
||||
score_threshold=0.5, # Only high-relevance links
|
||||
verbose=True
|
||||
),
|
||||
score_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://scikit-learn.org/", config=config)
|
||||
|
||||
if result.success:
|
||||
# Get high-scoring links
|
||||
good_links = [link for link in result.links.get("internal", [])
|
||||
if link.get("total_score", 0) > 0.7]
|
||||
|
||||
print(f"🎯 Found {len(good_links)} highly relevant links:")
|
||||
for link in good_links[:5]:
|
||||
print(f"⭐ {link['total_score']:.3f} - {link['href']}")
|
||||
print(f" {link.get('head_data', {}).get('title', 'No title')}")
|
||||
```
|
||||
|
||||
#### Use Case 2: Content Discovery
|
||||
Find all API endpoints and references:
|
||||
|
||||
```python
|
||||
async def api_discovery():
|
||||
config = CrawlerRunConfig(
|
||||
link_preview_config=LinkPreviewConfig(
|
||||
include_internal=True,
|
||||
include_patterns=["*/api/*", "*/reference/*"],
|
||||
exclude_patterns=["*/deprecated/*"],
|
||||
max_links=100,
|
||||
concurrency=15,
|
||||
verbose=False # Clean output
|
||||
),
|
||||
score_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://docs.example-api.com/", config=config)
|
||||
|
||||
if result.success:
|
||||
api_links = result.links.get("internal", [])
|
||||
|
||||
# Group by endpoint type
|
||||
endpoints = {}
|
||||
for link in api_links:
|
||||
if link.get("head_data"):
|
||||
title = link["head_data"].get("title", "")
|
||||
if "GET" in title:
|
||||
endpoints.setdefault("GET", []).append(link)
|
||||
elif "POST" in title:
|
||||
endpoints.setdefault("POST", []).append(link)
|
||||
|
||||
for method, links in endpoints.items():
|
||||
print(f"\n{method} Endpoints ({len(links)}):")
|
||||
for link in links[:3]:
|
||||
print(f" • {link['href']}")
|
||||
```
|
||||
|
||||
#### Use Case 3: Link Quality Analysis
|
||||
Analyze website structure and content quality:
|
||||
|
||||
```python
|
||||
async def quality_analysis():
|
||||
config = CrawlerRunConfig(
|
||||
link_preview_config=LinkPreviewConfig(
|
||||
include_internal=True,
|
||||
max_links=200,
|
||||
concurrency=20,
|
||||
),
|
||||
score_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://your-website.com/", config=config)
|
||||
|
||||
if result.success:
|
||||
links = result.links.get("internal", [])
|
||||
|
||||
# Analyze intrinsic scores
|
||||
scores = [link.get('intrinsic_score', 0) for link in links]
|
||||
avg_score = sum(scores) / len(scores) if scores else 0
|
||||
|
||||
print(f"📊 Link Quality Analysis:")
|
||||
print(f" Average intrinsic score: {avg_score:.2f}/10.0")
|
||||
print(f" High quality links (>7.0): {len([s for s in scores if s > 7.0])}")
|
||||
print(f" Low quality links (<3.0): {len([s for s in scores if s < 3.0])}")
|
||||
|
||||
# Find problematic links
|
||||
bad_links = [link for link in links
|
||||
if link.get('intrinsic_score', 0) < 2.0]
|
||||
|
||||
if bad_links:
|
||||
print(f"\n⚠️ Links needing attention:")
|
||||
for link in bad_links[:5]:
|
||||
print(f" {link['href']} (score: {link.get('intrinsic_score', 0):.1f})")
|
||||
```
|
||||
|
||||
### 2.6 Performance Tips
|
||||
|
||||
1. **Start Small**: Begin with `max_links: 10` to understand the feature
|
||||
2. **Use Patterns**: Filter with `include_patterns` to focus on relevant sections
|
||||
3. **Adjust Concurrency**: Higher concurrency = faster but more resource usage
|
||||
4. **Set Timeouts**: Use `timeout: 5` to prevent hanging on slow sites
|
||||
5. **Use Score Thresholds**: Filter out low-quality links with `score_threshold`
|
||||
|
||||
### 2.7 Troubleshooting
|
||||
|
||||
**No head data extracted?**
|
||||
```python
|
||||
# Check your configuration:
|
||||
config = CrawlerRunConfig(
|
||||
link_preview_config=LinkPreviewConfig(
|
||||
verbose=True # ← Enable to see what's happening
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Scores showing as None?**
|
||||
```python
|
||||
# Make sure scoring is enabled:
|
||||
config = CrawlerRunConfig(
|
||||
score_links=True, # ← Enable intrinsic scoring
|
||||
link_preview_config=LinkPreviewConfig(
|
||||
query="your search terms" # ← For contextual scoring
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Process taking too long?**
|
||||
```python
|
||||
# Optimize performance:
|
||||
link_preview_config = LinkPreviewConfig(
|
||||
max_links=20, # ← Reduce number
|
||||
concurrency=10, # ← Increase parallelism
|
||||
timeout=3, # ← Shorter timeout
|
||||
include_patterns=["*/important/*"] # ← Focus on key areas
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Domain Filtering
|
||||
|
||||
Some websites contain hundreds of third-party or affiliate links. You can filter out certain domains at **crawl time** by configuring the crawler. The most relevant parameters in `CrawlerRunConfig` are:
|
||||
|
||||
@@ -114,7 +473,7 @@ Some websites contain hundreds of third-party or affiliate links. You can filter
|
||||
- **`exclude_social_media_links`**: If `True`, automatically skip known social platforms.
|
||||
- **`exclude_domains`**: Provide a list of custom domains you want to exclude (e.g., `["spammyads.com", "tracker.net"]`).
|
||||
|
||||
### 2.1 Example: Excluding External & Social Media Links
|
||||
### 3.1 Example: Excluding External & Social Media Links
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
@@ -143,7 +502,7 @@ if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 2.2 Example: Excluding Specific Domains
|
||||
### 3.2 Example: Excluding Specific Domains
|
||||
|
||||
If you want to let external links in, but specifically exclude a domain (e.g., `suspiciousads.com`), do this:
|
||||
|
||||
@@ -157,9 +516,9 @@ This approach is handy when you still want external links but need to block cert
|
||||
|
||||
---
|
||||
|
||||
## 3. Media Extraction
|
||||
## 4. Media Extraction
|
||||
|
||||
### 3.1 Accessing `result.media`
|
||||
### 4.1 Accessing `result.media`
|
||||
|
||||
By default, Crawl4AI collects images, audio, video URLs, and data tables it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`, `tables`).
|
||||
|
||||
@@ -237,7 +596,7 @@ Depending on your Crawl4AI version or scraping strategy, these dictionaries can
|
||||
|
||||
With these details, you can easily filter out or focus on certain images (for instance, ignoring images with very low scores or a different domain), or gather metadata for analytics.
|
||||
|
||||
### 3.2 Excluding External Images
|
||||
### 4.2 Excluding External Images
|
||||
|
||||
If you’re dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
|
||||
|
||||
|
||||
61
docs/md_v2/core/llmtxt.md
Normal file
61
docs/md_v2/core/llmtxt.md
Normal file
@@ -0,0 +1,61 @@
|
||||
I<div class="llmtxt-container">
|
||||
<iframe id="llmtxt-frame" src="../../llmtxt/index.html" width="100%" style="border:none; display: block;" title="Crawl4AI LLM Context Builder"></iframe>
|
||||
</div>
|
||||
|
||||
<script>
|
||||
// Iframe height adjustment
|
||||
function resizeLLMtxtIframe() {
|
||||
const iframe = document.getElementById('llmtxt-frame');
|
||||
if (iframe) {
|
||||
const headerHeight = parseFloat(getComputedStyle(document.documentElement).getPropertyValue('--header-height') || '55');
|
||||
const topOffset = headerHeight + 20;
|
||||
const availableHeight = window.innerHeight - topOffset;
|
||||
iframe.style.height = Math.max(800, availableHeight) + 'px';
|
||||
}
|
||||
}
|
||||
|
||||
// Run immediately and on resize/load
|
||||
resizeLLMtxtIframe();
|
||||
let resizeTimer;
|
||||
window.addEventListener('load', resizeLLMtxtIframe);
|
||||
window.addEventListener('resize', () => {
|
||||
clearTimeout(resizeTimer);
|
||||
resizeTimer = setTimeout(resizeLLMtxtIframe, 150);
|
||||
});
|
||||
|
||||
// Remove Footer & HR from parent page
|
||||
document.addEventListener('DOMContentLoaded', () => {
|
||||
setTimeout(() => {
|
||||
const footer = window.parent.document.querySelector('footer');
|
||||
if (footer) {
|
||||
const hrBeforeFooter = footer.previousElementSibling;
|
||||
if (hrBeforeFooter && hrBeforeFooter.tagName === 'HR') {
|
||||
hrBeforeFooter.remove();
|
||||
}
|
||||
footer.remove();
|
||||
resizeLLMtxtIframe();
|
||||
}
|
||||
}, 100);
|
||||
});
|
||||
</script>
|
||||
|
||||
<style>
|
||||
#terminal-mkdocs-main-content {
|
||||
padding: 0 !important;
|
||||
margin: 0;
|
||||
width: 100%;
|
||||
height: 100%;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
#terminal-mkdocs-main-content .llmtxt-container {
|
||||
margin: 0;
|
||||
padding: 0;
|
||||
max-width: none;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
#terminal-mkdocs-toc-panel {
|
||||
display: none !important;
|
||||
}
|
||||
</style>
|
||||
@@ -8,11 +8,10 @@ To crawl a live web page, provide the URL starting with `http://` or `https://`,
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
|
||||
|
||||
async def crawl_web():
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/apple",
|
||||
@@ -33,13 +32,12 @@ To crawl a local HTML file, prefix the file path with `file://`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
|
||||
|
||||
async def crawl_local_file():
|
||||
local_file_path = "/path/to/apple.html" # Replace with your file path
|
||||
file_url = f"file://{local_file_path}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=file_url, config=config)
|
||||
@@ -93,8 +91,7 @@ import os
|
||||
import sys
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
|
||||
@@ -104,7 +101,7 @@ async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Step 1: Crawl the Web URL
|
||||
print("\n=== Step 1: Crawling the Wikipedia URL ===")
|
||||
web_config = CrawlerRunConfig(bypass_cache=True)
|
||||
web_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
result = await crawler.arun(url=wikipedia_url, config=web_config)
|
||||
|
||||
if not result.success:
|
||||
@@ -119,7 +116,7 @@ async def main():
|
||||
# Step 2: Crawl from the Local HTML File
|
||||
print("=== Step 2: Crawling from the Local HTML File ===")
|
||||
file_url = f"file://{html_file_path.resolve()}"
|
||||
file_config = CrawlerRunConfig(bypass_cache=True)
|
||||
file_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
local_result = await crawler.arun(url=file_url, config=file_config)
|
||||
|
||||
if not local_result.success:
|
||||
@@ -135,7 +132,7 @@ async def main():
|
||||
with open(html_file_path, 'r', encoding='utf-8') as f:
|
||||
raw_html_content = f.read()
|
||||
raw_html_url = f"raw:{raw_html_content}"
|
||||
raw_config = CrawlerRunConfig(bypass_cache=True)
|
||||
raw_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
|
||||
|
||||
if not raw_result.success:
|
||||
|
||||
@@ -187,7 +187,7 @@ from crawl4ai import CrawlerRunConfig
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.2,
|
||||
use_stemming=True
|
||||
language="english"
|
||||
)
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
@@ -200,7 +200,8 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
|
||||
- **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.
|
||||
- **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.
|
||||
- **`use_stemming`**: If `True`, variations of words match (e.g., “learn,” “learning,” “learnt”).
|
||||
- **`use_stemming`** *(default `True`)*: Whether to apply stemming to the query and content.
|
||||
- **`language (str)`**: Language for stemming (default: 'english').
|
||||
|
||||
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
|
||||
|
||||
@@ -233,7 +234,7 @@ prune_filter = PruningContentFilter(
|
||||
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||
|
||||
async def main():
|
||||
@@ -255,9 +256,12 @@ async def main():
|
||||
chunk_token_threshold=4096, # Adjust based on your needs
|
||||
verbose=True
|
||||
)
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=filter,
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=filter
|
||||
markdown_generator=md_generator,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
|
||||
@@ -296,7 +296,7 @@ if __name__ == "__main__":
|
||||
Once dynamic content is loaded, you can attach an **`extraction_strategy`** (like `JsonCssExtractionStrategy` or `LLMExtractionStrategy`). For example:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Commits",
|
||||
@@ -340,4 +340,45 @@ Crawl4AI’s **page interaction** features let you:
|
||||
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
|
||||
4. Combine with **structured extraction** for dynamic sites.
|
||||
|
||||
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
|
||||
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
|
||||
|
||||
---
|
||||
|
||||
## 9. Virtual Scrolling
|
||||
|
||||
For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
|
||||
|
||||
async def crawl_twitter_timeline():
|
||||
# Configure virtual scroll for Twitter-like feeds
|
||||
virtual_config = VirtualScrollConfig(
|
||||
container_selector="[data-testid='primaryColumn']", # Twitter's main column
|
||||
scroll_count=30, # Scroll 30 times
|
||||
scroll_by="container_height", # Scroll by container height each time
|
||||
wait_after_scroll=1.0 # Wait 1 second after each scroll
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
virtual_scroll_config=virtual_config
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://twitter.com/search?q=AI",
|
||||
config=config
|
||||
)
|
||||
# result.html now contains ALL tweets from the virtual scroll
|
||||
```
|
||||
|
||||
### Virtual Scroll vs JavaScript Scrolling
|
||||
|
||||
| Feature | Virtual Scroll | JS Code Scrolling |
|
||||
|---------|---------------|-------------------|
|
||||
| **Use Case** | Content replaced during scroll | Content appended or simple scroll |
|
||||
| **Configuration** | `VirtualScrollConfig` object | `js_code` with scroll commands |
|
||||
| **Automatic Merging** | Yes - merges all unique content | No - captures final state only |
|
||||
| **Best For** | Twitter, Instagram, virtual tables | Traditional pages, load more buttons |
|
||||
|
||||
For detailed examples and configuration options, see the [Virtual Scroll documentation](../advanced/virtual-scroll.md).
|
||||
|
||||
@@ -127,7 +127,7 @@ Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. B
|
||||
> **New!** Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
from crawl4ai import LLMConfig
|
||||
|
||||
# Generate a schema (one-time cost)
|
||||
@@ -157,7 +157,7 @@ Here's a basic extraction example:
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
@@ -212,7 +212,7 @@ import json
|
||||
import asyncio
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import LLMExtractionStrategy
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
@@ -272,7 +272,43 @@ if __name__ == "__main__":
|
||||
|
||||
---
|
||||
|
||||
## 7. Multi-URL Concurrency (Preview)
|
||||
## 7. Adaptive Crawling (New!)
|
||||
|
||||
Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
|
||||
|
||||
async def adaptive_example():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
adaptive = AdaptiveCrawler(crawler)
|
||||
|
||||
# Start adaptive crawling
|
||||
result = await adaptive.digest(
|
||||
start_url="https://docs.python.org/3/",
|
||||
query="async context managers"
|
||||
)
|
||||
|
||||
# View results
|
||||
adaptive.print_stats()
|
||||
print(f"Crawled {len(result.crawled_urls)} pages")
|
||||
print(f"Achieved {adaptive.confidence:.0%} confidence")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(adaptive_example())
|
||||
```
|
||||
|
||||
**What's special about adaptive crawling?**
|
||||
- **Automatic stopping**: Stops when sufficient information is gathered
|
||||
- **Intelligent link selection**: Follows only relevant links
|
||||
- **Confidence scoring**: Know how complete your information is
|
||||
|
||||
[Learn more about Adaptive Crawling →](adaptive-crawling.md)
|
||||
|
||||
---
|
||||
|
||||
## 8. Multi-URL Concurrency (Preview)
|
||||
|
||||
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse:
|
||||
|
||||
@@ -328,7 +364,7 @@ Some sites require multiple “page clicks” or dynamic JavaScript updates. Bel
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_structured_data_using_css_extractor():
|
||||
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
|
||||
|
||||
1121
docs/md_v2/core/url-seeding.md
Normal file
1121
docs/md_v2/core/url-seeding.md
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user