Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update

This commit is contained in:
UncleCode
2025-07-12 18:54:20 +08:00
320 changed files with 115071 additions and 514 deletions

View File

@@ -0,0 +1,347 @@
# Adaptive Web Crawling
## Introduction
Traditional web crawlers follow predetermined patterns, crawling pages blindly without knowing when they've gathered enough information. **Adaptive Crawling** changes this paradigm by introducing intelligence into the crawling process.
Think of it like research: when you're looking for information, you don't read every book in the library. You stop when you've found sufficient information to answer your question. That's exactly what Adaptive Crawling does for web scraping.
## Key Concepts
### The Problem It Solves
When crawling websites for specific information, you face two challenges:
1. **Under-crawling**: Stopping too early and missing crucial information
2. **Over-crawling**: Wasting resources by crawling irrelevant pages
Adaptive Crawling solves both by using a three-layer scoring system that determines when you have "enough" information.
### How It Works
The AdaptiveCrawler uses three metrics to measure information sufficiency:
- **Coverage**: How well your collected pages cover the query terms
- **Consistency**: Whether the information is coherent across pages
- **Saturation**: Detecting when new pages aren't adding new information
When these metrics indicate sufficient information has been gathered, crawling stops automatically.
## Quick Start
### Basic Usage
```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Create an adaptive crawler
adaptive = AdaptiveCrawler(crawler)
# Start crawling with a query
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View statistics
adaptive.print_stats()
# Get the most relevant content
relevant_pages = adaptive.get_relevant_content(top_k=5)
for page in relevant_pages:
print(f"- {page['url']} (score: {page['score']:.2f})")
```
### Configuration Options
```python
from crawl4ai import AdaptiveConfig
config = AdaptiveConfig(
confidence_threshold=0.7, # Stop when 70% confident (default: 0.8)
max_pages=20, # Maximum pages to crawl (default: 50)
top_k_links=3, # Links to follow per page (default: 5)
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
)
adaptive = AdaptiveCrawler(crawler, config=config)
```
## Crawling Strategies
Adaptive Crawling supports two distinct strategies for determining information sufficiency:
### Statistical Strategy (Default)
The statistical strategy uses pure information theory and term-based analysis:
- **Fast and efficient** - No API calls or model loading
- **Term-based coverage** - Analyzes query term presence and distribution
- **No external dependencies** - Works offline
- **Best for**: Well-defined queries with specific terminology
```python
# Default configuration uses statistical strategy
config = AdaptiveConfig(
strategy="statistical", # This is the default
confidence_threshold=0.8
)
```
### Embedding Strategy
The embedding strategy uses semantic embeddings for deeper understanding:
- **Semantic understanding** - Captures meaning beyond exact term matches
- **Query expansion** - Automatically generates query variations
- **Gap-driven selection** - Identifies semantic gaps in knowledge
- **Validation-based stopping** - Uses held-out queries to validate coverage
- **Best for**: Complex queries, ambiguous topics, conceptual understanding
```python
# Configure embedding strategy
config = AdaptiveConfig(
strategy="embedding",
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # Default
n_query_variations=10, # Generate 10 query variations
embedding_min_confidence_threshold=0.1 # Stop if completely irrelevant
)
# With custom embedding provider (e.g., OpenAI)
config = AdaptiveConfig(
strategy="embedding",
embedding_llm_config={
'provider': 'openai/text-embedding-3-small',
'api_token': 'your-api-key'
}
)
```
### Strategy Comparison
| Feature | Statistical | Embedding |
|---------|------------|-----------|
| **Speed** | Very fast | Moderate (API calls) |
| **Cost** | Free | Depends on provider |
| **Accuracy** | Good for exact terms | Excellent for concepts |
| **Dependencies** | None | Embedding model/API |
| **Query Understanding** | Literal | Semantic |
| **Best Use Case** | Technical docs, specific terms | Research, broad topics |
### Embedding Strategy Configuration
The embedding strategy offers fine-tuned control through several parameters:
```python
config = AdaptiveConfig(
strategy="embedding",
# Model configuration
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
embedding_llm_config=None, # Use for API-based embeddings
# Query expansion
n_query_variations=10, # Number of query variations to generate
# Coverage parameters
embedding_coverage_radius=0.2, # Distance threshold for coverage
embedding_k_exp=3.0, # Exponential decay factor (higher = stricter)
# Stopping criteria
embedding_min_relative_improvement=0.1, # Min improvement to continue
embedding_validation_min_score=0.3, # Min validation score
embedding_min_confidence_threshold=0.1, # Below this = irrelevant
# Link selection
embedding_overlap_threshold=0.85, # Similarity for deduplication
# Display confidence mapping
embedding_quality_min_confidence=0.7, # Min displayed confidence
embedding_quality_max_confidence=0.95 # Max displayed confidence
)
```
### Handling Irrelevant Queries
The embedding strategy can detect when a query is completely unrelated to the content:
```python
# This will stop quickly with low confidence
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="how to cook pasta" # Irrelevant to Python docs
)
# Check if query was irrelevant
if result.metrics.get('is_irrelevant', False):
print("Query is unrelated to the content!")
```
## When to Use Adaptive Crawling
### Perfect For:
- **Research Tasks**: Finding comprehensive information about a topic
- **Question Answering**: Gathering sufficient context to answer specific queries
- **Knowledge Base Building**: Creating focused datasets for AI/ML applications
- **Competitive Intelligence**: Collecting complete information about specific products/features
### Not Recommended For:
- **Full Site Archiving**: When you need every page regardless of content
- **Structured Data Extraction**: When targeting specific, known page patterns
- **Real-time Monitoring**: When you need continuous updates
## Understanding the Output
### Confidence Score
The confidence score (0-1) indicates how sufficient the gathered information is:
- **0.0-0.3**: Insufficient information, needs more crawling
- **0.3-0.6**: Partial information, may answer basic queries
- **0.6-0.8**: Good coverage, can answer most queries
- **0.8-1.0**: Excellent coverage, comprehensive information
### Statistics Display
```python
adaptive.print_stats(detailed=False) # Summary table
adaptive.print_stats(detailed=True) # Detailed metrics
```
The summary shows:
- Pages crawled vs. confidence achieved
- Coverage, consistency, and saturation scores
- Crawling efficiency metrics
## Persistence and Resumption
### Saving Progress
```python
config = AdaptiveConfig(
save_state=True,
state_path="my_crawl_state.json"
)
# Crawl will auto-save progress
result = await adaptive.digest(start_url, query)
```
### Resuming a Crawl
```python
# Resume from saved state
result = await adaptive.digest(
start_url,
query,
resume_from="my_crawl_state.json"
)
```
### Exporting Knowledge Base
```python
# Export collected pages to JSONL
adaptive.export_knowledge_base("knowledge_base.jsonl")
# Import into another session
new_adaptive = AdaptiveCrawler(crawler)
new_adaptive.import_knowledge_base("knowledge_base.jsonl")
```
## Best Practices
### 1. Query Formulation
- Use specific, descriptive queries
- Include key terms you expect to find
- Avoid overly broad queries
### 2. Threshold Tuning
- Start with default (0.8) for general use
- Lower to 0.6-0.7 for exploratory crawling
- Raise to 0.9+ for exhaustive coverage
### 3. Performance Optimization
- Use appropriate `max_pages` limits
- Adjust `top_k_links` based on site structure
- Enable caching for repeat crawls
### 4. Link Selection
- The crawler prioritizes links based on:
- Relevance to query
- Expected information gain
- URL structure and depth
## Examples
### Research Assistant
```python
# Gather information about a programming concept
result = await adaptive.digest(
start_url="https://realpython.com",
query="python decorators implementation patterns"
)
# Get the most relevant excerpts
for doc in adaptive.get_relevant_content(top_k=3):
print(f"\nFrom: {doc['url']}")
print(f"Relevance: {doc['score']:.2%}")
print(doc['content'][:500] + "...")
```
### Knowledge Base Builder
```python
# Build a focused knowledge base about machine learning
queries = [
"supervised learning algorithms",
"neural network architectures",
"model evaluation metrics"
]
for query in queries:
await adaptive.digest(
start_url="https://scikit-learn.org/stable/",
query=query
)
# Export combined knowledge base
adaptive.export_knowledge_base("ml_knowledge.jsonl")
```
### API Documentation Crawler
```python
# Intelligently crawl API documentation
config = AdaptiveConfig(
confidence_threshold=0.85, # Higher threshold for completeness
max_pages=30
)
adaptive = AdaptiveCrawler(crawler, config)
result = await adaptive.digest(
start_url="https://api.example.com/docs",
query="authentication endpoints rate limits"
)
```
## Next Steps
- Learn about [Advanced Adaptive Strategies](../advanced/adaptive-strategies.md)
- Explore the [AdaptiveCrawler API Reference](../api/adaptive-crawler.md)
- See more [Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples/adaptive_crawling)
## FAQ
**Q: How is this different from traditional crawling?**
A: Traditional crawling follows fixed patterns (BFS/DFS). Adaptive crawling makes intelligent decisions about which links to follow and when to stop based on information gain.
**Q: Can I use this with JavaScript-heavy sites?**
A: Yes! AdaptiveCrawler inherits all capabilities from AsyncWebCrawler, including JavaScript execution.
**Q: How does it handle large websites?**
A: The algorithm naturally limits crawling to relevant sections. Use `max_pages` as a safety limit.
**Q: Can I customize the scoring algorithms?**
A: Advanced users can implement custom strategies. See [Adaptive Strategies](../advanced/adaptive-strategies.md).

View File

@@ -252,7 +252,7 @@ The `clone()` method:
### Key fields to note
1. **`provider`**:
- Which LLM provoder to use.
- Which LLM provider to use.
- Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`<br/>*(default: `"openai/gpt-4o-mini"`)*
2. **`api_token`**:
@@ -273,8 +273,8 @@ In a typical scenario, you define **one** `BrowserConfig` for your crawler sessi
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator
from crawl4ai import JsonCssExtractionStrategy
async def main():
# 1) Browser config: headless, bigger viewport, no proxy
@@ -298,7 +298,7 @@ async def main():
# 3) Example LLM content filtering
gemini_config = LLMConfig(
provider="gemini/gemini-1.5-pro"
provider="gemini/gemini-1.5-pro",
api_token = "env:GEMINI_API_TOKEN"
)
@@ -322,8 +322,9 @@ async def main():
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
content_filter=filter,
options={"ignore_links": True}
)
# 4) Crawler run config: skip cache, use extraction
run_conf = CrawlerRunConfig(

View File

@@ -0,0 +1,395 @@
# C4A-Script: Visual Web Automation Made Simple
## What is C4A-Script?
C4A-Script is a powerful, human-readable domain-specific language (DSL) designed for web automation and interaction. Think of it as a simplified programming language that anyone can read and write, perfect for automating repetitive web tasks, testing user interfaces, or creating interactive demos.
### Why C4A-Script?
**Simple Syntax, Powerful Results**
```c4a
# Navigate and interact in plain English
GO https://example.com
WAIT `#search-box` 5
TYPE "Hello World"
CLICK `button[type="submit"]`
```
**Visual Programming Support**
C4A-Script comes with a built-in Blockly visual editor, allowing you to create scripts by dragging and dropping blocks - no coding experience required!
**Perfect for:**
- **UI Testing**: Automate user interaction flows
- **Demo Creation**: Build interactive product demonstrations
- **Data Entry**: Automate form filling and submissions
- **Testing Workflows**: Validate complex user journeys
- **Training**: Teach web automation without code complexity
## Getting Started: Your First Script
Let's create a simple script that searches for something on a website:
```c4a
# My first C4A-Script
GO https://duckduckgo.com
# Wait for the search box to appear
WAIT `input[name="q"]` 10
# Type our search query
TYPE "Crawl4AI"
# Press Enter to search
PRESS Enter
# Wait for results
WAIT `.results` 5
```
That's it! In just a few lines, you've automated a complete search workflow.
## Interactive Tutorial & Live Demo
Want to learn by doing? We've got you covered:
**🚀 [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try C4A-Script in your browser right now!
**📁 [Tutorial Examples](/examples/c4a_script/)** - Complete examples with source code
**🛠️ [Local Tutorial](/examples/c4a_script/tutorial/)** - Run the interactive tutorial on your machine
### Running the Tutorial Locally
The tutorial includes a Flask-based web interface with:
- **Live Code Editor** with syntax highlighting
- **Visual Blockly Editor** for drag-and-drop programming
- **Recording Mode** to capture your actions and generate scripts
- **Timeline View** to see and edit your automation steps
```bash
# Clone and navigate to the tutorial
cd docs/examples/c4a_script/tutorial/
# Install dependencies
pip install flask
# Launch the tutorial server
python app.py
# Open http://localhost:5000 in your browser
```
## Core Concepts
### Commands and Syntax
C4A-Script uses simple, English-like commands. Each command does one specific thing:
```c4a
# Comments start with #
COMMAND parameter1 parameter2
# Most commands use CSS selectors in backticks
CLICK `#submit-button`
# Text content goes in quotes
TYPE "Hello, World!"
# Numbers are used directly
WAIT 3
```
### Selectors: Finding Elements
C4A-Script uses CSS selectors to identify elements on the page:
```c4a
# By ID
CLICK `#login-button`
# By class
CLICK `.submit-btn`
# By attribute
CLICK `button[type="submit"]`
# By text content
CLICK `button:contains("Sign In")`
# Complex selectors
CLICK `.form-container input[name="email"]`
```
### Variables and Dynamic Content
Store and reuse values with variables:
```c4a
# Set a variable
SETVAR username = "john@example.com"
SETVAR password = "secret123"
# Use variables (prefix with $)
TYPE $username
PRESS Tab
TYPE $password
```
## Command Categories
### 🧭 Navigation Commands
Move around the web like a user would:
| Command | Purpose | Example |
|---------|---------|---------|
| `GO` | Navigate to URL | `GO https://example.com` |
| `RELOAD` | Refresh current page | `RELOAD` |
| `BACK` | Go back in history | `BACK` |
| `FORWARD` | Go forward in history | `FORWARD` |
### ⏱️ Wait Commands
Ensure elements are ready before interacting:
| Command | Purpose | Example |
|---------|---------|---------|
| `WAIT` | Wait for time/element/text | `WAIT 3` or `WAIT \`#element\` 10` |
### 🖱️ Mouse Commands
Click, drag, and move like a human:
| Command | Purpose | Example |
|---------|---------|---------|
| `CLICK` | Click element or coordinates | `CLICK \`button\`` or `CLICK 100 200` |
| `DOUBLE_CLICK` | Double-click element | `DOUBLE_CLICK \`.item\`` |
| `RIGHT_CLICK` | Right-click element | `RIGHT_CLICK \`#menu\`` |
| `SCROLL` | Scroll in direction | `SCROLL DOWN 500` |
| `DRAG` | Drag from point to point | `DRAG 100 100 500 300` |
### ⌨️ Keyboard Commands
Type text and press keys naturally:
| Command | Purpose | Example |
|---------|---------|---------|
| `TYPE` | Type text or variable | `TYPE "Hello"` or `TYPE $username` |
| `PRESS` | Press special keys | `PRESS Tab` or `PRESS Enter` |
| `CLEAR` | Clear input field | `CLEAR \`#search\`` |
| `SET` | Set input value directly | `SET \`#email\` "user@example.com"` |
### 🔀 Control Flow
Add logic and repetition to your scripts:
| Command | Purpose | Example |
|---------|---------|---------|
| `IF` | Conditional execution | `IF (EXISTS \`#popup\`) THEN CLICK \`#close\`` |
| `REPEAT` | Loop commands | `REPEAT (SCROLL DOWN 300, 5)` |
### 💾 Variables & Advanced
Store data and execute custom code:
| Command | Purpose | Example |
|---------|---------|---------|
| `SETVAR` | Create variable | `SETVAR email = "test@example.com"` |
| `EVAL` | Execute JavaScript | `EVAL \`console.log('Hello')\`` |
## Real-World Examples
### Example 1: Login Flow
```c4a
# Complete login automation
GO https://myapp.com/login
# Wait for page to load
WAIT `#login-form` 5
# Fill credentials
CLICK `#email`
TYPE "user@example.com"
PRESS Tab
TYPE "mypassword"
# Submit form
CLICK `button[type="submit"]`
# Wait for dashboard
WAIT `.dashboard` 10
```
### Example 2: E-commerce Shopping
```c4a
# Shopping automation with variables
SETVAR product = "laptop"
SETVAR budget = "1000"
GO https://shop.example.com
WAIT `#search-box` 3
# Search for product
TYPE $product
PRESS Enter
WAIT `.product-list` 5
# Filter by price
CLICK `.price-filter`
SET `#max-price` $budget
CLICK `.apply-filters`
# Select first result
WAIT `.product-item` 3
CLICK `.product-item:first-child`
```
### Example 3: Form Automation with Conditions
```c4a
# Smart form filling with error handling
GO https://forms.example.com
# Check if user is already logged in
IF (EXISTS `.user-menu`) THEN GO https://forms.example.com/new
IF (NOT EXISTS `.user-menu`) THEN CLICK `#login-link`
# Fill form
WAIT `#contact-form` 5
SET `#name` "John Doe"
SET `#email` "john@example.com"
SET `#message` "Hello from C4A-Script!"
# Handle popup if it appears
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
# Submit
CLICK `#submit-button`
WAIT `.success-message` 10
```
## Visual Programming with Blockly
C4A-Script includes a powerful visual programming interface built on Google Blockly. Perfect for:
- **Non-programmers** who want to create automation
- **Rapid prototyping** of automation workflows
- **Educational environments** for teaching automation concepts
- **Collaborative development** where visual representation helps communication
### Features:
- **Drag & Drop Interface**: Build scripts by connecting blocks
- **Real-time Sync**: Changes in visual mode instantly update the text script
- **Smart Block Types**: Blocks are categorized by function (Navigation, Actions, etc.)
- **Error Prevention**: Visual connections prevent syntax errors
- **Comment Support**: Add visual comment blocks for documentation
Try the visual editor in our [live demo](https://docs.crawl4ai.com/c4a-script/demo) or [local tutorial](/examples/c4a_script/tutorial/).
## Advanced Features
### Recording Mode
The tutorial interface includes a recording feature that watches your browser interactions and automatically generates C4A-Script commands:
1. Click "Record" in the tutorial interface
2. Perform actions in the browser preview
3. Watch as C4A-Script commands are generated in real-time
4. Edit and refine the generated script
### Error Handling and Debugging
C4A-Script provides clear error messages and debugging information:
```c4a
# Use comments for debugging
# This will wait up to 10 seconds for the element
WAIT `#slow-loading-element` 10
# Check if element exists before clicking
IF (EXISTS `#optional-button`) THEN CLICK `#optional-button`
# Use EVAL for custom debugging
EVAL `console.log("Current page title:", document.title)`
```
### Integration with Crawl4AI
C4A-Script integrates seamlessly with Crawl4AI's web crawling capabilities:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
# Use C4A-Script for interaction before crawling
script = """
GO https://example.com
CLICK `#load-more-content`
WAIT `.dynamic-content` 5
"""
config = CrawlerRunConfig(
js_code=script,
wait_for=".dynamic-content"
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
print(result.markdown)
```
## Best Practices
### 1. Always Wait for Elements
```c4a
# Bad: Clicking immediately
CLICK `#button`
# Good: Wait for element to appear
WAIT `#button` 5
CLICK `#button`
```
### 2. Use Descriptive Comments
```c4a
# Login to user account
GO https://myapp.com/login
WAIT `#login-form` 5
# Enter credentials
TYPE "user@example.com"
PRESS Tab
TYPE "password123"
# Submit and wait for redirect
CLICK `#submit-button`
WAIT `.dashboard` 10
```
### 3. Handle Variable Conditions
```c4a
# Handle different page states
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
IF (EXISTS `.popup-modal`) THEN CLICK `.close-modal`
# Proceed with main workflow
CLICK `#main-action`
```
### 4. Use Variables for Reusability
```c4a
# Define once, use everywhere
SETVAR base_url = "https://myapp.com"
SETVAR test_email = "test@example.com"
GO $base_url/login
SET `#email` $test_email
```
## Getting Help
- **📖 [Complete Examples](/examples/c4a_script/)** - Real-world automation scripts
- **🎮 [Interactive Tutorial](/examples/c4a_script/tutorial/)** - Hands-on learning environment
- **📋 [API Reference](/api/c4a-script-reference/)** - Detailed command documentation
- **🌐 [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try it in your browser
## What's Next?
Ready to dive deeper? Check out:
1. **[API Reference](/api/c4a-script-reference/)** - Complete command documentation
2. **[Tutorial Examples](/examples/c4a_script/)** - Copy-paste ready scripts
3. **[Local Tutorial Setup](/examples/c4a_script/tutorial/)** - Run the full development environment
C4A-Script makes web automation accessible to everyone. Whether you're a developer automating tests, a designer creating interactive demos, or a business user streamlining repetitive tasks, C4A-Script has the tools you need.
*Start automating today - your future self will thank you!* 🚀

View File

@@ -17,6 +17,9 @@
- [Configuration Reference](#configuration-reference)
- [Best Practices & Tips](#best-practices--tips)
## Installation
The Crawl4AI CLI will be installed automatically when you install the library.
## Basic Usage
The Crawl4AI CLI (`crwl`) provides a simple interface to the Crawl4AI library:

View File

@@ -191,7 +191,7 @@ You can combine content selection with a more advanced extraction strategy. For
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
# Minimal schema for repeated items
@@ -243,7 +243,7 @@ import asyncio
import json
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
class ArticleData(BaseModel):
headline: str
@@ -288,7 +288,7 @@ Below is a short function that unifies **CSS selection**, **exclusion** logic, a
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def extract_main_articles(url: str):
schema = {

View File

@@ -138,7 +138,7 @@ If you run a JSON-based extraction strategy (CSS, XPath, LLM, etc.), the structu
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {

View File

@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.6.0-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
```bash
# Pull the release candidate (recommended for latest features)
docker pull unclecode/crawl4ai:0.6.0-r1
# Pull the release candidate (for testing new features)
docker pull unclecode/crawl4ai:0.7.0-r1
# Or pull the latest stable version
# Or pull the current stable version (0.6.0)
docker pull unclecode/crawl4ai:latest
```
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r2`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version

View File

@@ -28,6 +28,11 @@ This page provides a comprehensive list of example scripts that demonstrate vari
| Example | Description | Link |
|---------|-------------|------|
| Deep Crawling | An extensive tutorial on deep crawling capabilities, demonstrating BFS and BestFirst strategies, stream vs. non-stream execution, filters, scorers, and advanced configurations. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/deepcrawl_example.py) |
<<<<<<< HEAD
| Virtual Scroll | Comprehensive examples for handling virtualized scrolling on sites like Twitter, Instagram. Demonstrates different scrolling scenarios with local test server. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/virtual_scroll_example.py) |
=======
| Adaptive Crawling | Demonstrates intelligent crawling that automatically determines when sufficient information has been gathered. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/adaptive_crawling/) |
>>>>>>> feature/progressive-crawling
| Dispatcher | Shows how to use the crawl dispatcher for advanced workload management. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/dispatcher_example.py) |
| Storage State | Tutorial on managing browser storage state for persistence. | [View Guide](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/storage_state_tutorial.md) |
| Network Console Capture | Demonstrates how to capture and analyze network requests and console logs. | [View Code](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/network_console_capture_example.py) |

View File

@@ -137,7 +137,7 @@ if __name__ == "__main__":
- Higher → fewer chunks but more relevant.
- Lower → more inclusive.
> In more advanced scenarios, you might see parameters like `use_stemming`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted.
> In more advanced scenarios, you might see parameters like `language`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted.
---
@@ -242,4 +242,4 @@ class MyCustomFilter(RelevantContentFilter):
With these tools, you can **zero in** on the text that truly matters, ignoring spammy or boilerplate content, and produce a concise, relevant “fit markdown” for your AI or data pipelines. Happy pruning and searching!
- Last Updated: 2025-01-01
- Last Updated: 2025-01-01

View File

@@ -105,7 +105,366 @@ result.links = {
---
## 2. Domain Filtering
## 2. Advanced Link Head Extraction & Scoring
Ever wanted to not just extract links, but also get the actual content (title, description, metadata) from those linked pages? And score them for relevance? This is exactly what Link Head Extraction does - it fetches the `<head>` section from each discovered link and scores them using multiple algorithms.
### 2.1 Why Link Head Extraction?
When you crawl a page, you get hundreds of links. But which ones are actually valuable? Link Head Extraction solves this by:
1. **Fetching head content** from each link (title, description, meta tags)
2. **Scoring links intrinsically** based on URL quality, text relevance, and context
3. **Scoring links contextually** using BM25 algorithm when you provide a search query
4. **Combining scores intelligently** to give you a final relevance ranking
### 2.2 Complete Working Example
Here's a full example you can copy, paste, and run immediately:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig
async def extract_link_heads_example():
"""
Complete example showing link head extraction with scoring.
This will crawl a documentation site and extract head content from internal links.
"""
# Configure link head extraction
config = CrawlerRunConfig(
# Enable link head extraction with detailed configuration
link_preview_config=LinkPreviewConfig(
include_internal=True, # Extract from internal links
include_external=False, # Skip external links for this example
max_links=10, # Limit to 10 links for demo
concurrency=5, # Process 5 links simultaneously
timeout=10, # 10 second timeout per link
query="API documentation guide", # Query for contextual scoring
score_threshold=0.3, # Only include links scoring above 0.3
verbose=True # Show detailed progress
),
# Enable intrinsic scoring (URL quality, text relevance)
score_links=True,
# Keep output clean
only_text=True,
verbose=True
)
async with AsyncWebCrawler() as crawler:
# Crawl a documentation site (great for testing)
result = await crawler.arun("https://docs.python.org/3/", config=config)
if result.success:
print(f"✅ Successfully crawled: {result.url}")
print(f"📄 Page title: {result.metadata.get('title', 'No title')}")
# Access links (now enhanced with head data and scores)
internal_links = result.links.get("internal", [])
external_links = result.links.get("external", [])
print(f"\n🔗 Found {len(internal_links)} internal links")
print(f"🌍 Found {len(external_links)} external links")
# Count links with head data
links_with_head = [link for link in internal_links
if link.get("head_data") is not None]
print(f"🧠 Links with head data extracted: {len(links_with_head)}")
# Show the top 3 scoring links
print(f"\n🏆 Top 3 Links with Full Scoring:")
for i, link in enumerate(links_with_head[:3]):
print(f"\n{i+1}. {link['href']}")
print(f" Link Text: '{link.get('text', 'No text')[:50]}...'")
# Show all three score types
intrinsic = link.get('intrinsic_score')
contextual = link.get('contextual_score')
total = link.get('total_score')
if intrinsic is not None:
print(f" 📊 Intrinsic Score: {intrinsic:.2f}/10.0 (URL quality & context)")
if contextual is not None:
print(f" 🎯 Contextual Score: {contextual:.3f} (BM25 relevance to query)")
if total is not None:
print(f" ⭐ Total Score: {total:.3f} (combined final score)")
# Show extracted head data
head_data = link.get("head_data", {})
if head_data:
title = head_data.get("title", "No title")
description = head_data.get("meta", {}).get("description", "No description")
print(f" 📰 Title: {title[:60]}...")
if description:
print(f" 📝 Description: {description[:80]}...")
# Show extraction status
status = link.get("head_extraction_status", "unknown")
print(f" ✅ Extraction Status: {status}")
else:
print(f"❌ Crawl failed: {result.error_message}")
# Run the example
if __name__ == "__main__":
asyncio.run(extract_link_heads_example())
```
**Expected Output:**
```
✅ Successfully crawled: https://docs.python.org/3/
📄 Page title: 3.13.5 Documentation
🔗 Found 53 internal links
🌍 Found 1 external links
🧠 Links with head data extracted: 10
🏆 Top 3 Links with Full Scoring:
1. https://docs.python.org/3.15/
Link Text: 'Python 3.15 (in development)...'
📊 Intrinsic Score: 4.17/10.0 (URL quality & context)
🎯 Contextual Score: 1.000 (BM25 relevance to query)
⭐ Total Score: 5.917 (combined final score)
📰 Title: 3.15.0a0 Documentation...
📝 Description: The official Python documentation...
✅ Extraction Status: valid
```
### 2.3 Configuration Deep Dive
The `LinkPreviewConfig` class supports these options:
```python
from crawl4ai.async_configs import LinkPreviewConfig
link_preview_config = LinkPreviewConfig(
# BASIC SETTINGS
verbose=True, # Show detailed logs (recommended for learning)
# LINK FILTERING
include_internal=True, # Include same-domain links
include_external=True, # Include different-domain links
max_links=50, # Maximum links to process (prevents overload)
# PATTERN FILTERING
include_patterns=[ # Only process links matching these patterns
"*/docs/*",
"*/api/*",
"*/reference/*"
],
exclude_patterns=[ # Skip links matching these patterns
"*/login*",
"*/admin*"
],
# PERFORMANCE SETTINGS
concurrency=10, # How many links to process simultaneously
timeout=5, # Seconds to wait per link
# RELEVANCE SCORING
query="machine learning API", # Query for BM25 contextual scoring
score_threshold=0.3, # Only include links above this score
)
```
### 2.4 Understanding the Three Score Types
Each extracted link gets three different scores:
#### 1. **Intrinsic Score (0-10)** - URL and Content Quality
Based on URL structure, link text quality, and page context:
```python
# High intrinsic score indicators:
# ✅ Clean URL structure (docs.python.org/api/reference)
# ✅ Meaningful link text ("API Reference Guide")
# ✅ Relevant to page context
# ✅ Not buried deep in navigation
# Low intrinsic score indicators:
# ❌ Random URLs (site.com/x7f9g2h)
# ❌ No link text or generic text ("Click here")
# ❌ Unrelated to page content
```
#### 2. **Contextual Score (0-1)** - BM25 Relevance to Query
Only available when you provide a `query`. Uses BM25 algorithm against head content:
```python
# Example: query = "machine learning tutorial"
# High contextual score: Link to "Complete Machine Learning Guide"
# Low contextual score: Link to "Privacy Policy"
```
#### 3. **Total Score** - Smart Combination
Intelligently combines intrinsic and contextual scores with fallbacks:
```python
# When both scores available: (intrinsic * 0.3) + (contextual * 0.7)
# When only intrinsic: uses intrinsic score
# When only contextual: uses contextual score
# When neither: not calculated
```
### 2.5 Practical Use Cases
#### Use Case 1: Research Assistant
Find the most relevant documentation pages:
```python
async def research_assistant():
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_external=True,
include_patterns=["*/docs/*", "*/tutorial/*", "*/guide/*"],
query="machine learning neural networks",
max_links=20,
score_threshold=0.5, # Only high-relevance links
verbose=True
),
score_links=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://scikit-learn.org/", config=config)
if result.success:
# Get high-scoring links
good_links = [link for link in result.links.get("internal", [])
if link.get("total_score", 0) > 0.7]
print(f"🎯 Found {len(good_links)} highly relevant links:")
for link in good_links[:5]:
print(f"{link['total_score']:.3f} - {link['href']}")
print(f" {link.get('head_data', {}).get('title', 'No title')}")
```
#### Use Case 2: Content Discovery
Find all API endpoints and references:
```python
async def api_discovery():
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_patterns=["*/api/*", "*/reference/*"],
exclude_patterns=["*/deprecated/*"],
max_links=100,
concurrency=15,
verbose=False # Clean output
),
score_links=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.example-api.com/", config=config)
if result.success:
api_links = result.links.get("internal", [])
# Group by endpoint type
endpoints = {}
for link in api_links:
if link.get("head_data"):
title = link["head_data"].get("title", "")
if "GET" in title:
endpoints.setdefault("GET", []).append(link)
elif "POST" in title:
endpoints.setdefault("POST", []).append(link)
for method, links in endpoints.items():
print(f"\n{method} Endpoints ({len(links)}):")
for link in links[:3]:
print(f"{link['href']}")
```
#### Use Case 3: Link Quality Analysis
Analyze website structure and content quality:
```python
async def quality_analysis():
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
max_links=200,
concurrency=20,
),
score_links=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://your-website.com/", config=config)
if result.success:
links = result.links.get("internal", [])
# Analyze intrinsic scores
scores = [link.get('intrinsic_score', 0) for link in links]
avg_score = sum(scores) / len(scores) if scores else 0
print(f"📊 Link Quality Analysis:")
print(f" Average intrinsic score: {avg_score:.2f}/10.0")
print(f" High quality links (>7.0): {len([s for s in scores if s > 7.0])}")
print(f" Low quality links (<3.0): {len([s for s in scores if s < 3.0])}")
# Find problematic links
bad_links = [link for link in links
if link.get('intrinsic_score', 0) < 2.0]
if bad_links:
print(f"\n⚠️ Links needing attention:")
for link in bad_links[:5]:
print(f" {link['href']} (score: {link.get('intrinsic_score', 0):.1f})")
```
### 2.6 Performance Tips
1. **Start Small**: Begin with `max_links: 10` to understand the feature
2. **Use Patterns**: Filter with `include_patterns` to focus on relevant sections
3. **Adjust Concurrency**: Higher concurrency = faster but more resource usage
4. **Set Timeouts**: Use `timeout: 5` to prevent hanging on slow sites
5. **Use Score Thresholds**: Filter out low-quality links with `score_threshold`
### 2.7 Troubleshooting
**No head data extracted?**
```python
# Check your configuration:
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
verbose=True # ← Enable to see what's happening
)
)
```
**Scores showing as None?**
```python
# Make sure scoring is enabled:
config = CrawlerRunConfig(
score_links=True, # ← Enable intrinsic scoring
link_preview_config=LinkPreviewConfig(
query="your search terms" # ← For contextual scoring
)
)
```
**Process taking too long?**
```python
# Optimize performance:
link_preview_config = LinkPreviewConfig(
max_links=20, # ← Reduce number
concurrency=10, # ← Increase parallelism
timeout=3, # ← Shorter timeout
include_patterns=["*/important/*"] # ← Focus on key areas
)
```
---
## 3. Domain Filtering
Some websites contain hundreds of third-party or affiliate links. You can filter out certain domains at **crawl time** by configuring the crawler. The most relevant parameters in `CrawlerRunConfig` are:
@@ -114,7 +473,7 @@ Some websites contain hundreds of third-party or affiliate links. You can filter
- **`exclude_social_media_links`**: If `True`, automatically skip known social platforms.
- **`exclude_domains`**: Provide a list of custom domains you want to exclude (e.g., `["spammyads.com", "tracker.net"]`).
### 2.1 Example: Excluding External & Social Media Links
### 3.1 Example: Excluding External & Social Media Links
```python
import asyncio
@@ -143,7 +502,7 @@ if __name__ == "__main__":
asyncio.run(main())
```
### 2.2 Example: Excluding Specific Domains
### 3.2 Example: Excluding Specific Domains
If you want to let external links in, but specifically exclude a domain (e.g., `suspiciousads.com`), do this:
@@ -157,9 +516,9 @@ This approach is handy when you still want external links but need to block cert
---
## 3. Media Extraction
## 4. Media Extraction
### 3.1 Accessing `result.media`
### 4.1 Accessing `result.media`
By default, Crawl4AI collects images, audio, video URLs, and data tables it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`, `tables`).
@@ -237,7 +596,7 @@ Depending on your Crawl4AI version or scraping strategy, these dictionaries can
With these details, you can easily filter out or focus on certain images (for instance, ignoring images with very low scores or a different domain), or gather metadata for analytics.
### 3.2 Excluding External Images
### 4.2 Excluding External Images
If youre dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:

61
docs/md_v2/core/llmtxt.md Normal file
View File

@@ -0,0 +1,61 @@
I<div class="llmtxt-container">
<iframe id="llmtxt-frame" src="../../llmtxt/index.html" width="100%" style="border:none; display: block;" title="Crawl4AI LLM Context Builder"></iframe>
</div>
<script>
// Iframe height adjustment
function resizeLLMtxtIframe() {
const iframe = document.getElementById('llmtxt-frame');
if (iframe) {
const headerHeight = parseFloat(getComputedStyle(document.documentElement).getPropertyValue('--header-height') || '55');
const topOffset = headerHeight + 20;
const availableHeight = window.innerHeight - topOffset;
iframe.style.height = Math.max(800, availableHeight) + 'px';
}
}
// Run immediately and on resize/load
resizeLLMtxtIframe();
let resizeTimer;
window.addEventListener('load', resizeLLMtxtIframe);
window.addEventListener('resize', () => {
clearTimeout(resizeTimer);
resizeTimer = setTimeout(resizeLLMtxtIframe, 150);
});
// Remove Footer & HR from parent page
document.addEventListener('DOMContentLoaded', () => {
setTimeout(() => {
const footer = window.parent.document.querySelector('footer');
if (footer) {
const hrBeforeFooter = footer.previousElementSibling;
if (hrBeforeFooter && hrBeforeFooter.tagName === 'HR') {
hrBeforeFooter.remove();
}
footer.remove();
resizeLLMtxtIframe();
}
}, 100);
});
</script>
<style>
#terminal-mkdocs-main-content {
padding: 0 !important;
margin: 0;
width: 100%;
height: 100%;
overflow: hidden;
}
#terminal-mkdocs-main-content .llmtxt-container {
margin: 0;
padding: 0;
max-width: none;
overflow: hidden;
}
#terminal-mkdocs-toc-panel {
display: none !important;
}
</style>

View File

@@ -8,11 +8,10 @@ To crawl a live web page, provide the URL starting with `http://` or `https://`,
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def crawl_web():
config = CrawlerRunConfig(bypass_cache=True)
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://en.wikipedia.org/wiki/apple",
@@ -33,13 +32,12 @@ To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
file_url = f"file://{local_file_path}"
config = CrawlerRunConfig(bypass_cache=True)
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=file_url, config=config)
@@ -93,8 +91,7 @@ import os
import sys
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai import AsyncWebCrawler, CacheMode, CrawlerRunConfig
async def main():
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
@@ -104,7 +101,7 @@ async def main():
async with AsyncWebCrawler() as crawler:
# Step 1: Crawl the Web URL
print("\n=== Step 1: Crawling the Wikipedia URL ===")
web_config = CrawlerRunConfig(bypass_cache=True)
web_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
result = await crawler.arun(url=wikipedia_url, config=web_config)
if not result.success:
@@ -119,7 +116,7 @@ async def main():
# Step 2: Crawl from the Local HTML File
print("=== Step 2: Crawling from the Local HTML File ===")
file_url = f"file://{html_file_path.resolve()}"
file_config = CrawlerRunConfig(bypass_cache=True)
file_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
local_result = await crawler.arun(url=file_url, config=file_config)
if not local_result.success:
@@ -135,7 +132,7 @@ async def main():
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read()
raw_html_url = f"raw:{raw_html_content}"
raw_config = CrawlerRunConfig(bypass_cache=True)
raw_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
if not raw_result.success:

View File

@@ -187,7 +187,7 @@ from crawl4ai import CrawlerRunConfig
bm25_filter = BM25ContentFilter(
user_query="machine learning",
bm25_threshold=1.2,
use_stemming=True
language="english"
)
md_generator = DefaultMarkdownGenerator(
@@ -200,7 +200,8 @@ config = CrawlerRunConfig(markdown_generator=md_generator)
- **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.
- **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.
- **`use_stemming`**: If `True`, variations of words match (e.g., “learn,” “learning,” “learnt”).
- **`use_stemming`** *(default `True`)*: Whether to apply stemming to the query and content.
- **`language (str)`**: Language for stemming (default: 'english').
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
@@ -233,7 +234,7 @@ prune_filter = PruningContentFilter(
For intelligent content filtering and high-quality markdown generation, you can use the **LLMContentFilter**. This filter leverages LLMs to generate relevant markdown while preserving the original content's meaning and structure:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
async def main():
@@ -255,9 +256,12 @@ async def main():
chunk_token_threshold=4096, # Adjust based on your needs
verbose=True
)
md_generator = DefaultMarkdownGenerator(
content_filter=filter,
options={"ignore_links": True}
)
config = CrawlerRunConfig(
content_filter=filter
markdown_generator=md_generator,
)
async with AsyncWebCrawler() as crawler:

View File

@@ -296,7 +296,7 @@ if __name__ == "__main__":
Once dynamic content is loaded, you can attach an **`extraction_strategy`** (like `JsonCssExtractionStrategy` or `LLMExtractionStrategy`). For example:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
schema = {
"name": "Commits",
@@ -340,4 +340,45 @@ Crawl4AIs **page interaction** features let you:
3. **Handle** multi-step flows (like “Load More”) with partial reloads or persistent sessions.
4. Combine with **structured extraction** for dynamic sites.
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
With these tools, you can scrape modern, interactive webpages confidently. For advanced hooking, user simulation, or in-depth config, check the [API reference](../api/parameters.md) or related advanced docs. Happy scripting!
---
## 9. Virtual Scrolling
For sites that use **virtual scrolling** (where content is replaced rather than appended as you scroll, like Twitter or Instagram), Crawl4AI provides a dedicated `VirtualScrollConfig`:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, VirtualScrollConfig
async def crawl_twitter_timeline():
# Configure virtual scroll for Twitter-like feeds
virtual_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']", # Twitter's main column
scroll_count=30, # Scroll 30 times
scroll_by="container_height", # Scroll by container height each time
wait_after_scroll=1.0 # Wait 1 second after each scroll
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
)
# result.html now contains ALL tweets from the virtual scroll
```
### Virtual Scroll vs JavaScript Scrolling
| Feature | Virtual Scroll | JS Code Scrolling |
|---------|---------------|-------------------|
| **Use Case** | Content replaced during scroll | Content appended or simple scroll |
| **Configuration** | `VirtualScrollConfig` object | `js_code` with scroll commands |
| **Automatic Merging** | Yes - merges all unique content | No - captures final state only |
| **Best For** | Twitter, Instagram, virtual tables | Traditional pages, load more buttons |
For detailed examples and configuration options, see the [Virtual Scroll documentation](../advanced/virtual-scroll.md).

View File

@@ -127,7 +127,7 @@ Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. B
> **New!** Crawl4AI now provides a powerful utility to automatically generate extraction schemas using LLM. This is a one-time cost that gives you a reusable schema for fast, LLM-free extractions:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
from crawl4ai import LLMConfig
# Generate a schema (one-time cost)
@@ -157,7 +157,7 @@ Here's a basic extraction example:
import asyncio
import json
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def main():
schema = {
@@ -212,7 +212,7 @@ import json
import asyncio
from pydantic import BaseModel, Field
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
@@ -272,7 +272,43 @@ if __name__ == "__main__":
---
## 7. Multi-URL Concurrency (Preview)
## 7. Adaptive Crawling (New!)
Crawl4AI now includes intelligent adaptive crawling that automatically determines when sufficient information has been gathered. Here's a quick example:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def adaptive_example():
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler)
# Start adaptive crawling
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="async context managers"
)
# View results
adaptive.print_stats()
print(f"Crawled {len(result.crawled_urls)} pages")
print(f"Achieved {adaptive.confidence:.0%} confidence")
if __name__ == "__main__":
asyncio.run(adaptive_example())
```
**What's special about adaptive crawling?**
- **Automatic stopping**: Stops when sufficient information is gathered
- **Intelligent link selection**: Follows only relevant links
- **Confidence scoring**: Know how complete your information is
[Learn more about Adaptive Crawling →](adaptive-crawling.md)
---
## 8. Multi-URL Concurrency (Preview)
If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Heres a quick glimpse:
@@ -328,7 +364,7 @@ Some sites require multiple “page clicks” or dynamic JavaScript updates. Bel
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")

File diff suppressed because it is too large Load Diff