Compare commits

...

37 Commits

Author SHA1 Message Date
UncleCode
9f9ea3bb3b chore: Clean up test artifacts and disable test workflow 2025-07-25 17:31:52 +08:00
UncleCode
d58b93c207 fix: Re-enable multi-platform Docker builds for ARM64 support 2025-07-25 16:38:11 +08:00
UncleCode
e2b4705010 fix: Use hardcoded Docker repository name to avoid masking issues 2025-07-25 15:52:26 +08:00
UncleCode
4a1abd5086 fix: Handle existing version on Test PyPI gracefully 2025-07-25 15:41:16 +08:00
UncleCode
04258cd4f2 fix: Speed up Docker test builds by using single platform and caching 2025-07-25 15:37:44 +08:00
UncleCode
84e462d9f8 Merge remote-tracking branch 'origin/develop' 2025-07-25 15:35:53 +08:00
UncleCode
9546773a07 fix: Move sentence-transformers to optional dependencies
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
2025-07-24 21:24:40 +08:00
UncleCode
66a979ad11 fix: Install dependencies before version check in workflows 2025-07-24 21:01:36 +08:00
UncleCode
0c31e91b53 feat: Add CI/CD workflows for automated PyPI and Docker releases 2025-07-24 20:58:43 +08:00
ntohidi
1b6a31f88f fix: encode PDF results to base64 in /crawl endpoint. ref #1301 2025-07-23 13:52:18 +02:00
Nasrin
b8c261780f Merge pull request #1319 from volumetric/fix_for_bug_#1310
Removed the incorrect reference in browser_config variable
2025-07-23 12:45:12 +02:00
ntohidi
db6ad7a79d fix: update links in README and C4A-Script documentation for accuracy 2025-07-23 09:47:18 +02:00
Nasrin
004d514f33 Merge pull request #1265 from unclecode/feature/nasrin-cli-deep-crawl
Feature/CLI - deep-crawl: Add --deep-crawl CLI option with BFS/DFS/Best-First strategies and fix serialization error. ref #874
2025-07-23 09:40:33 +02:00
Vinit Agrawal
3a9e2c716e Remvoed the incorrect reference in browser_config variable 2025-07-18 10:01:00 +05:30
unclecode
0163bd797c Merge branch 'release/v0.7.1' 2025-07-17 17:42:04 +08:00
ntohidi
26bad799e4 chore: update version to 0.7.1 2025-07-17 11:37:41 +02:00
ntohidi
cf8badfe27 feat: cleanup unused code and enhance documentation for v0.7.1
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 11:35:16 +02:00
ntohidi
ccbe3c105c refactor: improve link scoring output format in release notes 2025-07-17 09:13:20 +02:00
Nasrin
761c19d54b Merge pull request #1307 from unclecode/fix/json-infinity-serialization
fix: Handle infinity values in JSON serialization for API  responses
2025-07-16 13:34:25 +02:00
Nasrin
14b0ecb137 Merge pull request #1305 from unclecode/fix/release-notes-demo-code
Fix: Update release notes and demo code
2025-07-16 13:33:53 +02:00
ntohidi
0eaa9f9895 fix: handle infinity values in JSON serialization for API responses
- Add sanitize_json_data() function to convert infinity/NaN to JSON-compliant strings
- Fix /execute_js endpoint returning ValueError: Out of range float values are not JSON compliant: inf
- Fix /crawl endpoint batch responses with infinity values
- Fix /crawl/stream endpoint streaming responses with infinity values
- Fix /crawl/job endpoint background job responses with infinity values

The sanitize_json_data() function recursively processes response data:
- float('inf') → \"Infinity\"
- float('-inf') → \"-Infinity\"
- float('nan') → \"NaN\"

This prevents JSON serialization errors when JavaScript execution or crawling operations produce infinity values, ensuring all API endpoints return valid JSON.

Fixes: API endpoints crashing with infinity JSON serialization errors
Affects: /execute_js, /crawl, /crawl/stream, /crawl/job endpoints
2025-07-15 13:49:07 +02:00
ntohidi
1d1970ae69 docs: Update release notes and docs for v0.7.0 with teh correct parameters and explanations 2025-07-15 11:32:04 +02:00
ntohidi
205df1e330 docs: Fix virtual scroll configuration 2025-07-15 10:29:47 +02:00
ntohidi
2640dc73a5 docs: Enhance session management example for dynamic content crawling with improved JavaScript handling and extraction schema. ref #226 2025-07-15 10:19:29 +02:00
ntohidi
58024755c5 docs: Update adaptive crawling parameters and examples in README and release notes 2025-07-15 10:15:05 +02:00
UncleCode
dd5ee752cf docs: Add missing documentation pages to mkdocs.yml
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:58:26 +08:00
UncleCode
bde1bba6a2 docs: Add missing documentation pages to mkdocs.yml
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:56:33 +08:00
UncleCode
14f690d751 docs: Update documentation for v0.7.0 release
- Update mkdocs.yml site name to v0.7.x
- Add v0.7.0 to blog index as latest release
- Move v0.6.0 to Previous Releases section
- Copy release notes to proper location in docs/md_v2/blog/releases/
2025-07-12 19:08:17 +08:00
UncleCode
7b9ba3015f Merge branch 'release/v0.7.0' - The Adaptive Intelligence Update 2025-07-12 18:54:20 +08:00
UncleCode
0c8bb742b7 Release v0.7.0-r1: The Adaptive Intelligence Update
- Bump version to 0.7.0
- Add release notes and demo files
- Update README with v0.7.0 features
- Update Docker configurations for v0.7.0-r1
- Move v0.7.0 demo files to releases_review
- Fix BM25 scoring bug in URLSeeder

Major features:
- Adaptive Crawling with pattern learning
- Virtual Scroll support for infinite pages
- Link Preview with 3-layer scoring
- Async URL Seeder for massive discovery
- Performance optimizations
2025-07-12 18:51:13 +08:00
UncleCode
ba2ed53ff1 test(releases): Add test cases for release 0.7.0 2025-07-11 22:27:18 +08:00
UncleCode
a93efcb650 Merge PR #1285: 2025 APR, MAY, and JUN bug fixes 2025-07-11 21:22:34 +08:00
UncleCode
8794852a26 Merge PR #1285: 2025 APR, MAY, and JUN bug fixes 2025-07-11 21:22:03 +08:00
UncleCode
fb25a4a769 docs(examples): update crawl4ai showcase script
The crawl4ai showcase script has been significantly expanded to include more detailed examples and demonstrations. This includes live code examples, more detailed explanations, and a new real-world example. A new file, uv.lock, has also been added.
2025-07-11 20:55:37 +08:00
ntohidi
ee25c771d8 feat(cli): add deep crawling options with configurable strategies and max pages. ref #874 2025-07-02 14:07:23 +02:00
Aravind
02f3127ded Track Stargazers (#1249)
* Webhook for when repo is starred

* Send star data to google sheets to be saved

* change event name to watch

* Change message displayed on Discord

* Update .github/workflows/main.yml

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------

Co-authored-by: UncleCode <unclecode@kidocode.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-06-25 22:26:19 +08:00
UncleCode
b4bb0ccea0 Update simple-crawling.md
Fixing wrong documentation about th fit_markdown to assume its a direct parameter of CrawlerRunConfig, while it is NOT.
2025-06-08 11:33:28 +08:00
41 changed files with 7810 additions and 205 deletions

View File

@@ -9,16 +9,26 @@ on:
types: [opened]
discussion:
types: [created]
watch:
types: [started]
jobs:
notify-discord:
runs-on: ubuntu-latest
steps:
- name: Send to Google Apps Script (Stars only)
if: github.event_name == 'watch'
run: |
curl -fSs -X POST "${{ secrets.GOOGLE_SCRIPT_ENDPOINT }}" \
-H 'Content-Type: application/json' \
-d '{"url":"${{ github.event.sender.html_url }}"}'
- name: Set webhook based on event type
id: set-webhook
run: |
if [ "${{ github.event_name }}" == "discussion" ]; then
echo "webhook=${{ secrets.DISCORD_DISCUSSIONS_WEBHOOK }}" >> $GITHUB_OUTPUT
elif [ "${{ github.event_name }}" == "watch" ]; then
echo "webhook=${{ secrets.DISCORD_STAR_GAZERS }}" >> $GITHUB_OUTPUT
else
echo "webhook=${{ secrets.DISCORD_WEBHOOK }}" >> $GITHUB_OUTPUT
fi
@@ -31,5 +41,6 @@ jobs:
args: |
${{ github.event_name == 'issues' && format('📣 New issue created: **{0}** by {1} - {2}', github.event.issue.title, github.event.issue.user.login, github.event.issue.html_url) ||
github.event_name == 'issue_comment' && format('💬 New comment on issue **{0}** by {1} - {2}', github.event.issue.title, github.event.comment.user.login, github.event.comment.html_url) ||
github.event_name == 'pull_request' && format('🔄 New PR opened: **{0}** by {1} - {2}', github.event.pull_request.title, github.event.pull_request.user.login, github.event.pull_request.html_url) ||
github.event_name == 'pull_request' && format('🔄 New PR opened: **{0}** by {1} - {2}', github.event.pull_request.title, github.event.pull_request.user.login, github.event.pull_request.html_url) ||
github.event_name == 'watch' && format('⭐ {0} starred Crawl4AI 🥳! Check out their profile: {1}', github.event.sender.login, github.event.sender.html_url) ||
format('💬 New discussion started: **{0}** by {1} - {2}', github.event.discussion.title, github.event.discussion.user.login, github.event.discussion.html_url) }}

141
.github/workflows/release.yml vendored Normal file
View File

@@ -0,0 +1,141 @@
name: Release Pipeline
on:
push:
tags:
- 'v*'
- '!test-v*' # Exclude test tags
jobs:
release:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: v${{ steps.get_version.outputs.VERSION }}
release_name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY

View File

@@ -0,0 +1,116 @@
name: Test Release Pipeline
on:
push:
tags:
- 'test-v*'
jobs:
test-release:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/test-v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Testing with version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to Test PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TEST_PYPI_TOKEN }}
run: |
echo "📦 Uploading to Test PyPI..."
twine upload --repository testpypi dist/* || {
if [ $? -eq 1 ]; then
echo "⚠️ Upload failed - likely version already exists on Test PyPI"
echo "Continuing anyway for test purposes..."
else
exit 1
fi
}
echo "✅ Test PyPI step complete"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push Docker test images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:test-latest
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Summary
run: |
echo "## 🎉 Test Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 Test PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://test.pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install -i https://test.pypi.org/simple/ crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Test Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:test-latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🧹 Cleanup Commands" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "# Remove test tag" >> $GITHUB_STEP_SUMMARY
echo "git tag -d test-v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "git push origin :test-v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "# Remove Docker test images" >> $GITHUB_STEP_SUMMARY
echo "docker rmi unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "docker rmi unclecode/crawl4ai:test-latest" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

View File

@@ -1,7 +1,7 @@
FROM python:3.12-slim-bookworm AS build
# C4ai version
ARG C4AI_VER=0.6.0
ARG C4AI_VER=0.7.0-r1
ENV C4AI_VERSION=$C4AI_VER
LABEL c4ai.version=$C4AI_VER

View File

@@ -26,9 +26,9 @@
Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.
[✨ Check out latest update v0.6.0](#-recent-updates)
[✨ Check out latest update v0.7.0](#-recent-updates)
🎉 **Version 0.6.0 is now available!** This release candidate introduces World-aware Crawling with geolocation and locale settings, Table-to-DataFrame extraction, Browser pooling with pre-warming, Network and console traffic capture, MCP integration for AI tools, and a completely revamped Docker deployment! [Read the release notes →](https://docs.crawl4ai.com/blog)
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
<details>
<summary>🤓 <strong>My Personal Story</strong></summary>
@@ -274,8 +274,8 @@ The new Docker implementation includes:
```bash
# Pull and run the latest release candidate
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
docker pull unclecode/crawl4ai:0.7.0
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0
# Visit the playground at http://localhost:11235/playground
```
@@ -518,7 +518,72 @@ async def test_news_crawl():
## ✨ Recent Updates
### Version 0.6.0 Release Highlights
### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update
- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:
```python
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
max_depth=5, # Maximum crawl depth
max_pages=20, # Maximum number of pages to crawl
strategy="statistical"
)
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(crawler, config)
state = await adaptive_crawler.digest(
start_url="https://news.example.com",
query="latest news content"
)
# Crawler learns patterns and improves extraction over time
```
- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:
```python
scroll_config = VirtualScrollConfig(
container_selector="[data-testid='feed']",
scroll_count=20,
scroll_by="container_height",
wait_after_scroll=1.0
)
result = await crawler.arun(url, config=CrawlerRunConfig(
virtual_scroll_config=scroll_config
))
```
- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:
```python
link_config = LinkPreviewConfig(
query="machine learning tutorials",
score_threshold=0.3,
concurrent_requests=10
)
result = await crawler.arun(url, config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True
))
# Links ranked by relevance and quality
```
- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:
```python
seeder = AsyncUrlSeeder(SeedingConfig(
source="sitemap+cc",
pattern="*/blog/*",
query="python tutorials",
score_threshold=0.4
))
urls = await seeder.discover("https://example.com")
```
- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency
Read the full details in our [0.7.0 Release Notes](https://docs.crawl4ai.com/blog/release-v0.7.0) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
### Previous Version: 0.6.0 Release Highlights
- **🌎 World-aware Crawling**: Set geolocation, language, and timezone for authentic locale-specific content:
```python
@@ -588,7 +653,6 @@ async def test_news_crawl():
- **📱 Multi-stage Build System**: Optimized Dockerfile with platform-specific performance enhancements
Read the full details in our [0.6.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.6.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
### Previous Version: 0.5.0 Major Release Highlights

View File

@@ -3,7 +3,7 @@ import warnings
from .async_webcrawler import AsyncWebCrawler, CacheMode
# MODIFIED: Add SeedingConfig and VirtualScrollConfig here
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig, LinkPreviewConfig
from .content_scraping_strategy import (
ContentScrapingStrategy,
@@ -173,6 +173,7 @@ __all__ = [
"CompilationResult",
"ValidationResult",
"ErrorDetail",
"LinkPreviewConfig"
]

View File

@@ -1,7 +1,7 @@
# crawl4ai/__version__.py
# This is the version that will be used for stable releases
__version__ = "0.6.3"
__version__ = "0.7.1"
# For nightly builds, this gets set during build process
__nightly_version__ = None

View File

@@ -1659,22 +1659,57 @@ class SeedingConfig:
"""
def __init__(
self,
source: str = "sitemap+cc", # Options: "sitemap", "cc", "sitemap+cc"
pattern: Optional[str] = "*", # URL pattern to filter discovered URLs (e.g., "*example.com/blog/*")
live_check: bool = False, # Whether to perform HEAD requests to verify URL liveness
extract_head: bool = False, # Whether to fetch and parse <head> section for metadata
max_urls: int = -1, # Maximum number of URLs to discover (default: -1 for no limit)
concurrency: int = 1000, # Maximum concurrent requests for live checks/head extraction
hits_per_sec: int = 5, # Rate limit in requests per second
force: bool = False, # If True, bypasses the AsyncUrlSeeder's internal .jsonl cache
base_directory: Optional[str] = None, # Base directory for UrlSeeder's cache files (.jsonl)
llm_config: Optional[LLMConfig] = None, # Forward LLM config for future use (e.g., relevance scoring)
verbose: Optional[bool] = None, # Override crawler's general verbose setting
query: Optional[str] = None, # Search query for relevance scoring
score_threshold: Optional[float] = None, # Minimum relevance score to include URL (0.0-1.0)
scoring_method: str = "bm25", # Scoring method: "bm25" (default), future: "semantic"
filter_nonsense_urls: bool = True, # Filter out utility URLs like robots.txt, sitemap.xml, etc.
source: str = "sitemap+cc",
pattern: Optional[str] = "*",
live_check: bool = False,
extract_head: bool = False,
max_urls: int = -1,
concurrency: int = 1000,
hits_per_sec: int = 5,
force: bool = False,
base_directory: Optional[str] = None,
llm_config: Optional[LLMConfig] = None,
verbose: Optional[bool] = None,
query: Optional[str] = None,
score_threshold: Optional[float] = None,
scoring_method: str = "bm25",
filter_nonsense_urls: bool = True,
):
"""
Initialize URL seeding configuration.
Args:
source: Discovery source(s) to use. Options: "sitemap", "cc" (Common Crawl),
or "sitemap+cc" (both). Default: "sitemap+cc"
pattern: URL pattern to filter discovered URLs (e.g., "*example.com/blog/*").
Supports glob-style wildcards. Default: "*" (all URLs)
live_check: Whether to perform HEAD requests to verify URL liveness.
Default: False
extract_head: Whether to fetch and parse <head> section for metadata extraction.
Required for BM25 relevance scoring. Default: False
max_urls: Maximum number of URLs to discover. Use -1 for no limit.
Default: -1
concurrency: Maximum concurrent requests for live checks/head extraction.
Default: 1000
hits_per_sec: Rate limit in requests per second to avoid overwhelming servers.
Default: 5
force: If True, bypasses the AsyncUrlSeeder's internal .jsonl cache and
re-fetches URLs. Default: False
base_directory: Base directory for UrlSeeder's cache files (.jsonl).
If None, uses default ~/.crawl4ai/. Default: None
llm_config: LLM configuration for future features (e.g., semantic scoring).
Currently unused. Default: None
verbose: Override crawler's general verbose setting for seeding operations.
Default: None (inherits from crawler)
query: Search query for BM25 relevance scoring (e.g., "python tutorials").
Requires extract_head=True. Default: None
score_threshold: Minimum relevance score (0.0-1.0) to include URL.
Only applies when query is provided. Default: None
scoring_method: Scoring algorithm to use. Currently only "bm25" is supported.
Future: "semantic". Default: "bm25"
filter_nonsense_urls: Filter out utility URLs like robots.txt, sitemap.xml,
ads.txt, favicon.ico, etc. Default: True
"""
self.source = source
self.pattern = pattern
self.live_check = live_check

View File

@@ -824,7 +824,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
except Error:
visibility_info = await self.check_visibility(page)
if self.browser_config.config.verbose:
if self.browser_config.verbose:
self.logger.debug(
message="Body visibility info: {info}",
tag="DEBUG",

View File

@@ -424,10 +424,21 @@ class AsyncUrlSeeder:
self._log("info", "Finished URL seeding for {domain}. Total URLs: {count}",
params={"domain": domain, "count": len(results)}, tag="URL_SEED")
# Sort by relevance score if query was provided
# Apply BM25 scoring if query was provided
if query and extract_head and scoring_method == "bm25":
results.sort(key=lambda x: x.get(
"relevance_score", 0.0), reverse=True)
# Apply collective BM25 scoring across all documents
results = await self._apply_bm25_scoring(results, config)
# Filter by score threshold if specified
if score_threshold is not None:
original_count = len(results)
results = [r for r in results if r.get("relevance_score", 0) >= score_threshold]
if original_count > len(results):
self._log("info", "Filtered {filtered} URLs below score threshold {threshold}",
params={"filtered": original_count - len(results), "threshold": score_threshold}, tag="URL_SEED")
# Sort by relevance score
results.sort(key=lambda x: x.get("relevance_score", 0.0), reverse=True)
self._log("info", "Sorted {count} URLs by relevance score for query: '{query}'",
params={"count": len(results), "query": query}, tag="URL_SEED")
elif query and not extract_head:
@@ -982,28 +993,6 @@ class AsyncUrlSeeder:
"head_data": head_data,
}
# Apply BM25 scoring if query is provided and head data exists
if query and ok and scoring_method == "bm25" and head_data:
text_context = self._extract_text_context(head_data)
if text_context:
# Calculate BM25 score for this single document
# scores = self._calculate_bm25_score(query, [text_context])
scores = await asyncio.to_thread(self._calculate_bm25_score, query, [text_context])
relevance_score = scores[0] if scores else 0.0
entry["relevance_score"] = float(relevance_score)
else:
# No text context, use URL-based scoring as fallback
relevance_score = self._calculate_url_relevance_score(
query, entry["url"])
entry["relevance_score"] = float(relevance_score)
elif query:
# Query provided but no head data - we reject this entry
self._log("debug", "No head data for {url}, using URL-based scoring",
params={"url": url}, tag="URL_SEED")
return
# relevance_score = self._calculate_url_relevance_score(query, entry["url"])
# entry["relevance_score"] = float(relevance_score)
elif live:
self._log("debug", "Performing live check for {url}", params={
"url": url}, tag="URL_SEED")
@@ -1013,35 +1002,13 @@ class AsyncUrlSeeder:
params={"status": status.upper(), "url": url}, tag="URL_SEED")
entry = {"url": url, "status": status, "head_data": {}}
# Apply URL-based scoring if query is provided
if query:
relevance_score = self._calculate_url_relevance_score(
query, url)
entry["relevance_score"] = float(relevance_score)
else:
entry = {"url": url, "status": "unknown", "head_data": {}}
# Apply URL-based scoring if query is provided
if query:
relevance_score = self._calculate_url_relevance_score(
query, url)
entry["relevance_score"] = float(relevance_score)
# Now decide whether to add the entry based on score threshold
if query and "relevance_score" in entry:
if score_threshold is None or entry["relevance_score"] >= score_threshold:
if live or extract:
await self._cache_set(cache_kind, url, entry)
res_list.append(entry)
else:
self._log("debug", "URL {url} filtered out with score {score} < {threshold}",
params={"url": url, "score": entry["relevance_score"], "threshold": score_threshold}, tag="URL_SEED")
else:
# No query or no scoring - add as usual
if live or extract:
await self._cache_set(cache_kind, url, entry)
res_list.append(entry)
# Add entry to results (scoring will be done later)
if live or extract:
await self._cache_set(cache_kind, url, entry)
res_list.append(entry)
async def _head_ok(self, url: str, timeout: int) -> bool:
try:
@@ -1436,8 +1403,19 @@ class AsyncUrlSeeder:
scores = bm25.get_scores(query_tokens)
# Normalize scores to 0-1 range
max_score = max(scores) if max(scores) > 0 else 1.0
normalized_scores = [score / max_score for score in scores]
# BM25 can return negative scores, so we need to handle the full range
if len(scores) == 0:
return []
min_score = min(scores)
max_score = max(scores)
# If all scores are the same, return 0.5 for all
if max_score == min_score:
return [0.5] * len(scores)
# Normalize to 0-1 range using min-max normalization
normalized_scores = [(score - min_score) / (max_score - min_score) for score in scores]
return normalized_scores
except Exception as e:

View File

@@ -502,9 +502,12 @@ class AsyncWebCrawler:
metadata = result.get("metadata", {})
else:
cleaned_html = sanitize_input_encode(result.cleaned_html)
media = result.media.model_dump()
tables = media.pop("tables", [])
links = result.links.model_dump()
# media = result.media.model_dump()
# tables = media.pop("tables", [])
# links = result.links.model_dump()
media = result.media.model_dump() if hasattr(result.media, 'model_dump') else result.media
tables = media.pop("tables", []) if isinstance(media, dict) else []
links = result.links.model_dump() if hasattr(result.links, 'model_dump') else result.links
metadata = result.metadata
fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000)

View File

@@ -14,23 +14,8 @@ import hashlib
from .js_snippet import load_js_script
from .config import DOWNLOAD_PAGE_TIMEOUT
from .async_configs import BrowserConfig, CrawlerRunConfig
from playwright_stealth import StealthConfig
from .utils import get_chromium_path
stealth_config = StealthConfig(
webdriver=True,
chrome_app=True,
chrome_csi=True,
chrome_load_times=True,
chrome_runtime=True,
navigator_languages=True,
navigator_plugins=True,
navigator_permissions=True,
webgl_vendor=True,
outerdimensions=True,
navigator_hardware_concurrency=True,
media_codecs=True,
)
BROWSER_DISABLE_OPTIONS = [
"--disable-background-networking",

View File

@@ -27,7 +27,10 @@ from crawl4ai import (
PruningContentFilter,
BrowserProfiler,
DefaultMarkdownGenerator,
LLMConfig
LLMConfig,
BFSDeepCrawlStrategy,
DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy,
)
from crawl4ai.config import USER_SETTINGS
from litellm import completion
@@ -1014,9 +1017,11 @@ def cdp_cmd(user_data_dir: Optional[str], port: int, browser_type: str, headless
@click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)")
@click.option("--deep-crawl", type=click.Choice(["bfs", "dfs", "best-first"]), help="Enable deep crawling with specified strategy (bfs, dfs, or best-first)")
@click.option("--max-pages", type=int, default=10, help="Maximum number of pages to crawl in deep crawl mode")
def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str,
extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str):
output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int):
"""Crawl a website and extract content
Simple Usage:
@@ -1156,6 +1161,27 @@ Always return valid, properly formatted JSON."""
crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy()
# Handle deep crawling configuration
if deep_crawl:
if deep_crawl == "bfs":
crawler_cfg.deep_crawl_strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=max_pages
)
elif deep_crawl == "dfs":
crawler_cfg.deep_crawl_strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=max_pages
)
elif deep_crawl == "best-first":
crawler_cfg.deep_crawl_strategy = BestFirstCrawlingStrategy(
max_depth=3,
max_pages=max_pages
)
if verbose:
console.print(f"[green]Deep crawling enabled:[/green] {deep_crawl} strategy, max {max_pages} pages")
config = get_global_config()
browser_cfg.verbose = config.get("VERBOSE", False)
@@ -1170,39 +1196,60 @@ Always return valid, properly formatted JSON."""
verbose
)
# Handle deep crawl results (list) vs single result
if isinstance(result, list):
if len(result) == 0:
click.echo("No results found during deep crawling")
return
# Use the first result for question answering and output
main_result = result[0]
all_results = result
else:
# Single result from regular crawling
main_result = result
all_results = [result]
# Handle question
if question:
provider, token = setup_llm_config()
markdown = result.markdown.raw_markdown
markdown = main_result.markdown.raw_markdown
anyio.run(stream_llm_response, url, markdown, question, provider, token)
return
# Handle output
if not output_file:
if output == "all":
click.echo(json.dumps(result.model_dump(), indent=2))
if isinstance(result, list):
output_data = [r.model_dump() for r in all_results]
click.echo(json.dumps(output_data, indent=2))
else:
click.echo(json.dumps(main_result.model_dump(), indent=2))
elif output == "json":
print(result.extracted_content)
extracted_items = json.loads(result.extracted_content)
print(main_result.extracted_content)
extracted_items = json.loads(main_result.extracted_content)
click.echo(json.dumps(extracted_items, indent=2))
elif output in ["markdown", "md"]:
click.echo(result.markdown.raw_markdown)
click.echo(main_result.markdown.raw_markdown)
elif output in ["markdown-fit", "md-fit"]:
click.echo(result.markdown.fit_markdown)
click.echo(main_result.markdown.fit_markdown)
else:
if output == "all":
with open(output_file, "w") as f:
f.write(json.dumps(result.model_dump(), indent=2))
if isinstance(result, list):
output_data = [r.model_dump() for r in all_results]
f.write(json.dumps(output_data, indent=2))
else:
f.write(json.dumps(main_result.model_dump(), indent=2))
elif output == "json":
with open(output_file, "w") as f:
f.write(result.extracted_content)
f.write(main_result.extracted_content)
elif output in ["markdown", "md"]:
with open(output_file, "w") as f:
f.write(result.markdown.raw_markdown)
f.write(main_result.markdown.raw_markdown)
elif output in ["markdown-fit", "md-fit"]:
with open(output_file, "w") as f:
f.write(result.markdown.fit_markdown)
f.write(main_result.markdown.fit_markdown)
except Exception as e:
raise click.ClickException(str(e))
@@ -1354,9 +1401,11 @@ def profiles_cmd():
@click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)")
@click.option("--deep-crawl", type=click.Choice(["bfs", "dfs", "best-first"]), help="Enable deep crawling with specified strategy")
@click.option("--max-pages", type=int, default=10, help="Maximum number of pages to crawl in deep crawl mode")
def default(url: str, example: bool, browser_config: str, crawler_config: str, filter_config: str,
extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
output: str, bypass_cache: bool, question: str, verbose: bool, profile: str):
output: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int):
"""Crawl4AI CLI - Web content extraction tool
Simple Usage:
@@ -1406,7 +1455,9 @@ def default(url: str, example: bool, browser_config: str, crawler_config: str, f
bypass_cache=bypass_cache,
question=question,
verbose=verbose,
profile=profile
profile=profile,
deep_crawl=deep_crawl,
max_pages=max_pages
)
def main():

View File

@@ -1145,10 +1145,10 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
link_data["intrinsic_score"] = intrinsic_score
except Exception:
# Fail gracefully - assign default score
link_data["intrinsic_score"] = float('inf')
link_data["intrinsic_score"] = 0
else:
# No scoring enabled - assign infinity (all links equal priority)
link_data["intrinsic_score"] = float('inf')
link_data["intrinsic_score"] = 0
is_external = is_external_url(normalized_href, base_domain)
if is_external:

View File

@@ -3342,7 +3342,13 @@ async def get_text_embeddings(
# Default: use sentence-transformers
else:
# Lazy load to avoid importing heavy libraries unless needed
from sentence_transformers import SentenceTransformer
try:
from sentence_transformers import SentenceTransformer
except ImportError:
raise ImportError(
"sentence-transformers is required for local embeddings. "
"Install it with: pip install 'crawl4ai[transformer]' or pip install sentence-transformers"
)
# Cache the model in function attribute to avoid reloading
if not hasattr(get_text_embeddings, '_models'):

View File

@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.6.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
```bash
# Pull the release candidate (recommended for latest features)
docker pull unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
# Pull the release candidate (for testing new features)
docker pull unclecode/crawl4ai:0.7.0-r1
# Or pull the latest stable version
# Or pull the current stable version (0.6.0)
docker pull unclecode/crawl4ai:latest
```
@@ -99,7 +101,7 @@ EOL
-p 11235:11235 \
--name crawl4ai \
--shm-size=1g \
unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
unclecode/crawl4ai:0.7.0-r1
```
* **With LLM support:**
@@ -110,7 +112,7 @@ EOL
--name crawl4ai \
--env-file .llm.env \
--shm-size=1g \
unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number
unclecode/crawl4ai:0.7.0-r1
```
> The server will be available at `http://localhost:11235`. Visit `/playground` to access the interactive testing interface.
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r1`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version
@@ -160,7 +162,7 @@ The `docker-compose.yml` file in the project root provides a simplified approach
```bash
# Pulls and runs the release candidate from Docker Hub
# Automatically selects the correct architecture
IMAGE=unclecode/crawl4ai:0.6.0-rN # Use your favorite revision number docker compose up -d
IMAGE=unclecode/crawl4ai:0.7.0-r1 docker compose up -d
```
* **Build and Run Locally:**

View File

@@ -5,6 +5,7 @@ from typing import List, Tuple, Dict
from functools import partial
from uuid import uuid4
from datetime import datetime
from base64 import b64encode
import logging
from typing import Optional, AsyncGenerator
@@ -371,6 +372,9 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
server_memory_mb = _get_memory_mb()
result_dict = result.model_dump()
result_dict['server_memory_mb'] = server_memory_mb
# If PDF exists, encode it to base64
if result_dict.get('pdf') is not None:
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
data = json.dumps(result_dict, default=datetime_handler) + "\n"
yield data.encode('utf-8')
@@ -443,10 +447,19 @@ async def handle_crawl_request(
mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta
peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory
logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB")
# Process results to handle PDF bytes
processed_results = []
for result in results:
result_dict = result.model_dump()
# If PDF exists, encode it to base64
if result_dict.get('pdf') is not None:
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
processed_results.append(result_dict)
return {
"success": True,
"results": [result.model_dump() for result in results],
"results": processed_results,
"server_processing_time_s": end_time - start_time,
"server_memory_delta_mb": mem_delta_mb,
"server_peak_memory_mb": peak_mem_mb

343
docs/blog/release-v0.7.0.md Normal file
View File

@@ -0,0 +1,343 @@
# 🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update
*January 28, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.
## 🎯 What's New at a Glance
- **Adaptive Crawling**: Your crawler now learns and adapts to website patterns
- **Virtual Scroll Support**: Complete content extraction from infinite scroll pages
- **Link Preview with Intelligent Scoring**: Intelligent link analysis and prioritization
- **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering
- **Performance Optimizations**: Significant speed and memory improvements
## 🧠 Adaptive Crawling: Intelligence Through Pattern Learning
**The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.
**My Solution:** I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.
### Technical Deep-Dive
The Adaptive Crawler maintains a persistent state for each domain, tracking:
- Pattern success rates
- Selector stability over time
- Content structure variations
- Extraction confidence scores
```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio
async def main():
# Configure adaptive crawler
config = AdaptiveConfig(
strategy="statistical", # or "embedding" for semantic understanding
max_pages=10,
confidence_threshold=0.7, # Stop at 70% confidence
top_k_links=3, # Follow top 3 links per page
min_gain_threshold=0.05 # Need 5% information gain to continue
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
print("Starting adaptive crawl about Python decorators...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/glossary.html",
query="python decorators functions wrapping"
)
print(f"\n✅ Crawling Complete!")
print(f"• Confidence Level: {adaptive.confidence:.0%}")
print(f"• Pages Crawled: {len(result.crawled_urls)}")
print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
print(f"\nMost Relevant Pages:")
for i, page in enumerate(relevant, 1):
print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")
asyncio.run(main())
```
**Expected Real-World Impact:**
- **News Aggregation**: Maintain 95%+ extraction accuracy even as news sites update their templates
- **E-commerce Monitoring**: Track product changes across hundreds of stores without constant maintenance
- **Research Data Collection**: Build robust academic datasets that survive website redesigns
- **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites
## 🌊 Virtual Scroll: Complete Content Capture
**The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.
**My Solution:** I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.
### Implementation Details
```python
from crawl4ai import VirtualScrollConfig
# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=20, # Number of scrolls
scroll_by="container_height", # Smart scrolling by container size
wait_after_scroll=1.0 # Let content load
)
# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
container_selector="main .product-grid",
scroll_count=30,
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=1.5 # Images need time
)
# For news feeds with lazy loading
news_config = VirtualScrollConfig(
container_selector=".article-feed",
scroll_count=50,
scroll_by="page_height", # Viewport-based scrolling
wait_after_scroll=0.5 # Wait for content to load
)
# Use it in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://twitter.com/trending",
config=CrawlerRunConfig(
virtual_scroll_config=twitter_config,
# Combine with other features
extraction_strategy=JsonCssExtractionStrategy({
"tweets": {
"selector": "[data-testid='tweet']",
"fields": {
"text": {"selector": "[data-testid='tweetText']", "type": "text"},
"likes": {"selector": "[data-testid='like']", "type": "text"}
}
}
})
)
)
print(f"Captured {len(result.extracted_content['tweets'])} tweets")
```
**Key Capabilities:**
- **DOM Recycling Awareness**: Detects and handles virtual DOM element recycling
- **Smart Scroll Physics**: Three modes - container height, page height, or fixed pixels
- **Content Preservation**: Captures content before it's destroyed
- **Intelligent Stopping**: Stops when no new content appears
- **Memory Efficient**: Streams content instead of holding everything in memory
**Expected Real-World Impact:**
- **Social Media Analysis**: Capture entire Twitter threads with hundreds of replies, not just top 10
- **E-commerce Scraping**: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
- **News Aggregation**: Get all articles from modern news sites, not just above-the-fold content
- **Research Applications**: Complete data extraction from academic databases using virtual pagination
## 🔗 Link Preview: Intelligent Link Analysis and Scoring
**The Problem:** You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.
**My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
### Intelligent Link Analysis and Scoring
```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig
async def main():
# Configure intelligent link analysis
link_config = LinkPreviewConfig(
include_internal=True,
include_external=False,
max_links=10,
concurrency=5,
query="python tutorial", # For contextual scoring
score_threshold=0.3,
verbose=True
)
# Use in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://www.geeksforgeeks.org/",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
)
# Access scored and sorted links
if result.success and result.links:
for link in result.links.get("internal", []):
text = link.get('text', 'No text')[:40]
print(
text,
f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
)
asyncio.run(main())
```
**Scoring Components:**
1. **Intrinsic Score**: Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
- URL structure and depth
2. **Contextual Score**: Relevance to your query using BM25 algorithm
- Keyword matching in link text and title
- Meta description analysis
- Content preview scoring
3. **Total Score**: Combined score for final ranking
**Expected Real-World Impact:**
- **Research Efficiency**: Find relevant papers 10x faster by following only high-score links
- **Competitive Analysis**: Automatically identify important pages on competitor sites
- **Content Discovery**: Build topic-focused crawlers that stay on track
- **SEO Audits**: Identify and prioritize high-value internal linking opportunities
## 🎣 Async URL Seeder: Automated URL Discovery at Scale
**The Problem:** You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.
**My Solution:** I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.
### Technical Architecture
```python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig
async def main():
async with AsyncUrlSeeder() as seeder:
# Discover Python tutorial URLs
config = SeedingConfig(
source="sitemap", # Use sitemap
pattern="*python*", # URL pattern filter
extract_head=True, # Get metadata
query="python tutorial", # For relevance scoring
scoring_method="bm25",
score_threshold=0.2,
max_urls=10
)
print("Discovering Python async tutorial URLs...")
urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1):
print(f"\n{i}. {url_info['url']}")
if url_info.get('relevance_score'):
print(f" Relevance: {url_info['relevance_score']:.3f}")
if url_info.get('head_data', {}).get('title'):
print(f" Title: {url_info['head_data']['title'][:60]}...")
asyncio.run(main())
```
**Discovery Methods:**
- **Sitemap Mining**: Parses robots.txt and all linked sitemaps
- **Common Crawl**: Queries the Common Crawl index for historical URLs
- **Intelligent Crawling**: Follows links with smart depth control
- **Pattern Analysis**: Learns URL structures and generates variations
**Expected Real-World Impact:**
- **Migration Projects**: Discover 10,000+ URLs from legacy sites in under 60 seconds
- **Market Research**: Map entire competitor ecosystems automatically
- **Academic Research**: Build comprehensive datasets without manual URL collection
- **SEO Audits**: Find every indexable page with content scoring
- **Content Archival**: Ensure no content is left behind during site migrations
## ⚡ Performance Optimizations
This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.
### What We Optimized
```python
# Optimized crawling with v0.7.0 improvements
results = []
for url in urls:
result = await crawler.arun(
url,
config=CrawlerRunConfig(
# Performance optimizations
wait_until="domcontentloaded", # Faster than networkidle
cache_mode=CacheMode.ENABLED # Enable caching
)
)
results.append(result)
```
**Performance Gains:**
- **Startup Time**: 70% faster browser initialization
- **Page Loading**: 40% reduction with smart resource blocking
- **Extraction**: 3x faster with compiled CSS selectors
- **Memory Usage**: 60% reduction with streaming processing
- **Concurrent Crawls**: Handle 5x more parallel requests
## 🔧 Important Changes
### Breaking Changes
- `link_extractor` renamed to `link_preview` (better reflects functionality)
- Minimum Python version now 3.9
- `CrawlerConfig` split into `CrawlerRunConfig` and `BrowserConfig`
### Migration Guide
```python
# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)
# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
```
## 🤖 Coming Soon: Intelligent Web Automation
I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:
- **Crawl Agents**: Autonomous crawlers that understand your goals and adapt their strategies
- **Auto JS Generation**: Automatic JavaScript code generation for complex interactions
- **Smart Form Handling**: Intelligent form detection and filling
- **Context-Aware Actions**: Crawlers that understand page context and make decisions
These features are under active development and will revolutionize how we approach web automation. Stay tuned!
## 🚀 Get Started
```bash
pip install crawl4ai==0.7.0
```
Check out the [updated documentation](https://docs.crawl4ai.com).
Questions? Issues? I'm always listening:
- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- Twitter: [@unclecode](https://x.com/unclecode)
Happy crawling! 🕷️
---
*P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.*

View File

@@ -0,0 +1,43 @@
# 🛠️ Crawl4AI v0.7.1: Minor Cleanup Update
*July 17, 2025 • 2 min read*
---
A small maintenance release that removes unused code and improves documentation.
## 🎯 What's Changed
- **Removed unused StealthConfig** from `crawl4ai/browser_manager.py`
- **Updated documentation** with better examples and parameter explanations
- **Fixed virtual scroll configuration** examples in docs
## 🧹 Code Cleanup
Removed unused `StealthConfig` import and configuration that wasn't being used anywhere in the codebase. The project uses its own custom stealth implementation through JavaScript injection instead.
```python
# Removed unused code:
from playwright_stealth import StealthConfig
stealth_config = StealthConfig(...) # This was never used
```
## 📖 Documentation Updates
- Fixed adaptive crawling parameter examples
- Updated session management documentation
- Corrected virtual scroll configuration examples
## 🚀 Installation
```bash
pip install crawl4ai==0.7.1
```
No breaking changes - upgrade directly from v0.7.0.
---
Questions? Issues?
- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)

View File

@@ -18,7 +18,7 @@ Usage:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig
from crawl4ai import LinkPreviewConfig
async def basic_link_head_extraction():

View File

@@ -49,46 +49,75 @@ from crawl4ai import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():
async with AsyncWebCrawler() as crawler:
session_id = "github_commits_session"
url = "https://github.com/microsoft/TypeScript/commits/main"
all_commits = []
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "wait_for_session"
all_commits = []
# Define extraction schema
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [{
"name": "title", "selector": "h4.markdown-title", "type": "text"
}],
}
extraction_strategy = JsonCssExtractionStrategy(schema)
js_next_page = """
const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
if (commits.length > 0) {
window.lastCommit = commits[0].textContent.trim();
}
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) {button.click(); console.log('button clicked') }
"""
# JavaScript and wait configurations
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
# Crawl multiple pages
wait_for = """() => {
const commits = document.querySelectorAll('li[data-testid="commit-row-item"] h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.lastCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li[data-testid='commit-row-item']",
"fields": [
{
"name": "title",
"selector": "h4 a",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
browser_config = BrowserConfig(
verbose=True,
headless=False,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
for page in range(3):
config = CrawlerRunConfig(
url=url,
crawler_config = CrawlerRunConfig(
session_id=session_id,
css_selector="li[data-testid='commit-row-item']",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS
cache_mode=CacheMode.BYPASS,
capture_console_messages=True,
)
result = await crawler.arun(config=config)
if result.success:
result = await crawler.arun(url=url, config=crawler_config)
if result.console_messages:
print(f"Page {page + 1} console messages:", result.console_messages)
if result.extracted_content:
# print(f"Page {page + 1} result:", result.extracted_content)
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
else:
print(f"Page {page + 1}: No content extracted")
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
return all_commits
```
---

View File

@@ -91,13 +91,12 @@ async def crawl_twitter_timeline():
wait_after_scroll=1.0 # Twitter needs time to load
)
browser_config = BrowserConfig(headless=True) # Set to False to watch it work
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
# Optional: Set headless=False to watch it work
# browser_config=BrowserConfig(headless=False)
virtual_scroll_config=virtual_config
)
async with AsyncWebCrawler() as crawler:
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://twitter.com/search?q=AI",
config=config
@@ -200,7 +199,7 @@ Use **scan_full_page** when:
Virtual Scroll works seamlessly with extraction strategies:
```python
from crawl4ai import LLMExtractionStrategy
from crawl4ai import LLMExtractionStrategy, LLMConfig
# Define extraction schema
schema = {
@@ -222,7 +221,7 @@ config = CrawlerRunConfig(
scroll_count=20
),
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o-mini",
llm_config=LLMConfig(provider="openai/gpt-4o-mini"),
schema=schema
)
)

View File

@@ -20,14 +20,28 @@ Ever wondered why your AI coding assistant struggles with your library despite c
## Latest Release
Heres the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
### [Crawl4AI v0.7.0 The Adaptive Intelligence Update](releases/0.7.0.md)
*January 28, 2025*
Crawl4AI v0.7.0 introduces groundbreaking intelligence features that transform how crawlers understand and adapt to websites. This release brings Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, and the powerful Async URL Seeder for massive URL discovery.
Key highlights:
- **Adaptive Crawling**: Crawlers that learn and adapt to website structures automatically
- **Virtual Scroll Support**: Complete content extraction from modern infinite scroll pages
- **Link Preview**: 3-layer scoring system for intelligent link prioritization
- **Async URL Seeder**: Discover thousands of URLs in seconds with smart filtering
- **Performance Boost**: Up to 3x faster with optimized resource handling
[Read full release notes →](releases/0.7.0.md)
---
### [Crawl4AI v0.6.0 World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
*April 23, 2025*
## Previous Releases
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
### [Crawl4AI v0.6.0 World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
*December 23, 2024*
Crawl4AI v0.6.0 brought major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
@@ -45,8 +59,6 @@ Other key changes:
---
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
My dear friends and crawlers, there you go, this is the release of Crawl4AI v0.5.0! This release brings a wealth of new features, performance improvements, and a more streamlined developer experience. Here's a breakdown of what's new:
@@ -140,5 +152,4 @@ Curious about how Crawl4AI has evolved? Check out our [complete changelog](https
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
- Join our community discussions on GitHub
- Join our community discussions on GitHub

View File

@@ -0,0 +1,144 @@
# Crawl4AI Blog
Welcome to the Crawl4AI blog! Here you'll find detailed release notes, technical insights, and updates about the project. Whether you're looking for the latest improvements or want to dive deep into web crawling techniques, this is the place.
## Featured Articles
### [When to Stop Crawling: The Art of Knowing "Enough"](articles/adaptive-crawling-revolution.md)
*January 29, 2025*
Traditional crawlers are like tourists with unlimited time—they'll visit every street, every alley, every dead end. But what if your crawler could think like a researcher with a deadline? Discover how Adaptive Crawling revolutionizes web scraping by knowing when to stop. Learn about the three-layer intelligence system that evaluates coverage, consistency, and saturation to build focused knowledge bases instead of endless page collections.
[Read the full article →](articles/adaptive-crawling-revolution.md)
### [The LLM Context Protocol: Why Your AI Assistant Needs Memory, Reasoning, and Examples](articles/llm-context-revolution.md)
*January 24, 2025*
Ever wondered why your AI coding assistant struggles with your library despite comprehensive documentation? This article introduces the three-dimensional context protocol that transforms how AI understands code. Learn why memory, reasoning, and examples together create wisdom—not just information.
[Read the full article →](articles/llm-context-revolution.md)
## Latest Release
Heres the blog index entry for **v0.6.0**, written to match the exact tone and structure of your previous entries:
---
### [Crawl4AI v0.6.0 World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md)
*April 23, 2025*
Crawl4AI v0.6.0 is our most powerful release yet. This update brings major architectural upgrades including world-aware crawling (set geolocation, locale, and timezone), real-time traffic capture, and a memory-efficient crawler pool with pre-warmed pages.
The Docker server now exposes a full-featured MCP socket + SSE interface, supports streaming, and comes with a new Playground UI. Plus, table extraction is now native, and the new stress-test framework supports crawling 1,000+ URLs.
Other key changes:
* Native support for `result.media["tables"]` to export DataFrames
* Full network + console logs and MHTML snapshot per crawl
* Browser pooling and pre-warming for faster cold starts
* New streaming endpoints via MCP API and Playground
* Robots.txt support, proxy rotation, and improved session handling
* Deprecated old markdown names, legacy modules cleaned up
* Massive repo cleanup: ~36K insertions, ~5K deletions across 121 files
[Read full release notes →](releases/0.6.0.md)
---
Let me know if you want me to auto-update the actual file or just paste this into the markdown.
### [Crawl4AI v0.5.0: Deep Crawling, Scalability, and a New CLI!](releases/0.5.0.md)
My dear friends and crawlers, there you go, this is the release of Crawl4AI v0.5.0! This release brings a wealth of new features, performance improvements, and a more streamlined developer experience. Here's a breakdown of what's new:
**Major New Features:**
* **Deep Crawling:** Explore entire websites with configurable strategies (BFS, DFS, Best-First). Define custom filters and URL scoring for targeted crawls.
* **Memory-Adaptive Dispatcher:** Handle large-scale crawls with ease! Our new dispatcher dynamically adjusts concurrency based on available memory and includes built-in rate limiting.
* **Multiple Crawler Strategies:** Choose between the full-featured Playwright browser-based crawler or a new, *much* faster HTTP-only crawler for simpler tasks.
* **Docker Deployment:** Deploy Crawl4AI as a scalable, self-contained service with built-in API endpoints and optional JWT authentication.
* **Command-Line Interface (CLI):** Interact with Crawl4AI directly from your terminal. Crawl, configure, and extract data with simple commands.
* **LLM Configuration (`LLMConfig`):** A new, unified way to configure LLM providers (OpenAI, Anthropic, Ollama, etc.) for extraction, filtering, and schema generation. Simplifies API key management and switching between models.
**Minor Updates & Improvements:**
* **LXML Scraping Mode:** Faster HTML parsing with `LXMLWebScrapingStrategy`.
* **Proxy Rotation:** Added `ProxyRotationStrategy` with a `RoundRobinProxyStrategy` implementation.
* **PDF Processing:** Extract text, images, and metadata from PDF files.
* **URL Redirection Tracking:** Automatically follows and records redirects.
* **Robots.txt Compliance:** Optionally respect website crawling rules.
* **LLM-Powered Schema Generation:** Automatically create extraction schemas using an LLM.
* **`LLMContentFilter`:** Generate high-quality, focused markdown using an LLM.
* **Improved Error Handling & Stability:** Numerous bug fixes and performance enhancements.
* **Enhanced Documentation:** Updated guides and examples.
**Breaking Changes & Migration:**
This release includes several breaking changes to improve the library's structure and consistency. Here's what you need to know:
* **`arun_many()` Behavior:** Now uses the `MemoryAdaptiveDispatcher` by default. The return type depends on the `stream` parameter in `CrawlerRunConfig`. Adjust code that relied on unbounded concurrency.
* **`max_depth` Location:** Moved to `CrawlerRunConfig` and now controls *crawl depth*.
* **Deep Crawling Imports:** Import `DeepCrawlStrategy` and related classes from `crawl4ai.deep_crawling`.
* **`BrowserContext` API:** Updated; the old `get_context` method is deprecated.
* **Optional Model Fields:** Many data model fields are now optional. Handle potential `None` values.
* **`ScrapingMode` Enum:** Replaced with strategy pattern (`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
* **`content_filter` Parameter:** Removed from `CrawlerRunConfig`. Use extraction strategies or markdown generators with filters.
* **Removed Functionality:** The synchronous `WebCrawler`, the old CLI, and docs management tools have been removed.
* **Docker:** Significant changes to deployment. See the [Docker documentation](../deploy/docker/README.md).
* **`ssl_certificate.json`:** This file has been removed.
* **Config**: FastFilterChain has been replaced with FilterChain
* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
* **Proxy**: Removed synchronous WebCrawler support and related rate limiting configurations
* **LLM Parameters:** Use the new `LLMConfig` object instead of passing `provider`, `api_token`, `base_url`, and `api_base` directly to `LLMExtractionStrategy` and `LLMContentFilter`.
**In short:** Update imports, adjust `arun_many()` usage, check for optional fields, and review the Docker deployment guide.
## License Change
Crawl4AI v0.5.0 updates the license to Apache 2.0 *with a required attribution clause*. This means you are free to use, modify, and distribute Crawl4AI (even commercially), but you *must* clearly attribute the project in any public use or distribution. See the updated `LICENSE` file for the full legal text and specific requirements.
**Get Started:**
* **Installation:** `pip install "crawl4ai[all]"` (or use the Docker image)
* **Documentation:** [https://docs.crawl4ai.com](https://docs.crawl4ai.com)
* **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
I'm very excited to see what you build with Crawl4AI v0.5.0!
---
### [0.4.2 - Configurable Crawlers, Session Management, and Smarter Screenshots](releases/0.4.2.md)
*December 12, 2024*
The 0.4.2 update brings massive improvements to configuration, making crawlers and browsers easier to manage with dedicated objects. You can now import/export local storage for seamless session management. Plus, long-page screenshots are faster and cleaner, and full-page PDF exports are now possible. Check out all the new features to make your crawling experience even smoother.
[Read full release notes →](releases/0.4.2.md)
---
### [0.4.1 - Smarter Crawling with Lazy-Load Handling, Text-Only Mode, and More](releases/0.4.1.md)
*December 8, 2024*
This release brings major improvements to handling lazy-loaded images, a blazing-fast Text-Only Mode, full-page scanning for infinite scrolls, dynamic viewport adjustments, and session reuse for efficient crawling. If you're looking to improve speed, reliability, or handle dynamic content with ease, this update has you covered.
[Read full release notes →](releases/0.4.1.md)
---
### [0.4.0 - Major Content Filtering Update](releases/0.4.0.md)
*December 1, 2024*
Introduced significant improvements to content filtering, multi-threaded environment handling, and user-agent generation. This release features the new PruningContentFilter, enhanced thread safety, and improved test coverage.
[Read full release notes →](releases/0.4.0.md)
## Project History
Curious about how Crawl4AI has evolved? Check out our [complete changelog](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md) for a detailed history of all versions and updates.
## Stay Updated
- Star us on [GitHub](https://github.com/unclecode/crawl4ai)
- Follow [@unclecode](https://twitter.com/unclecode) on Twitter
- Join our community discussions on GitHub

View File

@@ -0,0 +1,343 @@
# 🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update
*January 28, 2025 • 10 min read*
---
Today I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.
## 🎯 What's New at a Glance
- **Adaptive Crawling**: Your crawler now learns and adapts to website patterns
- **Virtual Scroll Support**: Complete content extraction from infinite scroll pages
- **Link Preview with Intelligent Scoring**: Intelligent link analysis and prioritization
- **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering
- **Performance Optimizations**: Significant speed and memory improvements
## 🧠 Adaptive Crawling: Intelligence Through Pattern Learning
**The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.
**My Solution:** I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.
### Technical Deep-Dive
The Adaptive Crawler maintains a persistent state for each domain, tracking:
- Pattern success rates
- Selector stability over time
- Content structure variations
- Extraction confidence scores
```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio
async def main():
# Configure adaptive crawler
config = AdaptiveConfig(
strategy="statistical", # or "embedding" for semantic understanding
max_pages=10,
confidence_threshold=0.7, # Stop at 70% confidence
top_k_links=3, # Follow top 3 links per page
min_gain_threshold=0.05 # Need 5% information gain to continue
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
print("Starting adaptive crawl about Python decorators...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/glossary.html",
query="python decorators functions wrapping"
)
print(f"\n✅ Crawling Complete!")
print(f"• Confidence Level: {adaptive.confidence:.0%}")
print(f"• Pages Crawled: {len(result.crawled_urls)}")
print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
print(f"\nMost Relevant Pages:")
for i, page in enumerate(relevant, 1):
print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")
asyncio.run(main())
```
**Expected Real-World Impact:**
- **News Aggregation**: Maintain 95%+ extraction accuracy even as news sites update their templates
- **E-commerce Monitoring**: Track product changes across hundreds of stores without constant maintenance
- **Research Data Collection**: Build robust academic datasets that survive website redesigns
- **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites
## 🌊 Virtual Scroll: Complete Content Capture
**The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.
**My Solution:** I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.
### Implementation Details
```python
from crawl4ai import VirtualScrollConfig
# For social media feeds (Twitter/X style)
twitter_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=20, # Number of scrolls
scroll_by="container_height", # Smart scrolling by container size
wait_after_scroll=1.0 # Let content load
)
# For e-commerce product grids (Instagram style)
grid_config = VirtualScrollConfig(
container_selector="main .product-grid",
scroll_count=30,
scroll_by=800, # Fixed pixel scrolling
wait_after_scroll=1.5 # Images need time
)
# For news feeds with lazy loading
news_config = VirtualScrollConfig(
container_selector=".article-feed",
scroll_count=50,
scroll_by="page_height", # Viewport-based scrolling
wait_after_scroll=0.5 # Wait for content to load
)
# Use it in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://twitter.com/trending",
config=CrawlerRunConfig(
virtual_scroll_config=twitter_config,
# Combine with other features
extraction_strategy=JsonCssExtractionStrategy({
"tweets": {
"selector": "[data-testid='tweet']",
"fields": {
"text": {"selector": "[data-testid='tweetText']", "type": "text"},
"likes": {"selector": "[data-testid='like']", "type": "text"}
}
}
})
)
)
print(f"Captured {len(result.extracted_content['tweets'])} tweets")
```
**Key Capabilities:**
- **DOM Recycling Awareness**: Detects and handles virtual DOM element recycling
- **Smart Scroll Physics**: Three modes - container height, page height, or fixed pixels
- **Content Preservation**: Captures content before it's destroyed
- **Intelligent Stopping**: Stops when no new content appears
- **Memory Efficient**: Streams content instead of holding everything in memory
**Expected Real-World Impact:**
- **Social Media Analysis**: Capture entire Twitter threads with hundreds of replies, not just top 10
- **E-commerce Scraping**: Extract 500+ products from infinite scroll catalogs vs. 20-50 with traditional methods
- **News Aggregation**: Get all articles from modern news sites, not just above-the-fold content
- **Research Applications**: Complete data extraction from academic databases using virtual pagination
## 🔗 Link Preview: Intelligent Link Analysis and Scoring
**The Problem:** You crawl a page and get 200 links. Which ones matter? Which lead to the content you actually want? Traditional crawlers force you to follow everything or build complex filters.
**My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
### Intelligent Link Analysis and Scoring
```python
import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig
async def main():
# Configure intelligent link analysis
link_config = LinkPreviewConfig(
include_internal=True,
include_external=False,
max_links=10,
concurrency=5,
query="python tutorial", # For contextual scoring
score_threshold=0.3,
verbose=True
)
# Use in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://www.geeksforgeeks.org/",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
)
# Access scored and sorted links
if result.success and result.links:
for link in result.links.get("internal", []):
text = link.get('text', 'No text')[:40]
print(
text,
f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
)
asyncio.run(main())
```
**Scoring Components:**
1. **Intrinsic Score**: Based on link quality indicators
- Position on page (navigation, content, footer)
- Link attributes (rel, title, class names)
- Anchor text quality and length
- URL structure and depth
2. **Contextual Score**: Relevance to your query using BM25 algorithm
- Keyword matching in link text and title
- Meta description analysis
- Content preview scoring
3. **Total Score**: Combined score for final ranking
**Expected Real-World Impact:**
- **Research Efficiency**: Find relevant papers 10x faster by following only high-score links
- **Competitive Analysis**: Automatically identify important pages on competitor sites
- **Content Discovery**: Build topic-focused crawlers that stay on track
- **SEO Audits**: Identify and prioritize high-value internal linking opportunities
## 🎣 Async URL Seeder: Automated URL Discovery at Scale
**The Problem:** You want to crawl an entire domain but only have the homepage. Or worse, you want specific content types across thousands of pages. Manual URL discovery? That's a job for machines, not humans.
**My Solution:** I built Async URL Seeder—a turbocharged URL discovery engine that combines multiple sources with intelligent filtering and relevance scoring.
### Technical Architecture
```python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig
async def main():
async with AsyncUrlSeeder() as seeder:
# Discover Python tutorial URLs
config = SeedingConfig(
source="sitemap", # Use sitemap
pattern="*python*", # URL pattern filter
extract_head=True, # Get metadata
query="python tutorial", # For relevance scoring
scoring_method="bm25",
score_threshold=0.2,
max_urls=10
)
print("Discovering Python async tutorial URLs...")
urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1):
print(f"\n{i}. {url_info['url']}")
if url_info.get('relevance_score'):
print(f" Relevance: {url_info['relevance_score']:.3f}")
if url_info.get('head_data', {}).get('title'):
print(f" Title: {url_info['head_data']['title'][:60]}...")
asyncio.run(main())
```
**Discovery Methods:**
- **Sitemap Mining**: Parses robots.txt and all linked sitemaps
- **Common Crawl**: Queries the Common Crawl index for historical URLs
- **Intelligent Crawling**: Follows links with smart depth control
- **Pattern Analysis**: Learns URL structures and generates variations
**Expected Real-World Impact:**
- **Migration Projects**: Discover 10,000+ URLs from legacy sites in under 60 seconds
- **Market Research**: Map entire competitor ecosystems automatically
- **Academic Research**: Build comprehensive datasets without manual URL collection
- **SEO Audits**: Find every indexable page with content scoring
- **Content Archival**: Ensure no content is left behind during site migrations
## ⚡ Performance Optimizations
This release includes significant performance improvements through optimized resource handling, better concurrency management, and reduced memory footprint.
### What We Optimized
```python
# Optimized crawling with v0.7.0 improvements
results = []
for url in urls:
result = await crawler.arun(
url,
config=CrawlerRunConfig(
# Performance optimizations
wait_until="domcontentloaded", # Faster than networkidle
cache_mode=CacheMode.ENABLED # Enable caching
)
)
results.append(result)
```
**Performance Gains:**
- **Startup Time**: 70% faster browser initialization
- **Page Loading**: 40% reduction with smart resource blocking
- **Extraction**: 3x faster with compiled CSS selectors
- **Memory Usage**: 60% reduction with streaming processing
- **Concurrent Crawls**: Handle 5x more parallel requests
## 🔧 Important Changes
### Breaking Changes
- `link_extractor` renamed to `link_preview` (better reflects functionality)
- Minimum Python version now 3.9
- `CrawlerConfig` split into `CrawlerRunConfig` and `BrowserConfig`
### Migration Guide
```python
# Old (v0.6.x)
from crawl4ai import CrawlerConfig
config = CrawlerConfig(timeout=30000)
# New (v0.7.0)
from crawl4ai import CrawlerRunConfig, BrowserConfig
browser_config = BrowserConfig(timeout=30000)
run_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
```
## 🤖 Coming Soon: Intelligent Web Automation
I'm currently working on bringing advanced automation capabilities to Crawl4AI. This includes:
- **Crawl Agents**: Autonomous crawlers that understand your goals and adapt their strategies
- **Auto JS Generation**: Automatic JavaScript code generation for complex interactions
- **Smart Form Handling**: Intelligent form detection and filling
- **Context-Aware Actions**: Crawlers that understand page context and make decisions
These features are under active development and will revolutionize how we approach web automation. Stay tuned!
## 🚀 Get Started
```bash
pip install crawl4ai==0.7.0
```
Check out the [updated documentation](https://docs.crawl4ai.com).
Questions? Issues? I'm always listening:
- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)
- Twitter: [@unclecode](https://x.com/unclecode)
Happy crawling! 🕷️
---
*P.S. If you're using Crawl4AI in production, I'd love to hear about it. Your use cases inspire the next features.*

View File

@@ -35,7 +35,7 @@ from crawl4ai import AsyncWebCrawler, AdaptiveCrawler
async def main():
async with AsyncWebCrawler() as crawler:
# Create an adaptive crawler
# Create an adaptive crawler (config is optional)
adaptive = AdaptiveCrawler(crawler)
# Start crawling with a query
@@ -59,13 +59,13 @@ async def main():
from crawl4ai import AdaptiveConfig
config = AdaptiveConfig(
confidence_threshold=0.7, # Stop when 70% confident (default: 0.8)
max_pages=20, # Maximum pages to crawl (default: 50)
top_k_links=3, # Links to follow per page (default: 5)
confidence_threshold=0.8, # Stop when 80% confident (default: 0.7)
max_pages=30, # Maximum pages to crawl (default: 20)
top_k_links=5, # Links to follow per page (default: 3)
min_gain_threshold=0.05 # Minimum expected gain to continue (default: 0.1)
)
adaptive = AdaptiveCrawler(crawler, config=config)
adaptive = AdaptiveCrawler(crawler, config)
```
## Crawling Strategies
@@ -198,8 +198,8 @@ if result.metrics.get('is_irrelevant', False):
The confidence score (0-1) indicates how sufficient the gathered information is:
- **0.0-0.3**: Insufficient information, needs more crawling
- **0.3-0.6**: Partial information, may answer basic queries
- **0.6-0.8**: Good coverage, can answer most queries
- **0.8-1.0**: Excellent coverage, comprehensive information
- **0.6-0.7**: Good coverage, can answer most queries
- **0.7-1.0**: Excellent coverage, comprehensive information
### Statistics Display
@@ -257,9 +257,9 @@ new_adaptive.import_knowledge_base("knowledge_base.jsonl")
- Avoid overly broad queries
### 2. Threshold Tuning
- Start with default (0.8) for general use
- Lower to 0.6-0.7 for exploratory crawling
- Raise to 0.9+ for exhaustive coverage
- Start with default (0.7) for general use
- Lower to 0.5-0.6 for exploratory crawling
- Raise to 0.8+ for exhaustive coverage
### 3. Performance Optimization
- Use appropriate `max_pages` limits

View File

@@ -52,11 +52,9 @@ That's it! In just a few lines, you've automated a complete search workflow.
Want to learn by doing? We've got you covered:
**🚀 [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try C4A-Script in your browser right now!
**🚀 [Live Demo](https://docs.crawl4ai.com/apps/c4a-script/)** - Try C4A-Script in your browser right now!
**📁 [Tutorial Examples](/examples/c4a_script/)** - Complete examples with source code
**🛠️ [Local Tutorial](/examples/c4a_script/tutorial/)** - Run the interactive tutorial on your machine
**📁 [Tutorial Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/c4a_script/)** - Complete examples with source code
### Running the Tutorial Locally

View File

@@ -58,13 +58,15 @@ Pull and run images directly from Docker Hub without building locally.
#### 1. Pull the Image
Our latest release candidate is `0.6.0-r2`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system.
> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features.
```bash
# Pull the release candidate (recommended for latest features)
docker pull unclecode/crawl4ai:0.6.0-r1
# Pull the release candidate (for testing new features)
docker pull unclecode/crawl4ai:0.7.0-r1
# Or pull the latest stable version
# Or pull the current stable version (0.6.0)
docker pull unclecode/crawl4ai:latest
```
@@ -124,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai
#### Docker Hub Versioning Explained
* **Image Name:** `unclecode/crawl4ai`
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.6.0-r2`)
* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`)
* `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library
* `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`)
* **`latest` Tag:** Points to the most recent stable version

View File

@@ -125,7 +125,7 @@ Here's a full example you can copy, paste, and run immediately:
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig
from crawl4ai import LinkPreviewConfig
async def extract_link_heads_example():
"""
@@ -237,7 +237,7 @@ if __name__ == "__main__":
The `LinkPreviewConfig` class supports these options:
```python
from crawl4ai.async_configs import LinkPreviewConfig
from crawl4ai import LinkPreviewConfig
link_preview_config = LinkPreviewConfig(
# BASIC SETTINGS

View File

@@ -31,9 +31,16 @@ if __name__ == "__main__":
The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details):
```python
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(
content_filter=PruningContentFilter(threshold=0.6),
options={"ignore_links": True}
)
)
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(fit_markdown=True)
config=config
)
# Different content formats

View File

@@ -137,7 +137,7 @@ async def smart_blog_crawler():
word_count_threshold=300 # Only substantial articles
)
# Extract URLs and stream results as they come
# Extract URLs and crawl them
tutorial_urls = [t["url"] for t in tutorials[:10]]
results = await crawler.arun_many(tutorial_urls, config=config)
@@ -231,7 +231,7 @@ Common Crawl is a massive public dataset that regularly crawls the entire web. I
```python
# Use both sources
config = SeedingConfig(source="cc+sitemap")
config = SeedingConfig(source="sitemap+cc")
urls = await seeder.urls("example.com", config)
```
@@ -241,13 +241,13 @@ The `SeedingConfig` object is your control panel. Here's everything you can conf
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `source` | str | "cc" | URL source: "cc" (Common Crawl), "sitemap", or "cc+sitemap" |
| `source` | str | "sitemap+cc" | URL source: "cc" (Common Crawl), "sitemap", or "sitemap+cc" |
| `pattern` | str | "*" | URL pattern filter (e.g., "*/blog/*", "*.html") |
| `extract_head` | bool | False | Extract metadata from page `<head>` |
| `live_check` | bool | False | Verify URLs are accessible |
| `max_urls` | int | -1 | Maximum URLs to return (-1 = unlimited) |
| `concurrency` | int | 10 | Parallel workers for fetching |
| `hits_per_sec` | int | None | Rate limit for requests |
| `hits_per_sec` | int | 5 | Rate limit for requests |
| `force` | bool | False | Bypass cache, fetch fresh data |
| `verbose` | bool | False | Show detailed progress |
| `query` | str | None | Search query for BM25 scoring |
@@ -522,7 +522,7 @@ urls = await seeder.urls("docs.example.com", config)
```python
# Find specific products
config = SeedingConfig(
source="cc+sitemap", # Use both sources
source="sitemap+cc", # Use both sources
extract_head=True,
query="wireless headphones noise canceling",
scoring_method="bm25",
@@ -782,7 +782,7 @@ class ResearchAssistant:
# Step 1: Discover relevant URLs
config = SeedingConfig(
source="cc+sitemap", # Maximum coverage
source="sitemap+cc", # Maximum coverage
extract_head=True, # Get metadata
query=topic, # Research topic
scoring_method="bm25", # Smart scoring
@@ -832,7 +832,8 @@ class ResearchAssistant:
# Extract URLs and crawl all articles
article_urls = [article['url'] for article in top_articles]
results = []
async for result in await crawler.arun_many(article_urls, config=config):
crawl_results = await crawler.arun_many(article_urls, config=config)
async for result in crawl_results:
if result.success:
results.append({
'url': result.url,
@@ -933,10 +934,10 @@ config = SeedingConfig(concurrency=10, hits_per_sec=5)
# When crawling many URLs
async with AsyncWebCrawler() as crawler:
# Assuming urls is a list of URL strings
results = await crawler.arun_many(urls, config=config)
crawl_results = await crawler.arun_many(urls, config=config)
# Process as they arrive
async for result in results:
async for result in crawl_results:
process_immediately(result) # Don't wait for all
```
@@ -1020,7 +1021,7 @@ config = SeedingConfig(
# E-commerce product discovery
config = SeedingConfig(
source="cc+sitemap",
source="sitemap+cc",
pattern="*/product/*",
extract_head=True,
live_check=True

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,408 @@
"""
🚀 Crawl4AI v0.7.0 Release Demo
================================
This demo showcases all major features introduced in v0.7.0 release.
Major Features:
1. ✅ Adaptive Crawling - Intelligent crawling with confidence tracking
2. ✅ Virtual Scroll Support - Handle infinite scroll pages
3. ✅ Link Preview - Advanced link analysis with 3-layer scoring
4. ✅ URL Seeder - Smart URL discovery and filtering
5. ✅ C4A Script - Domain-specific language for web automation
6. ✅ Chrome Extension Updates - Click2Crawl and instant schema extraction
7. ✅ PDF Parsing Support - Extract content from PDF documents
8. ✅ Nightly Builds - Automated nightly releases
Run this demo to see all features in action!
"""
import asyncio
import json
from typing import List, Dict
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
from rich import box
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
BrowserConfig,
CacheMode,
AdaptiveCrawler,
AdaptiveConfig,
AsyncUrlSeeder,
SeedingConfig,
c4a_compile,
CompilationResult
)
from crawl4ai.async_configs import VirtualScrollConfig, LinkPreviewConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
console = Console()
def print_section(title: str, description: str = ""):
"""Print a section header"""
console.print(f"\n[bold cyan]{'=' * 60}[/bold cyan]")
console.print(f"[bold yellow]{title}[/bold yellow]")
if description:
console.print(f"[dim]{description}[/dim]")
console.print(f"[bold cyan]{'=' * 60}[/bold cyan]\n")
async def demo_1_adaptive_crawling():
"""Demo 1: Adaptive Crawling - Intelligent content extraction"""
print_section(
"Demo 1: Adaptive Crawling",
"Intelligently learns and adapts to website patterns"
)
# Create adaptive crawler with custom configuration
config = AdaptiveConfig(
strategy="statistical", # or "embedding"
confidence_threshold=0.7,
max_pages=10,
top_k_links=3,
min_gain_threshold=0.1
)
# Example: Learn from a product page
console.print("[cyan]Learning from product page patterns...[/cyan]")
async with AsyncWebCrawler() as crawler:
adaptive = AdaptiveCrawler(crawler, config)
# Start adaptive crawl
console.print("[cyan]Starting adaptive crawl...[/cyan]")
result = await adaptive.digest(
start_url="https://docs.python.org/3/",
query="python decorators tutorial"
)
console.print("[green]✓ Adaptive crawl completed[/green]")
console.print(f" - Confidence Level: {adaptive.confidence:.0%}")
console.print(f" - Pages Crawled: {len(result.crawled_urls)}")
console.print(f" - Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
if relevant:
console.print("\nMost relevant pages:")
for i, page in enumerate(relevant, 1):
console.print(f" {i}. {page['url']} (relevance: {page['score']:.2%})")
async def demo_2_virtual_scroll():
"""Demo 2: Virtual Scroll - Handle infinite scroll pages"""
print_section(
"Demo 2: Virtual Scroll Support",
"Capture content from modern infinite scroll pages"
)
# Configure virtual scroll - using body as container for example.com
scroll_config = VirtualScrollConfig(
container_selector="body", # Using body since example.com has simple structure
scroll_count=3, # Just 3 scrolls for demo
scroll_by="container_height", # or "page_height" or pixel value
wait_after_scroll=0.5 # Wait 500ms after each scroll
)
config = CrawlerRunConfig(
virtual_scroll_config=scroll_config,
cache_mode=CacheMode.BYPASS,
wait_until="networkidle"
)
console.print("[cyan]Virtual Scroll Configuration:[/cyan]")
console.print(f" - Container: {scroll_config.container_selector}")
console.print(f" - Scroll count: {scroll_config.scroll_count}")
console.print(f" - Scroll by: {scroll_config.scroll_by}")
console.print(f" - Wait after scroll: {scroll_config.wait_after_scroll}s")
console.print("\n[dim]Note: Using example.com for demo - in production, use this[/dim]")
console.print("[dim]with actual infinite scroll pages like social media feeds.[/dim]\n")
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=config
)
if result.success:
console.print("[green]✓ Virtual scroll executed successfully![/green]")
console.print(f" - Content length: {len(result.markdown)} chars")
# Show example of how to use with real infinite scroll sites
console.print("\n[yellow]Example for real infinite scroll sites:[/yellow]")
console.print("""
# For Twitter-like feeds:
scroll_config = VirtualScrollConfig(
container_selector="[data-testid='primaryColumn']",
scroll_count=20,
scroll_by="container_height",
wait_after_scroll=1.0
)
# For Instagram-like grids:
scroll_config = VirtualScrollConfig(
container_selector="main article",
scroll_count=15,
scroll_by=1000, # Fixed pixel amount
wait_after_scroll=1.5
)""")
async def demo_3_link_preview():
"""Demo 3: Link Preview with 3-layer scoring"""
print_section(
"Demo 3: Link Preview & Scoring",
"Advanced link analysis with intrinsic, contextual, and total scoring"
)
# Configure link preview
link_config = LinkPreviewConfig(
include_internal=True,
include_external=False,
max_links=10,
concurrency=5,
query="python tutorial", # For contextual scoring
score_threshold=0.3,
verbose=True
)
config = CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
console.print("[cyan]Analyzing links with 3-layer scoring system...[/cyan]")
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://docs.python.org/3/", config=config)
if result.success and result.links:
# Get scored links
internal_links = result.links.get("internal", [])
scored_links = [l for l in internal_links if l.get("total_score")]
scored_links.sort(key=lambda x: x.get("total_score", 0), reverse=True)
# Create a scoring table
table = Table(title="Link Scoring Results", box=box.ROUNDED)
table.add_column("Link Text", style="cyan", width=40)
table.add_column("Intrinsic Score", justify="center")
table.add_column("Contextual Score", justify="center")
table.add_column("Total Score", justify="center", style="bold green")
for link in scored_links[:5]:
text = link.get('text', 'No text')[:40]
table.add_row(
text,
f"{link.get('intrinsic_score', 0):.1f}/10",
f"{link.get('contextual_score', 0):.2f}/1",
f"{link.get('total_score', 0):.3f}"
)
console.print(table)
async def demo_4_url_seeder():
"""Demo 4: URL Seeder - Smart URL discovery"""
print_section(
"Demo 4: URL Seeder",
"Intelligent URL discovery and filtering"
)
# Configure seeding
seeding_config = SeedingConfig(
source="cc+sitemap", # or "crawl"
pattern="*tutorial*", # URL pattern filter
max_urls=50,
extract_head=True, # Get metadata
query="python programming", # For relevance scoring
scoring_method="bm25",
score_threshold=0.2,
force = True
)
console.print("[cyan]URL Seeder Configuration:[/cyan]")
console.print(f" - Source: {seeding_config.source}")
console.print(f" - Pattern: {seeding_config.pattern}")
console.print(f" - Max URLs: {seeding_config.max_urls}")
console.print(f" - Query: {seeding_config.query}")
console.print(f" - Scoring: {seeding_config.scoring_method}")
# Use URL seeder to discover URLs
async with AsyncUrlSeeder() as seeder:
console.print("\n[cyan]Discovering URLs from Python docs...[/cyan]")
urls = await seeder.urls("docs.python.org", seeding_config)
console.print(f"\n[green]✓ Discovered {len(urls)} URLs[/green]")
for i, url_info in enumerate(urls[:5], 1):
console.print(f" {i}. {url_info['url']}")
if url_info.get('relevance_score'):
console.print(f" Relevance: {url_info['relevance_score']:.3f}")
async def demo_5_c4a_script():
"""Demo 5: C4A Script - Domain-specific language"""
print_section(
"Demo 5: C4A Script Language",
"Domain-specific language for web automation"
)
# Example C4A script
c4a_script = """
# Simple C4A script example
WAIT `body` 3
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept`
CLICK `.search-button`
TYPE "python tutorial"
PRESS Enter
WAIT `.results` 5
"""
console.print("[cyan]C4A Script Example:[/cyan]")
console.print(Panel(c4a_script, title="script.c4a", border_style="blue"))
# Compile the script
compilation_result = c4a_compile(c4a_script)
if compilation_result.success:
console.print("[green]✓ Script compiled successfully![/green]")
console.print(f" - Generated {len(compilation_result.js_code)} JavaScript statements")
console.print("\nFirst 3 JS statements:")
for stmt in compilation_result.js_code[:3]:
console.print(f"{stmt}")
else:
console.print("[red]✗ Script compilation failed[/red]")
if compilation_result.first_error:
error = compilation_result.first_error
console.print(f" Error at line {error.line}: {error.message}")
async def demo_6_css_extraction():
"""Demo 6: Enhanced CSS/JSON extraction"""
print_section(
"Demo 6: Enhanced Extraction",
"Improved CSS selector and JSON extraction"
)
# Define extraction schema
schema = {
"name": "Example Page Data",
"baseSelector": "body",
"fields": [
{
"name": "title",
"selector": "h1",
"type": "text"
},
{
"name": "paragraphs",
"selector": "p",
"type": "list",
"fields": [
{"name": "text", "type": "text"}
]
}
]
}
extraction_strategy = JsonCssExtractionStrategy(schema)
console.print("[cyan]Extraction Schema:[/cyan]")
console.print(json.dumps(schema, indent=2))
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://example.com",
config=CrawlerRunConfig(
extraction_strategy=extraction_strategy,
cache_mode=CacheMode.BYPASS
)
)
if result.success and result.extracted_content:
console.print("\n[green]✓ Content extracted successfully![/green]")
console.print(f"Extracted: {json.dumps(json.loads(result.extracted_content), indent=2)[:200]}...")
async def demo_7_performance_improvements():
"""Demo 7: Performance improvements"""
print_section(
"Demo 7: Performance Improvements",
"Faster crawling with better resource management"
)
# Performance-optimized configuration
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED, # Use caching
wait_until="domcontentloaded", # Faster than networkidle
page_timeout=10000, # 10 second timeout
exclude_external_links=True,
exclude_social_media_links=True,
exclude_external_images=True
)
console.print("[cyan]Performance Configuration:[/cyan]")
console.print(" - Cache: ENABLED")
console.print(" - Wait: domcontentloaded (faster)")
console.print(" - Timeout: 10s")
console.print(" - Excluding: external links, images, social media")
# Measure performance
import time
start_time = time.time()
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
elapsed = time.time() - start_time
if result.success:
console.print(f"\n[green]✓ Page crawled in {elapsed:.2f} seconds[/green]")
async def main():
"""Run all demos"""
console.print(Panel(
"[bold cyan]Crawl4AI v0.7.0 Release Demo[/bold cyan]\n\n"
"This demo showcases all major features introduced in v0.7.0.\n"
"Each demo is self-contained and demonstrates a specific feature.",
title="Welcome",
border_style="blue"
))
demos = [
demo_1_adaptive_crawling,
demo_2_virtual_scroll,
demo_3_link_preview,
demo_4_url_seeder,
demo_5_c4a_script,
demo_6_css_extraction,
demo_7_performance_improvements
]
for i, demo in enumerate(demos, 1):
try:
await demo()
if i < len(demos):
console.print("\n[dim]Press Enter to continue to next demo...[/dim]")
input()
except Exception as e:
console.print(f"[red]Error in demo: {e}[/red]")
continue
console.print(Panel(
"[bold green]Demo Complete![/bold green]\n\n"
"Thank you for trying Crawl4AI v0.7.0!\n"
"For more examples and documentation, visit:\n"
"https://github.com/unclecode/crawl4ai",
title="Complete",
border_style="green"
))
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -0,0 +1,280 @@
"""
🚀 Crawl4AI v0.7.0 Feature Demo
================================
This file demonstrates the major features introduced in v0.7.0 with practical examples.
"""
import asyncio
import json
from pathlib import Path
from crawl4ai import (
AsyncWebCrawler,
CrawlerRunConfig,
BrowserConfig,
CacheMode,
# New imports for v0.7.0
VirtualScrollConfig,
LinkPreviewConfig,
AdaptiveCrawler,
AdaptiveConfig,
AsyncUrlSeeder,
SeedingConfig,
c4a_compile,
)
async def demo_link_preview():
"""
Demo 1: Link Preview with 3-Layer Scoring
Shows how to analyze links with intrinsic quality scores,
contextual relevance, and combined total scores.
"""
print("\n" + "="*60)
print("🔗 DEMO 1: Link Preview & Intelligent Scoring")
print("="*60)
# Configure link preview with contextual scoring
config = CrawlerRunConfig(
link_preview_config=LinkPreviewConfig(
include_internal=True,
include_external=False,
max_links=10,
concurrency=5,
query="machine learning tutorials", # For contextual scoring
score_threshold=0.3, # Minimum relevance
verbose=True
),
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://scikit-learn.org/stable/", config=config)
if result.success:
# Get scored links
internal_links = result.links.get("internal", [])
scored_links = [l for l in internal_links if l.get("total_score")]
scored_links.sort(key=lambda x: x.get("total_score", 0), reverse=True)
print(f"\nTop 5 Most Relevant Links:")
for i, link in enumerate(scored_links[:5], 1):
print(f"\n{i}. {link.get('text', 'No text')[:50]}...")
print(f" URL: {link['href']}")
print(f" Intrinsic Score: {link.get('intrinsic_score', 0):.2f}/10")
print(f" Contextual Score: {link.get('contextual_score', 0):.3f}")
print(f" Total Score: {link.get('total_score', 0):.3f}")
# Show metadata if available
if link.get('head_data'):
title = link['head_data'].get('title', 'No title')
print(f" Title: {title[:60]}...")
async def demo_adaptive_crawling():
"""
Demo 2: Adaptive Crawling
Shows intelligent crawling that stops when enough information
is gathered, with confidence tracking.
"""
print("\n" + "="*60)
print("🎯 DEMO 2: Adaptive Crawling with Confidence Tracking")
print("="*60)
# Configure adaptive crawler
config = AdaptiveConfig(
strategy="statistical", # or "embedding" for semantic understanding
max_pages=10,
confidence_threshold=0.7, # Stop at 70% confidence
top_k_links=3, # Follow top 3 links per page
min_gain_threshold=0.05 # Need 5% information gain to continue
)
async with AsyncWebCrawler(verbose=False) as crawler:
adaptive = AdaptiveCrawler(crawler, config)
print("Starting adaptive crawl about Python decorators...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/glossary.html",
query="python decorators functions wrapping"
)
print(f"\n✅ Crawling Complete!")
print(f"• Confidence Level: {adaptive.confidence:.0%}")
print(f"• Pages Crawled: {len(result.crawled_urls)}")
print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
print(f"\nMost Relevant Pages:")
for i, page in enumerate(relevant, 1):
print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")
async def demo_virtual_scroll():
"""
Demo 3: Virtual Scroll for Modern Web Pages
Shows how to capture content from pages with DOM recycling
(Twitter, Instagram, infinite scroll).
"""
print("\n" + "="*60)
print("📜 DEMO 3: Virtual Scroll Support")
print("="*60)
# Configure virtual scroll for a news site
virtual_config = VirtualScrollConfig(
container_selector="main, article, .content", # Common containers
scroll_count=20, # Scroll up to 20 times
scroll_by="container_height", # Scroll by container height
wait_after_scroll=0.5 # Wait 500ms after each scroll
)
config = CrawlerRunConfig(
virtual_scroll_config=virtual_config,
cache_mode=CacheMode.BYPASS,
wait_for="css:article" # Wait for articles to load
)
# Example with a real news site
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://news.ycombinator.com/",
config=config
)
if result.success:
# Count items captured
import re
items = len(re.findall(r'class="athing"', result.html))
print(f"\n✅ Captured {items} news items")
print(f"• HTML size: {len(result.html):,} bytes")
print(f"• Without virtual scroll, would only capture ~30 items")
async def demo_url_seeder():
"""
Demo 4: URL Seeder for Intelligent Discovery
Shows how to discover and filter URLs before crawling,
with relevance scoring.
"""
print("\n" + "="*60)
print("🌱 DEMO 4: URL Seeder - Smart URL Discovery")
print("="*60)
async with AsyncUrlSeeder() as seeder:
# Discover Python tutorial URLs
config = SeedingConfig(
source="sitemap", # Use sitemap
pattern="*python*", # URL pattern filter
extract_head=True, # Get metadata
query="python tutorial", # For relevance scoring
scoring_method="bm25",
score_threshold=0.2,
max_urls=10
)
print("Discovering Python async tutorial URLs...")
urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1):
print(f"\n{i}. {url_info['url']}")
if url_info.get('relevance_score'):
print(f" Relevance: {url_info['relevance_score']:.3f}")
if url_info.get('head_data', {}).get('title'):
print(f" Title: {url_info['head_data']['title'][:60]}...")
async def demo_c4a_script():
"""
Demo 5: C4A Script Language
Shows the domain-specific language for web automation
with JavaScript transpilation.
"""
print("\n" + "="*60)
print("🎭 DEMO 5: C4A Script - Web Automation Language")
print("="*60)
# Example C4A script
c4a_script = """
# E-commerce automation script
WAIT `body` 3
# Handle cookie banner
IF (EXISTS `.cookie-banner`) THEN CLICK `.accept-cookies`
# Search for product
CLICK `.search-box`
TYPE "wireless headphones"
PRESS Enter
# Wait for results
WAIT `.product-grid` 10
# Load more products
REPEAT (SCROLL DOWN 500, `document.querySelectorAll('.product').length < 50`)
# Apply filter
IF (EXISTS `.price-filter`) THEN CLICK `input[data-max-price="100"]`
"""
# Compile the script
print("Compiling C4A script...")
result = c4a_compile(c4a_script)
if result.success:
print(f"✅ Successfully compiled to {len(result.js_code)} JavaScript statements!")
print("\nFirst 3 JS statements:")
for stmt in result.js_code[:3]:
print(f"{stmt}")
# Use with crawler
config = CrawlerRunConfig(
c4a_script=c4a_script, # Pass C4A script directly
cache_mode=CacheMode.BYPASS
)
print("\n✅ Script ready for use with AsyncWebCrawler!")
else:
print(f"❌ Compilation error: {result.first_error.message}")
async def main():
"""Run all demos"""
print("\n🚀 Crawl4AI v0.7.0 Feature Demonstrations")
print("=" * 60)
demos = [
("Link Preview & Scoring", demo_link_preview),
("Adaptive Crawling", demo_adaptive_crawling),
("Virtual Scroll", demo_virtual_scroll),
("URL Seeder", demo_url_seeder),
("C4A Script", demo_c4a_script),
]
for name, demo_func in demos:
try:
await demo_func()
except Exception as e:
print(f"\n❌ Error in {name} demo: {str(e)}")
# Pause between demos
await asyncio.sleep(1)
print("\n" + "="*60)
print("✅ All demos completed!")
print("\nKey Takeaways:")
print("• Link Preview: 3-layer scoring for intelligent link analysis")
print("• Adaptive Crawling: Stop when you have enough information")
print("• Virtual Scroll: Capture all content from modern web pages")
print("• URL Seeder: Pre-discover and filter URLs efficiently")
print("• C4A Script: Simple language for complex automations")
if __name__ == "__main__":
asyncio.run(main())

View File

@@ -1,4 +1,4 @@
site_name: Crawl4AI Documentation (v0.6.x)
site_name: Crawl4AI Documentation (v0.7.x)
site_favicon: docs/md_v2/favicon.ico
site_description: 🚀🤖 Crawl4AI, Open-source LLM-Friendly Web Crawler & Scraper
site_url: https://docs.crawl4ai.com
@@ -25,6 +25,8 @@ nav:
- "Command Line Interface": "core/cli.md"
- "Simple Crawling": "core/simple-crawling.md"
- "Deep Crawling": "core/deep-crawling.md"
- "Adaptive Crawling": "core/adaptive-crawling.md"
- "URL Seeding": "core/url-seeding.md"
- "C4A-Script": "core/c4a-script.md"
- "Crawler Result": "core/crawler-result.md"
- "Browser, Crawler & LLM Config": "core/browser-crawler-config.md"
@@ -37,6 +39,7 @@ nav:
- "Link & Media": "core/link-media.md"
- Advanced:
- "Overview": "advanced/advanced-features.md"
- "Adaptive Strategies": "advanced/adaptive-strategies.md"
- "Virtual Scroll": "advanced/virtual-scroll.md"
- "File Downloading": "advanced/file-downloading.md"
- "Lazy Loading": "advanced/lazy-loading.md"

View File

@@ -44,7 +44,6 @@ dependencies = [
"brotli>=1.1.0",
"humanize>=4.10.0",
"lark>=1.2.2",
"sentence-transformers>=2.2.0",
"alphashape>=1.3.1",
"shapely>=2.0.0"
]
@@ -62,8 +61,8 @@ classifiers = [
[project.optional-dependencies]
pdf = ["PyPDF2"]
torch = ["torch", "nltk", "scikit-learn"]
transformer = ["transformers", "tokenizers"]
cosine = ["torch", "transformers", "nltk"]
transformer = ["transformers", "tokenizers", "sentence-transformers"]
cosine = ["torch", "transformers", "nltk", "sentence-transformers"]
sync = ["selenium"]
all = [
"PyPDF2",
@@ -72,8 +71,8 @@ all = [
"scikit-learn",
"transformers",
"tokenizers",
"selenium",
"PyPDF2"
"sentence-transformers",
"selenium"
]
[project.scripts]

View File

@@ -24,7 +24,6 @@ cssselect>=1.2.0
chardet>=5.2.0
brotli>=1.1.0
httpx[http2]>=0.27.2
sentence-transformers>=2.2.0
alphashape>=1.3.1
shapely>=2.0.0

View File

@@ -0,0 +1,345 @@
#!/usr/bin/env python3
"""
Simple API Test for Crawl4AI Docker Server v0.7.0
Uses only built-in Python modules to test all endpoints.
"""
import urllib.request
import urllib.parse
import json
import time
import sys
from typing import Dict, List, Optional
# Configuration
BASE_URL = "http://localhost:11234" # Change to your server URL
TEST_TIMEOUT = 30
class SimpleApiTester:
def __init__(self, base_url: str = BASE_URL):
self.base_url = base_url
self.token = None
self.results = []
def log(self, message: str):
print(f"[INFO] {message}")
def test_get_endpoint(self, endpoint: str) -> Dict:
"""Test a GET endpoint"""
url = f"{self.base_url}{endpoint}"
start_time = time.time()
try:
req = urllib.request.Request(url)
if self.token:
req.add_header('Authorization', f'Bearer {self.token}')
with urllib.request.urlopen(req, timeout=TEST_TIMEOUT) as response:
response_time = time.time() - start_time
status_code = response.getcode()
content = response.read().decode('utf-8')
# Try to parse JSON
try:
data = json.loads(content)
except:
data = {"raw_response": content[:200]}
return {
"endpoint": endpoint,
"method": "GET",
"status": "PASS" if status_code < 400 else "FAIL",
"status_code": status_code,
"response_time": response_time,
"data": data
}
except Exception as e:
response_time = time.time() - start_time
return {
"endpoint": endpoint,
"method": "GET",
"status": "FAIL",
"status_code": None,
"response_time": response_time,
"error": str(e)
}
def test_post_endpoint(self, endpoint: str, payload: Dict) -> Dict:
"""Test a POST endpoint"""
url = f"{self.base_url}{endpoint}"
start_time = time.time()
try:
data = json.dumps(payload).encode('utf-8')
req = urllib.request.Request(url, data=data, method='POST')
req.add_header('Content-Type', 'application/json')
if self.token:
req.add_header('Authorization', f'Bearer {self.token}')
with urllib.request.urlopen(req, timeout=TEST_TIMEOUT) as response:
response_time = time.time() - start_time
status_code = response.getcode()
content = response.read().decode('utf-8')
# Try to parse JSON
try:
data = json.loads(content)
except:
data = {"raw_response": content[:200]}
return {
"endpoint": endpoint,
"method": "POST",
"status": "PASS" if status_code < 400 else "FAIL",
"status_code": status_code,
"response_time": response_time,
"data": data
}
except Exception as e:
response_time = time.time() - start_time
return {
"endpoint": endpoint,
"method": "POST",
"status": "FAIL",
"status_code": None,
"response_time": response_time,
"error": str(e)
}
def print_result(self, result: Dict):
"""Print a formatted test result"""
status_color = {
"PASS": "",
"FAIL": "",
"SKIP": "⏭️"
}
print(f"{status_color[result['status']]} {result['method']} {result['endpoint']} "
f"| {result['response_time']:.3f}s | Status: {result['status_code'] or 'N/A'}")
if result['status'] == 'FAIL' and 'error' in result:
print(f" Error: {result['error']}")
self.results.append(result)
def run_all_tests(self):
"""Run all API tests"""
print("🚀 Starting Crawl4AI v0.7.0 API Test Suite")
print(f"📡 Testing server at: {self.base_url}")
print("=" * 60)
# # Test basic endpoints
# print("\n=== BASIC ENDPOINTS ===")
# # Health check
# result = self.test_get_endpoint("/health")
# self.print_result(result)
# # Schema endpoint
# result = self.test_get_endpoint("/schema")
# self.print_result(result)
# # Metrics endpoint
# result = self.test_get_endpoint("/metrics")
# self.print_result(result)
# # Root redirect
# result = self.test_get_endpoint("/")
# self.print_result(result)
# # Test authentication
# print("\n=== AUTHENTICATION ===")
# # Get token
# token_payload = {"email": "test@example.com"}
# result = self.test_post_endpoint("/token", token_payload)
# self.print_result(result)
# # Extract token if successful
# if result['status'] == 'PASS' and 'data' in result:
# token = result['data'].get('access_token')
# if token:
# self.token = token
# self.log(f"Successfully obtained auth token: {token[:20]}...")
# Test core APIs
print("\n=== CORE APIs ===")
test_url = "https://example.com"
# Test markdown endpoint
md_payload = {
"url": test_url,
"f": "fit",
"q": "test query",
"c": "0"
}
result = self.test_post_endpoint("/md", md_payload)
# print(result['data'].get('markdown', ''))
self.print_result(result)
# Test HTML endpoint
html_payload = {"url": test_url}
result = self.test_post_endpoint("/html", html_payload)
self.print_result(result)
# Test screenshot endpoint
screenshot_payload = {
"url": test_url,
"screenshot_wait_for": 2
}
result = self.test_post_endpoint("/screenshot", screenshot_payload)
self.print_result(result)
# Test PDF endpoint
pdf_payload = {"url": test_url}
result = self.test_post_endpoint("/pdf", pdf_payload)
self.print_result(result)
# Test JavaScript execution
js_payload = {
"url": test_url,
"scripts": ["(() => document.title)()"]
}
result = self.test_post_endpoint("/execute_js", js_payload)
self.print_result(result)
# Test crawl endpoint
crawl_payload = {
"urls": [test_url],
"browser_config": {},
"crawler_config": {}
}
result = self.test_post_endpoint("/crawl", crawl_payload)
self.print_result(result)
# Test config dump
config_payload = {"code": "CrawlerRunConfig()"}
result = self.test_post_endpoint("/config/dump", config_payload)
self.print_result(result)
# Test LLM endpoint
llm_endpoint = f"/llm/{test_url}?q=Extract%20main%20content"
result = self.test_get_endpoint(llm_endpoint)
self.print_result(result)
# Test ask endpoint
ask_endpoint = "/ask?context_type=all&query=crawl4ai&max_results=5"
result = self.test_get_endpoint(ask_endpoint)
print(result)
self.print_result(result)
# Test job APIs
print("\n=== JOB APIs ===")
# Test LLM job
llm_job_payload = {
"url": test_url,
"q": "Extract main content",
"cache": False
}
result = self.test_post_endpoint("/llm/job", llm_job_payload)
self.print_result(result)
# Test crawl job
crawl_job_payload = {
"urls": [test_url],
"browser_config": {},
"crawler_config": {}
}
result = self.test_post_endpoint("/crawl/job", crawl_job_payload)
self.print_result(result)
# Test MCP
print("\n=== MCP APIs ===")
# Test MCP schema
result = self.test_get_endpoint("/mcp/schema")
self.print_result(result)
# Test error handling
print("\n=== ERROR HANDLING ===")
# Test invalid URL
invalid_payload = {"url": "invalid-url", "f": "fit"}
result = self.test_post_endpoint("/md", invalid_payload)
self.print_result(result)
# Test invalid endpoint
result = self.test_get_endpoint("/nonexistent")
self.print_result(result)
# Print summary
self.print_summary()
def print_summary(self):
"""Print test results summary"""
print("\n" + "=" * 60)
print("📊 TEST RESULTS SUMMARY")
print("=" * 60)
total = len(self.results)
passed = sum(1 for r in self.results if r['status'] == 'PASS')
failed = sum(1 for r in self.results if r['status'] == 'FAIL')
print(f"Total Tests: {total}")
print(f"✅ Passed: {passed}")
print(f"❌ Failed: {failed}")
print(f"📈 Success Rate: {(passed/total)*100:.1f}%")
if failed > 0:
print("\n❌ FAILED TESTS:")
for result in self.results:
if result['status'] == 'FAIL':
print(f"{result['method']} {result['endpoint']}")
if 'error' in result:
print(f" Error: {result['error']}")
# Performance statistics
response_times = [r['response_time'] for r in self.results if r['response_time'] > 0]
if response_times:
avg_time = sum(response_times) / len(response_times)
max_time = max(response_times)
print(f"\n⏱️ Average Response Time: {avg_time:.3f}s")
print(f"⏱️ Max Response Time: {max_time:.3f}s")
# Save detailed report
report_file = f"crawl4ai_test_report_{int(time.time())}.json"
with open(report_file, 'w') as f:
json.dump({
"timestamp": time.time(),
"server_url": self.base_url,
"version": "0.7.0",
"summary": {
"total": total,
"passed": passed,
"failed": failed
},
"results": self.results
}, f, indent=2)
print(f"\n📄 Detailed report saved to: {report_file}")
def main():
"""Main test runner"""
import argparse
parser = argparse.ArgumentParser(description='Crawl4AI v0.7.0 API Test Suite')
parser.add_argument('--url', default=BASE_URL, help='Base URL of the server')
args = parser.parse_args()
tester = SimpleApiTester(args.url)
try:
tester.run_all_tests()
except KeyboardInterrupt:
print("\n🛑 Test suite interrupted by user")
except Exception as e:
print(f"\n💥 Test suite failed with error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,317 @@
#!/usr/bin/env python3
import asyncio
import pytest
import os
import json
import tempfile
from pathlib import Path
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
from crawl4ai import JsonCssExtractionStrategy, LLMExtractionStrategy, LLMConfig
from crawl4ai.content_filter_strategy import BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
from crawl4ai.async_url_seeder import AsyncUrlSeeder
from crawl4ai.utils import RobotsParser
class TestCrawl4AIv070:
"""Test suite for Crawl4AI v0.7.0 changes"""
@pytest.mark.asyncio
async def test_raw_url_parsing(self):
"""Test raw:// URL parsing logic fix"""
html_content = "<html><body><h1>Test Content</h1><p>This is a test paragraph.</p></body></html>"
async with AsyncWebCrawler() as crawler:
# Test raw:// prefix
result1 = await crawler.arun(f"raw://{html_content}")
assert result1.success
assert "Test Content" in result1.markdown
# Test raw: prefix
result2 = await crawler.arun(f"raw:{html_content}")
assert result2.success
assert "Test Content" in result2.markdown
@pytest.mark.asyncio
async def test_max_pages_limit_batch_processing(self):
"""Test max_pages limit is respected during batch processing"""
urls = [
"https://httpbin.org/html",
"https://httpbin.org/json",
"https://httpbin.org/xml"
]
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
max_pages=2
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls, config=config)
# Should only process 2 pages due to max_pages limit
successful_results = [r for r in results if r.success]
assert len(successful_results) <= 2
@pytest.mark.asyncio
async def test_navigation_abort_handling(self):
"""Test handling of navigation aborts during file downloads"""
async with AsyncWebCrawler() as crawler:
# Test with a URL that might cause navigation issues
result = await crawler.arun(
"https://httpbin.org/status/404",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
# Should not crash even with navigation issues
assert result is not None
@pytest.mark.asyncio
async def test_screenshot_capture_fix(self):
"""Test screenshot capture improvements"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
screenshot=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
assert result.screenshot is not None
assert len(result.screenshot) > 0
@pytest.mark.asyncio
async def test_redirect_status_codes(self):
"""Test that real redirect status codes are surfaced"""
async with AsyncWebCrawler() as crawler:
# Test with a redirect URL
result = await crawler.arun(
"https://httpbin.org/redirect/1",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
assert result.success
# Should have redirect information
assert result.status_code in [200, 301, 302, 303, 307, 308]
@pytest.mark.asyncio
async def test_local_file_processing(self):
"""Test local file processing with captured_console initialization"""
with tempfile.NamedTemporaryFile(mode='w', suffix='.html', delete=False) as f:
f.write("<html><body><h1>Local File Test</h1></body></html>")
temp_file = f.name
try:
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(f"file://{temp_file}")
assert result.success
assert "Local File Test" in result.markdown
finally:
os.unlink(temp_file)
@pytest.mark.asyncio
async def test_robots_txt_wildcard_support(self):
"""Test robots.txt wildcard rules support"""
parser = RobotsParser()
# Test wildcard patterns
robots_content = "User-agent: *\nDisallow: /admin/*\nDisallow: *.pdf"
# This should work without throwing exceptions
assert parser is not None
@pytest.mark.asyncio
async def test_exclude_external_images(self):
"""Test exclude_external_images flag"""
html_with_images = '''
<html><body>
<img src="/local-image.jpg" alt="Local">
<img src="https://external.com/image.jpg" alt="External">
</body></html>
'''
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
exclude_external_images=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(f"raw://{html_with_images}", config=config)
assert result.success
# External images should be excluded
assert "external.com" not in result.cleaned_html
@pytest.mark.asyncio
async def test_llm_extraction_strategy_fix(self):
"""Test LLM extraction strategy choices error fix"""
if not os.getenv("OPENAI_API_KEY"):
pytest.skip("OpenAI API key not available")
llm_config = LLMConfig(
provider="openai/gpt-4o-mini",
api_token=os.getenv("OPENAI_API_KEY")
)
strategy = LLMExtractionStrategy(
llm_config=llm_config,
instruction="Extract the main heading",
extraction_type="block"
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
extraction_strategy=strategy
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
# Should not throw 'str' object has no attribute 'choices' error
assert result.extracted_content is not None
@pytest.mark.asyncio
async def test_wait_for_timeout(self):
"""Test separate timeout for wait_for condition"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
wait_for="css:non-existent-element",
wait_for_timeout=1000 # 1 second timeout
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
# Should timeout gracefully and still return result
assert result is not None
@pytest.mark.asyncio
async def test_bm25_content_filter_language_parameter(self):
"""Test BM25 filter with language parameter for stemming"""
content_filter = BM25ContentFilter(
user_query="test content",
language="english",
use_stemming=True
)
markdown_generator = DefaultMarkdownGenerator(
content_filter=content_filter
)
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
markdown_generator=markdown_generator
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
assert result.markdown is not None
@pytest.mark.asyncio
async def test_url_normalization(self):
"""Test URL normalization for invalid schemes and trailing slashes"""
async with AsyncWebCrawler() as crawler:
# Test with trailing slash
result = await crawler.arun(
"https://httpbin.org/html/",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
assert result.success
@pytest.mark.asyncio
async def test_max_scroll_steps(self):
"""Test max_scroll_steps parameter for full page scanning"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
scan_full_page=True,
max_scroll_steps=3
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
@pytest.mark.asyncio
async def test_async_url_seeder(self):
"""Test AsyncUrlSeeder functionality"""
seeder = AsyncUrlSeeder(
base_url="https://httpbin.org",
max_depth=1,
max_urls=5
)
async with AsyncWebCrawler() as crawler:
urls = await seeder.seed(crawler)
assert isinstance(urls, list)
assert len(urls) <= 5
@pytest.mark.asyncio
async def test_pdf_processing_timeout(self):
"""Test PDF processing with timeout"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
pdf=True,
pdf_timeout=10000 # 10 seconds
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
# PDF might be None for HTML pages, but should not hang
assert result.pdf is not None or result.pdf is None
@pytest.mark.asyncio
async def test_browser_session_management(self):
"""Test improved browser session management"""
browser_config = BrowserConfig(
headless=True,
use_persistent_context=True
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
"https://httpbin.org/html",
config=CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
)
assert result.success
@pytest.mark.asyncio
async def test_memory_management(self):
"""Test memory management features"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
memory_threshold_percent=80.0,
check_interval=1.0,
memory_wait_timeout=600 # 10 minutes default
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
@pytest.mark.asyncio
async def test_virtual_scroll_support(self):
"""Test virtual scroll support for modern web scraping"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
scan_full_page=True,
virtual_scroll=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
@pytest.mark.asyncio
async def test_adaptive_crawling(self):
"""Test adaptive crawling feature"""
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
adaptive_crawling=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://httpbin.org/html", config=config)
assert result.success
if __name__ == "__main__":
# Run the tests
pytest.main([__file__, "-v"])

View File

@@ -5,7 +5,7 @@ Test script for Link Extractor functionality
from crawl4ai.models import Link
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig
from crawl4ai import LinkPreviewConfig
import asyncio
import sys
import os
@@ -237,7 +237,7 @@ def test_config_examples():
print(f" {key}: {value}")
print(" Usage:")
print(" from crawl4ai.async_configs import LinkPreviewConfig")
print(" from crawl4ai import LinkPreviewConfig")
print(" config = CrawlerRunConfig(")
print(" link_preview_config=LinkPreviewConfig(")
for key, value in config_dict.items():

3344
uv.lock generated Normal file

File diff suppressed because it is too large Load Diff