Compare commits

...

23 Commits

Author SHA1 Message Date
UncleCode
9f9ea3bb3b chore: Clean up test artifacts and disable test workflow 2025-07-25 17:31:52 +08:00
UncleCode
d58b93c207 fix: Re-enable multi-platform Docker builds for ARM64 support 2025-07-25 16:38:11 +08:00
UncleCode
e2b4705010 fix: Use hardcoded Docker repository name to avoid masking issues 2025-07-25 15:52:26 +08:00
UncleCode
4a1abd5086 fix: Handle existing version on Test PyPI gracefully 2025-07-25 15:41:16 +08:00
UncleCode
04258cd4f2 fix: Speed up Docker test builds by using single platform and caching 2025-07-25 15:37:44 +08:00
UncleCode
84e462d9f8 Merge remote-tracking branch 'origin/develop' 2025-07-25 15:35:53 +08:00
UncleCode
9546773a07 fix: Move sentence-transformers to optional dependencies
- Moved sentence-transformers from core to optional dependencies in pyproject.toml
- Removed sentence-transformers from requirements.txt
- Added proper ImportError handling with helpful installation message
- This prevents ~2.5GB of NVIDIA CUDA libraries from being installed by default
- Users who need embedding features can install with: pip install 'crawl4ai[transformer]'
2025-07-24 21:24:40 +08:00
UncleCode
66a979ad11 fix: Install dependencies before version check in workflows 2025-07-24 21:01:36 +08:00
UncleCode
0c31e91b53 feat: Add CI/CD workflows for automated PyPI and Docker releases 2025-07-24 20:58:43 +08:00
ntohidi
1b6a31f88f fix: encode PDF results to base64 in /crawl endpoint. ref #1301 2025-07-23 13:52:18 +02:00
Nasrin
b8c261780f Merge pull request #1319 from volumetric/fix_for_bug_#1310
Removed the incorrect reference in browser_config variable
2025-07-23 12:45:12 +02:00
ntohidi
db6ad7a79d fix: update links in README and C4A-Script documentation for accuracy 2025-07-23 09:47:18 +02:00
Nasrin
004d514f33 Merge pull request #1265 from unclecode/feature/nasrin-cli-deep-crawl
Feature/CLI - deep-crawl: Add --deep-crawl CLI option with BFS/DFS/Best-First strategies and fix serialization error. ref #874
2025-07-23 09:40:33 +02:00
Vinit Agrawal
3a9e2c716e Remvoed the incorrect reference in browser_config variable 2025-07-18 10:01:00 +05:30
unclecode
0163bd797c Merge branch 'release/v0.7.1' 2025-07-17 17:42:04 +08:00
ntohidi
26bad799e4 chore: update version to 0.7.1 2025-07-17 11:37:41 +02:00
ntohidi
cf8badfe27 feat: cleanup unused code and enhance documentation for v0.7.1
- Remove unused StealthConfig from browser_manager.py
- Update LinkPreviewConfig import path in __init__.py and examples
- Fix infinity handling in content_scraping_strategy.py (use 0 instead of float('inf'))
- Remove sanitize_json_data functions from API endpoints
- Add comprehensive C4A Script documentation to release notes
- Update v0.7.0 release notes with improved code examples
- Create v0.7.1 release notes focusing on cleanup and documentation improvements
- Update demo files with corrected import paths and examples
- Fix virtual scroll and adaptive crawling examples across documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-17 11:35:16 +02:00
ntohidi
ccbe3c105c refactor: improve link scoring output format in release notes 2025-07-17 09:13:20 +02:00
Nasrin
761c19d54b Merge pull request #1307 from unclecode/fix/json-infinity-serialization
fix: Handle infinity values in JSON serialization for API  responses
2025-07-16 13:34:25 +02:00
Nasrin
14b0ecb137 Merge pull request #1305 from unclecode/fix/release-notes-demo-code
Fix: Update release notes and demo code
2025-07-16 13:33:53 +02:00
ntohidi
0eaa9f9895 fix: handle infinity values in JSON serialization for API responses
- Add sanitize_json_data() function to convert infinity/NaN to JSON-compliant strings
- Fix /execute_js endpoint returning ValueError: Out of range float values are not JSON compliant: inf
- Fix /crawl endpoint batch responses with infinity values
- Fix /crawl/stream endpoint streaming responses with infinity values
- Fix /crawl/job endpoint background job responses with infinity values

The sanitize_json_data() function recursively processes response data:
- float('inf') → \"Infinity\"
- float('-inf') → \"-Infinity\"
- float('nan') → \"NaN\"

This prevents JSON serialization errors when JavaScript execution or crawling operations produce infinity values, ensuring all API endpoints return valid JSON.

Fixes: API endpoints crashing with infinity JSON serialization errors
Affects: /execute_js, /crawl, /crawl/stream, /crawl/job endpoints
2025-07-15 13:49:07 +02:00
UncleCode
bde1bba6a2 docs: Add missing documentation pages to mkdocs.yml
- Added Adaptive Crawling to Core section
- Added URL Seeding to Core section
- Added Adaptive Strategies to Advanced section
2025-07-12 19:56:33 +08:00
ntohidi
ee25c771d8 feat(cli): add deep crawling options with configurable strategies and max pages. ref #874 2025-07-02 14:07:23 +02:00
24 changed files with 941 additions and 329 deletions

141
.github/workflows/release.yml vendored Normal file
View File

@@ -0,0 +1,141 @@
name: Release Pipeline
on:
push:
tags:
- 'v*'
- '!test-v*' # Exclude test tags
jobs:
release:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Releasing version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
echo "📦 Uploading to PyPI..."
twine upload dist/*
echo "✅ Package uploaded to https://pypi.org/project/crawl4ai/"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Extract major and minor versions
id: versions
run: |
VERSION=${{ steps.get_version.outputs.VERSION }}
MAJOR=$(echo $VERSION | cut -d. -f1)
MINOR=$(echo $VERSION | cut -d. -f1-2)
echo "MAJOR=$MAJOR" >> $GITHUB_OUTPUT
echo "MINOR=$MINOR" >> $GITHUB_OUTPUT
- name: Build and push Docker images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}
unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}
unclecode/crawl4ai:latest
platforms: linux/amd64,linux/arm64
- name: Create GitHub Release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: v${{ steps.get_version.outputs.VERSION }}
release_name: Release v${{ steps.get_version.outputs.VERSION }}
body: |
## 🎉 Crawl4AI v${{ steps.get_version.outputs.VERSION }} Released!
### 📦 Installation
**PyPI:**
```bash
pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}
```
**Docker:**
```bash
docker pull unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}
docker pull unclecode/crawl4ai:latest
```
### 📝 What's Changed
See [CHANGELOG.md](https://github.com/${{ github.repository }}/blob/main/CHANGELOG.md) for details.
draft: false
prerelease: false
- name: Summary
run: |
echo "## 🚀 Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MINOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:${{ steps.versions.outputs.MAJOR }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📋 GitHub Release" >> $GITHUB_STEP_SUMMARY
echo "https://github.com/${{ github.repository }}/releases/tag/v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY

View File

@@ -0,0 +1,116 @@
name: Test Release Pipeline
on:
push:
tags:
- 'test-v*'
jobs:
test-release:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Extract version from tag
id: get_version
run: |
TAG_VERSION=${GITHUB_REF#refs/tags/test-v}
echo "VERSION=$TAG_VERSION" >> $GITHUB_OUTPUT
echo "Testing with version: $TAG_VERSION"
- name: Install package dependencies
run: |
pip install -e .
- name: Check version consistency
run: |
TAG_VERSION=${{ steps.get_version.outputs.VERSION }}
PACKAGE_VERSION=$(python -c "from crawl4ai.__version__ import __version__; print(__version__)")
echo "Tag version: $TAG_VERSION"
echo "Package version: $PACKAGE_VERSION"
if [ "$TAG_VERSION" != "$PACKAGE_VERSION" ]; then
echo "❌ Version mismatch! Tag: $TAG_VERSION, Package: $PACKAGE_VERSION"
echo "Please update crawl4ai/__version__.py to match the tag version"
exit 1
fi
echo "✅ Version check passed: $TAG_VERSION"
- name: Install build dependencies
run: |
python -m pip install --upgrade pip
pip install build twine
- name: Build package
run: python -m build
- name: Check package
run: twine check dist/*
- name: Upload to Test PyPI
env:
TWINE_USERNAME: __token__
TWINE_PASSWORD: ${{ secrets.TEST_PYPI_TOKEN }}
run: |
echo "📦 Uploading to Test PyPI..."
twine upload --repository testpypi dist/* || {
if [ $? -eq 1 ]; then
echo "⚠️ Upload failed - likely version already exists on Test PyPI"
echo "Continuing anyway for test purposes..."
else
exit 1
fi
}
echo "✅ Test PyPI step complete"
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Docker Hub
uses: docker/login-action@v3
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_TOKEN }}
- name: Build and push Docker test images
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: |
unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}
unclecode/crawl4ai:test-latest
platforms: linux/amd64,linux/arm64
cache-from: type=gha
cache-to: type=gha,mode=max
- name: Summary
run: |
echo "## 🎉 Test Release Complete!" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 📦 Test PyPI Package" >> $GITHUB_STEP_SUMMARY
echo "- Version: ${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "- URL: https://test.pypi.org/project/crawl4ai/" >> $GITHUB_STEP_SUMMARY
echo "- Install: \`pip install -i https://test.pypi.org/simple/ crawl4ai==${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🐳 Docker Test Images" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}\`" >> $GITHUB_STEP_SUMMARY
echo "- \`unclecode/crawl4ai:test-latest\`" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "### 🧹 Cleanup Commands" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`bash" >> $GITHUB_STEP_SUMMARY
echo "# Remove test tag" >> $GITHUB_STEP_SUMMARY
echo "git tag -d test-v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "git push origin :test-v${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
echo "# Remove Docker test images" >> $GITHUB_STEP_SUMMARY
echo "docker rmi unclecode/crawl4ai:test-${{ steps.get_version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
echo "docker rmi unclecode/crawl4ai:test-latest" >> $GITHUB_STEP_SUMMARY
echo "\`\`\`" >> $GITHUB_STEP_SUMMARY

View File

@@ -28,7 +28,7 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant
[✨ Check out latest update v0.7.0](#-recent-updates) [✨ Check out latest update v0.7.0](#-recent-updates)
🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://docs.crawl4ai.com/blog/release-v0.7.0) 🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md)
<details> <details>
<summary>🤓 <strong>My Personal Story</strong></summary> <summary>🤓 <strong>My Personal Story</strong></summary>

View File

@@ -3,7 +3,7 @@ import warnings
from .async_webcrawler import AsyncWebCrawler, CacheMode from .async_webcrawler import AsyncWebCrawler, CacheMode
# MODIFIED: Add SeedingConfig and VirtualScrollConfig here # MODIFIED: Add SeedingConfig and VirtualScrollConfig here
from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig from .async_configs import BrowserConfig, CrawlerRunConfig, HTTPCrawlerConfig, LLMConfig, ProxyConfig, GeolocationConfig, SeedingConfig, VirtualScrollConfig, LinkPreviewConfig
from .content_scraping_strategy import ( from .content_scraping_strategy import (
ContentScrapingStrategy, ContentScrapingStrategy,
@@ -173,6 +173,7 @@ __all__ = [
"CompilationResult", "CompilationResult",
"ValidationResult", "ValidationResult",
"ErrorDetail", "ErrorDetail",
"LinkPreviewConfig"
] ]

View File

@@ -1,7 +1,7 @@
# crawl4ai/__version__.py # crawl4ai/__version__.py
# This is the version that will be used for stable releases # This is the version that will be used for stable releases
__version__ = "0.7.0" __version__ = "0.7.1"
# For nightly builds, this gets set during build process # For nightly builds, this gets set during build process
__nightly_version__ = None __nightly_version__ = None

View File

@@ -824,7 +824,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
except Error: except Error:
visibility_info = await self.check_visibility(page) visibility_info = await self.check_visibility(page)
if self.browser_config.config.verbose: if self.browser_config.verbose:
self.logger.debug( self.logger.debug(
message="Body visibility info: {info}", message="Body visibility info: {info}",
tag="DEBUG", tag="DEBUG",

View File

@@ -502,9 +502,12 @@ class AsyncWebCrawler:
metadata = result.get("metadata", {}) metadata = result.get("metadata", {})
else: else:
cleaned_html = sanitize_input_encode(result.cleaned_html) cleaned_html = sanitize_input_encode(result.cleaned_html)
media = result.media.model_dump() # media = result.media.model_dump()
tables = media.pop("tables", []) # tables = media.pop("tables", [])
links = result.links.model_dump() # links = result.links.model_dump()
media = result.media.model_dump() if hasattr(result.media, 'model_dump') else result.media
tables = media.pop("tables", []) if isinstance(media, dict) else []
links = result.links.model_dump() if hasattr(result.links, 'model_dump') else result.links
metadata = result.metadata metadata = result.metadata
fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000) fit_html = preprocess_html_for_schema(html_content=html, text_threshold= 500, max_size= 300_000)

View File

@@ -14,23 +14,8 @@ import hashlib
from .js_snippet import load_js_script from .js_snippet import load_js_script
from .config import DOWNLOAD_PAGE_TIMEOUT from .config import DOWNLOAD_PAGE_TIMEOUT
from .async_configs import BrowserConfig, CrawlerRunConfig from .async_configs import BrowserConfig, CrawlerRunConfig
from playwright_stealth import StealthConfig
from .utils import get_chromium_path from .utils import get_chromium_path
stealth_config = StealthConfig(
webdriver=True,
chrome_app=True,
chrome_csi=True,
chrome_load_times=True,
chrome_runtime=True,
navigator_languages=True,
navigator_plugins=True,
navigator_permissions=True,
webgl_vendor=True,
outerdimensions=True,
navigator_hardware_concurrency=True,
media_codecs=True,
)
BROWSER_DISABLE_OPTIONS = [ BROWSER_DISABLE_OPTIONS = [
"--disable-background-networking", "--disable-background-networking",

View File

@@ -27,7 +27,10 @@ from crawl4ai import (
PruningContentFilter, PruningContentFilter,
BrowserProfiler, BrowserProfiler,
DefaultMarkdownGenerator, DefaultMarkdownGenerator,
LLMConfig LLMConfig,
BFSDeepCrawlStrategy,
DFSDeepCrawlStrategy,
BestFirstCrawlingStrategy,
) )
from crawl4ai.config import USER_SETTINGS from crawl4ai.config import USER_SETTINGS
from litellm import completion from litellm import completion
@@ -1014,9 +1017,11 @@ def cdp_cmd(user_data_dir: Optional[str], port: int, browser_type: str, headless
@click.option("--question", "-q", help="Ask a question about the crawled content") @click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True) @click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)") @click.option("--profile", "-p", help="Use a specific browser profile (by name)")
@click.option("--deep-crawl", type=click.Choice(["bfs", "dfs", "best-first"]), help="Enable deep crawling with specified strategy (bfs, dfs, or best-first)")
@click.option("--max-pages", type=int, default=10, help="Maximum number of pages to crawl in deep crawl mode")
def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str, def crawl_cmd(url: str, browser_config: str, crawler_config: str, filter_config: str,
extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict, extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str): output: str, output_file: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int):
"""Crawl a website and extract content """Crawl a website and extract content
Simple Usage: Simple Usage:
@@ -1156,6 +1161,27 @@ Always return valid, properly formatted JSON."""
crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy() crawler_cfg.scraping_strategy = LXMLWebScrapingStrategy()
# Handle deep crawling configuration
if deep_crawl:
if deep_crawl == "bfs":
crawler_cfg.deep_crawl_strategy = BFSDeepCrawlStrategy(
max_depth=3,
max_pages=max_pages
)
elif deep_crawl == "dfs":
crawler_cfg.deep_crawl_strategy = DFSDeepCrawlStrategy(
max_depth=3,
max_pages=max_pages
)
elif deep_crawl == "best-first":
crawler_cfg.deep_crawl_strategy = BestFirstCrawlingStrategy(
max_depth=3,
max_pages=max_pages
)
if verbose:
console.print(f"[green]Deep crawling enabled:[/green] {deep_crawl} strategy, max {max_pages} pages")
config = get_global_config() config = get_global_config()
browser_cfg.verbose = config.get("VERBOSE", False) browser_cfg.verbose = config.get("VERBOSE", False)
@@ -1170,39 +1196,60 @@ Always return valid, properly formatted JSON."""
verbose verbose
) )
# Handle deep crawl results (list) vs single result
if isinstance(result, list):
if len(result) == 0:
click.echo("No results found during deep crawling")
return
# Use the first result for question answering and output
main_result = result[0]
all_results = result
else:
# Single result from regular crawling
main_result = result
all_results = [result]
# Handle question # Handle question
if question: if question:
provider, token = setup_llm_config() provider, token = setup_llm_config()
markdown = result.markdown.raw_markdown markdown = main_result.markdown.raw_markdown
anyio.run(stream_llm_response, url, markdown, question, provider, token) anyio.run(stream_llm_response, url, markdown, question, provider, token)
return return
# Handle output # Handle output
if not output_file: if not output_file:
if output == "all": if output == "all":
click.echo(json.dumps(result.model_dump(), indent=2)) if isinstance(result, list):
output_data = [r.model_dump() for r in all_results]
click.echo(json.dumps(output_data, indent=2))
else:
click.echo(json.dumps(main_result.model_dump(), indent=2))
elif output == "json": elif output == "json":
print(result.extracted_content) print(main_result.extracted_content)
extracted_items = json.loads(result.extracted_content) extracted_items = json.loads(main_result.extracted_content)
click.echo(json.dumps(extracted_items, indent=2)) click.echo(json.dumps(extracted_items, indent=2))
elif output in ["markdown", "md"]: elif output in ["markdown", "md"]:
click.echo(result.markdown.raw_markdown) click.echo(main_result.markdown.raw_markdown)
elif output in ["markdown-fit", "md-fit"]: elif output in ["markdown-fit", "md-fit"]:
click.echo(result.markdown.fit_markdown) click.echo(main_result.markdown.fit_markdown)
else: else:
if output == "all": if output == "all":
with open(output_file, "w") as f: with open(output_file, "w") as f:
f.write(json.dumps(result.model_dump(), indent=2)) if isinstance(result, list):
output_data = [r.model_dump() for r in all_results]
f.write(json.dumps(output_data, indent=2))
else:
f.write(json.dumps(main_result.model_dump(), indent=2))
elif output == "json": elif output == "json":
with open(output_file, "w") as f: with open(output_file, "w") as f:
f.write(result.extracted_content) f.write(main_result.extracted_content)
elif output in ["markdown", "md"]: elif output in ["markdown", "md"]:
with open(output_file, "w") as f: with open(output_file, "w") as f:
f.write(result.markdown.raw_markdown) f.write(main_result.markdown.raw_markdown)
elif output in ["markdown-fit", "md-fit"]: elif output in ["markdown-fit", "md-fit"]:
with open(output_file, "w") as f: with open(output_file, "w") as f:
f.write(result.markdown.fit_markdown) f.write(main_result.markdown.fit_markdown)
except Exception as e: except Exception as e:
raise click.ClickException(str(e)) raise click.ClickException(str(e))
@@ -1354,9 +1401,11 @@ def profiles_cmd():
@click.option("--question", "-q", help="Ask a question about the crawled content") @click.option("--question", "-q", help="Ask a question about the crawled content")
@click.option("--verbose", "-v", is_flag=True) @click.option("--verbose", "-v", is_flag=True)
@click.option("--profile", "-p", help="Use a specific browser profile (by name)") @click.option("--profile", "-p", help="Use a specific browser profile (by name)")
@click.option("--deep-crawl", type=click.Choice(["bfs", "dfs", "best-first"]), help="Enable deep crawling with specified strategy")
@click.option("--max-pages", type=int, default=10, help="Maximum number of pages to crawl in deep crawl mode")
def default(url: str, example: bool, browser_config: str, crawler_config: str, filter_config: str, def default(url: str, example: bool, browser_config: str, crawler_config: str, filter_config: str,
extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict, extraction_config: str, json_extract: str, schema: str, browser: Dict, crawler: Dict,
output: str, bypass_cache: bool, question: str, verbose: bool, profile: str): output: str, bypass_cache: bool, question: str, verbose: bool, profile: str, deep_crawl: str, max_pages: int):
"""Crawl4AI CLI - Web content extraction tool """Crawl4AI CLI - Web content extraction tool
Simple Usage: Simple Usage:
@@ -1406,7 +1455,9 @@ def default(url: str, example: bool, browser_config: str, crawler_config: str, f
bypass_cache=bypass_cache, bypass_cache=bypass_cache,
question=question, question=question,
verbose=verbose, verbose=verbose,
profile=profile profile=profile,
deep_crawl=deep_crawl,
max_pages=max_pages
) )
def main(): def main():

View File

@@ -1145,10 +1145,10 @@ class LXMLWebScrapingStrategy(WebScrapingStrategy):
link_data["intrinsic_score"] = intrinsic_score link_data["intrinsic_score"] = intrinsic_score
except Exception: except Exception:
# Fail gracefully - assign default score # Fail gracefully - assign default score
link_data["intrinsic_score"] = float('inf') link_data["intrinsic_score"] = 0
else: else:
# No scoring enabled - assign infinity (all links equal priority) # No scoring enabled - assign infinity (all links equal priority)
link_data["intrinsic_score"] = float('inf') link_data["intrinsic_score"] = 0
is_external = is_external_url(normalized_href, base_domain) is_external = is_external_url(normalized_href, base_domain)
if is_external: if is_external:

View File

@@ -3342,7 +3342,13 @@ async def get_text_embeddings(
# Default: use sentence-transformers # Default: use sentence-transformers
else: else:
# Lazy load to avoid importing heavy libraries unless needed # Lazy load to avoid importing heavy libraries unless needed
from sentence_transformers import SentenceTransformer try:
from sentence_transformers import SentenceTransformer
except ImportError:
raise ImportError(
"sentence-transformers is required for local embeddings. "
"Install it with: pip install 'crawl4ai[transformer]' or pip install sentence-transformers"
)
# Cache the model in function attribute to avoid reloading # Cache the model in function attribute to avoid reloading
if not hasattr(get_text_embeddings, '_models'): if not hasattr(get_text_embeddings, '_models'):

View File

@@ -5,6 +5,7 @@ from typing import List, Tuple, Dict
from functools import partial from functools import partial
from uuid import uuid4 from uuid import uuid4
from datetime import datetime from datetime import datetime
from base64 import b64encode
import logging import logging
from typing import Optional, AsyncGenerator from typing import Optional, AsyncGenerator
@@ -371,6 +372,9 @@ async def stream_results(crawler: AsyncWebCrawler, results_gen: AsyncGenerator)
server_memory_mb = _get_memory_mb() server_memory_mb = _get_memory_mb()
result_dict = result.model_dump() result_dict = result.model_dump()
result_dict['server_memory_mb'] = server_memory_mb result_dict['server_memory_mb'] = server_memory_mb
# If PDF exists, encode it to base64
if result_dict.get('pdf') is not None:
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}") logger.info(f"Streaming result for {result_dict.get('url', 'unknown')}")
data = json.dumps(result_dict, default=datetime_handler) + "\n" data = json.dumps(result_dict, default=datetime_handler) + "\n"
yield data.encode('utf-8') yield data.encode('utf-8')
@@ -443,10 +447,19 @@ async def handle_crawl_request(
mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta mem_delta_mb = end_mem_mb - start_mem_mb # <--- Calculate delta
peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory peak_mem_mb = max(peak_mem_mb if peak_mem_mb else 0, end_mem_mb) # <--- Get peak memory
logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB") logger.info(f"Memory usage: Start: {start_mem_mb} MB, End: {end_mem_mb} MB, Delta: {mem_delta_mb} MB, Peak: {peak_mem_mb} MB")
# Process results to handle PDF bytes
processed_results = []
for result in results:
result_dict = result.model_dump()
# If PDF exists, encode it to base64
if result_dict.get('pdf') is not None:
result_dict['pdf'] = b64encode(result_dict['pdf']).decode('utf-8')
processed_results.append(result_dict)
return { return {
"success": True, "success": True,
"results": [result.model_dump() for result in results], "results": processed_results,
"server_processing_time_s": end_time - start_time, "server_processing_time_s": end_time - start_time,
"server_memory_delta_mb": mem_delta_mb, "server_memory_delta_mb": mem_delta_mb,
"server_peak_memory_mb": peak_mem_mb "server_peak_memory_mb": peak_mem_mb

View File

@@ -30,33 +30,40 @@ The Adaptive Crawler maintains a persistent state for each domain, tracking:
```python ```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio
# Initialize with custom adaptive parameters async def main():
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
max_depth=5, # Maximum crawl depth
max_pages=20, # Maximum number of pages to crawl
top_k_links=3, # Number of top links to follow per page
strategy="statistical", # 'statistical' or 'embedding'
coverage_weight=0.4, # Weight for coverage in confidence calculation
consistency_weight=0.3, # Weight for consistency in confidence calculation
saturation_weight=0.3 # Weight for saturation in confidence calculation
)
# Initialize adaptive crawler with web crawler
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(crawler, config)
# Crawl and learn patterns # Configure adaptive crawler
state = await adaptive_crawler.digest( config = AdaptiveConfig(
start_url="https://news.example.com/article/12345", strategy="statistical", # or "embedding" for semantic understanding
query="latest news articles and content" max_pages=10,
confidence_threshold=0.7, # Stop at 70% confidence
top_k_links=3, # Follow top 3 links per page
min_gain_threshold=0.05 # Need 5% information gain to continue
) )
# Access results and confidence async with AsyncWebCrawler(verbose=False) as crawler:
print(f"Confidence Level: {adaptive_crawler.confidence:.0%}") adaptive = AdaptiveCrawler(crawler, config)
print(f"Pages Crawled: {len(state.crawled_urls)}")
print(f"Knowledge Base: {len(adaptive_crawler.state.knowledge_base)} documents") print("Starting adaptive crawl about Python decorators...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/glossary.html",
query="python decorators functions wrapping"
)
print(f"\n✅ Crawling Complete!")
print(f"• Confidence Level: {adaptive.confidence:.0%}")
print(f"• Pages Crawled: {len(result.crawled_urls)}")
print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
print(f"\nMost Relevant Pages:")
for i, page in enumerate(relevant, 1):
print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")
asyncio.run(main())
``` ```
**Expected Real-World Impact:** **Expected Real-World Impact:**
@@ -141,56 +148,47 @@ async with AsyncWebCrawler() as crawler:
**My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals. **My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
### The Three-Layer Scoring System ### Intelligent Link Analysis and Scoring
```python ```python
from crawl4ai import LinkPreviewConfig, CrawlerRunConfig, CacheMode import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig
# Configure intelligent link analysis async def main():
link_config = LinkPreviewConfig( # Configure intelligent link analysis
include_internal=True, link_config = LinkPreviewConfig(
include_external=False, include_internal=True,
max_links=10, include_external=False,
concurrency=5, max_links=10,
query="python tutorial", # For contextual scoring concurrency=5,
score_threshold=0.3, query="python tutorial", # For contextual scoring
verbose=True score_threshold=0.3,
) verbose=True
# Use in your crawl
result = await crawler.arun(
"https://tech-blog.example.com",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
) )
) # Use in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://www.geeksforgeeks.org/",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
)
# Access scored and sorted links # Access scored and sorted links
if result.success and result.links: if result.success and result.links:
# Get scored links for link in result.links.get("internal", []):
internal_links = result.links.get("internal", []) text = link.get('text', 'No text')[:40]
scored_links = [l for l in internal_links if l.get("total_score")] print(
scored_links.sort(key=lambda x: x.get("total_score", 0), reverse=True) text,
f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
)
# Create a scoring table asyncio.run(main())
table = Table(title="Link Scoring Results", box=box.ROUNDED)
table.add_column("Link Text", style="cyan", width=40)
table.add_column("Intrinsic Score", justify="center")
table.add_column("Contextual Score", justify="center")
table.add_column("Total Score", justify="center", style="bold green")
for link in scored_links[:5]:
text = link.get('text', 'No text')[:40]
table.add_row(
text,
f"{link.get('intrinsic_score', 0):.1f}/10",
f"{link.get('contextual_score', 0):.2f}/1",
f"{link.get('total_score', 0):.3f}"
)
console.print(table)
``` ```
**Scoring Components:** **Scoring Components:**
@@ -223,58 +221,34 @@ console.print(table)
### Technical Architecture ### Technical Architecture
```python ```python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig from crawl4ai import AsyncUrlSeeder, SeedingConfig
# Basic discovery - find all product pages async def main():
seeder_config = SeedingConfig( async with AsyncUrlSeeder() as seeder:
# Discovery sources # Discover Python tutorial URLs
source="cc+sitemap", # Sitemap + Common Crawl config = SeedingConfig(
source="sitemap", # Use sitemap
# Filtering pattern="*python*", # URL pattern filter
pattern="*/product/*", # URL pattern matching extract_head=True, # Get metadata
query="python tutorial", # For relevance scoring
# Validation scoring_method="bm25",
live_check=True, # Verify URLs are alive score_threshold=0.2,
max_urls=50, # Stop at 50 URLs max_urls=10
)
# Performance
concurrency=100, # Maximum concurrent requests for live checks/head extraction print("Discovering Python async tutorial URLs...")
hits_per_sec=10 # Rate limit in requests per second to avoid overwhelming servers urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
)
print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1):
print(f"\n{i}. {url_info['url']}")
if url_info.get('relevance_score'):
print(f" Relevance: {url_info['relevance_score']:.3f}")
if url_info.get('head_data', {}).get('title'):
print(f" Title: {url_info['head_data']['title'][:60]}...")
async with AsyncUrlSeeder() as seeder: asyncio.run(main())
console.print("Discovering URLs from Python docs...")
urls = await seeder.urls("docs.python.org", seeding_config)
console.print(f"\n✓ Discovered {len(urls)} URLs")
# Advanced: Relevance-based discovery
research_config = SeedingConfig(
source="sitemap+cc", # Sitemap + Common Crawl
pattern="*/blog/*", # Blog posts only
# Content relevance
extract_head=True, # Get meta tags
query="quantum computing tutorials",
scoring_method="bm25", # BM25 scoring method
score_threshold=0.4, # High relevance only
# Smart filtering
filter_nonsense_urls=True, # Remove .xml, .txt, etc.
force=True # Bypass cache
)
# Discover with progress tracking
discovered = []
async with AsyncUrlSeeder() as seeder:
discovered = await seeder.urls("https://physics-blog.com", research_config)
console.print(f"\n✓ Discovered {len(discovered)} URLs")
# Results include scores and metadata
for url_data in discovered[:5]:
print(f"URL: {url_data['url']}")
print(f"Score: {url_data['relevance_score']:.3f}")
print(f"Title: {url_data['head_data']['title']}")
``` ```
**Discovery Methods:** **Discovery Methods:**

View File

@@ -0,0 +1,43 @@
# 🛠️ Crawl4AI v0.7.1: Minor Cleanup Update
*July 17, 2025 • 2 min read*
---
A small maintenance release that removes unused code and improves documentation.
## 🎯 What's Changed
- **Removed unused StealthConfig** from `crawl4ai/browser_manager.py`
- **Updated documentation** with better examples and parameter explanations
- **Fixed virtual scroll configuration** examples in docs
## 🧹 Code Cleanup
Removed unused `StealthConfig` import and configuration that wasn't being used anywhere in the codebase. The project uses its own custom stealth implementation through JavaScript injection instead.
```python
# Removed unused code:
from playwright_stealth import StealthConfig
stealth_config = StealthConfig(...) # This was never used
```
## 📖 Documentation Updates
- Fixed adaptive crawling parameter examples
- Updated session management documentation
- Corrected virtual scroll configuration examples
## 🚀 Installation
```bash
pip install crawl4ai==0.7.1
```
No breaking changes - upgrade directly from v0.7.0.
---
Questions? Issues?
- GitHub: [github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
- Discord: [discord.gg/crawl4ai](https://discord.gg/jP8KfhDhyN)

View File

@@ -18,7 +18,7 @@ Usage:
import asyncio import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig from crawl4ai import LinkPreviewConfig
async def basic_link_head_extraction(): async def basic_link_head_extraction():

View File

@@ -30,33 +30,40 @@ The Adaptive Crawler maintains a persistent state for each domain, tracking:
```python ```python
from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig from crawl4ai import AsyncWebCrawler, AdaptiveCrawler, AdaptiveConfig
import asyncio
# Initialize with custom adaptive parameters async def main():
config = AdaptiveConfig(
confidence_threshold=0.7, # Min confidence to stop crawling
max_depth=5, # Maximum crawl depth
max_pages=20, # Maximum number of pages to crawl
top_k_links=3, # Number of top links to follow per page
strategy="statistical", # 'statistical' or 'embedding'
coverage_weight=0.4, # Weight for coverage in confidence calculation
consistency_weight=0.3, # Weight for consistency in confidence calculation
saturation_weight=0.3 # Weight for saturation in confidence calculation
)
# Initialize adaptive crawler with web crawler
async with AsyncWebCrawler() as crawler:
adaptive_crawler = AdaptiveCrawler(crawler, config)
# Crawl and learn patterns # Configure adaptive crawler
state = await adaptive_crawler.digest( config = AdaptiveConfig(
start_url="https://news.example.com/article/12345", strategy="statistical", # or "embedding" for semantic understanding
query="latest news articles and content" max_pages=10,
confidence_threshold=0.7, # Stop at 70% confidence
top_k_links=3, # Follow top 3 links per page
min_gain_threshold=0.05 # Need 5% information gain to continue
) )
# Access results and confidence async with AsyncWebCrawler(verbose=False) as crawler:
print(f"Confidence Level: {adaptive_crawler.confidence:.0%}") adaptive = AdaptiveCrawler(crawler, config)
print(f"Pages Crawled: {len(state.crawled_urls)}")
print(f"Knowledge Base: {len(adaptive_crawler.state.knowledge_base)} documents") print("Starting adaptive crawl about Python decorators...")
result = await adaptive.digest(
start_url="https://docs.python.org/3/glossary.html",
query="python decorators functions wrapping"
)
print(f"\n✅ Crawling Complete!")
print(f"• Confidence Level: {adaptive.confidence:.0%}")
print(f"• Pages Crawled: {len(result.crawled_urls)}")
print(f"• Knowledge Base: {len(adaptive.state.knowledge_base)} documents")
# Get most relevant content
relevant = adaptive.get_relevant_content(top_k=3)
print(f"\nMost Relevant Pages:")
for i, page in enumerate(relevant, 1):
print(f"{i}. {page['url']} (relevance: {page['score']:.2%})")
asyncio.run(main())
``` ```
**Expected Real-World Impact:** **Expected Real-World Impact:**
@@ -141,56 +148,47 @@ async with AsyncWebCrawler() as crawler:
**My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals. **My Solution:** I implemented a three-layer scoring system that analyzes links like a human would—considering their position, context, and relevance to your goals.
### The Three-Layer Scoring System ### Intelligent Link Analysis and Scoring
```python ```python
from crawl4ai import LinkPreviewConfig, CrawlerRunConfig, CacheMode import asyncio
from crawl4ai import CrawlerRunConfig, CacheMode, AsyncWebCrawler
from crawl4ai.adaptive_crawler import LinkPreviewConfig
# Configure intelligent link analysis async def main():
link_config = LinkPreviewConfig( # Configure intelligent link analysis
include_internal=True, link_config = LinkPreviewConfig(
include_external=False, include_internal=True,
max_links=10, include_external=False,
concurrency=5, max_links=10,
query="python tutorial", # For contextual scoring concurrency=5,
score_threshold=0.3, query="python tutorial", # For contextual scoring
verbose=True score_threshold=0.3,
) verbose=True
# Use in your crawl
result = await crawler.arun(
"https://tech-blog.example.com",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
) )
) # Use in your crawl
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://www.geeksforgeeks.org/",
config=CrawlerRunConfig(
link_preview_config=link_config,
score_links=True, # Enable intrinsic scoring
cache_mode=CacheMode.BYPASS
)
)
# Access scored and sorted links # Access scored and sorted links
if result.success and result.links: if result.success and result.links:
# Get scored links for link in result.links.get("internal", []):
internal_links = result.links.get("internal", []) text = link.get('text', 'No text')[:40]
scored_links = [l for l in internal_links if l.get("total_score")] print(
scored_links.sort(key=lambda x: x.get("total_score", 0), reverse=True) text,
f"{link.get('intrinsic_score', 0):.1f}/10" if link.get('intrinsic_score') is not None else "0.0/10",
f"{link.get('contextual_score', 0):.2f}/1" if link.get('contextual_score') is not None else "0.00/1",
f"{link.get('total_score', 0):.3f}" if link.get('total_score') is not None else "0.000"
)
# Create a scoring table asyncio.run(main())
table = Table(title="Link Scoring Results", box=box.ROUNDED)
table.add_column("Link Text", style="cyan", width=40)
table.add_column("Intrinsic Score", justify="center")
table.add_column("Contextual Score", justify="center")
table.add_column("Total Score", justify="center", style="bold green")
for link in scored_links[:5]:
text = link.get('text', 'No text')[:40]
table.add_row(
text,
f"{link.get('intrinsic_score', 0):.1f}/10",
f"{link.get('contextual_score', 0):.2f}/1",
f"{link.get('total_score', 0):.3f}"
)
console.print(table)
``` ```
**Scoring Components:** **Scoring Components:**
@@ -223,58 +221,34 @@ console.print(table)
### Technical Architecture ### Technical Architecture
```python ```python
import asyncio
from crawl4ai import AsyncUrlSeeder, SeedingConfig from crawl4ai import AsyncUrlSeeder, SeedingConfig
# Basic discovery - find all product pages async def main():
seeder_config = SeedingConfig( async with AsyncUrlSeeder() as seeder:
# Discovery sources # Discover Python tutorial URLs
source="cc+sitemap", # Sitemap + Common Crawl config = SeedingConfig(
source="sitemap", # Use sitemap
# Filtering pattern="*python*", # URL pattern filter
pattern="*/product/*", # URL pattern matching extract_head=True, # Get metadata
query="python tutorial", # For relevance scoring
# Validation scoring_method="bm25",
live_check=True, # Verify URLs are alive score_threshold=0.2,
max_urls=50, # Stop at 50 URLs max_urls=10
)
# Performance
concurrency=100, # Maximum concurrent requests for live checks/head extraction print("Discovering Python async tutorial URLs...")
hits_per_sec=10 # Rate limit in requests per second to avoid overwhelming servers urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
)
print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1):
print(f"\n{i}. {url_info['url']}")
if url_info.get('relevance_score'):
print(f" Relevance: {url_info['relevance_score']:.3f}")
if url_info.get('head_data', {}).get('title'):
print(f" Title: {url_info['head_data']['title'][:60]}...")
async with AsyncUrlSeeder() as seeder: asyncio.run(main())
console.print("Discovering URLs from Python docs...")
urls = await seeder.urls("docs.python.org", seeding_config)
console.print(f"\n✓ Discovered {len(urls)} URLs")
# Advanced: Relevance-based discovery
research_config = SeedingConfig(
source="sitemap+cc", # Sitemap + Common Crawl
pattern="*/blog/*", # Blog posts only
# Content relevance
extract_head=True, # Get meta tags
query="quantum computing tutorials",
scoring_method="bm25", # BM25 scoring method
score_threshold=0.4, # High relevance only
# Smart filtering
filter_nonsense_urls=True, # Remove .xml, .txt, etc.
force=True # Bypass cache
)
# Discover with progress tracking
discovered = []
async with AsyncUrlSeeder() as seeder:
discovered = await seeder.urls("https://physics-blog.com", research_config)
console.print(f"\n✓ Discovered {len(discovered)} URLs")
# Results include scores and metadata
for url_data in discovered[:5]:
print(f"URL: {url_data['url']}")
print(f"Score: {url_data['relevance_score']:.3f}")
print(f"Title: {url_data['head_data']['title']}")
``` ```
**Discovery Methods:** **Discovery Methods:**

View File

@@ -52,11 +52,9 @@ That's it! In just a few lines, you've automated a complete search workflow.
Want to learn by doing? We've got you covered: Want to learn by doing? We've got you covered:
**🚀 [Live Demo](https://docs.crawl4ai.com/c4a-script/demo)** - Try C4A-Script in your browser right now! **🚀 [Live Demo](https://docs.crawl4ai.com/apps/c4a-script/)** - Try C4A-Script in your browser right now!
**📁 [Tutorial Examples](/examples/c4a_script/)** - Complete examples with source code **📁 [Tutorial Examples](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/c4a_script/)** - Complete examples with source code
**🛠️ [Local Tutorial](/examples/c4a_script/tutorial/)** - Run the interactive tutorial on your machine
### Running the Tutorial Locally ### Running the Tutorial Locally

View File

@@ -125,7 +125,7 @@ Here's a full example you can copy, paste, and run immediately:
```python ```python
import asyncio import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig from crawl4ai import LinkPreviewConfig
async def extract_link_heads_example(): async def extract_link_heads_example():
""" """
@@ -237,7 +237,7 @@ if __name__ == "__main__":
The `LinkPreviewConfig` class supports these options: The `LinkPreviewConfig` class supports these options:
```python ```python
from crawl4ai.async_configs import LinkPreviewConfig from crawl4ai import LinkPreviewConfig
link_preview_config = LinkPreviewConfig( link_preview_config = LinkPreviewConfig(
# BASIC SETTINGS # BASIC SETTINGS

View File

@@ -28,7 +28,7 @@ from rich import box
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, AdaptiveCrawler, AdaptiveConfig, BrowserConfig, CacheMode from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, AdaptiveCrawler, AdaptiveConfig, BrowserConfig, CacheMode
from crawl4ai import AsyncUrlSeeder, SeedingConfig from crawl4ai import AsyncUrlSeeder, SeedingConfig
from crawl4ai.async_configs import LinkPreviewConfig, VirtualScrollConfig from crawl4ai import LinkPreviewConfig, VirtualScrollConfig
from crawl4ai import c4a_compile, CompilationResult from crawl4ai import c4a_compile, CompilationResult
# Initialize Rich console for beautiful output # Initialize Rich console for beautiful output

View File

@@ -13,14 +13,13 @@ from crawl4ai import (
BrowserConfig, BrowserConfig,
CacheMode, CacheMode,
# New imports for v0.7.0 # New imports for v0.7.0
LinkPreviewConfig,
VirtualScrollConfig, VirtualScrollConfig,
LinkPreviewConfig,
AdaptiveCrawler, AdaptiveCrawler,
AdaptiveConfig, AdaptiveConfig,
AsyncUrlSeeder, AsyncUrlSeeder,
SeedingConfig, SeedingConfig,
c4a_compile, c4a_compile,
CompilationResult
) )
@@ -170,16 +169,16 @@ async def demo_url_seeder():
# Discover Python tutorial URLs # Discover Python tutorial URLs
config = SeedingConfig( config = SeedingConfig(
source="sitemap", # Use sitemap source="sitemap", # Use sitemap
pattern="*tutorial*", # URL pattern filter pattern="*python*", # URL pattern filter
extract_head=True, # Get metadata extract_head=True, # Get metadata
query="python async programming", # For relevance scoring query="python tutorial", # For relevance scoring
scoring_method="bm25", scoring_method="bm25",
score_threshold=0.2, score_threshold=0.2,
max_urls=10 max_urls=10
) )
print("Discovering Python async tutorial URLs...") print("Discovering Python async tutorial URLs...")
urls = await seeder.urls("docs.python.org", config) urls = await seeder.urls("https://www.geeksforgeeks.org/", config)
print(f"\n✅ Found {len(urls)} relevant URLs:") print(f"\n✅ Found {len(urls)} relevant URLs:")
for i, url_info in enumerate(urls[:5], 1): for i, url_info in enumerate(urls[:5], 1):
@@ -245,39 +244,6 @@ IF (EXISTS `.price-filter`) THEN CLICK `input[data-max-price="100"]`
print(f"❌ Compilation error: {result.first_error.message}") print(f"❌ Compilation error: {result.first_error.message}")
async def demo_pdf_support():
"""
Demo 6: PDF Parsing Support
Shows how to extract content from PDF files.
Note: Requires 'pip install crawl4ai[pdf]'
"""
print("\n" + "="*60)
print("📄 DEMO 6: PDF Parsing Support")
print("="*60)
try:
# Check if PDF support is installed
import PyPDF2
# Example: Process a PDF URL
config = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS,
pdf=True, # Enable PDF generation
extract_text_from_pdf=True # Extract text content
)
print("PDF parsing is available!")
print("You can now crawl PDF URLs and extract their content.")
print("\nExample usage:")
print(' result = await crawler.arun("https://example.com/document.pdf")')
print(' pdf_text = result.extracted_content # Contains extracted text')
except ImportError:
print("⚠️ PDF support not installed.")
print("Install with: pip install crawl4ai[pdf]")
async def main(): async def main():
"""Run all demos""" """Run all demos"""
print("\n🚀 Crawl4AI v0.7.0 Feature Demonstrations") print("\n🚀 Crawl4AI v0.7.0 Feature Demonstrations")
@@ -289,7 +255,6 @@ async def main():
("Virtual Scroll", demo_virtual_scroll), ("Virtual Scroll", demo_virtual_scroll),
("URL Seeder", demo_url_seeder), ("URL Seeder", demo_url_seeder),
("C4A Script", demo_c4a_script), ("C4A Script", demo_c4a_script),
("PDF Support", demo_pdf_support)
] ]
for name, demo_func in demos: for name, demo_func in demos:
@@ -309,7 +274,6 @@ async def main():
print("• Virtual Scroll: Capture all content from modern web pages") print("• Virtual Scroll: Capture all content from modern web pages")
print("• URL Seeder: Pre-discover and filter URLs efficiently") print("• URL Seeder: Pre-discover and filter URLs efficiently")
print("• C4A Script: Simple language for complex automations") print("• C4A Script: Simple language for complex automations")
print("• PDF Support: Extract content from PDF documents")
if __name__ == "__main__": if __name__ == "__main__":

View File

@@ -44,7 +44,6 @@ dependencies = [
"brotli>=1.1.0", "brotli>=1.1.0",
"humanize>=4.10.0", "humanize>=4.10.0",
"lark>=1.2.2", "lark>=1.2.2",
"sentence-transformers>=2.2.0",
"alphashape>=1.3.1", "alphashape>=1.3.1",
"shapely>=2.0.0" "shapely>=2.0.0"
] ]
@@ -62,8 +61,8 @@ classifiers = [
[project.optional-dependencies] [project.optional-dependencies]
pdf = ["PyPDF2"] pdf = ["PyPDF2"]
torch = ["torch", "nltk", "scikit-learn"] torch = ["torch", "nltk", "scikit-learn"]
transformer = ["transformers", "tokenizers"] transformer = ["transformers", "tokenizers", "sentence-transformers"]
cosine = ["torch", "transformers", "nltk"] cosine = ["torch", "transformers", "nltk", "sentence-transformers"]
sync = ["selenium"] sync = ["selenium"]
all = [ all = [
"PyPDF2", "PyPDF2",
@@ -72,8 +71,8 @@ all = [
"scikit-learn", "scikit-learn",
"transformers", "transformers",
"tokenizers", "tokenizers",
"selenium", "sentence-transformers",
"PyPDF2" "selenium"
] ]
[project.scripts] [project.scripts]

View File

@@ -24,7 +24,6 @@ cssselect>=1.2.0
chardet>=5.2.0 chardet>=5.2.0
brotli>=1.1.0 brotli>=1.1.0
httpx[http2]>=0.27.2 httpx[http2]>=0.27.2
sentence-transformers>=2.2.0
alphashape>=1.3.1 alphashape>=1.3.1
shapely>=2.0.0 shapely>=2.0.0

View File

@@ -0,0 +1,345 @@
#!/usr/bin/env python3
"""
Simple API Test for Crawl4AI Docker Server v0.7.0
Uses only built-in Python modules to test all endpoints.
"""
import urllib.request
import urllib.parse
import json
import time
import sys
from typing import Dict, List, Optional
# Configuration
BASE_URL = "http://localhost:11234" # Change to your server URL
TEST_TIMEOUT = 30
class SimpleApiTester:
def __init__(self, base_url: str = BASE_URL):
self.base_url = base_url
self.token = None
self.results = []
def log(self, message: str):
print(f"[INFO] {message}")
def test_get_endpoint(self, endpoint: str) -> Dict:
"""Test a GET endpoint"""
url = f"{self.base_url}{endpoint}"
start_time = time.time()
try:
req = urllib.request.Request(url)
if self.token:
req.add_header('Authorization', f'Bearer {self.token}')
with urllib.request.urlopen(req, timeout=TEST_TIMEOUT) as response:
response_time = time.time() - start_time
status_code = response.getcode()
content = response.read().decode('utf-8')
# Try to parse JSON
try:
data = json.loads(content)
except:
data = {"raw_response": content[:200]}
return {
"endpoint": endpoint,
"method": "GET",
"status": "PASS" if status_code < 400 else "FAIL",
"status_code": status_code,
"response_time": response_time,
"data": data
}
except Exception as e:
response_time = time.time() - start_time
return {
"endpoint": endpoint,
"method": "GET",
"status": "FAIL",
"status_code": None,
"response_time": response_time,
"error": str(e)
}
def test_post_endpoint(self, endpoint: str, payload: Dict) -> Dict:
"""Test a POST endpoint"""
url = f"{self.base_url}{endpoint}"
start_time = time.time()
try:
data = json.dumps(payload).encode('utf-8')
req = urllib.request.Request(url, data=data, method='POST')
req.add_header('Content-Type', 'application/json')
if self.token:
req.add_header('Authorization', f'Bearer {self.token}')
with urllib.request.urlopen(req, timeout=TEST_TIMEOUT) as response:
response_time = time.time() - start_time
status_code = response.getcode()
content = response.read().decode('utf-8')
# Try to parse JSON
try:
data = json.loads(content)
except:
data = {"raw_response": content[:200]}
return {
"endpoint": endpoint,
"method": "POST",
"status": "PASS" if status_code < 400 else "FAIL",
"status_code": status_code,
"response_time": response_time,
"data": data
}
except Exception as e:
response_time = time.time() - start_time
return {
"endpoint": endpoint,
"method": "POST",
"status": "FAIL",
"status_code": None,
"response_time": response_time,
"error": str(e)
}
def print_result(self, result: Dict):
"""Print a formatted test result"""
status_color = {
"PASS": "",
"FAIL": "",
"SKIP": "⏭️"
}
print(f"{status_color[result['status']]} {result['method']} {result['endpoint']} "
f"| {result['response_time']:.3f}s | Status: {result['status_code'] or 'N/A'}")
if result['status'] == 'FAIL' and 'error' in result:
print(f" Error: {result['error']}")
self.results.append(result)
def run_all_tests(self):
"""Run all API tests"""
print("🚀 Starting Crawl4AI v0.7.0 API Test Suite")
print(f"📡 Testing server at: {self.base_url}")
print("=" * 60)
# # Test basic endpoints
# print("\n=== BASIC ENDPOINTS ===")
# # Health check
# result = self.test_get_endpoint("/health")
# self.print_result(result)
# # Schema endpoint
# result = self.test_get_endpoint("/schema")
# self.print_result(result)
# # Metrics endpoint
# result = self.test_get_endpoint("/metrics")
# self.print_result(result)
# # Root redirect
# result = self.test_get_endpoint("/")
# self.print_result(result)
# # Test authentication
# print("\n=== AUTHENTICATION ===")
# # Get token
# token_payload = {"email": "test@example.com"}
# result = self.test_post_endpoint("/token", token_payload)
# self.print_result(result)
# # Extract token if successful
# if result['status'] == 'PASS' and 'data' in result:
# token = result['data'].get('access_token')
# if token:
# self.token = token
# self.log(f"Successfully obtained auth token: {token[:20]}...")
# Test core APIs
print("\n=== CORE APIs ===")
test_url = "https://example.com"
# Test markdown endpoint
md_payload = {
"url": test_url,
"f": "fit",
"q": "test query",
"c": "0"
}
result = self.test_post_endpoint("/md", md_payload)
# print(result['data'].get('markdown', ''))
self.print_result(result)
# Test HTML endpoint
html_payload = {"url": test_url}
result = self.test_post_endpoint("/html", html_payload)
self.print_result(result)
# Test screenshot endpoint
screenshot_payload = {
"url": test_url,
"screenshot_wait_for": 2
}
result = self.test_post_endpoint("/screenshot", screenshot_payload)
self.print_result(result)
# Test PDF endpoint
pdf_payload = {"url": test_url}
result = self.test_post_endpoint("/pdf", pdf_payload)
self.print_result(result)
# Test JavaScript execution
js_payload = {
"url": test_url,
"scripts": ["(() => document.title)()"]
}
result = self.test_post_endpoint("/execute_js", js_payload)
self.print_result(result)
# Test crawl endpoint
crawl_payload = {
"urls": [test_url],
"browser_config": {},
"crawler_config": {}
}
result = self.test_post_endpoint("/crawl", crawl_payload)
self.print_result(result)
# Test config dump
config_payload = {"code": "CrawlerRunConfig()"}
result = self.test_post_endpoint("/config/dump", config_payload)
self.print_result(result)
# Test LLM endpoint
llm_endpoint = f"/llm/{test_url}?q=Extract%20main%20content"
result = self.test_get_endpoint(llm_endpoint)
self.print_result(result)
# Test ask endpoint
ask_endpoint = "/ask?context_type=all&query=crawl4ai&max_results=5"
result = self.test_get_endpoint(ask_endpoint)
print(result)
self.print_result(result)
# Test job APIs
print("\n=== JOB APIs ===")
# Test LLM job
llm_job_payload = {
"url": test_url,
"q": "Extract main content",
"cache": False
}
result = self.test_post_endpoint("/llm/job", llm_job_payload)
self.print_result(result)
# Test crawl job
crawl_job_payload = {
"urls": [test_url],
"browser_config": {},
"crawler_config": {}
}
result = self.test_post_endpoint("/crawl/job", crawl_job_payload)
self.print_result(result)
# Test MCP
print("\n=== MCP APIs ===")
# Test MCP schema
result = self.test_get_endpoint("/mcp/schema")
self.print_result(result)
# Test error handling
print("\n=== ERROR HANDLING ===")
# Test invalid URL
invalid_payload = {"url": "invalid-url", "f": "fit"}
result = self.test_post_endpoint("/md", invalid_payload)
self.print_result(result)
# Test invalid endpoint
result = self.test_get_endpoint("/nonexistent")
self.print_result(result)
# Print summary
self.print_summary()
def print_summary(self):
"""Print test results summary"""
print("\n" + "=" * 60)
print("📊 TEST RESULTS SUMMARY")
print("=" * 60)
total = len(self.results)
passed = sum(1 for r in self.results if r['status'] == 'PASS')
failed = sum(1 for r in self.results if r['status'] == 'FAIL')
print(f"Total Tests: {total}")
print(f"✅ Passed: {passed}")
print(f"❌ Failed: {failed}")
print(f"📈 Success Rate: {(passed/total)*100:.1f}%")
if failed > 0:
print("\n❌ FAILED TESTS:")
for result in self.results:
if result['status'] == 'FAIL':
print(f"{result['method']} {result['endpoint']}")
if 'error' in result:
print(f" Error: {result['error']}")
# Performance statistics
response_times = [r['response_time'] for r in self.results if r['response_time'] > 0]
if response_times:
avg_time = sum(response_times) / len(response_times)
max_time = max(response_times)
print(f"\n⏱️ Average Response Time: {avg_time:.3f}s")
print(f"⏱️ Max Response Time: {max_time:.3f}s")
# Save detailed report
report_file = f"crawl4ai_test_report_{int(time.time())}.json"
with open(report_file, 'w') as f:
json.dump({
"timestamp": time.time(),
"server_url": self.base_url,
"version": "0.7.0",
"summary": {
"total": total,
"passed": passed,
"failed": failed
},
"results": self.results
}, f, indent=2)
print(f"\n📄 Detailed report saved to: {report_file}")
def main():
"""Main test runner"""
import argparse
parser = argparse.ArgumentParser(description='Crawl4AI v0.7.0 API Test Suite')
parser.add_argument('--url', default=BASE_URL, help='Base URL of the server')
args = parser.parse_args()
tester = SimpleApiTester(args.url)
try:
tester.run_all_tests()
except KeyboardInterrupt:
print("\n🛑 Test suite interrupted by user")
except Exception as e:
print(f"\n💥 Test suite failed with error: {e}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@@ -5,7 +5,7 @@ Test script for Link Extractor functionality
from crawl4ai.models import Link from crawl4ai.models import Link
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.async_configs import LinkPreviewConfig from crawl4ai import LinkPreviewConfig
import asyncio import asyncio
import sys import sys
import os import os
@@ -237,7 +237,7 @@ def test_config_examples():
print(f" {key}: {value}") print(f" {key}: {value}")
print(" Usage:") print(" Usage:")
print(" from crawl4ai.async_configs import LinkPreviewConfig") print(" from crawl4ai import LinkPreviewConfig")
print(" config = CrawlerRunConfig(") print(" config = CrawlerRunConfig(")
print(" link_preview_config=LinkPreviewConfig(") print(" link_preview_config=LinkPreviewConfig(")
for key, value in config_dict.items(): for key, value in config_dict.items():