diff --git a/README.md b/README.md index a538d298..8d157861 100644 --- a/README.md +++ b/README.md @@ -26,9 +26,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. -[✨ Check out latest update v0.7.0](#-recent-updates) +[✨ Check out latest update v0.7.3](#-recent-updates) -🎉 **Version 0.7.0 is now available!** The Adaptive Intelligence Update introduces groundbreaking features: Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, Async URL Seeder for massive discovery, and significant performance improvements. [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/blog/release-v0.7.0.md) +🎉 **Version 0.7.3 is now available!** The Multi-Config Intelligence Update brings URL-specific configurations for mixed content crawling, flexible Docker LLM providers, critical bug fixes, and improved documentation. Configure different strategies for docs, blogs, and APIs in a single crawl! [Read the release notes →](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.7.3.md)
🤓 My Personal Story @@ -273,9 +273,9 @@ The new Docker implementation includes: ### Getting Started ```bash -# Pull and run the latest release candidate -docker pull unclecode/crawl4ai:0.7.0 -docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.0 +# Pull and run the latest release +docker pull unclecode/crawl4ai:0.7.3 +docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.3 # Visit the playground at http://localhost:11235/playground ``` @@ -518,7 +518,40 @@ async def test_news_crawl(): ## ✨ Recent Updates -### Version 0.7.0 Release Highlights - The Adaptive Intelligence Update +### Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update + +- **🎨 Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch: + ```python + configs = [ + # Documentation sites - aggressive caching + CrawlerRunConfig( + url_matcher=["*docs*", "*documentation*"], + cache_mode="write" + ), + # News sites - fresh content, scroll for lazy loading + CrawlerRunConfig( + url_matcher=lambda url: 'blog' in url or 'news' in url, + cache_mode="bypass", + js_code="window.scrollTo(0, document.body.scrollHeight/2);" + ), + # Default fallback + CrawlerRunConfig() + ] + + results = await crawler.arun_many(urls, config=configs) + ``` + +- **🐳 Flexible Docker LLM Providers**: Configure LLM providers via environment variables: + ```bash + # Using .llm.env file (recommended) + docker run -d --env-file .llm.env -p 11235:11235 unclecode/crawl4ai:latest + ``` + +- **🔧 Bug Fixes & Improvements**: Critical stability fixes for production deployments + +Read the full details in our [0.7.3 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/docs/md_v2/blog/releases/0.7.3.md). + +### Previous Version: 0.7.0 Release Highlights - The Adaptive Intelligence Update - **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically: ```python diff --git a/crawl4ai/__version__.py b/crawl4ai/__version__.py index 043cf654..16868c72 100644 --- a/crawl4ai/__version__.py +++ b/crawl4ai/__version__.py @@ -1,7 +1,7 @@ # crawl4ai/__version__.py # This is the version that will be used for stable releases -__version__ = "0.7.2" +__version__ = "0.7.3" # For nightly builds, this gets set during build process __nightly_version__ = None diff --git a/docs/blog/release-v0.7.3.md b/docs/blog/release-v0.7.3.md new file mode 100644 index 00000000..d08d4774 --- /dev/null +++ b/docs/blog/release-v0.7.3.md @@ -0,0 +1,170 @@ +# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update + +*August 6, 2025 • 5 min read* + +--- + +Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready. + +## 🎯 What's New at a Glance + +- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch +- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables +- **Bug Fixes**: Resolved several critical issues for better stability +- **Documentation Updates**: Clearer examples and improved API documentation + +## 🎨 Multi-URL Configurations: One Size Doesn't Fit All + +**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic. + +**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support. + +### Technical Implementation + +```python +from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode + +# Define specialized configs for different content types +configs = [ + # Documentation sites - aggressive caching, include links + CrawlerRunConfig( + url_matcher=["*docs*", "*documentation*"], + cache_mode="write", + markdown_generator_options={"include_links": True} + ), + + # News/blog sites - fresh content, scroll for lazy loading + CrawlerRunConfig( + url_matcher=lambda url: 'blog' in url or 'news' in url, + cache_mode="bypass", + js_code="window.scrollTo(0, document.body.scrollHeight/2);" + ), + + # API endpoints - structured extraction + CrawlerRunConfig( + url_matcher=["*.json", "*api*"], + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o-mini", + extraction_type="structured" + ) + ), + + # Default fallback for everything else + CrawlerRunConfig() # No url_matcher = matches everything +] + +# Crawl multiple URLs with appropriate configs +async with AsyncWebCrawler() as crawler: + results = await crawler.arun_many( + urls=[ + "https://docs.python.org/3/", # → Uses documentation config + "https://blog.python.org/", # → Uses blog config + "https://api.github.com/users", # → Uses API config + "https://example.com/" # → Uses default config + ], + config=configs + ) +``` + +**Matching Capabilities:** +- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"` +- **Function Matchers**: Lambda functions for complex logic +- **Mixed Matchers**: Combine strings and functions with AND/OR logic +- **Fallback Support**: Default config when nothing matches + +**Expected Real-World Impact:** +- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl +- **Multi-Domain Crawling**: Different strategies per domain without separate runs +- **Reduced Complexity**: No more if/else forests in your extraction code +- **Better Performance**: Each URL gets exactly the processing it needs + +## 🐳 Docker: Flexible LLM Provider Configuration + +**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images. + +**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images. + +### Deployment Flexibility + +```bash +# Option 1: Direct environment variables +docker run -d \ + -e LLM_PROVIDER="groq/llama-3.2-3b-preview" \ + -e GROQ_API_KEY="your-key" \ + -p 11235:11235 \ + unclecode/crawl4ai:latest + +# Option 2: Using .llm.env file (recommended for production) +# Create .llm.env file: +# LLM_PROVIDER=openai/gpt-4o-mini +# OPENAI_API_KEY=your-openai-key +# GROQ_API_KEY=your-groq-key + +docker run -d \ + --env-file .llm.env \ + -p 11235:11235 \ + unclecode/crawl4ai:latest +``` + +Override per request when needed: +```python +# Use default provider from .llm.env +response = requests.post("http://localhost:11235/crawl", json={ + "url": "https://example.com", + "extraction_strategy": {"type": "llm"} +}) + +# Override to use different provider for this specific request +response = requests.post("http://localhost:11235/crawl", json={ + "url": "https://complex-page.com", + "extraction_strategy": { + "type": "llm", + "provider": "openai/gpt-4" # Override default + } +}) +``` + +**Expected Real-World Impact:** +- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex +- **A/B Testing**: Compare provider performance without deployment changes +- **Fallback Strategies**: Switch providers on-the-fly during outages +- **Development Flexibility**: Test locally with one provider, deploy with another +- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands + +## 🔧 Bug Fixes & Improvements + +This release includes several important bug fixes that improve stability and reliability: + +- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic +- **Memory Management**: Resolved memory leaks in long-running crawl sessions +- **Sitemap Processing**: Fixed redirect handling in sitemap fetching +- **Table Extraction**: Improved table detection and extraction accuracy +- **Error Handling**: Better error messages and recovery from network failures + +## 📚 Documentation Enhancements + +Based on community feedback, we've updated: +- Clearer examples for multi-URL configuration +- Improved CrawlResult documentation with all available fields +- Fixed typos and inconsistencies across documentation +- Added real-world URLs in examples for better understanding +- New comprehensive demo showcasing all v0.7.3 features + +## 🙏 Acknowledgments + +Thanks to our contributors and the entire community for feedback and bug reports. + +## 📚 Resources + +- [Full Documentation](https://docs.crawl4ai.com) +- [GitHub Repository](https://github.com/unclecode/crawl4ai) +- [Discord Community](https://discord.gg/crawl4ai) +- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py) + +--- + +*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!* + +**Happy Crawling! 🕷️** + +*- The Crawl4AI Team* \ No newline at end of file diff --git a/docs/md_v2/blog/index.md b/docs/md_v2/blog/index.md index 2ac8338d..123ca8b0 100644 --- a/docs/md_v2/blog/index.md +++ b/docs/md_v2/blog/index.md @@ -20,24 +20,30 @@ Ever wondered why your AI coding assistant struggles with your library despite c ## Latest Release -### [Crawl4AI v0.7.0 – The Adaptive Intelligence Update](releases/0.7.0.md) -*January 28, 2025* +### [Crawl4AI v0.7.3 – The Multi-Config Intelligence Update](releases/0.7.3.md) +*August 6, 2025* -Crawl4AI v0.7.0 introduces groundbreaking intelligence features that transform how crawlers understand and adapt to websites. This release brings Adaptive Crawling that learns website patterns, Virtual Scroll support for infinite pages, intelligent Link Preview with 3-layer scoring, and the powerful Async URL Seeder for massive URL discovery. +Crawl4AI v0.7.3 brings smarter URL-specific configurations, flexible Docker deployments, and critical stability improvements. Configure different crawling strategies for different URL patterns in a single batch—perfect for mixed content sites with docs, blogs, and APIs. Key highlights: -- **Adaptive Crawling**: Crawlers that learn and adapt to website structures automatically -- **Virtual Scroll Support**: Complete content extraction from modern infinite scroll pages -- **Link Preview**: 3-layer scoring system for intelligent link prioritization -- **Async URL Seeder**: Discover thousands of URLs in seconds with smart filtering -- **Performance Boost**: Up to 3x faster with optimized resource handling +- **Multi-URL Configurations**: Different strategies for different URL patterns in one crawl +- **Flexible Docker LLM Providers**: Configure providers via environment variables +- **Bug Fixes**: Critical stability improvements for production deployments +- **Documentation Updates**: Clearer examples and improved API documentation -[Read full release notes →](releases/0.7.0.md) +[Read full release notes →](releases/0.7.3.md) --- ## Previous Releases +### [Crawl4AI v0.7.0 – The Adaptive Intelligence Update](releases/0.7.0.md) +*January 28, 2025* + +Introduced groundbreaking intelligence features including Adaptive Crawling, Virtual Scroll support, intelligent Link Preview, and the Async URL Seeder for massive URL discovery. + +[Read release notes →](releases/0.7.0.md) + ### [Crawl4AI v0.6.0 – World-Aware Crawling, Pre-Warmed Browsers, and the MCP API](releases/0.6.0.md) *December 23, 2024* diff --git a/docs/md_v2/blog/releases/0.7.3.md b/docs/md_v2/blog/releases/0.7.3.md new file mode 100644 index 00000000..d08d4774 --- /dev/null +++ b/docs/md_v2/blog/releases/0.7.3.md @@ -0,0 +1,170 @@ +# 🚀 Crawl4AI v0.7.3: The Multi-Config Intelligence Update + +*August 6, 2025 • 5 min read* + +--- + +Today I'm releasing Crawl4AI v0.7.3—the Multi-Config Intelligence Update. This release brings smarter URL-specific configurations, flexible Docker deployments, important bug fixes, and documentation improvements that make Crawl4AI more robust and production-ready. + +## 🎯 What's New at a Glance + +- **Multi-URL Configurations**: Different crawling strategies for different URL patterns in a single batch +- **Flexible Docker LLM Providers**: Configure LLM providers via environment variables +- **Bug Fixes**: Resolved several critical issues for better stability +- **Documentation Updates**: Clearer examples and improved API documentation + +## 🎨 Multi-URL Configurations: One Size Doesn't Fit All + +**The Problem:** You're crawling a mix of documentation sites, blogs, and API endpoints. Each needs different handling—caching for docs, fresh content for news, structured extraction for APIs. Previously, you'd run separate crawls or write complex conditional logic. + +**My Solution:** I implemented URL-specific configurations that let you define different strategies for different URL patterns in a single crawl batch. First match wins, with optional fallback support. + +### Technical Implementation + +```python +from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MatchMode + +# Define specialized configs for different content types +configs = [ + # Documentation sites - aggressive caching, include links + CrawlerRunConfig( + url_matcher=["*docs*", "*documentation*"], + cache_mode="write", + markdown_generator_options={"include_links": True} + ), + + # News/blog sites - fresh content, scroll for lazy loading + CrawlerRunConfig( + url_matcher=lambda url: 'blog' in url or 'news' in url, + cache_mode="bypass", + js_code="window.scrollTo(0, document.body.scrollHeight/2);" + ), + + # API endpoints - structured extraction + CrawlerRunConfig( + url_matcher=["*.json", "*api*"], + extraction_strategy=LLMExtractionStrategy( + provider="openai/gpt-4o-mini", + extraction_type="structured" + ) + ), + + # Default fallback for everything else + CrawlerRunConfig() # No url_matcher = matches everything +] + +# Crawl multiple URLs with appropriate configs +async with AsyncWebCrawler() as crawler: + results = await crawler.arun_many( + urls=[ + "https://docs.python.org/3/", # → Uses documentation config + "https://blog.python.org/", # → Uses blog config + "https://api.github.com/users", # → Uses API config + "https://example.com/" # → Uses default config + ], + config=configs + ) +``` + +**Matching Capabilities:** +- **String Patterns**: Wildcards like `"*.pdf"`, `"*/blog/*"` +- **Function Matchers**: Lambda functions for complex logic +- **Mixed Matchers**: Combine strings and functions with AND/OR logic +- **Fallback Support**: Default config when nothing matches + +**Expected Real-World Impact:** +- **Mixed Content Sites**: Handle blogs, docs, and downloads in one crawl +- **Multi-Domain Crawling**: Different strategies per domain without separate runs +- **Reduced Complexity**: No more if/else forests in your extraction code +- **Better Performance**: Each URL gets exactly the processing it needs + +## 🐳 Docker: Flexible LLM Provider Configuration + +**The Problem:** Hardcoded LLM providers in Docker deployments. Want to switch from OpenAI to Groq? Rebuild and redeploy. Testing different models? Multiple Docker images. + +**My Solution:** Configure LLM providers via environment variables. Switch providers without touching code or rebuilding images. + +### Deployment Flexibility + +```bash +# Option 1: Direct environment variables +docker run -d \ + -e LLM_PROVIDER="groq/llama-3.2-3b-preview" \ + -e GROQ_API_KEY="your-key" \ + -p 11235:11235 \ + unclecode/crawl4ai:latest + +# Option 2: Using .llm.env file (recommended for production) +# Create .llm.env file: +# LLM_PROVIDER=openai/gpt-4o-mini +# OPENAI_API_KEY=your-openai-key +# GROQ_API_KEY=your-groq-key + +docker run -d \ + --env-file .llm.env \ + -p 11235:11235 \ + unclecode/crawl4ai:latest +``` + +Override per request when needed: +```python +# Use default provider from .llm.env +response = requests.post("http://localhost:11235/crawl", json={ + "url": "https://example.com", + "extraction_strategy": {"type": "llm"} +}) + +# Override to use different provider for this specific request +response = requests.post("http://localhost:11235/crawl", json={ + "url": "https://complex-page.com", + "extraction_strategy": { + "type": "llm", + "provider": "openai/gpt-4" # Override default + } +}) +``` + +**Expected Real-World Impact:** +- **Cost Optimization**: Use cheaper models for simple tasks, premium for complex +- **A/B Testing**: Compare provider performance without deployment changes +- **Fallback Strategies**: Switch providers on-the-fly during outages +- **Development Flexibility**: Test locally with one provider, deploy with another +- **Secure Configuration**: Keep API keys in `.llm.env` file, not in commands + +## 🔧 Bug Fixes & Improvements + +This release includes several important bug fixes that improve stability and reliability: + +- **URL Matcher Fallback**: Fixed edge cases in URL pattern matching logic +- **Memory Management**: Resolved memory leaks in long-running crawl sessions +- **Sitemap Processing**: Fixed redirect handling in sitemap fetching +- **Table Extraction**: Improved table detection and extraction accuracy +- **Error Handling**: Better error messages and recovery from network failures + +## 📚 Documentation Enhancements + +Based on community feedback, we've updated: +- Clearer examples for multi-URL configuration +- Improved CrawlResult documentation with all available fields +- Fixed typos and inconsistencies across documentation +- Added real-world URLs in examples for better understanding +- New comprehensive demo showcasing all v0.7.3 features + +## 🙏 Acknowledgments + +Thanks to our contributors and the entire community for feedback and bug reports. + +## 📚 Resources + +- [Full Documentation](https://docs.crawl4ai.com) +- [GitHub Repository](https://github.com/unclecode/crawl4ai) +- [Discord Community](https://discord.gg/crawl4ai) +- [Feature Demo](https://github.com/unclecode/crawl4ai/blob/main/docs/releases_review/demo_v0.7.3.py) + +--- + +*Crawl4AI continues to evolve with your needs. This release makes it smarter, more flexible, and more stable. Try the new multi-config feature and flexible Docker deployment—they're game changers!* + +**Happy Crawling! 🕷️** + +*- The Crawl4AI Team* \ No newline at end of file diff --git a/docs/md_v2/core/docker-deployment.md b/docs/md_v2/core/docker-deployment.md index 544db1e2..6e9a9704 100644 --- a/docs/md_v2/core/docker-deployment.md +++ b/docs/md_v2/core/docker-deployment.md @@ -58,15 +58,15 @@ Pull and run images directly from Docker Hub without building locally. #### 1. Pull the Image -Our latest release candidate is `0.7.0-r1`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system. +Our latest release is `0.7.3`. Images are built with multi-arch manifests, so Docker automatically pulls the correct version for your system. -> ⚠️ **Important Note**: The `latest` tag currently points to the stable `0.6.0` version. After testing and validation, `0.7.0` (without -r1) will be released and `latest` will be updated. For now, please use `0.7.0-r1` to test the new features. +> 💡 **Note**: The `latest` tag points to the stable `0.7.3` version. ```bash -# Pull the release candidate (for testing new features) -docker pull unclecode/crawl4ai:0.7.0-r1 +# Pull the latest version +docker pull unclecode/crawl4ai:0.7.3 -# Or pull the current stable version (0.6.0) +# Or pull using the latest tag docker pull unclecode/crawl4ai:latest ``` @@ -126,7 +126,7 @@ docker stop crawl4ai && docker rm crawl4ai #### Docker Hub Versioning Explained * **Image Name:** `unclecode/crawl4ai` -* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.0-r1`) +* **Tag Format:** `LIBRARY_VERSION[-SUFFIX]` (e.g., `0.7.3`) * `LIBRARY_VERSION`: The semantic version of the core `crawl4ai` Python library * `SUFFIX`: Optional tag for release candidates (``) and revisions (`r1`) * **`latest` Tag:** Points to the most recent stable version