From 88697c46305dfe787883cf59452592a30265b4d4 Mon Sep 17 00:00:00 2001 From: UncleCode Date: Tue, 21 Jan 2025 21:20:04 +0800 Subject: [PATCH] docs(readme): update version and feature announcements for v0.4.3b1 Update README.md to announce version 0.4.3b1 release with new features including: - Memory Dispatcher System - Streaming Support - LLM-Powered Markdown Generation - Schema Generation - Robots.txt Compliance Add detailed version numbering explanation section to help users understand pre-release versions. --- README.md | 70 ++++++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 59 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 68cc10a2..66a652ff 100644 --- a/README.md +++ b/README.md @@ -20,9 +20,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. -[✨ Check out latest update v0.4.24x](#-recent-updates) +[✨ Check out latest update v0.4.3b1x](#-recent-updates) -🎉 **Version 0.4.24x is out!** Major improvements in extraction strategies with enhanced JSON handling, SSL security, and Amazon product extraction. Plus, a completely revamped content filtering system! [Read the release notes →](https://docs.crawl4ai.com/blog) +🎉 **Version 0.4.3b1 is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
🤓 My Personal Story @@ -481,18 +481,66 @@ async def test_news_crawl():
+## ✨ Recent Updates -## ✨ Recent Updates +- **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`) +- **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`) +- **🤖 Enhanced LLM Integration**: + - **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed. + - **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance. + - **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction. +- **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental). +- **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching. +- **➡️ URL Redirection Tracking**: The `final_url` field now captures the final destination after any redirects. +- **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites. +- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`. +- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials. -- 🔒 **Enhanced SSL & Security**: New SSL certificate handling with custom paths and validation options for secure crawling -- 🔍 **Smart Content Filtering**: Advanced filtering system with regex support and efficient chunking strategies -- 📦 **Improved JSON Extraction**: Support for complex JSONPath, JSON-CSS, and Microdata extraction -- 🏗️ **New Field Types**: Added `computed`, `conditional`, `aggregate`, and `template` field types -- ⚡ **Performance Boost**: Optimized caching, parallel processing, and memory management -- 🐛 **Better Error Handling**: Enhanced debugging capabilities with detailed error tracking -- 🔐 **Security Features**: Improved input validation and safe expression evaluation +Read the full details in our [0.4.248 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). -Read the full details of this release in our [0.4.24 Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). +Here's a clear markdown explanation for your users about version numbering: + + +## Version Numbering in Crawl4AI + +Crawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release. + +### Version Numbers Explained + +Our version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3) + +#### Pre-release Versions +We use different suffixes to indicate development stages: + +- `dev` (0.4.3dev1): Development versions, unstable +- `a` (0.4.3a1): Alpha releases, experimental features +- `b` (0.4.3b1): Beta releases, feature complete but needs testing +- `rc` (0.4.3rc1): Release candidates, potential final version + +#### Installation +- Regular installation (stable version): + ```bash + pip install -U crawl4ai + ``` + +- Install pre-release versions: + ```bash + pip install crawl4ai --pre + ``` + +- Install specific version: + ```bash + pip install crawl4ai==0.4.3b1 + ``` + +#### Why Pre-releases? +We use pre-releases to: +- Test new features in real-world scenarios +- Gather feedback before final releases +- Ensure stability for production users +- Allow early adopters to try new features + +For production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag. ## 📖 Documentation & Roadmap