docs: update README for version 0.5.0 release with new features and CLI commands

2025-03-04 19:24:46 +08:00
parent 8a76563018
commit cbef406f9b
1 changed files with 43 additions and 19 deletions
--- a/README.md
+++ b/README.md
@@ -21,9 +21,9 @@
 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  
-[✨ Check out latest update v0.4.3bx](#-recent-updates)
+[✨ Check out latest update v0.5.0](#-recent-updates)
-🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
+🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog)
 <details>
 <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -68,7 +68,7 @@ If you encounter any browser-related issues, you can install them manually:
 python -m playwright install --with-deps chromium
 ```
-2. Run a simple web crawl:
+2. Run a simple web crawl with Python:
 ```python
 import asyncio
 from crawl4ai import *
@@ -84,6 +84,18 @@ if __name__ == "__main__":
    asyncio.run(main())
 ```
 3. Or use the new command-line interface:
 ```bash
 # Basic crawl with markdown output
 crwl https://www.nbcnews.com/business -o markdown
 # Deep crawl with BFS strategy, max 10 pages
 crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
 # Use LLM extraction with a specific question
 crwl https://www.example.com/products -q "Extract all product prices"
 ```
 ## ✨ Features 
 <details>
@@ -112,6 +124,7 @@ if __name__ == "__main__":
 - 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
 - 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
 - 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
 - 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
 - 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
 - ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
@@ -140,10 +153,11 @@ if __name__ == "__main__":
 <details>
 <summary>🚀 <strong>Deployment</strong></summary>
- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
+- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
 - 🔑 **Secure Authentication**: Built-in JWT token authentication for API security.
 - 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
 - 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
+- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
 </details>
@@ -486,21 +500,31 @@ async def test_news_crawl():
 ## ✨ Recent Updates
-   **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
+### Version 0.5.0 Major Release Highlights
 -   **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
 -   **🤖 Enhanced LLM Integration**:
    -   **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
    -   **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
    -   **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
 -   **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
 -   **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
 -   **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence.
 -   **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects.
 -   **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
 -   **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
 -   **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
-Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
+-   **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies:
    -   **BFS Strategy**: Breadth-first search explores websites level by level
    -   **DFS Strategy**: Depth-first search explores each branch deeply before backtracking
    -   **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next
    -   **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter
    -   **Score Thresholds**: Filter URLs based on relevance scores
 -   **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting
 -   **🔄 Multiple Crawling Strategies**:
    -   **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default)
    -   **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks
 -   **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints
 -   **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options
 -   **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content
 -   **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling.
 -   **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance
 -   **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy`
 -   **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
 -   **📄 PDF Processing**: Extract text, images, and metadata from PDF files
 -   **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects
 -   **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance
 -   **🔍 robots.txt Compliance**: Respect website crawling rules
 Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
 ## Version Numbering in Crawl4AI