docs: update README for version 0.5.0 release with new features and CLI commands

2025-03-04 19:24:46 +08:00
parent 8a76563018
commit cbef406f9b
1 changed files with 43 additions and 19 deletions
--- a/README.md
+++ b/README.md
@@ -21,9 +21,9 @@

 Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease.  

-[✨ Check out latest update v0.4.3bx](#-recent-updates)
+[✨ Check out latest update v0.5.0](#-recent-updates)

-🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog)
+🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog)

 <details>
 <summary>🤓 <strong>My Personal Story</strong></summary>
@@ -68,7 +68,7 @@ If you encounter any browser-related issues, you can install them manually:
 python -m playwright install --with-deps chromium
 ```

-2. Run a simple web crawl:
+2. Run a simple web crawl with Python:
 ```python
 import asyncio
 from crawl4ai import *
@@ -84,6 +84,18 @@ if __name__ == "__main__":
    asyncio.run(main())
 ```

+3. Or use the new command-line interface:
+```bash
+# Basic crawl with markdown output
+crwl https://www.nbcnews.com/business -o markdown
+
+# Deep crawl with BFS strategy, max 10 pages
+crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
+
+# Use LLM extraction with a specific question
+crwl https://www.example.com/products -q "Extract all product prices"
+```
+
 ## ✨ Features 

 <details>
@@ -112,6 +124,7 @@ if __name__ == "__main__":

 - 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.
 - 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
+- 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.
 - 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.
 - 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.
 - ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.
@@ -140,10 +153,11 @@ if __name__ == "__main__":
 <details>
 <summary>🚀 <strong>Deployment</strong></summary>

- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment.
+- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.
+- 🔑 **Secure Authentication**: Built-in JWT token authentication for API security.
 - 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.
 - 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.
- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms.
+- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.

 </details>

@@ -486,21 +500,31 @@ async def test_news_crawl():

 ## ✨ Recent Updates

-   **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`)
-   **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`)
-   **🤖 Enhanced LLM Integration**:
-    -   **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed.
-    -   **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance.
-    -   **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction.
-   **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental).
-   **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching.
-   **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence.
-   **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects.
-   **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites.
-   **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`.
-   **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials.
+### Version 0.5.0 Major Release Highlights

-Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).
+-   **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies:
+    -   **BFS Strategy**: Breadth-first search explores websites level by level
+    -   **DFS Strategy**: Depth-first search explores each branch deeply before backtracking
+    -   **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next
+    -   **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter
+    -   **Score Thresholds**: Filter URLs based on relevance scores
+-   **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting
+-   **🔄 Multiple Crawling Strategies**:
+    -   **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default)
+    -   **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks
+-   **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints
+-   **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options
+-   **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content
+-   **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling.
+-   **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance
+-   **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy`
+-   **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs
+-   **📄 PDF Processing**: Extract text, images, and metadata from PDF files
+-   **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects
+-   **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance
+-   **🔍 robots.txt Compliance**: Respect website crawling rules
+
+Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md).

 ## Version Numbering in Crawl4AI