diff --git a/README.md b/README.md index 09874b88..08dd97f1 100644 --- a/README.md +++ b/README.md @@ -21,9 +21,9 @@ Crawl4AI is the #1 trending GitHub repository, actively maintained by a vibrant community. It delivers blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Open source, flexible, and built for real-time performance, Crawl4AI empowers developers with unmatched speed, precision, and deployment ease. -[✨ Check out latest update v0.4.3bx](#-recent-updates) +[✨ Check out latest update v0.5.0](#-recent-updates) -🎉 **Version 0.4.3bx is out!** This release brings exciting new features like a Memory Dispatcher System, Streaming Support, LLM-Powered Markdown Generation, Schema Generation, and Robots.txt Compliance! [Read the release notes →](https://docs.crawl4ai.com/blog) +🎉 **Version 0.5.0 is out!** This major release introduces Deep Crawling with BFS/DFS/BestFirst strategies, Memory-Adaptive Dispatcher, Multiple Crawling Strategies (Playwright and HTTP), Docker Deployment with FastAPI, Command-Line Interface (CLI), and more! [Read the release notes →](https://docs.crawl4ai.com/blog)
🤓 My Personal Story @@ -68,7 +68,7 @@ If you encounter any browser-related issues, you can install them manually: python -m playwright install --with-deps chromium ``` -2. Run a simple web crawl: +2. Run a simple web crawl with Python: ```python import asyncio from crawl4ai import * @@ -84,6 +84,18 @@ if __name__ == "__main__": asyncio.run(main()) ``` +3. Or use the new command-line interface: +```bash +# Basic crawl with markdown output +crwl https://www.nbcnews.com/business -o markdown + +# Deep crawl with BFS strategy, max 10 pages +crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10 + +# Use LLM extraction with a specific question +crwl https://www.example.com/products -q "Extract all product prices" +``` + ## ✨ Features
@@ -112,6 +124,7 @@ if __name__ == "__main__": - 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection. - 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction. +- 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings. - 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling. - 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access. - ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups. @@ -140,10 +153,11 @@ if __name__ == "__main__":
🚀 Deployment -- 🐳 **Dockerized Setup**: Optimized Docker image with API server for easy deployment. +- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment. +- 🔑 **Secure Authentication**: Built-in JWT token authentication for API security. - 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows. - 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance. -- ⚙️ **DigitalOcean Deployment**: Ready-to-deploy configurations for DigitalOcean and similar platforms. +- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.
@@ -486,21 +500,31 @@ async def test_news_crawl(): ## ✨ Recent Updates -- **🚀 New Dispatcher System**: Scale to thousands of URLs with intelligent **memory monitoring**, **concurrency control**, and optional **rate limiting**. (See `MemoryAdaptiveDispatcher`, `SemaphoreDispatcher`, `RateLimiter`, `CrawlerMonitor`) -- **⚡ Streaming Mode**: Process results **as they arrive** instead of waiting for an entire batch to complete. (Set `stream=True` in `CrawlerRunConfig`) -- **🤖 Enhanced LLM Integration**: - - **Automatic schema generation**: Create extraction rules from HTML using OpenAI or Ollama, no manual CSS/XPath needed. - - **LLM-powered Markdown filtering**: Refine your markdown output with a new `LLMContentFilter` that understands content relevance. - - **Ollama Support**: Use open-source or self-hosted models for private or cost-effective extraction. -- **🏎️ Faster Scraping Option**: New `LXMLWebScrapingStrategy` offers **10-20x speedup** for large, complex pages (experimental). -- **🤖 robots.txt Compliance**: Respect website rules with `check_robots_txt=True` and efficient local caching. -- **🔄 Proxy Rotation**: Built-in support for dynamic proxy switching and IP verification, with support for authenticated proxies and session persistence. -- **➡️ URL Redirection Tracking**: The `redirected_url` field now captures the final destination after any redirects. -- **🪞 Improved Mirroring**: The `LXMLWebScrapingStrategy` now has much greater fidelity, allowing for almost pixel-perfect mirroring of websites. -- **📈 Enhanced Monitoring**: Track memory, CPU, and individual crawler status with `CrawlerMonitor`. -- **📝 Improved Documentation**: More examples, clearer explanations, and updated tutorials. +### Version 0.5.0 Major Release Highlights -Read the full details in our [0.4.3bx Release Notes](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). +- **🚀 Deep Crawling System**: Explore websites beyond initial URLs with three strategies: + - **BFS Strategy**: Breadth-first search explores websites level by level + - **DFS Strategy**: Depth-first search explores each branch deeply before backtracking + - **BestFirst Strategy**: Uses scoring functions to prioritize which URLs to crawl next + - **Page Limiting**: Control the maximum number of pages to crawl with `max_pages` parameter + - **Score Thresholds**: Filter URLs based on relevance scores +- **⚡ Memory-Adaptive Dispatcher**: Dynamically adjusts concurrency based on system memory with built-in rate limiting +- **🔄 Multiple Crawling Strategies**: + - **AsyncPlaywrightCrawlerStrategy**: Browser-based crawling with JavaScript support (Default) + - **AsyncHTTPCrawlerStrategy**: Fast, lightweight HTTP-only crawler for simple tasks +- **🐳 Docker Deployment**: Easy deployment with FastAPI server and streaming/non-streaming endpoints +- **💻 Command-Line Interface**: New `crwl` CLI provides convenient terminal access to all features with intuitive commands and configuration options +- **👤 Browser Profiler**: Create and manage persistent browser profiles to save authentication states, cookies, and settings for seamless crawling of protected content +- **🧠 Crawl4AI Coding Assistant**: AI-powered coding assistant to answer your question for Crawl4ai, and generate proper code for crawling. +- **🏎️ LXML Scraping Mode**: Fast HTML parsing using the `lxml` library for improved performance +- **🌐 Proxy Rotation**: Built-in support for proxy switching with `RoundRobinProxyStrategy` +- **🤖 LLM Content Filter**: Intelligent markdown generation using LLMs +- **📄 PDF Processing**: Extract text, images, and metadata from PDF files +- **🔗 URL Redirection Tracking**: Automatically follow and record HTTP redirects +- **🤖 LLM Schema Generation**: Easily create extraction schemas with LLM assistance +- **🔍 robots.txt Compliance**: Respect website crawling rules + +Read the full details in our [0.5.0 Release Notes](https://docs.crawl4ai.com/blog/releases/0.5.0.html) or check the [CHANGELOG](https://github.com/unclecode/crawl4ai/blob/main/CHANGELOG.md). ## Version Numbering in Crawl4AI