# Crawl4AI v0.5.0 Release Notes **Release Theme: Power, Flexibility, and Scalability** Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability. Key improvements include a new **deep crawling** system, a **memory-adaptive dispatcher** for handling large-scale crawls, **multiple crawling strategies** (including a fast HTTP-only crawler), **Docker** deployment options, and a powerful **command-line interface (CLI)**. This release also includes numerous bug fixes, performance optimizations, and documentation updates. **Important Note:** This release contains several **breaking changes**. Please review the "Breaking Changes" section carefully and update your code accordingly. ## Key Features ### 1. Deep Crawling Crawl4AI now supports deep crawling, allowing you to explore websites beyond the initial URLs. This is controlled by the `deep_crawl_strategy` parameter in `CrawlerRunConfig`. Several strategies are available: * **`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores the website level by level. (Default) * **`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each branch as deeply as possible before backtracking. * **`BestFirstCrawlingStrategy`:** Uses a scoring function to prioritize which URLs to crawl next. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain # Configure a deep crawl with BFS, limiting to a specific domain and content type. filter_chain = FilterChain( filters=[ DomainFilter(allowed_domains=["example.com"]), ContentTypeFilter(allowed_types=["text/html"]) ] ) deep_crawl_config = CrawlerRunConfig( deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=5, filter_chain=filter_chain), stream=True # Process results as they arrive ) async with AsyncWebCrawler() as crawler: async for result in await crawler.arun(url="https://example.com", config=deep_crawl_config): print(f"Crawled: {result.url} (Depth: {result.metadata['depth']})") ``` **Breaking Change:** The `max_depth` parameter is now part of `CrawlerRunConfig` and controls the *depth* of the crawl, not the number of concurrent crawls. The `arun()` and `arun_many()` methods are now decorated to handle deep crawling strategies. Imports for deep crawling strategies have changed. See the [Deep Crawling documentation](../deep_crawling/README.md) for more details. ### 2. Memory-Adaptive Dispatcher The new `MemoryAdaptiveDispatcher` dynamically adjusts concurrency based on available system memory and includes built-in rate limiting. This prevents out-of-memory errors and avoids overwhelming target websites. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher # Configure the dispatcher (optional, defaults are used if not provided) dispatcher = MemoryAdaptiveDispatcher( memory_threshold_percent=80.0, # Pause if memory usage exceeds 80% check_interval=0.5 # Check memory every 0.5 seconds ) async with AsyncWebCrawler() as crawler: results = await crawler.arun_many( urls=["https://example.com/1", "https://example.com/2"], config=CrawlerRunConfig(stream=False), # Batch mode dispatcher=dispatcher ) # OR, for streaming: async for result in await crawler.arun_many(urls, config=CrawlerRunConfig(stream=True), dispatcher=dispatcher): # ... ``` **Breaking Change:** `AsyncWebCrawler.arun_many()` now uses `MemoryAdaptiveDispatcher` by default. Existing code that relied on unbounded concurrency may require adjustments. ### 3. Multiple Crawling Strategies (Playwright and HTTP) Crawl4AI now offers two crawling strategies: * **`AsyncPlaywrightCrawlerStrategy` (Default):** Uses Playwright for browser-based crawling, supporting JavaScript rendering and complex interactions. * **`AsyncHTTPCrawlerStrategy`:** A lightweight, fast, and memory-efficient HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is unnecessary. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy # Use the HTTP crawler strategy http_crawler_config = HTTPCrawlerConfig( method="GET", headers={"User-Agent": "MyCustomBot/1.0"}, follow_redirects=True, verify_ssl=True ) async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler: result = await crawler.arun("https://example.com") print(f"Status code: {result.status_code}") print(f"Content length: {len(result.html)}") ``` ### 4. Docker Deployment Crawl4AI can now be easily deployed as a Docker container, providing a consistent and isolated environment. The Docker image includes a FastAPI server with both streaming and non-streaming endpoints. ```bash # Build the image (from the project root) docker build -t crawl4ai . # Run the container docker run -d -p 8000:8000 --name crawl4ai crawl4ai ``` **API Endpoints:** * `/crawl` (POST): Non-streaming crawl. * `/crawl/stream` (POST): Streaming crawl (NDJSON). * `/health` (GET): Health check. * `/schema` (GET): Returns configuration schemas. * `/md/{url}` (GET): Returns markdown content of the URL. * `/llm/{url}` (GET): Returns LLM extracted content. * `/token` (POST): Get JWT token **Breaking Changes:** * Docker deployment now requires a `.llm.env` file for API keys. * Docker deployment now requires Redis and a new `config.yml` structure. * Server startup now uses `supervisord` instead of direct process management. * Docker server now requires authentication by default (JWT tokens). See the [Docker deployment documentation](../deploy/docker/README.md) for detailed instructions. ### 5. Command-Line Interface (CLI) A new CLI (`crwl`) provides convenient access to Crawl4AI's functionality from the terminal. ```bash # Basic crawl crwl https://example.com # Get markdown output crwl https://example.com -o markdown # Use a configuration file crwl https://example.com -B browser.yml -C crawler.yml # Use LLM-based extraction crwl https://example.com -e extract.yml -s schema.json # Ask a question about the crawled content crwl https://example.com -q "What is the main topic?" # See usage examples crwl --example ``` See the [CLI documentation](../docs/md_v2/core/cli.md) for more details. ### 6. LXML Scraping Mode Added `LXMLWebScrapingStrategy` for faster HTML parsing using the `lxml` library. This can significantly improve scraping performance, especially for large or complex pages. Set `scraping_strategy=LXMLWebScrapingStrategy()` in your `CrawlerRunConfig`. **Breaking Change:** The `ScrapingMode` enum has been replaced with a strategy pattern. Use `WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`. ### 7. Proxy Rotation Added `ProxyRotationStrategy` abstract base class with `RoundRobinProxyStrategy` concrete implementation. ```python from crawl4ai import ( AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RoundRobinProxyStrategy, ) # Load proxies and create rotation strategy proxies = load_proxies_from_env() if not proxies: print("No proxies found in environment. Set PROXIES env variable!") return proxy_strategy = RoundRobinProxyStrategy(proxies) # Create configs browser_config = BrowserConfig(headless=True, verbose=False) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, proxy_rotation_strategy=proxy_strategy ) ``` ## Other Changes and Improvements * **Added: `LLMContentFilter` for intelligent markdown generation.** This new filter uses an LLM to create more focused and relevant markdown output. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator from crawl4ai.content_filter_strategy import LLMContentFilter from crawl4ai.async_configs import LlmConfig llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY") markdown_generator = DefaultMarkdownGenerator( content_filter=LLMContentFilter(llmConfig=llm_config, instruction="Extract key concepts and summaries") ) config = CrawlerRunConfig(markdown_generator=markdown_generator) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com/article", config=config) print(result.markdown) # Output will be filtered and formatted by the LLM ``` * **Added: URL redirection tracking.** The crawler now automatically follows HTTP redirects (301, 302, 307, 308) and records the final URL in the `redirected_url` field of the `CrawlResult` object. No code changes are required to enable this; it's automatic. * **Added: LLM-powered schema generation utility.** A new `generate_schema` method has been added to `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. This greatly simplifies creating extraction schemas. ```python from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai.async_configs import LlmConfig llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY") schema = JsonCssExtractionStrategy.generate_schema( html="