# Crawl4AI v0.5.0 Release Notes **Release Theme: Power, Flexibility, and Scalability** Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability. Key improvements include a new **deep crawling** system, a **memory-adaptive dispatcher** for handling large-scale crawls, **multiple crawling strategies** (including a fast HTTP-only crawler), **Docker** deployment options, and a powerful **command-line interface (CLI)**. This release also includes numerous bug fixes, performance optimizations, and documentation updates. **Important Note:** This release contains several **breaking changes**. Please review the "Breaking Changes" section carefully and update your code accordingly. ## Key Features ### 1. Deep Crawling Crawl4AI now supports deep crawling, allowing you to explore websites beyond the initial URLs. This is controlled by the `deep_crawl_strategy` parameter in `CrawlerRunConfig`. Several strategies are available: - **`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores the website level by level. (Default) - **`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each branch as deeply as possible before backtracking. - **`BestFirstCrawlingStrategy`:** Uses a scoring function to prioritize which URLs to crawl next. ```python import time from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain, URLPatternFilter, KeywordRelevanceScorer, BestFirstCrawlingStrategy import asyncio # Create a filter chain to filter urls based on patterns, domains and content type filter_chain = FilterChain( [ DomainFilter( allowed_domains=["docs.crawl4ai.com"], blocked_domains=["old.docs.crawl4ai.com"], ), URLPatternFilter(patterns=["*core*", "*advanced*"],), ContentTypeFilter(allowed_types=["text/html"]), ] ) # Create a keyword scorer that prioritises the pages with certain keywords first keyword_scorer = KeywordRelevanceScorer( keywords=["crawl", "example", "async", "configuration"], weight=0.7 ) # Set up the configuration deep_crawl_config = CrawlerRunConfig( deep_crawl_strategy=BestFirstCrawlingStrategy( max_depth=2, include_external=False, filter_chain=filter_chain, url_scorer=keyword_scorer, ), scraping_strategy=LXMLWebScrapingStrategy(), stream=True, verbose=True, ) async def main(): async with AsyncWebCrawler() as crawler: start_time = time.perf_counter() results = [] async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=deep_crawl_config): print(f"Crawled: {result.url} (Depth: {result.metadata['depth']}), score: {result.metadata['score']:.2f}") results.append(result) duration = time.perf_counter() - start_time print(f"\nā Crawled {len(results)} high-value pages in {duration:.2f} seconds") asyncio.run(main()) ``` **Breaking Change:** The `max_depth` parameter is now part of `CrawlerRunConfig` and controls the _depth_ of the crawl, not the number of concurrent crawls. The `arun()` and `arun_many()` methods are now decorated to handle deep crawling strategies. Imports for deep crawling strategies have changed. See the [Deep Crawling documentation](../../core/deep-crawling.md) for more details. ### 2. Memory-Adaptive Dispatcher The new `MemoryAdaptiveDispatcher` dynamically adjusts concurrency based on available system memory and includes built-in rate limiting. This prevents out-of-memory errors and avoids overwhelming target websites. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher import asyncio # Configure the dispatcher (optional, defaults are used if not provided) dispatcher = MemoryAdaptiveDispatcher( memory_threshold_percent=80.0, # Pause if memory usage exceeds 80% check_interval=0.5, # Check memory every 0.5 seconds ) async def batch_mode(): async with AsyncWebCrawler() as crawler: results = await crawler.arun_many( urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"], config=CrawlerRunConfig(stream=False), # Batch mode dispatcher=dispatcher, ) for result in results: print(f"Crawled: {result.url} with status code: {result.status_code}") async def stream_mode(): async with AsyncWebCrawler() as crawler: # OR, for streaming: async for result in await crawler.arun_many( urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"], config=CrawlerRunConfig(stream=True), dispatcher=dispatcher, ): print(f"Crawled: {result.url} with status code: {result.status_code}") print("Dispatcher in batch mode:") asyncio.run(batch_mode()) print("-" * 50) print("Dispatcher in stream mode:") asyncio.run(stream_mode()) ``` **Breaking Change:** `AsyncWebCrawler.arun_many()` now uses `MemoryAdaptiveDispatcher` by default. Existing code that relied on unbounded concurrency may require adjustments. ### 3. Multiple Crawling Strategies (Playwright and HTTP) Crawl4AI now offers two crawling strategies: - **`AsyncPlaywrightCrawlerStrategy` (Default):** Uses Playwright for browser-based crawling, supporting JavaScript rendering and complex interactions. - **`AsyncHTTPCrawlerStrategy`:** A lightweight, fast, and memory-efficient HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is unnecessary. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy import asyncio # Use the HTTP crawler strategy http_crawler_config = HTTPCrawlerConfig( method="GET", headers={"User-Agent": "MyCustomBot/1.0"}, follow_redirects=True, verify_ssl=True ) async def main(): async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler: result = await crawler.arun("https://example.com") print(f"Status code: {result.status_code}") print(f"Content length: {len(result.html)}") asyncio.run(main()) ``` ### 4. Docker Deployment Crawl4AI can now be easily deployed as a Docker container, providing a consistent and isolated environment. The Docker image includes a FastAPI server with both streaming and non-streaming endpoints. ```bash # Build the image (from the project root) docker build -t crawl4ai . # Run the container docker run -d -p 8000:8000 --name crawl4ai crawl4ai ``` **API Endpoints:** - `/crawl` (POST): Non-streaming crawl. - `/crawl/stream` (POST): Streaming crawl (NDJSON). - `/health` (GET): Health check. - `/schema` (GET): Returns configuration schemas. - `/md/{url}` (GET): Returns markdown content of the URL. - `/llm/{url}` (GET): Returns LLM extracted content. - `/token` (POST): Get JWT token **Breaking Changes:** - Docker deployment now requires a `.llm.env` file for API keys. - Docker deployment now requires Redis and a new `config.yml` structure. - Server startup now uses `supervisord` instead of direct process management. - Docker server now requires authentication by default (JWT tokens). See the [Docker deployment documentation](../../core/docker-deployment.md) for detailed instructions. ### 5. Command-Line Interface (CLI) A new CLI (`crwl`) provides convenient access to Crawl4AI's functionality from the terminal. ```bash # Basic crawl crwl https://example.com # Get markdown output crwl https://example.com -o markdown # Use a configuration file crwl https://example.com -B browser.yml -C crawler.yml # Use LLM-based extraction crwl https://example.com -e extract.yml -s schema.json # Ask a question about the crawled content crwl https://example.com -q "What is the main topic?" # See usage examples crwl --example ``` See the [CLI documentation](../docs/md_v2/core/cli.md) for more details. ### 6. LXML Scraping Mode Added `LXMLWebScrapingStrategy` for faster HTML parsing using the `lxml` library. This can significantly improve scraping performance, especially for large or complex pages. Set `scraping_strategy=LXMLWebScrapingStrategy()` in your `CrawlerRunConfig`. **Breaking Change:** The `ScrapingMode` enum has been replaced with a strategy pattern. Use `WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`. ### 7. Proxy Rotation Added `ProxyRotationStrategy` abstract base class with `RoundRobinProxyStrategy` concrete implementation. ```python import re from crawl4ai import ( AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, RoundRobinProxyStrategy, ) import asyncio from crawl4ai.configs import ProxyConfig async def main(): # Load proxies and create rotation strategy proxies = ProxyConfig.from_env() #eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2" if not proxies: print("No proxies found in environment. Set PROXIES env variable!") return proxy_strategy = RoundRobinProxyStrategy(proxies) # Create configs browser_config = BrowserConfig(headless=True, verbose=False) run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, proxy_rotation_strategy=proxy_strategy ) async with AsyncWebCrawler(config=browser_config) as crawler: urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice print("\nš Initializing crawler with proxy rotation...") async with AsyncWebCrawler(config=browser_config) as crawler: print("\nš Starting batch crawl with proxy rotation...") results = await crawler.arun_many( urls=urls, config=run_config ) for result in results: if result.success: ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html) current_proxy = run_config.proxy_config if run_config.proxy_config else None if current_proxy and ip_match: print(f"URL {result.url}") print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}") verified = ip_match.group(0) == current_proxy.ip if verified: print(f"ā Proxy working! IP matches: {current_proxy.ip}") else: print("ā Proxy failed or IP mismatch!") print("---") asyncio.run(main()) ``` ## Other Changes and Improvements - **Added: `LLMContentFilter` for intelligent markdown generation.** This new filter uses an LLM to create more focused and relevant markdown output. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator from crawl4ai.content_filter_strategy import LLMContentFilter from crawl4ai.async_configs import LlmConfig import asyncio llm_config = LlmConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY") markdown_generator = DefaultMarkdownGenerator( content_filter=LLMContentFilter(llmConfig=llm_config, instruction="Extract key concepts and summaries") ) config = CrawlerRunConfig(markdown_generator=markdown_generator) async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://docs.crawl4ai.com", config=config) print(result.markdown.fit_markdown) asyncio.run(main()) ``` - **Added: URL redirection tracking.** The crawler now automatically follows HTTP redirects (301, 302, 307, 308) and records the final URL in the `redirected_url` field of the `CrawlResult` object. No code changes are required to enable this; it's automatic. - **Added: LLM-powered schema generation utility.** A new `generate_schema` method has been added to `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. This greatly simplifies creating extraction schemas. ```python from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai.async_configs import LlmConfig llm_config = LlmConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY") schema = JsonCssExtractionStrategy.generate_schema( html="