# Crawl4AI v0.7.8: Stability & Bug Fix Release *December 2025* --- I'm releasing Crawl4AI v0.7.8—a focused stability release that addresses 11 bugs reported by the community. While there are no new features in this release, these fixes resolve important issues affecting Docker deployments, LLM extraction, URL handling, and dependency compatibility. ## What's Fixed at a Glance - **Docker API**: Fixed ContentRelevanceFilter deserialization, ProxyConfig serialization, and cache folder permissions - **LLM Extraction**: Configurable rate limiter backoff, HTML input format support, and proper URL handling for raw HTML - **URL Handling**: Correct relative URL resolution after JavaScript redirects - **Dependencies**: Replaced deprecated PyPDF2 with pypdf, Pydantic v2 ConfigDict compatibility - **AdaptiveCrawler**: Fixed query expansion to actually use LLM instead of hardcoded mock data ## Bug Fixes ### Docker & API Fixes #### ContentRelevanceFilter Deserialization (#1642) **The Problem:** When sending deep crawl requests to the Docker API with `ContentRelevanceFilter`, the server failed to deserialize the filter, causing requests to fail. **The Fix:** I added `ContentRelevanceFilter` to the public exports and enhanced the deserialization logic with dynamic imports. ```python # This now works correctly in Docker API import httpx request = { "urls": ["https://docs.example.com"], "crawler_config": { "deep_crawl_strategy": { "type": "BFSDeepCrawlStrategy", "max_depth": 2, "filter_chain": [ { "type": "ContentRelevanceFilter", "query": "API documentation", "threshold": 0.3 } ] } } } async with httpx.AsyncClient() as client: response = await client.post("http://localhost:11235/crawl", json=request) # Previously failed, now works! ``` #### ProxyConfig JSON Serialization (#1629) **The Problem:** `BrowserConfig.to_dict()` failed when `proxy_config` was set because `ProxyConfig` wasn't being serialized to a dictionary. **The Fix:** `ProxyConfig.to_dict()` is now called during serialization. ```python from crawl4ai import BrowserConfig from crawl4ai.async_configs import ProxyConfig proxy = ProxyConfig( server="http://proxy.example.com:8080", username="user", password="pass" ) config = BrowserConfig(headless=True, proxy_config=proxy) # Previously raised TypeError, now works config_dict = config.to_dict() json.dumps(config_dict) # Valid JSON ``` #### Docker Cache Folder Permissions (#1638) **The Problem:** The `.cache` folder in the Docker image had incorrect permissions, causing crawling to fail when caching was enabled. **The Fix:** Corrected ownership and permissions during image build. ```bash # Cache now works correctly in Docker docker run -d -p 11235:11235 \ --shm-size=1g \ -v ./my-cache:/app/.cache \ unclecode/crawl4ai:0.7.8 ``` --- ### LLM & Extraction Fixes #### Configurable Rate Limiter Backoff (#1269) **The Problem:** The LLM rate limiting backoff parameters were hardcoded, making it impossible to adjust retry behavior for different API rate limits. **The Fix:** `LLMConfig` now accepts three new parameters for complete control over retry behavior. ```python from crawl4ai import LLMConfig # Default behavior (unchanged) default_config = LLMConfig(provider="openai/gpt-4o-mini") # backoff_base_delay=2, backoff_max_attempts=3, backoff_exponential_factor=2 # Custom configuration for APIs with strict rate limits custom_config = LLMConfig( provider="openai/gpt-4o-mini", backoff_base_delay=5, # Wait 5 seconds on first retry backoff_max_attempts=5, # Try up to 5 times backoff_exponential_factor=3 # Multiply delay by 3 each attempt ) # Retry sequence: 5s -> 15s -> 45s -> 135s -> 405s ``` #### LLM Strategy HTML Input Support (#1178) **The Problem:** `LLMExtractionStrategy` always sent markdown to the LLM, but some extraction tasks work better with HTML structure preserved. **The Fix:** Added `input_format` parameter supporting `"markdown"`, `"html"`, `"fit_markdown"`, `"cleaned_html"`, and `"fit_html"`. ```python from crawl4ai import LLMExtractionStrategy, LLMConfig # Default: markdown input (unchanged) markdown_strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o-mini"), instruction="Extract product information" ) # NEW: HTML input - preserves table/list structure html_strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o-mini"), instruction="Extract the data table preserving structure", input_format="html" ) # NEW: Filtered markdown - only relevant content fit_strategy = LLMExtractionStrategy( llm_config=LLMConfig(provider="openai/gpt-4o-mini"), instruction="Summarize the main content", input_format="fit_markdown" ) ``` #### Raw HTML URL Variable (#1116) **The Problem:** When using `url="raw:..."`, the entire HTML content was being passed to extraction strategies as the URL parameter, polluting LLM prompts. **The Fix:** The URL is now correctly set to `"Raw HTML"` for raw HTML inputs. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig html = "