Release prep (#749)

* fix: Update export of URLPatternFilter

* chore: Add dependancy for cchardet in requirements

* docs: Update example for deep crawl in release note for v0.5

* Docs: update the example for memory dispatcher

* docs: updated example for crawl strategies

* Refactor: Removed wrapping in if __name__==main block since this is a markdown file.

* chore: removed cchardet from dependancy list, since unclecode is planning to remove it

* docs: updated the example for proxy rotation to a working example

* feat: Introduced ProxyConfig param

* Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1

* chore: update and test new dependancies

* feat:Make PyPDF2 a conditional dependancy

* updated tutorial and release note for v0.5

* docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename

* refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult

* fix: Bug in serialisation of markdown in acache_url

* Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown

* fix: remove deprecated markdown_v2 from docker

* Refactor: remove deprecated fit_markdown and fit_html from result

* refactor: fix cache retrieval for markdown as a string

* chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
This commit is contained in:
Aravind
2025-02-28 17:23:35 +05:30
committed by GitHub
parent 3a87b4e43b
commit a9e24307cc
38 changed files with 2040 additions and 326 deletions

View File

@@ -2,79 +2,151 @@
**Release Theme: Power, Flexibility, and Scalability**
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability. Key improvements include a new **deep crawling** system, a **memory-adaptive dispatcher** for handling large-scale crawls, **multiple crawling strategies** (including a fast HTTP-only crawler), **Docker** deployment options, and a powerful **command-line interface (CLI)**. This release also includes numerous bug fixes, performance optimizations, and documentation updates.
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the
library's power, flexibility, and scalability. Key improvements include a new
**deep crawling** system, a **memory-adaptive dispatcher** for handling
large-scale crawls, **multiple crawling strategies** (including a fast HTTP-only
crawler), **Docker** deployment options, and a powerful **command-line interface
(CLI)**. This release also includes numerous bug fixes, performance
optimizations, and documentation updates.
**Important Note:** This release contains several **breaking changes**. Please review the "Breaking Changes" section carefully and update your code accordingly.
**Important Note:** This release contains several **breaking changes**. Please
review the "Breaking Changes" section carefully and update your code
accordingly.
## Key Features
### 1. Deep Crawling
Crawl4AI now supports deep crawling, allowing you to explore websites beyond the initial URLs. This is controlled by the `deep_crawl_strategy` parameter in `CrawlerRunConfig`. Several strategies are available:
Crawl4AI now supports deep crawling, allowing you to explore websites beyond the
initial URLs. This is controlled by the `deep_crawl_strategy` parameter in
`CrawlerRunConfig`. Several strategies are available:
* **`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores the website level by level. (Default)
* **`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each branch as deeply as possible before backtracking.
* **`BestFirstCrawlingStrategy`:** Uses a scoring function to prioritize which URLs to crawl next.
- **`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores the website level
by level. (Default)
- **`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each branch as
deeply as possible before backtracking.
- **`BestFirstCrawlingStrategy`:** Uses a scoring function to prioritize which
URLs to crawl next.
```python
import time
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy
from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain, URLPatternFilter, KeywordRelevanceScorer, BestFirstCrawlingStrategy
import asyncio
# Configure a deep crawl with BFS, limiting to a specific domain and content type.
# Create a filter chain to filter urls based on patterns, domains and content type
filter_chain = FilterChain(
filters=[
DomainFilter(allowed_domains=["example.com"]),
ContentTypeFilter(allowed_types=["text/html"])
[
DomainFilter(
allowed_domains=["docs.crawl4ai.com"],
blocked_domains=["old.docs.crawl4ai.com"],
),
URLPatternFilter(patterns=["*core*", "*advanced*"],),
ContentTypeFilter(allowed_types=["text/html"]),
]
)
deep_crawl_config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(max_depth=5, filter_chain=filter_chain),
stream=True # Process results as they arrive
# Create a keyword scorer that prioritises the pages with certain keywords first
keyword_scorer = KeywordRelevanceScorer(
keywords=["crawl", "example", "async", "configuration"], weight=0.7
)
async with AsyncWebCrawler() as crawler:
async for result in await crawler.arun(url="https://example.com", config=deep_crawl_config):
print(f"Crawled: {result.url} (Depth: {result.metadata['depth']})")
# Set up the configuration
deep_crawl_config = CrawlerRunConfig(
deep_crawl_strategy=BestFirstCrawlingStrategy(
max_depth=2,
include_external=False,
filter_chain=filter_chain,
url_scorer=keyword_scorer,
),
scraping_strategy=LXMLWebScrapingStrategy(),
stream=True,
verbose=True,
)
async def main():
async with AsyncWebCrawler() as crawler:
start_time = time.perf_counter()
results = []
async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=deep_crawl_config):
print(f"Crawled: {result.url} (Depth: {result.metadata['depth']}), score: {result.metadata['score']:.2f}")
results.append(result)
duration = time.perf_counter() - start_time
print(f"\n✅ Crawled {len(results)} high-value pages in {duration:.2f} seconds")
asyncio.run(main())
```
**Breaking Change:** The `max_depth` parameter is now part of `CrawlerRunConfig` and controls the *depth* of the crawl, not the number of concurrent crawls. The `arun()` and `arun_many()` methods are now decorated to handle deep crawling strategies. Imports for deep crawling strategies have changed. See the [Deep Crawling documentation](../deep_crawling/README.md) for more details.
**Breaking Change:** The `max_depth` parameter is now part of `CrawlerRunConfig`
and controls the _depth_ of the crawl, not the number of concurrent crawls. The
`arun()` and `arun_many()` methods are now decorated to handle deep crawling
strategies. Imports for deep crawling strategies have changed. See the
[Deep Crawling documentation](../../core/deep-crawling.md) for more details.
### 2. Memory-Adaptive Dispatcher
The new `MemoryAdaptiveDispatcher` dynamically adjusts concurrency based on available system memory and includes built-in rate limiting. This prevents out-of-memory errors and avoids overwhelming target websites.
The new `MemoryAdaptiveDispatcher` dynamically adjusts concurrency based on
available system memory and includes built-in rate limiting. This prevents
out-of-memory errors and avoids overwhelming target websites.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher
import asyncio
# Configure the dispatcher (optional, defaults are used if not provided)
dispatcher = MemoryAdaptiveDispatcher(
memory_threshold_percent=80.0, # Pause if memory usage exceeds 80%
check_interval=0.5 # Check memory every 0.5 seconds
check_interval=0.5, # Check memory every 0.5 seconds
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=["https://example.com/1", "https://example.com/2"],
config=CrawlerRunConfig(stream=False), # Batch mode
dispatcher=dispatcher
)
# OR, for streaming:
async for result in await crawler.arun_many(urls, config=CrawlerRunConfig(stream=True), dispatcher=dispatcher):
# ...
async def batch_mode():
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(
urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"],
config=CrawlerRunConfig(stream=False), # Batch mode
dispatcher=dispatcher,
)
for result in results:
print(f"Crawled: {result.url} with status code: {result.status_code}")
async def stream_mode():
async with AsyncWebCrawler() as crawler:
# OR, for streaming:
async for result in await crawler.arun_many(
urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"],
config=CrawlerRunConfig(stream=True),
dispatcher=dispatcher,
):
print(f"Crawled: {result.url} with status code: {result.status_code}")
print("Dispatcher in batch mode:")
asyncio.run(batch_mode())
print("-" * 50)
print("Dispatcher in stream mode:")
asyncio.run(stream_mode())
```
**Breaking Change:** `AsyncWebCrawler.arun_many()` now uses `MemoryAdaptiveDispatcher` by default. Existing code that relied on unbounded concurrency may require adjustments.
**Breaking Change:** `AsyncWebCrawler.arun_many()` now uses
`MemoryAdaptiveDispatcher` by default. Existing code that relied on unbounded
concurrency may require adjustments.
### 3. Multiple Crawling Strategies (Playwright and HTTP)
Crawl4AI now offers two crawling strategies:
* **`AsyncPlaywrightCrawlerStrategy` (Default):** Uses Playwright for browser-based crawling, supporting JavaScript rendering and complex interactions.
* **`AsyncHTTPCrawlerStrategy`:** A lightweight, fast, and memory-efficient HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is unnecessary.
- **`AsyncPlaywrightCrawlerStrategy` (Default):** Uses Playwright for
browser-based crawling, supporting JavaScript rendering and complex
interactions.
- **`AsyncHTTPCrawlerStrategy`:** A lightweight, fast, and memory-efficient
HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is
unnecessary.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
import asyncio
# Use the HTTP crawler strategy
http_crawler_config = HTTPCrawlerConfig(
@@ -84,15 +156,20 @@ http_crawler_config = HTTPCrawlerConfig(
verify_ssl=True
)
async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler:
result = await crawler.arun("https://example.com")
print(f"Status code: {result.status_code}")
print(f"Content length: {len(result.html)}")
async def main():
async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler:
result = await crawler.arun("https://example.com")
print(f"Status code: {result.status_code}")
print(f"Content length: {len(result.html)}")
asyncio.run(main())
```
### 4. Docker Deployment
Crawl4AI can now be easily deployed as a Docker container, providing a consistent and isolated environment. The Docker image includes a FastAPI server with both streaming and non-streaming endpoints.
Crawl4AI can now be easily deployed as a Docker container, providing a
consistent and isolated environment. The Docker image includes a FastAPI server
with both streaming and non-streaming endpoints.
```bash
# Build the image (from the project root)
@@ -104,26 +181,28 @@ docker run -d -p 8000:8000 --name crawl4ai crawl4ai
**API Endpoints:**
* `/crawl` (POST): Non-streaming crawl.
* `/crawl/stream` (POST): Streaming crawl (NDJSON).
* `/health` (GET): Health check.
* `/schema` (GET): Returns configuration schemas.
* `/md/{url}` (GET): Returns markdown content of the URL.
* `/llm/{url}` (GET): Returns LLM extracted content.
* `/token` (POST): Get JWT token
- `/crawl` (POST): Non-streaming crawl.
- `/crawl/stream` (POST): Streaming crawl (NDJSON).
- `/health` (GET): Health check.
- `/schema` (GET): Returns configuration schemas.
- `/md/{url}` (GET): Returns markdown content of the URL.
- `/llm/{url}` (GET): Returns LLM extracted content.
- `/token` (POST): Get JWT token
**Breaking Changes:**
* Docker deployment now requires a `.llm.env` file for API keys.
* Docker deployment now requires Redis and a new `config.yml` structure.
* Server startup now uses `supervisord` instead of direct process management.
* Docker server now requires authentication by default (JWT tokens).
- Docker deployment now requires a `.llm.env` file for API keys.
- Docker deployment now requires Redis and a new `config.yml` structure.
- Server startup now uses `supervisord` instead of direct process management.
- Docker server now requires authentication by default (JWT tokens).
See the [Docker deployment documentation](../deploy/docker/README.md) for detailed instructions.
See the [Docker deployment documentation](../../core/docker-deployment.md) for
detailed instructions.
### 5. Command-Line Interface (CLI)
A new CLI (`crwl`) provides convenient access to Crawl4AI's functionality from the terminal.
A new CLI (`crwl`) provides convenient access to Crawl4AI's functionality from
the terminal.
```bash
# Basic crawl
@@ -149,14 +228,21 @@ See the [CLI documentation](../docs/md_v2/core/cli.md) for more details.
### 6. LXML Scraping Mode
Added `LXMLWebScrapingStrategy` for faster HTML parsing using the `lxml` library. This can significantly improve scraping performance, especially for large or complex pages. Set `scraping_strategy=LXMLWebScrapingStrategy()` in your `CrawlerRunConfig`.
Added `LXMLWebScrapingStrategy` for faster HTML parsing using the `lxml`
library. This can significantly improve scraping performance, especially for
large or complex pages. Set `scraping_strategy=LXMLWebScrapingStrategy()` in
your `CrawlerRunConfig`.
**Breaking Change:** The `ScrapingMode` enum has been replaced with a strategy pattern. Use `WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`.
**Breaking Change:** The `ScrapingMode` enum has been replaced with a strategy
pattern. Use `WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`.
### 7. Proxy Rotation
Added `ProxyRotationStrategy` abstract base class with `RoundRobinProxyStrategy` concrete implementation.
Added `ProxyRotationStrategy` abstract base class with `RoundRobinProxyStrategy`
concrete implementation.
```python
import re
from crawl4ai import (
AsyncWebCrawler,
BrowserConfig,
@@ -164,8 +250,12 @@ from crawl4ai import (
CacheMode,
RoundRobinProxyStrategy,
)
# Load proxies and create rotation strategy
proxies = load_proxies_from_env()
import asyncio
from crawl4ai.configs import ProxyConfig
async def main():
# Load proxies and create rotation strategy
proxies = ProxyConfig.from_env()
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
if not proxies:
print("No proxies found in environment. Set PROXIES env variable!")
return
@@ -178,154 +268,257 @@ from crawl4ai import (
cache_mode=CacheMode.BYPASS,
proxy_rotation_strategy=proxy_strategy
)
async with AsyncWebCrawler(config=browser_config) as crawler:
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
print("\n📈 Initializing crawler with proxy rotation...")
async with AsyncWebCrawler(config=browser_config) as crawler:
print("\n🚀 Starting batch crawl with proxy rotation...")
results = await crawler.arun_many(
urls=urls,
config=run_config
)
for result in results:
if result.success:
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
current_proxy = run_config.proxy_config if run_config.proxy_config else None
if current_proxy and ip_match:
print(f"URL {result.url}")
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
verified = ip_match.group(0) == current_proxy.ip
if verified:
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
else:
print("❌ Proxy failed or IP mismatch!")
print("---")
asyncio.run(main())
```
## Other Changes and Improvements
* **Added: `LLMContentFilter` for intelligent markdown generation.** This new filter uses an LLM to create more focused and relevant markdown output.
- **Added: `LLMContentFilter` for intelligent markdown generation.** This new
filter uses an LLM to create more focused and relevant markdown output.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai.async_configs import LlmConfig
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai.async_configs import LlmConfig
import asyncio
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
llm_config = LlmConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")
markdown_generator = DefaultMarkdownGenerator(
content_filter=LLMContentFilter(llmConfig=llm_config, instruction="Extract key concepts and summaries")
)
markdown_generator = DefaultMarkdownGenerator(
content_filter=LLMContentFilter(llmConfig=llm_config, instruction="Extract key concepts and summaries")
)
config = CrawlerRunConfig(markdown_generator=markdown_generator)
config = CrawlerRunConfig(markdown_generator=markdown_generator)
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/article", config=config)
print(result.markdown) # Output will be filtered and formatted by the LLM
```
result = await crawler.arun("https://docs.crawl4ai.com", config=config)
print(result.markdown.fit_markdown)
* **Added: URL redirection tracking.** The crawler now automatically follows HTTP redirects (301, 302, 307, 308) and records the final URL in the `redirected_url` field of the `CrawlResult` object. No code changes are required to enable this; it's automatic.
asyncio.run(main())
```
* **Added: LLM-powered schema generation utility.** A new `generate_schema` method has been added to `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`. This greatly simplifies creating extraction schemas.
- **Added: URL redirection tracking.** The crawler now automatically follows
HTTP redirects (301, 302, 307, 308) and records the final URL in the
`redirected_url` field of the `CrawlResult` object. No code changes are
required to enable this; it's automatic.
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import LlmConfig
- **Added: LLM-powered schema generation utility.** A new `generate_schema`
method has been added to `JsonCssExtractionStrategy` and
`JsonXPathExtractionStrategy`. This greatly simplifies creating extraction
schemas.
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.async_configs import LlmConfig
schema = JsonCssExtractionStrategy.generate_schema(
html="<div class='product'><h2>Product Name</h2><span class='price'>$99</span></div>",
llmConfig = llm_config,
query="Extract product name and price"
)
print(schema)
# Expected Output (may vary slightly due to LLM):
# {
# "name": "ProductExtractor",
# "baseSelector": "div.product",
# "fields": [
# {"name": "name", "selector": "h2", "type": "text"},
# {"name": "price", "selector": ".price", "type": "text"}
# ]
# }
```
llm_config = LlmConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")
* **Added: robots.txt compliance support.** The crawler can now respect `robots.txt` rules. Enable this by setting `check_robots_txt=True` in `CrawlerRunConfig`.
schema = JsonCssExtractionStrategy.generate_schema(
html="<div class='product'><h2>Product Name</h2><span class='price'>$99</span></div>",
llmConfig = llm_config,
query="Extract product name and price"
)
print(schema)
```
```python
config = CrawlerRunConfig(check_robots_txt=True)
```
Expected Output (may vary slightly due to LLM)
```JSON
{
"name": "ProductExtractor",
"baseSelector": "div.product",
"fields": [
{"name": "name", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
```
* **Added: PDF processing capabilities.** Crawl4AI can now extract text, images, and metadata from PDF files (both local and remote). This uses a new `PDFCrawlerStrategy` and `PDFContentScrapingStrategy`.
- **Added: robots.txt compliance support.** The crawler can now respect
`robots.txt` rules. Enable this by setting `check_robots_txt=True` in
`CrawlerRunConfig`.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
```python
config = CrawlerRunConfig(check_robots_txt=True)
```
async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
result = await crawler.arun(
"https://example.com/document.pdf",
config=CrawlerRunConfig(
scraping_strategy=PDFContentScrapingStrategy()
)
)
print(result.markdown) # Access extracted text
print(result.metadata) # Access PDF metadata (title, author, etc.)
```
- **Added: PDF processing capabilities.** Crawl4AI can now extract text, images,
and metadata from PDF files (both local and remote). This uses a new
`PDFCrawlerStrategy` and `PDFContentScrapingStrategy`.
* **Added: Support for frozenset serialization.** Improves configuration serialization, especially for sets of allowed/blocked domains. No code changes required.
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
import asyncio
* **Added: New `LlmConfig` parameter.** This new parameter can be passed for extraction, filtering, and schema generation tasks. It simplifies passing provider strings, API tokens, and base URLs across all sections where LLM configuration is necessary. It also enables reuse and allows for quick experimentation between different LLM configurations.
async def main():
async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
result = await crawler.arun(
"https://arxiv.org/pdf/2310.06825.pdf",
config=CrawlerRunConfig(
scraping_strategy=PDFContentScrapingStrategy()
)
)
print(result.markdown) # Access extracted text
print(result.metadata) # Access PDF metadata (title, author, etc.)
```python
from crawl4ai.async_configs import LlmConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
asyncio.run(main())
```
# Example of using LlmConfig with LLMExtractionStrategy
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
strategy = LLMExtractionStrategy(llmConfig=llm_config, schema=...)
- **Added: Support for frozenset serialization.** Improves configuration
serialization, especially for sets of allowed/blocked domains. No code changes
required.
# Example usage within a crawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(extraction_strategy=strategy)
)
```
**Breaking Change:** Removed old parameters like `provider`, `api_token`, `base_url`, and `api_base` from `LLMExtractionStrategy` and `LLMContentFilter`. Users should migrate to using the `LlmConfig` object.
- **Added: New `LlmConfig` parameter.** This new parameter can be passed for
extraction, filtering, and schema generation tasks. It simplifies passing
provider strings, API tokens, and base URLs across all sections where LLM
configuration is necessary. It also enables reuse and allows for quick
experimentation between different LLM configurations.
* **Changed: Improved browser context management and added shared data support. (Breaking Change:** `BrowserContext` API updated). Browser contexts are now managed more efficiently, reducing resource usage. A new `shared_data` dictionary is available in the `BrowserContext` to allow passing data between different stages of the crawling process. **Breaking Change:** The `BrowserContext` API has changed, and the old `get_context` method is deprecated.
```python
from crawl4ai.async_configs import LlmConfig
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
* **Changed:** Renamed `final_url` to `redirected_url` in `CrawledURL`. This improves consistency and clarity. Update any code referencing the old field name.
# Example of using LlmConfig with LLMExtractionStrategy
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
strategy = LLMExtractionStrategy(llmConfig=llm_config, schema=...)
* **Changed:** Improved type hints and removed unused files. This is an internal improvement and should not require code changes.
# Example usage within a crawler
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
config=CrawlerRunConfig(extraction_strategy=strategy)
)
```
**Breaking Change:** Removed old parameters like `provider`, `api_token`,
`base_url`, and `api_base` from `LLMExtractionStrategy` and
`LLMContentFilter`. Users should migrate to using the `LlmConfig` object.
* **Changed:** Reorganized deep crawling functionality into dedicated module. (**Breaking Change:** Import paths for `DeepCrawlStrategy` and related classes have changed). This improves code organization. Update imports to use the new `crawl4ai.deep_crawling` module.
- **Changed: Improved browser context management and added shared data support.
(Breaking Change:** `BrowserContext` API updated). Browser contexts are now
managed more efficiently, reducing resource usage. A new `shared_data`
dictionary is available in the `BrowserContext` to allow passing data between
different stages of the crawling process. **Breaking Change:** The
`BrowserContext` API has changed, and the old `get_context` method is
deprecated.
* **Changed:** Improved HTML handling and cleanup codebase. (**Breaking Change:** Removed `ssl_certificate.json` file). This removes an unused file. If you were relying on this file for custom certificate validation, you'll need to implement an alternative approach.
- **Changed:** Renamed `final_url` to `redirected_url` in `CrawledURL`. This
improves consistency and clarity. Update any code referencing the old field
name.
* **Changed:** Enhanced serialization and config handling. (**Breaking Change:** `FastFilterChain` has been replaced with `FilterChain`). This change simplifies config and improves the serialization.
- **Changed:** Improved type hints and removed unused files. This is an internal
improvement and should not require code changes.
* **Added:** Modified the license to Apache 2.0 *with a required attribution clause*. See the `LICENSE` file for details. All users must now clearly attribute the Crawl4AI project when using, distributing, or creating derivative works.
- **Changed:** Reorganized deep crawling functionality into dedicated module.
(**Breaking Change:** Import paths for `DeepCrawlStrategy` and related classes
have changed). This improves code organization. Update imports to use the new
`crawl4ai.deep_crawling` module.
* **Fixed:** Prevent memory leaks by ensuring proper closure of Playwright pages. No code changes required.
- **Changed:** Improved HTML handling and cleanup codebase. (**Breaking
Change:** Removed `ssl_certificate.json` file). This removes an unused file.
If you were relying on this file for custom certificate validation, you'll
need to implement an alternative approach.
* **Fixed:** Make model fields optional with default values (**Breaking Change:** Code relying on all fields being present may need adjustment). Fields in data models (like `CrawledURL`) are now optional, with default values (usually `None`). Update code to handle potential `None` values.
- **Changed:** Enhanced serialization and config handling. (**Breaking Change:**
`FastFilterChain` has been replaced with `FilterChain`). This change
simplifies config and improves the serialization.
* **Fixed:** Adjust memory threshold and fix dispatcher initialization. This is an internal bug fix; no code changes are required.
- **Added:** Modified the license to Apache 2.0 _with a required attribution
clause_. See the `LICENSE` file for details. All users must now clearly
attribute the Crawl4AI project when using, distributing, or creating
derivative works.
* **Fixed:** Ensure proper exit after running doctor command. No code changes are required.
* **Fixed:** JsonCss selector and crawler improvements.
* **Fixed:** Not working long page screenshot (#403)
* **Documentation:** Updated documentation URLs to the new domain.
* **Documentation:** Added SERP API project example.
* **Documentation:** Added clarifying comments for CSS selector behavior.
* **Documentation:** Add Code of Conduct for the project (#410)
- **Fixed:** Prevent memory leaks by ensuring proper closure of Playwright
pages. No code changes required.
- **Fixed:** Make model fields optional with default values (**Breaking
Change:** Code relying on all fields being present may need adjustment).
Fields in data models (like `CrawledURL`) are now optional, with default
values (usually `None`). Update code to handle potential `None` values.
- **Fixed:** Adjust memory threshold and fix dispatcher initialization. This is
an internal bug fix; no code changes are required.
- **Fixed:** Ensure proper exit after running doctor command. No code changes
are required.
- **Fixed:** JsonCss selector and crawler improvements.
- **Fixed:** Not working long page screenshot (#403)
- **Documentation:** Updated documentation URLs to the new domain.
- **Documentation:** Added SERP API project example.
- **Documentation:** Added clarifying comments for CSS selector behavior.
- **Documentation:** Add Code of Conduct for the project (#410)
## Breaking Changes Summary
* **Dispatcher:** The `MemoryAdaptiveDispatcher` is now the default for `arun_many()`, changing concurrency behavior. The return type of `arun_many` depends on the `stream` parameter.
* **Deep Crawling:** `max_depth` is now part of `CrawlerRunConfig` and controls crawl depth. Import paths for deep crawling strategies have changed.
* **Browser Context:** The `BrowserContext` API has been updated.
* **Models:** Many fields in data models are now optional, with default values.
* **Scraping Mode:** `ScrapingMode` enum replaced by strategy pattern (`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
* **Content Filter:** Removed `content_filter` parameter from `CrawlerRunConfig`. Use extraction strategies or markdown generators with filters instead.
* **Removed:** Synchronous `WebCrawler`, CLI, and docs management functionality.
* **Docker:** Significant changes to Docker deployment, including new requirements and configuration.
* **File Removed**: Removed ssl_certificate.json file which might affect existing certificate validations
* **Renamed**: final_url to redirected_url for consistency
* **Config**: FastFilterChain has been replaced with FilterChain
* **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT, List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
* **Proxy**: Removed synchronous WebCrawler support and related rate limiting configurations
- **Dispatcher:** The `MemoryAdaptiveDispatcher` is now the default for
`arun_many()`, changing concurrency behavior. The return type of `arun_many`
depends on the `stream` parameter.
- **Deep Crawling:** `max_depth` is now part of `CrawlerRunConfig` and controls
crawl depth. Import paths for deep crawling strategies have changed.
- **Browser Context:** The `BrowserContext` API has been updated.
- **Models:** Many fields in data models are now optional, with default values.
- **Scraping Mode:** `ScrapingMode` enum replaced by strategy pattern
(`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
- **Content Filter:** Removed `content_filter` parameter from
`CrawlerRunConfig`. Use extraction strategies or markdown generators with
filters instead.
- **Removed:** Synchronous `WebCrawler`, CLI, and docs management functionality.
- **Docker:** Significant changes to Docker deployment, including new
requirements and configuration.
- **File Removed**: Removed ssl_certificate.json file which might affect
existing certificate validations
- **Renamed**: final_url to redirected_url for consistency
- **Config**: FastFilterChain has been replaced with FilterChain
- **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT,
List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
- **Proxy**: Removed synchronous WebCrawler support and related rate limiting
configurations
## Migration Guide
1. **Update Imports:** Adjust imports for `DeepCrawlStrategy`, `BreadthFirstSearchStrategy`, and related classes due to the new `deep_crawling` module structure.
2. **`CrawlerRunConfig`:** Move `max_depth` to `CrawlerRunConfig`. If using `content_filter`, migrate to an extraction strategy or a markdown generator with a filter.
3. **`arun_many()`:** Adapt code to the new `MemoryAdaptiveDispatcher` behavior and the return type.
4. **`BrowserContext`:** Update code using the `BrowserContext` API.
5. **Models:** Handle potential `None` values for optional fields in data models.
6. **Scraping:** Replace `ScrapingMode` enum with `WebScrapingStrategy` or `LXMLWebScrapingStrategy`.
7. **Docker:** Review the updated Docker documentation and adjust your deployment accordingly.
8. **CLI:** Migrate to the new `crwl` command and update any scripts using the old CLI.
9. **Proxy:**: Removed synchronous WebCrawler support and related rate limiting configurations.
1. **Update Imports:** Adjust imports for `DeepCrawlStrategy`,
`BreadthFirstSearchStrategy`, and related classes due to the new
`deep_crawling` module structure.
2. **`CrawlerRunConfig`:** Move `max_depth` to `CrawlerRunConfig`. If using
`content_filter`, migrate to an extraction strategy or a markdown generator
with a filter.
3. **`arun_many()`:** Adapt code to the new `MemoryAdaptiveDispatcher` behavior
and the return type.
4. **`BrowserContext`:** Update code using the `BrowserContext` API.
5. **Models:** Handle potential `None` values for optional fields in data
models.
6. **Scraping:** Replace `ScrapingMode` enum with `WebScrapingStrategy` or
`LXMLWebScrapingStrategy`.
7. **Docker:** Review the updated Docker documentation and adjust your
deployment accordingly.
8. **CLI:** Migrate to the new `crwl` command and update any scripts using the
old CLI.
9. **Proxy:**: Removed synchronous WebCrawler support and related rate limiting
configurations.
10. **Config:**: Replace FastFilterChain to FilterChain