* fix: Update export of URLPatternFilter * chore: Add dependancy for cchardet in requirements * docs: Update example for deep crawl in release note for v0.5 * Docs: update the example for memory dispatcher * docs: updated example for crawl strategies * Refactor: Removed wrapping in if __name__==main block since this is a markdown file. * chore: removed cchardet from dependancy list, since unclecode is planning to remove it * docs: updated the example for proxy rotation to a working example * feat: Introduced ProxyConfig param * Add tutorial for deep crawl & update contributor list for bug fixes in feb alpha-1 * chore: update and test new dependancies * feat:Make PyPDF2 a conditional dependancy * updated tutorial and release note for v0.5 * docs: update docs for deep crawl, and fix a typo in docker-deployment markdown filename * refactor: 1. Deprecate markdown_v2 2. Make markdown backward compatible to behave as a string when needed. 3. Fix LlmConfig usage in cli 4. Deprecate markdown_v2 in cli 5. Update AsyncWebCrawler for changes in CrawlResult * fix: Bug in serialisation of markdown in acache_url * Refactor: Added deprecation errors for fit_html and fit_markdown directly on markdown. Now access them via markdown * fix: remove deprecated markdown_v2 from docker * Refactor: remove deprecated fit_markdown and fit_html from result * refactor: fix cache retrieval for markdown as a string * chore: update all docs, examples and tests with deprecation announcements for markdown_v2, fit_html, fit_markdown
525 lines
20 KiB
Markdown
525 lines
20 KiB
Markdown
# Crawl4AI v0.5.0 Release Notes
|
|
|
|
**Release Theme: Power, Flexibility, and Scalability**
|
|
|
|
Crawl4AI v0.5.0 is a major release focused on significantly enhancing the
|
|
library's power, flexibility, and scalability. Key improvements include a new
|
|
**deep crawling** system, a **memory-adaptive dispatcher** for handling
|
|
large-scale crawls, **multiple crawling strategies** (including a fast HTTP-only
|
|
crawler), **Docker** deployment options, and a powerful **command-line interface
|
|
(CLI)**. This release also includes numerous bug fixes, performance
|
|
optimizations, and documentation updates.
|
|
|
|
**Important Note:** This release contains several **breaking changes**. Please
|
|
review the "Breaking Changes" section carefully and update your code
|
|
accordingly.
|
|
|
|
## Key Features
|
|
|
|
### 1. Deep Crawling
|
|
|
|
Crawl4AI now supports deep crawling, allowing you to explore websites beyond the
|
|
initial URLs. This is controlled by the `deep_crawl_strategy` parameter in
|
|
`CrawlerRunConfig`. Several strategies are available:
|
|
|
|
- **`BFSDeepCrawlStrategy` (Breadth-First Search):** Explores the website level
|
|
by level. (Default)
|
|
- **`DFSDeepCrawlStrategy` (Depth-First Search):** Explores each branch as
|
|
deeply as possible before backtracking.
|
|
- **`BestFirstCrawlingStrategy`:** Uses a scoring function to prioritize which
|
|
URLs to crawl next.
|
|
|
|
```python
|
|
import time
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BFSDeepCrawlStrategy
|
|
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
|
|
from crawl4ai.deep_crawling import DomainFilter, ContentTypeFilter, FilterChain, URLPatternFilter, KeywordRelevanceScorer, BestFirstCrawlingStrategy
|
|
import asyncio
|
|
|
|
# Create a filter chain to filter urls based on patterns, domains and content type
|
|
filter_chain = FilterChain(
|
|
[
|
|
DomainFilter(
|
|
allowed_domains=["docs.crawl4ai.com"],
|
|
blocked_domains=["old.docs.crawl4ai.com"],
|
|
),
|
|
URLPatternFilter(patterns=["*core*", "*advanced*"],),
|
|
ContentTypeFilter(allowed_types=["text/html"]),
|
|
]
|
|
)
|
|
|
|
# Create a keyword scorer that prioritises the pages with certain keywords first
|
|
keyword_scorer = KeywordRelevanceScorer(
|
|
keywords=["crawl", "example", "async", "configuration"], weight=0.7
|
|
)
|
|
|
|
# Set up the configuration
|
|
deep_crawl_config = CrawlerRunConfig(
|
|
deep_crawl_strategy=BestFirstCrawlingStrategy(
|
|
max_depth=2,
|
|
include_external=False,
|
|
filter_chain=filter_chain,
|
|
url_scorer=keyword_scorer,
|
|
),
|
|
scraping_strategy=LXMLWebScrapingStrategy(),
|
|
stream=True,
|
|
verbose=True,
|
|
)
|
|
|
|
async def main():
|
|
async with AsyncWebCrawler() as crawler:
|
|
start_time = time.perf_counter()
|
|
results = []
|
|
async for result in await crawler.arun(url="https://docs.crawl4ai.com", config=deep_crawl_config):
|
|
print(f"Crawled: {result.url} (Depth: {result.metadata['depth']}), score: {result.metadata['score']:.2f}")
|
|
results.append(result)
|
|
duration = time.perf_counter() - start_time
|
|
print(f"\n✅ Crawled {len(results)} high-value pages in {duration:.2f} seconds")
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
**Breaking Change:** The `max_depth` parameter is now part of `CrawlerRunConfig`
|
|
and controls the _depth_ of the crawl, not the number of concurrent crawls. The
|
|
`arun()` and `arun_many()` methods are now decorated to handle deep crawling
|
|
strategies. Imports for deep crawling strategies have changed. See the
|
|
[Deep Crawling documentation](../../core/deep-crawling.md) for more details.
|
|
|
|
### 2. Memory-Adaptive Dispatcher
|
|
|
|
The new `MemoryAdaptiveDispatcher` dynamically adjusts concurrency based on
|
|
available system memory and includes built-in rate limiting. This prevents
|
|
out-of-memory errors and avoids overwhelming target websites.
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, MemoryAdaptiveDispatcher
|
|
import asyncio
|
|
|
|
# Configure the dispatcher (optional, defaults are used if not provided)
|
|
dispatcher = MemoryAdaptiveDispatcher(
|
|
memory_threshold_percent=80.0, # Pause if memory usage exceeds 80%
|
|
check_interval=0.5, # Check memory every 0.5 seconds
|
|
)
|
|
|
|
async def batch_mode():
|
|
async with AsyncWebCrawler() as crawler:
|
|
results = await crawler.arun_many(
|
|
urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"],
|
|
config=CrawlerRunConfig(stream=False), # Batch mode
|
|
dispatcher=dispatcher,
|
|
)
|
|
for result in results:
|
|
print(f"Crawled: {result.url} with status code: {result.status_code}")
|
|
|
|
async def stream_mode():
|
|
async with AsyncWebCrawler() as crawler:
|
|
# OR, for streaming:
|
|
async for result in await crawler.arun_many(
|
|
urls=["https://docs.crawl4ai.com", "https://github.com/unclecode/crawl4ai"],
|
|
config=CrawlerRunConfig(stream=True),
|
|
dispatcher=dispatcher,
|
|
):
|
|
print(f"Crawled: {result.url} with status code: {result.status_code}")
|
|
|
|
print("Dispatcher in batch mode:")
|
|
asyncio.run(batch_mode())
|
|
print("-" * 50)
|
|
print("Dispatcher in stream mode:")
|
|
asyncio.run(stream_mode())
|
|
```
|
|
|
|
**Breaking Change:** `AsyncWebCrawler.arun_many()` now uses
|
|
`MemoryAdaptiveDispatcher` by default. Existing code that relied on unbounded
|
|
concurrency may require adjustments.
|
|
|
|
### 3. Multiple Crawling Strategies (Playwright and HTTP)
|
|
|
|
Crawl4AI now offers two crawling strategies:
|
|
|
|
- **`AsyncPlaywrightCrawlerStrategy` (Default):** Uses Playwright for
|
|
browser-based crawling, supporting JavaScript rendering and complex
|
|
interactions.
|
|
- **`AsyncHTTPCrawlerStrategy`:** A lightweight, fast, and memory-efficient
|
|
HTTP-only crawler. Ideal for simple scraping tasks where browser rendering is
|
|
unnecessary.
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
|
|
from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
|
|
import asyncio
|
|
|
|
# Use the HTTP crawler strategy
|
|
http_crawler_config = HTTPCrawlerConfig(
|
|
method="GET",
|
|
headers={"User-Agent": "MyCustomBot/1.0"},
|
|
follow_redirects=True,
|
|
verify_ssl=True
|
|
)
|
|
|
|
async def main():
|
|
async with AsyncWebCrawler(crawler_strategy=AsyncHTTPCrawlerStrategy(browser_config =http_crawler_config)) as crawler:
|
|
result = await crawler.arun("https://example.com")
|
|
print(f"Status code: {result.status_code}")
|
|
print(f"Content length: {len(result.html)}")
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
### 4. Docker Deployment
|
|
|
|
Crawl4AI can now be easily deployed as a Docker container, providing a
|
|
consistent and isolated environment. The Docker image includes a FastAPI server
|
|
with both streaming and non-streaming endpoints.
|
|
|
|
```bash
|
|
# Build the image (from the project root)
|
|
docker build -t crawl4ai .
|
|
|
|
# Run the container
|
|
docker run -d -p 8000:8000 --name crawl4ai crawl4ai
|
|
```
|
|
|
|
**API Endpoints:**
|
|
|
|
- `/crawl` (POST): Non-streaming crawl.
|
|
- `/crawl/stream` (POST): Streaming crawl (NDJSON).
|
|
- `/health` (GET): Health check.
|
|
- `/schema` (GET): Returns configuration schemas.
|
|
- `/md/{url}` (GET): Returns markdown content of the URL.
|
|
- `/llm/{url}` (GET): Returns LLM extracted content.
|
|
- `/token` (POST): Get JWT token
|
|
|
|
**Breaking Changes:**
|
|
|
|
- Docker deployment now requires a `.llm.env` file for API keys.
|
|
- Docker deployment now requires Redis and a new `config.yml` structure.
|
|
- Server startup now uses `supervisord` instead of direct process management.
|
|
- Docker server now requires authentication by default (JWT tokens).
|
|
|
|
See the [Docker deployment documentation](../../core/docker-deployment.md) for
|
|
detailed instructions.
|
|
|
|
### 5. Command-Line Interface (CLI)
|
|
|
|
A new CLI (`crwl`) provides convenient access to Crawl4AI's functionality from
|
|
the terminal.
|
|
|
|
```bash
|
|
# Basic crawl
|
|
crwl https://example.com
|
|
|
|
# Get markdown output
|
|
crwl https://example.com -o markdown
|
|
|
|
# Use a configuration file
|
|
crwl https://example.com -B browser.yml -C crawler.yml
|
|
|
|
# Use LLM-based extraction
|
|
crwl https://example.com -e extract.yml -s schema.json
|
|
|
|
# Ask a question about the crawled content
|
|
crwl https://example.com -q "What is the main topic?"
|
|
|
|
# See usage examples
|
|
crwl --example
|
|
```
|
|
|
|
See the [CLI documentation](../docs/md_v2/core/cli.md) for more details.
|
|
|
|
### 6. LXML Scraping Mode
|
|
|
|
Added `LXMLWebScrapingStrategy` for faster HTML parsing using the `lxml`
|
|
library. This can significantly improve scraping performance, especially for
|
|
large or complex pages. Set `scraping_strategy=LXMLWebScrapingStrategy()` in
|
|
your `CrawlerRunConfig`.
|
|
|
|
**Breaking Change:** The `ScrapingMode` enum has been replaced with a strategy
|
|
pattern. Use `WebScrapingStrategy` (default) or `LXMLWebScrapingStrategy`.
|
|
|
|
### 7. Proxy Rotation
|
|
|
|
Added `ProxyRotationStrategy` abstract base class with `RoundRobinProxyStrategy`
|
|
concrete implementation.
|
|
|
|
```python
|
|
import re
|
|
from crawl4ai import (
|
|
AsyncWebCrawler,
|
|
BrowserConfig,
|
|
CrawlerRunConfig,
|
|
CacheMode,
|
|
RoundRobinProxyStrategy,
|
|
)
|
|
import asyncio
|
|
from crawl4ai.configs import ProxyConfig
|
|
async def main():
|
|
# Load proxies and create rotation strategy
|
|
proxies = ProxyConfig.from_env()
|
|
#eg: export PROXIES="ip1:port1:username1:password1,ip2:port2:username2:password2"
|
|
if not proxies:
|
|
print("No proxies found in environment. Set PROXIES env variable!")
|
|
return
|
|
|
|
proxy_strategy = RoundRobinProxyStrategy(proxies)
|
|
|
|
# Create configs
|
|
browser_config = BrowserConfig(headless=True, verbose=False)
|
|
run_config = CrawlerRunConfig(
|
|
cache_mode=CacheMode.BYPASS,
|
|
proxy_rotation_strategy=proxy_strategy
|
|
)
|
|
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
urls = ["https://httpbin.org/ip"] * (len(proxies) * 2) # Test each proxy twice
|
|
|
|
print("\n📈 Initializing crawler with proxy rotation...")
|
|
async with AsyncWebCrawler(config=browser_config) as crawler:
|
|
print("\n🚀 Starting batch crawl with proxy rotation...")
|
|
results = await crawler.arun_many(
|
|
urls=urls,
|
|
config=run_config
|
|
)
|
|
for result in results:
|
|
if result.success:
|
|
ip_match = re.search(r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}', result.html)
|
|
current_proxy = run_config.proxy_config if run_config.proxy_config else None
|
|
|
|
if current_proxy and ip_match:
|
|
print(f"URL {result.url}")
|
|
print(f"Proxy {current_proxy.server} -> Response IP: {ip_match.group(0)}")
|
|
verified = ip_match.group(0) == current_proxy.ip
|
|
if verified:
|
|
print(f"✅ Proxy working! IP matches: {current_proxy.ip}")
|
|
else:
|
|
print("❌ Proxy failed or IP mismatch!")
|
|
print("---")
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
## Other Changes and Improvements
|
|
|
|
- **Added: `LLMContentFilter` for intelligent markdown generation.** This new
|
|
filter uses an LLM to create more focused and relevant markdown output.
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, DefaultMarkdownGenerator
|
|
from crawl4ai.content_filter_strategy import LLMContentFilter
|
|
from crawl4ai.async_configs import LlmConfig
|
|
import asyncio
|
|
|
|
llm_config = LlmConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")
|
|
|
|
markdown_generator = DefaultMarkdownGenerator(
|
|
content_filter=LLMContentFilter(llmConfig=llm_config, instruction="Extract key concepts and summaries")
|
|
)
|
|
|
|
config = CrawlerRunConfig(markdown_generator=markdown_generator)
|
|
async def main():
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun("https://docs.crawl4ai.com", config=config)
|
|
print(result.markdown.fit_markdown)
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
- **Added: URL redirection tracking.** The crawler now automatically follows
|
|
HTTP redirects (301, 302, 307, 308) and records the final URL in the
|
|
`redirected_url` field of the `CrawlResult` object. No code changes are
|
|
required to enable this; it's automatic.
|
|
|
|
- **Added: LLM-powered schema generation utility.** A new `generate_schema`
|
|
method has been added to `JsonCssExtractionStrategy` and
|
|
`JsonXPathExtractionStrategy`. This greatly simplifies creating extraction
|
|
schemas.
|
|
|
|
```python
|
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|
from crawl4ai.async_configs import LlmConfig
|
|
|
|
llm_config = LlmConfig(provider="gemini/gemini-1.5-pro", api_token="env:GEMINI_API_KEY")
|
|
|
|
schema = JsonCssExtractionStrategy.generate_schema(
|
|
html="<div class='product'><h2>Product Name</h2><span class='price'>$99</span></div>",
|
|
llmConfig = llm_config,
|
|
query="Extract product name and price"
|
|
)
|
|
print(schema)
|
|
```
|
|
|
|
Expected Output (may vary slightly due to LLM)
|
|
```JSON
|
|
{
|
|
"name": "ProductExtractor",
|
|
"baseSelector": "div.product",
|
|
"fields": [
|
|
{"name": "name", "selector": "h2", "type": "text"},
|
|
{"name": "price", "selector": ".price", "type": "text"}
|
|
]
|
|
}
|
|
```
|
|
|
|
- **Added: robots.txt compliance support.** The crawler can now respect
|
|
`robots.txt` rules. Enable this by setting `check_robots_txt=True` in
|
|
`CrawlerRunConfig`.
|
|
|
|
```python
|
|
config = CrawlerRunConfig(check_robots_txt=True)
|
|
```
|
|
|
|
- **Added: PDF processing capabilities.** Crawl4AI can now extract text, images,
|
|
and metadata from PDF files (both local and remote). This uses a new
|
|
`PDFCrawlerStrategy` and `PDFContentScrapingStrategy`.
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|
from crawl4ai.processors.pdf import PDFCrawlerStrategy, PDFContentScrapingStrategy
|
|
import asyncio
|
|
|
|
async def main():
|
|
async with AsyncWebCrawler(crawler_strategy=PDFCrawlerStrategy()) as crawler:
|
|
result = await crawler.arun(
|
|
"https://arxiv.org/pdf/2310.06825.pdf",
|
|
config=CrawlerRunConfig(
|
|
scraping_strategy=PDFContentScrapingStrategy()
|
|
)
|
|
)
|
|
print(result.markdown) # Access extracted text
|
|
print(result.metadata) # Access PDF metadata (title, author, etc.)
|
|
|
|
asyncio.run(main())
|
|
```
|
|
|
|
- **Added: Support for frozenset serialization.** Improves configuration
|
|
serialization, especially for sets of allowed/blocked domains. No code changes
|
|
required.
|
|
|
|
- **Added: New `LlmConfig` parameter.** This new parameter can be passed for
|
|
extraction, filtering, and schema generation tasks. It simplifies passing
|
|
provider strings, API tokens, and base URLs across all sections where LLM
|
|
configuration is necessary. It also enables reuse and allows for quick
|
|
experimentation between different LLM configurations.
|
|
|
|
```python
|
|
from crawl4ai.async_configs import LlmConfig
|
|
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|
|
|
# Example of using LlmConfig with LLMExtractionStrategy
|
|
llm_config = LlmConfig(provider="openai/gpt-4o", api_token="YOUR_API_KEY")
|
|
strategy = LLMExtractionStrategy(llmConfig=llm_config, schema=...)
|
|
|
|
# Example usage within a crawler
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(
|
|
url="https://example.com",
|
|
config=CrawlerRunConfig(extraction_strategy=strategy)
|
|
)
|
|
```
|
|
**Breaking Change:** Removed old parameters like `provider`, `api_token`,
|
|
`base_url`, and `api_base` from `LLMExtractionStrategy` and
|
|
`LLMContentFilter`. Users should migrate to using the `LlmConfig` object.
|
|
|
|
- **Changed: Improved browser context management and added shared data support.
|
|
(Breaking Change:** `BrowserContext` API updated). Browser contexts are now
|
|
managed more efficiently, reducing resource usage. A new `shared_data`
|
|
dictionary is available in the `BrowserContext` to allow passing data between
|
|
different stages of the crawling process. **Breaking Change:** The
|
|
`BrowserContext` API has changed, and the old `get_context` method is
|
|
deprecated.
|
|
|
|
- **Changed:** Renamed `final_url` to `redirected_url` in `CrawledURL`. This
|
|
improves consistency and clarity. Update any code referencing the old field
|
|
name.
|
|
|
|
- **Changed:** Improved type hints and removed unused files. This is an internal
|
|
improvement and should not require code changes.
|
|
|
|
- **Changed:** Reorganized deep crawling functionality into dedicated module.
|
|
(**Breaking Change:** Import paths for `DeepCrawlStrategy` and related classes
|
|
have changed). This improves code organization. Update imports to use the new
|
|
`crawl4ai.deep_crawling` module.
|
|
|
|
- **Changed:** Improved HTML handling and cleanup codebase. (**Breaking
|
|
Change:** Removed `ssl_certificate.json` file). This removes an unused file.
|
|
If you were relying on this file for custom certificate validation, you'll
|
|
need to implement an alternative approach.
|
|
|
|
- **Changed:** Enhanced serialization and config handling. (**Breaking Change:**
|
|
`FastFilterChain` has been replaced with `FilterChain`). This change
|
|
simplifies config and improves the serialization.
|
|
|
|
- **Added:** Modified the license to Apache 2.0 _with a required attribution
|
|
clause_. See the `LICENSE` file for details. All users must now clearly
|
|
attribute the Crawl4AI project when using, distributing, or creating
|
|
derivative works.
|
|
|
|
- **Fixed:** Prevent memory leaks by ensuring proper closure of Playwright
|
|
pages. No code changes required.
|
|
|
|
- **Fixed:** Make model fields optional with default values (**Breaking
|
|
Change:** Code relying on all fields being present may need adjustment).
|
|
Fields in data models (like `CrawledURL`) are now optional, with default
|
|
values (usually `None`). Update code to handle potential `None` values.
|
|
|
|
- **Fixed:** Adjust memory threshold and fix dispatcher initialization. This is
|
|
an internal bug fix; no code changes are required.
|
|
|
|
- **Fixed:** Ensure proper exit after running doctor command. No code changes
|
|
are required.
|
|
- **Fixed:** JsonCss selector and crawler improvements.
|
|
- **Fixed:** Not working long page screenshot (#403)
|
|
- **Documentation:** Updated documentation URLs to the new domain.
|
|
- **Documentation:** Added SERP API project example.
|
|
- **Documentation:** Added clarifying comments for CSS selector behavior.
|
|
- **Documentation:** Add Code of Conduct for the project (#410)
|
|
|
|
## Breaking Changes Summary
|
|
|
|
- **Dispatcher:** The `MemoryAdaptiveDispatcher` is now the default for
|
|
`arun_many()`, changing concurrency behavior. The return type of `arun_many`
|
|
depends on the `stream` parameter.
|
|
- **Deep Crawling:** `max_depth` is now part of `CrawlerRunConfig` and controls
|
|
crawl depth. Import paths for deep crawling strategies have changed.
|
|
- **Browser Context:** The `BrowserContext` API has been updated.
|
|
- **Models:** Many fields in data models are now optional, with default values.
|
|
- **Scraping Mode:** `ScrapingMode` enum replaced by strategy pattern
|
|
(`WebScrapingStrategy`, `LXMLWebScrapingStrategy`).
|
|
- **Content Filter:** Removed `content_filter` parameter from
|
|
`CrawlerRunConfig`. Use extraction strategies or markdown generators with
|
|
filters instead.
|
|
- **Removed:** Synchronous `WebCrawler`, CLI, and docs management functionality.
|
|
- **Docker:** Significant changes to Docker deployment, including new
|
|
requirements and configuration.
|
|
- **File Removed**: Removed ssl_certificate.json file which might affect
|
|
existing certificate validations
|
|
- **Renamed**: final_url to redirected_url for consistency
|
|
- **Config**: FastFilterChain has been replaced with FilterChain
|
|
- **Deep-Crawl**: DeepCrawlStrategy.arun now returns Union[CrawlResultT,
|
|
List[CrawlResultT], AsyncGenerator[CrawlResultT, None]]
|
|
- **Proxy**: Removed synchronous WebCrawler support and related rate limiting
|
|
configurations
|
|
|
|
## Migration Guide
|
|
|
|
1. **Update Imports:** Adjust imports for `DeepCrawlStrategy`,
|
|
`BreadthFirstSearchStrategy`, and related classes due to the new
|
|
`deep_crawling` module structure.
|
|
2. **`CrawlerRunConfig`:** Move `max_depth` to `CrawlerRunConfig`. If using
|
|
`content_filter`, migrate to an extraction strategy or a markdown generator
|
|
with a filter.
|
|
3. **`arun_many()`:** Adapt code to the new `MemoryAdaptiveDispatcher` behavior
|
|
and the return type.
|
|
4. **`BrowserContext`:** Update code using the `BrowserContext` API.
|
|
5. **Models:** Handle potential `None` values for optional fields in data
|
|
models.
|
|
6. **Scraping:** Replace `ScrapingMode` enum with `WebScrapingStrategy` or
|
|
`LXMLWebScrapingStrategy`.
|
|
7. **Docker:** Review the updated Docker documentation and adjust your
|
|
deployment accordingly.
|
|
8. **CLI:** Migrate to the new `crwl` command and update any scripts using the
|
|
old CLI.
|
|
9. **Proxy:**: Removed synchronous WebCrawler support and related rate limiting
|
|
configurations.
|
|
10. **Config:**: Replace FastFilterChain to FilterChain
|