feat(release): prepare v0.4.3 beta release
Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
This commit is contained in:
@@ -1,264 +0,0 @@
|
||||
# Optimized Multi-URL Crawling
|
||||
|
||||
> **Note**: We’re developing a new **executor module** that uses a sophisticated algorithm to dynamically manage multi-URL crawling, optimizing for speed and memory usage. The approaches in this document remain fully valid, but keep an eye on **Crawl4AI**’s upcoming releases for this powerful feature! Follow [@unclecode](https://twitter.com/unclecode) on X and check the changelogs to stay updated.
|
||||
|
||||
|
||||
Crawl4AI’s **AsyncWebCrawler** can handle multiple URLs in a single run, which can greatly reduce overhead and speed up crawling. This guide shows how to:
|
||||
|
||||
1. **Sequentially** crawl a list of URLs using the **same** session, avoiding repeated browser creation.
|
||||
2. **Parallel**-crawl subsets of URLs in batches, again reusing the same browser.
|
||||
|
||||
When the entire process finishes, you close the browser once—**minimizing** memory and resource usage.
|
||||
|
||||
---
|
||||
|
||||
## 1. Why Avoid Simple Loops per URL?
|
||||
|
||||
If you naively do:
|
||||
|
||||
```python
|
||||
for url in urls:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url)
|
||||
```
|
||||
|
||||
You end up:
|
||||
|
||||
1. Spinning up a **new** browser for each URL
|
||||
2. Closing it immediately after the single crawl
|
||||
3. Potentially using a lot of CPU/memory for short-living browsers
|
||||
4. Missing out on session reusability if you have login or ongoing states
|
||||
|
||||
**Better** approaches ensure you **create** the browser once, then crawl multiple URLs with minimal overhead.
|
||||
|
||||
---
|
||||
|
||||
## 2. Sequential Crawling with Session Reuse
|
||||
|
||||
### 2.1 Overview
|
||||
|
||||
1. **One** `AsyncWebCrawler` instance for **all** URLs.
|
||||
2. **One** session (via `session_id`) so we can preserve local storage or cookies across URLs if needed.
|
||||
3. The crawler is only closed at the **end**.
|
||||
|
||||
**This** is the simplest pattern if your workload is moderate (dozens to a few hundred URLs).
|
||||
|
||||
### 2.2 Example Code
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def crawl_sequential(urls: List[str]):
|
||||
print("\n=== Sequential Crawling with Session Reuse ===")
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
# For better performance in Docker or low-memory environments:
|
||||
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator()
|
||||
)
|
||||
|
||||
# Create the crawler (opens the browser)
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
session_id = "session1" # Reuse the same session across all URLs
|
||||
for url in urls:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
config=crawl_config,
|
||||
session_id=session_id
|
||||
)
|
||||
if result.success:
|
||||
print(f"Successfully crawled: {url}")
|
||||
# E.g. check markdown length
|
||||
print(f"Markdown length: {len(result.markdown_v2.raw_markdown)}")
|
||||
else:
|
||||
print(f"Failed: {url} - Error: {result.error_message}")
|
||||
finally:
|
||||
# After all URLs are done, close the crawler (and the browser)
|
||||
await crawler.close()
|
||||
|
||||
async def main():
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3"
|
||||
]
|
||||
await crawl_sequential(urls)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why It’s Good**:
|
||||
|
||||
- **One** browser launch.
|
||||
- Minimal memory usage.
|
||||
- If the site requires login, you can log in once in `session_id` context and preserve auth across all URLs.
|
||||
|
||||
---
|
||||
|
||||
## 3. Parallel Crawling with Browser Reuse
|
||||
|
||||
### 3.1 Overview
|
||||
|
||||
To speed up crawling further, you can crawl multiple URLs in **parallel** (batches or a concurrency limit). The crawler still uses **one** browser, but spawns different sessions (or the same, depending on your logic) for each task.
|
||||
|
||||
### 3.2 Example Code
|
||||
|
||||
For this example make sure to install the [psutil](https://pypi.org/project/psutil/) package.
|
||||
|
||||
```bash
|
||||
pip install psutil
|
||||
```
|
||||
|
||||
Then you can run the following code:
|
||||
|
||||
```python
|
||||
import os
|
||||
import sys
|
||||
import psutil
|
||||
import asyncio
|
||||
|
||||
__location__ = os.path.dirname(os.path.abspath(__file__))
|
||||
__output__ = os.path.join(__location__, "output")
|
||||
|
||||
# Append parent directory to system path
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def crawl_parallel(urls: List[str], max_concurrent: int = 3):
|
||||
print("\n=== Parallel Crawling with Browser Reuse + Memory Check ===")
|
||||
|
||||
# We'll keep track of peak memory usage across all tasks
|
||||
peak_memory = 0
|
||||
process = psutil.Process(os.getpid())
|
||||
|
||||
def log_memory(prefix: str = ""):
|
||||
nonlocal peak_memory
|
||||
current_mem = process.memory_info().rss # in bytes
|
||||
if current_mem > peak_memory:
|
||||
peak_memory = current_mem
|
||||
print(f"{prefix} Current Memory: {current_mem // (1024 * 1024)} MB, Peak: {peak_memory // (1024 * 1024)} MB")
|
||||
|
||||
# Minimal browser config
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
verbose=False, # corrected from 'verbos=False'
|
||||
extra_args=["--disable-gpu", "--disable-dev-shm-usage", "--no-sandbox"],
|
||||
)
|
||||
crawl_config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
|
||||
# Create the crawler instance
|
||||
crawler = AsyncWebCrawler(config=browser_config)
|
||||
await crawler.start()
|
||||
|
||||
try:
|
||||
# We'll chunk the URLs in batches of 'max_concurrent'
|
||||
success_count = 0
|
||||
fail_count = 0
|
||||
for i in range(0, len(urls), max_concurrent):
|
||||
batch = urls[i : i + max_concurrent]
|
||||
tasks = []
|
||||
|
||||
for j, url in enumerate(batch):
|
||||
# Unique session_id per concurrent sub-task
|
||||
session_id = f"parallel_session_{i + j}"
|
||||
task = crawler.arun(url=url, config=crawl_config, session_id=session_id)
|
||||
tasks.append(task)
|
||||
|
||||
# Check memory usage prior to launching tasks
|
||||
log_memory(prefix=f"Before batch {i//max_concurrent + 1}: ")
|
||||
|
||||
# Gather results
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
|
||||
# Check memory usage after tasks complete
|
||||
log_memory(prefix=f"After batch {i//max_concurrent + 1}: ")
|
||||
|
||||
# Evaluate results
|
||||
for url, result in zip(batch, results):
|
||||
if isinstance(result, Exception):
|
||||
print(f"Error crawling {url}: {result}")
|
||||
fail_count += 1
|
||||
elif result.success:
|
||||
success_count += 1
|
||||
else:
|
||||
fail_count += 1
|
||||
|
||||
print(f"\nSummary:")
|
||||
print(f" - Successfully crawled: {success_count}")
|
||||
print(f" - Failed: {fail_count}")
|
||||
|
||||
finally:
|
||||
print("\nClosing crawler...")
|
||||
await crawler.close()
|
||||
# Final memory log
|
||||
log_memory(prefix="Final: ")
|
||||
print(f"\nPeak memory usage (MB): {peak_memory // (1024 * 1024)}")
|
||||
|
||||
async def main():
|
||||
urls = [
|
||||
"https://example.com/page1",
|
||||
"https://example.com/page2",
|
||||
"https://example.com/page3",
|
||||
"https://example.com/page4"
|
||||
]
|
||||
await crawl_parallel(urls, max_concurrent=2)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
|
||||
```
|
||||
|
||||
**Notes**:
|
||||
|
||||
- We **reuse** the same `AsyncWebCrawler` instance for all parallel tasks, launching **one** browser.
|
||||
- Each parallel sub-task might get its own `session_id` so they don’t share cookies/localStorage (unless that’s desired).
|
||||
- We limit concurrency to `max_concurrent=2` or 3 to avoid saturating CPU/memory.
|
||||
|
||||
---
|
||||
|
||||
## 4. Performance Tips
|
||||
|
||||
1. **Extra Browser Args**
|
||||
- `--disable-gpu`, `--no-sandbox` can help in Docker or restricted environments.
|
||||
- `--disable-dev-shm-usage` avoids using `/dev/shm` which can be small on some systems.
|
||||
|
||||
2. **Session Reuse**
|
||||
- If your site requires a login or you want to maintain local data across URLs, share the **same** `session_id`.
|
||||
- If you want isolation (each URL fresh), create unique sessions.
|
||||
|
||||
3. **Batching**
|
||||
- If you have **many** URLs (like thousands), you can do parallel crawling in chunks (like `max_concurrent=5`).
|
||||
- Use `arun_many()` for a built-in approach if you prefer, but the example above is often more flexible.
|
||||
|
||||
4. **Cache**
|
||||
- If your pages share many resources or you’re re-crawling the same domain repeatedly, consider setting `cache_mode=CacheMode.ENABLED` in `CrawlerRunConfig`.
|
||||
- If you need fresh data each time, keep `cache_mode=CacheMode.BYPASS`.
|
||||
|
||||
5. **Hooks**
|
||||
- You can set up global hooks for each crawler (like to block images) or per-run if you want.
|
||||
- Keep them consistent if you’re reusing sessions.
|
||||
|
||||
---
|
||||
|
||||
## 5. Summary
|
||||
|
||||
- **One** `AsyncWebCrawler` + multiple calls to `.arun()` is far more efficient than launching a new crawler per URL.
|
||||
- **Sequential** approach with a shared session is simple and memory-friendly for moderate sets of URLs.
|
||||
- **Parallel** approach can speed up large crawls by concurrency, but keep concurrency balanced to avoid overhead.
|
||||
- Close the crawler once at the end, ensuring the browser is only opened/closed once.
|
||||
|
||||
For even more advanced memory optimizations or dynamic concurrency patterns, see future sections on hooking or distributed crawling. The patterns above suffice for the majority of multi-URL scenarios—**giving you speed, simplicity, and minimal resource usage**. Enjoy your optimized crawling!
|
||||
@@ -58,7 +58,7 @@ Automatically manages concurrency based on system memory usage:
|
||||
|
||||
```python
|
||||
dispatcher = MemoryAdaptiveDispatcher(
|
||||
memory_threshold_percent=70.0, # Pause if memory exceeds this
|
||||
memory_threshold_percent=90.0, # Pause if memory exceeds this
|
||||
check_interval=1.0, # How often to check memory
|
||||
max_session_permit=10, # Maximum concurrent tasks
|
||||
rate_limiter=RateLimiter( # Optional rate limiting
|
||||
@@ -79,7 +79,7 @@ Provides simple concurrency control with a fixed limit:
|
||||
|
||||
```python
|
||||
dispatcher = SemaphoreDispatcher(
|
||||
semaphore_count=5, # Fixed concurrent tasks
|
||||
max_session_permit=5, # Fixed concurrent tasks
|
||||
rate_limiter=RateLimiter( # Optional rate limiting
|
||||
base_delay=(0.5, 1.0),
|
||||
max_delay=10.0
|
||||
|
||||
266
docs/md_v2/blog/releases/v0.4.3b1.md
Normal file
266
docs/md_v2/blog/releases/v0.4.3b1.md
Normal file
@@ -0,0 +1,266 @@
|
||||
# Crawl4AI 0.4.3b1 is Here: Faster, Smarter, and Ready for Real-World Crawling!
|
||||
|
||||
Hey, Crawl4AI enthusiasts! We're thrilled to announce the release of **Crawl4AI 0.4.3b1**, packed with powerful new features and enhancements that take web crawling to a whole new level of efficiency and intelligence. This release is all about giving you more control, better performance, and deeper insights into your crawled data.
|
||||
|
||||
Let's dive into what's new!
|
||||
|
||||
## 🚀 Major Feature Highlights
|
||||
|
||||
### 1. LLM-Powered Schema Generation: Zero to Structured Data in Seconds!
|
||||
|
||||
Tired of manually crafting CSS or XPath selectors? We've got you covered! Crawl4AI now features a revolutionary **schema generator** that uses the power of Large Language Models (LLMs) to automatically create extraction schemas for you.
|
||||
|
||||
**How it Works:**
|
||||
|
||||
1. **Provide HTML**: Feed in a sample HTML snippet that contains the type of data you want to extract (e.g., product listings, article sections).
|
||||
2. **Describe Your Needs (Optional)**: You can provide a natural language query like "extract all product names and prices" to guide the schema creation.
|
||||
3. **Choose Your LLM**: Use either **OpenAI** (GPT-4o recommended) for top-tier accuracy or **Ollama** for a local, open-source option.
|
||||
4. **Get Your Schema**: The tool outputs a ready-to-use JSON schema that works seamlessly with `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy`.
|
||||
|
||||
**Why You'll Love It:**
|
||||
|
||||
- **No More Tedious Selector Writing**: Let the LLM analyze the HTML and create the selectors for you!
|
||||
- **One-Time Cost**: Schema generation uses LLM, but once you have your schema, subsequent extractions are fast and LLM-free.
|
||||
- **Handles Complex Structures**: The LLM can understand nested elements, lists, and variations in layout—far beyond what simple CSS selectors can achieve.
|
||||
- **Learn by Example**: The generated schemas are a fantastic way to learn best practices for writing your own schemas.
|
||||
|
||||
**Example:**
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
# Sample HTML snippet (imagine this is part of a product listing page)
|
||||
html = """
|
||||
<div class="product">
|
||||
<h2 class="name">Awesome Gadget</h2>
|
||||
<span class="price">$99.99</span>
|
||||
</div>
|
||||
"""
|
||||
|
||||
# Generate schema using OpenAI
|
||||
schema = JsonCssExtractionStrategy.generate_schema(
|
||||
html,
|
||||
llm_provider="openai/gpt-4o",
|
||||
api_token="YOUR_API_TOKEN"
|
||||
)
|
||||
|
||||
# Or use Ollama for a local, open-source option
|
||||
# schema = JsonCssExtractionStrategy.generate_schema(
|
||||
# html,
|
||||
# llm_provider="ollama/llama3"
|
||||
# )
|
||||
|
||||
print(json.dumps(schema, indent=2))
|
||||
```
|
||||
|
||||
**Output (Schema):**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": null,
|
||||
"baseSelector": "div.product",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h2.name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.price",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
You can now **save** this schema and use it for all your extractions on pages with the same structure. No more LLM costs, just **fast, reliable** data extraction!
|
||||
|
||||
### 2. Robots.txt Compliance: Crawl Responsibly
|
||||
|
||||
Crawl4AI now respects website rules! With the new `check_robots_txt=True` option in `CrawlerRunConfig`, the crawler automatically fetches, parses, and obeys each site's `robots.txt` file.
|
||||
|
||||
**Key Features**:
|
||||
|
||||
- **Efficient Caching**: Stores parsed `robots.txt` files locally for 7 days to avoid re-fetching.
|
||||
- **Automatic Integration**: Works seamlessly with both `arun()` and `arun_many()`.
|
||||
- **Clear Status Codes**: Returns a 403 status code if a URL is disallowed.
|
||||
- **Customizable**: Adjust the cache directory and TTL if needed.
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.ENABLED,
|
||||
check_robots_txt=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/private-page", config=config)
|
||||
if result.status_code == 403:
|
||||
print("Access denied by robots.txt")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 3. Proxy Support in `CrawlerRunConfig`
|
||||
|
||||
Need more control over your proxy settings? Now you can configure proxies directly within `CrawlerRunConfig` for each crawl:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
proxy_config={
|
||||
"server": "http://your-proxy.com:8080",
|
||||
"username": "your_username", # Optional
|
||||
"password": "your_password" # Optional
|
||||
}
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
```
|
||||
|
||||
This allows for dynamic proxy assignment per URL or even per request.
|
||||
|
||||
### 4. LLM-Powered Markdown Filtering (Beta)
|
||||
|
||||
We're introducing an experimental **`LLMContentFilter`**! This filter, when used with the `DefaultMarkdownGenerator`, can produce highly focused markdown output by using an LLM to analyze content relevance.
|
||||
|
||||
**How it Works:**
|
||||
|
||||
1. You provide an **instruction** (e.g., "extract only the key technical details").
|
||||
2. The LLM analyzes each section of the page based on your instruction.
|
||||
3. Only the most relevant content is included in the final `fit_markdown`.
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
llm_filter = LLMContentFilter(
|
||||
provider="openai/gpt-4o",
|
||||
api_token="YOUR_API_TOKEN", # Or use "ollama/llama3" with no token
|
||||
instruction="Extract the core educational content about Python classes."
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(content_filter=llm_filter)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://docs.python.org/3/tutorial/classes.html",
|
||||
config=config
|
||||
)
|
||||
print(result.markdown_v2.fit_markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Note**: This is a beta feature. We're actively working on improving its accuracy and performance.
|
||||
|
||||
### 5. Streamlined `arun_many()` with Dispatchers
|
||||
|
||||
We've simplified concurrent crawling! `arun_many()` now intelligently handles multiple URLs, either returning a **list** of results or an **async generator** for streaming.
|
||||
|
||||
**Basic Usage (Batch)**:
|
||||
|
||||
```python
|
||||
results = await crawler.arun_many(
|
||||
urls=["https://site1.com", "https://site2.com"],
|
||||
config=CrawlerRunConfig()
|
||||
)
|
||||
|
||||
for res in results:
|
||||
print(res.url, "crawled successfully:", res.success)
|
||||
```
|
||||
|
||||
**Streaming Mode**:
|
||||
|
||||
```python
|
||||
async for result in await crawler.arun_many(
|
||||
urls=["https://site1.com", "https://site2.com"],
|
||||
config=CrawlerRunConfig(stream=True)
|
||||
):
|
||||
print("Just finished:", result.url)
|
||||
# Process each result immediately
|
||||
```
|
||||
|
||||
**Advanced:** You can now customize how `arun_many` handles concurrency by passing a **dispatcher**. See [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) for details.
|
||||
|
||||
### 6. Enhanced Browser Context Management
|
||||
|
||||
We've improved how Crawl4AI manages browser contexts for better resource utilization and session handling.
|
||||
|
||||
- **`shared_data` in `CrawlerRunConfig`**: Pass data between hooks using the `shared_data` dictionary.
|
||||
- **Context Reuse**: The crawler now intelligently reuses browser contexts based on configuration, reducing overhead.
|
||||
|
||||
### 7. Faster Scraping with `LXMLWebScrapingStrategy`
|
||||
|
||||
Introducing a new, optional **`LXMLWebScrapingStrategy`** that can be **10-20x faster** than the default BeautifulSoup approach for large, complex pages.
|
||||
|
||||
**How to Use**:
|
||||
|
||||
```python
|
||||
from crawl4ai import LXMLWebScrapingStrategy
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
scraping_strategy=LXMLWebScrapingStrategy() # Add this line
|
||||
)
|
||||
```
|
||||
|
||||
**When to Use**:
|
||||
- If profiling shows a bottleneck in `WebScrapingStrategy`.
|
||||
- For very large HTML documents where parsing speed matters.
|
||||
|
||||
**Caveats**:
|
||||
- It might not handle malformed HTML as gracefully as BeautifulSoup.
|
||||
- We're still gathering data, so report any issues!
|
||||
|
||||
---
|
||||
|
||||
## Try the Feature Demo Script!
|
||||
|
||||
We've prepared a Python script demonstrating these new features. You can find it at:
|
||||
|
||||
[**`features_demo.py`**](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/0_4_3b1_feature_demo.py)
|
||||
|
||||
**To run the demo:**
|
||||
|
||||
1. Make sure you have Crawl4AI installed (`pip install crawl4ai`).
|
||||
2. Copy the `features_demo.py` script to your local environment.
|
||||
3. Set your OpenAI API key as an environment variable (if using OpenAI models):
|
||||
```bash
|
||||
export OPENAI_API_KEY="your_api_key"
|
||||
```
|
||||
4. Run the script:
|
||||
```bash
|
||||
python features_demo.py
|
||||
```
|
||||
|
||||
The script will execute various crawl scenarios, showcasing the new features and printing results to your console.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Crawl4AI version 0.4.3b1 is a major step forward in flexibility, performance, and ease of use. With automatic schema generation, robots.txt handling, advanced content filtering, and streamlined multi-URL crawling, you can build powerful, efficient, and responsible web scrapers.
|
||||
|
||||
We encourage you to try out these new capabilities, explore the updated documentation, and share your feedback! Your input is invaluable as we continue to improve Crawl4AI.
|
||||
|
||||
**Stay Connected:**
|
||||
|
||||
- **Star** us on [GitHub](https://github.com/unclecode/crawl4ai) to show your support!
|
||||
- **Follow** [@unclecode](https://twitter.com/unclecode) on Twitter for updates and tips.
|
||||
- **Join** our community on Discord (link coming soon) to discuss your projects and get help.
|
||||
|
||||
Happy crawling!
|
||||
@@ -181,7 +181,7 @@ from crawl4ai.content_filter_strategy import LLMContentFilter
|
||||
async def main():
|
||||
# Initialize LLM filter with specific instruction
|
||||
filter = LLMContentFilter(
|
||||
provider="openai/gpt-4", # or your preferred provider
|
||||
provider="openai/gpt-4o", # or your preferred provider
|
||||
api_token="your-api-token", # or use environment variable
|
||||
instruction="""
|
||||
Focus on extracting the core educational content.
|
||||
|
||||
Reference in New Issue
Block a user