feat(release): prepare v0.4.3 beta release

Prepare the v0.4.3 beta release with major feature additions and improvements:
- Add JsonXPathExtractionStrategy and LLMContentFilter to exports
- Update version to 0.4.3b1
- Improve documentation for dispatchers and markdown generation
- Update development status to Beta
- Reorganize changelog format

BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
This commit is contained in:
UncleCode
2025-01-21 21:03:11 +08:00
parent d09c611d15
commit 16b8d4945b
12 changed files with 885 additions and 287 deletions

View File

@@ -0,0 +1,266 @@
# Crawl4AI 0.4.3b1 is Here: Faster, Smarter, and Ready for Real-World Crawling!
Hey, Crawl4AI enthusiasts! We're thrilled to announce the release of **Crawl4AI 0.4.3b1**, packed with powerful new features and enhancements that take web crawling to a whole new level of efficiency and intelligence. This release is all about giving you more control, better performance, and deeper insights into your crawled data.
Let's dive into what's new!
## 🚀 Major Feature Highlights
### 1. LLM-Powered Schema Generation: Zero to Structured Data in Seconds!
Tired of manually crafting CSS or XPath selectors? We've got you covered! Crawl4AI now features a revolutionary **schema generator** that uses the power of Large Language Models (LLMs) to automatically create extraction schemas for you.
**How it Works:**
1. **Provide HTML**: Feed in a sample HTML snippet that contains the type of data you want to extract (e.g., product listings, article sections).
2. **Describe Your Needs (Optional)**: You can provide a natural language query like "extract all product names and prices" to guide the schema creation.
3. **Choose Your LLM**: Use either **OpenAI** (GPT-4o recommended) for top-tier accuracy or **Ollama** for a local, open-source option.
4. **Get Your Schema**: The tool outputs a ready-to-use JSON schema that works seamlessly with `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy`.
**Why You'll Love It:**
- **No More Tedious Selector Writing**: Let the LLM analyze the HTML and create the selectors for you!
- **One-Time Cost**: Schema generation uses LLM, but once you have your schema, subsequent extractions are fast and LLM-free.
- **Handles Complex Structures**: The LLM can understand nested elements, lists, and variations in layout—far beyond what simple CSS selectors can achieve.
- **Learn by Example**: The generated schemas are a fantastic way to learn best practices for writing your own schemas.
**Example:**
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
# Sample HTML snippet (imagine this is part of a product listing page)
html = """
<div class="product">
<h2 class="name">Awesome Gadget</h2>
<span class="price">$99.99</span>
</div>
"""
# Generate schema using OpenAI
schema = JsonCssExtractionStrategy.generate_schema(
html,
llm_provider="openai/gpt-4o",
api_token="YOUR_API_TOKEN"
)
# Or use Ollama for a local, open-source option
# schema = JsonCssExtractionStrategy.generate_schema(
# html,
# llm_provider="ollama/llama3"
# )
print(json.dumps(schema, indent=2))
```
**Output (Schema):**
```json
{
"name": null,
"baseSelector": "div.product",
"fields": [
{
"name": "name",
"selector": "h2.name",
"type": "text"
},
{
"name": "price",
"selector": "span.price",
"type": "text"
}
]
}
```
You can now **save** this schema and use it for all your extractions on pages with the same structure. No more LLM costs, just **fast, reliable** data extraction!
### 2. Robots.txt Compliance: Crawl Responsibly
Crawl4AI now respects website rules! With the new `check_robots_txt=True` option in `CrawlerRunConfig`, the crawler automatically fetches, parses, and obeys each site's `robots.txt` file.
**Key Features**:
- **Efficient Caching**: Stores parsed `robots.txt` files locally for 7 days to avoid re-fetching.
- **Automatic Integration**: Works seamlessly with both `arun()` and `arun_many()`.
- **Clear Status Codes**: Returns a 403 status code if a URL is disallowed.
- **Customizable**: Adjust the cache directory and TTL if needed.
**Example**:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
async def main():
config = CrawlerRunConfig(
cache_mode=CacheMode.ENABLED,
check_robots_txt=True
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com/private-page", config=config)
if result.status_code == 403:
print("Access denied by robots.txt")
if __name__ == "__main__":
asyncio.run(main())
```
### 3. Proxy Support in `CrawlerRunConfig`
Need more control over your proxy settings? Now you can configure proxies directly within `CrawlerRunConfig` for each crawl:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
async def main():
config = CrawlerRunConfig(
proxy_config={
"server": "http://your-proxy.com:8080",
"username": "your_username", # Optional
"password": "your_password" # Optional
}
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://example.com", config=config)
```
This allows for dynamic proxy assignment per URL or even per request.
### 4. LLM-Powered Markdown Filtering (Beta)
We're introducing an experimental **`LLMContentFilter`**! This filter, when used with the `DefaultMarkdownGenerator`, can produce highly focused markdown output by using an LLM to analyze content relevance.
**How it Works:**
1. You provide an **instruction** (e.g., "extract only the key technical details").
2. The LLM analyzes each section of the page based on your instruction.
3. Only the most relevant content is included in the final `fit_markdown`.
**Example**:
```python
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_filter_strategy import LLMContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
llm_filter = LLMContentFilter(
provider="openai/gpt-4o",
api_token="YOUR_API_TOKEN", # Or use "ollama/llama3" with no token
instruction="Extract the core educational content about Python classes."
)
config = CrawlerRunConfig(
markdown_generator=DefaultMarkdownGenerator(content_filter=llm_filter)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
"https://docs.python.org/3/tutorial/classes.html",
config=config
)
print(result.markdown_v2.fit_markdown)
if __name__ == "__main__":
asyncio.run(main())
```
**Note**: This is a beta feature. We're actively working on improving its accuracy and performance.
### 5. Streamlined `arun_many()` with Dispatchers
We've simplified concurrent crawling! `arun_many()` now intelligently handles multiple URLs, either returning a **list** of results or an **async generator** for streaming.
**Basic Usage (Batch)**:
```python
results = await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig()
)
for res in results:
print(res.url, "crawled successfully:", res.success)
```
**Streaming Mode**:
```python
async for result in await crawler.arun_many(
urls=["https://site1.com", "https://site2.com"],
config=CrawlerRunConfig(stream=True)
):
print("Just finished:", result.url)
# Process each result immediately
```
**Advanced:** You can now customize how `arun_many` handles concurrency by passing a **dispatcher**. See [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) for details.
### 6. Enhanced Browser Context Management
We've improved how Crawl4AI manages browser contexts for better resource utilization and session handling.
- **`shared_data` in `CrawlerRunConfig`**: Pass data between hooks using the `shared_data` dictionary.
- **Context Reuse**: The crawler now intelligently reuses browser contexts based on configuration, reducing overhead.
### 7. Faster Scraping with `LXMLWebScrapingStrategy`
Introducing a new, optional **`LXMLWebScrapingStrategy`** that can be **10-20x faster** than the default BeautifulSoup approach for large, complex pages.
**How to Use**:
```python
from crawl4ai import LXMLWebScrapingStrategy
config = CrawlerRunConfig(
scraping_strategy=LXMLWebScrapingStrategy() # Add this line
)
```
**When to Use**:
- If profiling shows a bottleneck in `WebScrapingStrategy`.
- For very large HTML documents where parsing speed matters.
**Caveats**:
- It might not handle malformed HTML as gracefully as BeautifulSoup.
- We're still gathering data, so report any issues!
---
## Try the Feature Demo Script!
We've prepared a Python script demonstrating these new features. You can find it at:
[**`features_demo.py`**](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/0_4_3b1_feature_demo.py)
**To run the demo:**
1. Make sure you have Crawl4AI installed (`pip install crawl4ai`).
2. Copy the `features_demo.py` script to your local environment.
3. Set your OpenAI API key as an environment variable (if using OpenAI models):
```bash
export OPENAI_API_KEY="your_api_key"
```
4. Run the script:
```bash
python features_demo.py
```
The script will execute various crawl scenarios, showcasing the new features and printing results to your console.
## Conclusion
Crawl4AI version 0.4.3b1 is a major step forward in flexibility, performance, and ease of use. With automatic schema generation, robots.txt handling, advanced content filtering, and streamlined multi-URL crawling, you can build powerful, efficient, and responsible web scrapers.
We encourage you to try out these new capabilities, explore the updated documentation, and share your feedback! Your input is invaluable as we continue to improve Crawl4AI.
**Stay Connected:**
- **Star** us on [GitHub](https://github.com/unclecode/crawl4ai) to show your support!
- **Follow** [@unclecode](https://twitter.com/unclecode) on Twitter for updates and tips.
- **Join** our community on Discord (link coming soon) to discuss your projects and get help.
Happy crawling!