Prepare the v0.4.3 beta release with major feature additions and improvements: - Add JsonXPathExtractionStrategy and LLMContentFilter to exports - Update version to 0.4.3b1 - Improve documentation for dispatchers and markdown generation - Update development status to Beta - Reorganize changelog format BREAKING CHANGE: Memory threshold in MemoryAdaptiveDispatcher increased to 90% and SemaphoreDispatcher parameter renamed to max_session_permit
267 lines
9.5 KiB
Markdown
267 lines
9.5 KiB
Markdown
# Crawl4AI 0.4.3b1 is Here: Faster, Smarter, and Ready for Real-World Crawling!
|
|
|
|
Hey, Crawl4AI enthusiasts! We're thrilled to announce the release of **Crawl4AI 0.4.3b1**, packed with powerful new features and enhancements that take web crawling to a whole new level of efficiency and intelligence. This release is all about giving you more control, better performance, and deeper insights into your crawled data.
|
|
|
|
Let's dive into what's new!
|
|
|
|
## 🚀 Major Feature Highlights
|
|
|
|
### 1. LLM-Powered Schema Generation: Zero to Structured Data in Seconds!
|
|
|
|
Tired of manually crafting CSS or XPath selectors? We've got you covered! Crawl4AI now features a revolutionary **schema generator** that uses the power of Large Language Models (LLMs) to automatically create extraction schemas for you.
|
|
|
|
**How it Works:**
|
|
|
|
1. **Provide HTML**: Feed in a sample HTML snippet that contains the type of data you want to extract (e.g., product listings, article sections).
|
|
2. **Describe Your Needs (Optional)**: You can provide a natural language query like "extract all product names and prices" to guide the schema creation.
|
|
3. **Choose Your LLM**: Use either **OpenAI** (GPT-4o recommended) for top-tier accuracy or **Ollama** for a local, open-source option.
|
|
4. **Get Your Schema**: The tool outputs a ready-to-use JSON schema that works seamlessly with `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy`.
|
|
|
|
**Why You'll Love It:**
|
|
|
|
- **No More Tedious Selector Writing**: Let the LLM analyze the HTML and create the selectors for you!
|
|
- **One-Time Cost**: Schema generation uses LLM, but once you have your schema, subsequent extractions are fast and LLM-free.
|
|
- **Handles Complex Structures**: The LLM can understand nested elements, lists, and variations in layout—far beyond what simple CSS selectors can achieve.
|
|
- **Learn by Example**: The generated schemas are a fantastic way to learn best practices for writing your own schemas.
|
|
|
|
**Example:**
|
|
|
|
```python
|
|
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
|
|
|
# Sample HTML snippet (imagine this is part of a product listing page)
|
|
html = """
|
|
<div class="product">
|
|
<h2 class="name">Awesome Gadget</h2>
|
|
<span class="price">$99.99</span>
|
|
</div>
|
|
"""
|
|
|
|
# Generate schema using OpenAI
|
|
schema = JsonCssExtractionStrategy.generate_schema(
|
|
html,
|
|
llm_provider="openai/gpt-4o",
|
|
api_token="YOUR_API_TOKEN"
|
|
)
|
|
|
|
# Or use Ollama for a local, open-source option
|
|
# schema = JsonCssExtractionStrategy.generate_schema(
|
|
# html,
|
|
# llm_provider="ollama/llama3"
|
|
# )
|
|
|
|
print(json.dumps(schema, indent=2))
|
|
```
|
|
|
|
**Output (Schema):**
|
|
|
|
```json
|
|
{
|
|
"name": null,
|
|
"baseSelector": "div.product",
|
|
"fields": [
|
|
{
|
|
"name": "name",
|
|
"selector": "h2.name",
|
|
"type": "text"
|
|
},
|
|
{
|
|
"name": "price",
|
|
"selector": "span.price",
|
|
"type": "text"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
You can now **save** this schema and use it for all your extractions on pages with the same structure. No more LLM costs, just **fast, reliable** data extraction!
|
|
|
|
### 2. Robots.txt Compliance: Crawl Responsibly
|
|
|
|
Crawl4AI now respects website rules! With the new `check_robots_txt=True` option in `CrawlerRunConfig`, the crawler automatically fetches, parses, and obeys each site's `robots.txt` file.
|
|
|
|
**Key Features**:
|
|
|
|
- **Efficient Caching**: Stores parsed `robots.txt` files locally for 7 days to avoid re-fetching.
|
|
- **Automatic Integration**: Works seamlessly with both `arun()` and `arun_many()`.
|
|
- **Clear Status Codes**: Returns a 403 status code if a URL is disallowed.
|
|
- **Customizable**: Adjust the cache directory and TTL if needed.
|
|
|
|
**Example**:
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
|
|
|
async def main():
|
|
config = CrawlerRunConfig(
|
|
cache_mode=CacheMode.ENABLED,
|
|
check_robots_txt=True
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun("https://example.com/private-page", config=config)
|
|
if result.status_code == 403:
|
|
print("Access denied by robots.txt")
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
### 3. Proxy Support in `CrawlerRunConfig`
|
|
|
|
Need more control over your proxy settings? Now you can configure proxies directly within `CrawlerRunConfig` for each crawl:
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|
|
|
async def main():
|
|
config = CrawlerRunConfig(
|
|
proxy_config={
|
|
"server": "http://your-proxy.com:8080",
|
|
"username": "your_username", # Optional
|
|
"password": "your_password" # Optional
|
|
}
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun("https://example.com", config=config)
|
|
```
|
|
|
|
This allows for dynamic proxy assignment per URL or even per request.
|
|
|
|
### 4. LLM-Powered Markdown Filtering (Beta)
|
|
|
|
We're introducing an experimental **`LLMContentFilter`**! This filter, when used with the `DefaultMarkdownGenerator`, can produce highly focused markdown output by using an LLM to analyze content relevance.
|
|
|
|
**How it Works:**
|
|
|
|
1. You provide an **instruction** (e.g., "extract only the key technical details").
|
|
2. The LLM analyzes each section of the page based on your instruction.
|
|
3. Only the most relevant content is included in the final `fit_markdown`.
|
|
|
|
**Example**:
|
|
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
|
from crawl4ai.content_filter_strategy import LLMContentFilter
|
|
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
|
|
|
async def main():
|
|
llm_filter = LLMContentFilter(
|
|
provider="openai/gpt-4o",
|
|
api_token="YOUR_API_TOKEN", # Or use "ollama/llama3" with no token
|
|
instruction="Extract the core educational content about Python classes."
|
|
)
|
|
|
|
config = CrawlerRunConfig(
|
|
markdown_generator=DefaultMarkdownGenerator(content_filter=llm_filter)
|
|
)
|
|
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun(
|
|
"https://docs.python.org/3/tutorial/classes.html",
|
|
config=config
|
|
)
|
|
print(result.markdown_v2.fit_markdown)
|
|
|
|
if __name__ == "__main__":
|
|
asyncio.run(main())
|
|
```
|
|
|
|
**Note**: This is a beta feature. We're actively working on improving its accuracy and performance.
|
|
|
|
### 5. Streamlined `arun_many()` with Dispatchers
|
|
|
|
We've simplified concurrent crawling! `arun_many()` now intelligently handles multiple URLs, either returning a **list** of results or an **async generator** for streaming.
|
|
|
|
**Basic Usage (Batch)**:
|
|
|
|
```python
|
|
results = await crawler.arun_many(
|
|
urls=["https://site1.com", "https://site2.com"],
|
|
config=CrawlerRunConfig()
|
|
)
|
|
|
|
for res in results:
|
|
print(res.url, "crawled successfully:", res.success)
|
|
```
|
|
|
|
**Streaming Mode**:
|
|
|
|
```python
|
|
async for result in await crawler.arun_many(
|
|
urls=["https://site1.com", "https://site2.com"],
|
|
config=CrawlerRunConfig(stream=True)
|
|
):
|
|
print("Just finished:", result.url)
|
|
# Process each result immediately
|
|
```
|
|
|
|
**Advanced:** You can now customize how `arun_many` handles concurrency by passing a **dispatcher**. See [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) for details.
|
|
|
|
### 6. Enhanced Browser Context Management
|
|
|
|
We've improved how Crawl4AI manages browser contexts for better resource utilization and session handling.
|
|
|
|
- **`shared_data` in `CrawlerRunConfig`**: Pass data between hooks using the `shared_data` dictionary.
|
|
- **Context Reuse**: The crawler now intelligently reuses browser contexts based on configuration, reducing overhead.
|
|
|
|
### 7. Faster Scraping with `LXMLWebScrapingStrategy`
|
|
|
|
Introducing a new, optional **`LXMLWebScrapingStrategy`** that can be **10-20x faster** than the default BeautifulSoup approach for large, complex pages.
|
|
|
|
**How to Use**:
|
|
|
|
```python
|
|
from crawl4ai import LXMLWebScrapingStrategy
|
|
|
|
config = CrawlerRunConfig(
|
|
scraping_strategy=LXMLWebScrapingStrategy() # Add this line
|
|
)
|
|
```
|
|
|
|
**When to Use**:
|
|
- If profiling shows a bottleneck in `WebScrapingStrategy`.
|
|
- For very large HTML documents where parsing speed matters.
|
|
|
|
**Caveats**:
|
|
- It might not handle malformed HTML as gracefully as BeautifulSoup.
|
|
- We're still gathering data, so report any issues!
|
|
|
|
---
|
|
|
|
## Try the Feature Demo Script!
|
|
|
|
We've prepared a Python script demonstrating these new features. You can find it at:
|
|
|
|
[**`features_demo.py`**](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/0_4_3b1_feature_demo.py)
|
|
|
|
**To run the demo:**
|
|
|
|
1. Make sure you have Crawl4AI installed (`pip install crawl4ai`).
|
|
2. Copy the `features_demo.py` script to your local environment.
|
|
3. Set your OpenAI API key as an environment variable (if using OpenAI models):
|
|
```bash
|
|
export OPENAI_API_KEY="your_api_key"
|
|
```
|
|
4. Run the script:
|
|
```bash
|
|
python features_demo.py
|
|
```
|
|
|
|
The script will execute various crawl scenarios, showcasing the new features and printing results to your console.
|
|
|
|
## Conclusion
|
|
|
|
Crawl4AI version 0.4.3b1 is a major step forward in flexibility, performance, and ease of use. With automatic schema generation, robots.txt handling, advanced content filtering, and streamlined multi-URL crawling, you can build powerful, efficient, and responsible web scrapers.
|
|
|
|
We encourage you to try out these new capabilities, explore the updated documentation, and share your feedback! Your input is invaluable as we continue to improve Crawl4AI.
|
|
|
|
**Stay Connected:**
|
|
|
|
- **Star** us on [GitHub](https://github.com/unclecode/crawl4ai) to show your support!
|
|
- **Follow** [@unclecode](https://twitter.com/unclecode) on Twitter for updates and tips.
|
|
- **Join** our community on Discord (link coming soon) to discuss your projects and get help.
|
|
|
|
Happy crawling!
|