Push async version last changes for merge to main branch
This commit is contained in:
282
docs/md/full_details/advanced_jsoncss_extraction.md
Normal file
282
docs/md/full_details/advanced_jsoncss_extraction.md
Normal file
@@ -0,0 +1,282 @@
|
||||
# Advanced Usage of JsonCssExtractionStrategy
|
||||
|
||||
While the basic usage of JsonCssExtractionStrategy is powerful for simple structures, its true potential shines when dealing with complex, nested HTML structures. This section will explore advanced usage scenarios, demonstrating how to extract nested objects, lists, and nested lists.
|
||||
|
||||
## Hypothetical Website Example
|
||||
|
||||
Let's consider a hypothetical e-commerce website that displays product categories, each containing multiple products. Each product has details, reviews, and related items. This complex structure will allow us to demonstrate various advanced features of JsonCssExtractionStrategy.
|
||||
|
||||
Assume the HTML structure looks something like this:
|
||||
|
||||
```html
|
||||
<div class="category">
|
||||
<h2 class="category-name">Electronics</h2>
|
||||
<div class="product">
|
||||
<h3 class="product-name">Smartphone X</h3>
|
||||
<p class="product-price">$999</p>
|
||||
<div class="product-details">
|
||||
<span class="brand">TechCorp</span>
|
||||
<span class="model">X-2000</span>
|
||||
</div>
|
||||
<ul class="product-features">
|
||||
<li>5G capable</li>
|
||||
<li>6.5" OLED screen</li>
|
||||
<li>128GB storage</li>
|
||||
</ul>
|
||||
<div class="product-reviews">
|
||||
<div class="review">
|
||||
<span class="reviewer">John D.</span>
|
||||
<span class="rating">4.5</span>
|
||||
<p class="review-text">Great phone, love the camera!</p>
|
||||
</div>
|
||||
<div class="review">
|
||||
<span class="reviewer">Jane S.</span>
|
||||
<span class="rating">5</span>
|
||||
<p class="review-text">Best smartphone I've ever owned.</p>
|
||||
</div>
|
||||
</div>
|
||||
<ul class="related-products">
|
||||
<li>
|
||||
<span class="related-name">Phone Case</span>
|
||||
<span class="related-price">$29.99</span>
|
||||
</li>
|
||||
<li>
|
||||
<span class="related-name">Screen Protector</span>
|
||||
<span class="related-price">$9.99</span>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<!-- More products... -->
|
||||
</div>
|
||||
```
|
||||
|
||||
Now, let's create a schema to extract this complex structure:
|
||||
|
||||
```python
|
||||
schema = {
|
||||
"name": "E-commerce Product Catalog",
|
||||
"baseSelector": "div.category",
|
||||
"fields": [
|
||||
{
|
||||
"name": "category_name",
|
||||
"selector": "h2.category-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h3.product-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "p.product-price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.product-features li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "feature",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "reviews",
|
||||
"selector": "div.review",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "reviewer",
|
||||
"selector": "span.reviewer",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "rating",
|
||||
"selector": "span.rating",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "comment",
|
||||
"selector": "p.review-text",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "related_products",
|
||||
"selector": "ul.related-products li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span.related-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.related-price",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This schema demonstrates several advanced features:
|
||||
|
||||
1. **Nested Objects**: The `details` field is a nested object within each product.
|
||||
2. **Simple Lists**: The `features` field is a simple list of text items.
|
||||
3. **Nested Lists**: The `products` field is a nested list, where each item is a complex object.
|
||||
4. **Lists of Objects**: The `reviews` and `related_products` fields are lists of objects.
|
||||
|
||||
Let's break down the key concepts:
|
||||
|
||||
### Nested Objects
|
||||
|
||||
To create a nested object, use `"type": "nested"` and provide a `fields` array for the nested structure:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{
|
||||
"name": "brand",
|
||||
"selector": "span.brand",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "model",
|
||||
"selector": "span.model",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Simple Lists
|
||||
|
||||
For a simple list of identical items, use `"type": "list"`:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.product-features li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "feature",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Nested Lists
|
||||
|
||||
For a list of complex objects, use `"type": "nested_list"`:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
// ... fields for each product
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Lists of Objects
|
||||
|
||||
Similar to nested lists, but typically used for simpler objects within the list:
|
||||
|
||||
```python
|
||||
{
|
||||
"name": "related_products",
|
||||
"selector": "ul.related-products li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "span.related-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.related-price",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Using the Advanced Schema
|
||||
|
||||
To use this advanced schema with AsyncWebCrawler:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_complex_product_data():
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
|
||||
product_data = json.loads(result.extracted_content)
|
||||
print(json.dumps(product_data, indent=2))
|
||||
|
||||
asyncio.run(extract_complex_product_data())
|
||||
```
|
||||
|
||||
This will produce a structured JSON output that captures the complex hierarchy of the product catalog, including nested objects, lists, and nested lists.
|
||||
|
||||
## Tips for Advanced Usage
|
||||
|
||||
1. **Start Simple**: Begin with a basic schema and gradually add complexity.
|
||||
2. **Test Incrementally**: Test each part of your schema separately before combining them.
|
||||
3. **Use Chrome DevTools**: The Element Inspector is invaluable for identifying the correct selectors.
|
||||
4. **Handle Missing Data**: Use the `default` key in your field definitions to handle cases where data might be missing.
|
||||
5. **Leverage Transforms**: Use the `transform` key to clean or format extracted data (e.g., converting prices to numbers).
|
||||
6. **Consider Performance**: Very complex schemas might slow down extraction. Balance complexity with performance needs.
|
||||
|
||||
By mastering these advanced techniques, you can use JsonCssExtractionStrategy to extract highly structured data from even the most complex web pages, making it a powerful tool for web scraping and data analysis tasks.
|
||||
@@ -1,6 +1,6 @@
|
||||
# Crawl Request Parameters
|
||||
# Crawl Request Parameters for AsyncWebCrawler
|
||||
|
||||
The `run` function in Crawl4AI is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `run` function, along with their descriptions, possible values, and examples.
|
||||
The `arun` method in Crawl4AI's `AsyncWebCrawler` is designed to be highly configurable, allowing you to customize the crawling and extraction process to suit your needs. Below are the parameters you can use with the `arun` method, along with their descriptions, possible values, and examples.
|
||||
|
||||
## Parameters
|
||||
|
||||
@@ -13,9 +13,9 @@ url = "https://www.nbcnews.com/business"
|
||||
```
|
||||
|
||||
### word_count_threshold (int)
|
||||
**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is `5`.
|
||||
**Description:** The minimum number of words a block must contain to be considered meaningful. The default value is defined by `MIN_WORD_THRESHOLD`.
|
||||
**Required:** No
|
||||
**Default Value:** `5`
|
||||
**Default Value:** `MIN_WORD_THRESHOLD`
|
||||
**Example:**
|
||||
```python
|
||||
word_count_threshold = 10
|
||||
@@ -88,43 +88,92 @@ verbose = True
|
||||
Additional keyword arguments that can be passed to customize the crawling process further. Some notable options include:
|
||||
|
||||
- **only_text (bool):** Whether to extract only text content, excluding HTML tags. Default is `False`.
|
||||
- **session_id (str):** A unique identifier for the crawling session. This is useful for maintaining state across multiple requests.
|
||||
- **js_code (str or list):** JavaScript code to be executed on the page before extraction.
|
||||
- **wait_for (str):** A CSS selector or JavaScript function to wait for before considering the page load complete.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
result = crawler.run(
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector="p",
|
||||
only_text=True
|
||||
only_text=True,
|
||||
session_id="unique_session_123",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="article.main-article"
|
||||
)
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
Here's an example of how to use the `run` function with various parameters:
|
||||
Here's an example of how to use the `arun` method with various parameters:
|
||||
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
from crawl4ai.chunking_strategy import NlpSentenceChunking
|
||||
|
||||
# Create the WebCrawler instance
|
||||
crawler = WebCrawler()
|
||||
async def main():
|
||||
# Create the AsyncWebCrawler instance
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Run the crawler with custom parameters
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=10,
|
||||
extraction_strategy=CosineStrategy(semantic_filter="finance"),
|
||||
chunking_strategy=NlpSentenceChunking(),
|
||||
bypass_cache=True,
|
||||
css_selector="div.article-content",
|
||||
screenshot=True,
|
||||
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
|
||||
verbose=True,
|
||||
only_text=True,
|
||||
session_id="business_news_session",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
|
||||
# Run the crawler with custom parameters
|
||||
result = crawler.run(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=10,
|
||||
extraction_strategy=CosineStrategy(semantic_filter="finance"),
|
||||
chunking_strategy=NlpSentenceChunking(),
|
||||
bypass_cache=True,
|
||||
css_selector="div.article-content",
|
||||
screenshot=True,
|
||||
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3",
|
||||
verbose=True,
|
||||
only_text=True
|
||||
)
|
||||
print(result)
|
||||
|
||||
print(result)
|
||||
# Run the async function
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This example demonstrates how to configure various parameters to customize the crawling and extraction process using Crawl4AI.
|
||||
This example demonstrates how to configure various parameters to customize the crawling and extraction process using the asynchronous version of Crawl4AI.
|
||||
|
||||
## Additional Asynchronous Methods
|
||||
|
||||
The `AsyncWebCrawler` class also provides other useful asynchronous methods:
|
||||
|
||||
### arun_many
|
||||
**Description:** Crawl multiple URLs concurrently.
|
||||
**Example:**
|
||||
```python
|
||||
urls = ["https://example1.com", "https://example2.com", "https://example3.com"]
|
||||
results = await crawler.arun_many(urls, word_count_threshold=10, bypass_cache=True)
|
||||
```
|
||||
|
||||
### aclear_cache
|
||||
**Description:** Clear the crawler's cache.
|
||||
**Example:**
|
||||
```python
|
||||
await crawler.aclear_cache()
|
||||
```
|
||||
|
||||
### aflush_cache
|
||||
**Description:** Completely flush the crawler's cache.
|
||||
**Example:**
|
||||
```python
|
||||
await crawler.aflush_cache()
|
||||
```
|
||||
|
||||
### aget_cache_size
|
||||
**Description:** Get the current size of the cache.
|
||||
**Example:**
|
||||
```python
|
||||
cache_size = await crawler.aget_cache_size()
|
||||
print(f"Current cache size: {cache_size}")
|
||||
```
|
||||
|
||||
These asynchronous methods allow for efficient and flexible use of the AsyncWebCrawler in various scenarios.
|
||||
@@ -5,6 +5,9 @@ The `CrawlResult` class is the heart of Crawl4AI's output, encapsulating all the
|
||||
## Class Definition
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
@@ -17,6 +20,9 @@ class CrawlResult(BaseModel):
|
||||
extracted_content: Optional[str] = None
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
session_id: Optional[str] = None
|
||||
responser_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
```
|
||||
|
||||
## Fields Explanation
|
||||
@@ -34,7 +40,7 @@ A flag indicating whether the crawling and extraction were successful. If any er
|
||||
The cleaned HTML content of the web page. This field holds the HTML after removing unwanted tags like `<script>`, `<style>`, and others that do not contribute to the useful content.
|
||||
|
||||
### `media: Dict[str, List[Dict]]`
|
||||
A dictionary containing lists of extracted media elements from the web page. The media elements are categorized into images, videos, and audios. Here’s how they are structured:
|
||||
A dictionary containing lists of extracted media elements from the web page. The media elements are categorized into images, videos, and audios. Here's how they are structured:
|
||||
|
||||
- **Images**: Each image is represented as a dictionary with `src` (source URL) and `alt` (alternate text).
|
||||
- **Videos**: Each video is represented similarly with `src` and `alt`.
|
||||
@@ -88,33 +94,11 @@ A dictionary containing metadata extracted from the web page, such as title, des
|
||||
### `error_message: Optional[str]`
|
||||
If an error occurs during crawling, this field will contain the error message, helping you debug and understand what went wrong. 🚨
|
||||
|
||||
## Example Usage
|
||||
### `session_id: Optional[str]`
|
||||
A unique identifier for the crawling session. This can be useful for tracking and managing multiple crawling sessions.
|
||||
|
||||
Here's a quick example to illustrate how you might use the `CrawlResult` in your code:
|
||||
### `responser_headers: Optional[dict]`
|
||||
A dictionary containing the response headers from the web server. This can provide additional information about the server and the response.
|
||||
|
||||
```python
|
||||
from crawl4ai import WebCrawler
|
||||
|
||||
# Create the WebCrawler instance
|
||||
crawler = WebCrawler()
|
||||
|
||||
# Run the crawler on a URL
|
||||
result = crawler.run(url="https://www.example.com")
|
||||
|
||||
# Check if the crawl was successful
|
||||
if result.success:
|
||||
print("Crawl succeeded!")
|
||||
print("URL:", result.url)
|
||||
print("HTML:", result.html[:100]) # Print the first 100 characters of the HTML
|
||||
print("Cleaned HTML:", result.cleaned_html[:100])
|
||||
print("Media:", result.media)
|
||||
print("Links:", result.links)
|
||||
print("Screenshot:", result.screenshot)
|
||||
print("Markdown:", result.markdown[:100])
|
||||
print("Extracted Content:", result.extracted_content)
|
||||
print("Metadata:", result.metadata)
|
||||
else:
|
||||
print("Crawl failed with error:", result.error_message)
|
||||
```
|
||||
|
||||
With this setup, you can easily access all the valuable data extracted from the web page and integrate it into your applications. Happy crawling! 🕷️🤖
|
||||
### `status_code: Optional[int]`
|
||||
The HTTP status code of the response. This indicates the success or failure of the HTTP request (e.g., 200 for success, 404 for not found, etc.).
|
||||
|
||||
@@ -1,6 +1,143 @@
|
||||
## Extraction Strategies 🧠
|
||||
|
||||
Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into two of the most important strategies: `CosineStrategy` and `LLMExtractionStrategy`.
|
||||
Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into three of the most important strategies: `CosineStrategy`, `LLMExtractionStrategy`, and the new `JsonCssExtractionStrategy`.
|
||||
|
||||
### LLMExtractionStrategy
|
||||
|
||||
`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
|
||||
|
||||
#### When to Use
|
||||
- Suitable for complex extraction tasks requiring nuanced understanding.
|
||||
- Ideal for scenarios where detailed instructions can guide the extraction process.
|
||||
- Perfect for extracting specific types of information or content with precise guidelines.
|
||||
|
||||
#### Parameters
|
||||
- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
|
||||
- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
|
||||
- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
|
||||
|
||||
#### Example Without Instructions
|
||||
```python
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define extraction strategy without instructions
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider='openai',
|
||||
api_token=os.getenv('OPENAI_API_KEY')
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
#### Example With Instructions
|
||||
```python
|
||||
import asyncio
|
||||
import os
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define extraction strategy with instructions
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider='openai',
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
instruction="Extract only financial news and summarize key points."
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### JsonCssExtractionStrategy
|
||||
|
||||
`JsonCssExtractionStrategy` is a powerful tool for extracting structured data from HTML using CSS selectors. It allows you to define a schema that maps CSS selectors to specific fields, enabling precise and efficient data extraction.
|
||||
|
||||
#### When to Use
|
||||
- Ideal for extracting structured data from websites with consistent HTML structures.
|
||||
- Perfect for scenarios where you need to extract specific elements or attributes from a webpage.
|
||||
- Suitable for creating datasets from web pages with tabular or list-based information.
|
||||
|
||||
#### Parameters
|
||||
- `schema` (Dict[str, Any]): A dictionary defining the extraction schema, including base selector and field definitions.
|
||||
|
||||
#### Example
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define the extraction schema
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.tease-card",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "summary",
|
||||
"selector": "div.tease-card__info",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "link",
|
||||
"selector": "a",
|
||||
"type": "attribute",
|
||||
"attribute": "href"
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
# Create the extraction strategy
|
||||
strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
|
||||
# Parse and print the extracted content
|
||||
extracted_data = json.loads(result.extracted_content)
|
||||
print(json.dumps(extracted_data, indent=2))
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
#### Use Cases for JsonCssExtractionStrategy
|
||||
- Extracting product information from e-commerce websites.
|
||||
- Gathering news articles and their metadata from news portals.
|
||||
- Collecting user reviews and ratings from review websites.
|
||||
- Extracting job listings from job boards.
|
||||
|
||||
By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy`, nuanced, instruction-based extraction with `LLMExtractionStrategy`, or precise structured data extraction with `JsonCssExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️♂️✨
|
||||
|
||||
For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](../full_details/advanced_jsoncss_extraction.md).
|
||||
|
||||
|
||||
### CosineStrategy
|
||||
|
||||
@@ -21,96 +158,28 @@ Crawl4AI offers powerful extraction strategies to derive meaningful information
|
||||
|
||||
#### Example
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
from crawl4ai import WebCrawler
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Define extraction strategy
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="finance economy stock market",
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2,
|
||||
linkage_method='ward',
|
||||
top_k=3,
|
||||
model_name='BAAI/bge-small-en-v1.5'
|
||||
)
|
||||
|
||||
# Define extraction strategy
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="finance economy stock market",
|
||||
word_count_threshold=10,
|
||||
max_dist=0.2,
|
||||
linkage_method='ward',
|
||||
top_k=3,
|
||||
model_name='BAAI/bge-small-en-v1.5'
|
||||
)
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
# Run the crawler with the extraction strategy
|
||||
result = await crawler.arun(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = crawler.run(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### LLMExtractionStrategy
|
||||
|
||||
`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
|
||||
|
||||
#### When to Use
|
||||
- Suitable for complex extraction tasks requiring nuanced understanding.
|
||||
- Ideal for scenarios where detailed instructions can guide the extraction process.
|
||||
- Perfect for extracting specific types of information or content with precise guidelines.
|
||||
|
||||
#### Parameters
|
||||
- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
|
||||
- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
|
||||
- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
|
||||
|
||||
#### Example Without Instructions
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import WebCrawler
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
|
||||
# Define extraction strategy without instructions
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider='openai',
|
||||
api_token='your_api_token'
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = crawler.run(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
#### Example With Instructions
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import WebCrawler
|
||||
|
||||
crawler = WebCrawler()
|
||||
crawler.warmup()
|
||||
|
||||
# Define extraction strategy with instructions
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider='openai',
|
||||
api_token='your_api_token',
|
||||
instruction="Extract only financial news and summarize key points."
|
||||
)
|
||||
|
||||
# Sample URL
|
||||
url = "https://www.nbcnews.com/business"
|
||||
|
||||
# Run the crawler with the extraction strategy
|
||||
result = crawler.run(url=url, extraction_strategy=strategy)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
#### Use Cases for LLMExtractionStrategy
|
||||
- Extracting specific data types from structured or semi-structured content.
|
||||
- Generating summaries, extracting key information, or transforming content into different formats.
|
||||
- Performing detailed extractions based on custom instructions.
|
||||
|
||||
For more detailed examples, please refer to the [Examples section](../examples/index.md) of the documentation.
|
||||
|
||||
---
|
||||
|
||||
By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy` or nuanced, instruction-based extraction with `LLMExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️♂️✨
|
||||
|
||||
Reference in New Issue
Block a user