Push async version last changes for merge to main branch

This commit is contained in:
unclecode
2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions

View File

@@ -1,44 +1,34 @@
## Summarization Example
# Summarization Example with AsyncWebCrawler
This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
This example demonstrates how to use Crawl4AI's `AsyncWebCrawler` to extract a summary from a web page asynchronously. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
### Step-by-Step Guide
## Step-by-Step Guide
1. **Import Necessary Modules**
First, import the necessary modules and classes.
First, import the necessary modules and classes:
```python
import os
import time
import json
from crawl4ai.web_crawler import WebCrawler
from crawl4ai.chunking_strategy import *
from crawl4ai.extraction_strategy import *
from crawl4ai.crawler_strategy import *
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.chunking_strategy import RegexChunking
from pydantic import BaseModel, Field
```
2. **Define the URL to be Crawled**
Set the URL of the web page you want to summarize.
Set the URL of the web page you want to summarize:
```python
url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
url = 'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
```
3. **Initialize the WebCrawler**
3. **Define the Data Model**
Create an instance of the `WebCrawler` and call the `warmup` method.
```python
crawler = WebCrawler()
crawler.warmup()
```
4. **Define the Data Model**
Use Pydantic to define the structure of the extracted data.
Use Pydantic to define the structure of the extracted data:
```python
class PageSummary(BaseModel):
@@ -48,61 +38,116 @@ This example demonstrates how to use `Crawl4AI` to extract a summary from a web
keywords: list = Field(..., description="Keywords assigned to the page.")
```
5. **Run the Crawler**
4. **Create the Extraction Strategy**
Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.
Set up the `LLMExtractionStrategy` with the necessary parameters:
```python
result = crawler.run(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=PageSummary.model_json_schema(),
extraction_type="schema",
apply_chunking=False,
instruction=(
"From the crawled content, extract the following details: "
"1. Title of the page "
"2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "
'The extracted JSON format should look like this: '
'{ "title": "Page Title", "summary": "Detailed summary of the page.", '
'"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
)
),
bypass_cache=True,
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=PageSummary.model_json_schema(),
extraction_type="schema",
apply_chunking=False,
instruction=(
"From the crawled content, extract the following details: "
"1. Title of the page "
"2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "
'The extracted JSON format should look like this: '
'{ "title": "Page Title", "summary": "Detailed summary of the page.", '
'"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
)
)
```
6. **Process the Extracted Data**
5. **Define the Async Crawl Function**
Load the extracted content into a JSON object and print it.
Create an asynchronous function to run the crawler:
```python
page_summary = json.loads(result.extracted_content)
print(page_summary)
async def crawl_and_summarize(url):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=extraction_strategy,
chunking_strategy=RegexChunking(),
bypass_cache=True,
)
return result
```
7. **Save the Extracted Data**
6. **Run the Crawler and Process Results**
Save the extracted data to a file for further use.
Use asyncio to run the crawler and process the results:
```python
with open(".data/page_summary.json", "w", encoding="utf-8") as f:
f.write(result.extracted_content)
async def main():
result = await crawl_and_summarize(url)
if result.success:
page_summary = json.loads(result.extracted_content)
print("Extracted Page Summary:")
print(json.dumps(page_summary, indent=2))
# Save the extracted data
with open(".data/page_summary.json", "w", encoding="utf-8") as f:
json.dump(page_summary, f, indent=2)
print("Page summary saved to .data/page_summary.json")
else:
print(f"Failed to crawl and summarize the page. Error: {result.error_message}")
# Run the async main function
asyncio.run(main())
```
### Explanation
## Explanation
- **Importing Modules**: Import the necessary modules, including `WebCrawler` and `LLMExtractionStrategy` from `Crawl4AI`.
- **URL Definition**: Set the URL of the web page you want to crawl and summarize.
- **WebCrawler Initialization**: Create an instance of `WebCrawler` and call the `warmup` method to prepare the crawler.
- **Data Model Definition**: Define the structure of the data you want to extract using Pydantic's `BaseModel`.
- **Crawler Execution**: Run the crawler with the `LLMExtractionStrategy`, providing the schema and detailed instructions for the extraction process.
- **Data Processing**: Load the extracted content into a JSON object and print it to verify the results.
- **Data Saving**: Save the extracted data to a file for further use.
- **Importing Modules**: We import the necessary modules, including `AsyncWebCrawler` and `LLMExtractionStrategy` from Crawl4AI.
- **URL Definition**: We set the URL of the web page to crawl and summarize.
- **Data Model Definition**: We define the structure of the data to extract using Pydantic's `BaseModel`.
- **Extraction Strategy Setup**: We create an instance of `LLMExtractionStrategy` with the schema and detailed instructions for the extraction process.
- **Async Crawl Function**: We define an asynchronous function `crawl_and_summarize` that uses `AsyncWebCrawler` to perform the crawling and extraction.
- **Main Execution**: In the `main` function, we run the crawler, process the results, and save the extracted data.
This example demonstrates how to harness the power of `Crawl4AI` to perform advanced web crawling and data extraction tasks with minimal code.
## Advanced Usage: Crawling Multiple URLs
To demonstrate the power of `AsyncWebCrawler`, here's how you can summarize multiple pages concurrently:
```python
async def crawl_multiple_urls(urls):
async with AsyncWebCrawler(verbose=True) as crawler:
tasks = [crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=extraction_strategy,
chunking_strategy=RegexChunking(),
bypass_cache=True
) for url in urls]
results = await asyncio.gather(*tasks)
return results
async def main():
urls = [
'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot',
'https://marketplace.visualstudio.com/items?itemName=GitHub.copilot',
'https://marketplace.visualstudio.com/items?itemName=ms-python.python'
]
results = await crawl_multiple_urls(urls)
for i, result in enumerate(results):
if result.success:
page_summary = json.loads(result.extracted_content)
print(f"\nSummary for URL {i+1}:")
print(json.dumps(page_summary, indent=2))
else:
print(f"\nFailed to summarize URL {i+1}. Error: {result.error_message}")
asyncio.run(main())
```
This advanced example shows how to use `AsyncWebCrawler` to efficiently summarize multiple web pages concurrently, significantly reducing the total processing time compared to sequential crawling.
By leveraging the asynchronous capabilities of Crawl4AI, you can perform advanced web crawling and data extraction tasks with improved efficiency and scalability.