Update Documentation

This commit is contained in:
UncleCode
2024-10-27 19:24:46 +08:00
parent 38474bd66a
commit 4239654722
111 changed files with 7680 additions and 53 deletions

View File

@@ -0,0 +1,110 @@
# Hooks & Auth for AsyncWebCrawler
Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
## Example: Using Crawler Hooks with AsyncWebCrawler
Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
1. Configure the browser when it's created.
2. Add custom headers before navigating to the URL.
3. Log the current URL after navigation.
4. Perform actions after JavaScript execution.
5. Log the length of the HTML before returning it.
### Hook Definitions
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from playwright.async_api import Page, Browser
async def on_browser_created(browser: Browser):
print("[HOOK] on_browser_created")
# Example customization: set browser viewport size
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
page = await context.new_page()
# Example customization: logging in to a hypothetical website
await page.goto('https://example.com/login')
await page.fill('input[name="username"]', 'testuser')
await page.fill('input[name="password"]', 'password123')
await page.click('button[type="submit"]')
await page.wait_for_selector('#welcome')
# Add a custom cookie
await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
await page.close()
await context.close()
async def before_goto(page: Page):
print("[HOOK] before_goto")
# Example customization: add custom headers
await page.set_extra_http_headers({'X-Test-Header': 'test'})
async def after_goto(page: Page):
print("[HOOK] after_goto")
# Example customization: log the URL
print(f"Current URL: {page.url}")
async def on_execution_started(page: Page):
print("[HOOK] on_execution_started")
# Example customization: perform actions after JS execution
await page.evaluate("console.log('Custom JS executed')")
async def before_return_html(page: Page, html: str):
print("[HOOK] before_return_html")
# Example customization: log the HTML length
print(f"HTML length: {len(html)}")
return page
```
### Using the Hooks with the AsyncWebCrawler
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True)
crawler_strategy.set_hook('on_browser_created', on_browser_created)
crawler_strategy.set_hook('before_goto', before_goto)
crawler_strategy.set_hook('after_goto', after_goto)
crawler_strategy.set_hook('on_execution_started', on_execution_started)
crawler_strategy.set_hook('before_return_html', before_return_html)
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
result = await crawler.arun(
url="https://example.com",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="footer"
)
print("📦 Crawler Hooks result:")
print(result)
asyncio.run(main())
```
### Explanation
- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
- `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
- `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
- `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
- `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
### Additional Ideas
- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.

View File

@@ -0,0 +1,33 @@
# Examples
Welcome to the examples section of Crawl4AI documentation! In this section, you will find practical examples demonstrating how to use Crawl4AI for various web crawling and data extraction tasks. Each example is designed to showcase different features and capabilities of the library.
## Examples Index
### [LLM Extraction](llm_extraction.md)
This example demonstrates how to use Crawl4AI to extract information using Large Language Models (LLMs). You will learn how to configure the `LLMExtractionStrategy` to get structured data from web pages.
### [JSON CSS Extraction](json_css_extraction.md)
This example demonstrates how to use Crawl4AI to extract structured data without using LLM, and just focusing on page structure. You will learn how to use the `JsonCssExtractionStrategy` to extract data using CSS selectors.
### [JS Execution & CSS Filtering](js_execution_css_filtering.md)
Learn how to execute custom JavaScript code and filter data using CSS selectors. This example shows how to perform complex web interactions and extract specific content from web pages.
### [Hooks & Auth](hooks_auth.md)
This example covers the use of custom hooks for authentication and other pre-crawling tasks. You will see how to set up hooks to modify headers, authenticate sessions, and perform other preparatory actions before crawling.
### [Summarization](summarization.md)
Discover how to use Crawl4AI to summarize web page content. This example demonstrates the summarization capabilities of the library, helping you extract concise information from lengthy web pages.
### [Research Assistant](research_assistant.md)
In this example, Crawl4AI is used as a research assistant to gather and organize information from multiple sources. You will learn how to use various extraction and chunking strategies to compile a comprehensive report.
---
Each example includes detailed explanations and code snippets to help you understand and implement the features in your projects. Click on the links to explore each example and start making the most of Crawl4AI!

View File

@@ -0,0 +1,104 @@
# JS Execution & CSS Filtering with AsyncWebCrawler
In this example, we'll demonstrate how to use Crawl4AI's AsyncWebCrawler to execute JavaScript, filter data with CSS selectors, and use a cosine similarity strategy to extract relevant content. This approach is particularly useful when you need to interact with dynamic content on web pages, such as clicking "Load More" buttons.
## Example: Extracting Structured Data Asynchronously
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.chunking_strategy import RegexChunking
from crawl4ai.extraction_strategy import CosineStrategy
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
async def main():
# Define the JavaScript code to click the "Load More" button
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
if (loadMoreButton) {
loadMoreButton.click();
// Wait for new content to load
await new Promise(resolve => setTimeout(resolve, 2000));
}
"""
# Define a wait_for function to ensure content is loaded
wait_for = """
() => {
const articles = document.querySelectorAll('article.tease-card');
return articles.length > 10;
}
"""
async with AsyncWebCrawler(verbose=True) as crawler:
# Run the crawler with keyword filtering and CSS selector
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
wait_for=wait_for,
css_selector="article.tease-card",
extraction_strategy=CosineStrategy(
semantic_filter="technology",
),
chunking_strategy=RegexChunking(),
)
# Display the extracted result
print(result.extracted_content)
# Run the async function
asyncio.run(main())
```
### Explanation
1. **Asynchronous Execution**: We use `AsyncWebCrawler` with async/await syntax for non-blocking execution.
2. **JavaScript Execution**: The `js_code` variable contains JavaScript code that simulates clicking a "Load More" button and waits for new content to load.
3. **Wait Condition**: The `wait_for` function ensures that the page has loaded more than 10 articles before proceeding with the extraction.
4. **CSS Selector**: The `css_selector="article.tease-card"` parameter ensures that only article cards are extracted from the web page.
5. **Extraction Strategy**: The `CosineStrategy` is used with a semantic filter for "technology" to extract relevant content based on cosine similarity.
6. **Chunking Strategy**: We use `RegexChunking()` to split the content into manageable chunks for processing.
## Advanced Usage: Custom Session and Multiple Requests
For more complex scenarios where you need to maintain state across multiple requests or execute additional JavaScript after the initial page load, you can use a custom session:
```python
async def advanced_crawl():
async with AsyncWebCrawler(verbose=True) as crawler:
# Initial crawl with custom session
result1 = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
wait_for=wait_for,
css_selector="article.tease-card",
session_id="business_session"
)
# Execute additional JavaScript in the same session
result2 = await crawler.crawler_strategy.execute_js(
session_id="business_session",
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for_js="() => window.innerHeight + window.scrollY >= document.body.offsetHeight"
)
# Process results
print("Initial crawl result:", result1.extracted_content)
print("Additional JS execution result:", result2.html)
asyncio.run(advanced_crawl())
```
This advanced example demonstrates how to:
1. Use a custom session to maintain state across requests.
2. Execute additional JavaScript after the initial page load.
3. Wait for specific conditions using JavaScript functions.
## Try It Yourself
These examples demonstrate the power and flexibility of Crawl4AI's AsyncWebCrawler in handling complex web interactions and extracting meaningful data asynchronously. You can customize the JavaScript code, CSS selectors, extraction strategies, and waiting conditions to suit your specific requirements.

View File

@@ -0,0 +1,142 @@
# JSON CSS Extraction Strategy with AsyncWebCrawler
The `JsonCssExtractionStrategy` is a powerful feature of Crawl4AI that allows you to extract structured data from web pages using CSS selectors. This method is particularly useful when you need to extract specific data points from a consistent HTML structure, such as tables or repeated elements. Here's how to use it with the AsyncWebCrawler.
## Overview
The `JsonCssExtractionStrategy` works by defining a schema that specifies:
1. A base CSS selector for the repeating elements
2. Fields to extract from each element, each with its own CSS selector
This strategy is fast and efficient, as it doesn't rely on external services like LLMs for extraction.
## Example: Extracting Cryptocurrency Prices from Coinbase
Let's look at an example that extracts cryptocurrency prices from the Coinbase explore page.
```python
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def extract_structured_data_using_css_extractor():
print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---")
# Define the extraction schema
schema = {
"name": "Coinbase Crypto Prices",
"baseSelector": ".cds-tableRow-t45thuk",
"fields": [
{
"name": "crypto",
"selector": "td:nth-child(1) h2",
"type": "text",
},
{
"name": "symbol",
"selector": "td:nth-child(1) p",
"type": "text",
},
{
"name": "price",
"selector": "td:nth-child(2)",
"type": "text",
}
],
}
# Create the extraction strategy
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
# Use the AsyncWebCrawler with the extraction strategy
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.coinbase.com/explore",
extraction_strategy=extraction_strategy,
bypass_cache=True,
)
assert result.success, "Failed to crawl the page"
# Parse the extracted content
crypto_prices = json.loads(result.extracted_content)
print(f"Successfully extracted {len(crypto_prices)} cryptocurrency prices")
print(json.dumps(crypto_prices[0], indent=2))
return crypto_prices
# Run the async function
asyncio.run(extract_structured_data_using_css_extractor())
```
## Explanation of the Schema
The schema defines how to extract the data:
- `name`: A descriptive name for the extraction task.
- `baseSelector`: The CSS selector for the repeating elements (in this case, table rows).
- `fields`: An array of fields to extract from each element:
- `name`: The name to give the extracted data.
- `selector`: The CSS selector to find the specific data within the base element.
- `type`: The type of data to extract (usually "text" for textual content).
## Advantages of JsonCssExtractionStrategy
1. **Speed**: CSS selectors are fast to execute, making this method efficient for large datasets.
2. **Precision**: You can target exactly the elements you need.
3. **Structured Output**: The result is already structured as JSON, ready for further processing.
4. **No External Dependencies**: Unlike LLM-based strategies, this doesn't require any API calls to external services.
## Tips for Using JsonCssExtractionStrategy
1. **Inspect the Page**: Use browser developer tools to identify the correct CSS selectors.
2. **Test Selectors**: Verify your selectors in the browser console before using them in the script.
3. **Handle Dynamic Content**: If the page uses JavaScript to load content, you may need to combine this with JS execution (see the Advanced Usage section).
4. **Error Handling**: Always check the `result.success` flag and handle potential failures.
## Advanced Usage: Combining with JavaScript Execution
For pages that load data dynamically, you can combine the `JsonCssExtractionStrategy` with JavaScript execution:
```python
async def extract_dynamic_structured_data():
schema = {
"name": "Dynamic Crypto Prices",
"baseSelector": ".crypto-row",
"fields": [
{"name": "name", "selector": ".crypto-name", "type": "text"},
{"name": "price", "selector": ".crypto-price", "type": "text"},
]
}
js_code = """
window.scrollTo(0, document.body.scrollHeight);
await new Promise(resolve => setTimeout(resolve, 2000)); // Wait for 2 seconds
"""
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/crypto-prices",
extraction_strategy=extraction_strategy,
js_code=js_code,
wait_for=".crypto-row:nth-child(20)", # Wait for 20 rows to load
bypass_cache=True,
)
crypto_data = json.loads(result.extracted_content)
print(f"Extracted {len(crypto_data)} cryptocurrency entries")
asyncio.run(extract_dynamic_structured_data())
```
This advanced example demonstrates how to:
1. Execute JavaScript to trigger dynamic content loading.
2. Wait for a specific condition (20 rows loaded) before extraction.
3. Extract data from the dynamically loaded content.
By mastering the `JsonCssExtractionStrategy`, you can efficiently extract structured data from a wide variety of web pages, making it a valuable tool in your web scraping toolkit.
For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](../full_details/advanced_jsoncss_extraction.md).

View File

@@ -0,0 +1,179 @@
# LLM Extraction with AsyncWebCrawler
Crawl4AI's AsyncWebCrawler allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages asynchronously. Below are two examples demonstrating how to use `LLMExtractionStrategy` for different purposes with the AsyncWebCrawler.
## Example 1: Extract Structured Data
In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.
```python
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
async def extract_openai_fees():
url = 'https://openai.com/api/pricing/'
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=OpenAIModelFee.model_json_schema(),
extraction_type="schema",
instruction="From the crawled content, extract all mentioned model names along with their "
"fees for input and output tokens. Make sure not to miss anything in the entire content. "
'One extracted model JSON format should look like this: '
'{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
),
bypass_cache=True,
)
model_fees = json.loads(result.extracted_content)
print(f"Number of models extracted: {len(model_fees)}")
with open(".data/openai_fees.json", "w", encoding="utf-8") as f:
json.dump(model_fees, f, indent=2)
asyncio.run(extract_openai_fees())
```
## Example 2: Extract Relevant Content
In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.
```python
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
async def extract_tech_content():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract only content related to technology"
),
bypass_cache=True,
)
tech_content = json.loads(result.extracted_content)
print(f"Number of tech-related items extracted: {len(tech_content)}")
with open(".data/tech_content.json", "w", encoding="utf-8") as f:
json.dump(tech_content, f, indent=2)
asyncio.run(extract_tech_content())
```
## Advanced Usage: Combining JS Execution with LLM Extraction
This example demonstrates how to combine JavaScript execution with LLM extraction to handle dynamic content:
```python
async def extract_dynamic_content():
js_code = """
const loadMoreButton = Array.from(document.querySelectorAll('button')).find(button => button.textContent.includes('Load More'));
if (loadMoreButton) {
loadMoreButton.click();
await new Promise(resolve => setTimeout(resolve, 2000));
}
"""
wait_for = """
() => {
const articles = document.querySelectorAll('article.tease-card');
return articles.length > 10;
}
"""
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
js_code=js_code,
wait_for=wait_for,
css_selector="article.tease-card",
extraction_strategy=LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Summarize each article, focusing on technology-related content"
),
bypass_cache=True,
)
summaries = json.loads(result.extracted_content)
print(f"Number of summarized articles: {len(summaries)}")
with open(".data/tech_summaries.json", "w", encoding="utf-8") as f:
json.dump(summaries, f, indent=2)
asyncio.run(extract_dynamic_content())
```
## Customizing LLM Provider
Crawl4AI uses the `litellm` library under the hood, which allows you to use any LLM provider you want. Just pass the correct model name and API token:
```python
extraction_strategy=LLMExtractionStrategy(
provider="your_llm_provider/model_name",
api_token="your_api_token",
instruction="Your extraction instruction"
)
```
This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.
## Error Handling and Retries
When working with external LLM APIs, it's important to handle potential errors and implement retry logic. Here's an example of how you might do this:
```python
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential
class LLMExtractionError(Exception):
pass
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10))
async def extract_with_retry(crawler, url, extraction_strategy):
try:
result = await crawler.arun(url=url, extraction_strategy=extraction_strategy, bypass_cache=True)
return json.loads(result.extracted_content)
except Exception as e:
raise LLMExtractionError(f"Failed to extract content: {str(e)}")
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
try:
content = await extract_with_retry(
crawler,
"https://www.example.com",
LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
instruction="Extract and summarize main points"
)
)
print("Extracted content:", content)
except LLMExtractionError as e:
print(f"Extraction failed after retries: {e}")
asyncio.run(main())
```
This example uses the `tenacity` library to implement a retry mechanism with exponential backoff, which can help handle temporary failures or rate limiting from the LLM API.

View File

@@ -0,0 +1,220 @@
# Research Assistant Example with AsyncWebCrawler
This example demonstrates how to build an advanced research assistant using `Chainlit`, `Crawl4AI`'s `AsyncWebCrawler`, and various AI services. The assistant can crawl web pages asynchronously, answer questions based on the crawled content, and handle audio inputs.
## Step-by-Step Guide
1. **Install Required Packages**
Ensure you have the necessary packages installed:
```bash
pip install chainlit groq openai crawl4ai
```
2. **Import Libraries**
```python
import os
import time
import asyncio
from openai import AsyncOpenAI
import chainlit as cl
import re
from io import BytesIO
from chainlit.element import ElementBased
from groq import Groq
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import NoExtractionStrategy
from crawl4ai.chunking_strategy import RegexChunking
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
# Instrument the OpenAI client
cl.instrument_openai()
```
3. **Set Configuration**
```python
settings = {
"model": "llama3-8b-8192",
"temperature": 0.5,
"max_tokens": 500,
"top_p": 1,
"frequency_penalty": 0,
"presence_penalty": 0,
}
```
4. **Define Utility Functions**
```python
def extract_urls(text):
url_pattern = re.compile(r'(https?://\S+)')
return url_pattern.findall(text)
async def crawl_urls(urls):
async with AsyncWebCrawler(verbose=True) as crawler:
results = await crawler.arun_many(
urls=urls,
word_count_threshold=10,
extraction_strategy=NoExtractionStrategy(),
chunking_strategy=RegexChunking(),
bypass_cache=True
)
return [result.markdown for result in results if result.success]
```
5. **Initialize Chat Start Event**
```python
@cl.on_chat_start
async def on_chat_start():
cl.user_session.set("session", {
"history": [],
"context": {}
})
await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
```
6. **Handle Incoming Messages**
```python
@cl.on_message
async def on_message(message: cl.Message):
user_session = cl.user_session.get("session")
# Extract URLs from the user's message
urls = extract_urls(message.content)
if urls:
crawled_contents = await crawl_urls(urls)
for url, content in zip(urls, crawled_contents):
ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = {
"url": url,
"content": content
}
user_session["history"].append({
"role": "user",
"content": message.content
})
# Create a system message that includes the context
context_messages = [
f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
for ref, data in user_session["context"].items()
]
system_message = {
"role": "system",
"content": (
"You are a helpful bot. Use the following context for answering questions. "
"Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
"If the question requires any information from the provided appendices or context, refer to the sources. "
"If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages)
) if context_messages else "You are a helpful assistant."
}
msg = cl.Message(content="")
await msg.send()
# Get response from the LLM
stream = await client.chat.completions.create(
messages=[system_message, *user_session["history"]],
stream=True,
**settings
)
assistant_response = ""
async for part in stream:
if token := part.choices[0].delta.content:
assistant_response += token
await msg.stream_token(token)
# Add assistant message to the history
user_session["history"].append({
"role": "assistant",
"content": assistant_response
})
await msg.update()
# Append the reference section to the assistant's response
if user_session["context"]:
reference_section = "\n\nReferences:\n"
for ref, data in user_session["context"].items():
reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
msg.content += reference_section
await msg.update()
```
7. **Handle Audio Input**
```python
@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.AudioChunk):
if chunk.isStart:
buffer = BytesIO()
buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
cl.user_session.set("audio_buffer", buffer)
cl.user_session.set("audio_mime_type", chunk.mimeType)
cl.user_session.get("audio_buffer").write(chunk.data)
@cl.step(type="tool")
async def speech_to_text(audio_file):
response = await client.audio.transcriptions.create(
model="whisper-large-v3", file=audio_file
)
return response.text
@cl.on_audio_end
async def on_audio_end(elements: list[ElementBased]):
audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
audio_buffer.seek(0)
audio_file = audio_buffer.read()
audio_mime_type: str = cl.user_session.get("audio_mime_type")
start_time = time.time()
transcription = await speech_to_text((audio_buffer.name, audio_file, audio_mime_type))
end_time = time.time()
print(f"Transcription took {end_time - start_time} seconds")
user_msg = cl.Message(author="You", type="user_message", content=transcription)
await user_msg.send()
await on_message(user_msg)
```
8. **Run the Chat Application**
```python
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)
```
## Explanation
- **Libraries and Configuration**: We import necessary libraries, including `AsyncWebCrawler` from `crawl4ai`.
- **Utility Functions**:
- `extract_urls`: Uses regex to find URLs in messages.
- `crawl_urls`: An asynchronous function that uses `AsyncWebCrawler` to fetch content from multiple URLs concurrently.
- **Chat Start Event**: Initializes the chat session and sends a welcome message.
- **Message Handling**:
- Extracts URLs from user messages.
- Asynchronously crawls the URLs using `AsyncWebCrawler`.
- Updates chat history and context with crawled content.
- Generates a response using the LLM, incorporating the crawled context.
- **Audio Handling**: Captures, buffers, and transcribes audio input, then processes the transcription as text.
- **Running the Application**: Starts the Chainlit server for interaction with the assistant.
## Key Improvements
1. **Asynchronous Web Crawling**: Using `AsyncWebCrawler` allows for efficient, concurrent crawling of multiple URLs.
2. **Improved Context Management**: The assistant now maintains a context of crawled content, allowing for more informed responses.
3. **Dynamic Reference System**: The assistant can refer to specific sources in its responses and provide a reference section.
4. **Seamless Audio Integration**: The ability to handle audio inputs makes the assistant more versatile and user-friendly.
This updated Research Assistant showcases how to create a powerful, interactive tool that can efficiently fetch and process web content, handle various input types, and provide informed responses based on the gathered information.

View File

@@ -0,0 +1,153 @@
# Summarization Example with AsyncWebCrawler
This example demonstrates how to use Crawl4AI's `AsyncWebCrawler` to extract a summary from a web page asynchronously. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
## Step-by-Step Guide
1. **Import Necessary Modules**
First, import the necessary modules and classes:
```python
import os
import json
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from crawl4ai.chunking_strategy import RegexChunking
from pydantic import BaseModel, Field
```
2. **Define the URL to be Crawled**
Set the URL of the web page you want to summarize:
```python
url = 'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
```
3. **Define the Data Model**
Use Pydantic to define the structure of the extracted data:
```python
class PageSummary(BaseModel):
title: str = Field(..., description="Title of the page.")
summary: str = Field(..., description="Summary of the page.")
brief_summary: str = Field(..., description="Brief summary of the page.")
keywords: list = Field(..., description="Keywords assigned to the page.")
```
4. **Create the Extraction Strategy**
Set up the `LLMExtractionStrategy` with the necessary parameters:
```python
extraction_strategy = LLMExtractionStrategy(
provider="openai/gpt-4o",
api_token=os.getenv('OPENAI_API_KEY'),
schema=PageSummary.model_json_schema(),
extraction_type="schema",
apply_chunking=False,
instruction=(
"From the crawled content, extract the following details: "
"1. Title of the page "
"2. Summary of the page, which is a detailed summary "
"3. Brief summary of the page, which is a paragraph text "
"4. Keywords assigned to the page, which is a list of keywords. "
'The extracted JSON format should look like this: '
'{ "title": "Page Title", "summary": "Detailed summary of the page.", '
'"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
)
)
```
5. **Define the Async Crawl Function**
Create an asynchronous function to run the crawler:
```python
async def crawl_and_summarize(url):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=extraction_strategy,
chunking_strategy=RegexChunking(),
bypass_cache=True,
)
return result
```
6. **Run the Crawler and Process Results**
Use asyncio to run the crawler and process the results:
```python
async def main():
result = await crawl_and_summarize(url)
if result.success:
page_summary = json.loads(result.extracted_content)
print("Extracted Page Summary:")
print(json.dumps(page_summary, indent=2))
# Save the extracted data
with open(".data/page_summary.json", "w", encoding="utf-8") as f:
json.dump(page_summary, f, indent=2)
print("Page summary saved to .data/page_summary.json")
else:
print(f"Failed to crawl and summarize the page. Error: {result.error_message}")
# Run the async main function
asyncio.run(main())
```
## Explanation
- **Importing Modules**: We import the necessary modules, including `AsyncWebCrawler` and `LLMExtractionStrategy` from Crawl4AI.
- **URL Definition**: We set the URL of the web page to crawl and summarize.
- **Data Model Definition**: We define the structure of the data to extract using Pydantic's `BaseModel`.
- **Extraction Strategy Setup**: We create an instance of `LLMExtractionStrategy` with the schema and detailed instructions for the extraction process.
- **Async Crawl Function**: We define an asynchronous function `crawl_and_summarize` that uses `AsyncWebCrawler` to perform the crawling and extraction.
- **Main Execution**: In the `main` function, we run the crawler, process the results, and save the extracted data.
## Advanced Usage: Crawling Multiple URLs
To demonstrate the power of `AsyncWebCrawler`, here's how you can summarize multiple pages concurrently:
```python
async def crawl_multiple_urls(urls):
async with AsyncWebCrawler(verbose=True) as crawler:
tasks = [crawler.arun(
url=url,
word_count_threshold=1,
extraction_strategy=extraction_strategy,
chunking_strategy=RegexChunking(),
bypass_cache=True
) for url in urls]
results = await asyncio.gather(*tasks)
return results
async def main():
urls = [
'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot',
'https://marketplace.visualstudio.com/items?itemName=GitHub.copilot',
'https://marketplace.visualstudio.com/items?itemName=ms-python.python'
]
results = await crawl_multiple_urls(urls)
for i, result in enumerate(results):
if result.success:
page_summary = json.loads(result.extracted_content)
print(f"\nSummary for URL {i+1}:")
print(json.dumps(page_summary, indent=2))
else:
print(f"\nFailed to summarize URL {i+1}. Error: {result.error_message}")
asyncio.run(main())
```
This advanced example shows how to use `AsyncWebCrawler` to efficiently summarize multiple web pages concurrently, significantly reducing the total processing time compared to sequential crawling.
By leveraging the asynchronous capabilities of Crawl4AI, you can perform advanced web crawling and data extraction tasks with improved efficiency and scalability.