Push async version last changes for merge to main branch

This commit is contained in:
unclecode
2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions

View File

@@ -1,33 +1,32 @@
## Research Assistant Example
# Research Assistant Example with AsyncWebCrawler
This example demonstrates how to build a research assistant using `Chainlit` and `Crawl4AI`. The assistant will be capable of crawling web pages for information and answering questions based on the crawled content. Additionally, it integrates speech-to-text functionality for audio inputs.
This example demonstrates how to build an advanced research assistant using `Chainlit`, `Crawl4AI`'s `AsyncWebCrawler`, and various AI services. The assistant can crawl web pages asynchronously, answer questions based on the crawled content, and handle audio inputs.
### Step-by-Step Guide
## Step-by-Step Guide
1. **Install Required Packages**
Ensure you have the necessary packages installed. You need `chainlit`, `groq`, `requests`, and `openai`.
Ensure you have the necessary packages installed:
```bash
pip install chainlit groq requests openai
pip install chainlit groq openai crawl4ai
```
2. **Import Libraries**
Import all the necessary modules and initialize the OpenAI client.
```python
import os
import time
import asyncio
from openai import AsyncOpenAI
import chainlit as cl
import re
import requests
from io import BytesIO
from chainlit.element import ElementBased
from groq import Groq
from concurrent.futures import ThreadPoolExecutor
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import NoExtractionStrategy
from crawl4ai.chunking_strategy import RegexChunking
client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))
@@ -37,8 +36,6 @@ This example demonstrates how to build a research assistant using `Chainlit` and
3. **Set Configuration**
Define the model settings for the assistant.
```python
settings = {
"model": "llama3-8b-8192",
@@ -52,35 +49,25 @@ This example demonstrates how to build a research assistant using `Chainlit` and
4. **Define Utility Functions**
- **Extract URLs from Text**: Use regex to find URLs in messages.
```python
def extract_urls(text):
url_pattern = re.compile(r'(https?://\S+)')
return url_pattern.findall(text)
```python
def extract_urls(text):
url_pattern = re.compile(r'(https?://\S+)')
return url_pattern.findall(text)
```
- **Crawl URL**: Send a request to `Crawl4AI` to fetch the content of a URL.
```python
def crawl_url(url):
data = {
"urls": [url],
"include_raw_html": True,
"word_count_threshold": 10,
"extraction_strategy": "NoExtractionStrategy",
"chunking_strategy": "RegexChunking"
}
response = requests.post("https://crawl4ai.com/crawl", json=data)
response_data = response.json()
response_data = response_data['results'][0]
return response_data['markdown']
```
async def crawl_urls(urls):
async with AsyncWebCrawler(verbose=True) as crawler:
results = await crawler.arun_many(
urls=urls,
word_count_threshold=10,
extraction_strategy=NoExtractionStrategy(),
chunking_strategy=RegexChunking(),
bypass_cache=True
)
return [result.markdown for result in results if result.success]
```
5. **Initialize Chat Start Event**
Set up the initial chat message and user session.
```python
@cl.on_chat_start
async def on_chat_start():
@@ -88,15 +75,11 @@ This example demonstrates how to build a research assistant using `Chainlit` and
"history": [],
"context": {}
})
await cl.Message(
content="Welcome to the chat! How can I assist you today?"
).send()
await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
```
6. **Handle Incoming Messages**
Process user messages, extract URLs, and crawl them concurrently. Update the chat history and system message.
```python
@cl.on_message
async def on_message(message: cl.Message):
@@ -105,19 +88,14 @@ This example demonstrates how to build a research assistant using `Chainlit` and
# Extract URLs from the user's message
urls = extract_urls(message.content)
futures = []
with ThreadPoolExecutor() as executor:
for url in urls:
futures.append(executor.submit(crawl_url, url))
results = [future.result() for future in futures]
for url, result in zip(urls, results):
ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = {
"url": url,
"content": result
}
if urls:
crawled_contents = await crawl_urls(urls)
for url, content in zip(urls, crawled_contents):
ref_number = f"REF_{len(user_session['context']) + 1}"
user_session["context"][ref_number] = {
"url": url,
"content": content
}
user_session["history"].append({
"role": "user",
@@ -129,33 +107,24 @@ This example demonstrates how to build a research assistant using `Chainlit` and
f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
for ref, data in user_session["context"].items()
]
if context_messages:
system_message = {
"role": "system",
"content": (
"You are a helpful bot. Use the following context for answering questions. "
"Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
"If the question requires any information from the provided appendices or context, refer to the sources. "
"If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages)
)
}
else:
system_message = {
"role": "system",
"content": "You are a helpful assistant."
}
system_message = {
"role": "system",
"content": (
"You are a helpful bot. Use the following context for answering questions. "
"Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
"If the question requires any information from the provided appendices or context, refer to the sources. "
"If not, there is no need to add a references section. "
"At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
"\n\n".join(context_messages)
) if context_messages else "You are a helpful assistant."
}
msg = cl.Message(content="")
await msg.send()
# Get response from the LLM
stream = await client.chat.completions.create(
messages=[
system_message,
*user_session["history"]
],
messages=[system_message, *user_session["history"]],
stream=True,
**settings
)
@@ -174,18 +143,16 @@ This example demonstrates how to build a research assistant using `Chainlit` and
await msg.update()
# Append the reference section to the assistant's response
reference_section = "\n\nReferences:\n"
for ref, data in user_session["context"].items():
reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
msg.content += reference_section
await msg.update()
if user_session["context"]:
reference_section = "\n\nReferences:\n"
for ref, data in user_session["context"].items():
reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
msg.content += reference_section
await msg.update()
```
7. **Handle Audio Input**
Capture and transcribe audio input. Store the audio buffer and transcribe it when the audio ends.
```python
@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.AudioChunk):
@@ -194,12 +161,10 @@ This example demonstrates how to build a research assistant using `Chainlit` and
buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
cl.user_session.set("audio_buffer", buffer)
cl.user_session.set("audio_mime_type", chunk.mimeType)
cl.user_session.get("audio_buffer").write(chunk.data)
@cl.step(type="tool")
async def speech_to_text(audio_file):
cli = Groq()
response = await client.audio.transcriptions.create(
model="whisper-large-v3", file=audio_file
)
@@ -217,32 +182,39 @@ This example demonstrates how to build a research assistant using `Chainlit` and
end_time = time.time()
print(f"Transcription took {end_time - start_time} seconds")
user_msg = cl.Message(
author="You",
type="user_message",
content=transcription
)
user_msg = cl.Message(author="You", type="user_message", content=transcription)
await user_msg.send()
await on_message(user_msg)
```
8. **Run the Chat Application**
Start the Chainlit application.
```python
if __name__ == "__main__":
from chainlit.cli import run_chainlit
run_chainlit(__file__)
```
### Explanation
## Explanation
- **Libraries and Configuration**: Import necessary libraries and configure the OpenAI client.
- **Utility Functions**: Define functions to extract URLs and crawl them.
- **Chat Start Event**: Initialize chat session and welcome message.
- **Message Handling**: Extract URLs, crawl them concurrently, and update chat history and context.
- **Audio Handling**: Capture, buffer, and transcribe audio input, then process the transcription as text.
- **Running the Application**: Start the Chainlit server to interact with the assistant.
- **Libraries and Configuration**: We import necessary libraries, including `AsyncWebCrawler` from `crawl4ai`.
- **Utility Functions**:
- `extract_urls`: Uses regex to find URLs in messages.
- `crawl_urls`: An asynchronous function that uses `AsyncWebCrawler` to fetch content from multiple URLs concurrently.
- **Chat Start Event**: Initializes the chat session and sends a welcome message.
- **Message Handling**:
- Extracts URLs from user messages.
- Asynchronously crawls the URLs using `AsyncWebCrawler`.
- Updates chat history and context with crawled content.
- Generates a response using the LLM, incorporating the crawled context.
- **Audio Handling**: Captures, buffers, and transcribes audio input, then processes the transcription as text.
- **Running the Application**: Starts the Chainlit server for interaction with the assistant.
This example showcases how to create an interactive research assistant that can fetch, process, and summarize web content, along with handling audio inputs for a seamless user experience.
## Key Improvements
1. **Asynchronous Web Crawling**: Using `AsyncWebCrawler` allows for efficient, concurrent crawling of multiple URLs.
2. **Improved Context Management**: The assistant now maintains a context of crawled content, allowing for more informed responses.
3. **Dynamic Reference System**: The assistant can refer to specific sources in its responses and provide a reference section.
4. **Seamless Audio Integration**: The ability to handle audio inputs makes the assistant more versatile and user-friendly.
This updated Research Assistant showcases how to create a powerful, interactive tool that can efficiently fetch and process web content, handle various input types, and provide informed responses based on the gathered information.