Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions
--- a/docs/md/examples/research_assistant.md
+++ b/docs/md/examples/research_assistant.md
@@ -1,33 +1,32 @@
-## Research Assistant Example
+# Research Assistant Example with AsyncWebCrawler

-This example demonstrates how to build a research assistant using `Chainlit` and `Crawl4AI`. The assistant will be capable of crawling web pages for information and answering questions based on the crawled content. Additionally, it integrates speech-to-text functionality for audio inputs.
+This example demonstrates how to build an advanced research assistant using `Chainlit`, `Crawl4AI`'s `AsyncWebCrawler`, and various AI services. The assistant can crawl web pages asynchronously, answer questions based on the crawled content, and handle audio inputs.

-### Step-by-Step Guide
+## Step-by-Step Guide

 1. **Install Required Packages**

-    Ensure you have the necessary packages installed. You need `chainlit`, `groq`, `requests`, and `openai`.
+    Ensure you have the necessary packages installed:

    ```bash
-    pip install chainlit groq requests openai
+    pip install chainlit groq openai crawl4ai
    ```

 2. **Import Libraries**

-    Import all the necessary modules and initialize the OpenAI client.
-
    ```python
    import os
    import time
+    import asyncio
    from openai import AsyncOpenAI
    import chainlit as cl
    import re
-    import requests
    from io import BytesIO
    from chainlit.element import ElementBased
    from groq import Groq
-
-    from concurrent.futures import ThreadPoolExecutor
+    from crawl4ai import AsyncWebCrawler
+    from crawl4ai.extraction_strategy import NoExtractionStrategy
+    from crawl4ai.chunking_strategy import RegexChunking

    client = AsyncOpenAI(base_url="https://api.groq.com/openai/v1", api_key=os.getenv("GROQ_API_KEY"))

@@ -37,8 +36,6 @@ This example demonstrates how to build a research assistant using `Chainlit` and

 3. **Set Configuration**

-    Define the model settings for the assistant.
-
    ```python
    settings = {
        "model": "llama3-8b-8192",
@@ -52,35 +49,25 @@ This example demonstrates how to build a research assistant using `Chainlit` and

 4. **Define Utility Functions**

-    - **Extract URLs from Text**: Use regex to find URLs in messages.
+    ```python
+    def extract_urls(text):
+        url_pattern = re.compile(r'(https?://\S+)')
+        return url_pattern.findall(text)

-        ```python
-        def extract_urls(text):
-            url_pattern = re.compile(r'(https?://\S+)')
-            return url_pattern.findall(text)
-        ```
-
-    - **Crawl URL**: Send a request to `Crawl4AI` to fetch the content of a URL.
-
-        ```python
-        def crawl_url(url):
-            data = {
-                "urls": [url],
-                "include_raw_html": True,
-                "word_count_threshold": 10,
-                "extraction_strategy": "NoExtractionStrategy",
-                "chunking_strategy": "RegexChunking"
-            }
-            response = requests.post("https://crawl4ai.com/crawl", json=data)
-            response_data = response.json()
-            response_data = response_data['results'][0]
-            return response_data['markdown']
-        ```
+    async def crawl_urls(urls):
+        async with AsyncWebCrawler(verbose=True) as crawler:
+            results = await crawler.arun_many(
+                urls=urls,
+                word_count_threshold=10,
+                extraction_strategy=NoExtractionStrategy(),
+                chunking_strategy=RegexChunking(),
+                bypass_cache=True
+            )
+        return [result.markdown for result in results if result.success]
+    ```

 5. **Initialize Chat Start Event**

-    Set up the initial chat message and user session.
-
    ```python
    @cl.on_chat_start
    async def on_chat_start():
@@ -88,15 +75,11 @@ This example demonstrates how to build a research assistant using `Chainlit` and
            "history": [],
            "context": {}
        })  
-        await cl.Message(
-            content="Welcome to the chat! How can I assist you today?"
-        ).send()
+        await cl.Message(content="Welcome to the chat! How can I assist you today?").send()
    ```

 6. **Handle Incoming Messages**

-    Process user messages, extract URLs, and crawl them concurrently. Update the chat history and system message.
-
    ```python
    @cl.on_message
    async def on_message(message: cl.Message):
@@ -105,19 +88,14 @@ This example demonstrates how to build a research assistant using `Chainlit` and
        # Extract URLs from the user's message
        urls = extract_urls(message.content)

-        futures = []
-        with ThreadPoolExecutor() as executor:
-            for url in urls:
-                futures.append(executor.submit(crawl_url, url))
-
-        results = [future.result() for future in futures]
-
-        for url, result in zip(urls, results):
-            ref_number = f"REF_{len(user_session['context']) + 1}"
-            user_session["context"][ref_number] = {
-                "url": url,
-                "content": result
-            }    
+        if urls:
+            crawled_contents = await crawl_urls(urls)
+            for url, content in zip(urls, crawled_contents):
+                ref_number = f"REF_{len(user_session['context']) + 1}"
+                user_session["context"][ref_number] = {
+                    "url": url,
+                    "content": content
+                }

        user_session["history"].append({
            "role": "user",
@@ -129,33 +107,24 @@ This example demonstrates how to build a research assistant using `Chainlit` and
            f'<appendix ref="{ref}">\n{data["content"]}\n</appendix>'
            for ref, data in user_session["context"].items()
        ]
-        if context_messages:
-            system_message = {
-                "role": "system",
-                "content": (
-                    "You are a helpful bot. Use the following context for answering questions. "
-                    "Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
-                    "If the question requires any information from the provided appendices or context, refer to the sources. "
-                    "If not, there is no need to add a references section. "
-                    "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
-                    "\n\n".join(context_messages)
-                )
-            }
-        else:
-            system_message = {
-                "role": "system",
-                "content": "You are a helpful assistant."
-            }
+        system_message = {
+            "role": "system",
+            "content": (
+                "You are a helpful bot. Use the following context for answering questions. "
+                "Refer to the sources using the REF number in square brackets, e.g., [1], only if the source is given in the appendices below.\n\n"
+                "If the question requires any information from the provided appendices or context, refer to the sources. "
+                "If not, there is no need to add a references section. "
+                "At the end of your response, provide a reference section listing the URLs and their REF numbers only if sources from the appendices were used.\n\n"
+                "\n\n".join(context_messages)
+            ) if context_messages else "You are a helpful assistant."
+        }

        msg = cl.Message(content="")
        await msg.send()

        # Get response from the LLM
        stream = await client.chat.completions.create(
-            messages=[
-                system_message,
-                *user_session["history"]
-            ],
+            messages=[system_message, *user_session["history"]],
            stream=True,
            **settings
        )
@@ -174,18 +143,16 @@ This example demonstrates how to build a research assistant using `Chainlit` and
        await msg.update()

        # Append the reference section to the assistant's response
-        reference_section = "\n\nReferences:\n"
-        for ref, data in user_session["context"].items():
-            reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
-
-        msg.content += reference_section
-        await msg.update()
+        if user_session["context"]:
+            reference_section = "\n\nReferences:\n"
+            for ref, data in user_session["context"].items():
+                reference_section += f"[{ref.split('_')[1]}]: {data['url']}\n"
+            msg.content += reference_section
+            await msg.update()
    ```

 7. **Handle Audio Input**

-    Capture and transcribe audio input. Store the audio buffer and transcribe it when the audio ends.
-
    ```python
    @cl.on_audio_chunk
    async def on_audio_chunk(chunk: cl.AudioChunk):
@@ -194,12 +161,10 @@ This example demonstrates how to build a research assistant using `Chainlit` and
            buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
            cl.user_session.set("audio_buffer", buffer)
            cl.user_session.set("audio_mime_type", chunk.mimeType)
-
        cl.user_session.get("audio_buffer").write(chunk.data)

    @cl.step(type="tool")
    async def speech_to_text(audio_file):
-        cli = Groq()
        response = await client.audio.transcriptions.create(
            model="whisper-large-v3", file=audio_file
        )
@@ -217,32 +182,39 @@ This example demonstrates how to build a research assistant using `Chainlit` and
        end_time = time.time()
        print(f"Transcription took {end_time - start_time} seconds")
        
-        user_msg = cl.Message(
-            author="You", 
-            type="user_message",
-            content=transcription
-        )
+        user_msg = cl.Message(author="You", type="user_message", content=transcription)
        await user_msg.send()
        await on_message(user_msg)
    ```

 8. **Run the Chat Application**

-    Start the Chainlit application.
-
    ```python
    if __name__ == "__main__":
        from chainlit.cli import run_chainlit
        run_chainlit(__file__)
    ```

-### Explanation
+## Explanation

- **Libraries and Configuration**: Import necessary libraries and configure the OpenAI client.
- **Utility Functions**: Define functions to extract URLs and crawl them.
- **Chat Start Event**: Initialize chat session and welcome message.
- **Message Handling**: Extract URLs, crawl them concurrently, and update chat history and context.
- **Audio Handling**: Capture, buffer, and transcribe audio input, then process the transcription as text.
- **Running the Application**: Start the Chainlit server to interact with the assistant.
+- **Libraries and Configuration**: We import necessary libraries, including `AsyncWebCrawler` from `crawl4ai`.
+- **Utility Functions**: 
+  - `extract_urls`: Uses regex to find URLs in messages.
+  - `crawl_urls`: An asynchronous function that uses `AsyncWebCrawler` to fetch content from multiple URLs concurrently.
+- **Chat Start Event**: Initializes the chat session and sends a welcome message.
+- **Message Handling**: 
+  - Extracts URLs from user messages.
+  - Asynchronously crawls the URLs using `AsyncWebCrawler`.
+  - Updates chat history and context with crawled content.
+  - Generates a response using the LLM, incorporating the crawled context.
+- **Audio Handling**: Captures, buffers, and transcribes audio input, then processes the transcription as text.
+- **Running the Application**: Starts the Chainlit server for interaction with the assistant.

-This example showcases how to create an interactive research assistant that can fetch, process, and summarize web content, along with handling audio inputs for a seamless user experience.
+## Key Improvements
+
+1. **Asynchronous Web Crawling**: Using `AsyncWebCrawler` allows for efficient, concurrent crawling of multiple URLs.
+2. **Improved Context Management**: The assistant now maintains a context of crawled content, allowing for more informed responses.
+3. **Dynamic Reference System**: The assistant can refer to specific sources in its responses and provide a reference section.
+4. **Seamless Audio Integration**: The ability to handle audio inputs makes the assistant more versatile and user-friendly.
+
+This updated Research Assistant showcases how to create a powerful, interactive tool that can efficiently fetch and process web content, handle various input types, and provide informed responses based on the gathered information.