Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions
--- a/docs/md/examples/summarization.md
+++ b/docs/md/examples/summarization.md
@@ -1,44 +1,34 @@
-## Summarization Example
+# Summarization Example with AsyncWebCrawler

-This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
+This example demonstrates how to use Crawl4AI's `AsyncWebCrawler` to extract a summary from a web page asynchronously. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.

-### Step-by-Step Guide
+## Step-by-Step Guide

 1. **Import Necessary Modules**

-    First, import the necessary modules and classes.
+    First, import the necessary modules and classes:

    ```python
    import os
-    import time
    import json
-    from crawl4ai.web_crawler import WebCrawler
-    from crawl4ai.chunking_strategy import *
-    from crawl4ai.extraction_strategy import *
-    from crawl4ai.crawler_strategy import *
+    import asyncio
+    from crawl4ai import AsyncWebCrawler
+    from crawl4ai.extraction_strategy import LLMExtractionStrategy
+    from crawl4ai.chunking_strategy import RegexChunking
    from pydantic import BaseModel, Field
    ```

 2. **Define the URL to be Crawled**

-    Set the URL of the web page you want to summarize.
+    Set the URL of the web page you want to summarize:

    ```python
-    url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
+    url = 'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
    ```

-3. **Initialize the WebCrawler**
+3. **Define the Data Model**

-    Create an instance of the `WebCrawler` and call the `warmup` method.
-
-    ```python
-    crawler = WebCrawler()
-    crawler.warmup()
-    ```
-
-4. **Define the Data Model**
-
-    Use Pydantic to define the structure of the extracted data.
+    Use Pydantic to define the structure of the extracted data:

    ```python
    class PageSummary(BaseModel):
@@ -48,61 +38,116 @@ This example demonstrates how to use `Crawl4AI` to extract a summary from a web
        keywords: list = Field(..., description="Keywords assigned to the page.")
    ```

-5. **Run the Crawler**
+4. **Create the Extraction Strategy**

-    Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.
+    Set up the `LLMExtractionStrategy` with the necessary parameters:

    ```python
-    result = crawler.run(
-        url=url,
-        word_count_threshold=1,
-        extraction_strategy=LLMExtractionStrategy(
-            provider="openai/gpt-4o", 
-            api_token=os.getenv('OPENAI_API_KEY'), 
-            schema=PageSummary.model_json_schema(),
-            extraction_type="schema",
-            apply_chunking=False,
-            instruction=(
-                "From the crawled content, extract the following details: "
-                "1. Title of the page "
-                "2. Summary of the page, which is a detailed summary "
-                "3. Brief summary of the page, which is a paragraph text "
-                "4. Keywords assigned to the page, which is a list of keywords. "
-                'The extracted JSON format should look like this: '
-                '{ "title": "Page Title", "summary": "Detailed summary of the page.", '
-                '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
-            )
-        ),
-        bypass_cache=True,
+    extraction_strategy = LLMExtractionStrategy(
+        provider="openai/gpt-4o", 
+        api_token=os.getenv('OPENAI_API_KEY'), 
+        schema=PageSummary.model_json_schema(),
+        extraction_type="schema",
+        apply_chunking=False,
+        instruction=(
+            "From the crawled content, extract the following details: "
+            "1. Title of the page "
+            "2. Summary of the page, which is a detailed summary "
+            "3. Brief summary of the page, which is a paragraph text "
+            "4. Keywords assigned to the page, which is a list of keywords. "
+            'The extracted JSON format should look like this: '
+            '{ "title": "Page Title", "summary": "Detailed summary of the page.", '
+            '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
+        )
    )
    ```

-6. **Process the Extracted Data**
+5. **Define the Async Crawl Function**

-    Load the extracted content into a JSON object and print it.
+    Create an asynchronous function to run the crawler:

    ```python
-    page_summary = json.loads(result.extracted_content)
-    print(page_summary)
+    async def crawl_and_summarize(url):
+        async with AsyncWebCrawler(verbose=True) as crawler:
+            result = await crawler.arun(
+                url=url,
+                word_count_threshold=1,
+                extraction_strategy=extraction_strategy,
+                chunking_strategy=RegexChunking(),
+                bypass_cache=True,
+            )
+            return result
    ```

-7. **Save the Extracted Data**
+6. **Run the Crawler and Process Results**

-    Save the extracted data to a file for further use.
+    Use asyncio to run the crawler and process the results:

    ```python
-    with open(".data/page_summary.json", "w", encoding="utf-8") as f:
-        f.write(result.extracted_content)
+    async def main():
+        result = await crawl_and_summarize(url)
+        
+        if result.success:
+            page_summary = json.loads(result.extracted_content)
+            print("Extracted Page Summary:")
+            print(json.dumps(page_summary, indent=2))
+            
+            # Save the extracted data
+            with open(".data/page_summary.json", "w", encoding="utf-8") as f:
+                json.dump(page_summary, f, indent=2)
+            print("Page summary saved to .data/page_summary.json")
+        else:
+            print(f"Failed to crawl and summarize the page. Error: {result.error_message}")
+
+    # Run the async main function
+    asyncio.run(main())
    ```

-### Explanation
+## Explanation

- **Importing Modules**: Import the necessary modules, including `WebCrawler` and `LLMExtractionStrategy` from `Crawl4AI`.
- **URL Definition**: Set the URL of the web page you want to crawl and summarize.
- **WebCrawler Initialization**: Create an instance of `WebCrawler` and call the `warmup` method to prepare the crawler.
- **Data Model Definition**: Define the structure of the data you want to extract using Pydantic's `BaseModel`.
- **Crawler Execution**: Run the crawler with the `LLMExtractionStrategy`, providing the schema and detailed instructions for the extraction process.
- **Data Processing**: Load the extracted content into a JSON object and print it to verify the results.
- **Data Saving**: Save the extracted data to a file for further use.
+- **Importing Modules**: We import the necessary modules, including `AsyncWebCrawler` and `LLMExtractionStrategy` from Crawl4AI.
+- **URL Definition**: We set the URL of the web page to crawl and summarize.
+- **Data Model Definition**: We define the structure of the data to extract using Pydantic's `BaseModel`.
+- **Extraction Strategy Setup**: We create an instance of `LLMExtractionStrategy` with the schema and detailed instructions for the extraction process.
+- **Async Crawl Function**: We define an asynchronous function `crawl_and_summarize` that uses `AsyncWebCrawler` to perform the crawling and extraction.
+- **Main Execution**: In the `main` function, we run the crawler, process the results, and save the extracted data.

-This example demonstrates how to harness the power of `Crawl4AI` to perform advanced web crawling and data extraction tasks with minimal code.
+## Advanced Usage: Crawling Multiple URLs
+
+To demonstrate the power of `AsyncWebCrawler`, here's how you can summarize multiple pages concurrently:
+
+```python
+async def crawl_multiple_urls(urls):
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        tasks = [crawler.arun(
+            url=url,
+            word_count_threshold=1,
+            extraction_strategy=extraction_strategy,
+            chunking_strategy=RegexChunking(),
+            bypass_cache=True
+        ) for url in urls]
+        results = await asyncio.gather(*tasks)
+    return results
+
+async def main():
+    urls = [
+        'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot',
+        'https://marketplace.visualstudio.com/items?itemName=GitHub.copilot',
+        'https://marketplace.visualstudio.com/items?itemName=ms-python.python'
+    ]
+    results = await crawl_multiple_urls(urls)
+    
+    for i, result in enumerate(results):
+        if result.success:
+            page_summary = json.loads(result.extracted_content)
+            print(f"\nSummary for URL {i+1}:")
+            print(json.dumps(page_summary, indent=2))
+        else:
+            print(f"\nFailed to summarize URL {i+1}. Error: {result.error_message}")
+
+asyncio.run(main())
+```
+
+This advanced example shows how to use `AsyncWebCrawler` to efficiently summarize multiple web pages concurrently, significantly reducing the total processing time compared to sequential crawling.
+
+By leveraging the asynchronous capabilities of Crawl4AI, you can perform advanced web crawling and data extraction tasks with improved efficiency and scalability.