Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions
--- a/_sync/examples/summarization.md
+++ b/_sync/examples/summarization.md
@@ -0,0 +1,108 @@
+## Summarization Example
+
+This example demonstrates how to use `Crawl4AI` to extract a summary from a web page. The goal is to obtain the title, a detailed summary, a brief summary, and a list of keywords from the given page.
+
+### Step-by-Step Guide
+
+1. **Import Necessary Modules**
+
+    First, import the necessary modules and classes.
+
+    ```python
+    import os
+    import time
+    import json
+    from crawl4ai.web_crawler import WebCrawler
+    from crawl4ai.chunking_strategy import *
+    from crawl4ai.extraction_strategy import *
+    from crawl4ai.crawler_strategy import *
+    from pydantic import BaseModel, Field
+    ```
+
+2. **Define the URL to be Crawled**
+
+    Set the URL of the web page you want to summarize.
+
+    ```python
+    url = r'https://marketplace.visualstudio.com/items?itemName=Unclecode.groqopilot'
+    ```
+
+3. **Initialize the WebCrawler**
+
+    Create an instance of the `WebCrawler` and call the `warmup` method.
+
+    ```python
+    crawler = WebCrawler()
+    crawler.warmup()
+    ```
+
+4. **Define the Data Model**
+
+    Use Pydantic to define the structure of the extracted data.
+
+    ```python
+    class PageSummary(BaseModel):
+        title: str = Field(..., description="Title of the page.")
+        summary: str = Field(..., description="Summary of the page.")
+        brief_summary: str = Field(..., description="Brief summary of the page.")
+        keywords: list = Field(..., description="Keywords assigned to the page.")
+    ```
+
+5. **Run the Crawler**
+
+    Set up and run the crawler with the `LLMExtractionStrategy`. Provide the necessary parameters, including the schema for the extracted data and the instruction for the LLM.
+
+    ```python
+    result = crawler.run(
+        url=url,
+        word_count_threshold=1,
+        extraction_strategy=LLMExtractionStrategy(
+            provider="openai/gpt-4o", 
+            api_token=os.getenv('OPENAI_API_KEY'), 
+            schema=PageSummary.model_json_schema(),
+            extraction_type="schema",
+            apply_chunking=False,
+            instruction=(
+                "From the crawled content, extract the following details: "
+                "1. Title of the page "
+                "2. Summary of the page, which is a detailed summary "
+                "3. Brief summary of the page, which is a paragraph text "
+                "4. Keywords assigned to the page, which is a list of keywords. "
+                'The extracted JSON format should look like this: '
+                '{ "title": "Page Title", "summary": "Detailed summary of the page.", '
+                '"brief_summary": "Brief summary in a paragraph.", "keywords": ["keyword1", "keyword2", "keyword3"] }'
+            )
+        ),
+        bypass_cache=True,
+    )
+    ```
+
+6. **Process the Extracted Data**
+
+    Load the extracted content into a JSON object and print it.
+
+    ```python
+    page_summary = json.loads(result.extracted_content)
+    print(page_summary)
+    ```
+
+7. **Save the Extracted Data**
+
+    Save the extracted data to a file for further use.
+
+    ```python
+    with open(".data/page_summary.json", "w", encoding="utf-8") as f:
+        f.write(result.extracted_content)
+    ```
+
+### Explanation
+
+- **Importing Modules**: Import the necessary modules, including `WebCrawler` and `LLMExtractionStrategy` from `Crawl4AI`.
+- **URL Definition**: Set the URL of the web page you want to crawl and summarize.
+- **WebCrawler Initialization**: Create an instance of `WebCrawler` and call the `warmup` method to prepare the crawler.
+- **Data Model Definition**: Define the structure of the data you want to extract using Pydantic's `BaseModel`.
+- **Crawler Execution**: Run the crawler with the `LLMExtractionStrategy`, providing the schema and detailed instructions for the extraction process.
+- **Data Processing**: Load the extracted content into a JSON object and print it to verify the results.
+- **Data Saving**: Save the extracted data to a file for further use.
+
+This example demonstrates how to harness the power of `Crawl4AI` to perform advanced web crawling and data extraction tasks with minimal code.