Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions
--- a/_sync/examples/llm_extraction.md
+++ b/_sync/examples/llm_extraction.md
@@ -0,0 +1,90 @@
+# LLM Extraction
+
+Crawl4AI allows you to use Language Models (LLMs) to extract structured data or relevant content from web pages. Below are two examples demonstrating how to use LLMExtractionStrategy for different purposes.
+
+## Example 1: Extract Structured Data
+
+In this example, we use the `LLMExtractionStrategy` to extract structured data (model names and their fees) from the OpenAI pricing page.
+
+```python
+import os
+import time
+from crawl4ai.web_crawler import WebCrawler
+from crawl4ai.chunking_strategy import *
+from crawl4ai.extraction_strategy import *
+from crawl4ai.crawler_strategy import *
+
+url = r'https://openai.com/api/pricing/'
+
+crawler = WebCrawler()
+crawler.warmup()
+
+from pydantic import BaseModel, Field
+
+class OpenAIModelFee(BaseModel):
+    model_name: str = Field(..., description="Name of the OpenAI model.")
+    input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
+    output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
+
+result = crawler.run(
+    url=url,
+    word_count_threshold=1,
+    extraction_strategy= LLMExtractionStrategy(
+        provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
+        schema=OpenAIModelFee.model_json_schema(),
+        extraction_type="schema",
+        instruction="From the crawled content, extract all mentioned model names along with their "\
+            "fees for input and output tokens. Make sure not to miss anything in the entire content. "\
+            'One extracted model JSON format should look like this: '\
+            '{ "model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens" }'
+    ),
+    bypass_cache=True,
+)
+
+model_fees = json.loads(result.extracted_content)
+
+print(len(model_fees))
+
+with open(".data/data.json", "w", encoding="utf-8") as f:
+    f.write(result.extracted_content)
+```
+
+## Example 2: Extract Relevant Content
+
+In this example, we instruct the LLM to extract only content related to technology from the NBC News business page.
+
+```python
+crawler = WebCrawler()
+crawler.warmup()
+
+result = crawler.run(
+        url="https://www.nbcnews.com/business",
+        extraction_strategy=LLMExtractionStrategy(
+            provider="openai/gpt-4o",
+            api_token=os.getenv('OPENAI_API_KEY'),
+            instruction="Extract only content related to technology"
+        ),
+    bypass_cache=True,
+    )
+
+model_fees = json.loads(result.extracted_content)
+
+print(len(model_fees))
+
+with open(".data/data.json", "w", encoding="utf-8") as f:
+    f.write(result.extracted_content)
+```
+
+## Customizing LLM Provider
+
+Under the hood, Crawl4AI uses the `litellm` library, which allows you to use any LLM provider you want. Just pass the correct model name and API token.
+
+```python
+extraction_strategy=LLMExtractionStrategy(
+    provider="your_llm_provider/model_name",
+    api_token="your_api_token",
+    instruction="Your extraction instruction"
+)
+```
+
+This flexibility allows you to integrate with various LLM providers and tailor the extraction process to your specific needs.