Push async version last changes for merge to main branch

2024-09-24 20:52:08 +08:00
parent d628bc4034
commit 4d48bd31ca
61 changed files with 6219 additions and 891 deletions
--- a/docs/md/full_details/extraction_strategies.md
+++ b/docs/md/full_details/extraction_strategies.md
@@ -1,6 +1,143 @@
 ## Extraction Strategies 🧠

-Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into two of the most important strategies: `CosineStrategy` and `LLMExtractionStrategy`.
+Crawl4AI offers powerful extraction strategies to derive meaningful information from web content. Let's dive into three of the most important strategies: `CosineStrategy`, `LLMExtractionStrategy`, and the new `JsonCssExtractionStrategy`.
+
+### LLMExtractionStrategy
+
+`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
+
+#### When to Use
+- Suitable for complex extraction tasks requiring nuanced understanding.
+- Ideal for scenarios where detailed instructions can guide the extraction process.
+- Perfect for extracting specific types of information or content with precise guidelines.
+
+#### Parameters
+- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
+- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
+- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
+
+#### Example Without Instructions
+```python
+import asyncio
+import os
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Define extraction strategy without instructions
+        strategy = LLMExtractionStrategy(
+            provider='openai',
+            api_token=os.getenv('OPENAI_API_KEY')
+        )
+
+        # Sample URL
+        url = "https://www.nbcnews.com/business"
+
+        # Run the crawler with the extraction strategy
+        result = await crawler.arun(url=url, extraction_strategy=strategy)
+        print(result.extracted_content)
+
+asyncio.run(main())
+```
+
+#### Example With Instructions
+```python
+import asyncio
+import os
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Define extraction strategy with instructions
+        strategy = LLMExtractionStrategy(
+            provider='openai',
+            api_token=os.getenv('OPENAI_API_KEY'),
+            instruction="Extract only financial news and summarize key points."
+        )
+
+        # Sample URL
+        url = "https://www.nbcnews.com/business"
+
+        # Run the crawler with the extraction strategy
+        result = await crawler.arun(url=url, extraction_strategy=strategy)
+        print(result.extracted_content)
+
+asyncio.run(main())
+```
+
+### JsonCssExtractionStrategy
+
+`JsonCssExtractionStrategy` is a powerful tool for extracting structured data from HTML using CSS selectors. It allows you to define a schema that maps CSS selectors to specific fields, enabling precise and efficient data extraction.
+
+#### When to Use
+- Ideal for extracting structured data from websites with consistent HTML structures.
+- Perfect for scenarios where you need to extract specific elements or attributes from a webpage.
+- Suitable for creating datasets from web pages with tabular or list-based information.
+
+#### Parameters
+- `schema` (Dict[str, Any]): A dictionary defining the extraction schema, including base selector and field definitions.
+
+#### Example
+```python
+import asyncio
+import json
+from crawl4ai import AsyncWebCrawler
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Define the extraction schema
+        schema = {
+            "name": "News Articles",
+            "baseSelector": "article.tease-card",
+            "fields": [
+                {
+                    "name": "title",
+                    "selector": "h2",
+                    "type": "text",
+                },
+                {
+                    "name": "summary",
+                    "selector": "div.tease-card__info",
+                    "type": "text",
+                },
+                {
+                    "name": "link",
+                    "selector": "a",
+                    "type": "attribute",
+                    "attribute": "href"
+                }
+            ],
+        }
+
+        # Create the extraction strategy
+        strategy = JsonCssExtractionStrategy(schema, verbose=True)
+
+        # Sample URL
+        url = "https://www.nbcnews.com/business"
+
+        # Run the crawler with the extraction strategy
+        result = await crawler.arun(url=url, extraction_strategy=strategy)
+        
+        # Parse and print the extracted content
+        extracted_data = json.loads(result.extracted_content)
+        print(json.dumps(extracted_data, indent=2))
+
+asyncio.run(main())
+```
+
+#### Use Cases for JsonCssExtractionStrategy
+- Extracting product information from e-commerce websites.
+- Gathering news articles and their metadata from news portals.
+- Collecting user reviews and ratings from review websites.
+- Extracting job listings from job boards.
+
+By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy`, nuanced, instruction-based extraction with `LLMExtractionStrategy`, or precise structured data extraction with `JsonCssExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️‍♂️✨
+
+For more details on schema definitions and advanced extraction strategies, check out the[Advanced JsonCssExtraction](../full_details/advanced_jsoncss_extraction.md).
+

 ### CosineStrategy

@@ -21,96 +158,28 @@ Crawl4AI offers powerful extraction strategies to derive meaningful information

 #### Example
 ```python
+import asyncio
+from crawl4ai import AsyncWebCrawler
 from crawl4ai.extraction_strategy import CosineStrategy
-from crawl4ai import WebCrawler

-crawler = WebCrawler()
-crawler.warmup()
+async def main():
+    async with AsyncWebCrawler(verbose=True) as crawler:
+        # Define extraction strategy
+        strategy = CosineStrategy(
+            semantic_filter="finance economy stock market",
+            word_count_threshold=10,
+            max_dist=0.2,
+            linkage_method='ward',
+            top_k=3,
+            model_name='BAAI/bge-small-en-v1.5'
+        )

-# Define extraction strategy
-strategy = CosineStrategy(
-    semantic_filter="finance economy stock market",
-    word_count_threshold=10,
-    max_dist=0.2,
-    linkage_method='ward',
-    top_k=3,
-    model_name='BAAI/bge-small-en-v1.5'
-)
+        # Sample URL
+        url = "https://www.nbcnews.com/business"

-# Sample URL
-url = "https://www.nbcnews.com/business"
+        # Run the crawler with the extraction strategy
+        result = await crawler.arun(url=url, extraction_strategy=strategy)
+        print(result.extracted_content)

-# Run the crawler with the extraction strategy
-result = crawler.run(url=url, extraction_strategy=strategy)
-print(result.extracted_content)
+asyncio.run(main())
 ```
-
-### LLMExtractionStrategy
-
-`LLMExtractionStrategy` leverages a Language Model (LLM) to extract meaningful content from HTML. This strategy uses an external provider for LLM completions to perform extraction based on instructions.
-
-#### When to Use
- Suitable for complex extraction tasks requiring nuanced understanding.
- Ideal for scenarios where detailed instructions can guide the extraction process.
- Perfect for extracting specific types of information or content with precise guidelines.
-
-#### Parameters
- `provider` (str, optional): Provider for language model completions (e.g., openai/gpt-4). Default is `DEFAULT_PROVIDER`.
- `api_token` (str, optional): API token for the provider. If not provided, it will try to load from the environment variable `OPENAI_API_KEY`.
- `instruction` (str, optional): Instructions to guide the LLM on how to perform the extraction. Default is `None`.
-
-#### Example Without Instructions
-```python
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-from crawl4ai import WebCrawler
-
-crawler = WebCrawler()
-crawler.warmup()
-
-# Define extraction strategy without instructions
-strategy = LLMExtractionStrategy(
-    provider='openai',
-    api_token='your_api_token'
-)
-
-# Sample URL
-url = "https://www.nbcnews.com/business"
-
-# Run the crawler with the extraction strategy
-result = crawler.run(url=url, extraction_strategy=strategy)
-print(result.extracted_content)
-```
-
-#### Example With Instructions
-```python
-from crawl4ai.extraction_strategy import LLMExtractionStrategy
-from crawl4ai import WebCrawler
-
-crawler = WebCrawler()
-crawler.warmup()
-
-# Define extraction strategy with instructions
-strategy = LLMExtractionStrategy(
-    provider='openai',
-    api_token='your_api_token',
-    instruction="Extract only financial news and summarize key points."
-)
-
-# Sample URL
-url = "https://www.nbcnews.com/business"
-
-# Run the crawler with the extraction strategy
-result = crawler.run(url=url, extraction_strategy=strategy)
-print(result.extracted_content)
-```
-
-#### Use Cases for LLMExtractionStrategy
- Extracting specific data types from structured or semi-structured content.
- Generating summaries, extracting key information, or transforming content into different formats.
- Performing detailed extractions based on custom instructions.
-
-For more detailed examples, please refer to the [Examples section](../examples/index.md) of the documentation.
-
---
-
-By choosing the right extraction strategy, you can effectively extract the most relevant and useful information from web content. Whether you need fast, accurate semantic segmentation with `CosineStrategy` or nuanced, instruction-based extraction with `LLMExtractionStrategy`, Crawl4AI has you covered. Happy extracting! 🕵️‍♂️✨