Update Documentation

2024-10-27 19:24:46 +08:00
parent 38474bd66a
commit 4239654722
111 changed files with 7680 additions and 53 deletions
--- a/docs/md_v2/api/strategies.md
+++ b/docs/md_v2/api/strategies.md
@@ -0,0 +1,255 @@
+# Extraction & Chunking Strategies API
+
+This documentation covers the API reference for extraction and chunking strategies in Crawl4AI.
+
+## Extraction Strategies
+
+All extraction strategies inherit from the base `ExtractionStrategy` class and implement two key methods:
+- `extract(url: str, html: str) -> List[Dict[str, Any]]`
+- `run(url: str, sections: List[str]) -> List[Dict[str, Any]]`
+
+### LLMExtractionStrategy
+
+Used for extracting structured data using Language Models.
+
+```python
+LLMExtractionStrategy(
+    # Required Parameters
+    provider: str = DEFAULT_PROVIDER,     # LLM provider (e.g., "ollama/llama2")
+    api_token: Optional[str] = None,      # API token
+    
+    # Extraction Configuration
+    instruction: str = None,              # Custom extraction instruction
+    schema: Dict = None,                  # Pydantic model schema for structured data
+    extraction_type: str = "block",       # "block" or "schema"
+    
+    # Chunking Parameters
+    chunk_token_threshold: int = 4000,    # Maximum tokens per chunk
+    overlap_rate: float = 0.1,           # Overlap between chunks
+    word_token_rate: float = 0.75,       # Word to token conversion rate
+    apply_chunking: bool = True,         # Enable/disable chunking
+    
+    # API Configuration
+    base_url: str = None,                # Base URL for API
+    extra_args: Dict = {},               # Additional provider arguments
+    verbose: bool = False                # Enable verbose logging
+)
+```
+
+### CosineStrategy
+
+Used for content similarity-based extraction and clustering.
+
+```python
+CosineStrategy(
+    # Content Filtering
+    semantic_filter: str = None,        # Topic/keyword filter
+    word_count_threshold: int = 10,     # Minimum words per cluster
+    sim_threshold: float = 0.3,         # Similarity threshold
+    
+    # Clustering Parameters
+    max_dist: float = 0.2,             # Maximum cluster distance
+    linkage_method: str = 'ward',       # Clustering method
+    top_k: int = 3,                    # Top clusters to return
+    
+    # Model Configuration
+    model_name: str = 'sentence-transformers/all-MiniLM-L6-v2',  # Embedding model
+    
+    verbose: bool = False              # Enable verbose logging
+)
+```
+
+### JsonCssExtractionStrategy
+
+Used for CSS selector-based structured data extraction.
+
+```python
+JsonCssExtractionStrategy(
+    schema: Dict[str, Any],    # Extraction schema
+    verbose: bool = False      # Enable verbose logging
+)
+
+# Schema Structure
+schema = {
+    "name": str,              # Schema name
+    "baseSelector": str,      # Base CSS selector
+    "fields": [               # List of fields to extract
+        {
+            "name": str,      # Field name
+            "selector": str,  # CSS selector
+            "type": str,     # Field type: "text", "attribute", "html", "regex"
+            "attribute": str, # For type="attribute"
+            "pattern": str,  # For type="regex"
+            "transform": str, # Optional: "lowercase", "uppercase", "strip"
+            "default": Any    # Default value if extraction fails
+        }
+    ]
+}
+```
+
+## Chunking Strategies
+
+All chunking strategies inherit from `ChunkingStrategy` and implement the `chunk(text: str) -> list` method.
+
+### RegexChunking
+
+Splits text based on regex patterns.
+
+```python
+RegexChunking(
+    patterns: List[str] = None  # Regex patterns for splitting
+                               # Default: [r'\n\n']
+)
+```
+
+### SlidingWindowChunking
+
+Creates overlapping chunks with a sliding window approach.
+
+```python
+SlidingWindowChunking(
+    window_size: int = 100,    # Window size in words
+    step: int = 50             # Step size between windows
+)
+```
+
+### OverlappingWindowChunking
+
+Creates chunks with specified overlap.
+
+```python
+OverlappingWindowChunking(
+    window_size: int = 1000,   # Chunk size in words
+    overlap: int = 100         # Overlap size in words
+)
+```
+
+## Usage Examples
+
+### LLM Extraction
+
+```python
+from pydantic import BaseModel
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+
+# Define schema
+class Article(BaseModel):
+    title: str
+    content: str
+    author: str
+
+# Create strategy
+strategy = LLMExtractionStrategy(
+    provider="ollama/llama2",
+    schema=Article.schema(),
+    instruction="Extract article details"
+)
+
+# Use with crawler
+result = await crawler.arun(
+    url="https://example.com/article",
+    extraction_strategy=strategy
+)
+
+# Access extracted data
+data = json.loads(result.extracted_content)
+```
+
+### CSS Extraction
+
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+
+# Define schema
+schema = {
+    "name": "Product List",
+    "baseSelector": ".product-card",
+    "fields": [
+        {
+            "name": "title",
+            "selector": "h2.title",
+            "type": "text"
+        },
+        {
+            "name": "price",
+            "selector": ".price",
+            "type": "text",
+            "transform": "strip"
+        },
+        {
+            "name": "image",
+            "selector": "img",
+            "type": "attribute",
+            "attribute": "src"
+        }
+    ]
+}
+
+# Create and use strategy
+strategy = JsonCssExtractionStrategy(schema)
+result = await crawler.arun(
+    url="https://example.com/products",
+    extraction_strategy=strategy
+)
+```
+
+### Content Chunking
+
+```python
+from crawl4ai.chunking_strategy import OverlappingWindowChunking
+
+# Create chunking strategy
+chunker = OverlappingWindowChunking(
+    window_size=500,  # 500 words per chunk
+    overlap=50        # 50 words overlap
+)
+
+# Use with extraction strategy
+strategy = LLMExtractionStrategy(
+    provider="ollama/llama2",
+    chunking_strategy=chunker
+)
+
+result = await crawler.arun(
+    url="https://example.com/long-article",
+    extraction_strategy=strategy
+)
+```
+
+## Best Practices
+
+1. **Choose the Right Strategy**
+   - Use `LLMExtractionStrategy` for complex, unstructured content
+   - Use `JsonCssExtractionStrategy` for well-structured HTML
+   - Use `CosineStrategy` for content similarity and clustering
+
+2. **Optimize Chunking**
+   ```python
+   # For long documents
+   strategy = LLMExtractionStrategy(
+       chunk_token_threshold=2000,  # Smaller chunks
+       overlap_rate=0.1           # 10% overlap
+   )
+   ```
+
+3. **Handle Errors**
+   ```python
+   try:
+       result = await crawler.arun(
+           url="https://example.com",
+           extraction_strategy=strategy
+       )
+       if result.success:
+           content = json.loads(result.extracted_content)
+   except Exception as e:
+       print(f"Extraction failed: {e}")
+   ```
+
+4. **Monitor Performance**
+   ```python
+   strategy = CosineStrategy(
+       verbose=True,  # Enable logging
+       word_count_threshold=20,  # Filter short content
+       top_k=5  # Limit results
+   )
+   ```