Commit Message:

Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
parent 8fbc2e0463
commit 84b311760f
47 changed files with 6510 additions and 2 deletions
--- a/docs/llm.txt/7_extraction_strategies.sm.md
+++ b/docs/llm.txt/7_extraction_strategies.sm.md
@@ -0,0 +1,81 @@
+# Extraction Strategies (Condensed LLM-Friendly Reference)
+
+> Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.
+
+Streamlined parameters, usage, and code snippets for quick LLM reference.
+
+## LLMExtractionStrategy
+
+- Uses LLM to extract structured data from HTML.
+- Supports `instruction`, `schema`, `extraction_type`, `chunk_token_threshold`, `overlap_rate`.
+```python
+from crawl4ai.extraction_strategy import LLMExtractionStrategy
+strategy = LLMExtractionStrategy(
+    provider="openai",
+    api_token="your_api_token",
+    instruction="Extract prices",
+    schema={"fields": [...]},
+    extraction_type="schema"
+)
+```
+
+## CosineStrategy
+
+- Clusters content via semantic embeddings.
+- Key params: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`.
+```python
+from crawl4ai.extraction_strategy import CosineStrategy
+strategy = CosineStrategy(
+    semantic_filter="product reviews",
+    word_count_threshold=20,
+    sim_threshold=0.3,
+    top_k=5
+)
+```
+
+## JsonCssExtractionStrategy
+
+- Extracts data using CSS selectors.
+- `schema` defines `baseSelector`, `fields`.
+```python
+from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
+schema = {
+  "baseSelector": ".product",
+  "fields": [
+    {"name":"title","selector":"h2","type":"text"},
+    {"name":"price","selector":".price","type":"text"}
+  ]
+}
+strategy = JsonCssExtractionStrategy(schema=schema)
+```
+
+## JsonXPathExtractionStrategy
+
+- Similar to CSS but uses XPath.
+- More stable against changing class names.
+```python
+from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
+schema = {
+  "baseSelector": "//div[@class='product']",
+  "fields": [
+    {"name":"title","selector":".//h2","type":"text"},
+    {"name":"price","selector":".//span[@class='price']","type":"text"}
+  ]
+}
+strategy = JsonXPathExtractionStrategy(schema=schema)
+```
+
+## Example Usage
+
+```python
+from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
+
+config = CrawlerRunConfig(extraction_strategy=strategy)
+async with AsyncWebCrawler() as crawler:
+    result = await crawler.arun("https://example.com", config=config)
+    print(result.extracted_content)
+```
+
+## Optional
+
+- [extraction_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategies.py)