Commit Message:

Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
parent 8fbc2e0463
commit 84b311760f
47 changed files with 6510 additions and 2 deletions
--- a/docs/llm.txt/7_extraction_strategies.q.md
+++ b/docs/llm.txt/7_extraction_strategies.q.md
@@ -0,0 +1,74 @@
+### Hypothetical Questions
+
+1. **LLM Extraction Strategy**
+   - *"How can I use an LLM to dynamically extract structured data from a webpage?"*
+   - *"What is the difference between block extraction and schema-based extraction in the LLM strategy?"*
+   - *"How can I define a JSON schema and incorporate it into the LLM extraction process?"*
+   - *"What parameters control chunk size and overlap for LLM-based extraction?"*
+   - *"How do I handle errors, retries, and backoff when calling an LLM API for extraction?"*
+
+2. **Cosine Strategy**
+   - *"How does the Cosine Strategy identify and cluster semantically similar content?"*
+   - *"What parameters (like `sim_threshold` or `word_count_threshold`) affect the relevance of extracted content?"*
+   - *"When should I use semantic filtering with Cosine Strategy vs. simple keyword filtering?"*
+   - *"How can I adjust `top_k` to retrieve more or fewer relevant content clusters?"*
+   - *"In what scenarios is the Cosine Strategy more effective than LLM-based or CSS/XPath extraction?"*
+
+3. **JSON-Based Extraction Strategies (Without LLMs)**
+   - *"What are the advantages of using JSON-based extraction strategies like `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` over LLM-based methods?"*
+   - *"How do CSS and XPath selectors differ, and when is XPath more reliable?"*
+   - *"How can I handle frequently changing class names or dynamic elements using XPath-based extraction?"*
+   - *"Can I run these extraction strategies offline without any external API calls?"*
+   - *"How do I combine JS execution with XPath extraction to handle dynamically loaded content?"*
+
+4. **Environmental and Efficiency Considerations**
+   - *"Why should I avoid continuous LLM calls for repetitive extraction tasks?"*
+   - *"How does using XPath extraction reduce energy consumption and costs?"*
+   - *"Can I initially use an LLM to generate a schema and then rely solely on efficient, local strategies?"*
+
+5. **Schema Generation with a One-Time LLM Utility**
+   - *"How can I use a one-time LLM call to generate a schema, then run extraction repeatedly without further LLM costs?"*
+   - *"What steps are involved in using a language model just once to bootstrap my extraction schema?"*
+   - *"How do I incorporate the generated schema into `JsonXPathExtractionStrategy` for fast, robust extraction?"*
+
+6. **Advanced Use Cases and Best Practices**
+   - *"When should I combine LLM-based extraction with cosine similarity filtering for maximum relevance?"*
+   - *"What best practices should I follow when choosing thresholds and selectors to ensure stable, scalable extractions?"*
+   - *"How can I adapt these strategies to different page layouts, content types, or query requirements?"*
+   - *"Are there recommended troubleshooting steps if extraction fails or yields empty results?"*
+
+### Topics Discussed in the File
+
+- **LLM Extraction Strategy**:  
+  - **Modes**: Block-based or schema-based extraction using LLMs  
+  - **Parameters**: API tokens, instructions, schemas, chunk sizes, overlap rates  
+  - **Workflows**: Chunk merging, error handling, parallel execution  
+  - **Advantages**: Dynamic adaptability, schema-based extraction, scaling large content
+
+- **Cosine Strategy**:  
+  - **Approach**: Semantic filtering and clustering of content  
+  - **Parameters**: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`  
+  - **Use Cases**: Extracting relevant content from unstructured pages based on semantic similarity  
+  - **Advanced Config**: Custom clustering methods, model choices, performance optimization
+
+- **JSON-Based Extraction Strategies (Non-LLM)**:  
+  - **Strategies**: `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`  
+  - **Advantages**: Speed, efficiency, no external dependencies, environmentally friendly  
+  - **XPath vs. CSS**: XPath recommended for unstable, dynamic front-ends; more robust and structural  
+  - **Dynamic Content**: Combine JS execution and waiting conditions with XPath extraction
+
+- **Sustainability and Efficiency Considerations**:  
+  - **Rationale**: Avoiding continuous LLM use to save cost, reduce latency, and decrease carbon footprint  
+  - **Scalability**: Run on any device without expensive hardware or API calls
+
+- **One-Time LLM-Assisted Schema Generation**:  
+  - **Workflow**: Use LLM once to generate a schema from HTML and queries  
+  - **Afterwards**: Rely solely on JSON-based extraction (CSS/XPath) for fast and stable extractions  
+  - **Benefits**: Time-saving, cost-reducing, sustainable approach without sacrificing complexity
+
+- **Integration and Best Practices**:  
+  - **Threshold Tuning**: Iterative adjustments for `sim_threshold`, `word_count_threshold`  
+  - **Performance**: Chunking large content for LLM extraction, vectorizing content for cosine similarity  
+  - **Testing and Validation**: Use developer tools or dummy HTML to refine selectors, test JS code for dynamic content loading
+
+Overall, the file emphasizes choosing the right extraction strategy for the task—ranging from highly dynamic and schema-driven LLM approaches to more stable, efficient, and environmentally friendly direct HTML parsing methods (CSS/XPath). It also suggests a hybrid approach where an LLM can be used initially to generate a schema, then rely on local extraction strategies for ongoing tasks.