Commit Message:
Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
This commit is contained in:
74
docs/llm.txt/7_extraction_strategies.q.md
Normal file
74
docs/llm.txt/7_extraction_strategies.q.md
Normal file
@@ -0,0 +1,74 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **LLM Extraction Strategy**
|
||||
- *"How can I use an LLM to dynamically extract structured data from a webpage?"*
|
||||
- *"What is the difference between block extraction and schema-based extraction in the LLM strategy?"*
|
||||
- *"How can I define a JSON schema and incorporate it into the LLM extraction process?"*
|
||||
- *"What parameters control chunk size and overlap for LLM-based extraction?"*
|
||||
- *"How do I handle errors, retries, and backoff when calling an LLM API for extraction?"*
|
||||
|
||||
2. **Cosine Strategy**
|
||||
- *"How does the Cosine Strategy identify and cluster semantically similar content?"*
|
||||
- *"What parameters (like `sim_threshold` or `word_count_threshold`) affect the relevance of extracted content?"*
|
||||
- *"When should I use semantic filtering with Cosine Strategy vs. simple keyword filtering?"*
|
||||
- *"How can I adjust `top_k` to retrieve more or fewer relevant content clusters?"*
|
||||
- *"In what scenarios is the Cosine Strategy more effective than LLM-based or CSS/XPath extraction?"*
|
||||
|
||||
3. **JSON-Based Extraction Strategies (Without LLMs)**
|
||||
- *"What are the advantages of using JSON-based extraction strategies like `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` over LLM-based methods?"*
|
||||
- *"How do CSS and XPath selectors differ, and when is XPath more reliable?"*
|
||||
- *"How can I handle frequently changing class names or dynamic elements using XPath-based extraction?"*
|
||||
- *"Can I run these extraction strategies offline without any external API calls?"*
|
||||
- *"How do I combine JS execution with XPath extraction to handle dynamically loaded content?"*
|
||||
|
||||
4. **Environmental and Efficiency Considerations**
|
||||
- *"Why should I avoid continuous LLM calls for repetitive extraction tasks?"*
|
||||
- *"How does using XPath extraction reduce energy consumption and costs?"*
|
||||
- *"Can I initially use an LLM to generate a schema and then rely solely on efficient, local strategies?"*
|
||||
|
||||
5. **Schema Generation with a One-Time LLM Utility**
|
||||
- *"How can I use a one-time LLM call to generate a schema, then run extraction repeatedly without further LLM costs?"*
|
||||
- *"What steps are involved in using a language model just once to bootstrap my extraction schema?"*
|
||||
- *"How do I incorporate the generated schema into `JsonXPathExtractionStrategy` for fast, robust extraction?"*
|
||||
|
||||
6. **Advanced Use Cases and Best Practices**
|
||||
- *"When should I combine LLM-based extraction with cosine similarity filtering for maximum relevance?"*
|
||||
- *"What best practices should I follow when choosing thresholds and selectors to ensure stable, scalable extractions?"*
|
||||
- *"How can I adapt these strategies to different page layouts, content types, or query requirements?"*
|
||||
- *"Are there recommended troubleshooting steps if extraction fails or yields empty results?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **LLM Extraction Strategy**:
|
||||
- **Modes**: Block-based or schema-based extraction using LLMs
|
||||
- **Parameters**: API tokens, instructions, schemas, chunk sizes, overlap rates
|
||||
- **Workflows**: Chunk merging, error handling, parallel execution
|
||||
- **Advantages**: Dynamic adaptability, schema-based extraction, scaling large content
|
||||
|
||||
- **Cosine Strategy**:
|
||||
- **Approach**: Semantic filtering and clustering of content
|
||||
- **Parameters**: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`
|
||||
- **Use Cases**: Extracting relevant content from unstructured pages based on semantic similarity
|
||||
- **Advanced Config**: Custom clustering methods, model choices, performance optimization
|
||||
|
||||
- **JSON-Based Extraction Strategies (Non-LLM)**:
|
||||
- **Strategies**: `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
|
||||
- **Advantages**: Speed, efficiency, no external dependencies, environmentally friendly
|
||||
- **XPath vs. CSS**: XPath recommended for unstable, dynamic front-ends; more robust and structural
|
||||
- **Dynamic Content**: Combine JS execution and waiting conditions with XPath extraction
|
||||
|
||||
- **Sustainability and Efficiency Considerations**:
|
||||
- **Rationale**: Avoiding continuous LLM use to save cost, reduce latency, and decrease carbon footprint
|
||||
- **Scalability**: Run on any device without expensive hardware or API calls
|
||||
|
||||
- **One-Time LLM-Assisted Schema Generation**:
|
||||
- **Workflow**: Use LLM once to generate a schema from HTML and queries
|
||||
- **Afterwards**: Rely solely on JSON-based extraction (CSS/XPath) for fast and stable extractions
|
||||
- **Benefits**: Time-saving, cost-reducing, sustainable approach without sacrificing complexity
|
||||
|
||||
- **Integration and Best Practices**:
|
||||
- **Threshold Tuning**: Iterative adjustments for `sim_threshold`, `word_count_threshold`
|
||||
- **Performance**: Chunking large content for LLM extraction, vectorizing content for cosine similarity
|
||||
- **Testing and Validation**: Use developer tools or dummy HTML to refine selectors, test JS code for dynamic content loading
|
||||
|
||||
Overall, the file emphasizes choosing the right extraction strategy for the task—ranging from highly dynamic and schema-driven LLM approaches to more stable, efficient, and environmentally friendly direct HTML parsing methods (CSS/XPath). It also suggests a hybrid approach where an LLM can be used initially to generate a schema, then rely on local extraction strategies for ongoing tasks.
|
||||
Reference in New Issue
Block a user