Commit Message:

Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
This commit is contained in:
UncleCode
2024-12-21 14:26:56 +08:00
parent 8fbc2e0463
commit 84b311760f
47 changed files with 6510 additions and 2 deletions

View File

@@ -0,0 +1,74 @@
### Hypothetical Questions
1. **LLM Extraction Strategy**
- *"How can I use an LLM to dynamically extract structured data from a webpage?"*
- *"What is the difference between block extraction and schema-based extraction in the LLM strategy?"*
- *"How can I define a JSON schema and incorporate it into the LLM extraction process?"*
- *"What parameters control chunk size and overlap for LLM-based extraction?"*
- *"How do I handle errors, retries, and backoff when calling an LLM API for extraction?"*
2. **Cosine Strategy**
- *"How does the Cosine Strategy identify and cluster semantically similar content?"*
- *"What parameters (like `sim_threshold` or `word_count_threshold`) affect the relevance of extracted content?"*
- *"When should I use semantic filtering with Cosine Strategy vs. simple keyword filtering?"*
- *"How can I adjust `top_k` to retrieve more or fewer relevant content clusters?"*
- *"In what scenarios is the Cosine Strategy more effective than LLM-based or CSS/XPath extraction?"*
3. **JSON-Based Extraction Strategies (Without LLMs)**
- *"What are the advantages of using JSON-based extraction strategies like `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` over LLM-based methods?"*
- *"How do CSS and XPath selectors differ, and when is XPath more reliable?"*
- *"How can I handle frequently changing class names or dynamic elements using XPath-based extraction?"*
- *"Can I run these extraction strategies offline without any external API calls?"*
- *"How do I combine JS execution with XPath extraction to handle dynamically loaded content?"*
4. **Environmental and Efficiency Considerations**
- *"Why should I avoid continuous LLM calls for repetitive extraction tasks?"*
- *"How does using XPath extraction reduce energy consumption and costs?"*
- *"Can I initially use an LLM to generate a schema and then rely solely on efficient, local strategies?"*
5. **Schema Generation with a One-Time LLM Utility**
- *"How can I use a one-time LLM call to generate a schema, then run extraction repeatedly without further LLM costs?"*
- *"What steps are involved in using a language model just once to bootstrap my extraction schema?"*
- *"How do I incorporate the generated schema into `JsonXPathExtractionStrategy` for fast, robust extraction?"*
6. **Advanced Use Cases and Best Practices**
- *"When should I combine LLM-based extraction with cosine similarity filtering for maximum relevance?"*
- *"What best practices should I follow when choosing thresholds and selectors to ensure stable, scalable extractions?"*
- *"How can I adapt these strategies to different page layouts, content types, or query requirements?"*
- *"Are there recommended troubleshooting steps if extraction fails or yields empty results?"*
### Topics Discussed in the File
- **LLM Extraction Strategy**:
- **Modes**: Block-based or schema-based extraction using LLMs
- **Parameters**: API tokens, instructions, schemas, chunk sizes, overlap rates
- **Workflows**: Chunk merging, error handling, parallel execution
- **Advantages**: Dynamic adaptability, schema-based extraction, scaling large content
- **Cosine Strategy**:
- **Approach**: Semantic filtering and clustering of content
- **Parameters**: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`
- **Use Cases**: Extracting relevant content from unstructured pages based on semantic similarity
- **Advanced Config**: Custom clustering methods, model choices, performance optimization
- **JSON-Based Extraction Strategies (Non-LLM)**:
- **Strategies**: `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
- **Advantages**: Speed, efficiency, no external dependencies, environmentally friendly
- **XPath vs. CSS**: XPath recommended for unstable, dynamic front-ends; more robust and structural
- **Dynamic Content**: Combine JS execution and waiting conditions with XPath extraction
- **Sustainability and Efficiency Considerations**:
- **Rationale**: Avoiding continuous LLM use to save cost, reduce latency, and decrease carbon footprint
- **Scalability**: Run on any device without expensive hardware or API calls
- **One-Time LLM-Assisted Schema Generation**:
- **Workflow**: Use LLM once to generate a schema from HTML and queries
- **Afterwards**: Rely solely on JSON-based extraction (CSS/XPath) for fast and stable extractions
- **Benefits**: Time-saving, cost-reducing, sustainable approach without sacrificing complexity
- **Integration and Best Practices**:
- **Threshold Tuning**: Iterative adjustments for `sim_threshold`, `word_count_threshold`
- **Performance**: Chunking large content for LLM extraction, vectorizing content for cosine similarity
- **Testing and Validation**: Use developer tools or dummy HTML to refine selectors, test JS code for dynamic content loading
Overall, the file emphasizes choosing the right extraction strategy for the task—ranging from highly dynamic and schema-driven LLM approaches to more stable, efficient, and environmentally friendly direct HTML parsing methods (CSS/XPath). It also suggests a hybrid approach where an LLM can be used initially to generate a schema, then rely on local extraction strategies for ongoing tasks.