Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/llm.txt/7_extraction_strategies.q.md
+++ b/docs/llm.txt/7_extraction_strategies.q.md
@@ -1,74 +1,12 @@
-### Hypothetical Questions
-
-1. **LLM Extraction Strategy**
-   - *"How can I use an LLM to dynamically extract structured data from a webpage?"*
-   - *"What is the difference between block extraction and schema-based extraction in the LLM strategy?"*
-   - *"How can I define a JSON schema and incorporate it into the LLM extraction process?"*
-   - *"What parameters control chunk size and overlap for LLM-based extraction?"*
-   - *"How do I handle errors, retries, and backoff when calling an LLM API for extraction?"*
-
-2. **Cosine Strategy**
-   - *"How does the Cosine Strategy identify and cluster semantically similar content?"*
-   - *"What parameters (like `sim_threshold` or `word_count_threshold`) affect the relevance of extracted content?"*
-   - *"When should I use semantic filtering with Cosine Strategy vs. simple keyword filtering?"*
-   - *"How can I adjust `top_k` to retrieve more or fewer relevant content clusters?"*
-   - *"In what scenarios is the Cosine Strategy more effective than LLM-based or CSS/XPath extraction?"*
-
-3. **JSON-Based Extraction Strategies (Without LLMs)**
-   - *"What are the advantages of using JSON-based extraction strategies like `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` over LLM-based methods?"*
-   - *"How do CSS and XPath selectors differ, and when is XPath more reliable?"*
-   - *"How can I handle frequently changing class names or dynamic elements using XPath-based extraction?"*
-   - *"Can I run these extraction strategies offline without any external API calls?"*
-   - *"How do I combine JS execution with XPath extraction to handle dynamically loaded content?"*
-
-4. **Environmental and Efficiency Considerations**
-   - *"Why should I avoid continuous LLM calls for repetitive extraction tasks?"*
-   - *"How does using XPath extraction reduce energy consumption and costs?"*
-   - *"Can I initially use an LLM to generate a schema and then rely solely on efficient, local strategies?"*
-
-5. **Schema Generation with a One-Time LLM Utility**
-   - *"How can I use a one-time LLM call to generate a schema, then run extraction repeatedly without further LLM costs?"*
-   - *"What steps are involved in using a language model just once to bootstrap my extraction schema?"*
-   - *"How do I incorporate the generated schema into `JsonXPathExtractionStrategy` for fast, robust extraction?"*
-
-6. **Advanced Use Cases and Best Practices**
-   - *"When should I combine LLM-based extraction with cosine similarity filtering for maximum relevance?"*
-   - *"What best practices should I follow when choosing thresholds and selectors to ensure stable, scalable extractions?"*
-   - *"How can I adapt these strategies to different page layouts, content types, or query requirements?"*
-   - *"Are there recommended troubleshooting steps if extraction fails or yields empty results?"*
-
-### Topics Discussed in the File
-
- **LLM Extraction Strategy**:  
-  - **Modes**: Block-based or schema-based extraction using LLMs  
-  - **Parameters**: API tokens, instructions, schemas, chunk sizes, overlap rates  
-  - **Workflows**: Chunk merging, error handling, parallel execution  
-  - **Advantages**: Dynamic adaptability, schema-based extraction, scaling large content
-
- **Cosine Strategy**:  
-  - **Approach**: Semantic filtering and clustering of content  
-  - **Parameters**: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`  
-  - **Use Cases**: Extracting relevant content from unstructured pages based on semantic similarity  
-  - **Advanced Config**: Custom clustering methods, model choices, performance optimization
-
- **JSON-Based Extraction Strategies (Non-LLM)**:  
-  - **Strategies**: `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`  
-  - **Advantages**: Speed, efficiency, no external dependencies, environmentally friendly  
-  - **XPath vs. CSS**: XPath recommended for unstable, dynamic front-ends; more robust and structural  
-  - **Dynamic Content**: Combine JS execution and waiting conditions with XPath extraction
-
- **Sustainability and Efficiency Considerations**:  
-  - **Rationale**: Avoiding continuous LLM use to save cost, reduce latency, and decrease carbon footprint  
-  - **Scalability**: Run on any device without expensive hardware or API calls
-
- **One-Time LLM-Assisted Schema Generation**:  
-  - **Workflow**: Use LLM once to generate a schema from HTML and queries  
-  - **Afterwards**: Rely solely on JSON-based extraction (CSS/XPath) for fast and stable extractions  
-  - **Benefits**: Time-saving, cost-reducing, sustainable approach without sacrificing complexity
-
- **Integration and Best Practices**:  
-  - **Threshold Tuning**: Iterative adjustments for `sim_threshold`, `word_count_threshold`  
-  - **Performance**: Chunking large content for LLM extraction, vectorizing content for cosine similarity  
-  - **Testing and Validation**: Use developer tools or dummy HTML to refine selectors, test JS code for dynamic content loading
-
-Overall, the file emphasizes choosing the right extraction strategy for the task—ranging from highly dynamic and schema-driven LLM approaches to more stable, efficient, and environmentally friendly direct HTML parsing methods (CSS/XPath). It also suggests a hybrid approach where an LLM can be used initially to generate a schema, then rely on local extraction strategies for ongoing tasks.
+llm_extraction: LLM Extraction Strategy uses language models to process web content into structured JSON | language model extraction, schema extraction, LLM parsing | LLMExtractionStrategy(provider="openai", api_token="token")
+schema_based_extraction: Extract data using predefined JSON schemas to structure LLM output | schema extraction, structured output | schema=OpenAIModelFee.model_json_schema()
+chunking_config: Configure content chunking with token threshold and overlap rate | content chunks, token limits | chunk_token_threshold=1000, overlap_rate=0.1
+provider_config: Specify LLM provider and API credentials for extraction | model provider, API setup | provider="openai", api_token="your_token"
+cosine_strategy: Use similarity-based clustering to extract relevant content sections | content clustering, semantic similarity | CosineStrategy(semantic_filter="product reviews")
+clustering_params: Configure clustering behavior with similarity thresholds and methods | similarity settings, cluster config | sim_threshold=0.3, linkage_method='ward'
+content_filtering: Filter extracted content based on word count and relevance | content filters, extraction rules | word_count_threshold=10, top_k=3
+xpath_extraction: Extract data using XPath selectors for stable structural parsing | xpath selectors, HTML parsing | JsonXPathExtractionStrategy(schema)
+css_extraction: Extract data using CSS selectors for simple HTML parsing | css selectors, HTML parsing | JsonCssExtractionStrategy(schema)
+schema_generation: Generate extraction schemas automatically using one-time LLM assistance | schema creation, automation | generate_schema(html, query)
+dynamic_content: Handle dynamic webpage content with JavaScript execution and waiting | async content, js execution | js_code="window.scrollTo(0, document.body.scrollHeight)"
+extraction_best_practices: Use XPath for stability, avoid unnecessary LLM calls, test selectors | optimization, reliability | baseSelector="//table/tbody/tr"