Enhance crawler capabilities and documentation
- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
This commit is contained in:
@@ -1,74 +1,12 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **LLM Extraction Strategy**
|
||||
- *"How can I use an LLM to dynamically extract structured data from a webpage?"*
|
||||
- *"What is the difference between block extraction and schema-based extraction in the LLM strategy?"*
|
||||
- *"How can I define a JSON schema and incorporate it into the LLM extraction process?"*
|
||||
- *"What parameters control chunk size and overlap for LLM-based extraction?"*
|
||||
- *"How do I handle errors, retries, and backoff when calling an LLM API for extraction?"*
|
||||
|
||||
2. **Cosine Strategy**
|
||||
- *"How does the Cosine Strategy identify and cluster semantically similar content?"*
|
||||
- *"What parameters (like `sim_threshold` or `word_count_threshold`) affect the relevance of extracted content?"*
|
||||
- *"When should I use semantic filtering with Cosine Strategy vs. simple keyword filtering?"*
|
||||
- *"How can I adjust `top_k` to retrieve more or fewer relevant content clusters?"*
|
||||
- *"In what scenarios is the Cosine Strategy more effective than LLM-based or CSS/XPath extraction?"*
|
||||
|
||||
3. **JSON-Based Extraction Strategies (Without LLMs)**
|
||||
- *"What are the advantages of using JSON-based extraction strategies like `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` over LLM-based methods?"*
|
||||
- *"How do CSS and XPath selectors differ, and when is XPath more reliable?"*
|
||||
- *"How can I handle frequently changing class names or dynamic elements using XPath-based extraction?"*
|
||||
- *"Can I run these extraction strategies offline without any external API calls?"*
|
||||
- *"How do I combine JS execution with XPath extraction to handle dynamically loaded content?"*
|
||||
|
||||
4. **Environmental and Efficiency Considerations**
|
||||
- *"Why should I avoid continuous LLM calls for repetitive extraction tasks?"*
|
||||
- *"How does using XPath extraction reduce energy consumption and costs?"*
|
||||
- *"Can I initially use an LLM to generate a schema and then rely solely on efficient, local strategies?"*
|
||||
|
||||
5. **Schema Generation with a One-Time LLM Utility**
|
||||
- *"How can I use a one-time LLM call to generate a schema, then run extraction repeatedly without further LLM costs?"*
|
||||
- *"What steps are involved in using a language model just once to bootstrap my extraction schema?"*
|
||||
- *"How do I incorporate the generated schema into `JsonXPathExtractionStrategy` for fast, robust extraction?"*
|
||||
|
||||
6. **Advanced Use Cases and Best Practices**
|
||||
- *"When should I combine LLM-based extraction with cosine similarity filtering for maximum relevance?"*
|
||||
- *"What best practices should I follow when choosing thresholds and selectors to ensure stable, scalable extractions?"*
|
||||
- *"How can I adapt these strategies to different page layouts, content types, or query requirements?"*
|
||||
- *"Are there recommended troubleshooting steps if extraction fails or yields empty results?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **LLM Extraction Strategy**:
|
||||
- **Modes**: Block-based or schema-based extraction using LLMs
|
||||
- **Parameters**: API tokens, instructions, schemas, chunk sizes, overlap rates
|
||||
- **Workflows**: Chunk merging, error handling, parallel execution
|
||||
- **Advantages**: Dynamic adaptability, schema-based extraction, scaling large content
|
||||
|
||||
- **Cosine Strategy**:
|
||||
- **Approach**: Semantic filtering and clustering of content
|
||||
- **Parameters**: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`
|
||||
- **Use Cases**: Extracting relevant content from unstructured pages based on semantic similarity
|
||||
- **Advanced Config**: Custom clustering methods, model choices, performance optimization
|
||||
|
||||
- **JSON-Based Extraction Strategies (Non-LLM)**:
|
||||
- **Strategies**: `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
|
||||
- **Advantages**: Speed, efficiency, no external dependencies, environmentally friendly
|
||||
- **XPath vs. CSS**: XPath recommended for unstable, dynamic front-ends; more robust and structural
|
||||
- **Dynamic Content**: Combine JS execution and waiting conditions with XPath extraction
|
||||
|
||||
- **Sustainability and Efficiency Considerations**:
|
||||
- **Rationale**: Avoiding continuous LLM use to save cost, reduce latency, and decrease carbon footprint
|
||||
- **Scalability**: Run on any device without expensive hardware or API calls
|
||||
|
||||
- **One-Time LLM-Assisted Schema Generation**:
|
||||
- **Workflow**: Use LLM once to generate a schema from HTML and queries
|
||||
- **Afterwards**: Rely solely on JSON-based extraction (CSS/XPath) for fast and stable extractions
|
||||
- **Benefits**: Time-saving, cost-reducing, sustainable approach without sacrificing complexity
|
||||
|
||||
- **Integration and Best Practices**:
|
||||
- **Threshold Tuning**: Iterative adjustments for `sim_threshold`, `word_count_threshold`
|
||||
- **Performance**: Chunking large content for LLM extraction, vectorizing content for cosine similarity
|
||||
- **Testing and Validation**: Use developer tools or dummy HTML to refine selectors, test JS code for dynamic content loading
|
||||
|
||||
Overall, the file emphasizes choosing the right extraction strategy for the task—ranging from highly dynamic and schema-driven LLM approaches to more stable, efficient, and environmentally friendly direct HTML parsing methods (CSS/XPath). It also suggests a hybrid approach where an LLM can be used initially to generate a schema, then rely on local extraction strategies for ongoing tasks.
|
||||
llm_extraction: LLM Extraction Strategy uses language models to process web content into structured JSON | language model extraction, schema extraction, LLM parsing | LLMExtractionStrategy(provider="openai", api_token="token")
|
||||
schema_based_extraction: Extract data using predefined JSON schemas to structure LLM output | schema extraction, structured output | schema=OpenAIModelFee.model_json_schema()
|
||||
chunking_config: Configure content chunking with token threshold and overlap rate | content chunks, token limits | chunk_token_threshold=1000, overlap_rate=0.1
|
||||
provider_config: Specify LLM provider and API credentials for extraction | model provider, API setup | provider="openai", api_token="your_token"
|
||||
cosine_strategy: Use similarity-based clustering to extract relevant content sections | content clustering, semantic similarity | CosineStrategy(semantic_filter="product reviews")
|
||||
clustering_params: Configure clustering behavior with similarity thresholds and methods | similarity settings, cluster config | sim_threshold=0.3, linkage_method='ward'
|
||||
content_filtering: Filter extracted content based on word count and relevance | content filters, extraction rules | word_count_threshold=10, top_k=3
|
||||
xpath_extraction: Extract data using XPath selectors for stable structural parsing | xpath selectors, HTML parsing | JsonXPathExtractionStrategy(schema)
|
||||
css_extraction: Extract data using CSS selectors for simple HTML parsing | css selectors, HTML parsing | JsonCssExtractionStrategy(schema)
|
||||
schema_generation: Generate extraction schemas automatically using one-time LLM assistance | schema creation, automation | generate_schema(html, query)
|
||||
dynamic_content: Handle dynamic webpage content with JavaScript execution and waiting | async content, js execution | js_code="window.scrollTo(0, document.body.scrollHeight)"
|
||||
extraction_best_practices: Use XPath for stability, avoid unnecessary LLM calls, test selectors | optimization, reliability | baseSelector="//table/tbody/tr"
|
||||
Reference in New Issue
Block a user