Files
crawl4ai/docs/llm.txt/8_content_selection.q.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

75 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
### Hypothetical Questions
1. **Basic Content Selection**
- *"How can I use a CSS selector to extract only the main article text from a webpage?"*
- *"Whats a quick way to isolate a specific element or section of a webpage using Crawl4AI?"*
2. **Advanced CSS Selectors**
- *"How do I find the right CSS selector for a given element in a complex webpage?"*
- *"Can I combine multiple CSS selectors to target different parts of the page simultaneously?"*
3. **Content Filtering**
- *"What parameters can I use to remove non-essential elements like headers, footers, or ads?"*
- *"How do I filter out short or irrelevant text blocks using `word_count_threshold`?"*
- *"Is it possible to exclude external links, images, or social media widgets to get cleaner data?"*
4. **Iframe Content Handling**
- *"How do I enable iframe processing to extract content embedded in iframes?"*
- *"What should I do if the iframe content doesnt load or is blocked?"*
5. **LLM-Based Structured Extraction**
- *"When should I consider using LLM strategies for content extraction?"*
- *"How can I define a JSON schema for the LLM to produce structured, JSON-formatted outputs?"*
- *"What if the LLM returns incomplete or incorrect data—how can I refine the instructions or schema?"*
6. **Pattern-Based Selection with JSON Strategies**
- *"How can I extract multiple items (e.g., a list of articles or products) from a page using `JsonCssExtractionStrategy`?"*
- *"Whats the best way to handle nested fields or multiple levels of data using a JSON schema?"*
7. **Combining Multiple Techniques**
- *"How do I use CSS selectors, content filtering, and JSON-based extraction strategies together to get clean, structured data?"*
- *"Can I integrate LLM extraction for summarization alongside CSS-based extraction for raw content?"*
8. **Troubleshooting and Best Practices**
- *"Why am I getting empty or no results from my selectors, and how can I debug it?"*
- *"What should I do if content loading is dynamic and requires waiting or JS execution?"*
- *"How can I optimize performance and reliability for large-scale or repeated crawls?"*
9. **Performance and Reliability**
- *"How can I improve crawl speed while maintaining precision in content selection?"*
- *"Whats the benefit of using Dockerized environments for consistent and reproducible results?"*
10. **Additional Resources and Extensions**
- *"Where can I find the source code for the Async Web Crawler and strategies?"*
- *"What advanced topics, such as caching, proxy integration, or Docker deployments, can I explore next?"*
### Topics Discussed in the File
- **CSS Selectors for Content Isolation**:
Identifying elements with CSS selectors, using browser dev tools, and extracting targeted sections of a webpage.
- **Content Filtering Parameters**:
Removing unwanted tags, external links, social media elements, and enforcing minimum word counts to ensure meaningful content.
- **Handling Iframes**:
Enabling `process_iframes` and dealing with multi-domain or overlay elements to extract embedded content.
- **Structured Extraction with LLMs**:
Using `LLMExtractionStrategy` with schemas and instructions for complex or irregular data extraction, including JSON-based outputs.
- **Pattern-Based Extraction Using Schemas (JsonCssExtractionStrategy)**:
Defining a JSON schema to extract lists of items (e.g., articles, products) that follow a consistent pattern, capturing nested fields and attributes.
- **Combining Techniques**:
Integrating CSS selection, filtering, JSON schema extraction, and LLM-based transformation to get clean, structured, and context-rich results.
- **Troubleshooting and Best Practices**:
Adjusting selectors, filters, and instructions, lowering thresholds if empty results occur, and refining LLM prompts for better data.
- **Performance and Reliability**:
Starting with simple strategies, adding complexity as needed, and considering asynchronous crawling, caching, or Docker for large-scale operations.
- **Additional Resources**:
Links to code repositories, instructions for Docker deployments, caching strategies, and further refinement for advanced use cases.
In summary, the file provides comprehensive guidance on selecting and filtering content within Crawl4AI, covering everything from simple CSS-based extractions to advanced LLM-driven structured outputs, while also addressing common issues, best practices, and performance optimizations.