Enhance crawler capabilities and documentation
- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
This commit is contained in:
@@ -1,75 +1,12 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Basic Content Selection**
|
||||
- *"How can I use a CSS selector to extract only the main article text from a webpage?"*
|
||||
- *"What’s a quick way to isolate a specific element or section of a webpage using Crawl4AI?"*
|
||||
|
||||
2. **Advanced CSS Selectors**
|
||||
- *"How do I find the right CSS selector for a given element in a complex webpage?"*
|
||||
- *"Can I combine multiple CSS selectors to target different parts of the page simultaneously?"*
|
||||
|
||||
3. **Content Filtering**
|
||||
- *"What parameters can I use to remove non-essential elements like headers, footers, or ads?"*
|
||||
- *"How do I filter out short or irrelevant text blocks using `word_count_threshold`?"*
|
||||
- *"Is it possible to exclude external links, images, or social media widgets to get cleaner data?"*
|
||||
|
||||
4. **Iframe Content Handling**
|
||||
- *"How do I enable iframe processing to extract content embedded in iframes?"*
|
||||
- *"What should I do if the iframe content doesn’t load or is blocked?"*
|
||||
|
||||
5. **LLM-Based Structured Extraction**
|
||||
- *"When should I consider using LLM strategies for content extraction?"*
|
||||
- *"How can I define a JSON schema for the LLM to produce structured, JSON-formatted outputs?"*
|
||||
- *"What if the LLM returns incomplete or incorrect data—how can I refine the instructions or schema?"*
|
||||
|
||||
6. **Pattern-Based Selection with JSON Strategies**
|
||||
- *"How can I extract multiple items (e.g., a list of articles or products) from a page using `JsonCssExtractionStrategy`?"*
|
||||
- *"What’s the best way to handle nested fields or multiple levels of data using a JSON schema?"*
|
||||
|
||||
7. **Combining Multiple Techniques**
|
||||
- *"How do I use CSS selectors, content filtering, and JSON-based extraction strategies together to get clean, structured data?"*
|
||||
- *"Can I integrate LLM extraction for summarization alongside CSS-based extraction for raw content?"*
|
||||
|
||||
8. **Troubleshooting and Best Practices**
|
||||
- *"Why am I getting empty or no results from my selectors, and how can I debug it?"*
|
||||
- *"What should I do if content loading is dynamic and requires waiting or JS execution?"*
|
||||
- *"How can I optimize performance and reliability for large-scale or repeated crawls?"*
|
||||
|
||||
9. **Performance and Reliability**
|
||||
- *"How can I improve crawl speed while maintaining precision in content selection?"*
|
||||
- *"What’s the benefit of using Dockerized environments for consistent and reproducible results?"*
|
||||
|
||||
10. **Additional Resources and Extensions**
|
||||
- *"Where can I find the source code for the Async Web Crawler and strategies?"*
|
||||
- *"What advanced topics, such as caching, proxy integration, or Docker deployments, can I explore next?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **CSS Selectors for Content Isolation**:
|
||||
Identifying elements with CSS selectors, using browser dev tools, and extracting targeted sections of a webpage.
|
||||
|
||||
- **Content Filtering Parameters**:
|
||||
Removing unwanted tags, external links, social media elements, and enforcing minimum word counts to ensure meaningful content.
|
||||
|
||||
- **Handling Iframes**:
|
||||
Enabling `process_iframes` and dealing with multi-domain or overlay elements to extract embedded content.
|
||||
|
||||
- **Structured Extraction with LLMs**:
|
||||
Using `LLMExtractionStrategy` with schemas and instructions for complex or irregular data extraction, including JSON-based outputs.
|
||||
|
||||
- **Pattern-Based Extraction Using Schemas (JsonCssExtractionStrategy)**:
|
||||
Defining a JSON schema to extract lists of items (e.g., articles, products) that follow a consistent pattern, capturing nested fields and attributes.
|
||||
|
||||
- **Combining Techniques**:
|
||||
Integrating CSS selection, filtering, JSON schema extraction, and LLM-based transformation to get clean, structured, and context-rich results.
|
||||
|
||||
- **Troubleshooting and Best Practices**:
|
||||
Adjusting selectors, filters, and instructions, lowering thresholds if empty results occur, and refining LLM prompts for better data.
|
||||
|
||||
- **Performance and Reliability**:
|
||||
Starting with simple strategies, adding complexity as needed, and considering asynchronous crawling, caching, or Docker for large-scale operations.
|
||||
|
||||
- **Additional Resources**:
|
||||
Links to code repositories, instructions for Docker deployments, caching strategies, and further refinement for advanced use cases.
|
||||
|
||||
In summary, the file provides comprehensive guidance on selecting and filtering content within Crawl4AI, covering everything from simple CSS-based extractions to advanced LLM-driven structured outputs, while also addressing common issues, best practices, and performance optimizations.
|
||||
content_selection: Crawl4AI allows precise selection and filtering of webpage content | web scraping, content extraction, web crawler | CrawlerRunConfig(css_selector=".main-article")
|
||||
css_selectors: Target specific webpage elements using CSS selectors like .main-article or article h1 | DOM selection, HTML elements, element targeting | CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
media_extraction: Extract video and audio elements with metadata including source, type, and duration | multimedia content, media files | result.media["videos"], result.media["audios"]
|
||||
link_analysis: Automatically categorize links into internal, external, social media, navigation, and content links | link classification, URL analysis | result.links["internal"], result.links["external"]
|
||||
link_filtering: Control which links are included using exclude parameters | link exclusion, domain filtering | CrawlerRunConfig(exclude_external_links=True, exclude_social_media_links=True)
|
||||
metadata_extraction: Automatically extract page metadata including title, description, keywords, and dates | page information, meta tags | result.metadata['title'], result.metadata['description']
|
||||
content_filtering: Remove unwanted elements using word count threshold and excluded tags | content cleanup, element removal | CrawlerRunConfig(word_count_threshold=10, excluded_tags=['form', 'header'])
|
||||
iframe_handling: Process content within iframes by enabling iframe processing and overlay removal | embedded content, frames | CrawlerRunConfig(process_iframes=True, remove_overlay_elements=True)
|
||||
llm_extraction: Use LLMs for complex content extraction with structured output | AI extraction, structured data | LLMExtractionStrategy(provider="ollama/nemotron", schema=ArticleContent.schema())
|
||||
pattern_extraction: Extract repetitive content patterns using JSON schema mapping | structured extraction, repeated elements | JsonCssExtractionStrategy(schema)
|
||||
troubleshooting: Common issues include empty results, unintended content, and LLM errors | debugging, error handling | config.word_count_threshold, excluded_tags
|
||||
best_practices: Start with simple selectors before advanced strategies and use caching for efficiency | optimization, performance | AsyncWebCrawler().arun(url=url, config=config)
|
||||
Reference in New Issue
Block a user