Enhance crawler capabilities and documentation

- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
This commit is contained in:
UncleCode
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions

View File

@@ -1,75 +1,12 @@
### Hypothetical Questions
1. **Basic Content Selection**
- *"How can I use a CSS selector to extract only the main article text from a webpage?"*
- *"Whats a quick way to isolate a specific element or section of a webpage using Crawl4AI?"*
2. **Advanced CSS Selectors**
- *"How do I find the right CSS selector for a given element in a complex webpage?"*
- *"Can I combine multiple CSS selectors to target different parts of the page simultaneously?"*
3. **Content Filtering**
- *"What parameters can I use to remove non-essential elements like headers, footers, or ads?"*
- *"How do I filter out short or irrelevant text blocks using `word_count_threshold`?"*
- *"Is it possible to exclude external links, images, or social media widgets to get cleaner data?"*
4. **Iframe Content Handling**
- *"How do I enable iframe processing to extract content embedded in iframes?"*
- *"What should I do if the iframe content doesnt load or is blocked?"*
5. **LLM-Based Structured Extraction**
- *"When should I consider using LLM strategies for content extraction?"*
- *"How can I define a JSON schema for the LLM to produce structured, JSON-formatted outputs?"*
- *"What if the LLM returns incomplete or incorrect data—how can I refine the instructions or schema?"*
6. **Pattern-Based Selection with JSON Strategies**
- *"How can I extract multiple items (e.g., a list of articles or products) from a page using `JsonCssExtractionStrategy`?"*
- *"Whats the best way to handle nested fields or multiple levels of data using a JSON schema?"*
7. **Combining Multiple Techniques**
- *"How do I use CSS selectors, content filtering, and JSON-based extraction strategies together to get clean, structured data?"*
- *"Can I integrate LLM extraction for summarization alongside CSS-based extraction for raw content?"*
8. **Troubleshooting and Best Practices**
- *"Why am I getting empty or no results from my selectors, and how can I debug it?"*
- *"What should I do if content loading is dynamic and requires waiting or JS execution?"*
- *"How can I optimize performance and reliability for large-scale or repeated crawls?"*
9. **Performance and Reliability**
- *"How can I improve crawl speed while maintaining precision in content selection?"*
- *"Whats the benefit of using Dockerized environments for consistent and reproducible results?"*
10. **Additional Resources and Extensions**
- *"Where can I find the source code for the Async Web Crawler and strategies?"*
- *"What advanced topics, such as caching, proxy integration, or Docker deployments, can I explore next?"*
### Topics Discussed in the File
- **CSS Selectors for Content Isolation**:
Identifying elements with CSS selectors, using browser dev tools, and extracting targeted sections of a webpage.
- **Content Filtering Parameters**:
Removing unwanted tags, external links, social media elements, and enforcing minimum word counts to ensure meaningful content.
- **Handling Iframes**:
Enabling `process_iframes` and dealing with multi-domain or overlay elements to extract embedded content.
- **Structured Extraction with LLMs**:
Using `LLMExtractionStrategy` with schemas and instructions for complex or irregular data extraction, including JSON-based outputs.
- **Pattern-Based Extraction Using Schemas (JsonCssExtractionStrategy)**:
Defining a JSON schema to extract lists of items (e.g., articles, products) that follow a consistent pattern, capturing nested fields and attributes.
- **Combining Techniques**:
Integrating CSS selection, filtering, JSON schema extraction, and LLM-based transformation to get clean, structured, and context-rich results.
- **Troubleshooting and Best Practices**:
Adjusting selectors, filters, and instructions, lowering thresholds if empty results occur, and refining LLM prompts for better data.
- **Performance and Reliability**:
Starting with simple strategies, adding complexity as needed, and considering asynchronous crawling, caching, or Docker for large-scale operations.
- **Additional Resources**:
Links to code repositories, instructions for Docker deployments, caching strategies, and further refinement for advanced use cases.
In summary, the file provides comprehensive guidance on selecting and filtering content within Crawl4AI, covering everything from simple CSS-based extractions to advanced LLM-driven structured outputs, while also addressing common issues, best practices, and performance optimizations.
content_selection: Crawl4AI allows precise selection and filtering of webpage content | web scraping, content extraction, web crawler | CrawlerRunConfig(css_selector=".main-article")
css_selectors: Target specific webpage elements using CSS selectors like .main-article or article h1 | DOM selection, HTML elements, element targeting | CrawlerRunConfig(css_selector="article h1, article .content")
media_extraction: Extract video and audio elements with metadata including source, type, and duration | multimedia content, media files | result.media["videos"], result.media["audios"]
link_analysis: Automatically categorize links into internal, external, social media, navigation, and content links | link classification, URL analysis | result.links["internal"], result.links["external"]
link_filtering: Control which links are included using exclude parameters | link exclusion, domain filtering | CrawlerRunConfig(exclude_external_links=True, exclude_social_media_links=True)
metadata_extraction: Automatically extract page metadata including title, description, keywords, and dates | page information, meta tags | result.metadata['title'], result.metadata['description']
content_filtering: Remove unwanted elements using word count threshold and excluded tags | content cleanup, element removal | CrawlerRunConfig(word_count_threshold=10, excluded_tags=['form', 'header'])
iframe_handling: Process content within iframes by enabling iframe processing and overlay removal | embedded content, frames | CrawlerRunConfig(process_iframes=True, remove_overlay_elements=True)
llm_extraction: Use LLMs for complex content extraction with structured output | AI extraction, structured data | LLMExtractionStrategy(provider="ollama/nemotron", schema=ArticleContent.schema())
pattern_extraction: Extract repetitive content patterns using JSON schema mapping | structured extraction, repeated elements | JsonCssExtractionStrategy(schema)
troubleshooting: Common issues include empty results, unintended content, and LLM errors | debugging, error handling | config.word_count_threshold, excluded_tags
best_practices: Start with simple selectors before advanced strategies and use caching for efficiency | optimization, performance | AsyncWebCrawler().arun(url=url, config=config)