Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/llm.txt/8_content_selection.q.md
+++ b/docs/llm.txt/8_content_selection.q.md
@@ -1,75 +1,12 @@
-### Hypothetical Questions
-
-1. **Basic Content Selection**
-   - *"How can I use a CSS selector to extract only the main article text from a webpage?"*
-   - *"What’s a quick way to isolate a specific element or section of a webpage using Crawl4AI?"*
-
-2. **Advanced CSS Selectors**
-   - *"How do I find the right CSS selector for a given element in a complex webpage?"*
-   - *"Can I combine multiple CSS selectors to target different parts of the page simultaneously?"*
-   
-3. **Content Filtering**
-   - *"What parameters can I use to remove non-essential elements like headers, footers, or ads?"*
-   - *"How do I filter out short or irrelevant text blocks using `word_count_threshold`?"*
-   - *"Is it possible to exclude external links, images, or social media widgets to get cleaner data?"*
-
-4. **Iframe Content Handling**
-   - *"How do I enable iframe processing to extract content embedded in iframes?"*
-   - *"What should I do if the iframe content doesn’t load or is blocked?"*
-
-5. **LLM-Based Structured Extraction**
-   - *"When should I consider using LLM strategies for content extraction?"*
-   - *"How can I define a JSON schema for the LLM to produce structured, JSON-formatted outputs?"*
-   - *"What if the LLM returns incomplete or incorrect data—how can I refine the instructions or schema?"*
-
-6. **Pattern-Based Selection with JSON Strategies**
-   - *"How can I extract multiple items (e.g., a list of articles or products) from a page using `JsonCssExtractionStrategy`?"*
-   - *"What’s the best way to handle nested fields or multiple levels of data using a JSON schema?"*
-
-7. **Combining Multiple Techniques**
-   - *"How do I use CSS selectors, content filtering, and JSON-based extraction strategies together to get clean, structured data?"*
-   - *"Can I integrate LLM extraction for summarization alongside CSS-based extraction for raw content?"*
-
-8. **Troubleshooting and Best Practices**
-   - *"Why am I getting empty or no results from my selectors, and how can I debug it?"*
-   - *"What should I do if content loading is dynamic and requires waiting or JS execution?"*
-   - *"How can I optimize performance and reliability for large-scale or repeated crawls?"*
-
-9. **Performance and Reliability**
-   - *"How can I improve crawl speed while maintaining precision in content selection?"*
-   - *"What’s the benefit of using Dockerized environments for consistent and reproducible results?"*
-
-10. **Additional Resources and Extensions**
-    - *"Where can I find the source code for the Async Web Crawler and strategies?"*
-    - *"What advanced topics, such as caching, proxy integration, or Docker deployments, can I explore next?"*
-
-### Topics Discussed in the File
-
- **CSS Selectors for Content Isolation**:  
-  Identifying elements with CSS selectors, using browser dev tools, and extracting targeted sections of a webpage.
-
- **Content Filtering Parameters**:  
-  Removing unwanted tags, external links, social media elements, and enforcing minimum word counts to ensure meaningful content.
-
- **Handling Iframes**:  
-  Enabling `process_iframes` and dealing with multi-domain or overlay elements to extract embedded content.
-
- **Structured Extraction with LLMs**:  
-  Using `LLMExtractionStrategy` with schemas and instructions for complex or irregular data extraction, including JSON-based outputs.
-
- **Pattern-Based Extraction Using Schemas (JsonCssExtractionStrategy)**:  
-  Defining a JSON schema to extract lists of items (e.g., articles, products) that follow a consistent pattern, capturing nested fields and attributes.
-
- **Combining Techniques**:  
-  Integrating CSS selection, filtering, JSON schema extraction, and LLM-based transformation to get clean, structured, and context-rich results.
-
- **Troubleshooting and Best Practices**:  
-  Adjusting selectors, filters, and instructions, lowering thresholds if empty results occur, and refining LLM prompts for better data.
-
- **Performance and Reliability**:  
-  Starting with simple strategies, adding complexity as needed, and considering asynchronous crawling, caching, or Docker for large-scale operations.
-
- **Additional Resources**:  
-  Links to code repositories, instructions for Docker deployments, caching strategies, and further refinement for advanced use cases.
-
-In summary, the file provides comprehensive guidance on selecting and filtering content within Crawl4AI, covering everything from simple CSS-based extractions to advanced LLM-driven structured outputs, while also addressing common issues, best practices, and performance optimizations.
+content_selection: Crawl4AI allows precise selection and filtering of webpage content | web scraping, content extraction, web crawler | CrawlerRunConfig(css_selector=".main-article")
+css_selectors: Target specific webpage elements using CSS selectors like .main-article or article h1 | DOM selection, HTML elements, element targeting | CrawlerRunConfig(css_selector="article h1, article .content")
+media_extraction: Extract video and audio elements with metadata including source, type, and duration | multimedia content, media files | result.media["videos"], result.media["audios"]
+link_analysis: Automatically categorize links into internal, external, social media, navigation, and content links | link classification, URL analysis | result.links["internal"], result.links["external"]
+link_filtering: Control which links are included using exclude parameters | link exclusion, domain filtering | CrawlerRunConfig(exclude_external_links=True, exclude_social_media_links=True)
+metadata_extraction: Automatically extract page metadata including title, description, keywords, and dates | page information, meta tags | result.metadata['title'], result.metadata['description']
+content_filtering: Remove unwanted elements using word count threshold and excluded tags | content cleanup, element removal | CrawlerRunConfig(word_count_threshold=10, excluded_tags=['form', 'header'])
+iframe_handling: Process content within iframes by enabling iframe processing and overlay removal | embedded content, frames | CrawlerRunConfig(process_iframes=True, remove_overlay_elements=True)
+llm_extraction: Use LLMs for complex content extraction with structured output | AI extraction, structured data | LLMExtractionStrategy(provider="ollama/nemotron", schema=ArticleContent.schema())
+pattern_extraction: Extract repetitive content patterns using JSON schema mapping | structured extraction, repeated elements | JsonCssExtractionStrategy(schema)
+troubleshooting: Common issues include empty results, unintended content, and LLM errors | debugging, error handling | config.word_count_threshold, excluded_tags
+best_practices: Start with simple selectors before advanced strategies and use caching for efficiency | optimization, performance | AsyncWebCrawler().arun(url=url, config=config)