Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/llm.txt/5_markdown_generation.q.md
+++ b/docs/llm.txt/5_markdown_generation.q.md
@@ -1,53 +1,15 @@
-### Hypothetical Questions
-
-1. **Markdown Generation Basics**
-   - *"How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"*
-   - *"What are the main benefits of generating Markdown from web content for LLM workflows?"*
-   - *"How do I quickly start generating Markdown output from a given URL?"*
-
-2. **Default Markdown Generator Configuration**
-   - *"What parameters can I customize in `DefaultMarkdownGenerator` to control the HTML-to-Markdown conversion?"*
-   - *"How do I ignore links, images, or HTML entities when converting to Markdown?"*
-   - *"Can I set a custom line-wrapping width and handle code blocks in Markdown output?"*
-
-3. **Content Filtering Strategies**
-   - *"How can I apply filters like BM25 or pruning before Markdown generation?"*
-   - *"What is `fit_markdown` and how does it differ from the raw Markdown output?"*
-   - *"How do I use `BM25ContentFilter` to get content relevant to a specific user query?"*
-   - *"What does `PruningContentFilter` do, and when should I use it to clean up noisy HTML?"*
-
-4. **BM25 and Pruning Filters**
-   - *"How does BM25 ranking improve the relevance of extracted Markdown content?"*
-   - *"Which parameters should I tweak if BM25 returns too much or too little content?"*
-   - *"How can I combine `PruningContentFilter` with BM25 to first remove boilerplate and then focus on relevance?"*
-
-5. **Advanced html2text Configuration**
-   - *"What advanced `html2text` options are available and how do I set them?"*
-   - *"How can I preserve specific tags, handle code blocks, or skip internal links?"*
-   - *"Can I handle superscript and subscript formatting in the Markdown output?"*
-
-6. **Troubleshooting and Best Practices**
-   - *"Why am I getting empty Markdown output and how can I fix it?"*
-   - *"How do I handle malformed HTML or JavaScript-heavy sites?"*
-   - *"What are the recommended workflows for large-scale or performance-critical Markdown generation?"*
-   - *"How do I preserve references or add citation-style links in the final Markdown?"*
-
-7. **Use Cases and Integration**
-   - *"How can I incorporate `fit_markdown` into an LLM fine-tuning or RAG pipeline?"*
-   - *"Can I run Crawl4AI’s Markdown generation inside a Docker container for consistent environments?"*
-   - *"How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"*
-
-### Topics Discussed in the File
-
- **Markdown Generation Workflow** using `DefaultMarkdownGenerator`  
- **HTML-to-Markdown Conversion Options** (ignore links, images, escape HTML, line-wrapping, code handling)  
- **Applying Content Filters** (BM25 and Pruning) before Markdown generation  
- **fit_markdown vs. raw_markdown** for filtered, cleaner output  
- **BM25ContentFilter** for query-based content relevance  
- **PruningContentFilter** for unsupervised noise removal and cleaner pages  
- **Combining Filters** (prune first, then BM25) to refine content  
- **Advanced `html2text` Configurations** (handle code blocks, superscripts, skip internal links)  
- **Troubleshooting Tips** (empty output, malformed HTML, performance considerations)  
- **Downstream Uses**: Training LLMs, building RAG pipelines, semantic search indexing  
- **Best Practices** (iterative parameter tuning, caching, Docker deployment)  
- **Real-World Scenarios** (news summarization, large corpus pre-processing, improved RAG retrieval quality)
+markdown_generation: Converts web content into clean, structured Markdown format for AI processing | html to markdown, text conversion, content extraction | DefaultMarkdownGenerator()
+markdown_config_options: Configure HTML to Markdown conversion with html2text options like ignore_links, escape_html, body_width | markdown settings, conversion options | html2text_config={"ignore_links": True, "body_width": 80}
+content_filtering: Filter and clean web content using BM25 or Pruning strategies | content cleanup, noise removal | content_filter=BM25ContentFilter()
+bm25_filtering: Score and filter content based on relevance to a user query | relevance filtering, query matching | BM25ContentFilter(user_query="ai", bm25_threshold=1.5)
+pruning_filter: Remove boilerplate and noise using unsupervised clustering approach | content pruning, noise removal | PruningContentFilter(threshold=0.7, threshold_type="dynamic")
+markdown_result_types: Access different markdown outputs including raw, cited, and filtered versions | markdown formats, output types | result.markdown_v2.{raw_markdown, markdown_with_citations, fit_markdown}
+link_citations: Convert webpage links into citation-style references at document end | reference handling, link management | markdown_with_citations output format
+content_scoring: Evaluate content blocks based on text density, link density, and tag importance | content metrics, scoring system | PruningContentFilter metrics
+combined_filtering: Apply both pruning and BM25 filters for optimal content extraction | filter pipeline, multi-stage filtering | PruningContentFilter() followed by BM25ContentFilter()
+markdown_generation_troubleshooting: Debug empty outputs and malformed content issues | error handling, debugging | Check HTML content and filter thresholds
+performance_optimization: Cache results and adjust parameters for better processing speed | optimization, caching | Store intermediate results for reuse
+rag_pipeline_integration: Use filtered markdown for retrieval-augmented generation systems | RAG, vector storage | Store fit_markdown in vector database
+code_block_handling: Preserve and format code snippets in markdown output | code formatting, syntax | handle_code_in_pre=True option
+authentication_handling: Process content from authenticated pages using session tokens | auth support, protected content | Provide session tokens before markdown generation
+docker_deployment: Run markdown generation in containerized environment | deployment, containers | Include in Dockerfile configuration