Commit Message:

Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
parent 8fbc2e0463
commit 84b311760f
47 changed files with 6510 additions and 2 deletions
--- a/docs/llm.txt/5_markdown_generation.q.md
+++ b/docs/llm.txt/5_markdown_generation.q.md
@@ -0,0 +1,53 @@
+### Hypothetical Questions
+
+1. **Markdown Generation Basics**
+   - *"How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"*
+   - *"What are the main benefits of generating Markdown from web content for LLM workflows?"*
+   - *"How do I quickly start generating Markdown output from a given URL?"*
+
+2. **Default Markdown Generator Configuration**
+   - *"What parameters can I customize in `DefaultMarkdownGenerator` to control the HTML-to-Markdown conversion?"*
+   - *"How do I ignore links, images, or HTML entities when converting to Markdown?"*
+   - *"Can I set a custom line-wrapping width and handle code blocks in Markdown output?"*
+
+3. **Content Filtering Strategies**
+   - *"How can I apply filters like BM25 or pruning before Markdown generation?"*
+   - *"What is `fit_markdown` and how does it differ from the raw Markdown output?"*
+   - *"How do I use `BM25ContentFilter` to get content relevant to a specific user query?"*
+   - *"What does `PruningContentFilter` do, and when should I use it to clean up noisy HTML?"*
+
+4. **BM25 and Pruning Filters**
+   - *"How does BM25 ranking improve the relevance of extracted Markdown content?"*
+   - *"Which parameters should I tweak if BM25 returns too much or too little content?"*
+   - *"How can I combine `PruningContentFilter` with BM25 to first remove boilerplate and then focus on relevance?"*
+
+5. **Advanced html2text Configuration**
+   - *"What advanced `html2text` options are available and how do I set them?"*
+   - *"How can I preserve specific tags, handle code blocks, or skip internal links?"*
+   - *"Can I handle superscript and subscript formatting in the Markdown output?"*
+
+6. **Troubleshooting and Best Practices**
+   - *"Why am I getting empty Markdown output and how can I fix it?"*
+   - *"How do I handle malformed HTML or JavaScript-heavy sites?"*
+   - *"What are the recommended workflows for large-scale or performance-critical Markdown generation?"*
+   - *"How do I preserve references or add citation-style links in the final Markdown?"*
+
+7. **Use Cases and Integration**
+   - *"How can I incorporate `fit_markdown` into an LLM fine-tuning or RAG pipeline?"*
+   - *"Can I run Crawl4AI’s Markdown generation inside a Docker container for consistent environments?"*
+   - *"How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"*
+
+### Topics Discussed in the File
+
+- **Markdown Generation Workflow** using `DefaultMarkdownGenerator`  
+- **HTML-to-Markdown Conversion Options** (ignore links, images, escape HTML, line-wrapping, code handling)  
+- **Applying Content Filters** (BM25 and Pruning) before Markdown generation  
+- **fit_markdown vs. raw_markdown** for filtered, cleaner output  
+- **BM25ContentFilter** for query-based content relevance  
+- **PruningContentFilter** for unsupervised noise removal and cleaner pages  
+- **Combining Filters** (prune first, then BM25) to refine content  
+- **Advanced `html2text` Configurations** (handle code blocks, superscripts, skip internal links)  
+- **Troubleshooting Tips** (empty output, malformed HTML, performance considerations)  
+- **Downstream Uses**: Training LLMs, building RAG pipelines, semantic search indexing  
+- **Best Practices** (iterative parameter tuning, caching, Docker deployment)  
+- **Real-World Scenarios** (news summarization, large corpus pre-processing, improved RAG retrieval quality)