Files
crawl4ai/docs/llm.txt/5_markdown_generation.q.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

53 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
### Hypothetical Questions
1. **Markdown Generation Basics**
- *"How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"*
- *"What are the main benefits of generating Markdown from web content for LLM workflows?"*
- *"How do I quickly start generating Markdown output from a given URL?"*
2. **Default Markdown Generator Configuration**
- *"What parameters can I customize in `DefaultMarkdownGenerator` to control the HTML-to-Markdown conversion?"*
- *"How do I ignore links, images, or HTML entities when converting to Markdown?"*
- *"Can I set a custom line-wrapping width and handle code blocks in Markdown output?"*
3. **Content Filtering Strategies**
- *"How can I apply filters like BM25 or pruning before Markdown generation?"*
- *"What is `fit_markdown` and how does it differ from the raw Markdown output?"*
- *"How do I use `BM25ContentFilter` to get content relevant to a specific user query?"*
- *"What does `PruningContentFilter` do, and when should I use it to clean up noisy HTML?"*
4. **BM25 and Pruning Filters**
- *"How does BM25 ranking improve the relevance of extracted Markdown content?"*
- *"Which parameters should I tweak if BM25 returns too much or too little content?"*
- *"How can I combine `PruningContentFilter` with BM25 to first remove boilerplate and then focus on relevance?"*
5. **Advanced html2text Configuration**
- *"What advanced `html2text` options are available and how do I set them?"*
- *"How can I preserve specific tags, handle code blocks, or skip internal links?"*
- *"Can I handle superscript and subscript formatting in the Markdown output?"*
6. **Troubleshooting and Best Practices**
- *"Why am I getting empty Markdown output and how can I fix it?"*
- *"How do I handle malformed HTML or JavaScript-heavy sites?"*
- *"What are the recommended workflows for large-scale or performance-critical Markdown generation?"*
- *"How do I preserve references or add citation-style links in the final Markdown?"*
7. **Use Cases and Integration**
- *"How can I incorporate `fit_markdown` into an LLM fine-tuning or RAG pipeline?"*
- *"Can I run Crawl4AIs Markdown generation inside a Docker container for consistent environments?"*
- *"How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"*
### Topics Discussed in the File
- **Markdown Generation Workflow** using `DefaultMarkdownGenerator`
- **HTML-to-Markdown Conversion Options** (ignore links, images, escape HTML, line-wrapping, code handling)
- **Applying Content Filters** (BM25 and Pruning) before Markdown generation
- **fit_markdown vs. raw_markdown** for filtered, cleaner output
- **BM25ContentFilter** for query-based content relevance
- **PruningContentFilter** for unsupervised noise removal and cleaner pages
- **Combining Filters** (prune first, then BM25) to refine content
- **Advanced `html2text` Configurations** (handle code blocks, superscripts, skip internal links)
- **Troubleshooting Tips** (empty output, malformed HTML, performance considerations)
- **Downstream Uses**: Training LLMs, building RAG pipelines, semantic search indexing
- **Best Practices** (iterative parameter tuning, caching, Docker deployment)
- **Real-World Scenarios** (news summarization, large corpus pre-processing, improved RAG retrieval quality)