crawl4ai/docs/llm.txt/5_markdown_generation.q.md at 84b311760f5a0f96a1614604fe7e9fc5a7c7197f

Files

UncleCode 84b311760f Commit Message:

Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`

2024-12-21 14:26:56 +08:00

3.3 KiB

Raw Blame History

Hypothetical Questions

Markdown Generation Basics
- "How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"
- "What are the main benefits of generating Markdown from web content for LLM workflows?"
- "How do I quickly start generating Markdown output from a given URL?"
Default Markdown Generator Configuration
- "What parameters can I customize in DefaultMarkdownGenerator to control the HTML-to-Markdown conversion?"
- "How do I ignore links, images, or HTML entities when converting to Markdown?"
- "Can I set a custom line-wrapping width and handle code blocks in Markdown output?"
Content Filtering Strategies
- "How can I apply filters like BM25 or pruning before Markdown generation?"
- "What is fit_markdown and how does it differ from the raw Markdown output?"
- "How do I use BM25ContentFilter to get content relevant to a specific user query?"
- "What does PruningContentFilter do, and when should I use it to clean up noisy HTML?"
BM25 and Pruning Filters
- "How does BM25 ranking improve the relevance of extracted Markdown content?"
- "Which parameters should I tweak if BM25 returns too much or too little content?"
- "How can I combine PruningContentFilter with BM25 to first remove boilerplate and then focus on relevance?"
Advanced html2text Configuration
- "What advanced html2text options are available and how do I set them?"
- "How can I preserve specific tags, handle code blocks, or skip internal links?"
- "Can I handle superscript and subscript formatting in the Markdown output?"
Troubleshooting and Best Practices
- "Why am I getting empty Markdown output and how can I fix it?"
- "How do I handle malformed HTML or JavaScript-heavy sites?"
- "What are the recommended workflows for large-scale or performance-critical Markdown generation?"
- "How do I preserve references or add citation-style links in the final Markdown?"
Use Cases and Integration
- "How can I incorporate fit_markdown into an LLM fine-tuning or RAG pipeline?"
- "Can I run Crawl4AI’s Markdown generation inside a Docker container for consistent environments?"
- "How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"

Topics Discussed in the File

Markdown Generation Workflow using DefaultMarkdownGenerator
HTML-to-Markdown Conversion Options (ignore links, images, escape HTML, line-wrapping, code handling)
Applying Content Filters (BM25 and Pruning) before Markdown generation
fit_markdown vs. raw_markdown for filtered, cleaner output
BM25ContentFilter for query-based content relevance
PruningContentFilter for unsupervised noise removal and cleaner pages
Combining Filters (prune first, then BM25) to refine content
Advanced html2text Configurations (handle code blocks, superscripts, skip internal links)
Troubleshooting Tips (empty output, malformed HTML, performance considerations)
Downstream Uses: Training LLMs, building RAG pipelines, semantic search indexing
Best Practices (iterative parameter tuning, caching, Docker deployment)
Real-World Scenarios (news summarization, large corpus pre-processing, improved RAG retrieval quality)

3.3 KiB Raw Blame History Unescape Escape

Hypothetical Questions

Topics Discussed in the File

3.3 KiB

Raw Blame History