Files
crawl4ai/docs/llm.txt/5_markdown_generation.q.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

3.3 KiB
Raw Blame History

Hypothetical Questions

  1. Markdown Generation Basics

    • "How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"
    • "What are the main benefits of generating Markdown from web content for LLM workflows?"
    • "How do I quickly start generating Markdown output from a given URL?"
  2. Default Markdown Generator Configuration

    • "What parameters can I customize in DefaultMarkdownGenerator to control the HTML-to-Markdown conversion?"
    • "How do I ignore links, images, or HTML entities when converting to Markdown?"
    • "Can I set a custom line-wrapping width and handle code blocks in Markdown output?"
  3. Content Filtering Strategies

    • "How can I apply filters like BM25 or pruning before Markdown generation?"
    • "What is fit_markdown and how does it differ from the raw Markdown output?"
    • "How do I use BM25ContentFilter to get content relevant to a specific user query?"
    • "What does PruningContentFilter do, and when should I use it to clean up noisy HTML?"
  4. BM25 and Pruning Filters

    • "How does BM25 ranking improve the relevance of extracted Markdown content?"
    • "Which parameters should I tweak if BM25 returns too much or too little content?"
    • "How can I combine PruningContentFilter with BM25 to first remove boilerplate and then focus on relevance?"
  5. Advanced html2text Configuration

    • "What advanced html2text options are available and how do I set them?"
    • "How can I preserve specific tags, handle code blocks, or skip internal links?"
    • "Can I handle superscript and subscript formatting in the Markdown output?"
  6. Troubleshooting and Best Practices

    • "Why am I getting empty Markdown output and how can I fix it?"
    • "How do I handle malformed HTML or JavaScript-heavy sites?"
    • "What are the recommended workflows for large-scale or performance-critical Markdown generation?"
    • "How do I preserve references or add citation-style links in the final Markdown?"
  7. Use Cases and Integration

    • "How can I incorporate fit_markdown into an LLM fine-tuning or RAG pipeline?"
    • "Can I run Crawl4AIs Markdown generation inside a Docker container for consistent environments?"
    • "How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"

Topics Discussed in the File

  • Markdown Generation Workflow using DefaultMarkdownGenerator
  • HTML-to-Markdown Conversion Options (ignore links, images, escape HTML, line-wrapping, code handling)
  • Applying Content Filters (BM25 and Pruning) before Markdown generation
  • fit_markdown vs. raw_markdown for filtered, cleaner output
  • BM25ContentFilter for query-based content relevance
  • PruningContentFilter for unsupervised noise removal and cleaner pages
  • Combining Filters (prune first, then BM25) to refine content
  • Advanced html2text Configurations (handle code blocks, superscripts, skip internal links)
  • Troubleshooting Tips (empty output, malformed HTML, performance considerations)
  • Downstream Uses: Training LLMs, building RAG pipelines, semantic search indexing
  • Best Practices (iterative parameter tuning, caching, Docker deployment)
  • Real-World Scenarios (news summarization, large corpus pre-processing, improved RAG retrieval quality)