Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
3.3 KiB
3.3 KiB
Hypothetical Questions
-
Markdown Generation Basics
- "How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"
- "What are the main benefits of generating Markdown from web content for LLM workflows?"
- "How do I quickly start generating Markdown output from a given URL?"
-
Default Markdown Generator Configuration
- "What parameters can I customize in
DefaultMarkdownGeneratorto control the HTML-to-Markdown conversion?" - "How do I ignore links, images, or HTML entities when converting to Markdown?"
- "Can I set a custom line-wrapping width and handle code blocks in Markdown output?"
- "What parameters can I customize in
-
Content Filtering Strategies
- "How can I apply filters like BM25 or pruning before Markdown generation?"
- "What is
fit_markdownand how does it differ from the raw Markdown output?" - "How do I use
BM25ContentFilterto get content relevant to a specific user query?" - "What does
PruningContentFilterdo, and when should I use it to clean up noisy HTML?"
-
BM25 and Pruning Filters
- "How does BM25 ranking improve the relevance of extracted Markdown content?"
- "Which parameters should I tweak if BM25 returns too much or too little content?"
- "How can I combine
PruningContentFilterwith BM25 to first remove boilerplate and then focus on relevance?"
-
Advanced html2text Configuration
- "What advanced
html2textoptions are available and how do I set them?" - "How can I preserve specific tags, handle code blocks, or skip internal links?"
- "Can I handle superscript and subscript formatting in the Markdown output?"
- "What advanced
-
Troubleshooting and Best Practices
- "Why am I getting empty Markdown output and how can I fix it?"
- "How do I handle malformed HTML or JavaScript-heavy sites?"
- "What are the recommended workflows for large-scale or performance-critical Markdown generation?"
- "How do I preserve references or add citation-style links in the final Markdown?"
-
Use Cases and Integration
- "How can I incorporate
fit_markdowninto an LLM fine-tuning or RAG pipeline?" - "Can I run Crawl4AI’s Markdown generation inside a Docker container for consistent environments?"
- "How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"
- "How can I incorporate
Topics Discussed in the File
- Markdown Generation Workflow using
DefaultMarkdownGenerator - HTML-to-Markdown Conversion Options (ignore links, images, escape HTML, line-wrapping, code handling)
- Applying Content Filters (BM25 and Pruning) before Markdown generation
- fit_markdown vs. raw_markdown for filtered, cleaner output
- BM25ContentFilter for query-based content relevance
- PruningContentFilter for unsupervised noise removal and cleaner pages
- Combining Filters (prune first, then BM25) to refine content
- Advanced
html2textConfigurations (handle code blocks, superscripts, skip internal links) - Troubleshooting Tips (empty output, malformed HTML, performance considerations)
- Downstream Uses: Training LLMs, building RAG pipelines, semantic search indexing
- Best Practices (iterative parameter tuning, caching, Docker deployment)
- Real-World Scenarios (news summarization, large corpus pre-processing, improved RAG retrieval quality)