Commit Message:

Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00
parent 8fbc2e0463
commit 84b311760f
47 changed files with 6510 additions and 2 deletions
--- a/docs/llm.txt/12_prefix_based_input.q.md
+++ b/docs/llm.txt/12_prefix_based_input.q.md
@@ -0,0 +1,56 @@
+### Hypothetical Questions
+
+1. **Basic Usage**
+   - *"How can I crawl a regular website URL using Crawl4AI?"*
+   - *"What configuration object do I need to pass to `arun` for basic crawling scenarios?"*
+
+2. **Local HTML Files**
+   - *"How do I crawl an HTML file stored locally on my machine?"*
+   - *"What prefix should I use when specifying a local file path to `arun`?"*
+
+3. **Raw HTML Strings**
+   - *"Is it possible to crawl a raw HTML string without saving it to a file first?"*
+   - *"How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"*
+
+4. **Verifying Results**
+   - *"Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"*
+   - *"How do I handle errors or check if the crawl was successful?"*
+
+5. **Use Cases**
+   - *"When would I want to use `file://` vs. `raw:` URLs?"*
+   - *"Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"*
+
+6. **Caching and Configuration**
+   - *"What does `bypass_cache=True` do and when should I use it?"*
+   - *"Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"*
+
+7. **Practical Scenarios**
+   - *"How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"*
+   - *"Does Crawl4AI’s prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"*
+
+### Topics Discussed in the File
+
+- **Prefix-Based Input Handling**:  
+  Introducing the concept of using `http://` or `https://` for web URLs, `file://` for local files, and `raw:` for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI.
+
+- **Crawling a Web URL**:  
+  Demonstrating how to crawl a live web page (like a Wikipedia article) using `AsyncWebCrawler` and `CrawlerRunConfig`.
+
+- **Crawling a Local HTML File**:  
+  Showing how to convert a local file path to a `file://` URL and use `arun` to process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis.
+
+- **Crawling Raw HTML Content**:  
+  Explaining how to directly pass an HTML string prefixed with `raw:` to `arun`, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk.
+
+- **Consistency and Verification**:  
+  Providing a comprehensive example that:
+  1. Crawls a live Wikipedia page.
+  2. Saves the HTML to a file.
+  3. Re-crawls the local file.
+  4. Re-crawls the content as a raw HTML string.
+  5. Verifies that the Markdown extracted remains consistent across all three methods.
+
+- **Integration with `CrawlerRunConfig`**:  
+  Showing how to use `CrawlerRunConfig` to disable caching (`bypass_cache=True`) and ensure fresh results for each test run.
+
+In summary, the file highlights how to use Crawl4AI’s prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.