Files
crawl4ai/docs/llm.txt/12_prefix_based_input.q.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

56 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
### Hypothetical Questions
1. **Basic Usage**
- *"How can I crawl a regular website URL using Crawl4AI?"*
- *"What configuration object do I need to pass to `arun` for basic crawling scenarios?"*
2. **Local HTML Files**
- *"How do I crawl an HTML file stored locally on my machine?"*
- *"What prefix should I use when specifying a local file path to `arun`?"*
3. **Raw HTML Strings**
- *"Is it possible to crawl a raw HTML string without saving it to a file first?"*
- *"How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"*
4. **Verifying Results**
- *"Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"*
- *"How do I handle errors or check if the crawl was successful?"*
5. **Use Cases**
- *"When would I want to use `file://` vs. `raw:` URLs?"*
- *"Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"*
6. **Caching and Configuration**
- *"What does `bypass_cache=True` do and when should I use it?"*
- *"Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"*
7. **Practical Scenarios**
- *"How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"*
- *"Does Crawl4AIs prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"*
### Topics Discussed in the File
- **Prefix-Based Input Handling**:
Introducing the concept of using `http://` or `https://` for web URLs, `file://` for local files, and `raw:` for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI.
- **Crawling a Web URL**:
Demonstrating how to crawl a live web page (like a Wikipedia article) using `AsyncWebCrawler` and `CrawlerRunConfig`.
- **Crawling a Local HTML File**:
Showing how to convert a local file path to a `file://` URL and use `arun` to process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis.
- **Crawling Raw HTML Content**:
Explaining how to directly pass an HTML string prefixed with `raw:` to `arun`, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk.
- **Consistency and Verification**:
Providing a comprehensive example that:
1. Crawls a live Wikipedia page.
2. Saves the HTML to a file.
3. Re-crawls the local file.
4. Re-crawls the content as a raw HTML string.
5. Verifies that the Markdown extracted remains consistent across all three methods.
- **Integration with `CrawlerRunConfig`**:
Showing how to use `CrawlerRunConfig` to disable caching (`bypass_cache=True`) and ensure fresh results for each test run.
In summary, the file highlights how to use Crawl4AIs prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.