crawl4ai/docs/llm.txt/12_prefix_based_input.q.md at 84b311760f5a0f96a1614604fe7e9fc5a7c7197f

Files

UncleCode 84b311760f Commit Message:

Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`

2024-12-21 14:26:56 +08:00

3.1 KiB

Raw Blame History

Hypothetical Questions

Basic Usage
- "How can I crawl a regular website URL using Crawl4AI?"
- "What configuration object do I need to pass to arun for basic crawling scenarios?"
Local HTML Files
- "How do I crawl an HTML file stored locally on my machine?"
- "What prefix should I use when specifying a local file path to arun?"
Raw HTML Strings
- "Is it possible to crawl a raw HTML string without saving it to a file first?"
- "How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"
Verifying Results
- "Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"
- "How do I handle errors or check if the crawl was successful?"
Use Cases
- "When would I want to use file:// vs. raw: URLs?"
- "Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"
Caching and Configuration
- "What does bypass_cache=True do and when should I use it?"
- "Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"
Practical Scenarios
- "How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"
- "Does Crawl4AI’s prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"

Topics Discussed in the File

Prefix-Based Input Handling:
Introducing the concept of using http:// or https:// for web URLs, file:// for local files, and raw: for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI.
Crawling a Web URL:
Demonstrating how to crawl a live web page (like a Wikipedia article) using AsyncWebCrawler and CrawlerRunConfig.
Crawling a Local HTML File:
Showing how to convert a local file path to a file:// URL and use arun to process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis.
Crawling Raw HTML Content:
Explaining how to directly pass an HTML string prefixed with raw: to arun, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk.
Consistency and Verification:
Providing a comprehensive example that:
1. Crawls a live Wikipedia page.
2. Saves the HTML to a file.
3. Re-crawls the local file.
4. Re-crawls the content as a raw HTML string.
5. Verifies that the Markdown extracted remains consistent across all three methods.
Integration with CrawlerRunConfig:
Showing how to use CrawlerRunConfig to disable caching (bypass_cache=True) and ensure fresh results for each test run.

In summary, the file highlights how to use Crawl4AI’s prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.

3.1 KiB Raw Blame History Unescape Escape

Hypothetical Questions

Topics Discussed in the File

3.1 KiB

Raw Blame History