Files
crawl4ai/docs/llm.txt/12_prefix_based_input.q.md
UncleCode 84b311760f Commit Message:
Enhance Crawl4AI with CLI and documentation updates
  - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py`
  - Added chunking strategies and their documentation in `llm.txt`
2024-12-21 14:26:56 +08:00

3.1 KiB
Raw Blame History

Hypothetical Questions

  1. Basic Usage

    • "How can I crawl a regular website URL using Crawl4AI?"
    • "What configuration object do I need to pass to arun for basic crawling scenarios?"
  2. Local HTML Files

    • "How do I crawl an HTML file stored locally on my machine?"
    • "What prefix should I use when specifying a local file path to arun?"
  3. Raw HTML Strings

    • "Is it possible to crawl a raw HTML string without saving it to a file first?"
    • "How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"
  4. Verifying Results

    • "Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"
    • "How do I handle errors or check if the crawl was successful?"
  5. Use Cases

    • "When would I want to use file:// vs. raw: URLs?"
    • "Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"
  6. Caching and Configuration

    • "What does bypass_cache=True do and when should I use it?"
    • "Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"
  7. Practical Scenarios

    • "How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"
    • "Does Crawl4AIs prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"

Topics Discussed in the File

  • Prefix-Based Input Handling:
    Introducing the concept of using http:// or https:// for web URLs, file:// for local files, and raw: for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI.

  • Crawling a Web URL:
    Demonstrating how to crawl a live web page (like a Wikipedia article) using AsyncWebCrawler and CrawlerRunConfig.

  • Crawling a Local HTML File:
    Showing how to convert a local file path to a file:// URL and use arun to process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis.

  • Crawling Raw HTML Content:
    Explaining how to directly pass an HTML string prefixed with raw: to arun, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk.

  • Consistency and Verification:
    Providing a comprehensive example that:

    1. Crawls a live Wikipedia page.
    2. Saves the HTML to a file.
    3. Re-crawls the local file.
    4. Re-crawls the content as a raw HTML string.
    5. Verifies that the Markdown extracted remains consistent across all three methods.
  • Integration with CrawlerRunConfig:
    Showing how to use CrawlerRunConfig to disable caching (bypass_cache=True) and ensure fresh results for each test run.

In summary, the file highlights how to use Crawl4AIs prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.