Enhance Crawl4AI with CLI and documentation updates - Implemented Command-Line Interface (CLI) in `crawl4ai/cli.py` - Added chunking strategies and their documentation in `llm.txt`
3.1 KiB
Hypothetical Questions
-
Basic Usage
- "How can I crawl a regular website URL using Crawl4AI?"
- "What configuration object do I need to pass to
arunfor basic crawling scenarios?"
-
Local HTML Files
- "How do I crawl an HTML file stored locally on my machine?"
- "What prefix should I use when specifying a local file path to
arun?"
-
Raw HTML Strings
- "Is it possible to crawl a raw HTML string without saving it to a file first?"
- "How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"
-
Verifying Results
- "Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"
- "How do I handle errors or check if the crawl was successful?"
-
Use Cases
- "When would I want to use
file://vs.raw:URLs?" - "Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"
- "When would I want to use
-
Caching and Configuration
- "What does
bypass_cache=Truedo and when should I use it?" - "Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"
- "What does
-
Practical Scenarios
- "How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"
- "Does Crawl4AI’s prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"
Topics Discussed in the File
-
Prefix-Based Input Handling:
Introducing the concept of usinghttp://orhttps://for web URLs,file://for local files, andraw:for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI. -
Crawling a Web URL:
Demonstrating how to crawl a live web page (like a Wikipedia article) usingAsyncWebCrawlerandCrawlerRunConfig. -
Crawling a Local HTML File:
Showing how to convert a local file path to afile://URL and usearunto process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis. -
Crawling Raw HTML Content:
Explaining how to directly pass an HTML string prefixed withraw:toarun, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk. -
Consistency and Verification:
Providing a comprehensive example that:- Crawls a live Wikipedia page.
- Saves the HTML to a file.
- Re-crawls the local file.
- Re-crawls the content as a raw HTML string.
- Verifies that the Markdown extracted remains consistent across all three methods.
-
Integration with
CrawlerRunConfig:
Showing how to useCrawlerRunConfigto disable caching (bypass_cache=True) and ensure fresh results for each test run.
In summary, the file highlights how to use Crawl4AI’s prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.