Enhance crawler capabilities and documentation

- Add llm.txt generator - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation.
2024-12-25 21:34:31 +08:00
parent 84b311760f
commit d5ed451299
59 changed files with 2208 additions and 1763 deletions
--- a/docs/llm.txt/12_prefix_based_input.q.md
+++ b/docs/llm.txt/12_prefix_based_input.q.md
@@ -1,56 +1,10 @@
-### Hypothetical Questions
-
-1. **Basic Usage**
-   - *"How can I crawl a regular website URL using Crawl4AI?"*
-   - *"What configuration object do I need to pass to `arun` for basic crawling scenarios?"*
-
-2. **Local HTML Files**
-   - *"How do I crawl an HTML file stored locally on my machine?"*
-   - *"What prefix should I use when specifying a local file path to `arun`?"*
-
-3. **Raw HTML Strings**
-   - *"Is it possible to crawl a raw HTML string without saving it to a file first?"*
-   - *"How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"*
-
-4. **Verifying Results**
-   - *"Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"*
-   - *"How do I handle errors or check if the crawl was successful?"*
-
-5. **Use Cases**
-   - *"When would I want to use `file://` vs. `raw:` URLs?"*
-   - *"Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"*
-
-6. **Caching and Configuration**
-   - *"What does `bypass_cache=True` do and when should I use it?"*
-   - *"Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"*
-
-7. **Practical Scenarios**
-   - *"How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"*
-   - *"Does Crawl4AI’s prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"*
-
-### Topics Discussed in the File
-
- **Prefix-Based Input Handling**:  
-  Introducing the concept of using `http://` or `https://` for web URLs, `file://` for local files, and `raw:` for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI.
-
- **Crawling a Web URL**:  
-  Demonstrating how to crawl a live web page (like a Wikipedia article) using `AsyncWebCrawler` and `CrawlerRunConfig`.
-
- **Crawling a Local HTML File**:  
-  Showing how to convert a local file path to a `file://` URL and use `arun` to process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis.
-
- **Crawling Raw HTML Content**:  
-  Explaining how to directly pass an HTML string prefixed with `raw:` to `arun`, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk.
-
- **Consistency and Verification**:  
-  Providing a comprehensive example that:
-  1. Crawls a live Wikipedia page.
-  2. Saves the HTML to a file.
-  3. Re-crawls the local file.
-  4. Re-crawls the content as a raw HTML string.
-  5. Verifies that the Markdown extracted remains consistent across all three methods.
-
- **Integration with `CrawlerRunConfig`**:  
-  Showing how to use `CrawlerRunConfig` to disable caching (`bypass_cache=True`) and ensure fresh results for each test run.
-
-In summary, the file highlights how to use Crawl4AI’s prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.
+url_prefix_handling: Crawl4AI supports different URL prefixes for various input types | input handling, url format, crawling types | url="https://example.com" or "file://path" or "raw:html"
+web_crawling: Crawl live web pages using http:// or https:// prefixes with AsyncWebCrawler | web scraping, url crawling, web content | AsyncWebCrawler().arun(url="https://example.com")
+local_file_crawling: Access local HTML files using file:// prefix for crawling | local html, file crawling, file access | AsyncWebCrawler().arun(url="file:///path/to/file.html")
+raw_html_crawling: Process raw HTML content directly using raw: prefix | html string, raw content, direct html | AsyncWebCrawler().arun(url="raw:<html>content</html>")
+crawler_config: Configure crawling behavior using CrawlerRunConfig object | crawler settings, configuration, bypass cache | CrawlerRunConfig(bypass_cache=True)
+async_context: AsyncWebCrawler should be used within async context manager | async with, context management, async programming | async with AsyncWebCrawler() as crawler
+crawl_result: Crawler returns result object containing success status, markdown and error messages | response handling, crawl output, result parsing | result.success, result.markdown, result.error_message
+html_to_markdown: Crawler automatically converts HTML content to markdown format | format conversion, markdown generation, content processing | result.markdown
+error_handling: Check crawl success status and handle error messages appropriately | error checking, failure handling, status verification | if result.success: ... else: print(result.error_message)
+content_verification: Compare markdown length between different crawling methods for consistency | content validation, length comparison, consistency check | assert web_crawl_length == local_crawl_length