Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions
--- a/docs/md_v2/basic/prefix-based-input.md
+++ b/docs/md_v2/basic/prefix-based-input.md
@@ -2,31 +2,19 @@

 This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.

-## Table of Contents
- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)
-  - [Table of Contents](#table-of-contents)
-    - [Crawling a Web URL](#crawling-a-web-url)
-    - [Crawling a Local HTML File](#crawling-a-local-html-file)
-    - [Crawling Raw HTML Content](#crawling-raw-html-content)
-  - [Complete Example](#complete-example)
-    - [**How It Works**](#how-it-works)
-    - [**Running the Example**](#running-the-example)
-  - [Conclusion](#conclusion)
+## Crawling a Web URL

---
-
-
-### Crawling a Web URL
-
-To crawl a live web page, provide the URL starting with `http://` or `https://`.
+To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:

 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_configs import CrawlerRunConfig

 async def crawl_web():
-    async with AsyncWebCrawler(verbose=True) as crawler:
-        result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
+    config = CrawlerRunConfig(bypass_cache=True)
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
        if result.success:
            print("Markdown Content:")
            print(result.markdown)
@@ -36,20 +24,22 @@ async def crawl_web():
 asyncio.run(crawl_web())
 ```

-### Crawling a Local HTML File
+## Crawling a Local HTML File

 To crawl a local HTML file, prefix the file path with `file://`.

 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_configs import CrawlerRunConfig

 async def crawl_local_file():
    local_file_path = "/path/to/apple.html"  # Replace with your file path
    file_url = f"file://{local_file_path}"
+    config = CrawlerRunConfig(bypass_cache=True)
    
-    async with AsyncWebCrawler(verbose=True) as crawler:
-        result = await crawler.arun(url=file_url, bypass_cache=True)
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url=file_url, config=config)
        if result.success:
            print("Markdown Content from Local File:")
            print(result.markdown)
@@ -59,20 +49,22 @@ async def crawl_local_file():
 asyncio.run(crawl_local_file())
 ```

-### Crawling Raw HTML Content
+## Crawling Raw HTML Content

 To crawl raw HTML content, prefix the HTML string with `raw:`.

 ```python
 import asyncio
 from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_configs import CrawlerRunConfig

 async def crawl_raw_html():
    raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
    raw_html_url = f"raw:{raw_html}"
+    config = CrawlerRunConfig(bypass_cache=True)
    
-    async with AsyncWebCrawler(verbose=True) as crawler:
-        result = await crawler.arun(url=raw_html_url, bypass_cache=True)
+    async with AsyncWebCrawler() as crawler:
+        result = await crawler.arun(url=raw_html_url, config=config)
        if result.success:
            print("Markdown Content from Raw HTML:")
            print(result.markdown)
@@ -84,152 +76,83 @@ asyncio.run(crawl_raw_html())

 ---

-## Complete Example
+# Complete Example

 Below is a comprehensive script that:
-1. **Crawls the Wikipedia page for "Apple".**
-2. **Saves the HTML content to a local file (`apple.html`).**
-3. **Crawls the local HTML file and verifies the markdown length matches the original crawl.**
-4. **Crawls the raw HTML content from the saved file and verifies consistency.**
+
+1. Crawls the Wikipedia page for "Apple."
+2. Saves the HTML content to a local file (`apple.html`).
+3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
+4. Crawls the raw HTML content from the saved file and verifies consistency.

 ```python
 import os
 import sys
 import asyncio
 from pathlib import Path
-
-# Adjust the parent directory to include the crawl4ai module
-parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-sys.path.append(parent_dir)
-
 from crawl4ai import AsyncWebCrawler
+from crawl4ai.async_configs import CrawlerRunConfig

 async def main():
-    # Define the URL to crawl
    wikipedia_url = "https://en.wikipedia.org/wiki/apple"
-    
-    # Define the path to save the HTML file
-    # Save the file in the same directory as the script
    script_dir = Path(__file__).parent
    html_file_path = script_dir / "apple.html"
-    
-    async with AsyncWebCrawler(verbose=True) as crawler:
+
+    async with AsyncWebCrawler() as crawler:
+        # Step 1: Crawl the Web URL
        print("\n=== Step 1: Crawling the Wikipedia URL ===")
-        # Crawl the Wikipedia URL
-        result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
-        
-        # Check if crawling was successful
+        web_config = CrawlerRunConfig(bypass_cache=True)
+        result = await crawler.arun(url=wikipedia_url, config=web_config)
+
        if not result.success:
            print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
            return
-        
-        # Save the HTML content to a local file
+
        with open(html_file_path, 'w', encoding='utf-8') as f:
            f.write(result.html)
-        print(f"Saved HTML content to {html_file_path}")
-        
-        # Store the length of the generated markdown
        web_crawl_length = len(result.markdown)
        print(f"Length of markdown from web crawl: {web_crawl_length}\n")
-        
+
+        # Step 2: Crawl from the Local HTML File
        print("=== Step 2: Crawling from the Local HTML File ===")
-        # Construct the file URL with 'file://' prefix
        file_url = f"file://{html_file_path.resolve()}"
-        
-        # Crawl the local HTML file
-        local_result = await crawler.arun(url=file_url, bypass_cache=True)
-        
-        # Check if crawling was successful
+        file_config = CrawlerRunConfig(bypass_cache=True)
+        local_result = await crawler.arun(url=file_url, config=file_config)
+
        if not local_result.success:
            print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
            return
-        
-        # Store the length of the generated markdown from local file
+
        local_crawl_length = len(local_result.markdown)
-        print(f"Length of markdown from local file crawl: {local_crawl_length}")
-        
-        # Compare the lengths
-        assert web_crawl_length == local_crawl_length, (
-            f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
-        )
-        print("✅ Markdown length matches between web crawl and local file crawl.\n")
-        
+        assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
+        print("✅ Markdown length matches between web and local file crawl.\n")
+
+        # Step 3: Crawl Using Raw HTML Content
        print("=== Step 3: Crawling Using Raw HTML Content ===")
-        # Read the HTML content from the saved file
        with open(html_file_path, 'r', encoding='utf-8') as f:
            raw_html_content = f.read()
-        
-        # Prefix the raw HTML content with 'raw:'
        raw_html_url = f"raw:{raw_html_content}"
-        
-        # Crawl using the raw HTML content
-        raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
-        
-        # Check if crawling was successful
+        raw_config = CrawlerRunConfig(bypass_cache=True)
+        raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
+
        if not raw_result.success:
            print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
            return
-        
-        # Store the length of the generated markdown from raw HTML
+
        raw_crawl_length = len(raw_result.markdown)
-        print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
-        
-        # Compare the lengths
-        assert web_crawl_length == raw_crawl_length, (
-            f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
-        )
-        print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
-        
+        assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
+        print("✅ Markdown length matches between web and raw HTML crawl.\n")
+
        print("All tests passed successfully!")
-        
-    # Clean up by removing the saved HTML file
    if html_file_path.exists():
        os.remove(html_file_path)
-        print(f"Removed the saved HTML file: {html_file_path}")

-# Run the main function
 if __name__ == "__main__":
    asyncio.run(main())
 ```

-### **How It Works**
-
-1. **Step 1: Crawl the Web URL**
-   - Crawls `https://en.wikipedia.org/wiki/apple`.
-   - Saves the HTML content to `apple.html`.
-   - Records the length of the generated markdown.
-
-2. **Step 2: Crawl from the Local HTML File**
-   - Uses the `file://` prefix to crawl `apple.html`.
-   - Ensures the markdown length matches the original web crawl.
-
-3. **Step 3: Crawl Using Raw HTML Content**
-   - Reads the HTML from `apple.html`.
-   - Prefixes it with `raw:` and crawls.
-   - Verifies the markdown length matches the previous results.
-
-4. **Cleanup**
-   - Deletes the `apple.html` file after testing.
-
-### **Running the Example**
-
-1. **Save the Script:**
-   - Save the above code as `test_crawl4ai.py` in your project directory.
-
-2. **Execute the Script:**
-   - Run the script using:
-     ```bash
-     python test_crawl4ai.py
-     ```
-
-3. **Observe the Output:**
-   - The script will print logs detailing each step.
-   - Assertions ensure consistency across different crawling methods.
-   - Upon success, it confirms that all markdown lengths match.
-
 ---

-## Conclusion
-
-With the new prefix-based input handling in **Crawl4AI**, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.
+# Conclusion

+With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.