Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.
This commit is contained in:
UncleCode
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions

View File

@@ -2,31 +2,19 @@
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
## Table of Contents
- [Prefix-Based Input Handling in Crawl4AI](#prefix-based-input-handling-in-crawl4ai)
- [Table of Contents](#table-of-contents)
- [Crawling a Web URL](#crawling-a-web-url)
- [Crawling a Local HTML File](#crawling-a-local-html-file)
- [Crawling Raw HTML Content](#crawling-raw-html-content)
- [Complete Example](#complete-example)
- [**How It Works**](#how-it-works)
- [**Running the Example**](#running-the-example)
- [Conclusion](#conclusion)
## Crawling a Web URL
---
### Crawling a Web URL
To crawl a live web page, provide the URL starting with `http://` or `https://`.
To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_web():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", bypass_cache=True)
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
if result.success:
print("Markdown Content:")
print(result.markdown)
@@ -36,20 +24,22 @@ async def crawl_web():
asyncio.run(crawl_web())
```
### Crawling a Local HTML File
## Crawling a Local HTML File
To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
file_url = f"file://{local_file_path}"
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url=file_url, bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=file_url, config=config)
if result.success:
print("Markdown Content from Local File:")
print(result.markdown)
@@ -59,20 +49,22 @@ async def crawl_local_file():
asyncio.run(crawl_local_file())
```
### Crawling Raw HTML Content
## Crawling Raw HTML Content
To crawl raw HTML content, prefix the HTML string with `raw:`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_raw_html():
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
raw_html_url = f"raw:{raw_html}"
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url=raw_html_url, bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=raw_html_url, config=config)
if result.success:
print("Markdown Content from Raw HTML:")
print(result.markdown)
@@ -84,152 +76,83 @@ asyncio.run(crawl_raw_html())
---
## Complete Example
# Complete Example
Below is a comprehensive script that:
1. **Crawls the Wikipedia page for "Apple".**
2. **Saves the HTML content to a local file (`apple.html`).**
3. **Crawls the local HTML file and verifies the markdown length matches the original crawl.**
4. **Crawls the raw HTML content from the saved file and verifies consistency.**
1. Crawls the Wikipedia page for "Apple."
2. Saves the HTML content to a local file (`apple.html`).
3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
4. Crawls the raw HTML content from the saved file and verifies consistency.
```python
import os
import sys
import asyncio
from pathlib import Path
# Adjust the parent directory to include the crawl4ai module
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
# Define the URL to crawl
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
# Define the path to save the HTML file
# Save the file in the same directory as the script
script_dir = Path(__file__).parent
html_file_path = script_dir / "apple.html"
async with AsyncWebCrawler(verbose=True) as crawler:
async with AsyncWebCrawler() as crawler:
# Step 1: Crawl the Web URL
print("\n=== Step 1: Crawling the Wikipedia URL ===")
# Crawl the Wikipedia URL
result = await crawler.arun(url=wikipedia_url, bypass_cache=True)
# Check if crawling was successful
web_config = CrawlerRunConfig(bypass_cache=True)
result = await crawler.arun(url=wikipedia_url, config=web_config)
if not result.success:
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
return
# Save the HTML content to a local file
with open(html_file_path, 'w', encoding='utf-8') as f:
f.write(result.html)
print(f"Saved HTML content to {html_file_path}")
# Store the length of the generated markdown
web_crawl_length = len(result.markdown)
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
# Step 2: Crawl from the Local HTML File
print("=== Step 2: Crawling from the Local HTML File ===")
# Construct the file URL with 'file://' prefix
file_url = f"file://{html_file_path.resolve()}"
# Crawl the local HTML file
local_result = await crawler.arun(url=file_url, bypass_cache=True)
# Check if crawling was successful
file_config = CrawlerRunConfig(bypass_cache=True)
local_result = await crawler.arun(url=file_url, config=file_config)
if not local_result.success:
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
return
# Store the length of the generated markdown from local file
local_crawl_length = len(local_result.markdown)
print(f"Length of markdown from local file crawl: {local_crawl_length}")
# Compare the lengths
assert web_crawl_length == local_crawl_length, (
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Local file crawl ({local_crawl_length})"
)
print("✅ Markdown length matches between web crawl and local file crawl.\n")
assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
print("✅ Markdown length matches between web and local file crawl.\n")
# Step 3: Crawl Using Raw HTML Content
print("=== Step 3: Crawling Using Raw HTML Content ===")
# Read the HTML content from the saved file
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read()
# Prefix the raw HTML content with 'raw:'
raw_html_url = f"raw:{raw_html_content}"
# Crawl using the raw HTML content
raw_result = await crawler.arun(url=raw_html_url, bypass_cache=True)
# Check if crawling was successful
raw_config = CrawlerRunConfig(bypass_cache=True)
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
if not raw_result.success:
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
return
# Store the length of the generated markdown from raw HTML
raw_crawl_length = len(raw_result.markdown)
print(f"Length of markdown from raw HTML crawl: {raw_crawl_length}")
# Compare the lengths
assert web_crawl_length == raw_crawl_length, (
f"Markdown length mismatch: Web crawl ({web_crawl_length}) != Raw HTML crawl ({raw_crawl_length})"
)
print("✅ Markdown length matches between web crawl and raw HTML crawl.\n")
assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
print("✅ Markdown length matches between web and raw HTML crawl.\n")
print("All tests passed successfully!")
# Clean up by removing the saved HTML file
if html_file_path.exists():
os.remove(html_file_path)
print(f"Removed the saved HTML file: {html_file_path}")
# Run the main function
if __name__ == "__main__":
asyncio.run(main())
```
### **How It Works**
1. **Step 1: Crawl the Web URL**
- Crawls `https://en.wikipedia.org/wiki/apple`.
- Saves the HTML content to `apple.html`.
- Records the length of the generated markdown.
2. **Step 2: Crawl from the Local HTML File**
- Uses the `file://` prefix to crawl `apple.html`.
- Ensures the markdown length matches the original web crawl.
3. **Step 3: Crawl Using Raw HTML Content**
- Reads the HTML from `apple.html`.
- Prefixes it with `raw:` and crawls.
- Verifies the markdown length matches the previous results.
4. **Cleanup**
- Deletes the `apple.html` file after testing.
### **Running the Example**
1. **Save the Script:**
- Save the above code as `test_crawl4ai.py` in your project directory.
2. **Execute the Script:**
- Run the script using:
```bash
python test_crawl4ai.py
```
3. **Observe the Output:**
- The script will print logs detailing each step.
- Assertions ensure consistency across different crawling methods.
- Upon success, it confirms that all markdown lengths match.
---
## Conclusion
With the new prefix-based input handling in **Crawl4AI**, you can effortlessly crawl web URLs, local HTML files, and raw HTML strings using a unified `url` parameter. This enhancement simplifies the API usage and provides greater flexibility for diverse crawling scenarios.
# Conclusion
With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.