Enhance Crawl4AI with new features and documentation

- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags.
  - Introduced Managed Browsers for enhanced crawling experience.
  - Updated documentation for clearer navigation on configuration.
  - Changed 'text_only' to 'text_mode' in configuration and methods.
  - Improved performance and relevance in content filtering strategies.
This commit is contained in:
UncleCode
2024-12-19 21:02:29 +08:00
parent 393bb911c0
commit 849765712f
23 changed files with 1825 additions and 1721 deletions

View File

@@ -1,49 +1,66 @@
# Quick Start Guide 🚀
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI, covering everything from initial setup to advanced features like chunking and extraction strategies, using asynchronous programming. Let's dive in! 🌟
---
## Getting Started 🛠️
First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
Set up your environment with `BrowserConfig` and create an `AsyncWebCrawler` instance.
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
# We'll add our crawling code here
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# Add your crawling logic here
pass
if __name__ == "__main__":
asyncio.run(main())
```
---
### Basic Usage
Simply provide a URL and let Crawl4AI do the magic!
Provide a URL and let Crawl4AI do the work!
```python
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://www.nbcnews.com/business")
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(url="https://www.nbcnews.com/business")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(main())
```
---
### Taking Screenshots 📸
Capture screenshots of web pages easily:
Capture and save webpage screenshots with `CrawlerRunConfig`:
```python
from crawl4ai.async_configs import CacheMode
async def capture_and_save_screenshot(url: str, output_path: str):
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url=url,
screenshot=True,
cache_mode=CacheMode.BYPASS
)
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(
url=url,
screenshot=True,
cache_mode=CacheMode.BYPASS
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
if result.success and result.screenshot:
import base64
@@ -55,243 +72,101 @@ async def capture_and_save_screenshot(url: str, output_path: str):
print("Failed to capture screenshot")
```
---
### Browser Selection 🌐
Crawl4AI supports multiple browser engines. Here's how to use different browsers:
Choose from multiple browser engines using `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
# Use Firefox
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
firefox_config = BrowserConfig(browser_type="firefox", verbose=True, headless=True)
async with AsyncWebCrawler(config=firefox_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
# Use WebKit
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
webkit_config = BrowserConfig(browser_type="webkit", verbose=True, headless=True)
async with AsyncWebCrawler(config=webkit_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
# Use Chromium (default)
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
chromium_config = BrowserConfig(verbose=True, headless=True)
async with AsyncWebCrawler(config=chromium_config) as crawler:
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
```
---
### User Simulation 🎭
Simulate real user behavior to avoid detection:
Simulate real user behavior to bypass detection:
```python
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
result = await crawler.arun(
url="YOUR-URL-HERE",
cache_mode=CacheMode.BYPASS,
simulate_user=True, # Causes random mouse movements and clicks
override_navigator=True # Makes the browser appear more like a real user
)
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(verbose=True, headless=True)
crawl_config = CrawlerRunConfig(
url="YOUR-URL-HERE",
cache_mode=CacheMode.BYPASS,
simulate_user=True, # Random mouse movements and clicks
override_navigator=True # Makes the browser appear like a real user
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
```
---
### Understanding Parameters 🧠
By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
Explore caching and forcing fresh crawls:
```python
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
# First crawl (caches the result)
result1 = await crawler.arun(url="https://www.nbcnews.com/business")
browser_config = BrowserConfig(verbose=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
# First crawl (uses cache)
result1 = await crawler.arun(config=CrawlerRunConfig(url="https://www.nbcnews.com/business"))
print(f"First crawl result: {result1.markdown[:100]}...")
# Force to crawl again
result2 = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
# Force fresh crawl
result2 = await crawler.arun(
config=CrawlerRunConfig(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
)
print(f"Second crawl result: {result2.markdown[:100]}...")
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(main())
```
---
### Adding a Chunking Strategy 🧩
Let's add a chunking strategy: `RegexChunking`! This strategy splits the text based on a given regex pattern.
Split content into chunks using `RegexChunking`:
```python
from crawl4ai.chunking_strategy import RegexChunking
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
browser_config = BrowserConfig(verbose=True)
crawl_config = CrawlerRunConfig(
url="https://www.nbcnews.com/business",
chunking_strategy=RegexChunking(patterns=["\n\n"])
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(config=crawl_config)
print(f"RegexChunking result: {result.extracted_content[:200]}...")
asyncio.run(main())
if __name__ == "__main__":
asyncio.run(main())
```
### Using LLMExtractionStrategy with Different Providers 🤖
---
Crawl4AI supports multiple LLM providers for extraction:
### Advanced Features and Configurations
```python
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
class OpenAIModelFee(BaseModel):
model_name: str = Field(..., description="Name of the OpenAI model.")
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
# OpenAI
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
# Hugging Face
await extract_structured_data_using_llm(
"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
os.getenv("HUGGINGFACE_API_KEY")
)
# Ollama
await extract_structured_data_using_llm("ollama/llama3.2")
# With custom headers
custom_headers = {
"Authorization": "Bearer your-custom-token",
"X-Custom-Header": "Some-Value"
}
await extract_structured_data_using_llm(extra_headers=custom_headers)
```
### Knowledge Graph Generation 🕸️
Generate knowledge graphs from web content:
```python
from pydantic import BaseModel
from typing import List
class Entity(BaseModel):
name: str
description: str
class Relationship(BaseModel):
entity1: Entity
entity2: Entity
description: str
relation_type: str
class KnowledgeGraph(BaseModel):
entities: List[Entity]
relationships: List[Relationship]
extraction_strategy = LLMExtractionStrategy(
provider='openai/gpt-4o-mini',
api_token=os.getenv('OPENAI_API_KEY'),
schema=KnowledgeGraph.model_json_schema(),
extraction_type="schema",
instruction="Extract entities and relationships from the given text."
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://paulgraham.com/love.html",
cache_mode=CacheMode.BYPASS,
extraction_strategy=extraction_strategy
)
```
### Advanced Session-Based Crawling with Dynamic Content 🔄
For modern web applications with dynamic content loading, here's how to handle pagination and content updates:
```python
async def crawl_dynamic_content():
async with AsyncWebCrawler(verbose=True) as crawler:
url = "https://github.com/microsoft/TypeScript/commits/main"
session_id = "typescript_commits_session"
js_next_page = """
const button = document.querySelector('a[data-testid="pagination-next-button"]');
if (button) button.click();
"""
wait_for = """() => {
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
if (commits.length === 0) return false;
const firstCommit = commits[0].textContent.trim();
return firstCommit !== window.firstCommit;
}"""
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [
{
"name": "title",
"selector": "h4.markdown-title",
"type": "text",
"transform": "strip",
},
],
}
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
for page in range(3): # Crawl 3 pages
result = await crawler.arun(
url=url,
session_id=session_id,
css_selector="li.Box-sc-g0xbh4-0",
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS,
headless=False,
)
await crawler.crawler_strategy.kill_session(session_id)
```
### Handling Overlays and Fitting Content 📏
Remove overlay elements and fit content appropriately:
```python
async with AsyncWebCrawler(headless=False) as crawler:
result = await crawler.arun(
url="your-url-here",
cache_mode=CacheMode.BYPASS,
word_count_threshold=10,
remove_overlay_elements=True,
screenshot=True
)
```
## Performance Comparison 🏎️
Crawl4AI offers impressive performance compared to other solutions:
```python
# Firecrawl comparison
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
start = time.time()
scrape_status = app.scrape_url(
'https://www.nbcnews.com/business',
params={'formats': ['markdown', 'html']}
)
end = time.time()
# Crawl4AI comparison
async with AsyncWebCrawler() as crawler:
start = time.time()
result = await crawler.arun(
url="https://www.nbcnews.com/business",
word_count_threshold=0,
cache_mode=CacheMode.BYPASS,
verbose=False,
)
end = time.time()
```
Note: Performance comparisons should be conducted in environments with stable and fast internet connections for accurate results.
## Congratulations! 🎉
You've made it through the updated Crawl4AI Quickstart Guide! Now you're equipped with even more powerful features to crawl the web asynchronously like a pro! 🕸️
Happy crawling! 🚀
For advanced examples (LLM strategies, knowledge graphs, pagination handling), ensure all code aligns with the `BrowserConfig` and `CrawlerRunConfig` pattern shown above.