Enhance Crawl4AI with new features and documentation
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
This commit is contained in:
@@ -1,49 +1,66 @@
|
||||
# Quick Start Guide 🚀
|
||||
|
||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI with a friendly and humorous tone. We'll cover everything from basic usage to advanced features like chunking and extraction strategies, all with the power of asynchronous programming. Let's dive in! 🌟
|
||||
Welcome to the Crawl4AI Quickstart Guide! In this tutorial, we'll walk you through the basic usage of Crawl4AI, covering everything from initial setup to advanced features like chunking and extraction strategies, using asynchronous programming. Let's dive in! 🌟
|
||||
|
||||
---
|
||||
|
||||
## Getting Started 🛠️
|
||||
|
||||
First, let's import the necessary modules and create an instance of `AsyncWebCrawler`. We'll use an async context manager, which handles the setup and teardown of the crawler for us.
|
||||
Set up your environment with `BrowserConfig` and create an `AsyncWebCrawler` instance.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# We'll add our crawling code here
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Add your crawling logic here
|
||||
pass
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Basic Usage
|
||||
|
||||
Simply provide a URL and let Crawl4AI do the magic!
|
||||
Provide a URL and let Crawl4AI do the work!
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
crawl_config = CrawlerRunConfig(url="https://www.nbcnews.com/business")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
print(f"Basic crawl result: {result.markdown[:500]}") # Print first 500 characters
|
||||
|
||||
asyncio.run(main())
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Taking Screenshots 📸
|
||||
|
||||
Capture screenshots of web pages easily:
|
||||
Capture and save webpage screenshots with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CacheMode
|
||||
|
||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
crawl_config = CrawlerRunConfig(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
@@ -55,243 +72,101 @@ async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
print("Failed to capture screenshot")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Browser Selection 🌐
|
||||
|
||||
Crawl4AI supports multiple browser engines. Here's how to use different browsers:
|
||||
Choose from multiple browser engines using `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
# Use Firefox
|
||||
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
|
||||
firefox_config = BrowserConfig(browser_type="firefox", verbose=True, headless=True)
|
||||
async with AsyncWebCrawler(config=firefox_config) as crawler:
|
||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
||||
|
||||
# Use WebKit
|
||||
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
|
||||
webkit_config = BrowserConfig(browser_type="webkit", verbose=True, headless=True)
|
||||
async with AsyncWebCrawler(config=webkit_config) as crawler:
|
||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
||||
|
||||
# Use Chromium (default)
|
||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", cache_mode=CacheMode.BYPASS)
|
||||
chromium_config = BrowserConfig(verbose=True, headless=True)
|
||||
async with AsyncWebCrawler(config=chromium_config) as crawler:
|
||||
result = await crawler.arun(config=CrawlerRunConfig(url="https://www.example.com"))
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### User Simulation 🎭
|
||||
|
||||
Simulate real user behavior to avoid detection:
|
||||
Simulate real user behavior to bypass detection:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="YOUR-URL-HERE",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
simulate_user=True, # Causes random mouse movements and clicks
|
||||
override_navigator=True # Makes the browser appear more like a real user
|
||||
)
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(verbose=True, headless=True)
|
||||
crawl_config = CrawlerRunConfig(
|
||||
url="YOUR-URL-HERE",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
simulate_user=True, # Random mouse movements and clicks
|
||||
override_navigator=True # Makes the browser appear like a real user
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Understanding Parameters 🧠
|
||||
|
||||
By default, Crawl4AI caches the results of your crawls. This means that subsequent crawls of the same URL will be much faster! Let's see this in action.
|
||||
Explore caching and forcing fresh crawls:
|
||||
|
||||
```python
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# First crawl (caches the result)
|
||||
result1 = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# First crawl (uses cache)
|
||||
result1 = await crawler.arun(config=CrawlerRunConfig(url="https://www.nbcnews.com/business"))
|
||||
print(f"First crawl result: {result1.markdown[:100]}...")
|
||||
|
||||
# Force to crawl again
|
||||
result2 = await crawler.arun(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
|
||||
# Force fresh crawl
|
||||
result2 = await crawler.arun(
|
||||
config=CrawlerRunConfig(url="https://www.nbcnews.com/business", cache_mode=CacheMode.BYPASS)
|
||||
)
|
||||
print(f"Second crawl result: {result2.markdown[:100]}...")
|
||||
|
||||
asyncio.run(main())
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Adding a Chunking Strategy 🧩
|
||||
|
||||
Let's add a chunking strategy: `RegexChunking`! This strategy splits the text based on a given regex pattern.
|
||||
Split content into chunks using `RegexChunking`:
|
||||
|
||||
```python
|
||||
from crawl4ai.chunking_strategy import RegexChunking
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
browser_config = BrowserConfig(verbose=True)
|
||||
crawl_config = CrawlerRunConfig(
|
||||
url="https://www.nbcnews.com/business",
|
||||
chunking_strategy=RegexChunking(patterns=["\n\n"])
|
||||
)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(config=crawl_config)
|
||||
print(f"RegexChunking result: {result.extracted_content[:200]}...")
|
||||
|
||||
asyncio.run(main())
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Using LLMExtractionStrategy with Different Providers 🤖
|
||||
---
|
||||
|
||||
Crawl4AI supports multiple LLM providers for extraction:
|
||||
### Advanced Features and Configurations
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
output_fee: str = Field(..., description="Fee for output token for the OpenAI model.")
|
||||
|
||||
# OpenAI
|
||||
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||
|
||||
# Hugging Face
|
||||
await extract_structured_data_using_llm(
|
||||
"huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct",
|
||||
os.getenv("HUGGINGFACE_API_KEY")
|
||||
)
|
||||
|
||||
# Ollama
|
||||
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
|
||||
# With custom headers
|
||||
custom_headers = {
|
||||
"Authorization": "Bearer your-custom-token",
|
||||
"X-Custom-Header": "Some-Value"
|
||||
}
|
||||
await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||
```
|
||||
|
||||
### Knowledge Graph Generation 🕸️
|
||||
|
||||
Generate knowledge graphs from web content:
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
description: str
|
||||
|
||||
class Relationship(BaseModel):
|
||||
entity1: Entity
|
||||
entity2: Entity
|
||||
description: str
|
||||
relation_type: str
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity]
|
||||
relationships: List[Relationship]
|
||||
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider='openai/gpt-4o-mini',
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=KnowledgeGraph.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract entities and relationships from the given text."
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://paulgraham.com/love.html",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
extraction_strategy=extraction_strategy
|
||||
)
|
||||
```
|
||||
|
||||
### Advanced Session-Based Crawling with Dynamic Content 🔄
|
||||
|
||||
For modern web applications with dynamic content loading, here's how to handle pagination and content updates:
|
||||
|
||||
```python
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.firstCommit;
|
||||
}"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.markdown-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3): # Crawl 3 pages
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.Box-sc-g0xbh4-0",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
headless=False,
|
||||
)
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
### Handling Overlays and Fitting Content 📏
|
||||
|
||||
Remove overlay elements and fit content appropriately:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(headless=False) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="your-url-here",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
word_count_threshold=10,
|
||||
remove_overlay_elements=True,
|
||||
screenshot=True
|
||||
)
|
||||
```
|
||||
|
||||
## Performance Comparison 🏎️
|
||||
|
||||
Crawl4AI offers impressive performance compared to other solutions:
|
||||
|
||||
```python
|
||||
# Firecrawl comparison
|
||||
from firecrawl import FirecrawlApp
|
||||
app = FirecrawlApp(api_key=os.environ['FIRECRAWL_API_KEY'])
|
||||
start = time.time()
|
||||
scrape_status = app.scrape_url(
|
||||
'https://www.nbcnews.com/business',
|
||||
params={'formats': ['markdown', 'html']}
|
||||
)
|
||||
end = time.time()
|
||||
|
||||
# Crawl4AI comparison
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
start = time.time()
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
word_count_threshold=0,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
verbose=False,
|
||||
)
|
||||
end = time.time()
|
||||
```
|
||||
|
||||
Note: Performance comparisons should be conducted in environments with stable and fast internet connections for accurate results.
|
||||
|
||||
## Congratulations! 🎉
|
||||
|
||||
You've made it through the updated Crawl4AI Quickstart Guide! Now you're equipped with even more powerful features to crawl the web asynchronously like a pro! 🕸️
|
||||
|
||||
Happy crawling! 🚀
|
||||
For advanced examples (LLM strategies, knowledge graphs, pagination handling), ensure all code aligns with the `BrowserConfig` and `CrawlerRunConfig` pattern shown above.
|
||||
|
||||
Reference in New Issue
Block a user