File: 10_file_download.md
================================================================================
# Download Handling in Crawl4AI
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
## Enabling Downloads
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
```python
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
async def main():
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
async with AsyncWebCrawler(config=config) as crawler:
# ... your crawling logic ...
asyncio.run(main())
```
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
```python
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
async with AsyncWebCrawler() as crawler:
config = CrawlerRunConfig(accept_downloads=True)
result = await crawler.arun(url="https://example.com", config=config)
# ...
```
## Specifying Download Location
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
```python
from crawl4ai.async_configs import BrowserConfig
import os
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
os.makedirs(downloads_path, exist_ok=True)
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
async def main():
async with AsyncWebCrawler(config=config) as crawler:
result = await crawler.arun(url="https://example.com")
# ...
```
## Triggering Downloads
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
```python
from crawl4ai.async_configs import CrawlerRunConfig
config = CrawlerRunConfig(
js_code="""
const downloadLink = document.querySelector('a[href$=".exe"]');
if (downloadLink) {
downloadLink.click();
}
""",
wait_for=5 # Wait 5 seconds for the download to start
)
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
```
## Accessing Downloaded Files
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
```python
if result.downloaded_files:
print("Downloaded files:")
for file_path in result.downloaded_files:
print(f"- {file_path}")
file_size = os.path.getsize(file_path)
print(f"- File size: {file_size} bytes")
else:
print("No files downloaded.")
```
## Example: Downloading Multiple Files
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
import os
from pathlib import Path
async def download_multiple_files(url: str, download_path: str):
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
async with AsyncWebCrawler(config=config) as crawler:
run_config = CrawlerRunConfig(
js_code="""
const downloadLinks = document.querySelectorAll('a[download]');
for (const link of downloadLinks) {
link.click();
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
}
""",
wait_for=10 # Wait for all downloads to start
)
result = await crawler.arun(url=url, config=run_config)
if result.downloaded_files:
print("Downloaded files:")
for file in result.downloaded_files:
print(f"- {file}")
else:
print("No files downloaded.")
# Usage
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
os.makedirs(download_path, exist_ok=True)
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
```
## Important Considerations
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
- **Security:** Scan downloaded files for potential security threats before use.
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
File: 11_page_interaction.md
================================================================================
# Page Interaction
Crawl4AI provides powerful features for interacting with dynamic webpages, handling JavaScript execution, and managing page events.
## JavaScript Execution
### Basic Execution
```python
from crawl4ai.async_configs import CrawlerRunConfig
# Single JavaScript command
config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);"
)
result = await crawler.arun(url="https://example.com", config=config)
# Multiple commands
js_commands = [
"window.scrollTo(0, document.body.scrollHeight);",
"document.querySelector('.load-more').click();",
"document.querySelector('#consent-button').click();"
]
config = CrawlerRunConfig(js_code=js_commands)
result = await crawler.arun(url="https://example.com", config=config)
```
### Wait Conditions
### CSS-Based Waiting
Wait for elements to appear:
```python
config = CrawlerRunConfig(wait_for="css:.dynamic-content") # Wait for element with class 'dynamic-content'
result = await crawler.arun(url="https://example.com", config=config)
```
### JavaScript-Based Waiting
Wait for custom conditions:
```python
# Wait for number of elements
wait_condition = """() => {
return document.querySelectorAll('.item').length > 10;
}"""
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
result = await crawler.arun(url="https://example.com", config=config)
# Wait for dynamic content to load
wait_for_content = """() => {
const content = document.querySelector('.content');
return content && content.innerText.length > 100;
}"""
config = CrawlerRunConfig(wait_for=f"js:{wait_for_content}")
result = await crawler.arun(url="https://example.com", config=config)
```
### Handling Dynamic Content
### Load More Content
Handle infinite scroll or load more buttons:
```python
config = CrawlerRunConfig(
js_code=[
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();" # Click load more
],
wait_for="js:() => document.querySelectorAll('.item').length > previousCount" # Wait for new content
)
result = await crawler.arun(url="https://example.com", config=config)
```
### Form Interaction
Handle forms and inputs:
```python
js_form_interaction = """
document.querySelector('#search').value = 'search term'; // Fill form fields
document.querySelector('form').submit(); // Submit form
"""
config = CrawlerRunConfig(
js_code=js_form_interaction,
wait_for="css:.results" # Wait for results to load
)
result = await crawler.arun(url="https://example.com", config=config)
```
### Timing Control
### Delays and Timeouts
Control timing of interactions:
```python
config = CrawlerRunConfig(
page_timeout=60000, # Page load timeout (ms)
delay_before_return_html=2.0 # Wait before capturing content
)
result = await crawler.arun(url="https://example.com", config=config)
```
### Complex Interactions Example
Here's an example of handling a dynamic page with multiple interactions:
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def crawl_dynamic_content():
async with AsyncWebCrawler() as crawler:
# Initial page load
config = CrawlerRunConfig(
js_code="document.querySelector('.cookie-accept')?.click();", # Handle cookie consent
wait_for="css:.main-content"
)
result = await crawler.arun(url="https://example.com", config=config)
# Load more content
session_id = "dynamic_session" # Keep session for multiple interactions
for page in range(3): # Load 3 pages of content
config = CrawlerRunConfig(
session_id=session_id,
js_code=[
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
"window.previousCount = document.querySelectorAll('.item').length;", # Store item count
"document.querySelector('.load-more')?.click();" # Click load more
],
wait_for="""() => {
const currentCount = document.querySelectorAll('.item').length;
return currentCount > window.previousCount;
}""",
js_only=(page > 0) # Execute JS without reloading page for subsequent interactions
)
result = await crawler.arun(url="https://example.com", config=config)
print(f"Page {page + 1} items:", len(result.cleaned_html))
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
```
### Using with Extraction Strategies
Combine page interaction with structured extraction:
```python
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
from crawl4ai.async_configs import CrawlerRunConfig
# Pattern-based extraction after interaction
schema = {
"name": "Dynamic Items",
"baseSelector": ".item",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "description", "selector": ".desc", "type": "text"}
]
}
config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="css:.item:nth-child(10)", # Wait for 10 items
extraction_strategy=JsonCssExtractionStrategy(schema)
)
result = await crawler.arun(url="https://example.com", config=config)
# Or use LLM to analyze dynamic content
class ContentAnalysis(BaseModel):
topics: List[str]
summary: str
config = CrawlerRunConfig(
js_code="document.querySelector('.show-more').click();",
wait_for="css:.full-content",
extraction_strategy=LLMExtractionStrategy(
provider="ollama/nemotron",
schema=ContentAnalysis.schema(),
instruction="Analyze the full content"
)
)
result = await crawler.arun(url="https://example.com", config=config)
```
File: 12_prefix_based_input.md
================================================================================
# Prefix-Based Input Handling in Crawl4AI
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
## Crawling a Web URL
To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_web():
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
if result.success:
print("Markdown Content:")
print(result.markdown)
else:
print(f"Failed to crawl: {result.error_message}")
asyncio.run(crawl_web())
```
## Crawling a Local HTML File
To crawl a local HTML file, prefix the file path with `file://`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_local_file():
local_file_path = "/path/to/apple.html" # Replace with your file path
file_url = f"file://{local_file_path}"
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=file_url, config=config)
if result.success:
print("Markdown Content from Local File:")
print(result.markdown)
else:
print(f"Failed to crawl local file: {result.error_message}")
asyncio.run(crawl_local_file())
```
## Crawling Raw HTML Content
To crawl raw HTML content, prefix the HTML string with `raw:`.
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def crawl_raw_html():
raw_html = "
Hello, World!
"
raw_html_url = f"raw:{raw_html}"
config = CrawlerRunConfig(bypass_cache=True)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=raw_html_url, config=config)
if result.success:
print("Markdown Content from Raw HTML:")
print(result.markdown)
else:
print(f"Failed to crawl raw HTML: {result.error_message}")
asyncio.run(crawl_raw_html())
```
---
## Complete Example
Below is a comprehensive script that:
1. Crawls the Wikipedia page for "Apple."
2. Saves the HTML content to a local file (`apple.html`).
3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
4. Crawls the raw HTML content from the saved file and verifies consistency.
```python
import os
import sys
import asyncio
from pathlib import Path
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import CrawlerRunConfig
async def main():
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
script_dir = Path(__file__).parent
html_file_path = script_dir / "apple.html"
async with AsyncWebCrawler() as crawler:
# Step 1: Crawl the Web URL
print("\n=== Step 1: Crawling the Wikipedia URL ===")
web_config = CrawlerRunConfig(bypass_cache=True)
result = await crawler.arun(url=wikipedia_url, config=web_config)
if not result.success:
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
return
with open(html_file_path, 'w', encoding='utf-8') as f:
f.write(result.html)
web_crawl_length = len(result.markdown)
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
# Step 2: Crawl from the Local HTML File
print("=== Step 2: Crawling from the Local HTML File ===")
file_url = f"file://{html_file_path.resolve()}"
file_config = CrawlerRunConfig(bypass_cache=True)
local_result = await crawler.arun(url=file_url, config=file_config)
if not local_result.success:
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
return
local_crawl_length = len(local_result.markdown)
assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
print("✅ Markdown length matches between web and local file crawl.\n")
# Step 3: Crawl Using Raw HTML Content
print("=== Step 3: Crawling Using Raw HTML Content ===")
with open(html_file_path, 'r', encoding='utf-8') as f:
raw_html_content = f.read()
raw_html_url = f"raw:{raw_html_content}"
raw_config = CrawlerRunConfig(bypass_cache=True)
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
if not raw_result.success:
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
return
raw_crawl_length = len(raw_result.markdown)
assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
print("✅ Markdown length matches between web and raw HTML crawl.\n")
print("All tests passed successfully!")
if html_file_path.exists():
os.remove(html_file_path)
if __name__ == "__main__":
asyncio.run(main())
```
---
## Conclusion
With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.
File: 13_hooks_auth.md
================================================================================
# Hooks & Auth for AsyncWebCrawler
Crawl4AI's `AsyncWebCrawler` allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This updated documentation demonstrates how to use hooks, including the new `on_page_context_created` hook, and ensures compatibility with `BrowserConfig` and `CrawlerRunConfig`.
In this example, we'll:
1. Configure the browser and set up authentication when it's created.
2. Apply custom routing and initial actions when the page context is created.
3. Add custom headers before navigating to the URL.
4. Log the current URL after navigation.
5. Perform actions after JavaScript execution.
6. Log the length of the HTML before returning it.
## Hook Definitions
```python
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
from playwright.async_api import Page, Browser, BrowserContext
def log_routing(route):
# Example: block loading images
if route.request.resource_type == "image":
print(f"[HOOK] Blocking image request: {route.request.url}")
asyncio.create_task(route.abort())
else:
asyncio.create_task(route.continue_())
async def on_browser_created(browser: Browser, **kwargs):
print("[HOOK] on_browser_created")
# Example: Set browser viewport size and log in
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
page = await context.new_page()
await page.goto("https://example.com/login")
await page.fill("input[name='username']", "testuser")
await page.fill("input[name='password']", "password123")
await page.click("button[type='submit']")
await page.wait_for_selector("#welcome")
await context.add_cookies([{"name": "auth_token", "value": "abc123", "url": "https://example.com"}])
await page.close()
await context.close()
async def on_page_context_created(context: BrowserContext, page: Page, **kwargs):
print("[HOOK] on_page_context_created")
await context.route("**", log_routing)
async def before_goto(page: Page, context: BrowserContext, **kwargs):
print("[HOOK] before_goto")
await page.set_extra_http_headers({"X-Test-Header": "test"})
async def after_goto(page: Page, context: BrowserContext, **kwargs):
print("[HOOK] after_goto")
print(f"Current URL: {page.url}")
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
print("[HOOK] on_execution_started")
await page.evaluate("console.log('Custom JS executed')")
async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs):
print("[HOOK] before_return_html")
print(f"HTML length: {len(html)}")
return page
```
## Using the Hooks with AsyncWebCrawler
```python
async def main():
print("\n🔗 Using Crawler Hooks: Customize AsyncWebCrawler with hooks!")
# Configure browser and crawler settings
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080
)
crawler_run_config = CrawlerRunConfig(
js_code="window.scrollTo(0, document.body.scrollHeight);",
wait_for="footer"
)
# Initialize crawler
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
crawler.crawler_strategy.set_hook("before_goto", before_goto)
crawler.crawler_strategy.set_hook("after_goto", after_goto)
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
# Run the crawler
result = await crawler.arun(url="https://example.com", config=crawler_run_config)
print("\n📦 Crawler Hooks Result:")
print(result)
asyncio.run(main())
```
## Explanation of Hooks
- **`on_browser_created`**: Called when the browser is created. Use this to configure the browser or handle authentication (e.g., logging in and setting cookies).
- **`on_page_context_created`**: Called when a new page context is created. Use this to apply routing, block resources, or inject custom logic before navigating to the URL.
- **`before_goto`**: Called before navigating to the URL. Use this to add custom headers or perform other pre-navigation actions.
- **`after_goto`**: Called after navigation. Use this to verify content or log the URL.
- **`on_execution_started`**: Called after executing custom JavaScript. Use this to perform additional actions.
- **`before_return_html`**: Called before returning the HTML content. Use this to log details or preprocess the content.
## Additional Customizations
- **Resource Management**: Use `on_page_context_created` to block or modify requests (e.g., block images, fonts, or third-party scripts).
- **Dynamic Headers**: Use `before_goto` to add or modify headers dynamically based on the URL.
- **Authentication**: Use `on_browser_created` to handle login processes and set authentication cookies or tokens.
- **Content Analysis**: Use `before_return_html` to analyze or modify the extracted HTML content.
These hooks provide powerful customization options for tailoring the crawling process to your needs.
File: 14_proxy_security.md
================================================================================
# Proxy & Security
Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
## Basic Proxy Setup
Simple proxy configuration with `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
# Using proxy URL
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
# Using SOCKS proxy
browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Authenticated Proxy
Use an authenticated proxy with `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "user",
"password": "pass"
}
browser_config = BrowserConfig(proxy_config=proxy_config)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Rotating Proxies
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
```python
from crawl4ai.async_configs import BrowserConfig
async def get_next_proxy():
# Your proxy rotation logic here
return {"server": "http://next.proxy.com:8080"}
browser_config = BrowserConfig()
async with AsyncWebCrawler(config=browser_config) as crawler:
# Update proxy for each request
for url in urls:
proxy = await get_next_proxy()
browser_config.proxy_config = proxy
result = await crawler.arun(url=url, config=browser_config)
```
## Custom Headers
Add security-related headers via `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig
headers = {
"X-Forwarded-For": "203.0.113.195",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache"
}
browser_config = BrowserConfig(headers=headers)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com")
```
## Combining with Magic Mode
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
proxy="http://proxy.example.com:8080",
headers={"Accept-Language": "en-US"}
)
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(url="https://example.com", config=crawler_config)
```
File: 15_screenshot_and_pdf_export.md
================================================================================
# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
## **The New Approach:**
We’ve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browser’s built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
## **Key Benefits:**
- **Reliability:** The PDF export never times out and works regardless of page length.
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
## **Simple Example:**
```python
import os, sys
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
# Adjust paths as needed
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
sys.path.append(parent_dir)
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
async def main():
async with AsyncWebCrawler() as crawler:
# Request both PDF and screenshot
result = await crawler.arun(
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
cache_mode=CacheMode.BYPASS,
pdf=True,
screenshot=True
)
if result.success:
# Save screenshot
if result.screenshot:
from base64 import b64decode
with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
f.write(b64decode(result.screenshot))
# Save PDF
if result.pdf:
pdf_bytes = b64decode(result.pdf)
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
f.write(pdf_bytes)
if __name__ == "__main__":
asyncio.run(main())
```
## **What Happens Under the Hood:**
- Crawl4AI navigates to the target page.
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
- Finally, you get your PDF and/or screenshot ready to use.
## **Conclusion:**
With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.
File: 16_storage_state.md
================================================================================
# Using `storage_state` to Pre-Load Cookies and LocalStorage
Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
## What is `storage_state`?
`storage_state` can be:
- A dictionary containing cookies and localStorage data.
- A path to a JSON file that holds this information.
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
## Example Structure
Here’s an example storage state:
```json
{
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1675363572.037711,
"httpOnly": false,
"secure": false,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{ "name": "token", "value": "my_auth_token" },
{ "name": "refreshToken", "value": "my_refresh_token" }
]
}
]
}
```
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
---
## Passing `storage_state` as a Dictionary
You can directly provide the data as a dictionary:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
storage_dict = {
"cookies": [
{
"name": "session",
"value": "abcd1234",
"domain": "example.com",
"path": "/",
"expires": 1675363572.037711,
"httpOnly": False,
"secure": False,
"sameSite": "None"
}
],
"origins": [
{
"origin": "https://example.com",
"localStorage": [
{"name": "token", "value": "my_auth_token"},
{"name": "refreshToken", "value": "my_refresh_token"}
]
}
]
}
async with AsyncWebCrawler(
headless=True,
storage_state=storage_dict
) as crawler:
result = await crawler.arun(url='https://example.com/protected')
if result.success:
print("Crawl succeeded with pre-loaded session data!")
print("Page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
---
## Passing `storage_state` as a File
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(
headless=True,
storage_state="mystate.json" # Uses a JSON file instead of a dictionary
) as crawler:
result = await crawler.arun(url='https://example.com/protected')
if result.success:
print("Crawl succeeded with pre-loaded session data!")
print("Page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
---
## Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
1. Perform the login once in a hook.
2. After login completes, export the resulting `storage_state` to a file.
3. On subsequent runs, provide that `storage_state` to skip the login step.
**Step-by-Step Example:**
**First Run (Perform Login and Save State):**
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def on_browser_created_hook(browser):
# Access the default context and create a page
context = browser.contexts[0]
page = await context.new_page()
# Navigate to the login page
await page.goto("https://example.com/login", wait_until="domcontentloaded")
# Fill in credentials and submit
await page.fill("input[name='username']", "myuser")
await page.fill("input[name='password']", "mypassword")
await page.click("button[type='submit']")
await page.wait_for_load_state("networkidle")
# Now the site sets tokens in localStorage and cookies
# Export this state to a file so we can reuse it
await context.storage_state(path="my_storage_state.json")
await page.close()
async def main():
# First run: perform login and export the storage_state
async with AsyncWebCrawler(
headless=True,
verbose=True,
hooks={"on_browser_created": on_browser_created_hook},
use_persistent_context=True,
user_data_dir="./my_user_data"
) as crawler:
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
result = await crawler.arun(
url='https://example.com/protected-page',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("First run result success:", result.success)
if result.success:
print("Protected page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
**Second Run (Reuse Saved State, No Login Needed):**
```python
import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
async def main():
# Second run: no need to hook on_browser_created this time.
# Just provide the previously saved storage state.
async with AsyncWebCrawler(
headless=True,
verbose=True,
use_persistent_context=True,
user_data_dir="./my_user_data",
storage_state="my_storage_state.json" # Reuse previously exported state
) as crawler:
# Now the crawler starts already logged in
result = await crawler.arun(
url='https://example.com/protected-page',
cache_mode=CacheMode.BYPASS,
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
)
print("Second run result success:", result.success)
if result.success:
print("Protected page HTML length:", len(result.html))
if __name__ == "__main__":
asyncio.run(main())
```
**What’s Happening Here?**
- During the first run, the `on_browser_created_hook` logs into the site.
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
**Sign Out Scenario:**
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
---
## Conclusion
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.
File: 1_introduction.ex.md
================================================================================
# Introduction
## Quick Start (Minimal Example)
For a fast hands-on start, try crawling a single URL and printing its Markdown output:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, you’re good to go.
---
## Overview of Crawl4AI
Crawl4AI is a state-of-the-art, **asynchronous** web crawling library optimized for large-scale data collection. It’s built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.
**Why Asynchronous?**
Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.
### Purpose and Vision
- Offer an open-source alternative to expensive commercial APIs.
- Provide clean, structured, Markdown-based outputs for easy AI integration.
- Democratize large-scale, high-speed, and reliable web crawling solutions.
### Key Features
- **Markdown Generation**: Produces AI-friendly, concise Markdown.
- **High-Performance Crawling**: Asynchronous operations let you crawl numerous URLs concurrently.
- **Browser Control**: Fine-tune browser sessions, user agents, proxies, and viewport.
- **JavaScript Support**: Handle dynamic pages by injecting custom JavaScript snippets.
- **Content Filtering**: Use advanced strategies (e.g., BM25) to focus on what matters.
- **Extensibility**: Define custom extraction strategies for complex data schemas.
- **Deployment Ready**: Easy Docker deployment for production and scalability.
---
## Use Cases
- **LLM Training and Fine-Tuning**: Collect and preprocess large web datasets to train machine learning models.
- **RAG Pipelines**: Generate context documents for retrieval-augmented generation tasks.
- **Content Summarization**: Extract pages and produce summaries directly in Markdown.
- **Structured Data Extraction**: Pull structured JSON data suitable for building knowledge graphs or databases.
**Example: Creating a Fine-Tuning Dataset**
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
# Combine Markdown outputs into a single file for model fine-tuning
with open("fine_tuning_data.md", "w") as f:
for res in results:
f.write(res.markdown + "\n")
if __name__ == "__main__":
asyncio.run(main())
```
---
## Installation and Setup
### Environment Setup (Recommended)
Use a virtual environment to keep dependencies isolated:
```bash
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
```
### Basic Installation
```bash
pip install crawl4ai
crawl4ai-setup
```
By default, this installs the asynchronous version and sets up Playwright.
### Verify Installation
Run a quick test:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
If you see the page content printed as Markdown, you’re ready.
### Handling JavaScript-Heavy Pages
For pages that require JavaScript actions (like clicking a “Load More” button), use the `js_code` parameter:
```python
js_code = """
(async () => {
const loadMoreBtn = document.querySelector('button.load-more');
if (loadMoreBtn) loadMoreBtn.click();
await new Promise(r => setTimeout(r, 1000));
})();
"""
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/js-page",
js_code=[js_code]
)
print(result.markdown)
```
### Using Cache Modes
`CacheMode` can speed up repeated crawls by reusing previously fetched data. For instance:
```python
from crawl4ai import AsyncWebCrawler, CacheMode
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/large-page",
cache_mode=CacheMode.ENABLED
)
print(result.markdown)
```
---
## Quick Start Guide
### Minimal Working Example
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
```
### Multiple Concurrent Crawls
Harness async concurrency to run multiple crawls in parallel:
```python
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_url(crawler, url):
return await crawler.arun(url=url)
async def main():
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
for r in results:
print(r.markdown[:200])
if __name__ == "__main__":
asyncio.run(main())
```
### Dockerized Setup
Run Crawl4AI in Docker for production environments:
```bash
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
curl http://localhost:11235/health
```
### Proxy and Security Configurations
```python
async with AsyncWebCrawler(
proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
You can also add basic auth:
```python
async with AsyncWebCrawler(
proxies={"http": "http://user:password@proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
### Customizing Browser Settings
Customize headers, user agents, and viewport:
```python
async with AsyncWebCrawler(
verbose=True,
headers={"User-Agent": "MyCustomBrowser/1.0"},
viewport={"width": 1280, "height": 800}
) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
```
---
## Troubleshooting Installation
### Playwright Errors
If `crawl4ai-setup` fails, install manually:
```bash
playwright install chromium
pip install crawl4ai[all]
```
### SSL or Proxy Issues
- Check certificates or disable SSL verification (for dev only).
- Verify proxy credentials and server details.
Use `verbose=True` for detailed logs:
```python
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
```
---
## Common Pitfalls
1. **Missing Playwright Installation**: Run `playwright install chromium`.
2. **Time-Out on JavaScript-Heavy Pages**: Increase wait time or use `js_code` for page interactions.
3. **Empty Markdown**: Check if the page is JavaScript-rendered and adjust `js_code` or `wait_for` conditions.
4. **Permission Errors**: Run commands with appropriate permissions or use a virtual environment.
---
## Support and Community
- **GitHub Issues**: Have questions or found a bug? Open an issue on the [GitHub Repo](https://github.com/unclecode/crawl4ai/issues).
- **Contributions**: We welcome pull requests. Check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md).
- **Community Discussions**: Join discussions on GitHub to share tips, best practices, and feedback.
---
## Further Exploration
- **Advanced Extraction Strategies**: Dive into specialized extraction strategies like `JsonCssExtractionStrategy` or `LLMExtractionStrategy` for structured data output.
- **Content Filtering**: Explore BM25-based strategies to highlight the most relevant parts of a page.
- **Production Deployment**: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.
For more detailed code examples and advanced topics, refer to the accompanying [README](https://github.com/unclecode/crawl4ai) and the `QUICKSTART` Python file included with this distribution.
File: 2_configuration.md
================================================================================
# Core Configurations
## BrowserConfig
`BrowserConfig` centralizes all parameters required to set up and manage a browser instance and its context. This configuration ensures consistent and documented browser behavior for the crawler. Below is a detailed explanation of each parameter and its optimal use cases.
### Parameters and Use Cases
#### `browser_type`
- **Description**: Specifies the type of browser to launch.
- Supported values: `"chromium"`, `"firefox"`, `"webkit"`
- Default: `"chromium"`
- **Use Case**:
- Use `"chromium"` for general-purpose crawling with modern web standards.
- Use `"firefox"` when testing against Firefox-specific behavior.
- Use `"webkit"` for testing Safari-like environments.
#### `headless`
- **Description**: Determines whether the browser runs in headless mode (no GUI).
- Default: `True`
- **Use Case**:
- Enable for faster, automated operations without UI overhead.
- Disable (`False`) when debugging or inspecting browser behavior visually.
#### `use_managed_browser`
- **Description**: Enables advanced manipulation via a managed browser approach.
- Default: `False`
- **Use Case**:
- Use when fine-grained control is needed over browser sessions, such as debugging network requests or reusing sessions.
#### `debugging_port`
- **Description**: Port for remote debugging.
- Default: 9222
- **Use Case**:
- Use for debugging browser sessions with DevTools or external tools.
#### `use_persistent_context`
- **Description**: Uses a persistent browser context (e.g., saved profiles).
- Automatically enables `use_managed_browser`.
- Default: `False`
- **Use Case**:
- Persistent login sessions for authenticated crawling.
- Retaining cookies or local storage across multiple runs.
#### `user_data_dir`
- **Description**: Path to a directory for storing persistent browser data.
- Default: `None`
- **Use Case**:
- Specify a directory to save browser profiles for multi-run crawls or debugging.
#### `chrome_channel`
- **Description**: Specifies the Chrome channel to launch (e.g., `"chrome"`, `"msedge"`).
- Applies only when `browser_type` is `"chromium"`.
- Default: `"chrome"`
- **Use Case**:
- Use `"msedge"` for compatibility testing with Edge browsers.
#### `proxy` and `proxy_config`
- **Description**:
- `proxy`: Proxy server URL for the browser.
- `proxy_config`: Detailed proxy configuration.
- Default: `None`
- **Use Case**:
- Set `proxy` for single-proxy setups.
- Use `proxy_config` for advanced configurations, such as authenticated proxies or regional routing.
#### `viewport_width` and `viewport_height`
- **Description**: Sets the default browser viewport dimensions.
- Default: `1920` (width), `1080` (height)
- **Use Case**:
- Adjust for crawling responsive layouts or specific device emulations.
#### `accept_downloads` and `downloads_path`
- **Description**:
- `accept_downloads`: Allows file downloads.
- `downloads_path`: Directory for storing downloads.
- Default: `False`, `None`
- **Use Case**:
- Use when downloading and analyzing files like PDFs or spreadsheets.
#### `storage_state`
- **Description**: Specifies cookies and local storage state.
- Default: `None`
- **Use Case**:
- Provide state data for authenticated or preconfigured sessions.
#### `ignore_https_errors`
- **Description**: Ignores HTTPS certificate errors.
- Default: `True`
- **Use Case**:
- Enable for crawling sites with invalid certificates (testing environments).
#### `java_script_enabled`
- **Description**: Toggles JavaScript execution in pages.
- Default: `True`
- **Use Case**:
- Disable for simpler, faster crawls where JavaScript is unnecessary.
#### `cookies`
- **Description**: List of cookies to add to the browser context.
- Default: `[]`
- **Use Case**:
- Use for authenticated or preconfigured crawling scenarios.
#### `headers`
- **Description**: Extra HTTP headers applied to all requests.
- Default: `{}`
- **Use Case**:
- Customize headers for API-like crawling or bypassing bot detections.
#### `user_agent` and `user_agent_mode`
- **Description**:
- `user_agent`: Custom User-Agent string.
- `user_agent_mode`: Mode for generating User-Agent (e.g., `"random"`).
- Default: Standard Chromium-based User-Agent.
- **Use Case**:
- Set static User-Agent for consistent identification.
- Use `"random"` mode to reduce bot detection likelihood.
#### `text_mode`
- **Description**: Disables images and other rich content for faster load times.
- Default: `False`
- **Use Case**:
- Enable for text-only extraction tasks where speed is prioritized.
#### `light_mode`
- **Description**: Disables background features for performance gains.
- Default: `False`
- **Use Case**:
- Enable for high-performance crawls on resource-constrained environments.
#### `extra_args`
- **Description**: Additional command-line arguments for browser execution.
- Default: `[]`
- **Use Case**:
- Use for advanced browser configurations like WebRTC or GPU tuning.
## CrawlerRunConfig
The `CrawlerRunConfig` class centralizes parameters for controlling crawl operations. This configuration covers content extraction, page interactions, caching, and runtime behaviors. Below is an exhaustive breakdown of parameters and their best-use scenarios.
### Parameters and Use Cases
#### Content Processing Parameters
##### `word_count_threshold`
- **Description**: Minimum word count threshold for processing content.
- Default: `200`
- **Use Case**:
- Set a higher threshold for content-heavy pages to skip lightweight or irrelevant content.
##### `extraction_strategy`
- **Description**: Strategy for extracting structured data from crawled pages.
- Default: `None` (uses `NoExtractionStrategy` by default).
- **Use Case**:
- Use for schema-driven extraction when working with well-defined data models like JSON.
##### `chunking_strategy`
- **Description**: Strategy to chunk content before extraction.
- Default: `RegexChunking()`.
- **Use Case**:
- Use NLP-based chunking for semantic extractions or regex for predictable text blocks.
##### `markdown_generator`
- **Description**: Strategy for generating Markdown output.
- Default: `None`.
- **Use Case**:
- Use custom Markdown strategies for AI-ready outputs like RAG pipelines.
##### `content_filter`
- **Description**: Optional filter to prune irrelevant content.
- Default: `None`.
- **Use Case**:
- Use relevance-based filters for focused crawls, e.g., keyword-specific searches.
##### `only_text`
- **Description**: Extracts text-only content where applicable.
- Default: `False`.
- **Use Case**:
- Enable for extracting clean text without HTML tags or rich content.
##### `css_selector`
- **Description**: CSS selector to extract a specific portion of the page.
- Default: `None`.
- **Use Case**:
- Use when targeting specific page elements, like articles or headlines.
##### `excluded_tags`
- **Description**: List of HTML tags to exclude from processing.
- Default: `None`.
- **Use Case**:
- Remove elements like `