Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor. Updates all documentation examples and internal usages to reflect the new parameter name for consistency. Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.
15 KiB
Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
Crawl4AI, the #1 trending GitHub repository, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
Key Links:
- Website: https://crawl4ai.com
- GitHub: https://github.com/unclecode/crawl4ai
- Colab Notebook: Try on Google Colab
- Quickstart Code Example: quickstart_async.config.py
- Examples Folder: Crawl4AI Examples
Table of Contents
- Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
- Table of Contents
- 1. Introduction & Key Concepts
- 2. Installation & Environment Setup
- 3. Core Concepts & Configuration
- 4. Basic Crawling & Simple Extraction
- 5. Markdown Generation & AI-Optimized Output
- 6. Structured Data Extraction (CSS, XPath, LLM)
- 7. Advanced Extraction: LLM & Open-Source Models
- 8. Page Interactions, JS Execution, & Dynamic Content
- 9. Media, Links, & Metadata Handling
- 10. Authentication & Identity Preservation
- 11. Proxy & Security Enhancements
- 12. Screenshots, PDFs & File Downloads
- 13. Caching & Performance Optimization
- 14. Hooks for Custom Logic
- 15. Dockerization & Scaling
- 16. Troubleshooting & Common Pitfalls
- 17. Comprehensive End-to-End Example
- 18. Further Resources & Community
1. Introduction & Key Concepts
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
Quick Test:
import asyncio
from crawl4ai import AsyncWebCrawler
async def test_run():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
asyncio.run(test_run())
If you see Markdown output, everything is working!
More info: See /docs/introduction or 1_introduction.ex.md
2. Installation & Environment Setup
pip install crawl4ai
crawl4ai-setup
playwright install chromium
Try in Colab:
Open Colab Notebook
More info: See /docs/configuration or 2_configuration.md
3. Core Concepts & Configuration
Use AsyncWebCrawler, CrawlerRunConfig, and BrowserConfig to control crawling.
Example config:
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
browser_config = BrowserConfig(
headless=True,
viewport_width=1920,
viewport_height=1080,
text_mode=False,
ignore_https_errors=True,
java_script_enabled=True
)
run_config = CrawlerRunConfig(
css_selector="article.main",
word_count_threshold=50,
excluded_tags=['nav','footer'],
exclude_external_links=True,
wait_for="css:.article-loaded",
page_timeout=60000,
delay_before_return_html=1.0,
mean_delay=0.1,
max_range=0.3,
process_iframes=True,
remove_overlay_elements=True,
js_code="""
(async () => {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 2000));
document.querySelector('.load-more')?.click();
})();
"""
)
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
# run_config.cache_mode = CacheMode.ENABLED
Prefixes:
http://orhttps://for live pagesfile://local.htmlfor localraw:<html>for raw HTML strings
More info: See /docs/async_webcrawler or 3_async_webcrawler.ex.md
4. Basic Crawling & Simple Extraction
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun("https://news.example.com/article", config=run_config)
print(result.markdown) # Basic markdown content
More info: See /docs/browser_context_page or 4_browser_context_page.ex.md
5. Markdown Generation & AI-Optimized Output
After crawling, result.markdown_v2 provides:
raw_markdown: Unfiltered markdownmarkdown_with_citations: Links as references at the bottomreferences_markdown: A separate list of reference linksfit_markdown: Filtered, relevant markdown (e.g., after BM25)fit_html: The HTML used to producefit_markdown
Example:
print("RAW:", result.markdown_v2.raw_markdown[:200])
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
print("REFERENCES:", result.markdown_v2.references_markdown)
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
For AI training, fit_markdown focuses on the most relevant content.
More info: See /docs/markdown_generation or 5_markdown_generation.ex.md
6. Structured Data Extraction (CSS, XPath, LLM)
Extract JSON data without LLMs:
CSS:
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Products",
"baseSelector": ".product",
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "price", "selector": ".price", "type": "text"}
]
}
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
XPath:
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
xpath_schema = {
"name": "Articles",
"baseSelector": "//div[@class='article']",
"fields": [
{"name":"headline","selector":".//h1","type":"text"},
{"name":"summary","selector":".//p[@class='summary']","type":"text"}
]
}
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md
7. Advanced Extraction: LLM & Open-Source Models
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy
class TravelData(BaseModel):
destination: str
attractions: list
run_config.extraction_strategy = LLMExtractionStrategy(
provider="ollama/nemotron",
schema=TravelData.schema(),
instruction="Extract destination and top attractions."
)
More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md
8. Page Interactions, JS Execution, & Dynamic Content
Insert js_code and use wait_for to ensure content loads. Example:
run_config.js_code = """
(async () => {
document.querySelector('.load-more')?.click();
await new Promise(r => setTimeout(r, 2000));
})();
"""
run_config.wait_for = "css:.item-loaded"
More info: See /docs/page_interaction or 11_page_interaction.md
9. Media, Links, & Metadata Handling
result.media["images"]: List of images with src, score, alt. Score indicates relevance.
result.media["videos"], result.media["audios"] similarly hold media info.
result.links["internal"], result.links["external"], result.links["social"]: Categorized links. Each link has href, text, context, type.
result.metadata: Title, description, keywords, author.
Example:
# Images
for img in result.media["images"]:
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
# Links
for link in result.links["external"]:
print("External Link:", link["href"], "Text:", link["text"])
# Metadata
print("Page Title:", result.metadata["title"])
print("Description:", result.metadata["description"])
More info: See /docs/content_selection or 8_content_selection.ex.md
10. Authentication & Identity Preservation
Manual Setup via User Data Directory
-
Open Chrome with a custom user data dir:
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"On macOS:
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile" -
Log in to sites, solve CAPTCHAs, adjust settings manually.
The browser saves cookies/localStorage in that directory. -
Use
user_data_dirinBrowserConfig:browser_config = BrowserConfig( headless=True, user_data_dir="/Users/username/ChromeProfiles/MyProfile" )Now the crawler starts with those cookies, sessions, etc.
Using storage_state
Alternatively, export and reuse storage states:
browser_config = BrowserConfig(
headless=True,
storage_state="mystate.json" # Pre-saved state
)
No repeated logins needed.
More info: See /docs/storage_state or 16_storage_state.md
11. Proxy & Security Enhancements
Use proxy_config for authenticated proxies:
browser_config.proxy_config = {
"server": "http://proxy.example.com:8080",
"username": "proxyuser",
"password": "proxypass"
}
Combine with headers or ignore_https_errors as needed.
More info: See /docs/proxy_security or 14_proxy_security.md
12. Screenshots, PDFs & File Downloads
Enable screenshot=True or pdf=True in CrawlerRunConfig:
run_config.screenshot = True
run_config.pdf = True
After crawling:
if result.screenshot:
with open("page.png", "wb") as f:
f.write(result.screenshot)
if result.pdf:
with open("page.pdf", "wb") as f:
f.write(result.pdf)
File Downloads:
browser_config.accept_downloads = True
browser_config.downloads_path = "./downloads"
run_config.js_code = """document.querySelector('a.download')?.click();"""
# After crawl:
print("Downloaded files:", result.downloaded_files)
More info: See /docs/screenshot_and_pdf_export or 15_screenshot_and_pdf_export.md
Also 10_file_download.md
13. Caching & Performance Optimization
Set cache_mode to reuse fetch results:
from crawl4ai import CacheMode
run_config.cache_mode = CacheMode.ENABLED
Adjust delays, increase concurrency, or use text_mode=True for faster extraction.
More info: See /docs/cache_modes or 9_cache_modes.md
14. Hooks for Custom Logic
Hooks let you run code at specific lifecycle events without creating pages manually in on_browser_created.
Use on_page_context_created to apply routing or modify page contexts before crawling the URL:
Example Hook:
async def on_page_context_created_hook(context, page, **kwargs):
# Block all images to speed up load
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
print("[HOOK] Image requests blocked")
async with AsyncWebCrawler(config=browser_config) as crawler:
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
result = await crawler.arun("https://imageheavy.example.com", config=run_config)
print("Crawl finished with images blocked.")
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.
More info: See /docs/hooks_auth or 13_hooks_auth.md
15. Dockerization & Scaling
Use Docker images:
- AMD64 basic:
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
- ARM64 for M1/M2:
docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
- GPU support:
docker pull unclecode/crawl4ai:gpu-amd64
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
Scale with load balancers or Kubernetes.
More info: See /docs/proxy_security (for proxy) or relevant Docker instructions in README
16. Troubleshooting & Common Pitfalls
- Empty results? Relax filters, check selectors.
- Timeouts? Increase
page_timeoutor refinewait_for. - CAPTCHAs? Use
user_data_dirorstorage_stateafter manual solving. - JS errors? Try headful mode for debugging.
Check examples & quickstart_async.config.py for more code.
17. Comprehensive End-to-End Example
Combine hooks, JS execution, PDF saving, LLM extraction—see quickstart_async.config.py for a full example.
18. Further Resources & Community
- Docs: https://crawl4ai.com
- Issues & PRs: https://github.com/unclecode/crawl4ai/issues
Follow @unclecode for news & community updates.
Happy Crawling!
Leverage Crawl4AI to feed your AI models with clean, structured web data today.