Files

UncleCode f2d9912697 Renames browser_config param to config in AsyncWebCrawler

Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.

Updates all documentation examples and internal usages to reflect the new parameter name for consistency.

Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.

2024-12-26 16:34:36 +08:00

15 KiB

Raw Blame History

Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

Crawl4AI, the #1 trending GitHub repository, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.

Key Links:

Website: https://crawl4ai.com
GitHub: https://github.com/unclecode/crawl4ai
Colab Notebook: Try on Google Colab
Quickstart Code Example: quickstart_async.config.py
Examples Folder: Crawl4AI Examples

Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution

1. Introduction & Key Concepts

Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.

Quick Test:

import asyncio
from crawl4ai import AsyncWebCrawler

async def test_run():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun("https://example.com")
        print(result.markdown)

asyncio.run(test_run())

If you see Markdown output, everything is working!

More info: See /docs/introduction or 1_introduction.ex.md

2. Installation & Environment Setup

pip install crawl4ai
crawl4ai-setup
playwright install chromium

Try in Colab:
Open Colab Notebook

More info: See /docs/configuration or 2_configuration.md

3. Core Concepts & Configuration

Use AsyncWebCrawler, CrawlerRunConfig, and BrowserConfig to control crawling.

Example config:

from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

browser_config = BrowserConfig(
    headless=True,
    viewport_width=1920,
    viewport_height=1080,
    text_mode=False,
    ignore_https_errors=True,
    java_script_enabled=True
)

run_config = CrawlerRunConfig(
    css_selector="article.main",
    word_count_threshold=50,
    excluded_tags=['nav','footer'],
    exclude_external_links=True,
    wait_for="css:.article-loaded",
    page_timeout=60000,
    delay_before_return_html=1.0,
    mean_delay=0.1, 
    max_range=0.3,
    process_iframes=True,
    remove_overlay_elements=True,
    js_code="""
        (async () => {
            window.scrollTo(0, document.body.scrollHeight);
            await new Promise(r => setTimeout(r, 2000));
            document.querySelector('.load-more')?.click();
        })();
    """
)

# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
# run_config.cache_mode = CacheMode.ENABLED

Prefixes:

http:// or https:// for live pages
file://local.html for local
raw:<html> for raw HTML strings

More info: See /docs/async_webcrawler or 3_async_webcrawler.ex.md

4. Basic Crawling & Simple Extraction

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://news.example.com/article", config=run_config)
    print(result.markdown) # Basic markdown content

More info: See /docs/browser_context_page or 4_browser_context_page.ex.md

5. Markdown Generation & AI-Optimized Output

After crawling, result.markdown_v2 provides:

raw_markdown: Unfiltered markdown
markdown_with_citations: Links as references at the bottom
references_markdown: A separate list of reference links
fit_markdown: Filtered, relevant markdown (e.g., after BM25)
fit_html: The HTML used to produce fit_markdown

Example:

print("RAW:", result.markdown_v2.raw_markdown[:200])
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
print("REFERENCES:", result.markdown_v2.references_markdown)
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)

For AI training, fit_markdown focuses on the most relevant content.

More info: See /docs/markdown_generation or 5_markdown_generation.ex.md

6. Structured Data Extraction (CSS, XPath, LLM)

Extract JSON data without LLMs:

CSS:

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
  "name": "Products",
  "baseSelector": ".product",
  "fields": [
    {"name": "title", "selector": "h2", "type": "text"},
    {"name": "price", "selector": ".price", "type": "text"}
  ]
}
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)

XPath:

from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy

xpath_schema = {
  "name": "Articles",
  "baseSelector": "//div[@class='article']",
  "fields": [
    {"name":"headline","selector":".//h1","type":"text"},
    {"name":"summary","selector":".//p[@class='summary']","type":"text"}
  ]
}
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)

More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md

7. Advanced Extraction: LLM & Open-Source Models

Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).

from pydantic import BaseModel
from crawl4ai.extraction_strategy import LLMExtractionStrategy

class TravelData(BaseModel):
    destination: str
    attractions: list

run_config.extraction_strategy = LLMExtractionStrategy(
    provider="ollama/nemotron",
    schema=TravelData.schema(),
    instruction="Extract destination and top attractions."
)

More info: See /docs/extraction_strategies or 7_extraction_strategies.ex.md

8. Page Interactions, JS Execution, & Dynamic Content

Insert js_code and use wait_for to ensure content loads. Example:

run_config.js_code = """
(async () => {
   document.querySelector('.load-more')?.click();
   await new Promise(r => setTimeout(r, 2000));
})();
"""
run_config.wait_for = "css:.item-loaded"

More info: See /docs/page_interaction or 11_page_interaction.md

9. Media, Links, & Metadata Handling

result.media["images"]: List of images with src, score, alt. Score indicates relevance.

result.media["videos"], result.media["audios"] similarly hold media info.

result.links["internal"], result.links["external"], result.links["social"]: Categorized links. Each link has href, text, context, type.

result.metadata: Title, description, keywords, author.

Example:

# Images
for img in result.media["images"]:
    print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))

# Links
for link in result.links["external"]:
    print("External Link:", link["href"], "Text:", link["text"])

# Metadata
print("Page Title:", result.metadata["title"])
print("Description:", result.metadata["description"])

More info: See /docs/content_selection or 8_content_selection.ex.md

10. Authentication & Identity Preservation

Manual Setup via User Data Directory

Open Chrome with a custom user data dir:

"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"

On macOS:

"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"

Log in to sites, solve CAPTCHAs, adjust settings manually.
The browser saves cookies/localStorage in that directory.

Use user_data_dir in BrowserConfig:

browser_config = BrowserConfig(
    headless=True,
    user_data_dir="/Users/username/ChromeProfiles/MyProfile"
)

Now the crawler starts with those cookies, sessions, etc.

Using `storage_state`

Alternatively, export and reuse storage states:

browser_config = BrowserConfig(
    headless=True,
    storage_state="mystate.json"  # Pre-saved state
)

No repeated logins needed.

More info: See /docs/storage_state or 16_storage_state.md

11. Proxy & Security Enhancements

Use proxy_config for authenticated proxies:

browser_config.proxy_config = {
    "server": "http://proxy.example.com:8080",
    "username": "proxyuser",
    "password": "proxypass"
}

Combine with headers or ignore_https_errors as needed.

More info: See /docs/proxy_security or 14_proxy_security.md

12. Screenshots, PDFs & File Downloads

Enable screenshot=True or pdf=True in CrawlerRunConfig:

run_config.screenshot = True
run_config.pdf = True

After crawling:

if result.screenshot:
    with open("page.png", "wb") as f:
        f.write(result.screenshot)

if result.pdf:
    with open("page.pdf", "wb") as f:
        f.write(result.pdf)

File Downloads:

browser_config.accept_downloads = True
browser_config.downloads_path = "./downloads"
run_config.js_code = """document.querySelector('a.download')?.click();"""

# After crawl:
print("Downloaded files:", result.downloaded_files)

More info: See /docs/screenshot_and_pdf_export or 15_screenshot_and_pdf_export.md
Also 10_file_download.md

13. Caching & Performance Optimization

Set cache_mode to reuse fetch results:

from crawl4ai import CacheMode
run_config.cache_mode = CacheMode.ENABLED

Adjust delays, increase concurrency, or use text_mode=True for faster extraction.

More info: See /docs/cache_modes or 9_cache_modes.md

14. Hooks for Custom Logic

Hooks let you run code at specific lifecycle events without creating pages manually in on_browser_created.

Use on_page_context_created to apply routing or modify page contexts before crawling the URL:

Example Hook:

async def on_page_context_created_hook(context, page, **kwargs):
    # Block all images to speed up load
    await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
    print("[HOOK] Image requests blocked")

async with AsyncWebCrawler(config=browser_config) as crawler:
    crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
    result = await crawler.arun("https://imageheavy.example.com", config=run_config)
    print("Crawl finished with images blocked.")

This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.

More info: See /docs/hooks_auth or 13_hooks_auth.md

15. Dockerization & Scaling

Use Docker images:

AMD64 basic:

docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64

ARM64 for M1/M2:

docker pull unclecode/crawl4ai:basic-arm64
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64

GPU support:

docker pull unclecode/crawl4ai:gpu-amd64
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64

Scale with load balancers or Kubernetes.

More info: See /docs/proxy_security (for proxy) or relevant Docker instructions in README

16. Troubleshooting & Common Pitfalls

Empty results? Relax filters, check selectors.
Timeouts? Increase page_timeout or refine wait_for.
CAPTCHAs? Use user_data_dir or storage_state after manual solving.
JS errors? Try headful mode for debugging.

Check examples & quickstart_async.config.py for more code.

17. Comprehensive End-to-End Example

Combine hooks, JS execution, PDF saving, LLM extraction—see quickstart_async.config.py for a full example.

18. Further Resources & Community

Docs: https://crawl4ai.com
Issues & PRs: https://github.com/unclecode/crawl4ai/issues

Follow @unclecode for news & community updates.

Happy Crawling!
Leverage Crawl4AI to feed your AI models with clean, structured web data today.

15 KiB Raw Blame History Unescape Escape