# Crawl4AI Complete SDK Documentation **Generated:** 2025-10-19 12:56 **Format:** Ultra-Dense Reference (Optimized for AI Assistants) **Crawl4AI Version:** 0.7.4 --- ## Navigation - [Installation & Setup](#installation--setup) - [Quick Start](#quick-start) - [Core API](#core-api) - [Configuration](#configuration) - [Crawling Patterns](#crawling-patterns) - [Content Processing](#content-processing) - [Extraction Strategies](#extraction-strategies) - [Advanced Features](#advanced-features) --- # Installation & Setup # Installation & Setup (2023 Edition) ## 1. Basic Installation ```bash pip install crawl4ai ``` ## 2. Initial Setup & Diagnostics ### 2.1 Run the Setup Command ```bash crawl4ai-setup ``` - Performs OS-level checks (e.g., missing libs on Linux) - Confirms your environment is ready to crawl ### 2.2 Diagnostics ```bash crawl4ai-doctor ``` - Check Python version compatibility - Verify Playwright installation - Inspect environment variables or library conflicts If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`. ## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`) Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example: ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://www.example.com", ) print(result.markdown[:300]) # Show the first 300 characters of extracted text if __name__ == "__main__": asyncio.run(main()) ``` - A headless browser session loads `example.com` - Crawl4AI returns ~300 characters of markdown. If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly. ## 4. Advanced Installation (Optional) ### 4.1 Torch, Transformers, or All - **Text Clustering (Torch)** ```bash pip install crawl4ai[torch] crawl4ai-setup ``` - **Transformers** ```bash pip install crawl4ai[transformer] crawl4ai-setup ``` - **All Features** ```bash pip install crawl4ai[all] crawl4ai-setup ``` ```bash crawl4ai-download-models ``` ## 5. Docker (Experimental) ```bash docker pull unclecode/crawl4ai:basic docker run -p 11235:11235 unclecode/crawl4ai:basic ``` You can then make POST requests to `http://localhost:11235/crawl` to perform crawls. **Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025). ## 6. Local Server Mode (Legacy) ## Summary 1. **Install** with `pip install crawl4ai` and run `crawl4ai-setup`. 2. **Diagnose** with `crawl4ai-doctor` if you see errors. 3. **Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`. # Quick Start # Getting Started with Crawl4AI 1. Run your **first crawl** using minimal configuration. 3. Experiment with a simple **CSS-based extraction** strategy. 5. Crawl a **dynamic** page that loads content via JavaScript. ## 1. Introduction - An asynchronous crawler, **`AsyncWebCrawler`**. - Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**. - Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters). - Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based). ## 2. Your First Crawl Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output: ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com") print(result.markdown[:300]) # Print first 300 chars if __name__ == "__main__": asyncio.run(main()) ``` - **`AsyncWebCrawler`** launches a headless browser (Chromium by default). - It fetches `https://example.com`. - Crawl4AI automatically converts the HTML into Markdown. ## 3. Basic Configuration (Light Introduction) 1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.). 2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode async def main(): browser_conf = BrowserConfig(headless=True) # or False to see the browser run_conf = CrawlerRunConfig( cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler(config=browser_conf) as crawler: result = await crawler.arun( url="https://example.com", config=run_conf ) print(result.markdown) if __name__ == "__main__": asyncio.run(main()) ``` > IMPORTANT: By default cache mode is set to `CacheMode.BYPASS` to have fresh content. Set `CacheMode.ENABLED` to enable caching. ## 4. Generating Markdown Output - **`result.markdown`**: - **`result.markdown.fit_markdown`**: The same content after applying any configured **content filter** (e.g., `PruningContentFilter`). ### Example: Using a Filter with `DefaultMarkdownGenerator` ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator md_generator = DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") ) config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, markdown_generator=md_generator ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://news.ycombinator.com", config=config) print("Raw Markdown length:", len(result.markdown.raw_markdown)) print("Fit Markdown length:", len(result.markdown.fit_markdown)) ``` **Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial. ## 5. Simple Data Extraction (CSS-based) ```python from crawl4ai import JsonCssExtractionStrategy from crawl4ai import LLMConfig # Generate a schema (one-time cost) html = "