# Crawl4AI Complete SDK Documentation **Generated:** 2025-10-19 12:56 **Format:** Ultra-Dense Reference (Optimized for AI Assistants) **Crawl4AI Version:** 0.7.4 --- ## Navigation - [Installation & Setup](#installation--setup) - [Quick Start](#quick-start) - [Core API](#core-api) - [Configuration](#configuration) - [Crawling Patterns](#crawling-patterns) - [Content Processing](#content-processing) - [Extraction Strategies](#extraction-strategies) - [Advanced Features](#advanced-features) --- # Installation & Setup # Installation & Setup (2023 Edition) ## 1. Basic Installation ```bash pip install crawl4ai ``` ## 2. Initial Setup & Diagnostics ### 2.1 Run the Setup Command ```bash crawl4ai-setup ``` - Performs OS-level checks (e.g., missing libs on Linux) - Confirms your environment is ready to crawl ### 2.2 Diagnostics ```bash crawl4ai-doctor ``` - Check Python version compatibility - Verify Playwright installation - Inspect environment variables or library conflicts If any issues arise, follow its suggestions (e.g., installing additional system packages) and re-run `crawl4ai-setup`. ## 3. Verifying Installation: A Simple Crawl (Skip this step if you already run `crawl4ai-doctor`) Below is a minimal Python script demonstrating a **basic** crawl. It uses our new **`BrowserConfig`** and **`CrawlerRunConfig`** for clarity, though no custom settings are passed in this example: ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://www.example.com", ) print(result.markdown[:300]) # Show the first 300 characters of extracted text if __name__ == "__main__": asyncio.run(main()) ``` - A headless browser session loads `example.com` - Crawl4AI returns ~300 characters of markdown. If errors occur, rerun `crawl4ai-doctor` or manually ensure Playwright is installed correctly. ## 4. Advanced Installation (Optional) ### 4.1 Torch, Transformers, or All - **Text Clustering (Torch)** ```bash pip install crawl4ai[torch] crawl4ai-setup ``` - **Transformers** ```bash pip install crawl4ai[transformer] crawl4ai-setup ``` - **All Features** ```bash pip install crawl4ai[all] crawl4ai-setup ``` ```bash crawl4ai-download-models ``` ## 5. Docker (Experimental) ```bash docker pull unclecode/crawl4ai:basic docker run -p 11235:11235 unclecode/crawl4ai:basic ``` You can then make POST requests to `http://localhost:11235/crawl` to perform crawls. **Production usage** is discouraged until our new Docker approach is ready (planned in Jan or Feb 2025). ## 6. Local Server Mode (Legacy) ## Summary 1. **Install** with `pip install crawl4ai` and run `crawl4ai-setup`. 2. **Diagnose** with `crawl4ai-doctor` if you see errors. 3. **Verify** by crawling `example.com` with minimal `BrowserConfig` + `CrawlerRunConfig`. # Quick Start # Getting Started with Crawl4AI 1. Run your **first crawl** using minimal configuration. 3. Experiment with a simple **CSS-based extraction** strategy. 5. Crawl a **dynamic** page that loads content via JavaScript. ## 1. Introduction - An asynchronous crawler, **`AsyncWebCrawler`**. - Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**. - Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports optional filters). - Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based). ## 2. Your First Crawl Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output: ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com") print(result.markdown[:300]) # Print first 300 chars if __name__ == "__main__": asyncio.run(main()) ``` - **`AsyncWebCrawler`** launches a headless browser (Chromium by default). - It fetches `https://example.com`. - Crawl4AI automatically converts the HTML into Markdown. ## 3. Basic Configuration (Light Introduction) 1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.). 2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.). ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode async def main(): browser_conf = BrowserConfig(headless=True) # or False to see the browser run_conf = CrawlerRunConfig( cache_mode=CacheMode.BYPASS ) async with AsyncWebCrawler(config=browser_conf) as crawler: result = await crawler.arun( url="https://example.com", config=run_conf ) print(result.markdown) if __name__ == "__main__": asyncio.run(main()) ``` > IMPORTANT: By default cache mode is set to `CacheMode.BYPASS` to have fresh content. Set `CacheMode.ENABLED` to enable caching. ## 4. Generating Markdown Output - **`result.markdown`**: - **`result.markdown.fit_markdown`**: The same content after applying any configured **content filter** (e.g., `PruningContentFilter`). ### Example: Using a Filter with `DefaultMarkdownGenerator` ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator md_generator = DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed") ) config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, markdown_generator=md_generator ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://news.ycombinator.com", config=config) print("Raw Markdown length:", len(result.markdown.raw_markdown)) print("Fit Markdown length:", len(result.markdown.fit_markdown)) ``` **Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. `PruningContentFilter` may adds around `50ms` in processing time. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial. ## 5. Simple Data Extraction (CSS-based) ```python from crawl4ai import JsonCssExtractionStrategy from crawl4ai import LLMConfig # Generate a schema (one-time cost) html = "

Gaming Laptop

$999.99
" # Using OpenAI (requires API token) schema = JsonCssExtractionStrategy.generate_schema( html, llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-openai-token") # Required for OpenAI ) # Or using Ollama (open source, no token needed) schema = JsonCssExtractionStrategy.generate_schema( html, llm_config = LLMConfig(provider="ollama/llama3.3", api_token=None) # Not needed for Ollama ) # Use the schema for fast, repeated extractions strategy = JsonCssExtractionStrategy(schema) ``` ```python import asyncio import json from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode from crawl4ai import JsonCssExtractionStrategy async def main(): schema = { "name": "Example Items", "baseSelector": "div.item", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"} ] } raw_html = "

Item 1

Link 1
" async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="raw://" + raw_html, config=CrawlerRunConfig( cache_mode=CacheMode.BYPASS, extraction_strategy=JsonCssExtractionStrategy(schema) ) ) # The JSON output is stored in 'extracted_content' data = json.loads(result.extracted_content) print(data) if __name__ == "__main__": asyncio.run(main()) ``` - Great for repetitive page structures (e.g., item listings, articles). - No AI usage or costs. - The crawler returns a JSON string you can parse or store. - For sites where data is split across sibling elements (e.g. Hacker News), use the `"source"` field key to navigate to a sibling before extracting: `{"name": "score", "selector": "span.score", "type": "text", "source": "+ tr"}`. > Tips: You can pass raw HTML to the crawler instead of a URL. To do so, prefix the HTML with `raw://`. ## 6. Simple Data Extraction (LLM-based) - **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`) - **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`) - Or any provider supported by the underlying library ```python import os import json import asyncio from pydantic import BaseModel, Field from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, LLMConfig from crawl4ai import LLMExtractionStrategy class OpenAIModelFee(BaseModel): model_name: str = Field(..., description="Name of the OpenAI model.") input_fee: str = Field(..., description="Fee for input token for the OpenAI model.") output_fee: str = Field( ..., description="Fee for output token for the OpenAI model." ) async def extract_structured_data_using_llm( provider: str, api_token: str = None, extra_headers: Dict[str, str] = None ): print(f"\n--- Extracting Structured Data with {provider} ---") if api_token is None and provider != "ollama": print(f"API token is required for {provider}. Skipping this example.") return browser_config = BrowserConfig(headless=True) extra_args = {"temperature": 0, "top_p": 0.9, "max_tokens": 2000} if extra_headers: extra_args["extra_headers"] = extra_headers crawler_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, word_count_threshold=1, page_timeout=80000, extraction_strategy=LLMExtractionStrategy( llm_config = LLMConfig(provider=provider,api_token=api_token), schema=OpenAIModelFee.model_json_schema(), extraction_type="schema", instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens. Do not miss any models in the entire content.""", extra_args=extra_args, ), ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://openai.com/api/pricing/", config=crawler_config ) print(result.extracted_content) if __name__ == "__main__": asyncio.run( extract_structured_data_using_llm( provider="openai/gpt-4o", api_token=os.getenv("OPENAI_API_KEY") ) ) ``` - We define a Pydantic schema (`PricingInfo`) describing the fields we want. ## 7. Adaptive Crawling (New!) ```python import asyncio from crawl4ai import AsyncWebCrawler, AdaptiveCrawler async def adaptive_example(): async with AsyncWebCrawler() as crawler: adaptive = AdaptiveCrawler(crawler) # Start adaptive crawling result = await adaptive.digest( start_url="https://docs.python.org/3/", query="async context managers" ) # View results adaptive.print_stats() print(f"Crawled {len(result.crawled_urls)} pages") print(f"Achieved {adaptive.confidence:.0%} confidence") if __name__ == "__main__": asyncio.run(adaptive_example()) ``` - **Automatic stopping**: Stops when sufficient information is gathered - **Intelligent link selection**: Follows only relevant links - **Confidence scoring**: Know how complete your information is ## 8. Multi-URL Concurrency (Preview) If you need to crawl multiple URLs in **parallel**, you can use `arun_many()`. By default, Crawl4AI employs a **MemoryAdaptiveDispatcher**, automatically adjusting concurrency based on system resources. Here’s a quick glimpse: ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode async def quick_parallel_example(): urls = [ "https://example.com/page1", "https://example.com/page2", "https://example.com/page3" ] run_conf = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, stream=True # Enable streaming mode ) async with AsyncWebCrawler() as crawler: # Stream results as they complete async for result in await crawler.arun_many(urls, config=run_conf): if result.success: print(f"[OK] {result.url}, length: {len(result.markdown.raw_markdown)}") else: print(f"[ERROR] {result.url} => {result.error_message}") # Or get all results at once (default behavior) run_conf = run_conf.clone(stream=False) results = await crawler.arun_many(urls, config=run_conf) for res in results: if res.success: print(f"[OK] {res.url}, length: {len(res.markdown.raw_markdown)}") else: print(f"[ERROR] {res.url} => {res.error_message}") if __name__ == "__main__": asyncio.run(quick_parallel_example()) ``` 1. **Streaming mode** (`stream=True`): Process results as they become available using `async for` 2. **Batch mode** (`stream=False`): Wait for all results to complete ## 8. Dynamic Content Example Some sites require multiple “page clicks” or dynamic JavaScript updates. Below is an example showing how to **click** a “Next Page” button and wait for new commits to load on GitHub, using **`BrowserConfig`** and **`CrawlerRunConfig`**: ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai import JsonCssExtractionStrategy async def extract_structured_data_using_css_extractor(): print("\n--- Using JsonCssExtractionStrategy for Fast Structured Output ---") schema = { "name": "KidoCode Courses", "baseSelector": "section.charge-methodology .w-tab-content > div", "fields": [ { "name": "section_title", "selector": "h3.heading-50", "type": "text", }, { "name": "section_description", "selector": ".charge-content", "type": "text", }, { "name": "course_name", "selector": ".text-block-93", "type": "text", }, { "name": "course_description", "selector": ".course-content-text", "type": "text", }, { "name": "course_icon", "selector": ".image-92", "type": "attribute", "attribute": "src", }, ], } browser_config = BrowserConfig(headless=True, java_script_enabled=True) js_click_tabs = """ (async () => { const tabs = document.querySelectorAll("section.charge-methodology .tabs-menu-3 > div"); for(let tab of tabs) { tab.scrollIntoView(); tab.click(); await new Promise(r => setTimeout(r, 500)); } })(); """ crawler_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, extraction_strategy=JsonCssExtractionStrategy(schema), js_code=[js_click_tabs], ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://www.kidocode.com/degrees/technology", config=crawler_config ) companies = json.loads(result.extracted_content) print(f"Successfully extracted {len(companies)} companies") print(json.dumps(companies[0], indent=2)) async def main(): await extract_structured_data_using_css_extractor() if __name__ == "__main__": asyncio.run(main()) ``` - **`BrowserConfig(headless=False)`**: We want to watch it click “Next Page.” - **`CrawlerRunConfig(...)`**: We specify the extraction strategy, pass `session_id` to reuse the same page. - **`js_code`** and **`wait_for`** are used for subsequent pages (`page > 0`) to click the “Next” button and wait for new commits to load. - **`js_only=True`** indicates we’re not re-navigating but continuing the existing session. - Finally, we call `kill_session()` to clean up the page and browser session. ## 9. Next Steps 1. Performed a basic crawl and printed Markdown. 2. Used **content filters** with a markdown generator. 3. Extracted JSON via **CSS** or **LLM** strategies. 4. Handled **dynamic** pages with JavaScript triggers. # Core API # AsyncWebCrawler The **`AsyncWebCrawler`** is the core class for asynchronous web crawling in Crawl4AI. You typically create it **once**, optionally customize it with a **`BrowserConfig`** (e.g., headless, user agent), then **run** multiple **`arun()`** calls with different **`CrawlerRunConfig`** objects. 1. **Create** a `BrowserConfig` for global browser settings.  2. **Instantiate** `AsyncWebCrawler(config=browser_config)`.  3. **Use** the crawler in an async context manager (`async with`) or manage start/close manually.  4. **Call** `arun(url, config=crawler_run_config)` for each page you want. ## 1. Constructor Overview ```python class AsyncWebCrawler: def __init__( self, crawler_strategy: Optional[AsyncCrawlerStrategy] = None, config: Optional[BrowserConfig] = None, always_bypass_cache: bool = False, # deprecated always_by_pass_cache: Optional[bool] = None, # also deprecated base_directory: str = ..., thread_safe: bool = False, **kwargs, ): """ Create an AsyncWebCrawler instance. Args: crawler_strategy: (Advanced) Provide a custom crawler strategy if needed. config: A BrowserConfig object specifying how the browser is set up. always_bypass_cache: (Deprecated) Use CrawlerRunConfig.cache_mode instead. base_directory: Folder for storing caches/logs (if relevant). thread_safe: If True, attempts some concurrency safeguards. Usually False. **kwargs: Additional legacy or debugging parameters. """ ) ### Typical Initialization ```python from crawl4ai import AsyncWebCrawler, BrowserConfig browser_cfg = BrowserConfig( browser_type="chromium", headless=True, verbose=True crawler = AsyncWebCrawler(config=browser_cfg) ``` **Notes**: - **Legacy** parameters like `always_bypass_cache` remain for backward compatibility, but prefer to set **caching** in `CrawlerRunConfig`. --- ## 2. Lifecycle: Start/Close or Context Manager ### 2.1 Context Manager (Recommended) ```python async with AsyncWebCrawler(config=browser_cfg) as crawler: result = await crawler.arun("https://example.com") # The crawler automatically starts/closes resources ``` When the `async with` block ends, the crawler cleans up (closes the browser, etc.). ### 2.2 Manual Start & Close ```python crawler = AsyncWebCrawler(config=browser_cfg) await crawler.start() result1 = await crawler.arun("https://example.com") result2 = await crawler.arun("https://another.com") await crawler.close() ``` Use this style if you have a **long-running** application or need full control of the crawler’s lifecycle. --- ## 3. Primary Method: `arun()` ```python async def arun( url: str, config: Optional[CrawlerRunConfig] = None, # Legacy parameters for backward compatibility... ``` ### 3.1 New Approach You pass a `CrawlerRunConfig` object that sets up everything about a crawl—content filtering, caching, session reuse, JS code, screenshots, etc. ```python import asyncio from crawl4ai import CrawlerRunConfig, CacheMode run_cfg = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, css_selector="main.article", word_count_threshold=10, screenshot=True async with AsyncWebCrawler(config=browser_cfg) as crawler: result = await crawler.arun("https://example.com/news", config=run_cfg) ``` ### 3.2 Legacy Parameters Still Accepted For **backward** compatibility, `arun()` can still accept direct arguments like `css_selector=...`, `word_count_threshold=...`, etc., but we strongly advise migrating them into a **`CrawlerRunConfig`**. --- ## 4. Batch Processing: `arun_many()` ```python async def arun_many( urls: List[str], config: Optional[CrawlerRunConfig] = None, # Legacy parameters maintained for backwards compatibility... ``` ### 4.1 Resource-Aware Crawling The `arun_many()` method now uses an intelligent dispatcher that: - Monitors system memory usage - Implements adaptive rate limiting - Provides detailed progress monitoring - Manages concurrent crawls efficiently ### 4.2 Example Usage Check page [Multi-url Crawling](../advanced/multi-url-crawling.md) for a detailed example of how to use `arun_many()`. ```python ### 4.3 Key Features 1. **Rate Limiting** - Automatic delay between requests - Exponential backoff on rate limit detection - Domain-specific rate limiting - Configurable retry strategy 2. **Resource Monitoring** - Memory usage tracking - Adaptive concurrency based on system load - Automatic pausing when resources are constrained 3. **Progress Monitoring** - Detailed or aggregated progress display - Real-time status updates - Memory usage statistics 4. **Error Handling** - Graceful handling of rate limits - Automatic retries with backoff - Detailed error reporting ## 5. `CrawlResult` Output Each `arun()` returns a **`CrawlResult`** containing: - `url`: Final URL (if redirected). - `html`: Original HTML. - `cleaned_html`: Sanitized HTML. - `markdown_v2`: Deprecated. Instead just use regular `markdown` - `extracted_content`: If an extraction strategy was used (JSON for CSS/LLM strategies). - `screenshot`, `pdf`: If screenshots/PDF requested. - `media`, `links`: Information about discovered images/links. - `success`, `error_message`: Status info. ## 6. Quick Example ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode from crawl4ai import JsonCssExtractionStrategy import json async def main(): # 1. Browser config browser_cfg = BrowserConfig( browser_type="firefox", headless=False, verbose=True ) # 2. Run config schema = { "name": "Articles", "baseSelector": "article.post", "fields": [ { "name": "title", "selector": "h2", "type": "text" }, { "name": "url", "selector": "a", "type": "attribute", "attribute": "href" } ] } run_cfg = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, extraction_strategy=JsonCssExtractionStrategy(schema), word_count_threshold=15, remove_overlay_elements=True, wait_for="css:.post" # Wait for posts to appear ) async with AsyncWebCrawler(config=browser_cfg) as crawler: result = await crawler.arun( url="https://example.com/blog", config=run_cfg ) if result.success: print("Cleaned HTML length:", len(result.cleaned_html)) if result.extracted_content: articles = json.loads(result.extracted_content) print("Extracted articles:", articles[:2]) else: print("Error:", result.error_message) asyncio.run(main()) ``` - We define a **`BrowserConfig`** with Firefox, no headless, and `verbose=True`.  - We define a **`CrawlerRunConfig`** that **bypasses cache**, uses a **CSS** extraction schema, has a `word_count_threshold=15`, etc.  - We pass them to `AsyncWebCrawler(config=...)` and `arun(url=..., config=...)`. ## 7. Best Practices & Migration Notes 1. **Use** `BrowserConfig` for **global** settings about the browser’s environment.  2. **Use** `CrawlerRunConfig` for **per-crawl** logic (caching, content filtering, extraction strategies, wait conditions).  3. **Avoid** legacy parameters like `css_selector` or `word_count_threshold` directly in `arun()`. Instead: ```python run_cfg = CrawlerRunConfig(css_selector=".main-content", word_count_threshold=20) result = await crawler.arun(url="...", config=run_cfg) ``` ## 8. Summary - **Constructor** accepts **`BrowserConfig`** (or defaults).  - **`arun(url, config=CrawlerRunConfig)`** is the main method for single-page crawls.  - **`arun_many(urls, config=CrawlerRunConfig)`** handles concurrency across multiple URLs.  - For advanced lifecycle control, use `start()` and `close()` explicitly.  - If you used `AsyncWebCrawler(browser_type="chromium", css_selector="...")`, move browser settings to `BrowserConfig(...)` and content/crawl logic to `CrawlerRunConfig(...)`. # `arun()` Parameter Guide (New Approach) In Crawl4AI’s **latest** configuration model, nearly all parameters that once went directly to `arun()` are now part of **`CrawlerRunConfig`**. When calling `arun()`, you provide: ```python await crawler.arun( url="https://example.com", config=my_run_config ) ``` Below is an organized look at the parameters that can go inside `CrawlerRunConfig`, divided by their functional areas. For **Browser** settings (e.g., `headless`, `browser_type`), see [BrowserConfig](./parameters.md). ## 1. Core Usage ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode async def main(): run_config = CrawlerRunConfig( verbose=True, # Detailed logging cache_mode=CacheMode.ENABLED, # Use normal read/write cache check_robots_txt=True, # Respect robots.txt rules # ... other parameters ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com", config=run_config ) # Check if blocked by robots.txt if not result.success and result.status_code == 403: print(f"Error: {result.error_message}") ``` - `verbose=True` logs each crawl step.  - `cache_mode` decides how to read/write the local crawl cache. ## 2. Cache Control **`cache_mode`** (default: `CacheMode.ENABLED`) Use a built-in enum from `CacheMode`: - `ENABLED`: Normal caching—reads if available, writes if missing. - `DISABLED`: No caching—always refetch pages. - `READ_ONLY`: Reads from cache only; no new writes. - `WRITE_ONLY`: Writes to cache but doesn’t read existing data. - `BYPASS`: Skips reading cache for this crawl (though it might still write if set up that way). ```python run_config = CrawlerRunConfig( cache_mode=CacheMode.BYPASS ) ``` - `bypass_cache=True` acts like `CacheMode.BYPASS`. - `disable_cache=True` acts like `CacheMode.DISABLED`. - `no_cache_read=True` acts like `CacheMode.WRITE_ONLY`. - `no_cache_write=True` acts like `CacheMode.READ_ONLY`. ## 3. Content Processing & Selection ### 3.1 Text Processing ```python run_config = CrawlerRunConfig( word_count_threshold=10, # Ignore text blocks <10 words only_text=False, # If True, tries to remove non-text elements keep_data_attributes=False # Keep or discard data-* attributes ) ``` ### 3.2 Content Selection ```python run_config = CrawlerRunConfig( css_selector=".main-content", # Focus on .main-content region only excluded_tags=["form", "nav"], # Remove entire tag blocks remove_forms=True, # Specifically strip
elements remove_overlay_elements=True, # Attempt to remove modals/popups ) ``` ### 3.3 Link Handling ```python run_config = CrawlerRunConfig( exclude_external_links=True, # Remove external links from final content exclude_social_media_links=True, # Remove links to known social sites exclude_domains=["ads.example.com"], # Exclude links to these domains exclude_social_media_domains=["facebook.com","twitter.com"], # Extend the default list ) ``` ### 3.4 Media Filtering ```python run_config = CrawlerRunConfig( exclude_external_images=True # Strip images from other domains ) ``` ## 4. Page Navigation & Timing ### 4.1 Basic Browser Flow ```python run_config = CrawlerRunConfig( wait_for="css:.dynamic-content", # Wait for .dynamic-content delay_before_return_html=2.0, # Wait 2s before capturing final HTML page_timeout=60000, # Navigation & script timeout (ms) ) ``` - `wait_for`: - `"css:selector"` or - `"js:() => boolean"` e.g. `js:() => document.querySelectorAll('.item').length > 10`. - `mean_delay` & `max_range`: define random delays for `arun_many()` calls.  - `semaphore_count`: concurrency limit when crawling multiple URLs. ### 4.2 JavaScript Execution ```python run_config = CrawlerRunConfig( js_code=[ "window.scrollTo(0, document.body.scrollHeight);", "document.querySelector('.load-more')?.click();" ], js_only=False ) ``` - `js_code` can be a single string or a list of strings.  - `js_only=True` means “I’m continuing in the same session with new JS steps, no new full navigation.” ### 4.3 Anti-Bot ```python run_config = CrawlerRunConfig( magic=True, simulate_user=True, override_navigator=True ) ``` - `magic=True` tries multiple stealth features.  - `simulate_user=True` mimics mouse movements or random delays.  - `override_navigator=True` fakes some navigator properties (like user agent checks). ## 5. Session Management **`session_id`**: ```python run_config = CrawlerRunConfig( session_id="my_session123" ) ``` If re-used in subsequent `arun()` calls, the same tab/page context is continued (helpful for multi-step tasks or stateful browsing). ## 6. Screenshot, PDF & Media Options ```python run_config = CrawlerRunConfig( screenshot=True, # Grab a screenshot as base64 screenshot_wait_for=1.0, # Wait 1s before capturing pdf=True, # Also produce a PDF image_description_min_word_threshold=5, # If analyzing alt text image_score_threshold=3, # Filter out low-score images ) ``` - `result.screenshot` → Base64 screenshot string. - `result.pdf` → Byte array with PDF data. ## 7. Extraction Strategy **For advanced data extraction** (CSS/LLM-based), set `extraction_strategy`: ```python run_config = CrawlerRunConfig( extraction_strategy=my_css_or_llm_strategy ) ``` The extracted data will appear in `result.extracted_content`. ## 8. Comprehensive Example Below is a snippet combining many parameters: ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode from crawl4ai import JsonCssExtractionStrategy async def main(): # Example schema schema = { "name": "Articles", "baseSelector": "article.post", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"} ] } run_config = CrawlerRunConfig( # Core verbose=True, cache_mode=CacheMode.ENABLED, check_robots_txt=True, # Respect robots.txt rules # Content word_count_threshold=10, css_selector="main.content", excluded_tags=["nav", "footer"], exclude_external_links=True, # Page & JS js_code="document.querySelector('.show-more')?.click();", wait_for="css:.loaded-block", page_timeout=30000, # Extraction extraction_strategy=JsonCssExtractionStrategy(schema), # Session session_id="persistent_session", # Media screenshot=True, pdf=True, # Anti-bot simulate_user=True, magic=True, ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com/posts", config=run_config) if result.success: print("HTML length:", len(result.cleaned_html)) print("Extraction JSON:", result.extracted_content) if result.screenshot: print("Screenshot length:", len(result.screenshot)) if result.pdf: print("PDF bytes length:", len(result.pdf)) else: print("Error:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` 1. **Crawling** the main content region, ignoring external links.  2. Running **JavaScript** to click “.show-more”.  3. **Waiting** for “.loaded-block” to appear.  4. Generating a **screenshot** & **PDF** of the final page.  ## 9. Best Practices 1. **Use `BrowserConfig` for global browser** settings (headless, user agent).  2. **Use `CrawlerRunConfig`** to handle the **specific** crawl needs: content filtering, caching, JS, screenshot, extraction, etc.  4. **Limit** large concurrency (`semaphore_count`) if the site or your system can’t handle it.  5. For dynamic pages, set `js_code` or `scan_full_page` so you load all content. ## 10. Conclusion All parameters that used to be direct arguments to `arun()` now belong in **`CrawlerRunConfig`**. This approach: - Makes code **clearer** and **more maintainable**.  # `arun_many(...)` Reference > **Note**: This function is very similar to [`arun()`](./arun.md) but focused on **concurrent** or **batch** crawling. If you’re unfamiliar with `arun()` usage, please read that doc first, then review this for differences. ## Function Signature ```python async def arun_many( urls: Union[List[str], List[Any]], config: Optional[Union[CrawlerRunConfig, List[CrawlerRunConfig]]] = None, dispatcher: Optional[BaseDispatcher] = None, ... ) -> Union[List[CrawlResult], AsyncGenerator[CrawlResult, None]]: """ Crawl multiple URLs concurrently or in batches. :param urls: A list of URLs (or tasks) to crawl. :param config: (Optional) Either: - A single `CrawlerRunConfig` applying to all URLs - A list of `CrawlerRunConfig` objects with url_matcher patterns :param dispatcher: (Optional) A concurrency controller (e.g. MemoryAdaptiveDispatcher). ... :return: Either a list of `CrawlResult` objects, or an async generator if streaming is enabled. """ ``` ## Differences from `arun()` 1. **Multiple URLs**: - Instead of crawling a single URL, you pass a list of them (strings or tasks).  - The function returns either a **list** of `CrawlResult` or an **async generator** if streaming is enabled. 2. **Concurrency & Dispatchers**: - **`dispatcher`** param allows advanced concurrency control.  - If omitted, a default dispatcher (like `MemoryAdaptiveDispatcher`) is used internally.  3. **Streaming Support**: - Enable streaming by setting `stream=True` in your `CrawlerRunConfig`. - When streaming, use `async for` to process results as they become available. 4. **Parallel** Execution**: - `arun_many()` can run multiple requests concurrently under the hood.  - Each `CrawlResult` might also include a **`dispatch_result`** with concurrency details (like memory usage, start/end times). ### Basic Example (Batch Mode) ```python # Minimal usage: The default dispatcher will be used results = await crawler.arun_many( urls=["https://site1.com", "https://site2.com"], config=CrawlerRunConfig(stream=False) # Default behavior ) for res in results: if res.success: print(res.url, "crawled OK!") else: print("Failed:", res.url, "-", res.error_message) ``` ### Streaming Example ```python config = CrawlerRunConfig( stream=True, # Enable streaming mode cache_mode=CacheMode.BYPASS ) # Process results as they complete async for result in await crawler.arun_many( urls=["https://site1.com", "https://site2.com", "https://site3.com"], config=config ): if result.success: print(f"Just completed: {result.url}") # Process each result immediately process_result(result) ``` ### With a Custom Dispatcher ```python dispatcher = MemoryAdaptiveDispatcher( memory_threshold_percent=70.0, max_session_permit=10 ) results = await crawler.arun_many( urls=["https://site1.com", "https://site2.com", "https://site3.com"], config=my_run_config, dispatcher=dispatcher ) ``` ### URL-Specific Configurations Instead of using one config for all URLs, provide a list of configs with `url_matcher` patterns: ```python from crawl4ai import CrawlerRunConfig, MatchMode from crawl4ai.processors.pdf import PDFContentScrapingStrategy from crawl4ai.extraction_strategy import JsonCssExtractionStrategy from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator # PDF files - specialized extraction pdf_config = CrawlerRunConfig( url_matcher="*.pdf", scraping_strategy=PDFContentScrapingStrategy() ) # Blog/article pages - content filtering blog_config = CrawlerRunConfig( url_matcher=["*/blog/*", "*/article/*", "*python.org*"], markdown_generator=DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.48) ) ) # Dynamic pages - JavaScript execution github_config = CrawlerRunConfig( url_matcher=lambda url: 'github.com' in url, js_code="window.scrollTo(0, 500);" ) # API endpoints - JSON extraction api_config = CrawlerRunConfig( url_matcher=lambda url: 'api' in url or url.endswith('.json'), # Custome settings for JSON extraction ) # Default fallback config default_config = CrawlerRunConfig() # No url_matcher means it never matches except as fallback # Pass the list of configs - first match wins! results = await crawler.arun_many( urls=[ "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf", # → pdf_config "https://blog.python.org/", # → blog_config "https://github.com/microsoft/playwright", # → github_config "https://httpbin.org/json", # → api_config "https://example.com/" # → default_config ], config=[pdf_config, blog_config, github_config, api_config, default_config] ) ``` - **String patterns**: `"*.pdf"`, `"*/blog/*"`, `"*python.org*"` - **Function matchers**: `lambda url: 'api' in url` - **Mixed patterns**: Combine strings and functions with `MatchMode.OR` or `MatchMode.AND` - **First match wins**: Configs are evaluated in order - `dispatch_result` in each `CrawlResult` (if using concurrency) can hold memory and timing info.  - **Important**: Always include a default config (without `url_matcher`) as the last item if you want to handle all URLs. Otherwise, unmatched URLs will fail. ### Return Value Either a **list** of [`CrawlResult`](./crawl-result.md) objects, or an **async generator** if streaming is enabled. You can iterate to check `result.success` or read each item’s `extracted_content`, `markdown`, or `dispatch_result`. ## Dispatcher Reference - **`MemoryAdaptiveDispatcher`**: Dynamically manages concurrency based on system memory usage.  - **`SemaphoreDispatcher`**: Fixed concurrency limit, simpler but less adaptive.  ## Common Pitfalls 3. **Error Handling**: Each `CrawlResult` might fail for different reasons—always check `result.success` or the `error_message` before proceeding. ## Conclusion Use `arun_many()` when you want to **crawl multiple URLs** simultaneously or in controlled parallel tasks. If you need advanced concurrency features (like memory-based adaptive throttling or complex rate-limiting), provide a **dispatcher**. Each result is a standard `CrawlResult`, possibly augmented with concurrency stats (`dispatch_result`) for deeper inspection. For more details on concurrency logic and dispatchers, see the [Advanced Multi-URL Crawling](../advanced/multi-url-crawling.md) docs. # `CrawlResult` Reference The **`CrawlResult`** class encapsulates everything returned after a single crawl operation. It provides the **raw or processed content**, details on links and media, plus optional metadata (like screenshots, PDFs, or extracted JSON). **Location**: `crawl4ai/crawler/models.py` (for reference) ```python class CrawlResult(BaseModel): url: str html: str success: bool cleaned_html: Optional[str] = None fit_html: Optional[str] = None # Preprocessed HTML optimized for extraction media: Dict[str, List[Dict]] = {} links: Dict[str, List[Dict]] = {} downloaded_files: Optional[List[str]] = None screenshot: Optional[str] = None pdf : Optional[bytes] = None mhtml: Optional[str] = None markdown: Optional[Union[str, MarkdownGenerationResult]] = None extracted_content: Optional[str] = None metadata: Optional[dict] = None error_message: Optional[str] = None session_id: Optional[str] = None response_headers: Optional[dict] = None status_code: Optional[int] = None ssl_certificate: Optional[SSLCertificate] = None dispatch_result: Optional[DispatchResult] = None ... ``` ## 1. Basic Crawl Info ### 1.1 **`url`** *(str)* ```python print(result.url) # e.g., "https://example.com/" ``` ### 1.2 **`success`** *(bool)* **What**: `True` if the crawl pipeline ended without major errors; `False` otherwise. ```python if not result.success: print(f"Crawl failed: {result.error_message}") ``` ### 1.3 **`status_code`** *(Optional[int])* ```python if result.status_code == 404: print("Page not found!") ``` ### 1.4 **`error_message`** *(Optional[str])* **What**: If `success=False`, a textual description of the failure. ```python if not result.success: print("Error:", result.error_message) ``` ### 1.5 **`session_id`** *(Optional[str])* ```python # If you used session_id="login_session" in CrawlerRunConfig, see it here: print("Session:", result.session_id) ``` ### 1.6 **`response_headers`** *(Optional[dict])* ```python if result.response_headers: print("Server:", result.response_headers.get("Server", "Unknown")) ``` ### 1.7 **`ssl_certificate`** *(Optional[SSLCertificate])* **What**: If `fetch_ssl_certificate=True` in your CrawlerRunConfig, **`result.ssl_certificate`** contains a [**`SSLCertificate`**](../advanced/ssl-certificate.md) object describing the site's certificate. You can export the cert in multiple formats (PEM/DER/JSON) or access its properties like `issuer`, `subject`, `valid_from`, `valid_until`, etc. ```python if result.ssl_certificate: print("Issuer:", result.ssl_certificate.issuer) ``` ## 2. Raw / Cleaned Content ### 2.1 **`html`** *(str)* ```python # Possibly large print(len(result.html)) ``` ### 2.2 **`cleaned_html`** *(Optional[str])* **What**: A sanitized HTML version—scripts, styles, or excluded tags are removed based on your `CrawlerRunConfig`. ```python print(result.cleaned_html[:500]) # Show a snippet ``` ## 3. Markdown Fields ### 3.1 The Markdown Generation Approach - **Raw** markdown - **Links as citations** (with a references section) - **Fit** markdown if a **content filter** is used (like Pruning or BM25) **`MarkdownGenerationResult`** includes: - **`raw_markdown`** *(str)*: The full HTML→Markdown conversion. - **`markdown_with_citations`** *(str)*: Same markdown, but with link references as academic-style citations. - **`references_markdown`** *(str)*: The reference list or footnotes at the end. - **`fit_markdown`** *(Optional[str])*: If content filtering (Pruning/BM25) was applied, the filtered "fit" text. - **`fit_html`** *(Optional[str])*: The HTML that led to `fit_markdown`. ```python if result.markdown: md_res = result.markdown print("Raw MD:", md_res.raw_markdown[:300]) print("Citations MD:", md_res.markdown_with_citations[:300]) print("References:", md_res.references_markdown) if md_res.fit_markdown: print("Pruned text:", md_res.fit_markdown[:300]) ``` ### 3.2 **`markdown`** *(Optional[Union[str, MarkdownGenerationResult]])* **What**: Holds the `MarkdownGenerationResult`. ```python print(result.markdown.raw_markdown[:200]) print(result.markdown.fit_markdown) print(result.markdown.fit_html) ``` **Important**: "Fit" content (in `fit_markdown`/`fit_html`) exists in result.markdown, only if you used a **filter** (like **PruningContentFilter** or **BM25ContentFilter**) within a `MarkdownGenerationStrategy`. ## 4. Media & Links ### 4.1 **`media`** *(Dict[str, List[Dict]])* **What**: Contains info about discovered images, videos, or audio. Typically keys: `"images"`, `"videos"`, `"audios"`. - `src` *(str)*: Media URL - `alt` or `title` *(str)*: Descriptive text - `score` *(float)*: Relevance score if the crawler's heuristic found it "important" - `desc` or `description` *(Optional[str])*: Additional context extracted from surrounding text ```python images = result.media.get("images", []) for img in images: if img.get("score", 0) > 5: print("High-value image:", img["src"]) ``` ### 4.2 **`links`** *(Dict[str, List[Dict]])* **What**: Holds internal and external link data. Usually two keys: `"internal"` and `"external"`. - `href` *(str)*: The link target - `text` *(str)*: Link text - `title` *(str)*: Title attribute - `context` *(str)*: Surrounding text snippet - `domain` *(str)*: If external, the domain ```python for link in result.links["internal"]: print(f"Internal link to {link['href']} with text {link['text']}") ``` ## 5. Additional Fields ### 5.1 **`extracted_content`** *(Optional[str])* **What**: If you used **`extraction_strategy`** (CSS, LLM, etc.), the structured output (JSON). ```python if result.extracted_content: data = json.loads(result.extracted_content) print(data) ``` ### 5.2 **`downloaded_files`** *(Optional[List[str]])* **What**: If `accept_downloads=True` in your `BrowserConfig` + `downloads_path`, lists local file paths for downloaded items. ```python if result.downloaded_files: for file_path in result.downloaded_files: print("Downloaded:", file_path) ``` ### 5.3 **`screenshot`** *(Optional[str])* **What**: Base64-encoded screenshot if `screenshot=True` in `CrawlerRunConfig`. ```python import base64 if result.screenshot: with open("page.png", "wb") as f: f.write(base64.b64decode(result.screenshot)) ``` ### 5.4 **`pdf`** *(Optional[bytes])* **What**: Raw PDF bytes if `pdf=True` in `CrawlerRunConfig`. ```python if result.pdf: with open("page.pdf", "wb") as f: f.write(result.pdf) ``` ### 5.5 **`mhtml`** *(Optional[str])* **What**: MHTML snapshot of the page if `capture_mhtml=True` in `CrawlerRunConfig`. MHTML (MIME HTML) format preserves the entire web page with all its resources (CSS, images, scripts, etc.) in a single file. ```python if result.mhtml: with open("page.mhtml", "w", encoding="utf-8") as f: f.write(result.mhtml) ``` ### 5.6 **`metadata`** *(Optional[dict])* ```python if result.metadata: print("Title:", result.metadata.get("title")) print("Author:", result.metadata.get("author")) ``` ## 6. `dispatch_result` (optional) A `DispatchResult` object providing additional concurrency and resource usage information when crawling URLs in parallel (e.g., via `arun_many()` with custom dispatchers). It contains: - **`task_id`**: A unique identifier for the parallel task. - **`memory_usage`** (float): The memory (in MB) used at the time of completion. - **`peak_memory`** (float): The peak memory usage (in MB) recorded during the task's execution. - **`start_time`** / **`end_time`** (datetime): Time range for this crawling task. - **`error_message`** (str): Any dispatcher- or concurrency-related error encountered. ```python # Example usage: for result in results: if result.success and result.dispatch_result: dr = result.dispatch_result print(f"URL: {result.url}, Task ID: {dr.task_id}") print(f"Memory: {dr.memory_usage:.1f} MB (Peak: {dr.peak_memory:.1f} MB)") print(f"Duration: {dr.end_time - dr.start_time}") ``` > **Note**: This field is typically populated when using `arun_many(...)` alongside a **dispatcher** (e.g., `MemoryAdaptiveDispatcher` or `SemaphoreDispatcher`). If no concurrency or dispatcher is used, `dispatch_result` may remain `None`. ## 7. Network Requests & Console Messages When you enable network and console message capturing in `CrawlerRunConfig` using `capture_network_requests=True` and `capture_console_messages=True`, the `CrawlResult` will include these fields: ### 7.1 **`network_requests`** *(Optional[List[Dict[str, Any]]])* - Each item has an `event_type` field that can be `"request"`, `"response"`, or `"request_failed"`. - Request events include `url`, `method`, `headers`, `post_data`, `resource_type`, and `is_navigation_request`. - Response events include `url`, `status`, `status_text`, `headers`, and `request_timing`. - Failed request events include `url`, `method`, `resource_type`, and `failure_text`. - All events include a `timestamp` field. ```python if result.network_requests: # Count different types of events requests = [r for r in result.network_requests if r.get("event_type") == "request"] responses = [r for r in result.network_requests if r.get("event_type") == "response"] failures = [r for r in result.network_requests if r.get("event_type") == "request_failed"] print(f"Captured {len(requests)} requests, {len(responses)} responses, and {len(failures)} failures") # Analyze API calls api_calls = [r for r in requests if "api" in r.get("url", "")] # Identify failed resources for failure in failures: print(f"Failed to load: {failure.get('url')} - {failure.get('failure_text')}") ``` ### 7.2 **`console_messages`** *(Optional[List[Dict[str, Any]]])* - Each item has a `type` field indicating the message type (e.g., `"log"`, `"error"`, `"warning"`, etc.). - The `text` field contains the actual message text. - Some messages include `location` information (URL, line, column). - All messages include a `timestamp` field. ```python if result.console_messages: # Count messages by type message_types = {} for msg in result.console_messages: msg_type = msg.get("type", "unknown") message_types[msg_type] = message_types.get(msg_type, 0) + 1 print(f"Message type counts: {message_types}") # Display errors (which are usually most important) for msg in result.console_messages: if msg.get("type") == "error": print(f"Error: {msg.get('text')}") ``` ## 8. Example: Accessing Everything ```python async def handle_result(result: CrawlResult): if not result.success: print("Crawl error:", result.error_message) return # Basic info print("Crawled URL:", result.url) print("Status code:", result.status_code) # HTML print("Original HTML size:", len(result.html)) print("Cleaned HTML size:", len(result.cleaned_html or "")) # Markdown output if result.markdown: print("Raw Markdown:", result.markdown.raw_markdown[:300]) print("Citations Markdown:", result.markdown.markdown_with_citations[:300]) if result.markdown.fit_markdown: print("Fit Markdown:", result.markdown.fit_markdown[:200]) # Media & Links if "images" in result.media: print("Image count:", len(result.media["images"])) if "internal" in result.links: print("Internal link count:", len(result.links["internal"])) # Extraction strategy result if result.extracted_content: print("Structured data:", result.extracted_content) # Screenshot/PDF/MHTML if result.screenshot: print("Screenshot length:", len(result.screenshot)) if result.pdf: print("PDF bytes length:", len(result.pdf)) if result.mhtml: print("MHTML length:", len(result.mhtml)) # Network and console capturing if result.network_requests: print(f"Network requests captured: {len(result.network_requests)}") # Analyze request types req_types = {} for req in result.network_requests: if "resource_type" in req: req_types[req["resource_type"]] = req_types.get(req["resource_type"], 0) + 1 print(f"Resource types: {req_types}") if result.console_messages: print(f"Console messages captured: {len(result.console_messages)}") # Count by message type msg_types = {} for msg in result.console_messages: msg_types[msg.get("type", "unknown")] = msg_types.get(msg.get("type", "unknown"), 0) + 1 print(f"Message types: {msg_types}") ``` ## 9. Key Points & Future 1. **Deprecated legacy properties of CrawlResult** - `markdown_v2` - Deprecated in v0.5. Just use `markdown`. It holds the `MarkdownGenerationResult` now! - `fit_markdown` and `fit_html` - Deprecated in v0.5. They can now be accessed via `MarkdownGenerationResult` in `result.markdown`. eg: `result.markdown.fit_markdown` and `result.markdown.fit_html` 2. **Fit Content** - **`fit_markdown`** and **`fit_html`** appear in MarkdownGenerationResult, only if you used a content filter (like **PruningContentFilter** or **BM25ContentFilter**) inside your **MarkdownGenerationStrategy** or set them directly. - If no filter is used, they remain `None`. 3. **References & Citations** - If you enable link citations in your `DefaultMarkdownGenerator` (`options={"citations": True}`), you’ll see `markdown_with_citations` plus a **`references_markdown`** block. This helps large language models or academic-like referencing. 4. **Links & Media** - `links["internal"]` and `links["external"]` group discovered anchors by domain. - `media["images"]` / `["videos"]` / `["audios"]` store extracted media elements with optional scoring or context. 5. **Error Cases** - If `success=False`, check `error_message` (e.g., timeouts, invalid URLs). - `status_code` might be `None` if we failed before an HTTP response. Use **`CrawlResult`** to glean all final outputs and feed them into your data pipelines, AI models, or archives. With the synergy of a properly configured **BrowserConfig** and **CrawlerRunConfig**, the crawler can produce robust, structured results here in **`CrawlResult`**. # Configuration # Browser, Crawler & LLM Configuration (Quick Overview) Crawl4AI's flexibility stems from two key classes: 1. **`BrowserConfig`** – Dictates **how** the browser is launched and behaves (e.g., headless or visible, proxy, user agent). 2. **`CrawlerRunConfig`** – Dictates **how** each **crawl** operates (e.g., caching, extraction, timeouts, JavaScript code to run, etc.). 3. **`LLMConfig`** - Dictates **how** LLM providers are configured. (model, api token, base url, temperature etc.) In most examples, you create **one** `BrowserConfig` for the entire crawler session, then pass a **fresh** or re-used `CrawlerRunConfig` whenever you call `arun()`. This tutorial shows the most commonly used parameters. If you need advanced or rarely used fields, see the [Configuration Parameters](../api/parameters.md). ## 1. BrowserConfig Essentials ```python class BrowserConfig: def __init__( browser_type="chromium", headless=True, proxy_config=None, viewport_width=1080, viewport_height=600, verbose=True, use_persistent_context=False, user_data_dir=None, cookies=None, headers=None, user_agent=None, text_mode=False, light_mode=False, extra_args=None, enable_stealth=False, # ... other advanced parameters omitted here ): ... ``` ### Key Fields to Note 1. **`browser_type`** - Options: `"chromium"`, `"firefox"`, or `"webkit"`. - Defaults to `"chromium"`. - If you need a different engine, specify it here. 2. **`headless`** - `True`: Runs the browser in headless mode (invisible browser). - `False`: Runs the browser in visible mode, which helps with debugging. 3. **`proxy_config`** - A dictionary with fields like: ```json { "server": "http://proxy.example.com:8080", "username": "...", "password": "..." } ``` - Leave as `None` if a proxy is not required. 4. **`viewport_width` & `viewport_height`**: - The initial window size. - Some sites behave differently with smaller or bigger viewports. 5. **`verbose`**: - If `True`, prints extra logs. - Handy for debugging. 6. **`use_persistent_context`**: - If `True`, uses a **persistent** browser profile, storing cookies/local storage across runs. - Typically also set `user_data_dir` to point to a folder. 7. **`cookies`** & **`headers`**: - E.g. `cookies=[{"name": "session", "value": "abc123", "domain": "example.com"}]`. 8. **`user_agent`**: - Custom User-Agent string. If `None`, a default is used. - You can also set `user_agent_mode="random"` for randomization (if you want to fight bot detection). 9. **`text_mode`** & **`light_mode`**: - `text_mode=True` disables images, possibly speeding up text-only crawls. - `light_mode=True` turns off certain background features for performance. 10. **`extra_args`**: - Additional flags for the underlying browser. - E.g. `["--disable-extensions"]`. 11. **`enable_stealth`**: - If `True`, enables stealth mode using playwright-stealth. - Modifies browser fingerprints to avoid basic bot detection. - Default is `False`. Recommended for sites with bot protection. ### Helper Methods Both configuration classes provide a `clone()` method to create modified copies: ```python # Create a base browser config base_browser = BrowserConfig( browser_type="chromium", headless=True, text_mode=True ) # Create a visible browser config for debugging debug_browser = base_browser.clone( headless=False, verbose=True ) ``` ```python from crawl4ai import AsyncWebCrawler, BrowserConfig browser_conf = BrowserConfig( browser_type="firefox", headless=False, text_mode=True ) async with AsyncWebCrawler(config=browser_conf) as crawler: result = await crawler.arun("https://example.com") print(result.markdown[:300]) ``` ## 2. CrawlerRunConfig Essentials ```python class CrawlerRunConfig: def __init__( word_count_threshold=200, extraction_strategy=None, markdown_generator=None, cache_mode=None, js_code=None, wait_for=None, screenshot=False, pdf=False, capture_mhtml=False, # Location and Identity Parameters locale=None, # e.g. "en-US", "fr-FR" timezone_id=None, # e.g. "America/New_York" geolocation=None, # GeolocationConfig object # Resource Management enable_rate_limiting=False, rate_limit_config=None, memory_threshold_percent=70.0, check_interval=1.0, max_session_permit=20, display_mode=None, verbose=True, stream=False, # Enable streaming for arun_many() # ... other advanced parameters omitted ): ... ``` ### Key Fields to Note 1. **`word_count_threshold`**: - The minimum word count before a block is considered. - If your site has lots of short paragraphs or items, you can lower it. 2. **`extraction_strategy`**: - Where you plug in JSON-based extraction (CSS, LLM, etc.). - If `None`, no structured extraction is done (only raw/cleaned HTML + markdown). 3. **`markdown_generator`**: - E.g., `DefaultMarkdownGenerator(...)`, controlling how HTML→Markdown conversion is done. - If `None`, a default approach is used. 4. **`cache_mode`**: - Controls caching behavior (`ENABLED`, `BYPASS`, `DISABLED`, etc.). - If `None`, defaults to some level of caching or you can specify `CacheMode.ENABLED`. 5. **`js_code`**: - A string or list of JS strings to execute. - Great for "Load More" buttons or user interactions. 6. **`wait_for`**: - A CSS or JS expression to wait for before extracting content. - Common usage: `wait_for="css:.main-loaded"` or `wait_for="js:() => window.loaded === true"`. 7. **`screenshot`**, **`pdf`**, & **`capture_mhtml`**: - If `True`, captures a screenshot, PDF, or MHTML snapshot after the page is fully loaded. - The results go to `result.screenshot` (base64), `result.pdf` (bytes), or `result.mhtml` (string). 8. **Location Parameters**: - **`locale`**: Browser's locale (e.g., `"en-US"`, `"fr-FR"`) for language preferences - **`timezone_id`**: Browser's timezone (e.g., `"America/New_York"`, `"Europe/Paris"`) - **`geolocation`**: GPS coordinates via `GeolocationConfig(latitude=48.8566, longitude=2.3522)` 9. **`verbose`**: - Logs additional runtime details. - Overlaps with the browser's verbosity if also set to `True` in `BrowserConfig`. 10. **`enable_rate_limiting`**: - If `True`, enables rate limiting for batch processing. - Requires `rate_limit_config` to be set. 11. **`memory_threshold_percent`**: - The memory threshold (as a percentage) to monitor. - If exceeded, the crawler will pause or slow down. 12. **`check_interval`**: - The interval (in seconds) to check system resources. - Affects how often memory and CPU usage are monitored. 13. **`max_session_permit`**: - The maximum number of concurrent crawl sessions. - Helps prevent overwhelming the system. 14. **`url_matcher`** & **`match_mode`**: - Enable URL-specific configurations when used with `arun_many()`. - Set `url_matcher` to patterns (glob, function, or list) to match specific URLs. - Use `match_mode` (OR/AND) to control how multiple patterns combine. 15. **`display_mode`**: - The display mode for progress information (`DETAILED`, `BRIEF`, etc.). - Affects how much information is printed during the crawl. ### Helper Methods The `clone()` method is particularly useful for creating variations of your crawler configuration: ```python # Create a base configuration base_config = CrawlerRunConfig( cache_mode=CacheMode.ENABLED, word_count_threshold=200, wait_until="networkidle" ) # Create variations for different use cases stream_config = base_config.clone( stream=True, # Enable streaming mode cache_mode=CacheMode.BYPASS ) debug_config = base_config.clone( page_timeout=120000, # Longer timeout for debugging verbose=True ) ``` The `clone()` method: - Creates a new instance with all the same settings - Updates only the specified parameters - Leaves the original configuration unchanged - Perfect for creating variations without repeating all parameters ## 3. LLMConfig Essentials ### Key fields to note 1. **`provider`**: - Which LLM provider to use. - Possible values are `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`
*(default: `"openai/gpt-4o-mini"`)* 2. **`api_token`**: - Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables - API token of LLM provider
eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"` - Environment variable - use with prefix "env:"
eg:`api_token = "env: GROQ_API_KEY"` 3. **`base_url`**: - If your provider has a custom endpoint 4. **Backoff controls** *(optional)*: - `backoff_base_delay` *(default `2` seconds)* – how long to pause before the first retry if the provider rate-limits you. - `backoff_max_attempts` *(default `3`)* – total tries for the same prompt (initial call + retries). - `backoff_exponential_factor` *(default `2`)* – how quickly the pause grows between retries. A factor of 2 yields waits like 2s → 4s → 8s. - Because these plug into Crawl4AI’s retry helper, every LLM strategy automatically follows the pacing you define here. ```python llm_config = LLMConfig( provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY"), backoff_base_delay=1, # optional backoff_max_attempts=5, # optional backoff_exponential_factor=3, # optional ) ``` ## 4. Putting It All Together In a typical scenario, you define **one** `BrowserConfig` for your crawler session, then create **one or more** `CrawlerRunConfig` & `LLMConfig` depending on each call's needs: ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig, LLMContentFilter, DefaultMarkdownGenerator from crawl4ai import JsonCssExtractionStrategy async def main(): # 1) Browser config: headless, bigger viewport, no proxy browser_conf = BrowserConfig( headless=True, viewport_width=1280, viewport_height=720 ) # 2) Example extraction strategy schema = { "name": "Articles", "baseSelector": "div.article", "fields": [ {"name": "title", "selector": "h2", "type": "text"}, {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"} ] } extraction = JsonCssExtractionStrategy(schema) # 3) Example LLM content filtering gemini_config = LLMConfig( provider="gemini/gemini-1.5-pro", api_token = "env:GEMINI_API_TOKEN" ) # Initialize LLM filter with specific instruction filter = LLMContentFilter( llm_config=gemini_config, # or your preferred provider instruction=""" Focus on extracting the core educational content. Include: - Key concepts and explanations - Important code examples - Essential technical details Exclude: - Navigation elements - Sidebars - Footer content Format the output as clean markdown with proper code blocks and headers. """, chunk_token_threshold=500, # Adjust based on your needs verbose=True ) md_generator = DefaultMarkdownGenerator( content_filter=filter, options={"ignore_links": True} ) # 4) Crawler run config: skip cache, use extraction run_conf = CrawlerRunConfig( markdown_generator=md_generator, extraction_strategy=extraction, cache_mode=CacheMode.BYPASS, ) async with AsyncWebCrawler(config=browser_conf) as crawler: # 4) Execute the crawl result = await crawler.arun(url="https://example.com/news", config=run_conf) if result.success: print("Extracted content:", result.extracted_content) else: print("Error:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` ## 5. Next Steps - [BrowserConfig, CrawlerRunConfig & LLMConfig Reference](../api/parameters.md) - **Custom Hooks & Auth** (Inject JavaScript or handle login forms). - **Session Management** (Re-use pages, preserve state across multiple calls). - **Advanced Caching** (Fine-tune read/write cache modes). ## 6. Conclusion # 1. **BrowserConfig** – Controlling the Browser `BrowserConfig` focuses on **how** the browser is launched and behaves. This includes headless mode, proxies, user agents, and other environment tweaks. ```python from crawl4ai import AsyncWebCrawler, BrowserConfig browser_cfg = BrowserConfig( browser_type="chromium", headless=True, viewport_width=1280, viewport_height=720, proxy="http://user:pass@proxy:8080", user_agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/116.0.0.0 Safari/537.36", ) ``` ## 1.1 Parameter Highlights | **Parameter** | **Type / Default** | **What It Does** | |-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------| | **`browser_type`** | `"chromium"`, `"firefox"`, `"webkit"`
*(default: `"chromium"`)* | Which browser engine to use. `"chromium"` is typical for many sites, `"firefox"` or `"webkit"` for specialized tests. | | **`headless`** | `bool` (default: `True`) | Headless means no visible UI. `False` is handy for debugging. | | **`viewport_width`** | `int` (default: `1080`) | Initial page width (in px). Useful for testing responsive layouts. | | **`viewport_height`** | `int` (default: `600`) | Initial page height (in px). | | **`proxy`** | `str` (deprecated) | Deprecated. Use `proxy_config` instead. If set, it will be auto-converted internally. | | **`proxy_config`** | `dict` (default: `None`) | For advanced or multi-proxy needs, specify details like `{"server": "...", "username": "...", ...}`. | | **`use_persistent_context`** | `bool` (default: `False`) | If `True`, uses a **persistent** browser context (keep cookies, sessions across runs). Also sets `use_managed_browser=True`. | | **`user_data_dir`** | `str or None` (default: `None`) | Directory to store user data (profiles, cookies). Must be set if you want permanent sessions. | | **`ignore_https_errors`** | `bool` (default: `True`) | If `True`, continues despite invalid certificates (common in dev/staging). | | **`java_script_enabled`** | `bool` (default: `True`) | Disable if you want no JS overhead, or if only static content is needed. | | **`cookies`** | `list` (default: `[]`) | Pre-set cookies, each a dict like `{"name": "session", "value": "...", "url": "..."}`. | | **`headers`** | `dict` (default: `{}`) | Extra HTTP headers for every request, e.g. `{"Accept-Language": "en-US"}`. | | **`user_agent`** | `str` (default: Chrome-based UA) | Your custom or random user agent. `user_agent_mode="random"` can shuffle it. | | **`light_mode`** | `bool` (default: `False`) | Disables some background features for performance gains. | | **`text_mode`** | `bool` (default: `False`) | If `True`, tries to disable images/other heavy content for speed. | | **`use_managed_browser`** | `bool` (default: `False`) | For advanced “managed” interactions (debugging, CDP usage). Typically set automatically if persistent context is on. | | **`extra_args`** | `list` (default: `[]`) | Additional flags for the underlying browser process, e.g. `["--disable-extensions"]`. | - Set `headless=False` to visually **debug** how pages load or how interactions proceed. - If you need **authentication** storage or repeated sessions, consider `use_persistent_context=True` and specify `user_data_dir`. - For large pages, you might need a bigger `viewport_width` and `viewport_height` to handle dynamic content. # 2. **CrawlerRunConfig** – Controlling Each Crawl While `BrowserConfig` sets up the **environment**, `CrawlerRunConfig` details **how** each **crawl operation** should behave: caching, content filtering, link or domain blocking, timeouts, JavaScript code, etc. ```python from crawl4ai import AsyncWebCrawler, CrawlerRunConfig run_cfg = CrawlerRunConfig( wait_for="css:.main-content", word_count_threshold=15, excluded_tags=["nav", "footer"], exclude_external_links=True, stream=True, # Enable streaming for arun_many() ) ``` ## 2.1 Parameter Highlights ### A) **Content Processing** | **Parameter** | **Type / Default** | **What It Does** | |------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------| | **`word_count_threshold`** | `int` (default: ~200) | Skips text blocks below X words. Helps ignore trivial sections. | | **`extraction_strategy`** | `ExtractionStrategy` (default: None) | If set, extracts structured data (CSS-based, LLM-based, etc.). | | **`markdown_generator`** | `MarkdownGenerationStrategy` (None) | If you want specialized markdown output (citations, filtering, chunking, etc.). Can be customized with options such as `content_source` parameter to select the HTML input source ('cleaned_html', 'raw_html', or 'fit_html'). | | **`css_selector`** | `str` (None) | Retains only the part of the page matching this selector. Affects the entire extraction process. | | **`target_elements`** | `List[str]` (None) | List of CSS selectors for elements to focus on for markdown generation and data extraction, while still processing the entire page for links, media, etc. Provides more flexibility than `css_selector`. | | **`excluded_tags`** | `list` (None) | Removes entire tags (e.g. `["script", "style"]`). | | **`excluded_selector`** | `str` (None) | Like `css_selector` but to exclude. E.g. `"#ads, .tracker"`. | | **`only_text`** | `bool` (False) | If `True`, tries to extract text-only content. | | **`prettiify`** | `bool` (False) | If `True`, beautifies final HTML (slower, purely cosmetic). | | **`keep_data_attributes`** | `bool` (False) | If `True`, preserve `data-*` attributes in cleaned HTML. | | **`remove_forms`** | `bool` (False) | If `True`, remove all `` elements. | ### B) **Caching & Session** | **Parameter** | **Type / Default** | **What It Does** | |-------------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------| | **`cache_mode`** | `CacheMode or None` | Controls how caching is handled (`ENABLED`, `BYPASS`, `DISABLED`, etc.). If `None`, typically defaults to `ENABLED`. | | **`session_id`** | `str or None` | Assign a unique ID to reuse a single browser session across multiple `arun()` calls. | | **`bypass_cache`** | `bool` (False) | If `True`, acts like `CacheMode.BYPASS`. | | **`disable_cache`** | `bool` (False) | If `True`, acts like `CacheMode.DISABLED`. | | **`no_cache_read`** | `bool` (False) | If `True`, acts like `CacheMode.WRITE_ONLY` (writes cache but never reads). | | **`no_cache_write`** | `bool` (False) | If `True`, acts like `CacheMode.READ_ONLY` (reads cache but never writes). | ### C) **Page Navigation & Timing** | **Parameter** | **Type / Default** | **What It Does** | |----------------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------| | **`wait_until`** | `str` (domcontentloaded)| Condition for navigation to “complete”. Often `"networkidle"` or `"domcontentloaded"`. | | **`page_timeout`** | `int` (60000 ms) | Timeout for page navigation or JS steps. Increase for slow sites. | | **`wait_for`** | `str or None` | Wait for a CSS (`"css:selector"`) or JS (`"js:() => bool"`) condition before content extraction. | | **`wait_for_images`** | `bool` (False) | Wait for images to load before finishing. Slows down if you only want text. | | **`delay_before_return_html`** | `float` (0.1) | Additional pause (seconds) before final HTML is captured. Good for last-second updates. | | **`check_robots_txt`** | `bool` (False) | Whether to check and respect robots.txt rules before crawling. If True, caches robots.txt for efficiency. | | **`mean_delay`** and **`max_range`** | `float` (0.1, 0.3) | If you call `arun_many()`, these define random delay intervals between crawls, helping avoid detection or rate limits. | | **`semaphore_count`** | `int` (5) | Max concurrency for `arun_many()`. Increase if you have resources for parallel crawls. | ### D) **Page Interaction** | **Parameter** | **Type / Default** | **What It Does** | |----------------------------|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------| | **`js_code`** | `str or list[str]` (None) | JavaScript to run after load. E.g. `"document.querySelector('button')?.click();"`. | | **`js_only`** | `bool` (False) | If `True`, indicates we’re reusing an existing session and only applying JS. No full reload. | | **`ignore_body_visibility`** | `bool` (True) | Skip checking if `` is visible. Usually best to keep `True`. | | **`scan_full_page`** | `bool` (False) | If `True`, auto-scroll the page to load dynamic content (infinite scroll). | | **`scroll_delay`** | `float` (0.2) | Delay between scroll steps if `scan_full_page=True`. | | **`process_iframes`** | `bool` (False) | Inlines iframe content for single-page extraction. | | **`remove_overlay_elements`** | `bool` (False) | Removes potential modals/popups blocking the main content. | | **`remove_consent_popups`** | `bool` (False) | Removes GDPR/cookie consent popups from known CMP providers (OneTrust, Cookiebot, TrustArc, Quantcast, Didomi, Sourcepoint, FundingChoices, etc.). Tries clicking "Accept All" first, then falls back to DOM removal. | | **`simulate_user`** | `bool` (False) | Simulate user interactions (mouse movements) to avoid bot detection. | | **`override_navigator`** | `bool` (False) | Override `navigator` properties in JS for stealth. | | **`magic`** | `bool` (False) | Automatic handling of popups/consent banners. Experimental. | | **`adjust_viewport_to_content`** | `bool` (False) | Resizes viewport to match page content height. | If your page is a single-page app with repeated JS updates, set `js_only=True` in subsequent calls, plus a `session_id` for reusing the same tab. ### E) **Media Handling** | **Parameter** | **Type / Default** | **What It Does** | |--------------------------------------------|---------------------|-----------------------------------------------------------------------------------------------------------| | **`screenshot`** | `bool` (False) | Capture a screenshot (base64) in `result.screenshot`. | | **`screenshot_wait_for`** | `float or None` | Extra wait time before the screenshot. | | **`screenshot_height_threshold`** | `int` (~20000) | If the page is taller than this, alternate screenshot strategies are used. | | **`pdf`** | `bool` (False) | If `True`, returns a PDF in `result.pdf`. | | **`capture_mhtml`** | `bool` (False) | If `True`, captures an MHTML snapshot of the page in `result.mhtml`. MHTML includes all page resources (CSS, images, etc.) in a single file. | | **`image_description_min_word_threshold`** | `int` (~50) | Minimum words for an image’s alt text or description to be considered valid. | | **`image_score_threshold`** | `int` (~3) | Filter out low-scoring images. The crawler scores images by relevance (size, context, etc.). | | **`exclude_external_images`** | `bool` (False) | Exclude images from other domains. | ### F) **Link/Domain Handling** | **Parameter** | **Type / Default** | **What It Does** | |------------------------------|-------------------------|-----------------------------------------------------------------------------------------------------------------------------| | **`exclude_social_media_domains`** | `list` (e.g. Facebook/Twitter) | A default list can be extended. Any link to these domains is removed from final output. | | **`exclude_external_links`** | `bool` (False) | Removes all links pointing outside the current domain. | | **`exclude_social_media_links`** | `bool` (False) | Strips links specifically to social sites (like Facebook or Twitter). | | **`exclude_domains`** | `list` ([]) | Provide a custom list of domains to exclude (like `["ads.com", "trackers.io"]`). | | **`preserve_https_for_internal_links`** | `bool` (False) | If `True`, preserves HTTPS scheme for internal links even when the server redirects to HTTP. Useful for security-conscious crawling. | ### G) **Debug & Logging** | **Parameter** | **Type / Default** | **What It Does** | |----------------|--------------------|---------------------------------------------------------------------------| | **`verbose`** | `bool` (True) | Prints logs detailing each step of crawling, interactions, or errors. | | **`log_console`** | `bool` (False) | Logs the page’s JavaScript console output if you want deeper JS debugging.| ### H) **Virtual Scroll Configuration** | **Parameter** | **Type / Default** | **What It Does** | |------------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | **`virtual_scroll_config`** | `VirtualScrollConfig or dict` (None) | Configuration for handling virtualized scrolling on sites like Twitter/Instagram where content is replaced rather than appended. | When sites use virtual scrolling (content replaced as you scroll), use `VirtualScrollConfig`: ```python from crawl4ai import VirtualScrollConfig virtual_config = VirtualScrollConfig( container_selector="#timeline", # CSS selector for scrollable container scroll_count=30, # Number of times to scroll scroll_by="container_height", # How much to scroll: "container_height", "page_height", or pixels (e.g. 500) wait_after_scroll=0.5 # Seconds to wait after each scroll for content to load ) config = CrawlerRunConfig( virtual_scroll_config=virtual_config ) ``` **VirtualScrollConfig Parameters:** | **Parameter** | **Type / Default** | **What It Does** | |------------------------|---------------------------|-------------------------------------------------------------------------------------------| | **`container_selector`** | `str` (required) | CSS selector for the scrollable container (e.g., `"#feed"`, `".timeline"`) | | **`scroll_count`** | `int` (10) | Maximum number of scrolls to perform | | **`scroll_by`** | `str or int` ("container_height") | Scroll amount: `"container_height"`, `"page_height"`, or pixels (e.g., `500`) | | **`wait_after_scroll`** | `float` (0.5) | Time in seconds to wait after each scroll for new content to load | - Use `virtual_scroll_config` when content is **replaced** during scroll (Twitter, Instagram) - Use `scan_full_page` when content is **appended** during scroll (traditional infinite scroll) ### I) **URL Matching Configuration** | **Parameter** | **Type / Default** | **What It Does** | |------------------------|------------------------------|-------------------------------------------------------------------------------------------------------------------------------------| | **`url_matcher`** | `UrlMatcher` (None) | Pattern(s) to match URLs against. Can be: string (glob), function, or list of mixed types. **None means match ALL URLs** | | **`match_mode`** | `MatchMode` (MatchMode.OR) | How to combine multiple matchers in a list: `MatchMode.OR` (any match) or `MatchMode.AND` (all must match) | The `url_matcher` parameter enables URL-specific configurations when used with `arun_many()`: ```python from crawl4ai import CrawlerRunConfig, MatchMode from crawl4ai.processors.pdf import PDFContentScrapingStrategy from crawl4ai.extraction_strategy import JsonCssExtractionStrategy # Simple string pattern (glob-style) pdf_config = CrawlerRunConfig( url_matcher="*.pdf", scraping_strategy=PDFContentScrapingStrategy() ) # Multiple patterns with OR logic (default) blog_config = CrawlerRunConfig( url_matcher=["*/blog/*", "*/article/*", "*/news/*"], match_mode=MatchMode.OR # Any pattern matches ) # Function matcher api_config = CrawlerRunConfig( url_matcher=lambda url: 'api' in url or url.endswith('.json'), # Other settings like extraction_strategy ) # Mixed: String + Function with AND logic complex_config = CrawlerRunConfig( url_matcher=[ lambda url: url.startswith('https://'), # Must be HTTPS "*.org/*", # Must be .org domain lambda url: 'docs' in url # Must contain 'docs' ], match_mode=MatchMode.AND # ALL conditions must match ) # Combined patterns and functions with AND logic secure_docs = CrawlerRunConfig( url_matcher=["https://*", lambda url: '.doc' in url], match_mode=MatchMode.AND # Must be HTTPS AND contain .doc ) # Default config - matches ALL URLs default_config = CrawlerRunConfig() # No url_matcher = matches everything ``` **UrlMatcher Types:** - **None (default)**: When `url_matcher` is None or not set, the config matches ALL URLs - **String patterns**: Glob-style patterns like `"*.pdf"`, `"*/api/*"`, `"https://*.example.com/*"` - **Functions**: `lambda url: bool` - Custom logic for complex matching - **Lists**: Mix strings and functions, combined with `MatchMode.OR` or `MatchMode.AND` **Important Behavior:** - When passing a list of configs to `arun_many()`, URLs are matched against each config's `url_matcher` in order. First match wins! - If no config matches a URL and there's no default config (one without `url_matcher`), the URL will fail with "No matching configuration found" Both `BrowserConfig` and `CrawlerRunConfig` provide a `clone()` method to create modified copies: ```python # Create a base configuration base_config = CrawlerRunConfig( cache_mode=CacheMode.ENABLED, word_count_threshold=200 ) # Create variations using clone() stream_config = base_config.clone(stream=True) no_cache_config = base_config.clone( cache_mode=CacheMode.BYPASS, stream=True ) ``` The `clone()` method is particularly useful when you need slightly different configurations for different use cases, without modifying the original config. ## 2.3 Example Usage ```python import asyncio from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode async def main(): # Configure the browser browser_cfg = BrowserConfig( headless=False, viewport_width=1280, viewport_height=720, proxy="http://user:pass@myproxy:8080", text_mode=True ) # Configure the run run_cfg = CrawlerRunConfig( cache_mode=CacheMode.BYPASS, session_id="my_session", css_selector="main.article", excluded_tags=["script", "style"], exclude_external_links=True, wait_for="css:.article-loaded", screenshot=True, stream=True ) async with AsyncWebCrawler(config=browser_cfg) as crawler: result = await crawler.arun( url="https://example.com/news", config=run_cfg ) if result.success: print("Final cleaned_html length:", len(result.cleaned_html)) if result.screenshot: print("Screenshot captured (base64, length):", len(result.screenshot)) else: print("Crawl failed:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` ## 2.4 Compliance & Ethics | **Parameter** | **Type / Default** | **What It Does** | |-----------------------|-------------------------|----------------------------------------------------------------------------------------------------------------------| | **`check_robots_txt`**| `bool` (False) | When True, checks and respects robots.txt rules before crawling. Uses efficient caching with SQLite backend. | | **`user_agent`** | `str` (None) | User agent string to identify your crawler. Used for robots.txt checking when enabled. | ```python run_config = CrawlerRunConfig( check_robots_txt=True, # Enable robots.txt compliance user_agent="MyBot/1.0" # Identify your crawler ) ``` # 3. **LLMConfig** - Setting up LLM providers 1. LLMExtractionStrategy 2. LLMContentFilter 3. JsonCssExtractionStrategy.generate_schema 4. JsonXPathExtractionStrategy.generate_schema ## 3.1 Parameters | **Parameter** | **Type / Default** | **What It Does** | |-----------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------| | **`provider`** | `"ollama/llama3","groq/llama3-70b-8192","groq/llama3-8b-8192", "openai/gpt-4o-mini" ,"openai/gpt-4o","openai/o1-mini","openai/o1-preview","openai/o3-mini","openai/o3-mini-high","anthropic/claude-3-haiku-20240307","anthropic/claude-3-opus-20240229","anthropic/claude-3-sonnet-20240229","anthropic/claude-3-5-sonnet-20240620","gemini/gemini-pro","gemini/gemini-1.5-pro","gemini/gemini-2.0-flash","gemini/gemini-2.0-flash-exp","gemini/gemini-2.0-flash-lite-preview-02-05","deepseek/deepseek-chat"`
*(default: `"openai/gpt-4o-mini"`)* | Which LLM provider to use. | **`api_token`** |1.Optional. When not provided explicitly, api_token will be read from environment variables based on provider. For example: If a gemini model is passed as provider then,`"GEMINI_API_KEY"` will be read from environment variables
2. API token of LLM provider
eg: `api_token = "gsk_1ClHGGJ7Lpn4WGybR7vNWGdyb3FY7zXEw3SCiy0BAVM9lL8CQv"`
3. Environment variable - use with prefix "env:"
eg:`api_token = "env: GROQ_API_KEY"` | API token to use for the given provider | **`base_url`** |Optional. Custom API endpoint | If your provider has a custom endpoint ## 3.2 Example Usage ```python llm_config = LLMConfig(provider="openai/gpt-4o-mini", api_token=os.getenv("OPENAI_API_KEY")) ``` ## 4. Putting It All Together - **Use** `BrowserConfig` for **global** browser settings: engine, headless, proxy, user agent. - **Use** `CrawlerRunConfig` for each crawl’s **context**: how to filter content, handle caching, wait for dynamic elements, or run JS. - **Pass** both configs to `AsyncWebCrawler` (the `BrowserConfig`) and then to `arun()` (the `CrawlerRunConfig`). - **Use** `LLMConfig` for LLM provider configurations that can be used across all extraction, filtering, schema generation tasks. Can be used in - `LLMExtractionStrategy`, `LLMContentFilter`, `JsonCssExtractionStrategy.generate_schema` & `JsonXPathExtractionStrategy.generate_schema` ```python # Create a modified copy with the clone() method stream_cfg = run_cfg.clone( stream=True, cache_mode=CacheMode.BYPASS ) ``` # Crawling Patterns # Simple Crawling ## Basic Usage Set up a simple crawl using `BrowserConfig` and `CrawlerRunConfig`: ```python import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig async def main(): browser_config = BrowserConfig() # Default browser configuration run_config = CrawlerRunConfig() # Default crawl run configuration async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://example.com", config=run_config ) print(result.markdown) # Print clean markdown content if __name__ == "__main__": asyncio.run(main()) ``` ## Understanding the Response The `arun()` method returns a `CrawlResult` object with several useful properties. Here's a quick overview (see [CrawlResult](../api/crawl-result.md) for complete details): ```python config = CrawlerRunConfig( markdown_generator=DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.6), options={"ignore_links": True} ) ) result = await crawler.arun( url="https://example.com", config=config ) # Different content formats print(result.html) # Raw HTML print(result.cleaned_html) # Cleaned HTML print(result.markdown.raw_markdown) # Raw markdown from cleaned html print(result.markdown.fit_markdown) # Most relevant content in markdown # Check success status print(result.success) # True if crawl succeeded print(result.status_code) # HTTP status code (e.g., 200, 404) # Access extracted media and links print(result.media) # Dictionary of found media (images, videos, audio) print(result.links) # Dictionary of internal and external links ``` ## Adding Basic Options Customize your crawl using `CrawlerRunConfig`: ```python run_config = CrawlerRunConfig( word_count_threshold=10, # Minimum words per content block exclude_external_links=True, # Remove external links remove_overlay_elements=True, # Remove popups/modals process_iframes=True # Process iframe content ) result = await crawler.arun( url="https://example.com", config=run_config ) ``` ## Handling Errors ```python run_config = CrawlerRunConfig() result = await crawler.arun(url="https://example.com", config=run_config) if not result.success: print(f"Crawl failed: {result.error_message}") print(f"Status code: {result.status_code}") ``` ## Logging and Debugging Enable verbose logging in `BrowserConfig`: ```python browser_config = BrowserConfig(verbose=True) async with AsyncWebCrawler(config=browser_config) as crawler: run_config = CrawlerRunConfig() result = await crawler.arun(url="https://example.com", config=run_config) ``` ## Complete Example ```python import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, CacheMode async def main(): browser_config = BrowserConfig(verbose=True) run_config = CrawlerRunConfig( # Content filtering word_count_threshold=10, excluded_tags=['form', 'header'], exclude_external_links=True, # Content processing process_iframes=True, remove_overlay_elements=True, # Cache control cache_mode=CacheMode.ENABLED # Use cache if available ) async with AsyncWebCrawler(config=browser_config) as crawler: result = await crawler.arun( url="https://example.com", config=run_config ) if result.success: # Print clean content print("Content:", result.markdown[:500]) # First 500 chars # Process images for image in result.media["images"]: print(f"Found image: {image['src']}") # Process links for link in result.links["internal"]: print(f"Internal link: {link['href']}") else: print(f"Crawl failed: {result.error_message}") if __name__ == "__main__": asyncio.run(main()) ``` # Content Processing # Markdown Generation Basics 1. How to configure the **Default Markdown Generator** 3. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`) > - You know how to configure `CrawlerRunConfig`. ## 1. Quick Example ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator async def main(): config = CrawlerRunConfig( markdown_generator=DefaultMarkdownGenerator() ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com", config=config) if result.success: print("Raw Markdown Output:\n") print(result.markdown) # The unfiltered markdown from the page else: print("Crawl failed:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` - `CrawlerRunConfig( markdown_generator = DefaultMarkdownGenerator() )` instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl. - The resulting markdown is accessible via `result.markdown`. ## 2. How Markdown Generation Works ### 2.1 HTML-to-Text Conversion (Forked & Modified) - Preserves headings, code blocks, bullet points, etc. - Removes extraneous tags (scripts, styles) that don’t add meaningful content. - Can optionally generate references for links or skip them altogether. ### 2.2 Link Citations & References By default, the generator can convert `` elements into `[text][1]` citations, then place the actual links at the bottom of the document. This is handy for research workflows that demand references in a structured manner. ### 2.3 Optional Content Filters ## 3. Configuring the Default Markdown Generator You can tweak the output by passing an `options` dict to `DefaultMarkdownGenerator`. For example: ```python from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): # Example: ignore all links, don't escape HTML, and wrap text at 80 characters md_generator = DefaultMarkdownGenerator( options={ "ignore_links": True, "escape_html": False, "body_width": 80 } ) config = CrawlerRunConfig( markdown_generator=md_generator ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com/docs", config=config) if result.success: print("Markdown:\n", result.markdown[:500]) # Just a snippet else: print("Crawl failed:", result.error_message) if __name__ == "__main__": import asyncio asyncio.run(main()) ``` Some commonly used `options`: - **`ignore_links`** (bool): Whether to remove all hyperlinks in the final markdown. - **`ignore_images`** (bool): Remove all `![image]()` references. - **`escape_html`** (bool): Turn HTML entities into text (default is often `True`). - **`body_width`** (int): Wrap text at N characters. `0` or `None` means no wrapping. - **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page. - **`include_sup_sub`** (bool): Attempt to handle `` / `` in a more readable way. ## 4. Selecting the HTML Source for Markdown Generation The `content_source` parameter allows you to control which HTML content is used as input for markdown generation. This gives you flexibility in how the HTML is processed before conversion to markdown. ```python from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): # Option 1: Use the raw HTML directly from the webpage (before any processing) raw_md_generator = DefaultMarkdownGenerator( content_source="raw_html", options={"ignore_links": True} ) # Option 2: Use the cleaned HTML (after scraping strategy processing - default) cleaned_md_generator = DefaultMarkdownGenerator( content_source="cleaned_html", # This is the default options={"ignore_links": True} ) # Option 3: Use preprocessed HTML optimized for schema extraction fit_md_generator = DefaultMarkdownGenerator( content_source="fit_html", options={"ignore_links": True} ) # Use one of the generators in your crawler config config = CrawlerRunConfig( markdown_generator=raw_md_generator # Try each of the generators ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com", config=config) if result.success: print("Markdown:\n", result.markdown.raw_markdown[:500]) else: print("Crawl failed:", result.error_message) if __name__ == "__main__": import asyncio asyncio.run(main()) ``` ### HTML Source Options - **`"cleaned_html"`** (default): Uses the HTML after it has been processed by the scraping strategy. This HTML is typically cleaner and more focused on content, with some boilerplate removed. - **`"raw_html"`**: Uses the original HTML directly from the webpage, before any cleaning or processing. This preserves more of the original content, but may include navigation bars, ads, footers, and other elements that might not be relevant to the main content. - **`"fit_html"`**: Uses HTML preprocessed for schema extraction. This HTML is optimized for structured data extraction and may have certain elements simplified or removed. ### When to Use Each Option - Use **`"cleaned_html"`** (default) for most cases where you want a balance of content preservation and noise removal. - Use **`"raw_html"`** when you need to preserve all original content, or when the cleaning process is removing content you actually want to keep. - Use **`"fit_html"`** when working with structured data or when you need HTML that's optimized for schema extraction. ## 5. Content Filters ### 5.1 BM25ContentFilter ```python from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.content_filter_strategy import BM25ContentFilter from crawl4ai import CrawlerRunConfig bm25_filter = BM25ContentFilter( user_query="machine learning", bm25_threshold=1.2, language="english" ) md_generator = DefaultMarkdownGenerator( content_filter=bm25_filter, options={"ignore_links": True} ) config = CrawlerRunConfig(markdown_generator=md_generator) ``` - **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query. - **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more. - **`use_stemming`** *(default `True`)*: Whether to apply stemming to the query and content. - **`language (str)`**: Language for stemming (default: 'english'). ### 5.2 PruningContentFilter If you **don’t** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections. ```python from crawl4ai.content_filter_strategy import PruningContentFilter prune_filter = PruningContentFilter( threshold=0.5, threshold_type="fixed", # or "dynamic" min_word_threshold=50 ) ``` - **`threshold`**: Score boundary. Blocks below this score get removed. - **`threshold_type`**: - `"fixed"`: Straight comparison (`score >= threshold` keeps the block). - `"dynamic"`: The filter adjusts threshold in a data-driven manner. - **`min_word_threshold`**: Discard blocks under N words as likely too short or unhelpful. - You want a broad cleanup without a user query. ### 5.3 LLMContentFilter ```python from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, LLMConfig, DefaultMarkdownGenerator from crawl4ai.content_filter_strategy import LLMContentFilter async def main(): # Initialize LLM filter with specific instruction filter = LLMContentFilter( llm_config = LLMConfig(provider="openai/gpt-4o",api_token="your-api-token"), #or use environment variable instruction=""" Focus on extracting the core educational content. Include: - Key concepts and explanations - Important code examples - Essential technical details Exclude: - Navigation elements - Sidebars - Footer content Format the output as clean markdown with proper code blocks and headers. """, chunk_token_threshold=4096, # Adjust based on your needs verbose=True ) md_generator = DefaultMarkdownGenerator( content_filter=filter, options={"ignore_links": True} ) config = CrawlerRunConfig( markdown_generator=md_generator, ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com", config=config) print(result.markdown.fit_markdown) # Filtered markdown content ``` - **Chunk Processing**: Handles large documents by processing them in chunks (controlled by `chunk_token_threshold`) - **Parallel Processing**: For better performance, use smaller `chunk_token_threshold` (e.g., 2048 or 4096) to enable parallel processing of content chunks 1. **Exact Content Preservation**: ```python filter = LLMContentFilter( instruction=""" Extract the main educational content while preserving its original wording and substance completely. 1. Maintain the exact language and terminology 2. Keep all technical explanations and examples intact 3. Preserve the original flow and structure 4. Remove only clearly irrelevant elements like navigation menus and ads """, chunk_token_threshold=4096 ) ``` 2. **Focused Content Extraction**: ```python filter = LLMContentFilter( instruction=""" Focus on extracting specific types of content: - Technical documentation - Code examples - API references Reformat the content into clear, well-structured markdown """, chunk_token_threshold=4096 ) ``` > **Performance Tip**: Set a smaller `chunk_token_threshold` (e.g., 2048 or 4096) to enable parallel processing of content chunks. The default value is infinity, which processes the entire content as a single chunk. ## 6. Using Fit Markdown When a content filter is active, the library produces two forms of markdown inside `result.markdown`: 1. **`raw_markdown`**: The full unfiltered markdown. 2. **`fit_markdown`**: A “fit” version where the filter has removed or trimmed noisy segments. ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator from crawl4ai.content_filter_strategy import PruningContentFilter async def main(): config = CrawlerRunConfig( markdown_generator=DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.6), options={"ignore_links": True} ) ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://news.example.com/tech", config=config) if result.success: print("Raw markdown:\n", result.markdown) # If a filter is used, we also have .fit_markdown: md_object = result.markdown # or your equivalent print("Filtered markdown:\n", md_object.fit_markdown) else: print("Crawl failed:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` ## 7. The `MarkdownGenerationResult` Object If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, you’ll see fields such as: - **`raw_markdown`**: The direct HTML-to-markdown transformation (no filtering). - **`markdown_with_citations`**: A version that moves links to reference-style footnotes. - **`references_markdown`**: A separate string or section containing the gathered references. - **`fit_markdown`**: The filtered markdown if you used a content filter. - **`fit_html`**: The corresponding HTML snippet used to generate `fit_markdown` (helpful for debugging or advanced usage). ```python md_obj = result.markdown # your library’s naming may vary print("RAW:\n", md_obj.raw_markdown) print("CITED:\n", md_obj.markdown_with_citations) print("REFERENCES:\n", md_obj.references_markdown) print("FIT:\n", md_obj.fit_markdown) ``` - You can supply `raw_markdown` to an LLM if you want the entire text. - Or feed `fit_markdown` into a vector database to reduce token usage. - `references_markdown` can help you keep track of link provenance. ## 8. Combining Filters (BM25 + Pruning) in Two Passes You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank what’s left** against a user query (with `BM25ContentFilter`). You don’t have to crawl the page twice. Instead: 1. **First pass**: Apply `PruningContentFilter` directly to the raw HTML from `result.html` (the crawler’s downloaded HTML). 2. **Second pass**: Take the pruned HTML (or text) from step 1, and feed it into `BM25ContentFilter`, focusing on a user query. ### Two-Pass Example ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter from bs4 import BeautifulSoup async def main(): # 1. Crawl with minimal or no markdown generator, just get raw HTML config = CrawlerRunConfig( # If you only want raw HTML, you can skip passing a markdown_generator # or provide one but focus on .html in this example ) async with AsyncWebCrawler() as crawler: result = await crawler.arun("https://example.com/tech-article", config=config) if not result.success or not result.html: print("Crawl failed or no HTML content.") return raw_html = result.html # 2. First pass: PruningContentFilter on raw HTML pruning_filter = PruningContentFilter(threshold=0.5, min_word_threshold=50) # filter_content returns a list of "text chunks" or cleaned HTML sections pruned_chunks = pruning_filter.filter_content(raw_html) # This list is basically pruned content blocks, presumably in HTML or text form # For demonstration, let's combine these chunks back into a single HTML-like string # or you could do further processing. It's up to your pipeline design. pruned_html = "\n".join(pruned_chunks) # 3. Second pass: BM25ContentFilter with a user query bm25_filter = BM25ContentFilter( user_query="machine learning", bm25_threshold=1.2, language="english" ) # returns a list of text chunks bm25_chunks = bm25_filter.filter_content(pruned_html) if not bm25_chunks: print("Nothing matched the BM25 query after pruning.") return # 4. Combine or display final results final_text = "\n---\n".join(bm25_chunks) print("==== PRUNED OUTPUT (first pass) ====") print(pruned_html[:500], "... (truncated)") # preview print("\n==== BM25 OUTPUT (second pass) ====") print(final_text[:500], "... (truncated)") if __name__ == "__main__": asyncio.run(main()) ``` ### What’s Happening? 1. **Raw HTML**: We crawl once and store the raw HTML in `result.html`. 4. **BM25ContentFilter**: We feed the pruned string into `BM25ContentFilter` with a user query. This second pass further narrows the content to chunks relevant to “machine learning.” **No Re-Crawling**: We used `raw_html` from the first pass, so there’s no need to run `arun()` again—**no second network request**. ### Tips & Variations - **Plain Text vs. HTML**: If your pruned output is mostly text, BM25 can still handle it; just keep in mind it expects a valid string input. If you supply partial HTML (like `"

some text

"`), it will parse it as HTML. - **Adjust Thresholds**: If you see too much or too little text in step one, tweak `threshold=0.5` or `min_word_threshold=50`. Similarly, `bm25_threshold=1.2` can be raised/lowered for more or fewer chunks in step two. ### One-Pass Combination? ## 9. Common Pitfalls & Tips 1. **No Markdown Output?** 2. **Performance Considerations** - Very large pages with multiple filters can be slower. Consider `cache_mode` to avoid re-downloading. 3. **Take Advantage of `fit_markdown`** 4. **Adjusting `html2text` Options** - If you see lots of raw HTML slipping into the text, turn on `escape_html`. - If code blocks look messy, experiment with `mark_code` or `handle_code_in_pre`. ## 10. Summary & Next Steps - Configure the **DefaultMarkdownGenerator** with HTML-to-text options. - Select different HTML sources using the `content_source` parameter. - Distinguish between raw and filtered markdown (`fit_markdown`). - Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.). # Fit Markdown with Pruning & BM25 ## 1. How “Fit Markdown” Works ### 1.1 The `content_filter` In **`CrawlerRunConfig`**, you can specify a **`content_filter`** to shape how content is pruned or ranked before final markdown generation. A filter’s logic is applied **before** or **during** the HTML→Markdown process, producing: - **`result.markdown.raw_markdown`** (unfiltered) - **`result.markdown.fit_markdown`** (filtered or “fit” version) - **`result.markdown.fit_html`** (the corresponding HTML snippet that produced `fit_markdown`) ### 1.2 Common Filters ## 2. PruningContentFilter ### 2.1 Usage Example ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.content_filter_strategy import PruningContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator async def main(): # Step 1: Create a pruning filter prune_filter = PruningContentFilter( # Lower → more content retained, higher → more content pruned threshold=0.45, # "fixed" or "dynamic" threshold_type="dynamic", # Ignore nodes with <5 words min_word_threshold=5 ) # Step 2: Insert it into a Markdown Generator md_generator = DefaultMarkdownGenerator(content_filter=prune_filter) # Step 3: Pass it to CrawlerRunConfig config = CrawlerRunConfig( markdown_generator=md_generator ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://news.ycombinator.com", config=config ) if result.success: # 'fit_markdown' is your pruned content, focusing on "denser" text print("Raw Markdown length:", len(result.markdown.raw_markdown)) print("Fit Markdown length:", len(result.markdown.fit_markdown)) else: print("Error:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` ### 2.2 Key Parameters - **`min_word_threshold`** (int): If a block has fewer words than this, it’s pruned. - **`threshold_type`** (str): - `"fixed"` → each node must exceed `threshold` (0–1). - `"dynamic"` → node scoring adjusts according to tag type, text/link density, etc. - **`threshold`** (float, default ~0.48): The base or “anchor” cutoff. - **Link density** – Penalizes sections that are mostly links. - **Tag importance** – e.g., an `
` or `

` might be more important than a `

`. ## 3. BM25ContentFilter ### 3.1 Usage Example ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig from crawl4ai.content_filter_strategy import BM25ContentFilter from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator async def main(): # 1) A BM25 filter with a user query bm25_filter = BM25ContentFilter( user_query="startup fundraising tips", # Adjust for stricter or looser results bm25_threshold=1.2 ) # 2) Insert into a Markdown Generator md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter) # 3) Pass to crawler config config = CrawlerRunConfig( markdown_generator=md_generator ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://news.ycombinator.com", config=config ) if result.success: print("Fit Markdown (BM25 query-based):") print(result.markdown.fit_markdown) else: print("Error:", result.error_message) if __name__ == "__main__": asyncio.run(main()) ``` ### 3.2 Parameters - **`user_query`** (str, optional): E.g. `"machine learning"`. If blank, the filter tries to glean a query from page metadata. - **`bm25_threshold`** (float, default 1.0): - Higher → fewer chunks but more relevant. - Lower → more inclusive. > In more advanced scenarios, you might see parameters like `language`, `case_sensitive`, or `priority_tags` to refine how text is tokenized or weighted. ## 4. Accessing the “Fit” Output After the crawl, your “fit” content is found in **`result.markdown.fit_markdown`**. ```python fit_md = result.markdown.fit_markdown fit_html = result.markdown.fit_html ``` If the content filter is **BM25**, you might see additional logic or references in `fit_markdown` that highlight relevant segments. If it’s **Pruning**, the text is typically well-cleaned but not necessarily matched to a query. ## 5. Code Patterns Recap ### 5.1 Pruning ```python prune_filter = PruningContentFilter( threshold=0.5, threshold_type="fixed", min_word_threshold=10 ) md_generator = DefaultMarkdownGenerator(content_filter=prune_filter) config = CrawlerRunConfig(markdown_generator=md_generator) ``` ### 5.2 BM25 ```python bm25_filter = BM25ContentFilter( user_query="health benefits fruit", bm25_threshold=1.2 ) md_generator = DefaultMarkdownGenerator(content_filter=bm25_filter) config = CrawlerRunConfig(markdown_generator=md_generator) ``` ## 6. Combining with “word_count_threshold” & Exclusions ```python config = CrawlerRunConfig( word_count_threshold=10, excluded_tags=["nav", "footer", "header"], exclude_external_links=True, markdown_generator=DefaultMarkdownGenerator( content_filter=PruningContentFilter(threshold=0.5) ) ) ``` 1. The crawler’s `excluded_tags` are removed from the HTML first. 3. The final “fit” content is generated in `result.markdown.fit_markdown`. ## 7. Custom Filters If you need a different approach (like a specialized ML model or site-specific heuristics), you can create a new class inheriting from `RelevantContentFilter` and implement `filter_content(html)`. Then inject it into your **markdown generator**: ```python from crawl4ai.content_filter_strategy import RelevantContentFilter class MyCustomFilter(RelevantContentFilter): def filter_content(self, html, min_word_threshold=None): # parse HTML, implement custom logic return [block for block in ... if ... some condition...] ``` 1. Subclass `RelevantContentFilter`. 2. Implement `filter_content(...)`. 3. Use it in your `DefaultMarkdownGenerator(content_filter=MyCustomFilter(...))`. ## 8. Final Thoughts - **Summaries**: Quickly get the important text from a cluttered page. - **Search**: Combine with **BM25** to produce content relevant to a query. - **BM25ContentFilter**: Perfect for query-based extraction or searching. - Combine with **`excluded_tags`, `exclude_external_links`, `word_count_threshold`** to refine your final “fit” text. - Fit markdown ends up in **`result.markdown.fit_markdown`**; eventually **`result.markdown.fit_markdown`** in future versions. - Last Updated: 2025-01-01 # Content Selection Crawl4AI provides multiple ways to **select**, **filter**, and **refine** the content from your crawls. Whether you need to target a specific CSS region, exclude entire tags, filter out external links, or remove certain domains and images, **`CrawlerRunConfig`** offers a wide range of parameters. ## 1. CSS-Based Selection There are two ways to select content from a page: using `css_selector` or the more flexible `target_elements`. ### 1.1 Using `css_selector` A straightforward way to **limit** your crawl results to a certain region of the page is **`css_selector`** in **`CrawlerRunConfig`**: ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): config = CrawlerRunConfig( # e.g., first 30 items from Hacker News css_selector=".athing:nth-child(-n+30)" ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://news.ycombinator.com/newest", config=config ) print("Partial HTML length:", len(result.cleaned_html)) if __name__ == "__main__": asyncio.run(main()) ``` **Result**: Only elements matching that selector remain in `result.cleaned_html`. ### 1.2 Using `target_elements` The `target_elements` parameter provides more flexibility by allowing you to target **multiple elements** for content extraction while preserving the entire page context for other features: ```python import asyncio from crawl4ai import AsyncWebCrawler, CrawlerRunConfig async def main(): config = CrawlerRunConfig( # Target article body and sidebar, but not other content target_elements=["article.main-content", "aside.sidebar"] ) async with AsyncWebCrawler() as crawler: result = await crawler.arun( url="https://example.com/blog-post", config=config ) print("Markdown focused on target elements") print("Links from entire page still available:", len(result.links.get("internal", []))) if __name__ == "__main__": asyncio.run(main()) ``` **Key difference**: With `target_elements`, the markdown generation and structural data extraction focus on those elements, but other page elements (like links, images, and tables) are still extracted from the entire page. This gives you fine-grained control over what appears in your markdown content while preserving full page context for link analysis and media collection. ## 2. Content Filtering & Exclusions ### 2.1 Basic Overview ```python config = CrawlerRunConfig( # Content thresholds word_count_threshold=10, # Minimum words per block # Tag exclusions excluded_tags=['form', 'header', 'footer', 'nav'], # Link filtering exclude_external_links=True, exclude_social_media_links=True, # Block entire domains exclude_domains=["adtrackers.com", "spammynews.org"], exclude_social_media_domains=["facebook.com", "twitter.com"], # Media filtering exclude_external_images=True ) ``` - **`word_count_threshold`**: Ignores text blocks under X words. Helps skip trivial blocks like short nav or disclaimers. - **`excluded_tags`**: Removes entire tags (``, `
`, `