# Introduction ## Quick Start (Minimal Example) For a fast hands-on start, try crawling a single URL and printing its Markdown output: ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun(url="https://example.com") print(result.markdown) if __name__ == "__main__": asyncio.run(main()) ``` This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, you’re good to go. --- ## Overview of Crawl4AI Crawl4AI is a state-of-the-art, **asynchronous** web crawling library optimized for large-scale data collection. It’s built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly. **Why Asynchronous?** Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction. ### Purpose and Vision - Offer an open-source alternative to expensive commercial APIs. - Provide clean, structured, Markdown-based outputs for easy AI integration. - Democratize large-scale, high-speed, and reliable web crawling solutions. ### Key Features - **Markdown Generation**: Produces AI-friendly, concise Markdown. - **High-Performance Crawling**: Asynchronous operations let you crawl numerous URLs concurrently. - **Browser Control**: Fine-tune browser sessions, user agents, proxies, and viewport. - **JavaScript Support**: Handle dynamic pages by injecting custom JavaScript snippets. - **Content Filtering**: Use advanced strategies (e.g., BM25) to focus on what matters. - **Extensibility**: Define custom extraction strategies for complex data schemas. - **Deployment Ready**: Easy Docker deployment for production and scalability. --- ## Use Cases - **LLM Training and Fine-Tuning**: Collect and preprocess large web datasets to train machine learning models. - **RAG Pipelines**: Generate context documents for retrieval-augmented generation tasks. - **Content Summarization**: Extract pages and produce summaries directly in Markdown. - **Structured Data Extraction**: Pull structured JSON data suitable for building knowledge graphs or databases. **Example: Creating a Fine-Tuning Dataset** ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"] async with AsyncWebCrawler(verbose=True) as crawler: results = await asyncio.gather(*[crawler.arun(url=u) for u in urls]) # Combine Markdown outputs into a single file for model fine-tuning with open("fine_tuning_data.md", "w") as f: for res in results: f.write(res.markdown + "\n") if __name__ == "__main__": asyncio.run(main()) ``` --- ## Installation and Setup ### Environment Setup (Recommended) Use a virtual environment to keep dependencies isolated: ```bash python3 -m venv venv source venv/bin/activate pip install --upgrade pip ``` ### Basic Installation ```bash pip install crawl4ai crawl4ai-setup ``` By default, this installs the asynchronous version and sets up Playwright. ### Verify Installation Run a quick test: ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun(url="https://crawl4ai.com") print(result.markdown) if __name__ == "__main__": asyncio.run(main()) ``` If you see the page content printed as Markdown, you’re ready. ### Handling JavaScript-Heavy Pages For pages that require JavaScript actions (like clicking a “Load More” button), use the `js_code` parameter: ```python js_code = """ (async () => { const loadMoreBtn = document.querySelector('button.load-more'); if (loadMoreBtn) loadMoreBtn.click(); await new Promise(r => setTimeout(r, 1000)); })(); """ async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun( url="https://example.com/js-page", js_code=[js_code] ) print(result.markdown) ``` ### Using Cache Modes `CacheMode` can speed up repeated crawls by reusing previously fetched data. For instance: ```python from crawl4ai import AsyncWebCrawler, CacheMode async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun( url="https://example.com/large-page", cache_mode=CacheMode.ENABLED ) print(result.markdown) ``` --- ## Quick Start Guide ### Minimal Working Example ```python import asyncio from crawl4ai import AsyncWebCrawler async def main(): async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun(url="https://crawl4ai.com") print(result.markdown) if __name__ == "__main__": asyncio.run(main()) ``` ### Multiple Concurrent Crawls Harness async concurrency to run multiple crawls in parallel: ```python import asyncio from crawl4ai import AsyncWebCrawler async def crawl_url(crawler, url): return await crawler.arun(url=url) async def main(): urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"] async with AsyncWebCrawler(verbose=True) as crawler: results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls]) for r in results: print(r.markdown[:200]) if __name__ == "__main__": asyncio.run(main()) ``` ### Dockerized Setup Run Crawl4AI in Docker for production environments: ```bash docker pull unclecode/crawl4ai:basic-amd64 docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64 curl http://localhost:11235/health ``` ### Proxy and Security Configurations ```python async with AsyncWebCrawler( proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"} ) as crawler: result = await crawler.arun(url="https://crawl4ai.com") print(result.markdown) ``` You can also add basic auth: ```python async with AsyncWebCrawler( proxies={"http": "http://user:password@proxy.server:port"} ) as crawler: result = await crawler.arun(url="https://crawl4ai.com") print(result.markdown) ``` ### Customizing Browser Settings Customize headers, user agents, and viewport: ```python async with AsyncWebCrawler( verbose=True, headers={"User-Agent": "MyCustomBrowser/1.0"}, viewport={"width": 1280, "height": 800} ) as crawler: result = await crawler.arun("https://example.com") print(result.markdown) ``` --- ## Troubleshooting Installation ### Playwright Errors If `crawl4ai-setup` fails, install manually: ```bash playwright install chromium pip install crawl4ai[all] ``` ### SSL or Proxy Issues - Check certificates or disable SSL verification (for dev only). - Verify proxy credentials and server details. Use `verbose=True` for detailed logs: ```python async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun(url="https://crawl4ai.com") print(result.markdown) ``` --- ## Common Pitfalls 1. **Missing Playwright Installation**: Run `playwright install chromium`. 2. **Time-Out on JavaScript-Heavy Pages**: Increase wait time or use `js_code` for page interactions. 3. **Empty Markdown**: Check if the page is JavaScript-rendered and adjust `js_code` or `wait_for` conditions. 4. **Permission Errors**: Run commands with appropriate permissions or use a virtual environment. --- ## Support and Community - **GitHub Issues**: Have questions or found a bug? Open an issue on the [GitHub Repo](https://github.com/unclecode/crawl4ai/issues). - **Contributions**: We welcome pull requests. Check out the [contribution guidelines](https://github.com/unclecode/crawl4ai/blob/main/CONTRIBUTING.md). - **Community Discussions**: Join discussions on GitHub to share tips, best practices, and feedback. --- ## Further Exploration - **Advanced Extraction Strategies**: Dive into specialized extraction strategies like `JsonCssExtractionStrategy` or `LLMExtractionStrategy` for structured data output. - **Content Filtering**: Explore BM25-based strategies to highlight the most relevant parts of a page. - **Production Deployment**: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups. For more detailed code examples and advanced topics, refer to the accompanying [README](https://github.com/unclecode/crawl4ai) and the `QUICKSTART` Python file included with this distribution.