Enhance crawler capabilities and documentation - Added SSL certificate extraction in AsyncWebCrawler. - Introduced new content filters and chunking strategies for more robust data extraction. - Updated documentation management to streamline user experience.
8.6 KiB
Introduction
Quick Start (Minimal Example)
For a fast hands-on start, try crawling a single URL and printing its Markdown output:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, you’re good to go.
Overview of Crawl4AI
Crawl4AI is a state-of-the-art, asynchronous web crawling library optimized for large-scale data collection. It’s built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.
Why Asynchronous?
Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.
Purpose and Vision
- Offer an open-source alternative to expensive commercial APIs.
- Provide clean, structured, Markdown-based outputs for easy AI integration.
- Democratize large-scale, high-speed, and reliable web crawling solutions.
Key Features
- Markdown Generation: Produces AI-friendly, concise Markdown.
- High-Performance Crawling: Asynchronous operations let you crawl numerous URLs concurrently.
- Browser Control: Fine-tune browser sessions, user agents, proxies, and viewport.
- JavaScript Support: Handle dynamic pages by injecting custom JavaScript snippets.
- Content Filtering: Use advanced strategies (e.g., BM25) to focus on what matters.
- Extensibility: Define custom extraction strategies for complex data schemas.
- Deployment Ready: Easy Docker deployment for production and scalability.
Use Cases
- LLM Training and Fine-Tuning: Collect and preprocess large web datasets to train machine learning models.
- RAG Pipelines: Generate context documents for retrieval-augmented generation tasks.
- Content Summarization: Extract pages and produce summaries directly in Markdown.
- Structured Data Extraction: Pull structured JSON data suitable for building knowledge graphs or databases.
Example: Creating a Fine-Tuning Dataset
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
# Combine Markdown outputs into a single file for model fine-tuning
with open("fine_tuning_data.md", "w") as f:
for res in results:
f.write(res.markdown + "\n")
if __name__ == "__main__":
asyncio.run(main())
Installation and Setup
Environment Setup (Recommended)
Use a virtual environment to keep dependencies isolated:
python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
Basic Installation
pip install crawl4ai
crawl4ai-setup
By default, this installs the asynchronous version and sets up Playwright.
Verify Installation
Run a quick test:
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
If you see the page content printed as Markdown, you’re ready.
Handling JavaScript-Heavy Pages
For pages that require JavaScript actions (like clicking a “Load More” button), use the js_code parameter:
js_code = """
(async () => {
const loadMoreBtn = document.querySelector('button.load-more');
if (loadMoreBtn) loadMoreBtn.click();
await new Promise(r => setTimeout(r, 1000));
})();
"""
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/js-page",
js_code=[js_code]
)
print(result.markdown)
Using Cache Modes
CacheMode can speed up repeated crawls by reusing previously fetched data. For instance:
from crawl4ai import AsyncWebCrawler, CacheMode
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(
url="https://example.com/large-page",
cache_mode=CacheMode.ENABLED
)
print(result.markdown)
Quick Start Guide
Minimal Working Example
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Multiple Concurrent Crawls
Harness async concurrency to run multiple crawls in parallel:
import asyncio
from crawl4ai import AsyncWebCrawler
async def crawl_url(crawler, url):
return await crawler.arun(url=url)
async def main():
urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
async with AsyncWebCrawler(verbose=True) as crawler:
results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
for r in results:
print(r.markdown[:200])
if __name__ == "__main__":
asyncio.run(main())
Dockerized Setup
Run Crawl4AI in Docker for production environments:
docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
curl http://localhost:11235/health
Proxy and Security Configurations
async with AsyncWebCrawler(
proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
You can also add basic auth:
async with AsyncWebCrawler(
proxies={"http": "http://user:password@proxy.server:port"}
) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
Customizing Browser Settings
Customize headers, user agents, and viewport:
async with AsyncWebCrawler(
verbose=True,
headers={"User-Agent": "MyCustomBrowser/1.0"},
viewport={"width": 1280, "height": 800}
) as crawler:
result = await crawler.arun("https://example.com")
print(result.markdown)
Troubleshooting Installation
Playwright Errors
If crawl4ai-setup fails, install manually:
playwright install chromium
pip install crawl4ai[all]
SSL or Proxy Issues
- Check certificates or disable SSL verification (for dev only).
- Verify proxy credentials and server details.
Use verbose=True for detailed logs:
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url="https://crawl4ai.com")
print(result.markdown)
Common Pitfalls
- Missing Playwright Installation: Run
playwright install chromium. - Time-Out on JavaScript-Heavy Pages: Increase wait time or use
js_codefor page interactions. - Empty Markdown: Check if the page is JavaScript-rendered and adjust
js_codeorwait_forconditions. - Permission Errors: Run commands with appropriate permissions or use a virtual environment.
Support and Community
- GitHub Issues: Have questions or found a bug? Open an issue on the GitHub Repo.
- Contributions: We welcome pull requests. Check out the contribution guidelines.
- Community Discussions: Join discussions on GitHub to share tips, best practices, and feedback.
Further Exploration
- Advanced Extraction Strategies: Dive into specialized extraction strategies like
JsonCssExtractionStrategyorLLMExtractionStrategyfor structured data output. - Content Filtering: Explore BM25-based strategies to highlight the most relevant parts of a page.
- Production Deployment: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.
For more detailed code examples and advanced topics, refer to the accompanying README and the QUICKSTART Python file included with this distribution.