Files

UncleCode 9a4ed6bbd7 Commit Message:

Enhance crawler capabilities and documentation

  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation management to streamline user experience.

2024-12-26 15:17:07 +08:00

8.6 KiB

Raw Blame History

Introduction

Quick Start (Minimal Example)

For a fast hands-on start, try crawling a single URL and printing its Markdown output:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

This simple snippet should immediately confirm your environment is set up correctly. If you see the page content in Markdown format, you’re good to go.

Overview of Crawl4AI

Crawl4AI is a state-of-the-art, asynchronous web crawling library optimized for large-scale data collection. It’s built to integrate seamlessly into AI workflows such as fine-tuning, retrieval-augmented generation (RAG), and data pipelines. By focusing on generating structured, AI-ready data (like Markdown), it helps you build robust applications quickly.

Why Asynchronous?
Async architecture allows you to concurrently crawl multiple URLs without waiting on slow network operations. This results in drastically improved performance and efficiency, especially when dealing with large-scale data extraction.

Purpose and Vision

Offer an open-source alternative to expensive commercial APIs.
Provide clean, structured, Markdown-based outputs for easy AI integration.
Democratize large-scale, high-speed, and reliable web crawling solutions.

Key Features

Markdown Generation: Produces AI-friendly, concise Markdown.
High-Performance Crawling: Asynchronous operations let you crawl numerous URLs concurrently.
Browser Control: Fine-tune browser sessions, user agents, proxies, and viewport.
JavaScript Support: Handle dynamic pages by injecting custom JavaScript snippets.
Content Filtering: Use advanced strategies (e.g., BM25) to focus on what matters.
Extensibility: Define custom extraction strategies for complex data schemas.
Deployment Ready: Easy Docker deployment for production and scalability.

Use Cases

LLM Training and Fine-Tuning: Collect and preprocess large web datasets to train machine learning models.
RAG Pipelines: Generate context documents for retrieval-augmented generation tasks.
Content Summarization: Extract pages and produce summaries directly in Markdown.
Structured Data Extraction: Pull structured JSON data suitable for building knowledge graphs or databases.

Example: Creating a Fine-Tuning Dataset

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    urls = ["https://example.com/dataset_page_1", "https://example.com/dataset_page_2"]
    async with AsyncWebCrawler(verbose=True) as crawler:
        results = await asyncio.gather(*[crawler.arun(url=u) for u in urls])
        # Combine Markdown outputs into a single file for model fine-tuning
        with open("fine_tuning_data.md", "w") as f:
            for res in results:
                f.write(res.markdown + "\n")

if __name__ == "__main__":
    asyncio.run(main())

Installation and Setup

Environment Setup (Recommended)

Use a virtual environment to keep dependencies isolated:

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip

Basic Installation

pip install crawl4ai
crawl4ai-setup

By default, this installs the asynchronous version and sets up Playwright.

Verify Installation

Run a quick test:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://crawl4ai.com")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

If you see the page content printed as Markdown, you’re ready.

Handling JavaScript-Heavy Pages

For pages that require JavaScript actions (like clicking a “Load More” button), use the js_code parameter:

js_code = """
(async () => {
    const loadMoreBtn = document.querySelector('button.load-more');
    if (loadMoreBtn) loadMoreBtn.click();
    await new Promise(r => setTimeout(r, 1000));
})();
"""

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(
        url="https://example.com/js-page",
        js_code=[js_code]
    )
    print(result.markdown)

Using Cache Modes

CacheMode can speed up repeated crawls by reusing previously fetched data. For instance:

from crawl4ai import AsyncWebCrawler, CacheMode

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(
        url="https://example.com/large-page",
        cache_mode=CacheMode.ENABLED
    )
    print(result.markdown)

Quick Start Guide

Minimal Working Example

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(verbose=True) as crawler:
        result = await crawler.arun(url="https://crawl4ai.com")
        print(result.markdown)

if __name__ == "__main__":
    asyncio.run(main())

Multiple Concurrent Crawls

Harness async concurrency to run multiple crawls in parallel:

import asyncio
from crawl4ai import AsyncWebCrawler

async def crawl_url(crawler, url):
    return await crawler.arun(url=url)

async def main():
    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
    async with AsyncWebCrawler(verbose=True) as crawler:
        results = await asyncio.gather(*[crawl_url(crawler, u) for u in urls])
        for r in results:
            print(r.markdown[:200])

if __name__ == "__main__":
    asyncio.run(main())

Dockerized Setup

Run Crawl4AI in Docker for production environments:

docker pull unclecode/crawl4ai:basic-amd64
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
curl http://localhost:11235/health

Proxy and Security Configurations

async with AsyncWebCrawler(
    proxies={"http": "http://proxy.server:port", "https": "https://proxy.server:port"}
) as crawler:
    result = await crawler.arun(url="https://crawl4ai.com")
    print(result.markdown)

You can also add basic auth:

async with AsyncWebCrawler(
    proxies={"http": "http://user:password@proxy.server:port"}
) as crawler:
    result = await crawler.arun(url="https://crawl4ai.com")
    print(result.markdown)

Customizing Browser Settings

Customize headers, user agents, and viewport:

async with AsyncWebCrawler(
    verbose=True,
    headers={"User-Agent": "MyCustomBrowser/1.0"},
    viewport={"width": 1280, "height": 800}
) as crawler:
    result = await crawler.arun("https://example.com")
    print(result.markdown)

Troubleshooting Installation

Playwright Errors

If crawl4ai-setup fails, install manually:

playwright install chromium
pip install crawl4ai[all]

SSL or Proxy Issues

Check certificates or disable SSL verification (for dev only).
Verify proxy credentials and server details.

Use verbose=True for detailed logs:

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(url="https://crawl4ai.com")
    print(result.markdown)

Common Pitfalls

Missing Playwright Installation: Run playwright install chromium.
Time-Out on JavaScript-Heavy Pages: Increase wait time or use js_code for page interactions.
Empty Markdown: Check if the page is JavaScript-rendered and adjust js_code or wait_for conditions.
Permission Errors: Run commands with appropriate permissions or use a virtual environment.

Support and Community

GitHub Issues: Have questions or found a bug? Open an issue on the GitHub Repo.
Contributions: We welcome pull requests. Check out the contribution guidelines.
Community Discussions: Join discussions on GitHub to share tips, best practices, and feedback.

Further Exploration

Advanced Extraction Strategies: Dive into specialized extraction strategies like JsonCssExtractionStrategy or LLMExtractionStrategy for structured data output.
Content Filtering: Explore BM25-based strategies to highlight the most relevant parts of a page.
Production Deployment: Refer to the Docker and environment variable configurations for large-scale, distributed crawling setups.

For more detailed code examples and advanced topics, refer to the accompanying README and the QUICKSTART Python file included with this distribution.

8.6 KiB Raw Blame History Unescape Escape