crawl4ai/docs/md_v2/advanced/identity_based_crawling.md at next-browser-farm

Files

UncleCode 7aaaaae461 feat(browser-farm): Add Docker browser support for remote crawling

Implement initial MVP for Docker-based browser management in Crawl4ai, enabling
remote browser execution in containerized environments.

Key Changes:
- Add browser_farm module with Docker support components:
  * BrowserFarmService: Manages browser endpoints
  * DockerBrowser: Handles Docker browser communication
  * Basic health check implementation
  * Dockerfile with optimized Chrome/Playwright setup:
    - Based on python:3.10-slim for minimal size
    - Includes all required system dependencies
    - Auto-installs crawl4ai and sets up Playwright
    - Configures Chrome with remote debugging
    - Uses socat for port forwarding (9223)

- Update core components:
  * Rename use_managed_browser to use_remote_browser for clarity
  * Modify BrowserManager to support Docker mode
  * Add Docker configuration in BrowserConfig
  * Update context handling for remote browsers

- Add example:
  * hello_world_docker.py demonstrating Docker browser usage

Technical Details:
- Docker container exposes port 9223 (mapped to host:9333)
- Uses CDP (Chrome DevTools Protocol) for remote connection
- Maintains compatibility with existing managed browser features
- Simplified endpoint management for MVP phase
- Optimized Docker setup:
  * Minimal dependencies installation
  * Proper Chrome flags for containerized environment
  * Headless mode with GPU disabled
  * Security considerations (no-sandbox mode)

Testing:
- Extensive Docker configuration testing and optimization
- Verified with hello_world_docker.py example
- Confirmed remote browser connection and crawling functionality
- Tested basic health checks

This is the first step towards a scalable browser farm solution, setting up
the foundation for future enhancements like resource monitoring, multiple
browser instances, and container lifecycle management.

2025-01-02 18:41:36 +08:00

6.8 KiB

Raw Permalink Blame History

Preserve Your Identity with Crawl4AI

Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This document introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.

Managed Browsers: Your Digital Identity Solution

Managed Browsers enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.

Why Use Managed Browsers?

Authentic Browsing Experience: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
Effortless Configuration: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
Empowered Data Access: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.

Steps to Use Managed Browsers

Setup the Browser Configuration:

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

browser_config = BrowserConfig(
    headless=False,  # Set to False for initial setup to view browser actions
    verbose=True,
    user_agent_mode="random",
    use_remote_browser=True,  # Enables persistent browser sessions
    browser_type="chromium",
    user_data_dir="/path/to/user_profile_data"  # Path to save session data
)

Perform an Initial Run:
- Run the crawler with headless=False.
- Manually interact with the site (e.g., solve CAPTCHA or log in).
- The browser session saves cookies, local storage, and other required data.
Subsequent Runs:
- Switch to headless=True for automation.
- The session data is reused, allowing seamless crawling.

Example: Extracting Data Using Managed Browsers

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Define schema for structured data extraction
    schema = {
        "name": "Example Data",
        "baseSelector": "div.example",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    # Configure crawler
    browser_config = BrowserConfig(
        headless=True,  # Automate subsequent runs
        verbose=True,
        use_remote_browser=True,
        user_data_dir="/path/to/user_profile_data"
    )

    crawl_config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(schema),
        wait_for="css:div.example"  # Wait for the targeted element to load
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=crawl_config
        )

        if result.success:
            print("Extracted Data:", result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Benefits of Managed Browsers Over Other Methods

Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.

Magic Mode: Simplified Automation

While Managed Browsers are the preferred approach, Magic Mode provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.

What Magic Mode Does:

Simulates human browsing by randomizing interaction patterns and timing.
Masks browser automation signals.
Handles cookie popups and modals.
Modifies navigator properties for enhanced compatibility.

Using Magic Mode

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        magic=True  # Enables all automation features
    )

Magic Mode is particularly useful for:

Quick prototyping when a Managed Browser setup is not available.
Basic sites requiring minimal interaction or configuration.

Example: Combining Magic Mode with Additional Options

async def crawl_with_magic_mode(url: str):
    async with AsyncWebCrawler(headless=True) as crawler:
        result = await crawler.arun(
            url=url,
            magic=True,
            remove_overlay_elements=True,  # Remove popups/modals
            page_timeout=60000            # Increased timeout for complex pages
        )

        return result.markdown if result.success else None

Magic Mode vs. Managed Browsers

While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.

Key Comparison: Managed Browsers vs. Magic Mode

Feature	Managed Browsers	Magic Mode
Session Persistence	Retains cookies and local storage.	No session retention.
Human Interaction	Uses real user profiles and data.	Simulates human-like patterns.
Complex Sites	Best suited for heavily configured sites.	Works well with simpler challenges.
Setup Complexity	Requires initial manual interaction.	Fully automated, one-line setup.

Recommendation:

Use Managed Browsers for reliable, session-based crawling and data extraction.
Use Magic Mode for quick prototyping or when persistent profiles are not required.

Conclusion

Use Managed Browsers to preserve your digital identity and ensure reliable, identity-based crawling with persistent sessions. This approach works seamlessly for even the most complex websites.
Leverage Magic Mode for quick automation or in scenarios where persistent user profiles are not needed.

By combining these approaches, Crawl4AI provides unparalleled flexibility and capability for your crawling needs.

6.8 KiB Raw Permalink Blame History