Files

UncleCode f2d9912697 Renames browser_config param to config in AsyncWebCrawler

Standardizes parameter naming convention across the codebase by renaming browser_config to the more concise config in AsyncWebCrawler constructor.

Updates all documentation examples and internal usages to reflect the new parameter name for consistency.

Also improves hook execution by adding url/response parameters to goto hooks and fixes parameter ordering in before_return_html hook.

2024-12-26 16:34:36 +08:00

22 KiB

Raw Blame History

4. Creating Browser Instances, Contexts, and Pages

Introduction

Overview of Browser Management in Crawl4AI

Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, identity preservation, and session persistence for high-volume, dynamic web crawling.

Key Objectives

Identity Preservation:
- Implements stealth techniques to maintain authentic digital identity
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses
- Supports integration with third-party services to bypass CAPTCHA challenges
Persistent Sessions:
- Retains session data (cookies, local storage) for workflows requiring user authentication
- Allows seamless continuation of tasks across multiple runs without re-authentication
Scalable Crawling:
- Optimized resource utilization for handling thousands of URLs concurrently
- Flexible configuration options to tailor crawling behavior to specific requirements

Browser Creation Methods

Standard Browser Creation

Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.

Features and Limitations

Features:
- Quick and straightforward setup for small-scale tasks
- Supports headless and headful modes
Limitations:
- Lacks advanced customization options like session reuse
- May struggle with sites employing strict identity verification

Example Usage

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_config = BrowserConfig(browser_type="chromium", headless=True)
async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://crawl4ai.com")
    print(result.markdown)

Persistent Contexts

Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.

Benefits of Using `user_data_dir`

Session Persistence:
- Stores cookies, local storage, and cache between crawling sessions
- Reduces overhead for repetitive logins or multi-step workflows
Enhanced Performance:
- Leverages pre-loaded resources for faster page loading
Flexibility:
- Adapts to complex workflows requiring user-specific configurations

Example: Setting Up Persistent Contexts

config = BrowserConfig(user_data_dir="/path/to/user/data")
async with AsyncWebCrawler(config=config) as crawler:
    result = await crawler.arun("https://crawl4ai.com")
    print(result.markdown)

Managed Browser

The ManagedBrowser class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and identity preservation measures.

How It Works

Browser Process Management:
- Automates initialization and cleanup of browser processes
- Optimizes resource usage by pooling and reusing browser instances
Debugging Support:
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection
Identity Preservation:
- Implements stealth plugins to maintain authentic user identity
- Preserves browser fingerprints and session data

Features

Customizable Configurations:
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation
Debugging and Logging:
- Logs detailed browser interactions for debugging and performance analysis
Scalability:
- Handles multiple browser instances concurrently, scaling dynamically based on workload

Example: Using `ManagedBrowser`

from crawl4ai import AsyncWebCrawler, BrowserConfig

config = BrowserConfig(headless=False, debug_port=9222)
async with AsyncWebCrawler(config=config) as crawler:
    result = await crawler.arun("https://crawl4ai.com")
    print(result.markdown)

Context and Page Management

Creating and Configuring Browser Contexts

Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.

Customizations

Headers and Cookies:
- Define custom headers to mimic specific devices or browsers
- Set cookies for authenticated sessions
Session Reuse:
- Retain and reuse session data across multiple requests
- Example: Preserve login states for authenticated crawls

Example: Context Initialization

from crawl4ai import CrawlerRunConfig

config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://crawl4ai.com", config=config)
    print(result.markdown)

Creating Pages

Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.

Key Features

IFrame Handling:
- Extract content from embedded iframes
- Navigate and interact with nested content
Viewport Customization:
- Adjust viewport size to match target device dimensions
Lazy Loading:
- Ensure dynamic elements are fully loaded before extraction

Example: Page Initialization

config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
async with AsyncWebCrawler() as crawler:
    result = await crawler.arun("https://crawl4ai.com", config=config)
    print(result.markdown)

Preserve Your Identity with Crawl4AI

Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This section introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.

Managed Browsers: Your Digital Identity Solution

Managed Browsers enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.

Why Use Managed Browsers?

Authentic Browsing Experience: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
Effortless Configuration: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
Empowered Data Access: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.

I'll help create a section about using command-line Chrome with a user data directory, which is indeed a more straightforward approach for identity-based browsing.

### Steps to Use Identity-Based Browsing

1. **Launch Chrome with a Custom Profile Directory**

   - **Windows**:
   ```batch
   "C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\ChromeProfiles\CrawlProfile"

macOS:

"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/CrawlProfile"

Linux:

google-chrome --user-data-dir="/home/username/ChromeProfiles/CrawlProfile"

Set Up Your Identity:
- In the new Chrome window, log into your accounts (Google, social media, etc.)
- Complete any necessary CAPTCHA challenges
- Accept cookies and configure site preferences
- The profile directory will save all settings, cookies, and login states

Use the Profile in Crawl4AI:

from crawl4ai import AsyncWebCrawler, BrowserConfig

browser_config = BrowserConfig(
    headless=True,
    use_managed_browser=True,
    user_data_dir="/path/to/ChromeProfiles/CrawlProfile"  # Use the same directory from step 1
)

async with AsyncWebCrawler(config=browser_config) as crawler:
    result = await crawler.arun("https://example.com")

This approach provides several advantages:

Complete manual control over profile setup
Persistent logins across multiple sites
Pre-solved CAPTCHAs and saved preferences
Real browser history and cookies for authentic browsing patterns

Example: Extracting Data Using Managed Browsers

import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

async def main():
    # Define schema for structured data extraction
    schema = {
        "name": "Example Data",
        "baseSelector": "div.example",
        "fields": [
            {"name": "title", "selector": "h1", "type": "text"},
            {"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
        ]
    }

    # Configure crawler
    browser_config = BrowserConfig(
        headless=True,  # Automate subsequent runs
        verbose=True,
        use_managed_browser=True,
        user_data_dir="/path/to/user_profile_data"
    )

    crawl_config = CrawlerRunConfig(
        extraction_strategy=JsonCssExtractionStrategy(schema),
        wait_for="css:div.example"  # Wait for the targeted element to load
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://example.com",
            config=crawl_config
        )

        if result.success:
            print("Extracted Data:", result.extracted_content)

if __name__ == "__main__":
    asyncio.run(main())

Benefits of Managed Browsers Over Other Methods

Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.

Magic Mode: Simplified Automation

While Managed Browsers are the preferred approach, Magic Mode provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.

What Magic Mode Does:

Simulates human browsing by randomizing interaction patterns and timing
Masks browser automation signals
Handles cookie popups and modals
Modifies navigator properties for enhanced compatibility

Using Magic Mode

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(
        url="https://example.com",
        magic=True  # Enables all automation features
    )

Magic Mode is particularly useful for:

Quick prototyping when a Managed Browser setup is not available
Basic sites requiring minimal interaction or configuration

Example: Combining Magic Mode with Additional Options

async def crawl_with_magic_mode(url: str):
    async with AsyncWebCrawler(headless=True) as crawler:
        result = await crawler.arun(
            url=url,
            magic=True,
            remove_overlay_elements=True,  # Remove popups/modals
            page_timeout=60000            # Increased timeout for complex pages
        )

        return result.markdown if result.success else None

Magic Mode vs. Managed Browsers

While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.

Session Management

Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:

Performing JavaScript actions before and after crawling
Executing multiple sequential crawls faster without needing to reopen tabs or allocate memory repeatedly
Maintaining state for complex workflows

Note: This feature is designed for sequential workflows and is not suitable for parallel operations.

Basic Session Usage

Use BrowserConfig and CrawlerRunConfig to maintain state with a session_id:

from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig

async with AsyncWebCrawler() as crawler:
    session_id = "my_session"

    # Define configurations
    config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
    config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)

    # First request
    result1 = await crawler.arun(config=config1)

    # Subsequent request using the same session
    result2 = await crawler.arun(config=config2)

    # Clean up when done
    await crawler.crawler_strategy.kill_session(session_id)

Dynamic Content with Sessions

Here's an example of crawling GitHub commits across multiple pages while preserving session state:

from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode

async def crawl_dynamic_content():
    async with AsyncWebCrawler() as crawler:
        session_id = "github_commits_session"
        url = "https://github.com/microsoft/TypeScript/commits/main"
        all_commits = []

        # Define extraction schema
        schema = {
            "name": "Commit Extractor",
            "baseSelector": "li.Box-sc-g0xbh4-0",
            "fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
        }
        extraction_strategy = JsonCssExtractionStrategy(schema)

        # JavaScript and wait configurations
        js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
        wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""

        # Crawl multiple pages
        for page in range(3):
            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
                extraction_strategy=extraction_strategy,
                js_code=js_next_page if page > 0 else None,
                wait_for=wait_for if page > 0 else None,
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )

            result = await crawler.arun(config=config)
            if result.success:
                commits = json.loads(result.extracted_content)
                all_commits.extend(commits)
                print(f"Page {page + 1}: Found {len(commits)} commits")

        # Clean up session
        await crawler.crawler_strategy.kill_session(session_id)
        return all_commits

Session Best Practices

Descriptive Session IDs: Use meaningful names for session IDs to organize workflows:
```
session_id = "login_flow_session"
session_id = "product_catalog_session"
```

Resource Management: Always ensure sessions are cleaned up to free resources:

try:
    # Your crawling code here
    pass
finally:
    await crawler.crawler_strategy.kill_session(session_id)

State Maintenance: Reuse the session for subsequent actions within the same workflow:

# Step 1: Login
login_config = CrawlerRunConfig(
    url="https://example.com/login",
    session_id=session_id,
    js_code="document.querySelector('form').submit();"
)
await crawler.arun(config=login_config)

# Step 2: Verify login success
dashboard_config = CrawlerRunConfig(
    url="https://example.com/dashboard",
    session_id=session_id,
    wait_for="css:.user-profile"  # Wait for authenticated content
)
result = await crawler.arun(config=dashboard_config)

Common Use Cases for Sessions:
1. Authentication Flows: Login and interact with secured pages
2. Pagination Handling: Navigate through multiple pages
3. Form Submissions: Fill forms, submit, and process results
4. Multi-step Processes: Complete workflows that span multiple actions
5. Dynamic Content Navigation: Handle JavaScript-rendered or event-triggered content

Session-Based Crawling for Dynamic Content

In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.

Understanding Session-Based Crawling

Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:

Efficient handling of dynamic content without reloading the page
JavaScript actions before and after crawling (e.g., clicking buttons or scrolling)
State maintenance for authenticated sessions or multi-step workflows
Faster sequential crawling, as it avoids reopening tabs or reallocating resources

Note: Session-based crawling is ideal for sequential operations, not parallel tasks.

Basic Concepts

Before diving into examples, here are some key concepts:

Session ID: A unique identifier for a browsing session. Use the same session_id across multiple requests to maintain state.
BrowserConfig & CrawlerRunConfig: These configuration objects control browser settings and crawling behavior.
JavaScript Execution: Use js_code to perform actions like clicking buttons.
CSS Selectors: Target specific elements for interaction or data extraction.
Extraction Strategy: Define rules to extract structured data.
Wait Conditions: Specify conditions to wait for before proceeding.

Advanced Technique 1: Custom Execution Hooks

Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:

async def advanced_session_crawl_with_hooks():
    first_commit = ""

    async def on_execution_started(page):
        nonlocal first_commit
        try:
            while True:
                await page.wait_for_selector("li.commit-item h4")
                commit = await page.query_selector("li.commit-item h4")
                commit = await commit.evaluate("(element) => element.textContent").strip()
                if commit and commit != first_commit:
                    first_commit = commit
                    break
                await asyncio.sleep(0.5)
        except Exception as e:
            print(f"Warning: New content didn't appear: {e}")

    async with AsyncWebCrawler() as crawler:
        session_id = "commit_session"
        url = "https://github.com/example/repo/commits/main"
        crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)

        js_next_page = """document.querySelector('a.pagination-next').click();"""

        for page in range(3):
            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
                js_code=js_next_page if page > 0 else None,
                css_selector="li.commit-item",
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )

            result = await crawler.arun(config=config)
            print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")

        await crawler.crawler_strategy.kill_session(session_id)

Advanced Technique 2: Integrated JavaScript Execution and Waiting

Combine JavaScript execution and waiting logic for concise handling of dynamic content:

async def integrated_js_and_wait_crawl():
    async with AsyncWebCrawler() as crawler:
        session_id = "integrated_session"
        url = "https://github.com/example/repo/commits/main"

        js_next_page_and_wait = """
        (async () => {
            const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
            const initialCommit = getCurrentCommit();
            document.querySelector('a.pagination-next').click();
            while (getCurrentCommit() === initialCommit) {
                await new Promise(resolve => setTimeout(resolve, 100));
            }
        })();
        """

        for page in range(3):
            config = CrawlerRunConfig(
                url=url,
                session_id=session_id,
                js_code=js_next_page_and_wait if page > 0 else None,
                css_selector="li.commit-item",
                js_only=page > 0,
                cache_mode=CacheMode.BYPASS
            )

            result = await crawler.arun(config=config)
            print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")

        await crawler.crawler_strategy.kill_session(session_id)

Best Practices for Session-Based Crawling

Unique Session IDs: Assign descriptive and unique session_id values
Close Sessions: Always clean up sessions with kill_session after use
Error Handling: Anticipate and handle errors gracefully
Respect Websites: Follow terms of service and robots.txt
Delays: Add delays to avoid overwhelming servers
Optimize JavaScript: Keep scripts concise for better performance
Monitor Resources: Track memory and CPU usage for long sessions

Conclusion

By combining browser management, identity-based crawling through Managed Browsers, and robust session management, Crawl4AI provides a comprehensive solution for modern web crawling needs. These features work together to enable:

Authentic identity preservation
Efficient session management
Reliable handling of dynamic content
Scalable and maintainable crawling workflows

Remember to always follow best practices and respect website policies when implementing these features.

22 KiB Raw Blame History