Files
crawl4ai/docs/llm.txt/4_browser_context_page.xs.md
UncleCode d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00

4.1 KiB

Creating Browser Instances, Contexts, and Pages (Condensed LLM Reference)

Minimal code-focused reference retaining all outline sections.

Introduction

  • Manage browsers for crawling with identity preservation, sessions, scaling.
  • Maintain cookies, local storage, human-like actions.

Key Objectives

  • Identity Preservation: Stealth plugins, human-like inputs.
  • Persistent Sessions: Store cookies, continue tasks across runs.
  • Scalable Crawling: Handle large volumes efficiently.

Browser Creation Methods

Standard Browser Creation

from crawl4ai import AsyncWebCrawler, BrowserConfig

cfg = BrowserConfig(browser_type="chromium", headless=True)
async with AsyncWebCrawler(browser_config=cfg) as c:
    r = await c.arun("https://example.com")

Persistent Contexts

cfg = BrowserConfig(user_data_dir="/path/to/data")
async with AsyncWebCrawler(browser_config=cfg) as c:
    r = await c.arun("https://example.com")

Managed Browser

cfg = BrowserConfig(headless=False, debug_port=9222, use_managed_browser=True)
async with AsyncWebCrawler(browser_config=cfg) as c:
    r = await c.arun("https://example.com")

Context and Page Management

Creating and Configuring Browser Contexts

from crawl4ai import CrawlerRunConfig
conf = CrawlerRunConfig(headers={"User-Agent": "C4AI"})
async with AsyncWebCrawler() as c:
    r = await c.arun("https://example.com", config=conf)

Creating Pages

conf = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
async with AsyncWebCrawler() as c:
    r = await c.arun("https://example.com", config=conf)

Preserve Your Identity with Crawl4AI

Use Managed Browsers for authentic identity:

Managed Browsers: Your Digital Identity Solution

  • Store sessions, cookies, user profiles.
  • Reuse CAPTCHAs, logins.

Steps to Use Identity-Based Browsing

# Launch Chrome with user-data-dir
google-chrome --user-data-dir="/path/to/Profile"
# Then login manually, solve CAPTCHAs, etc.
cfg = BrowserConfig(
    headless=True,
    use_managed_browser=True,
    user_data_dir="/path/to/Profile"
)
async with AsyncWebCrawler(browser_config=cfg) as c:
    r = await c.arun("https://example.com")

Example: Extracting Data Using Managed Browsers

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {...}
cfg = BrowserConfig(
    headless=True, use_managed_browser=True,
    user_data_dir="/path/to/data"
)
crawl_cfg = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))

async with AsyncWebCrawler(browser_config=cfg) as c:
    r = await c.arun("https://example.com", config=crawl_cfg)

Magic Mode: Simplified Automation

async with AsyncWebCrawler() as c:
    r = await c.arun("https://example.com", magic=True)

Session Management

Use session_id to maintain state across requests:

from crawl4ai.async_configs import CrawlerRunConfig

async with AsyncWebCrawler() as c:
    sid = "my_session"
    conf1 = CrawlerRunConfig(url="https://example.com/page1", session_id=sid)
    conf2 = CrawlerRunConfig(url="https://example.com/page2", session_id=sid)
    r1 = await c.arun(config=conf1)
    r2 = await c.arun(config=conf2)
    await c.crawler_strategy.kill_session(sid)

Session-Based Crawling for Dynamic Content

  • Reuse the same session for multi-step actions, JS execution.
  • Ideal for pagination, JS-driven content.

Basic Concepts

  • session_id: Keep the same ID for related crawls.
  • js_code, wait_for: Run JS, wait for elements.

Advanced Techniques

  • Execute JS for dynamic content loading.
  • Wait loops or hooks to handle new elements.

Conclusion

  • Combine managed browsers, sessions, and configs for scalable, identity-preserved crawling.
  • Adjust headers, cookies, viewports.
  • Magic mode for quick attempts; Managed Browsers for robust identity.
  • Use sessions for multi-step, dynamic workflows.

Optional