- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
189 lines
7.1 KiB
Markdown
189 lines
7.1 KiB
Markdown
# Creating Browser Instances, Contexts, and Pages
|
|
|
|
## 1 Introduction
|
|
|
|
### Overview of Browser Management in Crawl4AI
|
|
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, anti-bot measures, and session persistence for high-volume, dynamic web crawling.
|
|
|
|
### Key Objectives
|
|
- **Anti-Bot Handling**:
|
|
- Implements stealth techniques to evade detection mechanisms used by modern websites.
|
|
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses.
|
|
- Supports integration with third-party services to bypass CAPTCHA challenges.
|
|
- **Persistent Sessions**:
|
|
- Retains session data (cookies, local storage) for workflows requiring user authentication.
|
|
- Allows seamless continuation of tasks across multiple runs without re-authentication.
|
|
- **Scalable Crawling**:
|
|
- Optimized resource utilization for handling thousands of URLs concurrently.
|
|
- Flexible configuration options to tailor crawling behavior to specific requirements.
|
|
|
|
---
|
|
|
|
## 2 Browser Creation Methods
|
|
|
|
### Standard Browser Creation
|
|
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
|
|
|
|
#### Features and Limitations
|
|
- **Features**:
|
|
- Quick and straightforward setup for small-scale tasks.
|
|
- Supports headless and headful modes.
|
|
- **Limitations**:
|
|
- Lacks advanced customization options like session reuse.
|
|
- May struggle with sites employing strict anti-bot measures.
|
|
|
|
#### Example Usage
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
|
|
|
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
|
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com")
|
|
print(result.markdown)
|
|
```
|
|
|
|
### Persistent Contexts
|
|
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
|
|
|
|
#### Benefits of Using `user_data_dir`
|
|
- **Session Persistence**:
|
|
- Stores cookies, local storage, and cache between crawling sessions.
|
|
- Reduces overhead for repetitive logins or multi-step workflows.
|
|
- **Enhanced Performance**:
|
|
- Leverages pre-loaded resources for faster page loading.
|
|
- **Flexibility**:
|
|
- Adapts to complex workflows requiring user-specific configurations.
|
|
|
|
#### Example: Setting Up Persistent Contexts
|
|
```python
|
|
config = BrowserConfig(user_data_dir="/path/to/user/data")
|
|
async with AsyncWebCrawler(browser_config=config) as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com")
|
|
print(result.markdown)
|
|
```
|
|
|
|
### Managed Browser
|
|
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
|
|
|
|
#### How It Works
|
|
- **Browser Process Management**:
|
|
- Automates initialization and cleanup of browser processes.
|
|
- Optimizes resource usage by pooling and reusing browser instances.
|
|
- **Debugging Support**:
|
|
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection.
|
|
- **Anti-Bot Measures**:
|
|
- Implements stealth plugins to mimic real user behavior and bypass bot detection.
|
|
|
|
#### Features
|
|
- **Customizable Configurations**:
|
|
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation.
|
|
- **Debugging and Logging**:
|
|
- Logs detailed browser interactions for debugging and performance analysis.
|
|
- **Scalability**:
|
|
- Handles multiple browser instances concurrently, scaling dynamically based on workload.
|
|
|
|
#### Example: Using `ManagedBrowser`
|
|
```python
|
|
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
|
|
|
config = BrowserConfig(headless=False, debug_port=9222)
|
|
async with AsyncWebCrawler(browser_config=config) as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com")
|
|
print(result.markdown)
|
|
```
|
|
|
|
---
|
|
|
|
## 3 Context and Page Management
|
|
|
|
### Creating and Configuring Browser Contexts
|
|
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
|
|
|
|
#### Customizations
|
|
- **Headers and Cookies**:
|
|
- Define custom headers to mimic specific devices or browsers.
|
|
- Set cookies for authenticated sessions.
|
|
- **Session Reuse**:
|
|
- Retain and reuse session data across multiple requests.
|
|
- Example: Preserve login states for authenticated crawls.
|
|
|
|
#### Example: Context Initialization
|
|
```python
|
|
from crawl4ai import CrawlerRunConfig
|
|
|
|
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com", config=config)
|
|
print(result.markdown)
|
|
```
|
|
|
|
### Creating Pages
|
|
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
|
|
|
|
#### Key Features
|
|
- **IFrame Handling**:
|
|
- Extract content from embedded iframes.
|
|
- Navigate and interact with nested content.
|
|
- **Viewport Customization**:
|
|
- Adjust viewport size to match target device dimensions.
|
|
- **Lazy Loading**:
|
|
- Ensure dynamic elements are fully loaded before extraction.
|
|
|
|
#### Example: Page Initialization
|
|
```python
|
|
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com", config=config)
|
|
print(result.markdown)
|
|
```
|
|
|
|
---
|
|
|
|
## 4 Advanced Features and Best Practices
|
|
|
|
### Debugging and Logging
|
|
Remote debugging provides a powerful way to troubleshoot complex crawling workflows.
|
|
|
|
#### Example: Enabling Remote Debugging
|
|
```python
|
|
config = BrowserConfig(debug_port=9222)
|
|
async with AsyncWebCrawler(browser_config=config) as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com")
|
|
```
|
|
|
|
### Anti-Bot Techniques
|
|
- **Human Behavior Simulation**:
|
|
- Mimic real user actions, such as scrolling, clicking, and typing.
|
|
- Example: Use JavaScript to simulate interactions.
|
|
- **Captcha Handling**:
|
|
- Integrate with third-party services like 2Captcha or AntiCaptcha for automated solving.
|
|
|
|
#### Example: Simulating User Actions
|
|
```python
|
|
js_code = """
|
|
(async () => {
|
|
document.querySelector('input[name="search"]').value = 'test';
|
|
document.querySelector('button[type="submit"]').click();
|
|
})();
|
|
"""
|
|
config = CrawlerRunConfig(js_code=[js_code])
|
|
async with AsyncWebCrawler() as crawler:
|
|
result = await crawler.arun("https://crawl4ai.com", config=config)
|
|
```
|
|
|
|
### Optimizations for Performance and Scalability
|
|
- **Persistent Contexts**:
|
|
- Reuse browser contexts to minimize resource consumption.
|
|
- **Concurrent Crawls**:
|
|
- Use `arun_many` with a controlled semaphore count for efficient batch processing.
|
|
|
|
#### Example: Scaling Crawls
|
|
```python
|
|
urls = ["https://example1.com", "https://example2.com"]
|
|
config = CrawlerRunConfig(semaphore_count=10)
|
|
async with AsyncWebCrawler() as crawler:
|
|
results = await crawler.arun_many(urls, config=config)
|
|
for result in results:
|
|
print(result.url, result.markdown)
|
|
```
|