Files
crawl4ai/docs/llm.txt/4_browser_context_page.md
UncleCode d5ed451299 Enhance crawler capabilities and documentation
- Add llm.txt generator
  - Added SSL certificate extraction in AsyncWebCrawler.
  - Introduced new content filters and chunking strategies for more robust data extraction.
  - Updated documentation.
2024-12-25 21:34:31 +08:00

551 lines
22 KiB
Markdown

## 4. Creating Browser Instances, Contexts, and Pages
### Introduction
#### Overview of Browser Management in Crawl4AI
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, identity preservation, and session persistence for high-volume, dynamic web crawling.
#### Key Objectives
- **Identity Preservation**:
- Implements stealth techniques to maintain authentic digital identity
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses
- Supports integration with third-party services to bypass CAPTCHA challenges
- **Persistent Sessions**:
- Retains session data (cookies, local storage) for workflows requiring user authentication
- Allows seamless continuation of tasks across multiple runs without re-authentication
- **Scalable Crawling**:
- Optimized resource utilization for handling thousands of URLs concurrently
- Flexible configuration options to tailor crawling behavior to specific requirements
---
### Browser Creation Methods
#### Standard Browser Creation
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
##### Features and Limitations
- **Features**:
- Quick and straightforward setup for small-scale tasks
- Supports headless and headful modes
- **Limitations**:
- Lacks advanced customization options like session reuse
- May struggle with sites employing strict identity verification
##### Example Usage
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(browser_type="chromium", headless=True)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
print(result.markdown)
```
#### Persistent Contexts
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
##### Benefits of Using `user_data_dir`
- **Session Persistence**:
- Stores cookies, local storage, and cache between crawling sessions
- Reduces overhead for repetitive logins or multi-step workflows
- **Enhanced Performance**:
- Leverages pre-loaded resources for faster page loading
- **Flexibility**:
- Adapts to complex workflows requiring user-specific configurations
##### Example: Setting Up Persistent Contexts
```python
config = BrowserConfig(user_data_dir="/path/to/user/data")
async with AsyncWebCrawler(browser_config=config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
print(result.markdown)
```
#### Managed Browser
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and identity preservation measures.
##### How It Works
- **Browser Process Management**:
- Automates initialization and cleanup of browser processes
- Optimizes resource usage by pooling and reusing browser instances
- **Debugging Support**:
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection
- **Identity Preservation**:
- Implements stealth plugins to maintain authentic user identity
- Preserves browser fingerprints and session data
##### Features
- **Customizable Configurations**:
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation
- **Debugging and Logging**:
- Logs detailed browser interactions for debugging and performance analysis
- **Scalability**:
- Handles multiple browser instances concurrently, scaling dynamically based on workload
##### Example: Using `ManagedBrowser`
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
config = BrowserConfig(headless=False, debug_port=9222)
async with AsyncWebCrawler(browser_config=config) as crawler:
result = await crawler.arun("https://crawl4ai.com")
print(result.markdown)
```
---
### Context and Page Management
#### Creating and Configuring Browser Contexts
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
##### Customizations
- **Headers and Cookies**:
- Define custom headers to mimic specific devices or browsers
- Set cookies for authenticated sessions
- **Session Reuse**:
- Retain and reuse session data across multiple requests
- Example: Preserve login states for authenticated crawls
##### Example: Context Initialization
```python
from crawl4ai import CrawlerRunConfig
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://crawl4ai.com", config=config)
print(result.markdown)
```
#### Creating Pages
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
##### Key Features
- **IFrame Handling**:
- Extract content from embedded iframes
- Navigate and interact with nested content
- **Viewport Customization**:
- Adjust viewport size to match target device dimensions
- **Lazy Loading**:
- Ensure dynamic elements are fully loaded before extraction
##### Example: Page Initialization
```python
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun("https://crawl4ai.com", config=config)
print(result.markdown)
```
---
# Preserve Your Identity with Crawl4AI
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This section introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
## Managed Browsers: Your Digital Identity Solution
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
### Why Use Managed Browsers?
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
I'll help create a section about using command-line Chrome with a user data directory, which is indeed a more straightforward approach for identity-based browsing.
```markdown
### Steps to Use Identity-Based Browsing
1. **Launch Chrome with a Custom Profile Directory**
- **Windows**:
```batch
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\ChromeProfiles\CrawlProfile"
```
- **macOS**:
```bash
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/CrawlProfile"
```
- **Linux**:
```bash
google-chrome --user-data-dir="/home/username/ChromeProfiles/CrawlProfile"
```
2. **Set Up Your Identity**:
- In the new Chrome window, log into your accounts (Google, social media, etc.)
- Complete any necessary CAPTCHA challenges
- Accept cookies and configure site preferences
- The profile directory will save all settings, cookies, and login states
3. **Use the Profile in Crawl4AI**:
```python
from crawl4ai import AsyncWebCrawler, BrowserConfig
browser_config = BrowserConfig(
headless=True,
use_managed_browser=True,
user_data_dir="/path/to/ChromeProfiles/CrawlProfile" # Use the same directory from step 1
)
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
result = await crawler.arun("https://example.com")
```
This approach provides several advantages:
- Complete manual control over profile setup
- Persistent logins across multiple sites
- Pre-solved CAPTCHAs and saved preferences
- Real browser history and cookies for authentic browsing patterns
### Example: Extracting Data Using Managed Browsers
```python
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
async def main():
# Define schema for structured data extraction
schema = {
"name": "Example Data",
"baseSelector": "div.example",
"fields": [
{"name": "title", "selector": "h1", "type": "text"},
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
]
}
# Configure crawler
browser_config = BrowserConfig(
headless=True, # Automate subsequent runs
verbose=True,
use_managed_browser=True,
user_data_dir="/path/to/user_profile_data"
)
crawl_config = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema),
wait_for="css:div.example" # Wait for the targeted element to load
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://example.com",
config=crawl_config
)
if result.success:
print("Extracted Data:", result.extracted_content)
if __name__ == "__main__":
asyncio.run(main())
```
## Benefits of Managed Browsers Over Other Methods
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
## Magic Mode: Simplified Automation
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
### What Magic Mode Does:
- Simulates human browsing by randomizing interaction patterns and timing
- Masks browser automation signals
- Handles cookie popups and modals
- Modifies navigator properties for enhanced compatibility
### Using Magic Mode
```python
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://example.com",
magic=True # Enables all automation features
)
```
Magic Mode is particularly useful for:
- Quick prototyping when a Managed Browser setup is not available
- Basic sites requiring minimal interaction or configuration
### Example: Combining Magic Mode with Additional Options
```python
async def crawl_with_magic_mode(url: str):
async with AsyncWebCrawler(headless=True) as crawler:
result = await crawler.arun(
url=url,
magic=True,
remove_overlay_elements=True, # Remove popups/modals
page_timeout=60000 # Increased timeout for complex pages
)
return result.markdown if result.success else None
```
## Magic Mode vs. Managed Browsers
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
# Session Management
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
- **Performing JavaScript actions before and after crawling**
- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly
- **Maintaining state for complex workflows**
**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
## Basic Session Usage
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
```python
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async with AsyncWebCrawler() as crawler:
session_id = "my_session"
# Define configurations
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
# First request
result1 = await crawler.arun(config=config1)
# Subsequent request using the same session
result2 = await crawler.arun(config=config2)
# Clean up when done
await crawler.crawler_strategy.kill_session(session_id)
```
## Dynamic Content with Sessions
Here's an example of crawling GitHub commits across multiple pages while preserving session state:
```python
from crawl4ai.async_configs import CrawlerRunConfig
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
from crawl4ai.cache_context import CacheMode
async def crawl_dynamic_content():
async with AsyncWebCrawler() as crawler:
session_id = "github_commits_session"
url = "https://github.com/microsoft/TypeScript/commits/main"
all_commits = []
# Define extraction schema
schema = {
"name": "Commit Extractor",
"baseSelector": "li.Box-sc-g0xbh4-0",
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
}
extraction_strategy = JsonCssExtractionStrategy(schema)
# JavaScript and wait configurations
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
# Crawl multiple pages
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
extraction_strategy=extraction_strategy,
js_code=js_next_page if page > 0 else None,
wait_for=wait_for if page > 0 else None,
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
if result.success:
commits = json.loads(result.extracted_content)
all_commits.extend(commits)
print(f"Page {page + 1}: Found {len(commits)} commits")
# Clean up session
await crawler.crawler_strategy.kill_session(session_id)
return all_commits
```
## Session Best Practices
1. **Descriptive Session IDs**:
Use meaningful names for session IDs to organize workflows:
```python
session_id = "login_flow_session"
session_id = "product_catalog_session"
```
2. **Resource Management**:
Always ensure sessions are cleaned up to free resources:
```python
try:
# Your crawling code here
pass
finally:
await crawler.crawler_strategy.kill_session(session_id)
```
3. **State Maintenance**:
Reuse the session for subsequent actions within the same workflow:
```python
# Step 1: Login
login_config = CrawlerRunConfig(
url="https://example.com/login",
session_id=session_id,
js_code="document.querySelector('form').submit();"
)
await crawler.arun(config=login_config)
# Step 2: Verify login success
dashboard_config = CrawlerRunConfig(
url="https://example.com/dashboard",
session_id=session_id,
wait_for="css:.user-profile" # Wait for authenticated content
)
result = await crawler.arun(config=dashboard_config)
```
4. **Common Use Cases for Sessions**:
1. **Authentication Flows**: Login and interact with secured pages
2. **Pagination Handling**: Navigate through multiple pages
3. **Form Submissions**: Fill forms, submit, and process results
4. **Multi-step Processes**: Complete workflows that span multiple actions
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content
# Session-Based Crawling for Dynamic Content
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
## Understanding Session-Based Crawling
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
1. **Efficient handling of dynamic content** without reloading the page
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling)
3. **State maintenance** for authenticated sessions or multi-step workflows
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
## Basic Concepts
Before diving into examples, here are some key concepts:
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
- **CSS Selectors**: Target specific elements for interaction or data extraction.
- **Extraction Strategy**: Define rules to extract structured data.
- **Wait Conditions**: Specify conditions to wait for before proceeding.
## Advanced Technique 1: Custom Execution Hooks
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
```python
async def advanced_session_crawl_with_hooks():
first_commit = ""
async def on_execution_started(page):
nonlocal first_commit
try:
while True:
await page.wait_for_selector("li.commit-item h4")
commit = await page.query_selector("li.commit-item h4")
commit = await commit.evaluate("(element) => element.textContent").strip()
if commit and commit != first_commit:
first_commit = commit
break
await asyncio.sleep(0.5)
except Exception as e:
print(f"Warning: New content didn't appear: {e}")
async with AsyncWebCrawler() as crawler:
session_id = "commit_session"
url = "https://github.com/example/repo/commits/main"
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
js_next_page = """document.querySelector('a.pagination-next').click();"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
```
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
```python
async def integrated_js_and_wait_crawl():
async with AsyncWebCrawler() as crawler:
session_id = "integrated_session"
url = "https://github.com/example/repo/commits/main"
js_next_page_and_wait = """
(async () => {
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
const initialCommit = getCurrentCommit();
document.querySelector('a.pagination-next').click();
while (getCurrentCommit() === initialCommit) {
await new Promise(resolve => setTimeout(resolve, 100));
}
})();
"""
for page in range(3):
config = CrawlerRunConfig(
url=url,
session_id=session_id,
js_code=js_next_page_and_wait if page > 0 else None,
css_selector="li.commit-item",
js_only=page > 0,
cache_mode=CacheMode.BYPASS
)
result = await crawler.arun(config=config)
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
await crawler.crawler_strategy.kill_session(session_id)
```
## Best Practices for Session-Based Crawling
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values
2. **Close Sessions**: Always clean up sessions with `kill_session` after use
3. **Error Handling**: Anticipate and handle errors gracefully
4. **Respect Websites**: Follow terms of service and robots.txt
5. **Delays**: Add delays to avoid overwhelming servers
6. **Optimize JavaScript**: Keep scripts concise for better performance
7. **Monitor Resources**: Track memory and CPU usage for long sessions
## Conclusion
By combining browser management, identity-based crawling through Managed Browsers, and robust session management, Crawl4AI provides a comprehensive solution for modern web crawling needs. These features work together to enable:
1. Authentic identity preservation
2. Efficient session management
3. Reliable handling of dynamic content
4. Scalable and maintainable crawling workflows
Remember to always follow best practices and respect website policies when implementing these features.