Enhance Crawl4AI with new features and documentation
- Fix crawler text mode for improved performance; cover missing `srcset` and `data_srcset` attributes in image tags. - Introduced Managed Browsers for enhanced crawling experience. - Updated documentation for clearer navigation on configuration. - Changed 'text_only' to 'text_mode' in configuration and methods. - Improved performance and relevance in content filtering strategies.
This commit is contained in:
@@ -2,80 +2,12 @@
|
||||
|
||||
Crawl4AI provides powerful content processing capabilities that help you extract clean, relevant content from web pages. This guide covers content cleaning, media handling, link analysis, and metadata extraction.
|
||||
|
||||
## Content Cleaning
|
||||
|
||||
### Understanding Clean Content
|
||||
When crawling web pages, you often encounter a lot of noise - advertisements, navigation menus, footers, popups, and other irrelevant content. Crawl4AI automatically cleans this noise using several approaches:
|
||||
|
||||
1. **Basic Cleaning**: Removes unwanted HTML elements and attributes
|
||||
2. **Content Relevance**: Identifies and preserves meaningful content blocks
|
||||
3. **Layout Analysis**: Understands page structure to identify main content areas
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
word_count_threshold=10, # Remove blocks with fewer words
|
||||
excluded_tags=['form', 'nav'], # Remove specific HTML tags
|
||||
remove_overlay_elements=True # Remove popups/modals
|
||||
)
|
||||
|
||||
# Get clean content
|
||||
print(result.cleaned_html) # Cleaned HTML
|
||||
print(result.markdown) # Clean markdown version
|
||||
```
|
||||
|
||||
### Fit Markdown: Smart Content Extraction
|
||||
One of Crawl4AI's most powerful features is `fit_markdown`. This feature uses advanced heuristics to identify and extract the main content from a webpage while excluding irrelevant elements.
|
||||
|
||||
#### How Fit Markdown Works
|
||||
- Analyzes content density and distribution
|
||||
- Identifies content patterns and structures
|
||||
- Removes boilerplate content (headers, footers, sidebars)
|
||||
- Preserves the most relevant content blocks
|
||||
- Maintains content hierarchy and formatting
|
||||
|
||||
#### Perfect For:
|
||||
- Blog posts and articles
|
||||
- News content
|
||||
- Documentation pages
|
||||
- Any page with a clear main content area
|
||||
|
||||
#### Not Recommended For:
|
||||
- E-commerce product listings
|
||||
- Search results pages
|
||||
- Social media feeds
|
||||
- Pages with multiple equal-weight content sections
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Get the most relevant content
|
||||
main_content = result.fit_markdown
|
||||
|
||||
# Compare with regular markdown
|
||||
all_content = result.markdown
|
||||
|
||||
print(f"Fit Markdown Length: {len(main_content)}")
|
||||
print(f"Regular Markdown Length: {len(all_content)}")
|
||||
```
|
||||
|
||||
#### Example Use Case
|
||||
```python
|
||||
async def extract_article_content(url: str) -> str:
|
||||
"""Extract main article content from a blog or news site."""
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url)
|
||||
|
||||
# fit_markdown will focus on the article content,
|
||||
# excluding navigation, ads, and other distractions
|
||||
return result.fit_markdown
|
||||
```
|
||||
|
||||
## Media Processing
|
||||
|
||||
Crawl4AI provides comprehensive media extraction and analysis capabilities. It automatically detects and processes various types of media elements while maintaining their context and relevance.
|
||||
|
||||
### Image Processing
|
||||
|
||||
The library handles various image scenarios, including:
|
||||
- Regular images
|
||||
- Lazy-loaded images
|
||||
@@ -84,7 +16,10 @@ The library handles various image scenarios, including:
|
||||
- Image metadata and context
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
for image in result.media["images"]:
|
||||
# Each image includes rich metadata
|
||||
@@ -96,20 +31,27 @@ for image in result.media["images"]:
|
||||
```
|
||||
|
||||
### Handling Lazy-Loaded Content
|
||||
Crawl4aai already handles lazy loading for media elements. You can also customize the wait time for lazy-loaded content:
|
||||
|
||||
Crawl4AI already handles lazy loading for media elements. You can customize the wait time for lazy-loaded content with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
wait_for="css:img[data-src]", # Wait for lazy images
|
||||
delay_before_return_html=2.0 # Additional wait time
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Video and Audio Content
|
||||
|
||||
The library extracts video and audio elements with their metadata:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Process videos
|
||||
for video in result.media["videos"]:
|
||||
print(f"Video source: {video['src']}")
|
||||
@@ -129,6 +71,7 @@ for audio in result.media["audios"]:
|
||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
||||
|
||||
### Link Classification
|
||||
|
||||
The library automatically categorizes links into:
|
||||
- Internal links (same domain)
|
||||
- External links (different domains)
|
||||
@@ -137,7 +80,10 @@ The library automatically categorizes links into:
|
||||
- Content links
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Analyze internal links
|
||||
for link in result.links["internal"]:
|
||||
@@ -154,18 +100,19 @@ for link in result.links["external"]:
|
||||
```
|
||||
|
||||
### Smart Link Filtering
|
||||
Control which links are included in the results:
|
||||
|
||||
Control which links are included in the results with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
@@ -173,7 +120,10 @@ result = await crawler.arun(
|
||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
||||
|
||||
```python
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
metadata = result.metadata
|
||||
print(f"Title: {metadata['title']}")
|
||||
@@ -184,40 +134,3 @@ print(f"Published Date: {metadata['published_date']}")
|
||||
print(f"Modified Date: {metadata['modified_date']}")
|
||||
print(f"Language: {metadata['language']}")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use Fit Markdown for Articles**
|
||||
```python
|
||||
# Perfect for blog posts, news articles, documentation
|
||||
content = result.fit_markdown
|
||||
```
|
||||
|
||||
2. **Handle Media Appropriately**
|
||||
```python
|
||||
# Filter by relevance score
|
||||
relevant_images = [
|
||||
img for img in result.media["images"]
|
||||
if img['score'] > 5
|
||||
]
|
||||
```
|
||||
|
||||
3. **Combine Link Analysis with Content**
|
||||
```python
|
||||
# Get content links with context
|
||||
content_links = [
|
||||
link for link in result.links["internal"]
|
||||
if link['type'] == 'content'
|
||||
]
|
||||
```
|
||||
|
||||
4. **Clean Content with Purpose**
|
||||
```python
|
||||
# Customize cleaning based on your needs
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
word_count_threshold=20, # Adjust based on content type
|
||||
keep_data_attributes=False, # Remove data attributes
|
||||
process_iframes=True # Include iframe content
|
||||
)
|
||||
```
|
||||
@@ -1,114 +1,121 @@
|
||||
# Hooks & Auth for AsyncWebCrawler
|
||||
|
||||
Crawl4AI's AsyncWebCrawler allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions that are called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This example demonstrates how to use various hooks to customize the asynchronous crawling process.
|
||||
Crawl4AI's `AsyncWebCrawler` allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This updated documentation demonstrates how to use hooks, including the new `on_page_context_created` hook, and ensures compatibility with `BrowserConfig` and `CrawlerRunConfig`.
|
||||
|
||||
## Example: Using Crawler Hooks with AsyncWebCrawler
|
||||
|
||||
Let's see how we can customize the AsyncWebCrawler using hooks! In this example, we'll:
|
||||
In this example, we'll:
|
||||
|
||||
1. Configure the browser when it's created.
|
||||
2. Add custom headers before navigating to the URL.
|
||||
3. Log the current URL after navigation.
|
||||
4. Perform actions after JavaScript execution.
|
||||
5. Log the length of the HTML before returning it.
|
||||
1. Configure the browser and set up authentication when it's created.
|
||||
2. Apply custom routing and initial actions when the page context is created.
|
||||
3. Add custom headers before navigating to the URL.
|
||||
4. Log the current URL after navigation.
|
||||
5. Perform actions after JavaScript execution.
|
||||
6. Log the length of the HTML before returning it.
|
||||
|
||||
### Hook Definitions
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from playwright.async_api import Page, Browser, BrowserContext
|
||||
|
||||
async def on_browser_created(browser: Browser):
|
||||
def log_routing(route):
|
||||
# Example: block loading images
|
||||
if route.request.resource_type == "image":
|
||||
print(f"[HOOK] Blocking image request: {route.request.url}")
|
||||
asyncio.create_task(route.abort())
|
||||
else:
|
||||
asyncio.create_task(route.continue_())
|
||||
|
||||
async def on_browser_created(browser: Browser, **kwargs):
|
||||
print("[HOOK] on_browser_created")
|
||||
# Example customization: set browser viewport size
|
||||
context = await browser.new_context(viewport={'width': 1920, 'height': 1080})
|
||||
# Example: Set browser viewport size and log in
|
||||
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
|
||||
page = await context.new_page()
|
||||
|
||||
# Example customization: logging in to a hypothetical website
|
||||
await page.goto('https://example.com/login')
|
||||
await page.fill('input[name="username"]', 'testuser')
|
||||
await page.fill('input[name="password"]', 'password123')
|
||||
await page.click('button[type="submit"]')
|
||||
await page.wait_for_selector('#welcome')
|
||||
|
||||
# Add a custom cookie
|
||||
await context.add_cookies([{'name': 'test_cookie', 'value': 'cookie_value', 'url': 'https://example.com'}])
|
||||
|
||||
await page.goto("https://example.com/login")
|
||||
await page.fill("input[name='username']", "testuser")
|
||||
await page.fill("input[name='password']", "password123")
|
||||
await page.click("button[type='submit']")
|
||||
await page.wait_for_selector("#welcome")
|
||||
await context.add_cookies([{"name": "auth_token", "value": "abc123", "url": "https://example.com"}])
|
||||
await page.close()
|
||||
await context.close()
|
||||
|
||||
async def before_goto(page: Page):
|
||||
print("[HOOK] before_goto")
|
||||
# Example customization: add custom headers
|
||||
await page.set_extra_http_headers({'X-Test-Header': 'test'})
|
||||
async def on_page_context_created(context: BrowserContext, page: Page, **kwargs):
|
||||
print("[HOOK] on_page_context_created")
|
||||
await context.route("**", log_routing)
|
||||
|
||||
async def after_goto(page: Page):
|
||||
async def before_goto(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] before_goto")
|
||||
await page.set_extra_http_headers({"X-Test-Header": "test"})
|
||||
|
||||
async def after_goto(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] after_goto")
|
||||
# Example customization: log the URL
|
||||
print(f"Current URL: {page.url}")
|
||||
|
||||
async def on_execution_started(page: Page):
|
||||
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] on_execution_started")
|
||||
# Example customization: perform actions after JS execution
|
||||
await page.evaluate("console.log('Custom JS executed')")
|
||||
|
||||
async def before_return_html(page: Page, html: str):
|
||||
async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs):
|
||||
print("[HOOK] before_return_html")
|
||||
# Example customization: log the HTML length
|
||||
print(f"HTML length: {len(html)}")
|
||||
return page
|
||||
```
|
||||
|
||||
### Using the Hooks with the AsyncWebCrawler
|
||||
### Using the Hooks with AsyncWebCrawler
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
async def main():
|
||||
print("\n🔗 Using Crawler Hooks: Let's see how we can customize the AsyncWebCrawler using hooks!")
|
||||
|
||||
initial_cookies = [
|
||||
{"name": "sessionId", "value": "abc123", "domain": ".example.com"},
|
||||
{"name": "userId", "value": "12345", "domain": ".example.com"}
|
||||
]
|
||||
crawler_strategy = AsyncPlaywrightCrawlerStrategy(verbose=True, cookies=initial_cookies)
|
||||
crawler_strategy.set_hook('on_browser_created', on_browser_created)
|
||||
crawler_strategy.set_hook('before_goto', before_goto)
|
||||
crawler_strategy.set_hook('after_goto', after_goto)
|
||||
crawler_strategy.set_hook('on_execution_started', on_execution_started)
|
||||
crawler_strategy.set_hook('before_return_html', before_return_html)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True, crawler_strategy=crawler_strategy) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
print("\n🔗 Using Crawler Hooks: Customize AsyncWebCrawler with hooks!")
|
||||
|
||||
print("📦 Crawler Hooks result:")
|
||||
# Configure browser and crawler settings
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080
|
||||
)
|
||||
|
||||
crawler_run_config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
|
||||
# Initialize crawler
|
||||
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
|
||||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
||||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
|
||||
|
||||
# Run the crawler
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_run_config)
|
||||
|
||||
print("\n📦 Crawler Hooks Result:")
|
||||
print(result)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Explanation
|
||||
### Explanation of Hooks
|
||||
|
||||
- `on_browser_created`: This hook is called when the Playwright browser is created. It sets up the browser context, logs in to a website, and adds a custom cookie.
|
||||
- `before_goto`: This hook is called right before Playwright navigates to the URL. It adds custom HTTP headers.
|
||||
- `after_goto`: This hook is called after Playwright navigates to the URL. It logs the current URL.
|
||||
- `on_execution_started`: This hook is called after any custom JavaScript is executed. It performs additional JavaScript actions.
|
||||
- `before_return_html`: This hook is called before returning the HTML content. It logs the length of the HTML content.
|
||||
- **`on_browser_created`**: Called when the browser is created. Use this to configure the browser or handle authentication (e.g., logging in and setting cookies).
|
||||
- **`on_page_context_created`**: Called when a new page context is created. Use this to apply routing, block resources, or inject custom logic before navigating to the URL.
|
||||
- **`before_goto`**: Called before navigating to the URL. Use this to add custom headers or perform other pre-navigation actions.
|
||||
- **`after_goto`**: Called after navigation. Use this to verify content or log the URL.
|
||||
- **`on_execution_started`**: Called after executing custom JavaScript. Use this to perform additional actions.
|
||||
- **`before_return_html`**: Called before returning the HTML content. Use this to log details or preprocess the content.
|
||||
|
||||
### Additional Ideas
|
||||
### Additional Customizations
|
||||
|
||||
- **Handling authentication**: Use the `on_browser_created` hook to handle login processes or set authentication tokens.
|
||||
- **Dynamic header modification**: Modify headers based on the target URL or other conditions in the `before_goto` hook.
|
||||
- **Content verification**: Use the `after_goto` hook to verify that the expected content is present on the page.
|
||||
- **Custom JavaScript injection**: Inject and execute custom JavaScript using the `on_execution_started` hook.
|
||||
- **Content preprocessing**: Modify or analyze the HTML content in the `before_return_html` hook before it's returned.
|
||||
- **Resource Management**: Use `on_page_context_created` to block or modify requests (e.g., block images, fonts, or third-party scripts).
|
||||
- **Dynamic Headers**: Use `before_goto` to add or modify headers dynamically based on the URL.
|
||||
- **Authentication**: Use `on_browser_created` to handle login processes and set authentication cookies or tokens.
|
||||
- **Content Analysis**: Use `before_return_html` to analyze or modify the extracted HTML content.
|
||||
|
||||
These hooks provide powerful customization options for tailoring the crawling process to your needs.
|
||||
|
||||
By using these hooks, you can customize the behavior of the AsyncWebCrawler to suit your specific needs, including handling authentication, modifying requests, and preprocessing content.
|
||||
156
docs/md_v2/advanced/identity_based_crawling.md
Normal file
156
docs/md_v2/advanced/identity_based_crawling.md
Normal file
@@ -0,0 +1,156 @@
|
||||
### Preserve Your Identity with Crawl4AI
|
||||
|
||||
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This document introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
|
||||
|
||||
---
|
||||
|
||||
### Managed Browsers: Your Digital Identity Solution
|
||||
|
||||
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
|
||||
|
||||
#### Why Use Managed Browsers?
|
||||
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
|
||||
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
|
||||
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
|
||||
|
||||
#### Steps to Use Managed Browsers
|
||||
|
||||
1. **Setup the Browser Configuration**:
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=False, # Set to False for initial setup to view browser actions
|
||||
verbose=True,
|
||||
user_agent_mode="random",
|
||||
use_managed_browser=True, # Enables persistent browser sessions
|
||||
browser_type="chromium",
|
||||
user_data_dir="/path/to/user_profile_data" # Path to save session data
|
||||
)
|
||||
```
|
||||
|
||||
2. **Perform an Initial Run**:
|
||||
- Run the crawler with `headless=False`.
|
||||
- Manually interact with the site (e.g., solve CAPTCHA or log in).
|
||||
- The browser session saves cookies, local storage, and other required data.
|
||||
|
||||
3. **Subsequent Runs**:
|
||||
- Switch to `headless=True` for automation.
|
||||
- The session data is reused, allowing seamless crawling.
|
||||
|
||||
#### Example: Extracting Data Using Managed Browsers
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Define schema for structured data extraction
|
||||
schema = {
|
||||
"name": "Example Data",
|
||||
"baseSelector": "div.example",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
# Configure crawler
|
||||
browser_config = BrowserConfig(
|
||||
headless=True, # Automate subsequent runs
|
||||
verbose=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/user_profile_data"
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
wait_for="css:div.example" # Wait for the targeted element to load
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("Extracted Data:", result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Benefits of Managed Browsers Over Other Methods
|
||||
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
|
||||
|
||||
---
|
||||
|
||||
### Magic Mode: Simplified Automation
|
||||
|
||||
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
|
||||
|
||||
#### What Magic Mode Does:
|
||||
- Simulates human browsing by randomizing interaction patterns and timing.
|
||||
- Masks browser automation signals.
|
||||
- Handles cookie popups and modals.
|
||||
- Modifies navigator properties for enhanced compatibility.
|
||||
|
||||
#### Using Magic Mode
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enables all automation features
|
||||
)
|
||||
```
|
||||
|
||||
Magic Mode is particularly useful for:
|
||||
- Quick prototyping when a Managed Browser setup is not available.
|
||||
- Basic sites requiring minimal interaction or configuration.
|
||||
|
||||
#### Example: Combining Magic Mode with Additional Options
|
||||
|
||||
```python
|
||||
async def crawl_with_magic_mode(url: str):
|
||||
async with AsyncWebCrawler(headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
magic=True,
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
page_timeout=60000 # Increased timeout for complex pages
|
||||
)
|
||||
|
||||
return result.markdown if result.success else None
|
||||
```
|
||||
|
||||
### Magic Mode vs. Managed Browsers
|
||||
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
|
||||
|
||||
---
|
||||
|
||||
### Key Comparison: Managed Browsers vs. Magic Mode
|
||||
|
||||
| Feature | **Managed Browsers** | **Magic Mode** |
|
||||
|-------------------------|------------------------------------------|-------------------------------------|
|
||||
| **Session Persistence** | Retains cookies and local storage. | No session retention. |
|
||||
| **Human Interaction** | Uses real user profiles and data. | Simulates human-like patterns. |
|
||||
| **Complex Sites** | Best suited for heavily configured sites.| Works well with simpler challenges.|
|
||||
| **Setup Complexity** | Requires initial manual interaction. | Fully automated, one-line setup. |
|
||||
|
||||
#### Recommendation:
|
||||
- Use **Managed Browsers** for reliable, session-based crawling and data extraction.
|
||||
- Use **Magic Mode** for quick prototyping or when persistent profiles are not required.
|
||||
|
||||
---
|
||||
|
||||
### Conclusion
|
||||
|
||||
- **Use Managed Browsers** to preserve your digital identity and ensure reliable, identity-based crawling with persistent sessions. This approach works seamlessly for even the most complex websites.
|
||||
- **Leverage Magic Mode** for quick automation or in scenarios where persistent user profiles are not needed.
|
||||
|
||||
By combining these approaches, Crawl4AI provides unparalleled flexibility and capability for your crawling needs.
|
||||
|
||||
@@ -1,136 +1,188 @@
|
||||
# Content Filtering in Crawl4AI
|
||||
# Creating Browser Instances, Contexts, and Pages
|
||||
|
||||
This guide explains how to use content filtering strategies in Crawl4AI to extract the most relevant information from crawled web pages. You'll learn how to use the built-in `BM25ContentFilter` and how to create your own custom content filtering strategies.
|
||||
## 1 Introduction
|
||||
|
||||
## Relevance Content Filter
|
||||
### Overview of Browser Management in Crawl4AI
|
||||
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, anti-bot measures, and session persistence for high-volume, dynamic web crawling.
|
||||
|
||||
The `RelevanceContentFilter` is an abstract class that provides a common interface for content filtering strategies. Specific filtering algorithms, like `PruningContentFilter` or `BM25ContentFilter`, inherit from this class and implement the `filter_content` method. This method takes the HTML content as input and returns a list of filtered text blocks.
|
||||
### Key Objectives
|
||||
- **Anti-Bot Handling**:
|
||||
- Implements stealth techniques to evade detection mechanisms used by modern websites.
|
||||
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses.
|
||||
- Supports integration with third-party services to bypass CAPTCHA challenges.
|
||||
- **Persistent Sessions**:
|
||||
- Retains session data (cookies, local storage) for workflows requiring user authentication.
|
||||
- Allows seamless continuation of tasks across multiple runs without re-authentication.
|
||||
- **Scalable Crawling**:
|
||||
- Optimized resource utilization for handling thousands of URLs concurrently.
|
||||
- Flexible configuration options to tailor crawling behavior to specific requirements.
|
||||
|
||||
---
|
||||
|
||||
## Pruning Content Filter
|
||||
## 2 Browser Creation Methods
|
||||
|
||||
The `PruningContentFilter` is a tree-shaking algorithm that analyzes the HTML DOM structure and removes less relevant nodes based on various metrics like text density, link density, and tag importance. It evaluates each node using a composite scoring system and "prunes" nodes that fall below a certain threshold.
|
||||
### Standard Browser Creation
|
||||
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
|
||||
|
||||
### Usage
|
||||
#### Features and Limitations
|
||||
- **Features**:
|
||||
- Quick and straightforward setup for small-scale tasks.
|
||||
- Supports headless and headful modes.
|
||||
- **Limitations**:
|
||||
- Lacks advanced customization options like session reuse.
|
||||
- May struggle with sites employing strict anti-bot measures.
|
||||
|
||||
#### Example Usage
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
async def filter_content(url):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = PruningContentFilter(
|
||||
min_word_threshold=5,
|
||||
threshold_type='dynamic',
|
||||
threshold=0.45
|
||||
)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True)
|
||||
if result.success:
|
||||
print(f"Cleaned Markdown:\n{result.fit_markdown}")
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(browser_config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Parameters
|
||||
### Persistent Contexts
|
||||
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
|
||||
|
||||
- **`min_word_threshold`**: (Optional) Minimum number of words a node must contain to be considered relevant. Nodes with fewer words are automatically pruned.
|
||||
|
||||
- **`threshold_type`**: (Optional, default 'fixed') Controls how pruning thresholds are calculated:
|
||||
- `'fixed'`: Uses a constant threshold value for all nodes
|
||||
- `'dynamic'`: Adjusts threshold based on node characteristics like tag importance and text/link ratios
|
||||
|
||||
- **`threshold`**: (Optional, default 0.48) Base threshold value for node pruning:
|
||||
- For fixed threshold: Nodes scoring below this value are removed
|
||||
- For dynamic threshold: This value is adjusted based on node properties
|
||||
|
||||
### How It Works
|
||||
|
||||
The pruning algorithm evaluates each node using multiple metrics:
|
||||
- Text density: Ratio of actual text to overall node content
|
||||
- Link density: Proportion of text within links
|
||||
- Tag importance: Weight based on HTML tag type (e.g., article, p, div)
|
||||
- Content quality: Metrics like text length and structural importance
|
||||
|
||||
Nodes scoring below the threshold are removed, effectively "shaking" less relevant content from the DOM tree. This results in a cleaner document containing only the most relevant content blocks.
|
||||
|
||||
The algorithm is particularly effective for:
|
||||
- Removing boilerplate content
|
||||
- Eliminating navigation menus and sidebars
|
||||
- Preserving main article content
|
||||
- Maintaining document structure while removing noise
|
||||
|
||||
|
||||
## BM25 Algorithm
|
||||
|
||||
The `BM25ContentFilter` uses the BM25 algorithm, a ranking function used in information retrieval to estimate the relevance of documents to a given search query. In Crawl4AI, this algorithm helps to identify and extract text chunks that are most relevant to the page's metadata or a user-specified query.
|
||||
|
||||
### Usage
|
||||
|
||||
To use the `BM25ContentFilter`, initialize it and then pass it as the `extraction_strategy` parameter to the `arun` method of the crawler.
|
||||
#### Benefits of Using `user_data_dir`
|
||||
- **Session Persistence**:
|
||||
- Stores cookies, local storage, and cache between crawling sessions.
|
||||
- Reduces overhead for repetitive logins or multi-step workflows.
|
||||
- **Enhanced Performance**:
|
||||
- Leverages pre-loaded resources for faster page loading.
|
||||
- **Flexibility**:
|
||||
- Adapts to complex workflows requiring user-specific configurations.
|
||||
|
||||
#### Example: Setting Up Persistent Contexts
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
|
||||
async def filter_content(url, query=None):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
content_filter = BM25ContentFilter(user_query=query)
|
||||
result = await crawler.arun(url=url, extraction_strategy=content_filter, fit_markdown=True) # Set fit_markdown flag to True to trigger BM25 filtering
|
||||
if result.success:
|
||||
print(f"Filtered Content (JSON):\n{result.extracted_content}")
|
||||
print(f"\nFiltered Markdown:\n{result.fit_markdown}") # New field in CrawlResult object
|
||||
print(f"\nFiltered HTML:\n{result.fit_html}") # New field in CrawlResult object. Note that raw HTML may have tags re-organized due to internal parsing.
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
# Example usage:
|
||||
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple", "fruit nutrition health")) # with query
|
||||
asyncio.run(filter_content("https://en.wikipedia.org/wiki/Apple")) # without query, metadata will be used as the query.
|
||||
|
||||
config = BrowserConfig(user_data_dir="/path/to/user/data")
|
||||
async with AsyncWebCrawler(browser_config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Parameters
|
||||
### Managed Browser
|
||||
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and anti-bot measures.
|
||||
|
||||
- **`user_query`**: (Optional) A string representing the search query. If not provided, the filter extracts relevant metadata (title, description, keywords) from the page and uses that as the query.
|
||||
- **`bm25_threshold`**: (Optional, default 1.0) A float value that controls the threshold for relevance. Higher values result in stricter filtering, returning only the most relevant text chunks. Lower values result in more lenient filtering.
|
||||
#### How It Works
|
||||
- **Browser Process Management**:
|
||||
- Automates initialization and cleanup of browser processes.
|
||||
- Optimizes resource usage by pooling and reusing browser instances.
|
||||
- **Debugging Support**:
|
||||
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection.
|
||||
- **Anti-Bot Measures**:
|
||||
- Implements stealth plugins to mimic real user behavior and bypass bot detection.
|
||||
|
||||
#### Features
|
||||
- **Customizable Configurations**:
|
||||
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation.
|
||||
- **Debugging and Logging**:
|
||||
- Logs detailed browser interactions for debugging and performance analysis.
|
||||
- **Scalability**:
|
||||
- Handles multiple browser instances concurrently, scaling dynamically based on workload.
|
||||
|
||||
## Fit Markdown Flag
|
||||
|
||||
Setting the `fit_markdown` flag to `True` in the `arun` method activates the BM25 content filtering during the crawl. The `fit_markdown` parameter instructs the scraper to extract and clean the HTML, primarily to prepare for a Large Language Model that cannot process large amounts of data. Setting this flag not only improves the quality of the extracted content but also adds the filtered content to two new attributes in the returned `CrawlResult` object: `fit_markdown` and `fit_html`.
|
||||
|
||||
|
||||
## Custom Content Filtering Strategies
|
||||
|
||||
You can create your own custom filtering strategies by inheriting from the `RelevantContentFilter` class and implementing the `filter_content` method. This allows you to tailor the filtering logic to your specific needs.
|
||||
|
||||
#### Example: Using `ManagedBrowser`
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import RelevantContentFilter
|
||||
from bs4 import BeautifulSoup, Tag
|
||||
from typing import List
|
||||
|
||||
class MyCustomFilter(RelevantContentFilter):
|
||||
def filter_content(self, html: str) -> List[str]:
|
||||
soup = BeautifulSoup(html, 'lxml')
|
||||
# Implement custom filtering logic here
|
||||
# Example: extract all paragraphs within divs with class "article-body"
|
||||
filtered_paragraphs = []
|
||||
for tag in soup.select("div.article-body p"):
|
||||
if isinstance(tag, Tag):
|
||||
filtered_paragraphs.append(str(tag)) # Add the cleaned HTML element.
|
||||
return filtered_paragraphs
|
||||
|
||||
|
||||
|
||||
async def custom_filter_demo(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
custom_filter = MyCustomFilter()
|
||||
result = await crawler.arun(url, extraction_strategy=custom_filter)
|
||||
if result.success:
|
||||
print(result.extracted_content)
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
config = BrowserConfig(headless=False, debug_port=9222)
|
||||
async with AsyncWebCrawler(browser_config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
This example demonstrates extracting paragraphs from a specific div class. You can customize this logic to implement different filtering strategies, use regular expressions, analyze text density, or apply other relevant techniques.
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
## 3 Context and Page Management
|
||||
|
||||
Content filtering strategies provide a powerful way to refine the output of your crawls. By using `BM25ContentFilter` or creating custom strategies, you can focus on the most pertinent information and improve the efficiency of your data processing pipeline.
|
||||
### Creating and Configuring Browser Contexts
|
||||
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
|
||||
|
||||
#### Customizations
|
||||
- **Headers and Cookies**:
|
||||
- Define custom headers to mimic specific devices or browsers.
|
||||
- Set cookies for authenticated sessions.
|
||||
- **Session Reuse**:
|
||||
- Retain and reuse session data across multiple requests.
|
||||
- Example: Preserve login states for authenticated crawls.
|
||||
|
||||
#### Example: Context Initialization
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Creating Pages
|
||||
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
|
||||
|
||||
#### Key Features
|
||||
- **IFrame Handling**:
|
||||
- Extract content from embedded iframes.
|
||||
- Navigate and interact with nested content.
|
||||
- **Viewport Customization**:
|
||||
- Adjust viewport size to match target device dimensions.
|
||||
- **Lazy Loading**:
|
||||
- Ensure dynamic elements are fully loaded before extraction.
|
||||
|
||||
#### Example: Page Initialization
|
||||
```python
|
||||
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4 Advanced Features and Best Practices
|
||||
|
||||
### Debugging and Logging
|
||||
Remote debugging provides a powerful way to troubleshoot complex crawling workflows.
|
||||
|
||||
#### Example: Enabling Remote Debugging
|
||||
```python
|
||||
config = BrowserConfig(debug_port=9222)
|
||||
async with AsyncWebCrawler(browser_config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
```
|
||||
|
||||
### Anti-Bot Techniques
|
||||
- **Human Behavior Simulation**:
|
||||
- Mimic real user actions, such as scrolling, clicking, and typing.
|
||||
- Example: Use JavaScript to simulate interactions.
|
||||
- **Captcha Handling**:
|
||||
- Integrate with third-party services like 2Captcha or AntiCaptcha for automated solving.
|
||||
|
||||
#### Example: Simulating User Actions
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
document.querySelector('input[name="search"]').value = 'test';
|
||||
document.querySelector('button[type="submit"]').click();
|
||||
})();
|
||||
"""
|
||||
config = CrawlerRunConfig(js_code=[js_code])
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
```
|
||||
|
||||
### Optimizations for Performance and Scalability
|
||||
- **Persistent Contexts**:
|
||||
- Reuse browser contexts to minimize resource consumption.
|
||||
- **Concurrent Crawls**:
|
||||
- Use `arun_many` with a controlled semaphore count for efficient batch processing.
|
||||
|
||||
#### Example: Scaling Crawls
|
||||
```python
|
||||
urls = ["https://example1.com", "https://example2.com"]
|
||||
config = CrawlerRunConfig(semaphore_count=10)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(urls, config=config)
|
||||
for result in results:
|
||||
print(result.url, result.markdown)
|
||||
```
|
||||
|
||||
@@ -4,59 +4,67 @@ Configure proxy settings and enhance security features in Crawl4AI for reliable
|
||||
|
||||
## Basic Proxy Setup
|
||||
|
||||
Simple proxy configuration:
|
||||
Simple proxy configuration with `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
# Using proxy URL
|
||||
async with AsyncWebCrawler(
|
||||
proxy="http://proxy.example.com:8080"
|
||||
) as crawler:
|
||||
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Using SOCKS proxy
|
||||
async with AsyncWebCrawler(
|
||||
proxy="socks5://proxy.example.com:1080"
|
||||
) as crawler:
|
||||
browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Authenticated Proxy
|
||||
|
||||
Use proxy with authentication:
|
||||
Use an authenticated proxy with `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
proxy_config = {
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(proxy_config=proxy_config) as crawler:
|
||||
browser_config = BrowserConfig(proxy_config=proxy_config)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Rotating Proxies
|
||||
|
||||
Example using a proxy rotation service:
|
||||
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
async def get_next_proxy():
|
||||
# Your proxy rotation logic here
|
||||
return {"server": "http://next.proxy.com:8080"}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
browser_config = BrowserConfig()
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Update proxy for each request
|
||||
for url in urls:
|
||||
proxy = await get_next_proxy()
|
||||
crawler.update_proxy(proxy)
|
||||
result = await crawler.arun(url=url)
|
||||
browser_config.proxy_config = proxy
|
||||
result = await crawler.arun(url=url, config=browser_config)
|
||||
```
|
||||
|
||||
## Custom Headers
|
||||
|
||||
Add security-related headers:
|
||||
Add security-related headers via `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
headers = {
|
||||
"X-Forwarded-For": "203.0.113.195",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
@@ -64,21 +72,24 @@ headers = {
|
||||
"Pragma": "no-cache"
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(headers=headers) as crawler:
|
||||
browser_config = BrowserConfig(headers=headers)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Combining with Magic Mode
|
||||
|
||||
For maximum protection, combine proxy with Magic Mode:
|
||||
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler(
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
proxy="http://proxy.example.com:8080",
|
||||
headers={"Accept-Language": "en-US"}
|
||||
) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enable all anti-detection features
|
||||
)
|
||||
```
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
||||
```
|
||||
|
||||
@@ -1,44 +1,53 @@
|
||||
# Session-Based Crawling for Dynamic Content
|
||||
### Session-Based Crawling for Dynamic Content
|
||||
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. To effectively crawl such websites, Crawl4AI provides powerful session-based crawling capabilities.
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
|
||||
|
||||
This guide will explore advanced techniques for crawling dynamic content using Crawl4AI's session management features.
|
||||
This guide explores advanced techniques for crawling dynamic content using Crawl4AI's session management features.
|
||||
|
||||
---
|
||||
|
||||
## Understanding Session-Based Crawling
|
||||
|
||||
Session-based crawling allows you to maintain a persistent browser session across multiple requests. This is crucial when:
|
||||
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
|
||||
|
||||
1. The content changes dynamically without URL changes
|
||||
2. You need to interact with the page (e.g., clicking buttons) between requests
|
||||
3. The site requires authentication or maintains state across pages
|
||||
1. **Efficient handling of dynamic content** without reloading the page.
|
||||
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling).
|
||||
3. **State maintenance** for authenticated sessions or multi-step workflows.
|
||||
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources.
|
||||
|
||||
Crawl4AI's `AsyncWebCrawler` class supports session-based crawling through the `session_id` parameter and related methods.
|
||||
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
|
||||
|
||||
---
|
||||
|
||||
## Basic Concepts
|
||||
|
||||
Before diving into examples, let's review some key concepts:
|
||||
Before diving into examples, here are some key concepts:
|
||||
|
||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple `arun` calls to maintain state.
|
||||
- **JavaScript Execution**: Use the `js_code` parameter to execute JavaScript on the page, such as clicking a "Load More" button.
|
||||
- **CSS Selectors**: Use these to target specific elements for extraction or interaction.
|
||||
- **Extraction Strategy**: Define how to extract structured data from the page.
|
||||
- **Wait Conditions**: Specify conditions to wait for before considering the page loaded.
|
||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
|
||||
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
|
||||
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
|
||||
- **CSS Selectors**: Target specific elements for interaction or data extraction.
|
||||
- **Extraction Strategy**: Define rules to extract structured data.
|
||||
- **Wait Conditions**: Specify conditions to wait for before proceeding.
|
||||
|
||||
---
|
||||
|
||||
## Example 1: Basic Session-Based Crawling
|
||||
|
||||
Let's start with a basic example of session-based crawling:
|
||||
A simple example using session-based crawling:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def basic_session_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
session_id = "my_session"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "dynamic_content_session"
|
||||
url = "https://example.com/dynamic-content"
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('.load-more-button').click();" if page > 0 else None,
|
||||
@@ -46,6 +55,7 @@ async def basic_session_crawl():
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {result.extracted_content.count('.content-item')} items")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
@@ -53,17 +63,16 @@ async def basic_session_crawl():
|
||||
asyncio.run(basic_session_crawl())
|
||||
```
|
||||
|
||||
This example demonstrates:
|
||||
1. Using a consistent `session_id` across multiple `arun` calls
|
||||
2. Executing JavaScript to load more content after the first page
|
||||
3. Using a CSS selector to extract specific content
|
||||
4. Properly closing the session after crawling
|
||||
This example shows:
|
||||
1. Reusing the same `session_id` across multiple requests.
|
||||
2. Executing JavaScript to load more content dynamically.
|
||||
3. Properly closing the session to free resources.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Technique 1: Custom Execution Hooks
|
||||
|
||||
Crawl4AI allows you to set custom hooks that execute at different stages of the crawling process. This is particularly useful for handling complex loading scenarios.
|
||||
|
||||
Here's an example that waits for new content to appear before proceeding:
|
||||
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
|
||||
|
||||
```python
|
||||
async def advanced_session_crawl_with_hooks():
|
||||
@@ -75,202 +84,96 @@ async def advanced_session_crawl_with_hooks():
|
||||
while True:
|
||||
await page.wait_for_selector("li.commit-item h4")
|
||||
commit = await page.query_selector("li.commit-item h4")
|
||||
commit = await commit.evaluate("(element) => element.textContent")
|
||||
commit = commit.strip()
|
||||
commit = await commit.evaluate("(element) => element.textContent").strip()
|
||||
if commit and commit != first_commit:
|
||||
first_commit = commit
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
except Exception as e:
|
||||
print(f"Warning: New content didn't appear after JavaScript execution: {e}")
|
||||
print(f"Warning: New content didn't appear: {e}")
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "commit_session"
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
session_id = "commit_session"
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a.pagination-next');
|
||||
if (button) button.click();
|
||||
"""
|
||||
js_next_page = """document.querySelector('a.pagination-next').click();"""
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.commit-item",
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
js_only=page > 0
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
commits = result.extracted_content.select("li.commit-item")
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(advanced_session_crawl_with_hooks())
|
||||
```
|
||||
|
||||
This technique uses a custom `on_execution_started` hook to ensure new content has loaded before proceeding to the next step.
|
||||
This technique ensures new content loads before the next action.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
|
||||
|
||||
Instead of using separate hooks, you can integrate the waiting logic directly into your JavaScript execution. This approach can be more concise and easier to manage for some scenarios.
|
||||
|
||||
Here's an example:
|
||||
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
|
||||
|
||||
```python
|
||||
async def integrated_js_and_wait_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "integrated_session"
|
||||
all_commits = []
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
|
||||
js_next_page_and_wait = """
|
||||
(async () => {
|
||||
const getCurrentCommit = () => {
|
||||
const commits = document.querySelectorAll('li.commit-item h4');
|
||||
return commits.length > 0 ? commits[0].textContent.trim() : null;
|
||||
};
|
||||
|
||||
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
|
||||
const initialCommit = getCurrentCommit();
|
||||
const button = document.querySelector('a.pagination-next');
|
||||
if (button) button.click();
|
||||
|
||||
while (true) {
|
||||
document.querySelector('a.pagination-next').click();
|
||||
while (getCurrentCommit() === initialCommit) {
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
const newCommit = getCurrentCommit();
|
||||
if (newCommit && newCommit !== initialCommit) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.commit-item",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.commit-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.commit-item",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page_and_wait if page > 0 else None,
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(integrated_js_and_wait_crawl())
|
||||
```
|
||||
|
||||
This approach combines the JavaScript for clicking the "next" button and waiting for new content to load into a single script.
|
||||
|
||||
## Advanced Technique 3: Using the `wait_for` Parameter
|
||||
|
||||
Crawl4AI provides a `wait_for` parameter that allows you to specify a condition to wait for before considering the page fully loaded. This can be particularly useful for dynamic content.
|
||||
|
||||
Here's an example:
|
||||
|
||||
```python
|
||||
async def wait_for_parameter_crawl():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
session_id = "wait_for_session"
|
||||
all_commits = []
|
||||
|
||||
js_next_page = """
|
||||
const commits = document.querySelectorAll('li.commit-item h4');
|
||||
if (commits.length > 0) {
|
||||
window.lastCommit = commits[0].textContent.trim();
|
||||
}
|
||||
const button = document.querySelector('a.pagination-next');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.commit-item h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.lastCommit;
|
||||
}"""
|
||||
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.commit-item",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.commit-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
css_selector="li.commit-item",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
asyncio.run(wait_for_parameter_crawl())
|
||||
```
|
||||
|
||||
This technique separates the JavaScript execution (clicking the "next" button) from the waiting condition, providing more flexibility and clarity in some scenarios.
|
||||
---
|
||||
|
||||
## Best Practices for Session-Based Crawling
|
||||
|
||||
1. **Use Unique Session IDs**: Ensure each crawling session has a unique `session_id` to prevent conflicts.
|
||||
2. **Close Sessions**: Always close sessions using `kill_session` when you're done to free up resources.
|
||||
3. **Handle Errors**: Implement proper error handling to deal with unexpected situations during crawling.
|
||||
4. **Respect Website Terms**: Ensure your crawling adheres to the website's terms of service and robots.txt file.
|
||||
5. **Implement Delays**: Add appropriate delays between requests to avoid overwhelming the target server.
|
||||
6. **Use Extraction Strategies**: Leverage `JsonCssExtractionStrategy` or other extraction strategies for structured data extraction.
|
||||
7. **Optimize JavaScript**: Keep your JavaScript execution concise and efficient to improve crawling speed.
|
||||
8. **Monitor Performance**: Keep an eye on memory usage and crawling speed, especially for long-running sessions.
|
||||
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values.
|
||||
2. **Close Sessions**: Always clean up sessions with `kill_session` after use.
|
||||
3. **Error Handling**: Anticipate and handle errors gracefully.
|
||||
4. **Respect Websites**: Follow terms of service and robots.txt.
|
||||
5. **Delays**: Add delays to avoid overwhelming servers.
|
||||
6. **Optimize JavaScript**: Keep scripts concise for better performance.
|
||||
7. **Monitor Resources**: Track memory and CPU usage for long sessions.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Session-based crawling with Crawl4AI provides powerful capabilities for handling dynamic content and complex web applications. By leveraging session management, JavaScript execution, and waiting strategies, you can effectively crawl and extract data from a wide range of modern websites.
|
||||
|
||||
Remember to use these techniques responsibly and in compliance with website policies and ethical web scraping practices.
|
||||
|
||||
For more advanced usage and API details, refer to the Crawl4AI API documentation.
|
||||
Session-based crawling in Crawl4AI is a robust solution for handling dynamic content and multi-step workflows. By combining session management, JavaScript execution, and structured extraction strategies, you can effectively navigate and extract data from modern web applications. Always adhere to ethical web scraping practices and respect website policies.
|
||||
@@ -1,74 +1,70 @@
|
||||
# Session Management
|
||||
### Session Management
|
||||
|
||||
Session management in Crawl4AI allows you to maintain state across multiple requests and handle complex multi-page crawling tasks, particularly useful for dynamic websites.
|
||||
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
|
||||
|
||||
## Basic Session Usage
|
||||
- **Performing JavaScript actions before and after crawling.**
|
||||
- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly.
|
||||
|
||||
Use `session_id` to maintain state between requests:
|
||||
**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
|
||||
|
||||
---
|
||||
|
||||
#### Basic Session Usage
|
||||
|
||||
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "my_session"
|
||||
|
||||
|
||||
# Define configurations
|
||||
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
|
||||
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
|
||||
|
||||
# First request
|
||||
result1 = await crawler.arun(
|
||||
url="https://example.com/page1",
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
# Subsequent request using same session
|
||||
result2 = await crawler.arun(
|
||||
url="https://example.com/page2",
|
||||
session_id=session_id
|
||||
)
|
||||
|
||||
result1 = await crawler.arun(config=config1)
|
||||
|
||||
# Subsequent request using the same session
|
||||
result2 = await crawler.arun(config=config2)
|
||||
|
||||
# Clean up when done
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Dynamic Content with Sessions
|
||||
---
|
||||
|
||||
Here's a real-world example of crawling GitHub commits across multiple pages:
|
||||
#### Dynamic Content with Sessions
|
||||
|
||||
Here's an example of crawling GitHub commits across multiple pages while preserving session state:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "github_commits_session"
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
session_id = "typescript_commits_session"
|
||||
all_commits = []
|
||||
|
||||
# Define navigation JavaScript
|
||||
js_next_page = """
|
||||
const button = document.querySelector('a[data-testid="pagination-next-button"]');
|
||||
if (button) button.click();
|
||||
"""
|
||||
|
||||
# Define wait condition
|
||||
wait_for = """() => {
|
||||
const commits = document.querySelectorAll('li.Box-sc-g0xbh4-0 h4');
|
||||
if (commits.length === 0) return false;
|
||||
const firstCommit = commits[0].textContent.trim();
|
||||
return firstCommit !== window.firstCommit;
|
||||
}"""
|
||||
|
||||
# Define extraction schema
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [
|
||||
{
|
||||
"name": "title",
|
||||
"selector": "h4.markdown-title",
|
||||
"type": "text",
|
||||
"transform": "strip",
|
||||
},
|
||||
],
|
||||
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
# JavaScript and wait configurations
|
||||
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
|
||||
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
|
||||
|
||||
# Crawl multiple pages
|
||||
for page in range(3):
|
||||
result = await crawler.arun(
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
extraction_strategy=extraction_strategy,
|
||||
@@ -78,6 +74,7 @@ async def crawl_dynamic_content():
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
if result.success:
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
@@ -88,46 +85,53 @@ async def crawl_dynamic_content():
|
||||
return all_commits
|
||||
```
|
||||
|
||||
## Session Best Practices
|
||||
---
|
||||
|
||||
1. **Session Naming**:
|
||||
```python
|
||||
# Use descriptive session IDs
|
||||
session_id = "login_flow_session"
|
||||
session_id = "product_catalog_session"
|
||||
```
|
||||
#### Session Best Practices
|
||||
|
||||
1. **Descriptive Session IDs**:
|
||||
Use meaningful names for session IDs to organize workflows:
|
||||
```python
|
||||
session_id = "login_flow_session"
|
||||
session_id = "product_catalog_session"
|
||||
```
|
||||
|
||||
2. **Resource Management**:
|
||||
```python
|
||||
try:
|
||||
# Your crawling code
|
||||
pass
|
||||
finally:
|
||||
# Always clean up sessions
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
Always ensure sessions are cleaned up to free resources:
|
||||
```python
|
||||
try:
|
||||
# Your crawling code here
|
||||
pass
|
||||
finally:
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
3. **State Management**:
|
||||
```python
|
||||
# First page: login
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
3. **State Maintenance**:
|
||||
Reuse the session for subsequent actions within the same workflow:
|
||||
```python
|
||||
# Step 1: Login
|
||||
login_config = CrawlerRunConfig(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
await crawler.arun(config=login_config)
|
||||
|
||||
# Second page: verify login success
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/dashboard",
|
||||
session_id=session_id,
|
||||
wait_for="css:.user-profile" # Wait for authenticated content
|
||||
)
|
||||
```
|
||||
# Step 2: Verify login success
|
||||
dashboard_config = CrawlerRunConfig(
|
||||
url="https://example.com/dashboard",
|
||||
session_id=session_id,
|
||||
wait_for="css:.user-profile" # Wait for authenticated content
|
||||
)
|
||||
result = await crawler.arun(config=dashboard_config)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
---
|
||||
|
||||
1. **Authentication Flows**
|
||||
2. **Pagination Handling**
|
||||
3. **Form Submissions**
|
||||
4. **Multi-step Processes**
|
||||
5. **Dynamic Content Navigation**
|
||||
#### Common Use Cases for Sessions
|
||||
|
||||
1. **Authentication Flows**: Login and interact with secured pages.
|
||||
2. **Pagination Handling**: Navigate through multiple pages.
|
||||
3. **Form Submissions**: Fill forms, submit, and process results.
|
||||
4. **Multi-step Processes**: Complete workflows that span multiple actions.
|
||||
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content.
|
||||
|
||||
Reference in New Issue
Block a user