Update the Tutorial section for new document version
This commit is contained in:
3
.gitignore
vendored
3
.gitignore
vendored
@@ -219,4 +219,5 @@ publish.sh
|
||||
combine.sh
|
||||
combined_output.txt
|
||||
tree.md
|
||||
.scripts
|
||||
.scripts
|
||||
.local
|
||||
@@ -1,63 +1,10 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Enabling Downloads**
|
||||
- *"How do I configure Crawl4AI to allow file downloads during a crawl?"*
|
||||
- *"Where in my code should I set `accept_downloads=True` to enable downloads?"*
|
||||
|
||||
2. **Specifying the Download Location**
|
||||
- *"How can I choose a custom directory for storing downloaded files?"*
|
||||
- *"What is the default download directory if I don’t specify one?"*
|
||||
|
||||
3. **Triggering Downloads from Pages**
|
||||
- *"How do I simulate a click on a download link or button to initiate file downloads?"*
|
||||
- *"Can I use JavaScript injection (`js_code`) to trigger downloads from the webpage elements?"*
|
||||
- *"What does `wait_for` do, and how do I use it to ensure the download starts before proceeding?"*
|
||||
|
||||
4. **Accessing Downloaded Files**
|
||||
- *"Where can I find the paths to the files that I’ve downloaded?"*
|
||||
- *"How do I check if any files were downloaded after my crawl completes?"*
|
||||
|
||||
5. **Multiple Downloads**
|
||||
- *"How do I handle scenarios where multiple files need to be downloaded sequentially?"*
|
||||
- *"Can I introduce delays between file downloads to prevent server overload?"*
|
||||
|
||||
6. **Error Handling and Reliability**
|
||||
- *"What if the files I expect to download don’t appear or the links are broken?"*
|
||||
- *"How can I handle incorrect paths, nonexistent directories, or failed downloads gracefully?"*
|
||||
|
||||
7. **Timing and Performance**
|
||||
- *"When should I use `wait_for` and how do I choose an appropriate delay?"*
|
||||
- *"Can I start the download and continue processing other tasks concurrently?"*
|
||||
|
||||
8. **Security Considerations**
|
||||
- *"What precautions should I take with downloaded files?"*
|
||||
- *"How can I ensure that downloaded files are safe before processing them further?"*
|
||||
|
||||
9. **Integration with Other Crawl4AI Features**
|
||||
- *"Can I combine file downloading with other extraction strategies or LLM-based processes?"*
|
||||
- *"How do I manage downloads when running multiple parallel crawls?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Enabling Downloads in Crawl4AI**:
|
||||
Configure the crawler through `BrowserConfig` or `CrawlerRunConfig` to allow file downloads.
|
||||
|
||||
- **Download Locations**:
|
||||
Specify a custom `downloads_path` or rely on the default directory (`~/.crawl4ai/downloads`).
|
||||
|
||||
- **Triggering File Downloads**:
|
||||
Use JavaScript code injection (`js_code`) to simulate user interactions (e.g., clicking a download link). Employ `wait_for` to allow time for downloads to initiate.
|
||||
|
||||
- **Accessing Downloaded Files**:
|
||||
After the crawl, `result.downloaded_files` provides a list of paths to the downloaded files. Use these paths to verify file sizes or further process the files.
|
||||
|
||||
- **Handling Multiple Files**:
|
||||
Loop through downloadable elements on the page, introduce delays, and wait for downloads to complete before proceeding.
|
||||
|
||||
- **Error and Timing Considerations**:
|
||||
Manage potential errors when downloads fail or timing issues arise. Adjust `wait_for` and error handling logic to ensure stable and reliable file retrievals.
|
||||
|
||||
- **Security Precautions**:
|
||||
Always verify the integrity and safety of downloaded files before using them in your application.
|
||||
|
||||
In summary, the file explains how to set up, initiate, and manage file downloads within the Crawl4AI framework, including specifying directories, triggering downloads programmatically, handling multiple files, and accessing downloaded results. It also covers timing, error handling, and security best practices.
|
||||
enable_downloads: Downloads must be enabled using accept_downloads parameter in BrowserConfig or CrawlerRunConfig | download settings, enable downloads | BrowserConfig(accept_downloads=True)
|
||||
download_location: Set custom download directory using downloads_path in BrowserConfig, defaults to .crawl4ai/downloads | download folder, save location | BrowserConfig(downloads_path="/path/to/downloads")
|
||||
download_trigger: Trigger downloads using js_code in CrawlerRunConfig to simulate click actions | download button, click download | CrawlerRunConfig(js_code="document.querySelector('a[download]').click()")
|
||||
download_timing: Control download timing using wait_for parameter in CrawlerRunConfig | download wait, timeout | CrawlerRunConfig(wait_for=5)
|
||||
access_downloads: Access downloaded files through downloaded_files attribute in CrawlResult | download results, file paths | result.downloaded_files
|
||||
multiple_downloads: Download multiple files by clicking multiple download links with delay | batch download, multiple files | js_code="const links = document.querySelectorAll('a[download]'); for(const link of links) { link.click(); }"
|
||||
download_verification: Check download success by examining downloaded_files list and file sizes | verify downloads, file check | if result.downloaded_files: print(os.path.getsize(file_path))
|
||||
browser_context: Downloads are managed within browser context and require proper js_code targeting | download management, browser scope | CrawlerRunConfig(js_code="...")
|
||||
error_handling: Handle failed downloads and incorrect paths for robust download management | download errors, error handling | try-except around download operations
|
||||
security_consideration: Scan downloaded files for security threats before use | security check, virus scan | No direct code reference
|
||||
@@ -1,64 +1,10 @@
|
||||
Below is a structured list of hypothetical questions derived from the file’s content, followed by a bullet-point summary of key topics discussed.
|
||||
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **JavaScript Execution Basics**
|
||||
- *"How do I inject a single JavaScript command into the page using Crawl4AI?"*
|
||||
- *"Can I run multiple JavaScript commands sequentially before extracting content?"*
|
||||
|
||||
2. **Waiting for Conditions**
|
||||
- *"How can I wait for a particular CSS element to appear before extracting data?"*
|
||||
- *"Is there a way to wait for a custom JavaScript condition, like a minimum number of items to load?"*
|
||||
|
||||
3. **Handling Dynamic Content**
|
||||
- *"How do I deal with infinite scrolling or 'Load More' buttons to continuously fetch new data?"*
|
||||
- *"Can I simulate user interactions (clicking buttons, scrolling) to reveal more content?"*
|
||||
|
||||
4. **Form Interactions**
|
||||
- *"How can I fill out and submit a form on a webpage using JavaScript injection?"*
|
||||
- *"What if I need to handle multiple form fields or a multi-step submission process?"*
|
||||
|
||||
5. **Timing Control and Delays**
|
||||
- *"How can I set a page load timeout or introduce a delay before extracting the final HTML?"*
|
||||
- *"When should I adjust `delay_before_return_html` to ensure the page is fully rendered?"*
|
||||
|
||||
6. **Complex Interactions**
|
||||
- *"How do I chain multiple interactions, like accepting cookies, scrolling, and then clicking 'Load More' several times?"*
|
||||
- *"Can I maintain a session to continue interacting with the page across multiple steps?"*
|
||||
|
||||
7. **Integration with Extraction Strategies**
|
||||
- *"How do I combine JavaScript-based interactions with a structured extraction strategy like `JsonCssExtractionStrategy`?"*
|
||||
- *"Is it possible to use LLM-based extraction after dynamically revealing more content?"*
|
||||
|
||||
8. **Troubleshooting Interactions**
|
||||
- *"What if my JavaScript code fails or the element I want to interact with isn’t available?"*
|
||||
- *"How can I verify that the dynamic content I triggered actually loaded before extraction?"*
|
||||
|
||||
9. **Performance and Reliability**
|
||||
- *"Do I need to consider timeouts and backoffs when dealing with heavily dynamic pages?"*
|
||||
- *"How can I ensure that my JS-based interactions do not slow down the extraction process unnecessarily?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **JavaScript Execution**:
|
||||
Injecting single or multiple JS commands into the page to manipulate scrolling, clicks, or form submissions.
|
||||
|
||||
- **Waiting Mechanisms**:
|
||||
Using `wait_for` with CSS selectors (`"css:.some-element"`) or custom JavaScript conditions (`"js:() => {...}"`) to ensure the page is in the desired state before extraction.
|
||||
|
||||
- **Dynamic Content Handling**:
|
||||
Techniques for infinite scrolling, load more buttons, and other elements that reveal additional data after user-like interactions.
|
||||
|
||||
- **Form Interaction**:
|
||||
Filling out form fields, submitting forms, and waiting for results to appear.
|
||||
|
||||
- **Timing Control**:
|
||||
Setting page timeouts, introducing delays before returning HTML, and ensuring stable and complete extractions.
|
||||
|
||||
- **Complex Interactions**:
|
||||
Combining multiple steps (cookie acceptance, infinite scroll, load more clicks) and maintaining sessions across multiple steps for fully dynamic pages.
|
||||
|
||||
- **Integration with Extraction Strategies**:
|
||||
Applying pattern-based (CSS/JSON) or LLM-based extraction after performing required interactions to reveal the content of interest.
|
||||
|
||||
In summary, the file provides detailed guidance on interacting with dynamic pages in Crawl4AI. It shows how to run JavaScript commands, wait for certain conditions, handle infinite scroll or complex user interactions, and integrate these techniques with content extraction strategies.
|
||||
javascript_execution: Execute single or multiple JavaScript commands in webpage | js code, javascript commands, browser execution | CrawlerRunConfig(js_code="window.scrollTo(0, document.body.scrollHeight);")
|
||||
css_wait: Wait for specific CSS elements to appear on page | css selector, element waiting, dynamic content | CrawlerRunConfig(wait_for="css:.dynamic-content")
|
||||
js_wait_condition: Define custom JavaScript wait conditions for dynamic content | javascript waiting, conditional wait, custom conditions | CrawlerRunConfig(wait_for="js:() => document.querySelectorAll('.item').length > 10")
|
||||
infinite_scroll: Handle infinite scroll and load more buttons | pagination, dynamic loading, scroll handling | CrawlerRunConfig(js_code="window.scrollTo(0, document.body.scrollHeight);")
|
||||
form_interaction: Fill and submit forms using JavaScript | form handling, input filling, form submission | CrawlerRunConfig(js_code="document.querySelector('#search').value = 'search term';")
|
||||
timing_control: Set page timeouts and delays before content capture | page timing, delays, timeouts | CrawlerRunConfig(page_timeout=60000, delay_before_return_html=2.0)
|
||||
session_management: Maintain browser session for multiple interactions | session handling, browser state, session cleanup | crawler.crawler_strategy.kill_session(session_id)
|
||||
cookie_consent: Handle cookie consent popups and notifications | cookie handling, popup management | CrawlerRunConfig(js_code="document.querySelector('.cookie-accept')?.click();")
|
||||
extraction_combination: Combine page interactions with structured data extraction | data extraction, content parsing | JsonCssExtractionStrategy(schema), LLMExtractionStrategy(schema)
|
||||
dynamic_content_loading: Wait for and verify dynamic content loading | content verification, dynamic loading | wait_for="js:() => document.querySelector('.content').innerText.length > 100"
|
||||
@@ -1,56 +1,10 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Basic Usage**
|
||||
- *"How can I crawl a regular website URL using Crawl4AI?"*
|
||||
- *"What configuration object do I need to pass to `arun` for basic crawling scenarios?"*
|
||||
|
||||
2. **Local HTML Files**
|
||||
- *"How do I crawl an HTML file stored locally on my machine?"*
|
||||
- *"What prefix should I use when specifying a local file path to `arun`?"*
|
||||
|
||||
3. **Raw HTML Strings**
|
||||
- *"Is it possible to crawl a raw HTML string without saving it to a file first?"*
|
||||
- *"How do I prefix a raw HTML string so that Crawl4AI treats it like HTML content?"*
|
||||
|
||||
4. **Verifying Results**
|
||||
- *"Can I compare the extracted Markdown content from a live page with that of a locally saved or raw version to ensure they match?"*
|
||||
- *"How do I handle errors or check if the crawl was successful?"*
|
||||
|
||||
5. **Use Cases**
|
||||
- *"When would I want to use `file://` vs. `raw:` URLs?"*
|
||||
- *"Can I reuse the same code structure for various input types (web URL, file, raw HTML)?"*
|
||||
|
||||
6. **Caching and Configuration**
|
||||
- *"What does `bypass_cache=True` do and when should I use it?"*
|
||||
- *"Is there a simpler way to configure crawling options uniformly across web URLs, local files, and raw HTML?"*
|
||||
|
||||
7. **Practical Scenarios**
|
||||
- *"How can I integrate file-based crawling into a pipeline that starts from a live page, saves the HTML, and then crawls that local file for consistency checks?"*
|
||||
- *"Does Crawl4AI’s prefix-based handling allow me to pre-process raw HTML (e.g., downloaded from another source) without hosting it on a local server?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Prefix-Based Input Handling**:
|
||||
Introducing the concept of using `http://` or `https://` for web URLs, `file://` for local files, and `raw:` for direct HTML strings. This unified approach allows seamless handling of different content sources within Crawl4AI.
|
||||
|
||||
- **Crawling a Web URL**:
|
||||
Demonstrating how to crawl a live web page (like a Wikipedia article) using `AsyncWebCrawler` and `CrawlerRunConfig`.
|
||||
|
||||
- **Crawling a Local HTML File**:
|
||||
Showing how to convert a local file path to a `file://` URL and use `arun` to process it, ensuring that previously saved HTML can be re-crawled for verification or offline analysis.
|
||||
|
||||
- **Crawling Raw HTML Content**:
|
||||
Explaining how to directly pass an HTML string prefixed with `raw:` to `arun`, enabling quick tests or processing of HTML code obtained from other sources without saving it to disk.
|
||||
|
||||
- **Consistency and Verification**:
|
||||
Providing a comprehensive example that:
|
||||
1. Crawls a live Wikipedia page.
|
||||
2. Saves the HTML to a file.
|
||||
3. Re-crawls the local file.
|
||||
4. Re-crawls the content as a raw HTML string.
|
||||
5. Verifies that the Markdown extracted remains consistent across all three methods.
|
||||
|
||||
- **Integration with `CrawlerRunConfig`**:
|
||||
Showing how to use `CrawlerRunConfig` to disable caching (`bypass_cache=True`) and ensure fresh results for each test run.
|
||||
|
||||
In summary, the file highlights how to use Crawl4AI’s prefix-based handling to effortlessly switch between crawling live web pages, local HTML files, and raw HTML strings. It also demonstrates a detailed workflow for verifying consistency and correctness across various input methods.
|
||||
url_prefix_handling: Crawl4AI supports different URL prefixes for various input types | input handling, url format, crawling types | url="https://example.com" or "file://path" or "raw:html"
|
||||
web_crawling: Crawl live web pages using http:// or https:// prefixes with AsyncWebCrawler | web scraping, url crawling, web content | AsyncWebCrawler().arun(url="https://example.com")
|
||||
local_file_crawling: Access local HTML files using file:// prefix for crawling | local html, file crawling, file access | AsyncWebCrawler().arun(url="file:///path/to/file.html")
|
||||
raw_html_crawling: Process raw HTML content directly using raw: prefix | html string, raw content, direct html | AsyncWebCrawler().arun(url="raw:<html>content</html>")
|
||||
crawler_config: Configure crawling behavior using CrawlerRunConfig object | crawler settings, configuration, bypass cache | CrawlerRunConfig(bypass_cache=True)
|
||||
async_context: AsyncWebCrawler should be used within async context manager | async with, context management, async programming | async with AsyncWebCrawler() as crawler
|
||||
crawl_result: Crawler returns result object containing success status, markdown and error messages | response handling, crawl output, result parsing | result.success, result.markdown, result.error_message
|
||||
html_to_markdown: Crawler automatically converts HTML content to markdown format | format conversion, markdown generation, content processing | result.markdown
|
||||
error_handling: Check crawl success status and handle error messages appropriately | error checking, failure handling, status verification | if result.success: ... else: print(result.error_message)
|
||||
content_verification: Compare markdown length between different crawling methods for consistency | content validation, length comparison, consistency check | assert web_crawl_length == local_crawl_length
|
||||
@@ -1,58 +1,12 @@
|
||||
Below is a structured list of hypothetical questions derived from the file’s content, followed by a bullet-point summary of key topics discussed.
|
||||
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **General Hook Usage**
|
||||
- *"What are hooks in Crawl4AI, and how do they help customize the crawling process?"*
|
||||
- *"Which stages of the crawling lifecycle can I attach hooks to?"*
|
||||
|
||||
2. **Specific Hooks**
|
||||
- *"What does the `on_browser_created` hook allow me to do?"*
|
||||
- *"How can I use the `on_page_context_created` hook to modify requests before navigation?"*
|
||||
- *"When should I use `before_goto` and `after_goto` hooks?"*
|
||||
- *"How does `on_execution_started` help with custom JavaScript execution?"*
|
||||
- *"What kind of preprocessing can I do in `before_return_html`?"*
|
||||
|
||||
3. **Authentication and Customization**
|
||||
- *"How can I perform authentication (like logging in) before actual crawling begins?"*
|
||||
- *"Can I set cookies, headers, or modify requests using hooks?"*
|
||||
|
||||
4. **Error Handling and Debugging**
|
||||
- *"If my hooks fail or raise errors, how is that handled during the crawling process?"*
|
||||
- *"How can I use hooks to troubleshoot issues, like blocking image requests or logging console messages?"*
|
||||
|
||||
5. **Complex Scenarios**
|
||||
- *"Can I combine multiple hooks to handle complex workflows like login, script execution, and dynamic content blocking?"*
|
||||
- *"Is it possible to add conditional logic in hooks to treat certain URLs differently?"*
|
||||
|
||||
6. **Performance and Reliability**
|
||||
- *"Do these hooks run asynchronously, and how does that affect the crawler’s performance?"*
|
||||
- *"Can I cancel requests or actions via hooks to improve efficiency?"*
|
||||
|
||||
7. **Integration with `BrowserConfig` and `CrawlerRunConfig`**
|
||||
- *"How do I use `BrowserConfig` and `CrawlerRunConfig` in tandem with hooks?"*
|
||||
- *"Does setting hooks require changes to the configuration objects or can I apply them at runtime?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Hooks in `AsyncWebCrawler`**:
|
||||
Hooks are asynchronous callback functions triggered at key points in the crawling lifecycle. They allow advanced customization, such as modifying browser/page contexts, injecting scripts, or altering network requests.
|
||||
|
||||
- **Hook Types and Purposes**:
|
||||
- **`on_browser_created`**: Initialize browser state, handle authentication (login), set cookies.
|
||||
- **`on_page_context_created`**: Set up request routing, block resources, or modify requests before navigation.
|
||||
- **`before_goto`**: Add or modify HTTP headers, prepare the page before actually navigating to the target URL.
|
||||
- **`after_goto`**: Verify the current URL, log details, or ensure that page navigation succeeded.
|
||||
- **`on_execution_started`**: Perform actions right after JS execution, like logging console output or checking state.
|
||||
- **`before_return_html`**: Analyze, log, or preprocess the extracted HTML before it’s returned.
|
||||
|
||||
- **Practical Examples**:
|
||||
Demonstrations of handling authentication via `on_browser_created`, blocking images using `on_page_context_created` with a custom routing function, adding HTTP headers in `before_goto`, and logging content details in `before_return_html`.
|
||||
|
||||
- **Integration with Configuration Objects**:
|
||||
Using `BrowserConfig` for initial browser settings and `CrawlerRunConfig` for specifying JavaScript code, wait conditions, and more, then combining them with hooks for a fully customizable crawling workflow.
|
||||
|
||||
- **Asynchronous and Flexible**:
|
||||
Hooks are async, fitting seamlessly into the event-driven model of crawling. They can abort requests, continue them, or conditionally modify behavior based on URL patterns.
|
||||
|
||||
In summary, this file explains how to use hooks in Crawl4AI’s `AsyncWebCrawler` to customize nearly every aspect of the crawling process. By attaching hooks at various lifecycle stages, developers can implement authentication routines, block certain types of requests, tweak headers, run custom JS, and analyze the final HTML—all while maintaining control and flexibility.
|
||||
crawler_hooks: AsyncWebCrawler supports customizable hooks for modifying crawler behavior | hooks, async functions, crawler customization | crawler.crawler_strategy.set_hook()
|
||||
browser_creation_hook: on_browser_created hook executes when browser is initialized for authentication and setup | browser setup, login, authentication | async def on_browser_created(browser: Browser, **kwargs)
|
||||
page_context_hook: on_page_context_created hook handles routing and initial page setup | page context, routing, resource blocking | async def on_page_context_created(context: BrowserContext, page: Page, **kwargs)
|
||||
navigation_pre_hook: before_goto hook allows adding custom headers before URL navigation | headers, pre-navigation, request modification | async def before_goto(page: Page, context: BrowserContext, **kwargs)
|
||||
navigation_post_hook: after_goto hook executes after URL navigation for verification | post-navigation, URL logging | async def after_goto(page: Page, context: BrowserContext, **kwargs)
|
||||
js_execution_hook: on_execution_started hook runs after custom JavaScript execution | JavaScript, script execution | async def on_execution_started(page: Page, context: BrowserContext, **kwargs)
|
||||
html_processing_hook: before_return_html hook processes HTML content before returning | HTML content, preprocessing | async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs)
|
||||
browser_configuration: BrowserConfig allows setting headless mode and viewport dimensions | browser settings, viewport | BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080)
|
||||
crawler_configuration: CrawlerRunConfig defines JavaScript execution and wait conditions | crawler settings, JS code, wait conditions | CrawlerRunConfig(js_code="window.scrollTo(0)", wait_for="footer")
|
||||
resource_management: Route handlers can block or modify specific resource types | resource blocking, request handling | if route.request.resource_type == "image": await route.abort()
|
||||
authentication_flow: Browser authentication handled through login form interaction and cookie setting | login process, cookies | await page.fill("input[name='username']", "testuser")
|
||||
hook_registration: Hooks are registered using the crawler strategy's set_hook method | hook setup, strategy | crawler.crawler_strategy.set_hook("hook_name", hook_function)
|
||||
@@ -93,3 +93,39 @@ crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection featu
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
||||
```
|
||||
|
||||
## SSL Certificate Verification
|
||||
|
||||
Crawl4AI can retrieve and analyze SSL certificates from HTTPS websites. This is useful for:
|
||||
- Verifying website authenticity
|
||||
- Detecting potential security issues
|
||||
- Analyzing certificate chains
|
||||
- Exporting certificates for further analysis
|
||||
|
||||
Enable SSL certificate retrieval with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(fetch_ssl_certificate=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.ssl_certificate:
|
||||
cert = result.ssl_certificate
|
||||
|
||||
# Access certificate properties
|
||||
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
||||
print(f"Valid until: {cert.valid_until}")
|
||||
print(f"Fingerprint: {cert.fingerprint}")
|
||||
|
||||
# Export certificate in different formats
|
||||
cert.to_json("cert.json") # For analysis
|
||||
cert.to_pem("cert.pem") # For web servers
|
||||
cert.to_der("cert.der") # For Java applications
|
||||
```
|
||||
|
||||
The SSL certificate object provides:
|
||||
- Direct access to certificate fields (issuer, subject, validity dates)
|
||||
- Methods to export in common formats (JSON, PEM, DER)
|
||||
- Certificate chain information and extensions
|
||||
|
||||
@@ -1,53 +1,8 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Basic Proxy Configuration**
|
||||
- *"How do I set a basic HTTP proxy for the crawler?"*
|
||||
- *"Can I use a SOCKS proxy instead of an HTTP proxy?"*
|
||||
|
||||
2. **Authenticated Proxies**
|
||||
- *"How do I provide a username and password for an authenticated proxy server?"*
|
||||
- *"What is the `proxy_config` dictionary, and how do I use it?"*
|
||||
|
||||
3. **Rotating Proxies**
|
||||
- *"How can I dynamically change the proxy server for each request?"*
|
||||
- *"What patterns or logic can I implement to rotate proxies from a pool?"*
|
||||
|
||||
4. **Custom Headers for Security and Anonymity**
|
||||
- *"How do I set custom HTTP headers in `BrowserConfig` to appear more human-like or meet security policies?"*
|
||||
- *"Can I add headers like `X-Forwarded-For`, `Accept-Language`, or `Cache-Control`?"*
|
||||
|
||||
5. **Combining Proxies with Magic Mode**
|
||||
- *"What is Magic Mode, and how does it help with anti-detection features?"*
|
||||
- *"Can I use Magic Mode in combination with proxies and custom headers for better anonymity?"*
|
||||
|
||||
6. **Troubleshooting and Edge Cases**
|
||||
- *"What if my authenticated proxy doesn’t accept credentials?"*
|
||||
- *"How do I handle errors when switching proxies mid-crawl?"*
|
||||
|
||||
7. **Performance and Reliability**
|
||||
- *"Does using a proxy slow down the crawling process?"*
|
||||
- *"How do I ensure stable and fast connections when rotating proxies frequently?"*
|
||||
|
||||
8. **Integration with Other Crawl4AI Features**
|
||||
- *"Can I use proxy configurations with hooks, caching, or LLM extraction strategies?"*
|
||||
- *"How do I integrate proxy-based crawling into a larger pipeline that includes data extraction and content filtering?"*
|
||||
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Proxy Configuration**:
|
||||
Shows how to set an HTTP or SOCKS proxy in `BrowserConfig` for the crawler, enabling you to route traffic through a specific server.
|
||||
|
||||
- **Authenticated Proxies**:
|
||||
Demonstrates how to provide username and password credentials to access proxy servers that require authentication.
|
||||
|
||||
- **Rotating Proxies**:
|
||||
Suggests a pattern for dynamically updating proxy settings before each request, allowing you to cycle through multiple proxies to avoid throttling or blocking.
|
||||
|
||||
- **Custom Headers**:
|
||||
Explains how to add custom HTTP headers in `BrowserConfig` for security, anonymity, or compliance with certain websites’ requirements.
|
||||
|
||||
- **Integration with Magic Mode**:
|
||||
Shows how to combine proxy usage, custom headers, and Magic Mode (`magic=True` in `CrawlerRunConfig`) to enhance anti-detection measures, making it harder for websites to detect automated crawlers.
|
||||
|
||||
In summary, the file explains how to configure proxies (including authenticated proxies), rotate them dynamically, set custom headers for extra security and privacy, and combine these techniques with Magic Mode for robust anti-detection strategies in Crawl4AI.
|
||||
proxy_setup: Configure basic proxy in Crawl4AI using BrowserConfig with proxy URL | proxy configuration, proxy setup, basic proxy | BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
socks_proxy: Use SOCKS proxy protocol for web crawling | SOCKS5, proxy protocol, SOCKS connection | BrowserConfig(proxy="socks5://proxy.example.com:1080")
|
||||
authenticated_proxy: Set up proxy with username and password authentication | proxy auth, proxy credentials, authenticated connection | BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080", "username": "user", "password": "pass"})
|
||||
rotating_proxies: Implement dynamic proxy rotation during crawling | proxy rotation, proxy switching, dynamic proxies | browser_config.proxy_config = await get_next_proxy()
|
||||
custom_headers: Add security headers to browser configuration for enhanced protection | HTTP headers, request headers, security headers | BrowserConfig(headers={"X-Forwarded-For": "203.0.113.195", "Accept-Language": "en-US,en;q=0.9"})
|
||||
magic_mode: Combine proxy settings with Magic Mode for maximum anti-detection | anti-detection, stealth mode, protection features | CrawlerRunConfig(magic=True) with BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
crawler_context: Use AsyncWebCrawler with async context manager for proper resource management | async crawler, context manager, crawler setup | async with AsyncWebCrawler(config=browser_config) as crawler
|
||||
cache_control: Set cache control headers to prevent caching during crawling | caching headers, no-cache, cache prevention | BrowserConfig(headers={"Cache-Control": "no-cache", "Pragma": "no-cache"})
|
||||
@@ -1,50 +1,9 @@
|
||||
Below is a structured list of hypothetical questions derived from the file’s content, followed by a bullet-point summary of key topics discussed.
|
||||
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Motivation and Use Cases**
|
||||
- *"Why should I use the PDF-based screenshot approach for very long web pages?"*
|
||||
- *"What are the benefits of generating a PDF before converting it to an image?"*
|
||||
|
||||
2. **Workflow and Technical Process**
|
||||
- *"How does Crawl4AI generate a PDF and then convert it into a screenshot?"*
|
||||
- *"Do I need to manually scroll or stitch images to capture large pages?"*
|
||||
|
||||
3. **Practical Steps**
|
||||
- *"What code do I need to write to request both a PDF and a screenshot in one crawl?"*
|
||||
- *"How do I save the resulting PDF and screenshot to disk?"*
|
||||
|
||||
4. **Performance and Reliability**
|
||||
- *"Will this PDF-based method time out or fail for extremely long pages?"*
|
||||
- *"Is this approach faster or more memory-efficient than traditional full-page screenshots?"*
|
||||
|
||||
5. **Additional Features and Customization**
|
||||
- *"Can I save only the PDF without generating a screenshot?"*
|
||||
- *"If I have a PDF, can I easily convert it to multiple images or just the first page?"*
|
||||
|
||||
6. **Integration with Other Crawl4AI Features**
|
||||
- *"Can I combine PDF/screenshot generation with other Crawl4AI extraction strategies or hooks?"*
|
||||
- *"Is caching or proxying affected by PDF or screenshot generation?"
|
||||
|
||||
7. **Troubleshooting**
|
||||
- *"What should I do if the screenshot or PDF does not appear in the result?"*
|
||||
- *"How do I handle large PDF sizes or slow saves when dealing with massive pages?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **New Approach to Large Page Screenshots**:
|
||||
The document introduces a method to first export a page as a PDF using the browser’s built-in PDF rendering capabilities and then convert that PDF to an image if a screenshot is requested.
|
||||
|
||||
- **Advantages Over Traditional Methods**:
|
||||
This approach avoids timeouts, memory issues, and the complexity of stitching multiple images for extremely long pages. The PDF rendering is stable, reliable, and does not require the crawler to scroll through the entire page.
|
||||
|
||||
- **One-Stop Solution**:
|
||||
By enabling `pdf=True` and `screenshot=True`, you receive both the full-page PDF and a screenshot (converted from the PDF) in a single crawl. This reduces repetitive processes and complexity.
|
||||
|
||||
- **How to Implement**:
|
||||
Demonstrates code usage with `arun` to request both the PDF and screenshot, and how to save them to files. Explains that if a PDF is already generated, the screenshot is derived directly from it, simplifying the workflow.
|
||||
|
||||
- **Integration and Efficiency**:
|
||||
Compatible with other Crawl4AI features like caching and extraction strategies. Simplifies large-scale crawling pipelines needing both a textual representation (HTML extraction) and visual confirmations (PDF/screenshot).
|
||||
|
||||
In summary, the file outlines a new feature for capturing full-page screenshots of massive web pages by first generating a stable, reliable PDF, then converting it into an image. This technique eliminates previous issues related to large content pages, ensuring smoother performance and simpler code maintenance.
|
||||
page_capture: Full-page screenshots and PDFs can be generated for massive webpages using Crawl4AI | webpage capture, full page screenshot, pdf export | AsyncWebCrawler().arun(url=url, pdf=True, screenshot=True)
|
||||
pdf_approach: Pages are first exported as PDF then converted to high-quality images for better handling of large content | pdf conversion, image export, page rendering | result.pdf, result.screenshot
|
||||
export_benefits: PDF export method never times out and works with any page length | timeout handling, page size limits, reliability | pdf=True
|
||||
dual_output: Get both PDF and screenshot in single crawl without reloading | multiple formats, single pass, efficient capture | pdf=True, screenshot=True
|
||||
result_handling: Screenshot and PDF data are returned as base64 encoded strings | base64 encoding, binary data, file saving | b64decode(result.screenshot), b64decode(result.pdf)
|
||||
cache_control: Cache mode can be bypassed for fresh page captures | caching, fresh content, bypass cache | cache_mode=CacheMode.BYPASS
|
||||
async_operation: Crawler operates asynchronously using Python's asyncio framework | async/await, concurrent execution | async with AsyncWebCrawler() as crawler
|
||||
file_saving: Screenshots and PDFs can be saved directly to local files | file output, save results, local storage | open("screenshot.png", "wb"), open("page.pdf", "wb")
|
||||
error_handling: Success status can be checked before processing results | error checking, result validation | if result.success:
|
||||
@@ -1,52 +0,0 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Basic Concept of `storage_state`**
|
||||
- *"What is `storage_state` and how does it help me maintain session data across crawls?"*
|
||||
- *"Can I directly provide a dictionary of cookies and localStorage data, or do I need a file?"*
|
||||
|
||||
2. **Cookies and LocalStorage Handling**
|
||||
- *"How do I set cookies and localStorage items before starting my crawl?"*
|
||||
- *"Can I specify multiple origins and different sets of localStorage keys per origin?"*
|
||||
|
||||
3. **Using a `storage_state` File**
|
||||
- *"How do I load session data from a JSON file?"*
|
||||
- *"Can I export the current session state to a file and reuse it later?"*
|
||||
|
||||
4. **Login and Authentication Scenarios**
|
||||
- *"How can I use `storage_state` to skip the login process on subsequent runs?"*
|
||||
- *"What’s the workflow for logging in once, exporting the session data, and then starting future crawls already logged in?"*
|
||||
|
||||
5. **Updating or Changing the Session State**
|
||||
- *"What if my session expires? Can I refresh the session and update the `storage_state` file?"*
|
||||
- *"How can I revert to a 'logged out' state by clearing tokens or using a sign-out scenario?"*
|
||||
|
||||
6. **Practical Use Cases**
|
||||
- *"If I’m crawling a series of protected pages from the same site, how can `storage_state` speed up the process?"*
|
||||
- *"Can I switch between multiple `storage_state` files for different accounts or different states (e.g., logged in vs. logged out)?"*
|
||||
|
||||
7. **Performance and Reliability**
|
||||
- *"Will using `storage_state` improve my crawl performance by reducing repeated actions?"*
|
||||
- *"Are there any risks or complications when transferring `storage_state` between different environments?"*
|
||||
|
||||
8. **Integration with Hooks and Configurations**
|
||||
- *"How do I integrate `storage_state` with hooks for a one-time login flow?"*
|
||||
- *"Can I still customize browser or page behavior with hooks if I start with a `storage_state`?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **`storage_state` Overview**:
|
||||
Explaining that `storage_state` is a mechanism to start crawls with preloaded cookies and localStorage data, eliminating the need to re-authenticate or re-set session data every time.
|
||||
|
||||
- **Data Formats**:
|
||||
You can provide `storage_state` as either a Python dictionary or a JSON file. The JSON structure includes cookies and localStorage entries associated with specific domains/origins.
|
||||
|
||||
- **Practical Authentication Workflows**:
|
||||
Demonstrating how to log in once (using a hook or manual interaction), then save the resulting `storage_state` to a file. Subsequent crawls can use this file to start already authenticated, greatly speeding up the process and simplifying pipelines.
|
||||
|
||||
- **Updating or Changing State**:
|
||||
The crawler can export the current session state to a file at any time. This allows reusing the same authenticated session, switching states, or returning to a baseline state (e.g., logged out) by applying a different `storage_state` file.
|
||||
|
||||
- **Integration with Other Features**:
|
||||
`storage_state` works seamlessly with `AsyncWebCrawler` and `CrawlerRunConfig`. You can still use hooks, JS code execution, and other Crawl4AI features alongside a preloaded session state.
|
||||
|
||||
In summary, the file explains how to use `storage_state` to maintain and reuse session data (cookies, localStorage) across crawls in Crawl4AI, demonstrating how it streamlines workflows that require authentication or complex session setups.
|
||||
@@ -1,439 +0,0 @@
|
||||
# Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling & AI Integration Solution
|
||||
|
||||
Crawl4AI, the **#1 trending GitHub repository**, streamlines web content extraction into AI-ready formats. Perfect for AI assistants, semantic search engines, or data pipelines, Crawl4AI transforms raw HTML into structured Markdown or JSON effortlessly. Integrate with LLMs, open-source models, or your own retrieval-augmented generation workflows.
|
||||
|
||||
**Key Links:**
|
||||
- **Website:** [https://crawl4ai.com](https://crawl4ai.com)
|
||||
- **GitHub:** [https://github.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)
|
||||
- **Colab Notebook:** [Try on Google Colab](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
- **Quickstart Code Example:** [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py)
|
||||
- **Examples Folder:** [Crawl4AI Examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples)
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
- [Crawl4AI Quick Start Guide: Your All-in-One AI-Ready Web Crawling \& AI Integration Solution](#crawl4ai-quick-start-guide-your-all-in-one-ai-ready-web-crawling--ai-integration-solution)
|
||||
- [Table of Contents](#table-of-contents)
|
||||
- [1. Introduction \& Key Concepts](#1-introduction--key-concepts)
|
||||
- [2. Installation \& Environment Setup](#2-installation--environment-setup)
|
||||
- [3. Core Concepts \& Configuration](#3-core-concepts--configuration)
|
||||
- [4. Basic Crawling \& Simple Extraction](#4-basic-crawling--simple-extraction)
|
||||
- [5. Markdown Generation \& AI-Optimized Output](#5-markdown-generation--ai-optimized-output)
|
||||
- [6. Structured Data Extraction (CSS, XPath, LLM)](#6-structured-data-extraction-css-xpath-llm)
|
||||
- [7. Advanced Extraction: LLM \& Open-Source Models](#7-advanced-extraction-llm--open-source-models)
|
||||
- [8. Page Interactions, JS Execution, \& Dynamic Content](#8-page-interactions-js-execution--dynamic-content)
|
||||
- [9. Media, Links, \& Metadata Handling](#9-media-links--metadata-handling)
|
||||
- [10. Authentication \& Identity Preservation](#10-authentication--identity-preservation)
|
||||
- [Manual Setup via User Data Directory](#manual-setup-via-user-data-directory)
|
||||
- [Using `storage_state`](#using-storage_state)
|
||||
- [11. Proxy \& Security Enhancements](#11-proxy--security-enhancements)
|
||||
- [12. Screenshots, PDFs \& File Downloads](#12-screenshots-pdfs--file-downloads)
|
||||
- [13. Caching \& Performance Optimization](#13-caching--performance-optimization)
|
||||
- [14. Hooks for Custom Logic](#14-hooks-for-custom-logic)
|
||||
- [15. Dockerization \& Scaling](#15-dockerization--scaling)
|
||||
- [16. Troubleshooting \& Common Pitfalls](#16-troubleshooting--common-pitfalls)
|
||||
- [17. Comprehensive End-to-End Example](#17-comprehensive-end-to-end-example)
|
||||
- [18. Further Resources \& Community](#18-further-resources--community)
|
||||
|
||||
---
|
||||
|
||||
## 1. Introduction & Key Concepts
|
||||
Crawl4AI transforms websites into structured, AI-friendly data. It efficiently handles large-scale crawling, integrates with both proprietary and open-source LLMs, and optimizes content for semantic search or RAG pipelines.
|
||||
|
||||
**Quick Test:**
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def test_run():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown)
|
||||
|
||||
asyncio.run(test_run())
|
||||
```
|
||||
|
||||
If you see Markdown output, everything is working!
|
||||
|
||||
**More info:** [See /docs/introduction](#) or [1_introduction.ex.md](https://github.com/unclecode/crawl4ai/blob/main/introduction.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 2. Installation & Environment Setup
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
**Try in Colab:**
|
||||
[Open Colab Notebook](https://colab.research.google.com/drive/1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)
|
||||
|
||||
**More info:** [See /docs/configuration](#) or [2_configuration.md](https://github.com/unclecode/crawl4ai/blob/main/configuration.md)
|
||||
|
||||
---
|
||||
|
||||
## 3. Core Concepts & Configuration
|
||||
Use `AsyncWebCrawler`, `CrawlerRunConfig`, and `BrowserConfig` to control crawling.
|
||||
|
||||
**Example config:**
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
text_mode=False,
|
||||
ignore_https_errors=True,
|
||||
java_script_enabled=True
|
||||
)
|
||||
|
||||
run_config = CrawlerRunConfig(
|
||||
css_selector="article.main",
|
||||
word_count_threshold=50,
|
||||
excluded_tags=['nav','footer'],
|
||||
exclude_external_links=True,
|
||||
wait_for="css:.article-loaded",
|
||||
page_timeout=60000,
|
||||
delay_before_return_html=1.0,
|
||||
mean_delay=0.1,
|
||||
max_range=0.3,
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True,
|
||||
js_code="""
|
||||
(async () => {
|
||||
window.scrollTo(0, document.body.scrollHeight);
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
document.querySelector('.load-more')?.click();
|
||||
})();
|
||||
"""
|
||||
)
|
||||
|
||||
# Use: ENABLED, DISABLED, BYPASS, READ_ONLY, WRITE_ONLY
|
||||
# run_config.cache_mode = CacheMode.ENABLED
|
||||
```
|
||||
|
||||
**Prefixes:**
|
||||
- `http://` or `https://` for live pages
|
||||
- `file://local.html` for local
|
||||
- `raw:<html>` for raw HTML strings
|
||||
|
||||
**More info:** [See /docs/async_webcrawler](#) or [3_async_webcrawler.ex.md](https://github.com/unclecode/crawl4ai/blob/main/async_webcrawler.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 4. Basic Crawling & Simple Extraction
|
||||
```python
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://news.example.com/article", config=run_config)
|
||||
print(result.markdown) # Basic markdown content
|
||||
```
|
||||
|
||||
**More info:** [See /docs/browser_context_page](#) or [4_browser_context_page.ex.md](https://github.com/unclecode/crawl4ai/blob/main/browser_context_page.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 5. Markdown Generation & AI-Optimized Output
|
||||
|
||||
After crawling, `result.markdown_v2` provides:
|
||||
- `raw_markdown`: Unfiltered markdown
|
||||
- `markdown_with_citations`: Links as references at the bottom
|
||||
- `references_markdown`: A separate list of reference links
|
||||
- `fit_markdown`: Filtered, relevant markdown (e.g., after BM25)
|
||||
- `fit_html`: The HTML used to produce `fit_markdown`
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
print("RAW:", result.markdown_v2.raw_markdown[:200])
|
||||
print("CITED:", result.markdown_v2.markdown_with_citations[:200])
|
||||
print("REFERENCES:", result.markdown_v2.references_markdown)
|
||||
print("FIT MARKDOWN:", result.markdown_v2.fit_markdown)
|
||||
```
|
||||
|
||||
For AI training, `fit_markdown` focuses on the most relevant content.
|
||||
|
||||
**More info:** [See /docs/markdown_generation](#) or [5_markdown_generation.ex.md](https://github.com/unclecode/crawl4ai/blob/main/markdown_generation.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 6. Structured Data Extraction (CSS, XPath, LLM)
|
||||
Extract JSON data without LLMs:
|
||||
|
||||
**CSS:**
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "Products",
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "price", "selector": ".price", "type": "text"}
|
||||
]
|
||||
}
|
||||
run_config.extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
```
|
||||
|
||||
**XPath:**
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
xpath_schema = {
|
||||
"name": "Articles",
|
||||
"baseSelector": "//div[@class='article']",
|
||||
"fields": [
|
||||
{"name":"headline","selector":".//h1","type":"text"},
|
||||
{"name":"summary","selector":".//p[@class='summary']","type":"text"}
|
||||
]
|
||||
}
|
||||
run_config.extraction_strategy = JsonXPathExtractionStrategy(xpath_schema)
|
||||
```
|
||||
|
||||
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 7. Advanced Extraction: LLM & Open-Source Models
|
||||
Use LLMExtractionStrategy for complex tasks. Works with OpenAI or open-source models (e.g., Ollama).
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class TravelData(BaseModel):
|
||||
destination: str
|
||||
attractions: list
|
||||
|
||||
run_config.extraction_strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=TravelData.schema(),
|
||||
instruction="Extract destination and top attractions."
|
||||
)
|
||||
```
|
||||
|
||||
**More info:** [See /docs/extraction_strategies](#) or [7_extraction_strategies.ex.md](https://github.com/unclecode/crawl4ai/blob/main/extraction_strategies.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 8. Page Interactions, JS Execution, & Dynamic Content
|
||||
Insert `js_code` and use `wait_for` to ensure content loads. Example:
|
||||
```python
|
||||
run_config.js_code = """
|
||||
(async () => {
|
||||
document.querySelector('.load-more')?.click();
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
})();
|
||||
"""
|
||||
run_config.wait_for = "css:.item-loaded"
|
||||
```
|
||||
|
||||
**More info:** [See /docs/page_interaction](#) or [11_page_interaction.md](https://github.com/unclecode/crawl4ai/blob/main/page_interaction.md)
|
||||
|
||||
---
|
||||
|
||||
## 9. Media, Links, & Metadata Handling
|
||||
`result.media["images"]`: List of images with `src`, `score`, `alt`. Score indicates relevance.
|
||||
|
||||
`result.media["videos"]`, `result.media["audios"]` similarly hold media info.
|
||||
|
||||
`result.links["internal"]`, `result.links["external"]`, `result.links["social"]`: Categorized links. Each link has `href`, `text`, `context`, `type`.
|
||||
|
||||
`result.metadata`: Title, description, keywords, author.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
# Images
|
||||
for img in result.media["images"]:
|
||||
print("Image:", img["src"], "Score:", img["score"], "Alt:", img.get("alt","N/A"))
|
||||
|
||||
# Links
|
||||
for link in result.links["external"]:
|
||||
print("External Link:", link["href"], "Text:", link["text"])
|
||||
|
||||
# Metadata
|
||||
print("Page Title:", result.metadata["title"])
|
||||
print("Description:", result.metadata["description"])
|
||||
```
|
||||
|
||||
**More info:** [See /docs/content_selection](#) or [8_content_selection.ex.md](https://github.com/unclecode/crawl4ai/blob/main/content_selection.ex.md)
|
||||
|
||||
---
|
||||
|
||||
## 10. Authentication & Identity Preservation
|
||||
|
||||
### Manual Setup via User Data Directory
|
||||
1. **Open Chrome with a custom user data dir:**
|
||||
```bash
|
||||
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\MyChromeProfile"
|
||||
```
|
||||
On macOS:
|
||||
```bash
|
||||
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/MyProfile"
|
||||
```
|
||||
|
||||
2. **Log in to sites, solve CAPTCHAs, adjust settings manually.**
|
||||
The browser saves cookies/localStorage in that directory.
|
||||
|
||||
3. **Use `user_data_dir` in `BrowserConfig`:**
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
user_data_dir="/Users/username/ChromeProfiles/MyProfile"
|
||||
)
|
||||
```
|
||||
|
||||
Now the crawler starts with those cookies, sessions, etc.
|
||||
|
||||
### Using `storage_state`
|
||||
Alternatively, export and reuse storage states:
|
||||
```python
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
storage_state="mystate.json" # Pre-saved state
|
||||
)
|
||||
```
|
||||
|
||||
No repeated logins needed.
|
||||
|
||||
**More info:** [See /docs/storage_state](#) or [16_storage_state.md](https://github.com/unclecode/crawl4ai/blob/main/storage_state.md)
|
||||
|
||||
---
|
||||
|
||||
## 11. Proxy & Security Enhancements
|
||||
Use `proxy_config` for authenticated proxies:
|
||||
```python
|
||||
browser_config.proxy_config = {
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "proxyuser",
|
||||
"password": "proxypass"
|
||||
}
|
||||
```
|
||||
|
||||
Combine with `headers` or `ignore_https_errors` as needed.
|
||||
|
||||
**More info:** [See /docs/proxy_security](#) or [14_proxy_security.md](https://github.com/unclecode/crawl4ai/blob/main/proxy_security.md)
|
||||
|
||||
---
|
||||
|
||||
## 12. Screenshots, PDFs & File Downloads
|
||||
Enable `screenshot=True` or `pdf=True` in `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
run_config.screenshot = True
|
||||
run_config.pdf = True
|
||||
```
|
||||
|
||||
After crawling:
|
||||
```python
|
||||
if result.screenshot:
|
||||
with open("page.png", "wb") as f:
|
||||
f.write(result.screenshot)
|
||||
|
||||
if result.pdf:
|
||||
with open("page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
```
|
||||
|
||||
**File Downloads:**
|
||||
```python
|
||||
browser_config.accept_downloads = True
|
||||
browser_config.downloads_path = "./downloads"
|
||||
run_config.js_code = """document.querySelector('a.download')?.click();"""
|
||||
|
||||
# After crawl:
|
||||
print("Downloaded files:", result.downloaded_files)
|
||||
```
|
||||
|
||||
**More info:** [See /docs/screenshot_and_pdf_export](#) or [15_screenshot_and_pdf_export.md](https://github.com/unclecode/crawl4ai/blob/main/screenshot_and_pdf_export.md)
|
||||
Also [10_file_download.md](https://github.com/unclecode/crawl4ai/blob/main/file_download.md)
|
||||
|
||||
---
|
||||
|
||||
## 13. Caching & Performance Optimization
|
||||
Set `cache_mode` to reuse fetch results:
|
||||
```python
|
||||
from crawl4ai import CacheMode
|
||||
run_config.cache_mode = CacheMode.ENABLED
|
||||
```
|
||||
|
||||
Adjust delays, increase concurrency, or use `text_mode=True` for faster extraction.
|
||||
|
||||
**More info:** [See /docs/cache_modes](#) or [9_cache_modes.md](https://github.com/unclecode/crawl4ai/blob/main/cache_modes.md)
|
||||
|
||||
---
|
||||
|
||||
## 14. Hooks for Custom Logic
|
||||
Hooks let you run code at specific lifecycle events without creating pages manually in `on_browser_created`.
|
||||
|
||||
Use `on_page_context_created` to apply routing or modify page contexts before crawling the URL:
|
||||
|
||||
**Example Hook:**
|
||||
```python
|
||||
async def on_page_context_created_hook(context, page, **kwargs):
|
||||
# Block all images to speed up load
|
||||
await context.route("**/*.{png,jpg,jpeg}", lambda route: route.abort())
|
||||
print("[HOOK] Image requests blocked")
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created_hook)
|
||||
result = await crawler.arun("https://imageheavy.example.com", config=run_config)
|
||||
print("Crawl finished with images blocked.")
|
||||
```
|
||||
|
||||
This hook is clean and doesn’t create a separate page itself—it just modifies the current context/page setup.
|
||||
|
||||
**More info:** [See /docs/hooks_auth](#) or [13_hooks_auth.md](https://github.com/unclecode/crawl4ai/blob/main/hooks_auth.md)
|
||||
|
||||
---
|
||||
|
||||
## 15. Dockerization & Scaling
|
||||
Use Docker images:
|
||||
|
||||
- AMD64 basic:
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic-amd64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-amd64
|
||||
```
|
||||
|
||||
- ARM64 for M1/M2:
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic-arm64
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic-arm64
|
||||
```
|
||||
|
||||
- GPU support:
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:gpu-amd64
|
||||
docker run --gpus all -p 11235:11235 unclecode/crawl4ai:gpu-amd64
|
||||
```
|
||||
|
||||
Scale with load balancers or Kubernetes.
|
||||
|
||||
**More info:** [See /docs/proxy_security (for proxy) or relevant Docker instructions in README](#)
|
||||
|
||||
---
|
||||
|
||||
## 16. Troubleshooting & Common Pitfalls
|
||||
- Empty results? Relax filters, check selectors.
|
||||
- Timeouts? Increase `page_timeout` or refine `wait_for`.
|
||||
- CAPTCHAs? Use `user_data_dir` or `storage_state` after manual solving.
|
||||
- JS errors? Try headful mode for debugging.
|
||||
|
||||
Check [examples](https://github.com/unclecode/crawl4ai/tree/main/docs/examples) & [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for more code.
|
||||
|
||||
---
|
||||
|
||||
## 17. Comprehensive End-to-End Example
|
||||
Combine hooks, JS execution, PDF saving, LLM extraction—see [quickstart_async.config.py](https://github.com/unclecode/crawl4ai/blob/main/docs/examples/quickstart_async.config.py) for a full example.
|
||||
|
||||
---
|
||||
|
||||
## 18. Further Resources & Community
|
||||
- **Docs:** [https://crawl4ai.com](https://crawl4ai.com)
|
||||
- **Issues & PRs:** [https://github.com/unclecode/crawl4ai/issues](https://github.com/unclecode/crawl4ai/issues)
|
||||
|
||||
Follow [@unclecode](https://x.com/unclecode) for news & community updates.
|
||||
|
||||
**Happy Crawling!**
|
||||
Leverage Crawl4AI to feed your AI models with clean, structured web data today.
|
||||
@@ -65,7 +65,7 @@
|
||||
|
||||
#### `viewport_width` and `viewport_height`
|
||||
- **Description**: Sets the default browser viewport dimensions.
|
||||
- Default: `1920` (width), `1080` (height)
|
||||
- Default: `1080` (width), `600` (height)
|
||||
- **Use Case**:
|
||||
- Adjust for crawling responsive layouts or specific device emulations.
|
||||
|
||||
@@ -134,6 +134,19 @@
|
||||
- **Use Case**:
|
||||
- Use for advanced browser configurations like WebRTC or GPU tuning.
|
||||
|
||||
#### `verbose`
|
||||
- **Description**: Enable verbose logging of browser operations.
|
||||
- Default: `True`
|
||||
- **Use Case**:
|
||||
- Enable for detailed logging during development and debugging.
|
||||
- Disable in production for better performance.
|
||||
|
||||
#### `sleep_on_close`
|
||||
- **Description**: Adds a delay before closing the browser.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable when you need to ensure all browser operations are complete before closing.
|
||||
|
||||
## CrawlerRunConfig
|
||||
The `CrawlerRunConfig` class centralizes parameters for controlling crawl operations. This configuration covers content extraction, page interactions, caching, and runtime behaviors. Below is an exhaustive breakdown of parameters and their best-use scenarios.
|
||||
|
||||
@@ -341,3 +354,37 @@ The `CrawlerRunConfig` class centralizes parameters for controlling crawl operat
|
||||
- **Use Case**:
|
||||
- Enable when debugging JavaScript errors on pages.
|
||||
|
||||
##### `parser_type`
|
||||
- **Description**: Type of parser to use for HTML parsing.
|
||||
- Default: `"lxml"`
|
||||
- **Use Case**:
|
||||
- Use when specific HTML parsing requirements are needed.
|
||||
- `"lxml"` provides good performance and standards compliance.
|
||||
|
||||
##### `prettiify`
|
||||
- **Description**: Apply `fast_format_html` to produce prettified HTML output.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable for better readability of extracted HTML content.
|
||||
- Useful during development and debugging.
|
||||
|
||||
##### `fetch_ssl_certificate`
|
||||
- **Description**: Fetch and store SSL certificate information during crawling.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable when SSL certificate analysis is required.
|
||||
- Useful for security audits and certificate validation.
|
||||
|
||||
##### `url`
|
||||
- **Description**: Target URL for the crawl operation.
|
||||
- Default: `None`
|
||||
- **Use Case**:
|
||||
- Set when initializing a crawler for a specific URL.
|
||||
- Can be overridden during actual crawl operations.
|
||||
|
||||
##### `log_console`
|
||||
- **Description**: Log browser console messages during crawling.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable to capture JavaScript console output.
|
||||
- Useful for debugging JavaScript-heavy pages.
|
||||
|
||||
@@ -1,97 +1,20 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
**BrowserConfig:**
|
||||
|
||||
1. **Browser Types and Headless Mode**
|
||||
- *"How do I choose between `chromium`, `firefox`, or `webkit` for `browser_type`?"*
|
||||
- *"What are the benefits of running the browser in `headless=True` mode versus a visible UI?"*
|
||||
|
||||
2. **Managed Browser and Persistent Context**
|
||||
- *"When should I enable `use_managed_browser` for advanced session control?"*
|
||||
- *"How do I use `use_persistent_context` and `user_data_dir` to maintain login sessions and persistent storage?"*
|
||||
|
||||
3. **Debugging and Remote Access**
|
||||
- *"How do I use the `debugging_port` to remotely inspect the browser with DevTools?"*
|
||||
|
||||
4. **Proxy and Network Configurations**
|
||||
- *"How can I configure a `proxy` or `proxy_config` for region-specific crawling or authentication?"*
|
||||
|
||||
5. **Viewports and Layout Testing**
|
||||
- *"How do I adjust `viewport_width` and `viewport_height` for responsive layout testing?"*
|
||||
|
||||
6. **Downloads and Storage States**
|
||||
- *"What steps do I need to take to enable `accept_downloads` and specify a `downloads_path`?"*
|
||||
- *"How can I use `storage_state` to preload cookies or session data?"*
|
||||
|
||||
7. **HTTPS and JavaScript Settings**
|
||||
- *"What happens if I set `ignore_https_errors=True` on sites with invalid SSL certificates?"*
|
||||
- *"When should I disable `java_script_enabled` to improve speed and stability?"*
|
||||
|
||||
8. **Cookies, Headers, and User Agents**
|
||||
- *"How do I add custom `cookies` or `headers` to every browser request?"*
|
||||
- *"How can I set a custom `user_agent` or use a `user_agent_mode` like `random` to avoid detection?"*
|
||||
|
||||
9. **Performance Tuning**
|
||||
- *"What is the difference between `text_mode`, `light_mode`, and adding `extra_args` for performance tuning?"*
|
||||
|
||||
---
|
||||
|
||||
**CrawlerRunConfig:**
|
||||
|
||||
10. **Content Extraction and Filtering**
|
||||
- *"How does the `word_count_threshold` affect which pages or sections get processed?"*
|
||||
- *"What `extraction_strategy` should I use for structured data extraction and how does `chunking_strategy` help organize the content?"*
|
||||
- *"How do I apply a `css_selector` or `excluded_tags` to refine my extracted content?"*
|
||||
|
||||
11. **Markdown and Text-Only Modes**
|
||||
- *"Can I generate Markdown output directly and what `markdown_generator` should I use?"*
|
||||
- *"When should I set `only_text=True` to strip out non-textual content?"*
|
||||
|
||||
12. **Caching and Session Handling**
|
||||
- *"How does `cache_mode=ENABLED` improve performance, and when should I consider `WRITE_ONLY` or disabling the cache?"*
|
||||
- *"What is the role of `session_id` in maintaining state across requests?"*
|
||||
|
||||
13. **Page Loading and Timing**
|
||||
- *"How do `wait_until`, `page_timeout`, and `wait_for` elements help control page load timing before extraction?"*
|
||||
- *"When should I disable `wait_for_images` to speed up the crawl?"*
|
||||
|
||||
14. **Delays and Concurrency**
|
||||
- *"How do `mean_delay` and `max_range` randomize request intervals to avoid detection?"*
|
||||
- *"What is `semaphore_count` and how does it manage concurrency for multiple crawling tasks?"*
|
||||
|
||||
15. **JavaScript Execution and Dynamic Content**
|
||||
- *"How can I inject custom `js_code` to load additional data or simulate user interactions?"*
|
||||
- *"When should I use `scan_full_page` or `adjust_viewport_to_content` to handle infinite scrolling?"*
|
||||
|
||||
16. **Screenshots, PDFs, and Media**
|
||||
- *"How do I enable `screenshot` or `pdf` generation to capture page states?"*
|
||||
- *"What are `image_description_min_word_threshold` and `image_score_threshold` for, and how do they enhance image-related extraction?"*
|
||||
|
||||
17. **Logging and Debugging**
|
||||
- *"How do `verbose` and `log_console` help me troubleshoot issues with crawling or page scripts?"*
|
||||
|
||||
---
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **BrowserConfig Essentials:**
|
||||
- Browser types (`chromium`, `firefox`, `webkit`)
|
||||
- Headless vs. non-headless mode
|
||||
- Persistent context and managed browser sessions
|
||||
- Proxy configurations and network settings
|
||||
- Viewport dimensions and responsive testing
|
||||
- Download handling and storage states
|
||||
- HTTPS errors and JavaScript enablement
|
||||
- Cookies, headers, and user agents
|
||||
- Performance tuning via `text_mode`, `light_mode`, and `extra_args`
|
||||
|
||||
- **CrawlerRunConfig Core Settings:**
|
||||
- Content extraction parameters (`word_count_threshold`, `extraction_strategy`, `chunking_strategy`)
|
||||
- Markdown generation and text-only extraction
|
||||
- Content filtering (`css_selector`, `excluded_tags`)
|
||||
- Caching strategies and `cache_mode` options
|
||||
- Page load conditions (`wait_until`, `wait_for`) and timeouts (`page_timeout`)
|
||||
- Delays, concurrency, and scaling (`mean_delay`, `max_range`, `semaphore_count`)
|
||||
- JavaScript injections (`js_code`) and handling dynamic/infinite scroll content
|
||||
- Screenshots, PDFs, and image thresholds for enhanced outputs
|
||||
- Logging and debugging modes (`verbose`, `log_console`)
|
||||
browser_config: Configure browser type with chromium, firefox, or webkit support | browser selection, browser engine, web engine | BrowserConfig(browser_type="chromium")
|
||||
headless_mode: Toggle headless browser mode for GUI-less operation | headless browser, no GUI, background mode | BrowserConfig(headless=True)
|
||||
managed_browser: Enable advanced browser manipulation and control | browser management, session control | BrowserConfig(use_managed_browser=True)
|
||||
debugging_setup: Configure remote debugging port for browser inspection | debug port, devtools connection | BrowserConfig(debugging_port=9222)
|
||||
persistent_context: Enable persistent browser sessions for maintaining state | session persistence, profile saving | BrowserConfig(use_persistent_context=True)
|
||||
browser_profile: Specify directory for storing browser profile data | user data, profile storage | BrowserConfig(user_data_dir="/path/to/profile")
|
||||
proxy_configuration: Set up proxy settings for browser connections | proxy server, network routing | BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
viewport_settings: Configure browser window dimensions | screen size, window dimensions | BrowserConfig(viewport_width=1920, viewport_height=1080)
|
||||
download_handling: Configure browser download behavior and location | file downloads, download directory | BrowserConfig(accept_downloads=True, downloads_path="/downloads")
|
||||
content_threshold: Set minimum word count for processing page content | word limit, content filter | CrawlerRunConfig(word_count_threshold=200)
|
||||
extraction_strategy: Configure method for extracting structured data | data extraction, parsing strategy | CrawlerRunConfig(extraction_strategy=CustomStrategy())
|
||||
content_chunking: Define strategy for breaking content into chunks | text chunking, content splitting | CrawlerRunConfig(chunking_strategy=RegexChunking())
|
||||
cache_behavior: Control caching mode for crawler operations | cache control, data caching | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
page_navigation: Configure page load and navigation timing | page timeout, navigation wait | CrawlerRunConfig(wait_until="domcontentloaded", page_timeout=60000)
|
||||
javascript_execution: Enable or disable JavaScript processing | JS handling, script execution | CrawlerRunConfig(java_script_enabled=True)
|
||||
content_filtering: Configure HTML tag exclusion and content cleanup | tag filtering, content cleanup | CrawlerRunConfig(excluded_tags=["script", "style"])
|
||||
concurrent_operations: Set limit for simultaneous crawler operations | concurrency control, parallel crawling | CrawlerRunConfig(semaphore_count=5)
|
||||
page_interaction: Configure JavaScript execution and page scanning | page automation, interaction control | CrawlerRunConfig(js_code="custom_script()", scan_full_page=True)
|
||||
media_capture: Enable screenshot and PDF generation capabilities | visual capture, page export | CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
debugging_options: Configure logging and console message capture | debug logging, error tracking | CrawlerRunConfig(verbose=True, log_console=True)
|
||||
@@ -1,279 +0,0 @@
|
||||
# Extended Documentation: Asynchronous Crawling with `AsyncWebCrawler`
|
||||
|
||||
This document provides a comprehensive, human-oriented overview of the `AsyncWebCrawler` class and related components from the `crawl4ai` package. It explains the motivations behind asynchronous crawling, shows how to configure and run crawls, and provides examples for advanced features like dynamic content handling, extraction strategies, caching, containerization, and troubleshooting.
|
||||
|
||||
## Introduction
|
||||
|
||||
Crawling websites can be slow if done sequentially, especially when handling large numbers of URLs or rendering dynamic pages. Asynchronous crawling helps you run multiple operations concurrently, improving throughput and performance. The `AsyncWebCrawler` class leverages asynchronous I/O and browser automation tools to fetch content efficiently, handle complex DOM interactions, and extract structured data.
|
||||
|
||||
### Quick Start
|
||||
|
||||
Before diving into advanced features, here is a quick start example that shows how to run a simple asynchronous crawl with a headless Chromium browser, extract basic text, and print the results.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
async def main():
|
||||
# Basic browser configuration
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
|
||||
# Run the crawler asynchronously
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print("Extracted Markdown:")
|
||||
print(result.markdown)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This snippet initializes a headless Chromium browser, crawls the page, processes the HTML, and prints extracted content as Markdown.
|
||||
|
||||
## Browser Configuration
|
||||
|
||||
The `BrowserConfig` class defines browser-related settings and behaviors. You can customize:
|
||||
|
||||
- `browser_type`: Browser to use, such as `chromium` or `firefox`.
|
||||
- `headless`: Run the browser in headless mode (no visible UI).
|
||||
- `viewport_width` and `viewport_height`: Control viewport dimensions for rendering.
|
||||
- `proxy`: Configure proxies to bypass IP restrictions.
|
||||
- `verbose`: Control logging verbosity.
|
||||
|
||||
**Example: Customizing Browser Settings**
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="firefox",
|
||||
headless=False,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://yourwebsite.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Running in Docker
|
||||
|
||||
For scalability and reproducibility, consider running your crawler inside a Docker container. A simple Dockerfile might look like this:
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.10-slim
|
||||
RUN apt-get update && apt-get install -y wget
|
||||
RUN pip install crawl4ai playwright
|
||||
RUN playwright install chromium
|
||||
COPY your_script.py /app/your_script.py
|
||||
WORKDIR /app
|
||||
CMD ["python", "your_script.py"]
|
||||
```
|
||||
|
||||
You can then run:
|
||||
|
||||
```bash
|
||||
docker build -t mycrawler .
|
||||
docker run mycrawler
|
||||
```
|
||||
|
||||
Within this container, `AsyncWebCrawler` will launch Chromium using Playwright and crawl sites as configured.
|
||||
|
||||
## Asynchronous Crawling Strategies
|
||||
|
||||
By default, `AsyncWebCrawler` uses `AsyncPlaywrightCrawlerStrategy`, which relies on Playwright for browser automation. This lets you interact with DOM elements, scroll, click buttons, and handle dynamic content. If other strategies are available, you can specify them during initialization.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
|
||||
```
|
||||
|
||||
## Handling Dynamic Content
|
||||
|
||||
Modern websites often load data via JavaScript or require user interactions. You can inject custom JavaScript snippets to manipulate the page, click buttons, or wait for certain elements to appear before extracting content.
|
||||
|
||||
**Example: Loading More Content**
|
||||
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
const loadButtons = document.querySelectorAll(".load-more");
|
||||
for (const btn of loadButtons) btn.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Wait for new content
|
||||
})();
|
||||
"""
|
||||
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(js_code=[js_code])
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/infinite-scroll", config=config)
|
||||
print("Extracted Markdown:")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
You can also use Playwright selectors to wait for specific elements before extraction.
|
||||
|
||||
## Extraction and Filtering
|
||||
|
||||
`AsyncWebCrawler` supports various extraction strategies to convert raw HTML into structured data. For example, `JsonCssExtractionStrategy` allows you to specify CSS selectors and get structured JSON from the page. `LLMExtractionStrategy` can feed extracted text into a language model for intelligent data extraction.
|
||||
|
||||
You can also apply content filters and chunking strategies to split large documents into smaller pieces before processing.
|
||||
|
||||
**Example: Using a JSON CSS Extraction Strategy**
|
||||
|
||||
```python
|
||||
from crawl4ai import JsonCssExtractionStrategy, CrawlerRunConfig, AsyncWebCrawler, RegexChunking
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}),
|
||||
chunking_strategy=RegexChunking()
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print("Extracted Content:")
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
**Comparing Chunking Strategies:**
|
||||
|
||||
- Regex-based chunking: Splits text by patterns, good for basic splitting.
|
||||
- NLP-based chunking (if available): Splits text into semantically meaningful units, ideal for LLM-based extraction.
|
||||
|
||||
## Caching and Performance
|
||||
|
||||
Caching helps avoid repeatedly fetching and rendering the same page. By default, caching is enabled (`CacheMode.ENABLED`), so subsequent crawls of the same URL can skip the network fetch if the data is still fresh. You can control the cache mode, clear the cache, or bypass it when needed.
|
||||
|
||||
**Cache Modes:**
|
||||
|
||||
- `CacheMode.ENABLED`: Use cache if available, write new results to cache.
|
||||
- `CacheMode.BYPASS`: Skip cache reading, but still write new results.
|
||||
- `CacheMode.DISABLED`: Do not use cache at all.
|
||||
|
||||
**Clearing and Flushing the Cache:**
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
await crawler.aclear_cache() # Clear entire cache
|
||||
# ... run some crawls ...
|
||||
await crawler.aflush_cache() # Flush partial entries if needed
|
||||
```
|
||||
|
||||
Use caching to speed up development, repeated tests, or partial re-runs of large crawls.
|
||||
|
||||
## Batch Crawling and Parallelization
|
||||
|
||||
The `arun_many` method lets you process multiple URLs concurrently, improving throughput. You can limit concurrency with `semaphore_count` and apply rate limiting via `CrawlerRunConfig` parameters like `mean_delay` and `max_range`.
|
||||
|
||||
**Example: Batch Crawling**
|
||||
|
||||
```python
|
||||
urls = [
|
||||
"https://site1.com",
|
||||
"https://site2.com",
|
||||
"https://site3.com"
|
||||
]
|
||||
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(semaphore_count=10, mean_delay=1.0, max_range=0.5)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(urls, config=config)
|
||||
for res in results:
|
||||
print(res.url, res.markdown)
|
||||
```
|
||||
|
||||
This allows you to process large URL lists efficiently. Adjust `semaphore_count` to match your resource limits.
|
||||
|
||||
## Scaling Crawls
|
||||
|
||||
To scale beyond a single machine, consider:
|
||||
|
||||
- Distributing URL lists across multiple workers or containers.
|
||||
- Using a job queue like Celery or Redis Queue to schedule crawls.
|
||||
- Integrating with cloud-based solutions for browser automation.
|
||||
|
||||
Always ensure you respect target site policies and comply with legal and ethical guidelines for web scraping.
|
||||
|
||||
## Screenshots and PDFs
|
||||
|
||||
If you need visual confirmation, you can enable screenshots or PDFs:
|
||||
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
with open("page_screenshot.png", "wb") as f:
|
||||
f.write(result.screenshot)
|
||||
with open("page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
```
|
||||
|
||||
This is helpful for debugging rendering issues or retaining visual copies of crawled pages.
|
||||
|
||||
## Troubleshooting and Common Issues
|
||||
|
||||
**Common Problems and Direct Fixes:**
|
||||
|
||||
1. **Browser not launching**:
|
||||
- Check that you have installed Playwright and run `playwright install` for the chosen browser.
|
||||
- Ensure all required dependencies are installed.
|
||||
|
||||
2. **Timeouts or partial loads**:
|
||||
- Increase timeouts or add delays between requests using `mean_delay` and `max_range`.
|
||||
- Wait for specific DOM elements to appear before proceeding.
|
||||
|
||||
3. **JavaScript not executing as expected**:
|
||||
- Use `js_code` in `CrawlerRunConfig` to inject scripts.
|
||||
- Check browser console for errors or consider headless=False to debug UI interactions.
|
||||
|
||||
4. **Content Extraction fails**:
|
||||
- Validate CSS selectors or extraction strategies.
|
||||
- Try a different extraction strategy if the current one is not producing results.
|
||||
|
||||
5. **Stale Data due to Caching**:
|
||||
- Call `await crawler.aclear_cache()` to remove old entries.
|
||||
- Use `cache_mode=CacheMode.BYPASS` to fetch fresh data.
|
||||
|
||||
**Direct Code Fixes:**
|
||||
If you experience missing content after injecting JS, try waiting longer:
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
document.querySelector(".load-more").click();
|
||||
await new Promise(r => setTimeout(r, 3000));
|
||||
})();
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(js_code=[js_code])
|
||||
```
|
||||
|
||||
Or run headless=False to visually verify that the UI is changing as expected.
|
||||
|
||||
## Best Practices and Tips
|
||||
|
||||
- **Structuring your code**: Keep crawl logic modular. Have separate functions for configuring crawls, extracting data, and processing results.
|
||||
- **Error Handling**: Wrap crawl operations in try/except blocks and log errors with `crawler.logger`.
|
||||
- **Avoiding Getting Blocked**: Use proxies or rotate user agents if you crawl frequently. Randomize delays between requests.
|
||||
- **Authentication and Session Management**: If the site requires login, provide the crawler with login steps via `js_code` or Playwright selectors. Consider using cookies or session storage retrieval in `CrawlerRunConfig`.
|
||||
|
||||
## Reference and Additional Resources
|
||||
|
||||
- **GitHub Repository**: [crawl4ai GitHub](https://github.com/yourusername/crawl4ai)
|
||||
- **Playwright Docs**: [https://playwright.dev/](https://playwright.dev/)
|
||||
- **AsyncIO in Python**: [Python Asyncio Docs](https://docs.python.org/3/library/asyncio.html)
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q**: How do I customize user agents?
|
||||
**A**: Pass `user_agent="MyUserAgentString"` to `arun` or `arun_many`, or update `crawler_strategy` directly.
|
||||
|
||||
**Q**: Can I crawl local HTML files?
|
||||
**A**: Yes, provide a `file://` URL or `raw:` prefix with raw HTML strings.
|
||||
|
||||
**Q**: How do I integrate LLM-based extraction?
|
||||
**A**: Set `extraction_strategy=LLMExtractionStrategy(...)` and provide a chunking strategy. This allows using large language models for context-aware data extraction.
|
||||
@@ -1,81 +1,15 @@
|
||||
### Questions
|
||||
|
||||
1. **Asynchronous Crawling Basics**
|
||||
- *"How do I perform asynchronous web crawling using `AsyncWebCrawler`?"*
|
||||
- *"What are the performance benefits of asynchronous I/O in `crawl4ai`?"*
|
||||
|
||||
2. **Browser Configuration**
|
||||
- *"How can I configure `BrowserConfig` for headless Chromium or Firefox?"*
|
||||
- *"How do I set viewport dimensions and proxies in the `BrowserConfig`?"*
|
||||
- *"How can I enable verbose logging for browser interactions?"*
|
||||
|
||||
3. **Docker and Containerization**
|
||||
- *"How do I run `AsyncWebCrawler` inside a Docker container for scalability?"*
|
||||
- *"Which dependencies are needed in the Dockerfile to run asynchronous crawls?"*
|
||||
|
||||
4. **Crawling Strategies**
|
||||
- *"What is `AsyncPlaywrightCrawlerStrategy` and when should I use it?"*
|
||||
- *"How do I switch between different crawler strategies if multiple are available?"*
|
||||
|
||||
5. **Handling Dynamic Content**
|
||||
- *"How can I inject custom JavaScript to load more content or simulate user actions?"*
|
||||
- *"What is the best way to wait for specific DOM elements before extracting content?"*
|
||||
|
||||
6. **Extraction Strategies**
|
||||
- *"How do I use `JsonCssExtractionStrategy` to extract structured JSON data?"*
|
||||
- *"What are the differences between regex-based chunking and NLP-based chunking?"*
|
||||
- *"How can I integrate `LLMExtractionStrategy` for more intelligent data extraction?"*
|
||||
|
||||
7. **Caching and Performance**
|
||||
- *"How does caching improve the performance of asynchronous crawling?"*
|
||||
- *"How do I clear or bypass the cache in `AsyncWebCrawler`?"*
|
||||
- *"What are the available `CacheMode` options and when should I use each?"*
|
||||
|
||||
8. **Batch Crawling and Concurrency**
|
||||
- *"How do I crawl multiple URLs concurrently using `arun_many`?"*
|
||||
- *"How can I limit concurrency with `semaphore_count` for resource management?"*
|
||||
|
||||
9. **Scaling Crawls**
|
||||
- *"What strategies can I use to scale asynchronous crawls across multiple machines?"*
|
||||
- *"How do I integrate job queues or distribute tasks for larger crawl projects?"*
|
||||
|
||||
10. **Screenshots and PDFs**
|
||||
- *"How do I enable screenshot or PDF capture during a crawl?"*
|
||||
- *"How can I save visual outputs for troubleshooting rendering issues?"*
|
||||
|
||||
11. **Troubleshooting**
|
||||
- *"What should I do if the browser fails to launch or times out?"*
|
||||
- *"How do I debug JavaScript code injections that don’t work as expected?"*
|
||||
- *"How can I handle partial loads or missing content due to timeouts?"*
|
||||
|
||||
12. **Best Practices**
|
||||
- *"How do I handle authentication or session management in `AsyncWebCrawler`?"*
|
||||
- *"How can I avoid getting blocked by target sites, e.g., by using proxies?"*
|
||||
- *"What error handling approaches are recommended for production crawls?"*
|
||||
- *"How can I adhere to legal and ethical guidelines when crawling?"*
|
||||
|
||||
13. **Configuration Options**
|
||||
- *"How do I customize `CrawlerRunConfig` parameters like `mean_delay` and `max_range`?"*
|
||||
- *"How can I run the crawler non-headless for debugging dynamic interactions?"*
|
||||
|
||||
14. **Integration and Reference**
|
||||
- *"Where can I find the GitHub repository or additional documentation?"*
|
||||
- *"How do I incorporate Playwright’s advanced features with `AsyncWebCrawler`?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Asynchronous Crawling and Performance**
|
||||
- **`AsyncWebCrawler` Initialization and Usage**
|
||||
- **`BrowserConfig` for Browser Choice, Headless Mode, Viewport, Proxy, and Verbosity**
|
||||
- **Running Crawlers in Docker and Containerized Environments**
|
||||
- **`AsyncPlaywrightCrawlerStrategy` and DOM Interactions**
|
||||
- **Dynamic Content Handling via JavaScript Injection**
|
||||
- **Extraction Strategies (e.g., `JsonCssExtractionStrategy`, `LLMExtractionStrategy`)**
|
||||
- **Content Chunking Approaches (Regex and NLP-based)**
|
||||
- **Caching Mechanisms and Cache Modes**
|
||||
- **Parallel Crawling with `arun_many` and Concurrency Controls**
|
||||
- **Scaling Crawls Across Multiple Workers or Containers**
|
||||
- **Screenshot and PDF Generation for Debugging**
|
||||
- **Common Troubleshooting Techniques and Error Handling**
|
||||
- **Authentication, Session Management, and Ethical Guidelines**
|
||||
- **Adjusting `CrawlerRunConfig` for Delays, Concurrency, Extraction, and JavaScript Injection**
|
||||
quick_start: Basic async crawl setup requires BrowserConfig and AsyncWebCrawler initialization | getting started, basic usage, initialization | asyncio.run(AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True)))
|
||||
browser_types: AsyncWebCrawler supports multiple browser types including Chromium and Firefox | supported browsers, browser options | BrowserConfig(browser_type="chromium")
|
||||
headless_mode: Browser can run in headless mode without UI for better performance | invisible browser, no GUI | BrowserConfig(headless=True)
|
||||
viewport_settings: Configure browser viewport dimensions for proper page rendering | screen size, window size | BrowserConfig(viewport_width=1920, viewport_height=1080)
|
||||
docker_deployment: AsyncWebCrawler can run in Docker containers for scalability | containerization, deployment | FROM python:3.10-slim; RUN pip install crawl4ai playwright
|
||||
dynamic_content: Handle JavaScript-loaded content using custom JS injection | javascript handling, dynamic loading | CrawlerRunConfig(js_code=["document.querySelector('.load-more').click()"])
|
||||
extraction_strategies: Multiple strategies available for content extraction including JsonCssExtractionStrategy and LLMExtractionStrategy | content extraction, data parsing | JsonCssExtractionStrategy(selectors={"title": "h1"})
|
||||
caching_modes: Control cache behavior with different modes: ENABLED, BYPASS, DISABLED | cache control, caching options | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
batch_crawling: Process multiple URLs concurrently using arun_many method | parallel crawling, multiple urls | crawler.arun_many(urls, config=CrawlerRunConfig(semaphore_count=10))
|
||||
rate_limiting: Control crawl rate using mean_delay and max_range parameters | throttling, delay control | CrawlerRunConfig(mean_delay=1.0, max_range=0.5)
|
||||
visual_capture: Generate screenshots and PDFs of crawled pages | page capture, visual output | CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
error_handling: Common issues include browser launch failures, timeouts, and JS execution problems | troubleshooting, debugging | try/except blocks with crawler.logger
|
||||
authentication: Handle login requirements through js_code or Playwright selectors | login handling, sessions | CrawlerRunConfig with login steps via js_code
|
||||
proxy_configuration: Configure proxy settings to bypass IP restrictions | proxy setup, IP rotation | BrowserConfig(proxy="http://proxy-server:port")
|
||||
chunking_strategies: Split content using regex or NLP-based chunking | content splitting, text processing | CrawlerRunConfig(chunking_strategy=RegexChunking())
|
||||
@@ -1,551 +0,0 @@
|
||||
## 4. Creating Browser Instances, Contexts, and Pages
|
||||
|
||||
### Introduction
|
||||
|
||||
#### Overview of Browser Management in Crawl4AI
|
||||
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, identity preservation, and session persistence for high-volume, dynamic web crawling.
|
||||
|
||||
#### Key Objectives
|
||||
- **Identity Preservation**:
|
||||
- Implements stealth techniques to maintain authentic digital identity
|
||||
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses
|
||||
- Supports integration with third-party services to bypass CAPTCHA challenges
|
||||
- **Persistent Sessions**:
|
||||
- Retains session data (cookies, local storage) for workflows requiring user authentication
|
||||
- Allows seamless continuation of tasks across multiple runs without re-authentication
|
||||
- **Scalable Crawling**:
|
||||
- Optimized resource utilization for handling thousands of URLs concurrently
|
||||
- Flexible configuration options to tailor crawling behavior to specific requirements
|
||||
|
||||
---
|
||||
|
||||
### Browser Creation Methods
|
||||
|
||||
#### Standard Browser Creation
|
||||
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
|
||||
|
||||
##### Features and Limitations
|
||||
- **Features**:
|
||||
- Quick and straightforward setup for small-scale tasks
|
||||
- Supports headless and headful modes
|
||||
- **Limitations**:
|
||||
- Lacks advanced customization options like session reuse
|
||||
- May struggle with sites employing strict identity verification
|
||||
|
||||
##### Example Usage
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
#### Persistent Contexts
|
||||
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
|
||||
|
||||
##### Benefits of Using `user_data_dir`
|
||||
- **Session Persistence**:
|
||||
- Stores cookies, local storage, and cache between crawling sessions
|
||||
- Reduces overhead for repetitive logins or multi-step workflows
|
||||
- **Enhanced Performance**:
|
||||
- Leverages pre-loaded resources for faster page loading
|
||||
- **Flexibility**:
|
||||
- Adapts to complex workflows requiring user-specific configurations
|
||||
|
||||
##### Example: Setting Up Persistent Contexts
|
||||
```python
|
||||
config = BrowserConfig(user_data_dir="/path/to/user/data")
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
#### Managed Browser
|
||||
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and identity preservation measures.
|
||||
|
||||
##### How It Works
|
||||
- **Browser Process Management**:
|
||||
- Automates initialization and cleanup of browser processes
|
||||
- Optimizes resource usage by pooling and reusing browser instances
|
||||
- **Debugging Support**:
|
||||
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection
|
||||
- **Identity Preservation**:
|
||||
- Implements stealth plugins to maintain authentic user identity
|
||||
- Preserves browser fingerprints and session data
|
||||
|
||||
##### Features
|
||||
- **Customizable Configurations**:
|
||||
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation
|
||||
- **Debugging and Logging**:
|
||||
- Logs detailed browser interactions for debugging and performance analysis
|
||||
- **Scalability**:
|
||||
- Handles multiple browser instances concurrently, scaling dynamically based on workload
|
||||
|
||||
##### Example: Using `ManagedBrowser`
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
config = BrowserConfig(headless=False, debug_port=9222)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Context and Page Management
|
||||
|
||||
#### Creating and Configuring Browser Contexts
|
||||
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
|
||||
|
||||
##### Customizations
|
||||
- **Headers and Cookies**:
|
||||
- Define custom headers to mimic specific devices or browsers
|
||||
- Set cookies for authenticated sessions
|
||||
- **Session Reuse**:
|
||||
- Retain and reuse session data across multiple requests
|
||||
- Example: Preserve login states for authenticated crawls
|
||||
|
||||
##### Example: Context Initialization
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
#### Creating Pages
|
||||
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
|
||||
|
||||
##### Key Features
|
||||
- **IFrame Handling**:
|
||||
- Extract content from embedded iframes
|
||||
- Navigate and interact with nested content
|
||||
- **Viewport Customization**:
|
||||
- Adjust viewport size to match target device dimensions
|
||||
- **Lazy Loading**:
|
||||
- Ensure dynamic elements are fully loaded before extraction
|
||||
|
||||
##### Example: Page Initialization
|
||||
```python
|
||||
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Preserve Your Identity with Crawl4AI
|
||||
|
||||
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This section introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
|
||||
|
||||
## Managed Browsers: Your Digital Identity Solution
|
||||
|
||||
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
|
||||
|
||||
### Why Use Managed Browsers?
|
||||
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
|
||||
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
|
||||
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
|
||||
|
||||
|
||||
I'll help create a section about using command-line Chrome with a user data directory, which is indeed a more straightforward approach for identity-based browsing.
|
||||
|
||||
```markdown
|
||||
### Steps to Use Identity-Based Browsing
|
||||
|
||||
1. **Launch Chrome with a Custom Profile Directory**
|
||||
|
||||
- **Windows**:
|
||||
```batch
|
||||
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\ChromeProfiles\CrawlProfile"
|
||||
```
|
||||
|
||||
- **macOS**:
|
||||
```bash
|
||||
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/CrawlProfile"
|
||||
```
|
||||
|
||||
- **Linux**:
|
||||
```bash
|
||||
google-chrome --user-data-dir="/home/username/ChromeProfiles/CrawlProfile"
|
||||
```
|
||||
|
||||
2. **Set Up Your Identity**:
|
||||
- In the new Chrome window, log into your accounts (Google, social media, etc.)
|
||||
- Complete any necessary CAPTCHA challenges
|
||||
- Accept cookies and configure site preferences
|
||||
- The profile directory will save all settings, cookies, and login states
|
||||
|
||||
3. **Use the Profile in Crawl4AI**:
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/ChromeProfiles/CrawlProfile" # Use the same directory from step 1
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
```
|
||||
|
||||
This approach provides several advantages:
|
||||
- Complete manual control over profile setup
|
||||
- Persistent logins across multiple sites
|
||||
- Pre-solved CAPTCHAs and saved preferences
|
||||
- Real browser history and cookies for authentic browsing patterns
|
||||
|
||||
### Example: Extracting Data Using Managed Browsers
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Define schema for structured data extraction
|
||||
schema = {
|
||||
"name": "Example Data",
|
||||
"baseSelector": "div.example",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
# Configure crawler
|
||||
browser_config = BrowserConfig(
|
||||
headless=True, # Automate subsequent runs
|
||||
verbose=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/user_profile_data"
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
wait_for="css:div.example" # Wait for the targeted element to load
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("Extracted Data:", result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Benefits of Managed Browsers Over Other Methods
|
||||
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
|
||||
|
||||
## Magic Mode: Simplified Automation
|
||||
|
||||
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
|
||||
|
||||
### What Magic Mode Does:
|
||||
- Simulates human browsing by randomizing interaction patterns and timing
|
||||
- Masks browser automation signals
|
||||
- Handles cookie popups and modals
|
||||
- Modifies navigator properties for enhanced compatibility
|
||||
|
||||
### Using Magic Mode
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enables all automation features
|
||||
)
|
||||
```
|
||||
|
||||
Magic Mode is particularly useful for:
|
||||
- Quick prototyping when a Managed Browser setup is not available
|
||||
- Basic sites requiring minimal interaction or configuration
|
||||
|
||||
### Example: Combining Magic Mode with Additional Options
|
||||
|
||||
```python
|
||||
async def crawl_with_magic_mode(url: str):
|
||||
async with AsyncWebCrawler(headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
magic=True,
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
page_timeout=60000 # Increased timeout for complex pages
|
||||
)
|
||||
|
||||
return result.markdown if result.success else None
|
||||
```
|
||||
|
||||
## Magic Mode vs. Managed Browsers
|
||||
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
|
||||
|
||||
# Session Management
|
||||
|
||||
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
|
||||
|
||||
- **Performing JavaScript actions before and after crawling**
|
||||
- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly
|
||||
- **Maintaining state for complex workflows**
|
||||
|
||||
**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
|
||||
|
||||
## Basic Session Usage
|
||||
|
||||
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "my_session"
|
||||
|
||||
# Define configurations
|
||||
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
|
||||
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
|
||||
|
||||
# First request
|
||||
result1 = await crawler.arun(config=config1)
|
||||
|
||||
# Subsequent request using the same session
|
||||
result2 = await crawler.arun(config=config2)
|
||||
|
||||
# Clean up when done
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Dynamic Content with Sessions
|
||||
|
||||
Here's an example of crawling GitHub commits across multiple pages while preserving session state:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "github_commits_session"
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
all_commits = []
|
||||
|
||||
# Define extraction schema
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
# JavaScript and wait configurations
|
||||
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
|
||||
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
|
||||
|
||||
# Crawl multiple pages
|
||||
for page in range(3):
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
if result.success:
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
# Clean up session
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
return all_commits
|
||||
```
|
||||
|
||||
## Session Best Practices
|
||||
|
||||
1. **Descriptive Session IDs**:
|
||||
Use meaningful names for session IDs to organize workflows:
|
||||
```python
|
||||
session_id = "login_flow_session"
|
||||
session_id = "product_catalog_session"
|
||||
```
|
||||
|
||||
2. **Resource Management**:
|
||||
Always ensure sessions are cleaned up to free resources:
|
||||
```python
|
||||
try:
|
||||
# Your crawling code here
|
||||
pass
|
||||
finally:
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
3. **State Maintenance**:
|
||||
Reuse the session for subsequent actions within the same workflow:
|
||||
```python
|
||||
# Step 1: Login
|
||||
login_config = CrawlerRunConfig(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
await crawler.arun(config=login_config)
|
||||
|
||||
# Step 2: Verify login success
|
||||
dashboard_config = CrawlerRunConfig(
|
||||
url="https://example.com/dashboard",
|
||||
session_id=session_id,
|
||||
wait_for="css:.user-profile" # Wait for authenticated content
|
||||
)
|
||||
result = await crawler.arun(config=dashboard_config)
|
||||
```
|
||||
|
||||
4. **Common Use Cases for Sessions**:
|
||||
1. **Authentication Flows**: Login and interact with secured pages
|
||||
2. **Pagination Handling**: Navigate through multiple pages
|
||||
3. **Form Submissions**: Fill forms, submit, and process results
|
||||
4. **Multi-step Processes**: Complete workflows that span multiple actions
|
||||
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content
|
||||
|
||||
# Session-Based Crawling for Dynamic Content
|
||||
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
|
||||
|
||||
## Understanding Session-Based Crawling
|
||||
|
||||
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
|
||||
|
||||
1. **Efficient handling of dynamic content** without reloading the page
|
||||
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling)
|
||||
3. **State maintenance** for authenticated sessions or multi-step workflows
|
||||
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources
|
||||
|
||||
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
|
||||
|
||||
## Basic Concepts
|
||||
|
||||
Before diving into examples, here are some key concepts:
|
||||
|
||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
|
||||
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
|
||||
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
|
||||
- **CSS Selectors**: Target specific elements for interaction or data extraction.
|
||||
- **Extraction Strategy**: Define rules to extract structured data.
|
||||
- **Wait Conditions**: Specify conditions to wait for before proceeding.
|
||||
|
||||
## Advanced Technique 1: Custom Execution Hooks
|
||||
|
||||
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
|
||||
|
||||
```python
|
||||
async def advanced_session_crawl_with_hooks():
|
||||
first_commit = ""
|
||||
|
||||
async def on_execution_started(page):
|
||||
nonlocal first_commit
|
||||
try:
|
||||
while True:
|
||||
await page.wait_for_selector("li.commit-item h4")
|
||||
commit = await page.query_selector("li.commit-item h4")
|
||||
commit = await commit.evaluate("(element) => element.textContent").strip()
|
||||
if commit and commit != first_commit:
|
||||
first_commit = commit
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
except Exception as e:
|
||||
print(f"Warning: New content didn't appear: {e}")
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "commit_session"
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
|
||||
js_next_page = """document.querySelector('a.pagination-next').click();"""
|
||||
|
||||
for page in range(3):
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
|
||||
|
||||
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
|
||||
|
||||
```python
|
||||
async def integrated_js_and_wait_crawl():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "integrated_session"
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
|
||||
js_next_page_and_wait = """
|
||||
(async () => {
|
||||
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
|
||||
const initialCommit = getCurrentCommit();
|
||||
document.querySelector('a.pagination-next').click();
|
||||
while (getCurrentCommit() === initialCommit) {
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
for page in range(3):
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
js_code=js_next_page_and_wait if page > 0 else None,
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Best Practices for Session-Based Crawling
|
||||
|
||||
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values
|
||||
2. **Close Sessions**: Always clean up sessions with `kill_session` after use
|
||||
3. **Error Handling**: Anticipate and handle errors gracefully
|
||||
4. **Respect Websites**: Follow terms of service and robots.txt
|
||||
5. **Delays**: Add delays to avoid overwhelming servers
|
||||
6. **Optimize JavaScript**: Keep scripts concise for better performance
|
||||
7. **Monitor Resources**: Track memory and CPU usage for long sessions
|
||||
|
||||
## Conclusion
|
||||
|
||||
By combining browser management, identity-based crawling through Managed Browsers, and robust session management, Crawl4AI provides a comprehensive solution for modern web crawling needs. These features work together to enable:
|
||||
|
||||
1. Authentic identity preservation
|
||||
2. Efficient session management
|
||||
3. Reliable handling of dynamic content
|
||||
4. Scalable and maintainable crawling workflows
|
||||
|
||||
Remember to always follow best practices and respect website policies when implementing these features.
|
||||
@@ -1,62 +1,10 @@
|
||||
### Questions
|
||||
|
||||
1. **Browser Creation and Configuration**
|
||||
- *"How do I create a browser instance with `BrowserConfig` for asynchronous crawling?"*
|
||||
- *"What is the difference between standard browser creation and using persistent contexts?"*
|
||||
- *"How do I configure headless mode and viewport dimensions?"*
|
||||
|
||||
2. **Persistent Sessions and `user_data_dir`**
|
||||
- *"How do persistent contexts work with `user_data_dir` to maintain session data?"*
|
||||
- *"How can I reuse cookies and local storage to avoid repetitive logins?"*
|
||||
|
||||
3. **Managed Browser**
|
||||
- *"What benefits does `ManagedBrowser` provide over a standard browser instance?"*
|
||||
- *"How do I enable identity preservation and stealth techniques using `ManagedBrowser`?"*
|
||||
- *"How can I integrate debugging tools like Chrome Developer Tools with `ManagedBrowser`?"*
|
||||
|
||||
4. **Identity Preservation**
|
||||
- *"How can I simulate human-like behavior (mouse movements, scrolling) to preserve identity?"*
|
||||
- *"What techniques does `crawl4ai` use to bypass CAPTCHA challenges and maintain authenticity?"*
|
||||
- *"How do I use real user profiles to solve CAPTCHAs and save session data?"*
|
||||
|
||||
5. **Session Management**
|
||||
- *"How can I maintain state across multiple crawls using `session_id`?"*
|
||||
- *"What are best practices for using sessions to handle multi-step login flows?"*
|
||||
- *"How do I reuse sessions for authenticated workflows and reduce overhead?"*
|
||||
|
||||
6. **Dynamic Content Handling**
|
||||
- *"How can I inject JavaScript or wait conditions to ensure dynamic elements load before extraction?"*
|
||||
- *"What strategies can I use to navigate infinite scrolling or ‘Load More’ buttons?"*
|
||||
- *"How do I integrate JS code execution and waiting to handle modern SPA (Single Page Application) layouts?"*
|
||||
|
||||
7. **Scaling and Performance**
|
||||
- *"How do I scale crawls to handle thousands of URLs concurrently?"*
|
||||
- *"What options exist for caching and resource utilization optimization?"*
|
||||
- *"How do I handle multiple browser instances efficiently for high-volume crawling?"*
|
||||
|
||||
8. **Extraction Strategies**
|
||||
- *"How can I use `JsonCssExtractionStrategy` to extract structured data?"*
|
||||
- *"What methods are available to chunk or filter extracted content?"*
|
||||
|
||||
9. **Magic Mode vs. Managed Browsers**
|
||||
- *"What is Magic Mode and when should I use it over Managed Browsers?"*
|
||||
- *"Does Magic Mode help with basic sites, and how do I enable it?"*
|
||||
|
||||
10. **Troubleshooting and Best Practices**
|
||||
- *"How can I debug browser automation issues with logs and headful mode?"*
|
||||
- *"What best practices should I follow to respect website policies?"*
|
||||
- *"How do I handle authentication flows, form submissions, and CAPTCHA challenges effectively?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Browser Instance Creation** (Standard vs. Persistent Contexts)
|
||||
- **`BrowserConfig` Customization** (headless mode, viewport, proxies, debugging)
|
||||
- **Managed Browser for Resource Management and Debugging**
|
||||
- **Identity Preservation Techniques** (Stealth, Human-like Behavior, Bypass CAPTCHAs)
|
||||
- **Persistent Sessions and `user_data_dir`** (Session Reuse, Authentication Flows)
|
||||
- **Crawling Modern Web Apps** (Dynamic Content, JS Injection, Infinite Scrolling)
|
||||
- **Session Management with `session_id`** (Maintaining State, Multi-Step Flows)
|
||||
- **Magic Mode** (Automation of User-Like Behavior, Simple Setup)
|
||||
- **Extraction Strategies** (`JsonCssExtractionStrategy`, Handling Structured Data)
|
||||
- **Scaling and Performance Optimization** (Multiple URLs, Concurrency, Reusing Sessions)
|
||||
- **Best Practices and Troubleshooting** (Respecting Policies, Debugging Tools, Handling Errors)
|
||||
browser_creation: Create standard browser instance with default configurations | browser initialization, basic setup, minimal config | AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True))
|
||||
persistent_context: Use persistent browser contexts to maintain session data and cookies | user_data_dir, session storage, login state | BrowserConfig(user_data_dir="/path/to/user/data")
|
||||
managed_browser: High-level browser management with resource optimization and debugging | browser process, stealth mode, debugging tools | BrowserConfig(headless=False, debug_port=9222)
|
||||
context_config: Configure browser context with custom headers and cookies | headers customization, session reuse | CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
||||
page_creation: Create and customize browser pages with viewport settings | viewport size, iframe handling, lazy loading | CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
identity_preservation: Maintain authentic digital identity using Managed Browsers | user profiles, CAPTCHA bypass, persistent login | BrowserConfig(use_managed_browser=True, user_data_dir="/path/to/profile")
|
||||
magic_mode: Enable automated user-like behavior and detection bypass | automation masking, cookie handling | crawler.arun(url="example.com", magic=True)
|
||||
session_management: Maintain state across multiple requests using session IDs | session reuse, sequential crawling | CrawlerRunConfig(session_id="my_session")
|
||||
dynamic_content: Handle JavaScript-rendered content with custom execution hooks | content loading, pagination | js_code="document.querySelector('a.pagination-next').click()"
|
||||
best_practices: Follow recommended patterns for efficient crawling | resource management, error handling | crawler.crawler_strategy.kill_session(session_id)
|
||||
@@ -1,152 +0,0 @@
|
||||
# Creating Browser Instances, Contexts, and Pages (Condensed LLM Reference)
|
||||
|
||||
> Minimal code-focused reference retaining all outline sections.
|
||||
|
||||
## Introduction
|
||||
- Manage browsers for crawling with identity preservation, sessions, scaling.
|
||||
- Maintain cookies, local storage, human-like actions.
|
||||
|
||||
### Key Objectives
|
||||
- **Identity Preservation**: Stealth plugins, human-like inputs.
|
||||
- **Persistent Sessions**: Store cookies, continue tasks across runs.
|
||||
- **Scalable Crawling**: Handle large volumes efficiently.
|
||||
|
||||
---
|
||||
|
||||
## Browser Creation Methods
|
||||
|
||||
### Standard Browser Creation
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
cfg = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
### Persistent Contexts
|
||||
```python
|
||||
cfg = BrowserConfig(user_data_dir="/path/to/data")
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
### Managed Browser
|
||||
```python
|
||||
cfg = BrowserConfig(headless=False, debug_port=9222, use_managed_browser=True)
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context and Page Management
|
||||
|
||||
### Creating and Configuring Browser Contexts
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
conf = CrawlerRunConfig(headers={"User-Agent": "C4AI"})
|
||||
async with AsyncWebCrawler() as c:
|
||||
r = await c.arun("https://example.com", config=conf)
|
||||
```
|
||||
|
||||
### Creating Pages
|
||||
```python
|
||||
conf = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
async with AsyncWebCrawler() as c:
|
||||
r = await c.arun("https://example.com", config=conf)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Preserve Your Identity with Crawl4AI
|
||||
|
||||
Use Managed Browsers for authentic identity:
|
||||
|
||||
## Managed Browsers: Your Digital Identity Solution
|
||||
- Store sessions, cookies, user profiles.
|
||||
- Reuse CAPTCHAs, logins.
|
||||
|
||||
### Steps to Use Identity-Based Browsing
|
||||
```bash
|
||||
# Launch Chrome with user-data-dir
|
||||
google-chrome --user-data-dir="/path/to/Profile"
|
||||
# Then login manually, solve CAPTCHAs, etc.
|
||||
```
|
||||
|
||||
```python
|
||||
cfg = BrowserConfig(
|
||||
headless=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/Profile"
|
||||
)
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
### Example: Extracting Data Using Managed Browsers
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {...}
|
||||
cfg = BrowserConfig(
|
||||
headless=True, use_managed_browser=True,
|
||||
user_data_dir="/path/to/data"
|
||||
)
|
||||
crawl_cfg = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
|
||||
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com", config=crawl_cfg)
|
||||
```
|
||||
|
||||
## Magic Mode: Simplified Automation
|
||||
```python
|
||||
async with AsyncWebCrawler() as c:
|
||||
r = await c.arun("https://example.com", magic=True)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Session Management
|
||||
|
||||
Use `session_id` to maintain state across requests:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as c:
|
||||
sid = "my_session"
|
||||
conf1 = CrawlerRunConfig(url="https://example.com/page1", session_id=sid)
|
||||
conf2 = CrawlerRunConfig(url="https://example.com/page2", session_id=sid)
|
||||
r1 = await c.arun(config=conf1)
|
||||
r2 = await c.arun(config=conf2)
|
||||
await c.crawler_strategy.kill_session(sid)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Session-Based Crawling for Dynamic Content
|
||||
|
||||
- Reuse the same session for multi-step actions, JS execution.
|
||||
- Ideal for pagination, JS-driven content.
|
||||
|
||||
## Basic Concepts
|
||||
- `session_id`: Keep the same ID for related crawls.
|
||||
- `js_code`, `wait_for`: Run JS, wait for elements.
|
||||
|
||||
## Advanced Techniques
|
||||
- Execute JS for dynamic content loading.
|
||||
- Wait loops or hooks to handle new elements.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
- Combine managed browsers, sessions, and configs for scalable, identity-preserved crawling.
|
||||
- Adjust headers, cookies, viewports.
|
||||
- Magic mode for quick attempts; Managed Browsers for robust identity.
|
||||
- Use sessions for multi-step, dynamic workflows.
|
||||
|
||||
## Optional
|
||||
- [async_crawler_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)
|
||||
@@ -1,365 +0,0 @@
|
||||
# 5. Markdown Generation (MEGA Extended Documentation)
|
||||
|
||||
## 5.1 Introduction
|
||||
|
||||
In modern AI workflows—especially those involving Large Language Models (LLMs)—it’s essential to provide clean, structured, and meaningful textual data. **Crawl4AI** assists with this by extracting web content and converting it into Markdown that is easy to process, fine-tune on, or use for retrieval-augmented generation (RAG).
|
||||
|
||||
**What Makes Markdown Outputs Valuable for AI?**
|
||||
- **Human-Readable & Machine-Friendly:** Markdown is a simple, text-based format easily parsed by humans and machines alike.
|
||||
- **Rich Structure:** Headings, lists, code blocks, and links are preserved and well-organized.
|
||||
- **Enhanced Relevance:** Content filtering ensures you focus on the main content while discarding noise, making the data cleaner for LLM training or search.
|
||||
|
||||
### Quick Start Example
|
||||
|
||||
Here’s a minimal snippet to get started:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator()
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.markdown_v2.raw_markdown)
|
||||
```
|
||||
|
||||
*Within a few lines of code, you can fetch a webpage, run it through the Markdown generator, and get a clean, AI-friendly output.*
|
||||
|
||||
---
|
||||
|
||||
## 5.2 Markdown Generation
|
||||
|
||||
The Markdown generation process transforms raw HTML into a structured format. At its core is the `DefaultMarkdownGenerator` class, which uses configurable parameters and optional filters. Let’s explore its functionality in depth.
|
||||
|
||||
### Internal Workings
|
||||
|
||||
1. **HTML to Markdown Conversion:**
|
||||
The generator relies on an HTML-to-text conversion process that respects various formatting options. It preserves headings, code blocks, and references while removing extraneous tags like scripts and styles.
|
||||
|
||||
2. **Link Citation Handling:**
|
||||
By default, the generator can convert links into citation-style references at the bottom of the document. This feature is particularly useful when you need a clean, reference-rich dataset for an LLM.
|
||||
|
||||
3. **Optional Content Filters:**
|
||||
You can provide a content filter (like BM25 or Pruning) to generate a “fit_markdown” output that contains only the most relevant or least noisy parts of the page.
|
||||
|
||||
### Key Parameters
|
||||
|
||||
- **`base_url` (string):**
|
||||
A base URL used to resolve relative links in the content.
|
||||
|
||||
- **`html2text_config` (dict):**
|
||||
Controls how HTML is converted to Markdown. If none is provided, default settings ensure a reasonable output. You can customize a wide array of options. These options mirror standard `html2text` configurations with custom enhancements.
|
||||
**Important Options:**
|
||||
- `ignore_links` (bool): If `True`, removes all hyperlinks in the output Markdown. Default: `False`
|
||||
- `ignore_images` (bool): If `True`, removes all images. Default: `False`
|
||||
- `escape_html` (bool): If `True`, escapes raw HTML entities. Default: `True`
|
||||
- `body_width` (int): Sets the text wrapping width. Default: unlimited (0 means no wrapping)
|
||||
|
||||
**Advanced html2text-related Options from Source:**
|
||||
- `inside_pre`/`inside_code` (internal flags): Track whether we are inside `<pre>` or `<code>` blocks.
|
||||
- `preserve_tags` (set): A set of tags to preserve. If not empty, content within these tags is kept verbatim.
|
||||
- `current_preserved_tag`/`preserve_depth`: Internally manage nesting levels of preserved tags.
|
||||
- `handle_code_in_pre` (bool): If `True`, treats code within `<pre>` blocks distinctly, possibly formatting them as code blocks in Markdown.
|
||||
- `skip_internal_links` (bool): If `True`, internal links (like `#section`) are skipped.
|
||||
- `single_line_break` (bool): If `True`, uses single line breaks instead of double line breaks.
|
||||
- `mark_code` (bool): If `True`, adds special markers around code text.
|
||||
- `include_sup_sub` (bool): If `True`, tries to include `<sup>` and `<sub>` text in a readable way.
|
||||
- `ignore_mailto_links` (bool): If `True`, ignores `mailto:` links.
|
||||
- `escape_backslash`, `escape_dot`, `escape_plus`, `escape_dash`, `escape_snob`: Special escaping options to handle characters that might conflict with Markdown syntax.
|
||||
|
||||
**Example Custom `html2text_config`:**
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={
|
||||
"ignore_links": True,
|
||||
"escape_html": False,
|
||||
"body_width": 80,
|
||||
"skip_internal_links": True,
|
||||
"mark_code": True,
|
||||
"include_sup_sub": True
|
||||
}
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/docs", config=config)
|
||||
print(result.markdown_v2.raw_markdown)
|
||||
```
|
||||
|
||||
In this example, we ignore all hyperlinks, do not escape HTML entities, wrap text at 80 characters wide, skip internal links, mark code regions, and include superscript/subscript formatting.
|
||||
|
||||
### Using Content Filters in Markdown Generation
|
||||
|
||||
- **`content_filter` (object):**
|
||||
An optional filter (like `BM25ContentFilter` or `PruningContentFilter`) that refines the content before Markdown generation. When applied:
|
||||
- `fit_markdown` is generated: a filtered version of the page focusing on main content.
|
||||
- `fit_html` is also available: the filtered HTML that was used to generate `fit_markdown`.
|
||||
|
||||
### Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.5,
|
||||
use_stemming=True
|
||||
),
|
||||
options={"ignore_links": True, "escape_html": False}
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com/ai-research", config=config)
|
||||
print(result.markdown_v2.fit_markdown) # Filtered Markdown focusing on machine learning
|
||||
```
|
||||
|
||||
### Troubleshooting Markdown Generation
|
||||
|
||||
- **Empty Markdown Output?**
|
||||
Check if the crawler successfully fetched HTML. Ensure your filters are not overly strict. If no filter is used and you still get no output, verify the HTML content isn’t empty or malformed.
|
||||
|
||||
- **Malformed HTML Content?**
|
||||
The internal parser is robust, but if encountering strange characters, consider adjusting `escape_html` to `True` or removing problematic tags using filters.
|
||||
|
||||
- **Performance Considerations:**
|
||||
Complex filters or very large HTML documents can slow down processing. Consider caching results or reducing `body_width` if line-wrapping is unnecessary.
|
||||
|
||||
---
|
||||
|
||||
### 5.2.1 MarkdownGenerationResult
|
||||
|
||||
After running the crawler, `result.markdown_v2` returns a `MarkdownGenerationResult` object.
|
||||
|
||||
**Attributes:**
|
||||
- `raw_markdown` (str): Unfiltered Markdown.
|
||||
- `markdown_with_citations` (str): Markdown with all links converted into references at the end.
|
||||
- `references_markdown` (str): A list of extracted references.
|
||||
- `fit_markdown` (Optional[str]): Markdown after applying filters.
|
||||
- `fit_html` (Optional[str]): Filtered HTML corresponding to `fit_markdown`.
|
||||
|
||||
**Integration Example:**
|
||||
|
||||
```python
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print("RAW:", result.markdown_v2.raw_markdown)
|
||||
print("CITED:", result.markdown_v2.markdown_with_citations)
|
||||
print("FIT:", result.markdown_v2.fit_markdown)
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- **RAG Pipelines:** Feed `fit_markdown` into a vector database for semantic search.
|
||||
- **LLM Fine-Tuning:** Use `raw_markdown` or `fit_markdown` as training data for large models.
|
||||
|
||||
---
|
||||
|
||||
## 5.3 Filtering Strategies
|
||||
|
||||
Filters refine raw HTML to produce cleaner Markdown. They can remove boilerplate sections (headers, footers) or focus on content relevant to a specific query.
|
||||
|
||||
**Two Major Strategies:**
|
||||
1. **BM25ContentFilter:**
|
||||
A relevance-based approach using BM25 scoring to rank content sections according to a user query.
|
||||
|
||||
2. **PruningContentFilter (Emphasized):**
|
||||
An unsupervised, clustering-like approach that systematically prunes irrelevant or noisy parts of the HTML. Unlike BM25, which relies on a query for relevance, `PruningContentFilter` attempts to cluster and discard noise based on structural and heuristic metrics. This makes it highly useful for general cleanup without predefined queries.
|
||||
|
||||
---
|
||||
|
||||
### Relevance-Based Filtering: BM25
|
||||
|
||||
BM25 ranks content blocks by relevance to a given query. It’s semi-supervised in the sense that it needs a query (`user_query`).
|
||||
|
||||
**Key Parameters:**
|
||||
- `user_query` (string): The query for content relevance.
|
||||
- `bm25_threshold` (float): The minimum relevance score. Increase to get less but more focused content.
|
||||
- `use_stemming` (bool): When `True`, matches variations of words.
|
||||
- `case_sensitive` (bool): Controls case sensitivity.
|
||||
|
||||
**If omitted `user_query`,** BM25 just scores content but doesn’t have a specific target. Useful if you need general scoring.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=BM25ContentFilter(
|
||||
user_query="artificial intelligence",
|
||||
bm25_threshold=2.0,
|
||||
use_stemming=True
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Troubleshooting BM25:**
|
||||
- If you get too much irrelevant content, raise `bm25_threshold`.
|
||||
- If you get too little content, lower it or disable `case_sensitive`.
|
||||
|
||||
---
|
||||
|
||||
### PruningContentFilter: Unsupervised Content Clustering
|
||||
|
||||
`PruningContentFilter` is about intelligently stripping away non-essential parts of a page—ads, navigation bars, repetitive links—without relying on a specific user query. Think of it as an unsupervised clustering method that scores content blocks and removes “noise.”
|
||||
|
||||
**Key Features:**
|
||||
- **Unsupervised Nature:** No query needed. Uses heuristics like text density, link density, tag importance, and HTML structure.
|
||||
- **Clustering-Like Behavior:** It effectively “clusters” page sections by their structural and textual qualities, and prunes those that don’t meet thresholds.
|
||||
- **Threshold Adjustments:** Dynamically adjusts or uses a fixed threshold to remove or keep content blocks.
|
||||
|
||||
**Parameters:**
|
||||
- `threshold` (float): Score threshold for removing content. Higher values prune more aggressively. Default: `0.5`.
|
||||
- `threshold_type` (str): `"fixed"` or `"dynamic"`.
|
||||
- **Fixed:** Compares each block’s score directly to a set threshold.
|
||||
- **Dynamic:** Adjusts threshold based on content metrics for a more adaptive approach.
|
||||
- `min_word_threshold` (int): Minimum word count to keep a content block.
|
||||
- Internal metrics consider:
|
||||
- **Text Density:** Prefers sections rich in text over code or sparse elements.
|
||||
- **Link Density:** Penalizes sections with too many links.
|
||||
- **Tag Importance:** Some tags (e.g., `<article>`, `<main>`, `<section>`) are considered more important and less likely to be pruned.
|
||||
- **Class/ID patterns:** Looks for signals (like `nav`, `footer`) to identify boilerplate.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.7,
|
||||
threshold_type="dynamic",
|
||||
min_word_threshold=100
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
In this example, content blocks under a dynamically adjusted threshold are pruned, and any block under 100 words is discarded, ensuring you keep only substantial textual sections.
|
||||
|
||||
**When to Use PruningContentFilter:**
|
||||
- **General Cleanup:** If you want a broad cleanup of the page without a specific target query, pruning is your go-to.
|
||||
- **Pre-Processing Large Corpora:** Before applying more specific filters, prune to remove boilerplate, then apply BM25 for query-focused refinement.
|
||||
|
||||
**Troubleshooting Pruning Filter:**
|
||||
- **Too Much Content Gone?** Lower the `threshold` or switch from `dynamic` to `fixed` threshold for more predictable behavior.
|
||||
- **Not Enough Pruning?** Increase `threshold` to be more aggressive.
|
||||
- **Mixed Results?** Adjust `min_word_threshold` or try the `dynamic` threshold mode to fine-tune results.
|
||||
|
||||
---
|
||||
|
||||
## 5.4 Fit Markdown: Bringing It All Together
|
||||
|
||||
“Fit Markdown” is the output you get when applying filters to the raw HTML before markdown generation. This produces a final, optimized Markdown that’s noise-free and content-focused.
|
||||
|
||||
### Advanced Usage Scenario
|
||||
|
||||
**Combining BM25 and Pruning:**
|
||||
1. First apply `PruningContentFilter` to remove general junk.
|
||||
2. Then apply a `BM25ContentFilter` to focus on query relevance.
|
||||
|
||||
*Example:*
|
||||
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
combined_filter = BM25ContentFilter(
|
||||
user_query="technology advancements",
|
||||
bm25_threshold=1.2,
|
||||
use_stemming=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.5) # First prune
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# First run pruning
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
pruned_fit_markdown = result.markdown_v2.fit_markdown
|
||||
|
||||
# Re-run the BM25 filter on the pruned output, or integrate BM25 in a pipeline
|
||||
# (In practice, you'd integrate both filters within the crawler or run a second pass.)
|
||||
```
|
||||
|
||||
**Performance Note:**
|
||||
Fit Markdown reduces token count, making subsequent LLM operations faster and cheaper.
|
||||
|
||||
---
|
||||
|
||||
## 5.5 Best Practices
|
||||
|
||||
- **Iterative Adjustment:** Start with default parameters, then adjust filters, thresholds, and `html2text_config` based on the quality of output you need.
|
||||
- **Combining Filters:** Use `PruningContentFilter` first to remove boilerplate, then a `BM25ContentFilter` to target relevance.
|
||||
- **Check Downstream Applications:** If you’re using fit Markdown for training LLMs, inspect the output to ensure no essential references were pruned.
|
||||
- **Docker Deployment:**
|
||||
Running Crawl4AI in a Docker container ensures a consistent environment. Just include the required packages in your Dockerfile and run the crawler script inside the container.
|
||||
- **Caching Results:**
|
||||
To save time, cache the raw HTML or intermediate Markdown. If you know you’ll re-run filters or change parameters often, caching avoids redundant crawling.
|
||||
|
||||
**Handling Special Cases:**
|
||||
- **Authentication-Protected Pages:**
|
||||
If you need to crawl gated content, provide appropriate session tokens or use a headless browser approach before feeding HTML to the generator.
|
||||
- **Proxies and Timeouts:**
|
||||
Configure the crawler with proxies or increased timeouts for sites that are slow or region-restricted.
|
||||
|
||||
---
|
||||
|
||||
## 5.6 Troubleshooting & FAQ
|
||||
|
||||
**Why am I getting empty Markdown?**
|
||||
- Ensure that the URL is correct and the crawler fetched content.
|
||||
- If using filters, relax your thresholds.
|
||||
|
||||
**How to handle JavaScript-heavy sites?**
|
||||
- Run a headless browser upstream to render the page. Crawl4AI expects server-rendered HTML.
|
||||
|
||||
**How to improve formatting for code snippets?**
|
||||
- Set `handle_code_in_pre = True` in `html2text_config` to preserve code blocks more accurately.
|
||||
|
||||
**Links are cluttering my Markdown.**
|
||||
- Use `ignore_links=True` or convert them to citations for a cleaner layout.
|
||||
|
||||
---
|
||||
|
||||
## 5.7 Real-World Use Cases
|
||||
|
||||
1. **Summarizing News Articles:**
|
||||
Use `PruningContentFilter` to strip ads and nav bars, then just the raw output to get a neat summary.
|
||||
|
||||
2. **Preparing Data for LLM Fine-Tuning:**
|
||||
For a large corpus, first prune all pages to remove boilerplate, then optionally apply BM25 to focus on specific topics. The resulting Markdown is ideal for training because it’s dense with meaningful content.
|
||||
|
||||
3. **RAG Pipelines:**
|
||||
Extract `fit_markdown`, store it in a vector database, and use it for retrieval-augmented generation. The references and structured headings enhance search relevance.
|
||||
|
||||
---
|
||||
|
||||
## 5.8 Appendix (References)
|
||||
|
||||
**Source Code Files:**
|
||||
- [markdown_generation_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/markdown_generation_strategy.py)
|
||||
- **Key Classes:** `MarkdownGenerationStrategy`, `DefaultMarkdownGenerator`
|
||||
- **Key Functions:** `convert_links_to_citations()`, `generate_markdown()`
|
||||
|
||||
- [content_filter_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/content_filter_strategy.py)
|
||||
- **Key Classes:** `RelevantContentFilter`, `BM25ContentFilter`, `PruningContentFilter`
|
||||
- **Metrics & Heuristics:** Examine `PruningContentFilter` code for scoring logic and threshold adjustments.
|
||||
|
||||
Exploring the source code will provide deeper insights into how tags are parsed, how scores are computed for pruning, and how BM25 relevance is calculated.
|
||||
|
||||
---
|
||||
|
||||
**In summary**, Markdown generation in Crawl4AI provides a powerful, configurable pipeline to transform raw HTML into AI-ready Markdown. By leveraging `PruningContentFilter` for general cleanup and `BM25ContentFilter` for query-focused extraction, plus fine-tuning `html2text_config`, you can achieve high-quality outputs for a wide range of AI applications.
|
||||
@@ -1,53 +1,15 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Markdown Generation Basics**
|
||||
- *"How can I convert raw HTML into clean, structured Markdown using Crawl4AI?"*
|
||||
- *"What are the main benefits of generating Markdown from web content for LLM workflows?"*
|
||||
- *"How do I quickly start generating Markdown output from a given URL?"*
|
||||
|
||||
2. **Default Markdown Generator Configuration**
|
||||
- *"What parameters can I customize in `DefaultMarkdownGenerator` to control the HTML-to-Markdown conversion?"*
|
||||
- *"How do I ignore links, images, or HTML entities when converting to Markdown?"*
|
||||
- *"Can I set a custom line-wrapping width and handle code blocks in Markdown output?"*
|
||||
|
||||
3. **Content Filtering Strategies**
|
||||
- *"How can I apply filters like BM25 or pruning before Markdown generation?"*
|
||||
- *"What is `fit_markdown` and how does it differ from the raw Markdown output?"*
|
||||
- *"How do I use `BM25ContentFilter` to get content relevant to a specific user query?"*
|
||||
- *"What does `PruningContentFilter` do, and when should I use it to clean up noisy HTML?"*
|
||||
|
||||
4. **BM25 and Pruning Filters**
|
||||
- *"How does BM25 ranking improve the relevance of extracted Markdown content?"*
|
||||
- *"Which parameters should I tweak if BM25 returns too much or too little content?"*
|
||||
- *"How can I combine `PruningContentFilter` with BM25 to first remove boilerplate and then focus on relevance?"*
|
||||
|
||||
5. **Advanced html2text Configuration**
|
||||
- *"What advanced `html2text` options are available and how do I set them?"*
|
||||
- *"How can I preserve specific tags, handle code blocks, or skip internal links?"*
|
||||
- *"Can I handle superscript and subscript formatting in the Markdown output?"*
|
||||
|
||||
6. **Troubleshooting and Best Practices**
|
||||
- *"Why am I getting empty Markdown output and how can I fix it?"*
|
||||
- *"How do I handle malformed HTML or JavaScript-heavy sites?"*
|
||||
- *"What are the recommended workflows for large-scale or performance-critical Markdown generation?"*
|
||||
- *"How do I preserve references or add citation-style links in the final Markdown?"*
|
||||
|
||||
7. **Use Cases and Integration**
|
||||
- *"How can I incorporate `fit_markdown` into an LLM fine-tuning or RAG pipeline?"*
|
||||
- *"Can I run Crawl4AI’s Markdown generation inside a Docker container for consistent environments?"*
|
||||
- *"How do I cache results or reuse sessions to speed up repeated markdown generation tasks?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Markdown Generation Workflow** using `DefaultMarkdownGenerator`
|
||||
- **HTML-to-Markdown Conversion Options** (ignore links, images, escape HTML, line-wrapping, code handling)
|
||||
- **Applying Content Filters** (BM25 and Pruning) before Markdown generation
|
||||
- **fit_markdown vs. raw_markdown** for filtered, cleaner output
|
||||
- **BM25ContentFilter** for query-based content relevance
|
||||
- **PruningContentFilter** for unsupervised noise removal and cleaner pages
|
||||
- **Combining Filters** (prune first, then BM25) to refine content
|
||||
- **Advanced `html2text` Configurations** (handle code blocks, superscripts, skip internal links)
|
||||
- **Troubleshooting Tips** (empty output, malformed HTML, performance considerations)
|
||||
- **Downstream Uses**: Training LLMs, building RAG pipelines, semantic search indexing
|
||||
- **Best Practices** (iterative parameter tuning, caching, Docker deployment)
|
||||
- **Real-World Scenarios** (news summarization, large corpus pre-processing, improved RAG retrieval quality)
|
||||
markdown_generation: Converts web content into clean, structured Markdown format for AI processing | html to markdown, text conversion, content extraction | DefaultMarkdownGenerator()
|
||||
markdown_config_options: Configure HTML to Markdown conversion with html2text options like ignore_links, escape_html, body_width | markdown settings, conversion options | html2text_config={"ignore_links": True, "body_width": 80}
|
||||
content_filtering: Filter and clean web content using BM25 or Pruning strategies | content cleanup, noise removal | content_filter=BM25ContentFilter()
|
||||
bm25_filtering: Score and filter content based on relevance to a user query | relevance filtering, query matching | BM25ContentFilter(user_query="ai", bm25_threshold=1.5)
|
||||
pruning_filter: Remove boilerplate and noise using unsupervised clustering approach | content pruning, noise removal | PruningContentFilter(threshold=0.7, threshold_type="dynamic")
|
||||
markdown_result_types: Access different markdown outputs including raw, cited, and filtered versions | markdown formats, output types | result.markdown_v2.{raw_markdown, markdown_with_citations, fit_markdown}
|
||||
link_citations: Convert webpage links into citation-style references at document end | reference handling, link management | markdown_with_citations output format
|
||||
content_scoring: Evaluate content blocks based on text density, link density, and tag importance | content metrics, scoring system | PruningContentFilter metrics
|
||||
combined_filtering: Apply both pruning and BM25 filters for optimal content extraction | filter pipeline, multi-stage filtering | PruningContentFilter() followed by BM25ContentFilter()
|
||||
markdown_generation_troubleshooting: Debug empty outputs and malformed content issues | error handling, debugging | Check HTML content and filter thresholds
|
||||
performance_optimization: Cache results and adjust parameters for better processing speed | optimization, caching | Store intermediate results for reuse
|
||||
rag_pipeline_integration: Use filtered markdown for retrieval-augmented generation systems | RAG, vector storage | Store fit_markdown in vector database
|
||||
code_block_handling: Preserve and format code snippets in markdown output | code formatting, syntax | handle_code_in_pre=True option
|
||||
authentication_handling: Process content from authenticated pages using session tokens | auth support, protected content | Provide session tokens before markdown generation
|
||||
docker_deployment: Run markdown generation in containerized environment | deployment, containers | Include in Dockerfile configuration
|
||||
@@ -1,87 +0,0 @@
|
||||
```markdown
|
||||
# Chunking Strategies
|
||||
|
||||
> Break large texts into manageable chunks for relevance and retrieval workflows.
|
||||
|
||||
Enables segmentation for similarity-based retrieval and integration into RAG pipelines.
|
||||
|
||||
## Why Use Chunking?
|
||||
|
||||
- Prepare text for cosine similarity scoring
|
||||
- Integrate into RAG systems
|
||||
- Support multiple segmentation methods (regex, sentences, topics, fixed-length, sliding windows)
|
||||
|
||||
## Methods of Chunking
|
||||
|
||||
- [Regex-Based Chunking]: Splits text on patterns (e.g., `\n\n`)
|
||||
```python
|
||||
class RegexChunking:
|
||||
def __init__(self, patterns=[r'\n\n']):
|
||||
self.patterns = patterns
|
||||
def chunk(self, text):
|
||||
parts = [text]
|
||||
for p in self.patterns:
|
||||
parts = [seg for pr in parts for seg in re.split(p, pr)]
|
||||
return parts
|
||||
```
|
||||
|
||||
- [Sentence-Based Chunking]: Uses NLP (e.g., `nltk.sent_tokenize`) for sentence-level chunks
|
||||
```python
|
||||
from nltk.tokenize import sent_tokenize
|
||||
class NlpSentenceChunking:
|
||||
def chunk(self, text):
|
||||
return sent_tokenize(text)
|
||||
```
|
||||
|
||||
- [Topic-Based Segmentation]: Leverages `TextTilingTokenizer` for topic-level segments
|
||||
```python
|
||||
from nltk.tokenize import TextTilingTokenizer
|
||||
class TopicSegmentationChunking:
|
||||
def __init__(self):
|
||||
self.tokenizer = TextTilingTokenizer()
|
||||
def chunk(self, text):
|
||||
return self.tokenizer.tokenize(text)
|
||||
```
|
||||
|
||||
- [Fixed-Length Word Chunking]: Chunks by a fixed number of words
|
||||
```python
|
||||
class FixedLengthWordChunking:
|
||||
def __init__(self, chunk_size=100):
|
||||
self.chunk_size = chunk_size
|
||||
def chunk(self, text):
|
||||
w = text.split()
|
||||
return [' '.join(w[i:i+self.chunk_size]) for i in range(0, len(w), self.chunk_size)]
|
||||
```
|
||||
|
||||
- [Sliding Window Chunking]: Overlapping chunks for context retention
|
||||
```python
|
||||
class SlidingWindowChunking:
|
||||
def __init__(self, window_size=100, step=50):
|
||||
self.window_size = window_size
|
||||
self.step = step
|
||||
def chunk(self, text):
|
||||
w = text.split()
|
||||
return [' '.join(w[i:i+self.window_size]) for i in range(0, max(len(w)-self.window_size+1, 1), self.step)]
|
||||
```
|
||||
|
||||
## Combining Chunking with Cosine Similarity
|
||||
|
||||
- Extract relevant chunks based on a query
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class CosineSimilarityExtractor:
|
||||
def __init__(self, query):
|
||||
self.query = query
|
||||
self.vectorizer = TfidfVectorizer()
|
||||
def find_relevant_chunks(self, chunks):
|
||||
X = self.vectorizer.fit_transform([self.query] + chunks)
|
||||
sims = cosine_similarity(X[0:1], X[1:]).flatten()
|
||||
return list(zip(chunks, sims))
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [chuncking_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/chuncking_strategies.py)
|
||||
```
|
||||
@@ -1,53 +1,10 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **General Purpose of Chunking**
|
||||
- *"Why is chunking text important before applying cosine similarity or building RAG pipelines?"*
|
||||
- *"How does dividing large texts into smaller chunks improve retrieval accuracy and scalability?"*
|
||||
|
||||
2. **Regex-Based Chunking**
|
||||
- *"How can I split text into chunks using a custom regular expression?"*
|
||||
- *"What are typical use cases for Regex-based chunking, and when should I prefer it over other methods?"*
|
||||
|
||||
3. **Sentence-Based Chunking**
|
||||
- *"How do I break text into individual sentences using an NLP approach like `sent_tokenize`?"*
|
||||
- *"When should I prefer sentence-based chunking over regex-based or fixed-length chunking?"*
|
||||
|
||||
4. **Topic-Based Segmentation**
|
||||
- *"What is topic-based segmentation, and how does it produce thematically coherent chunks?"*
|
||||
- *"How can I integrate TextTiling or other topic segmentation algorithms into my chunking pipeline?"*
|
||||
|
||||
5. **Fixed-Length Word Chunking**
|
||||
- *"How do I evenly distribute text into fixed-size word chunks?"*
|
||||
- *"What are the benefits and drawbacks of using a fixed-length chunking strategy?"*
|
||||
|
||||
6. **Sliding Window Chunking**
|
||||
- *"What is a sliding window approach, and how does overlapping chunks improve context retention?"*
|
||||
- *"How do I choose appropriate window sizes and step values for my sliding window chunking?"*
|
||||
|
||||
7. **Cosine Similarity Integration**
|
||||
- *"How do I apply cosine similarity to identify the most relevant chunks for a given query?"*
|
||||
- *"What preprocessing steps are necessary before computing cosine similarity between a query and the generated chunks?"*
|
||||
|
||||
8. **RAG (Retrieval-Augmented Generation) Applications**
|
||||
- *"How can chunking strategies facilitate integration with Retrieval-Augmented Generation systems?"*
|
||||
- *"Which chunking method is best suited for maintaining context in RAG-based pipelines?"*
|
||||
|
||||
9. **Practical Considerations & Best Practices**
|
||||
- *"How do I choose the right chunking strategy for my specific use case (e.g., documents, transcripts, webpages)?"*
|
||||
- *"What are some best practices for combining chunking, vectorization, and similarity scoring methods?"*
|
||||
|
||||
10. **Advanced Use Cases**
|
||||
- *"Can I combine multiple chunking strategies, such as applying sentence tokenization followed by a sliding window?"*
|
||||
- *"How do I handle very large documents or corpora with chunking and similarity extraction at scale?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Purpose of Chunking Strategies**: Facilitating cosine similarity retrieval and RAG system integration.
|
||||
- **Regex-Based Chunking**: Splitting text based on patterns (e.g., paragraphs, blank lines).
|
||||
- **Sentence-Based Chunking**: Using NLP techniques to create sentence-level segments for fine-grained analysis.
|
||||
- **Topic-Based Segmentation**: Grouping text into topical units for thematic coherence.
|
||||
- **Fixed-Length Word Chunking**: Dividing text into uniform word count segments for consistent structure.
|
||||
- **Sliding Window Chunking**: Overlapping segments to preserve contextual continuity.
|
||||
- **Integrating Cosine Similarity**: Pairing chunked text with a query to retrieve the most relevant content.
|
||||
- **Applications in RAG Systems**: Enhancing retrieval workflows by organizing content into meaningful chunks.
|
||||
- **Comparison of Chunking Methods**: Trade-offs between simplicity, coherence, and context preservation.
|
||||
chunking_overview: Chunking strategies divide large texts into manageable parts for content processing and extraction | text segmentation, content division, document splitting | None
|
||||
cosine_similarity_integration: Chunking prepares text segments for semantic similarity analysis using cosine similarity | semantic search, relevance matching | from sklearn.metrics.pairwise import cosine_similarity
|
||||
rag_integration: Chunks can be integrated into RAG (Retrieval-Augmented Generation) systems for structured workflows | retrieval augmented generation, RAG pipeline | None
|
||||
regex_chunking: Split text using regular expression patterns for basic segmentation | regex splitting, pattern-based chunking | RegexChunking(patterns=[r'\n\n'])
|
||||
sentence_chunking: Divide text into individual sentences using NLP tools | sentence tokenization, NLP chunking | from nltk.tokenize import sent_tokenize
|
||||
topic_chunking: Create topic-coherent chunks using TextTiling algorithm | topic segmentation, TextTiling | from nltk.tokenize import TextTilingTokenizer
|
||||
fixed_length_chunking: Segment text into chunks with fixed word count | word-based chunking, fixed size segments | FixedLengthWordChunking(chunk_size=100)
|
||||
sliding_window_chunking: Generate overlapping chunks for better context preservation | overlapping segments, windowed chunking | SlidingWindowChunking(window_size=100, step=50)
|
||||
cosine_similarity_extraction: Extract relevant chunks using TF-IDF and cosine similarity comparison | similarity search, relevance extraction | from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
chunking_workflow: Combine chunking with cosine similarity for enhanced content retrieval | content extraction, similarity workflow | CosineSimilarityExtractor(query).find_relevant_chunks(chunks)
|
||||
@@ -1,577 +0,0 @@
|
||||
# Structured Data Extraction Strategies
|
||||
|
||||
## Extraction Strategies
|
||||
Structured data extraction strategies are designed to convert raw web content into organized, JSON-formatted data. These strategies handle diverse extraction scenarios, including schema-based, language model-driven, and clustering methods. This section covers models using LLMs or without using them to extract data with precision and flexibility.
|
||||
|
||||
### LLM Extraction Strategy
|
||||
The **LLM Extraction Strategy** employs a large language model (LLM) to process content dynamically. It supports:
|
||||
- **Schema-Based Extraction**: Using a defined JSON schema to structure output.
|
||||
- **Instruction-Based Extraction**: Accepting custom prompts to guide the extraction process.
|
||||
- **Flexible Model Usage**: Supporting open-source or paid LLMs.
|
||||
|
||||
#### Key Features
|
||||
- Accepts customizable schemas for structured outputs.
|
||||
- Incorporates user prompts for tailored results.
|
||||
- Handles large inputs with chunking and overlap for efficient processing.
|
||||
|
||||
#### Parameters and Configurations
|
||||
Below is a detailed explanation of key parameters:
|
||||
|
||||
- **`provider`** *(str)*: Specifies the LLM provider (e.g., `openai`, `ollama`).
|
||||
- Default: `DEFAULT_PROVIDER`
|
||||
|
||||
- **`api_token`** *(Optional[str])*: API token for the LLM provider.
|
||||
- Required unless using a provider that doesn’t need authentication.
|
||||
|
||||
- **`instruction`** *(Optional[str])*: A prompt guiding the model on extraction specifics.
|
||||
- Example: "Extract all prices and model names from the page."
|
||||
|
||||
- **`schema`** *(Optional[Dict])*: JSON schema defining the structure of extracted data.
|
||||
- If provided, extraction switches to schema mode.
|
||||
|
||||
- **`extraction_type`** *(str)*: Determines extraction mode (`block` or `schema`).
|
||||
- Default: `block`
|
||||
|
||||
- **Chunking Settings**:
|
||||
- **`chunk_token_threshold`** *(int)*: Maximum token count per chunk. Default: `CHUNK_TOKEN_THRESHOLD`.
|
||||
- **`overlap_rate`** *(float)*: Proportion of overlapping tokens between chunks. Default: `OVERLAP_RATE`.
|
||||
|
||||
- **`extra_args`** *(Dict)*: Additional arguments passed to the LLM API sucj as `max_length`, `temperature`, etc.
|
||||
|
||||
#### Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.config import CrawlerRunConfig, BrowserConfig
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str
|
||||
input_fee: str
|
||||
output_fee: str
|
||||
|
||||
async def extract_structured_data():
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider="openai",
|
||||
api_token="your_api_token",
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
instruction="Extract all model fees from the content."
|
||||
)
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
extraction_strategy=extraction_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/pricing",
|
||||
config=crawler_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
#### Workflow and Error Handling
|
||||
- **Chunk Merging**: Content is divided into manageable chunks based on the token threshold.
|
||||
- **Backoff and Retries**: Handles API rate limits with backoff strategies.
|
||||
- **Error Logging**: Extracted blocks include error tags when issues occur.
|
||||
- **Parallel Execution**: Supports multi-threaded execution for efficiency.
|
||||
|
||||
#### Benefits of Using LLM Extraction Strategy
|
||||
- **Dynamic Adaptability**: Easily switch between schema-based and instruction-based modes.
|
||||
- **Scalable**: Processes large content efficiently using chunking.
|
||||
- **Versatile**: Works with various LLM providers and configurations.
|
||||
|
||||
This strategy is ideal for extracting structured data from complex web pages, ensuring compatibility with LLM training and fine-tuning workflows.
|
||||
|
||||
### Cosine Strategy
|
||||
|
||||
The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.
|
||||
|
||||
#### How It Works
|
||||
|
||||
The Cosine Strategy:
|
||||
1. Breaks down page content into meaningful chunks
|
||||
2. Converts text into vector representations
|
||||
3. Calculates similarity between chunks
|
||||
4. Clusters similar content together
|
||||
5. Ranks and filters content based on relevance
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews", # Target content type
|
||||
word_count_threshold=10, # Minimum words per cluster
|
||||
sim_threshold=0.3 # Similarity threshold
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/reviews",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
content = result.extracted_content
|
||||
```
|
||||
|
||||
#### Configuration Options
|
||||
|
||||
##### Core Parameters
|
||||
|
||||
```python
|
||||
CosineStrategy(
|
||||
# Content Filtering
|
||||
semantic_filter: str = None, # Keywords/topic for content filtering
|
||||
word_count_threshold: int = 10, # Minimum words per cluster
|
||||
sim_threshold: float = 0.3, # Similarity threshold (0.0 to 1.0)
|
||||
|
||||
# Clustering Parameters
|
||||
max_dist: float = 0.2, # Maximum distance for clustering
|
||||
linkage_method: str = 'ward', # Clustering linkage method
|
||||
top_k: int = 3, # Number of top categories to extract
|
||||
|
||||
# Model Configuration
|
||||
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
|
||||
|
||||
verbose: bool = False # Enable logging
|
||||
)
|
||||
```
|
||||
|
||||
##### Parameter Details
|
||||
|
||||
1. **semantic_filter**
|
||||
- Sets the target topic or content type
|
||||
- Use keywords relevant to your desired content
|
||||
- Example: "technical specifications", "user reviews", "pricing information"
|
||||
|
||||
2. **sim_threshold**
|
||||
- Controls how similar content must be to be grouped together
|
||||
- Higher values (e.g., 0.8) mean stricter matching
|
||||
- Lower values (e.g., 0.3) allow more variation
|
||||
```python
|
||||
# Strict matching
|
||||
strategy = CosineStrategy(sim_threshold=0.8)
|
||||
|
||||
# Loose matching
|
||||
strategy = CosineStrategy(sim_threshold=0.3)
|
||||
```
|
||||
|
||||
3. **word_count_threshold**
|
||||
- Filters out short content blocks
|
||||
- Helps eliminate noise and irrelevant content
|
||||
```python
|
||||
# Only consider substantial paragraphs
|
||||
strategy = CosineStrategy(word_count_threshold=50)
|
||||
```
|
||||
|
||||
4. **top_k**
|
||||
- Number of top content clusters to return
|
||||
- Higher values return more diverse content
|
||||
```python
|
||||
# Get top 5 most relevant content clusters
|
||||
strategy = CosineStrategy(top_k=5)
|
||||
```
|
||||
|
||||
#### Use Cases
|
||||
|
||||
##### 1. Article Content Extraction
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="main article content",
|
||||
word_count_threshold=100, # Longer blocks for articles
|
||||
top_k=1 # Usually want single main content
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/blog/post",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
##### 2. Product Review Analysis
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="customer reviews and ratings",
|
||||
word_count_threshold=20, # Reviews can be shorter
|
||||
top_k=10, # Get multiple reviews
|
||||
sim_threshold=0.4 # Allow variety in review content
|
||||
)
|
||||
```
|
||||
|
||||
##### 3. Technical Documentation
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="technical specifications documentation",
|
||||
word_count_threshold=30,
|
||||
sim_threshold=0.6, # Stricter matching for technical content
|
||||
max_dist=0.3 # Allow related technical sections
|
||||
)
|
||||
```
|
||||
|
||||
#### Advanced Features
|
||||
|
||||
##### Custom Clustering
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
linkage_method='complete', # Alternative clustering method
|
||||
max_dist=0.4, # Larger clusters
|
||||
model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # Multilingual support
|
||||
)
|
||||
```
|
||||
|
||||
##### Content Filtering Pipeline
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="pricing plans features",
|
||||
word_count_threshold=15,
|
||||
sim_threshold=0.5,
|
||||
top_k=3
|
||||
)
|
||||
|
||||
async def extract_pricing_features(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if result.success:
|
||||
content = json.loads(result.extracted_content)
|
||||
return {
|
||||
'pricing_features': content,
|
||||
'clusters': len(content),
|
||||
'similarity_scores': [item['score'] for item in content]
|
||||
}
|
||||
```
|
||||
|
||||
#### Best Practices
|
||||
|
||||
1. **Adjust Thresholds Iteratively**
|
||||
- Start with default values
|
||||
- Adjust based on results
|
||||
- Monitor clustering quality
|
||||
|
||||
2. **Choose Appropriate Word Count Thresholds**
|
||||
- Higher for articles (100+)
|
||||
- Lower for reviews/comments (20+)
|
||||
- Medium for product descriptions (50+)
|
||||
|
||||
3. **Optimize Performance**
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
word_count_threshold=10, # Filter early
|
||||
top_k=5, # Limit results
|
||||
verbose=True # Monitor performance
|
||||
)
|
||||
```
|
||||
|
||||
4. **Handle Different Content Types**
|
||||
```python
|
||||
# For mixed content pages
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product features",
|
||||
sim_threshold=0.4, # More flexible matching
|
||||
max_dist=0.3, # Larger clusters
|
||||
top_k=3 # Multiple relevant sections
|
||||
)
|
||||
```
|
||||
|
||||
#### Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if result.success:
|
||||
content = json.loads(result.extracted_content)
|
||||
if not content:
|
||||
print("No relevant content found")
|
||||
else:
|
||||
print(f"Extraction failed: {result.error_message}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during extraction: {str(e)}")
|
||||
```
|
||||
|
||||
The Cosine Strategy is particularly effective when:
|
||||
- Content structure is inconsistent
|
||||
- You need semantic understanding
|
||||
- You want to find similar content blocks
|
||||
- Structure-based extraction (CSS/XPath) isn't reliable
|
||||
|
||||
It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.
|
||||
|
||||
|
||||
### JSON-Based Extraction Strategies with AsyncWebCrawler
|
||||
|
||||
In many cases, relying on a Large Language Model (LLM) to parse and structure data from web pages is both unnecessary and wasteful. Instead of incurring additional computational overhead, network latency, and even contributing to unnecessary CO2 emissions, you can employ direct HTML parsing strategies. These approaches are faster, simpler, and more environmentally friendly, running efficiently on any computer or device without costly API calls.
|
||||
|
||||
Crawl4AI offers two primary declarative extraction strategies that do not depend on LLMs:
|
||||
- `JsonCssExtractionStrategy`
|
||||
- `JsonXPathExtractionStrategy`
|
||||
|
||||
Of these two, while CSS selectors are often simpler to use, **XPath selectors are generally more robust and flexible**, particularly for large-scale scraping tasks. Modern websites often generate dynamic or ephemeral class names that are subject to frequent change. XPath, on the other hand, allows you to navigate the DOM structure directly, making your selectors less brittle and less dependent on inconsistent class names.
|
||||
|
||||
#### Why Use JSON-Based Extraction Instead of LLMs?
|
||||
|
||||
1. **Speed & Efficiency**: Direct HTML parsing bypasses the latency of external API calls.
|
||||
2. **Lower Resource Usage**: No need for large models, GPU acceleration, or network overhead.
|
||||
3. **Environmentally Friendly**: Reduced energy consumption and carbon footprint compared to LLM inference.
|
||||
4. **Offline Capability**: Works anywhere you have the HTML, no network needed.
|
||||
5. **Scalability & Reliability**: Stable and predictable, without dealing with model “hallucinations” or downtime.
|
||||
|
||||
#### Advantages of XPath Over CSS
|
||||
|
||||
1. **Stability in Dynamic Environments**: Websites change their classes and IDs constantly. XPath allows you to refer to elements by structure and position instead of relying on fragile class names.
|
||||
2. **Finer-Grained Control**: XPath supports advanced queries like traversing parent/child relationships, filtering based on attributes, and handling complex nested patterns.
|
||||
3. **Consistency Across Complex Pages**: Even when the front-end framework changes markup or introduces randomized class names, XPath expressions often remain valid if the structural hierarchy stays intact.
|
||||
4. **More Powerful Selection Logic**: You can write conditions like `//div[@data-test='price']` or `//tr[3]/td[2]` to accurately pinpoint elements.
|
||||
|
||||
#### Example Using XPath
|
||||
|
||||
Below is an example that extracts cryptocurrency prices from a hypothetical page using `JsonXPathExtractionStrategy`. Here, we avoid depending on class names entirely, focusing on the consistent structure of the HTML. By adjusting XPath expressions, you can overcome dynamic naming schemes that would break fragile CSS selectors.
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
async def extract_data_using_xpath():
|
||||
print("\n--- Using JsonXPathExtractionStrategy for Fast, Reliable Structured Output ---")
|
||||
|
||||
# Define the extraction schema using XPath selectors
|
||||
# Example: We know the table rows are always in this structure, regardless of class names
|
||||
schema = {
|
||||
"name": "Crypto Prices",
|
||||
"baseSelector": "//table/tbody/tr",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": ".//td[1]/h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": ".//td[1]/p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".//td[2]",
|
||||
"type": "text",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
extraction_strategy = JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Use XPath extraction on a page known for frequently changing its class names
|
||||
result = await crawler.arun(
|
||||
url="https://www.examplecrypto.com/prices",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
|
||||
# Parse the extracted content
|
||||
crypto_prices = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(crypto_prices)} cryptocurrency prices")
|
||||
print(json.dumps(crypto_prices[0], indent=2))
|
||||
|
||||
return crypto_prices
|
||||
|
||||
# Run the async function
|
||||
asyncio.run(extract_data_using_xpath())
|
||||
```
|
||||
|
||||
#### When to Use CSS vs. XPath
|
||||
|
||||
- **CSS Selectors**: Good for simpler, stable sites where classes and IDs are fixed and descriptive. Ideal if you’re already familiar with front-end development patterns.
|
||||
- **XPath Selectors**: Recommended for complex or highly dynamic websites. If classes and IDs are meaningless, random, or prone to frequent changes, XPath provides a more structural and future-proof solution.
|
||||
|
||||
#### Handling Dynamic Content
|
||||
|
||||
Even on websites that load content asynchronously, you can still rely on XPath extraction. Combine the extraction strategy with JavaScript execution to scroll or wait for certain elements to appear. Using XPath after the page finishes loading ensures you’re targeting elements that are fully rendered and stable.
|
||||
|
||||
For example:
|
||||
|
||||
```python
|
||||
async def extract_dynamic_data():
|
||||
schema = {
|
||||
"name": "Dynamic Crypto Prices",
|
||||
"baseSelector": "//tr[contains(@class, 'price-row')]",
|
||||
"fields": [
|
||||
{"name": "name", "selector": ".//td[1]", "type": "text"},
|
||||
{"name": "price", "selector": ".//td[2]", "type": "text"},
|
||||
]
|
||||
}
|
||||
|
||||
js_code = """
|
||||
window.scrollTo(0, document.body.scrollHeight);
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
"""
|
||||
|
||||
extraction_strategy = JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.examplecrypto.com/dynamic-prices",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_code,
|
||||
wait_for="//tr[contains(@class, 'price-row')][20]", # Wait until at least 20 rows load
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
crypto_data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(crypto_data)} cryptocurrency entries")
|
||||
```
|
||||
|
||||
#### Best Practices
|
||||
|
||||
1. **Avoid LLM-Based Extraction**: If the data is repetitive and structured, direct HTML parsing is faster, cheaper, and more stable.
|
||||
2. **Start with XPath**: In a constantly changing environment, building XPath selectors from stable structural elements (like table hierarchies, element positions, or unique attributes) ensures you won’t need to frequently rewrite selectors.
|
||||
3. **Test in Developer Tools**: Use browser consoles or `xmllint` to quickly verify XPath queries before coding.
|
||||
4. **Focus on Hierarchy, Not Classes**: Avoid relying on class names if they’re dynamic. Instead, use structural approaches like `//table/tbody/tr` or `//div[@data-test='price']`.
|
||||
5. **Combine with JS Execution**: For dynamic sites, run small snippets of JS to reveal content before extracting with XPath.
|
||||
|
||||
By following these guidelines, you can create high-performance, resilient extraction pipelines. You’ll save resources, reduce environmental impact, and enjoy a level of reliability and speed that LLM-based solutions can’t match when parsing repetitive data from complex or ever-changing websites.
|
||||
|
||||
### **Automating Schema Generation with a One-Time LLM-Assisted Utility**
|
||||
|
||||
While the focus of these extraction strategies is to avoid continuous reliance on LLMs, you can leverage a model once to streamline the creation of complex schemas. Instead of painstakingly determining repetitive patterns, crafting CSS or XPath selectors, and deciding field definitions by hand, you can prompt a language model once with the raw HTML and a brief description of what you need to extract. The result is a ready-to-use schema that you can plug into `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy` for lightning-fast extraction without further model calls.
|
||||
|
||||
**How It Works:**
|
||||
1. Provide the raw HTML containing your repetitive patterns.
|
||||
2. Optionally specify a natural language query describing the data you want.
|
||||
3. Run `generate_schema(html, query)` to let the LLM generate a schema automatically.
|
||||
4. Take the returned schema and use it directly with `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy`.
|
||||
5. After this initial step, no more LLM calls are necessary—you now have a schema that you can reuse as often as you like.
|
||||
|
||||
**Code Example:**
|
||||
|
||||
Here is a simplified demonstration using the utility function `generate_schema` that you’ve incorporated into your codebase. In this example, we:
|
||||
- Use a one-time LLM call to derive a schema from the HTML structure of a job board.
|
||||
- Apply the resulting schema to `JsonXPathExtractionStrategy` (although you can also use `JsonCssExtractionStrategy` if preferred).
|
||||
- Extract data from the target page at high speed with no subsequent LLM calls.
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
# Assume generate_schema is integrated and available
|
||||
from my_schema_utils import generate_schema
|
||||
|
||||
async def extract_data_with_generated_schema():
|
||||
# Raw HTML snippet representing repetitive patterns in the webpage
|
||||
test_html = """
|
||||
<div class="company-listings">
|
||||
<div class="company" data-company-id="123">
|
||||
<div class="company-header">
|
||||
<img class="company-logo" src="google.png" alt="Google">
|
||||
<h1 class="company-name">Google</h1>
|
||||
<div class="company-meta">
|
||||
<span class="company-size">10,000+ employees</span>
|
||||
<span class="company-industry">Technology</span>
|
||||
<a href="https://google.careers" class="careers-link">Careers Page</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="departments">
|
||||
<div class="department">
|
||||
<h2 class="department-name">Engineering</h2>
|
||||
<div class="positions">
|
||||
<div class="position-card" data-position-id="eng-1">
|
||||
<h3 class="position-title">Senior Software Engineer</h3>
|
||||
<span class="salary-range">$150,000 - $250,000</span>
|
||||
<div class="position-meta">
|
||||
<span class="location">Mountain View, CA</span>
|
||||
<span class="job-type">Full-time</span>
|
||||
<span class="experience">5+ years</span>
|
||||
</div>
|
||||
<div class="skills-required">
|
||||
<span class="skill">Python</span>
|
||||
<span class="skill">Kubernetes</span>
|
||||
<span class="skill">Machine Learning</span>
|
||||
</div>
|
||||
<p class="position-description">Join our core engineering team...</p>
|
||||
<div class="application-info">
|
||||
<span class="posting-date">Posted: 2024-03-15</span>
|
||||
<button class="apply-btn" data-req-id="REQ12345">Apply Now</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
"""
|
||||
|
||||
# Optional natural language query to guide the schema generation
|
||||
query = "Extract company name, position titles, and salaries"
|
||||
|
||||
# One-time call to the LLM to generate a reusable schema
|
||||
schema = generate_schema(test_html, query=query)
|
||||
|
||||
# Other exmaples of queries:
|
||||
# # Test 1: No query (should extract everything)
|
||||
# print("\nTest 1: No Query (Full Schema)")
|
||||
# schema1 = generate_schema(test_html)
|
||||
# print(json.dumps(schema1, indent=2))
|
||||
|
||||
# # Test 2: Query for just basic job info
|
||||
# print("\nTest 2: Basic Job Info Query")
|
||||
# query2 = "I only need job titles, salaries, and locations"
|
||||
# schema2 = generate_schema(test_html, query2)
|
||||
# print(json.dumps(schema2, indent=2))
|
||||
|
||||
# # Test 3: Query for company and department structure
|
||||
# print("\nTest 3: Organizational Structure Query")
|
||||
# query3 = "Extract company details and department names, without position details"
|
||||
# schema3 = generate_schema(test_html, query3)
|
||||
# print(json.dumps(schema3, indent=2))
|
||||
|
||||
# # Test 4: Query for specific skills tracking
|
||||
# print("\nTest 4: Skills Analysis Query")
|
||||
# query4 = "I want to analyze required skills across all positions"
|
||||
# schema4 = generate_schema(test_html, query4)
|
||||
# print(json.dumps(schema4, indent=2))
|
||||
|
||||
# Now use the generated schema for high-speed extraction without any further LLM calls
|
||||
extraction_strategy = JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# URL for demonstration purposes (use any URL that contains a similar structure)
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/jobs",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
raise Exception("Extraction failed")
|
||||
|
||||
data = json.loads(result.extracted_content)
|
||||
print("Extracted data:")
|
||||
print(json.dumps(data, indent=2))
|
||||
|
||||
# Run the async function
|
||||
asyncio.run(extract_data_with_generated_schema())
|
||||
```
|
||||
|
||||
**Benefits of the One-Time LLM Approach:**
|
||||
- **Time-Saving**: Quickly bootstrap your schema creation, especially for complex pages.
|
||||
- **Once and Done**: Use the LLM once and then rely purely on the ultra-fast, local extraction strategies.
|
||||
- **Sustainable**: No repeated model calls means less compute, lower cost, and reduced environmental impact.
|
||||
|
||||
This approach leverages the strengths of both worlds: a one-time intelligent schema generation step with a language model, followed by a stable, purely local extraction pipeline that runs efficiently on any machine, without further LLM dependencies.
|
||||
@@ -1,74 +1,12 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **LLM Extraction Strategy**
|
||||
- *"How can I use an LLM to dynamically extract structured data from a webpage?"*
|
||||
- *"What is the difference between block extraction and schema-based extraction in the LLM strategy?"*
|
||||
- *"How can I define a JSON schema and incorporate it into the LLM extraction process?"*
|
||||
- *"What parameters control chunk size and overlap for LLM-based extraction?"*
|
||||
- *"How do I handle errors, retries, and backoff when calling an LLM API for extraction?"*
|
||||
|
||||
2. **Cosine Strategy**
|
||||
- *"How does the Cosine Strategy identify and cluster semantically similar content?"*
|
||||
- *"What parameters (like `sim_threshold` or `word_count_threshold`) affect the relevance of extracted content?"*
|
||||
- *"When should I use semantic filtering with Cosine Strategy vs. simple keyword filtering?"*
|
||||
- *"How can I adjust `top_k` to retrieve more or fewer relevant content clusters?"*
|
||||
- *"In what scenarios is the Cosine Strategy more effective than LLM-based or CSS/XPath extraction?"*
|
||||
|
||||
3. **JSON-Based Extraction Strategies (Without LLMs)**
|
||||
- *"What are the advantages of using JSON-based extraction strategies like `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy` over LLM-based methods?"*
|
||||
- *"How do CSS and XPath selectors differ, and when is XPath more reliable?"*
|
||||
- *"How can I handle frequently changing class names or dynamic elements using XPath-based extraction?"*
|
||||
- *"Can I run these extraction strategies offline without any external API calls?"*
|
||||
- *"How do I combine JS execution with XPath extraction to handle dynamically loaded content?"*
|
||||
|
||||
4. **Environmental and Efficiency Considerations**
|
||||
- *"Why should I avoid continuous LLM calls for repetitive extraction tasks?"*
|
||||
- *"How does using XPath extraction reduce energy consumption and costs?"*
|
||||
- *"Can I initially use an LLM to generate a schema and then rely solely on efficient, local strategies?"*
|
||||
|
||||
5. **Schema Generation with a One-Time LLM Utility**
|
||||
- *"How can I use a one-time LLM call to generate a schema, then run extraction repeatedly without further LLM costs?"*
|
||||
- *"What steps are involved in using a language model just once to bootstrap my extraction schema?"*
|
||||
- *"How do I incorporate the generated schema into `JsonXPathExtractionStrategy` for fast, robust extraction?"*
|
||||
|
||||
6. **Advanced Use Cases and Best Practices**
|
||||
- *"When should I combine LLM-based extraction with cosine similarity filtering for maximum relevance?"*
|
||||
- *"What best practices should I follow when choosing thresholds and selectors to ensure stable, scalable extractions?"*
|
||||
- *"How can I adapt these strategies to different page layouts, content types, or query requirements?"*
|
||||
- *"Are there recommended troubleshooting steps if extraction fails or yields empty results?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **LLM Extraction Strategy**:
|
||||
- **Modes**: Block-based or schema-based extraction using LLMs
|
||||
- **Parameters**: API tokens, instructions, schemas, chunk sizes, overlap rates
|
||||
- **Workflows**: Chunk merging, error handling, parallel execution
|
||||
- **Advantages**: Dynamic adaptability, schema-based extraction, scaling large content
|
||||
|
||||
- **Cosine Strategy**:
|
||||
- **Approach**: Semantic filtering and clustering of content
|
||||
- **Parameters**: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`
|
||||
- **Use Cases**: Extracting relevant content from unstructured pages based on semantic similarity
|
||||
- **Advanced Config**: Custom clustering methods, model choices, performance optimization
|
||||
|
||||
- **JSON-Based Extraction Strategies (Non-LLM)**:
|
||||
- **Strategies**: `JsonCssExtractionStrategy` and `JsonXPathExtractionStrategy`
|
||||
- **Advantages**: Speed, efficiency, no external dependencies, environmentally friendly
|
||||
- **XPath vs. CSS**: XPath recommended for unstable, dynamic front-ends; more robust and structural
|
||||
- **Dynamic Content**: Combine JS execution and waiting conditions with XPath extraction
|
||||
|
||||
- **Sustainability and Efficiency Considerations**:
|
||||
- **Rationale**: Avoiding continuous LLM use to save cost, reduce latency, and decrease carbon footprint
|
||||
- **Scalability**: Run on any device without expensive hardware or API calls
|
||||
|
||||
- **One-Time LLM-Assisted Schema Generation**:
|
||||
- **Workflow**: Use LLM once to generate a schema from HTML and queries
|
||||
- **Afterwards**: Rely solely on JSON-based extraction (CSS/XPath) for fast and stable extractions
|
||||
- **Benefits**: Time-saving, cost-reducing, sustainable approach without sacrificing complexity
|
||||
|
||||
- **Integration and Best Practices**:
|
||||
- **Threshold Tuning**: Iterative adjustments for `sim_threshold`, `word_count_threshold`
|
||||
- **Performance**: Chunking large content for LLM extraction, vectorizing content for cosine similarity
|
||||
- **Testing and Validation**: Use developer tools or dummy HTML to refine selectors, test JS code for dynamic content loading
|
||||
|
||||
Overall, the file emphasizes choosing the right extraction strategy for the task—ranging from highly dynamic and schema-driven LLM approaches to more stable, efficient, and environmentally friendly direct HTML parsing methods (CSS/XPath). It also suggests a hybrid approach where an LLM can be used initially to generate a schema, then rely on local extraction strategies for ongoing tasks.
|
||||
llm_extraction: LLM Extraction Strategy uses language models to process web content into structured JSON | language model extraction, schema extraction, LLM parsing | LLMExtractionStrategy(provider="openai", api_token="token")
|
||||
schema_based_extraction: Extract data using predefined JSON schemas to structure LLM output | schema extraction, structured output | schema=OpenAIModelFee.model_json_schema()
|
||||
chunking_config: Configure content chunking with token threshold and overlap rate | content chunks, token limits | chunk_token_threshold=1000, overlap_rate=0.1
|
||||
provider_config: Specify LLM provider and API credentials for extraction | model provider, API setup | provider="openai", api_token="your_token"
|
||||
cosine_strategy: Use similarity-based clustering to extract relevant content sections | content clustering, semantic similarity | CosineStrategy(semantic_filter="product reviews")
|
||||
clustering_params: Configure clustering behavior with similarity thresholds and methods | similarity settings, cluster config | sim_threshold=0.3, linkage_method='ward'
|
||||
content_filtering: Filter extracted content based on word count and relevance | content filters, extraction rules | word_count_threshold=10, top_k=3
|
||||
xpath_extraction: Extract data using XPath selectors for stable structural parsing | xpath selectors, HTML parsing | JsonXPathExtractionStrategy(schema)
|
||||
css_extraction: Extract data using CSS selectors for simple HTML parsing | css selectors, HTML parsing | JsonCssExtractionStrategy(schema)
|
||||
schema_generation: Generate extraction schemas automatically using one-time LLM assistance | schema creation, automation | generate_schema(html, query)
|
||||
dynamic_content: Handle dynamic webpage content with JavaScript execution and waiting | async content, js execution | js_code="window.scrollTo(0, document.body.scrollHeight)"
|
||||
extraction_best_practices: Use XPath for stability, avoid unnecessary LLM calls, test selectors | optimization, reliability | baseSelector="//table/tbody/tr"
|
||||
@@ -1,81 +0,0 @@
|
||||
# Extraction Strategies (Condensed LLM-Friendly Reference)
|
||||
|
||||
> Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.
|
||||
|
||||
Streamlined parameters, usage, and code snippets for quick LLM reference.
|
||||
|
||||
## LLMExtractionStrategy
|
||||
|
||||
- Uses LLM to extract structured data from HTML.
|
||||
- Supports `instruction`, `schema`, `extraction_type`, `chunk_token_threshold`, `overlap_rate`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="openai",
|
||||
api_token="your_api_token",
|
||||
instruction="Extract prices",
|
||||
schema={"fields": [...]},
|
||||
extraction_type="schema"
|
||||
)
|
||||
```
|
||||
|
||||
## CosineStrategy
|
||||
|
||||
- Clusters content via semantic embeddings.
|
||||
- Key params: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews",
|
||||
word_count_threshold=20,
|
||||
sim_threshold=0.3,
|
||||
top_k=5
|
||||
)
|
||||
```
|
||||
|
||||
## JsonCssExtractionStrategy
|
||||
|
||||
- Extracts data using CSS selectors.
|
||||
- `schema` defines `baseSelector`, `fields`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
schema = {
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name":"title","selector":"h2","type":"text"},
|
||||
{"name":"price","selector":".price","type":"text"}
|
||||
]
|
||||
}
|
||||
strategy = JsonCssExtractionStrategy(schema=schema)
|
||||
```
|
||||
|
||||
## JsonXPathExtractionStrategy
|
||||
|
||||
- Similar to CSS but uses XPath.
|
||||
- More stable against changing class names.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
schema = {
|
||||
"baseSelector": "//div[@class='product']",
|
||||
"fields": [
|
||||
{"name":"title","selector":".//h2","type":"text"},
|
||||
{"name":"price","selector":".//span[@class='price']","type":"text"}
|
||||
]
|
||||
}
|
||||
strategy = JsonXPathExtractionStrategy(schema=schema)
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [extraction_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategies.py)
|
||||
@@ -1,385 +0,0 @@
|
||||
# Content Selection in Crawl4AI
|
||||
|
||||
Crawl4AI offers flexible and powerful methods to precisely select and filter content from webpages. Whether you’re extracting articles, filtering unwanted elements, or using LLMs for structured data extraction, this guide will walk you through the essentials and advanced techniques.
|
||||
|
||||
**Table of Contents:**
|
||||
- [Content Selection in Crawl4AI](#content-selection-in-crawl4ai)
|
||||
- [Introduction \& Quick Start](#introduction--quick-start)
|
||||
- [CSS Selectors](#css-selectors)
|
||||
- [Content Filtering](#content-filtering)
|
||||
- [Handling Iframe Content](#handling-iframe-content)
|
||||
- [Structured Content Selection Using LLMs](#structured-content-selection-using-llms)
|
||||
- [Pattern-Based Selection](#pattern-based-selection)
|
||||
- [Comprehensive Example: Combining Techniques](#comprehensive-example-combining-techniques)
|
||||
- [Troubleshooting \& Best Practices](#troubleshooting--best-practices)
|
||||
- [Additional Resources](#additional-resources)
|
||||
|
||||
---
|
||||
|
||||
## Introduction & Quick Start
|
||||
|
||||
When crawling websites, you often need to isolate specific parts of a page—such as main article text, product listings, or metadata. Crawl4AI’s content selection features help you fine-tune your crawls to grab exactly what you need, while filtering out unnecessary elements.
|
||||
|
||||
**Quick Start Example:** Here’s a minimal example that extracts the main article content from a page:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def quick_start():
|
||||
config = CrawlerRunConfig(css_selector=".main-article")
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
This snippet sets a simple CSS selector to focus on the main article area of a webpage. You can build from here, adding more advanced strategies as needed.
|
||||
|
||||
---
|
||||
|
||||
## CSS Selectors
|
||||
|
||||
**What are they?**
|
||||
CSS selectors let you target specific parts of a webpage’s HTML. If you can identify a unique CSS selector (such as `.main-article`, `article h1`, or `.product-listing > li`), you can precisely control what parts of the page are extracted.
|
||||
|
||||
**How to find selectors:**
|
||||
1. Open the page in your browser.
|
||||
2. Use browser dev tools (e.g., Chrome DevTools: right-click → "Inspect") to locate the elements you want.
|
||||
3. Copy the CSS selector for that element.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def extract_heading_and_content(url):
|
||||
config = CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return result.extracted_content
|
||||
```
|
||||
|
||||
**Tip:** If your extracted content is empty, verify that your CSS selectors match existing elements on the page. Using overly generic selectors can also lead to too much content being extracted.
|
||||
|
||||
---
|
||||
|
||||
## Video and Audio Content
|
||||
|
||||
The library extracts video and audio elements with their metadata:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Process videos
|
||||
for video in result.media["videos"]:
|
||||
print(f"Video source: {video['src']}")
|
||||
print(f"Type: {video['type']}")
|
||||
print(f"Duration: {video.get('duration')}")
|
||||
print(f"Thumbnail: {video.get('poster')}")
|
||||
|
||||
# Process audio
|
||||
for audio in result.media["audios"]:
|
||||
print(f"Audio source: {audio['src']}")
|
||||
print(f"Type: {audio['type']}")
|
||||
print(f"Duration: {audio.get('duration')}")
|
||||
```
|
||||
|
||||
## Link Analysis
|
||||
|
||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
||||
|
||||
### Link Classification
|
||||
|
||||
The library automatically categorizes links into:
|
||||
- Internal links (same domain)
|
||||
- External links (different domains)
|
||||
- Social media links
|
||||
- Navigation links
|
||||
- Content links
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Analyze internal links
|
||||
for link in result.links["internal"]:
|
||||
print(f"Internal: {link['href']}")
|
||||
print(f"Link text: {link['text']}")
|
||||
print(f"Context: {link['context']}") # Surrounding text
|
||||
print(f"Type: {link['type']}") # nav, content, etc.
|
||||
|
||||
# Analyze external links
|
||||
for link in result.links["external"]:
|
||||
print(f"External: {link['href']}")
|
||||
print(f"Domain: {link['domain']}")
|
||||
print(f"Type: {link['type']}")
|
||||
```
|
||||
|
||||
### Smart Link Filtering
|
||||
|
||||
Control which links are included in the results with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
|
||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
metadata = result.metadata
|
||||
print(f"Title: {metadata['title']}")
|
||||
print(f"Description: {metadata['description']}")
|
||||
print(f"Keywords: {metadata['keywords']}")
|
||||
print(f"Author: {metadata['author']}")
|
||||
print(f"Published Date: {metadata['published_date']}")
|
||||
print(f"Modified Date: {metadata['modified_date']}")
|
||||
print(f"Language: {metadata['language']}")
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Content Filtering
|
||||
|
||||
Crawl4AI provides content filtering parameters to exclude unwanted elements and ensure that you only get meaningful data. For instance, you can remove navigation bars, ads, or other non-essential parts of the page.
|
||||
|
||||
**Key Parameters:**
|
||||
- `word_count_threshold`: Minimum word count per extracted block. Helps skip short or irrelevant snippets.
|
||||
- `excluded_tags`: List of HTML tags to omit (e.g., `['form', 'header', 'footer', 'nav']`).
|
||||
- `exclude_external_links`: Strips out links pointing to external domains.
|
||||
- `exclude_social_media_links`: Removes common social media links or widgets.
|
||||
- `exclude_external_images`: Filters out images hosted on external domains.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def filtered_extraction(url):
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['form', 'header', 'footer', 'nav'],
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
exclude_external_images=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return result.extracted_content
|
||||
```
|
||||
|
||||
**Best Practice:** Start with a minimal set of exclusions and increase them as needed. If you notice no content is extracted, try lowering `word_count_threshold` or removing certain excluded tags.
|
||||
|
||||
---
|
||||
|
||||
## Handling Iframe Content
|
||||
|
||||
If a page embeds content in iframes (such as videos, maps, or third-party widgets), you may need to enable iframe processing. This ensures that Crawl4AI loads and extracts content displayed inside iframes.
|
||||
|
||||
**How to enable:**
|
||||
- Set `process_iframes=True` in your `CrawlerRunConfig` to process iframe content.
|
||||
- Use `remove_overlay_elements=True` to discard popups or modals that might block iframe content.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def extract_iframe_content(url):
|
||||
config = CrawlerRunConfig(
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return result.extracted_content
|
||||
```
|
||||
|
||||
**Troubleshooting:**
|
||||
- If iframe content doesn’t load, ensure the iframe’s origin is allowed and that you have no network-related issues. Check the logs or consider using a browser-based strategy that supports multi-domain requests.
|
||||
|
||||
---
|
||||
|
||||
## Structured Content Selection Using LLMs
|
||||
|
||||
For more complex extraction tasks (e.g., summarizing content, extracting structured data like titles and key points), you can integrate LLMs. LLM-based extraction strategies let you define a schema and provide instructions to an LLM so it returns structured, JSON-formatted results.
|
||||
|
||||
**When to use LLM-based strategies:**
|
||||
- Extracting complex structures not easily captured by simple CSS selectors.
|
||||
- Summarizing or transforming data.
|
||||
- Handling varied, unpredictable page layouts.
|
||||
|
||||
**Example with an LLMExtractionStrategy:**
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
import json
|
||||
|
||||
class ArticleContent(BaseModel):
|
||||
title: str
|
||||
main_points: List[str]
|
||||
conclusion: str
|
||||
|
||||
async def extract_article_with_llm(url):
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=ArticleContent.schema(),
|
||||
instruction="Extract the main article title, key points, and conclusion"
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
article = json.loads(result.extracted_content)
|
||||
return article
|
||||
```
|
||||
|
||||
**Tips for LLM-based extraction:**
|
||||
- Refine your prompt in `instruction` to guide the LLM towards the desired structure.
|
||||
- If results are incomplete or incorrect, consider adjusting the schema or adding more context to the instruction.
|
||||
- Check for errors and handle edge cases where the LLM might not find certain fields.
|
||||
|
||||
---
|
||||
|
||||
## Pattern-Based Selection
|
||||
|
||||
When dealing with repetitive, structured patterns (like a list of articles or products), you can use `JsonCssExtractionStrategy` to define a JSON schema that maps selectors to specific fields.
|
||||
|
||||
**Use Cases:**
|
||||
- News article listings, product grids, directory entries.
|
||||
- Extract multiple items that follow a similar structure on the same page.
|
||||
|
||||
**Example JSON Schema Extraction:**
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
import json
|
||||
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.news-item",
|
||||
"fields": [
|
||||
{"name": "headline", "selector": "h2", "type": "text"},
|
||||
{"name": "summary", "selector": ".summary", "type": "text"},
|
||||
{"name": "category", "selector": ".category", "type": "text"},
|
||||
{
|
||||
"name": "metadata",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{"name": "author", "selector": ".author", "type": "text"},
|
||||
{"name": "date", "selector": ".date", "type": "text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async def extract_news_items(url):
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
articles = json.loads(result.extracted_content)
|
||||
return articles
|
||||
```
|
||||
|
||||
**Maintenance Tip:** If the site’s structure changes, update your schema accordingly. Test small changes to ensure the extracted structure still matches your expectations.
|
||||
|
||||
---
|
||||
|
||||
## Comprehensive Example: Combining Techniques
|
||||
|
||||
Below is a more involved example that demonstrates combining multiple strategies and filtering parameters. Here, we extract structured article content from an `article.main` section, exclude unnecessary elements, and enforce a word count threshold.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler, BrowserConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def extract_article_content(url: str):
|
||||
# Schema for structured extraction
|
||||
article_schema = {
|
||||
"name": "Article",
|
||||
"baseSelector": "article.main",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['nav', 'footer'],
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
extracted = json.loads(result.extracted_content)
|
||||
return extracted
|
||||
```
|
||||
|
||||
**Expanding This Example:**
|
||||
- Add pagination logic to handle multi-page extractions.
|
||||
- Introduce LLM-based extraction for a summary of the article’s main points.
|
||||
- Adjust filtering parameters to refine what content is included or excluded.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting & Best Practices
|
||||
|
||||
**Common Issues & Fixes:**
|
||||
- **Empty extraction result:**
|
||||
- Verify CSS selectors and filtering parameters.
|
||||
- Lower or remove `word_count_threshold` to see if overly strict criteria are filtering everything out.
|
||||
- Check network requests or iframe settings if content is loaded dynamically.
|
||||
|
||||
- **Unintended content included:**
|
||||
- Add more tags to `excluded_tags`, or refine your CSS selectors.
|
||||
- Use `exclude_external_links` and other filters to clean up results.
|
||||
|
||||
- **LLM extraction errors:**
|
||||
- Ensure the schema matches the expected JSON structure.
|
||||
- Refine the `instruction` prompt to guide the LLM more clearly.
|
||||
- Validate LLM provider configuration and error logs.
|
||||
|
||||
**Performance Tips:**
|
||||
- Start with simpler strategies (basic CSS selectors) before moving to advanced LLM-based extraction.
|
||||
- Use caching or asynchronous crawling to handle large numbers of pages efficiently.
|
||||
- Consider running headless browser extractions in Docker for consistent, reproducible environments.
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **GitHub Source Files:**
|
||||
- [Async Web Crawler Implementation](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
|
||||
- [Async Crawler Strategy Implementation](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)
|
||||
|
||||
- **Advanced Topics:**
|
||||
- Dockerized deployments for reproducible scraping environments.
|
||||
- Integration with caching or proxy services for large-scale crawls.
|
||||
- Expanding LLM strategies to perform complex transformations or summarizations.
|
||||
|
||||
Use these links and approaches as a starting point to refine your crawling strategies. With Crawl4AI’s flexible configuration and powerful selection methods, you’ll be able to extract exactly the content you need—no more, no less.
|
||||
@@ -1,75 +1,12 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **Basic Content Selection**
|
||||
- *"How can I use a CSS selector to extract only the main article text from a webpage?"*
|
||||
- *"What’s a quick way to isolate a specific element or section of a webpage using Crawl4AI?"*
|
||||
|
||||
2. **Advanced CSS Selectors**
|
||||
- *"How do I find the right CSS selector for a given element in a complex webpage?"*
|
||||
- *"Can I combine multiple CSS selectors to target different parts of the page simultaneously?"*
|
||||
|
||||
3. **Content Filtering**
|
||||
- *"What parameters can I use to remove non-essential elements like headers, footers, or ads?"*
|
||||
- *"How do I filter out short or irrelevant text blocks using `word_count_threshold`?"*
|
||||
- *"Is it possible to exclude external links, images, or social media widgets to get cleaner data?"*
|
||||
|
||||
4. **Iframe Content Handling**
|
||||
- *"How do I enable iframe processing to extract content embedded in iframes?"*
|
||||
- *"What should I do if the iframe content doesn’t load or is blocked?"*
|
||||
|
||||
5. **LLM-Based Structured Extraction**
|
||||
- *"When should I consider using LLM strategies for content extraction?"*
|
||||
- *"How can I define a JSON schema for the LLM to produce structured, JSON-formatted outputs?"*
|
||||
- *"What if the LLM returns incomplete or incorrect data—how can I refine the instructions or schema?"*
|
||||
|
||||
6. **Pattern-Based Selection with JSON Strategies**
|
||||
- *"How can I extract multiple items (e.g., a list of articles or products) from a page using `JsonCssExtractionStrategy`?"*
|
||||
- *"What’s the best way to handle nested fields or multiple levels of data using a JSON schema?"*
|
||||
|
||||
7. **Combining Multiple Techniques**
|
||||
- *"How do I use CSS selectors, content filtering, and JSON-based extraction strategies together to get clean, structured data?"*
|
||||
- *"Can I integrate LLM extraction for summarization alongside CSS-based extraction for raw content?"*
|
||||
|
||||
8. **Troubleshooting and Best Practices**
|
||||
- *"Why am I getting empty or no results from my selectors, and how can I debug it?"*
|
||||
- *"What should I do if content loading is dynamic and requires waiting or JS execution?"*
|
||||
- *"How can I optimize performance and reliability for large-scale or repeated crawls?"*
|
||||
|
||||
9. **Performance and Reliability**
|
||||
- *"How can I improve crawl speed while maintaining precision in content selection?"*
|
||||
- *"What’s the benefit of using Dockerized environments for consistent and reproducible results?"*
|
||||
|
||||
10. **Additional Resources and Extensions**
|
||||
- *"Where can I find the source code for the Async Web Crawler and strategies?"*
|
||||
- *"What advanced topics, such as caching, proxy integration, or Docker deployments, can I explore next?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **CSS Selectors for Content Isolation**:
|
||||
Identifying elements with CSS selectors, using browser dev tools, and extracting targeted sections of a webpage.
|
||||
|
||||
- **Content Filtering Parameters**:
|
||||
Removing unwanted tags, external links, social media elements, and enforcing minimum word counts to ensure meaningful content.
|
||||
|
||||
- **Handling Iframes**:
|
||||
Enabling `process_iframes` and dealing with multi-domain or overlay elements to extract embedded content.
|
||||
|
||||
- **Structured Extraction with LLMs**:
|
||||
Using `LLMExtractionStrategy` with schemas and instructions for complex or irregular data extraction, including JSON-based outputs.
|
||||
|
||||
- **Pattern-Based Extraction Using Schemas (JsonCssExtractionStrategy)**:
|
||||
Defining a JSON schema to extract lists of items (e.g., articles, products) that follow a consistent pattern, capturing nested fields and attributes.
|
||||
|
||||
- **Combining Techniques**:
|
||||
Integrating CSS selection, filtering, JSON schema extraction, and LLM-based transformation to get clean, structured, and context-rich results.
|
||||
|
||||
- **Troubleshooting and Best Practices**:
|
||||
Adjusting selectors, filters, and instructions, lowering thresholds if empty results occur, and refining LLM prompts for better data.
|
||||
|
||||
- **Performance and Reliability**:
|
||||
Starting with simple strategies, adding complexity as needed, and considering asynchronous crawling, caching, or Docker for large-scale operations.
|
||||
|
||||
- **Additional Resources**:
|
||||
Links to code repositories, instructions for Docker deployments, caching strategies, and further refinement for advanced use cases.
|
||||
|
||||
In summary, the file provides comprehensive guidance on selecting and filtering content within Crawl4AI, covering everything from simple CSS-based extractions to advanced LLM-driven structured outputs, while also addressing common issues, best practices, and performance optimizations.
|
||||
content_selection: Crawl4AI allows precise selection and filtering of webpage content | web scraping, content extraction, web crawler | CrawlerRunConfig(css_selector=".main-article")
|
||||
css_selectors: Target specific webpage elements using CSS selectors like .main-article or article h1 | DOM selection, HTML elements, element targeting | CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
media_extraction: Extract video and audio elements with metadata including source, type, and duration | multimedia content, media files | result.media["videos"], result.media["audios"]
|
||||
link_analysis: Automatically categorize links into internal, external, social media, navigation, and content links | link classification, URL analysis | result.links["internal"], result.links["external"]
|
||||
link_filtering: Control which links are included using exclude parameters | link exclusion, domain filtering | CrawlerRunConfig(exclude_external_links=True, exclude_social_media_links=True)
|
||||
metadata_extraction: Automatically extract page metadata including title, description, keywords, and dates | page information, meta tags | result.metadata['title'], result.metadata['description']
|
||||
content_filtering: Remove unwanted elements using word count threshold and excluded tags | content cleanup, element removal | CrawlerRunConfig(word_count_threshold=10, excluded_tags=['form', 'header'])
|
||||
iframe_handling: Process content within iframes by enabling iframe processing and overlay removal | embedded content, frames | CrawlerRunConfig(process_iframes=True, remove_overlay_elements=True)
|
||||
llm_extraction: Use LLMs for complex content extraction with structured output | AI extraction, structured data | LLMExtractionStrategy(provider="ollama/nemotron", schema=ArticleContent.schema())
|
||||
pattern_extraction: Extract repetitive content patterns using JSON schema mapping | structured extraction, repeated elements | JsonCssExtractionStrategy(schema)
|
||||
troubleshooting: Common issues include empty results, unintended content, and LLM errors | debugging, error handling | config.word_count_threshold, excluded_tags
|
||||
best_practices: Start with simple selectors before advanced strategies and use caching for efficiency | optimization, performance | AsyncWebCrawler().arun(url=url, config=config)
|
||||
@@ -1,130 +0,0 @@
|
||||
# Crawl4AI Content Selection (LLM-Friendly Reference)
|
||||
|
||||
> Minimal, code-oriented reference for selecting and filtering webpage content using Crawl4AI.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def run():
|
||||
config = CrawlerRunConfig(css_selector=".main-article")
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
## CSS Selectors
|
||||
|
||||
- Use `css_selector="selector"` to target specific content.
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
result = await crawler.arun(url="...", config=config)
|
||||
```
|
||||
|
||||
## Content Filtering
|
||||
|
||||
- `word_count_threshold`: int
|
||||
- `excluded_tags`: list of tags
|
||||
- `exclude_external_links`: bool
|
||||
- `exclude_social_media_links`: bool
|
||||
- `exclude_external_images`: bool
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["form","header","footer","nav"],
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
exclude_external_images=True
|
||||
)
|
||||
```
|
||||
|
||||
## Iframe Content
|
||||
|
||||
- `process_iframes`: bool
|
||||
- `remove_overlay_elements`: bool
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
```
|
||||
|
||||
## LLM-Based Extraction
|
||||
|
||||
- Use `LLMExtractionStrategy(provider="...")` with `schema=...` and `instruction="..."`
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel
|
||||
|
||||
class ArticleContent(BaseModel):
|
||||
title: str
|
||||
main_points: list[str]
|
||||
conclusion: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=ArticleContent.schema(),
|
||||
instruction="Extract title, points, conclusion"
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
```
|
||||
|
||||
## Pattern-Based Selection (JsonCssExtractionStrategy)
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.news-item",
|
||||
"fields": [
|
||||
{"name":"headline","selector":"h2","type":"text"},
|
||||
{"name":"summary","selector":".summary","type":"text"},
|
||||
{"name":"category","selector":".category","type":"text"},
|
||||
{
|
||||
"name":"metadata",
|
||||
"type":"nested",
|
||||
"fields":[
|
||||
{"name":"author","selector":".author","type":"text"},
|
||||
{"name":"date","selector":".date","type":"text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
|
||||
```
|
||||
|
||||
## Combined Example
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
article_schema = {
|
||||
"name":"Article",
|
||||
"baseSelector":"article.main",
|
||||
"fields":[
|
||||
{"name":"title","selector":"h1","type":"text"},
|
||||
{"name":"content","selector":".content","type":"text"}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["nav","footer"],
|
||||
exclude_external_links=True
|
||||
)
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
|
||||
- [async_crawler_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)
|
||||
@@ -1,58 +1,10 @@
|
||||
### Hypothetical Questions
|
||||
|
||||
1. **General Understanding of the New Caching System**
|
||||
- *"Why did Crawl4AI move from boolean cache flags to a `CacheMode` enum?"*
|
||||
- *"What are the benefits of using a single `CacheMode` enum over multiple booleans?"*
|
||||
|
||||
2. **CacheMode Usage**
|
||||
- *"What `CacheMode` should I use if I want normal caching (both read and write)?"*
|
||||
- *"How do I enable a mode that only reads from cache, or only writes to cache?"*
|
||||
- *"What does `CacheMode.BYPASS` do, and how is it different from `CacheMode.DISABLED`?"*
|
||||
|
||||
3. **Migrating from Old to New System**
|
||||
- *"How do I translate `bypass_cache=True` to the new `CacheMode` approach?"*
|
||||
- *"I used to set `disable_cache=True`; what `CacheMode` should I use now?"*
|
||||
- *"If I previously used `no_cache_read=True`, how do I achieve the same effect with `CacheMode`?"*
|
||||
|
||||
4. **Implementation Details**
|
||||
- *"How do I specify the `CacheMode` in my crawler runs?"*
|
||||
- *"Can I pass the `CacheMode` to `arun` directly, or do I need a `CrawlerRunConfig` object?"*
|
||||
|
||||
5. **Suppressing Deprecation Warnings**
|
||||
- *"How can I temporarily disable deprecation warnings while I migrate my code?"*
|
||||
|
||||
6. **Edge Cases and Best Practices**
|
||||
- *"What if I forget to update my code and still use the old flags?"*
|
||||
- *"Is there a `CacheMode` for scenarios where I want to only write to cache and never read old data?"*
|
||||
|
||||
7. **Examples and Code Snippets**
|
||||
- *"Can I see a side-by-side comparison of old and new caching code for a given URL?"*
|
||||
- *"How can I confirm that using `CacheMode.BYPASS` skips both reading and writing cache?"*
|
||||
|
||||
8. **Performance and Reliability**
|
||||
- *"Will switching to `CacheMode` improve my code’s readability and reduce confusion?"*
|
||||
- *"Can the new caching system still handle large-scale crawling scenarios efficiently?"*
|
||||
|
||||
### Topics Discussed in the File
|
||||
|
||||
- **Old vs. New Caching Approach**:
|
||||
Previously, multiple boolean flags (`bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`) controlled caching. Now, a single `CacheMode` enum simplifies configuration.
|
||||
|
||||
- **CacheMode Enum**:
|
||||
Provides clear modes:
|
||||
- `ENABLED`: Normal caching (read and write)
|
||||
- `DISABLED`: No caching at all
|
||||
- `READ_ONLY`: Only read from cache, don’t write new data
|
||||
- `WRITE_ONLY`: Only write to cache, don’t read old data
|
||||
- `BYPASS`: Skip cache entirely for this operation
|
||||
|
||||
- **Migration Patterns**:
|
||||
A simple mapping table helps developers switch old boolean flags to the corresponding `CacheMode` value.
|
||||
|
||||
- **Suppressing Deprecation Warnings**:
|
||||
Temporarily disabling deprecation warnings provides a grace period to update old code.
|
||||
|
||||
- **Code Examples**:
|
||||
Side-by-side comparisons show how to update code from old flags to the new `CacheMode` approach.
|
||||
|
||||
In summary, the file guides developers in transitioning from the old caching boolean flags to the new `CacheMode` enum, explaining the rationale, providing a mapping table, and offering code snippets to facilitate a smooth migration.
|
||||
cache_system: Crawl4AI v0.5.0 introduces CacheMode enum to replace boolean cache flags | caching system, cache control, cache configuration | CacheMode.ENABLED
|
||||
cache_modes: CacheMode enum supports five states: ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, and BYPASS | cache states, caching options, cache settings | CacheMode.ENABLED, CacheMode.DISABLED, CacheMode.READ_ONLY, CacheMode.WRITE_ONLY, CacheMode.BYPASS
|
||||
cache_migration_bypass: Replace bypass_cache=True with cache_mode=CacheMode.BYPASS | skip cache, bypass caching | cache_mode=CacheMode.BYPASS
|
||||
cache_migration_disable: Replace disable_cache=True with cache_mode=CacheMode.DISABLED | disable caching, turn off cache | cache_mode=CacheMode.DISABLED
|
||||
cache_migration_read: Replace no_cache_read=True with cache_mode=CacheMode.WRITE_ONLY | write-only cache, disable read | cache_mode=CacheMode.WRITE_ONLY
|
||||
cache_migration_write: Replace no_cache_write=True with cache_mode=CacheMode.READ_ONLY | read-only cache, disable write | cache_mode=CacheMode.READ_ONLY
|
||||
crawler_config: Use CrawlerRunConfig to set cache mode in AsyncWebCrawler | crawler settings, configuration object | CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
deprecation_warnings: Suppress cache deprecation warnings by setting SHOW_DEPRECATION_WARNINGS to False | warning suppression, legacy support | SHOW_DEPRECATION_WARNINGS = False
|
||||
async_crawler_usage: AsyncWebCrawler requires async/await syntax and supports configuration via CrawlerRunConfig | async crawler, web crawler setup | async with AsyncWebCrawler(verbose=True) as crawler
|
||||
crawler_execution: Run AsyncWebCrawler using asyncio.run() in main script | crawler execution, async main | asyncio.run(main())
|
||||
@@ -1,191 +0,0 @@
|
||||
import os
|
||||
from pathlib import Path
|
||||
from rank_bm25 import BM25Okapi
|
||||
import re
|
||||
from typing import List, Literal
|
||||
|
||||
from nltk.tokenize import word_tokenize
|
||||
from nltk.corpus import stopwords
|
||||
from nltk.stem import WordNetLemmatizer
|
||||
import nltk
|
||||
|
||||
|
||||
BASE_PATH = Path(__file__).resolve().parent
|
||||
|
||||
def get_file_map() -> dict:
|
||||
"""Cache file mappings to avoid repeated directory scans"""
|
||||
files = os.listdir(BASE_PATH)
|
||||
file_map = {}
|
||||
|
||||
for file in files:
|
||||
if file.endswith('.md'):
|
||||
# Extract number and name: "6_chunking_strategies.md" -> ("chunking_strategies", "6")
|
||||
match = re.match(r'(\d+)_(.+?)(?:\.(?:ex|xs|sm|q)?\.md)?$', file)
|
||||
if match:
|
||||
num, name = match.groups()
|
||||
if name not in file_map:
|
||||
file_map[name] = num
|
||||
return file_map
|
||||
|
||||
def concatenate_docs(file_names: List[str], mode: Literal["extended", "condensed"]) -> str:
|
||||
"""Concatenate documentation files based on names and mode."""
|
||||
file_map = get_file_map()
|
||||
result = []
|
||||
suffix_map = {
|
||||
"extended": ".ex.md",
|
||||
"condensed": [".xs.md", ".sm.md"]
|
||||
}
|
||||
|
||||
for name in file_names:
|
||||
if name not in file_map:
|
||||
continue
|
||||
|
||||
num = file_map[name]
|
||||
base_path = BASE_PATH
|
||||
|
||||
if mode == "extended":
|
||||
file_path = base_path / f"{num}_{name}{suffix_map[mode]}"
|
||||
if not file_path.exists():
|
||||
file_path = base_path / f"{num}_{name}.md"
|
||||
else:
|
||||
file_path = None
|
||||
for suffix in suffix_map["condensed"]:
|
||||
temp_path = base_path / f"{num}_{name}{suffix}"
|
||||
if temp_path.exists():
|
||||
file_path = temp_path
|
||||
break
|
||||
if not file_path:
|
||||
file_path = base_path / f"{num}_{name}.md"
|
||||
|
||||
if file_path.exists():
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
result.append(f.read())
|
||||
|
||||
return "\n\n---\n\n".join(result)
|
||||
|
||||
def extract_questions(content: str) -> List[tuple[str, str, str]]:
|
||||
"""
|
||||
Extract questions from Q files, returning list of (category, question, full_section).
|
||||
"""
|
||||
# Split into main sections (### Questions or ### Hypothetical Questions)
|
||||
sections = re.split(r'^###\s+.*Questions\s*$', content, flags=re.MULTILINE)[1:]
|
||||
|
||||
results = []
|
||||
for section in sections:
|
||||
# Find all numbered categories (1. **Category Name**)
|
||||
categories = re.split(r'^\d+\.\s+\*\*([^*]+)\*\*\s*$', section.strip(), flags=re.MULTILINE)
|
||||
|
||||
# Process each category
|
||||
for i in range(1, len(categories), 2):
|
||||
category = categories[i].strip()
|
||||
category_content = categories[i+1].strip()
|
||||
|
||||
# Extract questions (lines starting with dash and wrapped in italics)
|
||||
questions = re.findall(r'^\s*-\s*\*"([^"]+)"\*\s*$', category_content, flags=re.MULTILINE)
|
||||
|
||||
# Add each question with its category and full context
|
||||
for q in questions:
|
||||
results.append((category, q, f"Category: {category}\nQuestion: {q}"))
|
||||
|
||||
return results
|
||||
|
||||
def preprocess_text(text: str) -> List[str]:
|
||||
"""Preprocess text for better semantic matching"""
|
||||
# Lowercase and tokenize
|
||||
tokens = word_tokenize(text.lower())
|
||||
|
||||
# Remove stopwords but keep question words
|
||||
stop_words = set(stopwords.words('english')) - {'how', 'what', 'when', 'where', 'why', 'which'}
|
||||
lemmatizer = WordNetLemmatizer()
|
||||
|
||||
# Lemmatize but preserve original form for technical terms
|
||||
tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
|
||||
|
||||
return tokens
|
||||
|
||||
def search_questions(query: str, top_k: int = 5) -> str:
|
||||
"""Search through Q files using BM25 ranking and return top K matches."""
|
||||
q_files = [f for f in os.listdir(BASE_PATH) if f.endswith(".q.md")]
|
||||
# Prepare base path for file reading
|
||||
q_files = [BASE_PATH / f for f in q_files] # Convert to full path
|
||||
|
||||
documents = []
|
||||
file_contents = {}
|
||||
|
||||
for file in q_files:
|
||||
with open(file, 'r', encoding='utf-8') as f:
|
||||
content = f.read()
|
||||
questions = extract_questions(content)
|
||||
for category, question, full_section in questions:
|
||||
documents.append(question)
|
||||
file_contents[question] = (file, category, full_section)
|
||||
|
||||
if not documents:
|
||||
return "No questions found in documentation."
|
||||
|
||||
tokenized_docs = [preprocess_text(doc) for doc in documents]
|
||||
tokenized_query = preprocess_text(query)
|
||||
|
||||
bm25 = BM25Okapi(tokenized_docs)
|
||||
doc_scores = bm25.get_scores(tokenized_query)
|
||||
|
||||
score_threshold = max(doc_scores) * 0.4
|
||||
|
||||
# Aggregate scores by file
|
||||
file_data = {}
|
||||
for idx, score in enumerate(doc_scores):
|
||||
if score > score_threshold:
|
||||
question = documents[idx]
|
||||
file, category, _ = file_contents[question]
|
||||
|
||||
if file not in file_data:
|
||||
file_data[file] = {
|
||||
'total_score': 0,
|
||||
'match_count': 0,
|
||||
'questions': []
|
||||
}
|
||||
|
||||
file_data[file]['total_score'] += score
|
||||
file_data[file]['match_count'] += 1
|
||||
file_data[file]['questions'].append({
|
||||
'category': category,
|
||||
'question': question,
|
||||
'score': score
|
||||
})
|
||||
|
||||
# Sort files by match count and total score
|
||||
ranked_files = sorted(
|
||||
file_data.items(),
|
||||
key=lambda x: (x[1]['match_count'], x[1]['total_score']),
|
||||
reverse=True
|
||||
)[:top_k]
|
||||
|
||||
# Format results by file
|
||||
results = []
|
||||
for file, data in ranked_files:
|
||||
questions_summary = "\n".join(
|
||||
f"- [{q['category']}] {q['question']} (score: {q['score']:.2f})"
|
||||
for q in sorted(data['questions'], key=lambda x: x['score'], reverse=True)
|
||||
)
|
||||
|
||||
results.append(
|
||||
f"File: {file}\n"
|
||||
f"Match Count: {data['match_count']}\n"
|
||||
f"Total Score: {data['total_score']:.2f}\n\n"
|
||||
f"Matching Questions:\n{questions_summary}"
|
||||
)
|
||||
|
||||
return "\n\n---\n\n".join(results) if results else "No relevant matches found."
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example 1: Concatenate docs
|
||||
docs = concatenate_docs(["chunking_strategies", "content_selection"], "extended")
|
||||
print("Concatenated docs:", docs[:200], "...\n")
|
||||
|
||||
# Example 2: Search questions
|
||||
results = search_questions("How do I execute JS script on the page?", 3)
|
||||
print("Search results:", results[:200], "...")
|
||||
@@ -87,6 +87,20 @@ class AsyncWebCrawler:
|
||||
awarmup(): Perform warmup sequence.
|
||||
arun_many(): Run the crawler for multiple sources.
|
||||
aprocess_html(): Process HTML content.
|
||||
|
||||
Typical Usage:
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
print(result.markdown)
|
||||
|
||||
Using configuration:
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler_config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
||||
print(result.markdown)
|
||||
"""
|
||||
_domain_last_hit = {}
|
||||
|
||||
@@ -257,7 +271,7 @@ class AsyncWebCrawler:
|
||||
screenshot=True,
|
||||
...
|
||||
)
|
||||
|
||||
|
||||
New way (recommended):
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=200,
|
||||
@@ -270,7 +284,7 @@ class AsyncWebCrawler:
|
||||
url: The URL to crawl (http://, https://, file://, or raw:)
|
||||
crawler_config: Configuration object controlling crawl behavior
|
||||
[other parameters maintained for backwards compatibility]
|
||||
|
||||
|
||||
Returns:
|
||||
CrawlResult: The result of crawling and processing
|
||||
"""
|
||||
|
||||
@@ -1,129 +0,0 @@
|
||||
# Download Handling in Crawl4AI
|
||||
|
||||
This guide explains how to use Crawl4AI to handle file downloads during crawling. You'll learn how to trigger downloads, specify download locations, and access downloaded files.
|
||||
|
||||
## Enabling Downloads
|
||||
|
||||
To enable downloads, set the `accept_downloads` parameter in the `BrowserConfig` object and pass it to the crawler.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
config = BrowserConfig(accept_downloads=True) # Enable downloads globally
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
# ... your crawling logic ...
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Or, enable it for a specific crawl by using `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
config = CrawlerRunConfig(accept_downloads=True)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
# ...
|
||||
```
|
||||
|
||||
## Specifying Download Location
|
||||
|
||||
Specify the download directory using the `downloads_path` attribute in the `BrowserConfig` object. If not provided, Crawl4AI defaults to creating a "downloads" directory inside the `.crawl4ai` folder in your home directory.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
import os
|
||||
|
||||
downloads_path = os.path.join(os.getcwd(), "my_downloads") # Custom download path
|
||||
os.makedirs(downloads_path, exist_ok=True)
|
||||
|
||||
config = BrowserConfig(accept_downloads=True, downloads_path=downloads_path)
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
# ...
|
||||
```
|
||||
|
||||
## Triggering Downloads
|
||||
|
||||
Downloads are typically triggered by user interactions on a web page, such as clicking a download button. Use `js_code` in `CrawlerRunConfig` to simulate these actions and `wait_for` to allow sufficient time for downloads to start.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
const downloadLink = document.querySelector('a[href$=".exe"]');
|
||||
if (downloadLink) {
|
||||
downloadLink.click();
|
||||
}
|
||||
""",
|
||||
wait_for=5 # Wait 5 seconds for the download to start
|
||||
)
|
||||
|
||||
result = await crawler.arun(url="https://www.python.org/downloads/", config=config)
|
||||
```
|
||||
|
||||
## Accessing Downloaded Files
|
||||
|
||||
The `downloaded_files` attribute of the `CrawlResult` object contains paths to downloaded files.
|
||||
|
||||
```python
|
||||
if result.downloaded_files:
|
||||
print("Downloaded files:")
|
||||
for file_path in result.downloaded_files:
|
||||
print(f"- {file_path}")
|
||||
file_size = os.path.getsize(file_path)
|
||||
print(f"- File size: {file_size} bytes")
|
||||
else:
|
||||
print("No files downloaded.")
|
||||
```
|
||||
|
||||
## Example: Downloading Multiple Files
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
async def download_multiple_files(url: str, download_path: str):
|
||||
config = BrowserConfig(accept_downloads=True, downloads_path=download_path)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
run_config = CrawlerRunConfig(
|
||||
js_code="""
|
||||
const downloadLinks = document.querySelectorAll('a[download]');
|
||||
for (const link of downloadLinks) {
|
||||
link.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Delay between clicks
|
||||
}
|
||||
""",
|
||||
wait_for=10 # Wait for all downloads to start
|
||||
)
|
||||
result = await crawler.arun(url=url, config=run_config)
|
||||
|
||||
if result.downloaded_files:
|
||||
print("Downloaded files:")
|
||||
for file in result.downloaded_files:
|
||||
print(f"- {file}")
|
||||
else:
|
||||
print("No files downloaded.")
|
||||
|
||||
# Usage
|
||||
download_path = os.path.join(Path.home(), ".crawl4ai", "downloads")
|
||||
os.makedirs(download_path, exist_ok=True)
|
||||
|
||||
asyncio.run(download_multiple_files("https://www.python.org/downloads/windows/", download_path))
|
||||
```
|
||||
|
||||
## Important Considerations
|
||||
|
||||
- **Browser Context:** Downloads are managed within the browser context. Ensure `js_code` correctly targets the download triggers on the webpage.
|
||||
- **Timing:** Use `wait_for` in `CrawlerRunConfig` to manage download timing.
|
||||
- **Error Handling:** Handle errors to manage failed downloads or incorrect paths gracefully.
|
||||
- **Security:** Scan downloaded files for potential security threats before use.
|
||||
|
||||
This revised guide ensures consistency with the `Crawl4AI` codebase by using `BrowserConfig` and `CrawlerRunConfig` for all download-related configurations. Let me know if further adjustments are needed!
|
||||
@@ -1,10 +0,0 @@
|
||||
enable_downloads: Downloads must be enabled using accept_downloads parameter in BrowserConfig or CrawlerRunConfig | download settings, enable downloads | BrowserConfig(accept_downloads=True)
|
||||
download_location: Set custom download directory using downloads_path in BrowserConfig, defaults to .crawl4ai/downloads | download folder, save location | BrowserConfig(downloads_path="/path/to/downloads")
|
||||
download_trigger: Trigger downloads using js_code in CrawlerRunConfig to simulate click actions | download button, click download | CrawlerRunConfig(js_code="document.querySelector('a[download]').click()")
|
||||
download_timing: Control download timing using wait_for parameter in CrawlerRunConfig | download wait, timeout | CrawlerRunConfig(wait_for=5)
|
||||
access_downloads: Access downloaded files through downloaded_files attribute in CrawlResult | download results, file paths | result.downloaded_files
|
||||
multiple_downloads: Download multiple files by clicking multiple download links with delay | batch download, multiple files | js_code="const links = document.querySelectorAll('a[download]'); for(const link of links) { link.click(); }"
|
||||
download_verification: Check download success by examining downloaded_files list and file sizes | verify downloads, file check | if result.downloaded_files: print(os.path.getsize(file_path))
|
||||
browser_context: Downloads are managed within browser context and require proper js_code targeting | download management, browser scope | CrawlerRunConfig(js_code="...")
|
||||
error_handling: Handle failed downloads and incorrect paths for robust download management | download errors, error handling | try-except around download operations
|
||||
security_consideration: Scan downloaded files for security threats before use | security check, virus scan | No direct code reference
|
||||
@@ -1,190 +0,0 @@
|
||||
# Page Interaction
|
||||
|
||||
Crawl4AI provides powerful features for interacting with dynamic webpages, handling JavaScript execution, and managing page events.
|
||||
|
||||
## JavaScript Execution
|
||||
|
||||
### Basic Execution
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
# Single JavaScript command
|
||||
config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);"
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Multiple commands
|
||||
js_commands = [
|
||||
"window.scrollTo(0, document.body.scrollHeight);",
|
||||
"document.querySelector('.load-more').click();",
|
||||
"document.querySelector('#consent-button').click();"
|
||||
]
|
||||
config = CrawlerRunConfig(js_code=js_commands)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Wait Conditions
|
||||
|
||||
### CSS-Based Waiting
|
||||
|
||||
Wait for elements to appear:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(wait_for="css:.dynamic-content") # Wait for element with class 'dynamic-content'
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### JavaScript-Based Waiting
|
||||
|
||||
Wait for custom conditions:
|
||||
|
||||
```python
|
||||
# Wait for number of elements
|
||||
wait_condition = """() => {
|
||||
return document.querySelectorAll('.item').length > 10;
|
||||
}"""
|
||||
|
||||
config = CrawlerRunConfig(wait_for=f"js:{wait_condition}")
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Wait for dynamic content to load
|
||||
wait_for_content = """() => {
|
||||
const content = document.querySelector('.content');
|
||||
return content && content.innerText.length > 100;
|
||||
}"""
|
||||
|
||||
config = CrawlerRunConfig(wait_for=f"js:{wait_for_content}")
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Handling Dynamic Content
|
||||
|
||||
### Load More Content
|
||||
|
||||
Handle infinite scroll or load more buttons:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
js_code=[
|
||||
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
|
||||
"const loadMore = document.querySelector('.load-more'); if(loadMore) loadMore.click();" # Click load more
|
||||
],
|
||||
wait_for="js:() => document.querySelectorAll('.item').length > previousCount" # Wait for new content
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Form Interaction
|
||||
|
||||
Handle forms and inputs:
|
||||
|
||||
```python
|
||||
js_form_interaction = """
|
||||
document.querySelector('#search').value = 'search term'; // Fill form fields
|
||||
document.querySelector('form').submit(); // Submit form
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code=js_form_interaction,
|
||||
wait_for="css:.results" # Wait for results to load
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Timing Control
|
||||
|
||||
### Delays and Timeouts
|
||||
|
||||
Control timing of interactions:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
page_timeout=60000, # Page load timeout (ms)
|
||||
delay_before_return_html=2.0 # Wait before capturing content
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
### Complex Interactions Example
|
||||
|
||||
Here's an example of handling a dynamic page with multiple interactions:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Initial page load
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.querySelector('.cookie-accept')?.click();", # Handle cookie consent
|
||||
wait_for="css:.main-content"
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Load more content
|
||||
session_id = "dynamic_session" # Keep session for multiple interactions
|
||||
|
||||
for page in range(3): # Load 3 pages of content
|
||||
config = CrawlerRunConfig(
|
||||
session_id=session_id,
|
||||
js_code=[
|
||||
"window.scrollTo(0, document.body.scrollHeight);", # Scroll to bottom
|
||||
"window.previousCount = document.querySelectorAll('.item').length;", # Store item count
|
||||
"document.querySelector('.load-more')?.click();" # Click load more
|
||||
],
|
||||
wait_for="""() => {
|
||||
const currentCount = document.querySelectorAll('.item').length;
|
||||
return currentCount > window.previousCount;
|
||||
}""",
|
||||
js_only=(page > 0) # Execute JS without reloading page for subsequent interactions
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(f"Page {page + 1} items:", len(result.cleaned_html))
|
||||
|
||||
# Clean up session
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
### Using with Extraction Strategies
|
||||
|
||||
Combine page interaction with structured extraction:
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
# Pattern-based extraction after interaction
|
||||
schema = {
|
||||
"name": "Dynamic Items",
|
||||
"baseSelector": ".item",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "description", "selector": ".desc", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="css:.item:nth-child(10)", # Wait for 10 items
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Or use LLM to analyze dynamic content
|
||||
class ContentAnalysis(BaseModel):
|
||||
topics: List[str]
|
||||
summary: str
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
js_code="document.querySelector('.show-more').click();",
|
||||
wait_for="css:.full-content",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=ContentAnalysis.schema(),
|
||||
instruction="Analyze the full content"
|
||||
)
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
@@ -1,10 +0,0 @@
|
||||
javascript_execution: Execute single or multiple JavaScript commands in webpage | js code, javascript commands, browser execution | CrawlerRunConfig(js_code="window.scrollTo(0, document.body.scrollHeight);")
|
||||
css_wait: Wait for specific CSS elements to appear on page | css selector, element waiting, dynamic content | CrawlerRunConfig(wait_for="css:.dynamic-content")
|
||||
js_wait_condition: Define custom JavaScript wait conditions for dynamic content | javascript waiting, conditional wait, custom conditions | CrawlerRunConfig(wait_for="js:() => document.querySelectorAll('.item').length > 10")
|
||||
infinite_scroll: Handle infinite scroll and load more buttons | pagination, dynamic loading, scroll handling | CrawlerRunConfig(js_code="window.scrollTo(0, document.body.scrollHeight);")
|
||||
form_interaction: Fill and submit forms using JavaScript | form handling, input filling, form submission | CrawlerRunConfig(js_code="document.querySelector('#search').value = 'search term';")
|
||||
timing_control: Set page timeouts and delays before content capture | page timing, delays, timeouts | CrawlerRunConfig(page_timeout=60000, delay_before_return_html=2.0)
|
||||
session_management: Maintain browser session for multiple interactions | session handling, browser state, session cleanup | crawler.crawler_strategy.kill_session(session_id)
|
||||
cookie_consent: Handle cookie consent popups and notifications | cookie handling, popup management | CrawlerRunConfig(js_code="document.querySelector('.cookie-accept')?.click();")
|
||||
extraction_combination: Combine page interactions with structured data extraction | data extraction, content parsing | JsonCssExtractionStrategy(schema), LLMExtractionStrategy(schema)
|
||||
dynamic_content_loading: Wait for and verify dynamic content loading | content verification, dynamic loading | wait_for="js:() => document.querySelector('.content').innerText.length > 100"
|
||||
@@ -1,158 +0,0 @@
|
||||
# Prefix-Based Input Handling in Crawl4AI
|
||||
|
||||
This guide will walk you through using the Crawl4AI library to crawl web pages, local HTML files, and raw HTML strings. We'll demonstrate these capabilities using a Wikipedia page as an example.
|
||||
|
||||
## Crawling a Web URL
|
||||
|
||||
To crawl a live web page, provide the URL starting with `http://` or `https://`, using a `CrawlerRunConfig` object:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_web():
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://en.wikipedia.org/wiki/apple", config=config)
|
||||
if result.success:
|
||||
print("Markdown Content:")
|
||||
print(result.markdown)
|
||||
else:
|
||||
print(f"Failed to crawl: {result.error_message}")
|
||||
|
||||
asyncio.run(crawl_web())
|
||||
```
|
||||
|
||||
## Crawling a Local HTML File
|
||||
|
||||
To crawl a local HTML file, prefix the file path with `file://`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_local_file():
|
||||
local_file_path = "/path/to/apple.html" # Replace with your file path
|
||||
file_url = f"file://{local_file_path}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=file_url, config=config)
|
||||
if result.success:
|
||||
print("Markdown Content from Local File:")
|
||||
print(result.markdown)
|
||||
else:
|
||||
print(f"Failed to crawl local file: {result.error_message}")
|
||||
|
||||
asyncio.run(crawl_local_file())
|
||||
```
|
||||
|
||||
## Crawling Raw HTML Content
|
||||
|
||||
To crawl raw HTML content, prefix the HTML string with `raw:`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def crawl_raw_html():
|
||||
raw_html = "<html><body><h1>Hello, World!</h1></body></html>"
|
||||
raw_html_url = f"raw:{raw_html}"
|
||||
config = CrawlerRunConfig(bypass_cache=True)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=raw_html_url, config=config)
|
||||
if result.success:
|
||||
print("Markdown Content from Raw HTML:")
|
||||
print(result.markdown)
|
||||
else:
|
||||
print(f"Failed to crawl raw HTML: {result.error_message}")
|
||||
|
||||
asyncio.run(crawl_raw_html())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Complete Example
|
||||
|
||||
Below is a comprehensive script that:
|
||||
|
||||
1. Crawls the Wikipedia page for "Apple."
|
||||
2. Saves the HTML content to a local file (`apple.html`).
|
||||
3. Crawls the local HTML file and verifies the markdown length matches the original crawl.
|
||||
4. Crawls the raw HTML content from the saved file and verifies consistency.
|
||||
|
||||
```python
|
||||
import os
|
||||
import sys
|
||||
import asyncio
|
||||
from pathlib import Path
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
wikipedia_url = "https://en.wikipedia.org/wiki/apple"
|
||||
script_dir = Path(__file__).parent
|
||||
html_file_path = script_dir / "apple.html"
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Step 1: Crawl the Web URL
|
||||
print("\n=== Step 1: Crawling the Wikipedia URL ===")
|
||||
web_config = CrawlerRunConfig(bypass_cache=True)
|
||||
result = await crawler.arun(url=wikipedia_url, config=web_config)
|
||||
|
||||
if not result.success:
|
||||
print(f"Failed to crawl {wikipedia_url}: {result.error_message}")
|
||||
return
|
||||
|
||||
with open(html_file_path, 'w', encoding='utf-8') as f:
|
||||
f.write(result.html)
|
||||
web_crawl_length = len(result.markdown)
|
||||
print(f"Length of markdown from web crawl: {web_crawl_length}\n")
|
||||
|
||||
# Step 2: Crawl from the Local HTML File
|
||||
print("=== Step 2: Crawling from the Local HTML File ===")
|
||||
file_url = f"file://{html_file_path.resolve()}"
|
||||
file_config = CrawlerRunConfig(bypass_cache=True)
|
||||
local_result = await crawler.arun(url=file_url, config=file_config)
|
||||
|
||||
if not local_result.success:
|
||||
print(f"Failed to crawl local file {file_url}: {local_result.error_message}")
|
||||
return
|
||||
|
||||
local_crawl_length = len(local_result.markdown)
|
||||
assert web_crawl_length == local_crawl_length, "Markdown length mismatch"
|
||||
print("✅ Markdown length matches between web and local file crawl.\n")
|
||||
|
||||
# Step 3: Crawl Using Raw HTML Content
|
||||
print("=== Step 3: Crawling Using Raw HTML Content ===")
|
||||
with open(html_file_path, 'r', encoding='utf-8') as f:
|
||||
raw_html_content = f.read()
|
||||
raw_html_url = f"raw:{raw_html_content}"
|
||||
raw_config = CrawlerRunConfig(bypass_cache=True)
|
||||
raw_result = await crawler.arun(url=raw_html_url, config=raw_config)
|
||||
|
||||
if not raw_result.success:
|
||||
print(f"Failed to crawl raw HTML content: {raw_result.error_message}")
|
||||
return
|
||||
|
||||
raw_crawl_length = len(raw_result.markdown)
|
||||
assert web_crawl_length == raw_crawl_length, "Markdown length mismatch"
|
||||
print("✅ Markdown length matches between web and raw HTML crawl.\n")
|
||||
|
||||
print("All tests passed successfully!")
|
||||
if html_file_path.exists():
|
||||
os.remove(html_file_path)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
With the unified `url` parameter and prefix-based handling in **Crawl4AI**, you can seamlessly handle web URLs, local HTML files, and raw HTML content. Use `CrawlerRunConfig` for flexible and consistent configuration in all scenarios.
|
||||
@@ -1,10 +0,0 @@
|
||||
url_prefix_handling: Crawl4AI supports different URL prefixes for various input types | input handling, url format, crawling types | url="https://example.com" or "file://path" or "raw:html"
|
||||
web_crawling: Crawl live web pages using http:// or https:// prefixes with AsyncWebCrawler | web scraping, url crawling, web content | AsyncWebCrawler().arun(url="https://example.com")
|
||||
local_file_crawling: Access local HTML files using file:// prefix for crawling | local html, file crawling, file access | AsyncWebCrawler().arun(url="file:///path/to/file.html")
|
||||
raw_html_crawling: Process raw HTML content directly using raw: prefix | html string, raw content, direct html | AsyncWebCrawler().arun(url="raw:<html>content</html>")
|
||||
crawler_config: Configure crawling behavior using CrawlerRunConfig object | crawler settings, configuration, bypass cache | CrawlerRunConfig(bypass_cache=True)
|
||||
async_context: AsyncWebCrawler should be used within async context manager | async with, context management, async programming | async with AsyncWebCrawler() as crawler
|
||||
crawl_result: Crawler returns result object containing success status, markdown and error messages | response handling, crawl output, result parsing | result.success, result.markdown, result.error_message
|
||||
html_to_markdown: Crawler automatically converts HTML content to markdown format | format conversion, markdown generation, content processing | result.markdown
|
||||
error_handling: Check crawl success status and handle error messages appropriately | error checking, failure handling, status verification | if result.success: ... else: print(result.error_message)
|
||||
content_verification: Compare markdown length between different crawling methods for consistency | content validation, length comparison, consistency check | assert web_crawl_length == local_crawl_length
|
||||
@@ -1,119 +0,0 @@
|
||||
# Hooks & Auth for AsyncWebCrawler
|
||||
|
||||
Crawl4AI's `AsyncWebCrawler` allows you to customize the behavior of the web crawler using hooks. Hooks are asynchronous functions called at specific points in the crawling process, allowing you to modify the crawler's behavior or perform additional actions. This updated documentation demonstrates how to use hooks, including the new `on_page_context_created` hook, and ensures compatibility with `BrowserConfig` and `CrawlerRunConfig`.
|
||||
|
||||
In this example, we'll:
|
||||
|
||||
1. Configure the browser and set up authentication when it's created.
|
||||
2. Apply custom routing and initial actions when the page context is created.
|
||||
3. Add custom headers before navigating to the URL.
|
||||
4. Log the current URL after navigation.
|
||||
5. Perform actions after JavaScript execution.
|
||||
6. Log the length of the HTML before returning it.
|
||||
|
||||
## Hook Definitions
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
from playwright.async_api import Page, Browser, BrowserContext
|
||||
|
||||
def log_routing(route):
|
||||
# Example: block loading images
|
||||
if route.request.resource_type == "image":
|
||||
print(f"[HOOK] Blocking image request: {route.request.url}")
|
||||
asyncio.create_task(route.abort())
|
||||
else:
|
||||
asyncio.create_task(route.continue_())
|
||||
|
||||
async def on_browser_created(browser: Browser, **kwargs):
|
||||
print("[HOOK] on_browser_created")
|
||||
# Example: Set browser viewport size and log in
|
||||
context = await browser.new_context(viewport={"width": 1920, "height": 1080})
|
||||
page = await context.new_page()
|
||||
await page.goto("https://example.com/login")
|
||||
await page.fill("input[name='username']", "testuser")
|
||||
await page.fill("input[name='password']", "password123")
|
||||
await page.click("button[type='submit']")
|
||||
await page.wait_for_selector("#welcome")
|
||||
await context.add_cookies([{"name": "auth_token", "value": "abc123", "url": "https://example.com"}])
|
||||
await page.close()
|
||||
await context.close()
|
||||
|
||||
async def on_page_context_created(context: BrowserContext, page: Page, **kwargs):
|
||||
print("[HOOK] on_page_context_created")
|
||||
await context.route("**", log_routing)
|
||||
|
||||
async def before_goto(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] before_goto")
|
||||
await page.set_extra_http_headers({"X-Test-Header": "test"})
|
||||
|
||||
async def after_goto(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] after_goto")
|
||||
print(f"Current URL: {page.url}")
|
||||
|
||||
async def on_execution_started(page: Page, context: BrowserContext, **kwargs):
|
||||
print("[HOOK] on_execution_started")
|
||||
await page.evaluate("console.log('Custom JS executed')")
|
||||
|
||||
async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs):
|
||||
print("[HOOK] before_return_html")
|
||||
print(f"HTML length: {len(html)}")
|
||||
return page
|
||||
```
|
||||
|
||||
## Using the Hooks with AsyncWebCrawler
|
||||
|
||||
```python
|
||||
async def main():
|
||||
print("\n🔗 Using Crawler Hooks: Customize AsyncWebCrawler with hooks!")
|
||||
|
||||
# Configure browser and crawler settings
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080
|
||||
)
|
||||
|
||||
crawler_run_config = CrawlerRunConfig(
|
||||
js_code="window.scrollTo(0, document.body.scrollHeight);",
|
||||
wait_for="footer"
|
||||
)
|
||||
|
||||
# Initialize crawler
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
crawler.crawler_strategy.set_hook("on_browser_created", on_browser_created)
|
||||
crawler.crawler_strategy.set_hook("on_page_context_created", on_page_context_created)
|
||||
crawler.crawler_strategy.set_hook("before_goto", before_goto)
|
||||
crawler.crawler_strategy.set_hook("after_goto", after_goto)
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
crawler.crawler_strategy.set_hook("before_return_html", before_return_html)
|
||||
|
||||
# Run the crawler
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_run_config)
|
||||
|
||||
print("\n📦 Crawler Hooks Result:")
|
||||
print(result)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Explanation of Hooks
|
||||
|
||||
- **`on_browser_created`**: Called when the browser is created. Use this to configure the browser or handle authentication (e.g., logging in and setting cookies).
|
||||
- **`on_page_context_created`**: Called when a new page context is created. Use this to apply routing, block resources, or inject custom logic before navigating to the URL.
|
||||
- **`before_goto`**: Called before navigating to the URL. Use this to add custom headers or perform other pre-navigation actions.
|
||||
- **`after_goto`**: Called after navigation. Use this to verify content or log the URL.
|
||||
- **`on_execution_started`**: Called after executing custom JavaScript. Use this to perform additional actions.
|
||||
- **`before_return_html`**: Called before returning the HTML content. Use this to log details or preprocess the content.
|
||||
|
||||
## Additional Customizations
|
||||
|
||||
- **Resource Management**: Use `on_page_context_created` to block or modify requests (e.g., block images, fonts, or third-party scripts).
|
||||
- **Dynamic Headers**: Use `before_goto` to add or modify headers dynamically based on the URL.
|
||||
- **Authentication**: Use `on_browser_created` to handle login processes and set authentication cookies or tokens.
|
||||
- **Content Analysis**: Use `before_return_html` to analyze or modify the extracted HTML content.
|
||||
|
||||
These hooks provide powerful customization options for tailoring the crawling process to your needs.
|
||||
|
||||
@@ -1,12 +0,0 @@
|
||||
crawler_hooks: AsyncWebCrawler supports customizable hooks for modifying crawler behavior | hooks, async functions, crawler customization | crawler.crawler_strategy.set_hook()
|
||||
browser_creation_hook: on_browser_created hook executes when browser is initialized for authentication and setup | browser setup, login, authentication | async def on_browser_created(browser: Browser, **kwargs)
|
||||
page_context_hook: on_page_context_created hook handles routing and initial page setup | page context, routing, resource blocking | async def on_page_context_created(context: BrowserContext, page: Page, **kwargs)
|
||||
navigation_pre_hook: before_goto hook allows adding custom headers before URL navigation | headers, pre-navigation, request modification | async def before_goto(page: Page, context: BrowserContext, **kwargs)
|
||||
navigation_post_hook: after_goto hook executes after URL navigation for verification | post-navigation, URL logging | async def after_goto(page: Page, context: BrowserContext, **kwargs)
|
||||
js_execution_hook: on_execution_started hook runs after custom JavaScript execution | JavaScript, script execution | async def on_execution_started(page: Page, context: BrowserContext, **kwargs)
|
||||
html_processing_hook: before_return_html hook processes HTML content before returning | HTML content, preprocessing | async def before_return_html(page: Page, context: BrowserContext, html: str, **kwargs)
|
||||
browser_configuration: BrowserConfig allows setting headless mode and viewport dimensions | browser settings, viewport | BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080)
|
||||
crawler_configuration: CrawlerRunConfig defines JavaScript execution and wait conditions | crawler settings, JS code, wait conditions | CrawlerRunConfig(js_code="window.scrollTo(0)", wait_for="footer")
|
||||
resource_management: Route handlers can block or modify specific resource types | resource blocking, request handling | if route.request.resource_type == "image": await route.abort()
|
||||
authentication_flow: Browser authentication handled through login form interaction and cookie setting | login process, cookies | await page.fill("input[name='username']", "testuser")
|
||||
hook_registration: Hooks are registered using the crawler strategy's set_hook method | hook setup, strategy | crawler.crawler_strategy.set_hook("hook_name", hook_function)
|
||||
@@ -1,131 +0,0 @@
|
||||
# Proxy & Security
|
||||
|
||||
Configure proxy settings and enhance security features in Crawl4AI for reliable data extraction.
|
||||
|
||||
## Basic Proxy Setup
|
||||
|
||||
Simple proxy configuration with `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
# Using proxy URL
|
||||
browser_config = BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
|
||||
# Using SOCKS proxy
|
||||
browser_config = BrowserConfig(proxy="socks5://proxy.example.com:1080")
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Authenticated Proxy
|
||||
|
||||
Use an authenticated proxy with `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
proxy_config = {
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
}
|
||||
|
||||
browser_config = BrowserConfig(proxy_config=proxy_config)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Rotating Proxies
|
||||
|
||||
Example using a proxy rotation service and updating `BrowserConfig` dynamically:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
async def get_next_proxy():
|
||||
# Your proxy rotation logic here
|
||||
return {"server": "http://next.proxy.com:8080"}
|
||||
|
||||
browser_config = BrowserConfig()
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# Update proxy for each request
|
||||
for url in urls:
|
||||
proxy = await get_next_proxy()
|
||||
browser_config.proxy_config = proxy
|
||||
result = await crawler.arun(url=url, config=browser_config)
|
||||
```
|
||||
|
||||
## Custom Headers
|
||||
|
||||
Add security-related headers via `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig
|
||||
|
||||
headers = {
|
||||
"X-Forwarded-For": "203.0.113.195",
|
||||
"Accept-Language": "en-US,en;q=0.9",
|
||||
"Cache-Control": "no-cache",
|
||||
"Pragma": "no-cache"
|
||||
}
|
||||
|
||||
browser_config = BrowserConfig(headers=headers)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com")
|
||||
```
|
||||
|
||||
## Combining with Magic Mode
|
||||
|
||||
For maximum protection, combine proxy with Magic Mode via `CrawlerRunConfig` and `BrowserConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
proxy="http://proxy.example.com:8080",
|
||||
headers={"Accept-Language": "en-US"}
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(magic=True) # Enable all anti-detection features
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=crawler_config)
|
||||
```
|
||||
|
||||
## SSL Certificate Verification
|
||||
|
||||
Crawl4AI can retrieve and analyze SSL certificates from HTTPS websites. This is useful for:
|
||||
- Verifying website authenticity
|
||||
- Detecting potential security issues
|
||||
- Analyzing certificate chains
|
||||
- Exporting certificates for further analysis
|
||||
|
||||
Enable SSL certificate retrieval with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(fetch_ssl_certificate=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.ssl_certificate:
|
||||
cert = result.ssl_certificate
|
||||
|
||||
# Access certificate properties
|
||||
print(f"Issuer: {cert.issuer.get('CN', '')}")
|
||||
print(f"Valid until: {cert.valid_until}")
|
||||
print(f"Fingerprint: {cert.fingerprint}")
|
||||
|
||||
# Export certificate in different formats
|
||||
cert.to_json("cert.json") # For analysis
|
||||
cert.to_pem("cert.pem") # For web servers
|
||||
cert.to_der("cert.der") # For Java applications
|
||||
```
|
||||
|
||||
The SSL certificate object provides:
|
||||
- Direct access to certificate fields (issuer, subject, validity dates)
|
||||
- Methods to export in common formats (JSON, PEM, DER)
|
||||
- Certificate chain information and extensions
|
||||
@@ -1,8 +0,0 @@
|
||||
proxy_setup: Configure basic proxy in Crawl4AI using BrowserConfig with proxy URL | proxy configuration, proxy setup, basic proxy | BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
socks_proxy: Use SOCKS proxy protocol for web crawling | SOCKS5, proxy protocol, SOCKS connection | BrowserConfig(proxy="socks5://proxy.example.com:1080")
|
||||
authenticated_proxy: Set up proxy with username and password authentication | proxy auth, proxy credentials, authenticated connection | BrowserConfig(proxy_config={"server": "http://proxy.example.com:8080", "username": "user", "password": "pass"})
|
||||
rotating_proxies: Implement dynamic proxy rotation during crawling | proxy rotation, proxy switching, dynamic proxies | browser_config.proxy_config = await get_next_proxy()
|
||||
custom_headers: Add security headers to browser configuration for enhanced protection | HTTP headers, request headers, security headers | BrowserConfig(headers={"X-Forwarded-For": "203.0.113.195", "Accept-Language": "en-US,en;q=0.9"})
|
||||
magic_mode: Combine proxy settings with Magic Mode for maximum anti-detection | anti-detection, stealth mode, protection features | CrawlerRunConfig(magic=True) with BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
crawler_context: Use AsyncWebCrawler with async context manager for proper resource management | async crawler, context manager, crawler setup | async with AsyncWebCrawler(config=browser_config) as crawler
|
||||
cache_control: Set cache control headers to prevent caching during crawling | caching headers, no-cache, cache prevention | BrowserConfig(headers={"Cache-Control": "no-cache", "Pragma": "no-cache"})
|
||||
@@ -1,58 +0,0 @@
|
||||
# Capturing Full-Page Screenshots and PDFs from Massive Webpages with Crawl4AI
|
||||
|
||||
When dealing with very long web pages, traditional full-page screenshots can be slow or fail entirely. For large pages (like extensive Wikipedia articles), generating a single massive screenshot often leads to delays, memory issues, or style differences.
|
||||
|
||||
## **The New Approach:**
|
||||
We’ve introduced a new feature that effortlessly handles even the biggest pages by first exporting them as a PDF, then converting that PDF into a high-quality image. This approach leverages the browser’s built-in PDF rendering, making it both stable and efficient for very long content. You also have the option to directly save the PDF for your own usage—no need for multiple passes or complex stitching logic.
|
||||
|
||||
## **Key Benefits:**
|
||||
- **Reliability:** The PDF export never times out and works regardless of page length.
|
||||
- **Versatility:** Get both the PDF and a screenshot in one crawl, without reloading or reprocessing.
|
||||
- **Performance:** Skips manual scrolling and stitching images, reducing complexity and runtime.
|
||||
|
||||
## **Simple Example:**
|
||||
```python
|
||||
import os, sys
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
# Adjust paths as needed
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# Request both PDF and screenshot
|
||||
result = await crawler.arun(
|
||||
url='https://en.wikipedia.org/wiki/List_of_common_misconceptions',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Save screenshot
|
||||
if result.screenshot:
|
||||
from base64 import b64decode
|
||||
with open(os.path.join(__location__, "screenshot.png"), "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Save PDF
|
||||
if result.pdf:
|
||||
pdf_bytes = b64decode(result.pdf)
|
||||
with open(os.path.join(__location__, "page.pdf"), "wb") as f:
|
||||
f.write(pdf_bytes)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## **What Happens Under the Hood:**
|
||||
- Crawl4AI navigates to the target page.
|
||||
- If `pdf=True`, it exports the current page as a full PDF, capturing all of its content no matter the length.
|
||||
- If `screenshot=True`, and a PDF is already available, it directly converts the first page of that PDF to an image for you—no repeated loading or scrolling.
|
||||
- Finally, you get your PDF and/or screenshot ready to use.
|
||||
|
||||
## **Conclusion:**
|
||||
With this feature, Crawl4AI becomes even more robust and versatile for large-scale content extraction. Whether you need a PDF snapshot or a quick screenshot, you now have a reliable solution for even the most extensive webpages.
|
||||
@@ -1,9 +0,0 @@
|
||||
page_capture: Full-page screenshots and PDFs can be generated for massive webpages using Crawl4AI | webpage capture, full page screenshot, pdf export | AsyncWebCrawler().arun(url=url, pdf=True, screenshot=True)
|
||||
pdf_approach: Pages are first exported as PDF then converted to high-quality images for better handling of large content | pdf conversion, image export, page rendering | result.pdf, result.screenshot
|
||||
export_benefits: PDF export method never times out and works with any page length | timeout handling, page size limits, reliability | pdf=True
|
||||
dual_output: Get both PDF and screenshot in single crawl without reloading | multiple formats, single pass, efficient capture | pdf=True, screenshot=True
|
||||
result_handling: Screenshot and PDF data are returned as base64 encoded strings | base64 encoding, binary data, file saving | b64decode(result.screenshot), b64decode(result.pdf)
|
||||
cache_control: Cache mode can be bypassed for fresh page captures | caching, fresh content, bypass cache | cache_mode=CacheMode.BYPASS
|
||||
async_operation: Crawler operates asynchronously using Python's asyncio framework | async/await, concurrent execution | async with AsyncWebCrawler() as crawler
|
||||
file_saving: Screenshots and PDFs can be saved directly to local files | file output, save results, local storage | open("screenshot.png", "wb"), open("page.pdf", "wb")
|
||||
error_handling: Success status can be checked before processing results | error checking, result validation | if result.success:
|
||||
@@ -1,225 +0,0 @@
|
||||
# Using `storage_state` to Pre-Load Cookies and LocalStorage
|
||||
|
||||
Crawl4ai’s `AsyncWebCrawler` lets you preserve and reuse session data, including cookies and localStorage, across multiple runs. By providing a `storage_state`, you can start your crawls already “logged in” or with any other necessary session data—no need to repeat the login flow every time.
|
||||
|
||||
## What is `storage_state`?
|
||||
|
||||
`storage_state` can be:
|
||||
|
||||
- A dictionary containing cookies and localStorage data.
|
||||
- A path to a JSON file that holds this information.
|
||||
|
||||
When you pass `storage_state` to the crawler, it applies these cookies and localStorage entries before loading any pages. This means your crawler effectively starts in a known authenticated or pre-configured state.
|
||||
|
||||
## Example Structure
|
||||
|
||||
Here’s an example storage state:
|
||||
|
||||
```json
|
||||
{
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1675363572.037711,
|
||||
"httpOnly": false,
|
||||
"secure": false,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{ "name": "token", "value": "my_auth_token" },
|
||||
{ "name": "refreshToken", "value": "my_refresh_token" }
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
This JSON sets a `session` cookie and two localStorage entries (`token` and `refreshToken`) for `https://example.com`.
|
||||
|
||||
---
|
||||
|
||||
## Passing `storage_state` as a Dictionary
|
||||
|
||||
You can directly provide the data as a dictionary:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
storage_dict = {
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1675363572.037711,
|
||||
"httpOnly": False,
|
||||
"secure": False,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{"name": "token", "value": "my_auth_token"},
|
||||
{"name": "refreshToken", "value": "my_refresh_token"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state=storage_dict
|
||||
) as crawler:
|
||||
result = await crawler.arun(url='https://example.com/protected')
|
||||
if result.success:
|
||||
print("Crawl succeeded with pre-loaded session data!")
|
||||
print("Page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Passing `storage_state` as a File
|
||||
|
||||
If you prefer a file-based approach, save the JSON above to `mystate.json` and reference it:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state="mystate.json" # Uses a JSON file instead of a dictionary
|
||||
) as crawler:
|
||||
result = await crawler.arun(url='https://example.com/protected')
|
||||
if result.success:
|
||||
print("Crawl succeeded with pre-loaded session data!")
|
||||
print("Page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Using `storage_state` to Avoid Repeated Logins (Sign In Once, Use Later)
|
||||
|
||||
A common scenario is when you need to log in to a site (entering username/password, etc.) to access protected pages. Doing so every crawl is cumbersome. Instead, you can:
|
||||
|
||||
1. Perform the login once in a hook.
|
||||
2. After login completes, export the resulting `storage_state` to a file.
|
||||
3. On subsequent runs, provide that `storage_state` to skip the login step.
|
||||
|
||||
**Step-by-Step Example:**
|
||||
|
||||
**First Run (Perform Login and Save State):**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def on_browser_created_hook(browser):
|
||||
# Access the default context and create a page
|
||||
context = browser.contexts[0]
|
||||
page = await context.new_page()
|
||||
|
||||
# Navigate to the login page
|
||||
await page.goto("https://example.com/login", wait_until="domcontentloaded")
|
||||
|
||||
# Fill in credentials and submit
|
||||
await page.fill("input[name='username']", "myuser")
|
||||
await page.fill("input[name='password']", "mypassword")
|
||||
await page.click("button[type='submit']")
|
||||
await page.wait_for_load_state("networkidle")
|
||||
|
||||
# Now the site sets tokens in localStorage and cookies
|
||||
# Export this state to a file so we can reuse it
|
||||
await context.storage_state(path="my_storage_state.json")
|
||||
await page.close()
|
||||
|
||||
async def main():
|
||||
# First run: perform login and export the storage_state
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
hooks={"on_browser_created": on_browser_created_hook},
|
||||
use_persistent_context=True,
|
||||
user_data_dir="./my_user_data"
|
||||
) as crawler:
|
||||
|
||||
# After on_browser_created_hook runs, we have storage_state saved to my_storage_state.json
|
||||
result = await crawler.arun(
|
||||
url='https://example.com/protected-page',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
|
||||
)
|
||||
print("First run result success:", result.success)
|
||||
if result.success:
|
||||
print("Protected page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Second Run (Reuse Saved State, No Login Needed):**
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
# Second run: no need to hook on_browser_created this time.
|
||||
# Just provide the previously saved storage state.
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
verbose=True,
|
||||
use_persistent_context=True,
|
||||
user_data_dir="./my_user_data",
|
||||
storage_state="my_storage_state.json" # Reuse previously exported state
|
||||
) as crawler:
|
||||
|
||||
# Now the crawler starts already logged in
|
||||
result = await crawler.arun(
|
||||
url='https://example.com/protected-page',
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(options={"ignore_links": True}),
|
||||
)
|
||||
print("Second run result success:", result.success)
|
||||
if result.success:
|
||||
print("Protected page HTML length:", len(result.html))
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s Happening Here?**
|
||||
|
||||
- During the first run, the `on_browser_created_hook` logs into the site.
|
||||
- After logging in, the crawler exports the current session (cookies, localStorage, etc.) to `my_storage_state.json`.
|
||||
- On subsequent runs, passing `storage_state="my_storage_state.json"` starts the browser context with these tokens already in place, skipping the login steps.
|
||||
|
||||
**Sign Out Scenario:**
|
||||
If the website allows you to sign out by clearing tokens or by navigating to a sign-out URL, you can also run a script that uses `on_browser_created_hook` or `arun` to simulate signing out, then export the resulting `storage_state` again. That would give you a baseline “logged out” state to start fresh from next time.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
By using `storage_state`, you can skip repetitive actions, like logging in, and jump straight into crawling protected content. Whether you provide a file path or a dictionary, this powerful feature helps maintain state between crawls, simplifying your data extraction pipelines.
|
||||
@@ -1,10 +0,0 @@
|
||||
storage_state_concept: Storage state preserves session data including cookies and localStorage across crawler runs | session persistence, state management | storage_state="mystate.json"
|
||||
storage_state_formats: Storage state can be provided as either a dictionary or path to JSON file | state configuration, json format | storage_state={"cookies": [...], "origins": [...]}
|
||||
cookie_structure: Cookies in storage state require name, value, domain, path, and expiration properties | cookie configuration, session cookies | "cookies": [{"name": "session", "value": "abcd1234", "domain": "example.com"}]
|
||||
localstorage_structure: localStorage entries are organized by origin with name-value pairs | web storage, browser storage | "localStorage": [{"name": "token", "value": "my_auth_token"}]
|
||||
authentication_preservation: Storage state enables starting crawls in authenticated state without repeating login flow | session management, login persistence | AsyncWebCrawler(storage_state="my_storage_state.json")
|
||||
state_export: Browser context state can be exported to JSON file after successful login | session export, state saving | await context.storage_state(path="my_storage_state.json")
|
||||
login_automation: Initial login can be performed using browser_created_hook to establish authenticated state | authentication automation, login process | on_browser_created_hook(browser)
|
||||
persistent_context: Crawler supports persistent context with user data directory for maintaining state | browser persistence, session storage | use_persistent_context=True, user_data_dir="./my_user_data"
|
||||
protected_content: Storage state enables direct access to protected content by preserving authentication tokens | authenticated access, protected pages | crawler.arun(url='https://example.com/protected')
|
||||
state_reuse: Subsequent crawler runs can reuse saved storage state to skip authentication steps | session reuse, login bypass | AsyncWebCrawler(storage_state="my_storage_state.json")
|
||||
@@ -1,85 +0,0 @@
|
||||
# CrawlerRunConfig Parameters Documentation
|
||||
|
||||
## Content Processing Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `word_count_threshold` | int | 200 | Minimum word count threshold before processing content |
|
||||
| `extraction_strategy` | ExtractionStrategy | None | Strategy to extract structured data from crawled pages. When None, uses NoExtractionStrategy |
|
||||
| `chunking_strategy` | ChunkingStrategy | RegexChunking() | Strategy to chunk content before extraction |
|
||||
| `markdown_generator` | MarkdownGenerationStrategy | None | Strategy for generating markdown from extracted content |
|
||||
| `content_filter` | RelevantContentFilter | None | Optional filter to prune irrelevant content |
|
||||
| `only_text` | bool | False | If True, attempt to extract text-only content where applicable |
|
||||
| `css_selector` | str | None | CSS selector to extract a specific portion of the page |
|
||||
| `excluded_tags` | list[str] | [] | List of HTML tags to exclude from processing |
|
||||
| `keep_data_attributes` | bool | False | If True, retain `data-*` attributes while removing unwanted attributes |
|
||||
| `remove_forms` | bool | False | If True, remove all `<form>` elements from the HTML |
|
||||
| `prettiify` | bool | False | If True, apply `fast_format_html` to produce prettified HTML output |
|
||||
|
||||
## Caching Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `cache_mode` | CacheMode | None | Defines how caching is handled. Defaults to CacheMode.ENABLED internally |
|
||||
| `session_id` | str | None | Optional session ID to persist browser context and page instance |
|
||||
| `bypass_cache` | bool | False | Legacy parameter, if True acts like CacheMode.BYPASS |
|
||||
| `disable_cache` | bool | False | Legacy parameter, if True acts like CacheMode.DISABLED |
|
||||
| `no_cache_read` | bool | False | Legacy parameter, if True acts like CacheMode.WRITE_ONLY |
|
||||
| `no_cache_write` | bool | False | Legacy parameter, if True acts like CacheMode.READ_ONLY |
|
||||
|
||||
## Page Navigation and Timing Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `wait_until` | str | "domcontentloaded" | The condition to wait for when navigating |
|
||||
| `page_timeout` | int | 60000 | Timeout in milliseconds for page operations like navigation |
|
||||
| `wait_for` | str | None | CSS selector or JS condition to wait for before extracting content |
|
||||
| `wait_for_images` | bool | True | If True, wait for images to load before extracting content |
|
||||
| `delay_before_return_html` | float | 0.1 | Delay in seconds before retrieving final HTML |
|
||||
| `mean_delay` | float | 0.1 | Mean base delay between requests when calling arun_many |
|
||||
| `max_range` | float | 0.3 | Max random additional delay range for requests in arun_many |
|
||||
| `semaphore_count` | int | 5 | Number of concurrent operations allowed |
|
||||
|
||||
## Page Interaction Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `js_code` | str or list[str] | None | JavaScript code/snippets to run on the page |
|
||||
| `js_only` | bool | False | If True, indicates subsequent calls are JS-driven updates |
|
||||
| `ignore_body_visibility` | bool | True | If True, ignore whether the body is visible before proceeding |
|
||||
| `scan_full_page` | bool | False | If True, scroll through the entire page to load all content |
|
||||
| `scroll_delay` | float | 0.2 | Delay in seconds between scroll steps if scan_full_page is True |
|
||||
| `process_iframes` | bool | False | If True, attempts to process and inline iframe content |
|
||||
| `remove_overlay_elements` | bool | False | If True, remove overlays/popups before extracting HTML |
|
||||
| `simulate_user` | bool | False | If True, simulate user interactions for anti-bot measures |
|
||||
| `override_navigator` | bool | False | If True, overrides navigator properties for more human-like behavior |
|
||||
| `magic` | bool | False | If True, attempts automatic handling of overlays/popups |
|
||||
| `adjust_viewport_to_content` | bool | False | If True, adjust viewport according to page content dimensions |
|
||||
|
||||
## Media Handling Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `screenshot` | bool | False | Whether to take a screenshot after crawling |
|
||||
| `screenshot_wait_for` | float | None | Additional wait time before taking a screenshot |
|
||||
| `screenshot_height_threshold` | int | 20000 | Threshold for page height to decide screenshot strategy |
|
||||
| `pdf` | bool | False | Whether to generate a PDF of the page |
|
||||
| `image_description_min_word_threshold` | int | 50 | Minimum words for image description extraction |
|
||||
| `image_score_threshold` | int | 3 | Minimum score threshold for processing an image |
|
||||
| `exclude_external_images` | bool | False | If True, exclude all external images from processing |
|
||||
|
||||
## Link and Domain Handling Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `exclude_social_media_domains` | list[str] | SOCIAL_MEDIA_DOMAINS | List of domains to exclude for social media links |
|
||||
| `exclude_external_links` | bool | False | If True, exclude all external links from the results |
|
||||
| `exclude_social_media_links` | bool | False | If True, exclude links pointing to social media domains |
|
||||
| `exclude_domains` | list[str] | [] | List of specific domains to exclude from results |
|
||||
|
||||
## Debugging and Logging Parameters
|
||||
|
||||
| Parameter | Type | Default | Description |
|
||||
|-----------|------|---------|-------------|
|
||||
| `verbose` | bool | True | Enable verbose logging |
|
||||
| `log_console` | bool | False | If True, log console messages from the page |
|
||||
@@ -1,17 +0,0 @@
|
||||
content_processing: Configure word count threshold for processing crawled content | minimum words, content length, processing threshold | word_count_threshold=200
|
||||
extraction_config: Set strategy for extracting structured data from pages | data extraction, content parsing, structured data | extraction_strategy=ExtractionStrategy()
|
||||
chunking_setup: Configure content chunking strategy for processing | content splitting, text chunks, segmentation | chunking_strategy=RegexChunking()
|
||||
content_filtering: Filter irrelevant content using RelevantContentFilter | content pruning, filtering, relevance | content_filter=RelevantContentFilter()
|
||||
text_extraction: Extract only text content from web pages | text-only, content extraction, plain text | only_text=True
|
||||
css_selection: Target specific page elements using CSS selectors | element selection, content targeting, DOM selection | css_selector=".main-content"
|
||||
html_cleaning: Configure HTML tag exclusion and attribute handling | tag removal, attribute filtering, HTML cleanup | excluded_tags=["script", "style"], keep_data_attributes=True
|
||||
caching_config: Control page caching behavior and session persistence | cache settings, session management, cache control | cache_mode=CacheMode.ENABLED, session_id="session1"
|
||||
page_navigation: Configure page loading and navigation timing | page timeout, loading conditions, navigation settings | wait_until="domcontentloaded", page_timeout=60000
|
||||
request_timing: Set delays between multiple page requests | request delays, crawl timing, rate limiting | mean_delay=0.1, max_range=0.3
|
||||
concurrent_ops: Control number of concurrent crawling operations | concurrency, parallel requests, semaphore | semaphore_count=5
|
||||
page_interaction: Configure JavaScript execution and page scanning | JS execution, page scanning, user simulation | js_code="window.scrollTo(0,1000)", scan_full_page=True
|
||||
popup_handling: Manage overlay elements and popup removal | overlay removal, popup handling, anti-popup | remove_overlay_elements=True, magic=True
|
||||
media_capture: Configure screenshot and PDF generation settings | screenshots, PDF export, media capture | screenshot=True, pdf=True
|
||||
image_processing: Set thresholds for image processing and description | image handling, description extraction, image scoring | image_score_threshold=3, image_description_min_word_threshold=50
|
||||
link_filtering: Configure domain and link exclusion rules | domain filtering, link exclusion, URL filtering | exclude_external_links=True, exclude_domains=["example.com"]
|
||||
debug_settings: Control logging and debugging output | logging, debugging, console output | verbose=True, log_console=True
|
||||
@@ -1,16 +0,0 @@
|
||||
installation: Install Crawl4AI using pip and setup required dependencies | package installation, setup guide | pip install crawl4ai && crawl4ai-setup && playwright install chromium
|
||||
basic_usage: Create AsyncWebCrawler instance to extract web content into markdown | quick start, basic crawling | async with AsyncWebCrawler(verbose=True) as crawler: result = await crawler.arun("https://example.com")
|
||||
browser_configuration: Configure browser settings like headless mode, viewport, and JavaScript | browser setup, chrome options | BrowserConfig(headless=True, viewport_width=1920, viewport_height=1080)
|
||||
crawler_config: Set crawling parameters including selectors, timeouts and content filters | crawl settings, extraction config | CrawlerRunConfig(css_selector="article.main", page_timeout=60000)
|
||||
markdown_extraction: Get different markdown formats including raw, cited and filtered versions | content extraction, markdown output | result.markdown_v2.raw_markdown, result.markdown_v2.markdown_with_citations
|
||||
structured_extraction: Extract structured data using CSS or XPath selectors into JSON | data extraction, scraping | JsonCssExtractionStrategy(schema), JsonXPathExtractionStrategy(xpath_schema)
|
||||
llm_extraction: Use LLM models to extract structured data with custom schemas | AI extraction, model integration | LLMExtractionStrategy(provider="ollama/nemotron", schema=ModelSchema)
|
||||
dynamic_content: Handle JavaScript-driven content using custom JS code and wait conditions | dynamic pages, JS execution | run_config.js_code="window.scrollTo(0, document.body.scrollHeight);"
|
||||
media_handling: Access extracted images, videos and audio with relevance scores | media extraction, asset handling | result.media["images"], result.media["videos"]
|
||||
link_extraction: Get categorized internal and external links with context | link scraping, URL extraction | result.links["internal"], result.links["external"]
|
||||
authentication: Preserve login state using user data directory or storage state | login, session handling | BrowserConfig(user_data_dir="/path/to/profile")
|
||||
proxy_setup: Configure proxy settings with authentication for crawling | proxy configuration, network setup | browser_config.proxy_config={"server": "http://proxy.example.com:8080"}
|
||||
content_capture: Save screenshots and PDFs of crawled pages | page capture, downloads | run_config.screenshot=True, run_config.pdf=True
|
||||
caching: Enable result caching to improve performance | performance optimization, caching | run_config.cache_mode = CacheMode.ENABLED
|
||||
custom_hooks: Add custom logic at different stages of the crawling process | event hooks, customization | crawler.crawler_strategy.set_hook("on_page_context_created", hook_function)
|
||||
containerization: Run Crawl4AI in Docker with different architectures and GPU support | docker, deployment | docker pull unclecode/crawl4ai:basic-amd64
|
||||
@@ -1,112 +0,0 @@
|
||||
# Crawl4AI LLM Reference
|
||||
|
||||
> Minimal, code-focused reference for LLM-based retrieval and answer generation.
|
||||
|
||||
Intended usage: A language model trained on this document can provide quick answers to developers integrating Crawl4AI.
|
||||
|
||||
## Installation
|
||||
|
||||
- Basic:
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
```
|
||||
|
||||
- If necessary:
|
||||
```bash
|
||||
playwright install chromium
|
||||
```
|
||||
|
||||
## Basic Usage
|
||||
|
||||
- Asynchronous crawl:
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(verbose=True) as c:
|
||||
r = await c.arun(url="https://example.com")
|
||||
print(r.markdown)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Concurrent Crawling
|
||||
|
||||
- Multiple URLs:
|
||||
```python
|
||||
urls = ["https://example.com/page1", "https://example.com/page2"]
|
||||
async with AsyncWebCrawler() as c:
|
||||
results = await asyncio.gather(*[c.arun(url=u) for u in urls])
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
- CacheMode:
|
||||
```python
|
||||
from crawl4ai import CacheMode
|
||||
r = await c.arun(url="...", cache_mode=CacheMode.ENABLED)
|
||||
```
|
||||
|
||||
- Proxies:
|
||||
```python
|
||||
async with AsyncWebCrawler(proxies={"http": "http://user:pass@proxy:port"}) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
- Headers & Viewport:
|
||||
```python
|
||||
async with AsyncWebCrawler(headers={"User-Agent": "MyUA"}, viewport={"width":1024,"height":768}) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
## JavaScript Injection
|
||||
|
||||
- Custom JS:
|
||||
```python
|
||||
js_code = ["""
|
||||
(async () => {
|
||||
const btn = document.querySelector('#load-more');
|
||||
if (btn) btn.click();
|
||||
await new Promise(r => setTimeout(r, 1000));
|
||||
})();
|
||||
"""]
|
||||
|
||||
r = await c.arun(url="...", js_code=js_code)
|
||||
```
|
||||
|
||||
## Extraction Strategies
|
||||
|
||||
- JSON CSS Extraction:
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {...}
|
||||
r = await c.arun(url="...", extraction_strategy=JsonCssExtractionStrategy(schema))
|
||||
```
|
||||
|
||||
- LLM Extraction:
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
r = await c.arun(url="...",
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token="YOUR_API_KEY",
|
||||
schema={...},
|
||||
extraction_type="schema"
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Common Issues
|
||||
|
||||
- Playwright errors: `playwright install chromium`
|
||||
- Empty output: Increase wait or use `js_code`.
|
||||
- SSL issues: Check certificates or use `verify_ssl=False` (not recommended for production).
|
||||
|
||||
## Additional Links
|
||||
|
||||
- [GitHub Repository](https://github.com/unclecode/crawl4ai)
|
||||
- [Documentation](https://crawl4ai.com/mkdocs/)
|
||||
@@ -1,13 +0,0 @@
|
||||
installation: Install Crawl4AI using pip and run setup command | package installation, setup | pip install crawl4ai && crawl4ai-setup
|
||||
playwright_setup: Install Chromium browser for Playwright if needed | browser installation, chromium setup | playwright install chromium
|
||||
async_crawler: Create asynchronous web crawler instance with optional verbose logging | crawler initialization, async setup | AsyncWebCrawler(verbose=True)
|
||||
basic_crawl: Perform basic asynchronous webpage crawl and get markdown output | single page crawl, basic usage | async with AsyncWebCrawler() as c: await c.arun(url="https://example.com")
|
||||
concurrent_crawling: Crawl multiple URLs simultaneously using asyncio.gather | parallel crawling, multiple urls | asyncio.gather(*[c.arun(url=u) for u in urls])
|
||||
cache_configuration: Enable or disable cache mode for crawling | caching, cache settings | cache_mode=CacheMode.ENABLED
|
||||
proxy_setup: Configure proxy settings for web crawler | proxy configuration, http proxy | proxies={"http": "http://user:pass@proxy:port"}
|
||||
browser_config: Set custom headers and viewport dimensions | user agent, viewport size | headers={"User-Agent": "MyUA"}, viewport={"width":1024,"height":768}
|
||||
javascript_injection: Inject custom JavaScript code during crawling | js injection, custom scripts | js_code=["""(async () => {...})();"""]
|
||||
json_extraction: Extract data using JSON CSS extraction strategy | css extraction, json schema | JsonCssExtractionStrategy(schema)
|
||||
llm_extraction: Configure LLM-based extraction with OpenAI integration | language model extraction, AI extraction | LLMExtractionStrategy(provider="openai/gpt-4o", api_token="KEY")
|
||||
troubleshooting: Common issues include Playwright errors, empty output, and SSL problems | error handling, debugging | playwright install chromium, verify_ssl=False
|
||||
documentation_links: Access additional resources through GitHub repository and official documentation | resources, links | github.com/unclecode/crawl4ai, crawl4ai.com/mkdocs/
|
||||
@@ -1,390 +0,0 @@
|
||||
# Core Configurations
|
||||
|
||||
## BrowserConfig
|
||||
`BrowserConfig` centralizes all parameters required to set up and manage a browser instance and its context. This configuration ensures consistent and documented browser behavior for the crawler. Below is a detailed explanation of each parameter and its optimal use cases.
|
||||
|
||||
### Parameters and Use Cases
|
||||
|
||||
#### `browser_type`
|
||||
- **Description**: Specifies the type of browser to launch.
|
||||
- Supported values: `"chromium"`, `"firefox"`, `"webkit"`
|
||||
- Default: `"chromium"`
|
||||
- **Use Case**:
|
||||
- Use `"chromium"` for general-purpose crawling with modern web standards.
|
||||
- Use `"firefox"` when testing against Firefox-specific behavior.
|
||||
- Use `"webkit"` for testing Safari-like environments.
|
||||
|
||||
#### `headless`
|
||||
- **Description**: Determines whether the browser runs in headless mode (no GUI).
|
||||
- Default: `True`
|
||||
- **Use Case**:
|
||||
- Enable for faster, automated operations without UI overhead.
|
||||
- Disable (`False`) when debugging or inspecting browser behavior visually.
|
||||
|
||||
#### `use_managed_browser`
|
||||
- **Description**: Enables advanced manipulation via a managed browser approach.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Use when fine-grained control is needed over browser sessions, such as debugging network requests or reusing sessions.
|
||||
|
||||
#### `debugging_port`
|
||||
- **Description**: Port for remote debugging.
|
||||
- Default: 9222
|
||||
- **Use Case**:
|
||||
- Use for debugging browser sessions with DevTools or external tools.
|
||||
|
||||
#### `use_persistent_context`
|
||||
- **Description**: Uses a persistent browser context (e.g., saved profiles).
|
||||
- Automatically enables `use_managed_browser`.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Persistent login sessions for authenticated crawling.
|
||||
- Retaining cookies or local storage across multiple runs.
|
||||
|
||||
#### `user_data_dir`
|
||||
- **Description**: Path to a directory for storing persistent browser data.
|
||||
- Default: `None`
|
||||
- **Use Case**:
|
||||
- Specify a directory to save browser profiles for multi-run crawls or debugging.
|
||||
|
||||
#### `chrome_channel`
|
||||
- **Description**: Specifies the Chrome channel to launch (e.g., `"chrome"`, `"msedge"`).
|
||||
- Applies only when `browser_type` is `"chromium"`.
|
||||
- Default: `"chrome"`
|
||||
- **Use Case**:
|
||||
- Use `"msedge"` for compatibility testing with Edge browsers.
|
||||
|
||||
#### `proxy` and `proxy_config`
|
||||
- **Description**:
|
||||
- `proxy`: Proxy server URL for the browser.
|
||||
- `proxy_config`: Detailed proxy configuration.
|
||||
- Default: `None`
|
||||
- **Use Case**:
|
||||
- Set `proxy` for single-proxy setups.
|
||||
- Use `proxy_config` for advanced configurations, such as authenticated proxies or regional routing.
|
||||
|
||||
#### `viewport_width` and `viewport_height`
|
||||
- **Description**: Sets the default browser viewport dimensions.
|
||||
- Default: `1080` (width), `600` (height)
|
||||
- **Use Case**:
|
||||
- Adjust for crawling responsive layouts or specific device emulations.
|
||||
|
||||
#### `accept_downloads` and `downloads_path`
|
||||
- **Description**:
|
||||
- `accept_downloads`: Allows file downloads.
|
||||
- `downloads_path`: Directory for storing downloads.
|
||||
- Default: `False`, `None`
|
||||
- **Use Case**:
|
||||
- Use when downloading and analyzing files like PDFs or spreadsheets.
|
||||
|
||||
#### `storage_state`
|
||||
- **Description**: Specifies cookies and local storage state.
|
||||
- Default: `None`
|
||||
- **Use Case**:
|
||||
- Provide state data for authenticated or preconfigured sessions.
|
||||
|
||||
#### `ignore_https_errors`
|
||||
- **Description**: Ignores HTTPS certificate errors.
|
||||
- Default: `True`
|
||||
- **Use Case**:
|
||||
- Enable for crawling sites with invalid certificates (testing environments).
|
||||
|
||||
#### `java_script_enabled`
|
||||
- **Description**: Toggles JavaScript execution in pages.
|
||||
- Default: `True`
|
||||
- **Use Case**:
|
||||
- Disable for simpler, faster crawls where JavaScript is unnecessary.
|
||||
|
||||
#### `cookies`
|
||||
- **Description**: List of cookies to add to the browser context.
|
||||
- Default: `[]`
|
||||
- **Use Case**:
|
||||
- Use for authenticated or preconfigured crawling scenarios.
|
||||
|
||||
#### `headers`
|
||||
- **Description**: Extra HTTP headers applied to all requests.
|
||||
- Default: `{}`
|
||||
- **Use Case**:
|
||||
- Customize headers for API-like crawling or bypassing bot detections.
|
||||
|
||||
#### `user_agent` and `user_agent_mode`
|
||||
- **Description**:
|
||||
- `user_agent`: Custom User-Agent string.
|
||||
- `user_agent_mode`: Mode for generating User-Agent (e.g., `"random"`).
|
||||
- Default: Standard Chromium-based User-Agent.
|
||||
- **Use Case**:
|
||||
- Set static User-Agent for consistent identification.
|
||||
- Use `"random"` mode to reduce bot detection likelihood.
|
||||
|
||||
#### `text_mode`
|
||||
- **Description**: Disables images and other rich content for faster load times.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable for text-only extraction tasks where speed is prioritized.
|
||||
|
||||
#### `light_mode`
|
||||
- **Description**: Disables background features for performance gains.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable for high-performance crawls on resource-constrained environments.
|
||||
|
||||
#### `extra_args`
|
||||
- **Description**: Additional command-line arguments for browser execution.
|
||||
- Default: `[]`
|
||||
- **Use Case**:
|
||||
- Use for advanced browser configurations like WebRTC or GPU tuning.
|
||||
|
||||
#### `verbose`
|
||||
- **Description**: Enable verbose logging of browser operations.
|
||||
- Default: `True`
|
||||
- **Use Case**:
|
||||
- Enable for detailed logging during development and debugging.
|
||||
- Disable in production for better performance.
|
||||
|
||||
#### `sleep_on_close`
|
||||
- **Description**: Adds a delay before closing the browser.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable when you need to ensure all browser operations are complete before closing.
|
||||
|
||||
## CrawlerRunConfig
|
||||
The `CrawlerRunConfig` class centralizes parameters for controlling crawl operations. This configuration covers content extraction, page interactions, caching, and runtime behaviors. Below is an exhaustive breakdown of parameters and their best-use scenarios.
|
||||
|
||||
### Parameters and Use Cases
|
||||
|
||||
#### Content Processing Parameters
|
||||
|
||||
##### `word_count_threshold`
|
||||
- **Description**: Minimum word count threshold for processing content.
|
||||
- Default: `200`
|
||||
- **Use Case**:
|
||||
- Set a higher threshold for content-heavy pages to skip lightweight or irrelevant content.
|
||||
|
||||
##### `extraction_strategy`
|
||||
- **Description**: Strategy for extracting structured data from crawled pages.
|
||||
- Default: `None` (uses `NoExtractionStrategy` by default).
|
||||
- **Use Case**:
|
||||
- Use for schema-driven extraction when working with well-defined data models like JSON.
|
||||
|
||||
##### `chunking_strategy`
|
||||
- **Description**: Strategy to chunk content before extraction.
|
||||
- Default: `RegexChunking()`.
|
||||
- **Use Case**:
|
||||
- Use NLP-based chunking for semantic extractions or regex for predictable text blocks.
|
||||
|
||||
##### `markdown_generator`
|
||||
- **Description**: Strategy for generating Markdown output.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Use custom Markdown strategies for AI-ready outputs like RAG pipelines.
|
||||
|
||||
##### `content_filter`
|
||||
- **Description**: Optional filter to prune irrelevant content.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Use relevance-based filters for focused crawls, e.g., keyword-specific searches.
|
||||
|
||||
##### `only_text`
|
||||
- **Description**: Extracts text-only content where applicable.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable for extracting clean text without HTML tags or rich content.
|
||||
|
||||
##### `css_selector`
|
||||
- **Description**: CSS selector to extract a specific portion of the page.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Use when targeting specific page elements, like articles or headlines.
|
||||
|
||||
##### `excluded_tags`
|
||||
- **Description**: List of HTML tags to exclude from processing.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Remove elements like `<script>` or `<style>` during text extraction.
|
||||
|
||||
##### `keep_data_attributes`
|
||||
- **Description**: Retain `data-*` attributes in the HTML.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable for extracting custom attributes in HTML structures.
|
||||
|
||||
##### `remove_forms`
|
||||
- **Description**: Removes all `<form>` elements from the page.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Use when forms are irrelevant and clutter the extracted content.
|
||||
|
||||
##### `prettiify`
|
||||
- **Description**: Beautifies the HTML output.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable for generating readable HTML outputs.
|
||||
|
||||
---
|
||||
|
||||
#### Caching Parameters
|
||||
|
||||
##### `cache_mode`
|
||||
- **Description**: Controls how caching is handled.
|
||||
- Default: `CacheMode.ENABLED`.
|
||||
- **Use Case**:
|
||||
- Use `WRITE_ONLY` mode for crawls where fresh content is critical.
|
||||
|
||||
##### `session_id`
|
||||
- **Description**: Specifies a session ID to persist browser context.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Use for maintaining login states or multi-page workflows.
|
||||
|
||||
##### `bypass_cache`, `disable_cache`, `no_cache_read`, `no_cache_write`
|
||||
- **Description**: Legacy parameters for cache handling.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- These options provide finer control when overriding default caching behaviors.
|
||||
|
||||
---
|
||||
|
||||
#### Page Navigation and Timing Parameters
|
||||
|
||||
##### `wait_until`
|
||||
- **Description**: Defines the navigation wait condition (e.g., `"domcontentloaded"`).
|
||||
- Default: `"domcontentloaded"`.
|
||||
- **Use Case**:
|
||||
- Adjust to `"networkidle"` for pages with heavy JavaScript rendering.
|
||||
|
||||
##### `page_timeout`
|
||||
- **Description**: Timeout in milliseconds for page operations.
|
||||
- Default: `60000` (60 seconds).
|
||||
- **Use Case**:
|
||||
- Increase for slow-loading pages or complex sites.
|
||||
|
||||
##### `wait_for`
|
||||
- **Description**: CSS selector or JS condition to wait for before extraction.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Use for dynamic content that requires specific elements to load.
|
||||
|
||||
##### `wait_for_images`
|
||||
- **Description**: Waits for images to load before content extraction.
|
||||
- Default: `True`.
|
||||
- **Use Case**:
|
||||
- Disable for faster crawls when image data isn’t required.
|
||||
|
||||
##### `delay_before_return_html`
|
||||
- **Description**: Delay in seconds before retrieving HTML.
|
||||
- Default: `0.1`.
|
||||
- **Use Case**:
|
||||
- Use for ensuring final DOM updates are captured.
|
||||
|
||||
##### `mean_delay` and `max_range`
|
||||
- **Description**: Configures base and random delays between requests.
|
||||
- Default: `0.1` (mean), `0.3` (max).
|
||||
- **Use Case**:
|
||||
- Increase for stealthy crawls to avoid bot detection.
|
||||
|
||||
##### `semaphore_count`
|
||||
- **Description**: Number of concurrent operations allowed.
|
||||
- Default: `5`.
|
||||
- **Use Case**:
|
||||
- Adjust based on system resources and network limitations.
|
||||
|
||||
---
|
||||
|
||||
#### Page Interaction Parameters
|
||||
|
||||
##### `js_code`
|
||||
- **Description**: JavaScript code or snippets to execute on the page.
|
||||
- Default: `None`.
|
||||
- **Use Case**:
|
||||
- Use for custom interactions like clicking tabs or dynamically loading content.
|
||||
|
||||
##### `js_only`
|
||||
- **Description**: Indicates subsequent calls rely only on JS updates.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable for single-page applications (SPAs) with dynamic content.
|
||||
|
||||
##### `scan_full_page`
|
||||
- **Description**: Simulates scrolling to load all content.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Use for infinite-scroll pages or loading all dynamic elements.
|
||||
|
||||
##### `adjust_viewport_to_content`
|
||||
- **Description**: Adjusts viewport to match content dimensions.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable for capturing content-heavy pages fully.
|
||||
|
||||
---
|
||||
|
||||
#### Media Handling Parameters
|
||||
|
||||
##### `screenshot`
|
||||
- **Description**: Captures a screenshot after crawling.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable for visual debugging or reporting purposes.
|
||||
|
||||
##### `pdf`
|
||||
- **Description**: Generates a PDF of the page.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Use for archiving or sharing rendered page outputs.
|
||||
|
||||
##### `image_description_min_word_threshold` and `image_score_threshold`
|
||||
- **Description**: Controls thresholds for image description extraction and processing.
|
||||
- Default: `50` (words), `3` (score).
|
||||
- **Use Case**:
|
||||
- Adjust for higher relevance or descriptive quality of image metadata.
|
||||
|
||||
---
|
||||
|
||||
#### Debugging and Logging Parameters
|
||||
|
||||
##### `verbose`
|
||||
- **Description**: Enables detailed logging.
|
||||
- Default: `True`.
|
||||
- **Use Case**:
|
||||
- Use for troubleshooting or analyzing crawler behavior.
|
||||
|
||||
##### `log_console`
|
||||
- **Description**: Logs browser console messages.
|
||||
- Default: `False`.
|
||||
- **Use Case**:
|
||||
- Enable when debugging JavaScript errors on pages.
|
||||
|
||||
##### `parser_type`
|
||||
- **Description**: Type of parser to use for HTML parsing.
|
||||
- Default: `"lxml"`
|
||||
- **Use Case**:
|
||||
- Use when specific HTML parsing requirements are needed.
|
||||
- `"lxml"` provides good performance and standards compliance.
|
||||
|
||||
##### `prettiify`
|
||||
- **Description**: Apply `fast_format_html` to produce prettified HTML output.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable for better readability of extracted HTML content.
|
||||
- Useful during development and debugging.
|
||||
|
||||
##### `fetch_ssl_certificate`
|
||||
- **Description**: Fetch and store SSL certificate information during crawling.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable when SSL certificate analysis is required.
|
||||
- Useful for security audits and certificate validation.
|
||||
|
||||
##### `url`
|
||||
- **Description**: Target URL for the crawl operation.
|
||||
- Default: `None`
|
||||
- **Use Case**:
|
||||
- Set when initializing a crawler for a specific URL.
|
||||
- Can be overridden during actual crawl operations.
|
||||
|
||||
##### `log_console`
|
||||
- **Description**: Log browser console messages during crawling.
|
||||
- Default: `False`
|
||||
- **Use Case**:
|
||||
- Enable to capture JavaScript console output.
|
||||
- Useful for debugging JavaScript-heavy pages.
|
||||
@@ -1,20 +0,0 @@
|
||||
browser_config: Configure browser type with chromium, firefox, or webkit support | browser selection, browser engine, web engine | BrowserConfig(browser_type="chromium")
|
||||
headless_mode: Toggle headless browser mode for GUI-less operation | headless browser, no GUI, background mode | BrowserConfig(headless=True)
|
||||
managed_browser: Enable advanced browser manipulation and control | browser management, session control | BrowserConfig(use_managed_browser=True)
|
||||
debugging_setup: Configure remote debugging port for browser inspection | debug port, devtools connection | BrowserConfig(debugging_port=9222)
|
||||
persistent_context: Enable persistent browser sessions for maintaining state | session persistence, profile saving | BrowserConfig(use_persistent_context=True)
|
||||
browser_profile: Specify directory for storing browser profile data | user data, profile storage | BrowserConfig(user_data_dir="/path/to/profile")
|
||||
proxy_configuration: Set up proxy settings for browser connections | proxy server, network routing | BrowserConfig(proxy="http://proxy.example.com:8080")
|
||||
viewport_settings: Configure browser window dimensions | screen size, window dimensions | BrowserConfig(viewport_width=1920, viewport_height=1080)
|
||||
download_handling: Configure browser download behavior and location | file downloads, download directory | BrowserConfig(accept_downloads=True, downloads_path="/downloads")
|
||||
content_threshold: Set minimum word count for processing page content | word limit, content filter | CrawlerRunConfig(word_count_threshold=200)
|
||||
extraction_strategy: Configure method for extracting structured data | data extraction, parsing strategy | CrawlerRunConfig(extraction_strategy=CustomStrategy())
|
||||
content_chunking: Define strategy for breaking content into chunks | text chunking, content splitting | CrawlerRunConfig(chunking_strategy=RegexChunking())
|
||||
cache_behavior: Control caching mode for crawler operations | cache control, data caching | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
page_navigation: Configure page load and navigation timing | page timeout, navigation wait | CrawlerRunConfig(wait_until="domcontentloaded", page_timeout=60000)
|
||||
javascript_execution: Enable or disable JavaScript processing | JS handling, script execution | CrawlerRunConfig(java_script_enabled=True)
|
||||
content_filtering: Configure HTML tag exclusion and content cleanup | tag filtering, content cleanup | CrawlerRunConfig(excluded_tags=["script", "style"])
|
||||
concurrent_operations: Set limit for simultaneous crawler operations | concurrency control, parallel crawling | CrawlerRunConfig(semaphore_count=5)
|
||||
page_interaction: Configure JavaScript execution and page scanning | page automation, interaction control | CrawlerRunConfig(js_code="custom_script()", scan_full_page=True)
|
||||
media_capture: Enable screenshot and PDF generation capabilities | visual capture, page export | CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
debugging_options: Configure logging and console message capture | debug logging, error tracking | CrawlerRunConfig(verbose=True, log_console=True)
|
||||
@@ -1,280 +0,0 @@
|
||||
# Extended Documentation: Asynchronous Crawling with `AsyncWebCrawler`
|
||||
|
||||
This document provides a comprehensive, human-oriented overview of the `AsyncWebCrawler` class and related components from the `crawl4ai` package. It explains the motivations behind asynchronous crawling, shows how to configure and run crawls, and provides examples for advanced features like dynamic content handling, extraction strategies, caching, containerization, and troubleshooting.
|
||||
|
||||
## Introduction
|
||||
[EDIT: This is not a good way to introduce the library. The library excels at generating crawl data in the form of markdown or extracted JSON as quickly as possible. It is designed to be efficient in terms of memory and CPU usage. Users should choose this library because it generates markdown suitable for large language models and AI. Additionally, it can create structured data, which is beneficial because it supports attaching large language models to generate structured data. It also includes techniques like JSON CSS and JSON XPath extraction, allowing users to define patterns and extract data quickly. One of the library's strengths is its ability to work everywhere. It can crawl any website by offering various capabilities, such as connecting to a remote browser or using persistent data. This feature allows developers to create their own identity on websites where they have authentication access, enabling them to crawl without being mistakenly identified as a bot. This is a better way to introduce the library. In these documents, we discuss the main object, the main class, Asinggull crawlers, and all the functionalities we can achieve with this Asinggull crawler.]
|
||||
|
||||
Crawling websites can be slow if done sequentially, especially when handling large numbers of URLs or rendering dynamic pages. Asynchronous crawling helps you run multiple operations concurrently, improving throughput and performance. The `AsyncWebCrawler` class leverages asynchronous I/O and browser automation tools to fetch content efficiently, handle complex DOM interactions, and extract structured data.
|
||||
|
||||
### Quick Start
|
||||
|
||||
Before diving into advanced features, here is a quick start example that shows how to run a simple asynchronous crawl with a headless Chromium browser, extract basic text, and print the results.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
async def main():
|
||||
# Basic browser configuration
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
|
||||
# Run the crawler asynchronously
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print("Extracted Markdown:")
|
||||
print(result.markdown)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This snippet initializes a headless Chromium browser, crawls the page, processes the HTML, and prints extracted content as Markdown.
|
||||
|
||||
## Browser Configuration
|
||||
|
||||
The `BrowserConfig` class defines browser-related settings and behaviors. You can customize:
|
||||
|
||||
- `browser_type`: Browser to use, such as `chromium` or `firefox`.
|
||||
- `headless`: Run the browser in headless mode (no visible UI).
|
||||
- `viewport_width` and `viewport_height`: Control viewport dimensions for rendering.
|
||||
- `proxy`: Configure proxies to bypass IP restrictions.
|
||||
- `verbose`: Control logging verbosity.
|
||||
|
||||
**Example: Customizing Browser Settings**
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="firefox",
|
||||
headless=False,
|
||||
viewport_width=1920,
|
||||
viewport_height=1080,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://yourwebsite.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
### Running in Docker
|
||||
|
||||
For scalability and reproducibility, consider running your crawler inside a Docker container. A simple Dockerfile might look like this:
|
||||
|
||||
```dockerfile
|
||||
FROM python:3.10-slim
|
||||
RUN apt-get update && apt-get install -y wget
|
||||
RUN pip install crawl4ai playwright
|
||||
RUN playwright install chromium
|
||||
COPY your_script.py /app/your_script.py
|
||||
WORKDIR /app
|
||||
CMD ["python", "your_script.py"]
|
||||
```
|
||||
|
||||
You can then run:
|
||||
|
||||
```bash
|
||||
docker build -t mycrawler .
|
||||
docker run mycrawler
|
||||
```
|
||||
|
||||
Within this container, `AsyncWebCrawler` will launch Chromium using Playwright and crawl sites as configured.
|
||||
|
||||
## Asynchronous Crawling Strategies
|
||||
|
||||
By default, `AsyncWebCrawler` uses `AsyncPlaywrightCrawlerStrategy`, which relies on Playwright for browser automation. This lets you interact with DOM elements, scroll, click buttons, and handle dynamic content. If other strategies are available, you can specify them during initialization.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
|
||||
```
|
||||
|
||||
## Handling Dynamic Content
|
||||
|
||||
Modern websites often load data via JavaScript or require user interactions. You can inject custom JavaScript snippets to manipulate the page, click buttons, or wait for certain elements to appear before extracting content.
|
||||
|
||||
**Example: Loading More Content**
|
||||
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
const loadButtons = document.querySelectorAll(".load-more");
|
||||
for (const btn of loadButtons) btn.click();
|
||||
await new Promise(r => setTimeout(r, 2000)); // Wait for new content
|
||||
})();
|
||||
"""
|
||||
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(js_code=[js_code])
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/infinite-scroll", config=config)
|
||||
print("Extracted Markdown:")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
You can also use Playwright selectors to wait for specific elements before extraction.
|
||||
|
||||
## Extraction and Filtering
|
||||
|
||||
`AsyncWebCrawler` supports various extraction strategies to convert raw HTML into structured data. For example, `JsonCssExtractionStrategy` allows you to specify CSS selectors and get structured JSON from the page. `LLMExtractionStrategy` can feed extracted text into a language model for intelligent data extraction.
|
||||
|
||||
You can also apply content filters and chunking strategies to split large documents into smaller pieces before processing.
|
||||
|
||||
**Example: Using a JSON CSS Extraction Strategy**
|
||||
|
||||
```python
|
||||
from crawl4ai import JsonCssExtractionStrategy, CrawlerRunConfig, AsyncWebCrawler, RegexChunking
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}),
|
||||
chunking_strategy=RegexChunking()
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print("Extracted Content:")
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
**Comparing Chunking Strategies:**
|
||||
|
||||
- Regex-based chunking: Splits text by patterns, good for basic splitting.
|
||||
- NLP-based chunking (if available): Splits text into semantically meaningful units, ideal for LLM-based extraction.
|
||||
|
||||
## Caching and Performance
|
||||
|
||||
Caching helps avoid repeatedly fetching and rendering the same page. By default, caching is enabled (`CacheMode.ENABLED`), so subsequent crawls of the same URL can skip the network fetch if the data is still fresh. You can control the cache mode, clear the cache, or bypass it when needed.
|
||||
|
||||
**Cache Modes:**
|
||||
|
||||
- `CacheMode.ENABLED`: Use cache if available, write new results to cache.
|
||||
- `CacheMode.BYPASS`: Skip cache reading, but still write new results.
|
||||
- `CacheMode.DISABLED`: Do not use cache at all.
|
||||
|
||||
**Clearing and Flushing the Cache:**
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
await crawler.aclear_cache() # Clear entire cache
|
||||
# ... run some crawls ...
|
||||
await crawler.aflush_cache() # Flush partial entries if needed
|
||||
```
|
||||
|
||||
Use caching to speed up development, repeated tests, or partial re-runs of large crawls.
|
||||
|
||||
## Batch Crawling and Parallelization
|
||||
|
||||
The `arun_many` method lets you process multiple URLs concurrently, improving throughput. You can limit concurrency with `semaphore_count` and apply rate limiting via `CrawlerRunConfig` parameters like `mean_delay` and `max_range`.
|
||||
|
||||
**Example: Batch Crawling**
|
||||
|
||||
```python
|
||||
urls = [
|
||||
"https://site1.com",
|
||||
"https://site2.com",
|
||||
"https://site3.com"
|
||||
]
|
||||
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(semaphore_count=10, mean_delay=1.0, max_range=0.5)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
results = await crawler.arun_many(urls, config=config)
|
||||
for res in results:
|
||||
print(res.url, res.markdown)
|
||||
```
|
||||
|
||||
This allows you to process large URL lists efficiently. Adjust `semaphore_count` to match your resource limits.
|
||||
|
||||
## Scaling Crawls
|
||||
|
||||
To scale beyond a single machine, consider:
|
||||
|
||||
- Distributing URL lists across multiple workers or containers.
|
||||
- Using a job queue like Celery or Redis Queue to schedule crawls.
|
||||
- Integrating with cloud-based solutions for browser automation.
|
||||
|
||||
Always ensure you respect target site policies and comply with legal and ethical guidelines for web scraping.
|
||||
|
||||
## Screenshots and PDFs
|
||||
|
||||
If you need visual confirmation, you can enable screenshots or PDFs:
|
||||
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
with open("page_screenshot.png", "wb") as f:
|
||||
f.write(result.screenshot)
|
||||
with open("page.pdf", "wb") as f:
|
||||
f.write(result.pdf)
|
||||
```
|
||||
|
||||
This is helpful for debugging rendering issues or retaining visual copies of crawled pages.
|
||||
|
||||
## Troubleshooting and Common Issues
|
||||
|
||||
**Common Problems and Direct Fixes:**
|
||||
|
||||
1. **Browser not launching**:
|
||||
- Check that you have installed Playwright and run `playwright install` for the chosen browser.
|
||||
- Ensure all required dependencies are installed.
|
||||
|
||||
2. **Timeouts or partial loads**:
|
||||
- Increase timeouts or add delays between requests using `mean_delay` and `max_range`.
|
||||
- Wait for specific DOM elements to appear before proceeding.
|
||||
|
||||
3. **JavaScript not executing as expected**:
|
||||
- Use `js_code` in `CrawlerRunConfig` to inject scripts.
|
||||
- Check browser console for errors or consider headless=False to debug UI interactions.
|
||||
|
||||
4. **Content Extraction fails**:
|
||||
- Validate CSS selectors or extraction strategies.
|
||||
- Try a different extraction strategy if the current one is not producing results.
|
||||
|
||||
5. **Stale Data due to Caching**:
|
||||
- Call `await crawler.aclear_cache()` to remove old entries.
|
||||
- Use `cache_mode=CacheMode.BYPASS` to fetch fresh data.
|
||||
|
||||
**Direct Code Fixes:**
|
||||
If you experience missing content after injecting JS, try waiting longer:
|
||||
```python
|
||||
js_code = """
|
||||
(async () => {
|
||||
document.querySelector(".load-more").click();
|
||||
await new Promise(r => setTimeout(r, 3000));
|
||||
})();
|
||||
"""
|
||||
|
||||
config = CrawlerRunConfig(js_code=[js_code])
|
||||
```
|
||||
|
||||
Or run headless=False to visually verify that the UI is changing as expected.
|
||||
|
||||
## Best Practices and Tips
|
||||
|
||||
- **Structuring your code**: Keep crawl logic modular. Have separate functions for configuring crawls, extracting data, and processing results.
|
||||
- **Error Handling**: Wrap crawl operations in try/except blocks and log errors with `crawler.logger`.
|
||||
- **Avoiding Getting Blocked**: Use proxies or rotate user agents if you crawl frequently. Randomize delays between requests.
|
||||
- **Authentication and Session Management**: If the site requires login, provide the crawler with login steps via `js_code` or Playwright selectors. Consider using cookies or session storage retrieval in `CrawlerRunConfig`.
|
||||
|
||||
## Reference and Additional Resources
|
||||
|
||||
- **GitHub Repository**: [crawl4ai GitHub](https://github.com/yourusername/crawl4ai)
|
||||
- **Playwright Docs**: [https://playwright.dev/](https://playwright.dev/)
|
||||
- **AsyncIO in Python**: [Python Asyncio Docs](https://docs.python.org/3/library/asyncio.html)
|
||||
|
||||
## FAQ
|
||||
|
||||
**Q**: How do I customize user agents?
|
||||
**A**: Pass `user_agent="MyUserAgentString"` to `arun` or `arun_many`, or update `crawler_strategy` directly.
|
||||
|
||||
**Q**: Can I crawl local HTML files?
|
||||
**A**: Yes, provide a `file://` URL or `raw:` prefix with raw HTML strings.
|
||||
|
||||
**Q**: How do I integrate LLM-based extraction?
|
||||
**A**: Set `extraction_strategy=LLMExtractionStrategy(...)` and provide a chunking strategy. This allows using large language models for context-aware data extraction.
|
||||
@@ -1,15 +0,0 @@
|
||||
quick_start: Basic async crawl setup requires BrowserConfig and AsyncWebCrawler initialization | getting started, basic usage, initialization | asyncio.run(AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True)))
|
||||
browser_types: AsyncWebCrawler supports multiple browser types including Chromium and Firefox | supported browsers, browser options | BrowserConfig(browser_type="chromium")
|
||||
headless_mode: Browser can run in headless mode without UI for better performance | invisible browser, no GUI | BrowserConfig(headless=True)
|
||||
viewport_settings: Configure browser viewport dimensions for proper page rendering | screen size, window size | BrowserConfig(viewport_width=1920, viewport_height=1080)
|
||||
docker_deployment: AsyncWebCrawler can run in Docker containers for scalability | containerization, deployment | FROM python:3.10-slim; RUN pip install crawl4ai playwright
|
||||
dynamic_content: Handle JavaScript-loaded content using custom JS injection | javascript handling, dynamic loading | CrawlerRunConfig(js_code=["document.querySelector('.load-more').click()"])
|
||||
extraction_strategies: Multiple strategies available for content extraction including JsonCssExtractionStrategy and LLMExtractionStrategy | content extraction, data parsing | JsonCssExtractionStrategy(selectors={"title": "h1"})
|
||||
caching_modes: Control cache behavior with different modes: ENABLED, BYPASS, DISABLED | cache control, caching options | CrawlerRunConfig(cache_mode=CacheMode.ENABLED)
|
||||
batch_crawling: Process multiple URLs concurrently using arun_many method | parallel crawling, multiple urls | crawler.arun_many(urls, config=CrawlerRunConfig(semaphore_count=10))
|
||||
rate_limiting: Control crawl rate using mean_delay and max_range parameters | throttling, delay control | CrawlerRunConfig(mean_delay=1.0, max_range=0.5)
|
||||
visual_capture: Generate screenshots and PDFs of crawled pages | page capture, visual output | CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
error_handling: Common issues include browser launch failures, timeouts, and JS execution problems | troubleshooting, debugging | try/except blocks with crawler.logger
|
||||
authentication: Handle login requirements through js_code or Playwright selectors | login handling, sessions | CrawlerRunConfig with login steps via js_code
|
||||
proxy_configuration: Configure proxy settings to bypass IP restrictions | proxy setup, IP rotation | BrowserConfig(proxy="http://proxy-server:port")
|
||||
chunking_strategies: Split content using regex or NLP-based chunking | content splitting, text processing | CrawlerRunConfig(chunking_strategy=RegexChunking())
|
||||
@@ -1,111 +0,0 @@
|
||||
# Crawl4AI: AsyncWebCrawler Reference
|
||||
|
||||
> Minimal code-oriented reference. Focus on parameters, usage patterns, and code.
|
||||
|
||||
(See full code: [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py))
|
||||
|
||||
## Setup & Quick Start
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
import asyncio
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True)) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
print(r.markdown)
|
||||
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## BrowserConfig & Docker
|
||||
**Params:** `browser_type`, `headless`, `viewport_width`, `viewport_height`, `verbose`, `proxy`.
|
||||
```python
|
||||
browser_config = BrowserConfig(browser_type="firefox", headless=False)
|
||||
async with AsyncWebCrawler(config=browser_config) as c:
|
||||
r = await c.arun("https://site.com")
|
||||
```
|
||||
|
||||
**Docker Example:**
|
||||
```dockerfile
|
||||
FROM python:3.10-slim
|
||||
RUN pip install crawl4ai playwright
|
||||
RUN playwright install chromium
|
||||
COPY script.py /app/
|
||||
WORKDIR /app
|
||||
CMD ["python", "script.py"]
|
||||
```
|
||||
|
||||
## Asynchronous Strategies
|
||||
Default: `AsyncPlaywrightCrawlerStrategy`
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
||||
crawler = AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
|
||||
```
|
||||
|
||||
## Dynamic Content (js_code)
|
||||
```python
|
||||
js_code = ["""
|
||||
(async () => {
|
||||
document.querySelector(".load-more").click();
|
||||
await new Promise(r => setTimeout(r, 2000));
|
||||
})();
|
||||
"""]
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
config = CrawlerRunConfig(js_code=js_code)
|
||||
```
|
||||
|
||||
## Extraction & Filtering
|
||||
**Strategies:** `JsonCssExtractionStrategy`, `LLMExtractionStrategy`, `NoExtractionStrategy`.
|
||||
**Chunking:** `RegexChunking`, NLP-based.
|
||||
```python
|
||||
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}))
|
||||
```
|
||||
|
||||
## Caching & Performance
|
||||
**Cache Modes:** `ENABLED`, `BYPASS`, `DISABLED`
|
||||
```python
|
||||
await c.aclear_cache()
|
||||
await c.aflush_cache()
|
||||
```
|
||||
|
||||
## Batch Crawling & Parallelization
|
||||
```python
|
||||
urls = ["https://site1.com", "https://site2.com"]
|
||||
config = CrawlerRunConfig(semaphore_count=10)
|
||||
async with AsyncWebCrawler() as c:
|
||||
results = await c.arun_many(urls, config=config)
|
||||
```
|
||||
|
||||
## Screenshots & PDFs
|
||||
```python
|
||||
config = CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
result = await c.arun("https://example.com", config=config)
|
||||
with open("page.png","wb") as f: f.write(result.screenshot)
|
||||
with open("page.pdf","wb") as f: f.write(result.pdf)
|
||||
```
|
||||
|
||||
## Common Issues & Fixes
|
||||
- Browser not launching: `playwright install chromium`
|
||||
- Timeouts: Increase delays in `CrawlerRunConfig`
|
||||
- JS not executing: Use `js_code` or headless=False
|
||||
- Stale cache: `await c.aclear_cache()`
|
||||
- Extraction fail: Check CSS selectors or try different strategy
|
||||
|
||||
## Best Practices & Tips
|
||||
- Modularize crawl logic
|
||||
- Use proxies/rotating user agents
|
||||
- Add delays to avoid blocking
|
||||
- Use `async with` for resource cleanup
|
||||
|
||||
## Links & FAQ
|
||||
- GitHub: [crawl4ai](https://github.com/yourusername/crawl4ai)
|
||||
- Playwright Docs: [https://playwright.dev/](https://playwright.dev/)
|
||||
|
||||
**FAQ:**
|
||||
- Custom user agent: `user_agent="MyUserAgent"`
|
||||
- Local files: `file://` or `raw:`
|
||||
- LLM extraction: Set `extraction_strategy=LLMExtractionStrategy(...)`
|
||||
|
||||
## Links
|
||||
|
||||
- [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
|
||||
@@ -1,12 +0,0 @@
|
||||
setup_usage: Initialize AsyncWebCrawler with BrowserConfig for basic web crawling | crawler setup, initialization, basic usage | AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True))
|
||||
browser_configuration: Configure browser settings including type, headless mode, viewport, and proxy | browser setup, browser settings, viewport config | BrowserConfig(browser_type="firefox", headless=False, viewport_width=1920)
|
||||
docker_setup: Run crawler in Docker using python slim image with playwright installation | docker configuration, containerization | FROM python:3.10-slim; RUN pip install crawl4ai playwright
|
||||
crawler_strategy: Use AsyncPlaywrightCrawlerStrategy as default crawler implementation | crawler implementation, strategy pattern | AsyncWebCrawler(crawler_strategy=AsyncPlaywrightCrawlerStrategy())
|
||||
dynamic_content: Execute custom JavaScript code for dynamic content loading | javascript execution, dynamic loading, interaction | CrawlerRunConfig(js_code=["document.querySelector('.load-more').click()"])
|
||||
extraction_strategies: Choose between JSON CSS, LLM, or No extraction strategies for content parsing | content extraction, parsing strategies | CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(selectors={"title": "h1"}))
|
||||
cache_management: Control cache behavior with ENABLED, BYPASS, or DISABLED modes | caching, cache control, performance | await c.aclear_cache(), await c.aflush_cache()
|
||||
parallel_crawling: Crawl multiple URLs concurrently with semaphore control | batch crawling, parallel execution | CrawlerRunConfig(semaphore_count=10)
|
||||
media_capture: Capture screenshots and PDFs of crawled pages | screenshots, pdf generation, media export | CrawlerRunConfig(screenshot=True, pdf=True)
|
||||
troubleshooting: Common issues include browser launch failures, timeouts, and stale cache | error handling, debugging, fixes | playwright install chromium
|
||||
best_practices: Use modular crawl logic, proxies, and proper resource cleanup | optimization, maintenance, efficiency | async with AsyncWebCrawler() as c
|
||||
custom_settings: Configure user agent and local file access options | customization, configuration options | user_agent="MyUserAgent", file:// prefix
|
||||
@@ -1,551 +0,0 @@
|
||||
## 4. Creating Browser Instances, Contexts, and Pages
|
||||
|
||||
### Introduction
|
||||
|
||||
#### Overview of Browser Management in Crawl4AI
|
||||
Crawl4AI's browser management system is designed to provide developers with advanced tools for handling complex web crawling tasks. By managing browser instances, contexts, and pages, Crawl4AI ensures optimal performance, identity preservation, and session persistence for high-volume, dynamic web crawling.
|
||||
|
||||
#### Key Objectives
|
||||
- **Identity Preservation**:
|
||||
- Implements stealth techniques to maintain authentic digital identity
|
||||
- Simulates human-like behavior, such as mouse movements, scrolling, and key presses
|
||||
- Supports integration with third-party services to bypass CAPTCHA challenges
|
||||
- **Persistent Sessions**:
|
||||
- Retains session data (cookies, local storage) for workflows requiring user authentication
|
||||
- Allows seamless continuation of tasks across multiple runs without re-authentication
|
||||
- **Scalable Crawling**:
|
||||
- Optimized resource utilization for handling thousands of URLs concurrently
|
||||
- Flexible configuration options to tailor crawling behavior to specific requirements
|
||||
|
||||
---
|
||||
|
||||
### Browser Creation Methods
|
||||
|
||||
#### Standard Browser Creation
|
||||
Standard browser creation initializes a browser instance with default or minimal configurations. It is suitable for tasks that do not require session persistence or heavy customization.
|
||||
|
||||
##### Features and Limitations
|
||||
- **Features**:
|
||||
- Quick and straightforward setup for small-scale tasks
|
||||
- Supports headless and headful modes
|
||||
- **Limitations**:
|
||||
- Lacks advanced customization options like session reuse
|
||||
- May struggle with sites employing strict identity verification
|
||||
|
||||
##### Example Usage
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
#### Persistent Contexts
|
||||
Persistent contexts create browser sessions with stored data, enabling workflows that require maintaining login states or other session-specific information.
|
||||
|
||||
##### Benefits of Using `user_data_dir`
|
||||
- **Session Persistence**:
|
||||
- Stores cookies, local storage, and cache between crawling sessions
|
||||
- Reduces overhead for repetitive logins or multi-step workflows
|
||||
- **Enhanced Performance**:
|
||||
- Leverages pre-loaded resources for faster page loading
|
||||
- **Flexibility**:
|
||||
- Adapts to complex workflows requiring user-specific configurations
|
||||
|
||||
##### Example: Setting Up Persistent Contexts
|
||||
```python
|
||||
config = BrowserConfig(user_data_dir="/path/to/user/data")
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
#### Managed Browser
|
||||
The `ManagedBrowser` class offers a high-level abstraction for managing browser instances, emphasizing resource management, debugging capabilities, and identity preservation measures.
|
||||
|
||||
##### How It Works
|
||||
- **Browser Process Management**:
|
||||
- Automates initialization and cleanup of browser processes
|
||||
- Optimizes resource usage by pooling and reusing browser instances
|
||||
- **Debugging Support**:
|
||||
- Integrates with debugging tools like Chrome Developer Tools for real-time inspection
|
||||
- **Identity Preservation**:
|
||||
- Implements stealth plugins to maintain authentic user identity
|
||||
- Preserves browser fingerprints and session data
|
||||
|
||||
##### Features
|
||||
- **Customizable Configurations**:
|
||||
- Supports advanced options such as viewport resizing, proxy settings, and header manipulation
|
||||
- **Debugging and Logging**:
|
||||
- Logs detailed browser interactions for debugging and performance analysis
|
||||
- **Scalability**:
|
||||
- Handles multiple browser instances concurrently, scaling dynamically based on workload
|
||||
|
||||
##### Example: Using `ManagedBrowser`
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
config = BrowserConfig(headless=False, debug_port=9222)
|
||||
async with AsyncWebCrawler(config=config) as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Context and Page Management
|
||||
|
||||
#### Creating and Configuring Browser Contexts
|
||||
Browser contexts act as isolated environments within a single browser instance, enabling independent browsing sessions with their own cookies, cache, and storage.
|
||||
|
||||
##### Customizations
|
||||
- **Headers and Cookies**:
|
||||
- Define custom headers to mimic specific devices or browsers
|
||||
- Set cookies for authenticated sessions
|
||||
- **Session Reuse**:
|
||||
- Retain and reuse session data across multiple requests
|
||||
- Example: Preserve login states for authenticated crawls
|
||||
|
||||
##### Example: Context Initialization
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
#### Creating Pages
|
||||
Pages represent individual tabs or views within a browser context. They are responsible for rendering content, executing JavaScript, and handling user interactions.
|
||||
|
||||
##### Key Features
|
||||
- **IFrame Handling**:
|
||||
- Extract content from embedded iframes
|
||||
- Navigate and interact with nested content
|
||||
- **Viewport Customization**:
|
||||
- Adjust viewport size to match target device dimensions
|
||||
- **Lazy Loading**:
|
||||
- Ensure dynamic elements are fully loaded before extraction
|
||||
|
||||
##### Example: Page Initialization
|
||||
```python
|
||||
config = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
print(result.markdown)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Preserve Your Identity with Crawl4AI
|
||||
|
||||
Crawl4AI empowers you to navigate and interact with the web using your authentic digital identity, ensuring that you are recognized as a human and not mistaken for a bot. This section introduces Managed Browsers, the recommended approach for preserving your rights to access the web, and Magic Mode, a simplified solution for specific scenarios.
|
||||
|
||||
## Managed Browsers: Your Digital Identity Solution
|
||||
|
||||
**Managed Browsers** enable developers to create and use persistent browser profiles. These profiles store local storage, cookies, and other session-related data, allowing you to interact with websites as a recognized user. By leveraging your unique identity, Managed Browsers ensure that your experience reflects your rights as a human browsing the web.
|
||||
|
||||
### Why Use Managed Browsers?
|
||||
1. **Authentic Browsing Experience**: Managed Browsers retain session data and browser fingerprints, mirroring genuine user behavior.
|
||||
2. **Effortless Configuration**: Once you interact with the site using the browser (e.g., solving a CAPTCHA), the session data is saved and reused, providing seamless access.
|
||||
3. **Empowered Data Access**: By using your identity, Managed Browsers empower users to access data they can view on their own screens without artificial restrictions.
|
||||
|
||||
|
||||
I'll help create a section about using command-line Chrome with a user data directory, which is indeed a more straightforward approach for identity-based browsing.
|
||||
|
||||
```markdown
|
||||
### Steps to Use Identity-Based Browsing
|
||||
|
||||
1. **Launch Chrome with a Custom Profile Directory**
|
||||
|
||||
- **Windows**:
|
||||
```batch
|
||||
"C:\Program Files\Google\Chrome\Application\chrome.exe" --user-data-dir="C:\ChromeProfiles\CrawlProfile"
|
||||
```
|
||||
|
||||
- **macOS**:
|
||||
```bash
|
||||
"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" --user-data-dir="/Users/username/ChromeProfiles/CrawlProfile"
|
||||
```
|
||||
|
||||
- **Linux**:
|
||||
```bash
|
||||
google-chrome --user-data-dir="/home/username/ChromeProfiles/CrawlProfile"
|
||||
```
|
||||
|
||||
2. **Set Up Your Identity**:
|
||||
- In the new Chrome window, log into your accounts (Google, social media, etc.)
|
||||
- Complete any necessary CAPTCHA challenges
|
||||
- Accept cookies and configure site preferences
|
||||
- The profile directory will save all settings, cookies, and login states
|
||||
|
||||
3. **Use the Profile in Crawl4AI**:
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
browser_config = BrowserConfig(
|
||||
headless=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/ChromeProfiles/CrawlProfile" # Use the same directory from step 1
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
```
|
||||
|
||||
This approach provides several advantages:
|
||||
- Complete manual control over profile setup
|
||||
- Persistent logins across multiple sites
|
||||
- Pre-solved CAPTCHAs and saved preferences
|
||||
- Real browser history and cookies for authentic browsing patterns
|
||||
|
||||
### Example: Extracting Data Using Managed Browsers
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
# Define schema for structured data extraction
|
||||
schema = {
|
||||
"name": "Example Data",
|
||||
"baseSelector": "div.example",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
# Configure crawler
|
||||
browser_config = BrowserConfig(
|
||||
headless=True, # Automate subsequent runs
|
||||
verbose=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/user_profile_data"
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema),
|
||||
wait_for="css:div.example" # Wait for the targeted element to load
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
print("Extracted Data:", result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Benefits of Managed Browsers Over Other Methods
|
||||
Managed Browsers eliminate the need for manual detection workarounds by enabling developers to work directly with their identity and user profile data. This approach ensures maximum compatibility with websites and simplifies the crawling process while preserving your right to access data freely.
|
||||
|
||||
## Magic Mode: Simplified Automation
|
||||
|
||||
While Managed Browsers are the preferred approach, **Magic Mode** provides an alternative for scenarios where persistent user profiles are unnecessary or infeasible. Magic Mode automates user-like behavior and simplifies configuration.
|
||||
|
||||
### What Magic Mode Does:
|
||||
- Simulates human browsing by randomizing interaction patterns and timing
|
||||
- Masks browser automation signals
|
||||
- Handles cookie popups and modals
|
||||
- Modifies navigator properties for enhanced compatibility
|
||||
|
||||
### Using Magic Mode
|
||||
|
||||
```python
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
magic=True # Enables all automation features
|
||||
)
|
||||
```
|
||||
|
||||
Magic Mode is particularly useful for:
|
||||
- Quick prototyping when a Managed Browser setup is not available
|
||||
- Basic sites requiring minimal interaction or configuration
|
||||
|
||||
### Example: Combining Magic Mode with Additional Options
|
||||
|
||||
```python
|
||||
async def crawl_with_magic_mode(url: str):
|
||||
async with AsyncWebCrawler(headless=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
magic=True,
|
||||
remove_overlay_elements=True, # Remove popups/modals
|
||||
page_timeout=60000 # Increased timeout for complex pages
|
||||
)
|
||||
|
||||
return result.markdown if result.success else None
|
||||
```
|
||||
|
||||
## Magic Mode vs. Managed Browsers
|
||||
While Magic Mode simplifies many tasks, it cannot match the reliability and authenticity of Managed Browsers. By using your identity and persistent profiles, Managed Browsers render Magic Mode largely unnecessary. However, Magic Mode remains a viable fallback for specific situations where user identity is not a factor.
|
||||
|
||||
# Session Management
|
||||
|
||||
Session management in Crawl4AI is a powerful feature that allows you to maintain state across multiple requests, making it particularly suitable for handling complex multi-step crawling tasks. It enables you to reuse the same browser tab (or page object) across sequential actions and crawls, which is beneficial for:
|
||||
|
||||
- **Performing JavaScript actions before and after crawling**
|
||||
- **Executing multiple sequential crawls faster** without needing to reopen tabs or allocate memory repeatedly
|
||||
- **Maintaining state for complex workflows**
|
||||
|
||||
**Note:** This feature is designed for sequential workflows and is not suitable for parallel operations.
|
||||
|
||||
## Basic Session Usage
|
||||
|
||||
Use `BrowserConfig` and `CrawlerRunConfig` to maintain state with a `session_id`:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "my_session"
|
||||
|
||||
# Define configurations
|
||||
config1 = CrawlerRunConfig(url="https://example.com/page1", session_id=session_id)
|
||||
config2 = CrawlerRunConfig(url="https://example.com/page2", session_id=session_id)
|
||||
|
||||
# First request
|
||||
result1 = await crawler.arun(config=config1)
|
||||
|
||||
# Subsequent request using the same session
|
||||
result2 = await crawler.arun(config=config2)
|
||||
|
||||
# Clean up when done
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Dynamic Content with Sessions
|
||||
|
||||
Here's an example of crawling GitHub commits across multiple pages while preserving session state:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.cache_context import CacheMode
|
||||
|
||||
async def crawl_dynamic_content():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "github_commits_session"
|
||||
url = "https://github.com/microsoft/TypeScript/commits/main"
|
||||
all_commits = []
|
||||
|
||||
# Define extraction schema
|
||||
schema = {
|
||||
"name": "Commit Extractor",
|
||||
"baseSelector": "li.Box-sc-g0xbh4-0",
|
||||
"fields": [{"name": "title", "selector": "h4.markdown-title", "type": "text"}],
|
||||
}
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema)
|
||||
|
||||
# JavaScript and wait configurations
|
||||
js_next_page = """document.querySelector('a[data-testid="pagination-next-button"]').click();"""
|
||||
wait_for = """() => document.querySelectorAll('li.Box-sc-g0xbh4-0').length > 0"""
|
||||
|
||||
# Crawl multiple pages
|
||||
for page in range(3):
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
wait_for=wait_for if page > 0 else None,
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
if result.success:
|
||||
commits = json.loads(result.extracted_content)
|
||||
all_commits.extend(commits)
|
||||
print(f"Page {page + 1}: Found {len(commits)} commits")
|
||||
|
||||
# Clean up session
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
return all_commits
|
||||
```
|
||||
|
||||
## Session Best Practices
|
||||
|
||||
1. **Descriptive Session IDs**:
|
||||
Use meaningful names for session IDs to organize workflows:
|
||||
```python
|
||||
session_id = "login_flow_session"
|
||||
session_id = "product_catalog_session"
|
||||
```
|
||||
|
||||
2. **Resource Management**:
|
||||
Always ensure sessions are cleaned up to free resources:
|
||||
```python
|
||||
try:
|
||||
# Your crawling code here
|
||||
pass
|
||||
finally:
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
3. **State Maintenance**:
|
||||
Reuse the session for subsequent actions within the same workflow:
|
||||
```python
|
||||
# Step 1: Login
|
||||
login_config = CrawlerRunConfig(
|
||||
url="https://example.com/login",
|
||||
session_id=session_id,
|
||||
js_code="document.querySelector('form').submit();"
|
||||
)
|
||||
await crawler.arun(config=login_config)
|
||||
|
||||
# Step 2: Verify login success
|
||||
dashboard_config = CrawlerRunConfig(
|
||||
url="https://example.com/dashboard",
|
||||
session_id=session_id,
|
||||
wait_for="css:.user-profile" # Wait for authenticated content
|
||||
)
|
||||
result = await crawler.arun(config=dashboard_config)
|
||||
```
|
||||
|
||||
4. **Common Use Cases for Sessions**:
|
||||
1. **Authentication Flows**: Login and interact with secured pages
|
||||
2. **Pagination Handling**: Navigate through multiple pages
|
||||
3. **Form Submissions**: Fill forms, submit, and process results
|
||||
4. **Multi-step Processes**: Complete workflows that span multiple actions
|
||||
5. **Dynamic Content Navigation**: Handle JavaScript-rendered or event-triggered content
|
||||
|
||||
# Session-Based Crawling for Dynamic Content
|
||||
|
||||
In modern web applications, content is often loaded dynamically without changing the URL. Examples include "Load More" buttons, infinite scrolling, or paginated content that updates via JavaScript. Crawl4AI provides session-based crawling capabilities to handle such scenarios effectively.
|
||||
|
||||
## Understanding Session-Based Crawling
|
||||
|
||||
Session-based crawling allows you to reuse a persistent browser session across multiple actions. This means the same browser tab (or page object) is used throughout, enabling:
|
||||
|
||||
1. **Efficient handling of dynamic content** without reloading the page
|
||||
2. **JavaScript actions before and after crawling** (e.g., clicking buttons or scrolling)
|
||||
3. **State maintenance** for authenticated sessions or multi-step workflows
|
||||
4. **Faster sequential crawling**, as it avoids reopening tabs or reallocating resources
|
||||
|
||||
**Note:** Session-based crawling is ideal for sequential operations, not parallel tasks.
|
||||
|
||||
## Basic Concepts
|
||||
|
||||
Before diving into examples, here are some key concepts:
|
||||
|
||||
- **Session ID**: A unique identifier for a browsing session. Use the same `session_id` across multiple requests to maintain state.
|
||||
- **BrowserConfig & CrawlerRunConfig**: These configuration objects control browser settings and crawling behavior.
|
||||
- **JavaScript Execution**: Use `js_code` to perform actions like clicking buttons.
|
||||
- **CSS Selectors**: Target specific elements for interaction or data extraction.
|
||||
- **Extraction Strategy**: Define rules to extract structured data.
|
||||
- **Wait Conditions**: Specify conditions to wait for before proceeding.
|
||||
|
||||
## Advanced Technique 1: Custom Execution Hooks
|
||||
|
||||
Use custom hooks to handle complex scenarios, such as waiting for content to load dynamically:
|
||||
|
||||
```python
|
||||
async def advanced_session_crawl_with_hooks():
|
||||
first_commit = ""
|
||||
|
||||
async def on_execution_started(page):
|
||||
nonlocal first_commit
|
||||
try:
|
||||
while True:
|
||||
await page.wait_for_selector("li.commit-item h4")
|
||||
commit = await page.query_selector("li.commit-item h4")
|
||||
commit = await commit.evaluate("(element) => element.textContent").strip()
|
||||
if commit and commit != first_commit:
|
||||
first_commit = commit
|
||||
break
|
||||
await asyncio.sleep(0.5)
|
||||
except Exception as e:
|
||||
print(f"Warning: New content didn't appear: {e}")
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "commit_session"
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
crawler.crawler_strategy.set_hook("on_execution_started", on_execution_started)
|
||||
|
||||
js_next_page = """document.querySelector('a.pagination-next').click();"""
|
||||
|
||||
for page in range(3):
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
js_code=js_next_page if page > 0 else None,
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Advanced Technique 2: Integrated JavaScript Execution and Waiting
|
||||
|
||||
Combine JavaScript execution and waiting logic for concise handling of dynamic content:
|
||||
|
||||
```python
|
||||
async def integrated_js_and_wait_crawl():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
session_id = "integrated_session"
|
||||
url = "https://github.com/example/repo/commits/main"
|
||||
|
||||
js_next_page_and_wait = """
|
||||
(async () => {
|
||||
const getCurrentCommit = () => document.querySelector('li.commit-item h4').textContent.trim();
|
||||
const initialCommit = getCurrentCommit();
|
||||
document.querySelector('a.pagination-next').click();
|
||||
while (getCurrentCommit() === initialCommit) {
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
}
|
||||
})();
|
||||
"""
|
||||
|
||||
for page in range(3):
|
||||
config = CrawlerRunConfig(
|
||||
url=url,
|
||||
session_id=session_id,
|
||||
js_code=js_next_page_and_wait if page > 0 else None,
|
||||
css_selector="li.commit-item",
|
||||
js_only=page > 0,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
result = await crawler.arun(config=config)
|
||||
print(f"Page {page + 1}: Found {len(result.extracted_content)} commits")
|
||||
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
```
|
||||
|
||||
## Best Practices for Session-Based Crawling
|
||||
|
||||
1. **Unique Session IDs**: Assign descriptive and unique `session_id` values
|
||||
2. **Close Sessions**: Always clean up sessions with `kill_session` after use
|
||||
3. **Error Handling**: Anticipate and handle errors gracefully
|
||||
4. **Respect Websites**: Follow terms of service and robots.txt
|
||||
5. **Delays**: Add delays to avoid overwhelming servers
|
||||
6. **Optimize JavaScript**: Keep scripts concise for better performance
|
||||
7. **Monitor Resources**: Track memory and CPU usage for long sessions
|
||||
|
||||
## Conclusion
|
||||
|
||||
By combining browser management, identity-based crawling through Managed Browsers, and robust session management, Crawl4AI provides a comprehensive solution for modern web crawling needs. These features work together to enable:
|
||||
|
||||
1. Authentic identity preservation
|
||||
2. Efficient session management
|
||||
3. Reliable handling of dynamic content
|
||||
4. Scalable and maintainable crawling workflows
|
||||
|
||||
Remember to always follow best practices and respect website policies when implementing these features.
|
||||
@@ -1,10 +0,0 @@
|
||||
browser_creation: Create standard browser instance with default configurations | browser initialization, basic setup, minimal config | AsyncWebCrawler(config=BrowserConfig(browser_type="chromium", headless=True))
|
||||
persistent_context: Use persistent browser contexts to maintain session data and cookies | user_data_dir, session storage, login state | BrowserConfig(user_data_dir="/path/to/user/data")
|
||||
managed_browser: High-level browser management with resource optimization and debugging | browser process, stealth mode, debugging tools | BrowserConfig(headless=False, debug_port=9222)
|
||||
context_config: Configure browser context with custom headers and cookies | headers customization, session reuse | CrawlerRunConfig(headers={"User-Agent": "Crawl4AI/1.0"})
|
||||
page_creation: Create and customize browser pages with viewport settings | viewport size, iframe handling, lazy loading | CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
identity_preservation: Maintain authentic digital identity using Managed Browsers | user profiles, CAPTCHA bypass, persistent login | BrowserConfig(use_managed_browser=True, user_data_dir="/path/to/profile")
|
||||
magic_mode: Enable automated user-like behavior and detection bypass | automation masking, cookie handling | crawler.arun(url="example.com", magic=True)
|
||||
session_management: Maintain state across multiple requests using session IDs | session reuse, sequential crawling | CrawlerRunConfig(session_id="my_session")
|
||||
dynamic_content: Handle JavaScript-rendered content with custom execution hooks | content loading, pagination | js_code="document.querySelector('a.pagination-next').click()"
|
||||
best_practices: Follow recommended patterns for efficient crawling | resource management, error handling | crawler.crawler_strategy.kill_session(session_id)
|
||||
@@ -1,152 +0,0 @@
|
||||
# Creating Browser Instances, Contexts, and Pages (Condensed LLM Reference)
|
||||
|
||||
> Minimal code-focused reference retaining all outline sections.
|
||||
|
||||
## Introduction
|
||||
- Manage browsers for crawling with identity preservation, sessions, scaling.
|
||||
- Maintain cookies, local storage, human-like actions.
|
||||
|
||||
### Key Objectives
|
||||
- **Identity Preservation**: Stealth plugins, human-like inputs.
|
||||
- **Persistent Sessions**: Store cookies, continue tasks across runs.
|
||||
- **Scalable Crawling**: Handle large volumes efficiently.
|
||||
|
||||
---
|
||||
|
||||
## Browser Creation Methods
|
||||
|
||||
### Standard Browser Creation
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig
|
||||
|
||||
cfg = BrowserConfig(browser_type="chromium", headless=True)
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
### Persistent Contexts
|
||||
```python
|
||||
cfg = BrowserConfig(user_data_dir="/path/to/data")
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
### Managed Browser
|
||||
```python
|
||||
cfg = BrowserConfig(headless=False, debug_port=9222, use_managed_browser=True)
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Context and Page Management
|
||||
|
||||
### Creating and Configuring Browser Contexts
|
||||
```python
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
conf = CrawlerRunConfig(headers={"User-Agent": "C4AI"})
|
||||
async with AsyncWebCrawler() as c:
|
||||
r = await c.arun("https://example.com", config=conf)
|
||||
```
|
||||
|
||||
### Creating Pages
|
||||
```python
|
||||
conf = CrawlerRunConfig(viewport_width=1920, viewport_height=1080)
|
||||
async with AsyncWebCrawler() as c:
|
||||
r = await c.arun("https://example.com", config=conf)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Preserve Your Identity with Crawl4AI
|
||||
|
||||
Use Managed Browsers for authentic identity:
|
||||
|
||||
## Managed Browsers: Your Digital Identity Solution
|
||||
- Store sessions, cookies, user profiles.
|
||||
- Reuse CAPTCHAs, logins.
|
||||
|
||||
### Steps to Use Identity-Based Browsing
|
||||
```bash
|
||||
# Launch Chrome with user-data-dir
|
||||
google-chrome --user-data-dir="/path/to/Profile"
|
||||
# Then login manually, solve CAPTCHAs, etc.
|
||||
```
|
||||
|
||||
```python
|
||||
cfg = BrowserConfig(
|
||||
headless=True,
|
||||
use_managed_browser=True,
|
||||
user_data_dir="/path/to/Profile"
|
||||
)
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com")
|
||||
```
|
||||
|
||||
### Example: Extracting Data Using Managed Browsers
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {...}
|
||||
cfg = BrowserConfig(
|
||||
headless=True, use_managed_browser=True,
|
||||
user_data_dir="/path/to/data"
|
||||
)
|
||||
crawl_cfg = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
|
||||
|
||||
async with AsyncWebCrawler(config=cfg) as c:
|
||||
r = await c.arun("https://example.com", config=crawl_cfg)
|
||||
```
|
||||
|
||||
## Magic Mode: Simplified Automation
|
||||
```python
|
||||
async with AsyncWebCrawler() as c:
|
||||
r = await c.arun("https://example.com", magic=True)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Session Management
|
||||
|
||||
Use `session_id` to maintain state across requests:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async with AsyncWebCrawler() as c:
|
||||
sid = "my_session"
|
||||
conf1 = CrawlerRunConfig(url="https://example.com/page1", session_id=sid)
|
||||
conf2 = CrawlerRunConfig(url="https://example.com/page2", session_id=sid)
|
||||
r1 = await c.arun(config=conf1)
|
||||
r2 = await c.arun(config=conf2)
|
||||
await c.crawler_strategy.kill_session(sid)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# Session-Based Crawling for Dynamic Content
|
||||
|
||||
- Reuse the same session for multi-step actions, JS execution.
|
||||
- Ideal for pagination, JS-driven content.
|
||||
|
||||
## Basic Concepts
|
||||
- `session_id`: Keep the same ID for related crawls.
|
||||
- `js_code`, `wait_for`: Run JS, wait for elements.
|
||||
|
||||
## Advanced Techniques
|
||||
- Execute JS for dynamic content loading.
|
||||
- Wait loops or hooks to handle new elements.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
- Combine managed browsers, sessions, and configs for scalable, identity-preserved crawling.
|
||||
- Adjust headers, cookies, viewports.
|
||||
- Magic mode for quick attempts; Managed Browsers for robust identity.
|
||||
- Use sessions for multi-step, dynamic workflows.
|
||||
|
||||
## Optional
|
||||
- [async_crawler_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)
|
||||
@@ -1,390 +0,0 @@
|
||||
# 5. Markdown Generation (MEGA Extended Documentation)
|
||||
|
||||
## 5.1 Introduction
|
||||
|
||||
In modern AI workflows—especially those involving Large Language Models (LLMs)—it’s essential to provide clean, structured, and meaningful textual data. **Crawl4AI** assists with this by extracting web content and converting it into Markdown that is easy to process, fine-tune on, or use for retrieval-augmented generation (RAG).
|
||||
|
||||
**What Makes Markdown Outputs Valuable for AI?**
|
||||
- **Human-Readable & Machine-Friendly:** Markdown is a simple, text-based format easily parsed by humans and machines alike.
|
||||
- **Rich Structure:** Headings, lists, code blocks, and links are preserved and well-organized.
|
||||
- **Enhanced Relevance:** Content filtering ensures you focus on the main content while discarding noise, making the data cleaner for LLM training or search.
|
||||
|
||||
### Quick Start Example
|
||||
|
||||
Here’s a minimal snippet to get started:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator()
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.markdown_v2.raw_markdown)
|
||||
```
|
||||
|
||||
*Within a few lines of code, you can fetch a webpage, run it through the Markdown generator, and get a clean, AI-friendly output.*
|
||||
|
||||
---
|
||||
|
||||
## 5.2 Markdown Generation
|
||||
|
||||
The Markdown generation process transforms raw HTML into a structured format. At its core is the `DefaultMarkdownGenerator` class, which uses configurable parameters and optional filters. Let’s explore its functionality in depth.
|
||||
|
||||
### Internal Workings
|
||||
|
||||
1. **HTML to Markdown Conversion:**
|
||||
The generator relies on an HTML-to-text conversion process that respects various formatting options. It preserves headings, code blocks, and references while removing extraneous tags like scripts and styles.
|
||||
|
||||
2. **Link Citation Handling:**
|
||||
By default, the generator can convert links into citation-style references at the bottom of the document. This feature is particularly useful when you need a clean, reference-rich dataset for an LLM.
|
||||
|
||||
3. **Optional Content Filters:**
|
||||
You can provide a content filter (like BM25 or Pruning) to generate a “fit_markdown” output that contains only the most relevant or least noisy parts of the page.
|
||||
|
||||
### Key Parameters
|
||||
|
||||
- **`base_url` (string):**
|
||||
A base URL used to resolve relative links in the content.
|
||||
|
||||
- **`html2text_config` (dict):**
|
||||
Controls how HTML is converted to Markdown. If none is provided, default settings ensure a reasonable output. You can customize a wide array of options. These options mirror standard `html2text` configurations with custom enhancements.
|
||||
**Important Options:**
|
||||
- `ignore_links` (bool): If `True`, removes all hyperlinks in the output Markdown. Default: `False`
|
||||
- `ignore_images` (bool): If `True`, removes all images. Default: `False`
|
||||
- `escape_html` (bool): If `True`, escapes raw HTML entities. Default: `True`
|
||||
- `body_width` (int): Sets the text wrapping width. Default: unlimited (0 means no wrapping)
|
||||
|
||||
**Advanced html2text-related Options from Source:**
|
||||
- `inside_pre`/`inside_code` (internal flags): Track whether we are inside `<pre>` or `<code>` blocks.
|
||||
- `preserve_tags` (set): A set of tags to preserve. If not empty, content within these tags is kept verbatim.
|
||||
- `current_preserved_tag`/`preserve_depth`: Internally manage nesting levels of preserved tags.
|
||||
- `handle_code_in_pre` (bool): If `True`, treats code within `<pre>` blocks distinctly, possibly formatting them as code blocks in Markdown.
|
||||
- `skip_internal_links` (bool): If `True`, internal links (like `#section`) are skipped.
|
||||
- `single_line_break` (bool): If `True`, uses single line breaks instead of double line breaks.
|
||||
- `mark_code` (bool): If `True`, adds special markers around code text.
|
||||
- `include_sup_sub` (bool): If `True`, tries to include `<sup>` and `<sub>` text in a readable way.
|
||||
- `ignore_mailto_links` (bool): If `True`, ignores `mailto:` links.
|
||||
- `escape_backslash`, `escape_dot`, `escape_plus`, `escape_dash`, `escape_snob`: Special escaping options to handle characters that might conflict with Markdown syntax.
|
||||
|
||||
**Example Custom `html2text_config`:**
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler, CacheMode
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
options={
|
||||
"ignore_links": True,
|
||||
"escape_html": False,
|
||||
"body_width": 80,
|
||||
"skip_internal_links": True,
|
||||
"mark_code": True,
|
||||
"include_sup_sub": True
|
||||
}
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/docs", config=config)
|
||||
print(result.markdown_v2.raw_markdown)
|
||||
```
|
||||
|
||||
In this example, we ignore all hyperlinks, do not escape HTML entities, wrap text at 80 characters wide, skip internal links, mark code regions, and include superscript/subscript formatting.
|
||||
|
||||
### Using Content Filters
|
||||
|
||||
When you need filtered markdown (fit_markdown), configure the content filter with the markdown generator:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(), # Content filter goes here
|
||||
options={
|
||||
"ignore_links": True,
|
||||
"escape_html": False
|
||||
}
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This setup enables:
|
||||
- Raw markdown generation (always available)
|
||||
- Filtered markdown (fit_markdown) through PruningContentFilter
|
||||
|
||||
### Using Content Filters in Markdown Generation
|
||||
|
||||
- **`content_filter` (object):**
|
||||
An optional filter (like `BM25ContentFilter` or `PruningContentFilter`) that refines the content before Markdown generation. When applied:
|
||||
- `fit_markdown` is generated: a filtered version of the page focusing on main content.
|
||||
- `fit_html` is also available: the filtered HTML that was used to generate `fit_markdown`.
|
||||
|
||||
### Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.5,
|
||||
use_stemming=True
|
||||
),
|
||||
options={"ignore_links": True, "escape_html": False}
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://crawl4ai.com/ai-research", config=config)
|
||||
print(result.markdown_v2.fit_markdown) # Filtered Markdown focusing on machine learning
|
||||
```
|
||||
|
||||
### Troubleshooting Markdown Generation
|
||||
|
||||
- **Empty Markdown Output?**
|
||||
Check if the crawler successfully fetched HTML. Ensure your filters are not overly strict. If no filter is used and you still get no output, verify the HTML content isn’t empty or malformed.
|
||||
|
||||
- **Malformed HTML Content?**
|
||||
The internal parser is robust, but if encountering strange characters, consider adjusting `escape_html` to `True` or removing problematic tags using filters.
|
||||
|
||||
- **Performance Considerations:**
|
||||
Complex filters or very large HTML documents can slow down processing. Consider caching results or reducing `body_width` if line-wrapping is unnecessary.
|
||||
|
||||
---
|
||||
|
||||
### 5.2.1 MarkdownGenerationResult
|
||||
|
||||
After running the crawler, `result.markdown_v2` returns a `MarkdownGenerationResult` object.
|
||||
|
||||
**Attributes:**
|
||||
- `raw_markdown` (str): Unfiltered Markdown.
|
||||
- `markdown_with_citations` (str): Markdown with all links converted into references at the end.
|
||||
- `references_markdown` (str): A list of extracted references.
|
||||
- `fit_markdown` (Optional[str]): Markdown after applying filters.
|
||||
- `fit_html` (Optional[str]): Filtered HTML corresponding to `fit_markdown`.
|
||||
|
||||
**Integration Example:**
|
||||
|
||||
```python
|
||||
result = await crawler.arun("https://crawl4ai.com")
|
||||
print("RAW:", result.markdown_v2.raw_markdown)
|
||||
print("CITED:", result.markdown_v2.markdown_with_citations)
|
||||
print("FIT:", result.markdown_v2.fit_markdown)
|
||||
```
|
||||
|
||||
**Use Cases:**
|
||||
- **RAG Pipelines:** Feed `fit_markdown` into a vector database for semantic search.
|
||||
- **LLM Fine-Tuning:** Use `raw_markdown` or `fit_markdown` as training data for large models.
|
||||
|
||||
---
|
||||
|
||||
## 5.3 Filtering Strategies
|
||||
|
||||
Filters refine raw HTML to produce cleaner Markdown. They can remove boilerplate sections (headers, footers) or focus on content relevant to a specific query.
|
||||
|
||||
**Two Major Strategies:**
|
||||
1. **BM25ContentFilter:**
|
||||
A relevance-based approach using BM25 scoring to rank content sections according to a user query.
|
||||
|
||||
2. **PruningContentFilter (Emphasized):**
|
||||
An unsupervised, clustering-like approach that systematically prunes irrelevant or noisy parts of the HTML. Unlike BM25, which relies on a query for relevance, `PruningContentFilter` attempts to cluster and discard noise based on structural and heuristic metrics. This makes it highly useful for general cleanup without predefined queries.
|
||||
|
||||
---
|
||||
|
||||
### Relevance-Based Filtering: BM25
|
||||
|
||||
BM25 ranks content blocks by relevance to a given query. It’s semi-supervised in the sense that it needs a query (`user_query`).
|
||||
|
||||
**Key Parameters:**
|
||||
- `user_query` (string): The query for content relevance.
|
||||
- `bm25_threshold` (float): The minimum relevance score. Increase to get less but more focused content.
|
||||
- `use_stemming` (bool): When `True`, matches variations of words.
|
||||
- `case_sensitive` (bool): Controls case sensitivity.
|
||||
|
||||
**If omitted `user_query`,** BM25 just scores content but doesn’t have a specific target. Useful if you need general scoring.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=BM25ContentFilter(
|
||||
user_query="artificial intelligence",
|
||||
bm25_threshold=2.0,
|
||||
use_stemming=True
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
**Troubleshooting BM25:**
|
||||
- If you get too much irrelevant content, raise `bm25_threshold`.
|
||||
- If you get too little content, lower it or disable `case_sensitive`.
|
||||
|
||||
---
|
||||
|
||||
### PruningContentFilter: Unsupervised Content Clustering
|
||||
|
||||
`PruningContentFilter` is about intelligently stripping away non-essential parts of a page—ads, navigation bars, repetitive links—without relying on a specific user query. Think of it as an unsupervised clustering method that scores content blocks and removes “noise.”
|
||||
|
||||
**Key Features:**
|
||||
- **Unsupervised Nature:** No query needed. Uses heuristics like text density, link density, tag importance, and HTML structure.
|
||||
- **Clustering-Like Behavior:** It effectively “clusters” page sections by their structural and textual qualities, and prunes those that don’t meet thresholds.
|
||||
- **Threshold Adjustments:** Dynamically adjusts or uses a fixed threshold to remove or keep content blocks.
|
||||
|
||||
**Parameters:**
|
||||
- `threshold` (float): Score threshold for removing content. Higher values prune more aggressively. Default: `0.5`.
|
||||
- `threshold_type` (str): `"fixed"` or `"dynamic"`.
|
||||
- **Fixed:** Compares each block’s score directly to a set threshold.
|
||||
- **Dynamic:** Adjusts threshold based on content metrics for a more adaptive approach.
|
||||
- `min_word_threshold` (int): Minimum word count to keep a content block.
|
||||
- Internal metrics consider:
|
||||
- **Text Density:** Prefers sections rich in text over code or sparse elements.
|
||||
- **Link Density:** Penalizes sections with too many links.
|
||||
- **Tag Importance:** Some tags (e.g., `<article>`, `<main>`, `<section>`) are considered more important and less likely to be pruned.
|
||||
- **Class/ID patterns:** Looks for signals (like `nav`, `footer`) to identify boilerplate.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
content_filter=PruningContentFilter(
|
||||
threshold=0.7,
|
||||
threshold_type="dynamic",
|
||||
min_word_threshold=100
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
In this example, content blocks under a dynamically adjusted threshold are pruned, and any block under 100 words is discarded, ensuring you keep only substantial textual sections.
|
||||
|
||||
**When to Use PruningContentFilter:**
|
||||
- **General Cleanup:** If you want a broad cleanup of the page without a specific target query, pruning is your go-to.
|
||||
- **Pre-Processing Large Corpora:** Before applying more specific filters, prune to remove boilerplate, then apply BM25 for query-focused refinement.
|
||||
|
||||
**Troubleshooting Pruning Filter:**
|
||||
- **Too Much Content Gone?** Lower the `threshold` or switch from `dynamic` to `fixed` threshold for more predictable behavior.
|
||||
- **Not Enough Pruning?** Increase `threshold` to be more aggressive.
|
||||
- **Mixed Results?** Adjust `min_word_threshold` or try the `dynamic` threshold mode to fine-tune results.
|
||||
|
||||
---
|
||||
|
||||
## 5.4 Fit Markdown: Bringing It All Together
|
||||
|
||||
“Fit Markdown” is the output you get when applying filters to the raw HTML before markdown generation. This produces a final, optimized Markdown that’s noise-free and content-focused.
|
||||
|
||||
### Advanced Usage Scenario
|
||||
|
||||
**Combining BM25 and Pruning:**
|
||||
1. First apply `PruningContentFilter` to remove general junk.
|
||||
2. Then apply a `BM25ContentFilter` to focus on query relevance.
|
||||
|
||||
*Example:*
|
||||
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
combined_filter = BM25ContentFilter(
|
||||
user_query="technology advancements",
|
||||
bm25_threshold=1.2,
|
||||
use_stemming=True
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.5) # First prune
|
||||
)
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
# First run pruning
|
||||
result = await crawler.arun("https://crawl4ai.com", config=config)
|
||||
pruned_fit_markdown = result.markdown_v2.fit_markdown
|
||||
|
||||
# Re-run the BM25 filter on the pruned output, or integrate BM25 in a pipeline
|
||||
# (In practice, you'd integrate both filters within the crawler or run a second pass.)
|
||||
```
|
||||
|
||||
**Performance Note:**
|
||||
Fit Markdown reduces token count, making subsequent LLM operations faster and cheaper.
|
||||
|
||||
---
|
||||
|
||||
## 5.5 Best Practices
|
||||
|
||||
- **Iterative Adjustment:** Start with default parameters, then adjust filters, thresholds, and `html2text_config` based on the quality of output you need.
|
||||
- **Combining Filters:** Use `PruningContentFilter` first to remove boilerplate, then a `BM25ContentFilter` to target relevance.
|
||||
- **Check Downstream Applications:** If you’re using fit Markdown for training LLMs, inspect the output to ensure no essential references were pruned.
|
||||
- **Docker Deployment:**
|
||||
Running Crawl4AI in a Docker container ensures a consistent environment. Just include the required packages in your Dockerfile and run the crawler script inside the container.
|
||||
- **Caching Results:**
|
||||
To save time, cache the raw HTML or intermediate Markdown. If you know you’ll re-run filters or change parameters often, caching avoids redundant crawling.
|
||||
|
||||
**Handling Special Cases:**
|
||||
- **Authentication-Protected Pages:**
|
||||
If you need to crawl gated content, provide appropriate session tokens or use a headless browser approach before feeding HTML to the generator.
|
||||
- **Proxies and Timeouts:**
|
||||
Configure the crawler with proxies or increased timeouts for sites that are slow or region-restricted.
|
||||
|
||||
---
|
||||
|
||||
## 5.6 Troubleshooting & FAQ
|
||||
|
||||
**Why am I getting empty Markdown?**
|
||||
- Ensure that the URL is correct and the crawler fetched content.
|
||||
- If using filters, relax your thresholds.
|
||||
|
||||
**How to handle JavaScript-heavy sites?**
|
||||
- Run a headless browser upstream to render the page. Crawl4AI expects server-rendered HTML.
|
||||
|
||||
**How to improve formatting for code snippets?**
|
||||
- Set `handle_code_in_pre = True` in `html2text_config` to preserve code blocks more accurately.
|
||||
|
||||
**Links are cluttering my Markdown.**
|
||||
- Use `ignore_links=True` or convert them to citations for a cleaner layout.
|
||||
|
||||
---
|
||||
|
||||
## 5.7 Real-World Use Cases
|
||||
|
||||
1. **Summarizing News Articles:**
|
||||
Use `PruningContentFilter` to strip ads and nav bars, then just the raw output to get a neat summary.
|
||||
|
||||
2. **Preparing Data for LLM Fine-Tuning:**
|
||||
For a large corpus, first prune all pages to remove boilerplate, then optionally apply BM25 to focus on specific topics. The resulting Markdown is ideal for training because it’s dense with meaningful content.
|
||||
|
||||
3. **RAG Pipelines:**
|
||||
Extract `fit_markdown`, store it in a vector database, and use it for retrieval-augmented generation. The references and structured headings enhance search relevance.
|
||||
|
||||
---
|
||||
|
||||
## 5.8 Appendix (References)
|
||||
|
||||
**Source Code Files:**
|
||||
- [markdown_generation_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/markdown_generation_strategy.py)
|
||||
- **Key Classes:** `MarkdownGenerationStrategy`, `DefaultMarkdownGenerator`
|
||||
- **Key Functions:** `convert_links_to_citations()`, `generate_markdown()`
|
||||
|
||||
- [content_filter_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/content_filter_strategy.py)
|
||||
- **Key Classes:** `RelevantContentFilter`, `BM25ContentFilter`, `PruningContentFilter`
|
||||
- **Metrics & Heuristics:** Examine `PruningContentFilter` code for scoring logic and threshold adjustments.
|
||||
|
||||
Exploring the source code will provide deeper insights into how tags are parsed, how scores are computed for pruning, and how BM25 relevance is calculated.
|
||||
|
||||
---
|
||||
|
||||
**In summary**, Markdown generation in Crawl4AI provides a powerful, configurable pipeline to transform raw HTML into AI-ready Markdown. By leveraging `PruningContentFilter` for general cleanup and `BM25ContentFilter` for query-focused extraction, plus fine-tuning `html2text_config`, you can achieve high-quality outputs for a wide range of AI applications.
|
||||
@@ -1,15 +0,0 @@
|
||||
markdown_generation: Converts web content into clean, structured Markdown format for AI processing | html to markdown, text conversion, content extraction | DefaultMarkdownGenerator()
|
||||
markdown_config_options: Configure HTML to Markdown conversion with html2text options like ignore_links, escape_html, body_width | markdown settings, conversion options | html2text_config={"ignore_links": True, "body_width": 80}
|
||||
content_filtering: Filter and clean web content using BM25 or Pruning strategies | content cleanup, noise removal | content_filter=BM25ContentFilter()
|
||||
bm25_filtering: Score and filter content based on relevance to a user query | relevance filtering, query matching | BM25ContentFilter(user_query="ai", bm25_threshold=1.5)
|
||||
pruning_filter: Remove boilerplate and noise using unsupervised clustering approach | content pruning, noise removal | PruningContentFilter(threshold=0.7, threshold_type="dynamic")
|
||||
markdown_result_types: Access different markdown outputs including raw, cited, and filtered versions | markdown formats, output types | result.markdown_v2.{raw_markdown, markdown_with_citations, fit_markdown}
|
||||
link_citations: Convert webpage links into citation-style references at document end | reference handling, link management | markdown_with_citations output format
|
||||
content_scoring: Evaluate content blocks based on text density, link density, and tag importance | content metrics, scoring system | PruningContentFilter metrics
|
||||
combined_filtering: Apply both pruning and BM25 filters for optimal content extraction | filter pipeline, multi-stage filtering | PruningContentFilter() followed by BM25ContentFilter()
|
||||
markdown_generation_troubleshooting: Debug empty outputs and malformed content issues | error handling, debugging | Check HTML content and filter thresholds
|
||||
performance_optimization: Cache results and adjust parameters for better processing speed | optimization, caching | Store intermediate results for reuse
|
||||
rag_pipeline_integration: Use filtered markdown for retrieval-augmented generation systems | RAG, vector storage | Store fit_markdown in vector database
|
||||
code_block_handling: Preserve and format code snippets in markdown output | code formatting, syntax | handle_code_in_pre=True option
|
||||
authentication_handling: Process content from authenticated pages using session tokens | auth support, protected content | Provide session tokens before markdown generation
|
||||
docker_deployment: Run markdown generation in containerized environment | deployment, containers | Include in Dockerfile configuration
|
||||
@@ -1,87 +0,0 @@
|
||||
```markdown
|
||||
# Chunking Strategies
|
||||
|
||||
> Break large texts into manageable chunks for relevance and retrieval workflows.
|
||||
|
||||
Enables segmentation for similarity-based retrieval and integration into RAG pipelines.
|
||||
|
||||
## Why Use Chunking?
|
||||
|
||||
- Prepare text for cosine similarity scoring
|
||||
- Integrate into RAG systems
|
||||
- Support multiple segmentation methods (regex, sentences, topics, fixed-length, sliding windows)
|
||||
|
||||
## Methods of Chunking
|
||||
|
||||
- [Regex-Based Chunking]: Splits text on patterns (e.g., `\n\n`)
|
||||
```python
|
||||
class RegexChunking:
|
||||
def __init__(self, patterns=[r'\n\n']):
|
||||
self.patterns = patterns
|
||||
def chunk(self, text):
|
||||
parts = [text]
|
||||
for p in self.patterns:
|
||||
parts = [seg for pr in parts for seg in re.split(p, pr)]
|
||||
return parts
|
||||
```
|
||||
|
||||
- [Sentence-Based Chunking]: Uses NLP (e.g., `nltk.sent_tokenize`) for sentence-level chunks
|
||||
```python
|
||||
from nltk.tokenize import sent_tokenize
|
||||
class NlpSentenceChunking:
|
||||
def chunk(self, text):
|
||||
return sent_tokenize(text)
|
||||
```
|
||||
|
||||
- [Topic-Based Segmentation]: Leverages `TextTilingTokenizer` for topic-level segments
|
||||
```python
|
||||
from nltk.tokenize import TextTilingTokenizer
|
||||
class TopicSegmentationChunking:
|
||||
def __init__(self):
|
||||
self.tokenizer = TextTilingTokenizer()
|
||||
def chunk(self, text):
|
||||
return self.tokenizer.tokenize(text)
|
||||
```
|
||||
|
||||
- [Fixed-Length Word Chunking]: Chunks by a fixed number of words
|
||||
```python
|
||||
class FixedLengthWordChunking:
|
||||
def __init__(self, chunk_size=100):
|
||||
self.chunk_size = chunk_size
|
||||
def chunk(self, text):
|
||||
w = text.split()
|
||||
return [' '.join(w[i:i+self.chunk_size]) for i in range(0, len(w), self.chunk_size)]
|
||||
```
|
||||
|
||||
- [Sliding Window Chunking]: Overlapping chunks for context retention
|
||||
```python
|
||||
class SlidingWindowChunking:
|
||||
def __init__(self, window_size=100, step=50):
|
||||
self.window_size = window_size
|
||||
self.step = step
|
||||
def chunk(self, text):
|
||||
w = text.split()
|
||||
return [' '.join(w[i:i+self.window_size]) for i in range(0, max(len(w)-self.window_size+1, 1), self.step)]
|
||||
```
|
||||
|
||||
## Combining Chunking with Cosine Similarity
|
||||
|
||||
- Extract relevant chunks based on a query
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class CosineSimilarityExtractor:
|
||||
def __init__(self, query):
|
||||
self.query = query
|
||||
self.vectorizer = TfidfVectorizer()
|
||||
def find_relevant_chunks(self, chunks):
|
||||
X = self.vectorizer.fit_transform([self.query] + chunks)
|
||||
sims = cosine_similarity(X[0:1], X[1:]).flatten()
|
||||
return list(zip(chunks, sims))
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [chuncking_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/chuncking_strategies.py)
|
||||
```
|
||||
@@ -1,144 +0,0 @@
|
||||
# Chunking Strategies
|
||||
Chunking strategies are critical for dividing large texts into manageable parts, enabling effective content processing and extraction. These strategies are foundational in cosine similarity-based extraction techniques, which allow users to retrieve only the most relevant chunks of content for a given query. Additionally, they facilitate direct integration into RAG (Retrieval-Augmented Generation) systems for structured and scalable workflows.
|
||||
|
||||
### Why Use Chunking?
|
||||
1. **Cosine Similarity and Query Relevance**: Prepares chunks for semantic similarity analysis.
|
||||
2. **RAG System Integration**: Seamlessly processes and stores chunks for retrieval.
|
||||
3. **Structured Processing**: Allows for diverse segmentation methods, such as sentence-based, topic-based, or windowed approaches.
|
||||
|
||||
### Methods of Chunking
|
||||
|
||||
#### 1. Regex-Based Chunking
|
||||
Splits text based on regular expression patterns, useful for coarse segmentation.
|
||||
|
||||
**Code Example**:
|
||||
```python
|
||||
class RegexChunking:
|
||||
def __init__(self, patterns=None):
|
||||
self.patterns = patterns or [r'\n\n'] # Default pattern for paragraphs
|
||||
|
||||
def chunk(self, text):
|
||||
paragraphs = [text]
|
||||
for pattern in self.patterns:
|
||||
paragraphs = [seg for p in paragraphs for seg in re.split(pattern, p)]
|
||||
return paragraphs
|
||||
|
||||
# Example Usage
|
||||
text = """This is the first paragraph.
|
||||
|
||||
This is the second paragraph."""
|
||||
chunker = RegexChunking()
|
||||
print(chunker.chunk(text))
|
||||
```
|
||||
|
||||
#### 2. Sentence-Based Chunking
|
||||
Divides text into sentences using NLP tools, ideal for extracting meaningful statements.
|
||||
|
||||
**Code Example**:
|
||||
```python
|
||||
from nltk.tokenize import sent_tokenize
|
||||
|
||||
class NlpSentenceChunking:
|
||||
def chunk(self, text):
|
||||
sentences = sent_tokenize(text)
|
||||
return [sentence.strip() for sentence in sentences]
|
||||
|
||||
# Example Usage
|
||||
text = "This is sentence one. This is sentence two."
|
||||
chunker = NlpSentenceChunking()
|
||||
print(chunker.chunk(text))
|
||||
```
|
||||
|
||||
#### 3. Topic-Based Segmentation
|
||||
Uses algorithms like TextTiling to create topic-coherent chunks.
|
||||
|
||||
**Code Example**:
|
||||
```python
|
||||
from nltk.tokenize import TextTilingTokenizer
|
||||
|
||||
class TopicSegmentationChunking:
|
||||
def __init__(self):
|
||||
self.tokenizer = TextTilingTokenizer()
|
||||
|
||||
def chunk(self, text):
|
||||
return self.tokenizer.tokenize(text)
|
||||
|
||||
# Example Usage
|
||||
text = """This is an introduction.
|
||||
This is a detailed discussion on the topic."""
|
||||
chunker = TopicSegmentationChunking()
|
||||
print(chunker.chunk(text))
|
||||
```
|
||||
|
||||
#### 4. Fixed-Length Word Chunking
|
||||
Segments text into chunks of a fixed word count.
|
||||
|
||||
**Code Example**:
|
||||
```python
|
||||
class FixedLengthWordChunking:
|
||||
def __init__(self, chunk_size=100):
|
||||
self.chunk_size = chunk_size
|
||||
|
||||
def chunk(self, text):
|
||||
words = text.split()
|
||||
return [' '.join(words[i:i + self.chunk_size]) for i in range(0, len(words), self.chunk_size)]
|
||||
|
||||
# Example Usage
|
||||
text = "This is a long text with many words to be chunked into fixed sizes."
|
||||
chunker = FixedLengthWordChunking(chunk_size=5)
|
||||
print(chunker.chunk(text))
|
||||
```
|
||||
|
||||
#### 5. Sliding Window Chunking
|
||||
Generates overlapping chunks for better contextual coherence.
|
||||
|
||||
**Code Example**:
|
||||
```python
|
||||
class SlidingWindowChunking:
|
||||
def __init__(self, window_size=100, step=50):
|
||||
self.window_size = window_size
|
||||
self.step = step
|
||||
|
||||
def chunk(self, text):
|
||||
words = text.split()
|
||||
chunks = []
|
||||
for i in range(0, len(words) - self.window_size + 1, self.step):
|
||||
chunks.append(' '.join(words[i:i + self.window_size]))
|
||||
return chunks
|
||||
|
||||
# Example Usage
|
||||
text = "This is a long text to demonstrate sliding window chunking."
|
||||
chunker = SlidingWindowChunking(window_size=5, step=2)
|
||||
print(chunker.chunk(text))
|
||||
```
|
||||
|
||||
### Combining Chunking with Cosine Similarity
|
||||
To enhance the relevance of extracted content, chunking strategies can be paired with cosine similarity techniques. Here’s an example workflow:
|
||||
|
||||
**Code Example**:
|
||||
```python
|
||||
from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
class CosineSimilarityExtractor:
|
||||
def __init__(self, query):
|
||||
self.query = query
|
||||
self.vectorizer = TfidfVectorizer()
|
||||
|
||||
def find_relevant_chunks(self, chunks):
|
||||
vectors = self.vectorizer.fit_transform([self.query] + chunks)
|
||||
similarities = cosine_similarity(vectors[0:1], vectors[1:]).flatten()
|
||||
return [(chunks[i], similarities[i]) for i in range(len(chunks))]
|
||||
|
||||
# Example Workflow
|
||||
text = """This is a sample document. It has multiple sentences.
|
||||
We are testing chunking and similarity."""
|
||||
|
||||
chunker = SlidingWindowChunking(window_size=5, step=3)
|
||||
chunks = chunker.chunk(text)
|
||||
query = "testing chunking"
|
||||
extractor = CosineSimilarityExtractor(query)
|
||||
relevant_chunks = extractor.find_relevant_chunks(chunks)
|
||||
|
||||
print(relevant_chunks)
|
||||
```
|
||||
@@ -1,10 +0,0 @@
|
||||
chunking_overview: Chunking strategies divide large texts into manageable parts for content processing and extraction | text segmentation, content division, document splitting | None
|
||||
cosine_similarity_integration: Chunking prepares text segments for semantic similarity analysis using cosine similarity | semantic search, relevance matching | from sklearn.metrics.pairwise import cosine_similarity
|
||||
rag_integration: Chunks can be integrated into RAG (Retrieval-Augmented Generation) systems for structured workflows | retrieval augmented generation, RAG pipeline | None
|
||||
regex_chunking: Split text using regular expression patterns for basic segmentation | regex splitting, pattern-based chunking | RegexChunking(patterns=[r'\n\n'])
|
||||
sentence_chunking: Divide text into individual sentences using NLP tools | sentence tokenization, NLP chunking | from nltk.tokenize import sent_tokenize
|
||||
topic_chunking: Create topic-coherent chunks using TextTiling algorithm | topic segmentation, TextTiling | from nltk.tokenize import TextTilingTokenizer
|
||||
fixed_length_chunking: Segment text into chunks with fixed word count | word-based chunking, fixed size segments | FixedLengthWordChunking(chunk_size=100)
|
||||
sliding_window_chunking: Generate overlapping chunks for better context preservation | overlapping segments, windowed chunking | SlidingWindowChunking(window_size=100, step=50)
|
||||
cosine_similarity_extraction: Extract relevant chunks using TF-IDF and cosine similarity comparison | similarity search, relevance extraction | from sklearn.feature_extraction.text import TfidfVectorizer
|
||||
chunking_workflow: Combine chunking with cosine similarity for enhanced content retrieval | content extraction, similarity workflow | CosineSimilarityExtractor(query).find_relevant_chunks(chunks)
|
||||
@@ -1,604 +0,0 @@
|
||||
# Structured Data Extraction Strategies
|
||||
|
||||
## Extraction Strategies
|
||||
Structured data extraction strategies are designed to convert raw web content into organized, JSON-formatted data. These strategies handle diverse extraction scenarios, including schema-based, language model-driven, and clustering methods. This section covers models using LLMs or without using them to extract data with precision and flexibility.
|
||||
|
||||
## Input Formats
|
||||
All extraction strategies support different input formats to give you more control over how content is processed:
|
||||
|
||||
- **markdown** (default): Uses the raw markdown conversion of the HTML content. Best for general text extraction where HTML structure isn't critical.
|
||||
- **html**: Uses the raw HTML content. Useful when you need to preserve HTML structure or extract data from specific HTML elements.
|
||||
- **fit_markdown**: Uses the cleaned and filtered markdown content. Best for extracting relevant content while removing noise. Requires a markdown generator with content filter to be configured.
|
||||
|
||||
To specify an input format:
|
||||
```python
|
||||
strategy = LLMExtractionStrategy(
|
||||
input_format="html", # or "markdown" or "fit_markdown"
|
||||
provider="openai/gpt-4",
|
||||
instruction="Extract product information"
|
||||
)
|
||||
```
|
||||
|
||||
Note: When using "fit_markdown", ensure your CrawlerRunConfig includes a markdown generator and content filter:
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
markdown_generator=DefaultMarkdownGenerator(),
|
||||
content_filter=PruningContentFilter()
|
||||
)
|
||||
```
|
||||
|
||||
If fit_markdown is requested but not available (no markdown generator or content filter), the system will automatically fall back to raw markdown with a warning.
|
||||
|
||||
### LLM Extraction Strategy
|
||||
The **LLM Extraction Strategy** employs a large language model (LLM) to process content dynamically. It supports:
|
||||
- **Schema-Based Extraction**: Using a defined JSON schema to structure output.
|
||||
- **Instruction-Based Extraction**: Accepting custom prompts to guide the extraction process.
|
||||
- **Flexible Model Usage**: Supporting open-source or paid LLMs.
|
||||
|
||||
#### Key Features
|
||||
- Accepts customizable schemas for structured outputs.
|
||||
- Incorporates user prompts for tailored results.
|
||||
- Handles large inputs with chunking and overlap for efficient processing.
|
||||
|
||||
#### Parameters and Configurations
|
||||
Below is a detailed explanation of key parameters:
|
||||
|
||||
- **`provider`** *(str)*: Specifies the LLM provider (e.g., `openai`, `ollama`).
|
||||
- Default: `DEFAULT_PROVIDER`
|
||||
|
||||
- **`api_token`** *(Optional[str])*: API token for the LLM provider.
|
||||
- Required unless using a provider that doesn’t need authentication.
|
||||
|
||||
- **`instruction`** *(Optional[str])*: A prompt guiding the model on extraction specifics.
|
||||
- Example: "Extract all prices and model names from the page."
|
||||
|
||||
- **`schema`** *(Optional[Dict])*: JSON schema defining the structure of extracted data.
|
||||
- If provided, extraction switches to schema mode.
|
||||
|
||||
- **`extraction_type`** *(str)*: Determines extraction mode (`block` or `schema`).
|
||||
- Default: `block`
|
||||
|
||||
- **Chunking Settings**:
|
||||
- **`chunk_token_threshold`** *(int)*: Maximum token count per chunk. Default: `CHUNK_TOKEN_THRESHOLD`.
|
||||
- **`overlap_rate`** *(float)*: Proportion of overlapping tokens between chunks. Default: `OVERLAP_RATE`.
|
||||
|
||||
- **`extra_args`** *(Dict)*: Additional arguments passed to the LLM API sucj as `max_length`, `temperature`, etc.
|
||||
|
||||
#### Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.config import CrawlerRunConfig, BrowserConfig
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str
|
||||
input_fee: str
|
||||
output_fee: str
|
||||
|
||||
async def extract_structured_data():
|
||||
browser_config = BrowserConfig(headless=True)
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider="openai",
|
||||
api_token="your_api_token",
|
||||
schema=OpenAIModelFee.model_json_schema(),
|
||||
instruction="Extract all model fees from the content."
|
||||
)
|
||||
|
||||
crawler_config = CrawlerRunConfig(
|
||||
extraction_strategy=extraction_strategy
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/pricing",
|
||||
config=crawler_config
|
||||
)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
#### Workflow and Error Handling
|
||||
- **Chunk Merging**: Content is divided into manageable chunks based on the token threshold.
|
||||
- **Backoff and Retries**: Handles API rate limits with backoff strategies.
|
||||
- **Error Logging**: Extracted blocks include error tags when issues occur.
|
||||
- **Parallel Execution**: Supports multi-threaded execution for efficiency.
|
||||
|
||||
#### Benefits of Using LLM Extraction Strategy
|
||||
- **Dynamic Adaptability**: Easily switch between schema-based and instruction-based modes.
|
||||
- **Scalable**: Processes large content efficiently using chunking.
|
||||
- **Versatile**: Works with various LLM providers and configurations.
|
||||
|
||||
This strategy is ideal for extracting structured data from complex web pages, ensuring compatibility with LLM training and fine-tuning workflows.
|
||||
|
||||
### Cosine Strategy
|
||||
|
||||
The Cosine Strategy in Crawl4AI uses similarity-based clustering to identify and extract relevant content sections from web pages. This strategy is particularly useful when you need to find and extract content based on semantic similarity rather than structural patterns.
|
||||
|
||||
#### How It Works
|
||||
|
||||
The Cosine Strategy:
|
||||
1. Breaks down page content into meaningful chunks
|
||||
2. Converts text into vector representations
|
||||
3. Calculates similarity between chunks
|
||||
4. Clusters similar content together
|
||||
5. Ranks and filters content based on relevance
|
||||
|
||||
#### Basic Usage
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews", # Target content type
|
||||
word_count_threshold=10, # Minimum words per cluster
|
||||
sim_threshold=0.3 # Similarity threshold
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/reviews",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
content = result.extracted_content
|
||||
```
|
||||
|
||||
#### Configuration Options
|
||||
|
||||
##### Core Parameters
|
||||
|
||||
```python
|
||||
CosineStrategy(
|
||||
# Content Filtering
|
||||
semantic_filter: str = None, # Keywords/topic for content filtering
|
||||
word_count_threshold: int = 10, # Minimum words per cluster
|
||||
sim_threshold: float = 0.3, # Similarity threshold (0.0 to 1.0)
|
||||
|
||||
# Clustering Parameters
|
||||
max_dist: float = 0.2, # Maximum distance for clustering
|
||||
linkage_method: str = 'ward', # Clustering linkage method
|
||||
top_k: int = 3, # Number of top categories to extract
|
||||
|
||||
# Model Configuration
|
||||
model_name: str = 'sentence-transformers/all-MiniLM-L6-v2', # Embedding model
|
||||
|
||||
verbose: bool = False # Enable logging
|
||||
)
|
||||
```
|
||||
|
||||
##### Parameter Details
|
||||
|
||||
1. **semantic_filter**
|
||||
- Sets the target topic or content type
|
||||
- Use keywords relevant to your desired content
|
||||
- Example: "technical specifications", "user reviews", "pricing information"
|
||||
|
||||
2. **sim_threshold**
|
||||
- Controls how similar content must be to be grouped together
|
||||
- Higher values (e.g., 0.8) mean stricter matching
|
||||
- Lower values (e.g., 0.3) allow more variation
|
||||
```python
|
||||
# Strict matching
|
||||
strategy = CosineStrategy(sim_threshold=0.8)
|
||||
|
||||
# Loose matching
|
||||
strategy = CosineStrategy(sim_threshold=0.3)
|
||||
```
|
||||
|
||||
3. **word_count_threshold**
|
||||
- Filters out short content blocks
|
||||
- Helps eliminate noise and irrelevant content
|
||||
```python
|
||||
# Only consider substantial paragraphs
|
||||
strategy = CosineStrategy(word_count_threshold=50)
|
||||
```
|
||||
|
||||
4. **top_k**
|
||||
- Number of top content clusters to return
|
||||
- Higher values return more diverse content
|
||||
```python
|
||||
# Get top 5 most relevant content clusters
|
||||
strategy = CosineStrategy(top_k=5)
|
||||
```
|
||||
|
||||
#### Use Cases
|
||||
|
||||
##### 1. Article Content Extraction
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="main article content",
|
||||
word_count_threshold=100, # Longer blocks for articles
|
||||
top_k=1 # Usually want single main content
|
||||
)
|
||||
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/blog/post",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
```
|
||||
|
||||
##### 2. Product Review Analysis
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="customer reviews and ratings",
|
||||
word_count_threshold=20, # Reviews can be shorter
|
||||
top_k=10, # Get multiple reviews
|
||||
sim_threshold=0.4 # Allow variety in review content
|
||||
)
|
||||
```
|
||||
|
||||
##### 3. Technical Documentation
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="technical specifications documentation",
|
||||
word_count_threshold=30,
|
||||
sim_threshold=0.6, # Stricter matching for technical content
|
||||
max_dist=0.3 # Allow related technical sections
|
||||
)
|
||||
```
|
||||
|
||||
#### Advanced Features
|
||||
|
||||
##### Custom Clustering
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
linkage_method='complete', # Alternative clustering method
|
||||
max_dist=0.4, # Larger clusters
|
||||
model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2' # Multilingual support
|
||||
)
|
||||
```
|
||||
|
||||
##### Content Filtering Pipeline
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="pricing plans features",
|
||||
word_count_threshold=15,
|
||||
sim_threshold=0.5,
|
||||
top_k=3
|
||||
)
|
||||
|
||||
async def extract_pricing_features(url: str):
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if result.success:
|
||||
content = json.loads(result.extracted_content)
|
||||
return {
|
||||
'pricing_features': content,
|
||||
'clusters': len(content),
|
||||
'similarity_scores': [item['score'] for item in content]
|
||||
}
|
||||
```
|
||||
|
||||
#### Best Practices
|
||||
|
||||
1. **Adjust Thresholds Iteratively**
|
||||
- Start with default values
|
||||
- Adjust based on results
|
||||
- Monitor clustering quality
|
||||
|
||||
2. **Choose Appropriate Word Count Thresholds**
|
||||
- Higher for articles (100+)
|
||||
- Lower for reviews/comments (20+)
|
||||
- Medium for product descriptions (50+)
|
||||
|
||||
3. **Optimize Performance**
|
||||
```python
|
||||
strategy = CosineStrategy(
|
||||
word_count_threshold=10, # Filter early
|
||||
top_k=5, # Limit results
|
||||
verbose=True # Monitor performance
|
||||
)
|
||||
```
|
||||
|
||||
4. **Handle Different Content Types**
|
||||
```python
|
||||
# For mixed content pages
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product features",
|
||||
sim_threshold=0.4, # More flexible matching
|
||||
max_dist=0.3, # Larger clusters
|
||||
top_k=3 # Multiple relevant sections
|
||||
)
|
||||
```
|
||||
|
||||
#### Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com",
|
||||
extraction_strategy=strategy
|
||||
)
|
||||
|
||||
if result.success:
|
||||
content = json.loads(result.extracted_content)
|
||||
if not content:
|
||||
print("No relevant content found")
|
||||
else:
|
||||
print(f"Extraction failed: {result.error_message}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during extraction: {str(e)}")
|
||||
```
|
||||
|
||||
The Cosine Strategy is particularly effective when:
|
||||
- Content structure is inconsistent
|
||||
- You need semantic understanding
|
||||
- You want to find similar content blocks
|
||||
- Structure-based extraction (CSS/XPath) isn't reliable
|
||||
|
||||
It works well with other strategies and can be used as a pre-processing step for LLM-based extraction.
|
||||
|
||||
|
||||
### JSON-Based Extraction Strategies with AsyncWebCrawler
|
||||
|
||||
In many cases, relying on a Large Language Model (LLM) to parse and structure data from web pages is both unnecessary and wasteful. Instead of incurring additional computational overhead, network latency, and even contributing to unnecessary CO2 emissions, you can employ direct HTML parsing strategies. These approaches are faster, simpler, and more environmentally friendly, running efficiently on any computer or device without costly API calls.
|
||||
|
||||
Crawl4AI offers two primary declarative extraction strategies that do not depend on LLMs:
|
||||
- `JsonCssExtractionStrategy`
|
||||
- `JsonXPathExtractionStrategy`
|
||||
|
||||
Of these two, while CSS selectors are often simpler to use, **XPath selectors are generally more robust and flexible**, particularly for large-scale scraping tasks. Modern websites often generate dynamic or ephemeral class names that are subject to frequent change. XPath, on the other hand, allows you to navigate the DOM structure directly, making your selectors less brittle and less dependent on inconsistent class names.
|
||||
|
||||
#### Why Use JSON-Based Extraction Instead of LLMs?
|
||||
|
||||
1. **Speed & Efficiency**: Direct HTML parsing bypasses the latency of external API calls.
|
||||
2. **Lower Resource Usage**: No need for large models, GPU acceleration, or network overhead.
|
||||
3. **Environmentally Friendly**: Reduced energy consumption and carbon footprint compared to LLM inference.
|
||||
4. **Offline Capability**: Works anywhere you have the HTML, no network needed.
|
||||
5. **Scalability & Reliability**: Stable and predictable, without dealing with model “hallucinations” or downtime.
|
||||
|
||||
#### Advantages of XPath Over CSS
|
||||
|
||||
1. **Stability in Dynamic Environments**: Websites change their classes and IDs constantly. XPath allows you to refer to elements by structure and position instead of relying on fragile class names.
|
||||
2. **Finer-Grained Control**: XPath supports advanced queries like traversing parent/child relationships, filtering based on attributes, and handling complex nested patterns.
|
||||
3. **Consistency Across Complex Pages**: Even when the front-end framework changes markup or introduces randomized class names, XPath expressions often remain valid if the structural hierarchy stays intact.
|
||||
4. **More Powerful Selection Logic**: You can write conditions like `//div[@data-test='price']` or `//tr[3]/td[2]` to accurately pinpoint elements.
|
||||
|
||||
#### Example Using XPath
|
||||
|
||||
Below is an example that extracts cryptocurrency prices from a hypothetical page using `JsonXPathExtractionStrategy`. Here, we avoid depending on class names entirely, focusing on the consistent structure of the HTML. By adjusting XPath expressions, you can overcome dynamic naming schemes that would break fragile CSS selectors.
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
async def extract_data_using_xpath():
|
||||
print("\n--- Using JsonXPathExtractionStrategy for Fast, Reliable Structured Output ---")
|
||||
|
||||
# Define the extraction schema using XPath selectors
|
||||
# Example: We know the table rows are always in this structure, regardless of class names
|
||||
schema = {
|
||||
"name": "Crypto Prices",
|
||||
"baseSelector": "//table/tbody/tr",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": ".//td[1]/h2",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": ".//td[1]/p",
|
||||
"type": "text",
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".//td[2]",
|
||||
"type": "text",
|
||||
}
|
||||
],
|
||||
}
|
||||
|
||||
extraction_strategy = JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Use XPath extraction on a page known for frequently changing its class names
|
||||
result = await crawler.arun(
|
||||
url="https://www.examplecrypto.com/prices",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
assert result.success, "Failed to crawl the page"
|
||||
|
||||
# Parse the extracted content
|
||||
crypto_prices = json.loads(result.extracted_content)
|
||||
print(f"Successfully extracted {len(crypto_prices)} cryptocurrency prices")
|
||||
print(json.dumps(crypto_prices[0], indent=2))
|
||||
|
||||
return crypto_prices
|
||||
|
||||
# Run the async function
|
||||
asyncio.run(extract_data_using_xpath())
|
||||
```
|
||||
|
||||
#### When to Use CSS vs. XPath
|
||||
|
||||
- **CSS Selectors**: Good for simpler, stable sites where classes and IDs are fixed and descriptive. Ideal if you’re already familiar with front-end development patterns.
|
||||
- **XPath Selectors**: Recommended for complex or highly dynamic websites. If classes and IDs are meaningless, random, or prone to frequent changes, XPath provides a more structural and future-proof solution.
|
||||
|
||||
#### Handling Dynamic Content
|
||||
|
||||
Even on websites that load content asynchronously, you can still rely on XPath extraction. Combine the extraction strategy with JavaScript execution to scroll or wait for certain elements to appear. Using XPath after the page finishes loading ensures you’re targeting elements that are fully rendered and stable.
|
||||
|
||||
For example:
|
||||
|
||||
```python
|
||||
async def extract_dynamic_data():
|
||||
schema = {
|
||||
"name": "Dynamic Crypto Prices",
|
||||
"baseSelector": "//tr[contains(@class, 'price-row')]",
|
||||
"fields": [
|
||||
{"name": "name", "selector": ".//td[1]", "type": "text"},
|
||||
{"name": "price", "selector": ".//td[2]", "type": "text"},
|
||||
]
|
||||
}
|
||||
|
||||
js_code = """
|
||||
window.scrollTo(0, document.body.scrollHeight);
|
||||
await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
"""
|
||||
|
||||
extraction_strategy = JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.examplecrypto.com/dynamic-prices",
|
||||
extraction_strategy=extraction_strategy,
|
||||
js_code=js_code,
|
||||
wait_for="//tr[contains(@class, 'price-row')][20]", # Wait until at least 20 rows load
|
||||
bypass_cache=True,
|
||||
)
|
||||
|
||||
crypto_data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(crypto_data)} cryptocurrency entries")
|
||||
```
|
||||
|
||||
#### Best Practices
|
||||
|
||||
1. **Avoid LLM-Based Extraction**: If the data is repetitive and structured, direct HTML parsing is faster, cheaper, and more stable.
|
||||
2. **Start with XPath**: In a constantly changing environment, building XPath selectors from stable structural elements (like table hierarchies, element positions, or unique attributes) ensures you won’t need to frequently rewrite selectors.
|
||||
3. **Test in Developer Tools**: Use browser consoles or `xmllint` to quickly verify XPath queries before coding.
|
||||
4. **Focus on Hierarchy, Not Classes**: Avoid relying on class names if they’re dynamic. Instead, use structural approaches like `//table/tbody/tr` or `//div[@data-test='price']`.
|
||||
5. **Combine with JS Execution**: For dynamic sites, run small snippets of JS to reveal content before extracting with XPath.
|
||||
|
||||
By following these guidelines, you can create high-performance, resilient extraction pipelines. You’ll save resources, reduce environmental impact, and enjoy a level of reliability and speed that LLM-based solutions can’t match when parsing repetitive data from complex or ever-changing websites.
|
||||
|
||||
### **Automating Schema Generation with a One-Time LLM-Assisted Utility**
|
||||
|
||||
While the focus of these extraction strategies is to avoid continuous reliance on LLMs, you can leverage a model once to streamline the creation of complex schemas. Instead of painstakingly determining repetitive patterns, crafting CSS or XPath selectors, and deciding field definitions by hand, you can prompt a language model once with the raw HTML and a brief description of what you need to extract. The result is a ready-to-use schema that you can plug into `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy` for lightning-fast extraction without further model calls.
|
||||
|
||||
**How It Works:**
|
||||
1. Provide the raw HTML containing your repetitive patterns.
|
||||
2. Optionally specify a natural language query describing the data you want.
|
||||
3. Run `generate_schema(html, query)` to let the LLM generate a schema automatically.
|
||||
4. Take the returned schema and use it directly with `JsonCssExtractionStrategy` or `JsonXPathExtractionStrategy`.
|
||||
5. After this initial step, no more LLM calls are necessary—you now have a schema that you can reuse as often as you like.
|
||||
|
||||
**Code Example:**
|
||||
|
||||
Here is a simplified demonstration using the utility function `generate_schema` that you’ve incorporated into your codebase. In this example, we:
|
||||
- Use a one-time LLM call to derive a schema from the HTML structure of a job board.
|
||||
- Apply the resulting schema to `JsonXPathExtractionStrategy` (although you can also use `JsonCssExtractionStrategy` if preferred).
|
||||
- Extract data from the target page at high speed with no subsequent LLM calls.
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
# Assume generate_schema is integrated and available
|
||||
from my_schema_utils import generate_schema
|
||||
|
||||
async def extract_data_with_generated_schema():
|
||||
# Raw HTML snippet representing repetitive patterns in the webpage
|
||||
test_html = """
|
||||
<div class="company-listings">
|
||||
<div class="company" data-company-id="123">
|
||||
<div class="company-header">
|
||||
<img class="company-logo" src="google.png" alt="Google">
|
||||
<h1 class="company-name">Google</h1>
|
||||
<div class="company-meta">
|
||||
<span class="company-size">10,000+ employees</span>
|
||||
<span class="company-industry">Technology</span>
|
||||
<a href="https://google.careers" class="careers-link">Careers Page</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div class="departments">
|
||||
<div class="department">
|
||||
<h2 class="department-name">Engineering</h2>
|
||||
<div class="positions">
|
||||
<div class="position-card" data-position-id="eng-1">
|
||||
<h3 class="position-title">Senior Software Engineer</h3>
|
||||
<span class="salary-range">$150,000 - $250,000</span>
|
||||
<div class="position-meta">
|
||||
<span class="location">Mountain View, CA</span>
|
||||
<span class="job-type">Full-time</span>
|
||||
<span class="experience">5+ years</span>
|
||||
</div>
|
||||
<div class="skills-required">
|
||||
<span class="skill">Python</span>
|
||||
<span class="skill">Kubernetes</span>
|
||||
<span class="skill">Machine Learning</span>
|
||||
</div>
|
||||
<p class="position-description">Join our core engineering team...</p>
|
||||
<div class="application-info">
|
||||
<span class="posting-date">Posted: 2024-03-15</span>
|
||||
<button class="apply-btn" data-req-id="REQ12345">Apply Now</button>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
"""
|
||||
|
||||
# Optional natural language query to guide the schema generation
|
||||
query = "Extract company name, position titles, and salaries"
|
||||
|
||||
# One-time call to the LLM to generate a reusable schema
|
||||
schema = generate_schema(test_html, query=query)
|
||||
|
||||
# Other exmaples of queries:
|
||||
# # Test 1: No query (should extract everything)
|
||||
# print("\nTest 1: No Query (Full Schema)")
|
||||
# schema1 = generate_schema(test_html)
|
||||
# print(json.dumps(schema1, indent=2))
|
||||
|
||||
# # Test 2: Query for just basic job info
|
||||
# print("\nTest 2: Basic Job Info Query")
|
||||
# query2 = "I only need job titles, salaries, and locations"
|
||||
# schema2 = generate_schema(test_html, query2)
|
||||
# print(json.dumps(schema2, indent=2))
|
||||
|
||||
# # Test 3: Query for company and department structure
|
||||
# print("\nTest 3: Organizational Structure Query")
|
||||
# query3 = "Extract company details and department names, without position details"
|
||||
# schema3 = generate_schema(test_html, query3)
|
||||
# print(json.dumps(schema3, indent=2))
|
||||
|
||||
# # Test 4: Query for specific skills tracking
|
||||
# print("\nTest 4: Skills Analysis Query")
|
||||
# query4 = "I want to analyze required skills across all positions"
|
||||
# schema4 = generate_schema(test_html, query4)
|
||||
# print(json.dumps(schema4, indent=2))
|
||||
|
||||
# Now use the generated schema for high-speed extraction without any further LLM calls
|
||||
extraction_strategy = JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# URL for demonstration purposes (use any URL that contains a similar structure)
|
||||
result = await crawler.arun(
|
||||
url="https://crawl4ai.com/jobs",
|
||||
extraction_strategy=extraction_strategy,
|
||||
bypass_cache=True
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
raise Exception("Extraction failed")
|
||||
|
||||
data = json.loads(result.extracted_content)
|
||||
print("Extracted data:")
|
||||
print(json.dumps(data, indent=2))
|
||||
|
||||
# Run the async function
|
||||
asyncio.run(extract_data_with_generated_schema())
|
||||
```
|
||||
|
||||
**Benefits of the One-Time LLM Approach:**
|
||||
- **Time-Saving**: Quickly bootstrap your schema creation, especially for complex pages.
|
||||
- **Once and Done**: Use the LLM once and then rely purely on the ultra-fast, local extraction strategies.
|
||||
- **Sustainable**: No repeated model calls means less compute, lower cost, and reduced environmental impact.
|
||||
|
||||
This approach leverages the strengths of both worlds: a one-time intelligent schema generation step with a language model, followed by a stable, purely local extraction pipeline that runs efficiently on any machine, without further LLM dependencies.
|
||||
@@ -1,12 +0,0 @@
|
||||
llm_extraction: LLM Extraction Strategy uses language models to process web content into structured JSON | language model extraction, schema extraction, LLM parsing | LLMExtractionStrategy(provider="openai", api_token="token")
|
||||
schema_based_extraction: Extract data using predefined JSON schemas to structure LLM output | schema extraction, structured output | schema=OpenAIModelFee.model_json_schema()
|
||||
chunking_config: Configure content chunking with token threshold and overlap rate | content chunks, token limits | chunk_token_threshold=1000, overlap_rate=0.1
|
||||
provider_config: Specify LLM provider and API credentials for extraction | model provider, API setup | provider="openai", api_token="your_token"
|
||||
cosine_strategy: Use similarity-based clustering to extract relevant content sections | content clustering, semantic similarity | CosineStrategy(semantic_filter="product reviews")
|
||||
clustering_params: Configure clustering behavior with similarity thresholds and methods | similarity settings, cluster config | sim_threshold=0.3, linkage_method='ward'
|
||||
content_filtering: Filter extracted content based on word count and relevance | content filters, extraction rules | word_count_threshold=10, top_k=3
|
||||
xpath_extraction: Extract data using XPath selectors for stable structural parsing | xpath selectors, HTML parsing | JsonXPathExtractionStrategy(schema)
|
||||
css_extraction: Extract data using CSS selectors for simple HTML parsing | css selectors, HTML parsing | JsonCssExtractionStrategy(schema)
|
||||
schema_generation: Generate extraction schemas automatically using one-time LLM assistance | schema creation, automation | generate_schema(html, query)
|
||||
dynamic_content: Handle dynamic webpage content with JavaScript execution and waiting | async content, js execution | js_code="window.scrollTo(0, document.body.scrollHeight)"
|
||||
extraction_best_practices: Use XPath for stability, avoid unnecessary LLM calls, test selectors | optimization, reliability | baseSelector="//table/tbody/tr"
|
||||
@@ -1,102 +0,0 @@
|
||||
# Extraction Strategies (Condensed LLM-Friendly Reference)
|
||||
|
||||
> Extract structured data (JSON) and text blocks from HTML with LLM-based or clustering methods.
|
||||
|
||||
Streamlined parameters, usage, and code snippets for quick LLM reference.
|
||||
|
||||
## Input Formats
|
||||
|
||||
- **markdown** (default): Raw markdown from HTML
|
||||
- **html**: Raw HTML content
|
||||
- **fit_markdown**: Cleaned markdown (needs markdown_generator + content_filter)
|
||||
|
||||
```python
|
||||
strategy = LLMExtractionStrategy(
|
||||
input_format="html", # Choose format
|
||||
provider="openai/gpt-4",
|
||||
instruction="Extract data"
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=strategy,
|
||||
markdown_generator=DefaultMarkdownGenerator(), # For fit_markdown
|
||||
content_filter=PruningContentFilter() # For fit_markdown
|
||||
)
|
||||
```
|
||||
|
||||
## LLMExtractionStrategy
|
||||
|
||||
- Uses LLM to extract structured data from HTML.
|
||||
- Supports `instruction`, `schema`, `extraction_type`, `chunk_token_threshold`, `overlap_rate`, `input_format`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="openai",
|
||||
api_token="your_api_token",
|
||||
instruction="Extract prices",
|
||||
schema={"fields": [...]},
|
||||
extraction_type="schema",
|
||||
input_format="html"
|
||||
)
|
||||
```
|
||||
|
||||
## CosineStrategy
|
||||
|
||||
- Clusters content via semantic embeddings.
|
||||
- Key params: `semantic_filter`, `word_count_threshold`, `sim_threshold`, `top_k`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import CosineStrategy
|
||||
strategy = CosineStrategy(
|
||||
semantic_filter="product reviews",
|
||||
word_count_threshold=20,
|
||||
sim_threshold=0.3,
|
||||
top_k=5
|
||||
)
|
||||
```
|
||||
|
||||
## JsonCssExtractionStrategy
|
||||
|
||||
- Extracts data using CSS selectors.
|
||||
- `schema` defines `baseSelector`, `fields`.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
schema = {
|
||||
"baseSelector": ".product",
|
||||
"fields": [
|
||||
{"name":"title","selector":"h2","type":"text"},
|
||||
{"name":"price","selector":".price","type":"text"}
|
||||
]
|
||||
}
|
||||
strategy = JsonCssExtractionStrategy(schema=schema)
|
||||
```
|
||||
|
||||
## JsonXPathExtractionStrategy
|
||||
|
||||
- Similar to CSS but uses XPath.
|
||||
- More stable against changing class names.
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
schema = {
|
||||
"baseSelector": "//div[@class='product']",
|
||||
"fields": [
|
||||
{"name":"title","selector":".//h2","type":"text"},
|
||||
{"name":"price","selector":".//span[@class='price']","type":"text"}
|
||||
]
|
||||
}
|
||||
strategy = JsonXPathExtractionStrategy(schema=schema)
|
||||
```
|
||||
|
||||
## Example Usage
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [extraction_strategies.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/extraction_strategies.py)
|
||||
@@ -1,385 +0,0 @@
|
||||
# Content Selection in Crawl4AI
|
||||
|
||||
Crawl4AI offers flexible and powerful methods to precisely select and filter content from webpages. Whether you’re extracting articles, filtering unwanted elements, or using LLMs for structured data extraction, this guide will walk you through the essentials and advanced techniques.
|
||||
|
||||
**Table of Contents:**
|
||||
- [Content Selection in Crawl4AI](#content-selection-in-crawl4ai)
|
||||
- [Introduction \& Quick Start](#introduction--quick-start)
|
||||
- [CSS Selectors](#css-selectors)
|
||||
- [Content Filtering](#content-filtering)
|
||||
- [Handling Iframe Content](#handling-iframe-content)
|
||||
- [Structured Content Selection Using LLMs](#structured-content-selection-using-llms)
|
||||
- [Pattern-Based Selection](#pattern-based-selection)
|
||||
- [Comprehensive Example: Combining Techniques](#comprehensive-example-combining-techniques)
|
||||
- [Troubleshooting \& Best Practices](#troubleshooting--best-practices)
|
||||
- [Additional Resources](#additional-resources)
|
||||
|
||||
---
|
||||
|
||||
## Introduction & Quick Start
|
||||
|
||||
When crawling websites, you often need to isolate specific parts of a page—such as main article text, product listings, or metadata. Crawl4AI’s content selection features help you fine-tune your crawls to grab exactly what you need, while filtering out unnecessary elements.
|
||||
|
||||
**Quick Start Example:** Here’s a minimal example that extracts the main article content from a page:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def quick_start():
|
||||
config = CrawlerRunConfig(css_selector=".main-article")
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://crawl4ai.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
This snippet sets a simple CSS selector to focus on the main article area of a webpage. You can build from here, adding more advanced strategies as needed.
|
||||
|
||||
---
|
||||
|
||||
## CSS Selectors
|
||||
|
||||
**What are they?**
|
||||
CSS selectors let you target specific parts of a webpage’s HTML. If you can identify a unique CSS selector (such as `.main-article`, `article h1`, or `.product-listing > li`), you can precisely control what parts of the page are extracted.
|
||||
|
||||
**How to find selectors:**
|
||||
1. Open the page in your browser.
|
||||
2. Use browser dev tools (e.g., Chrome DevTools: right-click → "Inspect") to locate the elements you want.
|
||||
3. Copy the CSS selector for that element.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def extract_heading_and_content(url):
|
||||
config = CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return result.extracted_content
|
||||
```
|
||||
|
||||
**Tip:** If your extracted content is empty, verify that your CSS selectors match existing elements on the page. Using overly generic selectors can also lead to too much content being extracted.
|
||||
|
||||
---
|
||||
|
||||
## Video and Audio Content
|
||||
|
||||
The library extracts video and audio elements with their metadata:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Process videos
|
||||
for video in result.media["videos"]:
|
||||
print(f"Video source: {video['src']}")
|
||||
print(f"Type: {video['type']}")
|
||||
print(f"Duration: {video.get('duration')}")
|
||||
print(f"Thumbnail: {video.get('poster')}")
|
||||
|
||||
# Process audio
|
||||
for audio in result.media["audios"]:
|
||||
print(f"Audio source: {audio['src']}")
|
||||
print(f"Type: {audio['type']}")
|
||||
print(f"Duration: {audio.get('duration')}")
|
||||
```
|
||||
|
||||
## Link Analysis
|
||||
|
||||
Crawl4AI provides sophisticated link analysis capabilities, helping you understand the relationship between pages and identify important navigation patterns.
|
||||
|
||||
### Link Classification
|
||||
|
||||
The library automatically categorizes links into:
|
||||
- Internal links (same domain)
|
||||
- External links (different domains)
|
||||
- Social media links
|
||||
- Navigation links
|
||||
- Content links
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
# Analyze internal links
|
||||
for link in result.links["internal"]:
|
||||
print(f"Internal: {link['href']}")
|
||||
print(f"Link text: {link['text']}")
|
||||
print(f"Context: {link['context']}") # Surrounding text
|
||||
print(f"Type: {link['type']}") # nav, content, etc.
|
||||
|
||||
# Analyze external links
|
||||
for link in result.links["external"]:
|
||||
print(f"External: {link['href']}")
|
||||
print(f"Domain: {link['domain']}")
|
||||
print(f"Type: {link['type']}")
|
||||
```
|
||||
|
||||
### Smart Link Filtering
|
||||
|
||||
Control which links are included in the results with `CrawlerRunConfig`:
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
exclude_external_links=True, # Remove external links
|
||||
exclude_social_media_links=True, # Remove social media links
|
||||
exclude_social_media_domains=[ # Custom social media domains
|
||||
"facebook.com", "twitter.com", "instagram.com"
|
||||
],
|
||||
exclude_domains=["ads.example.com"] # Exclude specific domains
|
||||
)
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
```
|
||||
|
||||
## Metadata Extraction
|
||||
|
||||
Crawl4AI automatically extracts and processes page metadata, providing valuable information about the content:
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
metadata = result.metadata
|
||||
print(f"Title: {metadata['title']}")
|
||||
print(f"Description: {metadata['description']}")
|
||||
print(f"Keywords: {metadata['keywords']}")
|
||||
print(f"Author: {metadata['author']}")
|
||||
print(f"Published Date: {metadata['published_date']}")
|
||||
print(f"Modified Date: {metadata['modified_date']}")
|
||||
print(f"Language: {metadata['language']}")
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Content Filtering
|
||||
|
||||
Crawl4AI provides content filtering parameters to exclude unwanted elements and ensure that you only get meaningful data. For instance, you can remove navigation bars, ads, or other non-essential parts of the page.
|
||||
|
||||
**Key Parameters:**
|
||||
- `word_count_threshold`: Minimum word count per extracted block. Helps skip short or irrelevant snippets.
|
||||
- `excluded_tags`: List of HTML tags to omit (e.g., `['form', 'header', 'footer', 'nav']`).
|
||||
- `exclude_external_links`: Strips out links pointing to external domains.
|
||||
- `exclude_social_media_links`: Removes common social media links or widgets.
|
||||
- `exclude_external_images`: Filters out images hosted on external domains.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def filtered_extraction(url):
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['form', 'header', 'footer', 'nav'],
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
exclude_external_images=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return result.extracted_content
|
||||
```
|
||||
|
||||
**Best Practice:** Start with a minimal set of exclusions and increase them as needed. If you notice no content is extracted, try lowering `word_count_threshold` or removing certain excluded tags.
|
||||
|
||||
---
|
||||
|
||||
## Handling Iframe Content
|
||||
|
||||
If a page embeds content in iframes (such as videos, maps, or third-party widgets), you may need to enable iframe processing. This ensures that Crawl4AI loads and extracts content displayed inside iframes.
|
||||
|
||||
**How to enable:**
|
||||
- Set `process_iframes=True` in your `CrawlerRunConfig` to process iframe content.
|
||||
- Use `remove_overlay_elements=True` to discard popups or modals that might block iframe content.
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def extract_iframe_content(url):
|
||||
config = CrawlerRunConfig(
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
return result.extracted_content
|
||||
```
|
||||
|
||||
**Troubleshooting:**
|
||||
- If iframe content doesn’t load, ensure the iframe’s origin is allowed and that you have no network-related issues. Check the logs or consider using a browser-based strategy that supports multi-domain requests.
|
||||
|
||||
---
|
||||
|
||||
## Structured Content Selection Using LLMs
|
||||
|
||||
For more complex extraction tasks (e.g., summarizing content, extracting structured data like titles and key points), you can integrate LLMs. LLM-based extraction strategies let you define a schema and provide instructions to an LLM so it returns structured, JSON-formatted results.
|
||||
|
||||
**When to use LLM-based strategies:**
|
||||
- Extracting complex structures not easily captured by simple CSS selectors.
|
||||
- Summarizing or transforming data.
|
||||
- Handling varied, unpredictable page layouts.
|
||||
|
||||
**Example with an LLMExtractionStrategy:**
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
import json
|
||||
|
||||
class ArticleContent(BaseModel):
|
||||
title: str
|
||||
main_points: List[str]
|
||||
conclusion: str
|
||||
|
||||
async def extract_article_with_llm(url):
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=ArticleContent.schema(),
|
||||
instruction="Extract the main article title, key points, and conclusion"
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
article = json.loads(result.extracted_content)
|
||||
return article
|
||||
```
|
||||
|
||||
**Tips for LLM-based extraction:**
|
||||
- Refine your prompt in `instruction` to guide the LLM towards the desired structure.
|
||||
- If results are incomplete or incorrect, consider adjusting the schema or adding more context to the instruction.
|
||||
- Check for errors and handle edge cases where the LLM might not find certain fields.
|
||||
|
||||
---
|
||||
|
||||
## Pattern-Based Selection
|
||||
|
||||
When dealing with repetitive, structured patterns (like a list of articles or products), you can use `JsonCssExtractionStrategy` to define a JSON schema that maps selectors to specific fields.
|
||||
|
||||
**Use Cases:**
|
||||
- News article listings, product grids, directory entries.
|
||||
- Extract multiple items that follow a similar structure on the same page.
|
||||
|
||||
**Example JSON Schema Extraction:**
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
import json
|
||||
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.news-item",
|
||||
"fields": [
|
||||
{"name": "headline", "selector": "h2", "type": "text"},
|
||||
{"name": "summary", "selector": ".summary", "type": "text"},
|
||||
{"name": "category", "selector": ".category", "type": "text"},
|
||||
{
|
||||
"name": "metadata",
|
||||
"type": "nested",
|
||||
"fields": [
|
||||
{"name": "author", "selector": ".author", "type": "text"},
|
||||
{"name": "date", "selector": ".date", "type": "text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
async def extract_news_items(url):
|
||||
strategy = JsonCssExtractionStrategy(schema)
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
articles = json.loads(result.extracted_content)
|
||||
return articles
|
||||
```
|
||||
|
||||
**Maintenance Tip:** If the site’s structure changes, update your schema accordingly. Test small changes to ensure the extracted structure still matches your expectations.
|
||||
|
||||
---
|
||||
|
||||
## Comprehensive Example: Combining Techniques
|
||||
|
||||
Below is a more involved example that demonstrates combining multiple strategies and filtering parameters. Here, we extract structured article content from an `article.main` section, exclude unnecessary elements, and enforce a word count threshold.
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler, BrowserConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
import json
|
||||
|
||||
async def extract_article_content(url: str):
|
||||
# Schema for structured extraction
|
||||
article_schema = {
|
||||
"name": "Article",
|
||||
"baseSelector": "article.main",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h1", "type": "text"},
|
||||
{"name": "content", "selector": ".content", "type": "text"}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
||||
word_count_threshold=10,
|
||||
excluded_tags=['nav', 'footer'],
|
||||
exclude_external_links=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url=url, config=config)
|
||||
extracted = json.loads(result.extracted_content)
|
||||
return extracted
|
||||
```
|
||||
|
||||
**Expanding This Example:**
|
||||
- Add pagination logic to handle multi-page extractions.
|
||||
- Introduce LLM-based extraction for a summary of the article’s main points.
|
||||
- Adjust filtering parameters to refine what content is included or excluded.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting & Best Practices
|
||||
|
||||
**Common Issues & Fixes:**
|
||||
- **Empty extraction result:**
|
||||
- Verify CSS selectors and filtering parameters.
|
||||
- Lower or remove `word_count_threshold` to see if overly strict criteria are filtering everything out.
|
||||
- Check network requests or iframe settings if content is loaded dynamically.
|
||||
|
||||
- **Unintended content included:**
|
||||
- Add more tags to `excluded_tags`, or refine your CSS selectors.
|
||||
- Use `exclude_external_links` and other filters to clean up results.
|
||||
|
||||
- **LLM extraction errors:**
|
||||
- Ensure the schema matches the expected JSON structure.
|
||||
- Refine the `instruction` prompt to guide the LLM more clearly.
|
||||
- Validate LLM provider configuration and error logs.
|
||||
|
||||
**Performance Tips:**
|
||||
- Start with simpler strategies (basic CSS selectors) before moving to advanced LLM-based extraction.
|
||||
- Use caching or asynchronous crawling to handle large numbers of pages efficiently.
|
||||
- Consider running headless browser extractions in Docker for consistent, reproducible environments.
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **GitHub Source Files:**
|
||||
- [Async Web Crawler Implementation](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
|
||||
- [Async Crawler Strategy Implementation](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)
|
||||
|
||||
- **Advanced Topics:**
|
||||
- Dockerized deployments for reproducible scraping environments.
|
||||
- Integration with caching or proxy services for large-scale crawls.
|
||||
- Expanding LLM strategies to perform complex transformations or summarizations.
|
||||
|
||||
Use these links and approaches as a starting point to refine your crawling strategies. With Crawl4AI’s flexible configuration and powerful selection methods, you’ll be able to extract exactly the content you need—no more, no less.
|
||||
@@ -1,12 +0,0 @@
|
||||
content_selection: Crawl4AI allows precise selection and filtering of webpage content | web scraping, content extraction, web crawler | CrawlerRunConfig(css_selector=".main-article")
|
||||
css_selectors: Target specific webpage elements using CSS selectors like .main-article or article h1 | DOM selection, HTML elements, element targeting | CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
media_extraction: Extract video and audio elements with metadata including source, type, and duration | multimedia content, media files | result.media["videos"], result.media["audios"]
|
||||
link_analysis: Automatically categorize links into internal, external, social media, navigation, and content links | link classification, URL analysis | result.links["internal"], result.links["external"]
|
||||
link_filtering: Control which links are included using exclude parameters | link exclusion, domain filtering | CrawlerRunConfig(exclude_external_links=True, exclude_social_media_links=True)
|
||||
metadata_extraction: Automatically extract page metadata including title, description, keywords, and dates | page information, meta tags | result.metadata['title'], result.metadata['description']
|
||||
content_filtering: Remove unwanted elements using word count threshold and excluded tags | content cleanup, element removal | CrawlerRunConfig(word_count_threshold=10, excluded_tags=['form', 'header'])
|
||||
iframe_handling: Process content within iframes by enabling iframe processing and overlay removal | embedded content, frames | CrawlerRunConfig(process_iframes=True, remove_overlay_elements=True)
|
||||
llm_extraction: Use LLMs for complex content extraction with structured output | AI extraction, structured data | LLMExtractionStrategy(provider="ollama/nemotron", schema=ArticleContent.schema())
|
||||
pattern_extraction: Extract repetitive content patterns using JSON schema mapping | structured extraction, repeated elements | JsonCssExtractionStrategy(schema)
|
||||
troubleshooting: Common issues include empty results, unintended content, and LLM errors | debugging, error handling | config.word_count_threshold, excluded_tags
|
||||
best_practices: Start with simple selectors before advanced strategies and use caching for efficiency | optimization, performance | AsyncWebCrawler().arun(url=url, config=config)
|
||||
@@ -1,130 +0,0 @@
|
||||
# Crawl4AI Content Selection (LLM-Friendly Reference)
|
||||
|
||||
> Minimal, code-oriented reference for selecting and filtering webpage content using Crawl4AI.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
|
||||
async def run():
|
||||
config = CrawlerRunConfig(css_selector=".main-article")
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
print(result.extracted_content)
|
||||
```
|
||||
|
||||
## CSS Selectors
|
||||
|
||||
- Use `css_selector="selector"` to target specific content.
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(css_selector="article h1, article .content")
|
||||
result = await crawler.arun(url="...", config=config)
|
||||
```
|
||||
|
||||
## Content Filtering
|
||||
|
||||
- `word_count_threshold`: int
|
||||
- `excluded_tags`: list of tags
|
||||
- `exclude_external_links`: bool
|
||||
- `exclude_social_media_links`: bool
|
||||
- `exclude_external_images`: bool
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["form","header","footer","nav"],
|
||||
exclude_external_links=True,
|
||||
exclude_social_media_links=True,
|
||||
exclude_external_images=True
|
||||
)
|
||||
```
|
||||
|
||||
## Iframe Content
|
||||
|
||||
- `process_iframes`: bool
|
||||
- `remove_overlay_elements`: bool
|
||||
|
||||
```python
|
||||
config = CrawlerRunConfig(
|
||||
process_iframes=True,
|
||||
remove_overlay_elements=True
|
||||
)
|
||||
```
|
||||
|
||||
## LLM-Based Extraction
|
||||
|
||||
- Use `LLMExtractionStrategy(provider="...")` with `schema=...` and `instruction="..."`
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
from pydantic import BaseModel
|
||||
|
||||
class ArticleContent(BaseModel):
|
||||
title: str
|
||||
main_points: list[str]
|
||||
conclusion: str
|
||||
|
||||
strategy = LLMExtractionStrategy(
|
||||
provider="ollama/nemotron",
|
||||
schema=ArticleContent.schema(),
|
||||
instruction="Extract title, points, conclusion"
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=strategy)
|
||||
```
|
||||
|
||||
## Pattern-Based Selection (JsonCssExtractionStrategy)
|
||||
|
||||
```python
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
schema = {
|
||||
"name": "News Articles",
|
||||
"baseSelector": "article.news-item",
|
||||
"fields": [
|
||||
{"name":"headline","selector":"h2","type":"text"},
|
||||
{"name":"summary","selector":".summary","type":"text"},
|
||||
{"name":"category","selector":".category","type":"text"},
|
||||
{
|
||||
"name":"metadata",
|
||||
"type":"nested",
|
||||
"fields":[
|
||||
{"name":"author","selector":".author","type":"text"},
|
||||
{"name":"date","selector":".date","type":"text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(extraction_strategy=JsonCssExtractionStrategy(schema))
|
||||
```
|
||||
|
||||
## Combined Example
|
||||
|
||||
```python
|
||||
from crawl4ai.async_configs import CrawlerRunConfig, AsyncWebCrawler
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
article_schema = {
|
||||
"name":"Article",
|
||||
"baseSelector":"article.main",
|
||||
"fields":[
|
||||
{"name":"title","selector":"h1","type":"text"},
|
||||
{"name":"content","selector":".content","type":"text"}
|
||||
]
|
||||
}
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(article_schema),
|
||||
word_count_threshold=10,
|
||||
excluded_tags=["nav","footer"],
|
||||
exclude_external_links=True
|
||||
)
|
||||
```
|
||||
|
||||
## Optional
|
||||
|
||||
- [async_webcrawler.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_webcrawler.py)
|
||||
- [async_crawler_strategy.py](https://github.com/unclecode/crawl4ai/blob/main/crawl4ai/async_crawler_strategy.py)
|
||||
@@ -1,81 +0,0 @@
|
||||
# Crawl4AI Cache System and Migration Guide
|
||||
|
||||
## Overview
|
||||
Starting from version 0.5.0, Crawl4AI introduces a new caching system that replaces the old boolean flags with a more intuitive `CacheMode` enum. This change simplifies cache control and makes the behavior more predictable.
|
||||
|
||||
## Old vs New Approach
|
||||
|
||||
### Old Way (Deprecated)
|
||||
The old system used multiple boolean flags:
|
||||
- `bypass_cache`: Skip cache entirely
|
||||
- `disable_cache`: Disable all caching
|
||||
- `no_cache_read`: Don't read from cache
|
||||
- `no_cache_write`: Don't write to cache
|
||||
|
||||
### New Way (Recommended)
|
||||
The new system uses a single `CacheMode` enum:
|
||||
- `CacheMode.ENABLED`: Normal caching (read/write)
|
||||
- `CacheMode.DISABLED`: No caching at all
|
||||
- `CacheMode.READ_ONLY`: Only read from cache
|
||||
- `CacheMode.WRITE_ONLY`: Only write to cache
|
||||
- `CacheMode.BYPASS`: Skip cache for this operation
|
||||
|
||||
## Migration Example
|
||||
|
||||
### Old Code (Deprecated)
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def use_proxy():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
bypass_cache=True # Old way
|
||||
)
|
||||
print(len(result.markdown))
|
||||
|
||||
async def main():
|
||||
await use_proxy()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### New Code (Recommended)
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
from crawl4ai.async_configs import CrawlerRunConfig
|
||||
|
||||
async def use_proxy():
|
||||
config = CrawlerRunConfig(cache_mode=CacheMode.BYPASS) # Use CacheMode in CrawlerRunConfig
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
config=config # Pass the configuration object
|
||||
)
|
||||
print(len(result.markdown))
|
||||
|
||||
async def main():
|
||||
await use_proxy()
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
## Common Migration Patterns
|
||||
|
||||
| Old Flag | New Mode |
|
||||
|-----------------------|---------------------------------|
|
||||
| `bypass_cache=True` | `cache_mode=CacheMode.BYPASS` |
|
||||
| `disable_cache=True` | `cache_mode=CacheMode.DISABLED`|
|
||||
| `no_cache_read=True` | `cache_mode=CacheMode.WRITE_ONLY` |
|
||||
| `no_cache_write=True` | `cache_mode=CacheMode.READ_ONLY` |
|
||||
|
||||
## Suppressing Deprecation Warnings
|
||||
If you need time to migrate, you can temporarily suppress deprecation warnings:
|
||||
```python
|
||||
# In your config.py
|
||||
SHOW_DEPRECATION_WARNINGS = False
|
||||
```
|
||||
@@ -1,10 +0,0 @@
|
||||
cache_system: Crawl4AI v0.5.0 introduces CacheMode enum to replace boolean cache flags | caching system, cache control, cache configuration | CacheMode.ENABLED
|
||||
cache_modes: CacheMode enum supports five states: ENABLED, DISABLED, READ_ONLY, WRITE_ONLY, and BYPASS | cache states, caching options, cache settings | CacheMode.ENABLED, CacheMode.DISABLED, CacheMode.READ_ONLY, CacheMode.WRITE_ONLY, CacheMode.BYPASS
|
||||
cache_migration_bypass: Replace bypass_cache=True with cache_mode=CacheMode.BYPASS | skip cache, bypass caching | cache_mode=CacheMode.BYPASS
|
||||
cache_migration_disable: Replace disable_cache=True with cache_mode=CacheMode.DISABLED | disable caching, turn off cache | cache_mode=CacheMode.DISABLED
|
||||
cache_migration_read: Replace no_cache_read=True with cache_mode=CacheMode.WRITE_ONLY | write-only cache, disable read | cache_mode=CacheMode.WRITE_ONLY
|
||||
cache_migration_write: Replace no_cache_write=True with cache_mode=CacheMode.READ_ONLY | read-only cache, disable write | cache_mode=CacheMode.READ_ONLY
|
||||
crawler_config: Use CrawlerRunConfig to set cache mode in AsyncWebCrawler | crawler settings, configuration object | CrawlerRunConfig(cache_mode=CacheMode.BYPASS)
|
||||
deprecation_warnings: Suppress cache deprecation warnings by setting SHOW_DEPRECATION_WARNINGS to False | warning suppression, legacy support | SHOW_DEPRECATION_WARNINGS = False
|
||||
async_crawler_usage: AsyncWebCrawler requires async/await syntax and supports configuration via CrawlerRunConfig | async crawler, web crawler setup | async with AsyncWebCrawler(verbose=True) as crawler
|
||||
crawler_execution: Run AsyncWebCrawler using asyncio.run() in main script | crawler execution, async main | asyncio.run(main())
|
||||
@@ -1,117 +0,0 @@
|
||||
# Tutorial: Clicking Buttons to Load More Content with Crawl4AI
|
||||
|
||||
## Introduction
|
||||
|
||||
When scraping dynamic websites, it’s common to encounter “Load More” or “Next” buttons that must be clicked to reveal new content. Crawl4AI provides a straightforward way to handle these situations using JavaScript execution and waiting conditions. In this tutorial, we’ll cover two approaches:
|
||||
|
||||
1. **Step-by-step (Session-based) Approach:** Multiple calls to `arun()` to progressively load more content.
|
||||
2. **Single-call Approach:** Execute a more complex JavaScript snippet inside a single `arun()` call to handle all clicks at once before the extraction.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- A working installation of Crawl4AI
|
||||
- Basic familiarity with Python’s `async`/`await` syntax
|
||||
|
||||
## Step-by-Step Approach
|
||||
|
||||
Use a session ID to maintain state across multiple `arun()` calls:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
js_code = [
|
||||
# This JS finds the “Next” button and clicks it
|
||||
"const nextButton = document.querySelector('button.next'); nextButton && nextButton.click();"
|
||||
]
|
||||
|
||||
wait_for_condition = "css:.new-content-class"
|
||||
|
||||
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||||
# 1. Load the initial page
|
||||
result_initial = await crawler.arun(
|
||||
url="https://example.com",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
session_id="my_session"
|
||||
)
|
||||
|
||||
# 2. Click the 'Next' button and wait for new content
|
||||
result_next = await crawler.arun(
|
||||
url="https://example.com",
|
||||
session_id="my_session",
|
||||
js_code=js_code,
|
||||
wait_for=wait_for_condition,
|
||||
js_only=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# `result_next` now contains the updated HTML after clicking 'Next'
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- **`session_id`**: Keeps the same browser context open.
|
||||
- **`js_code`**: Executes JavaScript in the context of the already loaded page.
|
||||
- **`wait_for`**: Ensures the crawler waits until new content is fully loaded.
|
||||
- **`js_only=True`**: Runs the JS in the current session without reloading the page.
|
||||
|
||||
By repeating the `arun()` call multiple times and modifying the `js_code` (e.g., clicking different modules or pages), you can iteratively load all the desired content.
|
||||
|
||||
## Single-call Approach
|
||||
|
||||
If the page allows it, you can run a single `arun()` call with a more elaborate JavaScript snippet that:
|
||||
- Iterates over all the modules or "Next" buttons
|
||||
- Clicks them one by one
|
||||
- Waits for content updates between each click
|
||||
- Once done, returns control to Crawl4AI for extraction.
|
||||
|
||||
Example snippet:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
js_code = [
|
||||
# Example JS that clicks multiple modules:
|
||||
"""
|
||||
(async () => {
|
||||
const modules = document.querySelectorAll('.module-item');
|
||||
for (let i = 0; i < modules.length; i++) {
|
||||
modules[i].scrollIntoView();
|
||||
modules[i].click();
|
||||
// Wait for each module’s content to load, adjust 100ms as needed
|
||||
await new Promise(r => setTimeout(r, 100));
|
||||
}
|
||||
})();
|
||||
"""
|
||||
]
|
||||
|
||||
async with AsyncWebCrawler(headless=True, verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
js_code=js_code,
|
||||
wait_for="css:.final-loaded-content-class",
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# `result` now contains all content after all modules have been clicked in one go.
|
||||
```
|
||||
|
||||
**Key Points:**
|
||||
- All interactions (clicks and waits) happen before the extraction.
|
||||
- Ideal for pages where all steps can be done in a single pass.
|
||||
|
||||
## Choosing the Right Approach
|
||||
|
||||
- **Step-by-Step (Session-based)**:
|
||||
- Good when you need fine-grained control or must dynamically check conditions before clicking the next page.
|
||||
- Useful if the page requires multiple conditions checked at runtime.
|
||||
|
||||
- **Single-call**:
|
||||
- Perfect if the sequence of interactions is known in advance.
|
||||
- Cleaner code if the page’s structure is consistent and predictable.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Crawl4AI makes it easy to handle dynamic content:
|
||||
- Use session IDs and multiple `arun()` calls for stepwise crawling.
|
||||
- Or pack all actions into one `arun()` call if the interactions are well-defined upfront.
|
||||
|
||||
This flexibility ensures you can handle a wide range of dynamic web pages efficiently.
|
||||
329
docs/md_v3/tutorials/advanced-features.md
Normal file
329
docs/md_v3/tutorials/advanced-features.md
Normal file
@@ -0,0 +1,329 @@
|
||||
# Advanced Features (Proxy, PDF, Screenshot, SSL, Headers, & Storage State)
|
||||
|
||||
Crawl4AI offers multiple power-user features that go beyond simple crawling. This tutorial covers:
|
||||
|
||||
1. **Proxy Usage**
|
||||
2. **Capturing PDFs & Screenshots**
|
||||
3. **Handling SSL Certificates**
|
||||
4. **Custom Headers**
|
||||
5. **Session Persistence & Local Storage**
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have a basic grasp of [AsyncWebCrawler Basics](./async-webcrawler-basics.md)
|
||||
> - You know how to run or configure your Python environment with Playwright installed
|
||||
|
||||
---
|
||||
|
||||
## 1. Proxy Usage
|
||||
|
||||
If you need to route your crawl traffic through a proxy—whether for IP rotation, geo-testing, or privacy—Crawl4AI supports it via `BrowserConfig.proxy_config`.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_cfg = BrowserConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "myuser",
|
||||
"password": "mypass",
|
||||
},
|
||||
headless=True
|
||||
)
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.whatismyip.com/",
|
||||
config=crawler_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Page fetched via proxy.")
|
||||
print("Page HTML snippet:", result.html[:200])
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- **`proxy_config`** expects a dict with `server` and optional auth credentials.
|
||||
- Many commercial proxies provide an HTTP/HTTPS “gateway” server that you specify in `server`.
|
||||
- If your proxy doesn’t need auth, omit `username`/`password`.
|
||||
|
||||
---
|
||||
|
||||
## 2. Capturing PDFs & Screenshots
|
||||
|
||||
Sometimes you need a visual record of a page or a PDF “printout.” Crawl4AI can do both in one pass:
|
||||
|
||||
```python
|
||||
import os, asyncio
|
||||
from base64 import b64decode
|
||||
from crawl4ai import AsyncWebCrawler, CacheMode
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://en.wikipedia.org/wiki/List_of_common_misconceptions",
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
pdf=True,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# Save screenshot
|
||||
if result.screenshot:
|
||||
with open("wikipedia_screenshot.png", "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Save PDF
|
||||
if result.pdf:
|
||||
with open("wikipedia_page.pdf", "wb") as f:
|
||||
f.write(b64decode(result.pdf))
|
||||
|
||||
print("[OK] PDF & screenshot captured.")
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why PDF + Screenshot?**
|
||||
- Large or complex pages can be slow or error-prone with “traditional” full-page screenshots.
|
||||
- Exporting a PDF is more reliable for very long pages. Crawl4AI automatically converts the first PDF page into an image if you request both.
|
||||
|
||||
**Relevant Parameters**
|
||||
- **`pdf=True`**: Exports the current page as a PDF (base64-encoded in `result.pdf`).
|
||||
- **`screenshot=True`**: Creates a screenshot (base64-encoded in `result.screenshot`).
|
||||
- **`scan_full_page`** or advanced hooking can further refine how the crawler captures content.
|
||||
|
||||
---
|
||||
|
||||
## 3. Handling SSL Certificates
|
||||
|
||||
If you need to verify or export a site’s SSL certificate—for compliance, debugging, or data analysis—Crawl4AI can fetch it during the crawl:
|
||||
|
||||
```python
|
||||
import asyncio, os
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
tmp_dir = os.path.join(os.getcwd(), "tmp")
|
||||
os.makedirs(tmp_dir, exist_ok=True)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
fetch_ssl_certificate=True,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(url="https://example.com", config=config)
|
||||
|
||||
if result.success and result.ssl_certificate:
|
||||
cert = result.ssl_certificate
|
||||
print("\nCertificate Information:")
|
||||
print(f"Issuer (CN): {cert.issuer.get('CN', '')}")
|
||||
print(f"Valid until: {cert.valid_until}")
|
||||
print(f"Fingerprint: {cert.fingerprint}")
|
||||
|
||||
# Export in multiple formats:
|
||||
cert.to_json(os.path.join(tmp_dir, "certificate.json"))
|
||||
cert.to_pem(os.path.join(tmp_dir, "certificate.pem"))
|
||||
cert.to_der(os.path.join(tmp_dir, "certificate.der"))
|
||||
|
||||
print("\nCertificate exported to JSON/PEM/DER in 'tmp' folder.")
|
||||
else:
|
||||
print("[ERROR] No certificate or crawl failed.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- **`fetch_ssl_certificate=True`** triggers certificate retrieval.
|
||||
- `result.ssl_certificate` includes methods (`to_json`, `to_pem`, `to_der`) for saving in various formats (handy for server config, Java keystores, etc.).
|
||||
|
||||
---
|
||||
|
||||
## 4. Custom Headers
|
||||
|
||||
Sometimes you need to set custom headers (e.g., language preferences, authentication tokens, or specialized user-agent strings). You can do this in multiple ways:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
# Option 1: Set headers at the crawler strategy level
|
||||
crawler1 = AsyncWebCrawler(
|
||||
# The underlying strategy can accept headers in its constructor
|
||||
crawler_strategy=None # We'll override below for clarity
|
||||
)
|
||||
crawler1.crawler_strategy.update_user_agent("MyCustomUA/1.0")
|
||||
crawler1.crawler_strategy.set_custom_headers({
|
||||
"Accept-Language": "fr-FR,fr;q=0.9"
|
||||
})
|
||||
result1 = await crawler1.arun("https://www.example.com")
|
||||
print("Example 1 result success:", result1.success)
|
||||
|
||||
# Option 2: Pass headers directly to `arun()`
|
||||
crawler2 = AsyncWebCrawler()
|
||||
result2 = await crawler2.arun(
|
||||
url="https://www.example.com",
|
||||
headers={"Accept-Language": "es-ES,es;q=0.9"}
|
||||
)
|
||||
print("Example 2 result success:", result2.success)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Notes**
|
||||
- Some sites may react differently to certain headers (e.g., `Accept-Language`).
|
||||
- If you need advanced user-agent randomization or client hints, see [Identity-Based Crawling (Anti-Bot)](./identity-anti-bot.md) or use `UserAgentGenerator`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Session Persistence & Local Storage
|
||||
|
||||
Crawl4AI can preserve cookies and localStorage so you can continue where you left off—ideal for logging into sites or skipping repeated auth flows.
|
||||
|
||||
### 5.1 `storage_state`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
storage_dict = {
|
||||
"cookies": [
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/",
|
||||
"expires": 1699999999.0,
|
||||
"httpOnly": False,
|
||||
"secure": False,
|
||||
"sameSite": "None"
|
||||
}
|
||||
],
|
||||
"origins": [
|
||||
{
|
||||
"origin": "https://example.com",
|
||||
"localStorage": [
|
||||
{"name": "token", "value": "my_auth_token"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# Provide the storage state as a dictionary to start "already logged in"
|
||||
async with AsyncWebCrawler(
|
||||
headless=True,
|
||||
storage_state=storage_dict
|
||||
) as crawler:
|
||||
result = await crawler.arun("https://example.com/protected")
|
||||
if result.success:
|
||||
print("Protected page content length:", len(result.html))
|
||||
else:
|
||||
print("Failed to crawl protected page")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 5.2 Exporting & Reusing State
|
||||
|
||||
You can sign in once, export the browser context, and reuse it later—without re-entering credentials.
|
||||
|
||||
- **`await context.storage_state(path="my_storage.json")`**: Exports cookies, localStorage, etc. to a file.
|
||||
- Provide `storage_state="my_storage.json"` on subsequent runs to skip the login step.
|
||||
|
||||
**See**: [Detailed session management tutorial](./hooks-custom.md#using-storage_state) or [Explanations → Browser Context & Managed Browser](../../explanations/browser-management.md) for more advanced scenarios (like multi-step logins, or capturing after interactive pages).
|
||||
|
||||
---
|
||||
|
||||
## Putting It All Together
|
||||
|
||||
Here’s a snippet that combines multiple “advanced” features (proxy, PDF, screenshot, SSL, custom headers, and session reuse) into one run. Normally, you’d tailor each setting to your project’s needs.
|
||||
|
||||
```python
|
||||
import os, asyncio
|
||||
from base64 import b64decode
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
|
||||
async def main():
|
||||
# 1. Browser config with proxy + headless
|
||||
browser_cfg = BrowserConfig(
|
||||
proxy_config={
|
||||
"server": "http://proxy.example.com:8080",
|
||||
"username": "myuser",
|
||||
"password": "mypass",
|
||||
},
|
||||
headless=True,
|
||||
)
|
||||
|
||||
# 2. Crawler config with PDF, screenshot, SSL, custom headers, and ignoring caches
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
pdf=True,
|
||||
screenshot=True,
|
||||
fetch_ssl_certificate=True,
|
||||
cache_mode=CacheMode.BYPASS,
|
||||
headers={"Accept-Language": "en-US,en;q=0.8"},
|
||||
storage_state="my_storage.json", # Reuse session from a previous sign-in
|
||||
verbose=True,
|
||||
)
|
||||
|
||||
# 3. Crawl
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun("https://secure.example.com/protected", config=crawler_cfg)
|
||||
|
||||
if result.success:
|
||||
print("[OK] Crawled the secure page. Links found:", len(result.links.get("internal", [])))
|
||||
|
||||
# Save PDF & screenshot
|
||||
if result.pdf:
|
||||
with open("result.pdf", "wb") as f:
|
||||
f.write(b64decode(result.pdf))
|
||||
if result.screenshot:
|
||||
with open("result.png", "wb") as f:
|
||||
f.write(b64decode(result.screenshot))
|
||||
|
||||
# Check SSL cert
|
||||
if result.ssl_certificate:
|
||||
print("SSL Issuer CN:", result.ssl_certificate.issuer.get("CN", ""))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Conclusion & Next Steps
|
||||
|
||||
You’ve now explored several **advanced** features:
|
||||
|
||||
- **Proxy Usage**
|
||||
- **PDF & Screenshot** capturing for large or critical pages
|
||||
- **SSL Certificate** retrieval & exporting
|
||||
- **Custom Headers** for language or specialized requests
|
||||
- **Session Persistence** via storage state
|
||||
|
||||
**Where to go next**:
|
||||
|
||||
- **[Hooks & Custom Code](./hooks-custom.md)**: For multi-step interactions (clicking “Load More,” performing logins, etc.)
|
||||
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: If you need more sophisticated user simulation or stealth.
|
||||
- **[Reference → BrowserConfig & CrawlerRunConfig](../../reference/configuration.md)**: Detailed param descriptions for everything you’ve seen here and more.
|
||||
|
||||
With these power tools, you can build robust scraping workflows that mimic real user behavior, handle secure sites, capture detailed snapshots, and manage sessions across multiple runs—streamlining your entire data collection pipeline.
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
218
docs/md_v3/tutorials/async-webcrawler-basics.md
Normal file
218
docs/md_v3/tutorials/async-webcrawler-basics.md
Normal file
@@ -0,0 +1,218 @@
|
||||
Below is a sample Markdown file (`tutorials/async-webcrawler-basics.md`) illustrating how you might teach new users the fundamentals of `AsyncWebCrawler`. This tutorial builds on the **Getting Started** section by introducing key configuration parameters and the structure of the crawl result. Feel free to adjust the code snippets, wording, or format to match your style.
|
||||
|
||||
---
|
||||
|
||||
# AsyncWebCrawler Basics
|
||||
|
||||
In this tutorial, you’ll learn how to:
|
||||
|
||||
1. Create and configure an `AsyncWebCrawler` instance
|
||||
2. Understand the `CrawlResult` object returned by `arun()`
|
||||
3. Use basic `BrowserConfig` and `CrawlerRunConfig` options to tailor your crawl
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve already completed the [Getting Started](./getting-started.md) tutorial (or have equivalent knowledge).
|
||||
> - You have **Crawl4AI** installed and configured with Playwright.
|
||||
|
||||
---
|
||||
|
||||
## 1. What is `AsyncWebCrawler`?
|
||||
|
||||
`AsyncWebCrawler` is the central class for running asynchronous crawling operations in Crawl4AI. It manages browser sessions, handles dynamic pages (if needed), and provides you with a structured result object for each crawl. Essentially, it’s your high-level interface for collecting page data.
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Creating a Basic `AsyncWebCrawler` Instance
|
||||
|
||||
Below is a simple code snippet showing how to create and use `AsyncWebCrawler`. This goes one step beyond the minimal example you saw in [Getting Started](./getting-started.md).
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# 1. Set up configuration objects (optional if you want defaults)
|
||||
browser_config = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
verbose=True
|
||||
)
|
||||
crawler_config = CrawlerRunConfig(
|
||||
page_timeout=30000, # 30 seconds
|
||||
wait_for_images=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# 2. Initialize AsyncWebCrawler with your chosen browser config
|
||||
async with AsyncWebCrawler(config=browser_config) as crawler:
|
||||
# 3. Run a single crawl
|
||||
url_to_crawl = "https://example.com"
|
||||
result = await crawler.arun(url=url_to_crawl, config=crawler_config)
|
||||
|
||||
# 4. Inspect the result
|
||||
if result.success:
|
||||
print(f"Successfully crawled: {result.url}")
|
||||
print(f"HTML length: {len(result.html)}")
|
||||
print(f"Markdown snippet: {result.markdown[:200]}...")
|
||||
else:
|
||||
print(f"Failed to crawl {result.url}. Error: {result.error_message}")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### Key Points
|
||||
|
||||
1. **`BrowserConfig`** is optional, but it’s the place to specify browser-related settings (e.g., `headless`, `browser_type`).
|
||||
2. **`CrawlerRunConfig`** deals with how you want the crawler to behave for this particular run (timeouts, waiting for images, etc.).
|
||||
3. **`arun()`** is the main method to crawl a single URL. We’ll see how `arun_many()` works in later tutorials.
|
||||
|
||||
---
|
||||
|
||||
## 3. Understanding `CrawlResult`
|
||||
|
||||
When you call `arun()`, you get back a `CrawlResult` object containing all the relevant data from that crawl attempt. Some common fields include:
|
||||
|
||||
```python
|
||||
class CrawlResult(BaseModel):
|
||||
url: str
|
||||
html: str
|
||||
success: bool
|
||||
cleaned_html: Optional[str] = None
|
||||
media: Dict[str, List[Dict]] = {}
|
||||
links: Dict[str, List[Dict]] = {}
|
||||
screenshot: Optional[str] = None # base64-encoded screenshot if requested
|
||||
pdf: Optional[bytes] = None # binary PDF data if requested
|
||||
markdown: Optional[Union[str, MarkdownGenerationResult]] = None
|
||||
markdown_v2: Optional[MarkdownGenerationResult] = None
|
||||
error_message: Optional[str] = None
|
||||
# ... plus other fields like status_code, ssl_certificate, extracted_content, etc.
|
||||
```
|
||||
|
||||
### Commonly Used Fields
|
||||
|
||||
- **`success`**: `True` if the crawl succeeded, `False` otherwise.
|
||||
- **`html`**: The raw HTML (or final rendered state if JavaScript was executed).
|
||||
- **`markdown` / `markdown_v2`**: Contains the automatically generated Markdown representation of the page.
|
||||
- **`media`**: A dictionary with lists of extracted images, videos, or audio elements.
|
||||
- **`links`**: A dictionary with lists of “internal” and “external” link objects.
|
||||
- **`error_message`**: If `success` is `False`, this often contains a description of the error.
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
if result.success:
|
||||
print("Page Title or snippet of HTML:", result.html[:200])
|
||||
if result.markdown:
|
||||
print("Markdown snippet:", result.markdown[:200])
|
||||
print("Links found:", len(result.links.get("internal", [])), "internal links")
|
||||
else:
|
||||
print("Error crawling:", result.error_message)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Relevant Basic Parameters
|
||||
|
||||
Below are a few `BrowserConfig` and `CrawlerRunConfig` parameters you might tweak early on. We’ll cover more advanced ones (like proxies, PDF, or screenshots) in later tutorials.
|
||||
|
||||
### 4.1 `BrowserConfig` Essentials
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|--------------------|-----------------------------------------------------------|----------------|
|
||||
| `browser_type` | Which browser engine to use: `"chromium"`, `"firefox"`, `"webkit"` | `"chromium"` |
|
||||
| `headless` | Run the browser with no UI window. If `False`, you see the browser. | `True` |
|
||||
| `verbose` | Print extra logs for debugging. | `True` |
|
||||
| `java_script_enabled` | Toggle JavaScript. When `False`, you might speed up loads but lose dynamic content. | `True` |
|
||||
|
||||
### 4.2 `CrawlerRunConfig` Essentials
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------------------|--------------------------------------------------------------|--------------------|
|
||||
| `page_timeout` | Maximum time in ms to wait for the page to load or scripts. | `30000` (30s) |
|
||||
| `wait_for_images` | Wait for images to fully load. Good for accurate rendering. | `True` |
|
||||
| `css_selector` | Target only certain elements for extraction. | `None` |
|
||||
| `excluded_tags` | Skip certain HTML tags (like `nav`, `footer`, etc.) | `None` |
|
||||
| `verbose` | Print logs for debugging. | `True` |
|
||||
|
||||
> **Tip**: Don’t worry if you see lots of parameters. You’ll learn them gradually in later tutorials.
|
||||
|
||||
---
|
||||
|
||||
## 5. Putting It All Together
|
||||
|
||||
Here’s a slightly more in-depth example that shows off a few key config parameters at once:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
from crawl4ai import BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_cfg = BrowserConfig(
|
||||
browser_type="chromium",
|
||||
headless=True,
|
||||
java_script_enabled=True,
|
||||
verbose=False
|
||||
)
|
||||
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
page_timeout=30000, # wait up to 30 seconds
|
||||
wait_for_images=True,
|
||||
css_selector=".article-body", # only extract content under this CSS selector
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun("https://news.example.com", config=crawler_cfg)
|
||||
|
||||
if result.success:
|
||||
print("[OK] Crawled:", result.url)
|
||||
print("HTML length:", len(result.html))
|
||||
print("Extracted Markdown:", result.markdown_v2.raw_markdown[:300])
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Observations**:
|
||||
- `css_selector=".article-body"` ensures we only focus on the main content region.
|
||||
- `page_timeout=30000` helps if the site is slow.
|
||||
- We turned off `verbose` logs for the browser but kept them on for the crawler config.
|
||||
|
||||
---
|
||||
|
||||
## 6. Next Steps
|
||||
|
||||
- **Smart Crawling Techniques**: Learn to handle iframes, advanced caching, and selective extraction in the [next tutorial](./smart-crawling.md).
|
||||
- **Hooks & Custom Code**: See how to inject custom logic before and after navigation in a dedicated [Hooks Tutorial](./hooks-custom.md).
|
||||
- **Reference**: For a complete list of every parameter in `BrowserConfig` and `CrawlerRunConfig`, check out the [Reference section](../../reference/configuration.md).
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
You now know the basics of **AsyncWebCrawler**:
|
||||
- How to create it with optional browser/crawler configs
|
||||
- How `arun()` works for single-page crawls
|
||||
- Where to find your crawled data in `CrawlResult`
|
||||
- A handful of frequently used configuration parameters
|
||||
|
||||
From here, you can refine your crawler to handle more advanced scenarios, like focusing on specific content or dealing with dynamic elements. Let’s move on to **[Smart Crawling Techniques](./smart-crawling.md)** to learn how to handle iframes, advanced caching, and more.
|
||||
|
||||
---
|
||||
|
||||
**Last updated**: 2024-XX-XX
|
||||
|
||||
Keep exploring! If you get stuck, remember to check out the [How-To Guides](../../how-to/) for targeted solutions or the [Explanations](../../explanations/) for deeper conceptual background.
|
||||
271
docs/md_v3/tutorials/docker-quickstart.md
Normal file
271
docs/md_v3/tutorials/docker-quickstart.md
Normal file
@@ -0,0 +1,271 @@
|
||||
# Deploying with Docker (Quickstart)
|
||||
|
||||
> **⚠️ WARNING: Experimental & Legacy**
|
||||
> Our current Docker solution for Crawl4AI is **not stable** and **will be discontinued** soon. A more robust Docker/Orchestration strategy is in development, with a planned stable release in **2025**. If you choose to use this Docker approach, please proceed cautiously and avoid production deployment without thorough testing.
|
||||
|
||||
Crawl4AI is **open-source** and under **active development**. We appreciate your interest, but strongly recommend you make **informed decisions** if you need a production environment. Expect breaking changes in future versions.
|
||||
|
||||
---
|
||||
|
||||
## 1. Installation & Environment Setup (Outside Docker)
|
||||
|
||||
Before we jump into Docker usage, here’s a quick reminder of how to install Crawl4AI locally (legacy doc). For **non-Docker** deployments or local dev:
|
||||
|
||||
```bash
|
||||
# 1. Install the package
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
|
||||
# 2. Install playwright dependencies (all browsers or specific ones)
|
||||
playwright install --with-deps
|
||||
# or
|
||||
playwright install --with-deps chromium
|
||||
# or
|
||||
playwright install --with-deps chrome
|
||||
```
|
||||
|
||||
**Testing** your installation:
|
||||
|
||||
```bash
|
||||
# Visible browser test
|
||||
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); page = browser.new_page(); page.goto('https://example.com'); input('Press Enter to close...')"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Docker Overview
|
||||
|
||||
This Docker approach allows you to run a **Crawl4AI** service via REST API. You can:
|
||||
|
||||
1. **POST** a request (e.g., URLs, extraction config)
|
||||
2. **Retrieve** your results from a task-based endpoint
|
||||
|
||||
> **Note**: This Docker solution is **temporary**. We plan a more robust, stable Docker approach in the near future. For now, you can experiment, but do not rely on it for mission-critical production.
|
||||
|
||||
---
|
||||
|
||||
## 3. Pulling and Running the Image
|
||||
|
||||
### Basic Run
|
||||
|
||||
```bash
|
||||
docker pull unclecode/crawl4ai:basic
|
||||
docker run -p 11235:11235 unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
This starts a container on port `11235`. You can `POST` requests to `http://localhost:11235/crawl`.
|
||||
|
||||
### Using an API Token
|
||||
|
||||
```bash
|
||||
docker run -p 11235:11235 \
|
||||
-e CRAWL4AI_API_TOKEN=your_secret_token \
|
||||
unclecode/crawl4ai:basic
|
||||
```
|
||||
|
||||
If **`CRAWL4AI_API_TOKEN`** is set, you must include `Authorization: Bearer <token>` in your requests. Otherwise, the service is open to anyone.
|
||||
|
||||
---
|
||||
|
||||
## 4. Docker Compose for Multi-Container Workflows
|
||||
|
||||
You can also use **Docker Compose** to manage multiple services. Below is an **experimental** snippet:
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
crawl4ai:
|
||||
image: unclecode/crawl4ai:basic
|
||||
ports:
|
||||
- "11235:11235"
|
||||
environment:
|
||||
- CRAWL4AI_API_TOKEN=${CRAWL4AI_API_TOKEN:-}
|
||||
- OPENAI_API_KEY=${OPENAI_API_KEY:-}
|
||||
# Additional env variables as needed
|
||||
volumes:
|
||||
- /dev/shm:/dev/shm
|
||||
```
|
||||
|
||||
To run:
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
And to stop:
|
||||
|
||||
```bash
|
||||
docker-compose down
|
||||
```
|
||||
|
||||
**Troubleshooting**:
|
||||
|
||||
- **Check logs**: `docker-compose logs -f crawl4ai`
|
||||
- **Remove orphan containers**: `docker-compose down --remove-orphans`
|
||||
- **Remove networks**: `docker network rm <network_name>`
|
||||
|
||||
---
|
||||
|
||||
## 5. Making Requests to the Container
|
||||
|
||||
**Base URL**: `http://localhost:11235`
|
||||
|
||||
### Example: Basic Crawl
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
task_request = {
|
||||
"urls": "https://example.com",
|
||||
"priority": 10
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=task_request)
|
||||
task_id = response.json()["task_id"]
|
||||
|
||||
# Poll for status
|
||||
status_url = f"http://localhost:11235/task/{task_id}"
|
||||
status = requests.get(status_url).json()
|
||||
print(status)
|
||||
```
|
||||
|
||||
If you used an API token, do:
|
||||
|
||||
```python
|
||||
headers = {"Authorization": "Bearer your_secret_token"}
|
||||
response = requests.post(
|
||||
"http://localhost:11235/crawl",
|
||||
headers=headers,
|
||||
json=task_request
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Docker + New Crawler Config Approach
|
||||
|
||||
### Using `BrowserConfig` & `CrawlerRunConfig` in Requests
|
||||
|
||||
The Docker-based solution can accept **crawler configurations** in the request JSON (legacy doc might show direct parameters, but we want to embed them in `crawler_params` or `extra` to align with the new approach). For example:
|
||||
|
||||
```python
|
||||
import requests
|
||||
|
||||
request_data = {
|
||||
"urls": "https://www.nbcnews.com/business",
|
||||
"crawler_params": {
|
||||
"headless": True,
|
||||
"browser_type": "chromium",
|
||||
"verbose": True,
|
||||
"page_timeout": 30000,
|
||||
# ... any other BrowserConfig-like fields
|
||||
},
|
||||
"extra": {
|
||||
"word_count_threshold": 50,
|
||||
"bypass_cache": True
|
||||
}
|
||||
}
|
||||
|
||||
response = requests.post("http://localhost:11235/crawl", json=request_data)
|
||||
task_id = response.json()["task_id"]
|
||||
```
|
||||
|
||||
This is the recommended style if you want to replicate `BrowserConfig` and `CrawlerRunConfig` settings in Docker mode.
|
||||
|
||||
---
|
||||
|
||||
## 7. Example: JSON Extraction in Docker
|
||||
|
||||
```python
|
||||
import requests
|
||||
import json
|
||||
|
||||
# Define a schema for CSS extraction
|
||||
schema = {
|
||||
"name": "Coinbase Crypto Prices",
|
||||
"baseSelector": ".cds-tableRow-t45thuk",
|
||||
"fields": [
|
||||
{
|
||||
"name": "crypto",
|
||||
"selector": "td:nth-child(1) h2",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "symbol",
|
||||
"selector": "td:nth-child(1) p",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "td:nth-child(2)",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
request_data = {
|
||||
"urls": "https://www.coinbase.com/explore",
|
||||
"extraction_config": {
|
||||
"type": "json_css",
|
||||
"params": {"schema": schema}
|
||||
},
|
||||
"crawler_params": {
|
||||
"headless": True,
|
||||
"verbose": True
|
||||
}
|
||||
}
|
||||
|
||||
resp = requests.post("http://localhost:11235/crawl", json=request_data)
|
||||
task_id = resp.json()["task_id"]
|
||||
|
||||
# Poll for status
|
||||
status = requests.get(f"http://localhost:11235/task/{task_id}").json()
|
||||
if status["status"] == "completed":
|
||||
extracted_content = status["result"]["extracted_content"]
|
||||
data = json.loads(extracted_content)
|
||||
print("Extracted:", len(data), "entries")
|
||||
else:
|
||||
print("Task still in progress or failed.")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Why This Docker Is Temporary
|
||||
|
||||
**We are building a new, stable approach**:
|
||||
|
||||
- The current Docker container is **experimental** and might break with future releases.
|
||||
- We plan a stable release in **2025** with a more robust API, versioning, and orchestration.
|
||||
- If you use this Docker in production, do so at your own risk and be prepared for **breaking changes**.
|
||||
|
||||
**Community**: Because Crawl4AI is open-source, you can track progress or contribute to the new Docker approach. Check the [GitHub repository](https://github.com/unclecode/crawl4ai) for roadmaps and updates.
|
||||
|
||||
---
|
||||
|
||||
## 9. Known Limitations & Next Steps
|
||||
|
||||
1. **Not Production-Ready**: This Docker approach lacks extensive security, logging, or advanced config for large-scale usage.
|
||||
2. **Ongoing Changes**: Expect API changes. The official stable version is targeted for **2025**.
|
||||
3. **LLM Integrations**: Docker images are big if you want GPU or multiple model providers. We might unify these in a future build.
|
||||
4. **Performance**: For concurrency or large crawls, you may need to tune resources (memory, CPU) and watch out for ephemeral storage.
|
||||
5. **Version Pinning**: If you must deploy, pin your Docker tag to a specific version (e.g., `:basic-0.3.7`) to avoid surprise updates.
|
||||
|
||||
### Next Steps
|
||||
|
||||
- **Watch the Repository**: For announcements on the new Docker architecture.
|
||||
- **Experiment**: Use this Docker for test or dev environments, but keep an eye out for breakage.
|
||||
- **Contribute**: If you have ideas or improvements, open a PR or discussion.
|
||||
- **Check Roadmaps**: See our [GitHub issues](https://github.com/unclecode/crawl4ai/issues) or [Roadmap doc](https://github.com/unclecode/crawl4ai/blob/main/ROADMAP.md) to find upcoming releases.
|
||||
|
||||
---
|
||||
|
||||
## 10. Summary
|
||||
|
||||
**Deploying with Docker** can simplify running Crawl4AI as a service. However:
|
||||
|
||||
- **This Docker** approach is **legacy** and subject to removal/overhaul.
|
||||
- For production, please weigh the risks carefully.
|
||||
- Detailed “new Docker approach” is coming in **2025**.
|
||||
|
||||
We hope this guide helps you do a quick spin-up of Crawl4AI in Docker for **experimental** usage. Stay tuned for the fully-supported version!
|
||||
265
docs/md_v3/tutorials/getting-started.md
Normal file
265
docs/md_v3/tutorials/getting-started.md
Normal file
@@ -0,0 +1,265 @@
|
||||
# Getting Started with Crawl4AI
|
||||
|
||||
Welcome to **Crawl4AI**, an open-source LLM friendly Web Crawler & Scraper. In this tutorial, you’ll:
|
||||
|
||||
1. **Install** Crawl4AI (both via pip and Docker, with notes on platform challenges).
|
||||
2. Run your **first crawl** using minimal configuration.
|
||||
3. Generate **Markdown** output (and learn how it’s influenced by content filters).
|
||||
4. Experiment with a simple **CSS-based extraction** strategy.
|
||||
5. See a glimpse of **LLM-based extraction** (including open-source and closed-source model options).
|
||||
|
||||
---
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
Crawl4AI provides:
|
||||
- An asynchronous crawler, **`AsyncWebCrawler`**.
|
||||
- Configurable browser and run settings via **`BrowserConfig`** and **`CrawlerRunConfig`**.
|
||||
- Automatic HTML-to-Markdown conversion via **`DefaultMarkdownGenerator`** (supports additional filters).
|
||||
- Multiple extraction strategies (LLM-based or “traditional” CSS/XPath-based).
|
||||
|
||||
By the end of this guide, you’ll have installed Crawl4AI, performed a basic crawl, generated Markdown, and tried out two extraction strategies.
|
||||
|
||||
---
|
||||
|
||||
## 2. Installation
|
||||
|
||||
### 2.1 Python + Playwright
|
||||
|
||||
#### Basic Pip Installation
|
||||
|
||||
```bash
|
||||
pip install crawl4ai
|
||||
crawl4ai-setup
|
||||
playwright install --with-deps
|
||||
```
|
||||
|
||||
- **`crawl4ai-setup`** installs and configures Playwright (Chromium by default).
|
||||
|
||||
We cover advanced installation and Docker in the [Installation](#installation) section.
|
||||
|
||||
---
|
||||
|
||||
## 3. Your First Crawl
|
||||
|
||||
Here’s a minimal Python script that creates an **`AsyncWebCrawler`**, fetches a webpage, and prints the first 300 characters of its Markdown output:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com")
|
||||
print(result.markdown[:300]) # Print first 300 chars
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- **`AsyncWebCrawler`** launches a headless browser (Chromium by default).
|
||||
- It fetches `https://example.com`.
|
||||
- Crawl4AI automatically converts the HTML into Markdown.
|
||||
|
||||
You now have a simple, working crawl!
|
||||
|
||||
---
|
||||
|
||||
## 4. Basic Configuration (Light Introduction)
|
||||
|
||||
Crawl4AI’s crawler can be heavily customized using two main classes:
|
||||
|
||||
1. **`BrowserConfig`**: Controls browser behavior (headless or full UI, user agent, JavaScript toggles, etc.).
|
||||
2. **`CrawlerRunConfig`**: Controls how each crawl runs (caching, extraction, timeouts, hooking, etc.).
|
||||
|
||||
Below is an example with minimal usage:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_conf = BrowserConfig(headless=True) # or False to see the browser
|
||||
run_conf = CrawlerRunConfig(cache_mode="BYPASS")
|
||||
|
||||
async with AsyncWebCrawler(config=browser_conf) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com",
|
||||
config=run_conf
|
||||
)
|
||||
print(result.markdown)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
We’ll explore more advanced config in later tutorials (like enabling proxies, PDF output, multi-tab sessions, etc.). For now, just note how you pass these objects to manage crawling.
|
||||
|
||||
---
|
||||
|
||||
## 5. Generating Markdown Output
|
||||
|
||||
By default, Crawl4AI automatically generates Markdown from each crawled page. However, the exact output depends on whether you specify a **markdown generator** or **content filter**.
|
||||
|
||||
- **`result.markdown`**:
|
||||
The direct HTML-to-Markdown conversion.
|
||||
- **`result.markdown.fit_markdown`**:
|
||||
The same content after applying any configured **content filter** (e.g., `PruningContentFilter`).
|
||||
|
||||
### Example: Using a Filter with `DefaultMarkdownGenerator`
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.4, threshold_type="fixed")
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://news.ycombinator.com", config=config)
|
||||
print("Raw Markdown length:", len(result.markdown.raw_markdown))
|
||||
print("Fit Markdown length:", len(result.markdown.fit_markdown))
|
||||
```
|
||||
|
||||
**Note**: If you do **not** specify a content filter or markdown generator, you’ll typically see only the raw Markdown. We’ll dive deeper into these strategies in a dedicated **Markdown Generation** tutorial.
|
||||
|
||||
---
|
||||
|
||||
## 6. Simple Data Extraction (CSS-based)
|
||||
|
||||
Crawl4AI can also extract structured data (JSON) using CSS or XPath selectors. Below is a minimal CSS-based example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
import json
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def main():
|
||||
schema = {
|
||||
"name": "Example Items",
|
||||
"baseSelector": "div.item",
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2", "type": "text"},
|
||||
{"name": "link", "selector": "a", "type": "attribute", "attribute": "href"}
|
||||
]
|
||||
}
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/items",
|
||||
config=CrawlerRunConfig(
|
||||
extraction_strategy=JsonCssExtractionStrategy(schema)
|
||||
)
|
||||
)
|
||||
# The JSON output is stored in 'extracted_content'
|
||||
data = json.loads(result.extracted_content)
|
||||
print(data)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Why is this helpful?**
|
||||
- Great for repetitive page structures (e.g., item listings, articles).
|
||||
- No AI usage or costs.
|
||||
- The crawler returns a JSON string you can parse or store.
|
||||
|
||||
---
|
||||
|
||||
## 7. Simple Data Extraction (LLM-based)
|
||||
|
||||
For more complex or irregular pages, a language model can parse text intelligently into a structure you define. Crawl4AI supports **open-source** or **closed-source** providers:
|
||||
|
||||
- **Open-Source Models** (e.g., `ollama/llama3.3`, `no_token`)
|
||||
- **OpenAI Models** (e.g., `openai/gpt-4`, requires `api_token`)
|
||||
- Or any provider supported by the underlying library
|
||||
|
||||
Below is an example using **open-source** style (no token) and closed-source:
|
||||
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class PricingInfo(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the AI model")
|
||||
input_fee: str = Field(..., description="Fee for input tokens")
|
||||
output_fee: str = Field(..., description="Fee for output tokens")
|
||||
|
||||
async def main():
|
||||
# 1) Open-Source usage: no token required
|
||||
llm_strategy_open_source = LLMExtractionStrategy(
|
||||
provider="ollama/llama3.3", # or "any-other-local-model"
|
||||
api_token="no_token", # for local models, no API key is typically required
|
||||
schema=PricingInfo.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
From this page, extract all AI model pricing details in JSON format.
|
||||
Each entry should have 'model_name', 'input_fee', and 'output_fee'.
|
||||
""",
|
||||
temperature=0
|
||||
)
|
||||
|
||||
# 2) Closed-Source usage: API key for OpenAI, for example
|
||||
openai_token = os.getenv("OPENAI_API_KEY", "sk-YOUR_API_KEY")
|
||||
llm_strategy_openai = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token=openai_token,
|
||||
schema=PricingInfo.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""
|
||||
From this page, extract all AI model pricing details in JSON format.
|
||||
Each entry should have 'model_name', 'input_fee', and 'output_fee'.
|
||||
""",
|
||||
temperature=0
|
||||
)
|
||||
|
||||
# We'll demo the open-source approach here
|
||||
config = CrawlerRunConfig(extraction_strategy=llm_strategy_open_source)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/pricing",
|
||||
config=config
|
||||
)
|
||||
print("LLM-based extraction JSON:", result.extracted_content)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- We define a Pydantic schema (`PricingInfo`) describing the fields we want.
|
||||
- The LLM extraction strategy uses that schema and your instructions to transform raw text into structured JSON.
|
||||
- Depending on the **provider** and **api_token**, you can use local models or a remote API.
|
||||
|
||||
---
|
||||
|
||||
## 8. Next Steps
|
||||
|
||||
Congratulations! You have:
|
||||
1. Installed Crawl4AI (via pip, with Docker as an option).
|
||||
2. Performed a simple crawl and printed Markdown.
|
||||
3. Seen how adding a **markdown generator** + **content filter** can produce “fit” Markdown.
|
||||
4. Experimented with **CSS-based** extraction for repetitive data.
|
||||
5. Learned the basics of **LLM-based** extraction (open-source and closed-source).
|
||||
|
||||
If you are ready for more, check out:
|
||||
|
||||
- **Installation**: Learn more on how to install Crawl4AI and set up Playwright.
|
||||
- **Focus on Configuration**: Learn to customize browser settings, caching modes, advanced timeouts, etc.
|
||||
- **Markdown Generation Basics**: Dive deeper into content filtering and “fit markdown” usage.
|
||||
- **Dynamic Pages & Hooks**: Tackle sites with “Load More” buttons, login forms, or JavaScript complexities.
|
||||
- **Deployment**: Run Crawl4AI in Docker containers and scale across multiple nodes.
|
||||
- **Explanations & How-To Guides**: Explore browser contexts, identity-based crawling, hooking, performance, and more.
|
||||
|
||||
Crawl4AI is a powerful tool for extracting data and generating Markdown from virtually any website. Enjoy exploring, and we hope you build amazing AI-powered applications with it!
|
||||
335
docs/md_v3/tutorials/hooks-custom.md
Normal file
335
docs/md_v3/tutorials/hooks-custom.md
Normal file
@@ -0,0 +1,335 @@
|
||||
# Hooks & Custom Code
|
||||
|
||||
Crawl4AI supports a **hook** system that lets you run your own Python code at specific points in the crawling pipeline. By injecting logic into these hooks, you can automate tasks like:
|
||||
|
||||
- **Authentication** (log in before navigating)
|
||||
- **Content manipulation** (modify HTML, inject scripts, etc.)
|
||||
- **Session or browser configuration** (e.g., adjusting user agents, local storage)
|
||||
- **Custom data collection** (scrape extra details or track state at each stage)
|
||||
|
||||
In this tutorial, you’ll learn about:
|
||||
|
||||
1. What hooks are available
|
||||
2. How to attach code to each hook
|
||||
3. Practical examples (auth flows, user agent changes, content manipulation, etc.)
|
||||
|
||||
> **Prerequisites**
|
||||
> - Familiar with [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
|
||||
> - Comfortable with Python async/await.
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview of Available Hooks
|
||||
|
||||
| Hook Name | Called When / Purpose | Context / Objects Provided |
|
||||
|--------------------------|-----------------------------------------------------------------|-----------------------------------------------------|
|
||||
| **`on_browser_created`** | Immediately after the browser is launched, but **before** any page or context is created. | **Browser** object only (no `page` yet). Use it for broad browser-level config. |
|
||||
| **`on_page_context_created`** | Right after a new page context is created. Perfect for setting default timeouts, injecting scripts, etc. | Typically provides `page` and `context`. |
|
||||
| **`on_user_agent_updated`** | Whenever the user agent changes. For advanced user agent logic or additional header updates. | Typically provides `page` and updated user agent string. |
|
||||
| **`on_execution_started`** | Right before your main crawling logic runs (before rendering the page). Good for one-time setup or variable initialization. | Typically provides `page`, possibly `context`. |
|
||||
| **`before_goto`** | Right before navigating to the URL (i.e., `page.goto(...)`). Great for setting cookies, altering the URL, or hooking in authentication steps. | Typically provides `page`, `context`, and `goto_params`. |
|
||||
| **`after_goto`** | Immediately after navigation completes, but before scraping. For post-login checks or initial content adjustments. | Typically provides `page`, `context`, `response`. |
|
||||
| **`before_retrieve_html`** | Right before retrieving or finalizing the page’s HTML content. Good for in-page manipulation (e.g., removing ads or disclaimers). | Typically provides `page` or final HTML reference. |
|
||||
| **`before_return_html`** | Just before the HTML is returned to the crawler pipeline. Last chance to alter or sanitize content. | Typically provides final HTML or a `page`. |
|
||||
|
||||
### A Note on `on_browser_created` (the “unbrowser” hook)
|
||||
- **No `page`** object is available because no page context exists yet. You can, however, set up browser-wide properties.
|
||||
- For example, you might control [CDP sessions][cdp] or advanced browser flags here.
|
||||
|
||||
---
|
||||
|
||||
## 2. Registering Hooks
|
||||
|
||||
You can attach hooks by calling:
|
||||
|
||||
```python
|
||||
crawler.crawler_strategy.set_hook("hook_name", your_hook_function)
|
||||
```
|
||||
|
||||
or by passing a `hooks` dictionary to `AsyncWebCrawler` or your strategy constructor:
|
||||
|
||||
```python
|
||||
hooks = {
|
||||
"before_goto": my_before_goto_hook,
|
||||
"after_goto": my_after_goto_hook,
|
||||
# ... etc.
|
||||
}
|
||||
async with AsyncWebCrawler(hooks=hooks) as crawler:
|
||||
...
|
||||
```
|
||||
|
||||
### Hook Signature
|
||||
|
||||
Each hook is a function (async or sync, depending on your usage) that receives **certain parameters**—most often `page`, `context`, or custom arguments relevant to that stage. The library then awaits or calls your hook before continuing.
|
||||
|
||||
---
|
||||
|
||||
## 3. Real-Life Examples
|
||||
|
||||
Below are concrete scenarios where hooks come in handy.
|
||||
|
||||
---
|
||||
|
||||
### 3.1 Authentication Before Navigation
|
||||
|
||||
One of the most frequent tasks is logging in or applying authentication **before** the crawler navigates to a URL (so that the user is recognized immediately).
|
||||
|
||||
#### Using `before_goto`
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def before_goto_auth_hook(page, context, goto_params, **kwargs):
|
||||
"""
|
||||
Example: Set cookies or localStorage to simulate login.
|
||||
This hook runs right before page.goto() is called.
|
||||
"""
|
||||
# Example: Insert cookie-based auth or local storage data
|
||||
# (You could also do more complex actions, like fill forms if you already have a 'page' open.)
|
||||
print("[HOOK] Setting auth data before goto.")
|
||||
await context.add_cookies([
|
||||
{
|
||||
"name": "session",
|
||||
"value": "abcd1234",
|
||||
"domain": "example.com",
|
||||
"path": "/"
|
||||
}
|
||||
])
|
||||
# Optionally manipulate goto_params if needed:
|
||||
# goto_params["url"] = goto_params["url"] + "?debug=1"
|
||||
|
||||
async def main():
|
||||
hooks = {
|
||||
"before_goto": before_goto_auth_hook
|
||||
}
|
||||
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
crawler_cfg = CrawlerRunConfig()
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
|
||||
result = await crawler.arun(url="https://example.com/protected", config=crawler_cfg)
|
||||
if result.success:
|
||||
print("[OK] Logged in and fetched protected page.")
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Points**
|
||||
- `before_goto` receives `page`, `context`, `goto_params` so you can add cookies, localStorage, or even change the URL itself.
|
||||
- If you need to run a real login flow (submitting forms), consider `on_browser_created` or `on_page_context_created` if you want to do it once at the start.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Setting Up the Browser in `on_browser_created`
|
||||
|
||||
If you need to do advanced browser-level configuration (e.g., hooking into the Chrome DevTools Protocol, adjusting command-line flags, etc.), you’ll use `on_browser_created`. No `page` is available yet, but you can set up the **browser** instance itself.
|
||||
|
||||
```python
|
||||
async def on_browser_created_hook(browser, **kwargs):
|
||||
"""
|
||||
Runs immediately after the browser is created, before any pages.
|
||||
'browser' here is a Playwright Browser object.
|
||||
"""
|
||||
print("[HOOK] Browser created. Setting up custom stuff.")
|
||||
# Possibly connect to DevTools or create an incognito context
|
||||
# Example (pseudo-code):
|
||||
# devtools_url = await browser.new_context(devtools=True)
|
||||
|
||||
# Usage:
|
||||
async with AsyncWebCrawler(hooks={"on_browser_created": on_browser_created_hook}) as crawler:
|
||||
...
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Adjusting Page or Context in `on_page_context_created`
|
||||
|
||||
If you’d like to set default timeouts or inject scripts right after a page context is spun up:
|
||||
|
||||
```python
|
||||
async def on_page_context_created_hook(page, context, **kwargs):
|
||||
print("[HOOK] Page context created. Setting default timeouts or scripts.")
|
||||
await page.set_default_timeout(20000) # 20 seconds
|
||||
# Possibly inject a script or set user locale
|
||||
|
||||
# Usage:
|
||||
hooks = {
|
||||
"on_page_context_created": on_page_context_created_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.4 Dynamically Updating User Agents
|
||||
|
||||
`on_user_agent_updated` is fired whenever the strategy updates the user agent. For instance, you might want to set certain cookies or console-log changes for debugging:
|
||||
|
||||
```python
|
||||
async def on_user_agent_updated_hook(page, context, new_ua, **kwargs):
|
||||
print(f"[HOOK] User agent updated to {new_ua}")
|
||||
# Maybe add a custom header based on new UA
|
||||
await context.set_extra_http_headers({"X-UA-Source": new_ua})
|
||||
|
||||
hooks = {
|
||||
"on_user_agent_updated": on_user_agent_updated_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.5 Initializing Stuff with `on_execution_started`
|
||||
|
||||
`on_execution_started` runs before your main crawling logic. It’s a good place for short, one-time setup tasks (like clearing old caches, or storing a timestamp).
|
||||
|
||||
```python
|
||||
async def on_execution_started_hook(page, context, **kwargs):
|
||||
print("[HOOK] Execution started. Setting a start timestamp or logging.")
|
||||
context.set_default_navigation_timeout(45000) # 45s if your site is slow
|
||||
|
||||
hooks = {
|
||||
"on_execution_started": on_execution_started_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.6 Post-Processing with `after_goto`
|
||||
|
||||
After the crawler finishes navigating (i.e., the page has presumably loaded), you can do additional checks or manipulations—like verifying you’re on the right page, or removing interstitials:
|
||||
|
||||
```python
|
||||
async def after_goto_hook(page, context, response, **kwargs):
|
||||
"""
|
||||
Called right after page.goto() finishes, but before the crawler extracts HTML.
|
||||
"""
|
||||
if response and response.ok:
|
||||
print("[HOOK] After goto. Status:", response.status)
|
||||
# Maybe remove popups or check if we landed on a login failure page.
|
||||
await page.evaluate("""() => {
|
||||
const popup = document.querySelector(".annoying-popup");
|
||||
if (popup) popup.remove();
|
||||
}""")
|
||||
else:
|
||||
print("[HOOK] Navigation might have failed, status not ok or no response.")
|
||||
|
||||
hooks = {
|
||||
"after_goto": after_goto_hook
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.7 Last-Minute Modifications in `before_retrieve_html` or `before_return_html`
|
||||
|
||||
Sometimes you need to tweak the page or raw HTML right before it’s captured.
|
||||
|
||||
```python
|
||||
async def before_retrieve_html_hook(page, context, **kwargs):
|
||||
"""
|
||||
Modify the DOM just before the crawler finalizes the HTML.
|
||||
"""
|
||||
print("[HOOK] Removing adverts before capturing HTML.")
|
||||
await page.evaluate("""() => {
|
||||
const ads = document.querySelectorAll(".ad-banner");
|
||||
ads.forEach(ad => ad.remove());
|
||||
}""")
|
||||
|
||||
async def before_return_html_hook(page, context, html, **kwargs):
|
||||
"""
|
||||
'html' is the near-finished HTML string. Return an updated string if you like.
|
||||
"""
|
||||
# For example, remove personal data or certain tags from the final text
|
||||
print("[HOOK] Sanitizing final HTML.")
|
||||
sanitized_html = html.replace("PersonalInfo:", "[REDACTED]")
|
||||
return sanitized_html
|
||||
|
||||
hooks = {
|
||||
"before_retrieve_html": before_retrieve_html_hook,
|
||||
"before_return_html": before_return_html_hook
|
||||
}
|
||||
```
|
||||
|
||||
**Note**: If you want to make last-second changes in `before_return_html`, you can manipulate the `html` string directly. Return a new string if you want to override.
|
||||
|
||||
---
|
||||
|
||||
## 4. Putting It All Together
|
||||
|
||||
You can combine multiple hooks in a single run. For instance:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def on_browser_created_hook(browser, **kwargs):
|
||||
print("[HOOK] Browser is up, no page yet. Good for broad config.")
|
||||
|
||||
async def before_goto_auth_hook(page, context, goto_params, **kwargs):
|
||||
print("[HOOK] Adding cookies for auth.")
|
||||
await context.add_cookies([{"name": "session", "value": "abcd1234", "domain": "example.com"}])
|
||||
|
||||
async def after_goto_log_hook(page, context, response, **kwargs):
|
||||
if response:
|
||||
print("[HOOK] after_goto: Status code:", response.status)
|
||||
|
||||
async def main():
|
||||
hooks = {
|
||||
"on_browser_created": on_browser_created_hook,
|
||||
"before_goto": before_goto_auth_hook,
|
||||
"after_goto": after_goto_log_hook
|
||||
}
|
||||
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
crawler_cfg = CrawlerRunConfig(verbose=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg, hooks=hooks) as crawler:
|
||||
result = await crawler.arun("https://example.com/protected", config=crawler_cfg)
|
||||
if result.success:
|
||||
print("[OK] Protected page length:", len(result.html))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
This example:
|
||||
|
||||
1. **`on_browser_created`** sets up the brand-new browser instance.
|
||||
2. **`before_goto`** ensures you inject an auth cookie before accessing the page.
|
||||
3. **`after_goto`** logs the resulting HTTP status code.
|
||||
|
||||
---
|
||||
|
||||
## 5. Common Pitfalls & Best Practices
|
||||
|
||||
1. **Hook Order**: If multiple hooks do overlapping tasks (e.g., two `before_goto` hooks), be mindful of conflicts or repeated logic.
|
||||
2. **Async vs Sync**: Some hooks might be used in a synchronous or asynchronous style. Confirm your function signature. If the crawler expects `async`, define `async def`.
|
||||
3. **Mutating goto_params**: `goto_params` is a dict that eventually goes to Playwright’s `page.goto()`. Changing the `url` or adding extra fields can be powerful but can also lead to confusion. Document your changes carefully.
|
||||
4. **Browser vs Page vs Context**: Not all hooks have both `page` and `context`. For example, `on_browser_created` only has access to **`browser`**.
|
||||
5. **Avoid Overdoing It**: Hooks are powerful but can lead to complexity. If you find yourself writing massive code inside a hook, consider if a separate “how-to” function with a simpler approach might suffice.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion & Next Steps
|
||||
|
||||
**Hooks** let you bend Crawl4AI to your will:
|
||||
|
||||
- **Authentication** (cookies, localStorage) with `before_goto`
|
||||
- **Browser-level config** with `on_browser_created`
|
||||
- **Page or context config** with `on_page_context_created`
|
||||
- **Content modifications** before capturing HTML (`before_retrieve_html` or `before_return_html`)
|
||||
|
||||
**Where to go next**:
|
||||
|
||||
- **[Identity-Based Crawling & Anti-Bot](./identity-anti-bot.md)**: Combine hooks with advanced user simulation to avoid bot detection.
|
||||
- **[Reference → AsyncPlaywrightCrawlerStrategy](../../reference/browser-strategies.md)**: Learn more about how hooks are implemented under the hood.
|
||||
- **[How-To Guides](../../how-to/)**: Check short, specific recipes for tasks like scraping multiple pages with repeated “Load More” clicks.
|
||||
|
||||
With the hook system, you have near-complete control over the browser’s lifecycle—whether it’s setting up environment variables, customizing user agents, or manipulating the HTML. Enjoy the freedom to create sophisticated, fully customized crawling pipelines!
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
395
docs/md_v3/tutorials/json-extraction-basic.md
Normal file
395
docs/md_v3/tutorials/json-extraction-basic.md
Normal file
@@ -0,0 +1,395 @@
|
||||
# Extracting JSON (No LLM)
|
||||
|
||||
One of Crawl4AI’s **most powerful** features is extracting **structured JSON** from websites **without** relying on large language models. By defining a **schema** with CSS or XPath selectors, you can extract data instantly—even from complex or nested HTML structures—without the cost, latency, or environmental impact of an LLM.
|
||||
|
||||
**Why avoid LLM for basic extractions?**
|
||||
|
||||
1. **Faster & Cheaper**: No API calls or GPU overhead.
|
||||
2. **Lower Carbon Footprint**: LLM inference can be energy-intensive. A well-defined schema is practically carbon-free.
|
||||
3. **Precise & Repeatable**: CSS/XPath selectors do exactly what you specify. LLM outputs can vary or hallucinate.
|
||||
4. **Scales Readily**: For thousands of pages, schema-based extraction runs quickly and in parallel.
|
||||
|
||||
Below, we’ll explore how to craft these schemas and use them with **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy** if you prefer XPath). We’ll also highlight advanced features like **nested fields** and **base element attributes**.
|
||||
|
||||
---
|
||||
|
||||
## 1. Intro to Schema-Based Extraction
|
||||
|
||||
A schema defines:
|
||||
|
||||
1. A **base selector** that identifies each “container” element on the page (e.g., a product row, a blog post card).
|
||||
2. **Fields** describing which CSS/XPath selectors to use for each piece of data you want to capture (text, attribute, HTML block, etc.).
|
||||
3. **Nested** or **list** types for repeated or hierarchical structures.
|
||||
|
||||
For example, if you have a list of products, each one might have a name, price, reviews, and “related products.” This approach is faster and more reliable than an LLM for consistent, structured pages.
|
||||
|
||||
---
|
||||
|
||||
## 2. Simple Example: Crypto Prices
|
||||
|
||||
Let’s begin with a **simple** schema-based extraction using the `JsonCssExtractionStrategy`. Below is a snippet that extracts cryptocurrency prices from a site (similar to the legacy Coinbase example). Notice we **don’t** call any LLM:
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
async def extract_crypto_prices():
|
||||
# 1. Define a simple extraction schema
|
||||
schema = {
|
||||
"name": "Crypto Prices",
|
||||
"baseSelector": "div.crypto-row", # Repeated elements
|
||||
"fields": [
|
||||
{
|
||||
"name": "coin_name",
|
||||
"selector": "h2.coin-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "span.coin-price",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# 2. Create the extraction strategy
|
||||
extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)
|
||||
|
||||
# 3. Set up your crawler config (if needed)
|
||||
config = CrawlerRunConfig(
|
||||
# e.g., pass js_code or wait_for if the page is dynamic
|
||||
# wait_for="css:.crypto-row:nth-child(20)"
|
||||
cache_mode = CacheMode.BYPASS,
|
||||
extraction_strategy=extraction_strategy,
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# 4. Run the crawl and extraction
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/crypto-prices",
|
||||
|
||||
config=config
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
print("Crawl failed:", result.error_message)
|
||||
return
|
||||
|
||||
# 5. Parse the extracted JSON
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(data)} coin entries")
|
||||
print(json.dumps(data[0], indent=2) if data else "No data found")
|
||||
|
||||
asyncio.run(extract_crypto_prices())
|
||||
```
|
||||
|
||||
**Highlights**:
|
||||
|
||||
- **`baseSelector`**: Tells us where each “item” (crypto row) is.
|
||||
- **`fields`**: Two fields (`coin_name`, `price`) using simple CSS selectors.
|
||||
- Each field defines a **`type`** (e.g., `text`, `attribute`, `html`, `regex`, etc.).
|
||||
|
||||
No LLM is needed, and the performance is **near-instant** for hundreds or thousands of items.
|
||||
|
||||
---
|
||||
|
||||
### **XPath Example with `raw://` HTML**
|
||||
|
||||
Below is a short example demonstrating **XPath** extraction plus the **`raw://`** scheme. We’ll pass a **dummy HTML** directly (no network request) and define the extraction strategy in `CrawlerRunConfig`.
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonXPathExtractionStrategy
|
||||
|
||||
async def extract_crypto_prices_xpath():
|
||||
# 1. Minimal dummy HTML with some repeating rows
|
||||
dummy_html = """
|
||||
<html>
|
||||
<body>
|
||||
<div class='crypto-row'>
|
||||
<h2 class='coin-name'>Bitcoin</h2>
|
||||
<span class='coin-price'>$28,000</span>
|
||||
</div>
|
||||
<div class='crypto-row'>
|
||||
<h2 class='coin-name'>Ethereum</h2>
|
||||
<span class='coin-price'>$1,800</span>
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
"""
|
||||
|
||||
# 2. Define the JSON schema (XPath version)
|
||||
schema = {
|
||||
"name": "Crypto Prices via XPath",
|
||||
"baseSelector": "//div[@class='crypto-row']",
|
||||
"fields": [
|
||||
{
|
||||
"name": "coin_name",
|
||||
"selector": ".//h2[@class='coin-name']",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": ".//span[@class='coin-price']",
|
||||
"type": "text"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
# 3. Place the strategy in the CrawlerRunConfig
|
||||
config = CrawlerRunConfig(
|
||||
extraction_strategy=JsonXPathExtractionStrategy(schema, verbose=True)
|
||||
)
|
||||
|
||||
# 4. Use raw:// scheme to pass dummy_html directly
|
||||
raw_url = f"raw://{dummy_html}"
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=raw_url,
|
||||
config=config
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
print("Crawl failed:", result.error_message)
|
||||
return
|
||||
|
||||
data = json.loads(result.extracted_content)
|
||||
print(f"Extracted {len(data)} coin rows")
|
||||
if data:
|
||||
print("First item:", data[0])
|
||||
|
||||
asyncio.run(extract_crypto_prices_xpath())
|
||||
```
|
||||
|
||||
**Key Points**:
|
||||
|
||||
1. **`JsonXPathExtractionStrategy`** is used instead of `JsonCssExtractionStrategy`.
|
||||
2. **`baseSelector`** and each field’s `"selector"` use **XPath** instead of CSS.
|
||||
3. **`raw://`** lets us pass `dummy_html` with no real network request—handy for local testing.
|
||||
4. Everything (including the extraction strategy) is in **`CrawlerRunConfig`**.
|
||||
|
||||
That’s how you keep the config self-contained, illustrate **XPath** usage, and demonstrate the **raw** scheme for direct HTML input—all while avoiding the old approach of passing `extraction_strategy` directly to `arun()`.
|
||||
|
||||
---
|
||||
|
||||
## 3. Advanced Schema & Nested Structures
|
||||
|
||||
Real sites often have **nested** or repeated data—like categories containing products, which themselves have a list of reviews or features. For that, we can define **nested** or **list** (and even **nested_list**) fields.
|
||||
|
||||
### Sample E-Commerce HTML
|
||||
|
||||
We have a **sample e-commerce** HTML file on GitHub (example):
|
||||
```
|
||||
https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html
|
||||
```
|
||||
This snippet includes categories, products, features, reviews, and related items. Let’s see how to define a schema that fully captures that structure **without LLM**.
|
||||
|
||||
```python
|
||||
schema = {
|
||||
"name": "E-commerce Product Catalog",
|
||||
"baseSelector": "div.category",
|
||||
# (1) We can define optional baseFields if we want to extract attributes from the category container
|
||||
"baseFields": [
|
||||
{"name": "data_cat_id", "type": "attribute", "attribute": "data-cat-id"},
|
||||
],
|
||||
"fields": [
|
||||
{
|
||||
"name": "category_name",
|
||||
"selector": "h2.category-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "products",
|
||||
"selector": "div.product",
|
||||
"type": "nested_list", # repeated sub-objects
|
||||
"fields": [
|
||||
{
|
||||
"name": "name",
|
||||
"selector": "h3.product-name",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "price",
|
||||
"selector": "p.product-price",
|
||||
"type": "text"
|
||||
},
|
||||
{
|
||||
"name": "details",
|
||||
"selector": "div.product-details",
|
||||
"type": "nested", # single sub-object
|
||||
"fields": [
|
||||
{"name": "brand", "selector": "span.brand", "type": "text"},
|
||||
{"name": "model", "selector": "span.model", "type": "text"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "features",
|
||||
"selector": "ul.product-features li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{"name": "feature", "type": "text"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "reviews",
|
||||
"selector": "div.review",
|
||||
"type": "nested_list",
|
||||
"fields": [
|
||||
{"name": "reviewer", "selector": "span.reviewer", "type": "text"},
|
||||
{"name": "rating", "selector": "span.rating", "type": "text"},
|
||||
{"name": "comment", "selector": "p.review-text", "type": "text"}
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "related_products",
|
||||
"selector": "ul.related-products li",
|
||||
"type": "list",
|
||||
"fields": [
|
||||
{"name": "name", "selector": "span.related-name", "type": "text"},
|
||||
{"name": "price", "selector": "span.related-price", "type": "text"}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Key Takeaways:
|
||||
|
||||
- **Nested vs. List**:
|
||||
- **`type: "nested"`** means a **single** sub-object (like `details`).
|
||||
- **`type: "list"`** means multiple items that are **simple** dictionaries or single text fields.
|
||||
- **`type: "nested_list"`** means repeated **complex** objects (like `products` or `reviews`).
|
||||
- **Base Fields**: We can extract **attributes** from the container element via `"baseFields"`. For instance, `"data_cat_id"` might be `data-cat-id="elect123"`.
|
||||
- **Transforms**: We can also define a `transform` if we want to lower/upper case, strip whitespace, or even run a custom function.
|
||||
|
||||
### Running the Extraction
|
||||
|
||||
```python
|
||||
import json
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
|
||||
|
||||
ecommerce_schema = {
|
||||
# ... the advanced schema from above ...
|
||||
}
|
||||
|
||||
async def extract_ecommerce_data():
|
||||
strategy = JsonCssExtractionStrategy(ecommerce_schema, verbose=True)
|
||||
|
||||
config = CrawlerRunConfig()
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://gist.githubusercontent.com/githubusercontent/2d7b8ba3cd8ab6cf3c8da771ddb36878/raw/1ae2f90c6861ce7dd84cc50d3df9920dee5e1fd2/sample_ecommerce.html",
|
||||
extraction_strategy=strategy,
|
||||
config=config
|
||||
)
|
||||
|
||||
if not result.success:
|
||||
print("Crawl failed:", result.error_message)
|
||||
return
|
||||
|
||||
# Parse the JSON output
|
||||
data = json.loads(result.extracted_content)
|
||||
print(json.dumps(data, indent=2) if data else "No data found.")
|
||||
|
||||
asyncio.run(extract_ecommerce_data())
|
||||
```
|
||||
|
||||
If all goes well, you get a **structured** JSON array with each “category,” containing an array of `products`. Each product includes `details`, `features`, `reviews`, etc. All of that **without** an LLM.
|
||||
|
||||
---
|
||||
|
||||
## 4. Why “No LLM” Is Often Better
|
||||
|
||||
1. **Zero Hallucination**: Schema-based extraction doesn’t guess text. It either finds it or not.
|
||||
2. **Guaranteed Structure**: The same schema yields consistent JSON across many pages, so your downstream pipeline can rely on stable keys.
|
||||
3. **Speed**: LLM-based extraction can be 10–1000x slower for large-scale crawling.
|
||||
4. **Scalable**: Adding or updating a field is a matter of adjusting the schema, not re-tuning a model.
|
||||
|
||||
**When might you consider an LLM?** Possibly if the site is extremely unstructured or you want AI summarization. But always try a schema approach first for repeated or consistent data patterns.
|
||||
|
||||
---
|
||||
|
||||
## 5. Base Element Attributes & Additional Fields
|
||||
|
||||
It’s easy to **extract attributes** (like `href`, `src`, or `data-xxx`) from your base or nested elements using:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "href",
|
||||
"type": "attribute",
|
||||
"attribute": "href",
|
||||
"default": null
|
||||
}
|
||||
```
|
||||
|
||||
You can define them in **`baseFields`** (extracted from the main container element) or in each field’s sub-lists. This is especially helpful if you need an item’s link or ID stored in the parent `<div>`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Putting It All Together: Larger Example
|
||||
|
||||
Consider a blog site. We have a schema that extracts the **URL** from each post card (via `baseFields` with an `"attribute": "href"`), plus the title, date, summary, and author:
|
||||
|
||||
```python
|
||||
schema = {
|
||||
"name": "Blog Posts",
|
||||
"baseSelector": "a.blog-post-card",
|
||||
"baseFields": [
|
||||
{"name": "post_url", "type": "attribute", "attribute": "href"}
|
||||
],
|
||||
"fields": [
|
||||
{"name": "title", "selector": "h2.post-title", "type": "text", "default": "No Title"},
|
||||
{"name": "date", "selector": "time.post-date", "type": "text", "default": ""},
|
||||
{"name": "summary", "selector": "p.post-summary", "type": "text", "default": ""},
|
||||
{"name": "author", "selector": "span.post-author", "type": "text", "default": ""}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Then run with `JsonCssExtractionStrategy(schema)` to get an array of blog post objects, each with `"post_url"`, `"title"`, `"date"`, `"summary"`, `"author"`.
|
||||
|
||||
---
|
||||
|
||||
## 7. Tips & Best Practices
|
||||
|
||||
1. **Inspect the DOM** in Chrome DevTools or Firefox’s Inspector to find stable selectors.
|
||||
2. **Start Simple**: Verify you can extract a single field. Then add complexity like nested objects or lists.
|
||||
3. **Test** your schema on partial HTML or a test page before a big crawl.
|
||||
4. **Combine with JS Execution** if the site loads content dynamically. You can pass `js_code` or `wait_for` in `CrawlerRunConfig`.
|
||||
5. **Look at Logs** when `verbose=True`: if your selectors are off or your schema is malformed, it’ll often show warnings.
|
||||
6. **Use baseFields** if you need attributes from the container element (e.g., `href`, `data-id`), especially for the “parent” item.
|
||||
7. **Performance**: For large pages, make sure your selectors are as narrow as possible.
|
||||
|
||||
---
|
||||
|
||||
## 8. Conclusion
|
||||
|
||||
With **JsonCssExtractionStrategy** (or **JsonXPathExtractionStrategy**), you can build powerful, **LLM-free** pipelines that:
|
||||
|
||||
- Scrape any consistent site for structured data.
|
||||
- Support nested objects, repeating lists, or advanced transformations.
|
||||
- Scale to thousands of pages quickly and reliably.
|
||||
|
||||
**Next Steps**:
|
||||
|
||||
- Explore the [Advanced Usage of JSON Extraction](../../explanations/extraction-chunking.md) for deeper details on schema nesting, transformations, or hooking.
|
||||
- Combine your extracted JSON with advanced filtering or summarization in a second pass if needed.
|
||||
- For dynamic pages, combine strategies with `js_code` or infinite scroll hooking to ensure all content is loaded.
|
||||
|
||||
**Remember**: For repeated, structured data, you don’t need to pay for or wait on an LLM. A well-crafted schema plus CSS or XPath gets you the data faster, cleaner, and cheaper—**the real power** of Crawl4AI.
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Extracting JSON (No LLM)**! You’ve seen how schema-based approaches (either CSS or XPath) can handle everything from simple lists to deeply nested product catalogs—instantly, with minimal overhead. Enjoy building robust scrapers that produce consistent, structured JSON for your data pipelines!
|
||||
334
docs/md_v3/tutorials/json-extraction-llm.md
Normal file
334
docs/md_v3/tutorials/json-extraction-llm.md
Normal file
@@ -0,0 +1,334 @@
|
||||
Below is a **draft** of the **Extracting JSON (LLM)** tutorial, illustrating how to use large language models for structured data extraction in Crawl4AI. It highlights key parameters (like chunking, overlap, instruction, schema) and explains how the system remains **provider-agnostic** via LightLLM. Adjust field names or code snippets to match your repository’s specifics.
|
||||
|
||||
---
|
||||
|
||||
# Extracting JSON (LLM)
|
||||
|
||||
In some cases, you need to extract **complex or unstructured** information from a webpage that a simple CSS/XPath schema cannot easily parse. Or you want **AI**-driven insights, classification, or summarization. For these scenarios, Crawl4AI provides an **LLM-based extraction strategy** that:
|
||||
|
||||
1. Works with **any** large language model supported by [LightLLM](https://github.com/LightLLM) (Ollama, OpenAI, Claude, and more).
|
||||
2. Automatically splits content into chunks (if desired) to handle token limits, then combines results.
|
||||
3. Lets you define a **schema** (like a Pydantic model) or a simpler “block” extraction approach.
|
||||
|
||||
**Important**: LLM-based extraction can be slower and costlier than schema-based approaches. If your page data is highly structured, consider using [`JsonCssExtractionStrategy`](./json-extraction-basic.md) or [`JsonXPathExtractionStrategy`](./json-extraction-basic.md) first. But if you need AI to interpret or reorganize content, read on!
|
||||
|
||||
---
|
||||
|
||||
## 1. Why Use an LLM?
|
||||
|
||||
- **Complex Reasoning**: If the site’s data is unstructured, scattered, or full of natural language context.
|
||||
- **Semantic Extraction**: Summaries, knowledge graphs, or relational data that require comprehension.
|
||||
- **Flexible**: You can pass instructions to the model to do more advanced transformations or classification.
|
||||
|
||||
---
|
||||
|
||||
## 2. Provider-Agnostic via LightLLM
|
||||
|
||||
Crawl4AI uses a “provider string” (e.g., `"openai/gpt-4o"`, `"ollama/llama2.0"`, `"aws/titan"`) to identify your LLM. **Any** model that LightLLM supports is fair game. You just provide:
|
||||
|
||||
- **`provider`**: The `<provider>/<model_name>` identifier (e.g., `"openai/gpt-4"`, `"ollama/llama2"`, `"huggingface/google-flan"`, etc.).
|
||||
- **`api_token`**: If needed (for OpenAI, HuggingFace, etc.); local models or Ollama might not require it.
|
||||
- **`api_base`** (optional): If your provider has a custom endpoint.
|
||||
|
||||
This means you **aren’t locked** into a single LLM vendor. Switch or experiment easily.
|
||||
|
||||
---
|
||||
|
||||
## 3. How LLM Extraction Works
|
||||
|
||||
### 3.1 Flow
|
||||
|
||||
1. **Chunking** (optional): The HTML or markdown is split into smaller segments if it’s very long (based on `chunk_token_threshold`, overlap, etc.).
|
||||
2. **Prompt Construction**: For each chunk, the library forms a prompt that includes your **`instruction`** (and possibly schema or examples).
|
||||
3. **LLM Inference**: Each chunk is sent to the model in parallel or sequentially (depending on your concurrency).
|
||||
4. **Combining**: The results from each chunk are merged and parsed into JSON.
|
||||
|
||||
### 3.2 `extraction_type`
|
||||
|
||||
- **`"schema"`**: The model tries to return JSON conforming to your Pydantic-based schema.
|
||||
- **`"block"`**: The model returns freeform text, or smaller JSON structures, which the library collects.
|
||||
|
||||
For structured data, `"schema"` is recommended. You provide `schema=YourPydanticModel.model_json_schema()`.
|
||||
|
||||
---
|
||||
|
||||
## 4. Key Parameters
|
||||
|
||||
Below is an overview of important LLM extraction parameters. All are typically set inside `LLMExtractionStrategy(...)`. You then put that strategy in your `CrawlerRunConfig(..., extraction_strategy=...)`.
|
||||
|
||||
1. **`provider`** (str): e.g., `"openai/gpt-4"`, `"ollama/llama2"`.
|
||||
2. **`api_token`** (str): The API key or token for that model. May not be needed for local models.
|
||||
3. **`schema`** (dict): A JSON schema describing the fields you want. Usually generated by `YourModel.model_json_schema()`.
|
||||
4. **`extraction_type`** (str): `"schema"` or `"block"`.
|
||||
5. **`instruction`** (str): Prompt text telling the LLM what you want extracted. E.g., “Extract these fields as a JSON array.”
|
||||
6. **`chunk_token_threshold`** (int): Maximum tokens per chunk. If your content is huge, you can break it up for the LLM.
|
||||
7. **`overlap_rate`** (float): Overlap ratio between adjacent chunks. E.g., `0.1` means 10% of each chunk is repeated to preserve context continuity.
|
||||
8. **`apply_chunking`** (bool): Set `True` to chunk automatically. If you want a single pass, set `False`.
|
||||
9. **`input_format`** (str): Determines **which** crawler result is passed to the LLM. Options include:
|
||||
- `"markdown"`: The raw markdown (default).
|
||||
- `"fit_markdown"`: The filtered “fit” markdown if you used a content filter.
|
||||
- `"html"`: The cleaned or raw HTML.
|
||||
10. **`extra_args`** (dict): Additional LLM parameters like `temperature`, `max_tokens`, `top_p`, etc.
|
||||
11. **`show_usage()`**: A method you can call to print out usage info (token usage per chunk, total cost if known).
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
extraction_strategy = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token="YOUR_OPENAI_KEY",
|
||||
schema=MyModel.model_json_schema(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract a list of items from the text with 'name' and 'price' fields.",
|
||||
chunk_token_threshold=1200,
|
||||
overlap_rate=0.1,
|
||||
apply_chunking=True,
|
||||
input_format="html",
|
||||
extra_args={"temperature": 0.1, "max_tokens": 1000},
|
||||
verbose=True
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Putting It in `CrawlerRunConfig`
|
||||
|
||||
**Important**: In Crawl4AI, all strategy definitions should go inside the `CrawlerRunConfig`, not directly as a param in `arun()`. Here’s a full example:
|
||||
|
||||
```python
|
||||
import os
|
||||
import asyncio
|
||||
import json
|
||||
from pydantic import BaseModel, Field
|
||||
from typing import List
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class Product(BaseModel):
|
||||
name: str
|
||||
price: str
|
||||
|
||||
async def main():
|
||||
# 1. Define the LLM extraction strategy
|
||||
llm_strategy = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o-mini", # e.g. "ollama/llama2"
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=Product.schema_json(), # Or use model_json_schema()
|
||||
extraction_type="schema",
|
||||
instruction="Extract all product objects with 'name' and 'price' from the content.",
|
||||
chunk_token_threshold=1000,
|
||||
overlap_rate=0.0,
|
||||
apply_chunking=True,
|
||||
input_format="markdown", # or "html", "fit_markdown"
|
||||
extra_args={"temperature": 0.0, "max_tokens": 800}
|
||||
)
|
||||
|
||||
# 2. Build the crawler config
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=llm_strategy,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
# 3. Create a browser config if needed
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
# 4. Let's say we want to crawl a single page
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/products",
|
||||
config=crawl_config
|
||||
)
|
||||
|
||||
if result.success:
|
||||
# 5. The extracted content is presumably JSON
|
||||
data = json.loads(result.extracted_content)
|
||||
print("Extracted items:", data)
|
||||
|
||||
# 6. Show usage stats
|
||||
llm_strategy.show_usage() # prints token usage
|
||||
else:
|
||||
print("Error:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Chunking Details
|
||||
|
||||
### 6.1 `chunk_token_threshold`
|
||||
|
||||
If your page is large, you might exceed your LLM’s context window. **`chunk_token_threshold`** sets the approximate max tokens per chunk. The library calculates word→token ratio using `word_token_rate` (often ~0.75 by default). If chunking is enabled (`apply_chunking=True`), the text is split into segments.
|
||||
|
||||
### 6.2 `overlap_rate`
|
||||
|
||||
To keep context continuous across chunks, we can overlap them. E.g., `overlap_rate=0.1` means each subsequent chunk includes 10% of the previous chunk’s text. This is helpful if your needed info might straddle chunk boundaries.
|
||||
|
||||
### 6.3 Performance & Parallelism
|
||||
|
||||
By chunking, you can potentially process multiple chunks in parallel (depending on your concurrency settings and the LLM provider). This reduces total time if the site is huge or has many sections.
|
||||
|
||||
---
|
||||
|
||||
## 7. Input Format
|
||||
|
||||
By default, **LLMExtractionStrategy** uses `input_format="markdown"`, meaning the **crawler’s final markdown** is fed to the LLM. You can change to:
|
||||
|
||||
- **`html`**: The cleaned HTML or raw HTML (depending on your crawler config) goes into the LLM.
|
||||
- **`fit_markdown`**: If you used, for instance, `PruningContentFilter`, the “fit” version of the markdown is used. This can drastically reduce tokens if you trust the filter.
|
||||
- **`markdown`**: Standard markdown output from the crawler’s `markdown_generator`.
|
||||
|
||||
This setting is crucial: if the LLM instructions rely on HTML tags, pick `"html"`. If you prefer a text-based approach, pick `"markdown"`.
|
||||
|
||||
```python
|
||||
LLMExtractionStrategy(
|
||||
# ...
|
||||
input_format="html", # Instead of "markdown" or "fit_markdown"
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Token Usage & Show Usage
|
||||
|
||||
To keep track of tokens and cost, each chunk is processed with an LLM call. We record usage in:
|
||||
|
||||
- **`usages`** (list): token usage per chunk or call.
|
||||
- **`total_usage`**: sum of all chunk calls.
|
||||
- **`show_usage()`**: prints a usage report (if the provider returns usage data).
|
||||
|
||||
```python
|
||||
llm_strategy = LLMExtractionStrategy(...)
|
||||
# ...
|
||||
llm_strategy.show_usage()
|
||||
# e.g. “Total usage: 1241 tokens across 2 chunk calls”
|
||||
```
|
||||
|
||||
If your model provider doesn’t return usage info, these fields might be partial or empty.
|
||||
|
||||
---
|
||||
|
||||
## 9. Example: Building a Knowledge Graph
|
||||
|
||||
Below is a snippet combining **`LLMExtractionStrategy`** with a Pydantic schema for a knowledge graph. Notice how we pass an **`instruction`** telling the model what to parse.
|
||||
|
||||
```python
|
||||
import os
|
||||
import json
|
||||
import asyncio
|
||||
from typing import List
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
|
||||
from crawl4ai.extraction_strategy import LLMExtractionStrategy
|
||||
|
||||
class Entity(BaseModel):
|
||||
name: str
|
||||
description: str
|
||||
|
||||
class Relationship(BaseModel):
|
||||
entity1: Entity
|
||||
entity2: Entity
|
||||
description: str
|
||||
relation_type: str
|
||||
|
||||
class KnowledgeGraph(BaseModel):
|
||||
entities: List[Entity]
|
||||
relationships: List[Relationship]
|
||||
|
||||
async def main():
|
||||
# LLM extraction strategy
|
||||
llm_strat = LLMExtractionStrategy(
|
||||
provider="openai/gpt-4",
|
||||
api_token=os.getenv('OPENAI_API_KEY'),
|
||||
schema=KnowledgeGraph.schema_json(),
|
||||
extraction_type="schema",
|
||||
instruction="Extract entities and relationships from the content. Return valid JSON.",
|
||||
chunk_token_threshold=1400,
|
||||
apply_chunking=True,
|
||||
input_format="html",
|
||||
extra_args={"temperature": 0.1, "max_tokens": 1500}
|
||||
)
|
||||
|
||||
crawl_config = CrawlerRunConfig(
|
||||
extraction_strategy=llm_strat,
|
||||
cache_mode=CacheMode.BYPASS
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
|
||||
# Example page
|
||||
url = "https://www.nbcnews.com/business"
|
||||
result = await crawler.arun(url=url, config=crawl_config)
|
||||
|
||||
if result.success:
|
||||
with open("kb_result.json", "w", encoding="utf-8") as f:
|
||||
f.write(result.extracted_content)
|
||||
llm_strat.show_usage()
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Observations**:
|
||||
|
||||
- **`extraction_type="schema"`** ensures we get JSON fitting our `KnowledgeGraph`.
|
||||
- **`input_format="html"`** means we feed HTML to the model.
|
||||
- **`instruction`** guides the model to output a structured knowledge graph.
|
||||
|
||||
---
|
||||
|
||||
## 10. Best Practices & Caveats
|
||||
|
||||
1. **Cost & Latency**: LLM calls can be slow or expensive. Consider chunking or smaller coverage if you only need partial data.
|
||||
2. **Model Token Limits**: If your page + instruction exceed the context window, chunking is essential.
|
||||
3. **Instruction Engineering**: Well-crafted instructions can drastically improve output reliability.
|
||||
4. **Schema Strictness**: `"schema"` extraction tries to parse the model output as JSON. If the model returns invalid JSON, partial extraction might happen, or you might get an error.
|
||||
5. **Parallel vs. Serial**: The library can process multiple chunks in parallel, but you must watch out for rate limits on certain providers.
|
||||
6. **Check Output**: Sometimes, an LLM might omit fields or produce extraneous text. You may want to post-validate with Pydantic or do additional cleanup.
|
||||
|
||||
---
|
||||
|
||||
## 11. Conclusion
|
||||
|
||||
**LLM-based extraction** in Crawl4AI is **provider-agnostic**, letting you choose from hundreds of models via LightLLM. It’s perfect for **semantically complex** tasks or generating advanced structures like knowledge graphs. However, it’s **slower** and potentially costlier than schema-based approaches. Keep these tips in mind:
|
||||
|
||||
- Put your LLM strategy **in `CrawlerRunConfig`**.
|
||||
- Use **`input_format`** to pick which form (markdown, HTML, fit_markdown) the LLM sees.
|
||||
- Tweak **`chunk_token_threshold`**, **`overlap_rate`**, and **`apply_chunking`** to handle large content efficiently.
|
||||
- Monitor token usage with `show_usage()`.
|
||||
|
||||
If your site’s data is consistent or repetitive, consider [`JsonCssExtractionStrategy`](./json-extraction-basic.md) first for speed and simplicity. But if you need an **AI-driven** approach, `LLMExtractionStrategy` offers a flexible, multi-provider solution for extracting structured JSON from any website.
|
||||
|
||||
**Next Steps**:
|
||||
|
||||
1. **Experiment with Different Providers**
|
||||
- Try switching the `provider` (e.g., `"ollama/llama2"`, `"openai/gpt-4o"`, etc.) to see differences in speed, accuracy, or cost.
|
||||
- Pass different `extra_args` like `temperature`, `top_p`, and `max_tokens` to fine-tune your results.
|
||||
|
||||
2. **Combine With Other Strategies**
|
||||
- Use [content filters](../../how-to/content-filters.md) like BM25 or Pruning prior to LLM extraction to remove noise and reduce token usage.
|
||||
- Apply a [CSS or XPath extraction strategy](./json-extraction-basic.md) first for obvious, structured data, then send only the tricky parts to the LLM.
|
||||
|
||||
3. **Performance Tuning**
|
||||
- If pages are large, tweak `chunk_token_threshold`, `overlap_rate`, or `apply_chunking` to optimize throughput.
|
||||
- Check the usage logs with `show_usage()` to keep an eye on token consumption and identify potential bottlenecks.
|
||||
|
||||
4. **Validate Outputs**
|
||||
- If using `extraction_type="schema"`, parse the LLM’s JSON with a Pydantic model for a final validation step.
|
||||
- Log or handle any parse errors gracefully, especially if the model occasionally returns malformed JSON.
|
||||
|
||||
5. **Explore Hooks & Automation**
|
||||
- Integrate LLM extraction with [hooks](./hooks-custom.md) for complex pre/post-processing.
|
||||
- Use a multi-step pipeline: crawl, filter, LLM-extract, then store or index results for further analysis.
|
||||
|
||||
6. **Scale and Deploy**
|
||||
- Combine your LLM extraction setup with [Docker or other deployment solutions](./docker-quickstart.md) to run at scale.
|
||||
- Monitor memory usage and concurrency if you call LLMs frequently.
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Extracting JSON (LLM)**—now you can harness AI to parse, classify, or reorganize data on the web. Happy crawling!
|
||||
295
docs/md_v3/tutorials/link-media-analysis.md
Normal file
295
docs/md_v3/tutorials/link-media-analysis.md
Normal file
@@ -0,0 +1,295 @@
|
||||
Below is a **draft** of the **“Link & Media Analysis”** tutorial. It demonstrates how to access and filter links, handle domain restrictions, and manage media (especially images) using Crawl4AI’s configuration options. Feel free to adjust examples and text to match your exact workflow or preferences.
|
||||
|
||||
---
|
||||
|
||||
# Link & Media Analysis
|
||||
|
||||
In this tutorial, you’ll learn how to:
|
||||
|
||||
1. Extract links (internal, external) from crawled pages
|
||||
2. Filter or exclude specific domains (e.g., social media or custom domains)
|
||||
3. Access and manage media data (especially images) in the crawl result
|
||||
4. Configure your crawler to exclude or prioritize certain images
|
||||
|
||||
> **Prerequisites**
|
||||
> - You have completed or are familiar with the [AsyncWebCrawler Basics](./async-webcrawler-basics.md) tutorial.
|
||||
> - You can run Crawl4AI in your environment (Playwright, Python, etc.).
|
||||
|
||||
---
|
||||
|
||||
Below is a revised version of the **Link Extraction** and **Media Extraction** sections that includes example data structures showing how links and media items are stored in `CrawlResult`. Feel free to adjust any field names or descriptions to match your actual output.
|
||||
|
||||
---
|
||||
|
||||
## 1. Link Extraction
|
||||
|
||||
### 1.1 `result.links`
|
||||
|
||||
When you call `arun()` or `arun_many()` on a URL, Crawl4AI automatically extracts links and stores them in the `links` field of `CrawlResult`. By default, the crawler tries to distinguish **internal** links (same domain) from **external** links (different domains).
|
||||
|
||||
**Basic Example**:
|
||||
|
||||
```python
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://www.example.com")
|
||||
if result.success:
|
||||
internal_links = result.links.get("internal", [])
|
||||
external_links = result.links.get("external", [])
|
||||
print(f"Found {len(internal_links)} internal links, {len(external_links)} external links.")
|
||||
|
||||
# Each link is typically a dictionary with fields like:
|
||||
# { "href": "...", "text": "...", "title": "...", "base_domain": "..." }
|
||||
if internal_links:
|
||||
print("Sample Internal Link:", internal_links[0])
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
```
|
||||
|
||||
**Structure Example**:
|
||||
|
||||
```python
|
||||
result.links = {
|
||||
"internal": [
|
||||
{
|
||||
"href": "https://kidocode.com/",
|
||||
"text": "",
|
||||
"title": "",
|
||||
"base_domain": "kidocode.com"
|
||||
},
|
||||
{
|
||||
"href": "https://kidocode.com/degrees/technology",
|
||||
"text": "Technology Degree",
|
||||
"title": "KidoCode Tech Program",
|
||||
"base_domain": "kidocode.com"
|
||||
},
|
||||
# ...
|
||||
],
|
||||
"external": [
|
||||
# possibly other links leading to third-party sites
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
- **`href`**: The raw hyperlink URL.
|
||||
- **`text`**: The link text (if any) within the `<a>` tag.
|
||||
- **`title`**: The `title` attribute of the link (if present).
|
||||
- **`base_domain`**: The domain extracted from `href`. Helpful for filtering or grouping by domain.
|
||||
|
||||
---
|
||||
|
||||
## 2. Domain Filtering
|
||||
|
||||
Some websites contain hundreds of third-party or affiliate links. You can filter out certain domains at **crawl time** by configuring the crawler. The most relevant parameters in `CrawlerRunConfig` are:
|
||||
|
||||
- **`exclude_external_links`**: If `True`, discard any link pointing outside the root domain.
|
||||
- **`exclude_social_media_domains`**: Provide a list of social media platforms (e.g., `["facebook.com", "twitter.com"]`) to exclude from your crawl.
|
||||
- **`exclude_social_media_links`**: If `True`, automatically skip known social platforms.
|
||||
- **`exclude_domains`**: Provide a list of custom domains you want to exclude (e.g., `["spammyads.com", "tracker.net"]`).
|
||||
|
||||
### 2.1 Example: Excluding External & Social Media Links
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_external_links=True, # No links outside primary domain
|
||||
exclude_social_media_links=True # Skip recognized social media domains
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
"https://www.example.com",
|
||||
config=crawler_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Crawled:", result.url)
|
||||
print("Internal links count:", len(result.links.get("internal", [])))
|
||||
print("External links count:", len(result.links.get("external", [])))
|
||||
# Likely zero external links in this scenario
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 2.2 Example: Excluding Specific Domains
|
||||
|
||||
If you want to let external links in, but specifically exclude a domain (e.g., `suspiciousads.com`), do this:
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_domains=["suspiciousads.com"]
|
||||
)
|
||||
```
|
||||
|
||||
This approach is handy when you still want external links but need to block certain sites you consider spammy.
|
||||
|
||||
---
|
||||
|
||||
## 3. Media Extraction
|
||||
|
||||
### 3.1 Accessing `result.media`
|
||||
|
||||
By default, Crawl4AI collects images, audio, and video URLs it finds on the page. These are stored in `result.media`, a dictionary keyed by media type (e.g., `images`, `videos`, `audio`).
|
||||
|
||||
**Basic Example**:
|
||||
|
||||
```python
|
||||
if result.success:
|
||||
images_info = result.media.get("images", [])
|
||||
print(f"Found {len(images_info)} images in total.")
|
||||
for i, img in enumerate(images_info[:5]): # Inspect just the first 5
|
||||
print(f"[Image {i}] URL: {img['src']}")
|
||||
print(f" Alt text: {img.get('alt', '')}")
|
||||
print(f" Score: {img.get('score')}")
|
||||
print(f" Description: {img.get('desc', '')}\n")
|
||||
```
|
||||
|
||||
**Structure Example**:
|
||||
|
||||
```python
|
||||
result.media = {
|
||||
"images": [
|
||||
{
|
||||
"src": "https://cdn.prod.website-files.com/.../Group%2089.svg",
|
||||
"alt": "coding school for kids",
|
||||
"desc": "Trial Class Degrees degrees All Degrees AI Degree Technology ...",
|
||||
"score": 3,
|
||||
"type": "image",
|
||||
"group_id": 0,
|
||||
"format": None,
|
||||
"width": None,
|
||||
"height": None
|
||||
},
|
||||
# ...
|
||||
],
|
||||
"videos": [
|
||||
# Similar structure but with video-specific fields
|
||||
],
|
||||
"audio": [
|
||||
# Similar structure but with audio-specific fields
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Depending on your Crawl4AI version or scraping strategy, these dictionaries can include fields like:
|
||||
|
||||
- **`src`**: The media URL (e.g., image source)
|
||||
- **`alt`**: The alt text for images (if present)
|
||||
- **`desc`**: A snippet of nearby text or a short description (optional)
|
||||
- **`score`**: A heuristic relevance score if you’re using content-scoring features
|
||||
- **`width`**, **`height`**: If the crawler detects dimensions for the image/video
|
||||
- **`type`**: Usually `"image"`, `"video"`, or `"audio"`
|
||||
- **`group_id`**: If you’re grouping related media items, the crawler might assign an ID
|
||||
|
||||
With these details, you can easily filter out or focus on certain images (for instance, ignoring images with very low scores or a different domain), or gather metadata for analytics.
|
||||
|
||||
### 3.2 Excluding External Images
|
||||
|
||||
If you’re dealing with heavy pages or want to skip third-party images (advertisements, for example), you can turn on:
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_external_images=True
|
||||
)
|
||||
```
|
||||
|
||||
This setting attempts to discard images from outside the primary domain, keeping only those from the site you’re crawling.
|
||||
|
||||
### 3.3 Additional Media Config
|
||||
|
||||
- **`screenshot`**: Set to `True` if you want a full-page screenshot stored as `base64` in `result.screenshot`.
|
||||
- **`pdf`**: Set to `True` if you want a PDF version of the page in `result.pdf`.
|
||||
- **`wait_for_images`**: If `True`, attempts to wait until images are fully loaded before final extraction.
|
||||
|
||||
---
|
||||
|
||||
## 4. Putting It All Together: Link & Media Filtering
|
||||
|
||||
Here’s a combined example demonstrating how to filter out external links, skip certain domains, and exclude external images:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Suppose we want to keep only internal links, remove certain domains,
|
||||
# and discard external images from the final crawl data.
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
exclude_external_links=True,
|
||||
exclude_domains=["spammyads.com"],
|
||||
exclude_social_media_links=True, # skip Twitter, Facebook, etc.
|
||||
exclude_external_images=True, # keep only images from main domain
|
||||
wait_for_images=True, # ensure images are loaded
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://www.example.com", config=crawler_cfg)
|
||||
|
||||
if result.success:
|
||||
print("[OK] Crawled:", result.url)
|
||||
|
||||
# 1. Links
|
||||
in_links = result.links.get("internal", [])
|
||||
ext_links = result.links.get("external", [])
|
||||
print("Internal link count:", len(in_links))
|
||||
print("External link count:", len(ext_links)) # should be zero with exclude_external_links=True
|
||||
|
||||
# 2. Images
|
||||
images = result.media.get("images", [])
|
||||
print("Images found:", len(images))
|
||||
|
||||
# Let's see a snippet of these images
|
||||
for i, img in enumerate(images[:3]):
|
||||
print(f" - {img['src']} (alt={img.get('alt','')}, score={img.get('score','N/A')})")
|
||||
else:
|
||||
print("[ERROR] Failed to crawl. Reason:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Common Pitfalls & Tips
|
||||
|
||||
1. **Conflicting Flags**:
|
||||
- `exclude_external_links=True` but then also specifying `exclude_social_media_links=True` is typically fine, but understand that the first setting already discards *all* external links. The second becomes somewhat redundant.
|
||||
- `exclude_external_images=True` but want to keep some external images? Currently no partial domain-based setting for images, so you might need a custom approach or hook logic.
|
||||
|
||||
2. **Relevancy Scores**:
|
||||
- If your version of Crawl4AI or your scraping strategy includes an `img["score"]`, it’s typically a heuristic based on size, position, or content analysis. Evaluate carefully if you rely on it.
|
||||
|
||||
3. **Performance**:
|
||||
- Excluding certain domains or external images can speed up your crawl, especially for large, media-heavy pages.
|
||||
- If you want a “full” link map, do *not* exclude them. Instead, you can post-filter in your own code.
|
||||
|
||||
4. **Social Media Lists**:
|
||||
- `exclude_social_media_links=True` typically references an internal list of known social domains like Facebook, Twitter, LinkedIn, etc. If you need to add or remove from that list, look for library settings or a local config file (depending on your version).
|
||||
|
||||
---
|
||||
|
||||
## 6. Next Steps
|
||||
|
||||
Now that you understand how to manage **Link & Media Analysis**, you can:
|
||||
|
||||
- Fine-tune which links are stored or discarded in your final results
|
||||
- Control which images (or other media) appear in `result.media`
|
||||
- Filter out entire domains or social media platforms to keep your dataset relevant
|
||||
|
||||
**Recommended Follow-Ups**:
|
||||
- **[Advanced Features (Proxy, PDF, Screenshots)](./advanced-features.md)**: If you want to capture screenshots or save the page as a PDF for archival or debugging.
|
||||
- **[Hooks & Custom Code](./hooks-custom.md)**: For more specialized logic, such as automated “infinite scroll” or repeated “Load More” button clicks.
|
||||
- **Reference**: Check out [CrawlerRunConfig Reference](../../reference/configuration.md) for a comprehensive parameter list.
|
||||
|
||||
**Last updated**: 2024-XX-XX
|
||||
|
||||
---
|
||||
|
||||
**That’s it for Link & Media Analysis!** You’re now equipped to filter out unwanted sites and zero in on the images and videos that matter for your project.
|
||||
382
docs/md_v3/tutorials/markdown-basics.md
Normal file
382
docs/md_v3/tutorials/markdown-basics.md
Normal file
@@ -0,0 +1,382 @@
|
||||
Below is a **draft** of the **Markdown Generation Basics** tutorial that incorporates your current Crawl4AI design and terminology. It introduces the default markdown generator, explains the concept of content filters (BM25 and Pruning), and covers the `MarkdownGenerationResult` object in a coherent, step-by-step manner. Adjust parameters or naming as needed to align with your actual codebase.
|
||||
|
||||
---
|
||||
|
||||
# Markdown Generation Basics
|
||||
|
||||
One of Crawl4AI’s core features is generating **clean, structured markdown** from web pages. Originally built to solve the problem of extracting only the “actual” content and discarding boilerplate or noise, Crawl4AI’s markdown system remains one of its biggest draws for AI workflows.
|
||||
|
||||
In this tutorial, you’ll learn:
|
||||
|
||||
1. How to configure the **Default Markdown Generator**
|
||||
2. How **content filters** (BM25 or Pruning) help you refine markdown and discard junk
|
||||
3. The difference between raw markdown (`result.markdown`) and filtered markdown (`fit_markdown`)
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve completed or read [AsyncWebCrawler Basics](./async-webcrawler-basics.md) to understand how to run a simple crawl.
|
||||
> - You know how to configure `CrawlerRunConfig`.
|
||||
|
||||
---
|
||||
|
||||
## 1. Quick Example
|
||||
|
||||
Here’s a minimal code snippet that uses the **DefaultMarkdownGenerator** with no additional filtering:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator()
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com", config=config)
|
||||
|
||||
if result.success:
|
||||
print("Raw Markdown Output:\n")
|
||||
print(result.markdown) # The unfiltered markdown from the page
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s happening?**
|
||||
- `CrawlerRunConfig(markdown_generator=DefaultMarkdownGenerator())` instructs Crawl4AI to convert the final HTML into markdown at the end of each crawl.
|
||||
- The resulting markdown is accessible via `result.markdown`.
|
||||
|
||||
---
|
||||
|
||||
## 2. How Markdown Generation Works
|
||||
|
||||
### 2.1 HTML-to-Text Conversion (Forked & Modified)
|
||||
|
||||
Under the hood, **DefaultMarkdownGenerator** uses a specialized HTML-to-text approach that:
|
||||
|
||||
- Preserves headings, code blocks, bullet points, etc.
|
||||
- Removes extraneous tags (scripts, styles) that don’t add meaningful content.
|
||||
- Can optionally generate references for links or skip them altogether.
|
||||
|
||||
A set of **options** (passed as a dict) allows you to customize precisely how HTML converts to markdown. These map to standard html2text-like configuration plus your own enhancements (e.g., ignoring internal links, preserving certain tags verbatim, or adjusting line widths).
|
||||
|
||||
### 2.2 Link Citations & References
|
||||
|
||||
By default, the generator can convert `<a href="...">` elements into `[text][1]` citations, then place the actual links at the bottom of the document. This is handy for research workflows that demand references in a structured manner.
|
||||
|
||||
### 2.3 Optional Content Filters
|
||||
|
||||
Before or after the HTML-to-Markdown step, you can apply a **content filter** (like BM25 or Pruning) to reduce noise and produce a “fit_markdown”—a heavily pruned version focusing on the page’s main text. We’ll cover these filters shortly.
|
||||
|
||||
---
|
||||
|
||||
## 3. Configuring the Default Markdown Generator
|
||||
|
||||
You can tweak the output by passing an `options` dict to `DefaultMarkdownGenerator`. For example:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
# Example: ignore all links, don't escape HTML, and wrap text at 80 characters
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
options={
|
||||
"ignore_links": True,
|
||||
"escape_html": False,
|
||||
"body_width": 80
|
||||
}
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=md_generator
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/docs", config=config)
|
||||
if result.success:
|
||||
print("Markdown:\n", result.markdown[:500]) # Just a snippet
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
import asyncio
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
Some commonly used `options`:
|
||||
|
||||
- **`ignore_links`** (bool): Whether to remove all hyperlinks in the final markdown.
|
||||
- **`ignore_images`** (bool): Remove all `![image]()` references.
|
||||
- **`escape_html`** (bool): Turn HTML entities into text (default is often `True`).
|
||||
- **`body_width`** (int): Wrap text at N characters. `0` or `None` means no wrapping.
|
||||
- **`skip_internal_links`** (bool): If `True`, omit `#localAnchors` or internal links referencing the same page.
|
||||
- **`include_sup_sub`** (bool): Attempt to handle `<sup>` / `<sub>` in a more readable way.
|
||||
|
||||
---
|
||||
|
||||
## 4. Content Filters
|
||||
|
||||
**Content filters** selectively remove or rank sections of text before turning them into Markdown. This is especially helpful if your page has ads, nav bars, or other clutter you don’t want.
|
||||
|
||||
### 4.1 BM25ContentFilter
|
||||
|
||||
If you have a **search query**, BM25 is a good choice:
|
||||
|
||||
```python
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import BM25ContentFilter
|
||||
from crawl4ai import CrawlerRunConfig
|
||||
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.2,
|
||||
use_stemming=True
|
||||
)
|
||||
|
||||
md_generator = DefaultMarkdownGenerator(
|
||||
content_filter=bm25_filter,
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
|
||||
config = CrawlerRunConfig(markdown_generator=md_generator)
|
||||
```
|
||||
|
||||
- **`user_query`**: The term you want to focus on. BM25 tries to keep only content blocks relevant to that query.
|
||||
- **`bm25_threshold`**: Raise it to keep fewer blocks; lower it to keep more.
|
||||
- **`use_stemming`**: If `True`, variations of words match (e.g., “learn,” “learning,” “learnt”).
|
||||
|
||||
**No query provided?** BM25 tries to glean a context from page metadata, or you can simply treat it as a scorched-earth approach that discards text with low generic score. Realistically, you want to supply a query for best results.
|
||||
|
||||
### 4.2 PruningContentFilter
|
||||
|
||||
If you **don’t** have a specific query, or if you just want a robust “junk remover,” use `PruningContentFilter`. It analyzes text density, link density, HTML structure, and known patterns (like “nav,” “footer”) to systematically prune extraneous or repetitive sections.
|
||||
|
||||
```python
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
prune_filter = PruningContentFilter(
|
||||
threshold=0.5,
|
||||
threshold_type="fixed", # or "dynamic"
|
||||
min_word_threshold=50
|
||||
)
|
||||
```
|
||||
|
||||
- **`threshold`**: Score boundary. Blocks below this score get removed.
|
||||
- **`threshold_type`**:
|
||||
- `"fixed"`: Straight comparison (`score >= threshold` keeps the block).
|
||||
- `"dynamic"`: The filter adjusts threshold in a data-driven manner.
|
||||
- **`min_word_threshold`**: Discard blocks under N words as likely too short or unhelpful.
|
||||
|
||||
**When to Use PruningContentFilter**
|
||||
- You want a broad cleanup without a user query.
|
||||
- The page has lots of repeated sidebars, footers, or disclaimers that hamper text extraction.
|
||||
|
||||
---
|
||||
|
||||
## 5. Using Fit Markdown
|
||||
|
||||
When a content filter is active, the library produces two forms of markdown inside `result.markdown_v2` or (if using the simplified field) `result.markdown`:
|
||||
|
||||
1. **`raw_markdown`**: The full unfiltered markdown.
|
||||
2. **`fit_markdown`**: A “fit” version where the filter has removed or trimmed noisy segments.
|
||||
|
||||
**Note**:
|
||||
- In earlier examples, you may see references to `result.markdown_v2`. Depending on your library version, you might access `result.markdown`, `result.markdown_v2`, or an object named `MarkdownGenerationResult`. The idea is the same: you’ll have a raw version and a filtered (“fit”) version if a filter is used.
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
markdown_generator=DefaultMarkdownGenerator(
|
||||
content_filter=PruningContentFilter(threshold=0.6),
|
||||
options={"ignore_links": True}
|
||||
)
|
||||
)
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://news.example.com/tech", config=config)
|
||||
if result.success:
|
||||
print("Raw markdown:\n", result.markdown)
|
||||
|
||||
# If a filter is used, we also have .fit_markdown:
|
||||
md_object = result.markdown_v2 # or your equivalent
|
||||
print("Filtered markdown:\n", md_object.fit_markdown)
|
||||
else:
|
||||
print("Crawl failed:", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. The `MarkdownGenerationResult` Object
|
||||
|
||||
If your library stores detailed markdown output in an object like `MarkdownGenerationResult`, you’ll see fields such as:
|
||||
|
||||
- **`raw_markdown`**: The direct HTML-to-markdown transformation (no filtering).
|
||||
- **`markdown_with_citations`**: A version that moves links to reference-style footnotes.
|
||||
- **`references_markdown`**: A separate string or section containing the gathered references.
|
||||
- **`fit_markdown`**: The filtered markdown if you used a content filter.
|
||||
- **`fit_html`**: The corresponding HTML snippet used to generate `fit_markdown` (helpful for debugging or advanced usage).
|
||||
|
||||
**Example**:
|
||||
|
||||
```python
|
||||
md_obj = result.markdown_v2 # your library’s naming may vary
|
||||
print("RAW:\n", md_obj.raw_markdown)
|
||||
print("CITED:\n", md_obj.markdown_with_citations)
|
||||
print("REFERENCES:\n", md_obj.references_markdown)
|
||||
print("FIT:\n", md_obj.fit_markdown)
|
||||
```
|
||||
|
||||
**Why Does This Matter?**
|
||||
- You can supply `raw_markdown` to an LLM if you want the entire text.
|
||||
- Or feed `fit_markdown` into a vector database to reduce token usage.
|
||||
- `references_markdown` can help you keep track of link provenance.
|
||||
|
||||
---
|
||||
|
||||
Below is a **revised section** under “Combining Filters (BM25 + Pruning)” that demonstrates how you can run **two** passes of content filtering without re-crawling, by taking the HTML (or text) from a first pass and feeding it into the second filter. It uses real code patterns from the snippet you provided for **BM25ContentFilter**, which directly accepts **HTML** strings (and can also handle plain text with minimal adaptation).
|
||||
|
||||
---
|
||||
|
||||
## 7. Combining Filters (BM25 + Pruning) in Two Passes
|
||||
|
||||
You might want to **prune out** noisy boilerplate first (with `PruningContentFilter`), and then **rank what’s left** against a user query (with `BM25ContentFilter`). You don’t have to crawl the page twice. Instead:
|
||||
|
||||
1. **First pass**: Apply `PruningContentFilter` directly to the raw HTML from `result.html` (the crawler’s downloaded HTML).
|
||||
2. **Second pass**: Take the pruned HTML (or text) from step 1, and feed it into `BM25ContentFilter`, focusing on a user query.
|
||||
|
||||
### Two-Pass Example
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
async def main():
|
||||
# 1. Crawl with minimal or no markdown generator, just get raw HTML
|
||||
config = CrawlerRunConfig(
|
||||
# If you only want raw HTML, you can skip passing a markdown_generator
|
||||
# or provide one but focus on .html in this example
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun("https://example.com/tech-article", config=config)
|
||||
|
||||
if not result.success or not result.html:
|
||||
print("Crawl failed or no HTML content.")
|
||||
return
|
||||
|
||||
raw_html = result.html
|
||||
|
||||
# 2. First pass: PruningContentFilter on raw HTML
|
||||
pruning_filter = PruningContentFilter(threshold=0.5, min_word_threshold=50)
|
||||
|
||||
# filter_content returns a list of "text chunks" or cleaned HTML sections
|
||||
pruned_chunks = pruning_filter.filter_content(raw_html)
|
||||
# This list is basically pruned content blocks, presumably in HTML or text form
|
||||
|
||||
# For demonstration, let's combine these chunks back into a single HTML-like string
|
||||
# or you could do further processing. It's up to your pipeline design.
|
||||
pruned_html = "\n".join(pruned_chunks)
|
||||
|
||||
# 3. Second pass: BM25ContentFilter with a user query
|
||||
bm25_filter = BM25ContentFilter(
|
||||
user_query="machine learning",
|
||||
bm25_threshold=1.2,
|
||||
language="english"
|
||||
)
|
||||
|
||||
bm25_chunks = bm25_filter.filter_content(pruned_html) # returns a list of text chunks
|
||||
|
||||
if not bm25_chunks:
|
||||
print("Nothing matched the BM25 query after pruning.")
|
||||
return
|
||||
|
||||
# 4. Combine or display final results
|
||||
final_text = "\n---\n".join(bm25_chunks)
|
||||
|
||||
print("==== PRUNED OUTPUT (first pass) ====")
|
||||
print(pruned_html[:500], "... (truncated)") # preview
|
||||
|
||||
print("\n==== BM25 OUTPUT (second pass) ====")
|
||||
print(final_text[:500], "... (truncated)")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### What’s Happening?
|
||||
|
||||
1. **Raw HTML**: We crawl once and store the raw HTML in `result.html`.
|
||||
2. **PruningContentFilter**: Takes HTML + optional parameters. It extracts blocks of text or partial HTML, removing headings/sections deemed “noise.” It returns a **list of text chunks**.
|
||||
3. **Combine or Transform**: We join these pruned chunks back into a single HTML-like string. (Alternatively, you could store them in a list for further logic—whatever suits your pipeline.)
|
||||
4. **BM25ContentFilter**: We feed the pruned string into `BM25ContentFilter` with a user query. This second pass further narrows the content to chunks relevant to “machine learning.”
|
||||
|
||||
**No Re-Crawling**: We used `raw_html` from the first pass, so there’s no need to run `arun()` again—**no second network request**.
|
||||
|
||||
### Tips & Variations
|
||||
|
||||
- **Plain Text vs. HTML**: If your pruned output is mostly text, BM25 can still handle it; just keep in mind it expects a valid string input. If you supply partial HTML (like `"<p>some text</p>"`), it will parse it as HTML.
|
||||
- **Chaining in a Single Pipeline**: If your code supports it, you can chain multiple filters automatically. Otherwise, manual two-pass filtering (as shown) is straightforward.
|
||||
- **Adjust Thresholds**: If you see too much or too little text in step one, tweak `threshold=0.5` or `min_word_threshold=50`. Similarly, `bm25_threshold=1.2` can be raised/lowered for more or fewer chunks in step two.
|
||||
|
||||
### One-Pass Combination?
|
||||
|
||||
If your codebase or pipeline design allows applying multiple filters in one pass, you could do so. But often it’s simpler—and more transparent—to run them sequentially, analyzing each step’s result.
|
||||
|
||||
**Bottom Line**: By **manually chaining** your filtering logic in two passes, you get powerful incremental control over the final content. First, remove “global” clutter with Pruning, then refine further with BM25-based query relevance—without incurring a second network crawl.
|
||||
|
||||
---
|
||||
|
||||
## 8. Common Pitfalls & Tips
|
||||
|
||||
1. **No Markdown Output?**
|
||||
- Make sure the crawler actually retrieved HTML. If the site is heavily JS-based, you may need to enable dynamic rendering or wait for elements.
|
||||
- Check if your content filter is too aggressive. Lower thresholds or disable the filter to see if content reappears.
|
||||
|
||||
2. **Performance Considerations**
|
||||
- Very large pages with multiple filters can be slower. Consider `cache_mode` to avoid re-downloading.
|
||||
- If your final use case is LLM ingestion, consider summarizing further or chunking big texts.
|
||||
|
||||
3. **Take Advantage of `fit_markdown`**
|
||||
- Great for RAG pipelines, semantic search, or any scenario where extraneous boilerplate is unwanted.
|
||||
- Still verify the textual quality—some sites have crucial data in footers or sidebars.
|
||||
|
||||
4. **Adjusting `html2text` Options**
|
||||
- If you see lots of raw HTML slipping into the text, turn on `escape_html`.
|
||||
- If code blocks look messy, experiment with `mark_code` or `handle_code_in_pre`.
|
||||
|
||||
---
|
||||
|
||||
## 9. Summary & Next Steps
|
||||
|
||||
In this **Markdown Generation Basics** tutorial, you learned to:
|
||||
|
||||
- Configure the **DefaultMarkdownGenerator** with HTML-to-text options.
|
||||
- Use **BM25ContentFilter** for query-specific extraction or **PruningContentFilter** for general noise removal.
|
||||
- Distinguish between raw and filtered markdown (`fit_markdown`).
|
||||
- Leverage the `MarkdownGenerationResult` object to handle different forms of output (citations, references, etc.).
|
||||
|
||||
**Where to go from here**:
|
||||
|
||||
- **[Extracting JSON (No LLM)](./json-extraction-basic.md)**: If you need structured data instead of markdown, check out the library’s JSON extraction strategies.
|
||||
- **[Advanced Features](./advanced-features.md)**: Combine markdown generation with proxies, PDF exports, and more.
|
||||
- **[Explanations → Content Filters vs. Extraction Strategies](../../explanations/extraction-chunking.md)**: Dive deeper into how filters differ from chunking or semantic extraction.
|
||||
|
||||
Now you can produce high-quality Markdown from any website, focusing on exactly the content you need—an essential step for powering AI models, summarization pipelines, or knowledge-base queries.
|
||||
|
||||
**Last Updated**: 2024-XX-XX
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Markdown Generation Basics**! Enjoy generating clean, noise-free markdown for your LLM workflows, content archives, or research.
|
||||
227
docs/md_v3/tutorials/targeted-crawling.md
Normal file
227
docs/md_v3/tutorials/targeted-crawling.md
Normal file
@@ -0,0 +1,227 @@
|
||||
Below is a **draft** of a follow-up tutorial, **“Smart Crawling Techniques,”** building on the **“AsyncWebCrawler Basics”** tutorial. This tutorial focuses on three main points:
|
||||
|
||||
1. **Advanced usage of CSS selectors** (e.g., partial extraction, exclusions)
|
||||
2. **Handling iframes** (if relevant for your workflow)
|
||||
3. **Waiting for dynamic content** using `wait_for`, including the new `css:` and `js:` prefixes
|
||||
|
||||
Feel free to adjust code snippets, wording, or emphasis to match your library updates or user feedback.
|
||||
|
||||
---
|
||||
|
||||
# Smart Crawling Techniques
|
||||
|
||||
In the previous tutorial ([AsyncWebCrawler Basics](./async-webcrawler-basics.md)), you learned how to create an `AsyncWebCrawler` instance, run a basic crawl, and inspect the `CrawlResult`. Now it’s time to explore some of the **targeted crawling** features that let you:
|
||||
|
||||
1. Select specific parts of a webpage using CSS selectors
|
||||
2. Exclude or ignore certain page elements
|
||||
3. Wait for dynamic content to load using `wait_for` (with `css:` or `js:` rules)
|
||||
4. (Optionally) Handle iframes if your target site embeds additional content
|
||||
|
||||
> **Prerequisites**
|
||||
> - You’ve read or completed [AsyncWebCrawler Basics](./async-webcrawler-basics.md).
|
||||
> - You have a working environment for Crawl4AI (Playwright installed, etc.).
|
||||
|
||||
---
|
||||
|
||||
## 1. Targeting Specific Elements with CSS Selectors
|
||||
|
||||
### 1.1 Simple CSS Selector Usage
|
||||
|
||||
Let’s say you only need to crawl the main article content of a news page. By setting `css_selector` in `CrawlerRunConfig`, your final HTML or Markdown output focuses on that region. For example:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
css_selector=".article-body", # Only capture .article-body content
|
||||
excluded_tags=["nav", "footer"] # Optional: skip big nav & footer sections
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://news.example.com/story/12345",
|
||||
config=crawler_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Extracted content length:", len(result.html))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**Key Parameters**:
|
||||
- **`css_selector`**: Tells the crawler to focus on `.article-body`.
|
||||
- **`excluded_tags`**: Tells the crawler to skip specific HTML tags altogether (e.g., `nav` or `footer`).
|
||||
|
||||
**Tip**: For extremely noisy pages, you can further refine how you exclude certain elements by using `excluded_selector`, which takes a CSS selector you want removed from the final output.
|
||||
|
||||
### 1.2 Excluding Content with `excluded_selector`
|
||||
|
||||
If you want to remove certain sections within `.article-body` (like “related stories” sidebars), set:
|
||||
|
||||
```python
|
||||
CrawlerRunConfig(
|
||||
css_selector=".article-body",
|
||||
excluded_selector=".related-stories, .ads-banner"
|
||||
)
|
||||
```
|
||||
|
||||
This combination grabs the main article content while filtering out sidebars or ads.
|
||||
|
||||
---
|
||||
|
||||
## 2. Handling Iframes
|
||||
|
||||
Some sites embed extra content via `<iframe>` elements—for example, embedded videos or external forms. If you want the crawler to traverse these iframes and merge their content into the final HTML or Markdown, set:
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
process_iframes=True
|
||||
)
|
||||
```
|
||||
|
||||
- **`process_iframes=True`**: Tells the crawler (specifically the underlying Playwright strategy) to recursively fetch iframe content and integrate it into `result.html` and `result.markdown`.
|
||||
|
||||
**Warning**: Not all sites allow iframes to be crawled (some cross-origin policies might block it). If you see partial or missing data, check the domain policy or logs for warnings.
|
||||
|
||||
---
|
||||
|
||||
## 3. Waiting for Dynamic Content
|
||||
|
||||
Many modern sites load content dynamically (e.g., after user interaction or asynchronously). Crawl4AI helps you wait for specific conditions before capturing the final HTML. Let’s look at `wait_for`.
|
||||
|
||||
### 3.1 `wait_for` Basics
|
||||
|
||||
In `CrawlerRunConfig`, `wait_for` can be a simple CSS selector or a JavaScript condition. Under the hood, Crawl4AI uses `smart_wait` to interpret what you provide.
|
||||
|
||||
```python
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
wait_for="css:.main-article-loaded",
|
||||
page_timeout=30000
|
||||
)
|
||||
```
|
||||
|
||||
**Example**: `css:.main-article-loaded` means “Wait for an element with the class `.main-article-loaded` to appear in the DOM.” If it doesn’t appear within `30` seconds, you’ll get a timeout.
|
||||
|
||||
### 3.2 Using Explicit Prefixes
|
||||
|
||||
**`js:`** and **`css:`** can explicitly tell the crawler which approach to use:
|
||||
|
||||
- **`wait_for="css:.comments-section"`** → Wait for `.comments-section` to appear
|
||||
- **`wait_for="js:() => document.querySelectorAll('.comments').length > 5"`** → Wait until there are at least 6 comment elements
|
||||
|
||||
**Code Example**:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
config = CrawlerRunConfig(
|
||||
wait_for="js:() => document.querySelectorAll('.dynamic-items li').length >= 10",
|
||||
page_timeout=20000 # 20s
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler() as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/async-list",
|
||||
config=config
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Dynamic items loaded. HTML length:", len(result.html))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
### 3.3 Fallback Logic
|
||||
|
||||
If you **don’t** prefix `js:` or `css:`, Crawl4AI tries to detect whether your string looks like a CSS selector or a JavaScript snippet. It’ll first attempt a CSS selector. If that fails, it tries to evaluate it as a JavaScript function. This can be convenient but can also lead to confusion if the library guesses incorrectly. It’s often best to be explicit:
|
||||
|
||||
- **`"css:.my-selector"`** → Force CSS
|
||||
- **`"js:() => myAppState.isReady()"`** → Force JavaScript
|
||||
|
||||
**What Should My JavaScript Return?**
|
||||
- A function that returns `true` once the condition is met (or `false` if it fails).
|
||||
- The function can be sync or async, but note that the crawler wraps it in an async loop to poll until `true` or timeout.
|
||||
|
||||
---
|
||||
|
||||
## 4. Example: Targeted Crawl with Iframes & Wait-For
|
||||
|
||||
Below is a more advanced snippet combining these features:
|
||||
|
||||
```python
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig
|
||||
|
||||
async def main():
|
||||
browser_cfg = BrowserConfig(headless=True)
|
||||
crawler_cfg = CrawlerRunConfig(
|
||||
css_selector=".main-content",
|
||||
process_iframes=True,
|
||||
wait_for="css:.loaded-indicator", # Wait for .loaded-indicator to appear
|
||||
excluded_tags=["script", "style"], # Remove script/style tags
|
||||
page_timeout=30000,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
async with AsyncWebCrawler(config=browser_cfg) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://example.com/iframe-heavy",
|
||||
config=crawler_cfg
|
||||
)
|
||||
if result.success:
|
||||
print("[OK] Crawled with iframes. Length of final HTML:", len(result.html))
|
||||
else:
|
||||
print("[ERROR]", result.error_message)
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
```
|
||||
|
||||
**What’s Happening**:
|
||||
1. **`css_selector=".main-content"`** → Focus only on `.main-content` for final extraction.
|
||||
2. **`process_iframes=True`** → Recursively handle `<iframe>` content.
|
||||
3. **`wait_for="css:.loaded-indicator"`** → Don’t extract until the page shows `.loaded-indicator`.
|
||||
4. **`excluded_tags=["script", "style"]`** → Remove script and style tags for a cleaner result.
|
||||
|
||||
---
|
||||
|
||||
## 5. Common Pitfalls & Tips
|
||||
|
||||
1. **Be Explicit**: Using `"js:"` or `"css:"` can spare you headaches if the library guesses incorrectly.
|
||||
2. **Timeouts**: If the site never triggers your wait condition, a `TimeoutError` can occur. Check your logs or use `verbose=True` for more clues.
|
||||
3. **Infinite Scroll**: If you have repeated “load more” loops, you might use [Hooks & Custom Code](./hooks-custom.md) or add your own JavaScript for repeated scrolling.
|
||||
4. **Iframes**: Some iframes are cross-origin or protected. In those cases, you might not be able to read their content. Check your logs for permission errors.
|
||||
|
||||
---
|
||||
|
||||
## 6. Summary & Next Steps
|
||||
|
||||
With these **Targeted Crawling Techniques** you can:
|
||||
|
||||
- Precisely target or exclude content using CSS selectors.
|
||||
- Automatically wait for dynamic elements to load using `wait_for`.
|
||||
- Merge iframe content into your main page result.
|
||||
|
||||
### Where to Go Next?
|
||||
|
||||
- **[Link & Media Analysis](./link-media-analysis.md)**: Dive deeper into analyzing extracted links and media items.
|
||||
- **[Hooks & Custom Code](./hooks-custom.md)**: Learn how to implement repeated actions like infinite scroll or login sequences using hooks.
|
||||
- **Reference**: For an exhaustive list of parameters and advanced usage, see [CrawlerRunConfig Reference](../../reference/configuration.md).
|
||||
|
||||
If you run into issues or want to see real examples from other users, check the [How-To Guides](../../how-to/) or raise a question on GitHub.
|
||||
|
||||
**Last updated**: 2024-XX-XX
|
||||
|
||||
---
|
||||
|
||||
That’s it for **Targeted Crawling Techniques**! You’re now equipped to handle complex pages that rely on dynamic loading, custom CSS selectors, and iframe embedding.
|
||||
96
mkdocs_v2.yml
Normal file
96
mkdocs_v2.yml
Normal file
@@ -0,0 +1,96 @@
|
||||
site_name: Crawl4AI Documentation
|
||||
site_description: 🔥🕷️ Crawl4AI, Open-source LLM Friendly Web Crawler & Scrapper
|
||||
site_url: https://docs.crawl4ai.com
|
||||
repo_url: https://github.com/unclecode/crawl4ai
|
||||
repo_name: unclecode/crawl4ai
|
||||
docs_dir: docs/md_v3
|
||||
|
||||
|
||||
nav:
|
||||
- Home: index.md
|
||||
|
||||
- Tutorials:
|
||||
- "Getting Started": tutorials/getting-started.md
|
||||
- "AsyncWebCrawler Basics": tutorials/async-webcrawler-basics.md
|
||||
- "Targeted Crawling Techniques": tutorials/targeted-crawling.md
|
||||
- "Link & Media Analysis": tutorials/link-media-analysis.md
|
||||
- "Advanced Features (Proxy, PDF, Screenshots)": tutorials/advanced-features.md
|
||||
- "Hooks & Custom Code": tutorials/hooks-custom.md
|
||||
- "Markdown Generation Basics": tutorials/markdown-basics.md
|
||||
- "Extracting JSON (No LLM)": tutorials/json-extraction-basic.md
|
||||
- "Extracting JSON (LLM)": tutorials/json-extraction-llm.md
|
||||
- "Deploying with Docker (Quickstart)": tutorials/docker-quickstart.md
|
||||
|
||||
- How-To Guides:
|
||||
- "Advanced Browser Configuration": how-to/advanced-browser-config.md
|
||||
- "Managing Browser Contexts & Remote Browsers": how-to/browser-contexts-remote.md
|
||||
- "Identity-Based Crawling (Anti-Bot)": how-to/identity-anti-bot.md
|
||||
- "Link & Media Analysis": how-to/link-media-analysis.md
|
||||
- "Markdown Generation Customization": how-to/markdown-custom.md
|
||||
- "Structured Data Extraction (Advanced)": how-to/structured-data-advanced.md
|
||||
- "Deployment Options": how-to/deployment-options.md
|
||||
- "Performance & Caching": how-to/performance-caching.md
|
||||
|
||||
- Explanations:
|
||||
- "AsyncWebCrawler & Internal Flow": explanations/async-webcrawler-flow.md
|
||||
- "Configuration Objects Explained": explanations/configuration-objects.md
|
||||
- "Browser Context & Managed Browser": explanations/browser-management.md
|
||||
- "Markdown Generation Architecture": explanations/markdown-architecture.md
|
||||
- "Extraction & Chunking Strategies": explanations/extraction-chunking.md
|
||||
- "Identity-Based Crawling & Anti-Bot": explanations/identity-anti-bot.md
|
||||
- "Deployment Architectures": explanations/deployment-architectures.md
|
||||
|
||||
- Reference:
|
||||
- "Configuration": reference/configuration.md
|
||||
- "Core Crawler": reference/core-crawler.md
|
||||
- "Browser Strategies": reference/browser-strategies.md
|
||||
- "Markdown Generation": reference/markdown-generation.md
|
||||
- "Content Filters": reference/content-filters.md
|
||||
- "Extraction Strategies": reference/extraction-strategies.md
|
||||
- "Chunking Strategies": reference/chunking-strategies.md
|
||||
- "Identity & Utility": reference/identity-utilities.md
|
||||
- "Models": reference/models.md
|
||||
|
||||
- Blog:
|
||||
- "Blog Overview": blog/index.md
|
||||
# You can add real-life application posts here in the future
|
||||
# - "Cool Real-World E-Commerce Scraping": blog/ecommerce-case-study.md
|
||||
# - "Dealing with Complex Anti-Bot Systems": blog/anti-bot-tricks.md
|
||||
|
||||
|
||||
theme:
|
||||
name: terminal
|
||||
palette: dark
|
||||
|
||||
plugins:
|
||||
- search
|
||||
- mkdocstrings:
|
||||
handlers:
|
||||
python:
|
||||
analysis:
|
||||
follow_imports: true
|
||||
rendering:
|
||||
show_root_full_path: false
|
||||
|
||||
markdown_extensions:
|
||||
- codehilite
|
||||
- toc:
|
||||
permalink: true
|
||||
- pymdownx.highlight:
|
||||
anchor_linenums: true
|
||||
- pymdownx.inlinehilite
|
||||
- pymdownx.snippets
|
||||
- pymdownx.superfences
|
||||
- admonition
|
||||
- pymdownx.details
|
||||
- attr_list
|
||||
- tables
|
||||
|
||||
extra_css:
|
||||
- assets/styles.css
|
||||
- assets/highlight.css
|
||||
- assets/dmvendor.css
|
||||
|
||||
extra_javascript:
|
||||
- assets/highlight.min.js
|
||||
- assets/highlight_init.js
|
||||
Reference in New Issue
Block a user