Compare commits
24 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
b309bc34e1 | ||
|
|
b8147b64e0 | ||
|
|
aab6ea022e | ||
|
|
dd17ed0e63 | ||
|
|
768aa06ceb | ||
|
|
9ffa34b697 | ||
|
|
740802c491 | ||
|
|
b9ac96c332 | ||
|
|
d06535388a | ||
|
|
2b73bdf6b0 | ||
|
|
6aa803d712 | ||
|
|
320afdea64 | ||
|
|
ccbe72cfc1 | ||
|
|
b9bbd42373 | ||
|
|
68e9144ce3 | ||
|
|
9b2b267820 | ||
|
|
ff3524d9b1 | ||
|
|
b99d20b725 | ||
|
|
768b93140f | ||
|
|
4750810a67 | ||
|
|
e0e0db4247 | ||
|
|
bccadec887 | ||
|
|
0759503e50 | ||
|
|
7f1c020746 |
11
.gitignore
vendored
11
.gitignore
vendored
@@ -196,4 +196,13 @@ docs/.DS_Store
|
||||
tmp/
|
||||
test_env/
|
||||
**/.DS_Store
|
||||
**/.DS_Store
|
||||
**/.DS_Store
|
||||
|
||||
todo.md
|
||||
git_changes.py
|
||||
git_changes.md
|
||||
pypi_build.sh
|
||||
git_issues.py
|
||||
git_issues.md
|
||||
|
||||
.tests/
|
||||
161
CHANGELOG.md
161
CHANGELOG.md
@@ -1,5 +1,166 @@
|
||||
# Changelog
|
||||
|
||||
## [v0.3.71] - 2024-10-18
|
||||
|
||||
### Changes
|
||||
1. **Version Update**:
|
||||
- Updated version number from 0.3.7 to 0.3.71.
|
||||
|
||||
2. **Crawler Enhancements**:
|
||||
- Added `sleep_on_close` option to AsyncPlaywrightCrawlerStrategy for delayed browser closure.
|
||||
- Improved context creation with additional options:
|
||||
- Enabled `accept_downloads` and `java_script_enabled`.
|
||||
- Added a cookie to enable cookies by default.
|
||||
|
||||
3. **Error Handling Improvements**:
|
||||
- Enhanced error messages in AsyncWebCrawler's `arun` method.
|
||||
- Updated error reporting format for better visibility and consistency.
|
||||
|
||||
4. **Performance Optimization**:
|
||||
- Commented out automatic page and context closure in `crawl` method to potentially improve performance in certain scenarios.
|
||||
|
||||
### Documentation
|
||||
- Updated quickstart notebook:
|
||||
- Changed installation command to use the released package instead of GitHub repository.
|
||||
- Updated kernel display name.
|
||||
|
||||
### Developer Notes
|
||||
- Minor code refactoring and cleanup.
|
||||
|
||||
## [v0.3.7] - 2024-10-17
|
||||
|
||||
### New Features
|
||||
1. **Enhanced Browser Stealth**:
|
||||
- Implemented `playwright_stealth` for improved bot detection avoidance.
|
||||
- Added `StealthConfig` for fine-tuned control over stealth parameters.
|
||||
|
||||
2. **User Simulation**:
|
||||
- New `simulate_user` option to mimic human-like interactions (mouse movements, clicks, keyboard presses).
|
||||
|
||||
3. **Navigator Override**:
|
||||
- Added `override_navigator` option to modify navigator properties, further improving bot detection evasion.
|
||||
|
||||
4. **Improved iframe Handling**:
|
||||
- New `process_iframes` parameter to extract and integrate iframe content into the main page.
|
||||
|
||||
5. **Flexible Browser Selection**:
|
||||
- Support for choosing between Chromium, Firefox, and WebKit browsers.
|
||||
|
||||
6. **Include Links in Markdown**:
|
||||
- Added support for including links in Markdown content, by definin g a new flag `include_links_on_markdown` in `crawl` method.
|
||||
|
||||
### Improvements
|
||||
1. **Better Error Handling**:
|
||||
- Enhanced error reporting in WebScrappingStrategy with detailed error messages and suggestions.
|
||||
- Added console message and error logging for better debugging.
|
||||
|
||||
2. **Image Processing Enhancements**:
|
||||
- Improved image dimension updating and filtering logic.
|
||||
|
||||
3. **Crawling Flexibility**:
|
||||
- Added support for custom viewport sizes.
|
||||
- Implemented delayed content retrieval with `delay_before_return_html` parameter.
|
||||
|
||||
4. **Performance Optimization**:
|
||||
- Adjusted default semaphore count for parallel crawling.
|
||||
|
||||
### Bug Fixes
|
||||
- Fixed an issue where the HTML content could be empty after processing.
|
||||
|
||||
### Examples
|
||||
- Added new example `crawl_with_user_simulation()` demonstrating the use of user simulation and navigator override features.
|
||||
|
||||
### Developer Notes
|
||||
- Refactored code for better maintainability and readability.
|
||||
- Updated browser launch arguments for improved compatibility and performance.
|
||||
|
||||
## [v0.3.6] - 2024-10-12
|
||||
|
||||
### 1. Improved Crawling Control
|
||||
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
|
||||
- **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
|
||||
- Useful for pages with delayed content loading.
|
||||
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
|
||||
- Provides better handling for slow-loading pages.
|
||||
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
|
||||
|
||||
### 2. Browser Type Selection
|
||||
- Added support for different browser types (Chromium, Firefox, WebKit).
|
||||
- Users can now specify the browser type when initializing AsyncWebCrawler.
|
||||
- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
|
||||
|
||||
### 3. Screenshot Capture
|
||||
- Added ability to capture screenshots during crawling.
|
||||
- Useful for debugging and content verification.
|
||||
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
||||
|
||||
### 4. Enhanced LLM Extraction Strategy
|
||||
- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
||||
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
||||
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
||||
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
|
||||
|
||||
### 5. iframe Content Extraction
|
||||
- New feature to process and extract content from iframes.
|
||||
- **How to use**: Set `process_iframes=True` in the crawl method.
|
||||
|
||||
### 6. Delayed Content Retrieval
|
||||
- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
||||
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
|
||||
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
||||
|
||||
## Improvements and Optimizations
|
||||
|
||||
### 1. AsyncWebCrawler Enhancements
|
||||
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
|
||||
- Allows for more customized setups.
|
||||
|
||||
### 2. Image Processing Optimization
|
||||
- Enhanced image handling in WebScrappingStrategy.
|
||||
- Added filtering for small, invisible, or irrelevant images.
|
||||
- Improved image scoring system for better content relevance.
|
||||
- Implemented JavaScript-based image dimension updating for more accurate representation.
|
||||
|
||||
### 3. Database Schema Auto-updates
|
||||
- Automatic database schema updates ensure compatibility with the latest version.
|
||||
|
||||
### 4. Enhanced Error Handling and Logging
|
||||
- Improved error messages and logging for easier debugging.
|
||||
|
||||
### 5. Content Extraction Refinements
|
||||
- Refined HTML sanitization process.
|
||||
- Improved handling of base64 encoded images.
|
||||
- Enhanced Markdown conversion process.
|
||||
- Optimized content extraction algorithms.
|
||||
|
||||
### 6. Utility Function Enhancements
|
||||
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
|
||||
|
||||
## Bug Fixes
|
||||
- Fixed an issue where image tags were being prematurely removed during content extraction.
|
||||
|
||||
## Examples and Documentation
|
||||
- Updated `quickstart_async.py` with examples of:
|
||||
- Using custom headers in LLM extraction.
|
||||
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
|
||||
- Custom browser type usage.
|
||||
|
||||
## Developer Notes
|
||||
- Refactored code for better maintainability, flexibility, and performance.
|
||||
- Enhanced type hinting throughout the codebase for improved development experience.
|
||||
- Expanded error handling for more robust operation.
|
||||
|
||||
These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
|
||||
|
||||
## [v0.3.5] - 2024-09-02
|
||||
|
||||
Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
|
||||
|
||||
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
|
||||
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
|
||||
- Improve error handling and timeout management in crawling process
|
||||
- Fix typo in CrawlResult model (responser_headers -> response_headers)
|
||||
|
||||
## [v0.2.77] - 2024-08-04
|
||||
|
||||
Significant improvements in text processing and performance:
|
||||
|
||||
30
README.md
30
README.md
@@ -8,7 +8,16 @@
|
||||
|
||||
Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. 🆓🌐
|
||||
|
||||
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md).
|
||||
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
|
||||
|
||||
## New update 0.3.6
|
||||
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
|
||||
- 🖼️ Improved image processing with lazy-loading detection
|
||||
- 🔧 Custom page timeout parameter for better control over crawling behavior
|
||||
- 🕰️ Enhanced handling of delayed content loading
|
||||
- 🔑 Custom headers support for LLM interactions
|
||||
- 🖼️ iframe content extraction for comprehensive page analysis
|
||||
- ⏱️ Flexible timeout and delayed content retrieval options
|
||||
|
||||
## Try it Now!
|
||||
|
||||
@@ -38,7 +47,6 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
|
||||
- 🔄 Session management for complex multi-page crawling scenarios
|
||||
- 🌐 Asynchronous architecture for improved performance and scalability
|
||||
|
||||
|
||||
## Installation 🛠️
|
||||
|
||||
Crawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.
|
||||
@@ -55,9 +63,21 @@ For basic web crawling and scraping tasks:
|
||||
pip install crawl4ai
|
||||
```
|
||||
|
||||
By default this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
|
||||
By default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.
|
||||
|
||||
👉 Note: The standard version of Crawl4AI uses Playwright for asynchronous crawling. If you encounter an error saying that Playwright is not installed, you can run playwright install. However, this should be done automatically during the setup process.
|
||||
👉 Note: When you install Crawl4AI, the setup script should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:
|
||||
|
||||
1. Through the command line:
|
||||
```bash
|
||||
playwright install
|
||||
```
|
||||
|
||||
2. If the above doesn't work, try this more specific command:
|
||||
```bash
|
||||
python -m playwright install chromium
|
||||
```
|
||||
|
||||
This second method has proven to be more reliable in some cases.
|
||||
|
||||
#### Installation with Synchronous Version
|
||||
|
||||
@@ -112,7 +132,7 @@ async def main():
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
css_selector="article.tease-card",
|
||||
css_selector=".wide-tease-item__description",
|
||||
bypass_cache=True
|
||||
)
|
||||
print(result.extracted_content)
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
from .async_webcrawler import AsyncWebCrawler
|
||||
from .models import CrawlResult
|
||||
|
||||
__version__ = "0.3.2"
|
||||
__version__ = "0.3.71"
|
||||
|
||||
__all__ = [
|
||||
"AsyncWebCrawler",
|
||||
|
||||
558
crawl4ai/async_crawler_strategy copy.py
Normal file
558
crawl4ai/async_crawler_strategy copy.py
Normal file
@@ -0,0 +1,558 @@
|
||||
import asyncio
|
||||
import base64
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from io import BytesIO
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from pathlib import Path
|
||||
from playwright.async_api import ProxySettings
|
||||
from pydantic import BaseModel
|
||||
import hashlib
|
||||
import json
|
||||
import uuid
|
||||
from playwright_stealth import stealth_async
|
||||
|
||||
class AsyncCrawlResponse(BaseModel):
|
||||
html: str
|
||||
response_headers: Dict[str, str]
|
||||
status_code: int
|
||||
screenshot: Optional[str] = None
|
||||
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
class AsyncCrawlerStrategy(ABC):
|
||||
@abstractmethod
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
async def take_screenshot(self, url: str) -> str:
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def update_user_agent(self, user_agent: str):
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def set_hook(self, hook_type: str, hook: Callable):
|
||||
pass
|
||||
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
|
||||
self.use_cached_html = use_cached_html
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||
)
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.browser_type = kwargs.get("browser_type", "chromium")
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
self.js_code = js_code
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
self.playwright = None
|
||||
self.browser = None
|
||||
self.hooks = {
|
||||
'on_browser_created': None,
|
||||
'on_user_agent_updated': None,
|
||||
'on_execution_started': None,
|
||||
'before_goto': None,
|
||||
'after_goto': None,
|
||||
'before_return_html': None,
|
||||
'before_retrieve_html': None
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
await self.start()
|
||||
return self
|
||||
|
||||
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
||||
await self.close()
|
||||
|
||||
async def start(self):
|
||||
if self.playwright is None:
|
||||
self.playwright = await async_playwright().start()
|
||||
if self.browser is None:
|
||||
browser_args = {
|
||||
"headless": self.headless,
|
||||
"args": [
|
||||
"--disable-gpu",
|
||||
"--no-sandbox",
|
||||
"--disable-dev-shm-usage",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
# "--headless=new", # Use the new headless mode
|
||||
]
|
||||
}
|
||||
|
||||
# Add proxy settings if a proxy is specified
|
||||
if self.proxy:
|
||||
proxy_settings = ProxySettings(server=self.proxy)
|
||||
browser_args["proxy"] = proxy_settings
|
||||
|
||||
# Select the appropriate browser based on the browser_type
|
||||
if self.browser_type == "firefox":
|
||||
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||
elif self.browser_type == "webkit":
|
||||
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||
else:
|
||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||
|
||||
await self.execute_hook('on_browser_created', self.browser)
|
||||
|
||||
async def close(self):
|
||||
if self.browser:
|
||||
await self.browser.close()
|
||||
self.browser = None
|
||||
if self.playwright:
|
||||
await self.playwright.stop()
|
||||
self.playwright = None
|
||||
|
||||
def __del__(self):
|
||||
if self.browser or self.playwright:
|
||||
asyncio.get_event_loop().run_until_complete(self.close())
|
||||
|
||||
def set_hook(self, hook_type: str, hook: Callable):
|
||||
if hook_type in self.hooks:
|
||||
self.hooks[hook_type] = hook
|
||||
else:
|
||||
raise ValueError(f"Invalid hook type: {hook_type}")
|
||||
|
||||
async def execute_hook(self, hook_type: str, *args):
|
||||
hook = self.hooks.get(hook_type)
|
||||
if hook:
|
||||
if asyncio.iscoroutinefunction(hook):
|
||||
return await hook(*args)
|
||||
else:
|
||||
return hook(*args)
|
||||
return args[0] if args else None
|
||||
|
||||
def update_user_agent(self, user_agent: str):
|
||||
self.user_agent = user_agent
|
||||
|
||||
def set_custom_headers(self, headers: Dict[str, str]):
|
||||
self.headers = headers
|
||||
|
||||
async def kill_session(self, session_id: str):
|
||||
if session_id in self.sessions:
|
||||
context, page, _ = self.sessions[session_id]
|
||||
await page.close()
|
||||
await context.close()
|
||||
del self.sessions[session_id]
|
||||
|
||||
def _cleanup_expired_sessions(self):
|
||||
current_time = time.time()
|
||||
expired_sessions = [
|
||||
sid for sid, (_, _, last_used) in self.sessions.items()
|
||||
if current_time - last_used > self.session_ttl
|
||||
]
|
||||
for sid in expired_sessions:
|
||||
asyncio.create_task(self.kill_session(sid))
|
||||
|
||||
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
|
||||
wait_for = wait_for.strip()
|
||||
|
||||
if wait_for.startswith('js:'):
|
||||
# Explicitly specified JavaScript
|
||||
js_code = wait_for[3:].strip()
|
||||
return await self.csp_compliant_wait(page, js_code, timeout)
|
||||
elif wait_for.startswith('css:'):
|
||||
# Explicitly specified CSS selector
|
||||
css_selector = wait_for[4:].strip()
|
||||
try:
|
||||
await page.wait_for_selector(css_selector, timeout=timeout)
|
||||
except Error as e:
|
||||
if 'Timeout' in str(e):
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
|
||||
else:
|
||||
raise ValueError(f"Invalid CSS selector: '{css_selector}'")
|
||||
else:
|
||||
# Auto-detect based on content
|
||||
if wait_for.startswith('()') or wait_for.startswith('function'):
|
||||
# It's likely a JavaScript function
|
||||
return await self.csp_compliant_wait(page, wait_for, timeout)
|
||||
else:
|
||||
# Assume it's a CSS selector first
|
||||
try:
|
||||
await page.wait_for_selector(wait_for, timeout=timeout)
|
||||
except Error as e:
|
||||
if 'Timeout' in str(e):
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
|
||||
else:
|
||||
# If it's not a timeout error, it might be an invalid selector
|
||||
# Let's try to evaluate it as a JavaScript function as a fallback
|
||||
try:
|
||||
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
|
||||
except Error:
|
||||
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
|
||||
"It should be either a valid CSS selector, a JavaScript function, "
|
||||
"or explicitly prefixed with 'js:' or 'css:'.")
|
||||
|
||||
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
|
||||
wrapper_js = f"""
|
||||
async () => {{
|
||||
const userFunction = {user_wait_function};
|
||||
const startTime = Date.now();
|
||||
while (true) {{
|
||||
if (await userFunction()) {{
|
||||
return true;
|
||||
}}
|
||||
if (Date.now() - startTime > {timeout}) {{
|
||||
throw new Error('Timeout waiting for condition');
|
||||
}}
|
||||
await new Promise(resolve => setTimeout(resolve, 100));
|
||||
}}
|
||||
}}
|
||||
"""
|
||||
|
||||
try:
|
||||
await page.evaluate(wrapper_js)
|
||||
except TimeoutError:
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for condition")
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Error in wait condition: {str(e)}")
|
||||
|
||||
async def process_iframes(self, page):
|
||||
# Find all iframes
|
||||
iframes = await page.query_selector_all('iframe')
|
||||
|
||||
for i, iframe in enumerate(iframes):
|
||||
try:
|
||||
# Add a unique identifier to the iframe
|
||||
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
|
||||
|
||||
# Get the frame associated with this iframe
|
||||
frame = await iframe.content_frame()
|
||||
|
||||
if frame:
|
||||
# Wait for the frame to load
|
||||
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
|
||||
|
||||
# Extract the content of the iframe's body
|
||||
iframe_content = await frame.evaluate('() => document.body.innerHTML')
|
||||
|
||||
# Generate a unique class name for this iframe
|
||||
class_name = f'extracted-iframe-content-{i}'
|
||||
|
||||
# Replace the iframe with a div containing the extracted content
|
||||
_iframe = iframe_content.replace('`', '\\`')
|
||||
await page.evaluate(f"""
|
||||
() => {{
|
||||
const iframe = document.getElementById('iframe-{i}');
|
||||
const div = document.createElement('div');
|
||||
div.innerHTML = `{_iframe}`;
|
||||
div.className = '{class_name}';
|
||||
iframe.replaceWith(div);
|
||||
}}
|
||||
""")
|
||||
else:
|
||||
print(f"Warning: Could not access content frame for iframe {i}")
|
||||
except Exception as e:
|
||||
print(f"Error processing iframe {i}: {str(e)}")
|
||||
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
response_headers = {}
|
||||
status_code = None
|
||||
|
||||
self._cleanup_expired_sessions()
|
||||
session_id = kwargs.get("session_id")
|
||||
if session_id:
|
||||
context, page, _ = self.sessions.get(session_id, (None, None, None))
|
||||
if not context:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
else:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
|
||||
if kwargs.get("override_navigator", False):
|
||||
# Inject scripts to override navigator properties
|
||||
await context.add_init_script("""
|
||||
// Pass the Permissions Test.
|
||||
const originalQuery = window.navigator.permissions.query;
|
||||
window.navigator.permissions.query = (parameters) => (
|
||||
parameters.name === 'notifications' ?
|
||||
Promise.resolve({ state: Notification.permission }) :
|
||||
originalQuery(parameters)
|
||||
);
|
||||
Object.defineProperty(navigator, 'webdriver', {
|
||||
get: () => undefined
|
||||
});
|
||||
window.navigator.chrome = {
|
||||
runtime: {},
|
||||
// Add other properties if necessary
|
||||
};
|
||||
Object.defineProperty(navigator, 'plugins', {
|
||||
get: () => [1, 2, 3, 4, 5],
|
||||
});
|
||||
Object.defineProperty(navigator, 'languages', {
|
||||
get: () => ['en-US', 'en'],
|
||||
});
|
||||
Object.defineProperty(document, 'hidden', {
|
||||
get: () => false
|
||||
});
|
||||
Object.defineProperty(document, 'visibilityState', {
|
||||
get: () => 'visible'
|
||||
});
|
||||
""")
|
||||
|
||||
page = await context.new_page()
|
||||
|
||||
try:
|
||||
if self.verbose:
|
||||
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
if os.path.exists(cache_file_path):
|
||||
html = ""
|
||||
with open(cache_file_path, "r") as f:
|
||||
html = f.read()
|
||||
# retrieve response headers and status code from cache
|
||||
with open(cache_file_path + ".meta", "r") as f:
|
||||
meta = json.load(f)
|
||||
response_headers = meta.get("response_headers", {})
|
||||
status_code = meta.get("status_code")
|
||||
response = AsyncCrawlResponse(
|
||||
html=html, response_headers=response_headers, status_code=status_code
|
||||
)
|
||||
return response
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page)
|
||||
|
||||
response = await page.goto("about:blank")
|
||||
await stealth_async(page)
|
||||
response = await page.goto(
|
||||
url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
|
||||
)
|
||||
|
||||
# await stealth_async(page)
|
||||
# response = await page.goto("about:blank")
|
||||
# await stealth_async(page)
|
||||
# await page.evaluate(f"window.location.href = '{url}'")
|
||||
|
||||
await self.execute_hook('after_goto', page)
|
||||
|
||||
# Get status code and headers
|
||||
status_code = response.status
|
||||
response_headers = response.headers
|
||||
else:
|
||||
status_code = 200
|
||||
response_headers = {}
|
||||
|
||||
await page.wait_for_selector('body')
|
||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||
|
||||
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
|
||||
if js_code:
|
||||
if isinstance(js_code, str):
|
||||
await page.evaluate(js_code)
|
||||
elif isinstance(js_code, list):
|
||||
for js in js_code:
|
||||
await page.evaluate(js)
|
||||
|
||||
await page.wait_for_load_state('networkidle')
|
||||
# Check for on execution event
|
||||
await self.execute_hook('on_execution_started', page)
|
||||
|
||||
if kwargs.get("simulate_user", False):
|
||||
# Simulate user interactions
|
||||
await page.mouse.move(100, 100)
|
||||
await page.mouse.down()
|
||||
await page.mouse.up()
|
||||
await page.keyboard.press('ArrowDown')
|
||||
|
||||
# Handle the wait_for parameter
|
||||
wait_for = kwargs.get("wait_for")
|
||||
if wait_for:
|
||||
try:
|
||||
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Wait condition failed: {str(e)}")
|
||||
|
||||
|
||||
|
||||
# Update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
// Filter out images that are too small
|
||||
if (img.width < 100 && img.height < 100) return false;
|
||||
|
||||
// Filter out images that are not visible
|
||||
const rect = img.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) return false;
|
||||
|
||||
// Filter out images with certain class names (e.g., icons, thumbnails)
|
||||
if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
|
||||
|
||||
// Filter out images with certain patterns in their src (e.g., placeholder images)
|
||||
if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
|
||||
|
||||
return true;
|
||||
};
|
||||
|
||||
const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
|
||||
let imagesLeft = images.length;
|
||||
|
||||
if (imagesLeft === 0) {
|
||||
resolve();
|
||||
return;
|
||||
}
|
||||
|
||||
const checkImage = (img) => {
|
||||
if (img.complete && img.naturalWidth !== 0) {
|
||||
img.setAttribute('width', img.naturalWidth);
|
||||
img.setAttribute('height', img.naturalHeight);
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
}
|
||||
};
|
||||
|
||||
images.forEach(img => {
|
||||
checkImage(img);
|
||||
if (!img.complete) {
|
||||
img.onload = () => {
|
||||
checkImage(img);
|
||||
};
|
||||
img.onerror = () => {
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
};
|
||||
}
|
||||
});
|
||||
|
||||
// Fallback timeout of 5 seconds
|
||||
setTimeout(() => resolve(), 5000);
|
||||
});
|
||||
}
|
||||
"""
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
|
||||
# Process iframes
|
||||
if kwargs.get("process_iframes", False):
|
||||
page = await self.process_iframes(page)
|
||||
|
||||
await self.execute_hook('before_retrieve_html', page)
|
||||
# Check if delay_before_return_html is set then wait for that time
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
html = await page.content()
|
||||
await self.execute_hook('before_return_html', page, html)
|
||||
|
||||
# Check if kwargs has screenshot=True then take screenshot
|
||||
screenshot_data = None
|
||||
if kwargs.get("screenshot"):
|
||||
screenshot_data = await self.take_screenshot(url)
|
||||
|
||||
if self.verbose:
|
||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
with open(cache_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
# store response headers and status code in cache
|
||||
with open(cache_file_path + ".meta", "w", encoding="utf-8") as f:
|
||||
json.dump({
|
||||
"response_headers": response_headers,
|
||||
"status_code": status_code
|
||||
}, f)
|
||||
|
||||
async def get_delayed_content(delay: float = 5.0) -> str:
|
||||
if self.verbose:
|
||||
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
|
||||
await asyncio.sleep(delay)
|
||||
return await page.content()
|
||||
|
||||
response = AsyncCrawlResponse(
|
||||
html=html,
|
||||
response_headers=response_headers,
|
||||
status_code=status_code,
|
||||
screenshot=screenshot_data,
|
||||
get_delayed_content=get_delayed_content
|
||||
)
|
||||
return response
|
||||
except Error as e:
|
||||
raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
finally:
|
||||
if not session_id:
|
||||
await page.close()
|
||||
await context.close()
|
||||
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
|
||||
semaphore = asyncio.Semaphore(semaphore_count)
|
||||
|
||||
async def crawl_with_semaphore(url):
|
||||
async with semaphore:
|
||||
return await self.crawl(url, **kwargs)
|
||||
|
||||
tasks = [crawl_with_semaphore(url) for url in urls]
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||
|
||||
async def take_screenshot(self, url: str, wait_time=1000) -> str:
|
||||
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
||||
page = await context.new_page()
|
||||
try:
|
||||
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
# Wait for a specified time (default is 1 second)
|
||||
await page.wait_for_timeout(wait_time)
|
||||
screenshot = await page.screenshot(full_page=True)
|
||||
return base64.b64encode(screenshot).decode('utf-8')
|
||||
except Exception as e:
|
||||
error_message = f"Failed to take screenshot: {str(e)}"
|
||||
print(error_message)
|
||||
|
||||
# Generate an error image
|
||||
img = Image.new('RGB', (800, 600), color='black')
|
||||
draw = ImageDraw.Draw(img)
|
||||
font = ImageFont.load_default()
|
||||
draw.text((10, 10), error_message, fill=(255, 255, 255), font=font)
|
||||
|
||||
buffered = BytesIO()
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
||||
finally:
|
||||
await page.close()
|
||||
|
||||
@@ -1,30 +1,45 @@
|
||||
import asyncio
|
||||
import base64, time
|
||||
import base64
|
||||
import time
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Callable, Dict, Any, List, Optional
|
||||
from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||
import os
|
||||
import psutil
|
||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||
from io import BytesIO
|
||||
from PIL import Image, ImageDraw, ImageFont
|
||||
from .utils import sanitize_input_encode
|
||||
import json, uuid
|
||||
import hashlib
|
||||
from pathlib import Path
|
||||
from playwright.async_api import ProxySettings
|
||||
from pydantic import BaseModel
|
||||
import hashlib
|
||||
import json
|
||||
import uuid
|
||||
from playwright_stealth import StealthConfig, stealth_async
|
||||
|
||||
stealth_config = StealthConfig(
|
||||
webdriver=True,
|
||||
chrome_app=True,
|
||||
chrome_csi=True,
|
||||
chrome_load_times=True,
|
||||
chrome_runtime=True,
|
||||
navigator_languages=True,
|
||||
navigator_plugins=True,
|
||||
navigator_permissions=True,
|
||||
webgl_vendor=True,
|
||||
outerdimensions=True,
|
||||
navigator_hardware_concurrency=True,
|
||||
media_codecs=True,
|
||||
)
|
||||
|
||||
def calculate_semaphore_count():
|
||||
cpu_count = os.cpu_count()
|
||||
memory_gb = psutil.virtual_memory().total / (1024 ** 3) # Convert to GB
|
||||
base_count = max(1, cpu_count // 2)
|
||||
memory_based_cap = int(memory_gb / 2) # Assume 2GB per instance
|
||||
return min(base_count, memory_based_cap)
|
||||
|
||||
class AsyncCrawlResponse(BaseModel):
|
||||
html: str
|
||||
response_headers: Dict[str, str]
|
||||
status_code: int
|
||||
screenshot: Optional[str] = None
|
||||
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||
|
||||
class Config:
|
||||
arbitrary_types_allowed = True
|
||||
|
||||
class AsyncCrawlerStrategy(ABC):
|
||||
@abstractmethod
|
||||
@@ -50,23 +65,30 @@ class AsyncCrawlerStrategy(ABC):
|
||||
class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
def __init__(self, use_cached_html=False, js_code=None, **kwargs):
|
||||
self.use_cached_html = use_cached_html
|
||||
self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
||||
self.user_agent = kwargs.get(
|
||||
"user_agent",
|
||||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||
"(KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
|
||||
)
|
||||
self.proxy = kwargs.get("proxy")
|
||||
self.headless = kwargs.get("headless", True)
|
||||
self.headers = {}
|
||||
self.browser_type = kwargs.get("browser_type", "chromium")
|
||||
self.headers = kwargs.get("headers", {})
|
||||
self.sessions = {}
|
||||
self.session_ttl = 1800
|
||||
self.js_code = js_code
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
self.playwright = None
|
||||
self.browser = None
|
||||
self.sleep_on_close = kwargs.get("sleep_on_close", False)
|
||||
self.hooks = {
|
||||
'on_browser_created': None,
|
||||
'on_user_agent_updated': None,
|
||||
'on_execution_started': None,
|
||||
'before_goto': None,
|
||||
'after_goto': None,
|
||||
'before_return_html': None
|
||||
'before_return_html': None,
|
||||
'before_retrieve_html': None
|
||||
}
|
||||
|
||||
async def __aenter__(self):
|
||||
@@ -82,12 +104,16 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if self.browser is None:
|
||||
browser_args = {
|
||||
"headless": self.headless,
|
||||
# "headless": False,
|
||||
"args": [
|
||||
"--disable-gpu",
|
||||
"--disable-dev-shm-usage",
|
||||
"--disable-setuid-sandbox",
|
||||
"--no-sandbox",
|
||||
"--disable-dev-shm-usage",
|
||||
"--disable-blink-features=AutomationControlled",
|
||||
"--disable-infobars",
|
||||
"--window-position=0,0",
|
||||
"--ignore-certificate-errors",
|
||||
"--ignore-certificate-errors-spki-list",
|
||||
# "--headless=new", # Use the new headless mode
|
||||
]
|
||||
}
|
||||
|
||||
@@ -96,11 +122,19 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
proxy_settings = ProxySettings(server=self.proxy)
|
||||
browser_args["proxy"] = proxy_settings
|
||||
|
||||
|
||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||
# Select the appropriate browser based on the browser_type
|
||||
if self.browser_type == "firefox":
|
||||
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||
elif self.browser_type == "webkit":
|
||||
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||
else:
|
||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||
|
||||
await self.execute_hook('on_browser_created', self.browser)
|
||||
|
||||
async def close(self):
|
||||
if self.sleep_on_close:
|
||||
await asyncio.sleep(500)
|
||||
if self.browser:
|
||||
await self.browser.close()
|
||||
self.browser = None
|
||||
@@ -142,12 +176,52 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
|
||||
def _cleanup_expired_sessions(self):
|
||||
current_time = time.time()
|
||||
expired_sessions = [sid for sid, (_, _, last_used) in self.sessions.items()
|
||||
if current_time - last_used > self.session_ttl]
|
||||
expired_sessions = [
|
||||
sid for sid, (_, _, last_used) in self.sessions.items()
|
||||
if current_time - last_used > self.session_ttl
|
||||
]
|
||||
for sid in expired_sessions:
|
||||
asyncio.create_task(self.kill_session(sid))
|
||||
|
||||
|
||||
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
|
||||
wait_for = wait_for.strip()
|
||||
|
||||
if wait_for.startswith('js:'):
|
||||
# Explicitly specified JavaScript
|
||||
js_code = wait_for[3:].strip()
|
||||
return await self.csp_compliant_wait(page, js_code, timeout)
|
||||
elif wait_for.startswith('css:'):
|
||||
# Explicitly specified CSS selector
|
||||
css_selector = wait_for[4:].strip()
|
||||
try:
|
||||
await page.wait_for_selector(css_selector, timeout=timeout)
|
||||
except Error as e:
|
||||
if 'Timeout' in str(e):
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
|
||||
else:
|
||||
raise ValueError(f"Invalid CSS selector: '{css_selector}'")
|
||||
else:
|
||||
# Auto-detect based on content
|
||||
if wait_for.startswith('()') or wait_for.startswith('function'):
|
||||
# It's likely a JavaScript function
|
||||
return await self.csp_compliant_wait(page, wait_for, timeout)
|
||||
else:
|
||||
# Assume it's a CSS selector first
|
||||
try:
|
||||
await page.wait_for_selector(wait_for, timeout=timeout)
|
||||
except Error as e:
|
||||
if 'Timeout' in str(e):
|
||||
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
|
||||
else:
|
||||
# If it's not a timeout error, it might be an invalid selector
|
||||
# Let's try to evaluate it as a JavaScript function as a fallback
|
||||
try:
|
||||
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
|
||||
except Error:
|
||||
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
|
||||
"It should be either a valid CSS selector, a JavaScript function, "
|
||||
"or explicitly prefixed with 'js:' or 'css:'.")
|
||||
|
||||
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
|
||||
wrapper_js = f"""
|
||||
async () => {{
|
||||
@@ -172,6 +246,47 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Error in wait condition: {str(e)}")
|
||||
|
||||
async def process_iframes(self, page):
|
||||
# Find all iframes
|
||||
iframes = await page.query_selector_all('iframe')
|
||||
|
||||
for i, iframe in enumerate(iframes):
|
||||
try:
|
||||
# Add a unique identifier to the iframe
|
||||
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
|
||||
|
||||
# Get the frame associated with this iframe
|
||||
frame = await iframe.content_frame()
|
||||
|
||||
if frame:
|
||||
# Wait for the frame to load
|
||||
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
|
||||
|
||||
# Extract the content of the iframe's body
|
||||
iframe_content = await frame.evaluate('() => document.body.innerHTML')
|
||||
|
||||
# Generate a unique class name for this iframe
|
||||
class_name = f'extracted-iframe-content-{i}'
|
||||
|
||||
# Replace the iframe with a div containing the extracted content
|
||||
_iframe = iframe_content.replace('`', '\\`')
|
||||
await page.evaluate(f"""
|
||||
() => {{
|
||||
const iframe = document.getElementById('iframe-{i}');
|
||||
const div = document.createElement('div');
|
||||
div.innerHTML = `{_iframe}`;
|
||||
div.className = '{class_name}';
|
||||
iframe.replaceWith(div);
|
||||
}}
|
||||
""")
|
||||
else:
|
||||
print(f"Warning: Could not access content frame for iframe {i}")
|
||||
except Exception as e:
|
||||
print(f"Error processing iframe {i}: {str(e)}")
|
||||
|
||||
# Return the page object
|
||||
return page
|
||||
|
||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||
response_headers = {}
|
||||
status_code = None
|
||||
@@ -183,25 +298,70 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
if not context:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None,
|
||||
accept_downloads=True,
|
||||
java_script_enabled=True
|
||||
)
|
||||
await context.add_cookies([{"name": "cookiesEnabled", "value": "true", "url": url}])
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
page = await context.new_page()
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
else:
|
||||
context = await self.browser.new_context(
|
||||
user_agent=self.user_agent,
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
user_agent=self.user_agent,
|
||||
viewport={"width": 1920, "height": 1080},
|
||||
proxy={"server": self.proxy} if self.proxy else None
|
||||
)
|
||||
await context.set_extra_http_headers(self.headers)
|
||||
|
||||
if kwargs.get("override_navigator", False) or kwargs.get("simulate_user", False) or kwargs.get("magic", False):
|
||||
# Inject scripts to override navigator properties
|
||||
await context.add_init_script("""
|
||||
// Pass the Permissions Test.
|
||||
const originalQuery = window.navigator.permissions.query;
|
||||
window.navigator.permissions.query = (parameters) => (
|
||||
parameters.name === 'notifications' ?
|
||||
Promise.resolve({ state: Notification.permission }) :
|
||||
originalQuery(parameters)
|
||||
);
|
||||
Object.defineProperty(navigator, 'webdriver', {
|
||||
get: () => undefined
|
||||
});
|
||||
window.navigator.chrome = {
|
||||
runtime: {},
|
||||
// Add other properties if necessary
|
||||
};
|
||||
Object.defineProperty(navigator, 'plugins', {
|
||||
get: () => [1, 2, 3, 4, 5],
|
||||
});
|
||||
Object.defineProperty(navigator, 'languages', {
|
||||
get: () => ['en-US', 'en'],
|
||||
});
|
||||
Object.defineProperty(document, 'hidden', {
|
||||
get: () => false
|
||||
});
|
||||
Object.defineProperty(document, 'visibilityState', {
|
||||
get: () => 'visible'
|
||||
});
|
||||
""")
|
||||
|
||||
page = await context.new_page()
|
||||
# await stealth_async(page) #, stealth_config)
|
||||
|
||||
# Add console message and error logging
|
||||
if kwargs.get("log_console", False):
|
||||
page.on("console", lambda msg: print(f"Console: {msg.text}"))
|
||||
page.on("pageerror", lambda exc: print(f"Page Error: {exc}"))
|
||||
|
||||
try:
|
||||
if self.verbose:
|
||||
print(f"[LOG] 🕸️ Crawling {url} using AsyncPlaywrightCrawlerStrategy...")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
if os.path.exists(cache_file_path):
|
||||
html = ""
|
||||
with open(cache_file_path, "r") as f:
|
||||
@@ -211,12 +371,21 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
meta = json.load(f)
|
||||
response_headers = meta.get("response_headers", {})
|
||||
status_code = meta.get("status_code")
|
||||
response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
||||
response = AsyncCrawlResponse(
|
||||
html=html, response_headers=response_headers, status_code=status_code
|
||||
)
|
||||
return response
|
||||
|
||||
if not kwargs.get("js_only", False):
|
||||
await self.execute_hook('before_goto', page)
|
||||
response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
|
||||
|
||||
response = await page.goto(
|
||||
url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000)
|
||||
)
|
||||
|
||||
# response = await page.goto("about:blank")
|
||||
# await page.evaluate(f"window.location.href = '{url}'")
|
||||
|
||||
await self.execute_hook('after_goto', page)
|
||||
|
||||
# Get status code and headers
|
||||
@@ -232,55 +401,116 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
js_code = kwargs.get("js_code", kwargs.get("js", self.js_code))
|
||||
if js_code:
|
||||
if isinstance(js_code, str):
|
||||
r = await page.evaluate(js_code)
|
||||
await page.evaluate(js_code)
|
||||
elif isinstance(js_code, list):
|
||||
for js in js_code:
|
||||
await page.evaluate(js)
|
||||
|
||||
# await page.wait_for_timeout(100)
|
||||
await page.wait_for_load_state('networkidle')
|
||||
# Check for on execution even
|
||||
# Check for on execution event
|
||||
await self.execute_hook('on_execution_started', page)
|
||||
|
||||
# New code to handle the wait_for parameter
|
||||
# Example usage:
|
||||
# await crawler.crawl(
|
||||
# url,
|
||||
# js_code="// some JavaScript code",
|
||||
# wait_for="""() => {
|
||||
# return document.querySelector('#my-element') !== null;
|
||||
# }"""
|
||||
# )
|
||||
# Example of using a CSS selector:
|
||||
# await crawler.crawl(
|
||||
# url,
|
||||
# wait_for="#my-element"
|
||||
# )
|
||||
if kwargs.get("simulate_user", False) or kwargs.get("magic", False):
|
||||
# Simulate user interactions
|
||||
await page.mouse.move(100, 100)
|
||||
await page.mouse.down()
|
||||
await page.mouse.up()
|
||||
await page.keyboard.press('ArrowDown')
|
||||
|
||||
# Handle the wait_for parameter
|
||||
wait_for = kwargs.get("wait_for")
|
||||
if wait_for:
|
||||
try:
|
||||
await self.csp_compliant_wait(page, wait_for, timeout=kwargs.get("timeout", 30000))
|
||||
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
|
||||
except Exception as e:
|
||||
raise RuntimeError(f"Custom wait condition failed: {str(e)}")
|
||||
# try:
|
||||
# await page.wait_for_function(wait_for)
|
||||
# # if callable(wait_for):
|
||||
# # await page.wait_for_function(wait_for)
|
||||
# # elif isinstance(wait_for, str):
|
||||
# # await page.wait_for_selector(wait_for)
|
||||
# # else:
|
||||
# # raise ValueError("wait_for must be either a callable or a CSS selector string")
|
||||
# except Error as e:
|
||||
# raise Error(f"Custom wait condition failed: {str(e)}")
|
||||
raise RuntimeError(f"Wait condition failed: {str(e)}")
|
||||
|
||||
# Update image dimensions
|
||||
update_image_dimensions_js = """
|
||||
() => {
|
||||
return new Promise((resolve) => {
|
||||
const filterImage = (img) => {
|
||||
// Filter out images that are too small
|
||||
if (img.width < 100 && img.height < 100) return false;
|
||||
|
||||
// Filter out images that are not visible
|
||||
const rect = img.getBoundingClientRect();
|
||||
if (rect.width === 0 || rect.height === 0) return false;
|
||||
|
||||
// Filter out images with certain class names (e.g., icons, thumbnails)
|
||||
if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
|
||||
|
||||
// Filter out images with certain patterns in their src (e.g., placeholder images)
|
||||
if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
|
||||
|
||||
return true;
|
||||
};
|
||||
|
||||
const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
|
||||
let imagesLeft = images.length;
|
||||
|
||||
if (imagesLeft === 0) {
|
||||
resolve();
|
||||
return;
|
||||
}
|
||||
|
||||
const checkImage = (img) => {
|
||||
if (img.complete && img.naturalWidth !== 0) {
|
||||
img.setAttribute('width', img.naturalWidth);
|
||||
img.setAttribute('height', img.naturalHeight);
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
}
|
||||
};
|
||||
|
||||
images.forEach(img => {
|
||||
checkImage(img);
|
||||
if (!img.complete) {
|
||||
img.onload = () => {
|
||||
checkImage(img);
|
||||
};
|
||||
img.onerror = () => {
|
||||
imagesLeft--;
|
||||
if (imagesLeft === 0) resolve();
|
||||
};
|
||||
}
|
||||
});
|
||||
|
||||
// Fallback timeout of 5 seconds
|
||||
setTimeout(() => resolve(), 5000);
|
||||
});
|
||||
}
|
||||
"""
|
||||
await page.evaluate(update_image_dimensions_js)
|
||||
|
||||
# Wait a bit for any onload events to complete
|
||||
await page.wait_for_timeout(100)
|
||||
|
||||
# Process iframes
|
||||
if kwargs.get("process_iframes", False):
|
||||
page = await self.process_iframes(page)
|
||||
|
||||
await self.execute_hook('before_retrieve_html', page)
|
||||
# Check if delay_before_return_html is set then wait for that time
|
||||
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||
if delay_before_return_html:
|
||||
await asyncio.sleep(delay_before_return_html)
|
||||
|
||||
html = await page.content()
|
||||
page = await self.execute_hook('before_return_html', page, html)
|
||||
await self.execute_hook('before_return_html', page, html)
|
||||
|
||||
# Check if kwargs has screenshot=True then take screenshot
|
||||
screenshot_data = None
|
||||
if kwargs.get("screenshot"):
|
||||
screenshot_data = await self.take_screenshot(url)
|
||||
|
||||
if self.verbose:
|
||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||
|
||||
if self.use_cached_html:
|
||||
cache_file_path = os.path.join(Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest())
|
||||
cache_file_path = os.path.join(
|
||||
Path.home(), ".crawl4ai", "cache", hashlib.md5(url.encode()).hexdigest()
|
||||
)
|
||||
with open(cache_file_path, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
# store response headers and status code in cache
|
||||
@@ -290,67 +520,29 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
"status_code": status_code
|
||||
}, f)
|
||||
|
||||
response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
||||
async def get_delayed_content(delay: float = 5.0) -> str:
|
||||
if self.verbose:
|
||||
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
|
||||
await asyncio.sleep(delay)
|
||||
return await page.content()
|
||||
|
||||
response = AsyncCrawlResponse(
|
||||
html=html,
|
||||
response_headers=response_headers,
|
||||
status_code=status_code,
|
||||
screenshot=screenshot_data,
|
||||
get_delayed_content=get_delayed_content
|
||||
)
|
||||
return response
|
||||
except Error as e:
|
||||
raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
finally:
|
||||
if not session_id:
|
||||
await page.close()
|
||||
raise Error(f"[ERROR] 🚫 crawl(): Failed to crawl {url}: {str(e)}")
|
||||
# finally:
|
||||
# if not session_id:
|
||||
# await page.close()
|
||||
# await context.close()
|
||||
|
||||
# try:
|
||||
# html = await _crawl()
|
||||
# return sanitize_input_encode(html)
|
||||
# except Error as e:
|
||||
# raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||
# except Exception as e:
|
||||
# raise Exception(f"Failed to crawl {url}: {str(e)}")
|
||||
|
||||
async def execute_js(self, session_id: str, js_code: str, wait_for_js: str = None, wait_for_css: str = None) -> AsyncCrawlResponse:
|
||||
"""
|
||||
Execute JavaScript code in a specific session and optionally wait for a condition.
|
||||
|
||||
:param session_id: The ID of the session to execute the JS code in.
|
||||
:param js_code: The JavaScript code to execute.
|
||||
:param wait_for_js: JavaScript condition to wait for after execution.
|
||||
:param wait_for_css: CSS selector to wait for after execution.
|
||||
:return: AsyncCrawlResponse containing the page's HTML and other information.
|
||||
:raises ValueError: If the session does not exist.
|
||||
"""
|
||||
if not session_id:
|
||||
raise ValueError("Session ID must be provided")
|
||||
|
||||
if session_id not in self.sessions:
|
||||
raise ValueError(f"No active session found for session ID: {session_id}")
|
||||
|
||||
context, page, last_used = self.sessions[session_id]
|
||||
|
||||
try:
|
||||
await page.evaluate(js_code)
|
||||
|
||||
if wait_for_js:
|
||||
await page.wait_for_function(wait_for_js)
|
||||
|
||||
if wait_for_css:
|
||||
await page.wait_for_selector(wait_for_css)
|
||||
|
||||
# Get the updated HTML content
|
||||
html = await page.content()
|
||||
|
||||
# Get response headers and status code (assuming these are available)
|
||||
response_headers = await page.evaluate("() => JSON.stringify(performance.getEntriesByType('resource')[0].responseHeaders)")
|
||||
status_code = await page.evaluate("() => performance.getEntriesByType('resource')[0].responseStatus")
|
||||
|
||||
# Update the last used time for this session
|
||||
self.sessions[session_id] = (context, page, time.time())
|
||||
|
||||
return AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
||||
except Error as e:
|
||||
raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
|
||||
|
||||
|
||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||
semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
|
||||
semaphore_count = kwargs.get('semaphore_count', 5) # Adjust as needed
|
||||
semaphore = asyncio.Semaphore(semaphore_count)
|
||||
|
||||
async def crawl_with_semaphore(url):
|
||||
@@ -361,11 +553,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||
|
||||
async def take_screenshot(self, url: str) -> str:
|
||||
async def take_screenshot(self, url: str, wait_time=1000) -> str:
|
||||
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
||||
page = await context.new_page()
|
||||
try:
|
||||
await page.goto(url, wait_until="domcontentloaded")
|
||||
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||
# Wait for a specified time (default is 1 second)
|
||||
await page.wait_for_timeout(wait_time)
|
||||
screenshot = await page.screenshot(full_page=True)
|
||||
return base64.b64encode(screenshot).decode('utf-8')
|
||||
except Exception as e:
|
||||
@@ -382,4 +576,5 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
||||
img.save(buffered, format="JPEG")
|
||||
return base64.b64encode(buffered.getvalue()).decode('utf-8')
|
||||
finally:
|
||||
await page.close()
|
||||
await page.close()
|
||||
|
||||
|
||||
@@ -29,14 +29,31 @@ class AsyncDatabaseManager:
|
||||
)
|
||||
''')
|
||||
await db.commit()
|
||||
await self.update_db_schema()
|
||||
|
||||
async def aalter_db_add_screenshot(self, new_column: str = "media"):
|
||||
async def update_db_schema(self):
|
||||
async with aiosqlite.connect(self.db_path) as db:
|
||||
# Check if the 'media' column exists
|
||||
cursor = await db.execute("PRAGMA table_info(crawled_data)")
|
||||
columns = await cursor.fetchall()
|
||||
column_names = [column[1] for column in columns]
|
||||
|
||||
if 'media' not in column_names:
|
||||
await self.aalter_db_add_column('media')
|
||||
|
||||
# Check for other missing columns and add them if necessary
|
||||
for column in ['links', 'metadata', 'screenshot']:
|
||||
if column not in column_names:
|
||||
await self.aalter_db_add_column(column)
|
||||
|
||||
async def aalter_db_add_column(self, new_column: str):
|
||||
try:
|
||||
async with aiosqlite.connect(self.db_path) as db:
|
||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
||||
await db.commit()
|
||||
print(f"Added column '{new_column}' to the database.")
|
||||
except Exception as e:
|
||||
print(f"Error altering database to add screenshot column: {e}")
|
||||
print(f"Error altering database to add {new_column} column: {e}")
|
||||
|
||||
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
|
||||
try:
|
||||
|
||||
@@ -23,17 +23,17 @@ class AsyncWebCrawler:
|
||||
self,
|
||||
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
||||
always_by_pass_cache: bool = False,
|
||||
verbose: bool = False,
|
||||
**kwargs,
|
||||
):
|
||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
||||
verbose=verbose
|
||||
**kwargs
|
||||
)
|
||||
self.always_by_pass_cache = always_by_pass_cache
|
||||
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
|
||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||
self.ready = False
|
||||
self.verbose = verbose
|
||||
self.verbose = kwargs.get("verbose", False)
|
||||
|
||||
async def __aenter__(self):
|
||||
await self.crawler_strategy.__aenter__()
|
||||
@@ -80,7 +80,7 @@ class AsyncWebCrawler:
|
||||
|
||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||
|
||||
async_response : AsyncCrawlResponse = None
|
||||
async_response: AsyncCrawlResponse = None
|
||||
cached = None
|
||||
screenshot_data = None
|
||||
extracted_content = None
|
||||
@@ -102,15 +102,14 @@ class AsyncWebCrawler:
|
||||
t1 = time.time()
|
||||
if user_agent:
|
||||
self.crawler_strategy.update_user_agent(user_agent)
|
||||
async_response : AsyncCrawlResponse = await self.crawler_strategy.crawl(url, **kwargs)
|
||||
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
|
||||
html = sanitize_input_encode(async_response.html)
|
||||
screenshot_data = async_response.screenshot
|
||||
t2 = time.time()
|
||||
if verbose:
|
||||
print(
|
||||
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
|
||||
)
|
||||
if screenshot:
|
||||
screenshot_data = await self.crawler_strategy.take_screenshot(url)
|
||||
|
||||
crawl_result = await self.aprocess_html(
|
||||
url,
|
||||
@@ -127,15 +126,15 @@ class AsyncWebCrawler:
|
||||
**kwargs,
|
||||
)
|
||||
crawl_result.status_code = async_response.status_code if async_response else 200
|
||||
crawl_result.responser_headers = async_response.response_headers if async_response else {}
|
||||
crawl_result.response_headers = async_response.response_headers if async_response else {}
|
||||
crawl_result.success = bool(html)
|
||||
crawl_result.session_id = kwargs.get("session_id", None)
|
||||
return crawl_result
|
||||
except Exception as e:
|
||||
if not hasattr(e, "msg"):
|
||||
e.msg = str(e)
|
||||
print(f"[ERROR] 🚫 Failed to crawl {url}, error: {e.msg}")
|
||||
return CrawlResult(url=url, html="", success=False, error_message=e.msg)
|
||||
print(f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}")
|
||||
return CrawlResult(url=url, html="", markdown = f"[ERROR] 🚫 arun(): Failed to crawl {url}, error: {e.msg}", success=False, error_message=e.msg)
|
||||
|
||||
async def arun_many(
|
||||
self,
|
||||
@@ -196,6 +195,7 @@ class AsyncWebCrawler:
|
||||
image_description_min_word_threshold=kwargs.get(
|
||||
"image_description_min_word_threshold", IMAGE_DESCRIPTION_MIN_WORD_THRESHOLD
|
||||
),
|
||||
**kwargs,
|
||||
)
|
||||
if verbose:
|
||||
print(
|
||||
@@ -203,11 +203,11 @@ class AsyncWebCrawler:
|
||||
)
|
||||
|
||||
if result is None:
|
||||
raise ValueError(f"Failed to extract content from the website: {url}")
|
||||
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
|
||||
except InvalidCSSSelectorError as e:
|
||||
raise ValueError(str(e))
|
||||
except Exception as e:
|
||||
raise ValueError(f"Failed to extract content from the website: {url}, error: {str(e)}")
|
||||
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
|
||||
|
||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||
markdown = sanitize_input_encode(result.get("markdown", ""))
|
||||
|
||||
@@ -16,8 +16,6 @@ from .utils import (
|
||||
CustomHTML2Text
|
||||
)
|
||||
|
||||
|
||||
|
||||
class ContentScrappingStrategy(ABC):
|
||||
@abstractmethod
|
||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||
@@ -35,6 +33,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
return await asyncio.to_thread(self._get_content_of_website_optimized, url, html, **kwargs)
|
||||
|
||||
def _get_content_of_website_optimized(self, url: str, html: str, word_count_threshold: int = MIN_WORD_THRESHOLD, css_selector: str = None, **kwargs) -> Dict[str, Any]:
|
||||
success = True
|
||||
if not html:
|
||||
return None
|
||||
|
||||
@@ -129,7 +128,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.')
|
||||
image_format = image_format.strip('.').split('?')[0]
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
@@ -158,6 +157,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', ''),
|
||||
'data-src': img.get('data-src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
@@ -170,10 +170,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
if isinstance(element, Comment):
|
||||
element.extract()
|
||||
return False
|
||||
|
||||
# if element.name == 'img':
|
||||
# process_image(element, url, 0, 1)
|
||||
# return True
|
||||
|
||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||
if element.name == 'img':
|
||||
process_image(element, url, 0, 1)
|
||||
element.decompose()
|
||||
return False
|
||||
|
||||
@@ -272,12 +274,46 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
if base64_pattern.match(src):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
|
||||
try:
|
||||
str(body)
|
||||
except Exception as e:
|
||||
# Reset body to the original HTML
|
||||
success = False
|
||||
body = BeautifulSoup(html, 'html.parser')
|
||||
|
||||
# Create a new div with a special ID
|
||||
error_div = body.new_tag('div', id='crawl4ai_error_message')
|
||||
error_div.string = '''
|
||||
Crawl4AI Error: This page is not fully supported.
|
||||
|
||||
Possible reasons:
|
||||
1. The page may have restrictions that prevent crawling.
|
||||
2. The page might not be fully loaded.
|
||||
|
||||
Suggestions:
|
||||
- Try calling the crawl function with these parameters:
|
||||
magic=True,
|
||||
- Set headless=False to visualize what's happening on the page.
|
||||
|
||||
If the issue persists, please check the page's structure and any potential anti-crawling measures.
|
||||
'''
|
||||
|
||||
# Append the error div to the body
|
||||
body.body.append(error_div)
|
||||
|
||||
print(f"[LOG] 😧 Error: After processing the crawled HTML and removing irrelevant tags, nothing was left in the page. Check the markdown for further details.")
|
||||
|
||||
|
||||
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
|
||||
h = CustomHTML2Text()
|
||||
h.ignore_links = True
|
||||
markdown = h.handle(cleaned_html)
|
||||
h.ignore_links = not kwargs.get('include_links_on_markdown', False)
|
||||
h.body_width = 0
|
||||
try:
|
||||
markdown = h.handle(cleaned_html)
|
||||
except Exception as e:
|
||||
markdown = h.handle(sanitize_html(cleaned_html))
|
||||
markdown = markdown.replace(' ```', '```')
|
||||
|
||||
try:
|
||||
@@ -286,10 +322,11 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
||||
print('Error extracting metadata:', str(e))
|
||||
meta = {}
|
||||
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
return {
|
||||
'markdown': markdown,
|
||||
'cleaned_html': cleaned_html,
|
||||
'success': True,
|
||||
'success': success,
|
||||
'media': media,
|
||||
'links': links,
|
||||
'metadata': meta
|
||||
|
||||
@@ -80,6 +80,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
|
||||
self.apply_chunking = kwargs.get("apply_chunking", True)
|
||||
self.base_url = kwargs.get("base_url", None)
|
||||
self.extra_args = kwargs.get("extra_args", {})
|
||||
if not self.apply_chunking:
|
||||
self.chunk_token_threshold = 1e9
|
||||
|
||||
@@ -111,7 +112,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
||||
"{" + variable + "}", variable_values[variable]
|
||||
)
|
||||
|
||||
response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token, base_url=self.base_url) # , json_response=self.extract_type == "schema")
|
||||
response = perform_completion_with_backoff(
|
||||
self.provider,
|
||||
prompt_with_variables,
|
||||
self.api_token,
|
||||
base_url=self.base_url,
|
||||
extra_args = self.extra_args
|
||||
) # , json_response=self.extract_type == "schema")
|
||||
try:
|
||||
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
|
||||
blocks = json.loads(blocks)
|
||||
|
||||
@@ -18,5 +18,5 @@ class CrawlResult(BaseModel):
|
||||
metadata: Optional[dict] = None
|
||||
error_message: Optional[str] = None
|
||||
session_id: Optional[str] = None
|
||||
responser_headers: Optional[dict] = None
|
||||
response_headers: Optional[dict] = None
|
||||
status_code: Optional[int] = None
|
||||
@@ -1,4 +1,4 @@
|
||||
PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
|
||||
PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
|
||||
<url>{URL}</url>
|
||||
|
||||
And here is the cleaned HTML content of that webpage:
|
||||
@@ -79,7 +79,7 @@ To generate the JSON objects:
|
||||
2. For each block:
|
||||
a. Assign it an index based on its order in the content.
|
||||
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
|
||||
c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||
c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||
|
||||
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
|
||||
|
||||
|
||||
@@ -6,6 +6,7 @@ import json
|
||||
import html
|
||||
import re
|
||||
import os
|
||||
import platform
|
||||
from html2text import HTML2Text
|
||||
from .prompts import PROMPT_EXTRACT_BLOCKS
|
||||
from .config import *
|
||||
@@ -18,6 +19,46 @@ from requests.exceptions import InvalidSchema
|
||||
class InvalidCSSSelectorError(Exception):
|
||||
pass
|
||||
|
||||
def calculate_semaphore_count():
|
||||
cpu_count = os.cpu_count()
|
||||
memory_gb = get_system_memory() / (1024 ** 3) # Convert to GB
|
||||
base_count = max(1, cpu_count // 2)
|
||||
memory_based_cap = int(memory_gb / 2) # Assume 2GB per instance
|
||||
return min(base_count, memory_based_cap)
|
||||
|
||||
def get_system_memory():
|
||||
system = platform.system()
|
||||
if system == "Linux":
|
||||
with open('/proc/meminfo', 'r') as mem:
|
||||
for line in mem:
|
||||
if line.startswith('MemTotal:'):
|
||||
return int(line.split()[1]) * 1024 # Convert KB to bytes
|
||||
elif system == "Darwin": # macOS
|
||||
import subprocess
|
||||
output = subprocess.check_output(['sysctl', '-n', 'hw.memsize']).decode('utf-8')
|
||||
return int(output.strip())
|
||||
elif system == "Windows":
|
||||
import ctypes
|
||||
kernel32 = ctypes.windll.kernel32
|
||||
c_ulonglong = ctypes.c_ulonglong
|
||||
class MEMORYSTATUSEX(ctypes.Structure):
|
||||
_fields_ = [
|
||||
('dwLength', ctypes.c_ulong),
|
||||
('dwMemoryLoad', ctypes.c_ulong),
|
||||
('ullTotalPhys', c_ulonglong),
|
||||
('ullAvailPhys', c_ulonglong),
|
||||
('ullTotalPageFile', c_ulonglong),
|
||||
('ullAvailPageFile', c_ulonglong),
|
||||
('ullTotalVirtual', c_ulonglong),
|
||||
('ullAvailVirtual', c_ulonglong),
|
||||
('ullAvailExtendedVirtual', c_ulonglong),
|
||||
]
|
||||
memoryStatus = MEMORYSTATUSEX()
|
||||
memoryStatus.dwLength = ctypes.sizeof(MEMORYSTATUSEX)
|
||||
kernel32.GlobalMemoryStatusEx(ctypes.byref(memoryStatus))
|
||||
return memoryStatus.ullTotalPhys
|
||||
else:
|
||||
raise OSError("Unsupported operating system")
|
||||
|
||||
def get_home_folder():
|
||||
home_folder = os.path.join(Path.home(), ".crawl4ai")
|
||||
@@ -90,7 +131,7 @@ def split_and_parse_json_objects(json_string):
|
||||
return parsed_objects, unparsed_segments
|
||||
|
||||
def sanitize_html(html):
|
||||
# Replace all weird and special characters with an empty string
|
||||
# Replace all unwanted and special characters with an empty string
|
||||
sanitized_html = html
|
||||
# sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)
|
||||
|
||||
@@ -260,7 +301,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
|
||||
if tag.name != 'img':
|
||||
tag.attrs = {}
|
||||
|
||||
# Extract all img tgas inti [{src: '', alt: ''}]
|
||||
# Extract all img tgas int0 [{src: '', alt: ''}]
|
||||
media = {
|
||||
'images': [],
|
||||
'videos': [],
|
||||
@@ -298,7 +339,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
|
||||
img.decompose()
|
||||
|
||||
|
||||
# Create a function that replace content of all"pre" tage with its inner text
|
||||
# Create a function that replace content of all"pre" tag with its inner text
|
||||
def replace_pre_tags_with_text(node):
|
||||
for child in node.find_all('pre'):
|
||||
# set child inner html to its text
|
||||
@@ -461,7 +502,7 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
current_tag = tag
|
||||
while current_tag:
|
||||
current_tag = current_tag.parent
|
||||
# Get the text content of the parent tag
|
||||
# Get the text content from the parent tag
|
||||
if current_tag:
|
||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||
# Check if the text content has at least word_count_threshold
|
||||
@@ -470,88 +511,88 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
return None
|
||||
|
||||
def process_image(img, url, index, total_images):
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
#Check if an image has valid display and inside undesired html elements
|
||||
def is_valid_image(img, parent, parent_classes):
|
||||
style = img.get('style', '')
|
||||
src = img.get('src', '')
|
||||
classes_to_check = ['button', 'icon', 'logo']
|
||||
tags_to_check = ['button', 'input']
|
||||
return all([
|
||||
'display:none' not in style,
|
||||
src,
|
||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||
parent.name not in tags_to_check
|
||||
])
|
||||
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
# Function to parse image height/width value and units
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
match = re.match(r"(\d+)(\D*)", dimension)
|
||||
if match:
|
||||
number = int(match.group(1))
|
||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||
return number, unit
|
||||
return None, None
|
||||
#Score an image for it's usefulness
|
||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||
# Function to parse image height/width value and units
|
||||
def parse_dimension(dimension):
|
||||
if dimension:
|
||||
match = re.match(r"(\d+)(\D*)", dimension)
|
||||
if match:
|
||||
number = int(match.group(1))
|
||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||
return number, unit
|
||||
return None, None
|
||||
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
#If src is relative path construct full URL, if not it may be CDN URL
|
||||
img_url = urljoin(base_url,img.get('src'))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
if response.status_code == 200:
|
||||
return response.headers.get('Content-Length',None)
|
||||
else:
|
||||
print(f"Failed to retrieve file size for {img_url}")
|
||||
return None
|
||||
except InvalidSchema as e:
|
||||
# Fetch image file metadata to extract size and extension
|
||||
def fetch_image_file_size(img, base_url):
|
||||
#If src is relative path construct full URL, if not it may be CDN URL
|
||||
img_url = urljoin(base_url,img.get('src'))
|
||||
try:
|
||||
response = requests.head(img_url)
|
||||
if response.status_code == 200:
|
||||
return response.headers.get('Content-Length',None)
|
||||
else:
|
||||
print(f"Failed to retrieve file size for {img_url}")
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
except InvalidSchema as e:
|
||||
return None
|
||||
finally:
|
||||
return
|
||||
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.')
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
image_height = img.get('height')
|
||||
height_value, height_unit = parse_dimension(image_height)
|
||||
image_width = img.get('width')
|
||||
width_value, width_unit = parse_dimension(image_width)
|
||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||
# Remove . from format
|
||||
image_format = image_format.strip('.')
|
||||
score = 0
|
||||
if height_value:
|
||||
if height_unit == 'px' and height_value > 150:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||
score += 1
|
||||
if width_value:
|
||||
if width_unit == 'px' and width_value > 150:
|
||||
score += 1
|
||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||
score += 1
|
||||
if image_size > 10000:
|
||||
score += 1
|
||||
if img.get('alt') != '':
|
||||
score+=1
|
||||
if any(image_format==format for format in ['jpg','png','webp']):
|
||||
score+=1
|
||||
if index/images_count<0.5:
|
||||
score+=1
|
||||
return score
|
||||
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= IMAGE_SCORE_THRESHOLD:
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', ''),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||
return None
|
||||
score = score_image_for_usefulness(img, url, index, total_images)
|
||||
if score <= IMAGE_SCORE_THRESHOLD:
|
||||
return None
|
||||
return {
|
||||
'src': img.get('src', '').replace('\\"', '"').strip(),
|
||||
'alt': img.get('alt', ''),
|
||||
'desc': find_closest_parent_with_useful_text(img),
|
||||
'score': score,
|
||||
'type': 'image'
|
||||
}
|
||||
|
||||
def process_element(element: element.PageElement) -> bool:
|
||||
try:
|
||||
@@ -651,8 +692,8 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
||||
for img in imgs:
|
||||
src = img.get('src', '')
|
||||
if base64_pattern.match(src):
|
||||
# Replace base64 data with empty string
|
||||
img['src'] = base64_pattern.sub('', src)
|
||||
|
||||
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
||||
cleaned_html = sanitize_html(cleaned_html)
|
||||
|
||||
@@ -734,7 +775,14 @@ def extract_xml_data(tags, string):
|
||||
return data
|
||||
|
||||
# Function to perform the completion with exponential backoff
|
||||
def perform_completion_with_backoff(provider, prompt_with_variables, api_token, json_response = False, base_url=None):
|
||||
def perform_completion_with_backoff(
|
||||
provider,
|
||||
prompt_with_variables,
|
||||
api_token,
|
||||
json_response = False,
|
||||
base_url=None,
|
||||
**kwargs
|
||||
):
|
||||
from litellm import completion
|
||||
from litellm.exceptions import RateLimitError
|
||||
max_attempts = 3
|
||||
@@ -743,6 +791,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token,
|
||||
extra_args = {}
|
||||
if json_response:
|
||||
extra_args["response_format"] = { "type": "json_object" }
|
||||
|
||||
if kwargs.get("extra_args"):
|
||||
extra_args.update(kwargs["extra_args"])
|
||||
|
||||
for attempt in range(max_attempts):
|
||||
try:
|
||||
|
||||
@@ -12,6 +12,7 @@ from typing import List
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
from .config import *
|
||||
import warnings
|
||||
import json
|
||||
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
|
||||
|
||||
|
||||
|
||||
48
docs/examples/async_webcrawler_multiple_urls_example.py
Normal file
48
docs/examples/async_webcrawler_multiple_urls_example.py
Normal file
@@ -0,0 +1,48 @@
|
||||
# File: async_webcrawler_multiple_urls_example.py
|
||||
import os, sys
|
||||
# append 2 parent directories to sys.path to import crawl4ai
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
|
||||
async def main():
|
||||
# Initialize the AsyncWebCrawler
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# List of URLs to crawl
|
||||
urls = [
|
||||
"https://example.com",
|
||||
"https://python.org",
|
||||
"https://github.com",
|
||||
"https://stackoverflow.com",
|
||||
"https://news.ycombinator.com"
|
||||
]
|
||||
|
||||
# Set up crawling parameters
|
||||
word_count_threshold = 100
|
||||
|
||||
# Run the crawling process for multiple URLs
|
||||
results = await crawler.arun_many(
|
||||
urls=urls,
|
||||
word_count_threshold=word_count_threshold,
|
||||
bypass_cache=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Process the results
|
||||
for result in results:
|
||||
if result.success:
|
||||
print(f"Successfully crawled: {result.url}")
|
||||
print(f"Title: {result.metadata.get('title', 'N/A')}")
|
||||
print(f"Word count: {len(result.markdown.split())}")
|
||||
print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
|
||||
print(f"Number of images: {len(result.media.get('images', []))}")
|
||||
print("---")
|
||||
else:
|
||||
print(f"Failed to crawl: {result.url}")
|
||||
print(f"Error: {result.error_message}")
|
||||
print("---")
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
45
docs/examples/language_support_example.py
Normal file
45
docs/examples/language_support_example.py
Normal file
@@ -0,0 +1,45 @@
|
||||
import asyncio
|
||||
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
||||
|
||||
async def main():
|
||||
# Example 1: Setting language when creating the crawler
|
||||
crawler1 = AsyncWebCrawler(
|
||||
crawler_strategy=AsyncPlaywrightCrawlerStrategy(
|
||||
headers={"Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7"}
|
||||
)
|
||||
)
|
||||
result1 = await crawler1.arun("https://www.example.com")
|
||||
print("Example 1 result:", result1.extracted_content[:100]) # Print first 100 characters
|
||||
|
||||
# Example 2: Setting language before crawling
|
||||
crawler2 = AsyncWebCrawler()
|
||||
crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
|
||||
result2 = await crawler2.arun("https://www.example.com")
|
||||
print("Example 2 result:", result2.extracted_content[:100])
|
||||
|
||||
# Example 3: Setting language when calling arun method
|
||||
crawler3 = AsyncWebCrawler()
|
||||
result3 = await crawler3.arun(
|
||||
"https://www.example.com",
|
||||
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
|
||||
)
|
||||
print("Example 3 result:", result3.extracted_content[:100])
|
||||
|
||||
# Example 4: Crawling multiple pages with different languages
|
||||
urls = [
|
||||
("https://www.example.com", "fr-FR,fr;q=0.9"),
|
||||
("https://www.example.org", "es-ES,es;q=0.9"),
|
||||
("https://www.example.net", "de-DE,de;q=0.9"),
|
||||
]
|
||||
|
||||
crawler4 = AsyncWebCrawler()
|
||||
results = await asyncio.gather(*[
|
||||
crawler4.arun(url, headers={"Accept-Language": lang})
|
||||
for url, lang in urls
|
||||
])
|
||||
|
||||
for url, result in zip([u for u, _ in urls], results):
|
||||
print(f"Result for {url}:", result.extracted_content[:100])
|
||||
|
||||
if __name__ == "__main__":
|
||||
asyncio.run(main())
|
||||
@@ -47,8 +47,7 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# !pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git\"\n",
|
||||
"!pip install \"crawl4ai @ git+https://github.com/unclecode/crawl4ai.git@staging\"\n",
|
||||
"!pip install crawl4ai\n",
|
||||
"!pip install nest-asyncio\n",
|
||||
"!playwright install"
|
||||
]
|
||||
@@ -714,7 +713,7 @@
|
||||
"provenance": []
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "venv",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
|
||||
@@ -10,6 +10,7 @@ import time
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
from typing import Dict
|
||||
from bs4 import BeautifulSoup
|
||||
from pydantic import BaseModel, Field
|
||||
from crawl4ai import AsyncWebCrawler
|
||||
@@ -18,6 +19,8 @@ from crawl4ai.extraction_strategy import (
|
||||
LLMExtractionStrategy,
|
||||
)
|
||||
|
||||
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||
|
||||
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
|
||||
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
|
||||
print("Twitter: @unclecode")
|
||||
@@ -30,7 +33,7 @@ async def simple_crawl():
|
||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def js_and_css():
|
||||
async def simple_example_with_running_js_code():
|
||||
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
||||
# New code to handle the wait_for parameter
|
||||
wait_for = """() => {
|
||||
@@ -47,12 +50,21 @@ async def js_and_css():
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
js_code=js_code,
|
||||
# css_selector="article.tease-card",
|
||||
# wait_for=wait_for,
|
||||
bypass_cache=True,
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def simple_example_with_css_selector():
|
||||
print("\n--- Using CSS Selectors ---")
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://www.nbcnews.com/business",
|
||||
css_selector=".wide-tease-item__description",
|
||||
bypass_cache=True,
|
||||
)
|
||||
print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def use_proxy():
|
||||
print("\n--- Using a Proxy ---")
|
||||
print(
|
||||
@@ -66,6 +78,28 @@ async def use_proxy():
|
||||
# )
|
||||
# print(result.markdown[:500]) # Print first 500 characters
|
||||
|
||||
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
screenshot=True,
|
||||
bypass_cache=True
|
||||
)
|
||||
|
||||
if result.success and result.screenshot:
|
||||
import base64
|
||||
|
||||
# Decode the base64 screenshot data
|
||||
screenshot_data = base64.b64decode(result.screenshot)
|
||||
|
||||
# Save the screenshot as a JPEG file
|
||||
with open(output_path, 'wb') as f:
|
||||
f.write(screenshot_data)
|
||||
|
||||
print(f"Screenshot saved successfully to {output_path}")
|
||||
else:
|
||||
print("Failed to capture screenshot")
|
||||
|
||||
class OpenAIModelFee(BaseModel):
|
||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||
@@ -73,27 +107,30 @@ class OpenAIModelFee(BaseModel):
|
||||
..., description="Fee for output token for the OpenAI model."
|
||||
)
|
||||
|
||||
async def extract_structured_data_using_llm():
|
||||
print("\n--- Extracting Structured Data with OpenAI ---")
|
||||
print(
|
||||
"Note: Set your OpenAI API key as an environment variable to run this example."
|
||||
)
|
||||
if not os.getenv("OPENAI_API_KEY"):
|
||||
print("OpenAI API key not found. Skipping this example.")
|
||||
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
|
||||
print(f"\n--- Extracting Structured Data with {provider} ---")
|
||||
|
||||
if api_token is None and provider != "ollama":
|
||||
print(f"API token is required for {provider}. Skipping this example.")
|
||||
return
|
||||
|
||||
extra_args = {}
|
||||
if extra_headers:
|
||||
extra_args["extra_headers"] = extra_headers
|
||||
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
result = await crawler.arun(
|
||||
url="https://openai.com/api/pricing/",
|
||||
word_count_threshold=1,
|
||||
extraction_strategy=LLMExtractionStrategy(
|
||||
provider="openai/gpt-4o",
|
||||
api_token=os.getenv("OPENAI_API_KEY"),
|
||||
provider=provider,
|
||||
api_token=api_token,
|
||||
schema=OpenAIModelFee.schema(),
|
||||
extraction_type="schema",
|
||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
||||
extra_args=extra_args
|
||||
),
|
||||
bypass_cache=True,
|
||||
)
|
||||
@@ -320,6 +357,40 @@ async def crawl_dynamic_content_pages_method_3():
|
||||
await crawler.crawler_strategy.kill_session(session_id)
|
||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||
|
||||
async def crawl_custom_browser_type():
|
||||
# Use Firefox
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
# Use WebKit
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
# Use Chromium (default)
|
||||
start = time.time()
|
||||
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
|
||||
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||
print(result.markdown[:500])
|
||||
print("Time taken: ", time.time() - start)
|
||||
|
||||
async def crawl_with_user_simultion():
|
||||
async with AsyncWebCrawler(verbose=True, headless=True) as crawler:
|
||||
url = "YOUR-URL-HERE"
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
simulate_user = True,# Causes a series of random mouse movements and clicks to simulate user interaction
|
||||
override_navigator = True # Overrides the navigator object to make it look like a real user
|
||||
)
|
||||
|
||||
print(result.markdown)
|
||||
|
||||
async def speed_comparison():
|
||||
# print("\n--- Speed Comparison ---")
|
||||
# print("Firecrawl (simulated):")
|
||||
@@ -387,13 +458,31 @@ async def speed_comparison():
|
||||
|
||||
async def main():
|
||||
await simple_crawl()
|
||||
await js_and_css()
|
||||
await simple_example_with_running_js_code()
|
||||
await simple_example_with_css_selector()
|
||||
await use_proxy()
|
||||
await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
|
||||
await extract_structured_data_using_css_extractor()
|
||||
|
||||
# LLM extraction examples
|
||||
await extract_structured_data_using_llm()
|
||||
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||
await extract_structured_data_using_llm("openai/gpt-4o", os.getenv("OPENAI_API_KEY"))
|
||||
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||
|
||||
# You always can pass custom headers to the extraction strategy
|
||||
custom_headers = {
|
||||
"Authorization": "Bearer your-custom-token",
|
||||
"X-Custom-Header": "Some-Value"
|
||||
}
|
||||
await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||
|
||||
# await crawl_dynamic_content_pages_method_1()
|
||||
# await crawl_dynamic_content_pages_method_2()
|
||||
await crawl_dynamic_content_pages_method_3()
|
||||
|
||||
await crawl_custom_browser_type()
|
||||
|
||||
await speed_comparison()
|
||||
|
||||
|
||||
|
||||
@@ -2,11 +2,10 @@ aiosqlite==0.20.0
|
||||
html2text==2024.2.26
|
||||
lxml==5.3.0
|
||||
litellm==1.48.0
|
||||
numpy==2.1.1
|
||||
numpy>=1.26.0,<2.1.1
|
||||
pillow==10.4.0
|
||||
playwright==1.47.0
|
||||
python-dotenv==1.0.1
|
||||
requests==2.32.3
|
||||
PyYAML==6.0.2
|
||||
requests>=2.26.0,<2.32.3
|
||||
beautifulsoup4==4.12.3
|
||||
psutil==6.0.0
|
||||
playwright_stealth==1.0.6
|
||||
23
setup.py
23
setup.py
@@ -4,6 +4,7 @@ import os
|
||||
from pathlib import Path
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
# Create the .crawl4ai folder in the user's home directory if it doesn't exist
|
||||
# If the folder already exists, remove the cache folder
|
||||
@@ -35,21 +36,23 @@ transformer_requirements = ["transformers", "tokenizers", "onnxruntime"]
|
||||
cosine_similarity_requirements = ["torch", "transformers", "nltk", "spacy"]
|
||||
sync_requirements = ["selenium"]
|
||||
|
||||
def post_install():
|
||||
print("Running post-installation setup...")
|
||||
def install_playwright():
|
||||
print("Installing Playwright browsers...")
|
||||
try:
|
||||
subprocess.check_call(["playwright", "install"])
|
||||
subprocess.check_call([sys.executable, "-m", "playwright", "install"])
|
||||
print("Playwright installation completed successfully.")
|
||||
except subprocess.CalledProcessError:
|
||||
print("Error during Playwright installation. Please run 'playwright install' manually.")
|
||||
except FileNotFoundError:
|
||||
print("Playwright not found. Please ensure it's installed and run 'playwright install' manually.")
|
||||
except subprocess.CalledProcessError as e:
|
||||
print(f"Error during Playwright installation: {e}")
|
||||
print("Please run 'python -m playwright install' manually after the installation.")
|
||||
except Exception as e:
|
||||
print(f"Unexpected error during Playwright installation: {e}")
|
||||
print("Please run 'python -m playwright install' manually after the installation.")
|
||||
|
||||
class PostInstallCommand(install):
|
||||
def run(self):
|
||||
install.run(self)
|
||||
post_install()
|
||||
|
||||
install_playwright()
|
||||
|
||||
setup(
|
||||
name="Crawl4AI",
|
||||
version=version,
|
||||
@@ -61,7 +64,7 @@ setup(
|
||||
author_email="unclecode@kidocode.com",
|
||||
license="MIT",
|
||||
packages=find_packages(),
|
||||
install_requires=default_requirements,
|
||||
install_requires=default_requirements + ["playwright"], # Add playwright to default requirements
|
||||
extras_require={
|
||||
"torch": torch_requirements,
|
||||
"transformer": transformer_requirements,
|
||||
|
||||
@@ -5,7 +5,7 @@ import asyncio
|
||||
import time
|
||||
|
||||
# Add the parent directory to the Python path
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
124
tests/async/test_screenshot.py
Normal file
124
tests/async/test_screenshot.py
Normal file
@@ -0,0 +1,124 @@
|
||||
import os
|
||||
import sys
|
||||
import pytest
|
||||
import asyncio
|
||||
import base64
|
||||
from PIL import Image
|
||||
import io
|
||||
|
||||
# Add the parent directory to the Python path
|
||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||
sys.path.append(parent_dir)
|
||||
|
||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_basic_screenshot():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://example.com" # A static website
|
||||
result = await crawler.arun(url=url, bypass_cache=True, screenshot=True)
|
||||
|
||||
assert result.success
|
||||
assert result.screenshot is not None
|
||||
|
||||
# Verify the screenshot is a valid image
|
||||
image_data = base64.b64decode(result.screenshot)
|
||||
image = Image.open(io.BytesIO(image_data))
|
||||
assert image.format == "PNG"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_screenshot_with_wait_for():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
# Using a website with dynamic content
|
||||
url = "https://www.youtube.com"
|
||||
wait_for = "css:#content" # Wait for the main content to load
|
||||
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
screenshot=True,
|
||||
wait_for=wait_for
|
||||
)
|
||||
|
||||
assert result.success
|
||||
assert result.screenshot is not None
|
||||
|
||||
# Verify the screenshot is a valid image
|
||||
image_data = base64.b64decode(result.screenshot)
|
||||
image = Image.open(io.BytesIO(image_data))
|
||||
assert image.format == "PNG"
|
||||
|
||||
# You might want to add more specific checks here, like image dimensions
|
||||
# or even use image recognition to verify certain elements are present
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_screenshot_with_js_wait_for():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://www.amazon.com"
|
||||
wait_for = "js:() => document.querySelector('#nav-logo-sprites') !== null"
|
||||
|
||||
result = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
screenshot=True,
|
||||
wait_for=wait_for
|
||||
)
|
||||
|
||||
assert result.success
|
||||
assert result.screenshot is not None
|
||||
|
||||
image_data = base64.b64decode(result.screenshot)
|
||||
image = Image.open(io.BytesIO(image_data))
|
||||
assert image.format == "PNG"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_screenshot_without_wait_for():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://www.nytimes.com" # A website with lots of dynamic content
|
||||
|
||||
result = await crawler.arun(url=url, bypass_cache=True, screenshot=True)
|
||||
|
||||
assert result.success
|
||||
assert result.screenshot is not None
|
||||
|
||||
image_data = base64.b64decode(result.screenshot)
|
||||
image = Image.open(io.BytesIO(image_data))
|
||||
assert image.format == "PNG"
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_screenshot_comparison():
|
||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||
url = "https://www.reddit.com"
|
||||
wait_for = "css:#SHORTCUT_FOCUSABLE_DIV"
|
||||
|
||||
# Take screenshot without wait_for
|
||||
result_without_wait = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
screenshot=True
|
||||
)
|
||||
|
||||
# Take screenshot with wait_for
|
||||
result_with_wait = await crawler.arun(
|
||||
url=url,
|
||||
bypass_cache=True,
|
||||
screenshot=True,
|
||||
wait_for=wait_for
|
||||
)
|
||||
|
||||
assert result_without_wait.success and result_with_wait.success
|
||||
assert result_without_wait.screenshot is not None
|
||||
assert result_with_wait.screenshot is not None
|
||||
|
||||
# Compare the two screenshots
|
||||
image_without_wait = Image.open(io.BytesIO(base64.b64decode(result_without_wait.screenshot)))
|
||||
image_with_wait = Image.open(io.BytesIO(base64.b64decode(result_with_wait.screenshot)))
|
||||
|
||||
# This is a simple size comparison. In a real-world scenario, you might want to use
|
||||
# more sophisticated image comparison techniques.
|
||||
assert image_with_wait.size[0] >= image_without_wait.size[0]
|
||||
assert image_with_wait.size[1] >= image_without_wait.size[1]
|
||||
|
||||
# Entry point for debugging
|
||||
if __name__ == "__main__":
|
||||
pytest.main([__file__, "-v"])
|
||||
Reference in New Issue
Block a user