Compare commits
20 Commits
0.3.4
...
main-0.3.7
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ac9d83c72f | ||
|
|
ff9149b5c9 | ||
|
|
32f57c49d6 | ||
|
|
a5f627ba1a | ||
|
|
dbb587d681 | ||
|
|
9ffa34b697 | ||
|
|
740802c491 | ||
|
|
b9ac96c332 | ||
|
|
d06535388a | ||
|
|
2b73bdf6b0 | ||
|
|
6aa803d712 | ||
|
|
320afdea64 | ||
|
|
ccbe72cfc1 | ||
|
|
b9bbd42373 | ||
|
|
68e9144ce3 | ||
|
|
9b2b267820 | ||
|
|
ff3524d9b1 | ||
|
|
b99d20b725 | ||
|
|
768b93140f | ||
|
|
4750810a67 |
13
.gitignore
vendored
13
.gitignore
vendored
@@ -196,4 +196,15 @@ docs/.DS_Store
|
|||||||
tmp/
|
tmp/
|
||||||
test_env/
|
test_env/
|
||||||
**/.DS_Store
|
**/.DS_Store
|
||||||
**/.DS_Store
|
**/.DS_Store
|
||||||
|
|
||||||
|
todo.md
|
||||||
|
git_changes.py
|
||||||
|
git_changes.md
|
||||||
|
pypi_build.sh
|
||||||
|
git_issues.py
|
||||||
|
git_issues.md
|
||||||
|
|
||||||
|
.tests/
|
||||||
|
|
||||||
|
.issues/
|
||||||
87
CHANGELOG.md
87
CHANGELOG.md
@@ -1,5 +1,92 @@
|
|||||||
# Changelog
|
# Changelog
|
||||||
|
|
||||||
|
## [v0.3.6] - 2024-10-12
|
||||||
|
|
||||||
|
### 1. Improved Crawling Control
|
||||||
|
- **New Hook**: Added `before_retrieve_html` hook in `AsyncPlaywrightCrawlerStrategy`.
|
||||||
|
- **Delayed HTML Retrieval**: Introduced `delay_before_return_html` parameter to allow waiting before retrieving HTML content.
|
||||||
|
- Useful for pages with delayed content loading.
|
||||||
|
- **Flexible Timeout**: `smart_wait` function now uses `page_timeout` (default 60 seconds) instead of a fixed 30-second timeout.
|
||||||
|
- Provides better handling for slow-loading pages.
|
||||||
|
- **How to use**: Set `page_timeout=your_desired_timeout` (in milliseconds) when calling `crawler.arun()`.
|
||||||
|
|
||||||
|
### 2. Browser Type Selection
|
||||||
|
- Added support for different browser types (Chromium, Firefox, WebKit).
|
||||||
|
- Users can now specify the browser type when initializing AsyncWebCrawler.
|
||||||
|
- **How to use**: Set `browser_type="firefox"` or `browser_type="webkit"` when initializing AsyncWebCrawler.
|
||||||
|
|
||||||
|
### 3. Screenshot Capture
|
||||||
|
- Added ability to capture screenshots during crawling.
|
||||||
|
- Useful for debugging and content verification.
|
||||||
|
- **How to use**: Set `screenshot=True` when calling `crawler.arun()`.
|
||||||
|
|
||||||
|
### 4. Enhanced LLM Extraction Strategy
|
||||||
|
- Added support for multiple LLM providers (OpenAI, Hugging Face, Ollama).
|
||||||
|
- **Custom Arguments**: Added support for passing extra arguments to LLM providers via `extra_args` parameter.
|
||||||
|
- **Custom Headers**: Users can now pass custom headers to the extraction strategy.
|
||||||
|
- **How to use**: Specify the desired provider and custom arguments when using `LLMExtractionStrategy`.
|
||||||
|
|
||||||
|
### 5. iframe Content Extraction
|
||||||
|
- New feature to process and extract content from iframes.
|
||||||
|
- **How to use**: Set `process_iframes=True` in the crawl method.
|
||||||
|
|
||||||
|
### 6. Delayed Content Retrieval
|
||||||
|
- Introduced `get_delayed_content` method in `AsyncCrawlResponse`.
|
||||||
|
- Allows retrieval of content after a specified delay, useful for dynamically loaded content.
|
||||||
|
- **How to use**: Access `result.get_delayed_content(delay_in_seconds)` after crawling.
|
||||||
|
|
||||||
|
## Improvements and Optimizations
|
||||||
|
|
||||||
|
### 1. AsyncWebCrawler Enhancements
|
||||||
|
- **Flexible Initialization**: Now accepts arbitrary keyword arguments, passed directly to the crawler strategy.
|
||||||
|
- Allows for more customized setups.
|
||||||
|
|
||||||
|
### 2. Image Processing Optimization
|
||||||
|
- Enhanced image handling in WebScrappingStrategy.
|
||||||
|
- Added filtering for small, invisible, or irrelevant images.
|
||||||
|
- Improved image scoring system for better content relevance.
|
||||||
|
- Implemented JavaScript-based image dimension updating for more accurate representation.
|
||||||
|
|
||||||
|
### 3. Database Schema Auto-updates
|
||||||
|
- Automatic database schema updates ensure compatibility with the latest version.
|
||||||
|
|
||||||
|
### 4. Enhanced Error Handling and Logging
|
||||||
|
- Improved error messages and logging for easier debugging.
|
||||||
|
|
||||||
|
### 5. Content Extraction Refinements
|
||||||
|
- Refined HTML sanitization process.
|
||||||
|
- Improved handling of base64 encoded images.
|
||||||
|
- Enhanced Markdown conversion process.
|
||||||
|
- Optimized content extraction algorithms.
|
||||||
|
|
||||||
|
### 6. Utility Function Enhancements
|
||||||
|
- `perform_completion_with_backoff` function now supports additional arguments for more customized API calls to LLM providers.
|
||||||
|
|
||||||
|
## Bug Fixes
|
||||||
|
- Fixed an issue where image tags were being prematurely removed during content extraction.
|
||||||
|
|
||||||
|
## Examples and Documentation
|
||||||
|
- Updated `quickstart_async.py` with examples of:
|
||||||
|
- Using custom headers in LLM extraction.
|
||||||
|
- Different LLM provider usage (OpenAI, Hugging Face, Ollama).
|
||||||
|
- Custom browser type usage.
|
||||||
|
|
||||||
|
## Developer Notes
|
||||||
|
- Refactored code for better maintainability, flexibility, and performance.
|
||||||
|
- Enhanced type hinting throughout the codebase for improved development experience.
|
||||||
|
- Expanded error handling for more robust operation.
|
||||||
|
|
||||||
|
These updates significantly enhance the flexibility, accuracy, and robustness of crawl4ai, providing users with more control and options for their web crawling and content extraction tasks.
|
||||||
|
|
||||||
|
## [v0.3.5] - 2024-09-02
|
||||||
|
|
||||||
|
Enhance AsyncWebCrawler with smart waiting and screenshot capabilities
|
||||||
|
|
||||||
|
- Implement smart_wait function in AsyncPlaywrightCrawlerStrategy
|
||||||
|
- Add screenshot support to AsyncCrawlResponse and AsyncWebCrawler
|
||||||
|
- Improve error handling and timeout management in crawling process
|
||||||
|
- Fix typo in CrawlResult model (responser_headers -> response_headers)
|
||||||
|
|
||||||
## [v0.2.77] - 2024-08-04
|
## [v0.2.77] - 2024-08-04
|
||||||
|
|
||||||
Significant improvements in text processing and performance:
|
Significant improvements in text processing and performance:
|
||||||
|
|||||||
10
README.md
10
README.md
@@ -10,6 +10,14 @@ Crawl4AI simplifies asynchronous web crawling and data extraction, making it acc
|
|||||||
|
|
||||||
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
|
> Looking for the synchronous version? Check out [README.sync.md](./README.sync.md). You can also access the previous version in the branch [V0.2.76](https://github.com/unclecode/crawl4ai/blob/v0.2.76).
|
||||||
|
|
||||||
|
## New update 0.3.6
|
||||||
|
- 🌐 Multi-browser support (Chromium, Firefox, WebKit)
|
||||||
|
- 🖼️ Improved image processing with lazy-loading detection
|
||||||
|
- 🔧 Custom page timeout parameter for better control over crawling behavior
|
||||||
|
- 🕰️ Enhanced handling of delayed content loading
|
||||||
|
- 🔑 Custom headers support for LLM interactions
|
||||||
|
- 🖼️ iframe content extraction for comprehensive page analysis
|
||||||
|
- ⏱️ Flexible timeout and delayed content retrieval options
|
||||||
|
|
||||||
## Try it Now!
|
## Try it Now!
|
||||||
|
|
||||||
@@ -124,7 +132,7 @@ async def main():
|
|||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
js_code=js_code,
|
js_code=js_code,
|
||||||
css_selector="article.tease-card",
|
css_selector=".wide-tease-item__description",
|
||||||
bypass_cache=True
|
bypass_cache=True
|
||||||
)
|
)
|
||||||
print(result.extracted_content)
|
print(result.extracted_content)
|
||||||
|
|||||||
@@ -3,7 +3,7 @@
|
|||||||
from .async_webcrawler import AsyncWebCrawler
|
from .async_webcrawler import AsyncWebCrawler
|
||||||
from .models import CrawlResult
|
from .models import CrawlResult
|
||||||
|
|
||||||
__version__ = "0.3.4"
|
__version__ = "0.3.6"
|
||||||
|
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"AsyncWebCrawler",
|
"AsyncWebCrawler",
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
import asyncio
|
import asyncio
|
||||||
import base64, time
|
import base64, time
|
||||||
from abc import ABC, abstractmethod
|
from abc import ABC, abstractmethod
|
||||||
from typing import Callable, Dict, Any, List, Optional
|
from typing import Callable, Dict, Any, List, Optional, Awaitable
|
||||||
import os
|
import os
|
||||||
from playwright.async_api import async_playwright, Page, Browser, Error
|
from playwright.async_api import async_playwright, Page, Browser, Error
|
||||||
from io import BytesIO
|
from io import BytesIO
|
||||||
@@ -12,10 +12,16 @@ import hashlib
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from playwright.async_api import ProxySettings
|
from playwright.async_api import ProxySettings
|
||||||
from pydantic import BaseModel
|
from pydantic import BaseModel
|
||||||
|
|
||||||
class AsyncCrawlResponse(BaseModel):
|
class AsyncCrawlResponse(BaseModel):
|
||||||
html: str
|
html: str
|
||||||
response_headers: Dict[str, str]
|
response_headers: Dict[str, str]
|
||||||
status_code: int
|
status_code: int
|
||||||
|
screenshot: Optional[str] = None
|
||||||
|
get_delayed_content: Optional[Callable[[Optional[float]], Awaitable[str]]] = None
|
||||||
|
|
||||||
|
class Config:
|
||||||
|
arbitrary_types_allowed = True
|
||||||
|
|
||||||
class AsyncCrawlerStrategy(ABC):
|
class AsyncCrawlerStrategy(ABC):
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
@@ -44,7 +50,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
self.user_agent = kwargs.get("user_agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")
|
||||||
self.proxy = kwargs.get("proxy")
|
self.proxy = kwargs.get("proxy")
|
||||||
self.headless = kwargs.get("headless", True)
|
self.headless = kwargs.get("headless", True)
|
||||||
self.headers = {}
|
self.browser_type = kwargs.get("browser_type", "chromium") # New parameter
|
||||||
|
self.headers = kwargs.get("headers", {})
|
||||||
self.sessions = {}
|
self.sessions = {}
|
||||||
self.session_ttl = 1800
|
self.session_ttl = 1800
|
||||||
self.js_code = js_code
|
self.js_code = js_code
|
||||||
@@ -57,7 +64,8 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
'on_execution_started': None,
|
'on_execution_started': None,
|
||||||
'before_goto': None,
|
'before_goto': None,
|
||||||
'after_goto': None,
|
'after_goto': None,
|
||||||
'before_return_html': None
|
'before_return_html': None,
|
||||||
|
'before_retrieve_html': None
|
||||||
}
|
}
|
||||||
|
|
||||||
async def __aenter__(self):
|
async def __aenter__(self):
|
||||||
@@ -73,7 +81,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
if self.browser is None:
|
if self.browser is None:
|
||||||
browser_args = {
|
browser_args = {
|
||||||
"headless": self.headless,
|
"headless": self.headless,
|
||||||
# "headless": False,
|
|
||||||
"args": [
|
"args": [
|
||||||
"--disable-gpu",
|
"--disable-gpu",
|
||||||
"--disable-dev-shm-usage",
|
"--disable-dev-shm-usage",
|
||||||
@@ -88,7 +95,14 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
browser_args["proxy"] = proxy_settings
|
browser_args["proxy"] = proxy_settings
|
||||||
|
|
||||||
|
|
||||||
self.browser = await self.playwright.chromium.launch(**browser_args)
|
# Select the appropriate browser based on the browser_type
|
||||||
|
if self.browser_type == "firefox":
|
||||||
|
self.browser = await self.playwright.firefox.launch(**browser_args)
|
||||||
|
elif self.browser_type == "webkit":
|
||||||
|
self.browser = await self.playwright.webkit.launch(**browser_args)
|
||||||
|
else:
|
||||||
|
self.browser = await self.playwright.chromium.launch(**browser_args)
|
||||||
|
|
||||||
await self.execute_hook('on_browser_created', self.browser)
|
await self.execute_hook('on_browser_created', self.browser)
|
||||||
|
|
||||||
async def close(self):
|
async def close(self):
|
||||||
@@ -138,7 +152,45 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
for sid in expired_sessions:
|
for sid in expired_sessions:
|
||||||
asyncio.create_task(self.kill_session(sid))
|
asyncio.create_task(self.kill_session(sid))
|
||||||
|
|
||||||
|
async def smart_wait(self, page: Page, wait_for: str, timeout: float = 30000):
|
||||||
|
wait_for = wait_for.strip()
|
||||||
|
|
||||||
|
if wait_for.startswith('js:'):
|
||||||
|
# Explicitly specified JavaScript
|
||||||
|
js_code = wait_for[3:].strip()
|
||||||
|
return await self.csp_compliant_wait(page, js_code, timeout)
|
||||||
|
elif wait_for.startswith('css:'):
|
||||||
|
# Explicitly specified CSS selector
|
||||||
|
css_selector = wait_for[4:].strip()
|
||||||
|
try:
|
||||||
|
await page.wait_for_selector(css_selector, timeout=timeout)
|
||||||
|
except Error as e:
|
||||||
|
if 'Timeout' in str(e):
|
||||||
|
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{css_selector}'")
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Invalid CSS selector: '{css_selector}'")
|
||||||
|
else:
|
||||||
|
# Auto-detect based on content
|
||||||
|
if wait_for.startswith('()') or wait_for.startswith('function'):
|
||||||
|
# It's likely a JavaScript function
|
||||||
|
return await self.csp_compliant_wait(page, wait_for, timeout)
|
||||||
|
else:
|
||||||
|
# Assume it's a CSS selector first
|
||||||
|
try:
|
||||||
|
await page.wait_for_selector(wait_for, timeout=timeout)
|
||||||
|
except Error as e:
|
||||||
|
if 'Timeout' in str(e):
|
||||||
|
raise TimeoutError(f"Timeout after {timeout}ms waiting for selector '{wait_for}'")
|
||||||
|
else:
|
||||||
|
# If it's not a timeout error, it might be an invalid selector
|
||||||
|
# Let's try to evaluate it as a JavaScript function as a fallback
|
||||||
|
try:
|
||||||
|
return await self.csp_compliant_wait(page, f"() => {{{wait_for}}}", timeout)
|
||||||
|
except Error:
|
||||||
|
raise ValueError(f"Invalid wait_for parameter: '{wait_for}'. "
|
||||||
|
"It should be either a valid CSS selector, a JavaScript function, "
|
||||||
|
"or explicitly prefixed with 'js:' or 'css:'.")
|
||||||
|
|
||||||
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
|
async def csp_compliant_wait(self, page: Page, user_wait_function: str, timeout: float = 30000):
|
||||||
wrapper_js = f"""
|
wrapper_js = f"""
|
||||||
async () => {{
|
async () => {{
|
||||||
@@ -163,6 +215,48 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
raise RuntimeError(f"Error in wait condition: {str(e)}")
|
raise RuntimeError(f"Error in wait condition: {str(e)}")
|
||||||
|
|
||||||
|
async def process_iframes(self, page):
|
||||||
|
# Find all iframes
|
||||||
|
iframes = await page.query_selector_all('iframe')
|
||||||
|
|
||||||
|
for i, iframe in enumerate(iframes):
|
||||||
|
try:
|
||||||
|
# Add a unique identifier to the iframe
|
||||||
|
await iframe.evaluate(f'(element) => element.id = "iframe-{i}"')
|
||||||
|
|
||||||
|
# Get the frame associated with this iframe
|
||||||
|
frame = await iframe.content_frame()
|
||||||
|
|
||||||
|
if frame:
|
||||||
|
# Wait for the frame to load
|
||||||
|
await frame.wait_for_load_state('load', timeout=30000) # 30 seconds timeout
|
||||||
|
|
||||||
|
# Extract the content of the iframe's body
|
||||||
|
iframe_content = await frame.evaluate('() => document.body.innerHTML')
|
||||||
|
|
||||||
|
# Generate a unique class name for this iframe
|
||||||
|
class_name = f'extracted-iframe-content-{i}'
|
||||||
|
|
||||||
|
# Replace the iframe with a div containing the extracted content
|
||||||
|
_iframe = iframe_content.replace('`', '\\`')
|
||||||
|
await page.evaluate(f"""
|
||||||
|
() => {{
|
||||||
|
const iframe = document.getElementById('iframe-{i}');
|
||||||
|
const div = document.createElement('div');
|
||||||
|
div.innerHTML = `{_iframe}`;
|
||||||
|
div.className = '{class_name}';
|
||||||
|
iframe.replaceWith(div);
|
||||||
|
}}
|
||||||
|
""")
|
||||||
|
else:
|
||||||
|
print(f"Warning: Could not access content frame for iframe {i}")
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error processing iframe {i}: {str(e)}")
|
||||||
|
|
||||||
|
# Return the page object
|
||||||
|
return page
|
||||||
|
|
||||||
|
|
||||||
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
async def crawl(self, url: str, **kwargs) -> AsyncCrawlResponse:
|
||||||
response_headers = {}
|
response_headers = {}
|
||||||
status_code = None
|
status_code = None
|
||||||
@@ -207,7 +301,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
|
|
||||||
if not kwargs.get("js_only", False):
|
if not kwargs.get("js_only", False):
|
||||||
await self.execute_hook('before_goto', page)
|
await self.execute_hook('before_goto', page)
|
||||||
response = await page.goto(url, wait_until="domcontentloaded", timeout=60000)
|
response = await page.goto(url, wait_until="domcontentloaded", timeout=kwargs.get("page_timeout", 60000))
|
||||||
await self.execute_hook('after_goto', page)
|
await self.execute_hook('after_goto', page)
|
||||||
|
|
||||||
# Get status code and headers
|
# Get status code and headers
|
||||||
@@ -217,6 +311,7 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
status_code = 200
|
status_code = 200
|
||||||
response_headers = {}
|
response_headers = {}
|
||||||
|
|
||||||
|
|
||||||
await page.wait_for_selector('body')
|
await page.wait_for_selector('body')
|
||||||
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
|
||||||
|
|
||||||
@@ -250,22 +345,89 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
wait_for = kwargs.get("wait_for")
|
wait_for = kwargs.get("wait_for")
|
||||||
if wait_for:
|
if wait_for:
|
||||||
try:
|
try:
|
||||||
await self.csp_compliant_wait(page, wait_for, timeout=kwargs.get("timeout", 30000))
|
await self.smart_wait(page, wait_for, timeout=kwargs.get("page_timeout", 60000))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
raise RuntimeError(f"Custom wait condition failed: {str(e)}")
|
raise RuntimeError(f"Wait condition failed: {str(e)}")
|
||||||
# try:
|
|
||||||
# await page.wait_for_function(wait_for)
|
|
||||||
# # if callable(wait_for):
|
|
||||||
# # await page.wait_for_function(wait_for)
|
|
||||||
# # elif isinstance(wait_for, str):
|
|
||||||
# # await page.wait_for_selector(wait_for)
|
|
||||||
# # else:
|
|
||||||
# # raise ValueError("wait_for must be either a callable or a CSS selector string")
|
|
||||||
# except Error as e:
|
|
||||||
# raise Error(f"Custom wait condition failed: {str(e)}")
|
|
||||||
|
|
||||||
|
# Check if kwargs has screenshot=True then take screenshot
|
||||||
|
screenshot_data = None
|
||||||
|
if kwargs.get("screenshot"):
|
||||||
|
screenshot_data = await self.take_screenshot(url)
|
||||||
|
|
||||||
|
|
||||||
|
# New code to update image dimensions
|
||||||
|
update_image_dimensions_js = """
|
||||||
|
() => {
|
||||||
|
return new Promise((resolve) => {
|
||||||
|
const filterImage = (img) => {
|
||||||
|
// Filter out images that are too small
|
||||||
|
if (img.width < 100 && img.height < 100) return false;
|
||||||
|
|
||||||
|
// Filter out images that are not visible
|
||||||
|
const rect = img.getBoundingClientRect();
|
||||||
|
if (rect.width === 0 || rect.height === 0) return false;
|
||||||
|
|
||||||
|
// Filter out images with certain class names (e.g., icons, thumbnails)
|
||||||
|
if (img.classList.contains('icon') || img.classList.contains('thumbnail')) return false;
|
||||||
|
|
||||||
|
// Filter out images with certain patterns in their src (e.g., placeholder images)
|
||||||
|
if (img.src.includes('placeholder') || img.src.includes('icon')) return false;
|
||||||
|
|
||||||
|
return true;
|
||||||
|
};
|
||||||
|
|
||||||
|
const images = Array.from(document.querySelectorAll('img')).filter(filterImage);
|
||||||
|
let imagesLeft = images.length;
|
||||||
|
|
||||||
|
if (imagesLeft === 0) {
|
||||||
|
resolve();
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const checkImage = (img) => {
|
||||||
|
if (img.complete && img.naturalWidth !== 0) {
|
||||||
|
img.setAttribute('width', img.naturalWidth);
|
||||||
|
img.setAttribute('height', img.naturalHeight);
|
||||||
|
imagesLeft--;
|
||||||
|
if (imagesLeft === 0) resolve();
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
images.forEach(img => {
|
||||||
|
checkImage(img);
|
||||||
|
if (!img.complete) {
|
||||||
|
img.onload = () => {
|
||||||
|
checkImage(img);
|
||||||
|
};
|
||||||
|
img.onerror = () => {
|
||||||
|
imagesLeft--;
|
||||||
|
if (imagesLeft === 0) resolve();
|
||||||
|
};
|
||||||
|
}
|
||||||
|
});
|
||||||
|
|
||||||
|
// Fallback timeout of 5 seconds
|
||||||
|
setTimeout(() => resolve(), 5000);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
await page.evaluate(update_image_dimensions_js)
|
||||||
|
|
||||||
|
# Wait a bit for any onload events to complete
|
||||||
|
await page.wait_for_timeout(100)
|
||||||
|
|
||||||
|
# Process iframes
|
||||||
|
if kwargs.get("process_iframes", False):
|
||||||
|
page = await self.process_iframes(page)
|
||||||
|
|
||||||
|
await self.execute_hook('before_retrieve_html', page)
|
||||||
|
# Check if delay_before_return_html is set then wait for that time
|
||||||
|
delay_before_return_html = kwargs.get("delay_before_return_html")
|
||||||
|
if delay_before_return_html:
|
||||||
|
await asyncio.sleep(delay_before_return_html)
|
||||||
|
|
||||||
html = await page.content()
|
html = await page.content()
|
||||||
page = await self.execute_hook('before_return_html', page, html)
|
await self.execute_hook('before_return_html', page, html)
|
||||||
|
|
||||||
if self.verbose:
|
if self.verbose:
|
||||||
print(f"[LOG] ✅ Crawled {url} successfully!")
|
print(f"[LOG] ✅ Crawled {url} successfully!")
|
||||||
@@ -281,7 +443,20 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
"status_code": status_code
|
"status_code": status_code
|
||||||
}, f)
|
}, f)
|
||||||
|
|
||||||
response = AsyncCrawlResponse(html=html, response_headers=response_headers, status_code=status_code)
|
|
||||||
|
async def get_delayed_content(delay: float = 5.0) -> str:
|
||||||
|
if self.verbose:
|
||||||
|
print(f"[LOG] Waiting for {delay} seconds before retrieving content for {url}")
|
||||||
|
await asyncio.sleep(delay)
|
||||||
|
return await page.content()
|
||||||
|
|
||||||
|
response = AsyncCrawlResponse(
|
||||||
|
html=html,
|
||||||
|
response_headers=response_headers,
|
||||||
|
status_code=status_code,
|
||||||
|
screenshot=screenshot_data,
|
||||||
|
get_delayed_content=get_delayed_content
|
||||||
|
)
|
||||||
return response
|
return response
|
||||||
except Error as e:
|
except Error as e:
|
||||||
raise Error(f"Failed to crawl {url}: {str(e)}")
|
raise Error(f"Failed to crawl {url}: {str(e)}")
|
||||||
@@ -339,7 +514,6 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
except Error as e:
|
except Error as e:
|
||||||
raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
|
raise Error(f"Failed to execute JavaScript or wait for condition in session {session_id}: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
async def crawl_many(self, urls: List[str], **kwargs) -> List[AsyncCrawlResponse]:
|
||||||
semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
|
semaphore_count = kwargs.get('semaphore_count', calculate_semaphore_count())
|
||||||
semaphore = asyncio.Semaphore(semaphore_count)
|
semaphore = asyncio.Semaphore(semaphore_count)
|
||||||
@@ -352,11 +526,13 @@ class AsyncPlaywrightCrawlerStrategy(AsyncCrawlerStrategy):
|
|||||||
results = await asyncio.gather(*tasks, return_exceptions=True)
|
results = await asyncio.gather(*tasks, return_exceptions=True)
|
||||||
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
return [result if not isinstance(result, Exception) else str(result) for result in results]
|
||||||
|
|
||||||
async def take_screenshot(self, url: str) -> str:
|
async def take_screenshot(self, url: str, wait_time = 1000) -> str:
|
||||||
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
async with await self.browser.new_context(user_agent=self.user_agent) as context:
|
||||||
page = await context.new_page()
|
page = await context.new_page()
|
||||||
try:
|
try:
|
||||||
await page.goto(url, wait_until="domcontentloaded")
|
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
|
||||||
|
# Wait for a specified time (default is 1 second)
|
||||||
|
await page.wait_for_timeout(wait_time)
|
||||||
screenshot = await page.screenshot(full_page=True)
|
screenshot = await page.screenshot(full_page=True)
|
||||||
return base64.b64encode(screenshot).decode('utf-8')
|
return base64.b64encode(screenshot).decode('utf-8')
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
|
|||||||
@@ -29,14 +29,31 @@ class AsyncDatabaseManager:
|
|||||||
)
|
)
|
||||||
''')
|
''')
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
await self.update_db_schema()
|
||||||
|
|
||||||
async def aalter_db_add_screenshot(self, new_column: str = "media"):
|
async def update_db_schema(self):
|
||||||
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
|
# Check if the 'media' column exists
|
||||||
|
cursor = await db.execute("PRAGMA table_info(crawled_data)")
|
||||||
|
columns = await cursor.fetchall()
|
||||||
|
column_names = [column[1] for column in columns]
|
||||||
|
|
||||||
|
if 'media' not in column_names:
|
||||||
|
await self.aalter_db_add_column('media')
|
||||||
|
|
||||||
|
# Check for other missing columns and add them if necessary
|
||||||
|
for column in ['links', 'metadata', 'screenshot']:
|
||||||
|
if column not in column_names:
|
||||||
|
await self.aalter_db_add_column(column)
|
||||||
|
|
||||||
|
async def aalter_db_add_column(self, new_column: str):
|
||||||
try:
|
try:
|
||||||
async with aiosqlite.connect(self.db_path) as db:
|
async with aiosqlite.connect(self.db_path) as db:
|
||||||
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
await db.execute(f'ALTER TABLE crawled_data ADD COLUMN {new_column} TEXT DEFAULT ""')
|
||||||
await db.commit()
|
await db.commit()
|
||||||
|
print(f"Added column '{new_column}' to the database.")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f"Error altering database to add screenshot column: {e}")
|
print(f"Error altering database to add {new_column} column: {e}")
|
||||||
|
|
||||||
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
|
async def aget_cached_url(self, url: str) -> Optional[Tuple[str, str, str, str, str, str, str, bool, str]]:
|
||||||
try:
|
try:
|
||||||
|
|||||||
@@ -23,17 +23,18 @@ class AsyncWebCrawler:
|
|||||||
self,
|
self,
|
||||||
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
crawler_strategy: Optional[AsyncCrawlerStrategy] = None,
|
||||||
always_by_pass_cache: bool = False,
|
always_by_pass_cache: bool = False,
|
||||||
verbose: bool = False,
|
base_directory: str = str(Path.home()),
|
||||||
|
**kwargs,
|
||||||
):
|
):
|
||||||
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
self.crawler_strategy = crawler_strategy or AsyncPlaywrightCrawlerStrategy(
|
||||||
verbose=verbose
|
**kwargs
|
||||||
)
|
)
|
||||||
self.always_by_pass_cache = always_by_pass_cache
|
self.always_by_pass_cache = always_by_pass_cache
|
||||||
self.crawl4ai_folder = os.path.join(Path.home(), ".crawl4ai")
|
self.crawl4ai_folder = os.path.join(base_directory, ".crawl4ai")
|
||||||
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
os.makedirs(self.crawl4ai_folder, exist_ok=True)
|
||||||
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
os.makedirs(f"{self.crawl4ai_folder}/cache", exist_ok=True)
|
||||||
self.ready = False
|
self.ready = False
|
||||||
self.verbose = verbose
|
self.verbose = kwargs.get("verbose", False)
|
||||||
|
|
||||||
async def __aenter__(self):
|
async def __aenter__(self):
|
||||||
await self.crawler_strategy.__aenter__()
|
await self.crawler_strategy.__aenter__()
|
||||||
@@ -80,7 +81,7 @@ class AsyncWebCrawler:
|
|||||||
|
|
||||||
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
word_count_threshold = max(word_count_threshold, MIN_WORD_THRESHOLD)
|
||||||
|
|
||||||
async_response : AsyncCrawlResponse = None
|
async_response: AsyncCrawlResponse = None
|
||||||
cached = None
|
cached = None
|
||||||
screenshot_data = None
|
screenshot_data = None
|
||||||
extracted_content = None
|
extracted_content = None
|
||||||
@@ -102,15 +103,14 @@ class AsyncWebCrawler:
|
|||||||
t1 = time.time()
|
t1 = time.time()
|
||||||
if user_agent:
|
if user_agent:
|
||||||
self.crawler_strategy.update_user_agent(user_agent)
|
self.crawler_strategy.update_user_agent(user_agent)
|
||||||
async_response : AsyncCrawlResponse = await self.crawler_strategy.crawl(url, **kwargs)
|
async_response: AsyncCrawlResponse = await self.crawler_strategy.crawl(url, screenshot=screenshot, **kwargs)
|
||||||
html = sanitize_input_encode(async_response.html)
|
html = sanitize_input_encode(async_response.html)
|
||||||
|
screenshot_data = async_response.screenshot
|
||||||
t2 = time.time()
|
t2 = time.time()
|
||||||
if verbose:
|
if verbose:
|
||||||
print(
|
print(
|
||||||
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
|
f"[LOG] 🚀 Crawling done for {url}, success: {bool(html)}, time taken: {t2 - t1:.2f} seconds"
|
||||||
)
|
)
|
||||||
if screenshot:
|
|
||||||
screenshot_data = await self.crawler_strategy.take_screenshot(url)
|
|
||||||
|
|
||||||
crawl_result = await self.aprocess_html(
|
crawl_result = await self.aprocess_html(
|
||||||
url,
|
url,
|
||||||
@@ -127,7 +127,7 @@ class AsyncWebCrawler:
|
|||||||
**kwargs,
|
**kwargs,
|
||||||
)
|
)
|
||||||
crawl_result.status_code = async_response.status_code if async_response else 200
|
crawl_result.status_code = async_response.status_code if async_response else 200
|
||||||
crawl_result.responser_headers = async_response.response_headers if async_response else {}
|
crawl_result.response_headers = async_response.response_headers if async_response else {}
|
||||||
crawl_result.success = bool(html)
|
crawl_result.success = bool(html)
|
||||||
crawl_result.session_id = kwargs.get("session_id", None)
|
crawl_result.session_id = kwargs.get("session_id", None)
|
||||||
return crawl_result
|
return crawl_result
|
||||||
@@ -203,11 +203,11 @@ class AsyncWebCrawler:
|
|||||||
)
|
)
|
||||||
|
|
||||||
if result is None:
|
if result is None:
|
||||||
raise ValueError(f"Failed to extract content from the website: {url}")
|
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}")
|
||||||
except InvalidCSSSelectorError as e:
|
except InvalidCSSSelectorError as e:
|
||||||
raise ValueError(str(e))
|
raise ValueError(str(e))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
raise ValueError(f"Failed to extract content from the website: {url}, error: {str(e)}")
|
raise ValueError(f"Process HTML, Failed to extract content from the website: {url}, error: {str(e)}")
|
||||||
|
|
||||||
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
cleaned_html = sanitize_input_encode(result.get("cleaned_html", ""))
|
||||||
markdown = sanitize_input_encode(result.get("markdown", ""))
|
markdown = sanitize_input_encode(result.get("markdown", ""))
|
||||||
|
|||||||
@@ -16,8 +16,6 @@ from .utils import (
|
|||||||
CustomHTML2Text
|
CustomHTML2Text
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
class ContentScrappingStrategy(ABC):
|
class ContentScrappingStrategy(ABC):
|
||||||
@abstractmethod
|
@abstractmethod
|
||||||
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
def scrap(self, url: str, html: str, **kwargs) -> Dict[str, Any]:
|
||||||
@@ -129,7 +127,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
|||||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||||
# Remove . from format
|
# Remove . from format
|
||||||
image_format = image_format.strip('.')
|
image_format = image_format.strip('.').split('?')[0]
|
||||||
score = 0
|
score = 0
|
||||||
if height_value:
|
if height_value:
|
||||||
if height_unit == 'px' and height_value > 150:
|
if height_unit == 'px' and height_value > 150:
|
||||||
@@ -158,6 +156,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
|||||||
return None
|
return None
|
||||||
return {
|
return {
|
||||||
'src': img.get('src', ''),
|
'src': img.get('src', ''),
|
||||||
|
'data-src': img.get('data-src', ''),
|
||||||
'alt': img.get('alt', ''),
|
'alt': img.get('alt', ''),
|
||||||
'desc': find_closest_parent_with_useful_text(img),
|
'desc': find_closest_parent_with_useful_text(img),
|
||||||
'score': score,
|
'score': score,
|
||||||
@@ -170,10 +169,12 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
|||||||
if isinstance(element, Comment):
|
if isinstance(element, Comment):
|
||||||
element.extract()
|
element.extract()
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
# if element.name == 'img':
|
||||||
|
# process_image(element, url, 0, 1)
|
||||||
|
# return True
|
||||||
|
|
||||||
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
if element.name in ['script', 'style', 'link', 'meta', 'noscript']:
|
||||||
if element.name == 'img':
|
|
||||||
process_image(element, url, 0, 1)
|
|
||||||
element.decompose()
|
element.decompose()
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@@ -273,11 +274,14 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
|||||||
# Replace base64 data with empty string
|
# Replace base64 data with empty string
|
||||||
img['src'] = base64_pattern.sub('', src)
|
img['src'] = base64_pattern.sub('', src)
|
||||||
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
cleaned_html = str(body).replace('\n\n', '\n').replace(' ', ' ')
|
||||||
cleaned_html = sanitize_html(cleaned_html)
|
|
||||||
|
|
||||||
h = CustomHTML2Text()
|
h = CustomHTML2Text()
|
||||||
h.ignore_links = True
|
h.ignore_links = True
|
||||||
markdown = h.handle(cleaned_html)
|
h.body_width = 0
|
||||||
|
try:
|
||||||
|
markdown = h.handle(cleaned_html)
|
||||||
|
except Exception as e:
|
||||||
|
markdown = h.handle(sanitize_html(cleaned_html))
|
||||||
markdown = markdown.replace(' ```', '```')
|
markdown = markdown.replace(' ```', '```')
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -286,6 +290,7 @@ class WebScrappingStrategy(ContentScrappingStrategy):
|
|||||||
print('Error extracting metadata:', str(e))
|
print('Error extracting metadata:', str(e))
|
||||||
meta = {}
|
meta = {}
|
||||||
|
|
||||||
|
cleaned_html = sanitize_html(cleaned_html)
|
||||||
return {
|
return {
|
||||||
'markdown': markdown,
|
'markdown': markdown,
|
||||||
'cleaned_html': cleaned_html,
|
'cleaned_html': cleaned_html,
|
||||||
|
|||||||
@@ -80,6 +80,7 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
|||||||
self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
|
self.word_token_rate = kwargs.get("word_token_rate", WORD_TOKEN_RATE)
|
||||||
self.apply_chunking = kwargs.get("apply_chunking", True)
|
self.apply_chunking = kwargs.get("apply_chunking", True)
|
||||||
self.base_url = kwargs.get("base_url", None)
|
self.base_url = kwargs.get("base_url", None)
|
||||||
|
self.extra_args = kwargs.get("extra_args", {})
|
||||||
if not self.apply_chunking:
|
if not self.apply_chunking:
|
||||||
self.chunk_token_threshold = 1e9
|
self.chunk_token_threshold = 1e9
|
||||||
|
|
||||||
@@ -111,7 +112,13 @@ class LLMExtractionStrategy(ExtractionStrategy):
|
|||||||
"{" + variable + "}", variable_values[variable]
|
"{" + variable + "}", variable_values[variable]
|
||||||
)
|
)
|
||||||
|
|
||||||
response = perform_completion_with_backoff(self.provider, prompt_with_variables, self.api_token, base_url=self.base_url) # , json_response=self.extract_type == "schema")
|
response = perform_completion_with_backoff(
|
||||||
|
self.provider,
|
||||||
|
prompt_with_variables,
|
||||||
|
self.api_token,
|
||||||
|
base_url=self.base_url,
|
||||||
|
extra_args = self.extra_args
|
||||||
|
) # , json_response=self.extract_type == "schema")
|
||||||
try:
|
try:
|
||||||
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
|
blocks = extract_xml_data(["blocks"], response.choices[0].message.content)['blocks']
|
||||||
blocks = json.loads(blocks)
|
blocks = json.loads(blocks)
|
||||||
|
|||||||
@@ -18,5 +18,5 @@ class CrawlResult(BaseModel):
|
|||||||
metadata: Optional[dict] = None
|
metadata: Optional[dict] = None
|
||||||
error_message: Optional[str] = None
|
error_message: Optional[str] = None
|
||||||
session_id: Optional[str] = None
|
session_id: Optional[str] = None
|
||||||
responser_headers: Optional[dict] = None
|
response_headers: Optional[dict] = None
|
||||||
status_code: Optional[int] = None
|
status_code: Optional[int] = None
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
PROMPT_EXTRACT_BLOCKS = """YHere is the URL of the webpage:
|
PROMPT_EXTRACT_BLOCKS = """Here is the URL of the webpage:
|
||||||
<url>{URL}</url>
|
<url>{URL}</url>
|
||||||
|
|
||||||
And here is the cleaned HTML content of that webpage:
|
And here is the cleaned HTML content of that webpage:
|
||||||
@@ -79,7 +79,7 @@ To generate the JSON objects:
|
|||||||
2. For each block:
|
2. For each block:
|
||||||
a. Assign it an index based on its order in the content.
|
a. Assign it an index based on its order in the content.
|
||||||
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
|
b. Analyze the content and generate ONE semantic tag that describe what the block is about.
|
||||||
c. Extract the text content, EXACTLY SAME AS GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
c. Extract the text content, EXACTLY SAME AS THE GIVE DATA, clean it up if needed, and store it as a list of strings in the "content" field.
|
||||||
|
|
||||||
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
|
3. Ensure that the order of the JSON objects matches the order of the blocks as they appear in the original HTML content.
|
||||||
|
|
||||||
|
|||||||
@@ -131,7 +131,7 @@ def split_and_parse_json_objects(json_string):
|
|||||||
return parsed_objects, unparsed_segments
|
return parsed_objects, unparsed_segments
|
||||||
|
|
||||||
def sanitize_html(html):
|
def sanitize_html(html):
|
||||||
# Replace all weird and special characters with an empty string
|
# Replace all unwanted and special characters with an empty string
|
||||||
sanitized_html = html
|
sanitized_html = html
|
||||||
# sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)
|
# sanitized_html = re.sub(r'[^\w\s.,;:!?=\[\]{}()<>\/\\\-"]', '', html)
|
||||||
|
|
||||||
@@ -301,7 +301,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
|
|||||||
if tag.name != 'img':
|
if tag.name != 'img':
|
||||||
tag.attrs = {}
|
tag.attrs = {}
|
||||||
|
|
||||||
# Extract all img tgas inti [{src: '', alt: ''}]
|
# Extract all img tgas int0 [{src: '', alt: ''}]
|
||||||
media = {
|
media = {
|
||||||
'images': [],
|
'images': [],
|
||||||
'videos': [],
|
'videos': [],
|
||||||
@@ -339,7 +339,7 @@ def get_content_of_website(url, html, word_count_threshold = MIN_WORD_THRESHOLD,
|
|||||||
img.decompose()
|
img.decompose()
|
||||||
|
|
||||||
|
|
||||||
# Create a function that replace content of all"pre" tage with its inner text
|
# Create a function that replace content of all"pre" tag with its inner text
|
||||||
def replace_pre_tags_with_text(node):
|
def replace_pre_tags_with_text(node):
|
||||||
for child in node.find_all('pre'):
|
for child in node.find_all('pre'):
|
||||||
# set child inner html to its text
|
# set child inner html to its text
|
||||||
@@ -502,7 +502,7 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
|||||||
current_tag = tag
|
current_tag = tag
|
||||||
while current_tag:
|
while current_tag:
|
||||||
current_tag = current_tag.parent
|
current_tag = current_tag.parent
|
||||||
# Get the text content of the parent tag
|
# Get the text content from the parent tag
|
||||||
if current_tag:
|
if current_tag:
|
||||||
text_content = current_tag.get_text(separator=' ',strip=True)
|
text_content = current_tag.get_text(separator=' ',strip=True)
|
||||||
# Check if the text content has at least word_count_threshold
|
# Check if the text content has at least word_count_threshold
|
||||||
@@ -511,88 +511,88 @@ def get_content_of_website_optimized(url: str, html: str, word_count_threshold:
|
|||||||
return None
|
return None
|
||||||
|
|
||||||
def process_image(img, url, index, total_images):
|
def process_image(img, url, index, total_images):
|
||||||
#Check if an image has valid display and inside undesired html elements
|
#Check if an image has valid display and inside undesired html elements
|
||||||
def is_valid_image(img, parent, parent_classes):
|
def is_valid_image(img, parent, parent_classes):
|
||||||
style = img.get('style', '')
|
style = img.get('style', '')
|
||||||
src = img.get('src', '')
|
src = img.get('src', '')
|
||||||
classes_to_check = ['button', 'icon', 'logo']
|
classes_to_check = ['button', 'icon', 'logo']
|
||||||
tags_to_check = ['button', 'input']
|
tags_to_check = ['button', 'input']
|
||||||
return all([
|
return all([
|
||||||
'display:none' not in style,
|
'display:none' not in style,
|
||||||
src,
|
src,
|
||||||
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
not any(s in var for var in [src, img.get('alt', ''), *parent_classes] for s in classes_to_check),
|
||||||
parent.name not in tags_to_check
|
parent.name not in tags_to_check
|
||||||
])
|
])
|
||||||
|
|
||||||
#Score an image for it's usefulness
|
#Score an image for it's usefulness
|
||||||
def score_image_for_usefulness(img, base_url, index, images_count):
|
def score_image_for_usefulness(img, base_url, index, images_count):
|
||||||
# Function to parse image height/width value and units
|
# Function to parse image height/width value and units
|
||||||
def parse_dimension(dimension):
|
def parse_dimension(dimension):
|
||||||
if dimension:
|
if dimension:
|
||||||
match = re.match(r"(\d+)(\D*)", dimension)
|
match = re.match(r"(\d+)(\D*)", dimension)
|
||||||
if match:
|
if match:
|
||||||
number = int(match.group(1))
|
number = int(match.group(1))
|
||||||
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
unit = match.group(2) or 'px' # Default unit is 'px' if not specified
|
||||||
return number, unit
|
return number, unit
|
||||||
return None, None
|
return None, None
|
||||||
|
|
||||||
# Fetch image file metadata to extract size and extension
|
# Fetch image file metadata to extract size and extension
|
||||||
def fetch_image_file_size(img, base_url):
|
def fetch_image_file_size(img, base_url):
|
||||||
#If src is relative path construct full URL, if not it may be CDN URL
|
#If src is relative path construct full URL, if not it may be CDN URL
|
||||||
img_url = urljoin(base_url,img.get('src'))
|
img_url = urljoin(base_url,img.get('src'))
|
||||||
try:
|
try:
|
||||||
response = requests.head(img_url)
|
response = requests.head(img_url)
|
||||||
if response.status_code == 200:
|
if response.status_code == 200:
|
||||||
return response.headers.get('Content-Length',None)
|
return response.headers.get('Content-Length',None)
|
||||||
else:
|
else:
|
||||||
print(f"Failed to retrieve file size for {img_url}")
|
print(f"Failed to retrieve file size for {img_url}")
|
||||||
return None
|
|
||||||
except InvalidSchema as e:
|
|
||||||
return None
|
return None
|
||||||
finally:
|
except InvalidSchema as e:
|
||||||
return
|
return None
|
||||||
|
finally:
|
||||||
|
return
|
||||||
|
|
||||||
image_height = img.get('height')
|
image_height = img.get('height')
|
||||||
height_value, height_unit = parse_dimension(image_height)
|
height_value, height_unit = parse_dimension(image_height)
|
||||||
image_width = img.get('width')
|
image_width = img.get('width')
|
||||||
width_value, width_unit = parse_dimension(image_width)
|
width_value, width_unit = parse_dimension(image_width)
|
||||||
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
image_size = 0 #int(fetch_image_file_size(img,base_url) or 0)
|
||||||
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
image_format = os.path.splitext(img.get('src',''))[1].lower()
|
||||||
# Remove . from format
|
# Remove . from format
|
||||||
image_format = image_format.strip('.')
|
image_format = image_format.strip('.')
|
||||||
score = 0
|
score = 0
|
||||||
if height_value:
|
if height_value:
|
||||||
if height_unit == 'px' and height_value > 150:
|
if height_unit == 'px' and height_value > 150:
|
||||||
score += 1
|
|
||||||
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
|
||||||
score += 1
|
|
||||||
if width_value:
|
|
||||||
if width_unit == 'px' and width_value > 150:
|
|
||||||
score += 1
|
|
||||||
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
|
||||||
score += 1
|
|
||||||
if image_size > 10000:
|
|
||||||
score += 1
|
score += 1
|
||||||
if img.get('alt') != '':
|
if height_unit in ['%','vh','vmin','vmax'] and height_value >30:
|
||||||
score+=1
|
score += 1
|
||||||
if any(image_format==format for format in ['jpg','png','webp']):
|
if width_value:
|
||||||
score+=1
|
if width_unit == 'px' and width_value > 150:
|
||||||
if index/images_count<0.5:
|
score += 1
|
||||||
score+=1
|
if width_unit in ['%','vh','vmin','vmax'] and width_value >30:
|
||||||
return score
|
score += 1
|
||||||
|
if image_size > 10000:
|
||||||
|
score += 1
|
||||||
|
if img.get('alt') != '':
|
||||||
|
score+=1
|
||||||
|
if any(image_format==format for format in ['jpg','png','webp']):
|
||||||
|
score+=1
|
||||||
|
if index/images_count<0.5:
|
||||||
|
score+=1
|
||||||
|
return score
|
||||||
|
|
||||||
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
if not is_valid_image(img, img.parent, img.parent.get('class', [])):
|
||||||
return None
|
return None
|
||||||
score = score_image_for_usefulness(img, url, index, total_images)
|
score = score_image_for_usefulness(img, url, index, total_images)
|
||||||
if score <= IMAGE_SCORE_THRESHOLD:
|
if score <= IMAGE_SCORE_THRESHOLD:
|
||||||
return None
|
return None
|
||||||
return {
|
return {
|
||||||
'src': img.get('src', ''),
|
'src': img.get('src', '').replace('\\"', '"').strip(),
|
||||||
'alt': img.get('alt', ''),
|
'alt': img.get('alt', ''),
|
||||||
'desc': find_closest_parent_with_useful_text(img),
|
'desc': find_closest_parent_with_useful_text(img),
|
||||||
'score': score,
|
'score': score,
|
||||||
'type': 'image'
|
'type': 'image'
|
||||||
}
|
}
|
||||||
|
|
||||||
def process_element(element: element.PageElement) -> bool:
|
def process_element(element: element.PageElement) -> bool:
|
||||||
try:
|
try:
|
||||||
@@ -775,7 +775,14 @@ def extract_xml_data(tags, string):
|
|||||||
return data
|
return data
|
||||||
|
|
||||||
# Function to perform the completion with exponential backoff
|
# Function to perform the completion with exponential backoff
|
||||||
def perform_completion_with_backoff(provider, prompt_with_variables, api_token, json_response = False, base_url=None):
|
def perform_completion_with_backoff(
|
||||||
|
provider,
|
||||||
|
prompt_with_variables,
|
||||||
|
api_token,
|
||||||
|
json_response = False,
|
||||||
|
base_url=None,
|
||||||
|
**kwargs
|
||||||
|
):
|
||||||
from litellm import completion
|
from litellm import completion
|
||||||
from litellm.exceptions import RateLimitError
|
from litellm.exceptions import RateLimitError
|
||||||
max_attempts = 3
|
max_attempts = 3
|
||||||
@@ -784,6 +791,9 @@ def perform_completion_with_backoff(provider, prompt_with_variables, api_token,
|
|||||||
extra_args = {}
|
extra_args = {}
|
||||||
if json_response:
|
if json_response:
|
||||||
extra_args["response_format"] = { "type": "json_object" }
|
extra_args["response_format"] = { "type": "json_object" }
|
||||||
|
|
||||||
|
if kwargs.get("extra_args"):
|
||||||
|
extra_args.update(kwargs["extra_args"])
|
||||||
|
|
||||||
for attempt in range(max_attempts):
|
for attempt in range(max_attempts):
|
||||||
try:
|
try:
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ from typing import List
|
|||||||
from concurrent.futures import ThreadPoolExecutor
|
from concurrent.futures import ThreadPoolExecutor
|
||||||
from .config import *
|
from .config import *
|
||||||
import warnings
|
import warnings
|
||||||
|
import json
|
||||||
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
|
warnings.filterwarnings("ignore", message='Field "model_name" has conflict with protected namespace "model_".')
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
48
docs/examples/async_webcrawler_multiple_urls_example.py
Normal file
48
docs/examples/async_webcrawler_multiple_urls_example.py
Normal file
@@ -0,0 +1,48 @@
|
|||||||
|
# File: async_webcrawler_multiple_urls_example.py
|
||||||
|
import os, sys
|
||||||
|
# append 2 parent directories to sys.path to import crawl4ai
|
||||||
|
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||||
|
sys.path.append(parent_dir)
|
||||||
|
|
||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
# Initialize the AsyncWebCrawler
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
# List of URLs to crawl
|
||||||
|
urls = [
|
||||||
|
"https://example.com",
|
||||||
|
"https://python.org",
|
||||||
|
"https://github.com",
|
||||||
|
"https://stackoverflow.com",
|
||||||
|
"https://news.ycombinator.com"
|
||||||
|
]
|
||||||
|
|
||||||
|
# Set up crawling parameters
|
||||||
|
word_count_threshold = 100
|
||||||
|
|
||||||
|
# Run the crawling process for multiple URLs
|
||||||
|
results = await crawler.arun_many(
|
||||||
|
urls=urls,
|
||||||
|
word_count_threshold=word_count_threshold,
|
||||||
|
bypass_cache=True,
|
||||||
|
verbose=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Process the results
|
||||||
|
for result in results:
|
||||||
|
if result.success:
|
||||||
|
print(f"Successfully crawled: {result.url}")
|
||||||
|
print(f"Title: {result.metadata.get('title', 'N/A')}")
|
||||||
|
print(f"Word count: {len(result.markdown.split())}")
|
||||||
|
print(f"Number of links: {len(result.links.get('internal', [])) + len(result.links.get('external', []))}")
|
||||||
|
print(f"Number of images: {len(result.media.get('images', []))}")
|
||||||
|
print("---")
|
||||||
|
else:
|
||||||
|
print(f"Failed to crawl: {result.url}")
|
||||||
|
print(f"Error: {result.error_message}")
|
||||||
|
print("---")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
45
docs/examples/language_support_example.py
Normal file
45
docs/examples/language_support_example.py
Normal file
@@ -0,0 +1,45 @@
|
|||||||
|
import asyncio
|
||||||
|
from crawl4ai import AsyncWebCrawler, AsyncPlaywrightCrawlerStrategy
|
||||||
|
|
||||||
|
async def main():
|
||||||
|
# Example 1: Setting language when creating the crawler
|
||||||
|
crawler1 = AsyncWebCrawler(
|
||||||
|
crawler_strategy=AsyncPlaywrightCrawlerStrategy(
|
||||||
|
headers={"Accept-Language": "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7"}
|
||||||
|
)
|
||||||
|
)
|
||||||
|
result1 = await crawler1.arun("https://www.example.com")
|
||||||
|
print("Example 1 result:", result1.extracted_content[:100]) # Print first 100 characters
|
||||||
|
|
||||||
|
# Example 2: Setting language before crawling
|
||||||
|
crawler2 = AsyncWebCrawler()
|
||||||
|
crawler2.crawler_strategy.headers["Accept-Language"] = "es-ES,es;q=0.9,en-US;q=0.8,en;q=0.7"
|
||||||
|
result2 = await crawler2.arun("https://www.example.com")
|
||||||
|
print("Example 2 result:", result2.extracted_content[:100])
|
||||||
|
|
||||||
|
# Example 3: Setting language when calling arun method
|
||||||
|
crawler3 = AsyncWebCrawler()
|
||||||
|
result3 = await crawler3.arun(
|
||||||
|
"https://www.example.com",
|
||||||
|
headers={"Accept-Language": "de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7"}
|
||||||
|
)
|
||||||
|
print("Example 3 result:", result3.extracted_content[:100])
|
||||||
|
|
||||||
|
# Example 4: Crawling multiple pages with different languages
|
||||||
|
urls = [
|
||||||
|
("https://www.example.com", "fr-FR,fr;q=0.9"),
|
||||||
|
("https://www.example.org", "es-ES,es;q=0.9"),
|
||||||
|
("https://www.example.net", "de-DE,de;q=0.9"),
|
||||||
|
]
|
||||||
|
|
||||||
|
crawler4 = AsyncWebCrawler()
|
||||||
|
results = await asyncio.gather(*[
|
||||||
|
crawler4.arun(url, headers={"Accept-Language": lang})
|
||||||
|
for url, lang in urls
|
||||||
|
])
|
||||||
|
|
||||||
|
for url, result in zip([u for u, _ in urls], results):
|
||||||
|
print(f"Result for {url}:", result.extracted_content[:100])
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
asyncio.run(main())
|
||||||
@@ -10,6 +10,7 @@ import time
|
|||||||
import json
|
import json
|
||||||
import os
|
import os
|
||||||
import re
|
import re
|
||||||
|
from typing import Dict
|
||||||
from bs4 import BeautifulSoup
|
from bs4 import BeautifulSoup
|
||||||
from pydantic import BaseModel, Field
|
from pydantic import BaseModel, Field
|
||||||
from crawl4ai import AsyncWebCrawler
|
from crawl4ai import AsyncWebCrawler
|
||||||
@@ -18,6 +19,8 @@ from crawl4ai.extraction_strategy import (
|
|||||||
LLMExtractionStrategy,
|
LLMExtractionStrategy,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
__location__ = os.path.realpath(os.path.join(os.getcwd(), os.path.dirname(__file__)))
|
||||||
|
|
||||||
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
|
print("Crawl4AI: Advanced Web Crawling and Data Extraction")
|
||||||
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
|
print("GitHub Repository: https://github.com/unclecode/crawl4ai")
|
||||||
print("Twitter: @unclecode")
|
print("Twitter: @unclecode")
|
||||||
@@ -30,7 +33,7 @@ async def simple_crawl():
|
|||||||
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
result = await crawler.arun(url="https://www.nbcnews.com/business")
|
||||||
print(result.markdown[:500]) # Print first 500 characters
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
async def js_and_css():
|
async def simple_example_with_running_js_code():
|
||||||
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
print("\n--- Executing JavaScript and Using CSS Selectors ---")
|
||||||
# New code to handle the wait_for parameter
|
# New code to handle the wait_for parameter
|
||||||
wait_for = """() => {
|
wait_for = """() => {
|
||||||
@@ -47,12 +50,21 @@ async def js_and_css():
|
|||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://www.nbcnews.com/business",
|
url="https://www.nbcnews.com/business",
|
||||||
js_code=js_code,
|
js_code=js_code,
|
||||||
# css_selector="article.tease-card",
|
|
||||||
# wait_for=wait_for,
|
# wait_for=wait_for,
|
||||||
bypass_cache=True,
|
bypass_cache=True,
|
||||||
)
|
)
|
||||||
print(result.markdown[:500]) # Print first 500 characters
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
async def simple_example_with_css_selector():
|
||||||
|
print("\n--- Using CSS Selectors ---")
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url="https://www.nbcnews.com/business",
|
||||||
|
css_selector=".wide-tease-item__description",
|
||||||
|
bypass_cache=True,
|
||||||
|
)
|
||||||
|
print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
async def use_proxy():
|
async def use_proxy():
|
||||||
print("\n--- Using a Proxy ---")
|
print("\n--- Using a Proxy ---")
|
||||||
print(
|
print(
|
||||||
@@ -66,6 +78,28 @@ async def use_proxy():
|
|||||||
# )
|
# )
|
||||||
# print(result.markdown[:500]) # Print first 500 characters
|
# print(result.markdown[:500]) # Print first 500 characters
|
||||||
|
|
||||||
|
async def capture_and_save_screenshot(url: str, output_path: str):
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
result = await crawler.arun(
|
||||||
|
url=url,
|
||||||
|
screenshot=True,
|
||||||
|
bypass_cache=True
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.success and result.screenshot:
|
||||||
|
import base64
|
||||||
|
|
||||||
|
# Decode the base64 screenshot data
|
||||||
|
screenshot_data = base64.b64decode(result.screenshot)
|
||||||
|
|
||||||
|
# Save the screenshot as a JPEG file
|
||||||
|
with open(output_path, 'wb') as f:
|
||||||
|
f.write(screenshot_data)
|
||||||
|
|
||||||
|
print(f"Screenshot saved successfully to {output_path}")
|
||||||
|
else:
|
||||||
|
print("Failed to capture screenshot")
|
||||||
|
|
||||||
class OpenAIModelFee(BaseModel):
|
class OpenAIModelFee(BaseModel):
|
||||||
model_name: str = Field(..., description="Name of the OpenAI model.")
|
model_name: str = Field(..., description="Name of the OpenAI model.")
|
||||||
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
input_fee: str = Field(..., description="Fee for input token for the OpenAI model.")
|
||||||
@@ -73,27 +107,30 @@ class OpenAIModelFee(BaseModel):
|
|||||||
..., description="Fee for output token for the OpenAI model."
|
..., description="Fee for output token for the OpenAI model."
|
||||||
)
|
)
|
||||||
|
|
||||||
async def extract_structured_data_using_llm():
|
async def extract_structured_data_using_llm(provider: str, api_token: str = None, extra_headers: Dict[str, str] = None):
|
||||||
print("\n--- Extracting Structured Data with OpenAI ---")
|
print(f"\n--- Extracting Structured Data with {provider} ---")
|
||||||
print(
|
|
||||||
"Note: Set your OpenAI API key as an environment variable to run this example."
|
if api_token is None and provider != "ollama":
|
||||||
)
|
print(f"API token is required for {provider}. Skipping this example.")
|
||||||
if not os.getenv("OPENAI_API_KEY"):
|
|
||||||
print("OpenAI API key not found. Skipping this example.")
|
|
||||||
return
|
return
|
||||||
|
|
||||||
|
extra_args = {}
|
||||||
|
if extra_headers:
|
||||||
|
extra_args["extra_headers"] = extra_headers
|
||||||
|
|
||||||
async with AsyncWebCrawler(verbose=True) as crawler:
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
result = await crawler.arun(
|
result = await crawler.arun(
|
||||||
url="https://openai.com/api/pricing/",
|
url="https://openai.com/api/pricing/",
|
||||||
word_count_threshold=1,
|
word_count_threshold=1,
|
||||||
extraction_strategy=LLMExtractionStrategy(
|
extraction_strategy=LLMExtractionStrategy(
|
||||||
provider="openai/gpt-4o",
|
provider=provider,
|
||||||
api_token=os.getenv("OPENAI_API_KEY"),
|
api_token=api_token,
|
||||||
schema=OpenAIModelFee.schema(),
|
schema=OpenAIModelFee.schema(),
|
||||||
extraction_type="schema",
|
extraction_type="schema",
|
||||||
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
instruction="""From the crawled content, extract all mentioned model names along with their fees for input and output tokens.
|
||||||
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
Do not miss any models in the entire content. One extracted model JSON format should look like this:
|
||||||
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
{"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}.""",
|
||||||
|
extra_args=extra_args
|
||||||
),
|
),
|
||||||
bypass_cache=True,
|
bypass_cache=True,
|
||||||
)
|
)
|
||||||
@@ -320,6 +357,28 @@ async def crawl_dynamic_content_pages_method_3():
|
|||||||
await crawler.crawler_strategy.kill_session(session_id)
|
await crawler.crawler_strategy.kill_session(session_id)
|
||||||
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
print(f"Successfully crawled {len(all_commits)} commits across 3 pages")
|
||||||
|
|
||||||
|
async def crawl_custom_browser_type():
|
||||||
|
# Use Firefox
|
||||||
|
start = time.time()
|
||||||
|
async with AsyncWebCrawler(browser_type="firefox", verbose=True, headless = True) as crawler:
|
||||||
|
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||||
|
print(result.markdown[:500])
|
||||||
|
print("Time taken: ", time.time() - start)
|
||||||
|
|
||||||
|
# Use WebKit
|
||||||
|
start = time.time()
|
||||||
|
async with AsyncWebCrawler(browser_type="webkit", verbose=True, headless = True) as crawler:
|
||||||
|
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||||
|
print(result.markdown[:500])
|
||||||
|
print("Time taken: ", time.time() - start)
|
||||||
|
|
||||||
|
# Use Chromium (default)
|
||||||
|
start = time.time()
|
||||||
|
async with AsyncWebCrawler(verbose=True, headless = True) as crawler:
|
||||||
|
result = await crawler.arun(url="https://www.example.com", bypass_cache=True)
|
||||||
|
print(result.markdown[:500])
|
||||||
|
print("Time taken: ", time.time() - start)
|
||||||
|
|
||||||
async def speed_comparison():
|
async def speed_comparison():
|
||||||
# print("\n--- Speed Comparison ---")
|
# print("\n--- Speed Comparison ---")
|
||||||
# print("Firecrawl (simulated):")
|
# print("Firecrawl (simulated):")
|
||||||
@@ -387,13 +446,31 @@ async def speed_comparison():
|
|||||||
|
|
||||||
async def main():
|
async def main():
|
||||||
await simple_crawl()
|
await simple_crawl()
|
||||||
await js_and_css()
|
await simple_example_with_running_js_code()
|
||||||
|
await simple_example_with_css_selector()
|
||||||
await use_proxy()
|
await use_proxy()
|
||||||
|
await capture_and_save_screenshot("https://www.example.com", os.path.join(__location__, "tmp/example_screenshot.jpg"))
|
||||||
await extract_structured_data_using_css_extractor()
|
await extract_structured_data_using_css_extractor()
|
||||||
|
|
||||||
|
# LLM extraction examples
|
||||||
await extract_structured_data_using_llm()
|
await extract_structured_data_using_llm()
|
||||||
|
await extract_structured_data_using_llm("huggingface/meta-llama/Meta-Llama-3.1-8B-Instruct", os.getenv("HUGGINGFACE_API_KEY"))
|
||||||
|
await extract_structured_data_using_llm("openai/gpt-4", os.getenv("OPENAI_API_KEY"))
|
||||||
|
await extract_structured_data_using_llm("ollama/llama3.2")
|
||||||
|
|
||||||
|
# You always can pass custom headers to the extraction strategy
|
||||||
|
custom_headers = {
|
||||||
|
"Authorization": "Bearer your-custom-token",
|
||||||
|
"X-Custom-Header": "Some-Value"
|
||||||
|
}
|
||||||
|
await extract_structured_data_using_llm(extra_headers=custom_headers)
|
||||||
|
|
||||||
# await crawl_dynamic_content_pages_method_1()
|
# await crawl_dynamic_content_pages_method_1()
|
||||||
# await crawl_dynamic_content_pages_method_2()
|
# await crawl_dynamic_content_pages_method_2()
|
||||||
await crawl_dynamic_content_pages_method_3()
|
await crawl_dynamic_content_pages_method_3()
|
||||||
|
|
||||||
|
await crawl_custom_browser_type()
|
||||||
|
|
||||||
await speed_comparison()
|
await speed_comparison()
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,7 @@ import asyncio
|
|||||||
import time
|
import time
|
||||||
|
|
||||||
# Add the parent directory to the Python path
|
# Add the parent directory to the Python path
|
||||||
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
parent_dir = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||||
sys.path.append(parent_dir)
|
sys.path.append(parent_dir)
|
||||||
|
|
||||||
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||||
|
|||||||
124
tests/async/test_screenshot.py
Normal file
124
tests/async/test_screenshot.py
Normal file
@@ -0,0 +1,124 @@
|
|||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import pytest
|
||||||
|
import asyncio
|
||||||
|
import base64
|
||||||
|
from PIL import Image
|
||||||
|
import io
|
||||||
|
|
||||||
|
# Add the parent directory to the Python path
|
||||||
|
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
sys.path.append(parent_dir)
|
||||||
|
|
||||||
|
from crawl4ai.async_webcrawler import AsyncWebCrawler
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_basic_screenshot():
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
url = "https://example.com" # A static website
|
||||||
|
result = await crawler.arun(url=url, bypass_cache=True, screenshot=True)
|
||||||
|
|
||||||
|
assert result.success
|
||||||
|
assert result.screenshot is not None
|
||||||
|
|
||||||
|
# Verify the screenshot is a valid image
|
||||||
|
image_data = base64.b64decode(result.screenshot)
|
||||||
|
image = Image.open(io.BytesIO(image_data))
|
||||||
|
assert image.format == "PNG"
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_screenshot_with_wait_for():
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
# Using a website with dynamic content
|
||||||
|
url = "https://www.youtube.com"
|
||||||
|
wait_for = "css:#content" # Wait for the main content to load
|
||||||
|
|
||||||
|
result = await crawler.arun(
|
||||||
|
url=url,
|
||||||
|
bypass_cache=True,
|
||||||
|
screenshot=True,
|
||||||
|
wait_for=wait_for
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result.success
|
||||||
|
assert result.screenshot is not None
|
||||||
|
|
||||||
|
# Verify the screenshot is a valid image
|
||||||
|
image_data = base64.b64decode(result.screenshot)
|
||||||
|
image = Image.open(io.BytesIO(image_data))
|
||||||
|
assert image.format == "PNG"
|
||||||
|
|
||||||
|
# You might want to add more specific checks here, like image dimensions
|
||||||
|
# or even use image recognition to verify certain elements are present
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_screenshot_with_js_wait_for():
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
url = "https://www.amazon.com"
|
||||||
|
wait_for = "js:() => document.querySelector('#nav-logo-sprites') !== null"
|
||||||
|
|
||||||
|
result = await crawler.arun(
|
||||||
|
url=url,
|
||||||
|
bypass_cache=True,
|
||||||
|
screenshot=True,
|
||||||
|
wait_for=wait_for
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result.success
|
||||||
|
assert result.screenshot is not None
|
||||||
|
|
||||||
|
image_data = base64.b64decode(result.screenshot)
|
||||||
|
image = Image.open(io.BytesIO(image_data))
|
||||||
|
assert image.format == "PNG"
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_screenshot_without_wait_for():
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
url = "https://www.nytimes.com" # A website with lots of dynamic content
|
||||||
|
|
||||||
|
result = await crawler.arun(url=url, bypass_cache=True, screenshot=True)
|
||||||
|
|
||||||
|
assert result.success
|
||||||
|
assert result.screenshot is not None
|
||||||
|
|
||||||
|
image_data = base64.b64decode(result.screenshot)
|
||||||
|
image = Image.open(io.BytesIO(image_data))
|
||||||
|
assert image.format == "PNG"
|
||||||
|
|
||||||
|
@pytest.mark.asyncio
|
||||||
|
async def test_screenshot_comparison():
|
||||||
|
async with AsyncWebCrawler(verbose=True) as crawler:
|
||||||
|
url = "https://www.reddit.com"
|
||||||
|
wait_for = "css:#SHORTCUT_FOCUSABLE_DIV"
|
||||||
|
|
||||||
|
# Take screenshot without wait_for
|
||||||
|
result_without_wait = await crawler.arun(
|
||||||
|
url=url,
|
||||||
|
bypass_cache=True,
|
||||||
|
screenshot=True
|
||||||
|
)
|
||||||
|
|
||||||
|
# Take screenshot with wait_for
|
||||||
|
result_with_wait = await crawler.arun(
|
||||||
|
url=url,
|
||||||
|
bypass_cache=True,
|
||||||
|
screenshot=True,
|
||||||
|
wait_for=wait_for
|
||||||
|
)
|
||||||
|
|
||||||
|
assert result_without_wait.success and result_with_wait.success
|
||||||
|
assert result_without_wait.screenshot is not None
|
||||||
|
assert result_with_wait.screenshot is not None
|
||||||
|
|
||||||
|
# Compare the two screenshots
|
||||||
|
image_without_wait = Image.open(io.BytesIO(base64.b64decode(result_without_wait.screenshot)))
|
||||||
|
image_with_wait = Image.open(io.BytesIO(base64.b64decode(result_with_wait.screenshot)))
|
||||||
|
|
||||||
|
# This is a simple size comparison. In a real-world scenario, you might want to use
|
||||||
|
# more sophisticated image comparison techniques.
|
||||||
|
assert image_with_wait.size[0] >= image_without_wait.size[0]
|
||||||
|
assert image_with_wait.size[1] >= image_without_wait.size[1]
|
||||||
|
|
||||||
|
# Entry point for debugging
|
||||||
|
if __name__ == "__main__":
|
||||||
|
pytest.main([__file__, "-v"])
|
||||||
Reference in New Issue
Block a user